Machine Learning Made Easy (MLme): A Comprehensive Toolkit for Machine Learning-Driven Data Analysis

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.

Results

To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.

Conclusion

MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme .

Key Points

  • MLme is a novel tool that simplifies machine learning (ML) for researchers by integrating Data Exploration, AutoML, CustomML, and Visualization functionalities.

  • MLme improves efficiency and productivity by streamlining the ML workflow and eliminating the need for extensive coding efforts.

  • Rigorous testing on diverse datasets demonstrates MLme’s promising performance in classification problems.

  • MLme provides intuitive interfaces for data exploration, automated ML, customizable ML pipelines, and result visualization.

  • Future developments aim to expand MLme’s capabilities to include support for unsupervised learning, regression, hyperparameter tuning, and integration of user-defined algorithms.

Article activity feed

  1. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

    Reviewer 2 Ryan J. Urbanowicz Revision 2

    At this point I earnestly wish to see this paper published, and in acknowledging my own potential bias as a developer of STREAMLINE and participant in the development of TPOT, I am still recommending minor revision. At minimum, for me to recommend acceptance of this paper the following small but critical issue needs to be addressed, otherwise I must recommend reject. I believe this concern is well justified by scientific standards. I also still strongly recommend the authors re-consider the other non-critical issues reiterated below as a way to make their paper stronger and to be better received by the scientific community. If the journal editor disagrees with my assessment, I would still be happy to see this work published, however I must stand by my assertions below.Critical Issue:Limitations section: The authors updated the text - "excells in it's core objective of addressing classification tasks." To "it excels in its primary objective of addressing pipeline development for classification tasks.The use of the word 'excells' is the key problem, as this word is defined as "to do or be better than others". While the change in phrasing correctly no longer implies that MLme performed better than the other evaluated AutoML tools, it does still imply that it is the best in developing a pipeline for classification tasks, but no specific evidence is provided in the paper to support this assertion. I.e. there were no studies comparing how easy the tool was for users to apply than other autoML, and no detailed comparison of what pipeline elements could be included by MLme vs other autoML or pipeline development tools. The fact that MLme doesn't include hyperparameter optimization is in itself a limitation that I think would prevent MLme from being claimed as excelling or superior in pipeline development to other tools/platforms, even if it's easier to use that other tools. As phrased in the reviewer response, the authors could say that MLme is well-equipped to handle pipeline development as this would be a fair statement. All together I'd strongly encourage the authors not to make statements about the superior aspects of MLme without clearly backing up these statements with direct comparisons. Instead I'd suggest highlighting elements of MLme that are 'unique' or provide more functionality in contrast with other tools. In the reviewer response the authors make the claim that MLme is superior in terms of ease of use for visualization and exploratory analysis. If they want to make that statement in the paper backed up by accurate comparisons to other tools, I'd agree with that addition.Non-Critical Issues that I feel still should be addressed:1. Table S1 has been updated to remove the inaccuracies I previously pointed out, however this alone does not change the broader concern I had regarding the intention of this table (which is to highlight the parts of MLme that appear better than other AutoML tools without fairly pointing out the limitations of MLme in contrast with other tools). As a supplemental materials table, I do not feel this is critical, but I think making a table that more fairly reflects strengths and limitations of different tools would greatly strengthen this paper.2. The pipeline design in Figure 2 and and S10 are both high-level and still do not provide enough detail/clarity to understand exactly what happens and in what order when applying the autoML element of MLme. They key words here being transparency and reproducibility. The supplemental materials could describe a detailed walk through of what the autoML does at each step. At minimum this could also be clearly addressed in the software documentation on GitHub.3. While I understand the need for brevity, I think the addition of a sentence that indicates specifically what AutoML tools are most similar to MLme is a reasonable request that better places MLme in the context of the greater AutoML research space.

  2. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

    **Reviewer 2 Ryan J. Urbanowicz ** Revision 1

    Overall I think the authors have made some good improvements to this paper, although it does not seem like the main body of the paper has changed much with most of the updates going into supplemental materials. However, I think this work is worthy of publication once the following items are addressed. (which I still feel strongly should be addressed, but should be fairly easy to do so).

    1. Limitations section: While the authors added some basic comparisons to a few other AutoML tools, I do not see how they are justified in saying that MLme 'excells' in it's core objective of addressing classification tasks. This implies it is better performing a classification than other methods, which is not at all backed up here, and indeed would be very difficult to prove as it would require a huge amount of analyes over a broad range of simulated and real world benchmark datasets, and incomparison to many or all orther other autoML tools. At best i think the authors can say here that it is at least comparable in performance to AutoML tools (X, Y, Z) in its ability to conduct classification analyses. And according to Figure S9 this is only across 7 datasets, and focused only on the F1 score which could also be missleading or cherry picked. At best I believe the authors can say in the paper that "Initial evaluation across 7 datasets suggested that MLMe performed comparably to TPOT and Hyperopt-sklearn with respect to F1 score performance. This suggests that MLme is effective as an automated ML tool for classification tasks. " (or something similar).

    2. While the authors lengthened the supplemental materials table comparing ML algorithms (mainly by adding some other autoML tools, this table is intentionally presenting the capabilities of tools in a way that make it appear like MLme does the most (with the exception of the 'features' column) . For example, what about a column to indicate if an autoML tool has an automated pipeline discovery component (like TPOT)? In terms of AutoML, this table is structured to highlight the benefits of MLme, rather than give a fair comparison of AutoML tools (which is my major concern here). In terms of AutoML performance and usability there is alot more to these different tools than the 6 columns presented. In this table 'features' seems like an afterthought, but is arguably the most important aspect of an AutoML.

    3. Additionally, the information presented in the autoML comparison table does not seem to be entirely accurate, or at least how the columns are defined is not made entirely clear. Looking at STREAMLINE, which can be run by users with no coding experience (as a google colab notebook), it has a code free option (just not a GUI), STREAMLINE also generates more than two exploratory analysis plots, and more results visualizations plots than indicated). While I agree that MLme has many more ease of use functionality in comparison to STREAMLINE (which is a very nice plus), a reader might look at this table and think they need to know how to code in order to use STREAMLINE, which is not the case. Could the authors at least define their criteria for the "code free" column. As it's presented now it seems to be the same exact criteria as for GUI (in which case this is redundant). The same is true for the legend for the table where '*' indicates that coding experience is required for designing a custom pipeline. This requires more clarification, as STREAMLINE can be customized easily without coding experience by simply changing options in the Google Colab notebook, and TPOT automatically discovers new analysis pipelines which isn't reflected at all.

    4. While I appreciate the authors adding a citation for STREAMLINE and some other autoML tools not previously cited, it would be nice for the authors to discuss other AutoML tools further in their main paper, as well as to acknowledge in the main paper which AutoML tools are most similar to MLme in overall design and capabilities. Based on my own review of AutoML tools the most similar tools would include STREAMLINE and MLIJAR-supervised.

    5. I like the addition of Figure S10 that more clearly lays out the elements included in MLme, but I still think the paper and documentation lacks a clear and transparent walk through of exactly what happens to the data and how the analyses are conducted from start to finish when using the AutoML (at least by default). This is important to trusting what happens under the hood for reporting results, etc.

    Other comments responding to author responses:

    • I still disagree with the authors that a dataset with up to 1500 samples or up to 5520 features could be considered large by today's standards across all research domains. Even within biomedical data, datasets up to 100K subjects are becoming common, and 'omics' datasets regularly reach hundreds of thousands to multiple millions of features. I am glad to see the authors adding a larger dataset, but i would still be cautions when making suggestions about how well MLme handles 'large' datasets without including specifics for context. However ultimately this is subjective, and not preventing me from endorsing publication.
    • I also disagree that MLme isn't introducing a new methodology. The steps comprising an AutoML tool can be considered in itself a new methodology, even if it is built on established components, because there are still innumerable ways to put a machine learning analysis pipeline together that adds bias, data leakage, or just yields poorer performance. Thus I also don't think it's fair to just 'assume' your method will work as well as other AutoML tools, especially when you've ran it on a limited number of datasets/problems.
  3. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

    Reviewer 1 Joe Greener Revision 1

    The authors have adequately addressed my concerns and I believe that the manuscript is ready for publication.

  4. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

    ** Reviewer 2 Ryan J. Urbanowicz ** Original Submission

    In this paper the authors introduce MLme, a comprehensive toolkit for machine-learning driven analysis. The authors discuss the benefits and limitations of their toolkit and provide a demonstration evaluation on 6 datasets suggesting it's potential value. Overall MLme seems like a nice, easy to use tool with a good deal of potential and value. However as the developer of STREAMLINE, an AutoML toolkit with a number of very similar goals and design architecture to MLme it was very surprising to have it not referenced or compared to in this paper. My major concerns involve the limited details about what this specifically does/includes (e.g. what 16 ML algorithms are built in), as well as what seems like a limited and largely biased comparison of this toolkit's capabilities to other autoML tools (most specifically STREAMLINE which has a significant degree of similarity).- There are many other autoML tools out there that that authors have not considered in their Table S1 or referenced. Eg. MLBox, AutoWeka, H20, Devol, Auto-Keras, TransmorgriffAI, and most glaringly in for this reviewer, STREAMLINE (https://github.com/UrbsLab/STREAMLINE).- In particular, with respect to STREAMLINE (https://link.springer.com/chapter/10.1007/978-981-19-8460-0_9), there are a large number of pipeline similarities and a similar analysis mission/goals to MLme that make it extremely relevant to cite and contrast to in this manuscript as well as in Table S1. STREAMLINE has a similar focus on the end-to-end ML analysis pipeline including automated exploratory analysis, data processing, feature selection, ML modeling with 16 algorithms, evaluation, and results visualization generation, interactive visualizations, pickled output storage, etc. The first STREAMLINE paper was published March of 2023, and a preprint of that manuscript published June 2022, as well as a precursor implementation of this pipeline published as a preprint in Aug of 2020 (https://arxiv.org/abs/2008.12829). This in contrast with MLme's preprint published July of 2023. While MLme has a number of potentially nice features that STREAMLINE does not (i.e. a GUI interface, spider plots, easy color palate selection, inclusion of a dummy classifier, ability to handle multi-class classification [which is not yet available, but in development for STREAMLINE along with regression]), it lacks other potentially important features that STREAMLINE does have (i.e. automated hyperparameter optimization, basic data cleaning and feature engineering [in the newest release], collective feature selection, pickled models for later reuse, collective feature importance visualizations, a pdf analysis summary report, the ability to quickly evaluate models on new replication data, and potentially other capabilities that I can't highlight because of limited details on what MLme includes). The absence of hyperparameter optimization is a particularly problematic omission from MLme, as this a fairly critical element of any machine learning analysis pipeline.-Table S1 should be expanded to highlight a broader range of toolkit features to better highlight the strengths and weaknesses of a greater variety of methodologies. The present table seems a bit cherry picked to make MLme stand out as appearing to have more capabilities than other tools, but there are uncaptured advantages to these other approaches.-This manuscript includes no citations justifying their pipeline design choices. In particular, I'm most concerned with the author's justification of automatically including data resampling by default as it is well known that this can introduce bias in modeling. It's also not clear what determines if data resampling is required, and whether this only impacts training data or also testing data.- Its not clear that resampling is a good/reliable strategy for an automated machine learning framework since data resampling to create a more balanced dataset can also incorporate bias in to an ML model.- In the context of potential datasets from different domains (including biomedical data), the datasets identified in this paper as being "large" have only up to 1500 sample and only up to 5520 features, which would not be considered large by most data scientist standards.- There are largely limited details in this paper and the software's github documentation in terms of transparently indicating exactly what this pipeline does, and what options, algorithms, evaluation metrics, and visualizations it includes.- Since the authors do not benchmark MLme against any other autoML tool and they have a very limited set of benchmarked datasets (6 total, with limited diversity of data types, sizes, feature types), I don't think it's fair to claim that their methodology necessarily excels in it's core objective of addressing classification tasks. Ideally the authors would conduct benchmarking comparisons to STREAMLINE, as well as other autoML toolkits, however this might also understandably be outside the scope of this current paper. I do suggest the authors be more conservative in what assertions they make and conclusions they draw with respect to MLme. The authors might consider using established ML or AutoML benchmark benchmark datasets used by other algorithms and frameworks to compare or facilitate comparison of their pipeline toolkit to others.

  5. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

    This work has been published in *GigaScience *Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad111), and has published the reviews under the same license. These are as follows.

    ** Reviewer 1 Joe Greener ** Original Submission

    Akshay et al. present MLme, a toolkit for exploring data and automatically running machine learning (ML) models. This software could be useful for those with less experience in ML. I believe it is suitable for publication provided the following points are addressed.# Major1. The performance of models is consistently over 90% but without a reference point it is unclear how good this is. Are there results from previous studies on the same data that can be compared to, with a table comparing accuracy with MLme to previous work? Otherwise it is unclear whether MLme is supposed to be a quick way to have a first go at prediction on the data or can entirely replace manual model refinement.2. With any automated ML system it is important to impress upon users the risks of ML. For example, the splitting of data into training and test sets is done randomly, but there are cases where this is not appropriate as it will lead to data leakage between the training and test sets. This could be mentioned in the manuscript and somewhere on the GUI. There isn't really a replacement for domain knowledge, and users of MLme should have this in mind when using the software.# Minor3. More experienced ML users may want to use the software to have a first go at prediction on the data. For these users it may be useful to provide access to commands or scripts, or at least information on which functions were used, as additional options in the GUI. Users could then run these scripts themselves to tweak hyperparameters etc.4. The visualisation tab lacks an info button by the file upload to say what the file format should be.