CODARFE: Unlocking the prediction of continuous environmental variables based on microbiome
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
Background
Despite the surge in microbiome data acquisition, there is a limited availability of tools capable of effectively analyzing it and identifying correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here we introduce CODARFE, a novel tool for sparse compositional microbiome predictor selection and prediction of continuous environmental factors.
Results
We tested CODARFE against 4 state-of-the-art tools in 2 experiments. First, CODARFE outperformed predictor selection in 21 of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data—that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects, using a model trained on 1 dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in 5 formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify.
Conclusions
Our findings underscore the robustness and broad applicability of CODARFE across diverse fields, even under varying experimental conditions. Additionally, the ability to predict outcomes in new samples allows for the generation of new insights in previously unexplored contexts, providing researchers with a versatile tool.
Article activity feed
-
Despite the surge in data acquisition, there is a limited availability of tools capable of effectively analyzing microbiome data that identify correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here, we introduce CODARFE, a novel tool for sparse compositional microbiome-predictors selection and prediction of continuous environmental factors. We tested CODARFE against four state-of-the-art tools in two experiments. First, CODARFE outperformed predictor selection in 21 out of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of …
Despite the surge in data acquisition, there is a limited availability of tools capable of effectively analyzing microbiome data that identify correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here, we introduce CODARFE, a novel tool for sparse compositional microbiome-predictors selection and prediction of continuous environmental factors. We tested CODARFE against four state-of-the-art tools in two experiments. First, CODARFE outperformed predictor selection in 21 out of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data—that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects (e.g., ginseng field and cattle for arable soil, and HIV and crohn’s disease for human gut), using a model trained on one dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in five formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify - https://github.com/alerpaschoal/CODARFE.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf055), which carries out open, named peer-review. The following review is published under a CC-BY 4.0 license:
Reviewer: Jaak Truu
This manuscript addresses key aspects of microbiome data analysis, particularly in relating continuous variables to microbiome data and utilizing microbiome data to predict variables of interest. The data analysis approach is well-articulated; however, there is a notable omission regarding the derivation of the microbiome datasets. While the sources of these datasets are mentioned, it remains unclear whether the authors processed the initial data to produce the count tables used as input or if these tables were directly adopted from the original publications. Given that the data in the main text are derived from studies based on 16S rDNA sequencing, variations in data processing pipelines between publications could introduce significant variability. Although the manuscript discusses the importance of the sequenced 16S rDNA region and the similarity of the environments from which the samples were obtained, it does not address the impact of the initial data processing pipeline (including taxonomy assignment).
Additionally, the number of samples in each dataset is not provided in the tables.
The manuscript includes a comparison of the proposed method with other tools; however, it omits MaAsLin (Microbiome Multivariable Association with Linear Models), that has been applied far more extensively in microbiome data analysis than the tools included in the current manuscript. Incorporating a comparison with MaAsLin would enhance the comprehensiveness of the evaluation.
-
-
