Identifying Optimal Machine Learning Approaches for Human Gut Microbiome (Shotgun Metagenomics) and Metabolomics Integration with Stable Feature Selection

Suzette N. Palmer
Animesh Mishra
Shuheng Gan
Dajiang Liu
Andrew Y. Koh
Xiaowei Zhan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Microbiome research has been limited by methodological inconsistencies. Taxonomy-based profiling presents challenges such as data sparsity, variable taxonomic resolution, and the reliance on DNA-based profiling, which provides limited functional insight. Multi-omics integration has emerged as a promising approach to link microbiome composition with function. However, the lack of standardized methodologies and inconsistencies in machine learning strategies has hindered reproducibility. Additionally, while machine learning can be used to identify key microbial and metabolic features, the stability of feature selection across models and data types remains underexplored, despite its importance for downstream experimental validation and biomarker discovery. Here, we systematically compare Elastic Net, Random Forest, and XGBoost across five multi-omics integration strategies: Concatenation, Averaged Stacking, Weighted Non-negative Least Squares (NNLS), Lasso Stacking, and Partial Least Squares (PLS), as well as individual omics models. We evaluate performance across 588 binary and 735 continuous models using human gut microbiome-derived metabolomics and taxonomic data derived from metagenomics shotgun sequencing data. Additionally, we assess the impact of feature reduction on model performance and feature selection stability. Among the approaches tested, Random Forest combined with NNLS yielded the highest overall performance across diverse datasets. Tree-based methods also demonstrated consistent feature selection across data types and dimensionalities. These results demonstrate how integration strategies, algorithm selection, data dimensionality, and response type impact both predictive performance and the stability of selected features in multi-omics microbiome modeling.

Key Points

A total of 1,323 models were developed to comprehensively evaluate prediction performance and the robustness of feature selection for human gut microbiome (metabolomics and taxonomy from metagenomics shotgun sequencing) datasets. These models included three widely used machine learning algorithms – Elastic Net, Random Forest and XGBoost – applied across five integration strategies and single-omics approaches on datasets with binary and continuous outcomes.
For continuous outcomes, Random Forest combined with NNLS integration achieved the highest performance and maintained strong predictive performance across full-dimensional and feature-reduced datasets.
For binary outcomes, Random Forest consistently performed well regardless of the integration strategy. Notably, single-omics models, especially those using metabolomics data, outperformed integrative approaches.
Tree-based models demonstrated greater consistency in feature selection across different dimensionalities and integration strategies.

Version published to 10.1101/2025.06.21.660858 on bioRxiv
Jun 26, 2025

Multi-omics Reveals Metabolic-Inflammatory Drivers of Lung Cancer: An Integrated Mendelian Randomization and Machine Learning Study

This article has 6 authors:
1. Xiongjie Li
2. Fengyue Zhang
3. Xuan Xu
4. Zhenyao Wu
5. Xiaoyan Zhang
6. Xianghui Wang
This article has no evaluationsLatest version Dec 12, 2025
Integrating network toxicology, machine learning, gut microbiome analysis, and structural validation to reveal the molecular mechanism linking PFOA and PFOS exposure to age-related macular degeneration

This article has 6 authors:
1. Zhenyu Guo
2. Yujun Peng
3. Yiwei Lin
4. Jiaxi Liu
5. Yongjie Qin
6. Hongyang Zhang
This article has no evaluationsLatest version Jan 29, 2026
Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

This article has 2 authors:
1. Rafael de Negreiros Botan
2. Joao Batista de Sousa
This article has no evaluationsLatest version Dec 19, 2025

Discuss this preprint

Listed in

Abstract

Key Points

Article activity feed

Related articles

Multi-omics Reveals Metabolic-Inflammatory Drivers of Lung Cancer: An Integrated Mendelian Randomization and Machine Learning Study

Integrating network toxicology, machine learning, gut microbiome analysis, and structural validation to reveal the molecular mechanism linking PFOA and PFOS exposure to age-related macular degeneration

Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature