Systematic benchmarking of foundation models and classical baselines for microbiome-based disease prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Microbiome-based disease prediction is often hindered by sparse, compositional features and substantial inter-study heterogeneity. Foundation models and LLM-derived representations could, in principle, improve robustness and cross-cohort generalization, but their utility for microbiome prediction has not been systematically benchmarked. Results: We benchmarked classical machine-learning baselines (regularized logistic regression and random forests), standard numerical feature representations, GPT-derived semantic embeddings, and two foundation-model paradigms: a general-purpose tabular foundation model (TabPFN) and a microbiome-specific foundation model (MGM). Using 83 publicly curated case–control cohorts spanning 20 diseases profiled by 16S rRNA sequencing and shotgun metagenomics, we assessed performance under three settings: intra-cohort cross-validation, cross-cohort transfer (train on one cohort, test on others), and leave-one-study-out (LOSO) validation. GPT-derived semantic embeddings consistently underperformed standard numerical representations. TabPFN achieved strong out-of-the-box performance and competitive cross-cohort robustness, but did not consistently outperform well-tuned classical baselines across cohorts. MGM’s performance was disease-dependent and generally lagged behind the strongest tabular baselines, suggesting that current microbiome-specific pretraining at genus resolution does not yet confer a consistent advantage under study heterogeneity. Batch-effect correction methods provided limited and non-uniform improvements in LOSO evaluations. Conclusions: In this large-scale benchmark, current foundation-model approaches offer, at best, modest gains over strong classical baselines for microbiome-based disease prediction. Our results highlight that standard numerical representations remain difficult to beat, general-purpose tabular foundation models can provide strong out-of-the-box performance under domain shift, and microbiome-specific foundation models may require advances in pretraining scale, taxonomic resolution, and architecture to translate pretraining into reliable cross-study generalization.