Systematic benchmarking of foundation models and classical baselines for microbiome-based disease prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Microbiome-based disease prediction is often hindered by sparse, compositional features and substantial inter-study heterogeneity. Foundation models and LLM-derived representations could, in principle, improve robustness and cross-cohort generalization, but their utility for microbiome prediction has not been systematically benchmarked. Results: We benchmarked classical machine-learning baselines (regularized logistic regression and random forests), standard numerical feature representations, GPT-derived semantic embeddings, and two foundation-model paradigms: a general-purpose tabular foundation model (TabPFN) and a microbiome-specific foundation model (MGM). Using 83 publicly curated case–control cohorts spanning 20 diseases profiled by 16S rRNA sequencing and shotgun metagenomics, we assessed performance under three settings: intra-cohort cross-validation, cross-cohort transfer (train on one cohort, test on others), and leave-one-study-out (LOSO) validation. GPT-derived semantic embeddings consistently underperformed standard numerical representations. TabPFN achieved strong out-of-the-box performance and competitive cross-cohort robustness, but did not consistently outperform well-tuned classical baselines across cohorts. MGM’s performance was disease-dependent and generally lagged behind the strongest tabular baselines, suggesting that current microbiome-specific pretraining at genus resolution does not yet confer a consistent advantage under study heterogeneity. Batch-effect correction methods provided limited and non-uniform improvements in LOSO evaluations. Conclusions: In this large-scale benchmark, current foundation-model approaches offer, at best, modest gains over strong classical baselines for microbiome-based disease prediction. Our results highlight that standard numerical representations remain difficult to beat, general-purpose tabular foundation models can provide strong out-of-the-box performance under domain shift, and microbiome-specific foundation models may require advances in pretraining scale, taxonomic resolution, and architecture to translate pretraining into reliable cross-study generalization.

Article activity feed