Uncertainty Modeling Outperforms Machine Learning for Microbiome Data Analysis

Maxwell A. Konnaris
Manan Saxena
Nicole Lazar
Justin D. Silverman

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Microbiome sequencing measures relative rather than absolute abundances, providing no direct information about total microbial load. Normalization methods attempt to compensate, but rely on strong, often untestable assumptions that can bias inference. Experimental measurements of load (e.g., qPCR, flow cytometry) offer a solution, but remain costly and uncommon. A recent high-profile study proposed that machine learning could bypass this limitation by predicting microbial load from sequencing data alone. To evaluate this claim, we assembled mutt , the largest public database of paired sequencing and load measurements, spanning 35 studies and over 15,000 samples. Using mutt , we show that published machine learning models fail to generalize: on average they perform worse than a naive baseline that always predicted the training set mean. These failures stem from covariate shift–limited shared taxa between studies, differences in community composition, and differences in preprocessing pipelines–that silently derail model inputs. In contrast, Bayesian partially identified models do not attempt to impute microbial load, but instead propagate scale uncertainty through downstream analyses. Across 30 benchmark datasets, Bayesian partially identified models consistently outperformed normalization and machine learning approaches, providing a principled and reproducible foundation for microbiome inference.

Version published to 10.1101/2025.09.12.675956 on bioRxiv
Sep 16, 2025

Integrating Microbiome Data Visualization into FAIRDatabase using Edge Functions

This article has 3 authors:
1. Roman van Eldijk
2. Shivam Kumar
3. Vivek Sheraton M
This article has no evaluationsLatest version Jan 27, 2026
Quantitative evaluation of microbiome sequencing resolution under varying experimental conditions using defined mock communities

This article has 5 authors:
1. Songhee Lee
2. Hyeonah Lee
3. Jung Wook Kim
4. Hyeon-Jin Kim
5. Kwang Jun Lee
This article has no evaluationsLatest version Dec 30, 2025
META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating Microbiome Data Visualization into FAIRDatabase using Edge Functions

Quantitative evaluation of microbiome sequencing resolution under varying experimental conditions using defined mock communities

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing