Sequencing depth overcomes extraction bias: repurposing human WGS data for salivary microbiome profiling

Lourdes Velo-Suárez
Anthony F. Herzig
Ozvan Bocher
Gaëlle Le Folgoc
Liana Le Roux
Christelle Delmas
Marie Zins
Jean-François Deleuze
Geneviève Héry-Arnaud
Emmanuelle Génin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale human genomic projects have generated whole-genome sequencing (WGS) data from hundreds of thousands of individuals, primarily to study host genetic variation. When saliva is the DNA source, the resulting datasets also contain microbial reads that are routinely discarded. Here, we investigate whether these host-centric WGS workflows can yield reliable microbiome profiles, effectively doubling the research value of existing data without additional sampling. We compared non-human reads from 39 deeply sequenced saliva samples from the GAZEL cohort (miG dataset; median ∼43 million reads/sample) with 14 samples processed with microbiome-optimized extraction (ASAL; median ∼4.3 million reads/sample), using two complementary classifiers: meteor, a coverage-based mapper against a curated saliva-specific database, and sylph, a k-mer classifier against the Genome Taxonomy Database (GTDB). Despite the absence of microbial lysis optimization, miG samples showed up to 3-fold higher species richness, ∼10-fold greater sequencing depth, and significantly lower inter-sample variability (PERMANOVA R² = 0.10, p = 0.001; BETADISPER p = 0.0036). Rarefaction to 10⁶ reads eliminated most compositional differences, demonstrating that sequencing depth is the primary driver of community stability. Only ∼2% of detected taxa (12 of 592) showed extraction-related differences. The two classifiers exhibited fundamentally different depth-sensitivity profiles, with sylph retaining systematic detection asymmetries even after depth normalization, highlighting that classifier choice introduces biases that affect cross-study comparisons. These results show that biobank WGS data from saliva can be repurposed for robust, population-scale oral microbiome analyses, enabling simultaneous investigation of host genomic variation and the microbiome from the same archived samples.

Importance

Saliva-based whole-genome sequencing datasets generated across various cohorts to study human genetics contain non-human reads that are routinely discarded, thereby overlooking valuable microbial information. We show that these reads are sufficient to reconstruct robust oral microbiome profiles — without any additional sampling or laboratory work. This finding unlocks a vast archive of existing genomic data for retrospective microbiome research, enabling population-scale studies of oral microbial diversity, host–microbiome interactions, and disease associations at minimal additional cost. We further demonstrate that the choice of taxonomic classifier introduces systematic, depth-dependent biases that persist even after normalization, a practical consideration for any cross-cohort or multi-platform microbiome study.

Version published to 10.64898/2026.03.27.714786 on bioRxiv
Apr 1, 2026

16S rRNA sequence captures microbial functional potential

This article has 3 authors:
1. Jia Liu
2. M. Clara De Paolis Kaluza
3. Yana Bromberg
This article has no evaluationsLatest version Apr 18, 2026
A Bioinformatic Pipeline for Consensus Taxonomic Classification of Long-Read Amplicons

This article has 5 authors:
1. Ashley A. Paulsen
2. Breah LaSarre
3. Drew Delp
4. Gwyn A. Beattie
5. Larry J. Halverson
This article has no evaluationsLatest version Apr 30, 2026
Systematic evaluation of 24 extraction and library preparation combinations for metagenomic sequencing of SARS-CoV-2 in saliva

This article has 7 authors:
1. Kenin Qian
2. Varada Abhyankar
3. Dahlia Keo
4. Payton Zarceno
5. Traci Toy
6. Eleazar Eskin
7. Valerie A. Arboleda
This article has no evaluationsLatest version Apr 20, 2026

Discuss this preprint

Listed in

Abstract

Importance

Article activity feed

Related articles

16S rRNA sequence captures microbial functional potential

A Bioinformatic Pipeline for Consensus Taxonomic Classification of Long-Read Amplicons

Systematic evaluation of 24 extraction and library preparation combinations for metagenomic sequencing of SARS-CoV-2 in saliva