Sequencing depth overcomes extraction bias: repurposing human WGS data for salivary microbiome profiling

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large-scale human genomic projects have generated whole-genome sequencing (WGS) data from hundreds of thousands of individuals, primarily to study host genetic variation. When saliva is the DNA source, the resulting datasets also contain microbial reads that are routinely discarded. Here, we investigate whether these host-centric WGS workflows can yield reliable microbiome profiles, effectively doubling the research value of existing data without additional sampling. We compared non-human reads from 39 deeply sequenced saliva samples from the GAZEL cohort (miG dataset; median ∼43 million reads/sample) with 14 samples processed with microbiome-optimized extraction (ASAL; median ∼4.3 million reads/sample), using two complementary classifiers: meteor, a coverage-based mapper against a curated saliva-specific database, and sylph, a k-mer classifier against the Genome Taxonomy Database (GTDB). Despite the absence of microbial lysis optimization, miG samples showed up to 3-fold higher species richness, ∼10-fold greater sequencing depth, and significantly lower inter-sample variability (PERMANOVA R² = 0.10, p = 0.001; BETADISPER p = 0.0036). Rarefaction to 10⁶ reads eliminated most compositional differences, demonstrating that sequencing depth is the primary driver of community stability. Only ∼2% of detected taxa (12 of 592) showed extraction-related differences. The two classifiers exhibited fundamentally different depth-sensitivity profiles, with sylph retaining systematic detection asymmetries even after depth normalization, highlighting that classifier choice introduces biases that affect cross-study comparisons. These results show that biobank WGS data from saliva can be repurposed for robust, population-scale oral microbiome analyses, enabling simultaneous investigation of host genomic variation and the microbiome from the same archived samples.

Importance

Saliva-based whole-genome sequencing datasets generated across various cohorts to study human genetics contain non-human reads that are routinely discarded, thereby overlooking valuable microbial information. We show that these reads are sufficient to reconstruct robust oral microbiome profiles — without any additional sampling or laboratory work. This finding unlocks a vast archive of existing genomic data for retrospective microbiome research, enabling population-scale studies of oral microbial diversity, host–microbiome interactions, and disease associations at minimal additional cost. We further demonstrate that the choice of taxonomic classifier introduces systematic, depth-dependent biases that persist even after normalization, a practical consideration for any cross-cohort or multi-platform microbiome study.

Article activity feed