Replicate-anchored calibration of within-host single nucleotide variant detection in Mycobacterium tuberculosis whole genome sequencing
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Intra-host genetic heterogeneity in Mycobacterium tuberculosis is biologically and clinically informative, but its detection from short read whole genome sequencing depends on thresholds over read depth (DP), alternate allele support (AD), and minor allele frequency (MAF) that are rarely empirically anchored.
Methods
We developed a biological replicate-anchored, lexicographic calibration framework for per-specimen intra-host single nucleotide variant (iSNV) detection. Within-patient replicate sputum pairs from a pre-treatment tuberculosis (TB) cohort were scored across the joint (DP, AD, MAF) grid on six concordance metrics; the selector penalized no-signal regimes and ranked cells by reproducibility, with sensitivity as a tiebreaker. Selection stability was quantified by B=1,000 nonparametric bootstrap resamples of replicate pairs, defining a Looser/Primary/Tighter sensitivity ladder. The calibrated rule was applied to 97 patients contributing 282 cultured sputum specimens.
Results
Calibration on 169 replicate pairs from 67 patients identified a Primary cell at DP≥ 60x, AD ≥ 3, MAF ∈ [0.02, 0.50]. The bootstrap modal cell coincided with the Primary cell in 45.7% of 703 successful replications (MAF = 0.02 in 100%; AD = 3 in 90.2%). Per-patient prevalence of detectable within-host diversity at the Primary tier was 16.5% (16/97, Wilson 95% CI: 10.4% - 25.1%); Looser and Tighter tiers yielded comparable rates.
Conclusion
Within-patient replicate concordance provides a reproducible empirical anchor for iSNV detection thresholds in TB whole genome sequencing. The framework is internally calibrated and reported with explicit sensitivity tiers. The calibrated rule can be applied to external cohorts as a direct test of transferability.
Data Summary
Raw paired-end Illumina whole genome sequencing reads for the 282 Mycobacterium tuberculosis genomes analyzed in this study have been deposited at the NCBI Sequence Research Archive (SRA) under BioProject accession PRJNA1466981. Per-sample run accessions and corresponding metadata are provided in Supplementary Data S1.
The M. tuberculosis H37Rv reference genome used for read alignment is available from GenBank under accession NC_000962.3 .
The full analysis pipeline (R), calibration scripts, application scripts, parameter files, and derived data tables that reproduce all reported results, main figures, and supplementary materials are available at https://github.com/K01-tb-genomics/paper1_isnv_calibration . A versioned snapshot of the repository at the time of manuscript submission is archived at Zenodo under 10.5281/zenodo.20246996.
Supplementary Tables S1-S5 and Supplementary Figure S1 are provided as supplementary files alongside the article.
The authors confirm that all supporting data, code and protocols have been provided within the article or through the cited public repositories.
Impact Statement
Within-host genetic diversity in Mycobacterium tuberculosis is increasingly used to study transmission, treatment response, and resistance emergence. The bioinformatic thresholds that distinguish low-frequency variants from sequencing noise vary widely across published TB studies and are rarely empirically justified, thus reshaping which patients are reported as harboring detectable within-host diversity. We provide a biological replicate-anchored calibration framework that ties threshold selection to within-patient replicate concordance, quantifies stability under resampling, and reports explicit sensitivity tiers. Applied to 97 TB patients, the calibrated rule identifies 16.5% as carrying detectable within-host M. tuberculosis variation. The framework is transferable to other TB cohorts as a transparent basis for cross-study comparison.
Lay Summary
Tuberculosis (TB) bacteria can vary slightly within a single patient. Bacterial cells in the same person may carry small genetic differences, sometimes affecting only a small fraction of the bacterial population. Studying this kind of within-patient variation matters for understanding how TB responds to treatment and how it spreads between people. Detecting these differences from DNA sequencing means distinguishing real signal from background noise. However, the rules used to make that distinction affect which patients are reported as carrying mixed bacterial populations. In this study, we developed a method for making that detection decision more reliably. Most patients gave more than one sputum sample on the same day. We used the agreement between repeat samples from the same patient to pick rules that gave consistent answers. We checked how stable our chosen rules were by repeating the selection on randomly drawn versions of the data, and we report stricter and looser versions of the rule alongside our preferred one. We applied the method to 97 TB patients. About one in six (16.5%) showed detectable genetic variation within their TB bacterial population using our preferred rule. The same rules can also be applied to other TB studies to check whether they give consistent results when used in a different setting.