Accurate detection of mosaic mutations at short tandem repeats from bulk sequencing data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Short tandem repeats (STRs) are among the most mutable regions of the human genome, yet their somatic mosaicism remains poorly characterized due to the technical challenges of distinguishing genuine mutations from high intrinsic polymorphism and sequencing noise. Here, we introduce BulkMonSTR, a computational framework that combines STR-specific error modelling with machine-learning classification to enable accurate detection of mosaic STR mutations from bulk next-generation sequencing data. BulkMonSTR identifies nucleotide-resolution mutations—including insertions, deletions, and single-nucleotide variants (SNVs)—and supports both control-independent and case-control study designs. Leveraging a comprehensive training dataset derived from pedigree-based validation and in silico spike-in simulations, our random forest classifier effectively discriminates true mosaic events from germline variants and technical artifacts. Benchmarking on simulated and real datasets demonstrates that BulkMonSTR achieves substantially improved precision and F1 scores across diverse coverages and variant allele frequencies. In normal samples, cancer samples and controlled in silico mixing experiments, BulkMonSTR consistently outperforms existing methods, capturing a broader spectrum of STR mutations—including those arising on non-reference alleles—while achieving high validation rates. By enabling systematic, genome-wide interrogation of STR mosaicism, BulkMonSTR provides a scalable foundation for investigating the contributions of somatic STR mutations to aging and disease.