LongPhase-S: purity estimation and variant recalibration with somatic haplotying for long-read sequencing
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate detection of somatic variants is crucial for precision oncology, and long-read sequencing offers unprecedented advantages in resolving complex cancer genomes. However, most long-read somatic callers rely on phasing built for a diploid genome, an assumption violated by various contamination, subclonal heterogeneity, and aneuploidy in tumors. We present LongPhase-S, a novel method that jointly reconstructs somatic haplotypes, infers tumor purity, and recalibrates somatic variants in a purity-aware manner for paired tumor-normal long-read sequencing. By anchoring each somatic read to a parental germline lin-eage, LongPhase-S provides a phase-resolved view in which germline and somatic reads are disentangled across the genome. Building on somatic haplotyping, LongPhase-S trains a phase-aware purity estimator that outperformed existing methods. Using eight benchmark datasets comprising six cancer cell lines, including breast, melanoma, and lung cancers, LongPhase-S boosted the accuracy of state-of-the-art somatic callers wuth the estimated purity and somatic haplotypes. Specifically, mean F1 scores increased by 4.5% and 7.1% for single-nucleotide variants and insertions and deletions with ClairS, and by 1.2% and 0.5% with DeepSomatic. Collectively, these results showed that somatic haplotyping is a critical yet missing piece in existing somatic callers, which enables purity-aware and phase-resolved variant interpretation in heterogneous tumors.