Accurate calling of low-frequency somatic mutations by sample-specific modeling of error rates

Yixin Lin
Carmen Oroperv
Peter Sørud Porsgård
Amanda Frydendahl Boll Johansen
Mads Heilskov Rasmussen
Thomas Bataillon
Mikkel Heide Schierup
Claus Lindbjerg Andersen
Kristian Almstrup
Søren Besenbacher

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Calling rare somatic variants from NGS data is more challenging than calling inherited variants, especially if the somatic variant is only present in a small fraction of the cells in the sequenced biopsy. In this case, having a good estimate of the error rate of a specific base in a particular read becomes essential. In paired-end sequencing, where some DNA fragments are shorter than twice the read length, the overlapping regions of the read pairs are an ideal resource for training models to discern context-dependent base error rates, as any discordant bases in the overlaps must be caused by a sequencing error or an alignment error. We have created a new tool named BBQ (an acronym for Better Base Quality) that uses overlapping reads to estimate the error rate conditional on the mutation type, sequence context, and base quality. We also estimate how much the error rate of concordant bases in overlapping reads is decreased compared to bases in non-overlapping reads. Results show that overlapping reads can remove sequencing errors induced by DNA damage and that the increased quality of overlapping reads differs between samples and mutation types, reflecting different damage patterns between samples. We use the error models to call rare somatic variants. Sequencing data from a testis biopsy and a cell-free DNA sample serve as a proof-of-concept for rare germ cell mutation calling and for detecting rare cancer mutations. We find that using the sample-specific error models of BBQ allows us to call rare somatic variants with fewer false positives than existing tools such as Mutect2 and Strelka2.

Version published to 10.1101/2024.12.17.629019 on bioRxiv
Dec 20, 2024

HitSV: Maximizing discovery of structural variants across sequencing technologies

This article has 5 authors:
1. Yadong Wang
2. Gaoyang Li
3. Yadong Liu
4. Bo Liu
5. Long Qian
This article has no evaluationsLatest version Feb 20, 2026
A sensitive and accurate framework for population-scale structural variant discovery and genotyping across sequence types

This article has 4 authors:
1. Xin Wang
2. Guangbao Luo
3. Li Xiao
4. Zhangjun Fei
This article has no evaluationsLatest version Feb 18, 2026
Benchmarking RNA-seq Tools for Real-World Diagnostic Applications

This article has 15 authors:
1. Sarah Silverstein
2. Kaushik Ganapathy
3. Sandra Donkervoort
4. Veronique Bolduc
5. Ying Hu
6. Justin Moy
7. Prech Uapinyoying
8. Svetlana Gorokhova
9. Vijay Ganesh
10. Ben Weisburd
11. Rotem OrBach
12. A. Reghan Foley
13. Pejman Mohammadi
14. David Adams
15. Carsten Bonnemann
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

HitSV: Maximizing discovery of structural variants across sequencing technologies

A sensitive and accurate framework for population-scale structural variant discovery and genotyping across sequence types

Benchmarking RNA-seq Tools for Real-World Diagnostic Applications