Accurate calling of low-frequency somatic mutations by sample-specific modeling of error rates
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Calling rare somatic variants from NGS data is more challenging than calling inherited variants, especially if the somatic variant is only present in a small fraction of the cells in the sequenced biopsy. In this case, having a good estimate of the error rate of a specific base in a particular read becomes essential. In paired-end sequencing, where some DNA fragments are shorter than twice the read length, the overlapping regions of the read pairs are an ideal resource for training models to discern context-dependent base error rates, as any discordant bases in the overlaps must be caused by a sequencing error or an alignment error. We have created a new tool named BBQ (an acronym for Better Base Quality) that uses overlapping reads to estimate the error rate conditional on the mutation type, sequence context, and base quality. We also estimate how much the error rate of concordant bases in overlapping reads is decreased compared to bases in non-overlapping reads. Results show that overlapping reads can remove sequencing errors induced by DNA damage and that the increased quality of overlapping reads differs between samples and mutation types, reflecting different damage patterns between samples. We use the error models to call rare somatic variants. Sequencing data from a testis biopsy and a cell-free DNA sample serve as a proof-of-concept for rare germ cell mutation calling and for detecting rare cancer mutations. We find that using the sample-specific error models of BBQ allows us to call rare somatic variants with fewer false positives than existing tools such as Mutect2 and Strelka2.