Sub-consensus haploid variant calling in Long-read sequencing technology

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Next-generation sequencing (NGS) has become crucial in epidemiology, particularly for tracking viral evolution during outbreaks. While Oxford Nanopore Technologies (ONT) sequencing has gained popularity due to its long-read capabilities and cost-effectiveness, accurately identifying low-frequency variants in long-read data remains challenging. LoFreq, a commonly used variant caller for identifying rare variants in haploid datasets, was developed for short reads. This study aims to validate the use of LoFreq on long-read data and propose a calibration method to enhance accuracy. Methods We constructed truth sets using three plasmids containing SARS-CoV-2 spike genes (7179 bases) with 100 SNVs between them, as well as full-length Escherichia coli genomes. Libraries were sequenced on R9.4.1 and R10.4.1 flow cells. Recall was benchmarked with LoFreq and compared between flow cell chemistries and library size. We also developed a method to adjust base quality (Phred) scores to improve accuracy in long-read datasets. Results LoFreq demonstrated high sensitivity for detecting variants at allelic frequencies as low as 0.1, particularly with R10.4.1 chemistry. However, false discovery rates (FDR) were significant, varying by sequencing depth and chemistry. R10.4.1 showed superior performance in both sensitivity and FDR compared to R9.4.1. We propose a Phred score calibration method that significantly reduced false positives while maintaining recall rates in specific cases. However, it was found to be unsuitable for recalling variants at less than 10% and for structural variant discovery as they suffered significant recall loss. Conclusion While LoFreq remains useful for low-frequency variant calling in long-read data, high false discovery rates on either flow cell chemistry make its direct use on long-read data inadvisable. Our proposed quality score adjustment allows for improved detection of sub-consensus variants while reducing false discoveries. Though more fine-tuning is required for broader applicability, these findings address the lack of sub-consensus variant calling tools for long-read datasets and provide an adequate workaround for applying LoFreq to nanopore reads, which is crucial for future outbreak surveillance and pathogen evolution studies

Article activity feed