Sub-consensus haploid variant calling in Long-read sequencing technology

Xavier Zair
Andreas Wilm
Miles C Benton
Cheng Yong Tham
Lin Yang
Paola Florez De Sessions
October Michael Sessions
Eng Hui Chew
Swapnil Mishra

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Next-generation sequencing (NGS) has become crucial in epidemiology, particularly for tracking viral evolution during outbreaks. While Oxford Nanopore Technologies (ONT) sequencing has gained popularity due to its long-read capabilities and cost-effectiveness, accurately identifying low-frequency variants in long-read data remains challenging. LoFreq, a commonly used variant caller for identifying rare variants in haploid datasets, was developed for short reads. This study aims to validate the use of LoFreq on long-read data and propose a calibration method to enhance accuracy. Methods We constructed truth sets using three plasmids containing SARS-CoV-2 spike genes (7179 bases) with 100 SNVs between them, as well as full-length Escherichia coli genomes. Libraries were sequenced on R9.4.1 and R10.4.1 flow cells. Recall was benchmarked with LoFreq and compared between flow cell chemistries and library size. We also developed a method to adjust base quality (Phred) scores to improve accuracy in long-read datasets. Results LoFreq demonstrated high sensitivity for detecting variants at allelic frequencies as low as 0.1, particularly with R10.4.1 chemistry. However, false discovery rates (FDR) were significant, varying by sequencing depth and chemistry. R10.4.1 showed superior performance in both sensitivity and FDR compared to R9.4.1. We propose a Phred score calibration method that significantly reduced false positives while maintaining recall rates in specific cases. However, it was found to be unsuitable for recalling variants at less than 10% and for structural variant discovery as they suffered significant recall loss. Conclusion While LoFreq remains useful for low-frequency variant calling in long-read data, high false discovery rates on either flow cell chemistry make its direct use on long-read data inadvisable. Our proposed quality score adjustment allows for improved detection of sub-consensus variants while reducing false discoveries. Though more fine-tuning is required for broader applicability, these findings address the lack of sub-consensus variant calling tools for long-read datasets and provide an adequate workaround for applying LoFreq to nanopore reads, which is crucial for future outbreak surveillance and pathogen evolution studies

Version published to 10.21203/rs.3.rs-6226988/v1 on Research Square
Jun 3, 2025

HitSV: Maximizing discovery of structural variants across sequencing technologies

This article has 5 authors:
1. Yadong Wang
2. Gaoyang Li
3. Yadong Liu
4. Bo Liu
5. Long Qian
This article has no evaluationsLatest version Feb 20, 2026
Beyond SNPs: Scalable Detection of Structural Variants Unlocks Hidden Genetic Diversity in Tomato

This article has 1 author:
1. Reza Shekasteband
This article has no evaluationsLatest version Mar 10, 2026
A sensitive and accurate framework for population-scale structural variant discovery and genotyping across sequence types

This article has 4 authors:
1. Xin Wang
2. Guangbao Luo
3. Li Xiao
4. Zhangjun Fei
This article has no evaluationsLatest version Feb 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

HitSV: Maximizing discovery of structural variants across sequencing technologies

Beyond SNPs: Scalable Detection of Structural Variants Unlocks Hidden Genetic Diversity in Tomato

A sensitive and accurate framework for population-scale structural variant discovery and genotyping across sequence types