Enhancing variant detection in complex genomes: leveraging linked reads for robust SNP, Indel, and structural variant analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Accurate detection of genetic variants, including single nucleotide polymorphisms (SNPs), small insertions and deletions (INDELs), and structural variants (SVs), is critical for comprehensive genomic analysis. While traditional short-read sequencing performs well for SNP and INDEL detection, it struggles to resolve SVs, especially in complex genomic regions, due to inherent read length limitations. Linked-read sequencing technologies, such as single-tube Long Fragment Read sequencing (stLFR), overcome these challenges by employing molecular barcodes, providing crucial long-range information. Methods: This study investigates traditional pair-end linked-reads and a conceptual extension of linked-read technology: barcoded single-end reads of 500 bp (SE500 stLFR) and 1000 bp (SE1000 stLFR), generated using the single-tube Long Fragment Read (stLFR) platform. Unlike conventional paired-end (PE100 stLFR) linked reads, these longer single-end reads could offer improved resolution for variant detection by leveraging extended read lengths per barcode. To explore the potential of stLFR reads, we developed stLFR-sim, a Python-based simulator that reproduces the stLFR linked-read sequencing workflow to enable realistic simulation and benchmarking of linked-read sequencing data. Using stLFR-sim, we simulated a diverse set of datasets for the HG002 sample using T2T-based realistic genome simulation. Variant detection performance was then systematically assessed across three stLFR configurations: standard PE100 stLFR, SE500 stLFR, and SE1000 stLFR. Results: Benchmarking against the Genome in a Bottle (GIAB) gold standard reveals distinct strengths of each configuration. Extended single-end reads (SE500 stLFR and SE1000 stLFR) significantly enhance SV detection, with SE1000 stLFR providing the best balance between precision and recall. In contrast, the shorter PE100 stLFR reads exhibit higher precision for SNP and INDEL calling, particularly within high-confidence regions, though with reduced performance in low-mappability contexts. To explore optimization strategies, we constructed hybrid libraries combining paired-end and single-end barcoded reads. These hybrid approaches integrate the complementary advantages of different read types, consistently outperforming single libraries across small variant types and genomic contexts. Conclusion: Collectively, our findings offer a robust comparative framework for evaluating stLFR sequencing strategies, highlight the promise of barcoded single-end reads for improving SV detection, and provide practical guidance for tailoring sequencing designs to the complexities of the genome.

Article activity feed