Assessing genomic reproducibility of read alignment tools

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Genomic research relies on accurate and reproducible computational analyses of DNA sequencing data to draw reliable biological conclusions. Read mapping, the process of aligning reads to a reference genome, is central to many applications, including variant detection and comparative genomics. While several tools have been developed for this task, genomic reproducibility 1 , defined as the consistency of results across replicates, remains underexplored. Here, we address this question by introducing a methodology based on synthetic replicates of sequencing data, generated by perturbing the original reads through shuffling, reverse complementing, or combined shuffling and reverse complementing. Our approach is able to simulate variability observed across sequencing runs due to differences in library preparation techniques. We evaluated the reproducibility of eight alignment tools (BWA-MEM2 2 , Bowtie2 3 , HISAT2 4 , minimap2 5 , NextGenMap 6 , SNAP 7 , SMALT 7,8 and Subread 9 ) under these perturbations using whole-genome sequencing (WGS) data. Synthetic replicates were aligned and compared to the original sample to quantify discrepancies. Mapping accuracy changes ranged from 0.0001% to 4.4% for primary reads, which are alignments not marked as secondary, supplementary, or duplicates, and up to 12.2% for high-quality primary reads. For primary reads, the percentage of reads commonly mapped in both the original and synthetic replicate ranged from 91.66% to 100% relative to the total number of mapped reads in the original dataset. High-quality filtering improved consistency, though some tools still failed to recover more than 70% of the original alignments. Within the set of common reads, the incidence of inconsistent mappings was as high as 13.53% for primary reads and 6.73% for high-quality primary reads. Bowtie 2 was fully reproducible under the shuffling replicate, Subread was fully reproducible under the reverse-complement replicate, whereas NextGenMap exhibited only minor inconsistencies. By contrast, SNAP and minimap2 displayed the most significant variability under reverse complementing. We further demonstrate that alignment inconsistencies propagate to the downstream task of calling structural genomic variants. Using Manta for structural-variant (SV) calling, we observed that Bowtie 2, HISAT2, and minimap2 maintained perfect SV concordance between original and replicate alignments, whereas other tools exhibited lower concordance, with Subread scoring only 87% concordance. In conclusion, our comprehensive evaluation demonstrates that synthetic perturbations reveal critical differences in how alignment tools handle technical variability and how these differences propagate to downstream variant analyses, underscoring the necessity of incorporating reproducibility benchmarks into the selection and validation of read mappers to ensure robust and reliable genomic interpretations.

Article activity feed