Columba: Fast Approximate Pattern Matching with Optimized Search Schemes
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Aligning sequencing reads to reference genomes is a fundamental task in bioinformatics. Aligners can be classified as lossy or lossless: lossy aligners prioritize speed by reporting only one or a few high-scoring alignments, whereas lossless aligners output all optimal alignments, ensuring completeness and sensitivity. This paper introduces Columba, a high-performance lossless aligner tailored for Illumina sequencing data. Columba processes single or paired-end reads in FASTQ format and outputs alignments in SAM format. By utilizing advanced search schemes and bit-parallel alignment techniques, Columba achieves exceptional speed. Columba is available in two variants. The first is based on the bidirectional FM-index. The second, Columba RLC, employs run-length compression using a bidirectional move structure, significantly reducing memory usage for large, repetitive datasets like pan-genomes. Through extensive benchmarking, Columba outperforms existing lossless aligners in speed, particularly at higher error rates. Tests on the human genome and bacterial and human pan-genome datasets demonstrate Columba’s robustness and efficiency. We integrated Columba into the OptiType HLA genotyping pipeline, where it substantially reduced computational time while maintaining accuracy. These results position Columba as a versatile, state-of-the-art tool for high-sensitivity genomic analyses.