Columba: Fast Approximate Pattern Matching with Optimized Search Schemes

Luca Renders
Lore Depuydt
Travis Gagie
Jan Fostier

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Aligning sequencing reads to reference genomes is a fundamental task in bioinformatics. Aligners can be classified as lossy or lossless: lossy aligners prioritize speed by reporting only one or a few high-scoring alignments, whereas lossless aligners output all optimal alignments, ensuring completeness and sensitivity. This paper introduces Columba, a high-performance lossless aligner tailored for Illumina sequencing data. Columba processes single or paired-end reads in FASTQ format and outputs alignments in SAM format. By utilizing advanced search schemes and bit-parallel alignment techniques, Columba achieves exceptional speed. Columba is available in two variants. The first is based on the bidirectional FM-index. The second, Columba RLC, employs run-length compression using a bidirectional move structure, significantly reducing memory usage for large, repetitive datasets like pan-genomes. Through extensive benchmarking, Columba outperforms existing lossless aligners in speed, particularly at higher error rates. Tests on the human genome and bacterial and human pan-genome datasets demonstrate Columba’s robustness and efficiency. We integrated Columba into the OptiType HLA genotyping pipeline, where it substantially reduced computational time while maintaining accuracy. These results position Columba as a versatile, state-of-the-art tool for high-sensitivity genomic analyses.

Version published to 10.1101/2025.03.26.645543v1 on bioRxiv
Mar 31, 2025

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

This article has 3 authors:
1. Ali Ghaffaari
2. Alexander Schönhuth
3. Tobias Marschall
This article has no evaluationsLatest version Feb 17, 2025
Run-length compressed metagenomic read classification with SMEM-finding and tagging

This article has 5 authors:
1. Lore Depuydt
2. Omar Y. Ahmed
3. Jan Fostier
4. Ben Langmead
5. Travis Gagie
This article has no evaluationsLatest version Mar 24, 2025
Chimera: Ultrafast and Memory-efficient Database Construction for High-Accuracy Taxonomic Classification in the Age of Expanding Genomic Data

This article has 6 authors:
1. Qinzhong Tian
2. Pinglu Zhang
3. Yanming Wei
4. Quan Zou
5. Yansu Wang
6. Ximei Luo
This article has no evaluationsLatest version Mar 28, 2025

Listed in

Abstract

Article activity feed

Related articles

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

Run-length compressed metagenomic read classification with SMEM-finding and tagging

Chimera: Ultrafast and Memory-efficient Database Construction for High-Accuracy Taxonomic Classification in the Age of Expanding Genomic Data