Run-length compressed metagenomic read classification with SMEM-finding and tagging

Lore Depuydt
Omar Y. Ahmed
Jan Fostier
Ben Langmead
Travis Gagie

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification in O ( r ) space, where r is the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at least L between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity of O ( r ). Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available at https://github.com/biointec/tagger under the AGPL-3.0 license.

Version published to 10.1101/2025.02.25.640119v2 on bioRxiv
Mar 24, 2025
Version published to 10.1101/2025.02.25.640119v1 on bioRxiv
Feb 28, 2025

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

This article has 3 authors:
1. Ali Ghaffaari
2. Alexander Schönhuth
3. Tobias Marschall
This article has no evaluationsLatest version Feb 17, 2025
gcSV: a unified framework for comprehensive structural variant detection

This article has 5 authors:
1. Gaoyang Li
2. Yadong Liu
3. Bo Liu
4. Long Qian
5. Yadong Wang
This article has no evaluationsLatest version Feb 15, 2025
DeepMM: Identify and correct Metagenome Misassemblies with deep learning

This article has 5 authors:
1. Yi Ding
2. Jin Xiao
3. Bohao Zou
4. Chao Yang
5. Lu Zhang
This article has no evaluationsLatest version Feb 13, 2025

Listed in

Abstract

Article activity feed

Related articles

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

gcSV: a unified framework for comprehensive structural variant detection

DeepMM: Identify and correct Metagenome Misassemblies with deep learning