raxtax: A k-mer-based non-Bayesian Taxonomic Classifier

Noah A. Wahl
Georgios Koutsovoulos
Ben Bettisworth
Alexandros Stamatakis

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Taxonomic classification in biodiversity studies is the process of assigning the anonymous sequences of a marker gene (barcode) or whole genomes (metagenomics) to a specific lineage using a reference database that contains named sequences in a known taxonomy. This classification is important for assessing the diversity of biological systems. Taxonomic classification faces two main challenges: first, accuracy is critical as errors can propagate to downstream analysis results; and second, the classification time requirements can limit study size and study design, in particular when considering the constantly growing reference databases. To address these two challenges, we introduce raxtax , an efficient, novel taxonomic classification tool for barcodes that uses common k -mers between all pairs of query and reference sequences. We also introduce two novel uncertainty scores which take into account the fundamental biases of reference databases.

Results

We validate raxtax on three widely used empirical reference databases and show that it is 2.7-100 times faster than competing state-of-the-art tools on the largest database while being equally accurate. In particular, raxtax exhibits increasing speedups with growing query and reference sequence numbers compared to existing tools (for 100,000 and 1,000,000 query and reference sequences overall, it is 1.3 and 2.9 times faster, respectively), and therefore alleviates the taxonomic classification scalability challenge.

Availability and Implementation

raxtax is available at https://github.com/noahares/raxtax under a CCNC-BY-SA license. The scripts and summary metrics used in our analyses are available at https://github.com/noahares/raxtax_paper_scripts . The source code, sequence data and summarized results of the analyses are available at https://doi.org/10.5281/zenodo.15057027 .

Version published to 10.1101/2025.03.11.642618 on bioRxiv
Mar 14, 2025

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026
Testing the validity and adequacy of linguistic phylogenetic analyses

This article has 1 author:
1. Benedict King
This article has no evaluationsLatest version Dec 17, 2025
Phylogenetic Lineages of <a id="article-title"></a>PRRSV-2 from Canada Reveal Patterns of Transboundary Spread and Two Novel Sub-Lineages in North America

This article has 10 authors:
1. Joao P. H. da Silva
2. Igor A. D. Paploski
3. Robert Charette
4. Luc Dufresne
5. Sylvain Messier
6. Julie Bolduc
7. Mariana Kikuti
8. Nakarin Pamornchainavakul
9. Cesar A. Corzo
10. Kimberly VanderWaal
This article has no evaluationsLatest version Jan 9, 2026

Discuss this preprint

Listed in

Abstract

Results

Availability and Implementation

Article activity feed

Related articles

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

Testing the validity and adequacy of linguistic phylogenetic analyses

Phylogenetic Lineages of <a id="article-title"></a>PRRSV-2 from Canada Reveal Patterns of Transboundary Spread and Two Novel Sub-Lineages in North America