raxtax: A k-mer-based non-Bayesian Taxonomic Classifier

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Taxonomic classification in biodiversity studies is the process of assigning the anonymous sequences of a marker gene (barcode) to a specific lineage using a reference database that contains named sequences in a known taxonomy. This classification is important for assessing the complexity of biological systems. Taxonomic classification faces two inherent challenges: first, accuracy is critical as errors can propagate to downstream analysis results; and second, the classification time requirements can limit study size and study design, in particular when considering the constantly growing reference databases. To address these two challenges, we introduce raxtax , an efficient, novel taxonomic classification tool that uses common k -mers between all pairs of query and reference sequences. We also introduce two novel uncertainty scores which take into account the fundamental biases of reference databases. We validate raxtax on three widely used empirical reference databases and show that it is 2.7-100 times faster than competing state-of-the-art tools on the largest database while being equally accurate. In particular, raxtax exhibits increasing speedups with growing query and reference sequence numbers compared to existing tools (for 100,000 and 1,000,000 query and reference sequences overall, it is 1.3 and 2.9 times faster, respectively), and therefore alleviates the taxonomic classification scalability challenge.

Article activity feed