Precise and scalable metagenomic profiling with sample-tailored minimizer libraries

Johan Nyström-Persson
Nishad Bapatdhar
Samik Ghosh

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Reference-based metagenomic profiling requires large genome libraries to maximize detection and minimize false positives. However, as libraries grow, classification accuracy suffers, particularly in k-mer-based tools, as the growing overlap in genomic regions among organisms results in more high-level taxonomic assignments, blunting precision. To address this, we propose sample-tailored minimizer libraries, which improve on the minimizer-LCA (lowest common ancestor) classification algorithm from the widely used Kraken 2 [1]. In this method, an initial filtering step using a large library removes non-resemblance genomes, followed by a refined classification step using a dynamically built smaller minimizer library. This 2-step classification method shows significant performance improvements compared to the state of the art. We develop a new computational tool called Slacken, a distributed and highly scalable platform based on Apache Spark, to implement the 2-step classification method, which improves speed while keeping the cost per sample comparable to Kraken 2. Specifically, in the CAMI2 [2] “strain madness” samples, the fraction of reads classified at species level increased by 3.5x, while for in silico samples it increased by 2.2x. The 2-step method achieves the sensitivity of large genomic libraries and the specificity of smaller ones, unlocking the true potential of large reference libraries for metagenomic read profiling.

Version published to 10.1101/2024.12.22.629657 on bioRxiv
Dec 25, 2024

Discuss this preprint

Listed in

Abstract

Article activity feed