Precise and scalable metagenomic profiling with sample-tailored minimizer libraries
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Reference-based metagenomic profiling requires large genome libraries to maximize detection and minimize false positives. However, as libraries grow, classification accuracy suffers, particularly in k-mer-based tools, as the growing overlap in genomic regions among organisms results in more high-level taxonomic assignments, blunting precision. To address this, we propose sample-tailored minimizer libraries, which improve on the minimizer-LCA (lowest common ancestor) classification algorithm from the widely used Kraken 2 [1]. In this method, an initial filtering step using a large library removes non-resemblance genomes, followed by a refined classification step using a dynamically built smaller minimizer library. This 2-step classification method shows significant performance improvements compared to the state of the art. We develop a new computational tool called Slacken, a distributed and highly scalable platform based on Apache Spark, to implement the 2-step classification method, which improves speed while keeping the cost per sample comparable to Kraken 2. Specifically, in the CAMI2 [2] “strain madness” samples, the fraction of reads classified at species level increased by 3.5x, while for in silico samples it increased by 2.2x. The 2-step method achieves the sensitivity of large genomic libraries and the specificity of smaller ones, unlocking the true potential of large reference libraries for metagenomic read profiling.