From classification to confirmation: verifying taxonomic classifications by mapping metagenomic reads to reference genomes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background : Obtaining high precision while maintaining high recall is an ongoing problem for metagenomic taxonomic classification in microbial ecology research. Parameter adjustments can achieve this in simulated samples, but in real samples –especially from environments like marine and soil– the proportion of classified reads drops sharply with precision increases. We therefore suggest verification of metagenomic taxonomic classifications obtained from a tool like Kraken by mapping their assigned reads to reference genomes to assess genomic coverage. Results : In simulations, filtering the identified species to only those with ³0.5% reference genome coverage removed 99.7% of false-positive taxa. Applying this method to samples from real datasets requires a more nuanced approach that considers sequencing depth, whether the samples are high- or low-microbial biomass, and database completeness with respect to the sampled environment. Nevertheless, we show that clinically relevant Kraken-identified taxa such as Helicobacter pylori identified in human stool samples lack any reads mapping to their reference genome and are likely false positives driven by contaminating phage sequences within reference genomes. Similarly, in human blood and tumour datasets, only 18 and 11 species, respectively, have ³1% reference genome coverage and likely represent sample collection or sequencing contaminants. Marine and soil samples pose additional challenges due to lower representation in reference databases, leading to low nucleotide identity between sequenced reads and reference genomes and similarity only at higher taxonomic ranks. Conclusions : We recommend genome coverage checking to researchers in all fields of microbial ecology and provide an open-source pipeline on Github (GeCoCheck): https://github.com/R-Wright-1/GeCoCheck.

Article activity feed