Characterization of microbial dark matter at scale with MetaSBT and taxonomy-aware Sequence Bloom Trees

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Metagenomics has become a powerful tool for studying microbial communities, allowing researchers to investigate microbial diversity within complex environmental samples. Recent advances in sequencing technology have enabled the recovery of near-complete microbial genomes directly from metagenomic samples, also known as metagenome-assembled genomes (MAGs). However, accurately characterizing these genomes remains a significant challenge due to the presence of sequencing errors, incomplete assembly, and contamination. Here we present MetaSBT, a new tool for organizing, indexing, and characterizing microbial reference genomes and MAGs. It is able to identify clusters of genomes at all seven taxonomic levels, from the kingdom all the way down to the species level, using the Sequence Bloom Tree (SBT) data structure that relies on Bloom Filters (BFs) to index massive amounts of genomes based on their k-mers composition. We have built an initial set of databases composed of over 190 thousand viral genomes from NCBI GenBank and public sources grouped into sequence consistent clusters at different taxonomic levels, making it the first software solution for the classification of viruses at different ranks, including still unknown ones. This results in the definition of over 40 thousand species clusters where ~80% do not match with any known viral species in reference databases to date. Furthermore, we show how our databases can be used as a new basis for existing quantitative metagenomic profilers to unlock the detection of unknown microbes and the estimation of their abundance in metagenomic samples. Finally, the framework is released open-source and, along with its public databases, is fully integrated into the Galaxy Platform enabling broad accessibility. Importance: The MetaSBT framework and its databases, together with its integration in the Galaxy Platform, provide a powerful resource for microbial research. MetaSBT provides a powerful and scalable approach for classifying microbial genomes, including previously unknown ones. This facilitates the discovery and characterization of novel taxa, a crucial feature for expanding our knowledge of microbial diversity and its implications within host health and environmental factors. Furthermore, MetaSBT databases can serve as a reference base for other state-of-the-art tools, enhancing their capabilities to identify, analyze, and classify unknown microbes in metagenomic samples.

Article activity feed