Fast and accurate taxonomic domain assignment of short metagenomic reads using BBERT

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Biological diversity revealed by metagenomic sequencing far exceeds that of known or cultured organisms, yet much of this diversity remains inaccessible because most sequences from complex habitats, such as soil, lack database references. The majority of soil-derived reads cannot be assembled or annotated and are designated as "microbial dark matter." Recent advances in large language models provide an opportunity to classify such unannotated sequences without relying on assembly or reference databases. We present BBERT, a BERT-based language model trained to distinguish whether a metagenomic read originates from bacteria. Using an out-of-distribution detection framework, BBERT flags reads that diverge from learned bacterial patterns as non-bacterial. By focusing on this fundamental split in the tree of life and training only on bacterial sequences, BBERT achieves high accuracy and enables large-scale analyses of soil metagenomes. Testing across 1,971 soil metagenomes, BBERT reliably identified bacterial reads, predicted coding potential, and determined reading frames directly from short reads, without assembly. Applying BBERT enhances downstream analyses and assembly performance, enabling a more robust analysis of microbial processes and their environmental drivers.

Article activity feed