Fast and accurate taxonomic domain assignment of short metagenomic reads using BBERT
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Metagenomes from complex environments such as soil contain vast biodiversity, yet most short reads cannot be taxonomically or functionally annotated because they lack reference genomes, obscuring the true structure and function of microbial communities. We introduce BBERT, a nucleotide large language model. BBERT identifies bacterial sequence syntax without relying on reference databases, enabling accurate assignment of taxonomic domain, coding potential, and reading frame directly from reads as short as 100 bp. Applying BBERT to a global dataset of soil metagenomes reveals that the majority of previously unannotated “microbial dark matter” is non-bacterial, and that resolving this conflation reshapes functional inferences from global surveys, uncovering functional differences between temperate and boreal-arctic soils. BBERT also improves de-novo metagenomic assembly, reducing mismatches and gaps while accelerating runtime. By providing fast, reference-free classification of short reads, BBERT unlocks large metagenomic archives for more accurate ecological and evolutionary analyses.