Fast and accurate taxonomic domain assignment of short metagenomic reads using BBERT

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Metagenomes from complex environments such as soil contain vast biodiversity, yet most short reads cannot be taxonomically or functionally annotated because they lack reference genomes, obscuring the true structure and function of microbial communities. We introduce BBERT, a nucleotide large language model. BBERT identifies bacterial sequence syntax without relying on reference databases, enabling accurate assignment of taxonomic domain, coding potential, and reading frame directly from reads as short as 100 bp. Applying BBERT to a global dataset of soil metagenomes reveals that the majority of previously unannotated “microbial dark matter” is non-bacterial, and that resolving this conflation reshapes functional inferences from global surveys, uncovering functional differences between temperate and boreal-arctic soils. BBERT also improves de-novo metagenomic assembly, reducing mismatches and gaps while accelerating runtime. By providing fast, reference-free classification of short reads, BBERT unlocks large metagenomic archives for more accurate ecological and evolutionary analyses.

Article activity feed