CompleteBin: A transformer-based framework unlocks microbial dark matter through improved short contig binning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Metagenomic binning is crucial for reconstructing microbial genomes from metagenomic sequencing samples. However, existing tools struggle in complex communities where short, low-abundance contigs predominate, thereby limiting the recovery of complete metagenome-assembled genomes (MAGs) and the identification of novel functions. Here, we introduce CompleteBin, a Transformer-based framework that integrates contig sequence context, pre-trained taxonomic embeddings from a genome language model, and dynamic contrastive learning to bin short contigs robustly. Across CAMI II datasets, CompleteBin increased near-complete MAG recovery by 38.5% over leading methods like COMEBin. Across diverse real-world datasets (marine, freshwater, plant-associated, cold seep sediment, and human gut), it achieved a 57.4% improvement on average. Applying CompleteBin to six cold seep sediment samples uncovered 129 strain-level genome bins across 30 phyla, including 13 phyla undetected by other tools, and taxonomically assigned 90,405 genes (32.1% of total), revealing previously unknown species in nitrogen and sulfur cycling. CompleteBin unlocks microbial dark matter in diverse environments, advancing our understanding of microbial ecology and biogeochemical processes.