Genos-m: a foundation model for human-associated microbial genomes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Human-associated microbial genomes encode extensive strain-level diversity and niche-specific gene repertoires that are critical to host health. However, these complex sequence features remain difficult to capture using general-purpose DNA foundation models, highlighting the need for dedicated representation learning tailored to the human microbiome. Here, we introduce Genos-m, an open-source foundation model for human-associated microbial genome representation. Genos-m was pretrained on approximately 1.2 trillion nucleotide tokens from a curated microbial genome corpus, including human-associated prokaryotic isolates, high-quality metagenome-assembled genomes (MAGs) and bacteriophages, supplemented with GTDB species-level representative genomes to broaden prokaryotic taxonomic breadth. The model uses a sparsely activated Mixture-of-Experts (MoE) Transformer architecture, with 4.7 billion total parameters, approximately 330 million activated parameters per forward pass and a maximum context length of one million base pairs.
We evaluated frozen Genos-m representations across short-sequence and gene-level tasks, biosynthetic gene cluster (BGC)-based regional sequence tasks, whole-genome strain phenotype prediction, and zero-shot transfer on prokaryote-related RNAfitness assays. Across these benchmarks, Genos-m consistently ranked among the leading comparison models, with the best performance in five of eight gene-fitness regression tasks and in BGC type classification. Using sparse autoencoders, we identified sparse features in Genos-m hidden activations that aligned with annotated ORFs, intergenic regions, and tRNA and rRNA loci.
In downstream applications, Genos-m-derived genome-informed species representations in-corporated into a human microbiome self-supervised learning model improved colorectal cancer (CRC)-control classification over conventional species-abundance random forest models. Genos-m also generated stable sample-level embeddings from as few as 10,000 metagenomic reads, retaining gut microbial community structure that distinguished geographic origin and aligned with enterotypes defined from full-depth taxonomic profiles.
Together, these results support Genos-m as a reusable representation model for microbial genomes and metagenomes, with conclusions bounded by the reported datasets, task definitions and evaluation protocols. Genos-m model weights, inference code, and usage documentation are publicly available on GitHub ( https://github.com/BGI-HangzhouAI/Genos-m ) and Hugging-Face ( https://huggingface.co/BGI-HangzhouAI/Genos-m ).