Genos-m: a foundation model for human-associated microbial genomes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Human-associated microbial genomes encode extensive strain-level diversity and niche-specific gene repertoires that are critical to host health. However, these complex sequence features remain difficult to capture using general-purpose DNA foundation models, highlighting the need for dedicated representation learning tailored to the human microbiome. Here, we introduce Genos-m, an open-source foundation model for human-associated microbial genome representation. Genos-m was pretrained on approximately 1.2 trillion nucleotide tokens from a curated microbial genome corpus, including human-associated prokaryotic isolates, high-quality metagenome-assembled genomes (MAGs) and bacteriophages, supplemented with GTDB species-level representative genomes to broaden prokaryotic taxonomic breadth. The model uses a sparsely activated Mixture-of-Experts (MoE) Transformer architecture, with 4.7 billion total parameters, approximately 330 million activated parameters per forward pass and a maximum context length of one million base pairs.

We evaluated frozen Genos-m representations across short-sequence and gene-level tasks, biosynthetic gene cluster (BGC)-based regional sequence tasks, whole-genome strain phenotype prediction, and zero-shot transfer on prokaryote-related RNAfitness assays. Across these benchmarks, Genos-m consistently ranked among the leading comparison models, with the best performance in five of eight gene-fitness regression tasks and in BGC type classification. Using sparse autoencoders, we identified sparse features in Genos-m hidden activations that aligned with annotated ORFs, intergenic regions, and tRNA and rRNA loci.

In downstream applications, Genos-m-derived genome-informed species representations in-corporated into a human microbiome self-supervised learning model improved colorectal cancer (CRC)-control classification over conventional species-abundance random forest models. Genos-m also generated stable sample-level embeddings from as few as 10,000 metagenomic reads, retaining gut microbial community structure that distinguished geographic origin and aligned with enterotypes defined from full-depth taxonomic profiles.

Together, these results support Genos-m as a reusable representation model for microbial genomes and metagenomes, with conclusions bounded by the reported datasets, task definitions and evaluation protocols. Genos-m model weights, inference code, and usage documentation are publicly available on GitHub ( https://github.com/BGI-HangzhouAI/Genos-m ) and Hugging-Face ( https://huggingface.co/BGI-HangzhouAI/Genos-m ).

Article activity feed