Genos-m: a foundation model for human-associated microbial genomes

Chao Fang
Fangming Yang
Hao Hou
Huahui Ren
Huanzi Zhong
Huinan Xu
Jiahao Zhang
Jianxin Su
Jielun Cai
Jingnan Yuan
Leo Jingyu Lee
Junhua Li
Kui Wu
Lihui Wang
Liwen Xiong
Long Hou
Meng Ni
Shida Zhu
Shiping Liu
Sirong Liu
Ting Zhu
Xiaofang Chen
Xiaofeng Wang
Zhan Xiao
Xin Jin
Xinting Liu
Xuyang Feng
Yinbin Qiu
Yujing Liu
Yupeng Zhou
Yuxiang Lin
Zhaorong Li
Zhouming Huang
Zhun Shi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Human-associated microbial genomes encode extensive strain-level diversity and niche-specific gene repertoires that are critical to host health. However, these complex sequence features remain difficult to capture using general-purpose DNA foundation models, highlighting the need for dedicated representation learning tailored to the human microbiome. Here, we introduce Genos-m, an open-source foundation model for human-associated microbial genome representation. Genos-m was pretrained on approximately 1.2 trillion nucleotide tokens from a curated microbial genome corpus, including human-associated prokaryotic isolates, high-quality metagenome-assembled genomes (MAGs) and bacteriophages, supplemented with GTDB species-level representative genomes to broaden prokaryotic taxonomic breadth. The model uses a sparsely activated Mixture-of-Experts (MoE) Transformer architecture, with 4.7 billion total parameters, approximately 330 million activated parameters per forward pass and a maximum context length of one million base pairs.

We evaluated frozen Genos-m representations across short-sequence and gene-level tasks, biosynthetic gene cluster (BGC)-based regional sequence tasks, whole-genome strain phenotype prediction, and zero-shot transfer on prokaryote-related RNAfitness assays. Across these benchmarks, Genos-m consistently ranked among the leading comparison models, with the best performance in five of eight gene-fitness regression tasks and in BGC type classification. Using sparse autoencoders, we identified sparse features in Genos-m hidden activations that aligned with annotated ORFs, intergenic regions, and tRNA and rRNA loci.

In downstream applications, Genos-m-derived genome-informed species representations in-corporated into a human microbiome self-supervised learning model improved colorectal cancer (CRC)-control classification over conventional species-abundance random forest models. Genos-m also generated stable sample-level embeddings from as few as 10,000 metagenomic reads, retaining gut microbial community structure that distinguished geographic origin and aligned with enterotypes defined from full-depth taxonomic profiles.

Together, these results support Genos-m as a reusable representation model for microbial genomes and metagenomes, with conclusions bounded by the reported datasets, task definitions and evaluation protocols. Genos-m model weights, inference code, and usage documentation are publicly available on GitHub ( https://github.com/BGI-HangzhouAI/Genos-m ) and Hugging-Face ( https://huggingface.co/BGI-HangzhouAI/Genos-m ).

Version published to 10.64898/2026.05.21.726868 on bioRxiv
May 24, 2026

16S rRNA sequence captures microbial functional potential

This article has 3 authors:
1. Jia Liu
2. M. Clara De Paolis Kaluza
3. Yana Bromberg
This article has no evaluationsLatest version Apr 18, 2026
Benchmarking strain-level profiling of Escherichia coli in short-read gut metagenomes

This article has 5 authors:
1. Matthew Galbraith
2. David Williams
3. Liam P. Shaw
4. Samuel Lipworth
5. Nicole Stoesser
This article has no evaluationsLatest version May 19, 2026
WasteFams: A database of protein families from global wastewater microbiomes

This article has 11 authors:
1. Alexandros Galaras
2. Iro N Chasapi
3. Eleni Aplakidou
4. Maria N. Chasapi
5. Efthimia Lamari
6. Sophia Diplari
7. Ilias Georgakopoulos-Soares
8. Evangelos Karatzas
9. Fotis A. Baltoumas
10. Nikos C. Kyrpides
11. Georgios A. Pavlopoulos
This article has no evaluationsLatest version May 12, 2026

Genos-m: a foundation model for human-associated microbial genomes

Discuss this preprint

Listed in

Abstract

Article activity feed

16S rRNA sequence captures microbial functional potential

Benchmarking strain-level profiling of Escherichia coli in short-read gut metagenomes

WasteFams: A database of protein families from global wastewater microbiomes

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

16S rRNA sequence captures microbial functional potential

Benchmarking strain-level profiling of Escherichia coli in short-read gut metagenomes

WasteFams: A database of protein families from global wastewater microbiomes