Gener anno : A Genomic Foundation Model for Metagenomic Annotation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid growth of genomic and metagenomic data has underscored the pressing need for advanced computational tools capable of deciphering complex biological sequences. In this study, we introduce Gener anno , a compact yet powerful genomic foundation model specifically optimized for metagenomic annotation. Trained on an extensive dataset comprising 715 billion base pairs (bp) of prokaryotic DNA, Gener anno employs a transformer encoder architecture with 500 million parameters, enabling bidirectional attention over sequences up to 8192 nucleotides at single-nucleotide resolution. This design addresses key limitations of existing methods, including the inability of traditional Hidden Markov Models (HMMs) to handle fragmented DNA sequences, as well as the suboptimal tokenization schemes of current genomic foundation models that compromise fine-grained analysis. To evaluate the model performance, we curated the Prokaryotic Gener Tasks—a biologically meaningful benchmark encompassing gene fitness prediction, antibiotic resistance prediction, gene classification, and taxonomic classification. Across these tasks, Gener anno consistently outperforms its counterparts, establishing itself as a leading genomic foundation model in the prokaryotic domain. For metagenomic annotation, Gener anno achieves superior accuracy compared to traditional HMMbased methods (e.g., GLIMMER3, GeneMarkS2, Prodigal) and recent LLM-based approaches (e.g., GeneLM), while demonstrating exceptional generalization ability on archaeal genomes. Notably, Gener anno pioneers the prediction of pseudogenes based solely on sequence data, leveraging its contextual understanding to differentiate non-functional sequences from active coding regions. Overall, Gener anno represents a significant advancement in genomic foundation modeling, bridging the gap between large-scale sequence analysis and fine-grained biological insights. By providing a versatile tool for metagenomic annotation and broader genomic exploration, this work lays the groundwork for future research in functional genomics and related fields. Implementation details and supplementary resources are available at https://github.com/GenerTeam/GENERanno .

Article activity feed