Gener anno : A Genomic Foundation Model for Metagenomic Annotation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid growth of genomic and metagenomic data has underscored the pressing need for advanced computational tools capable of deciphering complex biological sequences. In this study, we introduce Gener anno , a compact yet powerful genomic foundation model (GFM) specifically optimized for metagenomic annotation. Trained on an extensive dataset comprising 715 billion base pairs (bp) of prokaryotic DNA, Gener anno employs a transformer encoder architecture with 500 million parameters, enabling bidirectional attention over sequences up to 8192 bp at single-nucleotide resolution. This design addresses key limitations of existing methods, including the inability of traditional Hidden Markov Models (HMMs) to handle fragmented DNA sequences from multi-species microbial communities, as well as the suboptimal tokenization schemes of existing GFMs that compromise fine-grained analysis. At its core, Gener anno excels in identifying coding regions from fragmented and mixed DNA sequences—a hallmark of metagenomic analysis. It achieves superior accuracy compared to traditional HMM-based methods (e.g., GLIMMER3, GeneMarkS2, Prodigal) and recent LLM-based approaches (e.g., GeneLM), while demonstrating robust generalization ability on archaeal genomes. Leveraging its advanced contextual understanding capability, Gener anno further enables two essential functions: pseudogene prediction and taxonomic classification—both performed based solely on raw sequence data, without reliance on reference databases or comparative genomics. These functionalities collectively streamline the metagenomic analysis pipeline, significantly reducing preprocessing requirements and enabling end-to-end interpretation of sequencing data. Beyond its primary role in metagenomic annotation, Gener anno also serves as a powerful GFM. To evaluate its broader utility, we curated the Prokaryotic Gener Tasks—a comprehensive benchmark suite specifically tailored for prokaryotic genomic analysis. It includes gene fitness prediction, antibiotic resistance identification, gene classification, and taxonomic classification, reflecting diverse aspects of functional genomics. On this benchmark, Gener anno consistently outperforms existing GFMs such as DNABERT-2, NT-v2, and GenomeOcean, demonstrating strong generalization capabilities across a wide range of genomic tasks. Overall, Gener anno provides a unified framework that integrates multiple critical functions for metagenomic annotation and beyond. By eliminating dependencies on external resources and offering rich contextual understanding of genomic sequences, this work delivers a foundational tool for advancing functional genomics in complex microbial communities. Implementation details and supplementary resources are available at https://github.com/GenerTeam/GENERanno .

Article activity feed