OneGenomeRice (OGR): A Genomic Foundation Model for Rice
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The transition of genomics to a predictive intelligence discipline is driven by the advent of genomic foundation models. While substantial progress has been observed in human-centric models, plant genomics, particularly for the staple crops, remains hindered by a lack of models. Here we introduce OneGenomeRice (OGR), a genomic foundation model for rice ( Oryza sativa ) engineered by a Mixture of Experts (MoE) transformer architecture with 1.25-billion-parameters. OGR was pre-trained on a genomic dataset comprising 422 high-quality genomes of cultivated and wild rice. A comprehensive benchmark, including short-sequence motif identification, long-range regulatory modeling, single-nucleotide resolution prediction, selective sweep detection and subspecies classification, demonstrated that OGR significantly outperforms existing state-of-the-art plant or all-life genome models in 11 categories. The model was also further used for several downstream applications, such as introgression between indica and japonica subspecies using embedding-based supervised classification, agronomy trait-associated functional loci through attention-derived importance signals, and gene expression prediction of DNA sequences etc. These results indicate OGR being a promising foundational computational infrastructure for functional genomics and precision breeding of rice.