OneGenomeRice (OGR): A Genomic Foundation Model for Rice

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The transition of genomics to a predictive intelligence discipline is driven by the advent of genomic foundation models. While substantial progress has been observed in human-centric models, plant genomics, particularly for the staple crops, remains hindered by a lack of models. Here we introduce OneGenomeRice (OGR), a genomic foundation model for rice ( Oryza sativa ) engineered by a Mixture of Experts (MoE) transformer architecture with 1.25-billion-parameters. OGR was pre-trained on a genomic dataset comprising 422 high-quality genomes of cultivated and wild rice. A comprehensive benchmark, including short-sequence motif identification, long-range regulatory modeling, single-nucleotide resolution prediction, selective sweep detection and subspecies classification, demonstrated that OGR significantly outperforms existing state-of-the-art plant or all-life genome models in 11 categories. The model was also further used for several downstream applications, such as introgression between indica and japonica subspecies using embedding-based supervised classification, agronomy trait-associated functional loci through attention-derived importance signals, and gene expression prediction of DNA sequences etc. These results indicate OGR being a promising foundational computational infrastructure for functional genomics and precision breeding of rice.

Article activity feed