A Large-Scale Foundation Model for RNA Enables Diverse Function and Structure Prediction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurately predicting RNA structures and functions from nucleotide sequences, or conversely, designing sequences to meet structural and functional requirements, remains a fundamental challenge in RNA biology, largely due to limited annotated data and the poor efficiency of \textit{ab initio} modeling approaches. Here, we introduce AIDO.RNA, a large-scale RNA foundation model that leverages self-supervised pre-training to learn general and effective RNA representations, which can be transferred to tackle a wide range of RNA prediction and design tasks. AIDO.RNA is a 1.6-billion-parameter transformer-based language model, pre-trained on 42 million non-coding RNA (ncRNA) sequences at single-nucleotide resolution. It can be adapted to achieve state-of-the-art performance on 26 out of 28 diverse tasks, including RNA structure and function prediction, mRNA expression modeling, multi-modal RNA isoform expression prediction, and RNA inverse folding, demonstrating its effectiveness and versatility across the board. We find that beyond excelling in ncRNA-related tasks that directly reside in the pre-training data space, AIDO.RNA can be efficiently adapted to new domains with continued domain-specific pre-training to generalize toward untranslated regions and coding regions of mRNA, suggesting a promising pathway to continue to level up biological foundation models in general. We make AIDO.RNA open source and release the utility of the model in AIDO.ModelGenerator, a Python package enabling easy reproduction, application, and extension of our results.