Deciphering RNA regulation with a foundation language model
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
RNAs are widely regulated in cellular processes by their cis -regulatory elements and trans -acting RNA binding proteins, the information of which is encoded in RNA sequences spanning evolutionary diversity. Deciphering the rules underlying RNA regulation can provide new insights into molecular mechanisms and RNA therapies. The large language models have demonstrated their efficacy in the analysis of human languages and protein sequences. Considering the similarities in semantic and syntactic features between RNA sequences and human language, we developed LAMAR, a foundation la nguage m odel for RN A r egulation, aimed at capturing the intrinsic characteristics of RNA sequences. The model was first pretrained on approximately 15 million sequences of genes and transcripts from 225 mammals and 1569 viruses. By leveraging LAMAR as a foundational platform, we fine-tuned the pretrained model with labeled datasets across various tasks. The resulting fine-tuned model outperformed the best benchmark by up to 9% for predicting mRNA translation efficiency, and improved by 7% for predicting the RNA half-life. We further applied LAMAR to predict the splice sites of pre-mRNAs and internal ribosome entry sites driving cap-independent translation, the performances of which were comparable with the state-of-the-art methods. The results indicated that our foundation language model is applicable for comprehensive regulatory analysis of RNA, providing new clues into molecular cell biology and disease.
The code is available at https://github.com/rnasys/LAMAR .