A foundation language model to decipher diverse regulation of RNAs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
RNA metabolism is tightly regulated by cis -elements and trans -acting factors. Most information guiding such regulation is encoded in RNA sequences. Considering the similarities in semantic and syntactic features between RNAs and human language, we developed LAMAR, a transformer-based foundation la nguage m odel for RN A r egulation, to decipher general rules underlying RNA processing. The model was pretrained on approximately 15 million sequences from both genome and transcriptome of 225 mammals and 1569 viruses, and further fine-tuned with labeled datasets for various tasks. The resulting fine-tuned models outperformed the state-of-the-art methods in predicting mRNA translation efficiency and mRNA half-life, while achieving comparable accuracy to specifically designed methods in predicting splice sites of pre-mRNAs and internal ribosome entry sites. Our results indicated that a single foundation language model is applicable in the comprehensive analysis of different aspects of RNA regulation, providing new insight into the design and optimization of RNA drugs.