A foundation language model to decipher diverse regulation of RNAs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

RNA metabolism is tightly regulated by cis -elements and trans -acting factors. Most information guiding such regulation is encoded in RNA sequences. Considering the similarities in semantic and syntactic features between RNAs and human language, we developed LAMAR, a transformer-based foundation la nguage m odel for RN A r egulation, to decipher general rules underlying RNA processing. The model was pretrained on approximately 15 million sequences from both genome and transcriptome of 225 mammals and 1569 viruses, and further fine-tuned with labeled datasets for various tasks. The resulting fine-tuned models outperformed the state-of-the-art methods in predicting mRNA translation efficiency and mRNA half-life, while achieving comparable accuracy to specifically designed methods in predicting splice sites of pre-mRNAs and internal ribosome entry sites. Our results indicated that a single foundation language model is applicable in the comprehensive analysis of different aspects of RNA regulation, providing new insight into the design and optimization of RNA drugs.

Article activity feed