RNALens: Study on 5’ UTR Modeling and Cell-Specificity
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recently, the Transformer architecture has been applied to predict the structure, function, and regulatory activity of biological sequences. Predicting the cell-specific regulatory impact of 5’ untranslated regions (5’ UTRs) on mRNA expression and translation remains a key challenge for rational mRNA design. Existing studies such as UTR-LM, RNABERT, and RNA-FM train transformer-based models solely on 5’ UTR sequences with fixed nucleotide tokenization schemes and auxiliary structural features. These models pay less attention to the integration of broader genomic context and thermodynamic objectives, which limits their ability to generalize across diverse cell types and accurately predict both mRNA expression level (EL) and translation efficiency (TE). In this paper, we propose RNALens, a foundation model pre-trained in two stages on multispecies genomic sequences and curated 5’ UTR data using masked language modeling augmented with secondary structure prediction and minimum free energy regression. RNALens employs byte-pair encoding to capture variable-length nucleotide motifs. It is then fine-tuned on high-throughput reporter assay datasets from HEK293T, PC3, and muscle tissues to yield specialized predictors for EL and TE in each cellular context. Experiment results on benchmark datasets demonstrate that RNALens achieves superior performance than existing machine learning methods for both expression and translation predictions across cell-specific and cross-context tests, offering an efficient in silico platform for guiding the design of mRNA therapeutics with precise cellular targeting. 1