Large mRNA language foundation modeling with NUWA for unified sequence perception and generation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The mRNA serves as a crucial bridge between DNA and proteins. Compared to DNA, mRNA sequences are much more concise and information-dense, which makes mRNA an ideal language through which to explore various biological principles. In this study, we present NUWA, a large mRNA language foundation model leveraging a BERT-like architecture, trained with curriculum masked language modeling and supervised contrastive loss for unified mRNA sequence perception and generation. For pretraining, we utilized large-scale mRNA coding sequences comprising approximately 80 million sequences from 19,676 bacterial species, 33 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species, and pre-trained three domain-specific models respectively. This enables NUWA to learn coding sequence patterns across the entire tree of life. The fine-tuned NUWA demonstrates strong performance across a variety of downstream tasks, excelling not only in RNA-related perception tasks but also exhibiting robust capability in cross-modal protein-related tasks. On the generation front, NUWA pioneers an entropy-guided strategy that enables BERT-like models in generating mRNA sequences, producing natural-like sequences that accurately recapitulate species-specific codon usage patterns. Moreover, NUWA can be effectively fine-tuned on small, task-specific datasets to generate functional mRNAs with desired properties, including sequences that do not exist in nature, and to design coding sequences for diverse proteins in biomanufacturing, vaccine development, and therapeutic applications. To our knowledge, NUWA represents the first mRNA language model for unified sequence perception and generation, providing a versatile and programmable platform for mRNA design.