codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints

Binita Rajbanshi
Anuj Guruacharya

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Emerging generative models for biology focus on DNA, non-coding RNA, or proteins, ignoring information hidden in mRNA. Additionally, in protein engineering and mRNA therapeutics the design of mRNA sequences is still a challenge, lacking a clear framework. Here, we introduce and rigorously evaluate two novel methods: a foundational model for mRNA and a reinforcement learning mRNA design framework built on such a model. codonGPT is the first generative foundational language model trained directly on coding mRNA sequences. To solve the problem of synonymous constraints that are only unique to mRNA, we introduce a novel method of inference-time masking, along with house-keeping genes evaluation. For the first time, we also rigorously demonstrate, that for precise mRNA therapeutics design, reinforcement learning on such a model provides a clear framework for biological optimization. Our study introduces a novel foundational model for mRNA and a new reinforcement learning based paradigm for mRNA sequence design.

Version published to 10.1101/2025.06.25.661500 on bioRxiv
Jun 27, 2025

GENERator: A Long-Context Generative Genomic Foundation Model

This article has 18 authors:
1. Qiuyi Li
2. Wei Wu
3. Yuanyuan Zhang
4. Zhihao Zhan
5. Ruipu Chen
6. Mingyang Li
7. Kun Fu
8. Junyan Qi
9. Yongzhou Bao
10. Chao Wang
11. Yiheng Zhu
12. Zhiyun Zhang
13. Jian Tang
14. Fuli Feng
15. Jieping Ye
16. Liu Yuwen
17. Hui Xiong
18. Zheng Wang
This article has no evaluationsLatest version Feb 4, 2026
Explicit Dynamic Cross-Strand Interactions for DNA Sequence Language Modeling

This article has 12 authors:
1. Xiao Luo
2. Cheng Yang
3. Yuansheng Liu
4. Lei Ling
5. Fengxin Li
6. Changjian Chen
7. Long Wang
8. Feng Yu
9. Liang Qiao
10. Xiangxiang Zeng
11. Kenli Li
12. Alexander Schönhuth
This article has no evaluationsLatest version Jan 8, 2026
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GENERator: A Long-Context Generative Genomic Foundation Model

Explicit Dynamic Cross-Strand Interactions for DNA Sequence Language Modeling

A Survey on Efficient Protein Language Models