ppLM-CO:Pre-trained Protein Language Model for Codon Optimization

Shashank Pathak
Guohui Lin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Messenger ribonucleic acid (mRNA) vaccines represent a major advancement in synthetic biology, yet their efficacy remains limited by how efficiently the encoded protein is translated within the host. Since multiple codons can code for the same amino acid, the search space of possible coding sequences (CDS) within mRNA grows exponentially with protein length, making the problem highly underdetermined. Finding CDSs that yield efficient translation hinges on codon optimization—the process of choosing among synonymous codons, while encoding the same protein but differ in their effects on translation speed, tRNA availability, and mRNA secondary structure. Recent deep learning approaches have framed codon optimization as a sequence learning problem, where the goal is to model context-dependent codon usage patterns across the amino acid in protein sequence. However, these methods rely on large sequence models that learn amino-acid embeddings from scratch, leading to computationally intensive training. We propose ppLM-CO, a lightweight codon optimization framework that integrates pretrained protein language models (ppLMs) to directly provide contextual amino-acid embeddings, thereby eliminating the need for embedding learning. This design reduces trainable parameters by over 92% – 99% compared with prior deep models while maintaining complete biological fidelity. In-silico evaluations across three species and two vaccine targets—SARS-CoV-2 spike and Varicella-Zoster Virus (VZV) gE viral proteins—demonstrate that ppLM-CO consistently achieves higher expression and competitive stability, establishing a scalable and biologically consistent approach for codon optimization.

Version published to 10.1101/2024.12.12.628267 on bioRxiv
Dec 16, 2024

In-Context Learning in Genomic Language Models as a Biological Evaluation Task

This article has 2 authors:
1. Aadit Kapoor
2. Wendy Lee
This article has no evaluationsLatest version Dec 9, 2025
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026
Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

In-Context Learning in Genomic Language Models as a Biological Evaluation Task

Emergence of Biological Structural Discovery in General-Purpose Language Models

Best Practices for Using Large Language Models at Scale