ppLM-CO:Pre-trained Protein Language Model for Codon Optimization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Messenger ribonucleic acid (mRNA) vaccines represent a major advancement in synthetic biology, yet their efficacy remains limited by how efficiently the encoded protein is translated within the host. Since multiple codons can code for the same amino acid, the search space of possible coding sequences (CDS) within mRNA grows exponentially with protein length, making the problem highly underdetermined. Finding CDSs that yield efficient translation hinges on codon optimization—the process of choosing among synonymous codons, while encoding the same protein but differ in their effects on translation speed, tRNA availability, and mRNA secondary structure. Recent deep learning approaches have framed codon optimization as a sequence learning problem, where the goal is to model context-dependent codon usage patterns across the amino acid in protein sequence. However, these methods rely on large sequence models that learn amino-acid embeddings from scratch, leading to computationally intensive training. We propose ppLM-CO, a lightweight codon optimization framework that integrates pretrained protein language models (ppLMs) to directly provide contextual amino-acid embeddings, thereby eliminating the need for embedding learning. This design reduces trainable parameters by over 92% – 99% compared with prior deep models while maintaining complete biological fidelity. In-silico evaluations across three species and two vaccine targets—SARS-CoV-2 spike and Varicella-Zoster Virus (VZV) gE viral proteins—demonstrate that ppLM-CO consistently achieves higher expression and competitive stability, establishing a scalable and biologically consistent approach for codon optimization.