CodonTranslator: a conditional codon language model for codon optimization across life domains

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Codon optimization involves selecting synonymous codons to match host-specific preferences. It is critical for heterologous expression but remains challenging due to the combinatorial design space. Under long-term evolutionary selection, natural coding sequences are near-optimal compromises between translational efficiency, accuracy, and regulatory constraints, providing a de facto standard for data-driven models. Recent deep learning–based language models therefore aim to learn the distribution of natural codon sequences and reuse it for design. However, existing approaches discard the rich semantic structure of taxonomic lineages, underutilize protein functional and evolutionary constraints, and often rely on masked-language objectives that lack a principled mechanism for sequence generation. Here we present CodonTranslator, a 150M-parameter decoder-only Transformer trained on 62 million CDS–protein pairs from over 2,100 species. CodonTranslator uses a pretrained language model to embed hierarchical species lineages and a pretrained protein language model to encode protein context, enabling interpolation across hosts and generalization to unseen species and proteins. Our results show that CodonTranslator implicitly learns the genetic code from data, faithfully reproduces species-specific codon usage, and designs coding sequences that match or surpass existing methods in both codon usage metrics and predicted biological stability. Our dataset, pretrained models, and code are available at https://github.com/poseidonchan/CodonTranslator

Article activity feed