A Suite of Foundation Models Captures the Contextual Interplay Between Codons
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In the canonical genetic code, many amino acids are assigned more than one codon. Work by us and others has shown that the choice of these synonymous codon is not random, and carries regulatory and functional consequences. Existing protein foundation models ignore this context-dependent role of coding sequence in shaping the protein landscape of the cell. To address this gap, we introduce cdsFM, a suite of codon-resolution large language models, including both EnCodon and DeCodon models, with up to 1B parameters. Pre-trained on 60 million protein-coding sequences from more than 5,000 species, our models effectively learn the relationship between codons and amino acids, recapitualing the overall structure of the genetic code. In addition to outperforming state-of-the-art genomic foundation models in a variety of zero-shot and few-shot learning tasks, the larger pre-trained models were superior in predicting the choice of synonymous codons. To systematically assess the impact of synonymous codon choices on protein expression and our models' ability to capture these effects, we generated a large dataset measuring overall and surface expression levels of three proteins as a function of changes in their synonymous codons. We showed that our EnCodon models could be readily fine-tuned to predict the contextual consequences of synonymous codon choices. Armed with this knowledge, we applied EnCodon to existing clinical datasets of synonymous variants, and we identified a large number of synonymous codons that are likely pathogenic, several of which we experimentally confirmed in a cell-based model. Together, our findings establish the cdsFM suite as a powerful tool for decoding the complex functional grammar underlying the choice of synonymous codons.