DECODING SYNONYMOUS CODON SELECTION WITH A TRANSFORMER MODEL
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The genetic code is highly redundant, with many synonymous codons encoding the same amino acid. Codon usage influences RNA structure, signaling, and translation rates. Differences in tRNA availability modulate elongation, with rare codons slowing translation and affecting co-translational folding and gene expression. Despite their functional importance and non-random distribution, rare codons are underrepresented in natural datasets, restricting the development of predictive models. We developed a transformer-based model that predicts codon sequences from amino acids, substantially improving rare codon prediction. The model learns codon signatures encoding species identity, RNA thermodynamic properties, and elongation constraints without explicit labels. Attention analysis shows that codon choice depends on both short and long-range sequence contexts, recovering dicodon effects and highlighting additional motifs. Finally, predictions correlate with experimental measurements of the impact of synonymous mutations on protein fitness, linking gene sequence to fitness and functional consequences, providing a framework to connect sequence variation, translation, and protein function.