Reduced amino acid substitution matrices find traces of ancient coding alphabets in modern day proteins

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

All known living systems make proteins from the same twenty canonically-coded amino acids, but this was not always the case. Early genetic coding systems likely operated with a restricted pool of amino acid types and limited means to distinguish between them. Despite this, amino acid substitution models like LG and WAG all assume a constant coding alphabet over time. That makes them especially inappropriate for the aminoacyl-tRNA synthetases (aaRS) - the enzymes that govern translation. To address this limitation, we created a class of substitution models that accounts for evolutionary changes in the coding alphabet size by defining the transition from nineteen states in a past epoch to twenty now. We use a Bayesian phylogenetic framework to improve phylogeny estimation and testing of this two-alphabet hypothesis. The hypothesis was strongly rejected by datasets composed exclusively of “young” eukaryotic proteins. It was generally supported by “old” (aaRS and non-aaRS) proteins whose origins date from before the last universal common ancestor. Standard methods overestimate the divergence ages of proteins that originated under reduced coding alphabets in both simulated and aaRS alignments. The new model reduces this bias substantially. Our findings support the late incorporation of tryptophan into the genetic code (relative to tyrosine) and suggest that isoleucine and valine were once coded interchangeably, forming protein quasispecies. This work provides a robust, seamless framework for reconstructing phylogenies from ancient protein datasets and offers further insights into the dawn of molecular biology.

Article activity feed