In silico evolution of globular protein folds from random sequences
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
The origin and evolution of protein folds are among the most challenging, long-standing problems in biology 1,2 . Although many plausible scenarios of early protein evolution leading to fold nucleation have been proposed 3-8 , realistic simulation of this process was not feasible because of the lack of efficient approaches for protein structure prediction, a situation that changed with the advent of powerful tools for fast and robust protein structure prediction, such as AlphaFold 9,10 and ESMFold 11 . We developed a computational approach for protein fold evolution simulator (PFES) with atomistic details that provide insights into the mechanisms of evolution of globular folds from random amino acid sequences. PFES introduces random mutations in a population of protein sequences, evaluates the effect of mutations on protein structure, and selects a new set of proteins for further evolution. Repeating this process iteratively allows tracking the evolutionary trajectory of a changing protein fold that evolves under selective pressure for protein fold stability, interaction with other proteins, or other features shaping the fitness landscape. We employed PFES to show how globular protein folds could evolve from random amino acid sequences as monomers or in complexes with other proteins. The simulations reproduce the evolution of many simple folds of natural proteins as well as the evolution of distinct folds not known to exist in nature. We show that evolution of a stable fold from random sequences, on average takes 3 to 8 amino acid replacements per site, suggesting that simple but stable protein folds can evolve relatively easily. These findings could shed light on the enigma of the rapid evolution of protein fold diversity at the earliest stages of life evolution. PFES tracks the complete evolutionary history from simulations that describes intermediate states at the sequence and structure levels and can be used to test versatile hypotheses on protein fold evolution.
Article activity feed
-
duplications
Can you define this a little more? Are you saying that the gene duplicates and then there are two copies, each of which evolve independently, or that the gene duplicates and it produces a new protein that is twice as long as the original? I assume the first, but then I'm a little a confused about why the protein structure is different immediately after duplication. I would assume it would be the same immediately after duplication and then would evolve maybe under a different selection scheme than the initial protein/copy.
-
novel fold
Can you provide information inline about what this means? Is it a novel fold geometry or is it a novel sequence that leads to a known fold geometry?
-
query that can be identified by sequence similarity search
Within the AlphaFold database or is this referring to some other set of homologs?
-
This structural search showed that 82 of the 200 structures evolved in the simulations had structural analogs among natural proteins, with a Foldseek probability of at least 0.95, including 23 hits from PDB (Table S1).
Does this have implications for structural searches from real proteins? Do we expect a higher rate of false positive matches? Or would these types of matches described here be consistent with convergent evolution where the selective pressures across two systems were similar and led to similar structures?
-
100 random peptide sequences
Can you clarify here what random means?
-
polypeptide sequences
What happens if you don't start from a known polypeptide sequence, but instead from a completely random set of amino acids that doesn't have any similarity to anything that occurs in nature? Or what if you start from non-coding regions of genomes?
-
random sequence
It would be helpful to know here how random these sequences really are (I assume this is covered later in your paper as well though).
I think this study also has tertiary connections to using protein language models as predictors for different tasks. It might be interesting if you could explore this question a little, how random sequences are encoded in PLMs and how that is different than when something is structured/has a function/etc. I think that could help develop an intuition around how these PLMs encode information about proteins.
-
Nevertheless, many hypotheses converge on the scenario of proteins originating as small peptides with random sequences that gradually evolved into more complex structures with distinct folds 2-8,19.
This is really interesting, and somewhat separately reminds me of the sORF literature like this paper: https://www.cell.com/cell-reports/fulltext/S2211-1247(22)01696-5. There have been quite a few studies that show that sORFs are often evolutionarily young. I think your paper applies to this question as well.
-
reduced amino acid alphabets
Can you clarify in-text whether this means degenerate alphabets like dayhoff or hydrophobic/polar encoding versus fewer letters in the amino acid alphabet?
-