In silico evolution of globular protein folds from random sequences

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

The origin and evolution of protein folds are among the most challenging, long-standing problems in biology. We developed Protein Fold Evolution Simulator (PFES), a computational approach that simulates evolution of globular folds from random amino acid sequences with atomistic details. PFES introduces random mutations in a population of protein sequences, evaluates the effect of mutations on protein structure, and selects a new set of proteins for further evolution. Iteration of this process allows tracking the evolutionary trajectory of a changing protein fold that evolves under selective pressure for protein fold stability, interaction with other proteins, or other features shaping the fitness landscape. We employed PFES to show how stable, globular protein folds could evolve from random amino acid sequences as monomers or in complexes with other proteins. The simulations reproduce the evolution of many simple folds of natural proteins as well as emergence of distinct folds not known to exist in nature. We show that evolution of small globular protein folds from random sequences, on average, takes 1.15 to 3 amino acid replacements per site, depending on the population size, with some simulations yielding stable folds after as few as 0.2 replacements per site. These values are lower than the characteristic numbers of replacements in conserved proteins during the time since the Last Universal Common Ancestor, suggesting that simple protein folds can evolve from random sequences relatively easily and quickly. PFES tracks the complete evolutionary history from simulations and can be used to test hypotheses on protein fold evolution.

Article activity feed

  1. duplications

    Can you define this a little more? Are you saying that the gene duplicates and then there are two copies, each of which evolve independently, or that the gene duplicates and it produces a new protein that is twice as long as the original? I assume the first, but then I'm a little a confused about why the protein structure is different immediately after duplication. I would assume it would be the same immediately after duplication and then would evolve maybe under a different selection scheme than the initial protein/copy.

  2. This structural search showed that 82 of the 200 structures evolved in the simulations had structural analogs among natural proteins, with a Foldseek probability of at least 0.95, including 23 hits from PDB (Table S1).

    Does this have implications for structural searches from real proteins? Do we expect a higher rate of false positive matches? Or would these types of matches described here be consistent with convergent evolution where the selective pressures across two systems were similar and led to similar structures?

  3. polypeptide sequences

    What happens if you don't start from a known polypeptide sequence, but instead from a completely random set of amino acids that doesn't have any similarity to anything that occurs in nature? Or what if you start from non-coding regions of genomes?

  4. random sequence

    It would be helpful to know here how random these sequences really are (I assume this is covered later in your paper as well though).

    I think this study also has tertiary connections to using protein language models as predictors for different tasks. It might be interesting if you could explore this question a little, how random sequences are encoded in PLMs and how that is different than when something is structured/has a function/etc. I think that could help develop an intuition around how these PLMs encode information about proteins.

  5. Nevertheless, many hypotheses converge on the scenario of proteins originating as small peptides with random sequences that gradually evolved into more complex structures with distinct folds 2-8,19.

    This is really interesting, and somewhat separately reminds me of the sORF literature like this paper: https://www.cell.com/cell-reports/fulltext/S2211-1247(22)01696-5. There have been quite a few studies that show that sORFs are often evolutionarily young. I think your paper applies to this question as well.

  6. reduced amino acid alphabets

    Can you clarify in-text whether this means degenerate alphabets like dayhoff or hydrophobic/polar encoding versus fewer letters in the amino acid alphabet?