Fluctuations and the limit of predictability in protein evolution
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Protein evolution involves mutations occurring across a wide range of time scales. In analogy with disordered systems in statistical physics, this dynamical heterogeneity suggests strong correlations between mutations happening at distinct sites and times. To quantify these correlations, we examine the role of various fluctuation sources in protein evolution, simulated using a data-driven energy landscape, used as a proxy for protein fitness. By applying spatio-temporal correlation functions developed in the context of disordered physical systems, we disentangle fluctuations originating from the initial condition, i.e. the ancestral sequence from which the evolutionary process originated, from those driven by stochastic mutations along independent evolutionary paths. Our analysis shows that, in diverse protein families, fluctuations from the ancestral sequence predominate at shorter time scales. This allows us to identify a time scale over which ancestral sequence information persists, enabling its reconstruction. We link this persistence to the strength of epistatic interactions: ancestral sequences with stronger epistatic signatures impact evolutionary trajectories over extended periods. At longer time scales, however, ancestral influence fades as epistatically constrained sites evolve collectively. To confirm this idea, we apply a standard ancestral sequence reconstruction algorithm and verify that the time-dependent recovery error is influenced by the properties of the ancestor itself.
Article activity feed
-
Thank you for your comment. We checked that the average pairwise distance between sequences in the natural MSA is comparable (or even larger) than the one obtained from sequences generated with our protocol at large time scales. The time scale we were interested in for this study is much smaller than this.
For reference, you can check Fig. S3 in the supplementary material of "Emergent time scales of epistasis in protein evolution" (https://arxiv.org/abs/2403.09436), in which a similar simulation protocol is used and the pairwise Hamming distance between natural and synthetic sequences is compared.
Another interesting question that can be posed is what happens on the opposite side of the spectrum: having very far apart data in the MSA, can we trust the results we have on a short time scale? How much are our results determined by …
Thank you for your comment. We checked that the average pairwise distance between sequences in the natural MSA is comparable (or even larger) than the one obtained from sequences generated with our protocol at large time scales. The time scale we were interested in for this study is much smaller than this.
For reference, you can check Fig. S3 in the supplementary material of "Emergent time scales of epistasis in protein evolution" (https://arxiv.org/abs/2403.09436), in which a similar simulation protocol is used and the pairwise Hamming distance between natural and synthetic sequences is compared.
Another interesting question that can be posed is what happens on the opposite side of the spectrum: having very far apart data in the MSA, can we trust the results we have on a short time scale? How much are our results determined by phylogeny instead of epistasis? Phylogenetic correlations create different clusters in sequence space which might result into long evolutionary time scales.
However in "Emergent time scales of epistasis in protein evolution" (https://arxiv.org/abs/2403.09436), we show that our simulations are in good agreement with experimental data for what regards epistatic phenomena such as contingency, entrenchment and variation of mutational effects between homologs that have an Hamming distance lower than 40%.
In other projects we are trying to infer a model on directed evolution data, which has much less divergence, and in that case your question is quite interesting. Is it possible to build a global model starting by just local data? What we are seeing is that the problem is not really the sampling dynamics, but rather the construction of the model. For example, one can bias the model to go far away from the wildtype from which the experimental data has been evolved.
-
-
Conversely, at larger time scales, the dynamical noise contribution dominates and the trajectory-to-trajectory fluctuations are large enough to hide the signal coming from the ancestral sequence, precluding the possibility to reconstruct i
It might be interesting to see what the scale of hamming distance distribution is in the underlying MSA's for the focal protein families is, vs. at what scales of hamming distance such effects are observed in the simulations. One potential concern could be that couplings/epistasis are estimated from the MSA on one scale of sequence divergence, but the simulations are pushed to much larger scales, in which cases the epistatic interactions inferred from the MSA might no longer be accurate.
-