Reconstructability of evolutionary intermediates in generative epistatic landscapes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Evolutionary intermediates connect observed proteins, but the sequence of steps that produced them is rarely recoverable from extant data alone. Here we ask what can, and cannot, be inferred about such intermediates from the endpoints. Using generative sequence landscapes as controlled models of protein-family evolution, we benchmark data-driven reconstruction against ground-truth simulated trajectories. We find that the best point prediction is not necessarily the most faithful evolutionary reconstruction: maximum-likelihood intermediates can be residue-wise accurate yet statistically atypical, whereas conditional sampling better captures the ensemble of plausible histories. Predictability is limited by the topology of the landscape. Constrained, low-mutability regions preserve information about the path, while permissive high-mutability regions open many alternative routes and erase path-specific memory. We also show that sequence divergence alone is an insufficient measure of elapsed evolutionary time; incorporating endpoint mutability provides a more reliable way to place intermediates in the landscape. These results recast intermediate reconstruction as a calibrated probabilistic problem. Rather than seeking a single ``true'' sequence, data-driven models should identify when endpoints contain evolutionary information, and return realistic ensembles.

Article activity feed