Additive baselines furnish no evidence for epistasis learning by MULTI-evolve

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Recent work from Tran et al . ( Science , 2026) introduced MULTI-evolve, a framework for protein engineering that combines single-mutant nomination via a protein language model (PLM) or a deep mutational scan (DMS), experimental single- and double-mutant characterization, and neural networks to engineer hyperactive multimutant proteins. The authors attribute the framework’s performance to “epistasis-aware modeling” and claim that their neural networks “learn the epistatic landscape” and “identify synergistic interactions” from limited double-mutant training data. Additive models, by definition, cannot represent epistasis, making them a natural null baseline for such claims. Here we show that MULTI-evolve’s multimutant predictions are almost perfectly correlated with an additive model’s across all three engineering applications (APEX, dCasRx, and HuABC2), such that the engineering of multimutants reduces to combining beneficial mutations with the largest additive effects—a standard protein engineering strategy for over four decades. We also find that MULTI-evolve’s neural networks do not outperform an additive model on held-out test set predictions, and do not even represent epistasis in their training data. Finally, we revisit a DMS benchmark finding presented as evidence of epistasis learning and show that the same pattern is expected even under a null additive model, due to an elementary statistical phenomenon; when we fit an additive model to the benchmark data, it reproduces the reported pattern. More broadly, our findings underscore the need to benchmark models for machine learning-guided directed evolution against additive null baselines before attributing performance to learned epistasis.

Article activity feed

  1. The FCNN learns additivity but the authors misattribute its performance to epistasis because no linear baseline was tested

    One thing I personally have always wanted to see in these sort of data/experiments is some flavour of variance partitioning as in quantitative genetics. I.e. How much variation does 2nd order epistasis explain in these systems? Consequently what even is the expected improvement we can hope if we train a model that can successfully capture say 2nd order epistasis? It seems from these results it would probably be quite modest (at least in the context of these data). I don't think a proper analysis can be done here due to the experimental design; each mutant combination is measured only once, so epistasis cannot be distinguished from experimental noise. This paper from the Thornton lab has done an analysis along these lines, and the results seem to suggest additivity can get you a lot of the way there...

  2. Each of these issues is evident in MULTI-evolve: the engineering success is real, but the source of performance is that mutational effects are sufficiently additive for the proteins and mutations considered, not that the neural network has learned epistatic synergies.

    Thank you for going through the effort of putting together this properly benchmarked analysis, the lack of a purely additive model in the original work is a significant omission. To give credit to the original authors, it seems the biggest success here is their ensemble PLM model, I suppose some claim of 'synergistic epistasis' could be made here, as the mutants the ensemble proposed do seem to be genuinely beneficial. However, it is obvious that this does not extend to the MLP trained by the authors where the claims of epistasis being captured are made.