Importance of higher-order epistasis in protein sequence-function relationships
Curation statements for this article:-
Curated by eLife
eLife Assessment
In this important work, the authors present a new transformer-based neural network designed to isolate and quantify higher-order epistasis in protein sequences. They provide solid evidence that higher-order epistasis can play key roles in protein function. This work will be of interest to the communities interested in modeling biological sequence data and understanding mutational effects.
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
Abstract
Protein sequence–function relationships are inherently complex, as amino acids at different positions can interact in highly unpredictable ways. A key question for protein evolution and engineering is how often epistasis extends beyond pairwise interactions to involve three or more positions. Although experimental data has accumulated rapidly in recent years, addressing this question remains challenging, as the number of possible interactions is typically enormous even for proteins of moderate size. Here, we introduce an interpretable machine learning framework for studying higher-order epistasis scalable to full-length proteins. Our model builds on the transformer architecture, with key modifications allowing us to assess the importance of higher-order interactions by fitting a series of models with increasing complexity. Applying our method to 10 large protein sequence-function datasets, we found that while additive effects explain the majority of the variance, within the epistatic component, the contribution of higher-order epistasis ranges from negligible to up to 60%. We also found higher-order epistasis is particularly important for generalizing locally sampled fitness data to distant regions of sequence space and for modeling an additional multi-peak fitness landscape. Our findings suggest that higher-order epistasis can play important roles in protein sequence-function relationships, and thus should be properly considered in protein engineering and evolutionary data analysis.
Article activity feed
-
eLife Assessment
In this important work, the authors present a new transformer-based neural network designed to isolate and quantify higher-order epistasis in protein sequences. They provide solid evidence that higher-order epistasis can play key roles in protein function. This work will be of interest to the communities interested in modeling biological sequence data and understanding mutational effects.
-
Reviewer #1 (Public review):
The authors present an approach that uses the transformer architecture to model epistasis in deep mutational scanning datasets. This is an original and very interesting idea. Applying the approach to 10 datasets, they quantify the contribution of higher-order epistasis, showing that it varies quite extensively.
Suggestions:
(1) The approach taken is very interesting, but it is not particularly well placed in the context of recent related work. MAVE-NN, LANTERN, and MoCHI are all approaches that different labs have developed for inferring and fitting global epistasis functions to DMS datasets. MoCHI can also be used to infer multi-dimensional global epistasis (for example, folding and binding energies) and also pairwise (and higher order) specific interaction terms (see 10.1186/s13059-024-03444-y and …
Reviewer #1 (Public review):
The authors present an approach that uses the transformer architecture to model epistasis in deep mutational scanning datasets. This is an original and very interesting idea. Applying the approach to 10 datasets, they quantify the contribution of higher-order epistasis, showing that it varies quite extensively.
Suggestions:
(1) The approach taken is very interesting, but it is not particularly well placed in the context of recent related work. MAVE-NN, LANTERN, and MoCHI are all approaches that different labs have developed for inferring and fitting global epistasis functions to DMS datasets. MoCHI can also be used to infer multi-dimensional global epistasis (for example, folding and binding energies) and also pairwise (and higher order) specific interaction terms (see 10.1186/s13059-024-03444-y and 10.1371/journal.pcbi.1012132). It doesn't distract from the current work to better introduce these recent approaches in the introduction. A comparison of the different capabilities of the methods may also be helpful. It may also be interesting to compare the contributions to variance of 1st, 2nd, and higher-order interaction terms estimated by the Epistatic transformer and MoCHI.
(2) https://doi.org/10.1371/journal.pcbi.1004771 is another useful reference that relates different metrics of epistasis, including the useful distinction between biochemical/background-relative and background-averaged epistasis.
(3) Which higher-order interactions are more important? Are there any mechanistic/structural insights?
-
Reviewer #2 (Public review):
Summary:
This paper presents a novel transformer-based neural network model, termed the epistatic transformer, designed to isolate and quantify higher-order epistasis in protein sequence-function relationships. By modifying the multi-head attention architecture, the authors claim they can precisely control the order of specific epistatic interactions captured by the model. The approach is applied to both simulated data and ten diverse experimental deep mutational scanning (DMS) datasets, including full-length proteins. The authors argue that higher-order epistasis, although often modest in global contribution, plays critical roles in extrapolation and capturing distant genotypic effects, especially in multi-peak fitness landscapes.
Strengths:
(1) The study tackles a long-standing question in molecular …
Reviewer #2 (Public review):
Summary:
This paper presents a novel transformer-based neural network model, termed the epistatic transformer, designed to isolate and quantify higher-order epistasis in protein sequence-function relationships. By modifying the multi-head attention architecture, the authors claim they can precisely control the order of specific epistatic interactions captured by the model. The approach is applied to both simulated data and ten diverse experimental deep mutational scanning (DMS) datasets, including full-length proteins. The authors argue that higher-order epistasis, although often modest in global contribution, plays critical roles in extrapolation and capturing distant genotypic effects, especially in multi-peak fitness landscapes.
Strengths:
(1) The study tackles a long-standing question in molecular evolution and protein engineering: "how significant are epistatic interactions beyond pairwise effects?" The question is relevant given the growing availability of large-scale DMS datasets and increasing reliance on machine learning in protein design.
(2) The manuscript includes both simulation and real-data experiments, as well as extrapolation tasks (e.g., predicting distant genotypes, cross-ortholog transfer). These well-rounded evaluations demonstrate robustness and applicability.
(3) The code is made available for reproducibility.
Weaknesses:
(1) The paper mainly compares its transformer models to additive models and occasionally to linear pairwise interaction models. However, other strong baselines exist. For example, the authors should compare baseline methods such as "DANGO: Predicting higher-order genetic interactions". There are many works related to pairwise interaction detection, such as: "Detecting statistical interactions from neural network weights", "shapiq: Shapley interactions for machine learning", and "Error-controlled non-additive interaction discovery in machine learning models".
(2) While the transformer architecture is cleverly adapted, the claim that it allows for "explicit control" and "interpretability" over interaction order may be overstated. Although the 2^M scaling with MHA layers is shown empirically, the actual biological interactions captured by the attention mechanism remain opaque. A deeper analysis of learned attention maps or embedding similarities (e.g., visualizations, site-specific interaction clusters) could substantiate claims about interpretability.
(3) The distinction between nonspecific (global) and specific epistasis is central to the modeling framework, yet it remains conceptually underdeveloped. While a sigmoid function is used to model global effects, it's unclear to what extent this functional form suffices. The authors should justify this choice more rigorously or at least acknowledge its limitations and potential implications.
(4) The manuscript refers to "pairwise", "3-4-way", and ">4-way" interactions without always clearly defining the boundaries of these groupings or how exactly the order is inferred from transformer layer depth. This can be confusing to readers unfamiliar with the architecture or with statistical definitions of interaction order. The authors should clarify terminology consistently. Including a visual mapping or table linking a number of layers to the maximum modeled interaction order could be helpful.
-
Reviewer #3 (Public review):
Summary:
Sethi and Zou present a new neural network to study the importance of epistatic interactions in pairs and groups of amino acids to the function of proteins. Their new model is validated on a small simulated data set and then applied to 10 empirical data sets. Results show that epistatic interactions in groups of amino acids can be important to predict the function of a protein, especially for sequences that are not very similar to the training data.
Strengths:
The manuscript relies on a novel neural network architecture that makes it easy to study specifically the contribution of interactions between 2, 3, 4, or more amino acids. The study of 10 different protein families shows that there is variation among protein families.
Weaknesses:
The manuscript is good overall, but could have gone a bit …
Reviewer #3 (Public review):
Summary:
Sethi and Zou present a new neural network to study the importance of epistatic interactions in pairs and groups of amino acids to the function of proteins. Their new model is validated on a small simulated data set and then applied to 10 empirical data sets. Results show that epistatic interactions in groups of amino acids can be important to predict the function of a protein, especially for sequences that are not very similar to the training data.
Strengths:
The manuscript relies on a novel neural network architecture that makes it easy to study specifically the contribution of interactions between 2, 3, 4, or more amino acids. The study of 10 different protein families shows that there is variation among protein families.
Weaknesses:
The manuscript is good overall, but could have gone a bit deeper by comparing the new architecture to standard transformers, and by investigating whether differences between protein families explain some of the differences in the importance of interactions between amino acids. Finally, the GitHub repository needs some more information to be usable.
-
-
-