Design and experimental characterization of specificity-switching mutational paths of WW domains

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    In this important study, the authors demonstrate that generative AI techniques (restricted Boltzmann machine) can be used effectively to design and characterize mutational pathways of WW domains with different binding specificities. The computational studies are complemented by experimental validations, and the results provide solid evidence supporting the idea that sequence landscape holds significance in understanding protein evolution from a transition path perspective. The minor weakness of the study in the current form concerns limited success in designing variants with smoothly varying binding specificities. Nevertheless, the work will likely have a major impact on research aimed at understanding how evolution navigates fitness landscapes as well as reconstructing ancestral sequences.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Specific interactions between proteins and other biomolecules are ubiquitous in cellular processes. How specificity is encoded in the protein sequence and can be modified through a minimal set of concerted mutations is a complex issue. In this work, we focus on the WW protein domain, whose variants specifically bind to different classes of proline-rich peptides. Combining unsupervised learning of homologous WW sequence data with Restricted Boltzmann Machines (RBM) and path-sampling methods, we design mutational paths of putative WW domains interpolating between two natural WW domains with either distinct or similar specificities. Sequences along the designed paths are then experimentally validated with high-throughput in-vitro binding assays against 3 peptides of different classes. The vast majority (93%) of intermediate sequences along the designed paths are responsive to the initial or/and final peptides. On the contrary, domains along scrambled paths, in which the same mutations are introduced in random order are not functional, emphasizing how successful design crucially depends on the ability to model epistatic interactions. Interestingly, switch in specificity between classes I and IV whose representative peptides bind to different pockets on the WW domain appears to be smooth, with intermediates displaying some level of binding cross-reactivity with all tested peptides. We finally show that the RBM paths share a high identity with internal nodes obtained from ancestral sequence reconstruction based on the seed WW domains.

Significance Statement

Generative machine-learning models are nowadays used to design new protein sequences with desired functions. Here, we address a more demanding task: designing a full mutational path connecting two natural proteins with different binding specificities. We illustrate this problem with WW domains, a small protein unit capable of recognizing distinct classes of proline-rich peptides. We experimentally verify that most of the intermediate sequences along the designed path are functional and respond to the initial or/and final peptides. The designed sequences share significant homology with the sequences obtained as internal nodes of phylogenetic trees through ancestral sequence reconstruction.

Article activity feed

  1. eLife Assessment

    In this important study, the authors demonstrate that generative AI techniques (restricted Boltzmann machine) can be used effectively to design and characterize mutational pathways of WW domains with different binding specificities. The computational studies are complemented by experimental validations, and the results provide solid evidence supporting the idea that sequence landscape holds significance in understanding protein evolution from a transition path perspective. The minor weakness of the study in the current form concerns limited success in designing variants with smoothly varying binding specificities. Nevertheless, the work will likely have a major impact on research aimed at understanding how evolution navigates fitness landscapes as well as reconstructing ancestral sequences.

  2. Reviewer #1 (Public review):

    Summary:

    The authors aim to study mutational paths connecting WW domains with different binding specificities. Their approach combines an unsupervised sequence generative model based on RBMs with a path-sampling algorithm. The key result is that most intermediate sequences along the designed transition paths retain measurable binding activity in wet-lab assays, whereas paths containing the same mutations introduced in a randomized order are largely non-functional. This difference is attributed to epistatic interactions captured by the RBM model.

    Strengths:

    Exploring mutational paths in high-dimensional protein sequence space is a challenging problem. The computational framework used here is state-of-the-art and is strengthened by systematic experimental characterization of binding activity. The study is comprehensive in scope, including multiple transition paths both within and across WW specificity classes, and the integration of modeling with high-throughput experimental validation is a clear strength.

    Weaknesses:

    A major concern is whether the stated goal of specificity switching is fully achieved. Along the sampled transition paths, most intermediate variants appear to retain specificity close to either the initial or the final class, rather than exhibiting gradually shifting specificity. For example, in Figure 4G (Class I to Class II/III), binding appears largely binary, with intermediates behaving similarly to one of the endpoints. A similar pattern is observed in Figure 3H for the Class I to Class IV transition, where binding responses are close to 0 or 1. In this sense, the specificity-switching objective is only partially realized by assigning two endpoints with different specificity. This raises a broader conceptual question: is it possible that different WW specificities evolved from a common ancestor without passing through intermediates that exhibit mixed or intermediate specificity? If so, then inferring specificity-switching pathways purely from extant natural sequences may be fundamentally challenging.

  3. Reviewer #2 (Public review):

    This is an extremely important work that shows how one can use generative models to construct specificity-switching mutational paths in complex fitness landscapes. The experimental evidence is very clear, and the theoretical tools are innovative.

    The work will likely have a deep impact on future research aimed at understanding how evolution navigates fitness landscapes as well as reconstructing ancestral sequences.

    The manuscript is extremely clear and well written, the experimental evidence is strong, and the methods are clearly described, so I do not have major issues to raise. A few minor issues are listed below.

    (1) I consider the WW domain as an 'easy' case from the point of view of generative modelling. The domain is rather short, epistatic effects are not very strong (e.g. Boltzmann learning usually converges very quickly to a very paramagnetic state), and the resulting models are well interpretable (e.g. the hidden units of the RBM correlate well with subclasses).

    This is not always (not often?) the case, however. In more complex proteins, the learning procedures can be slower and the resulting models less interpretable. Just for completeness, perhaps the authors could comment on the generality of the results and what they would expect for other systems based on their experience.

    (2) In Section 3.3, the authors say that direct paths connecting Class I and Class IV behave similarly to indirect paths, despite having lower scores according to the RBM. How generic is this? Does it also happen for other classes? This might be an important point to address, as direct paths are easier to sample.

    (3) The path shown in Figure 4 goes through a region of non-functionality around sequences 18-19. It seems that the sample path is basically exploring the functional regions for Class I and Class II/III separately, trying to approach the other class, but then it can't really make the switch.

    By contrast, the path going from Class I to Class IV seems able to perform the functional switch in a single step (20-21) without losing too much of the function.

    Perhaps the authors could better comment on this? Is this a limitation of the sampling method, or a fundamental biological fact?

    (4) On page 12, it is stated that the temperature was chosen to 1/3 to maximize the score. This is important and should be mentioned earlier (I didn't notice it until that point).

    (5) On page 13, it is stated that: "However, the scores of the ancestral sequences along the phylogenetic pathways assigned by the RBM are significantly lower than the ones of the RBM-designed sequences. This result is expected as ASR reconstruction does not take into account epistasis, differently from RBM, and we expect ASR sequences to generally be of lesser quality."

    I was very surprised by this result. My own experience with ASR shows that, on the contrary, sequences found by ASR (via maximum likelihood) tend to have high scores in the (R)BM, and tend to be more stable than extant sequences. I attribute this to the fact that ASR typically finds a "consensus" sequence that maximizes the contribution to the score coming from the fields (the profile), which is typically dominant over the epistatic signal, resulting in a bigger score. Maybe the authors did not use maximum likelihood in the ASR? Some clarification might be useful here.