Protein Language Model Identifies Disordered, Conserved Motifs Driving Phase Separation

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This valuable study presents an analysis of evolutionary conservation in intrinsically disordered regions, identified as key drivers of phase separation, leveraging a protein language model. The strength of evidence is potentially compelling, but a clearer justification of the methods and analyses is needed to fully support the main claims.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Intrinsically disordered regions (IDRs) play a critical role in phase separation and are essential for the formation of membraneless organelles (MLOs). Mutations within IDRs can disrupt their multivalent interaction networks, altering phase behavior and contributing to various diseases. Therefore, examining the evolutionary fitness of IDRs provides valuable insights into the relationship between protein sequences and phase separation. In this study, we utilized the ESM2 protein language model to map the fitness landscape of IDRs. Our findings reveal that IDRs, particularly those actively participating in phase separation, contain conserved amino acids. This conservation is evident through mutational constraints predicted by ESM2 and supported by direct analyses of multiple sequence alignments. These conserved, disordered amino acids include residues traditionally identified as “stickers” as well as “spacers” and frequently form continuous sequence motifs. The strong conservation, combined with their critical role in phase separation, suggests that these motifs act as functional units under evolutionary selection to support stable MLO formation. Our findings underscore the insights into phase separation’s molecular grammar made possible through evolutionary analysis enabled by protein language models.

Article activity feed

  1. eLife Assessment

    This valuable study presents an analysis of evolutionary conservation in intrinsically disordered regions, identified as key drivers of phase separation, leveraging a protein language model. The strength of evidence is potentially compelling, but a clearer justification of the methods and analyses is needed to fully support the main claims.

  2. Reviewer #1 (Public review):

    The manuscript by Zhang et al describes the use of a protein language model (pLM) to analyse disordered regions in proteins, with a focus on those that may be important in biological phase separation. While the paper is relatively easy to read overall, my main comment is that the authors could perhaps make it clearer which observations are new, and which support previous work using related approaches. Further, while the link to phase separation is interesting, it is not completely clear which data supports the statements made, and this could also be made clearer.

    Major comments:

    (1) With respect to putting the work in a better context of what has previously been done before, this is not to say that there is not new information in it, but what the authors do is somewhat closely related to work by others. I think it would be useful to make those links more directly. Some examples:

    (1a) Alderson et al (reference 71) analysed in detail the conservation of IDRs (via pLDDT, which is itself related to conservation) to show, for example, that conserved residues fold upon binding. This analysis is very similar to the analysis used in the current study (using ESM2 as a different measure of conservation). Thus, the approach (pages 7-8) described as "This distinction allows us to classify disordered regions into two types: "flexible disordered" regions, which show high ESM2 scores and greater mutational tolerance, and "conserved disordered" regions, which display low ESM2 scores, indicating varying levels of mutational constraint despite a lack of stable folding." is fundamentally very similar to that used by Alderson et al. Thus, the result that "Given that low ESM2 scores generally reflect mutational constraint in folded proteins, the presence of region a among disordered residues suggests that certain disordered amino acids are evolutionarily conserved and likely functionally significant" is in some ways very similar to the results of that paper.

    (1b) Dasmeh et al (https://doi.org/10.1093/genetics/iyab184), Lu et al (https://doi.org/10.1371/journal.pcbi.1010238) and Ho & Huang (https://doi.org/10.1002/pro.4317) analysed conservation in IDRs, including aromatic residues and their role in phase separation

    (1c) A number of groups have performed proteomewide saturation scans using pLMs, including variants of the ESM family, including Meier (reference 89, but cited about something else) and Cagiada et al (https://doi.org/10.1101/2024.05.21.595203) that analysed variant effects in IDRs using a pLM. Thus, I think statements such as "their applicability to studying the fitness and evolutionary pressures on IDRs has yet to be established" should possibly be qualified.

    (2) On page 4, the authors write, "The conserved residues are primarily located in regions associated with phase separation." These results are presented as a central part of the work, but it is not completely clear what the evidence is.

    (3) It would be useful with an assessment of what controls the authors used to assess whether there are folded domains within their set of IDRs.

  3. Reviewer #2 (Public review):

    This manuscript uses the ESM2 language model to map the evolutionary fitness landscape of intrinsically disordered regions (IDRs). The central idea is that mutational preferences predicted by these models could be useful in understanding eventual IDR-related behavior, such as disruption of otherwise stable phases. While ESM2-type models have been applied to analyze such mutational effects in folded proteins, they have not been used or verified for studying IDRs. Here, the authors use ESM2 to study membraneless organelle formation and the related fitness landscape of IDRs.

    Through this, their key finding in this work is the identification of a subset of amino acids that exhibit mutation resistance. Their findings reveal a strong correlation between ESM2 scores and conservation scores, which if true, could be useful for understanding IDRs in general. Through their ESM2-based calculations, the authors conclude that IDRs crucial for phase separation frequently contain conserved sequence motifs composed of both so-called sticker and spacer residues. The authors note that many such motifs have been experimentally validated as essential for phase separation.

    Unfortunately, I do not believe that the results can be trusted. ESM2 has not been validated for IDRs through experiments. The authors themselves point out its little use in that context. In this study, they do not provide any further rationale for why this situation might have changed. Furthermore, they mention that experimental perturbations of the predicted motifs in in vivo studies may further elucidate their functional importance, but none of that is done here. That some of the motifs have been previously validated does not give any credibility to the use of ESM2 here, given that such systems were probably seen during the training of the model.

    I believe that the authors should revamp their whole study and come up with a rigorous, scientific protocol where they make predictions and test them using ESM2 (or any other scientific framework).

  4. Reviewer #3 (Public review):

    Summary:

    This is a very nice and interesting paper to read about motif conservation in protein sequences and mainly in IDRs regions using the ESM2 language model. The topic of the paper is timely, with strong biological significance. The paper can be of great interest to the scientific community in the field of protein phase transitions and future applications using the ESM models. The ability of ESM2 to identify conserved motifs is crucial for disease prediction, as these regions may serve as potential drug targets. Therefore, I find these findings highly significant, and the authors strongly support them throughout the paper. The work motivates the scientific community towards further motif exploration related to diseases.

    Strengths:

    (1) Revealing conserved regions in IDRs by the ESM-2 language model.

    (2) Identification of functionally significant residues within protein sequences, especially in IDRs.

    (3) Findings supported by useful analyses.

    Weaknesses:

    (1) Lack of examples demonstrating the potential biological functions of these conserved regions

    (2) Very limited discussion of potential future work and of limitations.