Decoding molecular mechanisms for loss-of-function variants in the human proteome

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This work introduces FunC-ESMs, a proteome-scale framework to classify loss-of-function missense variants into distinct mechanistic groups by combining two complementary state-of-the-art machine learning models. The strength of evidence is convincing, supported by solid benchmarking, integration with experimental datasets, and careful methodological design. The significance of the findings is valuable, providing a resource of clear interest to researchers and diagnostic laboratories working on variant interpretation.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Proteins play a critical role in cellular function by interacting with other biomolecules; missense variants that cause loss of protein function can lead to a broad spectrum of genetic disorders. While much progress has been made on predicting which missense variants may cause disease, our ability to predict the underlying molecular mechanisms remain limited. One common mechanism is that missense variants cause protein destabilization resulting in decreased protein abundance and loss of function, while other variants directly disrupt key interactions with other molecules. We have here leveraged machine-learning models for protein sequence and structure to disentangle effects on protein function and abundance, and applied our resulting model to all missense variants in the human proteome. We find that approximately half of all missense variants that lead to loss of function and disease do so because they disrupt protein stability. We predicted functionally important positions in all human proteins and found that they cluster on protein structures and are often found on the protein surface. Our work provides a resource for interpreting both predicted and experimental variant effects across the human proteome, and a mechanistic starting point for developing therapies towards genetic diseases.

Article activity feed

  1. eLife Assessment

    This work introduces FunC-ESMs, a proteome-scale framework to classify loss-of-function missense variants into distinct mechanistic groups by combining two complementary state-of-the-art machine learning models. The strength of evidence is convincing, supported by solid benchmarking, integration with experimental datasets, and careful methodological design. The significance of the findings is valuable, providing a resource of clear interest to researchers and diagnostic laboratories working on variant interpretation.

  2. Reviewer #1 (Public review):

    Summary:

    In this work, the authors aim to improve upon their previous iterations of frameworks and models that try to decouple variant effects of protein stability from direct effects on function. This is motivated by the utility of understanding the specific molecular mechanisms underlying loss-of-function disease to assist in developing potential treatment approaches, which differ based on the causal mechanisms. The authors demonstrably achieve this goal, with FunC-ESMs presenting an elegant approach, utilizing pre-trained ESM-1b and ESM-IF models, which freed them from model training or running computationally intensive Rosetta predictions. While the performance improvements over their previous model are not unambiguous, in some of the examples, FunC-ESMs allowed them to scale up their analysis to the proteome level, deriving variant classifications of stable-but-inactive and total-loss across 20,144 human proteins, and further allowing them to identify functionally and structurally critical sites. However, the strength of the manuscript could be improved by clarifying or rewording some terminology concerning the molecular effects and what other underlying molecular mechanisms could also reside in the stable-but-inactive group, given the stated motivation of setting up a mechanistic starting point for therapeutic development and clinical applications.

    Strengths:

    Overall, the manuscript is very well framed and written, with clear motivations and objectives. The previous works are explained well and set up a clear methodological comparison with the new framework. FunC-ESMs is solidly designed to minimize data circularity, and the methodology to derive optimal thresholds is well reasoned. The authors make an effort to provide all the data and code very accessible.

    Weaknesses:

    (1) Considering how loss-of-function mechanisms dominate the known missense disease variant landscape, it is understandable that the scope of the work focuses on loss of function. However, variants exceeding the established ESM-1b threshold in the manuscript are often generalized as loss-of-function variants (e.g., lines 176, 304; line 285, for instance, uses much more neutral language), which can be misleading due to the guaranteed presence of deleterious variants that manifest through other mechanisms, such as gain-of-function.

    While relatively not as well predicted, gain-of-function variants would still likely demonstrate inflated ESM-1b scores and end up in the SBI class. Given the emphasis on the potential utility of the framework for tailoring therapeutic approaches, it seems pertinent to highlight gain-of-function and dominant-negative mechanisms in the manuscript, as they would require considerably different therapeutics than loss-of-function variants.

    A short disclaimer explaining the other mechanisms and the potential limitations of the framework in picking them out would improve the clarity of the manuscript. As an additional step, it would be interesting to explore where clinically validated gain-of-function and dominant-negative variant examples fall within the framework's classification.

    (2) Given the clinical angle, it would be useful to see the predicted label distribution in population datasets like gnomAD, for instance, focusing on dominant Mendelian disease genes to minimize the impact of non-penetrant or heterozygous disease variants. The performance demonstration using (likely) benign ClinVar variants is not as informative of the real-world utility cases that the method would be used in by clinicians or researchers.

  3. Reviewer #2 (Public review):

    Summary:

    The paper by Cagiada et al builds on their previously published work, but now uses two independent and complementary machine learning models to predict the deleteriousness of every missense change in the human proteome. The authors were able to separate all missense variants into three classes - wild-type like, total loss (important for stability), or stable-but-inactive (important for function), showing that the predictions correlated well with intuition in terms of clustering and location in folded versus intrinsically disordered regions. Evaluation of known pathogenic and benign variants from ClinVar suggested that around half of all pathogenic missense variants cause disease by disrupting protein stability. These results could be valuable for researchers and genomic diagnostics laboratories performing variant interpretation.

    Strengths:

    The method uses data from two independent state-of-the-art ML models, which were developed and published by other groups. The predictions were provided for every missense variant in the entire human proteome, and have been validated against a small previously published experimental dataset, as well as using known pathogenic and benign variants from ClinVar. Results are clearly stated and well illustrated with useful figures.

    Weaknesses:

    Both the description and the analysis could benefit from some additional work around the thresholds used for both ML models (ESM-1b and ESM-IF). The thresholds were selected based on an ROC analysis using published MAVE data, which has various limitations, including the small number of proteins for which MAVE data are available. Moreover, the correlation between the predictions from the two ML models was not evaluated, and there was no discussion of the limitations of the models or where they might predict different things, which was avoided by using two independent thresholds. The threshold approach needs further explanation, and a sensitivity analysis of how the results would change using different thresholds or by defining thresholds in an alternative way would be informative. In addition, the ClinVar pathogenic variants are all treated equally, when in fact it is known that some act via a gain versus a loss of function mechanism. It would be useful to know if these known patho-mechanisms correlate with predictions of variants that affect stability versus function.