Functional protein mining with conformal guarantees

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

1

Molecular structure prediction and homology detection provide a promising path to discovering new protein function and evolutionary relationships. However, current approaches lack statistical reliability assurances, limiting their practical utility for selecting proteins for further experimental and in-silico characterization. To address this challenge, we introduce a novel approach to protein search leveraging principles from conformal prediction, offering a framework that ensures statistical guarantees with user-specified risk and provides calibrated probabilities (rather than raw ML scores) for any protein search model. Our method (1) lets users select many biologically-relevant loss metrics (i.e. false discovery rate) and assigns reliable functional probabilities for annotating genes of unknown function; (2) achieves state-of-the-art performance in enzyme classification without training new models; and (3) robustly and rapidly pre-filters proteins for computationally intensive structural alignment algorithms. Our framework enhances the reliability of protein homology detection and enables the discovery of new proteins with likely desirable functional properties.

Article activity feed

  1. Although we extensively use Protein-Vec in this work, our approach is model agnostic and can be used with any search algorithm.

    Most of your examples seem to be embedding or vector-based, which is very cool. But I think it could be useful to see some examples that use sequence or even structures since that is also presumably doable with your approach.

  2. We find that 39.6% of coding genes of previously unknown function meet our criteria for an exact functional match

    I'm having a bit of trouble separating what your approach enables vs what just Protein-Vec alone does in this example. I know that your approach tells us about confidence in the annotations, but it might be interesting to discuss what comes out of Protein-Vec alone vs what comes out with your approach?

  3. Structural alignment between predicted structure of functional hit of previously unannotated protein in Mycoplasma mycoides and characterized exonuclease.

    Might have missed this, but which proteins are which color?

  4. For example, a recent work Protein-Vec [24] presented state-of-the-art results across numerous benchmarks for function prediction.

    Because you use Protein-Vec quite a bit throughout this paper, it might be useful to give a bit more context up front.

  5. Our framework enhances the reliability of protein homology detection and enables the discovery of new proteins with likely desirable functional properties

    I think that the idea of using conformal prediction to generate some sort of confidence about which proteins to experiment with could be extremely useful! I really enjoyed reading this paper, and one of my favorite things about this paper is that the authors include so many different examples of how this could be applied. I think it could be very cool to take some of these predictions into the lab in the future!

  6. Functional protein mining with conformal guarantees

    I found this study very interesting, and despite my limited knowledge of pLMs and conformal statistics, I have a few comments about the results pertaining to section 3.1. Perhaps my comments may provide a data point on how non-experts may engage with the paper. Please feel free to take or leave any of my suggestions/remarks.

    I really like the approach of establishing conformal guarantees for all the reasons stated in the introduction. I especially liked the genericism with which the application of conformal statistics to this problem is presented, and that it was made clear that an explicit "non-goal" of the study was to demo a new machine learning model for enzyme classification.

    While reading, I kept thinking about the fact that members of a Pfam domain do not necessarily share the same biochemical function. This is because less than 0.1% of protein functional annotations are linked to experimental evidence (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9374478/) and the rest--the vast and overwhelming majority--are annotated transitively based on similarity scores of some kind.

    With that in mind, I think the authors could do better to point out that the ground truth upon which their terms FP, TP, and FDR are defined, is itself a proxy for shared function. I don't believe this at all detracts from the results of the paper, but pointing out these assumptions would increase the trust of readers who question what you mean by terms like conformal "guarantees" and "true" positives. My apologies if you already explained this somewhere and I missed it.

    Since JCVI Syn3.0 was published in 2016, it would be interesting to see whether the traditional search methods (BLAST & HMMSearch) still yield 20% unknown function, or whether or our annotations have since improved.

    It would also be interesting to see if the Protein-Vec hits in the Syn3.0 case study that don't exceed lambda are systematically "worse" than the true positives, for example as measured by TM-score.

    Thanks again for putting out this interesting study.