ProteInfer, deep neural networks for protein functional inference
Curation statements for this article:-
Curated by eLife
Evaluation Summary:
The authors describe a newly developed software, ProteInfer, that analyses protein sequences to predict their functions. It is based on a single convolutional neural network scan for all known domains in parallel. This software provides a convincing approach for all computational scientists as well as experimentalists working near the interface of machine learning and molecular biology.
(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (eLife)
Abstract
Predicting the function of a protein from its amino acid sequence is a long-standing challenge in bioinformatics. Traditional approaches use sequence alignment to compare a query sequence either to thousands of models of protein families or to large databases of individual protein sequences. Here we introduce ProteInfer, which instead employs deep convolutional neural networks to directly predict a variety of protein functions – Enzyme Commission (EC) numbers and Gene Ontology (GO) terms – directly from an unaligned amino acid sequence. This approach provides precise predictions which complement alignment-based methods, and the computational efficiency of a single neural network permits novel and lightweight software interfaces, which we demonstrate with an in-browser graphical interface for protein function prediction in which all computation is performed on the user’s personal computer with no data uploaded to remote servers. Moreover, these models place full-length amino acid sequences into a generalised functional space, facilitating downstream analysis and interpretation. To read the interactive version of this paper, please visit https://google-research.github.io/proteinfer/ .
Article activity feed
-
-
Author Response
Reviewer #1 (Public Review):
This work describes a new method, Proteinfer, which uses dilated neural networks to predict protein function, using EC terms and GO terms. The software is fast and the server-side performance is fast and reliable. The method is very clearly described. However, it is hard to judge the accuracy of this method based on the current manuscript, and some more work is needed to do so.
I would like to address the following statement by the authors: (p3, left column): "We focus on Swiss Prot to ensure that our models learn from human-curated labels, rather than labels generated by electronic annotation".
There is a subtle but important point to be made here: while SwissProt (SP) entries are human-curated, they might still have their function annotated ("labeled") electronically only. The …
Author Response
Reviewer #1 (Public Review):
This work describes a new method, Proteinfer, which uses dilated neural networks to predict protein function, using EC terms and GO terms. The software is fast and the server-side performance is fast and reliable. The method is very clearly described. However, it is hard to judge the accuracy of this method based on the current manuscript, and some more work is needed to do so.
I would like to address the following statement by the authors: (p3, left column): "We focus on Swiss Prot to ensure that our models learn from human-curated labels, rather than labels generated by electronic annotation".
There is a subtle but important point to be made here: while SwissProt (SP) entries are human-curated, they might still have their function annotated ("labeled") electronically only. The SP entry comprises the sequence, source organism, paper(s) (if any), annotations, cross-references, etc. A validated entry does not mean that the annotation was necessarily validated manually: but rather that there is a paper backing the veracity of the sequence itself, and that it is not an automatic generation from a genome project.
Example: 009L_FRG3G is a reviewed entry, and has four function annotations, all generated by BLAST, with an IEA (inferred by electronic annotation) evidence code. Most GO annotations in SwissProt are generated that way: a reviewed Swissprot entry, unlike what the authors imply, does not guarantee that the function annotation was made by non-electronic means. If the authors would like to use non-electronic annotations for functional labels, they should use those that are annotated with the GO experimental evidence codes (or, at the very least, not exclusively annotated with IEA). Therefore, most of the annotations in the authors' gold standard protein annotations are simply generated by BLAST and not reviewed by a person. Essentially the authors are comparing predictions with predictions, or at least not taking care not to do so. This is an important point that the authors need to address since there is no apparent gold standard they are using.
The above statement is relevant to GO. But since EC is mapped 1:1 to GO molecular function ontology (as a subset, there are many terms in GO MFO that are not enzymes of course), the authors can easily apply this to EC-based entries as well.
This may explain why, in Figure S8(b), BLAST retains such a high and even plateau of the precision-recall curve: BLAST hits are used throughout as gold-standard, and therefore BLAST performs so well. This is in contrast, say to CAFA assessments which use as a gold standard only those proteins which have experimental GO evidence codes, and therefore BLAST performs much poorer upon assessment.
We thank the reviewer for this point. We regret if we gave the impression that our training data derives exclusively, or even primarily, from direct experiments on the amino acid sequences in question. We had attempted to address this point in the discussion with this section:
"On the other hand, many entries come from experts applying existing computational methods, including BLAST and HMM-based approaches, to identify protein function. Therefore, the data may be enriched for sequences with functions that are easily ascribable using these techniques which could limit the ability to estimate the added value of using an alternative alignment-free tool. An idealised dataset would involved training only on those sequences that have themselves been experimentally characterized, but at present too little data exists than would be needed for a fully supervised deep-learning approach."
We have now added a sentence in the early sentence of of the manuscript reinforcing this point:
"Despite its curated nature, SwissProt contains many proteins annotated only on the basis of electronic tools."
We have also removed the phrase "rather than labels generated by a computational annotation pipeline" because we acknowledge that this could be read to imply that computational approaches are not used at all for SwissProt which would not be correct.
While we agree that SwissProt contains many entries inferred via electronic means, we nevertheless think its curated nature makes an important difference. Curators as far as possible reconcile all known data for a protein, often looking for the presence of key residues in the active sites. There are proteins where electronic annotation would suggest functions in direct contradiction to experimental data, which are avoided due to this curation process. As one example, UniProt entry Q76NQ1 contains a rhomboid-like domain typically found in rhomboid proteases (IPR022764) and therefore inputting it into InterProScan results in a prediction of peptidase activity (GO:0004252). However this is in fact an inactive protein, as discovered by experiment, and so is not annotated with this activity in SwissProt. ProteInfer successfully avoids predicting peptidase activity as a result of this curated training data. (For transparency, ProteInfer is by no means perfect on this point: there are also cases in which UniProt curators have annotated single proteins as inactive but ProteInfer has not learnt this relationship, due to similar sequences which remain active).
We had also attempted to address this point by comparing with phenotypes seen in a specific high-throughput experimental assay ("Comparison to experimental data" section).
We have now added a new analysis in which we assess the recall of GO terms while excluding IEA annotation codes. We find that at the threshold that maximises F1 score in the full analysis, our approach is able to recall 60-75% (depending on ontology) of annotations. Inferring precision is challenging due to the fact that only a very small proportion of the possible function*gene combinations have in fact been tested, making it difficult to distinguish a true negative from a false negative.
"We also tested how well our trained model was able to recall the subset of GO term annotations which are not associated with the "inferred from electronic annotation" (IEA) evidence code, indicating either experimental work or more intensely-curated evidence. We found that at the threshold that maximised F1 score for overall prediction, 75% of molecular function annotations could be successfully recalled, 61% of cellular component annotations, and 60% of biological process annotations."
Pooling GO DAGs together: It is unclear how the authors generate performance data over GO as a whole. GO is really 3 disjoint DAGs (molecular function ontology or MFO, Biological Process or BPO, Cellular component or CCO). Any assessment of performance should be over each DAG separately, to make biological sense. Pooling together the three GO DAGs which describe completely different aspects of the function is not informative. Interestingly enough, in the browser applications, the GO DAG results are distinctly separated into the respective DAGs.
Thank you for this suggestion. To answer the question of how we were previously generating performance data: this was simply by treating all terms equivalently, regardless of their ontology.
We agree that it would be helpful to the reader to split out results by ontology type, especially given clear differences in performance.
We now provide PR-curve graphs split by ontology type.
We have also added the following text:
"The same trends for the relative performance of different approaches were seen for each of the direct-acyclic graphs that make up the GO ontology (biological process, cellular component and molecular function), but there were substantial differences in absolute performance (Fig S10). Performance was highest for molecular function (max F1: 0.94), followed by biological process (max F1:0.86) and then cellular component (max F1:0.84)."
Figure 3 and lack of baseline methods: the text refers to Figures 3A and 3B, but I could only see one figure with no panels. Is there an error here? It is not possible at this point to talk about the results in this figure as described. It looks like Figure 3A is missing, with Fmax scores. In any case, Figure 3(b?) has precision-recall curves showing the performance of predictions is the highest on Isomerases and lowest in hydrolases. It is hard to tell the Fmax values, but they seem reasonably high. However, there is no comparison with a baseline method such as BLAST or Naive, and those should be inserted. It is important to compare Proteinfer with these baseline methods to answer the following questions: (1) Does Proteinfer perform better than the go-to method of choice for most biologists? (2) does it perform better than what is expected given the frequency of these terms in the dataset? For an explanation of the Naive method which answers the latter question, see: ( https://www.nature.com/articles/nmeth.2340 )
We apologise for the errors in figure referencing in the text here. This emerged in part from the two versions of text required to support an interactive and legacy PDF version. We had provided baseline comparisons with BLAST in Fig. 5 of the interactive version (correctly referenced in the interactive version) and in Fig. S7 of the PDF version (incorrectly referenced as Fig 3B).
We have now moved the key panel of Fig S7 to the main-text of the PDF version (new Fig 3B), as suggested also by the editor, and updated the figure referencing appropriately. We have also added a Naive frequency-count based baseline. This baseline would not appear in Fig 3B due to axis truncation, but is shown in a supplemental figure, new Fig S9. We thank the reviewer and the editor for raising these points.
Reviewer #2 (Public Review):
In this paper, Sanderson et al. describe a convolutional neural network that predicts protein domains directly from amino acid sequences. They train this model with manually curated sequences from the Swiss-Prot database to predict Enzyme Commission (EC) numbers and Gene Ontology (GO) terms. This paper builds on previous work by this group, where they trained a separate neural network to recognize each known protein domain. Here, they train one convolutional neural network to identify enzymatic functions or GO terms. They discuss how this change can deal with protein domains that frequently co-occur and more efficiently handle proteins of different lengths. The tool, ProteInfer, adds a useful new tool for computational analysis of proteins that complements existing methods like BLAST and Pfam.
The authors make three claims:
- "ProteInfer models reproduce curator decisions for a variety of functional properties across sequences distant from the training data"
This claim is well supported by the data presented in the paper. The authors compare the precision-recall curves of four model variations. The authors focus their training on the maximum F1 statistic of the precision-recall curve. Using precision-recall curves is appropriate for this kind of problem.
- "Attribution analysis shows that the predictions are driven by relevant regions of each protein sequence".
This claim is very well supported by the data and particularly well illustrated by Figure 4. The examples on the interactive website are also very nice. This section is a substantial innovation of this method. It shows the value of scanning for multiple functions at the same time and the value of being able to scan proteins of any length.
- "ProteInfer models create a generalised mapping between sequence space and the space of protein functions, which is useful for tasks other than those for which the models were trained."
This claim is also well supported. The print version of the figure is really clear, and the interactive version is even better. It is a clever use of UMAP representations to look at the abstract last layer of the network. It was very nice how each sub-functional class clustered.
The interactive website was very easy to use with a good user interface. I expect will be accessible to experimental and computational biologists.
The manuscript has many strengths. The main text is clearly written, with high-level descriptions of the modeling. I initially printed and read the static PDF version of the paper. The interactive form is much more fun to read because of the ability to analyze my favorite proteins and zoom in on their figures (e.g. Figure 8). The new Figure 1 motivates the work nicely. The website has an excellent interactive graphic showing how the number of layers in the network and the kernel size change how data is pooled across residues. I will use this tool in my teaching.
We are grateful for these comments. We are excited that the reviewer hopes to use this figure for teaching, which is exactly the sort of impact we hoped for this interactive manuscript. We agree that the interactive manuscript is by far the most compelling version of this work.
The manuscript has only minor weaknesses. It was not clear if the interactive model on the website was the Single CNN model or the Ensemble CNN model.
We thank the reviewer for pointing out the ambiguity here. The model shown on the website is a Single CNN model, and is chosen with hyperparameters that achieve good performance whilst being readily downloadable to the user's machine for this demonstration without use of excessive bandwidth. We have added additional sentences to address this better in the manuscript.
" When the user loads the tool, lightweight EC (5MB) and GO model (7MB) prediction models are downloaded and all predictions are then performed locally, with query sequences never leaving the user's computer. We selected the hyperparameters for these lightweight models by performing a tuning study in which we filtered results by the size of the model's parameters and then selected the best performing models. This approach uses a single neural network, rather than an ensemble. Inference in the browser for a 1500 amino-acid sequence takes < 1.5 seconds for both models "
Overall, ProteInfer will be a very useful resource for a broad user base. The analysis of the 171 new proteins in Figure 7 was particularly compelling and serves as a great example of the utility and power of ProteInfer. It completes leading tools in a very valuable way. I anticipate adding it to my standard analysis workflows. The data and code are publicly available.
Reviewer #3 (Public Review):
In this work, the authors employ a deep convolutional neural network approach to map protein sequence to function. The rationales are that (i) once trained, the neural network would offer fast predictions for new sequences, facilitating exploration and discovery without the need for extensive computational resources, (ii) that the embedding of protein sequences in a fixed-dimensional space would allow potential analyses and interpretation of sequence-function relationships across proteins, and (iii) predicting protein function in a way that is different from alignment-based approaches could lead to new insights or superior performance, at least in certain regimes, thereby complementing existing approaches. I believe the authors demonstrate i and iii convincingly, whereas ii was left open-ended.
A strength of the work is showing that the trained CNNs perform generally on par with existing alignment based-methods such as BLASTp, with a precision-recall tradeoff that differs from BLASTp. Because the method is more precise at lower recall values, whereas BLASTp has higher recall at lower precision values, it is indeed a good complement to BLASTp, as demonstrated by the top performance of the ensemble approach containing both methods.
Another strength of the work is its emphasis on usability and interpretability, as demonstrated in the graphical interface, use of class activation mapping for sub-sequence attribution, and the analysis of hierarchical functional clustering when projecting the high-dimensional embedding into UMAP projections.
We thank the reviewer for highlighting these points.
However, a main weakness is the premise that this approach is new. For example, the authors claim that existing deep learning "models cannot infer functional annotation for full-length protein sequences." However, as the proposed method is a straightforward deep neural network implementation, there have been other very similar approaches published for protein function prediction. For example, Cai, Wang, and Deng, Frontiers in Bioengineering and Biotechnology (2020), the latter also being a CNN approach. As such, it is difficult to assess how this approach differs from or builds on previous work.
We agree that there has been a great deal of exciting work looking at the application of deep learning to protein sequences. Our core code has been publicly available on GitHub since April 2019 , and our preprint has now been available for more than a year. We regret the time taken to release a manuscript and for it to reach review: this was in part due to the SARS-CoV-2 pandemic, which the first author was heavily involved in the scientific response to. Nevertheless, we believe that our work has a number of important features that distinguish it from much other work in this space.
● We train across the entire GO ontology. In the paper referenced by the reviewer, training is with 491 BP terms, 321 MF terms, and 240 CC terms. In contrast, we train with a vocabulary of 32,102 GO labels, and the majority of these are predicted at least once in our test set. ● We use a dilated convolutional approach. In the referenced paper the network used is instead of fixed dimensions. Such an approach means there is an upper limit on how large a protein can be input into the model, and also means that this maximum length defines the computational resources used for every protein, including much smaller ones. In contrast, our dilated network scales to any size of protein, but when used with smaller input sequences it performs only the calculations needed for this size of sequence.
● We use class-activation mapping to determine regions of a protein responsible for predictions, and therefore potentially involved in specific functions.
● We provide a TensorFlow.JS implementation of our approach that allows lightweight models to be tested without any downloads
● We provide a command-line tool that provides easy access to full models.
We have made some changes to bring out these points more clearly in the text:
"Since natural protein sequences can vary in length by at least three orders of magnitude, this pooling is advantageous because it allows our model to accommodate sequences of arbitrary length without imposing restrictive modeling assumptions or computational burdens that scale with sequence length. In contrast, many previous approaches operate on fixed sequence lengths: these techniques are unable to make predictions for proteins larger than this sequence length, and use unnecessary resources when employed on smaller proteins."
We have added a table that sets out the vocabulary sizes used in our work (5,134 for EC and 32,109 for GO):
"Gene Ontology (GO) terms describe important protein functional properties, with 32,109 such terms in Swiss-Pr ot (Table S6) that cov er the molecular functions of proteins (e.g. DNA-binding, amylase activity), the biological processes they are involved in (e.g. DNA replication, meiosis), and the cellular components to which they localise (e.g. mitochondrion, cytosol)."
A second weakness is that it was not clear what new insights the UMAP projections of the sequence embedding could offer. For example, the authors mention that "a generalized mapping between sequence space and the space of protein functions...is useful for tasks other than those for which the models were trained." However, such tasks were not explicitly explained. The hierarchical clustering of enzymatic proteins shown in Fig. 5 and the clustering of non-enzymatic proteins in Fig. 6 are consistent with the expectation of separability in the high-dimensional embedding space that would be necessary for good CNN performance (although the sub-groups are sometimes not well-separated. For example, only the second level and leaf level are well-separated in the enzyme classification UMAP hierarchy). Therefore, the value-added of the UMAP representation should be something like using these plots to gain insight into a family or sub-family of enzymes.
We thank the reviewer for highlighting this point. There are two types of embedding which we discuss in the paper. The first is the high-dimensional representation of the protein that the neural network constructs as part of the prediction process. This is the embedding we feel is most useful for downstream applications, and we discuss a specific example of training the EC-number network to recognise membrane proteins (a property on which it was not trained): "To quantitatively measure whether these embeddings capture the function of non-enzyme proteins, we trained a simple random forest classification model that used these embeddings to predict whether a protein was annotated with the intrinsic component of membrane GO term. We trained on a small set of non-enzymes containing 518 membrane proteins, and evaluated on the rest of the examples. This simple model achieved a precision of 97% and recall of 60% for an F1 score of 0.74. Model training and data-labelling took around 15 seconds. This demonstrates the power of embeddings to simplify other studies with limited labeled data, as has been observed in recent work (43, 72)."
As the reviewer points out, there is a second embedding created by compressing this high-dimensional down to two dimensions using UMAP. This embedding can also be useful for understanding the properties seen by the network, for example the GO term s highlighted in Fig. 7 , but in general it will contain less information than the higher-dimensional embedding.
The clear presentation, ease of use, and computationally accessible downstream analytics of this work make it of broad utility to the field.
-
Evaluation Summary:
The authors describe a newly developed software, ProteInfer, that analyses protein sequences to predict their functions. It is based on a single convolutional neural network scan for all known domains in parallel. This software provides a convincing approach for all computational scientists as well as experimentalists working near the interface of machine learning and molecular biology.
(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)
-
Reviewer #1 (Public Review):
This work describes a new method, Proteinfer, which uses dilated neural networks to predict protein function, using EC terms and GO terms. The software is fast and the server-side performance is fast and reliable. The method is very clearly described. However, it is hard to judge the accuracy of this method based on the current manuscript, and some more work is needed to do so.
I would like to address the following statement by the authors: (p3, left column): "We focus on Swiss Prot to ensure that our models learn from human-curated labels, rather than labels generated by electronic annotation".
There is a subtle but important point to be made here: while SwissProt (SP) entries are human-curated, they might still have their function annotated ("labeled") electronically only. The SP entry comprises the …
Reviewer #1 (Public Review):
This work describes a new method, Proteinfer, which uses dilated neural networks to predict protein function, using EC terms and GO terms. The software is fast and the server-side performance is fast and reliable. The method is very clearly described. However, it is hard to judge the accuracy of this method based on the current manuscript, and some more work is needed to do so.
I would like to address the following statement by the authors: (p3, left column): "We focus on Swiss Prot to ensure that our models learn from human-curated labels, rather than labels generated by electronic annotation".
There is a subtle but important point to be made here: while SwissProt (SP) entries are human-curated, they might still have their function annotated ("labeled") electronically only. The SP entry comprises the sequence, source organism, paper(s) (if any), annotations, cross-references, etc. A validated entry does not mean that the annotation was necessarily validated manually: but rather that there is a paper backing the veracity of the sequence itself, and that it is not an automatic generation from a genome project.
Example: 009L_FRG3G is a reviewed entry, and has four function annotations, all generated by BLAST, with an IEA (inferred by electronic annotation) evidence code. Most GO annotations in SwissProt are generated that way: a reviewed Swissprot entry, unlike what the authors imply, does not guarantee that the function annotation was made by non-electronic means. If the authors would like to use non-electronic annotations for functional labels, they should use those that are annotated with the GO experimental evidence codes (or, at the very least, not exclusively annotated with IEA). Therefore, most of the annotations in the authors' gold standard protein annotations are simply generated by BLAST and not reviewed by a person. Essentially the authors are comparing predictions with predictions, or at least not taking care not to do so. This is an important point that the authors need to address since there is no apparent gold standard they are using.The above statement is relevant to GO. But since EC is mapped 1:1 to GO molecular function ontology (as a subset, there are many terms in GO MFO that are not enzymes of course), the authors can easily apply this to EC-based entries as well.
This may explain why, in Figure S8(b), BLAST retains such a high and even plateau of the precision-recall curve: BLAST hits are used throughout as gold-standard, and therefore BLAST performs so well. This is in contrast, say to CAFA assessments which use as a gold standard only those proteins which have experimental GO evidence codes, and therefore BLAST performs much poorer upon assessment.
Pooling GO DAGs together: It is unclear how the authors generate performance data over GO as a whole. GO is really 3 disjoint DAGs (molecular function ontology or MFO, Biological Process or BPO, Cellular component or CCO). Any assessment of performance should be over each DAG separately, to make biological sense. Pooling together the three GO DAGs which describe completely different aspects of the function is not informative. Interestingly enough, in the browser applications, the GO DAG results are distinctly separated into the respective DAGs.
Figure 3 and lack of baseline methods: the text refers to Figures 3A and 3B, but I could only see one figure with no panels. Is there an error here? It is not possible at this point to talk about the results in this figure as described. It looks like Figure 3A is missing, with Fmax scores. In any case, Figure 3(b?) has precision-recall curves showing the performance of predictions is the highest on Isomerases and lowest in hydrolases. It is hard to tell the Fmax values, but they seem reasonably high. However, there is no comparison with a baseline method such as BLAST or Naive, and those should be inserted. It is important to compare Proteinfer with these baseline methods to answer the following questions: (1) Does Proteinfer perform better than the go-to method of choice for most biologists? (2) does it perform better than what is expected given the frequency of these terms in the dataset? For an explanation of the Naive method which answers the latter question, see: (https://www.nature.com/articles/nmeth.2340)
-
Reviewer #2 (Public Review):
In this paper, Sanderson et al. describe a convolutional neural network that predicts protein domains directly from amino acid sequences. They train this model with manually curated sequences from the Swiss-Prot database to predict Enzyme Commission (EC) numbers and Gene Ontology (GO) terms. This paper builds on previous work by this group, where they trained a separate neural network to recognize each known protein domain. Here, they train one convolutional neural network to identify enzymatic functions or GO terms. They discuss how this change can deal with protein domains that frequently co-occur and more efficiently handle proteins of different lengths. The tool, ProteInfer, adds a useful new tool for computational analysis of proteins that complements existing methods like BLAST and Pfam.
The authors …
Reviewer #2 (Public Review):
In this paper, Sanderson et al. describe a convolutional neural network that predicts protein domains directly from amino acid sequences. They train this model with manually curated sequences from the Swiss-Prot database to predict Enzyme Commission (EC) numbers and Gene Ontology (GO) terms. This paper builds on previous work by this group, where they trained a separate neural network to recognize each known protein domain. Here, they train one convolutional neural network to identify enzymatic functions or GO terms. They discuss how this change can deal with protein domains that frequently co-occur and more efficiently handle proteins of different lengths. The tool, ProteInfer, adds a useful new tool for computational analysis of proteins that complements existing methods like BLAST and Pfam.
The authors make three claims:
"ProteInfer models reproduce curator decisions for a variety of functional properties across sequences distant from the training data"
.
This claim is well supported by the data presented in the paper. The authors compare the precision-recall curves of four model variations. The authors focus their training on the maximum F1 statistic of the precision-recall curve. Using precision-recall curves is appropriate for this kind of problem."Attribution analysis shows that the predictions are driven by relevant regions of each protein sequence".
This claim is very well supported by the data and particularly well illustrated by Figure 4. The examples on the interactive website are also very nice. This section is a substantial innovation of this method. It shows the value of scanning for multiple functions at the same time and the value of being able to scan proteins of any length.
- "ProteInfer models create a generalised mapping between sequence space and the space of protein functions, which is useful for tasks other than those for which the models were trained."
This claim is also well supported. The print version of the figure is really clear, and the interactive version is even better. It is a clever use of UMAP representations to look at the abstract last layer of the network. It was very nice how each sub-functional class clustered.
The interactive website was very easy to use with a good user interface. I expect will be accessible to experimental and computational biologists.
The manuscript has many strengths. The main text is clearly written, with high-level descriptions of the modeling. I initially printed and read the static PDF version of the paper. The interactive form is much more fun to read because of the ability to analyze my favorite proteins and zoom in on their figures (e.g. Figure 8). The new Figure 1 motivates the work nicely. The website has an excellent interactive graphic showing how the number of layers in the network and the kernel size change how data is pooled across residues. I will use this tool in my teaching.
The manuscript has only minor weaknesses. It was not clear if the interactive model on the website was the Single CNN model or the Ensemble CNN model.
Overall, ProteInfer will be a very useful resource for a broad user base. The analysis of the 171 new proteins in Figure 7 was particularly compelling and serves as a great example of the utility and power of ProteInfer. It completes leading tools in a very valuable way. I anticipate adding it to my standard analysis workflows. The data and code are publicly available.
-
Reviewer #3 (Public Review):
In this work, the authors employ a deep convolutional neural network approach to map protein sequence to function. The rationales are that (i) once trained, the neural network would offer fast predictions for new sequences, facilitating exploration and discovery without the need for extensive computational resources, (ii) that the embedding of protein sequences in a fixed-dimensional space would allow potential analyses and interpretation of sequence-function relationships across proteins, and (iii) predicting protein function in a way that is different from alignment-based approaches could lead to new insights or superior performance, at least in certain regimes, thereby complementing existing approaches. I believe the authors demonstrate i and iii convincingly, whereas ii was left open-ended.
A strength of …
Reviewer #3 (Public Review):
In this work, the authors employ a deep convolutional neural network approach to map protein sequence to function. The rationales are that (i) once trained, the neural network would offer fast predictions for new sequences, facilitating exploration and discovery without the need for extensive computational resources, (ii) that the embedding of protein sequences in a fixed-dimensional space would allow potential analyses and interpretation of sequence-function relationships across proteins, and (iii) predicting protein function in a way that is different from alignment-based approaches could lead to new insights or superior performance, at least in certain regimes, thereby complementing existing approaches. I believe the authors demonstrate i and iii convincingly, whereas ii was left open-ended.
A strength of the work is showing that the trained CNNs perform generally on par with existing alignment based-methods such as BLASTp, with a precision-recall tradeoff that differs from BLASTp. Because the method is more precise at lower recall values, whereas BLASTp has higher recall at lower precision values, it is indeed a good complement to BLASTp, as demonstrated by the top performance of the ensemble approach containing both methods.
Another strength of the work is its emphasis on usability and interpretability, as demonstrated in the graphical interface, use of class activation mapping for sub-sequence attribution, and the analysis of hierarchical functional clustering when projecting the high-dimensional embedding into UMAP projections.
However, a main weakness is the premise that this approach is new. For example, the authors claim that existing deep learning "models cannot infer functional annotation for full-length protein sequences." However, as the proposed method is a straightforward deep neural network implementation, there have been other very similar approaches published for protein function prediction. For example, Cai, Wang, and Deng, Frontiers in Bioengineering and Biotechnology (2020),
the latter also being a CNN approach. As such, it is difficult to assess how this approach differs from or builds on previous work.A second weakness is that it was not clear what new insights the UMAP projections of the sequence embedding could offer. For example, the authors mention that "a generalized mapping between sequence space and the space of protein functions...is useful for tasks other than those for which the models were trained." However, such tasks were not explicitly explained. The hierarchical clustering of enzymatic proteins shown in Fig. 5 and the clustering of non-enzymatic proteins in Fig. 6 are consistent with the expectation of separability in the high-dimensional embedding space that would be necessary for good CNN performance (although the sub-groups are sometimes not well-separated. For example, only the second level and leaf level are well-separated in the enzyme classification UMAP hierarchy). Therefore, the value-added of the UMAP representation should be something like using these plots to gain insight into a family or sub-family of enzymes.
The clear presentation, ease of use, and computationally accessible downstream analytics of this work make it of broad utility to the field.
-
-