Prediction of Anti‐Freezing Proteins From Their Evolutionary Profile
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Prediction of antifreeze proteins (AFPs) holds significant importance due to their diverse applications in healthcare. An inherent limitation of current AFP prediction methods is their reliance on unreviewed proteins for evaluation. This study evaluates, proposed and existing methods on an independent dataset containing 80 AFPs and 73 non‐AFPs obtained from Uniport, which have been already reviewed by experts. Initially, we constructed machine learning models for AFP prediction using selected composition‐based protein features and achieved a peak AUROC of 0.90 with an MCC of 0.69 on the independent dataset. Subsequently, we observed a notable enhancement in model performance, with the AUROC increasing from 0.90 to 0.93 upon incorporating evolutionary information instead of relying solely on the primary sequence of proteins. Furthermore, we explored hybrid models integrating our machine learning approaches with BLAST‐based similarity and motif‐based methods. However, the performance of these hybrid models either matched or was inferior to that of our best machine‐learning model. Our best model based on evolutionary information outperforms all existing methods on independent/validation dataset. To facilitate users, a user‐friendly web server with a standalone package named “AFPropred” was developed ( https://webs.iiitd.edu.in/raghava/afpropred ).
Article activity feed
-
-
Once the model was evaluated, we chose our top-performing model for further analysis, in which we integrated the evolutionary features with composition-based features and the ML score with the BLAST score and named the hybrid methods
It's a bit confusing to me why you would carry out your model selection procedure without using the intended feature-set. My worry would be that certain models might perform better or worse with different feature types and that you might be missing that here. Could you elaborate on this?
-
Feature selection techniques
Did you also consider exploring the use of regularization? I Would suggest looking into the use of L1 or L2 regularization to reduce the contribution of some of your features to mitigate the potential for overfitting.
-
PSSM-400
Is there a particular reason you chose this specific PSSM profile over the alternatives?
-
The reliability of a method depends on the quality of the dataset used for training and evaluation.
Along these lines, it's important to make significant efforts to identify sources of bias in your training dataset and mitigate their potential impact on predictions. This is true for the training set, but it's similarly true for the test (referred to here as the validation) set - if the test set is biased or imbalanced with respect to some relevant biological feature, then the resultant prediction accuracies may not reflect true model performance.
What I would like to see explored more thoroughly here is whether there are taxonomic biases in this curated set of proteins used in training and testing of your model? If, for instance, some species/taxonomic groups are disproportionately represented in both your training and validation …
The reliability of a method depends on the quality of the dataset used for training and evaluation.
Along these lines, it's important to make significant efforts to identify sources of bias in your training dataset and mitigate their potential impact on predictions. This is true for the training set, but it's similarly true for the test (referred to here as the validation) set - if the test set is biased or imbalanced with respect to some relevant biological feature, then the resultant prediction accuracies may not reflect true model performance.
What I would like to see explored more thoroughly here is whether there are taxonomic biases in this curated set of proteins used in training and testing of your model? If, for instance, some species/taxonomic groups are disproportionately represented in both your training and validation sets, potentially leading to elevated prediction accuracies.
-
two datasets, the main and the validation
This is a bit confusing - common terminology to use here would be to refer to this as the training dataset (subsequently subdivided into the K-folds for cross-validation), and the latter as the test set, rather than the validation set as you've done here.
-
The major limitations of the existing methods is their dataset, as these methods have been evaluated on unreviewed data.
Can you elaborate on this some? As is it's not clear whether all of the listed studies above have evaluated their methods on unreviewed data, and for those that have, what the sources of the unreviewed data are.
-