Prediction of anti-freezing proteins from their evolutionary profile

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

1.

Prediction of antifreeze proteins (AFPs) holds significant importance due to their diverse applications in healthcare. An inherent limitation of current AFP prediction methods is their reliance on unreviewed proteins for evaluation. This study evaluates proposed and existing methods on an independent dataset containing 81 AFPs and 73 non-AFPs obtained from Uniport, which have been already reviewed by experts. Initially, we constructed machine learning models for AFP prediction using selected composition-based protein features and achieved a peak AUC of 0.90 with an MCC of 0.69 on the independent dataset. Subsequently, we observed a notable enhancement in model performance, with the AUC increasing from 0.90 to 0.93 upon incorporating evolutionary information instead of relying solely on the primary sequence of proteins. Furthermore, we explored hybrid models integrating our machine learning approaches with BLAST-based similarity and motif-based methods. However, the performance of these hybrid models either matched or was inferior to that of our best machine-learning model. Our best model based on evolutionary information outperforms all existing methods on independent/validation dataset. To facilitate users, a user-friendly web server with a standalone package named “AFPropred” was developed ( https://webs.iiitd.edu.in/raghava/afpropred ).

Highlights

  • Prediction of antifreeze proteins with high precision

  • Evaluation of prediction models on an independent dataset

  • Machine learning based models using sequence composition

  • Evolutionary information based prediction models

  • A webserver for predicting, scanning, and designing AFPs.

Author’s Biography

  • Nishant Kumar is currently working as Ph.D. in Computational biology from Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.

  • Shubham Choudhury is currently working as Ph.D. in Computational biology from Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India

  • Nisha Bajiya is currently working as Ph.D. in Computational biology from Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India

  • Sumeet Patiyal is currently working as a postdoctoral visiting fellow Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA.

  • Gajendra P. S. Raghava is currently working as Professor and Head of Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.

  • Article activity feed

    1. Once the model was evaluated, we chose our top-performing model for further analysis, in which we integrated the evolutionary features with composition-based features and the ML score with the BLAST score and named the hybrid methods

      It's a bit confusing to me why you would carry out your model selection procedure without using the intended feature-set. My worry would be that certain models might perform better or worse with different feature types and that you might be missing that here. Could you elaborate on this?

    2. Feature selection techniques

      Did you also consider exploring the use of regularization? I Would suggest looking into the use of L1 or L2 regularization to reduce the contribution of some of your features to mitigate the potential for overfitting.

    3. The reliability of a method depends on the quality of the dataset used for training and evaluation.

      Along these lines, it's important to make significant efforts to identify sources of bias in your training dataset and mitigate their potential impact on predictions. This is true for the training set, but it's similarly true for the test (referred to here as the validation) set - if the test set is biased or imbalanced with respect to some relevant biological feature, then the resultant prediction accuracies may not reflect true model performance.

      What I would like to see explored more thoroughly here is whether there are taxonomic biases in this curated set of proteins used in training and testing of your model? If, for instance, some species/taxonomic groups are disproportionately represented in both your training and validation sets, potentially leading to elevated prediction accuracies.

    4. two datasets, the main and the validation

      This is a bit confusing - common terminology to use here would be to refer to this as the training dataset (subsequently subdivided into the K-folds for cross-validation), and the latter as the test set, rather than the validation set as you've done here.

    5. The major limitations of the existing methods is their dataset, as these methods have been evaluated on unreviewed data.

      Can you elaborate on this some? As is it's not clear whether all of the listed studies above have evaluated their methods on unreviewed data, and for those that have, what the sources of the unreviewed data are.