AutoPeptideML: a study on how to build more trustworthy peptide bioactivity predictors

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Motivation

Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build their own custom models. We examine different steps in the development life-cycle of peptide bioactivity binary predictors and identify key steps where automation cannot only result in a more accessible method, but also more robust and interpretable evaluation leading to more trustworthy models.

Results

We present a new automated method for drawing negative peptides that achieves better balance between specificity and generalization than current alternatives. We study the effect of homology-based partitioning for generating the training and testing data subsets and demonstrate that model performance is overestimated when no such homology correction is used, which indicates that prior studies may have overestimated their performance when applied to new peptide sequences. We also conduct a systematic analysis of different protein language models as peptide representation methods and find that they can serve as better descriptors than a naive alternative, but that there is no significant difference across models with different sizes or algorithms. Finally, we demonstrate that an ensemble of optimized traditional machine learning algorithms can compete with more complex neural network models, while being more computationally efficient. We integrate these findings into AutoPeptideML, an easy-to-use AutoML tool to allow researchers without a computational background to build new predictive models for peptide bioactivity in a matter of minutes.

Availability and implementation

Source code, documentation, and data are available at https://github.com/IBM/AutoPeptideML and a dedicated web-server at http://peptide.ucd.ie/AutoPeptideML. A static version of the software to ensure the reproduction of the results is available at https://zenodo.org/records/13363975.

Article activity feed

  1. The idea is that peptide 3D structures and their dynamics are less constrained than that of proteins (at least, for non cyclical peptides), in the sense that they are smaller and they don't have as many steric collisions as a larger protein would have. Therefore, the evolutionary pressures in single point mutations might not be the same as the ones governing protein evolution and they could switch between residues according to different rules. However, it is just a possible explanation for the phenomenon that we are seeing, it is perfectly possible that the effect of the biological aspect is minimal when compared to the computational aspect of having a smaller attention window to exploit.

    It is difficult to find evolutionary analysis as they would be focused on specific bioactivities like cellular signalling which might have additional evolutionary pressures. Regarding this idea of the structural freedom, one that it is interesting is: "London N, Movshovitz-Attias D, Schueler-Furman O. The structural basis of peptide-protein binding strategies. Structure. 2010 Feb 10;18(2):188-99. doi: 10.1016/j.str.2009.11.012. PMID: 20159464." They show that protein-peptide interactions occur more often through H-bonds with atoms in the backbone of the peptide than protein-protein interactions. Admittedly, the dataset is quite sparse.

  2. For the time being all classifiers are binary. If you want a multiclass classifier say with 5 different classes (A, B, C, D, E), you can train 5 binary classifiers where the positives are each of the classes and the negatives automatically generated. This would be the way to go for say the metagenome analysis, where you are interested in predicting against the baseline of not bioactive or broadly bioactive. If you want to distinguish only between those five classes, you can prepare 5 datasets where in dataset 1, A is the positive; classes B, C, D, and E are the negatives; in dataset 2, B is positive and A, C, D, and E, negatives and so on. If we see that there is a need for this function we will expand the code to offer the option to perform either of this operations automatically.

  3. Right now it is not supported. There are workarounds that a user could do, but it will require some time. We'll try to add that option to the API, in the coming weeks. The reason why we haven't done it yet is that it has some theoretical drawbacks that an inexperienced user might not be aware of.

    The theoretical problem that that strategy poses is that you lose a bit of control on what the model is learning. Right now, with our negatives you know that it is learning to distinguish between your specific bioactivity and general bioactivity-like peptides. However, the moment you introduce custom negatives you are muddying that definition somewhat.

  4. The short answer would be that the smallest dataset we have tried has 119 positives samples and I think that you should not try anything below that number as even the evaluation is not going to be all that reliable.

    The slightly longer answer is that it depends on the difficulty of the classification task, the more complicated that it would be to separate between positives and negatives, the more data that it is required. The opposite is also true. However, if the task is easy enough that say with 50-100 samples you can obtain a decent ML model, then chances are that there is a simpler analytical solution (like a motif or a logos visualisation) that would be more informative than the ML. All of this is obviously a qualitative assessment, I'm not entirely sure if there is a way to provide a more formal estimation of the number of samples necessary.

  5. Secondly, the two general purpose predictors, one of which is based on a convolutional neural network (UniDL4BioPep-A), and the other on an ensemble of three simpler ML models (AutoPeptideML), both have comparable performance.

    I'll post this on github as well, but are these models available for use? I know I could use UniDL4BioPep and their models, but I think your tool and would love to be able to use the models you built!

  6. The biological hypothesis is that peptide sequences evolve with fewer constraints compared to protein sequences.

    It would be interesting to unpack this sentence a bit more in the discussion. What do you mean by this, and can you provide any references for this?

  7. The self-reported values for the handcrafted models referenced in Table 1 are included with the evaluation in the original set of benchmarks to contextualise the contributions of both general purpose frameworks.

    The way this is plotted is really confusing. I would suggest making the handcrafted models as a vertical line or a dot instead of its own bar. The reference line approach would make it clear that you're not re-running those models in some way like you are the other two.

  8. Step 6 - Model EvaluationThe ensemble obtained in the previous step is evaluated against the hold-out evaluation set using a wide range of metrics that include accuracy, balanced accuracy, weighted precision, precision, F1, weighted F1, recall, weighted recall, area under the receiver-operating characteristic curve, matthew’s correlation coefficient (MCC), jaccard similarity, and weighted jaccard similarity as implemented in the scikit-learn [52]. The plots generated include calibration curve, confusion matrix, precision-recall curve, and receiver-operating characteristic curve, as implemented in sckit-plot [60].Step 7 - PredictionAutoPeptideML can predict the bioactivity of new samples given a pre-trained model generated in Step 5. Predictions are a score within the range [0, 1]. This result can be interpreted as the probability of the peptide sequence having the target bioactivity, given the predictive model (P (x ∩ + |MODEL)). This step outputs a CSV file with the peptides sorted according to their predicted bioactivity probability.

    This is sort of a dumb question at this point, but are all classifiers explicitly binary? How would you recommend a user who is interested in many classes of peptide bioactivity (or who plans to say, predict all peptides in a metagenome and then predict bioactivities) go about assessing all classes of bioactivity?

  9. To avoid introducing false negative peptides into the negative subset, the algorithm accepts an optional input containing a list of bioactivity tags that the user considers may overlap with the bioactivity of interest and should, therefore, be excluded.

    This is nice :)

  10. If both positive and negative peptides are provided, the program balances the classes by oversampling the underrepresented class and continues to Step 3; if negative peptides are not provided, it executes Step 2 to build the negative set.

    Would it be possible for a user to provide some negative peptides, but then to also have those negative peptides supplemented by those generated in Step 2? This might be nice for users that have small input data sets if it's possible.

  11. AutoPeptideML only requires a dataset of peptides known to be positive for the bioactivity of interest

    It would be super helpful if you could provide estimates for the minimum number of required sequences or any other information you can add to help a user build intuition for how many sequences they need to provide