AutoPeptideML: a study on how to build more trustworthy peptide bioactivity predictors

Raúl Fernández-Díaz
Rodrigo Cossio-Pérez
Clement Agoni
Hoang Thanh Lam
Vanessa Lopez
Denis C Shields

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Motivation

Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build their own custom models. We examine different steps in the development life-cycle of peptide bioactivity binary predictors and identify key steps where automation cannot only result in a more accessible method, but also more robust and interpretable evaluation leading to more trustworthy models.

Results

We present a new automated method for drawing negative peptides that achieves better balance between specificity and generalization than current alternatives. We study the effect of homology-based partitioning for generating the training and testing data subsets and demonstrate that model performance is overestimated when no such homology correction is used, which indicates that prior studies may have overestimated their performance when applied to new peptide sequences. We also conduct a systematic analysis of different protein language models as peptide representation methods and find that they can serve as better descriptors than a naive alternative, but that there is no significant difference across models with different sizes or algorithms. Finally, we demonstrate that an ensemble of optimized traditional machine learning algorithms can compete with more complex neural network models, while being more computationally efficient. We integrate these findings into AutoPeptideML, an easy-to-use AutoML tool to allow researchers without a computational background to build new predictive models for peptide bioactivity in a matter of minutes.

Availability and implementation

Source code, documentation, and data are available at https://github.com/IBM/AutoPeptideML and a dedicated web-server at http://peptide.ucd.ie/AutoPeptideML. A static version of the software to ensure the reproduction of the results is available at https://zenodo.org/records/13363975.

Version published to 10.1093/bioinformatics/btae555
Sep 1, 2024
Version published to 10.1101/2023.11.13.566825v4 on bioRxiv
Jul 8, 2024
Version published to 10.1101/2023.11.13.566825v3 on bioRxiv
Mar 16, 2024
Arcadia Science
Feb 8, 2024

I would say that it is an indication of dataset size being too small (only 119 positive samples), which goes back to the first annotation.

Read the original source
Arcadia Science
Feb 8, 2024

The idea is that peptide 3D structures and their dynamics are less constrained than that of proteins (at least, for non cyclical peptides), in the sense that they are smaller and they don't have as many steric collisions as a larger protein would have. Therefore, the evolutionary pressures in single point mutations might not be the same as the ones governing protein evolution and they could switch between residues according to different rules. However, it is just a possible explanation for the phenomenon that we are seeing, it is perfectly possible that the effect of the biological aspect is minimal when compared to the computational aspect of having a smaller attention window to exploit.

It is difficult to find evolutionary analysis as they would be focused on specific bioactivities like cellular signalling which might have additional …

The idea is that peptide 3D structures and their dynamics are less constrained than that of proteins (at least, for non cyclical peptides), in the sense that they are smaller and they don't have as many steric collisions as a larger protein would have. Therefore, the evolutionary pressures in single point mutations might not be the same as the ones governing protein evolution and they could switch between residues according to different rules. However, it is just a possible explanation for the phenomenon that we are seeing, it is perfectly possible that the effect of the biological aspect is minimal when compared to the computational aspect of having a smaller attention window to exploit.

It is difficult to find evolutionary analysis as they would be focused on specific bioactivities like cellular signalling which might have additional evolutionary pressures. Regarding this idea of the structural freedom, one that it is interesting is: "London N, Movshovitz-Attias D, Schueler-Furman O. The structural basis of peptide-protein binding strategies. Structure. 2010 Feb 10;18(2):188-99. doi: 10.1016/j.str.2009.11.012. PMID: 20159464." They show that protein-peptide interactions occur more often through H-bonds with atoms in the backbone of the peptide than protein-protein interactions. Admittedly, the dataset is quite sparse.

Read the original source
Arcadia Science
Feb 8, 2024

I'll update this comment whenever we are finished implementing it, but we will be introducing an option in the webserver (http://peptide.ucd.ie/AutoPeptideML) to allow users to use the models directly.

Read the original source
Arcadia Science
Feb 8, 2024

For the time being all classifiers are binary. If you want a multiclass classifier say with 5 different classes (A, B, C, D, E), you can train 5 binary classifiers where the positives are each of the classes and the negatives automatically generated. This would be the way to go for say the metagenome analysis, where you are interested in predicting against the baseline of not bioactive or broadly bioactive. If you want to distinguish only between those five classes, you can prepare 5 datasets where in dataset 1, A is the positive; classes B, C, D, and E are the negatives; in dataset 2, B is positive and A, C, D, and E, negatives and so on. If we see that there is a need for this function we will expand the code to offer the option to perform either of this operations automatically.

Read the original source
Arcadia Science
Feb 8, 2024

Right now it is not supported. There are workarounds that a user could do, but it will require some time. We'll try to add that option to the API, in the coming weeks. The reason why we haven't done it yet is that it has some theoretical drawbacks that an inexperienced user might not be aware of.

The theoretical problem that that strategy poses is that you lose a bit of control on what the model is learning. Right now, with our negatives you know that it is learning to distinguish between your specific bioactivity and general bioactivity-like peptides. However, the moment you introduce custom negatives you are muddying that definition somewhat.

Read the original source
Arcadia Science
Feb 8, 2024

This is incredibly helpful, thank you so much for your response!

Read the original source
Arcadia Science
Feb 8, 2024

The short answer would be that the smallest dataset we have tried has 119 positives samples and I think that you should not try anything below that number as even the evaluation is not going to be all that reliable.

The slightly longer answer is that it depends on the difficulty of the classification task, the more complicated that it would be to separate between positives and negatives, the more data that it is required. The opposite is also true. However, if the task is easy enough that say with 50-100 samples you can obtain a decent ML model, then chances are that there is a simpler analytical solution (like a motif or a logos visualisation) that would be more informative than the ML. All of this is obviously a qualitative assessment, I'm not entirely sure if there is a way to provide a more formal estimation of the number of samples …

The short answer would be that the smallest dataset we have tried has 119 positives samples and I think that you should not try anything below that number as even the evaluation is not going to be all that reliable.

The slightly longer answer is that it depends on the difficulty of the classification task, the more complicated that it would be to separate between positives and negatives, the more data that it is required. The opposite is also true. However, if the task is easy enough that say with 50-100 samples you can obtain a decent ML model, then chances are that there is a simpler analytical solution (like a motif or a logos visualisation) that would be more informative than the ML. All of this is obviously a qualitative assessment, I'm not entirely sure if there is a way to provide a more formal estimation of the number of samples necessary.

Read the original source
Arcadia Science
Jan 29, 2024

Secondly, the two general purpose predictors, one of which is based on a convolutional neural network (UniDL4BioPep-A), and the other on an ensemble of three simpler ML models (AutoPeptideML), both have comparable performance.

I'll post this on github as well, but are these models available for use? I know I could use UniDL4BioPep and their models, but I think your tool and would love to be able to use the models you built!

Read the original source
Arcadia Science
Jan 29, 2024

The biological hypothesis is that peptide sequences evolve with fewer constraints compared to protein sequences.

It would be interesting to unpack this sentence a bit more in the discussion. What do you mean by this, and can you provide any references for this?

Read the original source
Arcadia Science
Jan 29, 2024

Figure 4

Why do you think performance is so bad for blood brain barrier?

Read the original source
Arcadia Science
Jan 29, 2024

The self-reported values for the handcrafted models referenced in Table 1 are included with the evaluation in the original set of benchmarks to contextualise the contributions of both general purpose frameworks.

The way this is plotted is really confusing. I would suggest making the handcrafted models as a vertical line or a dot instead of its own bar. The reference line approach would make it clear that you're not re-running those models in some way like you are the other two.

Read the original source
Arcadia Science
Jan 29, 2024

Step 6 - Model EvaluationThe ensemble obtained in the previous step is evaluated against the hold-out evaluation set using a wide range of metrics that include accuracy, balanced accuracy, weighted precision, precision, F1, weighted F1, recall, weighted recall, area under the receiver-operating characteristic curve, matthew’s correlation coefficient (MCC), jaccard similarity, and weighted jaccard similarity as implemented in the scikit-learn [52]. The plots generated include calibration curve, confusion matrix, precision-recall curve, and receiver-operating characteristic curve, as implemented in sckit-plot [60].Step 7 - PredictionAutoPeptideML can predict the bioactivity of new samples given a pre-trained model generated in Step 5. Predictions are a score within the range [0, 1]. This result can be interpreted as the probability of …

Step 6 - Model EvaluationThe ensemble obtained in the previous step is evaluated against the hold-out evaluation set using a wide range of metrics that include accuracy, balanced accuracy, weighted precision, precision, F1, weighted F1, recall, weighted recall, area under the receiver-operating characteristic curve, matthew’s correlation coefficient (MCC), jaccard similarity, and weighted jaccard similarity as implemented in the scikit-learn [52]. The plots generated include calibration curve, confusion matrix, precision-recall curve, and receiver-operating characteristic curve, as implemented in sckit-plot [60].Step 7 - PredictionAutoPeptideML can predict the bioactivity of new samples given a pre-trained model generated in Step 5. Predictions are a score within the range [0, 1]. This result can be interpreted as the probability of the peptide sequence having the target bioactivity, given the predictive model (P (x ∩ + |MODEL)). This step outputs a CSV file with the peptides sorted according to their predicted bioactivity probability.

This is sort of a dumb question at this point, but are all classifiers explicitly binary? How would you recommend a user who is interested in many classes of peptide bioactivity (or who plans to say, predict all peptides in a metagenome and then predict bioactivities) go about assessing all classes of bioactivity?

Read the original source
Arcadia Science
Jan 29, 2024

To avoid introducing false negative peptides into the negative subset, the algorithm accepts an optional input containing a list of bioactivity tags that the user considers may overlap with the bioactivity of interest and should, therefore, be excluded.

This is nice :)

Read the original source
Arcadia Science
Jan 29, 2024

If both positive and negative peptides are provided, the program balances the classes by oversampling the underrepresented class and continues to Step 3; if negative peptides are not provided, it executes Step 2 to build the negative set.

Would it be possible for a user to provide some negative peptides, but then to also have those negative peptides supplemented by those generated in Step 2? This might be nice for users that have small input data sets if it's possible.

Read the original source
Arcadia Science
Jan 29, 2024

AutoPeptideML only requires a dataset of peptides known to be positive for the bioactivity of interest

It would be super helpful if you could provide estimates for the minimum number of required sequences or any other information you can add to help a user build intuition for how many sequences they need to provide

Read the original source
Version published to 10.1101/2023.11.13.566825v2 on bioRxiv
Jan 29, 2024
Version published to 10.1101/2023.11.13.566825v1 on bioRxiv
Nov 15, 2023

HADDOCK3: A modular and versatile platform for integrative modelling of biomolecular complexes

This article has 11 authors:
1. Marco Giulini
2. Victor Reys
3. João M. C. Teixeira
4. Brian Jiménez-García
5. Rodrigo V. Honorato
6. Anna Kravchenko
7. Xiaotong Xu
8. Raphaëlle Versini
9. Anna Engel
10. Stefan Verhoeven
11. Alexandre M.J.J. Bonvin
This article has no evaluationsLatest version May 7, 2025
FastConformation: A Standalone ML-Based Toolkit for Modeling and Analyzing Protein Conformational Ensembles at Scale

This article has 6 authors:
1. Flavia Maria Galeazzi
2. Gabriel Monteiro da Silva
3. Pablo Arantes
4. Iz Varghese
5. Ananya Shukla
6. Brenda M. Rubenstein
This article has no evaluationsLatest version May 14, 2025
Interpreting biochemical text with language models: a machine learning framework for reaction extraction and cheminformatic validation

This article has 14 authors:
1. Daven Lim
2. Swathi Badrinarayanan
3. Kira Sterling
4. Guru Rajesh
5. Eshaan Mistry
6. Daphne Yang
7. Max Lee
8. Kenneth Bryan Hsu
9. Mrunali Manjrekar
10. Cassandra Areff
11. Phil Xie
12. Ivan Kristanto
13. Arjun Chandran
14. J. Christopher Anderson
This article has no evaluationsLatest version May 20, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Motivation

Results

Availability and implementation

Article activity feed

Related articles

HADDOCK3: A modular and versatile platform for integrative modelling of biomolecular complexes

FastConformation: A Standalone ML-Based Toolkit for Modeling and Analyzing Protein Conformational Ensembles at Scale

Interpreting biochemical text with language models: a machine learning framework for reaction extraction and cheminformatic validation