AutoPeptideML: A study on how to build more trustworthy peptide bioactivity predictors

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build trustworthy models. We considered the effect of different design choices in the development of peptide bioactivity binary predictors and found that the choice of negative peptides and the use of homology-based partitioning strategies when constructing the evaluation set have a significant impact on perceived model performance providing more realistic estimation of the performance of the model when exposed to new data. We also show that the use of protein language models to generate peptide representations can both simplify the computational pipelines and improve model performance, and that state-of-the-art protein language models perform similarly regardless of size or architecture. Finally, we integrate these results into an easy-to-use AutoML tool to support the development of new robust predictive models for peptide bioactivity by biologist without a strong machine learning expertise. Source code, documentation, and data are available at https://github.com/IBM/AutoPeptideML and a dedicated web-server at http://peptide.ucd.ie/AutoPeptideML .

Article activity feed

  1. The idea is that peptide 3D structures and their dynamics are less constrained than that of proteins (at least, for non cyclical peptides), in the sense that they are smaller and they don't have as many steric collisions as a larger protein would have. Therefore, the evolutionary pressures in single point mutations might not be the same as the ones governing protein evolution and they could switch between residues according to different rules. However, it is just a possible explanation for the phenomenon that we are seeing, it is perfectly possible that the effect of the biological aspect is minimal when compared to the computational aspect of having a smaller attention window to exploit.

    It is difficult to find evolutionary analysis as they would be focused on specific bioactivities like cellular signalling which might have additional evolutionary pressures. Regarding this idea of the structural freedom, one that it is interesting is: "London N, Movshovitz-Attias D, Schueler-Furman O. The structural basis of peptide-protein binding strategies. Structure. 2010 Feb 10;18(2):188-99. doi: 10.1016/j.str.2009.11.012. PMID: 20159464." They show that protein-peptide interactions occur more often through H-bonds with atoms in the backbone of the peptide than protein-protein interactions. Admittedly, the dataset is quite sparse.

  2. For the time being all classifiers are binary. If you want a multiclass classifier say with 5 different classes (A, B, C, D, E), you can train 5 binary classifiers where the positives are each of the classes and the negatives automatically generated. This would be the way to go for say the metagenome analysis, where you are interested in predicting against the baseline of not bioactive or broadly bioactive. If you want to distinguish only between those five classes, you can prepare 5 datasets where in dataset 1, A is the positive; classes B, C, D, and E are the negatives; in dataset 2, B is positive and A, C, D, and E, negatives and so on. If we see that there is a need for this function we will expand the code to offer the option to perform either of this operations automatically.

  3. Right now it is not supported. There are workarounds that a user could do, but it will require some time. We'll try to add that option to the API, in the coming weeks. The reason why we haven't done it yet is that it has some theoretical drawbacks that an inexperienced user might not be aware of.

    The theoretical problem that that strategy poses is that you lose a bit of control on what the model is learning. Right now, with our negatives you know that it is learning to distinguish between your specific bioactivity and general bioactivity-like peptides. However, the moment you introduce custom negatives you are muddying that definition somewhat.

  4. The short answer would be that the smallest dataset we have tried has 119 positives samples and I think that you should not try anything below that number as even the evaluation is not going to be all that reliable.

    The slightly longer answer is that it depends on the difficulty of the classification task, the more complicated that it would be to separate between positives and negatives, the more data that it is required. The opposite is also true. However, if the task is easy enough that say with 50-100 samples you can obtain a decent ML model, then chances are that there is a simpler analytical solution (like a motif or a logos visualisation) that would be more informative than the ML. All of this is obviously a qualitative assessment, I'm not entirely sure if there is a way to provide a more formal estimation of the number of samples necessary.

  5. Secondly, the two general purpose predictors, one of which is based on a convolutional neural network (UniDL4BioPep-A), and the other on an ensemble of three simpler ML models (AutoPeptideML), both have comparable performance.

    I'll post this on github as well, but are these models available for use? I know I could use UniDL4BioPep and their models, but I think your tool and would love to be able to use the models you built!

  6. The biological hypothesis is that peptide sequences evolve with fewer constraints compared to protein sequences.

    It would be interesting to unpack this sentence a bit more in the discussion. What do you mean by this, and can you provide any references for this?

  7. The self-reported values for the handcrafted models referenced in Table 1 are included with the evaluation in the original set of benchmarks to contextualise the contributions of both general purpose frameworks.

    The way this is plotted is really confusing. I would suggest making the handcrafted models as a vertical line or a dot instead of its own bar. The reference line approach would make it clear that you're not re-running those models in some way like you are the other two.

  8. Step 6 - Model EvaluationThe ensemble obtained in the previous step is evaluated against the hold-out evaluation set using a wide range of metrics that include accuracy, balanced accuracy, weighted precision, precision, F1, weighted F1, recall, weighted recall, area under the receiver-operating characteristic curve, matthew’s correlation coefficient (MCC), jaccard similarity, and weighted jaccard similarity as implemented in the scikit-learn [52]. The plots generated include calibration curve, confusion matrix, precision-recall curve, and receiver-operating characteristic curve, as implemented in sckit-plot [60].Step 7 - PredictionAutoPeptideML can predict the bioactivity of new samples given a pre-trained model generated in Step 5. Predictions are a score within the range [0, 1]. This result can be interpreted as the probability of the peptide sequence having the target bioactivity, given the predictive model (P (x ∩ + |MODEL)). This step outputs a CSV file with the peptides sorted according to their predicted bioactivity probability.

    This is sort of a dumb question at this point, but are all classifiers explicitly binary? How would you recommend a user who is interested in many classes of peptide bioactivity (or who plans to say, predict all peptides in a metagenome and then predict bioactivities) go about assessing all classes of bioactivity?

  9. To avoid introducing false negative peptides into the negative subset, the algorithm accepts an optional input containing a list of bioactivity tags that the user considers may overlap with the bioactivity of interest and should, therefore, be excluded.

    This is nice :)

  10. If both positive and negative peptides are provided, the program balances the classes by oversampling the underrepresented class and continues to Step 3; if negative peptides are not provided, it executes Step 2 to build the negative set.

    Would it be possible for a user to provide some negative peptides, but then to also have those negative peptides supplemented by those generated in Step 2? This might be nice for users that have small input data sets if it's possible.

  11. AutoPeptideML only requires a dataset of peptides known to be positive for the bioactivity of interest

    It would be super helpful if you could provide estimates for the minimum number of required sequences or any other information you can add to help a user build intuition for how many sequences they need to provide