Classification of bioactive peptides: a comparative analysis of models and encodings

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Bioactive peptides are short amino acid chains possessing biological activity and exerting specific physiological effects relevant to human health, which are increasingly produced through fermentation due to their therapeutic roles. One of the main open problems related to biopeptides remains the determination of their functional potential, which still mainly relies on time-consuming in vivo tests. While bioinformatic tools for the identification of bioactive peptides are available, they are focused on specific functional classes and have not been systematically tested on realistic settings. To tackle this problem, bioactive peptide sequences and functions were collected from a variety of databases to generate a comprehensive collection of bioactive peptides from microbial fermentation. This collection was organized into nine functional classes including some previously studied and some newly defined such as immunomodulatory, opioid and cardiovascular peptides. Upon assessing their native sequence properties, four alternative encoding methods were tested in combination with a multitude of machine learning algorithms, from basic classifiers like logistic regression to advanced algorithms like BERT. By testing a total set of 171 models, it was found that, while some functions are intrinsically easier to detect, no single combination of classifiers and encoders worked universally well for all the classes. For this reason, we unified all the best individual models for each class and generated CICERON (Classification of bIoaCtive pEptides fRom micrObial fermeNtation), a classification tool for the functional classification of peptides. State-of-the-art classifiers were found to underperform on our benchmark dataset compared to the models included in CICERON. Altogether, our work provides a tool for real-world peptide classification and can serve as a benchmark for future model development.

Article activity feed

  1. To compare the amino acid usage in the functional classes, single amino acid, dipeptide, and tripeptide frequencies were plotted (Figure 2). The amino acid frequency plot in Figure 2A reveals that some BPs classes have distinct characteristics. For example, celiac disease BPs have the highest frequency of proline and glutamine, opioid peptides are enriched in tyrosine and glycine, while cardiovascular BPs have slightly higher frequencies of alanine and the highest frequency of the negatively charged amino acids aspartic acid and glutamic acid.

    I'm curious if using degenerate amino acid alphabets (dayhoff encoding, hydrophobic-polar, etc) would further improve classification accuracy or show interesting patterns.

  2. After filtering the sequences, and merging functionally overlapping or related classes, the final database consisted of 3990 BPs divided into nine different functional groups (Table 1).

    it would be nice to see how many peptides were dropped at each stage of filtering. I'm also curious since you dropped so much data if the model would be more generalizable if that data were included somehow.

  3. “antihypertensive”, “ACE-inhibitory” and “Renin-inhibitory” as Antihypertensive; “DPP-IV inhibitors” and “alpha-glucosidase inhibitors” as Antidiabetic; “antimicrobial”, “antifungal”, “antibacterial” and “anticancer” as Antimicrobial; “antithrombotic”, “CaMKII Inhibitor” as Cardiovascular with positive effects on vascular circulation; “Antiamnestic”, “anxiolytic-like”, “AChE inhibitors”, “PEP-inhibitory” and “neuropeptides” as Neuropeptides.

    curious if you tried without these groupings -- ie, how dissimilar are some of the peptides that were placed into combined groups, and could the model have done well without these groupings

  4. Peptides with identical sequences but different functional class assignments were removed to avoid introducing potential biases in the classifier’s training.

    How often does this occur?

  5. The final result, CICERON, consists of nine different binary classifiers capable of identifying the products of microbial fermentation-derived BPs.

    Did you try this as a multi-classification problem and end up with a better performance with binary classifiers? Or was the underlying model you used limited to binary classification?

  6. Given the importance of BPs, there have been several attempts to create in-silico approaches to perform a preliminary assignment of the potential functional properties and facilitate the subsequent discovery and testing process in vivo [19–24]. These methods rely on several databases where peptides from various experiments have been collected and classified according to the BPs functional classes. Using the sequence properties of the peptides, such as amino acid composition, or the presence of sequence patterns of interest, peptides can be assigned to a functional class depending on the type of classifier used.

    I'm curious if this task is different than determining whether an amino acid sequence of 2-50 aas is a bioactive peptide, or if one must first know that the sequence is a peptide to then apply these tools to categorize the peptide sequence into a functional class.

  7. Classification of bIoaCtive pEptides fRom micrObial fermeNtation

    One question that this name, and the abstract in general, left me with is whether this method is extensible beyond microbial fermentation peptides. I will continue to read to find out, but I'm wondering if another sentence might be added to clarify this.

    I'm also curious if microbial fermentation peptides include all known classes of peptides, or if there are some functional classifications that might not be labelled by CICERON because CICERON has not seen them before. Again I will continue reading to hopefully find out!