Peptipedia v2.0: A peptide sequence database and user-friendly web platform. A major update

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

In recent years, peptides have gained significant relevance due to their therapeutic properties. The surge in peptide production and synthesis has generated vast amounts of data, enabling the creation of comprehensive databases and information repositories. Advances in sequencing techniques and artificial intelligence have further accelerated the design of tailor-made peptides. However, leveraging these techniques requires versatile and continuously updated storage systems, along with tools that facilitate peptide research and the implementation of machine learning for predictive systems. This work introduces Peptipedia v2.0, one of the most comprehensive public repositories of peptides, supporting biotechnological research by simplifying peptide study and annotation. Peptipedia v2.0 has expanded its collection by over 45% with peptide sequences that have reported biological activities. The functional biological activity tree has been revised and enhanced, incorporating new categories such as cosmetic and dermatological activities, molecular binding, and anti-ageing properties. Utilizing protein language models and machine learning, more than 90 binary classification models have been trained, validated, and incorporated into Peptipedia v2.0. These models exhibit average sensitivities and specificities of 0.877 ± 0.0530 and 0.873 ±0.054, respectively, facilitating the annotation of more than 3.6 million peptide sequences with unknown biological activities, also registered in Peptipedia v2.0. Additionally, Peptipedia v2.0 introduces description tools based on structural and ontological properties and user-friendly machinelearning tools to facilitate the application of machine-learning strategies to study peptide sequences. Peptipedia v2.0 is accessible under the Creative Commons CC BY-NC-ND 4.0 license at https://peptipedia.cl/ .

Article activity feed

  1. The database available in Peptipedia v2.0 was designed based on a relational schema. PostgreSQL manages all operations over the database. All queries performed to the database are managed through an application programming interface (API) implemented using the Flask framework version 2.3.1 and SQLAlchemy version 2.0.29.

    I tried to look at a specific peptides (ex. https://app.peptipedia.cl/peptide/33572) and it took a prohibitively long time for the page to load (> 5minutes). Is this fixable?

    the downloads tab was pleasantly fast, but i noticed that you don't provide the information that you quoted above. For example, for therapeutic peptides, I wanted to know which were labeled vs. predicted, and patent information, and really everything, but this information was not downloaded. How should we extract this information in bulk?

  2. A binary dataset was constructed for each biological activity identified in the Peptipedia v2.0. The following steps were used to build the datasets: i) Peptides exhibiting the target activity were collected to generate positive examples, ii) Peptides without the target activity were collected to create the negative examples, iii) An undersampling strategy was applied to balance the dataset by randomly removing negative examples, and iv) Different pre-trained models were then used to generate embeddings, representing peptide sequences numerically for training the classification models (Dallago et al., 2021) (See more details in Section S2 of Supplementary Materials).

    AutoPeptideML recently put together a lovely preprint on bioactivity classifiers for peptides. I recommend that you take a look! I think the approach they take could help avoid common problems with machine learning that might crop up with this approach (e.x. data set leakage from homology between sequences)

  3. Through homology mechanisms, enrichment analysis was performed for all canonical peptides using the MetaStudent tool (Hamp et al., 2013). MetaStudent allows the assignment of gene ontology terms (GO) from different sources, such as molecular function, cellular localization, and biological process.

    will this work on short peptides? what is the rate of false positives?

  4. Then, a semantic analysis was generated to recognise the biological activity of the peptide sequence through the available description in the data sources.

    how did you make sure this was accurate? What did you constitute as a "label" vs. a "prediction"?

  5. Information related to the characteristics of the reported peptide sequences, including their biological activities, descriptions, experimental information, and related publications or patents, was downloaded from all data sources.Then, Python scripts were implemented to process the raw data downloaded from the data sources and transform the information for loading into the Peptipedia v2.0.A length filter was made, containing only peptides with a length equal to or less than 150 residues and higher than three residues. Also, the collected peptides were classified as canonical (with only the 20 natural amino acids) or non-canonical peptides. Then, a semantic analysis was generated to recognise the biological activity of the peptide sequence through the available description in the data sources.Finally, a loader Python script was implemented to load the register in the Peptipedia v2.0, developing a scalable ETL strategy (Extract, Transformation, and Load) for each utilised data source.

    would you be willing to annotate inline in the text where each of the scripts that does one of these actions is available? there's some information I'd be interested in hunting down for certain peptides, and having a pointer to the code that does each specific thing would be helpful to accomplish this

  6. Peptides are usually classified by just one functional biological activity. However, moonlight peptides have two or more known activities within the same domain (Jeffery, 1999), and identifying the potential multiple activities of a peptide is relevant for biotechnological and pharmaceutical industries (Singh and Bhalla, 2020; Zanzoni et al., 2019).

    These sources are all about proteins, not about peptides. Is there any evidence that a peptide has multiple biological activities in vivo? I think there are some that do multiple things in cell culture, but it would be interesting to know if any do multiple things in the body