Peptipedia v2.0: A peptide sequence database and user-friendly web platform. A major update

Gabriel Cabas-Mora
Anamaría Daza
Nicole Soto-García
Valentina Garrido
Diego Alvarez
Marcelo Navarrete
Lindybeth Sarmiento-Varón
Julieta H. Sepúlveda Yañez
Mehdi D. Davari
Frederic Cadet
Álvaro Olivera-Nappa
Roberto Uribe-Paredes
David Medina-Ortiz

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

In recent years, peptides have gained significant relevance due to their therapeutic properties. The surge in peptide production and synthesis has generated vast amounts of data, enabling the creation of comprehensive databases and information repositories. Advances in sequencing techniques and artificial intelligence have further accelerated the design of tailor-made peptides. However, leveraging these techniques requires versatile and continuously updated storage systems, along with tools that facilitate peptide research and the implementation of machine learning for predictive systems. This work introduces Peptipedia v2.0, one of the most comprehensive public repositories of peptides, supporting biotechnological research by simplifying peptide study and annotation. Peptipedia v2.0 has expanded its collection by over 45% with peptide sequences that have reported biological activities. The functional biological activity tree has been revised and enhanced, incorporating new categories such as cosmetic and dermatological activities, molecular binding, and anti-ageing properties. Utilizing protein language models and machine learning, more than 90 binary classification models have been trained, validated, and incorporated into Peptipedia v2.0. These models exhibit average sensitivities and specificities of 0.877 ± 0.0530 and 0.873 ±0.054, respectively, facilitating the annotation of more than 3.6 million peptide sequences with unknown biological activities, also registered in Peptipedia v2.0. Additionally, Peptipedia v2.0 introduces description tools based on structural and ontological properties and user-friendly machinelearning tools to facilitate the application of machine-learning strategies to study peptide sequences. Peptipedia v2.0 is accessible under the Creative Commons CC BY-NC-ND 4.0 license at https://peptipedia.cl/ .

Arcadia Science
Aug 7, 2024

The database available in Peptipedia v2.0 was designed based on a relational schema. PostgreSQL manages all operations over the database. All queries performed to the database are managed through an application programming interface (API) implemented using the Flask framework version 2.3.1 and SQLAlchemy version 2.0.29.

I tried to look at a specific peptides (ex. https://app.peptipedia.cl/peptide/33572) and it took a prohibitively long time for the page to load (> 5minutes). Is this fixable?

the downloads tab was pleasantly fast, but i noticed that you don't provide the information that you quoted above. For example, for therapeutic peptides, I wanted to know which were labeled vs. predicted, and patent information, and really everything, but this information was not downloaded. How should we extract this information in bulk?

Read the original source
Arcadia Science
Aug 7, 2024

A binary dataset was constructed for each biological activity identified in the Peptipedia v2.0. The following steps were used to build the datasets: i) Peptides exhibiting the target activity were collected to generate positive examples, ii) Peptides without the target activity were collected to create the negative examples, iii) An undersampling strategy was applied to balance the dataset by randomly removing negative examples, and iv) Different pre-trained models were then used to generate embeddings, representing peptide sequences numerically for training the classification models (Dallago et al., 2021) (See more details in Section S2 of Supplementary Materials).

AutoPeptideML recently put together a lovely preprint on bioactivity classifiers for peptides. I recommend that you take a look! I think the approach they take could help …

A binary dataset was constructed for each biological activity identified in the Peptipedia v2.0. The following steps were used to build the datasets: i) Peptides exhibiting the target activity were collected to generate positive examples, ii) Peptides without the target activity were collected to create the negative examples, iii) An undersampling strategy was applied to balance the dataset by randomly removing negative examples, and iv) Different pre-trained models were then used to generate embeddings, representing peptide sequences numerically for training the classification models (Dallago et al., 2021) (See more details in Section S2 of Supplementary Materials).

AutoPeptideML recently put together a lovely preprint on bioactivity classifiers for peptides. I recommend that you take a look! I think the approach they take could help avoid common problems with machine learning that might crop up with this approach (e.x. data set leakage from homology between sequences)

Read the original source
Arcadia Science
Aug 7, 2024

Through homology mechanisms, enrichment analysis was performed for all canonical peptides using the MetaStudent tool (Hamp et al., 2013). MetaStudent allows the assignment of gene ontology terms (GO) from different sources, such as molecular function, cellular localization, and biological process.

will this work on short peptides? what is the rate of false positives?

Read the original source
Arcadia Science
Aug 7, 2024

RaptorX-Property

Why this instead of ex. emsfold?

Read the original source
Arcadia Science
Aug 7, 2024

the peptide descriptor service and enrichment analysis system.

what are these things? are they what you describe below?

Read the original source
Arcadia Science
Aug 7, 2024

Then, a semantic analysis was generated to recognise the biological activity of the peptide sequence through the available description in the data sources.

how did you make sure this was accurate? What did you constitute as a "label" vs. a "prediction"?

Read the original source
Arcadia Science
Aug 7, 2024

Information related to the characteristics of the reported peptide sequences, including their biological activities, descriptions, experimental information, and related publications or patents, was downloaded from all data sources.Then, Python scripts were implemented to process the raw data downloaded from the data sources and transform the information for loading into the Peptipedia v2.0.A length filter was made, containing only peptides with a length equal to or less than 150 residues and higher than three residues. Also, the collected peptides were classified as canonical (with only the 20 natural amino acids) or non-canonical peptides. Then, a semantic analysis was generated to recognise the biological activity of the peptide sequence through the available description in the data sources.Finally, a loader Python script was …

Information related to the characteristics of the reported peptide sequences, including their biological activities, descriptions, experimental information, and related publications or patents, was downloaded from all data sources.Then, Python scripts were implemented to process the raw data downloaded from the data sources and transform the information for loading into the Peptipedia v2.0.A length filter was made, containing only peptides with a length equal to or less than 150 residues and higher than three residues. Also, the collected peptides were classified as canonical (with only the 20 natural amino acids) or non-canonical peptides. Then, a semantic analysis was generated to recognise the biological activity of the peptide sequence through the available description in the data sources.Finally, a loader Python script was implemented to load the register in the Peptipedia v2.0, developing a scalable ETL strategy (Extract, Transformation, and Load) for each utilised data source.

would you be willing to annotate inline in the text where each of the scripts that does one of these actions is available? there's some information I'd be interested in hunting down for certain peptides, and having a pointer to the code that does each specific thing would be helpful to accomplish this

Read the original source
Arcadia Science
Aug 7, 2024

Peptides are usually classified by just one functional biological activity. However, moonlight peptides have two or more known activities within the same domain (Jeffery, 1999), and identifying the potential multiple activities of a peptide is relevant for biotechnological and pharmaceutical industries (Singh and Bhalla, 2020; Zanzoni et al., 2019).

These sources are all about proteins, not about peptides. Is there any evidence that a peptide has multiple biological activities in vivo? I think there are some that do multiple things in cell culture, but it would be interesting to know if any do multiple things in the body

Read the original source
Arcadia Science
Aug 7, 2024

collect

collected

Read the original source
Version published to 10.1101/2024.07.11.603053v1 on bioRxiv
Jul 16, 2024
Version published to 10.1093/database/baae113
Jan 1, 2024

PPIKB: A Comprehensive Knowledge Base and Analysis Platform for Protein–Peptide Interactions Based on Literature and Patents

This article has 7 authors:
1. Ning Zhu
2. Yanyu Ming
3. Chengyun Zhang
4. Cao Sen
5. Chongyang Li
6. Jingjing Guo
7. Hongliang Duan
This article has no evaluationsLatest version Jun 12, 2025
PEPlife2: A Updated Repository of the Half-life of Peptides

This article has 6 authors:
1. Urooj Alam
2. Kunal Chaudhary
3. Nishant Kumar
4. Ritu Tomer
5. Sumeet Patiyal
6. Gajendra P. S. Raghava
This article has no evaluationsLatest version May 16, 2025
ANABAG: Annotated Antibody Antigen dataset with unique features for Antibody Engineering Applications

This article has 3 authors:
1. Ilyas Grandguillaume
2. Fernando Luis Barroso da Silva
3. Catherine Etchebest
This article has no evaluationsLatest version Jul 7, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

PPIKB: A Comprehensive Knowledge Base and Analysis Platform for Protein–Peptide Interactions Based on Literature and Patents

PEPlife2: A Updated Repository of the Half-life of Peptides

ANABAG: Annotated Antibody Antigen dataset with unique features for Antibody Engineering Applications