GenPept-Curated-2025: A Benchmark Dataset for Antimicrobial Peptide Prediction with Homology-Controlled Partitioning

Huynh Trong Pham
Bao Huynh
Thanh-Hoang Nguyen-Vo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Antimicrobial peptides (AMPs) are promising therapeutic candidates against rising antimicrobial resistance, yet progress in AMP prediction is hampered by the lack of benchmark datasets that address homology leakage, negative set reliability, and distributional diversity. Existing AMP databases, designed as biological repositories, do not enforce the controlled partitioning required for rigorous machine learning evaluation. We present GenPept-Curated-2025 , a curated, class-balanced benchmark of 11,000 peptide sequences (5,500 AMP / 5,500 non-AMP) derived from Bacteria, Archaea, and Fungi, and sourced exclusively from GenPept/NCBI Protein. The dataset was constructed through a reproducible pipeline comprising taxonomic scoping, quality control, precursor handling, annotation-based labeling, and Identical Protein Groups (IPG)-based deduplication, with sequence length restricted to 10–200 aa. The AMP proportion varies substantially across length bins (14.2% in [10, 50] aa to 77.1% in [101, 150] aa), identifying length-dependent class imbalance as a distribution shift that benchmarking must account for. The dataset is openly released to support standardized, reproducible, and leakage-free evaluation of AMP prediction models.

Version published to 10.64898/2026.04.25.720793 on bioRxiv
Apr 29, 2026

BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models

This article has 7 authors:
1. Breno L. S. de Almeida
2. Robson P. Bonidia
3. Martin Bole
4. Anderson Avila-Santos
5. Peter F. Stadler
6. Ulisses N. da Rocha
7. André C. P. L. F. de Carvalho
This article has no evaluationsLatest version Apr 22, 2026
Structural bias in machine learning-guided peptide design

This article has 2 authors:
1. Victor Daniel Aldas-Bulos
2. Fabien Plisson
This article has no evaluationsLatest version May 8, 2026
Evaluating Reference-Independent Pipelines for the Detection of Spreading Organisms in Metagenomic Datasets

This article has 7 authors:
1. N.S. Popov
2. V.V. Panova
3. M. Molchanova
4. S.A. Gurov
5. A.N. Lukashev
6. E.N. Ilina
7. A.I. Manolov
This article has no evaluationsLatest version May 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models

Structural bias in machine learning-guided peptide design

Evaluating Reference-Independent Pipelines for the Detection of Spreading Organisms in Metagenomic Datasets