Structural bias in machine learning-guided peptide design

Victor Daniel Aldas-Bulos
Fabien Plisson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine learning continues to accelerate peptide and protein design through the rapid prediction and generation of sequences with desired characteristics. Many applications focus on predicting properties, functions, and structures, as well as generating point mutations and de novo designs. Nevertheless, many models prove less generalizable than initially claimed. Most predictors and generators are trained on sequential datasets, where imbalances can be addressed during preprocessing. In contrast, structural bias, a subtype of algorithmic bias arising from uneven representation of structural classes in training datasets, and the limitations of early protein structure predictors have frequently remained undetected and uncorrected. The recent surge in powerful protein structure prediction tools, such as the AlphaFold and RosettaFold series and their variants, now presents opportunities to mitigate this issue. We hypothesize that such structural sampling biases influence the downstream performance of ML models. Using antimicrobial peptides as a case study, we audited the structural biases in 16 state-of-the-art predictors for antimicrobial activity and tested whether structural information constrains their predictions. Our analysis revealed that models explicitly trained on sequential data still produce predictions biased by uneven fold representations and data leakage. These findings highlight the importance of integrating balanced structural data or implementing bias-mitigating strategies to develop agnostic models that maximize bioactive protein discovery and multi-objective optimization.

Version published to 10.64898/2026.05.06.721805 on bioRxiv
May 8, 2026

Improving Biological Sequence Prediction with AlphaFold2 Representation

This article has 3 authors:
1. Zhiqian Jiang
2. Canh Hao Nguyen
3. Hiroshi Mamitsuka
This article has no evaluationsLatest version Apr 28, 2026
Systematic Benchmarking of Kinase Bioactivity Models Across Splitting Strategies and Protein Representations

This article has 1 author:
1. Joshua M. Abbott
This article has no evaluationsLatest version Apr 22, 2026
Integrating Diffusion and Liquid AI Models for Predicting Peptide Affinity from mRNA Display Selections

This article has 8 authors:
1. Colin M. Leaf
2. Pearl Qi
3. Yash Pragnesh Gandhi
4. Farzad Jalali-Yazdi
5. Justin N. Ong
6. Terry T. Takahashi
7. Rajiv K. Kalia
8. Richard W. Roberts
This article has no evaluationsLatest version May 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Improving Biological Sequence Prediction with AlphaFold2 Representation

Systematic Benchmarking of Kinase Bioactivity Models Across Splitting Strategies and Protein Representations

Integrating Diffusion and Liquid AI Models for Predicting Peptide Affinity from mRNA Display Selections