Advancing Protein Ensemble Predictions Across the Order–Disorder Continuum

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

While deep learning has transformed structure prediction for ordered proteins, intrinsically disordered proteins remain poorly predicted due to systematic underrepresentation in training data, despite constituting approximately 30% of eukaryotic proteomes. We introduce Pep-toneBench, the first benchmark enabling systematic assessment of ensemble generators across both ordered and disordered proteins, integrating diverse experimental observables. Our analysis reveals that existing evaluation metrics exhibit systematic bias toward the structured spectrum of the proteome. Assessment of popular predictors (AlphaFold2, ESMFlow, Boltz2) confirms high accuracy on ordered proteins but shows performance degradation with increasing disorder. We further present PepTron, a flow-matching ensemble generator trained on data augmented with synthetic disordered protein ensembles. On our benchmark PepTron matches BioEmu on disordered regions while maintaining competitive accuracy on ordered protein benchmarks. Our data augmentation approach demonstrates that targeted training strategies can approach the performance of computationally expensive simulation-based methods, establishing a generalizable framework applicable to other protein generative models. All datasets, models, and code are openly available.

Article activity feed