Advancing Protein Ensemble Predictions Across the Order–Disorder Continuum
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
While deep learning has transformed structure prediction for ordered proteins, intrinsically disordered proteins remain poorly predicted due to systematic underrepresentation in training data, despite constituting approximately 30% of eukaryotic proteomes. We introduce Pep-toneBench, the first benchmark enabling systematic assessment of ensemble generators across both ordered and disordered proteins, integrating diverse experimental observables. Our analysis reveals that existing evaluation metrics exhibit systematic bias toward the structured spectrum of the proteome. Assessment of popular predictors (AlphaFold2, ESMFlow, Boltz2) confirms high accuracy on ordered proteins but shows performance degradation with increasing disorder. We further present PepTron, a flow-matching ensemble generator trained on data augmented with synthetic disordered protein ensembles. On our benchmark PepTron matches BioEmu on disordered regions while maintaining competitive accuracy on ordered protein benchmarks. Our data augmentation approach demonstrates that targeted training strategies can approach the performance of computationally expensive simulation-based methods, establishing a generalizable framework applicable to other protein generative models. All datasets, models, and code are openly available.