SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate prediction of protein-ligand binding affinities remains a cornerstone problem in drug discovery. While binding affinity is inherently dictated by the 3D structure and dynamics of protein-ligand complexes, current deep learning approaches are limited by the lack of high-quality experimental structures with annotated binding affinities. To address this limitation, we introduce the Struc-turally Augmented IC50 Repository ( SAIR ), the largest publicly available dataset of protein-ligand 3D structures with associated activity data. The dataset com-prises 5, 244, 285 structures across 1, 048, 857 unique protein-ligand systems, cu-rated from the ChEMBL and BindingDB databases, which were then computa-tionally folded using the Boltz-1x model. We provide a comprehensive charac-terization of the dataset, including distributional statistics of proteins and ligands, and evaluate the structural fidelity of the folded complexes using PoseBusters. Our analysis reveals that approximately 3% of structures exhibit physical anoma-lies, predominantly related to internal energy violations. As an initial demon-stration, we benchmark several binding affinity prediction methods, including empirical scoring functions (Vina, Vinardo), a 3D convolutional neural network (Onionnet-2), and a graph neural network (AEV-PLIG). While machine learning-based models consistently outperform traditional scoring function methods, nei-ther exhibit a high correlation with ground truth affinities, highlighting the need for models specifically fine-tuned to synthetic structure distributions. This work provides a foundation for developing and evaluating next-generation structure and binding-affinity prediction models and offers insights into the structural and phys-ical underpinnings of protein-ligand interactions. The dataset can be found at https://www.sandboxaq.com/sair .