SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate prediction of protein-ligand binding affinities remains a cornerstone problem in drug discovery. While binding affinity is inherently dictated by the 3D structure and dynamics of protein-ligand complexes, current deep learning approaches are limited by the lack of high-quality experimental structures with annotated binding affinities. To address this limitation, we introduce the Struc-turally Augmented IC50 Repository ( SAIR ), the largest publicly available dataset of protein-ligand 3D structures with associated activity data. The dataset com-prises 5, 244, 285 structures across 1, 048, 857 unique protein-ligand systems, cu-rated from the ChEMBL and BindingDB databases, which were then computa-tionally folded using the Boltz-1x model. We provide a comprehensive charac-terization of the dataset, including distributional statistics of proteins and ligands, and evaluate the structural fidelity of the folded complexes using PoseBusters. Our analysis reveals that approximately 3% of structures exhibit physical anoma-lies, predominantly related to internal energy violations. As an initial demon-stration, we benchmark several binding affinity prediction methods, including empirical scoring functions (Vina, Vinardo), a 3D convolutional neural network (Onionnet-2), and a graph neural network (AEV-PLIG). While machine learning-based models consistently outperform traditional scoring function methods, nei-ther exhibit a high correlation with ground truth affinities, highlighting the need for models specifically fine-tuned to synthetic structure distributions. This work provides a foundation for developing and evaluating next-generation structure and binding-affinity prediction models and offers insights into the structural and phys-ical underpinnings of protein-ligand interactions. The dataset can be found at https://www.sandboxaq.com/sair .

Article activity feed