GEMS – Enhancing Generalizable Binding Affinity Prediction by Removing Data Leakage and Integrating Language Model Embeddings into Graph Neural Networks

David Graber
Peter Stockinger
Fabian Meyer
Siddhartha Mishra
Claus Horn
Rebecca Buller

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The field of computational drug design requires accurate scoring functions to predict binding affinities for protein-ligand interactions. However, train-test data leakage between the PDBbind database and the CASF benchmark datasets has significantly inflated the performance metrics of currently available deep-learning-based binding affinity prediction models, leading to overestimation of their generalization capabilities. We address this issue by proposing PDBbind CleanSplit, a training dataset curated by a novel structure-based filtering algorithm that eliminates train-test data leakage as well as redundancies within the training set. Retraining current top-performing models on CleanSplit caused their benchmark performance to drop significantly, indicating that the performance of existing models is largely driven by data leakage. In contrast, our graph neural network model for efficient molecular scoring (GEMS) maintains high benchmark performance when trained on CleanSplit. Leveraging a sparse graph modeling of protein-ligand interactions and transfer learning from language models, GEMS is able to generalize to strictly independent test datasets.

Version published to 10.1101/2024.12.09.627482v2 on bioRxiv
Jun 10, 2025
Version published to 10.1101/2024.12.09.627482v1 on bioRxiv
Dec 11, 2024

How Good is AlphaFold3 at Ranking Drug Binding Affinities?

This article has 9 authors:
1. Xin Hong
2. Bowen Gao
3. Yinjun Jia
4. Wenyu Zhu
5. Qixuan Chen
6. Xiaohe Tian
7. Zhenyi Zhong
8. Jianhui Wang
9. Yanyan Lan
This article has no evaluationsLatest version May 30, 2025
SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset

This article has 14 authors:
1. Pablo Lemos
2. Zane Beckwith
3. Sasaank Bandi
4. Maarten van Damme
5. Jordan Crivelli-Decker
6. Benjamin J. Shields
7. Thomas Merth
8. Punit K. Jha
9. Nicola De Mitri
10. Tiffany J. Callahan
11. AJ Nish
12. Paul Abruzzo
13. Romelia Salomon-Ferrer
14. Martin Ganahl
This article has no evaluationsLatest version Jun 21, 2025
ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design

This article has 4 authors:
1. Lukia Mistryukova
2. Vladimir Manuilov
3. Konstantin Avchaciov
4. Peter O. Fedichev
This article has no evaluationsLatest version Jun 20, 2025

Listed in

Abstract

Article activity feed

Related articles

How Good is AlphaFold3 at Ranking Drug Binding Affinities?

SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset

ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design