Accelerating ligand discovery by combining Bayesian optimization with MMGBSA-based binding affinity calculations

Lucas Andersen
Max Rausch-Dupont
Alejandro Martínez León
Andrea Volkamer
Jochen S. Hub
Dietrich Klakow

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Predicting protein–ligand binding affinity with high accuracy is critical in structure-based drug discovery. While docking methods offer computational efficiency, they often lack the precision required for reliable affinity ranking. In contrast, molecular dynamics (MD)-based approaches such as MMGBSA provide more accurate binding free energy estimates but are computationally intensive, limiting their scalability. To address this trade-off, we introduce an active learning framework that automates molecule selection for docking and MD simulations, replacing manual expert-driven decisions with a data-efficient, model-guided strategy. Our approach integrates fixed — partly pre-trained deep learning — molecular embeddings (MolFormer, ChemBERTa-2, and Morgan fingerprints) with adaptive regression models (e.g. Bayesian Ridge and Random Forest) to iteratively improve binding affinity predictions. We evaluate this approach retro-spectively on a new dataset of 59,356 chemically diverse compounds from ZINC-22 targeting the MCL1 protein using both AutoDock Vina and MMGBSA binding free energy scores. Our results show that incorporating MMGBSA scores into the active learning loop significantly enhances performance, recovering 79.9% of the top 1% binders in the whole dataset, compared to only 6.7% when using docking scores alone. Notably, MMGBSA exhibits a stronger correlation with experimental binding affinities than AutoDock Vina on our dataset and enables more accurate ranking of candidate compounds in a runtime efficient way. Furthermore, we demonstrate that a one-at-a-time acquisition active learning strategy consistently outperforms traditional batched acquisition, the latter achieving just 78.4% recovery with MolFormer and Bayesian Ridge. These findings underscore the potential of integrating deep learning-based molecular representations with MD-level accuracy in an active learning framework, offering a scalable and efficient path to accelerate virtual screening and improve hit identification in drug discovery.

Version published to 10.1101/2025.06.22.660936 on bioRxiv
Jun 27, 2025

Integrating Computational Biology in Modern Drug Discovery: A Synergistic Approach of Structure-Based, Ligand-Based, and Network Pharmacology Strategies

This article has 4 authors:
1. Cromwel Tepap Zemnou
2. Gabriel Tchuente Kamsu
3. Ramelle Ngakam
4. Etienne Junior Tcheumeni
This article has no evaluationsLatest version Jan 29, 2026
Parameter-Efficient Adaptation of Large Language Models for Drug-Target Affinity Modeling in Drug Discovery

This article has 1 author:
1. Virendra Singh Kaira
This article has no evaluationsLatest version Jan 29, 2026
Multi-Modal Ensemble Learning for TLR4 Binding Prediction: Addressing Data Scarcity and Leakage in Small Molecule Drug Discovery

This article has 3 authors:
1. Brandon Yee
2. Maximilian Rutkowski
3. Wilson Collins
This article has no evaluationsLatest version Jan 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating Computational Biology in Modern Drug Discovery: A Synergistic Approach of Structure-Based, Ligand-Based, and Network Pharmacology Strategies

Parameter-Efficient Adaptation of Large Language Models for Drug-Target Affinity Modeling in Drug Discovery

Multi-Modal Ensemble Learning for TLR4 Binding Prediction: Addressing Data Scarcity and Leakage in Small Molecule Drug Discovery