Assessing the Generalizability of Machine Learning and Physics-Based Methods with DNA-Encoded Libraries

Marissa Dolorfino
Daniel Santos Perez
Yao Fu
Shu-Hang Lin
Sean McCarty
Matthew J. O’Meara
Terra Sztain

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Predicting protein-ligand binding is a central challenge in computational drug discovery, and while machine learning (ML) and co-folding methods have advanced rapidly, their ability to generalize beyond training or parameterization regimes remains insufficiently understood. DNA-encoded libraries (DELs) enable ultra-large screening of billions of molecules simultaneously, providing a useful testbed for evaluating these approaches at scale. A recent NeurIPS competition revealed that even top performing ML models trained on DEL data failed at generalizing to out-of-distribution (OOD) chemical space. We investigated whether integrating structural modeling could bridge this generalization gap. We systematically assessed state-of-the-art ML, docking, and co-folding methods including Schrodinger Glide, Rosetta GALigandDock, and Boltz-2 with three biologically diverse protein targets screened against libraries containing multiple DEL synthesis formats. While ML excels in-distribution, OOD hit discrimination is dependent on both the target and ligand context, with no single method consistently dominating. These findings demonstrate that benchmark performance alone is insufficient to predict OOD performance, highlighting the need for system-dependent evaluation of binding prediction methods. We provide an open-source package for assessing protein-ligand prediction methods and analyzing high-throughput screening data: DEL-iver.

Abstract Figure

Version published to 10.64898/2026.04.18.719394 on bioRxiv
Apr 19, 2026

CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants

This article has 8 authors:
1. Yongcan Chen
2. Lihao Fu
3. Xuchao Lu
4. Wenzhuo Li
5. Yuan Gao
6. Yibo Wang
7. Zhicheng Ruan
8. Tong Si
This article has no evaluationsLatest version Mar 25, 2026
Integrating Diffusion and Liquid AI Models for Predicting Peptide Affinity from mRNA Display Selections

This article has 8 authors:
1. Colin M. Leaf
2. Pearl Qi
3. Yash Pragnesh Gandhi
4. Farzad Jalali-Yazdi
5. Justin N. Ong
6. Terry T. Takahashi
7. Rajiv K. Kalia
8. Richard W. Roberts
This article has no evaluationsLatest version May 11, 2026
Cross-Attention Over RNA And Protein Sequences Enables Generalizable Interaction Prediction

This article has 7 authors:
1. Mario Catalano
2. Gerardo Pepe
3. Gabriele Ausiello
4. Claire McWhite
5. Giorgio Gambosi
6. Manuela Helmer Citterich
7. Pier Federico Gherardini
This article has no evaluationsLatest version Apr 23, 2026

Discuss this preprint

Listed in

Abstract

Abstract Figure

Article activity feed

Related articles

CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants

Integrating Diffusion and Liquid AI Models for Predicting Peptide Affinity from mRNA Display Selections

Cross-Attention Over RNA And Protein Sequences Enables Generalizable Interaction Prediction