Physics beats diffusion: Agentic AI-driven virtual screening benchmark on a GPCR target
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Virtual screening (VS) campaigns require expert decisions at every stage, from active compound curation and decoy generation to receptor preparation, docking engine selection, and statistical evaluation. I show that an autonomous large language model (LLM) coding agent (Claude Code, Anthropic) can design and execute a complete VS benchmark pipeline without human coding intervention that requires only high-level scientific direction. The agent curated 1,000 FPR2 (a GPCR receptor) actives from ChEMBL (pChEMBL ≥ 5), generated ca. 10,000 property-matched decoys, prepared ligand libraries using two protocols (naive defaults and expert-guided), configured and ran docking with two fundamentally different engines; 1) Uni-Dock (GPU-accelerated physics-based) and 2) DiffDock (diffusion-based machine learning). Then it performed full statistical evaluation including ROC AUC, BEDROC, enrichment factors, DeLong tests, and paired bootstrap confidence intervals. Uni-Dock achieved ROC AUC = 0.70–0.73 with significant discrimination (permutation p < 0.0001), while DiffDock confidence scores yielded near-random performance (AUC = 0.54–0.56; Cliff’s d = negligible), consistent with its known underrepresentation of GPCR targets in training data. Expert-guided protocols improved Uni-Dock AUC by +0.020 (DeLong p = 0.003; paired bootstrap p = 0.002). Single-ligand redocking confirmed Vina reproduces the crystal pose (RMSD 0.22–0.39 Å) whereas both Uni-Dock batch mode (5.2–5.7 Å) and DiffDock (23–29 Å) failed. All code, data, and the agent’s skill file are openly available. Scientific contribution: This is the first demonstration of an LLM coding agent autonomously constructing a reproducible VS benchmark from scratch. The resulting benchmark provides the first head-to-head comparison of Uni-Dock and DiffDock on a GPCR target that reveals that physics-based docking (AUC = 0.70–0.73) substantially outperforms diffusion-based ML docking (AUC = 0.54–0.56, near-random) for this underrepresented target class.