Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI Driven Molecular Property and Activity Prediction

Jinjiang Guo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

Version published to 10.64898/2026.04.29.721568 on bioRxiv
May 4, 2026

Systematic Benchmarking of Kinase Bioactivity Models Across Splitting Strategies and Protein Representations

This article has 1 author:
1. Joshua M. Abbott
This article has no evaluationsLatest version Apr 22, 2026
Benchmarking Boltz-2 for Screening of Therapeutic Antibody-Antigen Interactions

This article has 10 authors:
1. Alexandra Fieux-Castagnet
2. Julian Waton
3. Alina Glukhonemykh
4. Eric Snow
5. Roshini Ashokkumar
6. Jess Fleming
7. David Champagne
8. Thomas Devenyns
9. Alex Peluffo
10. Chris Anagnostopoulos
This article has no evaluationsLatest version May 14, 2026
DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

This article has 6 authors:
1. Tianyu Liu
2. Sihan Jiang
3. Fan Zhang
4. Kunyang Sun
5. Teresa Head-Gordon
6. Hongyu Zhao
This article has no evaluationsLatest version Apr 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Systematic Benchmarking of Kinase Bioactivity Models Across Splitting Strategies and Protein Representations

Benchmarking Boltz-2 for Screening of Therapeutic Antibody-Antigen Interactions

DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery