Bayesian Optimization for Biochemical Discovery with LLMs

Rafael Gómez-Bombarelli
Mattias Akke
Soojung Yang
Jurgis Ruza
Jinyeop Song
Elton Pan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Incorporating prior domain knowledge into Bayesian optimization (BO) remains difficult for statistical methods, which also typically suffer from limited interpretability. Large language models (LLMs) offer complementary strengths in reasoning and knowledge integration, but it remains unclear when and how they improve BO. We address this gap through a systematic analysis of the success and failure modes of LLM-enabled approaches to BO and propose strategies to overcome their limitations. Two types of scientific tasks are benchmarked: molecular optimization using string-based representations and optimization of four-residue binding motifs within proteins. In the molecular task, we find that poor data comprehension and the limited ability to process large in-context data hinder LLM performance. Accordingly, we propose an agentic workflow that orchestrates chemical tools and statistical BO and show its effectiveness. In the protein task, reasoning LLMs prove effective at domain knowledge-based hypothesis generation, improving optimization performance while providing interpretable design strategies. Across both domains, we observe a tendency for LLMs to overfixate on irrelevant context, where withholding information paradoxically improves performance. These results clarify the conditions under which LLMs enhance BO and suggest hybrid approaches that combine statistical rigor with LLM-enabled reasoning and interpretability.

Version published to 10.21203/rs.3.rs-8216063/v1 on Research Square
Jan 22, 2026

Bgolearn: a Unified Bayesian Optimization Framework for Accelerating Materials Discovery

This article has 13 authors:
1. Bin Cao
2. Jie Xiong
3. Jiaxuan Ma
4. Yuan Tian
5. Yirui Hu
6. Mengwei He
7. Longhan Zhang
8. Jiayu Wang
9. Jian Hui
10. Li Liu
11. Dezhen Xue
12. Turab Lookman
13. Tong-Yi Zhang
This article has no evaluationsLatest version Feb 25, 2026
Defining Peptides in ChEBI

This article has 8 authors:
1. Simon Flügel
2. Till Mossakowski
3. Fabian Neuhaus
4. Erik Pfanenstiel
5. Martin Glauer
6. Edgar Haak
7. Adnan Malik
8. Noel M O'Boyle
This article has no evaluationsLatest version Jan 28, 2026
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Bgolearn: a Unified Bayesian Optimization Framework for Accelerating Materials Discovery

Defining Peptides in ChEBI

Emergence of Biological Structural Discovery in General-Purpose Language Models