Bayesian Optimization for Biochemical Discovery with LLMs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Incorporating prior domain knowledge into Bayesian optimization (BO) remains difficult for statistical methods, which also typically suffer from limited interpretability. Large language models (LLMs) offer complementary strengths in reasoning and knowledge integration, but it remains unclear when and how they improve BO. We address this gap through a systematic analysis of the success and failure modes of LLM-enabled approaches to BO and propose strategies to overcome their limitations. Two types of scientific tasks are benchmarked: molecular optimization using string-based representations and optimization of four-residue binding motifs within proteins. In the molecular task, we find that poor data comprehension and the limited ability to process large in-context data hinder LLM performance. Accordingly, we propose an agentic workflow that orchestrates chemical tools and statistical BO and show its effectiveness. In the protein task, reasoning LLMs prove effective at domain knowledge-based hypothesis generation, improving optimization performance while providing interpretable design strategies. Across both domains, we observe a tendency for LLMs to overfixate on irrelevant context, where withholding information paradoxically improves performance. These results clarify the conditions under which LLMs enhance BO and suggest hybrid approaches that combine statistical rigor with LLM-enabled reasoning and interpretability.

Article activity feed