BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research

Yuanhao Qu
Yingzhou Lu
Xinming Tu
Serena Zhang
Tianwei She
Alexander Glenn Shaw
Jou-Ho Shih
Bingqing Zhao
Minjie Shen
Haochen Yang
Jielin Yan
Rongchuan Zhang
Xinze Wu
Tingting Li
Bin Zhou
Ning Wang
Adam Ma
Le Cong
Xiaobo Hu
Yuan Jiang
Jiayun Dong
Tao Peng
Jure Leskovec
Kexin Huang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

LLM agents now perform real biomedical research, but evaluating them rigorously is hard. Outcome-only benchmarks fail in two ways. First, a correct final answer can come from memorization, reward hacking, or wrong reasoning that produces the right number by chance. Second, valid alternative analyses are marked wrong simply because they differ from the reference. We introduce BiomniBench, a process-level evaluation framework that scores the full agent trajectory against expert-designed, task-specific rubrics. Our first release, BiomniBench-DA, contains 100 data-analysis tasks across 17 task types, 5 disease areas, and a general-biology category, each based on a paper from journals such as Nature, Cell, and Science and co-developed with an original author or a domain expert. Benchmarking frontier and open-weight models across four agent harnesses reveals three findings. Frontier and open-weight bases cluster within a few points of each other, with substantial headroom for all models. The agent harness shifts scores by more than the gap between successive model generations. Agents reliably ground claims in real sources yet consistently fall short on method selection, biological interpretation, and scientific reasoning. BiomniBench is the first process-level benchmark for LLM agents in biomedical research, providing the dimension-level diagnostics that outcome scoring cannot.

Dataset

huggingface.co/datasets/phylobio/BiomniBench-DA

Version published to 10.64898/2026.05.12.724604 on bioRxiv
May 14, 2026

ClaroAI-Bench: Evaluating Agentic Scientific Reproducibility on Real Biomedical Papers

This article has 1 author:
1. Kyle O’Connell
This article has no evaluationsLatest version May 12, 2026
Open-Rosalind: Tool-First Biomedical LLM Agents with Process-Aware Benchmarking

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version May 8, 2026
Evaluating open LLMs for agentic analysis orchestration in a typical biomedical lab

This article has 1 author:
1. Anton Nekrutenko
This article has no evaluationsLatest version May 18, 2026

Discuss this preprint

Listed in

Abstract

Dataset

Article activity feed

Related articles

ClaroAI-Bench: Evaluating Agentic Scientific Reproducibility on Real Biomedical Papers

Open-Rosalind: Tool-First Biomedical LLM Agents with Process-Aware Benchmarking

Evaluating open LLMs for agentic analysis orchestration in a typical biomedical lab