Agentic systems are adept at solving well-scoped, verifiable problems in computational biology

Surag Nair
Laura Gunsalus
Brian Orcutt-Jahns
Jordan Rossen
Avantika Lal
Carlo De Donno
Muhammed Hasan Celik
Kipper Fletez-Brant
Xiaoman Xie
Hector Corrada Bravo
Gokcen Eraslan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We introduce CompBioBench, a benchmark of 100 diverse tasks for evaluating agentic systems in computational biology. Unlike mathematics and programming, which more readily admit systematic verification, biological data are inherently noisy and open to interpretation. To enable objective evaluation without reducing tasks to prescriptive checklists, we propose a new benchmark construction strategy based on synthetic/augmented data and metadata scrambling/scrubbing of real datasets to create challenging problems that have a single ground-truth answer and require multi-step reasoning, tool use, bespoke code, and interaction with real-world external resources. The benchmark spans genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and machine learning workflows. Questions are curated by domain experts to cover a broad range of skills with varying difficulty. We evaluate leading general-purpose agentic systems starting from a bare-minimum environment, requiring them to fetch data and tools as needed to solve each problem. We find strong end-to-end performance, with Codex CLI (GPT 5.4) reaching 83% accuracy, Gemini CLI (3.1 Pro) reaching 82%, Claude Code (Opus 4.6) reaching 81%, and Claude Code (Opus 4.7) reaching 78%. On the hardest questions, Claude Code (Opus 4.6) reaches 69%, Codex CLI (GPT 5.4) reaches 59%, and Gemini CLI (3.1 Pro) reaches 49%. CompBioBench provides a practical testbed for measuring the progress of agentic systems in computational biology and for guiding future benchmark design. Data and a public leaderboard are available at https://huggingface.co/collections/Genentech/compbiobench-v1.

Version published to 10.64898/2026.04.06.716850 on bioRxiv
Apr 9, 2026

MechAInistic: An LLM-guided Multi-Agent System for Reasoning over Genome-Scale Constraint-Based Metabolic Models

This article has 7 authors:
1. Josh Loecker
2. Narayna Puraja
3. William Bryant
4. Bhanwar Lal Puniya
5. Prakash Packrisamy
6. Ahmed Abdeen Hamed
7. Tomáš Helikar
This article has no evaluationsLatest version May 13, 2026
MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

This article has 17 authors:
1. Lisheng Zhang
2. Lilong Wang
3. Xiangyu Sun
4. Wei Tang
5. Haoyang Su
6. Yuehui Qian
7. Qikui Yang
8. Qingsong Li
9. Zhenyu Tang
10. Haoran Sun
11. Yingnan Han
12. Yankai Jiang
13. Wenjie Lou
14. Bowen Zhou
15. Xiaosong Wang
16. Lei Bai
17. Zhengwei Xie
This article has no evaluationsLatest version Apr 6, 2026
S2F-agent: Skill-grounded agent for Sequence-to-Function computational genomics workflows

This article has 2 authors:
1. Jiaqi Li
2. Zhiwei Bao
This article has no evaluationsLatest version May 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

MechAInistic: An LLM-guided Multi-Agent System for Reasoning over Genome-Scale Constraint-Based Metabolic Models

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

S2F-agent: Skill-grounded agent for Sequence-to-Function computational genomics workflows