Trustworthy agentic genomics through versioned skill libraries

Manuel Corpas
Alfredo Iacoangeli
Mathieu Bourdenx
Mahmoud Aldraimli
Nathan Skene
Segun Fatumo
Heinner Guio

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Genomics is adopting autonomous AI agents that interpret genomes from natural-language instructions faster than it is building the means to trust them. We report the first large-scale controlled evaluation of where, in an agentic genomic pipeline, correctness must reside for the system to be trustworthy at clinical scale. Using pharmacogenomics, a domain where errors are measurable and sometimes lethal, we benchmarked nine frontier large language models across 44,550 scored evaluations on 110 pharmacogenomic cases, and tested model interpretation of real star-allele diplotypes from more than 7,000 individuals in three ancestrally diverse populations. Trustworthiness proved to be a property of pipeline architecture, not of the model. Letting the model reason was stochastic and unsafe, and grounding it in the correct guidelines by retrieval paradoxically increased lethal-class errors. Encoding the validated decision logic as a versioned skill and executing it as code made the pharmacogenomic mapping exact, auditable and identical across models, confining all residual error to a single input-interpretation step. On individual genomes, unguarded model interpretation degraded along an ancestry gradient; execution removes this gradient from the clinical mapping, relocating it to the auditable completeness of the input caller. This establishes a generalisable, auditable architecture for trustworthy agentic genome interpretation at scale.

Highlights

Correctness must be executed, not reasoned or retrieved, to be trustworthy
Retrieval raises phenotype accuracy yet increases lethal-class errors; skills do not
Execution makes the clinical mapping exact and model-invariant; error stays at input
A deterministic input caller is the predicted route to all-correct emitted answers

In brief

Corpas and colleagues show that trustworthy agentic genome interpretation comes not from making language models reason correctly about biology, but from confining them to interpreting input while versioned, validated skills do the reasoning as executed code. Across nine large language models and 110 pharmacogenomics cases, executing the skill makes the clinical mapping deterministic, auditable and model-invariant.

Significance

Genomics is adopting autonomous, language-model-mediated agents faster than it is building the standards needed to trust them. On a pharmacogenomic benchmark with lethal-class consequences, we show that an agent’s trustworthiness is not a property of the model but of how the agent is constrained: correctness must be moved out of the stochastic model into a versioned skill executed as code, with the model confined to interpreting heterogeneous input. This gives the field a transferable architecture for trustworthy agentic genome interpretation, a predicted route to deploying it so that every emitted answer is correct (execute the validated skill, call the input deterministically, and abstain on the irreducible residual), and a way to develop genomic skills as validated, executable, versioned units rather than prompts. Following a validation framework described elsewhere, we use clinical-grade to mean determinism, auditability, traceability to versioned components and population-invariant performance, all achieved under skill-constrained execution. We distinguish two senses of population performance: the executed clinical mapping is population-invariant by construction, verified across European, Latin American and East African origin individuals, whereas the model’s interpretation of real, ancestrally diverse diplotypes is not, degrading along an ancestry gradient, which is precisely why the mapping must be executed rather than reasoned. We do not claim full clinical validation, which would additionally require non-canonical inputs, real-world genomic and clinical data, human comparators and multi-site concordance.

Version published to 10.64898/2026.06.11.731523 on bioRxiv
Jun 15, 2026

FlowBench: separating planning, fault recovery and interpretation in agentic bioinformatics

This article has 2 authors:
1. Alina Kurjan
2. Adam P. Cribbs
This article has no evaluationsLatest version Jun 16, 2026
MechAInistic: An LLM-guided Multi-Agent System for Reasoning over Genome-Scale Constraint-Based Metabolic Models

This article has 7 authors:
1. Josh Loecker
2. Narayna Puraja
3. William Bryant
4. Bhanwar Lal Puniya
5. Prakash Packrisamy
6. Ahmed Abdeen Hamed
7. Tomáš Helikar
This article has no evaluationsLatest version May 13, 2026
gRely: Relyability for genome trained sequence-to-expression models

This article has 3 authors:
1. Abdul Muntakim Rafi
2. Gokcen Eraslan
3. Kipper Fletez-Brant
This article has no evaluationsLatest version May 27, 2026

Discuss this preprint

Listed in

Abstract

Highlights

In brief

Significance

Article activity feed

Related articles

FlowBench: separating planning, fault recovery and interpretation in agentic bioinformatics

MechAInistic: An LLM-guided Multi-Agent System for Reasoning over Genome-Scale Constraint-Based Metabolic Models

gRely: Relyability for genome trained sequence-to-expression models