FHIR-AgentEval: A Modular Sandbox for Benchmarking Clinical LLM Agents with an Evaluation of Memory-Augmented Configurations

Youssef Mokssit
Kamalakkannan Ravi
Mengshu Nie
Junyoung Kim
Cong Liu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Healthcare data exchange increasingly relies on HL7 FHIR, but FHIR's implementation complexity creates barriers for clinical workflows. Large language model (LLM) agents could bridge this gap by translating natural language requests into structured FHIR operations, yet their reliability remains unproven. We present FHIR-AgentEval, an extensible evaluation sandbox comprising 43 modular tasks for benchmarking LLM agents on realistic appointment management and genetic testing workflows. Each task executes against a resettable FHIR server with custom deterministic validation of both agent responses and resulting server state. We run an ablation study of five agent configurations, varying access to an on-demand FHIR R4 specifications server and long-term memory trained with or without specification grounding. Across four experimental settings, memory consistently improves task success and reduces strategic failures such as incorrect tool selection and resource-type confusion. On held-out tasks, the best memory configuration improves success by 9.1% over baseline, offering a potential pathway toward more robust clinical deployment.

Version published to 10.21203/rs.3.rs-8746188/v1 on Research Square
Feb 9, 2026

FHIR-RAG-MEDS: Integrating HL7 FHIR with Retrieval-Augmented Large Language Models for Enhanced Medical Decision Support

This article has 9 authors:
1. Yildiray Kabak
2. Gokce B. Laleci Erturkmen
3. Mert Gencturk
4. Tuncay Namli
5. A. Anil Sinaci
6. Ruben Alcantud Corcoles
7. Cristina Gomez Ballesteros
8. Pedro Abizanda
9. Asuman Dogac
This article has no evaluationsLatest version Feb 27, 2026
Clinical MLOps: A Framework for Responsible Deployment and Observability of AI Systems in Cloud-Native Healthcare Platforms

This article has 1 author:
1. Daniel Spadacini
This article has no evaluationsLatest version Feb 28, 2026
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

This article has 22 authors:
1. Jiazhen Pan
2. Bailiang Jian
3. Paul Hager
4. Yundi Zhang
5. Che Liu
6. Friederike Jungmann
7. Hongwei Li
8. Chenyu You
9. Junde Wu
10. Jiayuan Zhu
11. Fenglin Liu
12. Yuyuan Liu
13. Niklas Bubeck
14. Christian Wachinger
15. Chen Chen
16. Zhenyu Gong
17. Cheng Ouyang
18. Georgios Kaissis
19. Benedikt Wiestler
20. Daniel Rückert
21. Julian Canisius
22. Moritz Knolle
This article has no evaluationsLatest version Feb 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

FHIR-RAG-MEDS: Integrating HL7 FHIR with Retrieval-Augmented Large Language Models for Enhanced Medical Decision Support

Clinical MLOps: A Framework for Responsible Deployment and Observability of AI Systems in Cloud-Native Healthcare Platforms

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models