FHIR-AgentEval: A Modular Sandbox for Benchmarking Clinical LLM Agents with an Evaluation of Memory-Augmented Configurations
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Healthcare data exchange increasingly relies on HL7 FHIR, but FHIR's implementation complexity creates barriers for clinical workflows. Large language model (LLM) agents could bridge this gap by translating natural language requests into structured FHIR operations, yet their reliability remains unproven. We present FHIR-AgentEval, an extensible evaluation sandbox comprising 43 modular tasks for benchmarking LLM agents on realistic appointment management and genetic testing workflows. Each task executes against a resettable FHIR server with custom deterministic validation of both agent responses and resulting server state. We run an ablation study of five agent configurations, varying access to an on-demand FHIR R4 specifications server and long-term memory trained with or without specification grounding. Across four experimental settings, memory consistently improves task success and reduces strategic failures such as incorrect tool selection and resource-type confusion. On held-out tasks, the best memory configuration improves success by 9.1% over baseline, offering a potential pathway toward more robust clinical deployment.