Clinical Agents Don’t Care

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Large language models (LLMs) now power clinical agents that can plan, call tools, and write into electronic health records (EHRs). They are becoming actors, not assistants. Given known LLM faults, quality assurance is essential before clinical use. A key question is whether agents notice patient-identity errors or act indifferent.

Methods

We created a record environment using publicly available MIMIC-IV real-world emergency department data. Agents were instructed to copy ICD-10 codes from visit headers into patient records using Extract and Store tools, with an option to record “UNKNOWN” if uncertain or abstain.

Each agent was presented with ten batched records from the same patient (clean version). Then we tampered one of the records and evaluated how the agent responded.

We ran four separate batches: the clean baseline batch, a batch with one visit with a complete swapped header from another patient, a batch with one visit with a one-digit MRN change, and a batch with age shifted in one visit.

Six models, both closed- and open-weight, completed 1.2 million tool calls to assess model performance. The endpoint was whether agents would identify when fields were inconsistent identity.

Results

Agents frequently failed, copying codes into tampered charts. GPT-4.1 flagged mismatched headers as UNKNOWN in 17.4% of runs but never detected subtle faults. GPT-4.1-nano detected 4.4% of header swaps and <1% of MRN or age changes. GPT-5-chat never identified mismatches but omitted responses in 12.6% of cases. Other models rarely abstained. Subtle tampering passed almost entirely without detection.

Conclusions

Clinical agents are often indifferent to patient details inconsistencies. The central risk is misbinding , not miscoding. Safe deployment requires explicit identity verification, abstention when uncertain, and benchmarks that treat record integrity, not just accuracy, as a primary outcome.

Article activity feed