Clinical Agents Don’t Care

Eyal Klang
Benjamin S Glicksberg
Alon Gorenshtein
Nicholas Gavin
Robert Freeman
Lisa Stump
Alexander W Charney
Daniel Shu Wei Ting
Mahmud Omar
Girish N Nadkarni

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) now power clinical agents that can plan, call tools, and write into electronic health records (EHRs). They are becoming actors, not assistants. Given known LLM faults, quality assurance is essential before clinical use. A key question is whether agents notice patient-identity errors or act indifferent.

Methods

We created a record environment using publicly available MIMIC-IV real-world emergency department data. Agents were instructed to copy ICD-10 codes from visit headers into patient records using Extract and Store tools, with an option to record “UNKNOWN” if uncertain or abstain.

Each agent was presented with ten batched records from the same patient (clean version). Then we tampered one of the records and evaluated how the agent responded.

We ran four separate batches: the clean baseline batch, a batch with one visit with a complete swapped header from another patient, a batch with one visit with a one-digit MRN change, and a batch with age shifted in one visit.

Six models, both closed- and open-weight, completed 1.2 million tool calls to assess model performance. The endpoint was whether agents would identify when fields were inconsistent identity.

Results

Agents frequently failed, copying codes into tampered charts. GPT-4.1 flagged mismatched headers as UNKNOWN in 17.4% of runs but never detected subtle faults. GPT-4.1-nano detected 4.4% of header swaps and <1% of MRN or age changes. GPT-5-chat never identified mismatches but omitted responses in 12.6% of cases. Other models rarely abstained. Subtle tampering passed almost entirely without detection.

Conclusions

Clinical agents are often indifferent to patient details inconsistencies. The central risk is misbinding , not miscoding. Safe deployment requires explicit identity verification, abstention when uncertain, and benchmarks that treat record integrity, not just accuracy, as a primary outcome.

Version published to 10.1101/2025.10.17.25338226 on medRxiv
Oct 19, 2025

Using Large Language Models to Assemble, Audit, and Prioritize the Therapeutic Landscape

This article has 5 authors:
1. Anugraha Thyagatur Kidigannappa
2. Nithin Sonti
3. Rahul Vijayan
4. Ayaan Parikh
5. Roland Faller
This article has no evaluationsLatest version Nov 16, 2025
OncoGPT: A Modular AI Assistant Orchestrating LLMs in Molecular Oncology

This article has 4 authors:
1. François Degrave
2. Cédric Balsat
3. Maxime Liénard
4. Sébastien Sauvage
This article has no evaluationsLatest version Nov 12, 2025
Combining Clinician Expertise with Prompt Engineering enhances Small Language Models Reliability for Cancer Entity Recognition in Electronic Health Records

This article has 44 authors:
1. Federica Corso
2. Vittoria Peppoloni
3. Laura Mazzeo
4. Giuseppe Leone
5. Luana Passos
6. Vanja Mišković
7. Justin Armanini
8. Alberto Ferrarin
9. Isabella Catharina Wiest
10. Fabian Wolf
11. Giulia Montelatici
12. Rebecca Romanò
13. Ambrosini Paolo
14. Tommaso Capoccia
15. Stefano Natangelo
16. Simone Rota
17. Paola Andena
18. Marta De Ponti
19. Alessandra Russo
20. Giulia Stasi
21. Leonardo Provenzano
22. Andrea Spagnoletti
23. Marco Meazza Prina
24. Chiara Cavalli
25. Claudia Giani
26. Roberta Serino
27. Michele Borracino
28. Chiara Bonalume
29. Rosa Maria di Mauro
30. Claudia Agosta
31. Andra Diana Dumitrascu
32. Giorgia Di Liberti
33. Giulia Corrao
34. Teresa Beninato
35. Monica Ganzinelli
36. Mario Occhipinti
37. Marta Brambilla
38. Claudia Proto
39. Jakob Nicholas Kather
40. Alessandra Laura Giulia Pedrocchi
41. Filippo De Braud
42. Giuseppe Lo Russo
43. Paolo Baili
44. Arsela Prelaj
This article has no evaluationsLatest version Oct 21, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Using Large Language Models to Assemble, Audit, and Prioritize the Therapeutic Landscape

OncoGPT: A Modular AI Assistant Orchestrating LLMs in Molecular Oncology

Combining Clinician Expertise with Prompt Engineering enhances Small Language Models Reliability for Cancer Entity Recognition in Electronic Health Records