MedMisBench: Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

Hongjian Zhou
Xinyu Zou
Jinge Wu
Sean Wu
Junchi Yu
Bradley Max Segal
Tobias Erich Niebuhr
Sara Amro
Michael Petrus
Sheikh Momin
Alexandra Cardoso Pinto
Rachel Niesen
Laura Sophie Wegner
Dhruv Darji
Jung Moses Koo
Joshua Fieggen
Kapil Narain
Mingde Zeng
Lei Clifton
Linda Shapiro
Fenglin Liu
David A. Clifton

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience , and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context. ¹

Version published to 10.64898/2026.05.25.727671 on bioRxiv
May 28, 2026

A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

This article has 24 authors:
1. Joshua Proulx
2. Bryce Daines
3. Michael Barton
4. Molly E. Leonard
5. Joseph A. Garcia
6. Bronson Young
7. Quinn Snell
8. Timothy W. West
9. Sam R. Watson
10. Maryam AlQaseer
11. Mathieu Louiset
12. Muhammad Bilal Maqsood
13. Mary J. Voutt-Goos
14. Caryn Douma
15. Nishaminy Kasbekar
16. Jaclyn Jeffries
17. Wadie Abu-Rahmeh
18. Karen Frush
19. Darshan K. Grewal
20. Mouna Bahsoun
21. Michael Leonard
22. Allan Frankel
23. David C. Classen
24. Stanley L. Pestotnik
This article has no evaluationsLatest version Jun 10, 2026
Evidence-Graded Decision Authorization for Safe Clinical AI: A Constrained Reasoning Framework

This article has 3 authors:
1. Che Lin
2. Jia-Yi Lin
3. Yao-San Lin
This article has no evaluationsLatest version May 22, 2026
Combined values alignment and epistemic verification prevent delusional reinforcement in conversational AI agents

This article has 4 authors:
1. Anna Carrano
2. Milit S. Patel
3. Stella Hartono
4. Stephen C. Ekker
This article has no evaluationsLatest version Jun 2, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

Evidence-Graded Decision Authorization for Safe Clinical AI: A Constrained Reasoning Framework

Combined values alignment and epistemic verification prevent delusional reinforcement in conversational AI agents