LLMs Can Do Medical Harm: Stress-Testing Clinical Decisions Under Social Pressure

Mahmud Omar
Reem Agbareia
Jolion McGreevy
Alon Gorenshtein
Alexander W Charney
Ankit Sakhuja
Benjamin S Glicksberg
Girish N Nadkarni
Eyal Klang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) are entering clinical workflows, yet their effect on clinical decisions and potential for harm are uncertain.

Methods

We measured harmful decision output from an ensemble of 20 LLMs across >10 million clinical scenarios with safety or ethical dilemmas. Each case was shown under a neutral control and six Milgram-style social-pressure conditions, with or without a brief mitigation cue (“verify or escalate if unsafe”). The primary outcome was the proportion of potentially harmful responses. We used two-proportion tests/χ ² tests and confirmatory mixed-effects logistic models.

Results

Across all runs (N = 10,096,800), LLMs produced 1.18 million potentially harmful outputs (11.7%). Mitigation reduced harmful decisions from 16.6% to 10.1% (p < 0.001). When exposed to social pressure, models behaved predictably but unevenly: prompts framed as authority or responsibility transfer generated the most harmful responses, whereas control prompts, neutral and pressure-free, produced the fewest (mitigated 8.3–9.6%; unmitigated 14.3–16.0%; χ ² p < 0.001). In other words, when told what to do, or told that someone else would take responsibility, models were more likely to comply, even when the instruction was unsafe. These effects were consistent across datasets and models

Conclusion

LLMs can generate harmful medical decisions at scale. A brief safety reminder reduces, but does not eliminate, this behavior. These results highlight the need to measure harm propensity as a core performance metric and to maintain guardrails and continuous physician oversight before integrating LLMs into clinical decision-making.

Version published to 10.1101/2025.11.25.25340972 on medRxiv
Nov 27, 2025

When silence is safer: a review of LLM abstention in healthcare

This article has 6 authors:
1. Oriana Presacan
2. Alireza Nik
3. Jaya Ojha
4. Vajira Thambawita
5. Bogdan Ionescu
6. Michael A. Riegler
This article has no evaluationsLatest version Dec 15, 2025
Beyond Simulations: What 20,000 Real Conversations Reveal About Mental Health AI Safety

This article has 6 authors:
1. Caitlin Stamatis
2. Jonah Meyerhoff
3. Richard Zhang
4. Olivier Tieleman
5. Matteo Malgaroli
6. Thomas Hull
This article has no evaluationsLatest version Jan 27, 2026
Clinical Cyberbioethics and AI-Mediated Clinical Decision-Making: a mapping and narrative review with thematic synthesis (SWiM)

This article has 2 authors:
1. Anderson Díaz Pérez
2. Wendy Acuña Pérez
This article has no evaluationsLatest version Jan 16, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

When silence is safer: a review of LLM abstention in healthcare

Beyond Simulations: What 20,000 Real Conversations Reveal About Mental Health AI Safety

Clinical Cyberbioethics and AI-Mediated Clinical Decision-Making: a mapping and narrative review with thematic synthesis (SWiM)