Obedience to Unsafe Clinical Instructions: How Large Language Models Respond to Authority Cues

Mahmud Omar
Reem Agbareia
Jolion McGreevy
Alon Gorenshtein
Alexander Charney
Ankit Sakhuja
Benjamin S. Glicksberg
Girish Nadkarni
Eyal Klang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Large language models (LLMs) are being integrated into clinical environments where deference to authority can cause harm. Unlike hallucination or bias, obedience to unsafe instructions represents a distinct safety failure: following an explicit but harmful order. Methods We conducted a cross-sectional evaluation of 20 proprietary, open-source, and clinically tuned LLMs across 10,096,800 clinical decision scenarios, including synthetic vignettes with predefined safe versus unsafe options and real-world discharge recommendations reframed to include unsafe contradictory requests. Each scenario was presented under a neutral control or one of six Milgram-style social-pressure conditions (authority, responsibility transfer, urgency, threat, conformity, depersonalization), with or without a short mitigation cue instructing verification or escalation if unsafe. The primary outcome was the proportion of potentially harmful outputs, defined as selection or endorsement of an unsafe clinical decision. Results Across all runs, 1.18 million of 10.1 million outputs (11.7%) were harmful. Harmful decisions occurred in 16.6% of unmitigated versus 10.1% of mitigated conditions (absolute reduction, 6.5 percentage points; p < 0.001). In synthetic vignettes, harmful responses averaged 8.1% overall, declining from 10.6% to 7.2% with mitigation (difference, 3.4 percentage points; p < 0.001). In real-world discharge cases, harmful responses averaged 30.0%, decreasing from 46.6% to 24.5% with mitigation (difference, 22.1 percentage points; p < 0.001). Across all conditions, authority and responsibility-transfer cues elicited the highest harmful compliance, and control prompts the lowest; mitigation reduced rates but preserved this pattern. Conclusion LLMs do not behave as neutral calculators in clinical contexts. When exposed to authority or responsibility-transfer cues, they exhibit consistent obedience to unsafe instructions. A brief safety reminder substantially reduces but does not eliminate this behavior.

Version published to 10.21203/rs.3.rs-8932472/v1 on Research Square
Mar 18, 2026

Simulating Lay Health-Seeking Behavior with LLM Personas and Illness Vignettes: Reproducibility, Prompt Sensitivity, and Slice Dependence

This article has 1 author:
1. Yuusuke Harada
This article has no evaluationsLatest version Mar 29, 2026
Rethinking Medical LLM Hallucinations: A System-Level Survey

This article has 4 authors:
1. Asha Matthews
2. Vijay Vankadaru
3. Tanya Roosta
4. Peyman Passban
This article has no evaluationsLatest version Mar 23, 2026
RESPECT: A Conversational AI System for Informed Consent with Accuracy, Safety, and Stakeholder-Centered Evaluation

This article has 3 authors:
1. Salvatore Giorgi
2. Katie Ryan
3. Jane Paik Kim
This article has no evaluationsLatest version Apr 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Simulating Lay Health-Seeking Behavior with LLM Personas and Illness Vignettes: Reproducibility, Prompt Sensitivity, and Slice Dependence

Rethinking Medical LLM Hallucinations: A System-Level Survey

RESPECT: A Conversational AI System for Informed Consent with Accuracy, Safety, and Stakeholder-Centered Evaluation