Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges: A Randomized Controlled Trial

Ihsan Ayyub Qazi
Ayesha Ali
Asad Ullah Khawaja
Muhammad Junaid Akhtar
Ali Zafar Sheikh
Muhammad Hamad Alizai

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As large language models (LLMs) enter clinical workflows, automation bias, the uncritical acceptance of automated output, poses a patient-safety risk. Optimal physician-AI collaboration requires trust calibration, matching scrutiny to LLM recommendation accuracy. We report a randomized trial evaluating a behavioral nudge to mitigate automation bias. Seventy-two AI-trained physicians were randomized to evaluate six vignettes alongside ChatGPT-5.1 recommendations, consulted at each physician’s discretion; three contained deliberate, clinically significant errors. The treatment arm received a dual-component nudge: an anchoring cue reporting ChatGPT’s benchmark accuracy to calibrate expectations, and a case-specific, selective-attention cue; a numeric accuracy rating and color-coded traffic light, derived from the mean of three distinct-family LLMs. The control group saw recommendations alone; blinded reviewers scored diagnostic reasoning against an expert rubric. The treatment group scored significantly higher (mean difference, 7.6 percentage-points; 95% CI, 1.4-13.9; P =0.016) than the control, suggesting a scalable strategy to preserve clinical judgment in LLM-assisted care. ClinicalTrials.gov registration: NCT07328815 .

Version published to 10.64898/2026.06.01.26354596 on medRxiv
Jun 2, 2026

Role-Prompting in Frontier Large Language Models Influences Clinical Reasoning in Complex Medical Cases

This article has 8 authors:
1. Chintan Dave
2. Adrianna Diviero
3. Tashni Dassanayake
4. Salman J. Alshahrani
5. Anas Al Mardini
6. Widad Khadir
7. Ashaki D. Patel
8. Adithya Srivastava
This article has no evaluationsLatest version Jul 1, 2026
The Unreliable Judges: Assessing Reproducibility and Self-Preference Bias of LLMs as Free-Text Evaluators

This article has 4 authors:
1. J I Alvarez-Arenas
2. D Jimenez-Carretero
3. D Mañanes
4. F Sanchez-Cabo
This article has no evaluationsLatest version Jun 17, 2026
Physician epistemic framing alters the accuracy of large language models for medical second opinions

This article has 4 authors:
1. Florian Reis
2. Wilfried Kunde
3. Felix Balzer
4. Sebastian D. Boie
This article has no evaluationsLatest version Jul 2, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Role-Prompting in Frontier Large Language Models Influences Clinical Reasoning in Complex Medical Cases

The Unreliable Judges: Assessing Reproducibility and Self-Preference Bias of LLMs as Free-Text Evaluators

Physician epistemic framing alters the accuracy of large language models for medical second opinions