Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance
Large language models (LLMs) show promise for improving clinical reasoning, but they also risk inducing automation bias, an over-reliance that can degrade diagnostic accuracy. Whether AI-trained physicians are vulnerable to this bias when LLM use is voluntary remains unknown.
Objective
To determine whether exposure to erroneous LLM recommendations degrades AI-trained physicians’ diagnostic performance compared to error-free AI advice.
Design
A single-blind randomized clinical trial was conducted from June 20 to August 15, 2025.
Setting
Physicians were recruited from multiple medical institutions in Pakistan, participating through in-person or remote video conferencing.
Participants
Physicians registered with the Pakistan Medical and Dental Council with MBBS degrees, who had completed a 20-hour AI-literacy training covering LLM capabilities, prompt engineering, and critical evaluation of AI output.
Intervention
Participants were randomized 1:1 to diagnose 6 clinical vignettes in 75 minutes. The control group received unmodified ChatGPT-4o’s diagnostic recommendations; the treatment group’s recommendations contained deliberate errors in 3 of 6 vignettes. Physicians could voluntarily consult offered ChatGPT-4o recommendations alongside conventional diagnostic resources based on their clinical judgment.
Main Outcomes and Measures
Primary outcome was the diagnostic reasoning accuracy (percentage), assessed by three blinded physicians using an expert-validated rubric to evaluate: differential diagnosis accuracy, appropriateness of supporting and opposing evidence, and quality of recommended diagnostic steps. Secondary outcome was the top-choice diagnosis accuracy.
Results
Forty-four physicians (22 treatment, 22 control) participated. Physicians receiving error-free recommendations achieved mean (SD) diagnostic accuracy of 84.9% (19.7%), whereas those exposed to flawed recommendations scored 73.3% (30.5%), resulting in an adjusted mean difference of -14.0 percentage points (95% CI: -8.3 to -19.7; P < .0001). Top-choice diagnosis accuracy per case was 76.1% (42.5) in the treatment group and 90.5% (28.9) in the control group, with an adjusted difference of -18.3 percentage points (95% CI, -26.6 to -10.0; P < .0001).
Conclusions and Relevance
This trial demonstrates that erroneous LLM recommendations significantly degrade physicians’ diagnostic performance by inducing automation bias, even in AI-trained physicians. Voluntary deference to flawed AI output highlights critical patient safety risk, necessitating robust safeguards to ensure human oversight before widespread clinical deployment.
Trial Registration
ClinicalTrials.gov Identifier: NCT06963957