LLM Reasoning Does Not Protect Against Clinical Cognitive Biases - An Evaluation Using BiasMedQA

Su Hwan Kim
Sebastian Ziegelmayer
Felix Busch
Christian J. Mertens
Matthias Keicher
Lisa C. Adams
Keno K. Bressem
Rickmer Braren
Marcus R. Makowski
Jan S. Kirschke
Dennis M. Hedderich
Benedikt Wiestler

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Cognitive biases are an important source of clinical errors. Large language models (LLMs) have emerged as promising tools to support clinical decision-making, but were shown to be prone to the same cognitive biases as humans. Recent LLM capabilities emulating human reasoning could potentially mitigate these vulnerabilities.

Methods

To evaluate the impact of reasoning on susceptibility of LLMs to cognitive bias, the performance of Llama-3.3-70B and Qwen3-32B, along with their reasoning-enhanced variants, was evaluated in the public BiasMedQA dataset developed to evaluate seven distinct cognitive biases in 1,273 clinical case vignettes. Each model was tested using a base prompt, a debiasing prompt with the instruction to actively mitigate cognitive bias, and a few-shot prompt with additional sample cases of biased responses. For each model pair, two mixed-effects logistic regression models were fitted to determine the impact of biases and mitigation strategies on performance.

Results

In neither of the two models, reasoning capabilities were able to consistently prevent cognitive bias, although both reasoning models achieved better overall performance compared to their respective base model (OR 4.0 for Llama-3.3-70B, OR 3.6 for Qwen3-32B). In Llama-3.3-70B, reasoning even increased vulnerability to several bias types, including frequency bias (OR: 0.6, p = 0.006) and recency bias (OR: 0.5, p < 0.001). In contrast, the debiasing and few-shot prompting approaches demonstrated statistically significant reductions in biased responses across both model architectures, with the few-shot strategy exhibiting substantially greater effectiveness (OR 0.1 vs. 0.6 for Llama-3.3-70B; OR 0.25 vs. 0.6 for Qwen3).

Conclusions

Our results indicate that contemporary reasoning capabilities in LLMs fail to protect against cognitive biases, extending the growing body of literature suggesting that the purported reasoning abilities which may represent sophisticated pattern recognition rather than genuine inferential cognition.

Version published to 10.1101/2025.06.22.25330078 on medRxiv
Jun 23, 2025

Assessment of Bias in Clinical Trials with LLMs Using ROBUST-RCT: A Feasibility Study

This article has 4 authors:
1. Pedro Rodrigues Vidor
2. Yohan Casiraghi
3. Adolfo Moraes de Souza
4. Maria Inês Schmidt
This article has no evaluationsLatest version Aug 13, 2025
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis

This article has 4 authors:
1. J Healy
2. J Kossoff
3. M Lee
4. C Hasford
This article has no evaluationsLatest version Aug 27, 2025
Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians

This article has 6 authors:
1. Ihsan Ayyub Qazi
2. Ayesha Ali
3. Asad Ullah Khawaja
4. Muhammad Junaid Akhtar
5. Ali Zafar Sheikh
6. Muhammad Hamad Alizai
This article has no evaluationsLatest version Aug 26, 2025

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Assessment of Bias in Clinical Trials with LLMs Using ROBUST-RCT: A Feasibility Study

Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis

Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians