Rethinking Medical LLM Hallucinations: A System-Level Survey

Asha Matthews
Vijay Vankadaru
Tanya Roosta
Peyman Passban

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) demonstrate strong performance across biomedical and clinical tasks, yet their deployment in healthcare remains limited by hallucination. Prior research has often treated hallucination as an isolated model failure to be addressed through improved training, prompting, or retrieval. However, emerging theoretical and empirical evidence suggests that hallucination is instead a structural property of probabilistic language generation rather than a fully removable bug. This distinction is particularly critical in medicine and healthcare, where near-correct answers, fabricated evidence, and unsafe recommendations can introduce real patient risk and legal liability. In this paper, we present a system-level survey of hallucination in medical LLMs. Rather than exhaustively cataloging every prior work, our goal is to highlight the dominant research directions and analyze the problem from a systems perspective. We synthesize literature spanning definitions, taxonomies, benchmarks, detection methods, and mitigation strategies, and examine how these components interact within real clinical workflows. Our analysis shows that despite diverse models and technical advances, improvements to individual components rarely translate into reliable end-to-end systems. Based on this synthesis, we argue that hallucination in healthcare should be treated as a system-level risk management problem rather than a model-level defect. We outline key open challenges and emphasize the need to understand not only how to reduce hallucinations, but why they occur and how their impact propagates through clinical decision pipelines. Ultimately, progress toward trustworthy medical AI will depend on designing systems that anticipate, monitor, and safely manage hallucinations rather than assuming they can be fully eliminated.

Version published to 10.31222/osf.io/f9uad_v1 on OSF Preprints
Mar 23, 2026

The Inefficacy of Artificial Intelligence Large Language Models in Healthcare: A Clinical and Statistical Perspective

This article has 4 authors:
1. Michael Williams
2. Raeed Kabir
3. Cody Taylor
4. Tariq Nakhooda
This article has no evaluationsLatest version Apr 27, 2026
The Inefficacy of Artificial Intelligence Large Language Models in Healthcare: A Clinical and Statistical Perspective

This article has 4 authors:
1. Michael Williams
2. Raeed Kabir
3. Cody Taylor
4. Tariq Nakhooda
This article has no evaluationsLatest version Apr 27, 2026
The Inefficacy of Artificial Intelligence Large Language Models in Healthcare: A Clinical and Statistical Perspective

This article has 4 authors:
1. Michael Williams
2. Raeed Kabir
3. Cody Taylor
4. Tariq Nakhooda
This article has no evaluationsLatest version Apr 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The Inefficacy of Artificial Intelligence Large Language Models in Healthcare: A Clinical and Statistical Perspective

The Inefficacy of Artificial Intelligence Large Language Models in Healthcare: A Clinical and Statistical Perspective

The Inefficacy of Artificial Intelligence Large Language Models in Healthcare: A Clinical and Statistical Perspective