A statistical framework for evaluating repeatability and reproducibility of large language models in diagnostic reasoning

Cathy Shyr
Boyu Ren
Chih-Yuan Hsu
Rory J. Tinker
Rizwan Hamid
Adam Wright
Bradley A. Malin
Hua Xu

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

A major concern in applying large language models (LLMs) to medicine is their output variability, as they can generate different responses even when the input prompt, model architecture, and parameters remain the same. In this study, we present a statistical framework for evaluating LLM consistency by quantifying the repeatability and reproducibility of outputs in diagnostic reasoning. The framework captures both semantic variability, reflecting differences in the clinical meaning of outputs, and internal variability, reflecting differences in token-level generation behavior. These dimensions are critical in medicine, where subtle shifts in meaning or model reasoning may influence clinician interpretation and decision-making. We apply the framework across multiple LLMs using validated diagnostic prompts on standardized medical exam vignettes and real-world rare disease cases from the Undiagnosed Diseases Network. We find that LLM consistency depends on the model, prompt, and complexity of the patient case, and is generally not correlated with diagnostic accuracy. This highlights the need for case-by-case assessment of output consistency to ensure reliability in clinical applications. Our framework can support model and prompt selection to promote more reliable use of LLMs in medicine.

Version published to 10.1101/2025.08.06.25333170 on medRxiv
Aug 8, 2025

The effect of medical explanations from large language models on diagnostic accuracy in radiology

This article has 8 authors:
1. Stefan Feuerriegel
2. Philipp Spitzer
3. Daniel Hendriks
4. Jan Rudolph
5. Sarah Schlaeger
6. Jens Ricke
7. Niklas Kühl
8. Boj Hoppe
This article has no evaluationsLatest version Aug 11, 2025
A Novel Framework for Evaluating the Clinical Reasoning Process of Large Language Models: A Comparative Study in Nephrology

This article has 16 authors:
1. Yuichiro Yano
2. Hiroaki Kakizaki
3. Hajime Nagasu
4. Seiji Kishi
5. Takeo Koshida
6. Yoshihito Nihei
7. Akira Hirano
8. Masaomi Nangaku
9. Hirotake Mori
10. Toshio Naito
11. Mizuki Ohashi
12. Shoichi Maruyama
13. Isao Matsui
14. Yoshitaka Isaka
15. Yusuke Suzuki
16. Naoki Kashihara
This article has no evaluationsLatest version Sep 7, 2025
Context Matching is not Reasoning: Assessing Generalized Evaluation of Generative Language Models in Clinical Settings

This article has 15 authors:
1. Andrew Wen
2. Qiuhao Lu
3. Yu-Neng Chuang
4. Guanchu Wang
5. Jiayi Yuan
6. Jiamu Zhang
7. Liwei Wang
8. Sunyang Fu
9. Kurt D. Miller
10. Heling Jia
11. Steven D. Bedrick
12. William R Hersh
13. Kirk E. Roberts
14. Xia Hu
15. Hongfang Liu
This article has no evaluationsLatest version Aug 29, 2025

Listed in

Abstract

Article activity feed

Related articles

The effect of medical explanations from large language models on diagnostic accuracy in radiology

A Novel Framework for Evaluating the Clinical Reasoning Process of Large Language Models: A Comparative Study in Nephrology

Context Matching is not Reasoning: Assessing Generalized Evaluation of Generative Language Models in Clinical Settings