The effect of medical explanations from large language models on diagnostic decisions in radiology

Philipp Spitzer
Daniel Hendriks
Jan Rudolph
Sarah Schlaeger
Jens Ricke
Niklas Kühl
Boj Friedrich Hoppe
Stefan Feuerriegel

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are increasingly used by physicians for diagnostic support. A key advantage of LLMs is the ability to generate explanations that can help physicians understand the reasoning behind a diagnosis. However, the best-suited format for LLM-generated explanations remains unclear. In this large-scale study, we examined the effect of different formats for LLM explanations on clinical decision-making. For this, we conducted a randomized experiment with radiologists reviewing patient cases with radiological images ( N = 2020 assessments). Participants received either no LLM support (control group) or were supported by one of three LLM-generated explanations: (1) a standard output providing the diagnosis without explanation; (2) a differential diagnosis comparing multiple possible diagnoses; or (3) a chain-of-thought explanation offering a detailed reasoning process for the diagnosis. We find that the format of explanations significantly influences diagnostic accuracy. The chain-of-thought explanations yielded the best performance, improving the diagnostic accuracy by 12.2% compared to the control condition without LLM support ( P = 0.001). The chain-of-thought explanations are also superior to the standard output without explanation (+7.2%; P = 0.040) and the differential diagnosis format (+9.7%; P = 0.004). Evidently, explaining the reasoning for a diagnosis helps physicians to identify and correct potential errors in LLM predictions and thus improve overall decisions. Altogether, the results highlight the importance of how explanations in medical LLMs are generated to maximize their utility in clinical practice. By designing explanations to support the reasoning processes of physicians, LLMs can improve diagnostic performance and, ultimately, patient outcomes.

Version published to 10.1101/2025.03.04.25323357v1 on medRxiv
Mar 6, 2025

Do Language Models Think Like Doctors?

This article has 15 authors:
1. Liam G. McCoy
2. Rajiv Swamy
3. Nidhish Sagar
4. Minjia Wang
5. James Cao
6. Stephen Bacchi
7. Nigel Fong
8. Nigel CK Tan
9. Kevin Tan
10. Thomas A. Buckley
11. Peter Brodeur
12. Leo Anthony Celi
13. Arjun Manrai
14. Aloysius Humbert
15. Adam Rodman
This article has no evaluationsLatest version Feb 12, 2025
Large Language Models in Radiology Reporting—A Systematic Review of Performance, Limitations, and Clinical Implications

This article has 7 authors:
1. Yaara Artsi
2. Eyal Klang
3. Jeremy D. Collins
4. Benjamin S. Glicksberg
5. Panagiotis Korfiatis
6. Girish N Nadkarni
7. Vera Sorin
This article has no evaluationsLatest version Mar 19, 2025
Towards Evaluating the Diagnostic Ability of LLMs

This article has 2 authors:
1. Peter Sarvari
2. Zaid Al-fagih
Reviewed by PREreview

This article has 1 evaluationAppears in 1 listLatest version Mar 7, 2025Latest activity Dec 4, 2024

Listed in

Abstract

Article activity feed

Related articles

Do Language Models Think Like Doctors?

Large Language Models in Radiology Reporting—A Systematic Review of Performance, Limitations, and Clinical Implications

Towards Evaluating the Diagnostic Ability of LLMs