Assessing the Limitations of Large Language Models in Clinical Practice Guideline-concordant Treatment Decision-making on Real-world Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Aims

Large Language Models (LLMs) have shown promise in therapeutic decision-making comparable to medical experts, but these studies have used highly curated patient data. The aim of this study was to determine whether LLMs can make guideline-concordant treatment decisions based on patient data as it typically presents in clinical practice.

Methods and Results

We conducted a retrospective study of 80 patients with severe aortic stenosis who were scheduled for either surgical (SAVR, n=24) or transcatheter aortic valve replacement (TAVR, n=56) by our institutional heart team in 2022. Various LLMs (BioGPT, GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, Llama-2, Mistral, PaLM 2, and DeepSeek-R1) were queried using either anonymized original medical reports or manually generated case summaries to determine the most guideline-concordant treatment. Agreement with the Heart Team was measured using Cohen’s kappa coefficients, reliability using intraclass correlation coefficients (ICCs), and fairness using frequency bias indices (FBIs) with FBIs >1 indicating bias towards TAVR. When presented with original medical reports, LLMs showed poor performance (kappa: −0.47–0.22, ICC: 0.0–1.0, FBI: 0.95–1.51). The LLMs’ performance improved substantially when case summaries were used as input and additional guideline knowledge was added to the prompt (kappa: −0.02–0.63, ICC: 0.01–1.0, FBI: 0.46–1.23). Qualitative analysis revealed instances of hallucinations in all LLMs tested.

Conclusion

Even advanced LLMs require extensively curated input for informed treatment decisions. Unreliable responses, bias and hallucinations pose significant health risks and highlight the need for caution in applying LLMs to real-world clinical decision-making.

Article activity feed