Evaluating the Limitations of Large Language Models in Therapeutic Decision-making for patients with Aortic Stenosis

Tobias Roeschl
Marie Hoffmann
Djawid Hashemi
Felix Rarreck
Nils Hinrichs
Tobias D. Trippel
Axel Unbehaun
Christoph Klein
Jörg Kempfert
Henryk Dreger
Benjamin O’Brien
Gerhard Hindricks
Felix Balzer
Volkmar Falk
Alexander Meyer

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Aims

Large language models (LLMs) have shown promise in therapeutic decision-making comparable to medical experts, but these studies have used specially prepared patient data. The aim of this study was to determine whether LLMs can make guideline-adherent treatment decisions based on real-world patient data.

Methods and Results

We conducted a retrospective study of 80 patients with severe aortic stenosis who were scheduled for either surgical (SAVR, n=24) or transcatheter aortic valve replacement (TAVR, n=56) by our institutional heart team in 2022. Various LLMs (BioGPT, GPT-3.5, GPT-4 and GPT-4 Turbo, Llama-2, Mistral, and PaLM-2) were queried using either deidentified original medical reports or manually generated case summaries to determine the most guideline-adherent treatment. Agreement with the Heart Team was measured using Cohen’s kappa coefficients, reliability using intraclass correlation coefficients (ICCs), and fairness using frequency bias indices (FBIs) with FBIs >1 indicating bias towards TAVR. When presented with original medical reports, LLMs showed poor performance (kappa: -0.47-0.09, ICC: 0.0-0.91, FBI: 0.95-1.53). The LLMs’ performance improved substantially when case summaries were used as input and additional guideline knowledge was added to the prompt (kappa: -0.02-0.62, ICC: 0.01-0.97, FBI: 0.46-1.24). Qualitative analysis revealed instances of hallucinations in all LLMs tested.

Conclusion

Our findings suggest that even advanced LLMs currently make informed treatment decisions only with extensively pre-processed data, not with original patient data. Unreliable responses, bias and hallucinations pose significant health risks and highlight the need for caution in applying LLMs to real-world clinical decision-making.

Version published to 10.1101/2024.11.20.24313385v1 on medRxiv
Nov 23, 2024

Not the Models You Are Looking For: Traditional ML Outperforms LLMs in Clinical Prediction Tasks

This article has 10 authors:
1. Katherine E. Brown
2. Chao Yan
3. Zhuohang Li
4. Xinmeng Zhang
5. Benjamin X. Collins
6. You Chen
7. Ellen Wright Clayton
8. Murat Kantarcioglu
9. Yevgeniy Vorobeychik
10. Bradley A. Malin
This article has no evaluationsLatest version Dec 5, 2024
An Overview of Medical Knowledge Evaluation of Large Language Models: An Endeavor Toward a Standardized Evaluation and Reporting Guideline

This article has 2 authors:
1. Omid Kohandel Gargari
2. Gholamreza Habibi
This article has no evaluationsLatest version Jan 9, 2025
Evaluation and Enhancement of Large Language Models for In-Patient Diagnostic Support

This article has 1 author:
1. Yixuan Yuan
This article has no evaluationsLatest version Jan 7, 2025

Listed in

Abstract

Aims

Methods and Results

Conclusion

Article activity feed

Related articles

Not the Models You Are Looking For: Traditional ML Outperforms LLMs in Clinical Prediction Tasks

An Overview of Medical Knowledge Evaluation of Large Language Models: An Endeavor Toward a Standardized Evaluation and Reporting Guideline

Evaluation and Enhancement of Large Language Models for In-Patient Diagnostic Support