Evaluation of Large Language Models in the Clinical Management of Patients With Upper Gastrointestinal Bleeding : Insights From Real-World Patient Data

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

Upper gastrointestinal bleeding (UGIB) is a life-threatening emergency requiring rapid risk assessment. Current scoring tools have limited accuracy. Large language models (LLMs) may support clinical decision-making, but their role in UGIB management is unclear. This study evaluated LLMs for patient risk classification, prediction of endoscopic findings, and alignment with routine clinical decision-making.

Methods

In this retrospective study, we analyzed electronic health records(EHRs) of 384 UGIB patients presented to two referral centers in Karaj, Iran, between March and December 2024. Included cases underwent upper gastrointestinal endoscopy; incomplete records were excluded. Five LLMs including GPT-5, Llama 4, Gemini-2.5-Flash, DeepSeek R1, and Grok were assessed using in-context learning for (i) risk classification, (ii) prediction of probable endoscopic findings, and (iii) clinical justification generation. Performance metrics included accuracy, precision, recall, and F1-score, compared with conventional machine learning models. Two gastroenterologists independently assessed justifications across seven domains: relevance, clarity, originality, completeness, specificity, correctness, and consistency.

Results

All LLMs outperformed conventional models (highest baseline accuracy 0.54). GPT-5 achieved the highest risk classification accuracy (0.66), followed by Llama 4 (0.64). Grok performed best in predicting endoscopic findings (0.32). gastroenterologists noted variability in reasoning: GPT-5 and Grok provided the most complete justifications, though GPT-5 occasionally over-classified urgent cases. Llama-4 and Gemini-2.5-Flash were less specific, while DeepSeek R1 offered detailed patient summaries but lacked predictive outputs.

Conclusions

LLMs improved UGIB risk prediction and generate interpretive reasoning, but accuracy limitations, inconsistent reasoning, and occasional risk misclassification highlight the need for clinician oversight and prospective validation before clinical use.

Key Messages

What is already known on this topic

UGIB is a medical emergency requiring rapid risk stratification and timely management. LLMs are promising tools for clinical decision support, but their role in UGIB management remains unclear.

What this study adds

LLMs can improve risk prediction and interpretive reasoning in UGIB, but limitations in accuracy, inconsistent reasoning, and occasional misclassification highlight the need for clinician oversight and prospective validation.

How this study might affect research, practice, or policy

LLMs provide structured, human-readable explanations that could support clinical decision-making, potentially reduce unnecessary emergency endoscopies, improve care efficiency, and alleviate physician workload.

Article activity feed