Large-Language-Model Mortality Risk Stratification in the Intensive Care Unit: A Benchmark Against APACHE II

Ahad Khaleghi Ardabili
Alireza Vafaei Sadr
Vida Abedi
Anthony S Bonavia

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Accurately predicting clinical trajectories in critically ill patients remains challenging due to physiological instability and multisystem organ dysfunction. Traditional prognostic tools, such as the APACHE II score, offer standardized risk assessment but are constrained by static algorithms. This study evaluates the predictive performance and reliability of large language models (LLMs) compared to APACHE II for in-hospital mortality prediction.

Methods

This was a single-center, retrospective study. De-identified clinical data from 70 critically ill patients were provided to four LLMs—Gemini, Llama, GPT-4, and R1. Each model stratified patients into high-, intermediate-, or low-risk (of in-hospital death) categories without being instructed to apply the APACHE II method. To assess the impact of additional information, models were also provided with de-identified hospital discharge summaries from prior hospital admissions. Consistency and rationale analyses were performed across multiple iterations.

Findings

LLMs demonstrated a general tendency toward risk overestimation, classifying more patients as high risk compared to APACHE II. Mortality rates within high-risk groups were lower than APACHE-predicted rates, suggesting calibration mismatch. Gemini, when supplemented with additional clinical context, uniquely identified a low-risk group. Gemini, GPT-4, and R1 exhibited the highest consistency across repeated evaluations, while Llama showed greater variability that improved with context. Semantic rationale analyses revealed greater stability among larger models, indicating non-stochastic reasoning patterns.

Conclusions

LLMs, supplemented with discharge summaries from prior hospitalizations, show promise in mortality risk stratification in critically ill patients. However, further refinement is necessary to improve calibration and reliability before clinical implementation. Context-aware prompting strategies and improved model calibration may enhance the utility of LLMs alongside established systems like APACHE II.

Author Summary

Predicting which critically ill patients are at greatest risk of dying in the hospital is one of the most important and difficult tasks faced by doctors. Traditionally, we’ve used structured scoring systems like APACHE II, which rely on a fixed set of patient measurements. In this study, we explored whether large language models (LLMs)—the same kind of technology behind chatbots like ChatGPT—could perform this task just as well, or even better. We provided four different LLMs with real patient data from our intensive care unit and asked them to assess each patient’s risk of dying, without giving them any instructions about how to do so. We also tested whether adding more context, such as hospital discharge summaries, made their predictions more accurate or consistent. We found that while LLMs tended to overestimate risk, some models—especially when given extra clinical information—showed strong consistency and thoughtful reasoning in their predictions. Our findings suggest that LLMs may eventually serve as helpful partners to physicians, offering a flexible and adaptable way to interpret complex clinical data. However, more work is needed to ensure that these tools are safe, reliable, and transparent before they can be used in real-world hospital settings

Version published to 10.1101/2025.05.14.25327650v1 on medRxiv
May 27, 2025

Predicting Short-Term Mortality in Severe Cirrhosis: An Interpretable Machine Learning Model Integrating Routine Clinical Indicators

This article has 5 authors:
1. Shun Zhang
2. Rui Liu
3. Zhengjie Li
4. Tao Pan
5. Xudong Wen
This article has no evaluationsLatest version Jul 11, 2025
External Validation Study of a Chest Radiograph-Derived Aging Biomarker for ICU Mortality Prediction

This article has 7 authors:
1. Ryo Deguchi
2. Yasuhito Mitsuyama
3. Shannon L Walston
4. Hirotaka Takita
5. Chong Hyun Suh
6. Yukio Miki
7. Daiju Ueda
This article has no evaluationsLatest version Jun 9, 2025
Adjusted-GCS-Enhanced Machine-Learning Model Predicts 28-Day Mortality in ICU Stroke Patients

This article has 8 authors:
1. Wei Cao
2. Wan-Zhu Liu
3. Wei Chen
4. Jun Wang
5. Zhao-Jun Mei
6. En-Xi Xu
7. Bo Chen
8. Zhou Zhou
This article has no evaluationsLatest version Jul 7, 2025

Listed in

Abstract

Background

Methods

Findings

Conclusions

Author Summary

Article activity feed

Related articles

Predicting Short-Term Mortality in Severe Cirrhosis: An Interpretable Machine Learning Model Integrating Routine Clinical Indicators

External Validation Study of a Chest Radiograph-Derived Aging Biomarker for ICU Mortality Prediction

Adjusted-GCS-Enhanced Machine-Learning Model Predicts 28-Day Mortality in ICU Stroke Patients