Evaluating Deep Learning Sepsis Prediction Models in ICUs Under Distribution Shift: A Multi-Centre Retrospective Cohort Study

Fanny Tranchellini
Youssef Farag
Catherine Jutzeler
Lakmal Meegahapola

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Sepsis remains a leading cause of mortality in intensive care units (ICUs) worldwide, underscoring the urgent need for early detection to improve patient outcomes. While artificial intelligence (AI) models trained on ICU data show promise for sepsis prediction, their clinical utility is frequently hampered by poor generalization under external validation, largely attributable to distribution shifts arising from heterogeneity in data. Prior studies have focused on direct model deployment or conventional transfer learning methods (e.g., fine-tuning), yet systematic exploration of alternative strategies and root causes of performance degradation remains limited. In this study, we quantify those distribution shifts across three harmonized adult ICU cohorts: the high-resolution HiRID database (Bern University Hospital, Switzerland; 29 698 stays, 2008–2019; 6.3 % sepsis), MIMIC-IV (Beth Israel Deaconess Medical Center, USA; 63 425 stays, 2008–2019; 5.2 % sepsis), and eICU (208 US hospitals; 123 413 stays, 2014–2015; 4.6 % sepsis) for a total of 216 536 stays and 10 846 sepsis cases. We then evaluate five deployment strategies across three model architectures (CNN, InceptionTime, LSTM) under four target-data regimes: none , small ( < 8000 stays), medium (8000–32000), and large ( > 32000). The strategies are direct generalisation, standard transfer learning (fine-tuning / retraining), target training, supervised domain adaptation (DA: MMD or CORAL), and fusion training (merged datasets). Key results demonstrate that fine-tuning consistently underperforms across all data sizes (adjusted p < 0.05 vs. DA, fusion, and retraining) even though it has been the go to method in many prior studies that explored this direction. Retraining and fusion training excel in small and large target domains, while supervised DA methods dominate in medium-sized datasets. For example, DA with maximum mean discrepancy (DA MMD) achieves superior performance in both area under the receiver operating characteristic curve (AUROC = 0.720) and normalized area under the precision-recall curve (nAUPRC = 2.352) compared to fusion training (AUROC = 0.712, nAUPRC = 2.215; p = 0.02, adjusted p = 0.07). Retraining remains competitive (AUROC = 0.719, nAUPRC = 2.326; p > 0.05 vs. DA MMD) but lags in nAUPRC. Overall, our results call for moving beyond routine fine-tuning: retraining or fusion are preferable in data-poor or data-rich scenarios, whereas domain adaptation offers the most stable and substantial gains when moderate target data are available.

Version published to 10.1101/2025.07.31.25332542 on medRxiv
Aug 1, 2025

Development and External Validation of a High-Precision Model for Predicting ICU Admission from Emergency Department Triage

This article has 3 authors:
1. Nathan Nguyen
2. Andrew Chu
3. Debadutta Dash
This article has no evaluationsLatest version Jul 23, 2025
Development and Validation of VC-MAES and VC-SEPS: Deep Learning-Based Early Warning Systems for Hospitalized Patients

This article has 12 authors:
1. Yeji Kim
2. Sangchul Hahn
3. Kwang Joon Kim
4. Eunho Yang
5. Ji-hyun Kim
6. Suji Lee
7. Chang Hoon Han
8. Joo-Yun Won
9. Byung Eun Ahn
10. Yechan Mun
11. Kyung Soo Chung
12. Taeyong Sim
This article has no evaluationsLatest version Aug 24, 2025
Adjusted-GCS-Enhanced Machine-Learning Model Predicts 28-Day Mortality in ICU Stroke Patients

This article has 8 authors:
1. Wei Cao
2. Wan-Zhu Liu
3. Wei Chen
4. Jun Wang
5. Zhao-Jun Mei
6. En-Xi Xu
7. Bo Chen
8. Zhou Zhou
This article has no evaluationsLatest version Jul 7, 2025

Listed in

Abstract

Article activity feed

Related articles

Development and External Validation of a High-Precision Model for Predicting ICU Admission from Emergency Department Triage

Development and Validation of VC-MAES and VC-SEPS: Deep Learning-Based Early Warning Systems for Hospitalized Patients

Adjusted-GCS-Enhanced Machine-Learning Model Predicts 28-Day Mortality in ICU Stroke Patients