Integrating Machine Learning and SHAP for Interpretable Prediction of 28-D ay Mortality in ICU Patients: A Comprehensive Analysis of Initial Physiologic al Features

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Accurate, interpretable prediction of 28-day mortality in intensive care unit (ICU) patients is pivotal for timely clinical decision-making, resource optimization, and improving patient outcomes. Despite the growing application of machine learning (ML) models in mortality prediction—with reported area under the receiver operating characteristic curve (AUC) values ranging from 0.75 to 0.90—their "black-box" nature remains a major barrier to clinical adoption [1], as clinicians require transparent insights into how predictions are derived to trust and act on them. Traditional scoring systems, such as APACHE II [2] and SOFA [3], while interpretable, are limited by linear assumptions, manual calculation burden, and failure to capture complex nonlinear interactions among clinical features, leading to suboptimal performance in diverse patient cohorts [4]. Initial physiological features, measured within 24 hours of ICU admission, are readily available and reflect the early severity of illness, making them ideal for early risk stratification. However, few studies have systematically evaluated the predictive value of these features using interpretable ML frameworks, and even fewer have validated clinical utility through calibration and decision curve analysis (DCA)—critical steps for translating predictive models into practice [5]. Methods We conducted a retrospective cohort study of 654 ICU patients admitted to a tertiary care center between August 2022 and August 2024. Eligibility criteria included age ≥ 18 years, ICU stay ≥ 24 hours, and complete records of seven key initial features: age, weight, height, initial heart rate (HR), initial respiratory rate (RR), initial blood oxygen saturation (SpO₂), and initial body temperature (T). Patients with missing data for > 5% of features were excluded, and remaining missing values (< 5% per feature) were imputed using median values (selected over mean due to robustness to outliers) [6]. The primary outcome was 28-day all-cause mortality. We compared three ML models (random forest [RF] [7], XGBoost [8], logistic regression [LR]) and two traditional scoring systems (APACHE II, SOFA) using 5-fold stratified cross-validation to maintain class balance (32% mortality rate). Hyperparameter optimization for RF and XGBoost was performed via grid search with 1000 bootstrap samples for confidence interval (CI) estimation. The best-performing model (RF) was further interpreted using SHapley Additive exPlanations (SHAP) [9], including summary plots (global feature importance), dependence plots (feature-response relationships), force plots (individual patient predictions), and interaction plots (feature-feature synergies). Model performance was evaluated using discriminative metrics (accuracy, precision, recall, F1 score, AUC), calibration (calibration curves, Brier score), and clinical utility (DCA) [10]. Subgroup analyses were conducted by age (< 65 vs. ≥65 years), gender, and presence of chronic respiratory disease (CRD) to assess generalizability. Results The RF model outperformed XGBoost, LR, APACHE II, and SOFA, achieving an accuracy of 0.8313 (95% CI: 0.792–0.868), precision of 0.7059 (95% CI: 0.621–0.782), recall of 0.5484 (95% CI: 0.463–0.631), F1 score of 0.6154 (95% CI: 0.542–0.683), and AUC of 0.8792 (95% CI: 0.843–0.915). XGBoost showed comparable discriminative ability (AUC = 0.8564, 95% CI: 0.817–0.896) but was less interpretable, while LR (AUC = 0.7634, 95% CI: 0.718–0.809), APACHE II (AUC = 0.7215, 95% CI: 0.674–0.769), and SOFA (AUC = 0.7583, 95% CI: 0.712–0.805) performed poorly. Calibration curves demonstrated close alignment between predicted and observed mortality risks (Brier score = 0.124), with no significant miscalibration (Hosmer-Lemeshow test: χ²=8.32, p = 0.401). DCA showed the RF model provided a net benefit across a wide range of risk thresholds (5–80%), surpassing all comparators; at a clinically relevant threshold of 20%, the net benefit was 0.28 (vs. 0.15 for SOFA, 0.10 for APACHE II, 0.22 for XGBoost, and 0.18 for LR). SHAP analysis identified initial SpO₂ as the most influential feature (mean absolute SHAP value = 0.1589), followed by initial RR (0.1423), initial HR (0.1367), age (0.1124), weight (0.0987), initial T (0.0876), and height (0.0762). Dependence plots revealed nonlinear relationships: initial HR < 120 bpm was associated with increasing mortality risk, while HR > 120 bpm correlated with reduced risk (likely reflecting compensatory mechanisms). Interaction plots showed that low initial SpO₂ (< 90%) combined with high initial RR (> 25 breaths/min) amplified mortality risk (SHAP interaction value = 0.214). Subgroup analyses confirmed the model’s robustness: AUC remained > 0.85 in older patients (≥ 65 years), females, and those with CRD, with initial SpO₂ consistently ranked as the top feature. Conclusion Our study demonstrates that integrating RF with SHAP enables accurate, interpretable, and clinically useful prediction of 28-day ICU mortality using only initial physiological features. The RF model outperforms traditional scoring systems and other ML models in discriminative ability, calibration, and clinical utility. SHAP analysis clarifies the nonlinear and interactive effects of key features (initial SpO₂, RR, HR), providing actionable insights for early risk stratification and targeted interventions. This framework addresses the critical gap between predictive performance and interpretability, supporting the translation of ML models into routine ICU practice. Future multi-center validation and integration with electronic health record (EHR) systems will further enhance generalizability and clinical applicability.

Article activity feed