Interpretable Machine Learning for Mortality Risk Detection in National Health Data

J. CHA
E.D. CHA
E. Yoo
H. Song

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Accurate mortality prediction is essential for identifying high-risk individuals and guiding public health interventions. However, machine learning (ML) models trained on nationally representative data—such as NHANES, where mortality occurs in fewer than 10% of cases—often struggle with extreme class imbalance and limited interpretability, hindering practical utility. Objective: This study investigates whether loss-aware ML approaches can enhance both sensitivity and interpretability in predicting all-cause mortality, particularly in older and socioeconomically vulnerable populations where most mortality events occur. Methods: We used data from 4,188 U.S. adults in the 2011–2012 NHANES cycle, linked to the 2019 National Death Index. Four models—logistic regression, random forest, gradient boosting, and XGBoost—were trained under varying loss function strategies, without imputation or oversampling, to preserve real-world class imbalance. Performance was assessed via recall, F1-score, and PR-AUC. SHapley Additive ex- Planations (SHAP) were used for interpretability. Mortality label distributions across preprocessing strategies were statistically compared using chi-square tests. Results: Despite modest absolute metrics, the XGBoost model with class-weighted loss achieved the best recall (30.7%) and F1-score (35.4%), enabling identification of a substantial portion of deaths that baseline models missed. SHAP analysis revealed clinically consistent risk factors—age, HbA1c, systolic blood pressure, and poverty index—particularly concentrated in high-risk subgroups. Crucially, chi-square tests showed that both the raw (χ2 = 48.26, p < 0.0001) and SLOTE-imputed datasets (χ2 = 798.81, p < 0.0001) differed significantly from our analytic dataset in outcome distribution, underscoring the importance of rigorous preprocessing. These distortions, if ignored, could bias both model evaluation and real-world risk stratification. Conclusion: Even when predictive accuracy appears modest, transparent and statistically grounded models offer valuable insights for targeted outreach and equitable health policy. Our results demonstrate that interpretable, loss-aware ML can play a critical role in population-level mortality prediction. All code and data will be made publicly available upon peer-reviewed publication.

Version published to 10.21203/rs.3.rs-6926897/v1 on Research Square
Jun 20, 2025

Enhanced machine learning and hybrid ensemble approaches for coronary heart disease prediction

This article has 4 authors:
1. Maurice Wanyonyi
2. Zakayo Ndiku Morris
3. Faith Mueni Musyoka
4. Dominic Makaa Kitavi
This article has no evaluationsLatest version Jul 3, 2025
Comparative Study of Machine Learning Techniques for Diabetes Forecasting

This article has 2 authors:
1. Abdul Aamir Khan
2. Bk Sharma
This article has no evaluationsLatest version Jul 22, 2025
Interpretable Machine Learning for Life Expectancy Prediction: A Comparative Study of Linear Regression, Decision Tree, and Random Forest

This article has 3 authors:
1. Roman Dolgopolyi
2. Ioanna Amaslidou
3. Agrippina Margaritou
This article has no evaluationsLatest version Jun 26, 2025

Listed in

Abstract

Article activity feed

Related articles

Enhanced machine learning and hybrid ensemble approaches for coronary heart disease prediction

Comparative Study of Machine Learning Techniques for Diabetes Forecasting

Interpretable Machine Learning for Life Expectancy Prediction: A Comparative Study of Linear Regression, Decision Tree, and Random Forest