Enhancing Cause of Death Prediction: Development and Validation of ML Models Using Multimodal Data Across Multiple Healthcare Sites

Mohammed Al-Garadi
Rishi J Desai
Kerry Ngan
Michele LeNoue-Newton
Ruth M. Reeves
Daniel Park
Jose J. Hernández-Muñoz
Shirley V. Wang
Judith C. Maro
Candace C. Fuller
Joshua Lin Kueiyu
Aida Kuzucan
Kevin Coughlin
Haritha Pillai
Melissa McPheeters
Jill Whitaker
Jessica A. Deere
Michael F. McLemore
Dax M. Westerman
Tony Morrow
Margaret A. Adgent
Michael E. Matheny

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance

Timely and accurate determination of causes of death (CoD) is essential for public health surveillance, epidemiological research, and healthcare policy development. However, obtaining up-to-date and detailed CoD information is challenging due to delays in official death records and inconsistencies in data reporting across institutions.

Objective

To develop and validate machine learning (ML) models capable of predicting probable CoD by integrating comprehensive features from structured electronic health record (EHR) data, unstructured clinical notes, and publicly available data.

Design, Setting, and Participants

This multi-institutional retrospective cohort study was conducted at Vanderbilt University Medical Center (VUMC) and Massachusetts General Brigham (MGB). Deceased patients were included if they had at least one inpatient or outpatient encounter between October 1, 2015, and January 1, 2021, with corresponding death records from state health departments and the National Death Index. The study was comprised of 13,708 deceased patients from VUMC and 34,839 from MGB.

Exposures

Integration of structured EHR data, unstructured clinical notes processed using advanced language models, and publicly available data into machine learning models to predict CoD.

Main Outcomes and Measures

The primary outcome was the underlying CoD, classified into one of the top 15 National Center for Health Statistics (NCHS) rankable CoD categories, with all other causes grouped into an “Other” category. Model performance was evaluated using weighted area under the receiver operating characteristic curve (AUC) and weighted F-measure.

Results

The XGBoost model using structured EHR data alone achieved weighted AUCs of 0.86 (95% CI, 0.84–0.88) at VUMC and 0.80 (95% CI, 0.79-0.80) at MGB. Adding unstructured notes improved performance, with weighted AUCs of 0.90 (95% CI, 0.88–0.93) at VUMC and 0.92(95% CI, 0.91–0.92) at MGB. Adding publicly available data did not further improve performance. Cross-institutional validation revealed significant performance degradation.

Conclusions and Relevance

ML models integrating EHR structured and unstructured data to predict underlying CoD at the time of the most recent encounter among deceased patients achieved excellent performance within individual institutions. The inclusion of publicly available data did not improve performance, and all versions had poor portability between institutions. Healthcare institutions may benefit from adopting robust processes for locally tailored models, and future research should focus on enhancing model generalizability while addressing unique institutional data environments.

Version published to 10.1101/2025.06.24.25330213 on medRxiv
Jun 24, 2025

Machine learning models for predicting severe clinical events in hospitalized patients with coronary artery disease

This article has 16 authors:
1. Hao Liu
2. Meijun Liu
3. Xinmiao Guan
4. Feng Cao
5. Changhao Liang
6. Zhongwen Qi
7. Jiaqi Hui
8. Junnan Zhao
9. Jingli Xing
10. Jianguo Zhou
11. Dong Zhang
12. Lei Liu
13. Xiaoliang Hao
14. Minjing Luo
15. Fengqin Xu
16. Yutong Fei
This article has no evaluationsLatest version Jan 12, 2026
Machine Learning Insights for Cardiovascular Risk Prediction in Diabetic Patients: Emphasis on Renal and Cardiac Markers Using Random Forests

This article has 1 author:
1. Julian Borges
This article has no evaluationsLatest version Jan 21, 2026
Risk Stratification for In-Hospital Mortality in Alzheimer’s Disease Using Interpretable Regression and Explainable AI

This article has 3 authors:
1. Tursun Alkam
2. Ebrahim Tarshizi
3. Andrew H. Van Benschoten
This article has no evaluationsLatest version Jan 7, 2026

Discuss this preprint

Listed in

Abstract

Importance

Objective

Design, Setting, and Participants

Exposures

Main Outcomes and Measures

Results

Conclusions and Relevance

Article activity feed

Related articles

Machine learning models for predicting severe clinical events in hospitalized patients with coronary artery disease

Machine Learning Insights for Cardiovascular Risk Prediction in Diabetic Patients: Emphasis on Renal and Cardiac Markers Using Random Forests

Risk Stratification for In-Hospital Mortality in Alzheimer’s Disease Using Interpretable Regression and Explainable AI