Heart Failure Prediction & Risk Stratification using Machine Learning

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Heart failure (HF) is one of the most prevalent causes of morbidity, mortality, and healthcare expenditures, with approximately 6.7 million adults in the U.S. suffering from this condition and contributing to hundreds of thousands of deaths annually. Early diagnosis of high-risk individuals has been a challenge, as HF-specific symptoms are often ignored or misinterpreted as normal aging, stress, or minor illnesses, leading to delayed diagnosis. This study evaluated whether routinely available electronic medical record (EMR) variables can support HF prediction for population screening and proactive care pathways. Methods We trained, tested, and evaluated several models, including logistic regression, SVM, KNN, random forest, XGBoost, MLP, and a custom stacked ensemble using stratified 5-fold cross-validation and 70/30 hold-out splits for HF prediction on EMR data from the All of Us Research Program. This group consisted of 37,070 adults (13,577 HF; 23,493 non-HF). The predictors included readily available demographics, vital signs, laboratory values, common conditions, lifestyle, and a deprivation index. Preprocessing steps included IQR-winsorization, median imputation, one-hot encoding, and QuantileTransformer. Predicted probabilities were calibrated and adjusted to a realistic population prevalence. SHAP analysis was used to identify the most influential features. Results The stacked model obtained ROC-AUC 0.927, PR-AUC 0.895, and accuracy 0.856 in the test set. Calibration and prevalence adjustment yielded interpretable probability estimates and clear stratification of individuals into clinically actionable risk tiers. SHAP analysis identified atrial fibrillation, age, hypertensive disorder, sodium, and deprivation index as the top five features impacting the model’s prediction. A secondary multiclass experiment (No-HF, HF with reduced ejection fraction, and HF with preserved ejection fraction) was performed, achieving lower discrimination results (macro-AUC ~ 0.87) and lower per-class precision/recall, presumably due to label noise, class imbalance, and overlapping phenotypes. Conclusions We demonstrated that a carefully calibrated stacked ensemble on the combination of readily available EMR variables can achieve strong discrimination for HF, making it a potentially effective tool for an AI clinical decision support system (AI-CDSS) in population screening and proactive care pathways.

Article activity feed