Machine Learning-Based Cardiovascular Disease Prediction: Comparative Analysis of SMOTE Impact on Imbalanced Healthcare Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Cardiovascular disease (CVD) constitutes the primary global mortality cause, affecting 18 million individuals annually. Machine learning approaches for CVD prediction face significant challenges due to inherent class imbalance in healthcare datasets, where disease-positive cases are substantially underrepresented, leading to biased model performance favoring majority classes. This comprehensive study evaluated ten machine learning algorithms including Random Forest, Support Vector Machine, XGBoost, and ensemble methods on the Behavioral Risk Factor Surveillance System (BRFSS) dataset containing 308,070 patient records. The Boruta algorithm identified optimal feature subsets, while RandomizedSearchCV performed hyperparameter optimization. Model performance was assessed both on original imbalanced data and after applying Synthetic Minority Over-sampling Technique (SMOTE) for class balancing. Original imbalanced datasets yielded high overall accuracies (~ 92%) but severely compromised minority class detection (F1-scores: 0.00-0.28). SMOTE implementation dramatically enhanced minority class performance: Stacking ensemble achieved optimal results with 94.49% accuracy and 0.94 F1-score for CVD-positive cases. Ensemble methods demonstrated superior adaptability to class balancing compared to linear algorithms, which showed substantial performance degradation. SMOTE effectively mitigates class imbalance challenges in cardiovascular disease prediction, significantly improving minority class detection capabilities while preserving overall model accuracy, establishing ensemble methods as optimal approaches for imbalanced healthcare applications.