Evaluating Feature Selection Methods and Feature Contributions for Cardiovascular Disease Risk Prediction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Cardiovascular disease (CVD) remains the foremost contributor to global illness and death, underscoring the critical need for effective tools that can predict risk at early stages to support preventive care and timely clinical decisions. With the growing complexity of healthcare data, machine learning has shown considerable promise in extracting insights that enhance medical decision-making. Nonetheless, the effectiveness and clarity of machine learning models largely rely on the relevance and quality of input features. Methods In this work, we explored and compared three distinct feature selection strategies-Alternating Decision Tree (ADT)-based analysis, Cross-Validated Feature Evaluation (CVFE), and Hypergraph-Based Feature Evaluation (HFE)-to isolate the most predictive clinical variables for assessing CVD risk. Our analysis utilized data from the National Health and Nutrition Examination Survey (NHANES), administered by the National Center for Health Statistics under the Centers for Disease Control and Prevention (CDC), encompassing demographic, clinical, laboratory, and survey data collected across the U.S. from August 2021 through August 2023. Distinct sets of features obtained through the selection techniques were used to develop eXtreme Gradient Boosting (XGBoost) models, which were then assessed for predictive effectiveness. To improve clarity and understand the model's decision-making, SHapley Additive exPlanations (SHAP) was utilized to interpret the influence of each feature in the top-performing model. Results Among the approaches, the HFE method achieved the most accurate results, reaching 75% accuracy and an AUC of 0.7857, outperforming the alternatives. The most influential predictors identified by the best model included age, total cholesterol, glycohemoglobin levels, systolic blood pressure, smoking history, and a diagnosis of diabetes. The web application, accessible at https://shiny.tricities.wsu.edu/cvdr-prediction/, presents predictive results, probability scores, and a SHAP plot generated from the model trained using the feature set selected by the hypergraph-based approach. Conclusions This study highlights the importance of strategic feature selection in refining predictive accuracy and interpretability, offering a practical data-centric approach that could aid clinicians in evaluating cardiovascular risk and tailoring preventive care. Trial registration Not applicable as this research is not a clinical trial.