Diabetes Risk Stratification Using Self‐Reported Health Indicators

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study presents a machine learning approach to classify diabetes status using the CDC BRFSS 2015 dataset, which includes over 250,000 self-reported health records across 22 variables. The goal was to develop a multi-class classification model to identify individuals as No, Yes, Borderline, or During Pregnancy in terms of diabetes status, using only non-invasive, survey-based data. Exploratory Data Analysis (EDA) revealed meaningful trends in variables such as BMI, Sleep Time, and General Health. Statistical analysis using Spearman correlation identified key associations with the target variable, with BMI (ρ = 0.31) and General Health (ρ = 0.25) showing small- to-moderate effect sizes. An XGBoost classifier was trained on an 80/20 stratified split and evaluated for accuracy and interpretability. Feature importance was assessed through built-in gain metrics and permutation importance. To enhance model transparency, SHAP (Tree Explainer) was applied, generating summary and waterfall plots that highlighted the positive contribution of features like high BMI and poor general health toward predicting diabetic classes. The combined effect of statistical significance, effect size, and model interpretation provides robust and explainable insights into the risk factors. This work demonstrates that predictive modeling using self-reported indicators can serve as a cost-effective, scalable alternative to laboratory-based diabetes screening. The framework is particularly valuable for population- level health assessments and community outreach, enabling early identification and intervention without requiring invasive diagnostic procedures.

Article activity feed