Diabetes Risk Stratification Using Self‐Reported Health Indicators

Tanuja Tummala

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study presents a machine learning approach to classify diabetes status using the CDC BRFSS 2015 dataset, which includes over 250,000 self-reported health records across 22 variables. The goal was to develop a multi-class classification model to identify individuals as No, Yes, Borderline, or During Pregnancy in terms of diabetes status, using only non-invasive, survey-based data. Exploratory Data Analysis (EDA) revealed meaningful trends in variables such as BMI, Sleep Time, and General Health. Statistical analysis using Spearman correlation identified key associations with the target variable, with BMI (ρ = 0.31) and General Health (ρ = 0.25) showing small- to-moderate effect sizes. An XGBoost classifier was trained on an 80/20 stratified split and evaluated for accuracy and interpretability. Feature importance was assessed through built-in gain metrics and permutation importance. To enhance model transparency, SHAP (Tree Explainer) was applied, generating summary and waterfall plots that highlighted the positive contribution of features like high BMI and poor general health toward predicting diabetic classes. The combined effect of statistical significance, effect size, and model interpretation provides robust and explainable insights into the risk factors. This work demonstrates that predictive modeling using self-reported indicators can serve as a cost-effective, scalable alternative to laboratory-based diabetes screening. The framework is particularly valuable for population- level health assessments and community outreach, enabling early identification and intervention without requiring invasive diagnostic procedures.

Version published to 10.20944/preprints202506.1238.v1
Jun 16, 2025

Health Indicator Predictions from lifestyle and biometric data using Machine Learning Models

This article has 1 author:
1. Manuela Pop
This article has no evaluationsLatest version Dec 19, 2025
Comparing Algorithm Effectiveness in Health Data Analysis

This article has 1 author:
1. Abdulmalik Hazaa Alshammari
This article has no evaluationsLatest version Jan 22, 2026
Early Prediction of Type 2 Debites Using Non-invasive Lifestyle Factors and Machine Learning

This article has 1 author:
1. Ameen Shabhashakhan
This article has no evaluationsLatest version Jan 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Health Indicator Predictions from lifestyle and biometric data using Machine Learning Models

Comparing Algorithm Effectiveness in Health Data Analysis

Early Prediction of Type 2 Debites Using Non-invasive Lifestyle Factors and Machine Learning