Predicting county-level diagnosed diabetes prevalence in the United States using explainable gradient boosting and geographic interpretation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Diagnosed diabetes affects approximately 38.4 million Americans, but its burden is not evenly distributed across U.S. counties. Existing machine-learning studies have mainly focused on individual risk prediction using biometric, clinical, or survey variables. These approaches are less suited to explaining why diagnosed diabetes prevalence differs geographically across counties. We developed an explainable gradient-boosting framework for predicting county-level diagnosed diabetes prevalence across 2,957 U.S. counties using an ecological cross-sectional design. The analysis integrated food-environment, socioeconomic, occupational, demographic, health-behavior, and clinical indicators from five public data sources. Four regression models were compared: Elastic Net, Random Forest, XGBoost, and LightGBM. LightGBM was selected as the primary model based on validation-set RMSE and interpreted using SHAP TreeExplainer. The validation-selected LightGBM model achieved a held-out test RMSE of 0.423 percentage points, R² = 0.964, and MAPE = 2.76%. Although XGBoost achieved a lower test RMSE of 0.399 and R² = 0.968, it was retained as a secondary benchmark because primary-model selection was based only on validation performance. A sensitivity model using only structural and contextual predictors, and excluding CDC PLACES health-behavior and clinical covariates, retained substantial predictive performance (R² = 0.827). Poverty rate was the most frequent dominant positive structural SHAP contributor nationally (n = 772 counties, 26.1%), followed by food insecurity rate (n = 707, 23.9%), Supplemental Nutrition Assistance Program (SNAP) participation rate (n = 316, 10.7%), unemployment rate (n = 224, 7.6%), and median household income (n = 178, 6.0%). Residual Moran’s I decreased from 0.665 to 0.069 after model fitting. Explainable machine learning using public county-level data can characterize geographic variation in diagnosed diabetes prevalence. County-level SHAP maps may support local hypothesis generation, but should be interpreted as explanations of model predictions rather than causal effects.