Predicting county-level diagnosed diabetes prevalence in the United States using explainable gradient boosting and geographic interpretation

Yussif Yahaya
Sagor Khan
Priyanka Rani Saha
Md Al Amin Meia

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Diagnosed diabetes affects approximately 38.4 million Americans, but its burden is not evenly distributed across U.S. counties. Existing machine-learning studies have mainly focused on individual risk prediction using biometric, clinical, or survey variables. These approaches are less suited to explaining why diagnosed diabetes prevalence differs geographically across counties. We developed an explainable gradient-boosting framework for predicting county-level diagnosed diabetes prevalence across 2,957 U.S. counties using an ecological cross-sectional design. The analysis integrated food-environment, socioeconomic, occupational, demographic, health-behavior, and clinical indicators from five public data sources. Four regression models were compared: Elastic Net, Random Forest, XGBoost, and LightGBM. LightGBM was selected as the primary model based on validation-set RMSE and interpreted using SHAP TreeExplainer. The validation-selected LightGBM model achieved a held-out test RMSE of 0.423 percentage points, R² = 0.964, and MAPE = 2.76%. Although XGBoost achieved a lower test RMSE of 0.399 and R² = 0.968, it was retained as a secondary benchmark because primary-model selection was based only on validation performance. A sensitivity model using only structural and contextual predictors, and excluding CDC PLACES health-behavior and clinical covariates, retained substantial predictive performance (R² = 0.827). Poverty rate was the most frequent dominant positive structural SHAP contributor nationally (n = 772 counties, 26.1%), followed by food insecurity rate (n = 707, 23.9%), Supplemental Nutrition Assistance Program (SNAP) participation rate (n = 316, 10.7%), unemployment rate (n = 224, 7.6%), and median household income (n = 178, 6.0%). Residual Moran’s I decreased from 0.665 to 0.069 after model fitting. Explainable machine learning using public county-level data can characterize geographic variation in diagnosed diabetes prevalence. County-level SHAP maps may support local hypothesis generation, but should be interpreted as explanations of model predictions rather than causal effects.

Version published to 10.64898/2026.06.23.26356400 on medRxiv
Jun 26, 2026

Border-Region Status and Diagnosed Diabetes Prevalence in Texas: A Cross-Sectional Ecological Analysis

This article has 4 authors:
1. Priyanka Rani Saha
2. Sagor Khan
3. Yussif Yahaya
4. Md Al Amin Meia
This article has no evaluationsLatest version Jun 2, 2026
Data-driven Prediction of Fifteen-Year All-Cause Mortality among 2.3 Million Individuals in the VA

This article has 14 authors:
1. Sayera Dhaubhadel
2. Judith D. Cohn
3. Tanmoy Bhattacharya
4. Ruy M. Ribeiro
5. Kumkum Ganguly
6. Nicolas Hengartner
7. Janet P. Tate
8. Lauren Costa
9. Yuk-Lam Ho
10. Kelly Cho
11. Jean C. Beckham
12. Nathan A. Kimbrel
13. Amy C. Justice
14. Benjamin H. McMahon
This article has no evaluationsLatest version Jul 9, 2026
Development and validation of a risk prediction algorithm to estimate all-cause mortality among community-dwelling Canadians – the Mortality Population Risk Tool (MPoRT)

This article has 13 authors:
1. Douglas G. Manuel
2. Anan Bader Eddeen
3. Philippe Fines
4. Carol Bennett
5. Stacey Fisher
6. Mahsa Jessri
7. Richard Perez
8. Meltem Tuna
9. Claudia Sanmartin
10. Yulric Sequeria
11. Juan Li
12. Laura C. Rosella
13. Deirdre Hennessy
This article has no evaluationsLatest version Jun 22, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Border-Region Status and Diagnosed Diabetes Prevalence in Texas: A Cross-Sectional Ecological Analysis

Data-driven Prediction of Fifteen-Year All-Cause Mortality among 2.3 Million Individuals in the VA

Development and validation of a risk prediction algorithm to estimate all-cause mortality among community-dwelling Canadians – the Mortality Population Risk Tool (MPoRT)