Explainable machine learning for health disparities: type 2 diabetes in the All of Us research program

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Type 2 diabetes (T2D) is a disease with high morbidity and mortality and a disproportionate impact on minority groups. Machine learning (ML) is increasingly used to characterize T2D risk factors; however, it has not been used to study T2D health disparities. Our objective was to use explainable ML methods to discover and characterize T2D health disparity risk factors. We applied SHapley Additive exPlanations (SHAP), a new class of explainable ML methods that provide interpretability to ML classifiers, to this end. ML classifiers were used to model T2D risk within and between self-identified race and ethnicity (SIRE) groups, and SHAP values were calculated to quantify the effect of T2D risk factors. We then stratified SHAP values by SIRE to quantify the effect of T2D risk factors on prevalence differences between groups. We found that ML classifiers (random forest, lightGBM, and XGBoost) accurately modeled T2D risk and recaptured the observed prevalence differences between SIRE groups. SHAP analysis showed the top seven most important T2D risk factors for all SIRE groups were the same, with the order of importance for features differing between groups. SHAP values stratified by SIRE showed that income, waist circumference, and education best explain the higher prevalence of T2D in the Black or African American group, compared to the White group, whereas income, education and triglycerides best explain the higher prevalence of T2D in the Hispanic or Latino group. This study demonstrates that explainable ML can be used to elucidate health disparity risk factors and quantify their group-specific effects.

Author Summary

While machine learning (ML) methods hold great promise for epidemiological studies, their practical utility is limited by interpretability. Increasingly complex ML models are great at predicting disease risk, but how they arrive at a given prediction is often obscured by model complexity. Explainable ML is an emerging discipline that seeks to render ML models more transparent by elucidating how and why input features contribute to output predictions. This study reports a novel application of explainable ML to epidemiology, focusing on type 2 diabetes (T2D) as a paradigm of health disparities. We found that ML classifiers were able to accurately model T2D disparities, for a large cohort of Black, Hispanic, and White Americans, and explainable ML revealed which risk factors contributed to the observed disparities and how. The results demonstrate that explainable ML can be a powerful tool for the discovery and characterization of health disparity risk factors.

Article activity feed