Comparative Performance of Linear Regression and Machine Learning Models for Predicting Glycemic Status in Uncontrolled Type 2 Diabetes: SHAP-Based Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background This study aimed to compare the predictive performance of linear regression (LR) versus other machine learning (ML) models and assess the importance of clinical, biochemical and medication adherence predictors using SHapley Additive exPlanations (SHAP) analysis. Methods A cross-sectional study was conducted among adults (≥ 18 years) with type 2 diabetes mellitus (T2DM) and uncontrolled glycated hemoglobin (HbA1c) (≥ 7%), which was the primary outcome. After data preprocessing and feature selection, four supervised regression models; LR, random forest (RF), support vector regression (SVR), and extreme gradient boosting (XGBoost), were trained and evaluated. ANOVA F-test identified the top predictive continuous variables and SHAP analysis was used for clinical interpretation. Results Data from 223 patients were analyzed (mean age: 57.4 ± 9.8 years; 50.7% female). LR achieved the highest coefficient of determination (R²=0.28), while RF had the lowest mean absolute error (MAE = 1.18). SVR and XGBoost underperformed, with R² values of 0.19 and 0.07, respectively. Key predictors for high HbA1c included; fasting blood glucose (FBG), diastolic blood pressure (DBP), body mass index (BMI), insulin dose, serum magnesium concentration, and medication adherence. SHAP analysis confirmed the influence of DBP, FBG, insulin dose, magnesium levels, and low adherence on elevated HbA1c. Conclusion Although RF model moderately predicted HbA1c, LR outperformed the other ML-models. SHAP analysis highlighted interpretable predictors, supporting the use of explainable ML models for personalized glycemic risk stratification and clinical decision-making in T2DM management. Future studies should consider larger, multi-center datasets with more features and external validation to enhance ML-models’ predication accuracy and generalizability.