A Physiology-Aware, Orthogonalized, Calibration-First Machine-Learning Diabetes Risk Model Robust to Population Shift
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Type 2 diabetes remains a global health burden, with rising prevalence despite extensive public health campaigns and clinical interventions. Traditional risk estimation tools, such as the American Diabetes Association (ADA) risk test and rule-based scoring systems, have long been used for early identification but often lack calibration, interpretability, or adaptability across populations. More recently, machine learning (ML) approaches have attempted to improve predictive accuracy using complex ensemble models or deep learning architectures. However, these methods frequently sacrifice interpretability and calibration, often underperforming when externally validated or applied under distributional shifts. Addressing these limitations, we propose a calibration-cantered, physiology-aware, and parsimonious logistic regression model using five core predictors: age, BMI, glucose, insulin, and pregnancies. Our approach incorporates novel orthogonalization of insulin from glucose to manage collinearity and prioritizes calibrated probability estimates using Brier score minimization. Experimental evaluation across multiple ML baselines (e.g., Logistics Regression, Random Forest, XGBoost, Gradient Boosting) demonstrated that our model achieved comparable or superior performance (Logistics Regression: AUROC = 0.836, AUPRC = 0.714, Brier = 0.166) while maintaining excellent calibration and generalizability under covariate shift simulations. Notably, our model’s simplicity and transparency make it ideal for clinical deployment, offering interpretable and actionable insights into individual risk. Compared to prior models, ours provides a rare balance of statistical rigor, clinical relevance, and resilience to population shift. These results reinforce the need to prioritize model trustworthiness over complexity and point toward a new paradigm of machine learning in healthcare, one grounded in physiological understanding, probabilistic honesty, and real-world usability.