Predicting Type 2 Diabetes Using Baseline and Longitudinal Changes in Lifestyle and Clinical Markers: A Machine Learning Approach
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Most type 2 diabetes (T2D) prediction models rely on static baseline measurements and often include diagnostic glycemic markers, limiting their ability to capture temporal risk evolution and creating circular reasoning. This study developed a machine learning framework that systematically integrates baseline measurements with longitudinal interval changes to predict incident T2D. Using the Ansan-Ansung cohort of the Korean Genome and Epidemiology Study (KoGES; 2001–2018), we included 7,510 initially diabetes-free participants in this prospective analysis. The framework jointly modeled static variables and 2-year interval changes in lifestyle, anthropometric, and biochemical markers using XGBoost, Random Forest, LGBM, logistic regression, neural networks, and ensemble methods. Principal component analysis addressed multicollinearity. Diagnostic glycemic markers (fasting glucose, HbA1c) were excluded to ensure genuine risk prediction. The ensemble model achieved AUROC 0.763, with XGBoost (0.752) and LGBM (0.750) showing comparable performance. SHAP analysis identified changes in C-reactive protein (▲CRP) and body mass index (▲BMI), together with baseline triglycerides, as the most influential predictors. Examination of decision tree structures revealed clinically meaningful and biologically plausible thresholds (e.g., BMI < 25.6 kg/m²). The resulting ensemble model was implemented through the Multi-Domain Simulation Interface (MDSi) framework, enabling population-level inference across lifestyle, anthropometric, and metabolic domains. Overall, change variables contributed more strongly than static measures, suggesting that accelerated physiological shifts precede the onset of T2D. By capturing dynamic metabolic trajectories rather than static risk profiles, this framework differentiates true risk prediction from early disease detection, enabling clinically interpretable prediction with substantial potential for preventive interventions before diagnostic thresholds are reached.