Improving Type 2 Diabetes Prediction: Comparative Evaluation of Machine Learning Classifiers Using Balanced Data from the AWI-Gen Cohort

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Type 2 diabetes mellitus (T2DM) is an escalating public health concern across Africa, but regionally tailored predictive models are scarce. Advances in machine learning (ML) offer potential for early identification, though previous research has been constrained by methodological issues such as data leakage, class imbalance, and overfitting, limiting clinical deployment, especially in digital health contexts. Methods: This study analysed data from 2,010 participants in the H3Africa AWI-Gen cohort in northern Ghana to develop and evaluate ML-based prediction models tailored to African settings. Rigorous preprocessing steps, including handling class imbalance with SMOTE and excluding diagnostic biomarkers prone to target leakage, were applied. Eight ML classifiers underwent robust Bayesian hyperparameter optimisation. Model performance was assessed via stratified 5-fold cross-validation and confirmed through extensive sensitivity and calibration analyses. Results: The optimised XGBoost model yielded an AUC of 0.845 (95% CI: 0.812–0.878) and a sensitivity of 78.2% on unseen data. Including glucose as a predictor increased performance by 11.5%, underscoring the necessity of its exclusion to avoid biased evaluation. Models using only anthropometric and lifestyle variables (AUC = 0.783) demonstrated robust predictive capacity, with waist circumference, physical activity, and BMI standing out as the most stable predictors across analyses. Conclusion: Our findings demonstrate that ML models constructed from routinely collected clinical and lifestyle data can attain clinically meaningful diabetes prediction suitable for digital health applications in low-resource African contexts. This study addresses prior methodological gaps and offers a data-driven framework that is both robust and clinically plausible for early T2DM detection, with potential implications for public health policy and digital screening programmes in similar populations.

Article activity feed