Early risk prediction of gestational diabetes using routine antenatal care clinical data using machine learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Gestational diabetes mellitus (GDM) is a major contributor to adverse maternal and neonatal outcomes, particularly in low-resource settings where access to timely diagnostic testing is limited. Early risk stratification using routinely collected antenatal data may support targeted screening and preventive interventions. This study developed and evaluated machine learning models for early GDM risk prediction using routinely collected antenatal data from 3,525 pregnancies in Uganda, comprising 2,153 non-GDM and 1,372 GDM cases. Logistic regression, Random Forest, XGBoost, and a soft-voting ensemble were trained and evaluated on a held-out test set. Model performance was assessed using receiver operating characteristic area under the curve (ROC AUC), precision–recall AUC (PR AUC), accuracy, precision, recall, and F1-score. Confusion matrix analysis was used to examine screening trade-offs, and model interpretability was evaluated using Shapley Additive Explanations (SHAP). Non-linear models demonstrated strong discrimination, with Random Forest and XGBoost achieving ROC AUC values of 0.997 and PR AUC values exceeding 0.995. Random Forest achieved an accuracy of 0.971, recall of 0.994, and F1-score of 0.963, missing only two GDM cases in the test set. Logistic regression showed slightly lower but still robust performance (ROC AUC = 0.982, F1-score = 0.951). The ensemble model did not consistently outperform the strongest individual learner. SHAP analysis identified body mass index, HDL cholesterol, blood pressure, and prediabetes status as the most influential predictors, with feature effects aligning with established clinical knowledge. Machine learning models, particularly Random Forest and XGBoost, can support early GDM risk stratification using routinely collected antenatal data. The combination of strong discrimination, high sensitivity, and clinically interpretable explanations highlights the potential of these models as decision-support tools for antenatal screening in low-resource healthcare settings.

Article activity feed