Early risk prediction of gestational diabetes using routine antenatal care clinical data using machine learning

Emmanuel Ahishakiye
Justine Nakirijja
Shallon Ahimbisibwe

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Gestational diabetes mellitus (GDM) is a major contributor to adverse maternal and neonatal outcomes, particularly in low-resource settings where access to timely diagnostic testing is limited. Early risk stratification using routinely collected antenatal data may support targeted screening and preventive interventions. This study developed and evaluated machine learning models for early GDM risk prediction using routinely collected antenatal data from 3,525 pregnancies in Uganda, comprising 2,153 non-GDM and 1,372 GDM cases. Logistic regression, Random Forest, XGBoost, and a soft-voting ensemble were trained and evaluated on a held-out test set. Model performance was assessed using receiver operating characteristic area under the curve (ROC AUC), precision–recall AUC (PR AUC), accuracy, precision, recall, and F1-score. Confusion matrix analysis was used to examine screening trade-offs, and model interpretability was evaluated using Shapley Additive Explanations (SHAP). Non-linear models demonstrated strong discrimination, with Random Forest and XGBoost achieving ROC AUC values of 0.997 and PR AUC values exceeding 0.995. Random Forest achieved an accuracy of 0.971, recall of 0.994, and F1-score of 0.963, missing only two GDM cases in the test set. Logistic regression showed slightly lower but still robust performance (ROC AUC = 0.982, F1-score = 0.951). The ensemble model did not consistently outperform the strongest individual learner. SHAP analysis identified body mass index, HDL cholesterol, blood pressure, and prediabetes status as the most influential predictors, with feature effects aligning with established clinical knowledge. Machine learning models, particularly Random Forest and XGBoost, can support early GDM risk stratification using routinely collected antenatal data. The combination of strong discrimination, high sensitivity, and clinically interpretable explanations highlights the potential of these models as decision-support tools for antenatal screening in low-resource healthcare settings.

Version published to 10.21203/rs.3.rs-8721947/v1 on Research Square
Feb 17, 2026

Perinatal Mortality Prediction and Risk Factor Identification Using Machine Learning on Recent Sub-Saharan African DHS Data Affiliations

This article has 8 authors:
1. Tadele Chekol Maru
2. Andualem Enyew
3. Makda Fekadie Tewelgne
4. Eliyas Addisu Taye
5. Agerie Mengistie Zeleke
6. Belayneh Jejaw Abate
7. Deresse Abebe Gebrehana
8. Azanaw Amare Muche
This article has no evaluationsLatest version Mar 30, 2026
Predicting Adequate Antenatal Care Utilization Among Pregnant Women in Kenya: A Comparative Machine Learning Study Using the Kenya Demographic and Health Survey

This article has 1 author:
1. Calvince Otieno Ngaji
This article has no evaluationsLatest version Mar 27, 2026
Optimizing machine learning models for predicting iron supplementation uptake among pregnant women in Somaliland: insights from the 2020 Somaliland demographic and health survey data

This article has 3 authors:
1. Abdifatah Ibrahim Mouse
2. Omran Salih
3. Abdisalam Hassan Muse
This article has no evaluationsLatest version Mar 17, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Perinatal Mortality Prediction and Risk Factor Identification Using Machine Learning on Recent Sub-Saharan African DHS Data Affiliations

Predicting Adequate Antenatal Care Utilization Among Pregnant Women in Kenya: A Comparative Machine Learning Study Using the Kenya Demographic and Health Survey

Optimizing machine learning models for predicting iron supplementation uptake among pregnant women in Somaliland: insights from the 2020 Somaliland demographic and health survey data