Application of Machine Learning to Predict Teenage Pregnancy in Zambia: Evidence from 2024 Zambia Demographic and Health Surveys
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Teenage pregnancy remains a significant public health issue in Zambia, contributing to high maternal and child morbidity and mortality, school dropout, and socioeconomic vulnerability. Understanding the most influential predictors of teenage pregnancy is a key initial step in developing targeted interventions to address teenage pregnancy. Therefore, this study is aimed at applying machine learning (ML) and ordinary logistic regression to determine and explain the most influential predictors of teenage pregnancy in Zambia. Methods This cross-sectional study used data from the 2024 Zambia Demographic and Health Survey (ZDHS). Feature selection was performed using mutual information scores and the Boruta algorithm in 5-fold cross-validation (CV). Data was split into 80% training set and 20% held-out testing set. Seven ML models were trained in python on 80% of the training set using 5-fold CV and average metrics were obtained on whom the best performing model was identified which was later tested on the held-out testing set. Model performance was evaluated using accuracy and the area under the receiver operating characteristic curve (AUC-ROC). The Shapley Additive Explanation (SHAP) method was applied to determine feature importance. Weighted logistic regression in STATA was then used to estimate odds ratios for key predictors. Results A total of 3292 adolescents were included in the analysis, of whom 26.3% had experienced teenage pregnancy. The Extreme Gradient Boosting (XGB) model outperformed other models, achieving an accuracy of 0.806 (0.789–0.822) and AUC-ROC 0.839 (0.821–0.858). On the held-out test set, the XGB model achieved the accuracy of 0.746 an AUC-ROC of 0.824. The most influential predictors of teenage pregnancy were participant’s current age, followed by age of the household head, knowledge of ovulatory cycle, wealth status, education, type of place of residence, frequency of watching television, internet use, and province. A one-year increase in participant’s age (OR = 2.20; 95% CI = 2.03–2.37) and knowledge of ovulatory cycle (OR = 2.70; 95% CI = 2.12–3.42), increased the odds of teenage pregnancy in Zambia. Conversely, a one-year increase in the age of the household head (OR = 0.98; 95% CI = 0.98–0.99), belonging to the richer wealth quintile (OR = 0.58; 95% CI = 0.38–0.90), belonging to the richest quintile (OR = 0.18; 95% CI = 0.10–0.31), having attained secondary or higher education (OR = 0.42; 95% CI = 0.25–0.71), watching television at least once a week (OR = 0.67; 95% CI = 0.47–0.95), watching television almost every day (OR = 0.72; 95% CI = 0.52–0.99), internet use in the last 12 months (OR = 0.67; 95% CI = 0.49–0.92), reduced the odds of teenage pregnancy. Conclusion Applying ML models and conventional modelling could help identify and explain the most influential predictors of teenage pregnancy to support evidence-based, targeted interventions.