A machine learning approach for predicting childhood anaemia in Lesotho: An analysis of the 2023-24 Demographic and Health Survey

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Anaemia remains a major global health concern, particularly in sub-Saharan Africa, where it contributes significantly to childhood morbidity and mortality. In Lesotho, the 2014 Demographic and Health Survey (LDHS) reported that 51% of children under five were anaemic. Although conventional statistical approaches have been used to identify some risk factors, application of machine learning (ML) as a predictive tool for childhood anaemia has been insufficiently examined in Lesotho. This study aimed to develop and compare the performance of multiple ML algorithms in predicting anaemia among children aged 6–59 months using the most recent Lesotho DHS data. Methods A secondary analysis was performed using data from the 2023–2024 LDHS, including all children with valid haemoglobin. To address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was applied to the training dataset, ensuring equitable model learning. The data were then split into training (80%) and testing (20%) subsets. Six ML algorithms that is Logistic Regression, Decision Tree, K-Nearest Neighbours, Support Vector Machine, Random Forest, and (Extreme Gradient Boosting (XGBoost) were trained and evaluated employing a 5-fold cross-validation procedure. Model performance was assessed using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), accuracy, sensitivity, and specificity on the independent test set. Results The prevalence of anaemia was 33.7%. After applying SMOTE balancing, ensemble models (Random Forest and XGBoost) outperformed the traditional classifiers. The Random Forest algorithm achieved the highest performance (AUC = 0.841, accuracy = 75.2%, sensitivity = 76.6%, specificity = 73.8%), followed by XGBoost (AUC = 0.792, accuracy = 72.5%). Logistic Regression showed the weakest predictive ability (AUC = 0.522). Overall, feature importance analysis identified the child’s age in months as the most influential predictor, followed by child’s sex, recent morbidity, and household wealth quintile. Conclusion Ensemble machine learning methods, particularly Random Forest, can accurately predict childhood anaemia in Lesotho using routinely collected socio-demographic, health, and nutritional data. Incorporating SMOTE improved model balance and generalizability. The resulting model offers a scalable and practical decision-support tool for early identification of high-risk children in resource-limited settings, supporting more targeted screening and timely intervention to reduce the national anaemia burden.

Article activity feed