From Risk Factors to Predictive Modelling: Applying Machine Learning to Childhood Malaria Surveillance in Resource-Limited Settings
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Malaria remains a concerning public health issue in sub-Saharan Africa, especially among children under five. Nigeria accounts for almost 30% of malaria-related child deaths globally despite control efforts. However, machine learning (ML) approaches can detect complex patterns from extensive datasets, and may therefore improve prediction accuracy, giving a better understanding of drivers of malaria in children, leading to informed targeted interventions. Methods We conducted a cross-sectional study with 693 caregiver-child pairs from high-burden Internally Displaced Persons (IDPs) Camps in Nigeria. Sociodemographic, household conditions, malaria knowledge and prevention practices data were collected alongside Rapid Diagnostic Test (RDT) results. 70:30 split data is used to train and evaluate four ML models namely Logistic Regression (LR), Decision Tree (DT), Random Forest (RF) and Gradient Boosting Machine (GBM). The performance of the model was evaluated based on Area Under the Curve (AUC), precision, recall, and F1-score as well as variable importance to reveal key predictors. Results Malaria prevalence was 68.5%, and significant associations were observed with caregiver gender, education and housing conditions. Male caregivers had reduced odds of malaria positivity (aOR = 0.44, p < 0.001), and Mud walls conferred protection against malaria positive cases (aOR = 0.60, p = 0.002). Random Forest (AUC = 0.89) was the top performing model identifying caregiver occupation (15. 7% importance), and residential camp (14.7% importance) as leading predictors. GBM (AUC = 0.87) and LR (AUC = 0.82) were next, with DT (AUC = 0.78) had the lowest AUC value. There was a clear knowledge gap, with 60.3% of caregivers without Malaria prevention knowledge. Conclusion Malaria risk prediction is improved by machine learning and RF performs better. Important modifiable variables include housing conditions, caregiver education, and localized vector control. This study recommends a precision public health approach integrating ML within surveillance for real-time risk mapping and resource optimization in high-burden areas.