AI for Cholera Outbreak Prediction, Real-Time Tracking, and Low-Resource Diagnostics using Federated and Privacy-Preserving Machine Learning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study presents a multi-model computational framework for predicting cholera outbreaks using spatio-temporal, climatic, and socio-environmental predictors across regions with recurrent epidemics. The dataset included 17,842 records of regions and days from January 2024 to May 2025, divided into 70% for training, 15% for validation, and 15% for testing in cross-sectional models, along with rolling-origin splits for time-series models. Two forecasting tasks were examined: (i) prediction of reported cases and (ii) categorization of outbreak severity into low (0–9 cases), medium (10–29 cases), and high (≥ 30 cases). Baseline statistical evaluations utilized Poisson and Negative Binomial regression methods. Overdispersion tests (variance/mean ratio = 2.7) highlighted the advantages of Negative Binomial models, which identified rainfall (IRR = 1.18, 95% CI: 1.10–1.26) and water salinity (IRR = 1.11, 95% CI: 1.06–1.16) as major contributors to outbreak risk, whereas sanitation coverage lowered incidence rates by 23% (IRR = 0.77, 95% CI: 0.71–0.84). Experiments with machine learning demonstrated significant enhancements in performance. Random Forest regression lowered RMSE from 41.2 (baseline) to 28.9, whereas classification reached a macro-F1 of 0.81. XGBoost enhanced classification results with macro-F1 = 0.87 and ROC-AUC = 0.91, surpassing Random Forest (macro-F1 = 0.79, ROC-AUC = 0.86). SHAP analysis identified rainfall, sanitation, and mobility index as the three primary factors, responsible for 62% of the variance in predicting outbreaks. Deep learning utilizing Long Short-Term Memory (LSTM) networks delivered the most precise time-based predictions. For a 7-day forecast, LSTM produced RMSE = 25.3 ± 6.2, MAE = 18.4 ± 4.7, and MAPE = 12.8 ± 3.1, while ARIMA showed RMSE = 27.9 ± 7.4 and MAPE = 17.5 ± 4.5, and naive benchmarks had MAPE ≥ 20%. Over a 14-day period, LSTM maintained its advantage with RMSE = 39.5 ± 10.2 and MAPE = 20.5 ± 5.6, surpassing ARIMA (RMSE = 41.2 ± 11.0; MAPE = 24.7 ± 6.3). Federated learning trials involving 5 regional clients showed performance comparable to centralized learning, achieving an accuracy of 0.84 (without differential privacy) and 0.78 (with DP, σ = 1.0). Privacy-utility trade-offs resulted in ε = 3.1–7.8 for δ = 1e-5, confirming practicality in low-bandwidth settings (average communication overhead = 11.4 MB per round). The results indicate that LSTM-based forecasting increases epidemic prediction accuracy by as much as 25% compared to ARIMA and 35% compared to naive methods, while XGBoost boosts outbreak severity classification by 8% relative to Random Forests. Federated models guarantee privacy-focused scalability with merely 5–9% loss in utility. These findings highlight the promise of combining ensemble learning, deep temporal models, and federated AI to create resilient, data-sovereign public health surveillance systems for areas susceptible to cholera.

Article activity feed