AI for Cholera Outbreak Prediction, Real-Time Tracking, and Low-Resource Diagnostics using Federated and Privacy-Preserving Machine Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study presents a multi-model computational framework for predicting cholera outbreaks using spatio-temporal, climatic, and socio-environmental predictors across regions with recurrent epidemics. The dataset included 17,842 records of regions and days from January 2024 to May 2025, divided into 70% for training, 15% for validation, and 15% for testing in cross-sectional models, along with rolling-origin splits for time-series models. Two forecasting tasks were examined: (i) prediction of reported cases and (ii) categorization of outbreak severity into low (0–9 cases), medium (10–29 cases), and high (≥ 30 cases). Baseline statistical evaluations utilized Poisson and Negative Binomial regression methods. Overdispersion tests (variance/mean ratio = 2.7) highlighted the advantages of Negative Binomial models, which identified rainfall (IRR = 1.18, 95% CI: 1.10–1.26) and water salinity (IRR = 1.11, 95% CI: 1.06–1.16) as major contributors to outbreak risk, whereas sanitation coverage lowered incidence rates by 23% (IRR = 0.77, 95% CI: 0.71–0.84). Experiments with machine learning demonstrated significant enhancements in performance. Random Forest regression lowered RMSE from 41.2 (baseline) to 28.9, whereas classification reached a macro-F1 of 0.81. XGBoost enhanced classification results with macro-F1 = 0.87 and ROC-AUC = 0.91, surpassing Random Forest (macro-F1 = 0.79, ROC-AUC = 0.86). SHAP analysis identified rainfall, sanitation, and mobility index as the three primary factors, responsible for 62% of the variance in predicting outbreaks. Deep learning utilizing Long Short-Term Memory (LSTM) networks delivered the most precise time-based predictions. For a 7-day forecast, LSTM produced RMSE = 25.3 ± 6.2, MAE = 18.4 ± 4.7, and MAPE = 12.8 ± 3.1, while ARIMA showed RMSE = 27.9 ± 7.4 and MAPE = 17.5 ± 4.5, and naive benchmarks had MAPE ≥ 20%. Over a 14-day period, LSTM maintained its advantage with RMSE = 39.5 ± 10.2 and MAPE = 20.5 ± 5.6, surpassing ARIMA (RMSE = 41.2 ± 11.0; MAPE = 24.7 ± 6.3). Federated learning trials involving 5 regional clients showed performance comparable to centralized learning, achieving an accuracy of 0.84 (without differential privacy) and 0.78 (with DP, σ = 1.0). Privacy-utility trade-offs resulted in ε = 3.1–7.8 for δ = 1e-5, confirming practicality in low-bandwidth settings (average communication overhead = 11.4 MB per round). The results indicate that LSTM-based forecasting increases epidemic prediction accuracy by as much as 25% compared to ARIMA and 35% compared to naive methods, while XGBoost boosts outbreak severity classification by 8% relative to Random Forests. Federated models guarantee privacy-focused scalability with merely 5–9% loss in utility. These findings highlight the promise of combining ensemble learning, deep temporal models, and federated AI to create resilient, data-sovereign public health surveillance systems for areas susceptible to cholera.