Machine Learning and Probabilistic Approaches for Forecasting Infectious Disease Transmission and Cases
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives
Forecasting the effective reproductive number ( R t ) and infection case counts is critical for guiding public health responses. We developed a machine learning and probabilistic forecasting framework to predict R t and daily COVID-19 cases, respectively, across South Carolina counties, with the flexibility to generalize to other infectious diseases.
Methods
We first estimated R t using the EpiNow2 R package, which incorporates Bayesian time-series modeling and accounts for reporting delay and incubation period. These initial estimates were refined using spatial covariate-adjusted smoothing through the Integrated Nested Laplace Approximation (INLA). We then generated R t forecasts using an ensemble of linear regression, random forest, and XGBoost models. Daily case forecasts were obtained by linking R t trajectories with historical case data via a Poisson model.
Results
This ensemble-based approach outperformed EpiNow2 across different forecast horizons (7-day, 14-day, and 21-day). In the first forecast period (November 11, 2020 – February 02, 2021), the ensemble achieved a median PA of 96.5% (IQR: 95.4% – 97.1%) for 7-day horizon R t forecast, compared to 87.0% (IQR: 84.4% – 89.4%) from EpiNow2. In the second period (December 11, 2022 – March 04, 2023), the ensemble attained a 93.0% median PA for R t forecast (IQR: 90.8% – 95.4%), while EpiNow2 reached 86.8% (IQR: 82.5% – 89.2%). Similar trends were observed for case forecasts, with the ensemble model demonstrating improved performance.
Conclusion
This study presents a flexible forecasting framework that integrates Bayesian estimation, spatial smoothing, and ensemble machine learning to improve the accuracy of COVID-19 transmission and case forecasts. The approach enhances epidemic forecasting performance and offers scalable tools to support data-driven public health preparedness and response.