Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This large-scale collaborative study is a timely contribution that will be of interest to researchers working in the fields of infectious disease forecasting and epidemic control. This paper provides a comprehensive evaluation of the predictive skills of real-time COVID-19 forecasting models in Europe. The conclusions of the paper are well supported by the data and are consistent with findings from studies in other countries.

This article has been Reviewed by the following groups

Read the full article

Abstract

Short-term forecasts of infectious disease burden can contribute to situational awareness and aid capacity planning. Based on best practice in other fields and recent insights in infectious disease epidemiology, one can maximise the predictive performance of such forecasts if multiple models are combined into an ensemble. Here, we report on the performance of ensembles in predicting COVID-19 cases and deaths across Europe between 08 March 2021 and 07 March 2022.

Methods:

We used open-source tools to develop a public European COVID-19 Forecast Hub. We invited groups globally to contribute weekly forecasts for COVID-19 cases and deaths reported by a standardised source for 32 countries over the next 1–4 weeks. Teams submitted forecasts from March 2021 using standardised quantiles of the predictive distribution. Each week we created an ensemble forecast, where each predictive quantile was calculated as the equally-weighted average (initially the mean and then from 26th July the median) of all individual models’ predictive quantiles. We measured the performance of each model using the relative Weighted Interval Score (WIS), comparing models’ forecast accuracy relative to all other models. We retrospectively explored alternative methods for ensemble forecasts, including weighted averages based on models’ past predictive performance.

Results:

Over 52 weeks, we collected forecasts from 48 unique models. We evaluated 29 models’ forecast scores in comparison to the ensemble model. We found a weekly ensemble had a consistently strong performance across countries over time. Across all horizons and locations, the ensemble performed better on relative WIS than 83% of participating models’ forecasts of incident cases (with a total N=886 predictions from 23 unique models), and 91% of participating models’ forecasts of deaths (N=763 predictions from 20 models). Across a 1–4 week time horizon, ensemble performance declined with longer forecast periods when forecasting cases, but remained stable over 4 weeks for incident death forecasts. In every forecast across 32 countries, the ensemble outperformed most contributing models when forecasting either cases or deaths, frequently outperforming all of its individual component models. Among several choices of ensemble methods we found that the most influential and best choice was to use a median average of models instead of using the mean, regardless of methods of weighting component forecast models.

Conclusions:

Our results support the use of combining forecasts from individual models into an ensemble in order to improve predictive performance across epidemiological targets and populations during infectious disease epidemics. Our findings further suggest that median ensemble methods yield better predictive performance more than ones based on means. Our findings also highlight that forecast consumers should place more weight on incident death forecasts than incident case forecasts at forecast horizons greater than 2 weeks.

Funding:

AA, BH, BL, LWa, MMa, PP, SV funded by National Institutes of Health (NIH) Grant 1R01GM109718, NSF BIG DATA Grant IIS-1633028, NSF Grant No.: OAC-1916805, NSF Expeditions in Computing Grant CCF-1918656, CCF-1917819, NSF RAPID CNS-2028004, NSF RAPID OAC-2027541, US Centers for Disease Control and Prevention 75D30119C05935, a grant from Google, University of Virginia Strategic Investment Fund award number SIF160, Defense Threat Reduction Agency (DTRA) under Contract No. HDTRA1-19-D-0007, and respectively Virginia Dept of Health Grant VDH-21-501-0141, VDH-21-501-0143, VDH-21-501-0147, VDH-21-501-0145, VDH-21-501-0146, VDH-21-501-0142, VDH-21-501-0148. AF, AMa, GL funded by SMIGE - Modelli statistici inferenziali per governare l'epidemia, FISR 2020-Covid-19 I Fase, FISR2020IP-00156, Codice Progetto: PRJ-0695. AM, BK, FD, FR, JK, JN, JZ, KN, MG, MR, MS, RB funded by Ministry of Science and Higher Education of Poland with grant 28/WFSN/2021 to the University of Warsaw. BRe, CPe, JLAz funded by Ministerio de Sanidad/ISCIII. BT, PG funded by PERISCOPE European H2020 project, contract number 101016233. CP, DL, EA, MC, SA funded by European Commission - Directorate-General for Communications Networks, Content and Technology through the contract LC-01485746, and Ministerio de Ciencia, Innovacion y Universidades and FEDER, with the project PGC2018-095456-B-I00. DE., MGu funded by Spanish Ministry of Health / REACT-UE (FEDER). DO, GF, IMi, LC funded by Laboratory Directed Research and Development program of Los Alamos National Laboratory (LANL) under project number 20200700ER. DS, ELR, GG, NGR, NW, YW funded by National Institutes of General Medical Sciences (R35GM119582; the content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS or the National Institutes of Health). FB, FP funded by InPresa, Lombardy Region, Italy. HG, KS funded by European Centre for Disease Prevention and Control. IV funded by Agencia de Qualitat i Avaluacio Sanitaries de Catalunya (AQuAS) through contract 2021-021OE. JDe, SMo, VP funded by Netzwerk Universitatsmedizin (NUM) project egePan (01KX2021). JPB, SH, TH funded by Federal Ministry of Education and Research (BMBF; grant 05M18SIA). KH, MSc, YKh funded by Project SaxoCOV, funded by the German Free State of Saxony. Presentation of data, model results and simulations also funded by the NFDI4Health Task Force COVID-19 ( https://www.nfdi4health.de/task-force-covid-19-2 ) within the framework of a DFG-project (LO-342/17-1). LP, VE funded by Mathematical and Statistical modelling project (MUNI/A/1615/2020), Online platform for real-time monitoring, analysis and management of epidemic situations (MUNI/11/02202001/2020); VE also supported by RECETOX research infrastructure (Ministry of Education, Youth and Sports of the Czech Republic: LM2018121), the CETOCOEN EXCELLENCE (CZ.02.1.01/0.0/0.0/17-043/0009632), RECETOX RI project (CZ.02.1.01/0.0/0.0/16-013/0001761). NIB funded by Health Protection Research Unit (grant code NIHR200908). SAb, SF funded by Wellcome Trust (210758/Z/18/Z).

Article activity feed

  1. eLife assessment

    This large-scale collaborative study is a timely contribution that will be of interest to researchers working in the fields of infectious disease forecasting and epidemic control. This paper provides a comprehensive evaluation of the predictive skills of real-time COVID-19 forecasting models in Europe. The conclusions of the paper are well supported by the data and are consistent with findings from studies in other countries.

  2. Reviewer #1 (Public Review):

    The paper, fundamentally, is a description of the accuracy of individual model and ensemble model short-term forecasts of COVID-19. This has been done before in both weather and infectious disease. So what are the contributions of this manuscript? I see the following:

    1. The authors show that ensemble prediction (a straight average) generally outperforms individual component models. This is not new and has been shown, as the authors cite, for weather, climate, and infectious disease.
    2. Use of the median estimate across models, rather than the mean, buffers against outliers. This is a well-recognized workaround for right-skewed distributions, though the specific finding in this study is of some importance, as this hasn't always been the case (noted by the authors in their discussion).
    3. Deaths are better forecasted than cases. This is not new, either, as the authors note, as deaths are a lagged function of cases/infections.
    4. It presents the archive of European COVID-19 forecasts.

    Although I don't see a lot of novelty in these findings, this COVID-19 forecasting work is important and represents a considerable effort on part of the individual modelers. The paper is well written, but it doesn't show much that is novel methodologically. For instance, it doesn't propose and validate an approach for improving forecasting or projection accuracy. Are there new ways to handle or predict behavioral, vaccination uptake, or viral changes? Are there novel post-processing approaches, other than 'ensembling' that could improve forecast accuracy?

  3. Reviewer #2 (Public Review):

    This paper by Sherratt et al. evaluated the performance of real-time predictions for COVID-19 submitted to the European COVID-19 Forecast Hub between March 8 2021 and March 7 2022. This large-scale multi-team multi-county collaboration collected short-term forecasts for COVID-19 from 26 teams generated for 32 countries in Europe, making this dataset one of the largest archives of real-time COVID-19 forecasts. The results indicate that ensemble models combining forecasts from individual models generally performs better than each individual model, and ensemble methods based on medians outperform the ones based on means. The comparison also shows that incident death forecasts are more reliable than incident case forecasts beyond two weeks into the future. The paper further included detailed discussions on several practical considerations in the operational use of forecasting models. These findings provide practical guides for generating real-time forecasts for infectious diseases and novel insights into coordinating international forecasting efforts during a public health emergency.

    The conclusions of this paper are well supported by the data and analyses. A few aspects could be further discussed in the manuscript.

    1. A parallel effort of real-time COVID-19 forecasting in the US (i.e., the US COVID-19 Forecast Hub) reported similar findings on the use of ensemble models. This study from Europe provides independent validation that shows the robustness of these findings. While both studies followed similar guidelines and used the same evaluation metrics (coverage and WIS), I believe there should be unique challenges associated with forecasting for multiple countries (as opposed to forecasting in a single country). As a result, it might be worthwhile to discuss those challenges and potential solutions to inform similar efforts in the future.

    2. WIS is a strictly proper score for evaluating forecast performance; however, it must rely on a reference forecast model. This may create difficulties in interpreting forecast accuracy for the general public who may not understand the concept of WIS. For instance, what is a WIS score good enough to trust? The authors may want to include a simple metric (e.g., mean absolute error) as a supplement even though these metrics have some caveats. I presume the performance should be highly correlated using different evaluation metrics.

    3. It might be helpful to elaborate more on the assumptions for near-term predictions in participating models (e.g., status quo, reactive change of transmission, etc.). Essentially all real-time predictions were generated based on assumptions, although sometimes those assumptions were not stated explicitly. For behavior-induced changing points (peaks or troughs), it might be challenging to predict using the status quo without considering a change in model states.

    4. Data in the tables and figures were used to compare forecasts. It would be great to have a formal statistical test for comparing model performance, if possible.