Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations

Katharine Sherratt
Hugo Gruson
Rok Grah
Helen Johnson
Rene Niehus
Bastian Prasse
Frank Sandmann
Jannik Deuschel
Daniel Wolffram
Sam Abbott
Alexander Ullrich
Graham Gibson
Evan L Ray
Nicholas G Reich
Daniel Sheldon
Yijin Wang
Nutcha Wattanachit
Lijing Wang
Jan Trnka
Guillaume Obozinski
Tao Sun
Dorina Thanou
Loic Pottier
Ekaterina Krymova
Jan H Meinke
Maria Vittoria Barbarossa
Neele Leithauser
Jan Mohring
Johanna Schneider
Jaroslaw Wlazlo
Jan Fuhrmann
Berit Lange
Isti Rodiah
Prasith Baccam
Heidi Gurung
Steven Stage
Bradley Suchoski
Jozef Budzinski
Robert Walraven
Inmaculada Villanueva
Vit Tucek
Martin Smid
Milan Zajicek
Cesar Perez Alvarez
Borja Reina
Nikos I Bosse
Sophie R Meakin
Lauren Castro
Geoffrey Fairchild
Isaac Michaud
Dave Osthus
Pierfrancesco Alaimo Di Loro
Antonello Maruotti
Veronika Eclerova
Andrea Kraus
David Kraus
Lenka Pribylova
Bertsimas Dimitris
Michael Lingzhi Li
Soni Saksham
Jonas Dehning
Sebastian Mohr
Viola Priesemann
Grzegorz Redlarski
Benjamin Bejar
Giovanni Ardenghi
Nicola Parolini
Giovanni Ziarelli
Wolfgang Bock
Stefan Heyder
Thomas Hotz
David E Singh
Miguel Guzman-Merino
Jose L Aznarte
David Morina
Sergio Alonso
Enric Alvarez
Daniel Lopez
Clara Prats
Jan Pablo Burgard
Arne Rodloff
Tom Zimmermann
Alexander Kuhlmann
Janez Zibert
Fulvia Pennoni
Fabio Divino
Marti Catala
Gianfranco Lovison
Paolo Giudici
Barbara Tarantino
Francesco Bartolucci
Giovanna Jona Lasinio
Marco Mingione
Alessio Farcomeni
Ajitesh Srivastava
Pablo Montero-Manso
Aniruddha Adiga
Benjamin Hurt
Bryan Lewis
Madhav Marathe
Przemyslaw Porebski
Srinivasan Venkatramanan
Rafal P Bartczuk
Filip Dreger
Anna Gambin
Krzysztof Gogolewski
Magdalena Gruziel-Slomka
Bartosz Krupa
Antoni Moszyński
Karol Niedzielewski
Jedrzej Nowosielski
Maciej Radwan
Franciszek Rakowski
Marcin Semeniuk
Ewa Szczurek
Jakub Zielinski
Jan Kisielewski
Barbara Pabjan
Kirsten Holger
Yuri Kheifetz
Markus Scholz
Biecek Przemyslaw
Marcin Bodych
Maciej Filinski
Radoslaw Idzikowski
Tyll Krueger
Tomasz Ozanski
Johannes Bracher
Sebastian Funk

Curated by eLife

eLife assessment

This large-scale collaborative study is a timely contribution that will be of interest to researchers working in the fields of infectious disease forecasting and epidemic control. This paper provides a comprehensive evaluation of the predictive skills of real-time COVID-19 forecasting models in Europe. The conclusions of the paper are well supported by the data and are consistent with findings from studies in other countries.

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (eLife)

Abstract

Short-term forecasts of infectious disease burden can contribute to situational awareness and aid capacity planning. Based on best practice in other fields and recent insights in infectious disease epidemiology, one can maximise the predictive performance of such forecasts if multiple models are combined into an ensemble. Here, we report on the performance of ensembles in predicting COVID-19 cases and deaths across Europe between 08 March 2021 and 07 March 2022.

Methods:

We used open-source tools to develop a public European COVID-19 Forecast Hub. We invited groups globally to contribute weekly forecasts for COVID-19 cases and deaths reported by a standardised source for 32 countries over the next 1–4 weeks. Teams submitted forecasts from March 2021 using standardised quantiles of the predictive distribution. Each week we created an ensemble forecast, where each predictive quantile was calculated as the equally-weighted average (initially the mean and then from 26th July the median) of all individual models’ predictive quantiles. We measured the performance of each model using the relative Weighted Interval Score (WIS), comparing models’ forecast accuracy relative to all other models. We retrospectively explored alternative methods for ensemble forecasts, including weighted averages based on models’ past predictive performance.

Results:

Over 52 weeks, we collected forecasts from 48 unique models. We evaluated 29 models’ forecast scores in comparison to the ensemble model. We found a weekly ensemble had a consistently strong performance across countries over time. Across all horizons and locations, the ensemble performed better on relative WIS than 83% of participating models’ forecasts of incident cases (with a total N=886 predictions from 23 unique models), and 91% of participating models’ forecasts of deaths (N=763 predictions from 20 models). Across a 1–4 week time horizon, ensemble performance declined with longer forecast periods when forecasting cases, but remained stable over 4 weeks for incident death forecasts. In every forecast across 32 countries, the ensemble outperformed most contributing models when forecasting either cases or deaths, frequently outperforming all of its individual component models. Among several choices of ensemble methods we found that the most influential and best choice was to use a median average of models instead of using the mean, regardless of methods of weighting component forecast models.

Conclusions:

Our results support the use of combining forecasts from individual models into an ensemble in order to improve predictive performance across epidemiological targets and populations during infectious disease epidemics. Our findings further suggest that median ensemble methods yield better predictive performance more than ones based on means. Our findings also highlight that forecast consumers should place more weight on incident death forecasts than incident case forecasts at forecast horizons greater than 2 weeks.

Funding:

AA, BH, BL, LWa, MMa, PP, SV funded by National Institutes of Health (NIH) Grant 1R01GM109718, NSF BIG DATA Grant IIS-1633028, NSF Grant No.: OAC-1916805, NSF Expeditions in Computing Grant CCF-1918656, CCF-1917819, NSF RAPID CNS-2028004, NSF RAPID OAC-2027541, US Centers for Disease Control and Prevention 75D30119C05935, a grant from Google, University of Virginia Strategic Investment Fund award number SIF160, Defense Threat Reduction Agency (DTRA) under Contract No. HDTRA1-19-D-0007, and respectively Virginia Dept of Health Grant VDH-21-501-0141, VDH-21-501-0143, VDH-21-501-0147, VDH-21-501-0145, VDH-21-501-0146, VDH-21-501-0142, VDH-21-501-0148. AF, AMa, GL funded by SMIGE - Modelli statistici inferenziali per governare l'epidemia, FISR 2020-Covid-19 I Fase, FISR2020IP-00156, Codice Progetto: PRJ-0695. AM, BK, FD, FR, JK, JN, JZ, KN, MG, MR, MS, RB funded by Ministry of Science and Higher Education of Poland with grant 28/WFSN/2021 to the University of Warsaw. BRe, CPe, JLAz funded by Ministerio de Sanidad/ISCIII. BT, PG funded by PERISCOPE European H2020 project, contract number 101016233. CP, DL, EA, MC, SA funded by European Commission - Directorate-General for Communications Networks, Content and Technology through the contract LC-01485746, and Ministerio de Ciencia, Innovacion y Universidades and FEDER, with the project PGC2018-095456-B-I00. DE., MGu funded by Spanish Ministry of Health / REACT-UE (FEDER). DO, GF, IMi, LC funded by Laboratory Directed Research and Development program of Los Alamos National Laboratory (LANL) under project number 20200700ER. DS, ELR, GG, NGR, NW, YW funded by National Institutes of General Medical Sciences (R35GM119582; the content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS or the National Institutes of Health). FB, FP funded by InPresa, Lombardy Region, Italy. HG, KS funded by European Centre for Disease Prevention and Control. IV funded by Agencia de Qualitat i Avaluacio Sanitaries de Catalunya (AQuAS) through contract 2021-021OE. JDe, SMo, VP funded by Netzwerk Universitatsmedizin (NUM) project egePan (01KX2021). JPB, SH, TH funded by Federal Ministry of Education and Research (BMBF; grant 05M18SIA). KH, MSc, YKh funded by Project SaxoCOV, funded by the German Free State of Saxony. Presentation of data, model results and simulations also funded by the NFDI4Health Task Force COVID-19 ( https://www.nfdi4health.de/task-force-covid-19-2 ) within the framework of a DFG-project (LO-342/17-1). LP, VE funded by Mathematical and Statistical modelling project (MUNI/A/1615/2020), Online platform for real-time monitoring, analysis and management of epidemic situations (MUNI/11/02202001/2020); VE also supported by RECETOX research infrastructure (Ministry of Education, Youth and Sports of the Czech Republic: LM2018121), the CETOCOEN EXCELLENCE (CZ.02.1.01/0.0/0.0/17-043/0009632), RECETOX RI project (CZ.02.1.01/0.0/0.0/16-013/0001761). NIB funded by Health Protection Research Unit (grant code NIHR200908). SAb, SF funded by Wellcome Trust (210758/Z/18/Z).

Version published to 10.7554/elife.81916 on eLife
Apr 21, 2023
eLife
Nov 4, 2022

eLife assessment

This large-scale collaborative study is a timely contribution that will be of interest to researchers working in the fields of infectious disease forecasting and epidemic control. This paper provides a comprehensive evaluation of the predictive skills of real-time COVID-19 forecasting models in Europe. The conclusions of the paper are well supported by the data and are consistent with findings from studies in other countries.

Read the original source
eLife
Nov 4, 2022

Reviewer #1 (Public Review):

The paper, fundamentally, is a description of the accuracy of individual model and ensemble model short-term forecasts of COVID-19. This has been done before in both weather and infectious disease. So what are the contributions of this manuscript? I see the following:

1. The authors show that ensemble prediction (a straight average) generally outperforms individual component models. This is not new and has been shown, as the authors cite, for weather, climate, and infectious disease.
2. Use of the median estimate across models, rather than the mean, buffers against outliers. This is a well-recognized workaround for right-skewed distributions, though the specific finding in this study is of some importance, as this hasn't always been the case (noted by the authors in their discussion).
3. Deaths are better …

Reviewer #1 (Public Review):

The paper, fundamentally, is a description of the accuracy of individual model and ensemble model short-term forecasts of COVID-19. This has been done before in both weather and infectious disease. So what are the contributions of this manuscript? I see the following:

1. The authors show that ensemble prediction (a straight average) generally outperforms individual component models. This is not new and has been shown, as the authors cite, for weather, climate, and infectious disease.
2. Use of the median estimate across models, rather than the mean, buffers against outliers. This is a well-recognized workaround for right-skewed distributions, though the specific finding in this study is of some importance, as this hasn't always been the case (noted by the authors in their discussion).
3. Deaths are better forecasted than cases. This is not new, either, as the authors note, as deaths are a lagged function of cases/infections.
4. It presents the archive of European COVID-19 forecasts.

Although I don't see a lot of novelty in these findings, this COVID-19 forecasting work is important and represents a considerable effort on part of the individual modelers. The paper is well written, but it doesn't show much that is novel methodologically. For instance, it doesn't propose and validate an approach for improving forecasting or projection accuracy. Are there new ways to handle or predict behavioral, vaccination uptake, or viral changes? Are there novel post-processing approaches, other than 'ensembling' that could improve forecast accuracy?

Read the original source
eLife
Nov 4, 2022

Reviewer #2 (Public Review):

This paper by Sherratt et al. evaluated the performance of real-time predictions for COVID-19 submitted to the European COVID-19 Forecast Hub between March 8 2021 and March 7 2022. This large-scale multi-team multi-county collaboration collected short-term forecasts for COVID-19 from 26 teams generated for 32 countries in Europe, making this dataset one of the largest archives of real-time COVID-19 forecasts. The results indicate that ensemble models combining forecasts from individual models generally performs better than each individual model, and ensemble methods based on medians outperform the ones based on means. The comparison also shows that incident death forecasts are more reliable than incident case forecasts beyond two weeks into the future. The paper further included detailed discussions on …

Reviewer #2 (Public Review):

This paper by Sherratt et al. evaluated the performance of real-time predictions for COVID-19 submitted to the European COVID-19 Forecast Hub between March 8 2021 and March 7 2022. This large-scale multi-team multi-county collaboration collected short-term forecasts for COVID-19 from 26 teams generated for 32 countries in Europe, making this dataset one of the largest archives of real-time COVID-19 forecasts. The results indicate that ensemble models combining forecasts from individual models generally performs better than each individual model, and ensemble methods based on medians outperform the ones based on means. The comparison also shows that incident death forecasts are more reliable than incident case forecasts beyond two weeks into the future. The paper further included detailed discussions on several practical considerations in the operational use of forecasting models. These findings provide practical guides for generating real-time forecasts for infectious diseases and novel insights into coordinating international forecasting efforts during a public health emergency.

The conclusions of this paper are well supported by the data and analyses. A few aspects could be further discussed in the manuscript.

1. A parallel effort of real-time COVID-19 forecasting in the US (i.e., the US COVID-19 Forecast Hub) reported similar findings on the use of ensemble models. This study from Europe provides independent validation that shows the robustness of these findings. While both studies followed similar guidelines and used the same evaluation metrics (coverage and WIS), I believe there should be unique challenges associated with forecasting for multiple countries (as opposed to forecasting in a single country). As a result, it might be worthwhile to discuss those challenges and potential solutions to inform similar efforts in the future.

2. WIS is a strictly proper score for evaluating forecast performance; however, it must rely on a reference forecast model. This may create difficulties in interpreting forecast accuracy for the general public who may not understand the concept of WIS. For instance, what is a WIS score good enough to trust? The authors may want to include a simple metric (e.g., mean absolute error) as a supplement even though these metrics have some caveats. I presume the performance should be highly correlated using different evaluation metrics.

3. It might be helpful to elaborate more on the assumptions for near-term predictions in participating models (e.g., status quo, reactive change of transmission, etc.). Essentially all real-time predictions were generated based on assumptions, although sometimes those assumptions were not stated explicitly. For behavior-induced changing points (peaks or troughs), it might be challenging to predict using the status quo without considering a change in model states.

4. Data in the tables and figures were used to compare forecasts. It would be great to have a formal statistical test for comparing model performance, if possible.

Read the original source
Version published to 10.1101/2022.06.16.22276024 on medRxiv
Jun 16, 2022

The influence of ensemble size and composition on the performance of combined real-time COVID-19 forecasts

This article has 4 authors:
1. Friederike Becker
2. Katharine Sherratt
3. Nikos Bosse
4. Sebastian Funk
This article has no evaluationsLatest version Aug 12, 2025
Machine Learning and Probabilistic Approaches for Forecasting COVID-19 Transmission and Cases

This article has 7 authors:
1. Md Sakhawat Hossain
2. Ravi Goyal
3. Natasha K Martin
4. Victor DeGruttola
5. Tanvir Ahammed
6. Christopher McMahan
7. Lior Rennert
This article has no evaluationsLatest version Jun 24, 2025
Forecasting COVID-19 with Temporal Hierarchies and Ensemble Methods

This article has 4 authors:
1. Li Shandross
2. Evan L. Ray
3. Benjamin W. Rogers
4. Nicholas G. Reich
This article has no evaluationsLatest version Jun 27, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Methods:

Results:

Conclusions:

Funding:

Article activity feed

Related articles

The influence of ensemble size and composition on the performance of combined real-time COVID-19 forecasts

Machine Learning and Probabilistic Approaches for Forecasting COVID-19 Transmission and Cases

Forecasting COVID-19 with Temporal Hierarchies and Ensemble Methods