Forecasting the COVID-19 Epidemic By Integrating Symptom Search Behavior Into Predictive Models: Infoveillance Study

Alessandro Rabiolo
Eugenio Alladio
Esteban Morales
Andrew Ian McNaught
Francesco Bandello
Abdelmonem A Afifi
Alessandro Marchese

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (ScreenIT)

Abstract

Previous studies have suggested associations between trends of web searches and COVID-19 traditional metrics. It remains unclear whether models incorporating trends of digital searches lead to better predictions.

Objective

The aim of this study is to investigate the relationship between Google Trends searches of symptoms associated with COVID-19 and confirmed COVID-19 cases and deaths. We aim to develop predictive models to forecast the COVID-19 epidemic based on a combination of Google Trends searches of symptoms and conventional COVID-19 metrics.

Methods

An open-access web application was developed to evaluate Google Trends and traditional COVID-19 metrics via an interactive framework based on principal component analysis (PCA) and time series modeling. The application facilitates the analysis of symptom search behavior associated with COVID-19 disease in 188 countries. In this study, we selected the data of nine countries as case studies to represent all continents. PCA was used to perform data dimensionality reduction, and three different time series models (error, trend, seasonality; autoregressive integrated moving average; and feed-forward neural network autoregression) were used to predict COVID-19 metrics in the upcoming 14 days. The models were compared in terms of prediction ability using the root mean square error (RMSE) of the first principal component (PC1). The predictive abilities of models generated with both Google Trends data and conventional COVID-19 metrics were compared with those fitted with conventional COVID-19 metrics only.

Results

The degree of correlation and the best time lag varied as a function of the selected country and topic searched; in general, the optimal time lag was within 15 days. Overall, predictions of PC1 based on both search terms and COVID-19 traditional metrics performed better than those not including Google searches (median 1.56, IQR 0.90-2.49 versus median 1.87, IQR 1.09-2.95, respectively), but the improvement in prediction varied as a function of the selected country and time frame. The best model varied as a function of country, time range, and period of time selected. Models based on a 7-day moving average led to considerably smaller RMSE values as opposed to those calculated with raw data (median 0.90, IQR 0.50-1.53 versus median 2.27, IQR 1.62-3.74, respectively).

Conclusions

The inclusion of digital online searches in statistical models may improve the nowcasting and forecasting of the COVID-19 epidemic and could be used as one of the surveillance systems of COVID-19 disease. We provide a free web application operating with nearly real-time data that anyone can use to make predictions of outbreaks, improve estimates of the dynamics of ongoing epidemics, and predict future or rebound waves.

Version published to 10.2196/28876
Aug 11, 2021
Version published to 10.2196/preprints.28876
Mar 17, 2021

SciScore for 10.1101/2021.03.09.21253186: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
To further assess the PCA models based on both Google Trends data and conventional COVID-19 metrics, we also generated predictive models based on conventional COVID-19 metrics only; we then compared the predictive ability of models with and without Google Trends data by means of RMSE for each country.	Google suggested: (Google, RRID:SCR_017097)

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Modeling epidemics is a complex task that depends on several assumptions and …

SciScore for 10.1101/2021.03.09.21253186: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
To further assess the PCA models based on both Google Trends data and conventional COVID-19 metrics, we also generated predictive models based on conventional COVID-19 metrics only; we then compared the predictive ability of models with and without Google Trends data by means of RMSE for each country.	Google suggested: (Google, RRID:SCR_017097)

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Modeling epidemics is a complex task that depends on several assumptions and entails numerous limitations. The choice of data input is a crucial part of model development for accurate predictions.16 Most of the current models for COVID-19 rely on confirmed cases or deaths, but neither of these measures is satisfactory. Confirmed COVID-19 cases represent only a part of all infected subjects, as those with mild symptoms may not seek medical attention or get tested. Also, the number of confirmed cases is highly dependent on the number of tests performed, which varies greatly in different countries and in the diverse phases of the epidemic. Confirmed COVID-19 deaths are likely to be a more reliable measure, but they occur in the final stages of the disease and are, therefore, a poor indicator to detect outbreaks at their earliest stages. Also, COVID-19 deaths are not uniformly reported among the different countries and may vary as a function of the healthcare systems, population demographic and public health status. In the past decade, there has been an increasing interest in the use of internet big data to understand patterns of disease, population behaviors, and make surveillance of infectious disease. Despite being initially welcomed with enthusiasm, models based only on Google Trends data proved to be not accurate in determining the absolute numbers of cases in epidemics, but were helpful in identifying temporal dynamics, anticipating peaks, and improving forecasting when use...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Read the original source

Version published to 10.1101/2021.03.09.21253186 on medRxiv
Mar 12, 2021

WITHDRAWN: Behavior-Aware COVID-19 Forecasting Using Markov SIR Models on Dynamic Contact Networks: An Observational Modeling Study

This article has 1 author:
1. Mojtaba Dadashkarimi
This article has no evaluationsLatest version Aug 5, 2025
Ensemble forecasts of COVID-19 activity to support Australia’s pandemic response: 2020–22

This article has 19 authors:
1. Robert Moss
2. Ruarai J. Tobin
3. Mitchell O’Hara-Wild
4. Adeshina I. Adekunle
5. Dennis Liu
6. Tobin South
7. Dylan J. Morris
8. Gerard E. Ryan
9. Tianxiao Hao
10. Aarathy Babu
11. Katharine L. Senior
12. James G. Wood
13. Nick Golding
14. Joshua V. Ross
15. Peter Dawson
16. Rob J. Hyndman
17. David J. Price
18. James M. McCaw
19. Freya M. Shearer
This article has no evaluationsLatest version Sep 12, 2025
Retrospective analysis of macroscopic health, socioeconomic, and demographic risk predictors for COVID-19 accumulated mortality ratio

This article has 2 authors:
1. Murat Razi
2. Manuel Grana
This article has no evaluationsLatest version Sep 5, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

WITHDRAWN: Behavior-Aware COVID-19 Forecasting Using Markov SIR Models on Dynamic Contact Networks: An Observational Modeling Study

Ensemble forecasts of COVID-19 activity to support Australia’s pandemic response: 2020–22

Retrospective analysis of macroscopic health, socioeconomic, and demographic risk predictors for COVID-19 accumulated mortality ratio