Digital Data and Machine Learning for Influenza Prediction: Enhancing Healthcare Sustainability in Norway

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Influenza presents a significant public health challenge globally, with seasonal outbreaks straining healthcare systems. Healthcare centers often experience high traffic from influenza-like illnesses (ILIs), many of which require only basic self-care advice. These visits contribute to avoidable congestion and strain. Timely ILI forecasts could support alternative strategies—like SMS-based guidance—to reduce unnecessary visits. Internet search data offers real-time insight into public health trends and may improve upon traditional surveillance systems. This study assessed the effectiveness of using Google search query data, alongside ILI incidence, to forecast influenza activity in Norway with machine learning models. Methods Weekly ILI data from the Norwegian Syndromic Surveillance System (NorSySS) was collected from 2006 to 2024, along with normalized Google search query data for 13 influenza-related terms. Pearson correlation analysis was conducted to identify search terms with significant associations with ILI incidence. Machine learning models, including Linear Regression, Random Forest, XGBoost, Support Vector Regression (SVR), and Long Short-Term Memory (LSTM) networks, were employed to predict ILI incidence. Model performance was evaluated using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²) metrics. Results The final predictor matrix combined 29 symptom- and medication-related Google search terms—identified through Pearson correlation (r ≥ 0.30), mutual information, and LASSO regression—with their lagged variants. These features were used to train the machine learning models. Among them, Random Forest achieved the best predictive performance (RMSE = 0.47, R² = 0.62), closely followed by XGBoost (RMSE = 0.48, R² = 0.60). Linear Regression and SVR showed moderate accuracy, while LSTM performed least effectively (RMSE = 0.76, R² = 0.11). Compared to LSTM, Random Forest reduced prediction error by 38%, most accurately capturing weekly ILI trends. Conclusions This study highlights the potential of integrating online search query data with machine learning models to improve the accuracy of influenza forecasting. The findings support the use of digital data sources as a complementary tool for influenza prediction, contributing to more sustainable healthcare resource management and timely public health interventions.

Article activity feed