Integrating Google Trends and Hybrid Statistical-Machine Learning Models for Dengue Surveillance in an Inland Vietnamese Province: A 9-Year Evaluation with Media Bias Assessment

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate and timely dengue surveillance remains a critical challenge in endemic regions due to delayed reporting and under-detection. This study evaluated the feasibility of integrating Google Trends data with advanced statistical and machine learning models to enhance dengue surveillance in Dong Nai, an inland province of Vietnam, over a nine-year period (2013–2021). Monthly confirmed dengue cases were paired with Google Trends Index (GTI) for dengue-related search terms. A comprehensive modeling framework was applied, including time series regression (Poisson, Quasi-Poisson, Negative Binomial), multiple machine learning algorithms (Random Forest, XGBoost, LightGBM), and hybrid ensemble approaches. The GTI demonstrated strong temporal correlation with dengue incidence (Pearson r = 0.975, p < 0.001), with contemporaneous alignment at zero lag. The optimal Negative Binomial model incorporating GTI and autoregressive epidemic memory (log(DF < sub > t-1</sub > + 1)) achieved superior predictive accuracy (Dispersion Index = 1.20, RMSE = 419.71). Random Forest outperformed other machine learning models but remained inferior to Negative Binomial (RMSE = 560.36, R² = 0.766). The hybrid ensemble model (Negative Binomial + Random Forest) provided enhanced robustness with RMSE = 485.49 and R² = 0.825. Importantly, media bias analysis identified seven GTI spikes during the study period, of which most coincided with actual outbreaks. Calculated bias indices were consistently low (0.0049–0.018), indicating minimal distortion of the GTI signal by external media influence. Outbreak detection performance was excellent at the 95th percentile outbreak threshold, achieving an AUC of 1.00, sensitivity of 100%, and specificity of 94.4%. These findings demonstrate that Google Trends, when integrated with autoregressive statistical models, provides a feasible and reliable digital signal for dengue surveillance. Hybrid statistical–machine learning models offer scalable solutions for real-time outbreak prediction, with minimal susceptibility to media-induced noise. Future work incorporating multi-source ecological, climatic, and behavioral data could further optimize predictive capacity for early warning systems in dengue-endemic settings.

Article activity feed