Comparative Analysis of Machine Learning Models for Multi-Horizon PM2.5 Forecasting
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate forecasting of particulate matter (PM2.5) concentrations is critical for public health management and environmental policy-making. This study presents a comprehensive comparison of six machine learning models—Linear Regression, Support Vector Regression (SVR), Random Forest (RF), Gradient Boosting Decision Trees (GBDT), Multi-Layer Perceptron (MLP), and Long Short-Term Memory (LSTM)—for multi-horizon PM2.5 prediction. Using hourly air quality data from 11 cities in Zhejiang Province, China (January-February 2024), we evaluate model performance across three forecast horizons: 1-hour, 6-hour, and 24-hour ahead predictions. Our results demonstrate that model performance varies significantly with forecast horizon. For short-term (1-hour) predictions, Linear Regression achieves the best performance (RMSE=10.682, R²=0.901), suggesting near-linear temporal dynamics. For longer horizons (24-hour), ensemble tree-based models outperform others, with GBDT achieving RMSE=24.264 and R²=0.467. Surprisingly, deep learning approaches (LSTM) underperform traditional machine learning methods, particularly for long-term forecasting. Feature importance analysis reveals that the most recent PM2.5 value (lag-1) accounts for 47.8% of predictive power, while Air Quality Index contributes 42.3%, highlighting the dominance of temporal autocorrelation in PM2.5 dynamics.