Data Augmentation Strategies for Improved PM<sub>2.5</sub> Forecasting Using Transformer Architectures
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Breathing in fine particulate matter with diameters less than 2.5 µm (PM2.5) has greatly increased an individual’s risk of cardiovascular and respiratory diseases. As climate change progresses, extreme weather events, including wildfires, are expected to rise, exacerbating air pollution. The 2023 Canadian wildfires highlighted the growing threat of PM2.5 as smoke spread across U.S. cities like New York, Philadelphia, and Washington D.C. This research investigates the application of data augmentation techniques to improve the accuracy of PM2.5 concentration forecasts in these urban environments. Models trained on imbalanced datasets often struggle to capture extreme pollution events, underestimating high PM2.5 levels due to the model’s focus on more frequent, low-value samples. To address this, we implemented cluster-based undersampling and trained transformer models using various cutoff thresholds (12.1 µg/m³ and 35.5 µg/m³) and partial sampling ratios (10/90, 20/80, 30/70, 40/60, 50/50). Our results demonstrate that the 35.5 µg/m³ threshold, coupled with a 20/80 partial sampling ratio, provides the best performance regarding RMSE and R², particularly in capturing high PM2.5 events. Overall, models trained on augmented data significantly outperformed those trained on original data, highlighting the importance of resampling techniques in improving air quality forecasting accuracy, especially for high-pollution scenarios. These insights significantly contribute to a better understanding of PM2.5 pollution with the hopes of more informed public health and environmental policies.