Machine Learning–Based Prediction of Particulate Matter and Gaseous Pollutants in Mega Cities

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Air pollution remains a major public health concern in large metropolitan areas, where complex interactions between particulate matter, gaseous pollutants, and meteorological conditions shape short- and medium-term pollution dynamics. Traditional statistical approaches often struggle to capture these nonlinear and time-dependent relationships, prompting increased use of machine learning (ML) techniques for air quality prediction. In Türkiye, comprehensive ML-based forecasting studies focusing on major metropolitan areas remain limited. Objective: This study aims to predict PM₂.₅ concentrations in Istanbul and Ankara using machine learning models and to examine the relative contribution of pollutant history and meteorological variables to short-term PM₂.₅ dynamics through interpretable modeling approaches. Methods: Daily air quality data (PM₂.₅, PM₁₀, NO₂, SO₂, O₃, CO) were obtained from the National Air Quality Monitoring Network of the Ministry of Environment, Urbanization, and Climate Change of Türkiye. Meteorological variables were sourced from official meteorological stations. Feature engineering incorporated lagged pollutant values, moving averages, seasonal indicators, and meteorological parameters. Random Forest was selected as the primary modeling approach and evaluated using time-series cross-validation. Model interpretability was assessed through feature importance metrics and SHAP analyses. Multicollinearity diagnostics were conducted using variance inflation factors (VIF). Results: The Random Forest model demonstrated stable predictive performance across time-series cross-validation folds, yielding an average RMSE of 5.70 µg/m³, with fold-specific RMSE values ranging from 3.71 to 7.84 µg/m³. Feature importance analysis revealed that lagged PM₁₀ concentrations (PM₁₀ lag 1) dominated the model, accounting for approximately 82.6% of the total explanatory contribution, indicating a strong short-term autoregressive structure. Meteorological variables such as wind speed and dew point exhibited smaller but consistent contributions (each <5%). SHAP-based interpretability analyses further confirmed the nonlinear influence of both pollutant persistence and meteorological conditions on PM₂.₅ predictions. Conclusions: Machine learning models effectively capture nonlinear and time-dependent patterns in PM₂.₅ concentrations in large metropolitan areas. The findings highlight the dominant role of short-term pollutant persistence, complemented by meteorological influences. Interpretable ML approaches provide actionable insights for air quality management, supporting resource planning and early intervention strategies. The study contributes to the growing body of evidence supporting ML-based air pollution forecasting while emphasizing the importance of interpretability for policy-relevant applications.

Article activity feed