Machine Learning Approaches for Predicting Air Pollution Levels: A Transparent, Time-Aware Pipeline for Daily AQI in Indian Cities
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate city-scale forecasts of the Air Quality Index (AQI) are essential for exposure advisories and short-term mitigation.We present a transparent, leakage-aware machine-learning pipeline for daily AQI prediction across Indian cities usingpublicly available regulatory measurements. The workflow standardizes types, de-duplicates City×Date records, andapplies within-city interpolation followed by median filling; it then constructs calendar variables, PM2.5/PM10 interactionfeatures, and city-specific time-aware history (lags at t−1,t−3,t−7; rolling mean/standard deviation over 3/7/14 dayscomputed on shifted series to prevent leakage). Under a strictly chronological split (last 20% of dates as a forward hold-out),we compare Linear Regression, Ridge, Random Forest, and Histogram-based Gradient Boosting using MAE, RMSE,R2, and MAPE. Random Forest attains the best test performance (MAE = 12.7742, RMSE = 24.8427, R2 = 0.9320,MAPE = 11.3123%). Feature importance indicates short-memory persistence (AQIt−1) as the dominant driver, withco-pollutants (CO, PM2.5, PM10) and recent variability providing incremental skill. The pipeline is fully reproducibleand deployment-ready, offering a strong operational baseline that agencies can extend with exogenous drivers (e.g.,meteorology) and to alternative targets (e.g., PM2.5 or hourly horizons).