Machine Learning Approaches for Predicting Air Pollution Levels: A Transparent, Time-Aware Pipeline for Daily AQI in Indian Cities

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate city-scale forecasts of the Air Quality Index (AQI) are essential for exposure advisories and short-term mitigation.We present a transparent, leakage-aware machine-learning pipeline for daily AQI prediction across Indian cities usingpublicly available regulatory measurements. The workflow standardizes types, de-duplicates City×Date records, andapplies within-city interpolation followed by median filling; it then constructs calendar variables, PM2.5/PM10 interactionfeatures, and city-specific time-aware history (lags at t−1,t−3,t−7; rolling mean/standard deviation over 3/7/14 dayscomputed on shifted series to prevent leakage). Under a strictly chronological split (last 20% of dates as a forward hold-out),we compare Linear Regression, Ridge, Random Forest, and Histogram-based Gradient Boosting using MAE, RMSE,R2, and MAPE. Random Forest attains the best test performance (MAE = 12.7742, RMSE = 24.8427, R2 = 0.9320,MAPE = 11.3123%). Feature importance indicates short-memory persistence (AQIt−1) as the dominant driver, withco-pollutants (CO, PM2.5, PM10) and recent variability providing incremental skill. The pipeline is fully reproducibleand deployment-ready, offering a strong operational baseline that agencies can extend with exogenous drivers (e.g.,meteorology) and to alternative targets (e.g., PM2.5 or hourly horizons).

Article activity feed