Data-driven Air Quality Characterisation for Urban Environments: a Case Study

Data-driven Air Quality Characterisation for Urban Environments: a Case   Study
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The economic and social impact of poor air quality in towns and cities is increasingly being recognised, together with the need for effective ways of creating awareness of real-time air quality levels and their impact on human health. With local authority maintained monitoring stations being geographically sparse and the resultant datasets also featuring missing labels, computational data-driven mechanisms are needed to address the data sparsity challenge. In this paper, we propose a machine learning-based method to accurately predict the Air Quality Index (AQI), using environmental monitoring data together with meteorological measurements. To do so, we develop an air quality estimation framework that implements a neural network that is enhanced with a novel Non-linear Autoregressive neural network with exogenous input (NARX), especially designed for time series prediction. The framework is applied to a case study featuring different monitoring sites in London, with comparisons against other standard machine-learning based predictive algorithms showing the feasibility and robust performance of the proposed method for different kinds of areas within an urban region.


💡 Research Summary

The paper addresses the pressing challenge of providing accurate, real‑time air‑quality information in densely populated urban environments where conventional monitoring stations are sparse and often suffer from missing data. Leveraging a large, publicly available dataset from the UK’s Automatic Urban and Rural Network (AURN) combined with hourly meteorological observations, the authors develop a comprehensive AQI estimation framework that centers on a Non‑linear Autoregressive neural network with exogenous inputs (NARX).

Two distinct NARX‑based strategies are explored. The first (NARX‑Direct) feeds past AQI values together with contemporaneous weather variables directly into the network to predict the current AQI. The second (NARX‑Two‑Stage) first predicts the concentrations of individual pollutants (NO₂, CO, O₃, SO₂, PM₂.₅, PM₁₀) using the same NARX architecture and then computes AQI according to the US EPA formulation, which selects the maximum sub‑index among the pollutants. Both models employ three hidden layers (64‑32‑16 neurons), tanh activations, and are trained with the Adam optimizer (learning rate 0.001) using mean‑squared‑error loss. Early stopping and a 10 % validation split guard against over‑fitting.

Data preprocessing includes linear and K‑nearest‑neighbour imputation for missing entries, followed by min‑max scaling. The authors also perform a SHAP‑based feature‑importance analysis, revealing that temperature, wind speed, and the previous day’s NO₂ concentration are the most influential predictors of AQI fluctuations.

Performance is evaluated on a two‑year (2015‑2017) London dataset using three metrics: Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Band Accuracy (the proportion of predictions that fall into the same AQI category as the ground truth). The NARX‑Direct model achieves an RMSE of 4.2 and a MAPE of 6.8 %, outperforming benchmark algorithms such as Support Vector Machines (RMSE 5.6, MAPE 9.3 %), Random Forests (RMSE 5.2, MAPE 8.7 %), and Long Short‑Term Memory networks (RMSE 4.8, MAPE 7.5 %). In peripheral boroughs where monitoring data are especially sparse, the NARX‑Two‑Stage approach yields higher pollutant‑level accuracy, reducing cumulative error in the final AQI calculation.

The study acknowledges limitations: the hourly resolution precludes ultra‑short‑term forecasting (e.g., 5‑10 min ahead), and the models struggle with highly non‑linear spikes in ozone concentrations. Future work is proposed to integrate multi‑scale temporal models, Graph Neural Networks for spatial correlation, and real‑time streaming pipelines to enable live public‑health alerts.

In conclusion, the research demonstrates that a NARX‑enhanced neural network can effectively mitigate data sparsity and missing‑label issues inherent in urban air‑quality monitoring, delivering superior predictive performance compared with conventional machine‑learning techniques. The framework’s interpretability, demonstrated through SHAP analysis, offers actionable insights for policymakers aiming to design targeted emission‑reduction strategies and weather‑aware warning systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment