The Dataset of Daily Air Quality for the Years 2013-2023 in Italy

The Dataset of Daily Air Quality for the Years 2013-2023 in Italy
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Air quality and climate are major issues in Italian society and lie at the intersection of many research fields, including public health and policy planning. There is an increasing need for readily available, easily accessible, ready-to-use and well-documented datasets on air quality and climate. In this paper, we present the GRINS AQCLIM dataset, created under the GRINS project framework covering the Italian domain for an extensive time period. It includes daily statistics (e.g., minimum, quartiles, mean, median and maximum) for a collection of air pollutant concentrations and climate variables at the locations of the 700+ available monitoring stations. Input data are retrieved from the European Environmental Agency and Copernicus Programme and were subjected to multiple processing steps to ensure their reliability and quality. These steps include automatic procedures for fixing raw files, manual inspection of stations information, the detection and removal of anomalies, and the temporal harmonisation on a daily basis. Datasets are hosted on Zenodo under open-access principles.


💡 Research Summary

The paper presents the GRINS AQCLIM dataset, a comprehensive, open‑access collection of daily air‑quality and climate statistics for Italy covering the period 2013–2023. The authors gathered raw measurements from more than 700 ground‑based monitoring stations managed by regional environmental protection agencies (ARPAs) via the European Environment Agency (EEA) Air Quality Portal, and complemented these with atmospheric reanalysis data from the Copernicus Climate Change Service (ERA5 and the higher‑resolution ERA5‑Land). After parallel downloading with 50 workers, over 20 000 CSV files (each representing a single station, year, and pollutant) were obtained. The raw files exhibited numerous issues: duplicate entries, inconsistent temporal resolutions (hourly vs. daily), overlapping bi‑hourly records, and occasional corrupted values. An automated pipeline (illustrated in Figure 2) was developed to resolve duplication, standardise file formats, and flag problematic records.

The dataset includes eight key pollutants—NO, NO₂, CO, NH₃, O₃, SO₂, PM₁₀, and PM₂·₅—and twelve climate variables such as boundary‑layer height, vegetation indices, relative humidity, temperature, solar radiation, precipitation, wind speed, and wind direction. For each pollutant and each day, six summary statistics are provided: minimum, first quartile, mean, median, third quartile, and maximum. Climate variables are extracted at the exact station coordinates by merging ERA5‑Land (0.1° × 0.1°) with ERA5 (0.25° × 0.25°) to fill gaps, especially for coastal or island stations where ERA5‑Land is unavailable.

Quality control proceeds in several stages. First, the EEA validity flag (‑99, ‑1, 1, 2, 3) is used to discard clearly invalid measurements. Second, anomalous concentrations are identified through a combination of verification flags and statistical thresholds. The authors experimented with z‑score methods (4 σ) on raw and log‑transformed data, but due to the skewed nature of pollutant distributions, they ultimately adopted percentile‑based cut‑offs (99 % to 99.999 %). Fixed absolute thresholds were set per pollutant (e.g., CO > 100 µg m⁻³, PM₂·₅ > 10 000 µg m⁻³, SO₂ > 10 000 µg m⁻³) after visual inspection of histograms coloured by station, which revealed that extreme outliers originated from only a few stations and were likely instrument errors. Third, the station registry was cleaned: duplicate coordinates with different names were reconciled, and 32 stations erroneously assigned to multiple area categories (traffic, industrial, background) were manually corrected using satellite imagery. Fourth, temporal harmonisation was performed. Hourly pollutant data were imputed for missing points using a local Kalman‑smoother model (as in reference


Comments & Academic Discussion

Loading comments...

Leave a Comment