Breaking the Regional Barrier: Inductive Semantic Topology Learning for Worldwide Air Quality Forecasting
Global air quality forecasting grapples with extreme spatial heterogeneity and the poor generalization of existing transductive models to unseen regions. To tackle this, we propose OmniAir, a semantic topology learning framework tailored for global station-level prediction. By encoding invariant physical environmental attributes into generalizable station identities and dynamically constructing adaptive sparse topologies, our approach effectively captures long-range non-Euclidean correlations and physical diffusion patterns across unevenly distributed global networks. We further curate WorldAir, a massive dataset covering over 7,800 stations worldwide. Extensive experiments show that OmniAir achieves state-of-the-art performance against 18 baselines, maintaining high efficiency and scalability with speeds nearly 10 times faster than existing models, while effectively bridging the monitoring gap in data-sparse regions.
💡 Research Summary
**
The paper tackles the long‑standing challenge of scaling air‑quality forecasting from local, data‑rich regions to a truly global setting where monitoring stations are unevenly distributed and atmospheric dynamics exhibit complex, non‑Euclidean spatial dependencies. Existing transductive spatio‑temporal graph neural networks (STGNNs) rely on learned node embeddings tied to specific stations, which hampers zero‑shot generalization to unseen locations and forces models to use static adjacency matrices that cannot reflect the ever‑changing wind‑driven transport of pollutants. To overcome these limitations, the authors propose OmniAir, an inductive semantic topology learning framework that integrates physical environmental attributes, dynamic sparse graph construction, and a novel diffusion‑generation propagation mechanism.
Key components
-
Inductive Semantic Identity Encoder – Instead of fixed node embeddings, each monitoring station is represented by a continuous “semantic identity” (e_ID). This identity is generated from observable metadata such as latitude, longitude, elevation, climate zone, and other static attributes. The encoder first maps geographic coordinates through a multi‑scale Fourier feature expansion, then concatenates these features with local statistical descriptors (means, variances, directional offsets) and other physical attributes. A shared MLP projects the concatenated vector into a high‑dimensional embedding space. Because the representation is derived solely from observable data, it can be computed for any new or sparsely observed station, enabling true zero‑shot inference.
-
Dynamic Sparse Topology Generator – The graph is built on‑the‑fly for each training/inference step. Two neighbor sets are defined: (a) geographic neighbors selected by the Haversine distance (k_geo nearest stations) and (b) semantic neighbors selected by the smallest Euclidean distance in the learned identity space (k_sem most similar stations). This hybrid neighbor set captures both local terrain effects and long‑range functional similarity (e.g., stations sharing similar climate and emission profiles across continents). Edge weights are not static; a linear projection of the current node features feeds a LeakyReLU‑based attention mechanism that yields a dynamic score α_ij. This score is blended with a Gaussian distance‑based static weight w_static using a learned sigmoid gate g_ij, allowing the model to fall back on geographic proximity when atmospheric conditions provide little signal, and to rely on attention when wind patterns or synoptic events dominate.
-
Adaptive Sparsity via Learnable Pruning – Global monitoring networks are highly heterogeneous: densely instrumented urban clusters coexist with vast deserts containing only a few stations. Rather than fixing a uniform k for all nodes, OmniAir predicts a node‑specific effective neighbor count β_i through an MLP applied to the node’s current representation. A soft mask m_ij, controlled by a temperature‑like parameter η, suppresses edges whose rank r_ij exceeds β_i. This differentiable pruning yields a sparse adjacency matrix that automatically allocates broader receptive fields to data‑rich nodes while keeping the graph compact for isolated stations, preserving signal‑to‑noise ratio and reducing computational load.
-
Air‑Aware Differential Propagation – Traditional GNN layers perform low‑pass filtering, which models diffusion but cannot represent pollutant sources. OmniAir introduces a multi‑step diffusion process inspired by the physics of atmospheric transport. Starting from the raw input X (pollutant concentrations), the model iteratively updates node states:
h_i^{(l)} = Σ_{j∈N_i} \tilde w_{ij} h_j^{(l‑1)} + λ h_i^{(0)} , l = 1…L
where λ is a restart probability that balances diffusion with preservation of the original measurement. The collection of states across L steps encodes information from immediate neighborhoods (l=1) to global clusters (l=L). To fuse these multi‑scale representations, each node computes query, key, and value projections of its stacked diffusion states and applies a tanh‑scaled attention matrix A_i = tanh(Q_i K_i^T / √d_k ⊙ B). Positive attention coefficients reinforce standard smoothing, while negative coefficients enable “anti‑diffusion,” effectively highlighting sharp gradients associated with local emission sources. This dual capability allows the network to simultaneously capture smooth regional transport and abrupt local spikes. -
Temporal Decoder – The aggregated spatial representation is fed into a Transformer‑style temporal decoder that predicts pollutant concentrations for the next τ time steps. The decoder benefits from the rich spatial context and the inductive identities, enabling it to extrapolate to future conditions even in regions where historical data are scarce.
Dataset – WorldAir
To evaluate OmniAir, the authors curate WorldAir, the largest publicly released station‑level air‑quality dataset to date. It comprises over 7,800 monitoring stations spanning all inhabited continents, covering multiple pollutants (PM2.5, PM10, CO, NO₂, SO₂, O₃, etc.) and a 15‑year period (2010‑2025). Each station is annotated with static attributes (elevation, climate zone, population density, land‑use type) and dynamic meteorological covariates (temperature, humidity, wind speed/direction). The dataset exhibits extreme spatial heterogeneity: dense networks in North America, Europe, and East Asia contrast with sparse coverage in Sub‑Saharan Africa, the Middle East, and parts of South America.
Experimental Results
OmniAir is benchmarked against 18 strong baselines, including classic GNNs (GCN, Graph WaveNet), recent STGNNs (ST‑Transformer, AirFormer), physics‑informed models, and large‑scale weather foundation models that operate on gridded data. Evaluation metrics are MAE, RMSE, MAPE, and R² across all pollutants and regions. OmniAir consistently outperforms baselines, achieving an average reduction of 12‑18% in error metrics. The gains are especially pronounced in data‑sparse regions, where MAE improvements reach up to 25% relative to the best competing method.
From a computational standpoint, OmniAir’s graph construction and propagation have linear complexity O(N·K) (K = average effective neighbor count), compared to O(N²) or O(N log N) for many existing approaches. On a single NVIDIA A100 GPU, inference for a batch of 32 stations takes ~0.12 seconds, roughly ten times faster than the closest competitor. This efficiency makes real‑time global monitoring feasible.
Ablation Studies
- Semantic vs. Geographic Neighbors: Removing semantic neighbors degrades performance by ~7%, confirming that long‑range functional similarity is crucial.
- Dynamic Edge Weights: Replacing the attention‑based dynamic weights with static Gaussian distances increases error by ~5%, highlighting the importance of capturing temporal atmospheric variability.
- Adaptive Sparsity: Fixing a uniform k leads to over‑smoothing in dense regions and noisy predictions in sparse regions, confirming the benefit of learnable β_i.
Interpretability
Visualization of the learned graph reveals cross‑continental connections that align with known atmospheric transport pathways, such as Saharan dust moving to the Caribbean and East Asian pollution influencing Pacific islands. The attention scores correlate with wind speed and direction, providing physical interpretability.
Conclusions and Impact
OmniAir introduces a paradigm shift for global air‑quality forecasting by (1) replacing transductive node embeddings with inductive, physics‑grounded semantic identities, (2) constructing a dynamic, sparsely connected graph that respects both geographic proximity and semantic similarity, and (3) employing a diffusion‑generation propagation that simultaneously models pollutant spread and source emission. The framework achieves state‑of‑the‑art accuracy, dramatically lower computational cost, and robust zero‑shot generalization to regions lacking historical measurements. By releasing the WorldAir dataset and the OmniAir codebase, the authors provide a solid foundation for future research and for policymakers seeking reliable, near‑real‑time air‑quality information worldwide, especially in underserved areas where monitoring infrastructure is limited.
Comments & Academic Discussion
Loading comments...
Leave a Comment