Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting
The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography’’, where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.
💡 Research Summary
The paper introduces a Spatially‑Informed Transformer (SIT) that merges the rigorous covariance structure of classical geostatistics with the expressive power of modern transformer networks for spatio‑temporal forecasting. Traditional Gaussian Process (GP) models provide exact uncertainty quantification but scale cubically with the number of observations, making them infeasible for large sensor networks. Graph Neural Networks (GNNs) address irregular domains but are limited by local receptive fields and suffer from oversmoothing when modeling long‑range teleconnections. Transformers, with their global self‑attention, naturally capture long‑range dependencies, yet their standard attention mechanism is permutation‑invariant and lacks any built‑in notion of physical distance; positional encodings used in NLP are inadequate for continuous 2‑D/3‑D spaces.
To bridge this gap, the authors embed a learnable geostatistical kernel directly into the attention score. The modified attention becomes
\
Comments & Academic Discussion
Loading comments...
Leave a Comment