A Space-Time Transformer for Precipitation Nowcasting
Meteorological agencies around the world rely on real-time flood guidance to issue life-saving advisories and warnings. For decades traditional numerical weather prediction (NWP) models have been state-of-the-art for precipitation forecasting. However, physically-parameterized models suffer from a few core limitations: first, solving PDEs to resolve atmospheric dynamics is computationally demanding, and second, these methods degrade in performance at nowcasting timescales (i.e., 0-4 hour lead-times). Motivated by these shortcomings, recent work proposes AI-weather prediction (AI-WP) alternatives that learn to emulate analysis data with neural networks. While these data-driven approaches have enjoyed enormous success across diverse spatial and temporal resolutions, applications of video-understanding architectures for weather forecasting remain underexplored. To address these gaps, we propose \textbf{SaTformer}: a video transformer built on full space-time attention that skillfully forecasts extreme precipitation from satellite radiances. Along with our novel architecture, we introduce techniques to tame long-tailed precipitation datasets. Namely, we reformulate precipitation regression into a classification problem, and employ a class-weighted loss to address label imbalances. Our model scored \textbf{first place} on the NeurIPS Weather4Cast 2025 ``Cumulative Rainfall’’ challenge. Code and model weights are available: \texttt{\href{github.com/leharris3/w4c-25}{github.com/leharris3/satformer}}
💡 Research Summary
The paper introduces SaTformer, a video‑transformer architecture designed for precipitation nowcasting (0‑4 h lead times) using low‑resolution geostationary satellite HRIT imagery. Traditional numerical weather prediction (NWP) models are computationally expensive and lose skill at such short horizons, prompting the rise of AI‑Weather Prediction (AI‑WP) methods. Existing AI‑WP work has largely focused on medium‑range, global forecasts and on convolutional networks; the authors argue that video‑understanding transformers can better capture the spatio‑temporal dynamics of extreme precipitation.
SaTformer processes an input sequence of T = 4 frames, each with C = 11 spectral channels and spatial size 32 × 32 km (pixel resolution 512 × 512 km). Each frame is divided into non‑overlapping 4 × 4 patches, yielding N = (H·W)/P² tokens per timestep. Tokens are linearly projected to a hidden dimension d = 512, summed with learnable positional embeddings, and a learnable class token (CLS) is prepended to the sequence. The full token sequence (NT + 1 tokens) is fed through L = 12 transformer encoder blocks. Within each block, full space‑time self‑attention is applied: every token attends to every other token across both spatial and temporal axes, implemented with multi‑head attention (8 heads, head dim = 64). After attention, a residual connection and a feed‑forward MLP update the token representations. The final CLS token is extracted and passed through a single‑layer MLP followed by softmax to produce a probability distribution over n = 64 discrete precipitation bins.
To address the severe class imbalance inherent in extreme‑precipitation datasets, the authors reformulate the regression problem as a classification task. The continuous target y_reg is quantized into equally spaced bins of width δ, and the nearest bin index is encoded as a one‑hot vector. Training uses a class‑weighted cross‑entropy loss: each class i receives weight w_i = ‑log(|D_i|/|D_total|), where |D_i| is the number of training samples in that bin. This weighting amplifies the contribution of rare, high‑precipitation bins, improving calibration and extreme‑event skill.
Training details: the model is trained for 200 epochs (≈25 k steps) with Adam (lr = 1e‑5), batch size 128, on four NVIDIA A6000 GPUs. Input features are normalized to
Comments & Academic Discussion
Loading comments...
Leave a Comment