Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing
Predicting satellite imagery requires a balance between structural accuracy and textural detail. Standard deterministic methods like PredRNN or SimVP minimize pixel-based errors but suffer from the “regression to the mean” problem, producing blurry outputs that obscure subtle geographic-spatial features. Generative models provide realistic textures but often misleadingly reveal structural anomalies. To bridge this gap, we introduce Sat-JEPA-Diff, which combines Self-Supervised Learning (SSL) with Hidden Diffusion Models (LDM). An IJEPA module predicts stable semantic representations, which then route a frozen Stable Diffusion backbone via a lightweight cross-attention adapter. This ensures that the synthesized high-accuracy textures are based on absolutely accurate structural predictions. Evaluated on a global Sentinel-2 dataset, Sat-JEPA-Diff excels at resolving sharp boundaries. It achieves leading perceptual scores (GSSIM: 0.8984, FID: 0.1475) and significantly outperforms deterministic baselines, despite standard autoregressive stability limits. The code and dataset are publicly available on https://github.com/VU-AIML/SAT-JEPA-DIFF.
💡 Research Summary
SatJEPADiff addresses a long‑standing dilemma in satellite image forecasting: deterministic models such as PredRNN and SimVP excel at minimizing pixel‑wise losses but inevitably produce blurry outputs due to regression‑to‑the‑mean, while generative diffusion models generate realistic textures but can hallucinate structural elements that do not exist in the true scene. The proposed framework bridges this gap by decomposing the prediction task into two complementary stages.
In the first stage, a Joint‑Embedding Predictive Architecture (IJEP‑A) is adapted for spatiotemporal forecasting. An input frame Iₜ is split into 8×8 patches and encoded by a Vision Transformer (ViT) into a sequence of 256 tokens (dimension 768). A lightweight transformer predictor (default 6 layers) forecasts the future semantic embedding \hat{z}{t+1}. A target encoder, maintained as an exponential moving average (EMA) copy of the main encoder, provides stable reference embeddings z*{t+1} derived from the AlphaEarth foundation model (64‑dimensional per‑pixel vectors). The IJEP‑A loss combines L1 reconstruction, cosine similarity, a spatial variance regularizer, and an InfoNCE contrastive term, ensuring that the predicted latent captures both fine‑grained semantics and global discriminability.
The second stage conditions a frozen Stable Diffusion 3.5 latent diffusion model (LDM) with the predicted semantics. A cross‑attention adapter transforms the IJEP‑A tokens into a 4096‑dimensional token stream h and also projects a down‑sampled (32×32) RGB version of the current frame into a 2048‑dimensional global conditioning vector p. A learned sigmoid gate α balances semantic and coarse‑spatial signals (h = α·h_semantic + (1−α)·h_coarse). These conditioning signals are fed into the diffusion backbone, which is trained only via low‑rank adaptation (LoRA) on the attention projections. Training follows a rectified‑flow formulation: noisy latents x_σ are constructed as a linear interpolation between the clean latent x₀ and Gaussian noise ε, and the model learns to predict the velocity v = ε − x₀. The diffusion loss consists of an L2 velocity term plus an SSIM regularizer.
The authors curated a large‑scale dataset spanning 2017‑2024, covering 100 geographically diverse regions. Each sample includes Sentinel‑2 surface reflectance (RGB bands, 10 m GSD) and pre‑computed AlphaEarth embeddings. Experiments compare SatJEPADiff against deterministic baselines (PredRNN, SimVP) and pure diffusion baselines (Stable Diffusion 3.5, MCVD). While deterministic models achieve higher PSNR and SSIM, they lag dramatically on perceptual metrics: GSSIM (gradient‑based SSIM) and FID. SatJEPADiff attains GSSIM = 0.8984 (a >14 % improvement over the best deterministic score) and FID = 0.1475 (a 67 % reduction compared to the vanilla diffusion baseline). Qualitative visualizations show that SatJEPADiff preserves sharp urban edges, road networks, and forest boundaries that deterministic models blur, while avoiding the unrealistic artifacts sometimes seen in unconstrained diffusion outputs.
Ablation studies replace the ViT encoder with the Panopticon multi‑sensor foundation model; performance remains comparable, indicating that the framework is not tied to a specific encoder architecture. Long‑horizon roll‑out experiments (5‑step predictions) reveal modest error accumulation but retain structural consistency, suggesting the method’s potential for multi‑step forecasting.
Limitations include the stochastic nature of diffusion, which introduces variability in multi‑step roll‑outs, and the current focus on single‑step prediction. Future work is outlined: (1) designing recurrent adapters for better long‑term temporal coherence, (2) incorporating vision‑language scene descriptions as additional conditioning signals, and (3) exploring lightweight latent decoders for real‑time deployment.
In summary, SatJEPADiff is the first model that successfully merges self‑supervised semantic forecasting with a frozen latent diffusion generator, achieving both high‑fidelity textures and structurally accurate predictions in satellite imagery. The approach sets a new benchmark for remote‑sensing forecasting and opens avenues for downstream applications such as disaster monitoring, agricultural management, and climate change analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment