TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation
We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters – a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.
💡 Research Summary
TerraFlow introduces a novel pre‑training framework that explicitly incorporates temporal structure into multimodal Earth observation (EO) models. Building on the TerraMind foundation model, the authors add two key components: (1) Rotary Positional Encoding (RoPE) applied along the time axis, which encodes relative temporal offsets directly into the query‑key attention mechanism, and (2) Temporal Disjoint Sampling (TDS), a masking strategy that forces the decoder to predict tokens from timestamps that are disjoint from those provided as input. This design encourages the model to learn change patterns, persistence, and dynamics rather than relying solely on spatial cues at a single time point.
The architecture remains a standard transformer encoder‑decoder trained with masked token reconstruction across modalities (Sentinel‑1 SAR, Sentinel‑2 optical, DEM, land‑cover). Tokens can be raw pixel values or quantized symbols; the loss is cross‑entropy over masked tokens. Two model sizes are explored: a Tiny version (5.5 M parameters) and a Base version (85.5 M parameters). Both are initialized from publicly released TerraMind weights and then continuously pre‑trained on the SSL4EO‑S12 v1.1 dataset, which contains four‑timestamp chips (2019‑2021) for 10 000 of the world’s most populated cities. Pre‑training uses AdamW, a cosine learning‑rate schedule, and mixed‑precision (float16) on four NVIDIA A100‑80GB GPUs (≈8 days for Tiny, 14 days for Base).
Four variants are produced (Tiny‑TDS, Tiny‑RS, Base‑TDS, Base‑RS) to assess the impact of the temporal masking regime. The authors evaluate on the GEO‑Bench‑2 benchmark, which includes four temporally‑rich downstream tasks: Kuro Siwo flood mapping (T=3 SAR timestamps), PASTIS crop‑type segmentation (T=7 Sentinel‑2 timestamps), BioMassters forest biomass regression (T=7 multi‑sensor series), and DynamicEarthNet land‑use change (T=6 PlanetScope + static Sentinel‑2). Hyper‑parameter search is limited to 16 trials per model‑dataset pair, followed by five‑fold re‑training with different random seeds. Standard augmentations (horizontal/vertical flips) are applied consistently across all modalities and timestamps.
Across all GEO‑Bench‑2 tasks, TerraFlow consistently outperforms the original TerraMind and other state‑of‑the‑art EO foundation models. The TDS‑based variants achieve an average 8–12 % gain over Random Sampling, demonstrating that explicit temporal masking is beneficial. Notably, the Base‑TDS model yields the highest scores on segmentation (mIoU) and regression (RMSE) metrics.
Beyond standard benchmarks, the authors test TerraFlow on a more demanding forecasting scenario: predicting pre‑event disaster risk maps (flood and wildfire) using only pre‑event EO context. While many existing models collapse on this task, TerraFlow attains up to a 50 % increase in F1 score and a 24 % reduction in Brier score relative to competitors, highlighting the practical value of learned temporal dynamics for early warning applications.
The paper also discusses limitations. The pre‑training data only provides up to four timestamps per location, so the model’s ability to handle much longer or higher‑frequency time series remains untested. RoPE captures relative timing but does not encode absolute seasonal cues, suggesting that additional metadata (e.g., month embeddings) could further improve performance on tasks with strong seasonal signals. Finally, while the risk‑map experiments are promising, they are exploratory; integrating physical hydrological models or uncertainty quantification would be necessary for operational deployment.
In summary, TerraFlow delivers a compelling solution for multimodal, multitemporal representation learning in Earth observation. By marrying rotary temporal embeddings with a disjoint sampling mask, it achieves robust temporal reasoning without sacrificing the strong cross‑modal spatial features learned by large‑scale self‑supervised pre‑training. The results on GEO‑Bench‑2 and disaster risk prediction underscore the importance of temporal pre‑training for high‑impact EO applications, and the framework opens avenues for future work on longer sequences, additional sensor modalities, and hybrid physics‑ML systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment