StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions
Managing natural resources and mitigating risks from floods, droughts, wildfires, and landslides require models that can accurately predict climate-driven land-surface responses. Traditional models often struggle with spatial generalization because they are trained or calibrated on limited observations and can degrade under concept drift. Recently proposed vision foundation models trained on satellite imagery demand massive compute, and they are not designed for dynamic land surface prediction tasks. We introduce StefaLand, a generative spatiotemporal Earth representation learning model centered on learning cross-domain interactions to suppress overfitting. StefaLand demonstrates especially strong spatial generalization on five datasets across four important tasks: streamflow, soil moisture, soil composition and landslides, compared to previous state-of-the-art methods. The domain-inspired design choices include a location-aware masked autoencoder that fuses static and time-series inputs, an attribute-based rather than image-based representation that drastically reduces compute demands, and residual fine-tuning adapters that strengthen knowledge transfer across tasks. StefaLand can be pretrained and finetuned on commonly available academic compute resources, yet consistently outperforms state-of-the-art supervised learning baselines, fine-tuned vision foundation models and commercially available embeddings, highlighting the previously overlooked value of cross-domain interactions and providing assistance to data-poor regions of the world.
💡 Research Summary
StefaLand is an efficient, attribute‑centric foundation model designed for dynamic land‑surface prediction tasks such as streamflow, soil moisture, soil composition, and landslide susceptibility. Unlike existing vision‑based Earth observation models that rely on massive satellite imagery and require extensive compute, StefaLand operates on a curated set of static landscape attributes (elevation, soil texture, depth, geology, vegetation indices, etc.) and dynamic meteorological forcings (precipitation, temperature, radiation).
The core architecture is a transformer‑based masked autoencoder inspired by BERT. During pre‑training, the model receives a joint sequence of static and time‑varying variables from roughly 8,600 basins spanning 40 years. A novel Cross‑Variable Group Masking (CVGM) strategy groups physically or statistically related variables (e.g., sand, silt, clay fractions) and masks entire groups simultaneously. The model must reconstruct the masked group from the remaining tokens, forcing it to learn cross‑domain dependencies rather than relying on simple correlations. Reconstruction loss is normalized by variable‑wise standard deviations to handle differing scales.
Pre‑training consumes about 720 V100 GPU‑hours, far less than the multi‑week, multi‑petaflop training typical of large vision foundation models, making StefaLand accessible to academic labs with modest resources.
For downstream tasks, StefaLand adopts Residual Fine‑Tuning Adapters. The frozen encoder produces contextual embeddings (E_t) for each time step. A shallow convolution‑plus‑linear block processes raw forcings (x_t) into a residual signal (r_t). The sum (E_t + r_t) feeds an LSTM decoder for temporally explicit tasks (streamflow, soil moisture). For static or spatial tasks (soil property inference, landslide susceptibility), the decoder is replaced by task‑specific MLP or 2‑D CNN heads, while the encoder remains frozen. This residual pathway preserves the global spatial knowledge learned during pre‑training and allows limited task‑specific parameters to adapt to local dynamics, reducing over‑fitting risk.
The authors evaluate StefaLand on five datasets covering four distinct tasks: (1) US CAMELS streamflow, (2) global Carav an streamflow, (3) global in‑situ soil moisture, (4) global soil property maps, and (5) Oregon landslide susceptibility. Baselines include traditional LSTM‑SL, transformer variants (Informer, Reformer, DLinear), and state‑of‑the‑art vision foundation models such as AlphaEarth. All experiments use temporally based validation for hyper‑parameter tuning and spatial hold‑out testing to assess generalization to unseen basins and regions.
Results show that StefaLand consistently outperforms all baselines. Across tasks, it reduces RMSE by 12‑18 % and improves R² by 0.05‑0.12 relative to the strongest supervised baselines. The most striking advantage appears in spatial generalization: when applied to data‑scarce regions (e.g., Africa, South America, parts of Asia) that were not represented in the pre‑training set, StefaLand’s performance degrades far less than competing models. Moreover, its parameter count (~30 M) and compute footprint are an order of magnitude lower than image‑based foundation models, yet it delivers comparable or superior predictive skill.
Key contributions are: (1) an attribute‑based foundation model that dramatically lowers computational barriers; (2) the CVGM masking scheme that explicitly forces learning of cross‑domain interactions among climate, soil, topography, and vegetation; (3) residual adapters that enable efficient fine‑tuning while preserving pretrained spatial representations; and (4) extensive empirical validation demonstrating robust transfer across tasks, scales, and regions.
StefaLand therefore offers a practical, scalable solution for the geoscience community, especially for regions with limited in‑situ observations. Its design also leaves room for future multimodal extensions that could integrate satellite imagery alongside attribute data, further enriching the learned representations for even broader Earth system modeling applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment