IceBench-S2S: A Benchmark of Deep Learning for Challenging Subseasonal-to-Seasonal Daily Arctic Sea Ice Forecasting in Deep Latent Space
Arctic sea ice plays a critical role in regulating Earth’s climate system, significantly influencing polar ecological stability and human activities in coastal regions. Recent advances in artificial intelligence have facilitated the development of skillful pan-Arctic sea ice forecasting systems, where data-driven approaches showcase tremendous potential to outperform conventional physics-based numerical models in terms of accuracy, computational efficiency and forecasting lead times. Despite the latest progress made by deep learning (DL) forecasting models, most of their skillful forecasting lead times are confined to daily subseasonal scale and monthly averaged values for up to six months, which drastically hinders their deployment for real-world applications, e.g., maritime routine planning for Arctic transportation and scientific investigation. Extending daily forecasts from subseasonal to seasonal (S2S) scale is scientifically crucial for operational applications. To bridge the gap between the forecasting lead time of current DL models and the significant daily S2S scale, we introduce IceBench-S2S, the first comprehensive benchmark for evaluating DL approaches in mitigating the challenge of forecasting Arctic sea ice concentration in successive 180-day periods. It proposes a generalized framework that first compresses spatial features of daily sea ice data into a deep latent space. The temporally concatenated deep features are subsequently modeled by DL-based forecasting backbones to predict the sea ice variation at S2S scale. IceBench-S2S provides a unified training and evaluation pipeline for different backbones, along with practical guidance for model selection in polar environmental monitoring tasks.
💡 Research Summary
The paper introduces IceBench‑S2S, the first comprehensive benchmark dedicated to deep‑learning (DL) approaches for daily Arctic sea‑ice concentration (SIC) forecasting over a 180‑day horizon, i.e., a subseasonal‑to‑seasonal (S2S) task. Recognizing that existing DL models excel only up to about 90 days or on monthly averages, the authors propose a two‑stage framework: (1) spatial compression of high‑resolution daily SIC fields into a compact latent representation, and (2) temporal modeling of the resulting latent sequences with a variety of DL time‑series backbones.
The dataset is built from the NSIDC Climate Data Record (CDR) version 4 (G02202) spanning 1978‑10‑25 to 2024‑06‑30, comprising 16 686 daily grids of size 448 × 304 (≈25 km resolution). The data are split into training (1979‑2015), validation (2016‑2019), and testing (2020‑2024). In addition to CDR, the benchmark includes two alternative retrieval algorithms—NASA Team (NT) and Bootstrap (BT)—to test the robustness of the spatial encoder across different sensor‑processing pipelines.
Spatial compression is performed by a Swin‑Transformer‑based autoencoder (Swin‑AE). The encoder first applies a 2 × 2 convolution, then passes the feature map through four Swin‑Transformer blocks with shifted‑window attention, progressively down‑sampling the spatial dimensions while doubling channel depth. After flattening, a multilayer perceptron projects the representation to a 1024‑dimensional latent vector zₜ. The decoder mirrors the encoder to reconstruct the original SIC field. This design yields a compression ratio of 133:1, with reconstruction metrics (MSE ≈ 0.003, PSNR ≈ 24.5 dB, SSIM ≈ 0.91) that preserve over 90 % spatial correlation. Experiments comparing latent dimensions of 4096 versus 1024 show only marginal gains for the larger space, confirming that 1024 dimensions are sufficient for high‑fidelity reconstruction.
For temporal forecasting, the latent sequences {zₜ} are fed into four families of DL models: (a) Transformer‑based architectures, (b) recurrent networks (LSTM/GRU), (c) Temporal Convolutional Networks, and (d) classic autoregressive (AR) models. All backbones share a common hyper‑parameter search space and training protocol, ensuring fair comparison. Forecast horizons are set to 7, 15, 30, and 180 days, and performance is evaluated using MSE, MAE, NSE, PSNR, SSIM, and correlation metrics after decoding back to the spatial domain.
Results demonstrate that across all horizons, especially the challenging 180‑day lead time, DL backbones outperform traditional baselines such as anomaly persistence and climatology. The best models achieve statistically significant improvements in both skill scores and the ability to capture extreme events, exemplified by accurate prediction of the September 2016 and September 2022 minimum sea‑ice extents. Moreover, the Swin‑AE trained on CDR generalizes well to NT and BT datasets without fine‑tuning, indicating that the latent encoder captures physical patterns rather than sensor‑specific artifacts.
The benchmark also provides a suite of evaluation tools, including spatial reconstruction quality metrics, temporal skill scores, and robustness tests on extreme events. By standardizing data splits, preprocessing, and evaluation protocols, IceBench‑S2S offers a reproducible platform for the AI community to develop and test novel S2S forecasting models, while giving climate scientists a practical reference for operational sea‑ice prediction.
Future work suggested by the authors includes (i) incorporation of multimodal inputs such as atmospheric reanalysis, oceanic heat fluxes, and wind fields; (ii) probabilistic forecasting and uncertainty quantification; and (iii) deployment of the best-performing models in real‑time operational pipelines to support Arctic navigation, resource management, and climate research. In sum, IceBench‑S2S bridges a critical gap between short‑term DL sea‑ice forecasts and the operational need for reliable, long‑lead‑time daily predictions, fostering interdisciplinary advances at the intersection of machine learning and geoscience.
Comments & Academic Discussion
Loading comments...
Leave a Comment