Comparing and Contrasting DLWP Backbones on Navier-Stokes and Atmospheric Dynamics
A large number of Deep Learning Weather Prediction (DLWP) architectures – based on various backbones, including U-Net, Transformer, Graph Neural Network, and Fourier Neural Operator (FNO) – have demonstrated their potential at forecasting atmospheric states. However, due to differences in training protocols, forecast horizons, and data choices, it remains unclear which (if any) of these methods and architectures are most suitable for weather forecasting and for future model development. Here, we step back and provide a detailed empirical analysis, under controlled conditions, comparing and contrasting the most prominent DLWP models, along with their backbones. We accomplish this by predicting synthetic two-dimensional incompressible Navier-Stokes and real-world global weather dynamics. On synthetic data, we observe favorable performance of FNO, while on the real-world WeatherBench dataset, our results demonstrate the suitability of ConvLSTM and SwinTransformer for short-to-mid-ranged forecasts. For long-ranged weather rollouts of up to 50 years, we observe superior stability and physical soundness in architectures that formulate a spherical data representation, i.e., GraphCast and Spherical FNO. The code is available at https://github.com/amazon-science/dlwp-benchmark.
💡 Research Summary
This paper presents a systematic benchmark of the most prominent deep‑learning weather prediction (DLWP) backbones—U‑Net, ConvLSTM, Swin‑Transformer, Graph Neural Networks (GNN), and Fourier Neural Operators (FNO)—under tightly controlled experimental conditions. The authors evaluate the models on two distinct domains: synthetic two‑dimensional incompressible Navier‑Stokes simulations and real‑world global atmospheric data from the WeatherBench dataset. By fixing the number of trainable parameters, training schedule, optimizer, and input variable set, the study isolates the effect of the architectural backbone on forecast skill, computational efficiency, and physical fidelity.
Synthetic Navier‑Stokes experiments
The authors generate 64 × 64 periodic‑domain Navier‑Stokes data at two Reynolds numbers (Re = 10³ for relatively laminar flow and Re = 10⁴ for more turbulent flow). Three data‑size regimes are considered (1 k and 10 k training samples). Each model receives a history of ten frames and predicts the future autoregressively, except for the (T)FNO variants which output the whole sequence in one forward pass. Parameter counts are varied from 5 × 10⁴ to 1.28 × 10⁸. Across all turbulence levels and data‑size settings, the Tensor‑Factorized FNO (TFNO) consistently yields the lowest root‑mean‑square error (RMSE), followed by Swin‑Transformer, U‑Net, and ConvLSTM when the model size is limited to roughly one million parameters. The ranking remains stable when the training set is enlarged tenfold, indicating that the superiority of TFNO is not merely a data‑efficiency artifact but stems from its spectral‑domain operations that capture the underlying PDE dynamics more directly than purely spatial convolutions.
Real‑world WeatherBench experiments
For the atmospheric benchmark, the authors select eight core prognostic variables (2‑m temperature, 850 hPa temperature, geopotential height at 1000, 700, 500, 300 hPa, and 10‑m zonal/meridional wind components) together with four static fields (latitude, longitude, topography, land‑sea mask) and solar insolation as a forcing input. The data are downsampled to 5.625° resolution (64 × 32 grid) with a 6‑hour cadence. All models are trained with the same AdamW optimizer, cosine learning‑rate schedule, and 200 epochs, while sweeping the parameter budget across the same decade‑spanning range used in the synthetic experiments. Forecast skill is assessed at 3, 5, and 7‑day lead times using RMSE and Anomaly Correlation Coefficient (ACC).
The short‑to‑mid‑range results reveal a surprising dominance of the recurrent ConvLSTM architecture, which achieves the lowest RMSE and highest ACC across most variables, especially temperature and low‑level wind. Swin‑Transformer follows closely, and the FourCastNet implementation based on the Adaptive FNO (AFNO) also performs competitively. Pure spatial backbones (U‑Net) and the vanilla FNO lag behind, suggesting that explicit temporal modeling is crucial for capturing the evolution of atmospheric fields over a few days.
Long‑range roll‑outs (up to 365 days and beyond) expose stark differences in numerical stability and physical realism. Models that operate directly on a spherical mesh—GraphCast (a hierarchical icosahedral MeshGraphNet) and the spherical variant of FNO (SFNO)—maintain bounded error growth, preserve kinetic‑energy spectra, and reproduce characteristic zonal jet patterns even after many months of autonomous prediction. In contrast, ConvLSTM, U‑Net, and Swin‑Transformer exhibit error accumulation that eventually leads to unphysical temperature spikes and loss of coherent wind structures. The authors also compare two geographic representations: the traditional latitude‑longitude grid and the HEALPix (HPX) mesh. The HPX‑based experiments confirm that reducing polar distortion improves the long‑term fidelity of spherical models.
From a computational standpoint, ConvLSTM is the most memory‑efficient and fastest to train for a given parameter budget (≈1 M parameters), making it attractive for operational settings with limited GPU resources. FNO‑based models, while more demanding in GPU memory, deliver higher accuracy per parameter, especially in the synthetic turbulence task. GraphCast occupies a middle ground: its message‑passing architecture incurs moderate memory usage but offers superior stability for climate‑scale simulations.
A notable observation is the saturation of performance with increasing model size. Beyond roughly 10 M parameters, further scaling yields diminishing returns in RMSE reduction for both synthetic and real‑world tasks. This suggests that the current dataset size, variable selection, and training schedule are insufficient to fully exploit larger capacities, pointing to a need for richer training data (e.g., longer historical records, higher vertical resolution) and possibly physics‑informed loss functions.
Conclusions and future directions
The study concludes that there is no single “best” backbone for all weather‑prediction scenarios. For operational short‑term forecasts (≤14 days), ConvLSTM provides the best trade‑off between accuracy, speed, and resource consumption, with Swin‑Transformer as a strong alternative. For climate‑scale or very long roll‑outs, spherical‑aware architectures such as GraphCast and SFNO are preferable due to their numerical stability and adherence to physical constraints. The authors recommend future work to (1) expand training datasets and incorporate multi‑scale physics, (2) embed conservation laws directly into loss functions or network architectures, and (3) explore hybrid designs that combine the temporal modeling strengths of recurrent networks with the spectral efficiency of FNOs. Such advances could break the observed performance ceiling and enable DLWP models to rival traditional numerical weather prediction across a broader spectrum of forecasting horizons.
Comments & Academic Discussion
Loading comments...
Leave a Comment