AI weather prediction ensembles with latent noise injection and optimized with the continuous ranked probability score (CRPS) have produced both accurate and well-calibrated predictions with far less computational cost compared with diffusion-based methods. However, current CRPS ensemble approaches vary in their training strategies and noise injection mechanisms, with most injecting noise globally throughout the network via conditional normalization. This structure increases training expense and limits the physical interpretability of the stochastic perturbations. We introduce Stochastic Decomposition Layers (SDL) for converting deterministic machine learning weather models into probabilistic ensemble systems. Adapted from StyleGAN's hierarchical noise injection, SDL applies learned perturbations at three decoder scales through latent-driven modulation, per-pixel noise, and channel scaling. When applied to WXFormer via transfer learning, SDL requires less than 2% of the computational cost needed to train the baseline model. Each ensemble member is generated from a compact latent tensor (5 MB), enabling perfect reproducibility and post-inference spread adjustment through latent rescaling. Evaluation on 2022 ERA5 reanalysis shows ensembles with spread-skill ratios approaching unity and rank histograms that progressively flatten toward uniformity through medium-range forecasts, achieving calibration competitive with operational IFS-ENS. Multi-scale experiments reveal hierarchical uncertainty: coarse layers modulate synoptic patterns while fine layers control mesoscale variability. The explicit latent parameterization provides interpretable uncertainty quantification for operational forecasting and climate applications.
📄 Full Content
National Meteorological and Hydrological Services have increasingly incorporated ensemble forecasting methods to provide probabilistic predictions and quantify forecast uncertainty over the past 3 decades [1,2]. Traditional ensemble numerical weather prediction (NWP) approaches perturb initial conditions and/or use perturbed model physics parameterizations [3][4][5][6], but these methods are computationally expensive, requiring multiple full model integrations. Operational NWP ensembles have a relatively small number of members, usually on the order of 30-50, and may run at a lower spatial resolution compared with deterministic flagship members [7]. Recent advances in machine learning for weather prediction [8][9][10][11][12][13] have demonstrated skill comparable to traditional numerical weather prediction models, yet extending these capabilities to probabilistic forecasting while respecting chaotic dynamics remains challenging.
The theoretical foundation for ensemble prediction stems from Lorenz’s discovery that the atmosphere is a chaotic system where small initial condition errors grow exponentially [14]. Quantitative estimates from early global circulation models established a practical deterministic predictability limit of roughly two weeks [15], later confirmed as a fundamental property of atmospheric dynamics [16][17][18]. Epstein and coauthors [19] proposed a stochastic-dynamic formulation of the primitive equations that predicted the mean and variance of each state variable but was viewed to be computationally intractable at the time. Monte Carlo ensemble approximations [20] and optimization with least squares techniques [21] demonstrated that calibrated NWP ensembles could be produced that could improve on deterministic forecasts. In general, ensemble forecasting quantifies the fundamental predictability limitations of chaotic dynamics by sampling the phase space of possible atmospheric states, representing forecast uncertainty through trajectory divergence on the climate system’s attractor [2,22].
Early ML weather models employed deterministic architectures trained with mean squared error (MSE) loss. When combined with multi-step training, these models produced overly smooth forecasts that underrepresented atmospheric variability. Generative modeling approaches, particularly diffusion-based weather models [23][24][25][26], address this limitation by iteratively denoising random fields to produce realistic ensemble members. However, diffusion-based weather models [26] require learning full denoising trajectories across hundreds of timesteps during training and at least 20-50 iterative refinement steps per forecast at inference, imposing substantial computational burden on operational systems requiring rapid generation of large ensembles (50+ members).
Recent work has explored parameter-efficient alternatives that avoid iterative denoising by injecting stochasticity through various architectural mechanisms and training directly on the Continuous Ranked Probability Score (CRPS) to achieve calibrated ensembles with computational costs proportional to ensemble size [27][28][29][30][31]. Foundation models demonstrate strong transfer learning capabilities across weather and climate tasks [32][33][34]. Alternative approaches to ensemble generation include conditional diffusion models with explicit uncertainty quantification [35], latent diffusion with knowledge alignment mechanisms for precipitation nowcasting [36], and diffusion-based ensemble emulation methods that generate large ensembles conditioned on few operational forecast members [37]. From a dynamical systems perspective, these approaches sample the forecast distribution but provide no mechanism to systematically navigate the sampled trajectories or regenerate specific ensemble members. In addition, many of these approaches inject noise globally throughout the network without explicit hierarchical decomposition, which limits the spatial and scale-specific interpretability of the stochastic perturbations.
Operational ensemble forecasting requires not only probabilistic skill but also interpretable control over uncertainty structure. Forecasters benefit from the ability to systematically explore alternative scenarios within the ensemble distribution. For example, amplifying or suppressing specific synoptic patterns to understand their downstream impacts on regional weather. Similarly, ensemble post-processing and recalibration workflows require adjusting spread without regenerating forecasts, particularly when computational constraints limit ensemble size or when verification reveals systematic over-or underdispersion [38,39]. The multi-scale nature of atmospheric predictability, where large-scale synoptic systems exhibit different error growth rates than mesoscale features [40], motivates hierarchical uncertainty decomposition. Injecting perturbations at distinct spatial scales spanning the synoptic/meso-α boundary through the meso-β range while maintaining meteorological consistency enables physically interpretable ensemble spread that reflects the scale-dependent structure of forecast uncertainty. Current ML ensemble methods lack explicit architectural mechanisms for scale-dependent uncertainty control and provide limited capability to adjust or explore the generated forecast distribution after inference.
In this work, we introduce Stochastic Decomposition Layers (SDL) that separate latent-driven modulation from per-pixel stochasticity, enabling controlled exploration of forecast uncertainty. We build upon the WXFormer architecture [13,41], incorporating SDL at multiple hierarchical scales to enable two critical capabilities: (1) calibrated probabilistic forecasts competitive with physics-based ensembles, achieved through CRPS-based training with computational efficiency comparable to other computationally efficient methods, and (2) post-inference exploration of forecast uncertainty by systematically perturbing latent tensor and noise components. While noise scaling is theoretically possible in any stochastic model, SDL learns a symmetric latent geometry that provides calibrated control over ensemble spread without regeneration. Each ensemble member is determined by a compact latent tensor, enabling perfect reproducibility and spread adjustment after inference. This provides operational control over forecast uncertainty without regenerating ensemble members.
The WXFormer model [13,41,42] employs a hierarchical encoder-decoder structure for weather prediction (Figure 1). The encoder processes input features on progressively coarser spatial grids through a series of cross embedding layers (CEL) and transformer blocks, enabling efficient multiscale feature extraction. The decoder uses U-Net-style skip connections to integrate high-resolution spatial information from the encoder with globally-informed features from the deepest layer, producing weather forecasts at the original input resolution. This multi-scale processing provides injection points for stochasticity across different atmospheric resolutions, from coarse large-scale structures in deeper layers to fine-grained spatial details in shal-lower layers. By strategically placing noise injection layers at each level of this hierarchy (Figure 1), we enable simultaneous perturbation of large-scale circulation patterns and small-scale atmospheric features, creating ensemble members that span a physically coherent range of forecast solutions.
Recent approaches to ensemble generation in machine learning weather models typically apply noise globally without explicit scale separation. The StyleGAN architecture [43,44] demonstrated that hierarchical, decomposed noise injection enables both diversity and fine grained control in generative models by separating global style variations, which influence composition and large scale structure, from local stochastic details, which influence texture and fine scale features. Style transfer techniques have been successfully adapted to atmospheric science applications [45], showing that neural style decomposition can capture complex physical relationships in atmospheric data.
Table 1: Components of the SDL layer corresponding to Equation 1 and Figure 1b.
These insights motivate our adaptation of hierarchical noise decomposition for atmospheric ensemble forecasting. We accomplish this by injecting controlled stochastic perturbations into feature maps using three multiplicative components within SDL. The output feature map F out ∈ R B×C×H×W is computed as:
where R Gauss ∼ N (0, I) is per-pixel Gaussian noise sampled at full spatial resolution, α is a nontrainable scaling parameter, S ∈ R B×C×1×1 is a tensor derived from a latent tensor Z ℓ ∼ N (0, I) sampled independently at each SDL layer via learned linear transformation, and M ∈ R 1×C×1×1 is a learnable channel-wise modulation parameter. The operator ⊙ denotes element-wise multiplication with broadcasting across spatial dimensions.
The SDL architecture supports injection at six hierarchical levels: three encoder positions and three decoder positions (Figure 1). In this work, we implement decoder-only SDL injection with three independent latent vectors Z 1 , Z 2 , Z 3 sampled at each decoder stage (Figure 1, yellow boxes). In the decoder, SDL layers are positioned after each upsampling operation but before skip connection concatenation, ensuring that stochastic perturbations are applied to upsampled features before incorporating encoder information. This ordering enables localized variations while maintaining large-scale consistency through the skip connections. To enable extensive experimentation within computational constraints, we implement decoderonly SDL injection with independent latent sampling at each layer. We refer to architectural positions as “decoder levels” (Level 1 through Level 4, Table 2) and SDL application locations as “injection points” controlled by parameters β 1 , β 2 , and β 3 respectively, spanning from the synoptic/meso-α boundary (Level 1) through the meso-β scale (Level 4).
The decoder-only configuration requires storing three latent tensors per ensemble member (one per injection point), totaling approximately 5 MB in single precision (float32): Level 1 (24×36×71, ∼240 KB), Level 2 (48×72×71, ∼960 KB), and Level 3 (96×144×71, ∼3.7 MB). Full encoderdecoder injection would require six tensors (∼10 MB per member). These compact latent representations enable perfect reproducibility of any ensemble member and support post-inference spread adjustment through latent rescaling, all while requiring orders of magnitude less storage than archiving complete forecast fields.
Each injection point employs a fixed scaling parameter α to accommodate feature magnitude variations across network depth. Appropriate α selection enables reasonable initial ensemble spread, accelerating convergence of learnable SDL components (M and the linear transformation generating S). The transfer learning framework allows freezing pre-trained WXFormer weights while training only SDL parameters, or joint fine-tuning for superior ensemble quality. We set α = 0.235 at all injection points based on preliminary experiments.
During inference, multiple ensemble members are generated by sampling different latent vectors {Z 1 , Z 2 , Z 3 } independently at each of the three decoder SDL injection points. The learned parameters M and the linear transformations producing S, combined with the fixed α scaling parameters, ensure variations remain physically plausible and appropriately scaled. This multi-scale injection produces comprehensive uncertainty quantification: deep-layer perturbations affect large-scale flow patterns (e.g., synoptic troughs and ridges), while shallow-layer perturbations modulate fine-scale features (e.g., mesoscale convective systems and surface temperature gradients). The ensemble spread reflects the model’s learned uncertainty structure rather than relying on ad-hoc perturbation strategies.
This study uses ERA5 reanalysis [46] coarsened to 1 variables. This vertical discretization emphasizes lower-to-mid tropospheric representation where most weather-relevant processes occur while accepting coarser resolution at higher altitudes. The temporal split follows [13]: training (1979-2017), validation (2018-2019), and testing (2020-2022).
Complete details on variable selection, vertical level specification, and data processing are provided in [13,47].
We initialize from a pre-trained deterministic WXFormer model trained on ERA5 following [13].
The SDL components (α, M, and the linear transformations producing S) are inserted into the model, and all model weights are then fine-tuned jointly. This transfer learning approach leverages the pre-trained model’s deterministic forecast skill as a strong initialization, requiring approximately 1.61% of the computational cost needed to train the baseline WXFormer from scratch (see ref [13] for baseline training details). Fine-tuning all weights produces superior ensemble quality compared to training only the SDL parameters while maintaining computational efficiency. See A for complete training details and computational cost analysis.
During training, we generate M = 10 ensemble members per input by sampling different latent tensors Z at each SDL injection point. The SDL parameters are optimized using the almost fair Continuous Ranked Probability Score (afCRPS) loss [27], defined for an ensemble {x j } M j=1 and verification target y as:
where ϵ := (1-α) M , and α ∈ (0, 1] is a hyperparameter. Setting α = 1 recovers the fair CRPS [48], which adjusts for finite ensemble size. However, fair CRPS suffers from a degeneracy where all members except one match the target, leaving the remaining member unconstrained. The almost fair formulation with α < 1 avoids this pathology while maintaining near-fair scoring properties.
We use α = 0.95 following [27]. The loss is averaged across all variables, vertical levels, and spatial locations with latitude weighting following [13].
Training uses the same 1979-2017 ERA5 data as the deterministic model, with validation performed on 2018-2019 and testing on 00Z initializations from 2022. This fine-tuning strategy is substantially more efficient than diffusion-based ensemble generation, which requires iterative denoising steps at inference and prohibitively expensive training on full trajectory sequences. Our approach incurs computational overhead only proportional to the desired ensemble size. Generating 50 members requires 50 forward passes with different Z samples but no architectural modifications or iterative refinement. The pretrained base model ensures that deterministic forecast quality is preserved without training from scratch, while the learned SDL parameters add controlled stochasticity appropriate to each hierarchical level.
We evaluate our SDL-enhanced WXFormer (SDL-WXFormer) ensemble against the ECMWF Integrated Forecasting System (IFS-ENS) on the 2022 test period using standard probabilistic metrics: Continuous Ranked Probability Score (CRPS), root mean square error (RMSE) of the ensemble mean, spread-skill ratio (SSR), and rank histograms. We also compare the underlying deterministic SDL-WXFormer against IFS HRES to establish baseline forecast skill. All forecasts are verified on a common 192×288 grid; details on operational system resolutions, regridding procedures, and pressure-level interpolation are provided in B. Figure 2 shows results for key atmospheric variables at selected pressure levels, while Figure 3 Deterministic RMSE comparisons (Figure 2, first row) show that the base deterministic WX-Former model delivers performance comparable to operational IFS HRES forecasts across all evaluated variables. For dynamical fields (Z500, U700, V850) and the thermodynamic field (T850), the model’s RMSE curves closely track-or in some cases outperform-IFS HRES throughout the 10-day IFS forecast window. Beyond day 10, the model maintains stable skill through day 15, demonstrating that the ERA5-trained dynamics extend reliably into a longer forecast horizon than that provided by IFS in this comparison. Specific humidity (Q850) shows particularly impressive deterministic skill, with consistently lower RMSE than IFS HRES across all lead times. This strong performance on moisture fields reflects effective learning of boundary layer processes and moisture transport despite the well-known challenges of humidity prediction in data-driven models [10,11].
Ensemble mean RMSE (Figure 2 and3, second rows) follows similar patterns to both deterministic RMSE and CRPS, with SDL forecasts maintaining accuracy comparable to or better than IFS-ENS through medium-range timescales. The shared y-axis range between deterministic and ensemble RMSE rows reveals that ensemble averaging provides consistent error reduction across all variables, with ensemble mean performance closely matching single-member deterministic skill. The RMSE growth rates are consistent with ensemble mean forecast error accumulation in operational systems, suggesting that the pre-trained deterministic WXFormer base provides stable skill while SDL perturbations add calibrated uncertainty without degrading deterministic performance. This vertical gradient reflects the model’s 16-level discretization, which provides finest resolution in the weather-relevant lower troposphere while accepting coarser sampling aloft.
Operational forecast centers require the ability to exactly reproduce disseminated ensemble members for verification, case studies, and audit trails. While stochastic perturbation methods can reproduce prior forecasts by storing random number generator (RNG) seeds, this approach only enables rigid replay. In contrast, SDL’s latent tensor storage enables structural reproducibility, where the stored representation can be mathematically manipulated (e.g., rescaled or interpolated) post-inference. This is a capability unavailable with simple RNG seeds. SDL generates each ensemble member from three independent latent tensors Z 1 , Z 2 , Z 3 ∼ N (0, I) sampled at each decoder injection point. Rather than storing the raw latent vectors, we archive the computed noise perturbations α • R Gauss ⊙ S(Z ℓ ) ⊙ M applied at each SDL layer during the forward pass. These stored perturbations are compact (∼5 MB per member per forecast time), substantially reducing archival requirements relative to complete atmospheric state fields (∼35 MB per member per time). Crucially, this representation supports post-inference manipulation through mathematical operations such as scaling or interpolation. This is a capability that distinguishes SDL from methods that discard noise after inference.
We investigate whether SDL learns scale-dependent uncertainty decomposition aligned with the architectural hierarchy described in Table 2 through systematic manipulation of individual injec- Varying β 1 (left column of Figure 6) while holding β 2 = β 3 = 1.0 reveals the coarsest layer’s control over large-scale organization at the synoptic/meso-α boundary (Table 2, Level 1: 800 km effective resolution). Z500 anomalies exhibit large-amplitude, spatially coherent patterns with magnitudes exceeding ±600 m concentrated in mid-latitudes. Kinetic energy spectra (Figure 7b) confirm that β 1 variations primarily modulate energy at synoptic scales and the large-scale end of the mesoα range (wavenumbers n = 2-10), with the inset panel revealing complex, wavenumber-dependent energy redistribution: at n = 1, β = -3 maximally increases KE; conversely, at the spectral peak (n ≈ 4), β = 3 generates the largest KE amplification. This wavenumber-dependent response demonstrates that coarse-layer perturbations redistribute energy across synoptic and large meso-α scales rather than applying uniform scaling.
The center column explores β 2 variation with β 1 = β 3 = 1.0, isolating intermediate-layer contributions at the meso-α scale (Level 2: 400 km, Table 2). Z500 anomalies show comparable Right column results demonstrate that β 3 variation (finest layer, Level 3: 200 km at the mesoα/meso-β transition, Table 2) produces minimal Z500 anomalies, typically below ±200 m with limited spatial coherence. Spectral analysis (Figure 7d) shows that β 3 effects emerge primarily at high wavenumbers (n > 20-30) corresponding to meso-β processes, with negligible impact on synoptic and large meso-α scales. This scale separation is consistent with the architectural design: perturbations at 2× downsampling operate at scales below those dominating upper-tropospheric height fields, which are controlled by synoptic-scale and large meso-α wave dynamics.
While each SDL layer exhibits sensitivity across multiple wavenumbers-reflecting the overlapping receptive fields inherent to convolutional architectures-the spectral and spatial analyses confirm hierarchical stratification aligned with architectural scale. The β 1 layer produces maximum kinetic energy modulation at synoptic scales (n = 2-10), β 2 dominates at meso-α scales (n = 10-20), and β 3 effects emerge primarily at meso-β scales (n > 20-30). This scale-dependent behavior parallels findings from physics-based ensemble systems, where forecast variance evolution exhibits strong scale dependence with rapid saturation at meso-β scales and slower growth at synoptic and large meso-α scales [49].
T e=0.0 0.1 0.2 . . . 0.8 0.9 1.0
Table 3: Latent interpolation between initial states Z i (0) and Z j (0) with autoregressive evolution to future forecast times T . Each column shows an interpolated latent trajectory.
The deterministic latent space enables controlled exploration between ensemble members. Given two stored latent tensors Z 1 and Z 2 from different members, we generate intermediate forecasts:
Table 3 shows that interpolation applies independently at each forecast step. For every parameter e ∈ [0, 1], the latent state Z e (T ) is a weighted combination of Z i (T ) and Z j (T ). Additionally, the 16-level vertical discretization optimized for lower-tropospheric resolution limits representation of stratospheric processes relevant to medium-range prediction. Sudden stratospheric warmings, stratosphere-troposphere coupling during blocking events, and quasi-biennial oscillation teleconnections may be inadequately captured in upper levels where vertical resolution is coarsest. Applications requiring explicit stratospheric dynamics would benefit from expanded vertical discretization or hybrid approaches combining fine tropospheric resolution with targeted stratospheric sampling.
Stochastic parameterization in numerical weather prediction represents model uncertainty arising from unresolved scales and uncertain physics [2,6,50]. Machine learning approaches for stochastic parameterization include generative adversarial networks applied to simplified atmospheric models [51] and comprehensive reviews of data-driven stochastic methods [52]. These methods are designed based on theoretical understanding of atmospheric processes and can be interpreted through physical mechanisms.
Despite this fundamental difference, SDL ensembles exhibit similar statistical properties to physics-based systems: calibrated spread-skill relationships, close-to uniform rank histograms, and realistic trajectory divergence. This suggests that the functional form of uncertainty, including how perturbations are structured across scales, variables, and forecast lead times, can be learned from data through appropriate training objectives. SDL never explicitly encodes knowledge about baroclinic instability or moist convection, yet it produces ensembles whose statistical behavior resembles systems that explicitly parameterize these processes through CRPS optimization on IFS-derived reanalysis data.
The learned nature of SDL has both advantages and limitations compared to theory-driven approaches. SDL can capture systematic model errors and complex uncertainty patterns that emerge from the full modeling system without requiring explicit specification. However, it lacks interpretability in terms of physical processes and cannot generalize beyond training distribution bounds. These approaches are complementary rather than competing. The learned perturbations could augment physics-based stochastic schemes by capturing emergent uncertainty patterns while retaining interpretable physical perturbations for fundamental atmospheric processes.
While SDL shares computational efficiency with other CRPS-trained ensemble methods, its hierarchical latent parameterization distinguishes it through explicit control mechanisms. The learned latent space enables targeted scale-dependent perturbation and post-inference spread adjustment-capabilities that provide operational flexibility beyond what is achievable with fixed stochastic injection schemes.
The multi-scale injection strategy produces scale-dependent uncertainty control that remains coherent even when individual layer perturbations are amplified beyond training distribution magnitudes.
This robustness suggests the learned decomposition captures fundamental atmospheric uncertainty structure rather than distribution-specific artifacts. Critically, this hierarchical organization emerges from CRPS optimization without explicit architectural constraints enforcing scale separation, analogous to learned hierarchical representations in generative modeling [43,44] where stochastic injection combined with appropriate loss functions induces natural scale-dependent representations.
We introduce Stochastic Decomposition Layers, a computationally efficient method for converting deterministic machine learning weather models into calibrated probabilistic ensemble forecasting systems. SDL employs hierarchical noise decomposition to inject learned stochastic perturbations at multiple scales through latent-driven modulation, per-pixel noise, and learned channel modulation.
When applied to WXFormer via transfer learning, SDL achieves three primary contributions. This approach leveraged the pre-trained deterministic forecast skill as a strong initialization while requiring substantially reduced computational resources compared to training ensemble systems from random initialization. 137-level hybrid sigma-pressure vertical coordinate system. The model produces forecasts on these model levels, which are then interpolated to standard pressure levels (500, 700, 850 hPa) during evaluation using the pressure-level interpolation procedure described in [13]. ERA5 verification data undergo identical model-to-pressure level interpolation to ensure consistent comparison.
The ECMWF IFS ensemble system (IFS-ENS) is accessed through WeatherBench 2 archives, which provides IFS-ENS forecasts regridded to a 240×121 equiangular grid. We use this Weather-Bench 2 distribution format rather than the native IFS-ENS resolution (O640 octahedral reduced Gaussian grid, 9 km) due to data storage constraints. The ECMWF high-resolution deterministic system (HRES) archived data are provided at 1440×721 resolution (0.25°regular latitude-longitude).
Both operational systems provide output on pressure levels directly.
All forecast verification in this study uses a common 192×288 grid. For deterministic forecast comparison (Figure 2, first row), HRES forecasts were interpolated from 1440×721 to the 192×288 verification grid. This substantial spatial coarsening from 0.25°to approximately 1°resolution applies implicit smoothing that may reduce small-scale error magnitudes. ERA5 verification data were regridded to the same 192×288 grid.
Following horizontal interpolation to the common grid, SDL-WXFormer predictions and ERA5 verification both undergo vertical interpolation from model levels to pressure levels. IFS-ENS and HRES forecasts are already provided on pressure levels and require no vertical interpolation.
Forecast skill metrics (RMSE, MAE, ACC) are computed on the 192×288 verification grid using cosine-latitude weights normalized such that j w j = N lat to account for meridional convergence.
Ensemble metrics (CRPS, spread-skill ratio, rank histograms) use the almost-fair CRPS formulation (Equation 2) with α = 0.95. The 2022 test period comprises 365 forecast initializations at 00Z.
The common verification grid enables direct metric comparison but introduces resolution-dependent effects. Coarsening HRES from 0.25°to 1°applies substantial spatial smoothing that suppresses small-scale features and may artificially reduce deterministic RMSE. However, effective resolution of numerical weather prediction models is typically 4-5 grid spacings due to numerical diffusion and spectral truncation, suggesting HRES effective resolution of approximately 1°-1.25°despite 0.25°nominal spacing. The 192×288 grid therefore captures the predictive information relevant for medium-range evaluation while enabling consistent comparison. Ensemble spread calibration (spread-skill ratios near unity, uniform rank histograms) indicates appropriate probabilistic skill at the scales resolved by this verification grid.
The ERA5 reanalysis data for this study can be accessed through the NSF NCAR Research Data
Archive [46]. Operational IFS-HRES deterministic forecasts and IFS-ENS ensemble forecasts are obtained from the WeatherBench 2 archive [53], which provides regridded ECMWF operational predictions on standardized verification grids at 240×121 (ensemble) and 1440×721 (deterministic) horizontal resolutions. The neural networks described here and the simulation code used to train and test the models are archived at https://github.com/NCAR/miles-credit
.