Fair scores reward ensemble forecast members that behave like samples from the same distribution as the verifying observations. They are therefore an attractive choice as loss functions to train data-driven ensemble forecasts or post-processing methods when large training ensembles are either unavailable or computationally prohibitive. The adjusted continuous ranked probability score (aCRPS) is fair and unbiased with respect to ensemble size, provided forecast members are exchangeable and interpretable as conditionally independent draws from an underlying predictive distribution. However, distribution-aware post-processing methods that introduce structural dependency between members can violate this assumption, rendering aCRPS unfair. We demonstrate this effect using two approaches designed to minimize the expected aCRPS of a finite ensemble: (1) a linear member-by-member calibration, which couples members through a common dependency on the sample ensemble mean, and (2) a deep-learning method, which couples members via transformer self-attention across the ensemble dimension. In both cases, the results are sensitive to ensemble size and apparent gains in aCRPS can correspond to systematic unreliability characterized by over-dispersion. We introduce trajectory transformers as a proof-of-concept that ensemble-size independence can be achieved. This approach is an adaptation of the Post-processing Ensembles with Transformers (PoET) framework and applies self-attention over lead time while preserving the conditional independence required by aCRPS. When applied to weekly mean $T_{2m}$ forecasts from the ECMWF subseasonal forecasting system, this approach successfully reduces systematic model biases whilst also improving or maintaining forecast reliability regardless of the ensemble size used in training (3 vs 9 members) or real-time forecasts (9 vs 100 members).
Ensemble forecasting can be viewed as a Monte Carlo method, in which ensemble members approximate samples from a probability distribution of future atmospheric states, such that each member represents an equally plausible forecast trajectory (Leith, 1974;Molteni et al., 1996;Leutbecher and Palmer, 2008). However, state-of-the-art forecast models may suffer from systematic biases and flow-dependent errors due to many factors, including discretization and parameterization errors, unresolved or simplified physical processes, and errors in the specified initial state (Bauer et al., 2015;Magnusson et al., 2019). Raw ensemble forecasts can thus be biased and unreliable, where reliability is a measure of the statistical consistency between forecast probabilities and observed frequencies (Wilks, 2011;Gneiting et al., 2007).
In a statistical sense, the goal of ensemble forecasting is to maximize forecast sharpness subject to reliability (Gneiting et al., 2007). This principle has motivated the use of proper scoring rules and their ‘fair’ variants (Gneiting et al., 2007;Ferro, 2014) as loss functions for ensemble post-processing methods (Gneiting et al., 2005;Rasp and Lerch, 2018;Grönquist et al., 2021;Ben Bouallègue et al., 2024) and data-driven ensemble forecast systems (Lang et al., 2024;Kochkov et al., 2024). A negatively oriented scoring rule is considered strictly proper if the expected score is uniquely minimized when the predictive distribution is equal to the true distribution of the observations (Gneiting and Raftery, 2007). Fair scores are proper scoring rules that account for finite ensemble size effects and reward forecast members that behave as though they are sampled from the same distribution as the verifying observations (Ferro, 2014).
For ensemble members {x 1 , . . . , x N } and observation y, the unadjusted kernel representation of the continuous ranked probability score (CRPS; Gneiting and Raftery, 2007) is a proper score defined by
(1)
However, the unadjusted CRPS is not fair for finite ensembles and rewards overconfident forecasts. To address this limitation, Ferro et al. (2008) define an adjusted version of the CRPS, which can be written in the kernel representation following Leutbecher (2019) as
(2)
When forecast members are exchangeable, such that they can be interpreted as a random sample from an underlying predictive distribution, aCRPS is both fair and unbiased with ensemble size.
Importantly, fair scoring rules are specific to the dependence structure of the sampled ensemble members and do not exist for all forms of dependency (Ferro, 2014). For this reason, distribution-aware post-processing methods that introduce structural dependency between ensemble members can break the underlying assumptions of a fair score (e.g. aCRPS) such that members are rewarded when they appear to be sampled from a different distribution to the verifying observations. In this case, the post-processed forecasts that minimize the target loss function become systematically unreliable such that forecasts and observations have different statistical properties. This type of unreliability can be diagnosed from a mismatch between the average ensemble variance and the average squared error of the ensemble mean (after correction for finite ensemble size effects) or from differences in the total variance of forecast members and observations when evaluated over many events (Leutbecher and Palmer, 2008;Johnson and Bowler, 2009;Roberts and Leutbecher, 2025).
Traditional post-processing methods can be broadly categorized into two classes: (1) parametric methods, which assume a specific distributional form (e.g., Gaussian) and estimate distribution parameters from ensemble statistics (Gneiting et al., 2005;Scheuerer and Möller, 2015), and (2) non-parametric methods, which adjust each ensemble member individually and can preserve multivariate ensemble dependencies (Van Schaeybroeck and Vannitsem, 2015;Scheuerer and Hamill, 2015). Recent advances in machine learning methods, combined with the increasing accessibility of deep learning software frameworks and large reforecast datasets, have also facilitated the development of sophisticated data-driven ensemble post-processing methods, which are able to learn complex non-linear relationships from highdimensional input data (Gneiting et al., 2005;Rasp and Lerch, 2018;Grönquist et al., 2021;Ben Bouallègue et al., 2024;Horat and Lerch, 2024). However, the computational expense of generating datasets and training deep-learning methods means that data-driven post-processing methods are commonly trained using a small ensemble size (e.g. 10 members) with the goal of applying these methods to real-time forecasts with larger ensemble size (e.g. 50+ members).
For this reason, it is important to demonstrate that data-driven approaches trained on reduced ensemble sizes do not compromise the reliability of the resulting post-processed forecasts.
This study focuses on a new class of
This content is AI-processed based on open access ArXiv data.