Data-Efficient Generative Modeling of Non-Gaussian Global Climate Fields via Scalable Composite Transformations
Quantifying uncertainty in future climate projections is hindered by the prohibitive computational cost of running physical climate models, which severely limits the availability of training data. We propose a data-efficient framework for emulating the internal variability of global climate fields, specifically designed to overcome these sample-size constraints. Inspired by copula modeling, our approach constructs a highly expressive joint distribution via a composite transformation to a multivariate standard normal space. We combine a nonparametric Bayesian transport map for spatial dependence modeling with flexible, spatially varying marginal models, essential for capturing non-Gaussian behavior and heavy-tailed extremes. These marginals are defined by a parametric model followed by a semi-parametric B-spline correction to capture complex distributional features. The marginal parameters are spatially smoothed using Gaussian-process priors with low-rank approximations, rendering the computational cost linear in the spatial dimension. When applied to global log-precipitation-rate fields at more than 50,000 grid locations, our stochastic surrogate achieves high fidelity, accurately quantifying the climate distribution’s spatial dependence and marginal characteristics, including the tails. Using only 10 training samples, it outperforms a state-of-the-art competitor trained on 80 samples, effectively octupling the computational budget for climate research. We provide a Python implementation at https://github.com/jobrachem/ppptm .
💡 Research Summary
The paper introduces a novel statistical framework for generating high‑dimensional, non‑Gaussian climate fields in a data‑efficient manner. Physical climate models are computationally expensive, limiting the number of available training samples to typically a few dozen. Existing approaches either assume Gaussianity, scale poorly with dimension, or require thousands of samples (e.g., deep generative models). To overcome these constraints, the authors propose a three‑stage composite transformation that maps the original field to a multivariate standard normal space.
The first stage, G, is a parametric marginal transformation. For each spatial location a chosen parametric family (e.g., location‑scale t‑distribution) is fitted, and the cumulative distribution function (CDF) is composed with the inverse Gaussian CDF. This yields a strictly monotonic map that controls tail behavior explicitly. Because estimating the parameters independently at each location would be unstable with few samples, the authors place independent Gaussian‑process (GP) priors on each parameter across space, sharing information and enforcing smoothness. Low‑rank approximations of the GP kernels keep computational cost linear in the number of grid points (L ≈ 5 × 10⁴).
The second stage, H, adds a semi‑parametric correction using a monotone B‑spline. The spline is defined on a fixed knot grid; its coefficients are parameterized as an intercept plus log‑increments to guarantee monotonicity. Outside the interior knot interval the transformation defaults to the identity, ensuring that the parametric tails remain untouched while the spline flexibly adjusts the bulk of the distribution. The spline coefficients are also given spatial GP priors, again with low‑rank approximations. This design yields a flexible yet regularized marginal model that can capture complex shapes without sacrificing extrapolation stability.
The third stage, T, is a triangular scalable Bayesian transport map (BATM). Each component T_i depends on the current standardized variable and all previous ones via a nonlinear regression function f_i and a positive scale d_i:
T_i(ẑ₁: i) = (ẑ_i − f_i(ẑ_{<i}))/d_i.
Because the map is lower‑triangular, its Jacobian determinant is the product of the d_i terms, allowing exact likelihood evaluation and straightforward inversion. The map thus acts as a copula, removing any remaining dependence after the marginal Gaussianization performed by G and H.
Training proceeds by maximizing the joint likelihood (or Bayesian posterior) of all parameters (the marginal GP hyper‑parameters, spline coefficients, and transport‑map functions). The monotonicity constraints are built into the parameterization, so no additional projection steps are needed.
The methodology is evaluated on log‑precipitation‑rate fields from the Community Earth System Model (CESM) Large Ensemble Project, comprising 55 296 grid points worldwide. With only N = 10 training replicates, the proposed model outperforms a state‑of‑the‑art competitor trained on N = 80 samples across several metrics: mean‑squared error, spatial correlation structure, tail probability estimation, and visual similarity of generated fields. The model accurately reproduces long‑range dependencies and local variability while preserving realistic extreme‑value behavior. Computationally, generating hundreds of new fields on a standard laptop takes a few minutes, whereas a single CESM simulation requires weeks on a supercomputer.
Key contributions include: (1) a principled separation of marginal and dependence modeling via a Gaussian‑space composite map, (2) a novel monotone B‑spline correction that blends parametric tail control with flexible bulk fitting, (3) scalable spatial smoothing of all marginal parameters using low‑rank GP priors, and (4) the integration of a triangular Bayesian transport map as a high‑dimensional copula. The approach is generic and can be applied to any high‑dimensional data with a meaningful distance metric, not only climate variables. Limitations noted by the authors involve the fixed choice of parametric family for all locations, the need for a predefined ordering in the triangular map, and the current handling of spline boundaries. Future work may explore adaptive ordering, mixture families for margins, and more expressive dependence structures. Overall, the paper delivers a powerful, computationally tractable tool for uncertainty quantification in climate modeling when training data are scarce.
Comments & Academic Discussion
Loading comments...
Leave a Comment