Combining spatial information sources while accounting for systematic errors in proxies

Environmental research increasingly uses high-dimensional remote sensing and numerical model output to help fill space-time gaps between traditional observations. Such output is often a noisy proxy for the process of interest. Thus one needs to separate and assess the signal and noise (often called discrepancy) in the proxy given complicated spatio-temporal dependencies. Here I extend a popular two-likelihood hierarchical model using a more flexible representation for the discrepancy. I employ the little-used Markov random field approximation to a thin plate spline, which can capture small-scale discrepancy in a computationally efficient manner while better modeling smooth processes than standard conditional auto-regressive models. The increased flexibility reduces identifiability, but the lack of identifiability is inherent in the scientific context. I model particulate matter air pollution using satellite aerosol and atmospheric model output proxies. The estimated discrepancies occur at a variety of spatial scales, with small-scale discrepancy particularly important. The examples indicate little predictive improvement over modeling the observations alone. Similarly, in simulations with an informative proxy, the presence of discrepancy and resulting identifiability issues prevent improvement in prediction. The results highlight but do not resolve the critical question of how best to use proxy information while minimizing the potential for proxy-induced error.

💡 Research Summary

Environmental scientists increasingly rely on high‑dimensional auxiliary data—such as satellite remote‑sensing products and numerical atmospheric model outputs—to fill spatial and temporal gaps in traditional ground‑based observations. These auxiliary data are typically treated as noisy proxies for the true process of interest (e.g., surface concentrations of particulate matter). The central statistical challenge is to separate the genuine signal from the systematic error, or “discrepancy,” that arises because the proxy does not perfectly represent the underlying phenomenon.

The paper builds on the widely used two‑likelihood hierarchical framework, in which the observation model and the proxy model are linked through a linear relationship but each possesses its own error structure. The author argues that the conventional implementation, which often employs a conditional autoregressive (CAR) prior for the discrepancy, is insufficient for capturing small‑scale, smooth deviations that are common in remote‑sensing and model‑output data. To address this, the author introduces a more flexible discrepancy representation: a thin‑plate spline approximated by a Markov random field (MRF). The thin‑plate spline provides a globally smooth surface, while the MRF approximation makes computation tractable on large lattices by exploiting conditional independence among neighboring nodes. This hybrid construction retains the spline’s ability to model smooth, multi‑scale variation while offering the sparse precision matrices that enable efficient Bayesian inference via Markov chain Monte Carlo (MCMC).

The model consists of three layers. (1) The latent true process (Z(s)) is modeled as a spatial Gaussian process with a conventional covariance function. (2) Each proxy (X_i(s)) is linked to (Z(s)) through a linear coefficient plus a systematic error term (\delta_i(s)). (3) The systematic error (\delta_i(s)) is expressed as a sum of thin‑plate‑spline‑based MRF components that operate at distinct spatial scales (large, medium, and small). Hyper‑parameters control the variance and smoothness at each scale, allowing the data to reveal where discrepancy is most pronounced.

A key theoretical issue is identifiability. Because both the latent process and the discrepancy are latent fields, they can trade off against each other in the likelihood, leading to non‑unique posterior modes. The author acknowledges that this lack of identifiability is intrinsic to the scientific problem: without external information about the quality of the proxy, the data alone cannot fully disentangle signal from systematic error. The proposed flexible discrepancy model therefore trades increased expressive power for heightened identifiability concerns.

The methodology is applied to a nationwide United States dataset of ground‑based PM2.5 measurements. Two proxies are used: (a) satellite‑derived aerosol optical depth (AOD) and (b) output from the Community Multiscale Air Quality (CMAQ) model. The posterior analysis reveals discrepancy structures at both coarse (hundreds of kilometres) and fine (tens of kilometres) scales, with the fine‑scale component accounting for a substantial portion of the total discrepancy variance. Despite this richer representation, predictive performance—assessed via cross‑validation—improves only marginally relative to a baseline model that uses the ground observations alone. In many folds the improvement is statistically insignificant.

To explore the limits of the approach, the author conducts simulation studies where the proxy is deliberately made highly informative (correlation ≈ 0.8) versus weakly informative (correlation ≈ 0.3). Even when the proxy is highly correlated with the true process, the presence of a non‑trivial discrepancy prevents the hierarchical model from achieving substantial gains in predictive accuracy. The simulations confirm that the identifiability problem, rather than computational issues, is the primary barrier to exploiting proxy information.

In summary, the paper makes three substantive contributions. First, it demonstrates that a thin‑plate‑spline‑based MRF can capture small‑scale, smooth discrepancy more effectively than traditional CAR priors while remaining computationally feasible for large spatial datasets. Second, it provides empirical evidence that, in realistic air‑quality applications, the added flexibility does not automatically translate into better predictions because the proxy‑induced systematic error can dominate the information gain. Third, it highlights the persistent identifiability challenge: without external validation or strong prior knowledge about the proxy’s bias structure, hierarchical integration of proxy data may yield limited practical benefit.

The findings suggest that future work should focus on (i) incorporating independent assessments of proxy bias (e.g., field campaigns, physical model diagnostics), (ii) imposing informative priors or constraints on the discrepancy field to improve identifiability, and (iii) exploring multi‑proxy frameworks where complementary proxies jointly constrain the discrepancy. Only by addressing these methodological gaps can the promise of high‑dimensional auxiliary data be fully realized in environmental exposure modeling.

💡 Research Summary

📜 Original Paper Content