A spatial random forest algorithm for population-level epidemiological risk assessment

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Spatial epidemiology identifies the drivers of elevated population-level disease risks, using disease counts, exposures and known confounders at the areal unit level. Poisson regression models are typically used for inference, which incorporate a linear/additive regression component and allow for unmeasured confounding via a set of spatially autocorrelated random effects. This approach requires the confounder interactions and their functional relationships with disease risk to be specified in advance, rather than being learned from the data. Therefore, this paper proposes the SPAR-Forest-ERF algorithm, which is the first fusion of random forests for capturing non-linear and interacting confounder-response effects with Bayesian spatial autocorrelation models that can estimate interpretable exposure response functions (ERF) with full uncertainty quantification. Methodologically, we extend existing methods set in a prediction context by propagating uncertainty between both the ML and statistical models, developing a new stopping criteria designed to ensure the stability of the primary inferential target, and incorporating a range of different ERFs for maximum model flexibility. This methodology is motivated by a new study quantifying the impact of air pollution concentrations on self-rated health in Scotland, using data from the recently released 2022 national census.

💡 Research Summary

This paper introduces SPAR‑Forest‑ERF, a novel hybrid methodology that integrates random forests with Bayesian spatial smoothing to estimate exposure‑response functions (ERFs) in population‑level epidemiological studies while fully accounting for spatial autocorrelation and providing comprehensive uncertainty quantification. Traditional spatial epidemiology relies on Poisson regression with linear/additive covariate effects and a spatially correlated random effect (often a CAR prior). Such models require analysts to pre‑specify interaction terms and functional forms, limiting their ability to learn complex, non‑linear relationships from the data.

SPAR‑Forest‑ERF addresses these limitations through three key innovations. First, it models the exposure effect g(x) using flexible forms: a simple linear term, a Bayesian p‑spline (non‑linear) implemented as a second‑order random‑walk latent field via INLA, or a measurement‑error formulation that treats the observed exposure as a noisy proxy for the true exposure. This flexibility allows researchers to directly interpret how risk changes across exposure levels, which is essential for policy guidance.

Second, the confounding component m(z) is captured by a random forest. The forest efficiently learns high‑dimensional, non‑linear, and interacting relationships among nuisance covariates, producing out‑of‑bag predictions that serve as an estimate of the confounder‑adjusted risk surface. Because random forests assume independent observations, residual spatial dependence remains in the data.

Third, the residual spatial structure is modeled with a Bayesian spatial random effect ϕ, typically a conditional autoregressive (CAR) prior. Crucially, the algorithm propagates uncertainty from the forest stage to the spatial stage: posterior samples of the forest predictions are fed into the spatial model, and the joint posterior of all parameters is obtained via an iterative MCMC scheme. This yields a coherent full‑posterior distribution for the ERF, rather than a point estimate with ad‑hoc error bars.

A novel stopping rule is introduced that monitors the stability of the ERF itself. The iterative procedure (forest → spatial model → updated residuals → forest) continues until successive ERF estimates differ by less than a pre‑specified tolerance, ensuring that the primary inferential target has converged rather than relying on generic predictive‑error criteria.

The methodology is evaluated through extensive simulations that vary the strength of spatial autocorrelation, the degree of non‑linearity in the true exposure‑response relationship, and the presence of measurement error. Compared with standard Poisson‑CAR models and with a two‑stage random‑forest‑then‑kriging approach, SPAR‑Forest‑ERF consistently achieves lower mean absolute error, better coverage of the true ERF, and more accurate recovery of non‑linear dose‑response shapes.

The authors apply the method to a newly released 2022 Scottish census dataset comprising 6,972 data zones. The health outcome is self‑rated general health, dichotomized into “bad” (Bad or Very Bad) and “good” (Good or Very Good) categories, with expected counts obtained via indirect standardization. Exposures include NO₂, PM₂.₅, and PM₁₀, derived from the Pollution Climate Mapping model and aggregated to each zone. A set of 12 socio‑economic and demographic confounders (e.g., SIMD deprivation index, population density, urban/rural indicators, loneliness) are included.

Exploratory analysis shows significant residual spatial autocorrelation after fitting both a Poisson GLM and a random forest, motivating the need for a spatial smoothing component. Using SPAR‑Forest‑ERF, the authors estimate three types of ERFs for each pollutant. The non‑linear p‑spline ERFs reveal that risk of reporting “bad” health remains relatively flat at low pollutant concentrations but rises sharply beyond urban thresholds (≈20 µg m⁻³ for NO₂, similar inflection points for PM₂.₅ and PM₁₀). The measurement‑error ERF accounts for uncertainty in the PCM exposure estimates, slightly attenuating the steepness of the dose‑response curves. The spatial random effects capture residual clusters of elevated risk in parts of Edinburgh, Glasgow, and the north‑east coast, indicating unmeasured local factors.

The paper discusses strengths such as interpretability of ERFs, flexibility to model complex confounder effects, and rigorous uncertainty propagation. Limitations include sensitivity to random‑forest hyper‑parameters, computational demands of the Bayesian MCMC, and the requirement for external information to specify measurement‑error variance. Future work is suggested on extending the framework to spatio‑temporal data, jointly modeling multiple exposures, and leveraging variational inference for scalability.

In conclusion, SPAR‑Forest‑ERF provides a comprehensive, inference‑focused solution for spatial epidemiology, enabling researchers to uncover realistic, non‑linear exposure‑response relationships while properly accounting for spatial dependence and delivering full posterior uncertainty—features that are essential for evidence‑based public health decision‑making.

A spatial random forest algorithm for population-level epidemiological risk assessment

💡 Research Summary

Comments & Academic Discussion

Leave a Comment