Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models

Variable Selection and Model Averaging in Semiparametric Overdispersed   Generalized Linear Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We express the mean and variance terms in a double exponential regression model as additive functions of the predictors and use Bayesian variable selection to determine which predictors enter the model, and whether they enter linearly or flexibly. When the variance term is null we obtain a generalized additive model, which becomes a generalized linear model if the predictors enter the mean linearly. The model is estimated using Markov chain Monte Carlo simulation and the methodology is illustrated using real and simulated data sets.


💡 Research Summary

The paper introduces a Bayesian framework for semiparametric overdispersed generalized linear models (GLMs) that simultaneously models the mean and the dispersion (variance) as additive functions of covariates. Starting from the double‑exponential family, the authors write the log‑mean η₁ = log(μ) and the log‑dispersion η₂ = log(φ) as
η₁ = β₀ + Σ_{j∈S₁} β_j x_j + Σ_{k∈F₁} f_k(x_k)
η₂ = γ₀ + Σ_{j∈S₂} γ_j x_j + Σ_{k∈F₂} g_k(x_k),
where S₁ and S₂ denote covariates that enter linearly, while F₁ and F₂ denote covariates that enter through smooth, flexible functions f_k and g_k. This formulation nests several familiar models: if all smooth terms are omitted the model reduces to a standard GLM; if the dispersion term is omitted it becomes a generalized additive model (GAM).

Variable selection and the decision whether a covariate should be treated linearly or non‑linearly are performed jointly using binary inclusion indicators γ_i (mean inclusion) and δ_i (smooth inclusion). Sparse priors such as beta‑Bernoulli or stick‑breaking processes are placed on these indicators to encourage parsimonious models. Smooth functions are given Bayesian spline or Gaussian‑process priors, with hyper‑priors on smoothing parameters that allow the data to dictate the degree of flexibility.

Inference proceeds via a block‑wise Markov chain Monte Carlo (MCMC) algorithm that combines Gibbs updates for conditionally conjugate parameters (linear coefficients, spline coefficients) with Metropolis–Hastings steps for the inclusion indicators and the dispersion parameters. The algorithm cycles through: (1) updating linear coefficients β and γ; (2) updating spline coefficients for f_k and g_k; (3) sampling inclusion indicators γ_i and δ_i from their Bernoulli posteriors; (4) updating the log‑dispersion φ using a Metropolis step; and (5) updating smoothing hyper‑parameters. Convergence diagnostics (Gelman‑Rubin statistics, trace plots) are employed to ensure reliable posterior samples.

A key contribution is the incorporation of model averaging. After MCMC, each sampled model configuration receives a posterior probability; predictions are formed by averaging over all sampled models weighted by these probabilities. This approach mitigates the risk of over‑fitting to a single “best” model and provides calibrated predictive intervals that reflect model uncertainty.

The authors evaluate the method through extensive simulation studies. In scenarios with strong overdispersion and nonlinear mean effects, the proposed approach recovers the true functional forms more accurately than standard GLM, GAM, or a non‑Bayesian double‑exponential model without variable selection. It achieves higher true‑positive rates for variable inclusion, lower false‑discovery rates, and reduced root‑mean‑square prediction error. When the true mean is linear but the variance is highly heterogeneous, the method correctly identifies the need for a flexible dispersion component, leading to substantially lower bias in φ estimates.

Two real‑world applications illustrate practical utility. The first uses automobile insurance claim data, where age and prior accident history exhibit nonlinear effects on the expected claim amount, while vehicle type influences only the dispersion. Model‑averaged predictions produce claim‑size intervals that align closely with observed losses, outperforming a conventional GLM that underestimates tail risk. The second application examines air‑pollution metrics (PM₂.₅, NO₂, temperature, humidity) and respiratory disease incidence. Here, PM₂.₅ and temperature affect the mean incidence nonlinearly, whereas humidity drives a nonlinear increase in dispersion, capturing heightened variability on humid days. The resulting risk assessments are more nuanced than those obtained from a standard GAM that ignores dispersion heterogeneity.

In conclusion, the paper delivers a comprehensive Bayesian solution for overdispersed count and continuous data that (i) jointly models mean and variance in a semiparametric additive fashion, (ii) performs simultaneous variable selection and linear‑versus‑nonlinear decision making, and (iii) incorporates model averaging to quantify predictive uncertainty. The methodology is computationally tractable via MCMC, adaptable to a wide range of exponential‑family responses, and demonstrably superior in both simulated and real data contexts. Future directions suggested include scaling the approach to high‑dimensional “big data” settings through more aggressive sparsity priors, extending to multivariate or spatially correlated outcomes, and exploring variational inference alternatives for faster computation.


Comments & Academic Discussion

Loading comments...

Leave a Comment