Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models

Reading time: 5 minute
...

📝 Abstract

We express the mean and variance terms in a double exponential regression model as additive functions of the predictors and use Bayesian variable selection to determine which predictors enter the model, and whether they enter linearly or flexibly. When the variance term is null we obtain a generalized additive model, which becomes a generalized linear model if the predictors enter the mean linearly. The model is estimated using Markov chain Monte Carlo simulation and the methodology is illustrated using real and simulated data sets.

💡 Analysis

We express the mean and variance terms in a double exponential regression model as additive functions of the predictors and use Bayesian variable selection to determine which predictors enter the model, and whether they enter linearly or flexibly. When the variance term is null we obtain a generalized additive model, which becomes a generalized linear model if the predictors enter the mean linearly. The model is estimated using Markov chain Monte Carlo simulation and the methodology is illustrated using real and simulated data sets.

📄 Content

Flexibly modelling the response variance in regression is important for efficient estimation of mean parameters, correct inference and for understanding the sources of variability in the response. Response distributions that are commonly used for modelling non-Gaussian data such as the binomial and Poisson, although natural and interpretable, have a variance that is a function of the mean and often real data exhibits more variability than might be implied by the mean-variance relationship, a phenomenon referred to as overdispersion. Underdispersion, where the data exhibits less variability than expected, can also occur, although this is less frequent.

Generalized linear models (GLMs) have traditionally been used to model non-Gaussian regression data (Nelder andWedderburn, 1972 , McCullagh andNelder, 1989), where the response y has a distribution from the exponential family and a transformation of the mean response is a linear function of predictors. This framework is extended to generalized additive models (GAMs) by Hastie and Tibshirani (1990) where a transformation of the mean is modelled as a flexible additive function of the predictors. However, the restriction to the exponential family in GLMs and GAMs is sometimes not general enough. While there is often strong motivation for using exponential family distributions on the grounds of interpretability, the variance of these distributions is a function of the mean and the data often exhibit greater variability than is implied by such mean-variance relationships.

Quasi-likelihood (Wedderburn, 1974) provides one simple approach to inference in the presence of overdispersion, where the exponential family assumption is dropped and only a model for the mean is given with the response variance a function of the mean up to a multiplicative constant. However, this approach does not allow overdispersion to be modelled as a function of covariates. An extension of quasi-likelihood which allows this is the extended quasi-likelihood of Nelder and Pregibon (1987), but in general extended quasi-likelihood estimators may not be consistent (Davidian and Carroll, 1988).

Use of a working normal likelihood for estimating mean and variance parameters can also be used for modelling overdispersion (Peck et al., 1984). However, the non-robustness of the estimation of mean parameters when the variance function is incorrectly specified is a difficulty with this approach.

Alternatively, a generalized least squares estimating equation for mean parameters can be combined with a normal score estimating equation for variance parameters, a procedure referred to as pseudolikelihood (Davidian and Carroll, 1987). Both the working normal likelihood and pseudolikelihood approaches are related to the theory of generalized estimating equations (see Davidian and Giltinan, 1995, p. 57). Additive extensions of generalized estimating equations are considered by Wild and Yee (1996). Smyth (1989) considers modelling the mean and variance in a parametric class of models which allows normal, inverse Gaussian and gamma response distributions, and a quasi-likelihood extension is also proposed which uses a similar approach to pseudolikelihood for estimation of variance parameters. Smyth and Verbyla (1999) consider extensions of residual maximum likelihood (REML) estimation of variance parameters to double generalized linear models where dispersion parameters are modelled linearly in terms of covariates after transformation by a link function.

Inference about mean and variance functions using estimating equations has the drawback that there is no fully specified model, making it difficult to deal with characteristics of the predictive distribution for a future response, other than its mean and variance. Model based approaches to modelling overdispersion include exponential dispersion models and related approaches (Jorgensen, 1997, Smyth, 1989), the extended Poisson process models of Faddy (1997) and mixture models such as the beta-binomial, negative binomial and generalized linear mixed models (Breslow andClayton, 1993, Lee andNelder, 1996) . One drawback of mixture models is that they are unable to model underdispersion. Generalized additive mixed models incorporating random effects in GAMs are considered by Lin and Zhang (1999) . Both Yee and Wild (1996) and Rigby and Stasinopoulos (2005) consider very general frameworks for additive modelling and algorithms for estimating the additive terms. There is clearly scope for further research on inference: Rigby and Stasinopoulos (2005) suggest that one use for their methods is as an exploratory tool for a subsequent fully Bayesian analysis of the kind considered in our article. See also Brezger and Lang (2005) and Smith and Kohn (1996) for other recent work on Bayesian generalized additive models.

Our framework for flexibly modelling the mean and variance functions is based on the double exponential regression models introduced by Efron (1986), an appro

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut