In the popular approach of "Bayesian variable selection" (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics. The Gibbs posterior is constructed from a risk function of practical interest (such as the classification error) and aims at minimizing a risk function without modeling the data probabilistically. This can improve the performance over the usual Bayesian approach, which depends on a probability model which may be misspecified. Conditions will be provided to achieve good risk performance, even in the presence of high dimensionality, when the number of candidate variables "$K$" can be much larger than the sample size "$n$." In addition, we develop a convenient Markov chain Monte Carlo algorithm to implement BVS with the Gibbs posterior.
Deep Dive into Gibbs posterior for variable selection in high-dimensional classification and data mining.
In the popular approach of “Bayesian variable selection” (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics. The Gibbs posterior is constructed from a risk function of practical interest (such as the classification error) and aims at minimizing a risk function without modeling the data probabilistically. This can improve the performance over the usual Bayesian approach, which depends on a probability model which may be misspecified. Conditions will be provided to achieve good risk performance, even in the presence of high dimensionality, when the number of candidate variables “$K$” can be much larger than the sample size “$n$.” In addition, we develop a convenient Markov chain Monte Carlo algorithm to implement BVS with the Gibbs posterior.
arXiv:0810.5655v1 [stat.ME] 31 Oct 2008
The Annals of Statistics
2008, Vol. 36, No. 5, 2207–2231
DOI: 10.1214/07-AOS547
c
⃝Institute of Mathematical Statistics, 2008
GIBBS POSTERIOR FOR VARIABLE SELECTION IN
HIGH-DIMENSIONAL CLASSIFICATION AND DATA MINING1
By Wenxin Jiang and Martin A. Tanner
Northwestern University
In the popular approach of “Bayesian variable selection” (BVS),
one uses prior and posterior distributions to select a subset of can-
didate variables to enter the model. A completely new direction will
be considered here to study BVS with a Gibbs posterior originating
in statistical mechanics. The Gibbs posterior is constructed from a
risk function of practical interest (such as the classification error)
and aims at minimizing a risk function without modeling the data
probabilistically. This can improve the performance over the usual
Bayesian approach, which depends on a probability model which
may be misspecified. Conditions will be provided to achieve good
risk performance, even in the presence of high dimensionality, when
the number of candidate variables “K” can be much larger than the
sample size “n.” In addition, we develop a convenient Markov chain
Monte Carlo algorithm to implement BVS with the Gibbs posterior.
1. Introduction.
The problem of interest here is to predict y, a {0,1}
response, based on x, a vector of predictors of dimension dim(x) = K. We
have Dn = (y(i),x(i))n
1, the observed data with sample size n, typically as-
sumed to form n i.i.d. (independent and identically distributed) copies of
(y,x).
One is often interested in modeling the relation between y and x, selecting
components of x that are most relevant to y, and predicting y using selected
information from x.
In the approach of Bayesian variable selection (BVS), one chooses compo-
nents of x according to some probability distribution (prior and posterior).
The BVS approach is very popular for handling high-dimensional data (with
Received February 2007; revised August 2007.
1Supported in part by NSF Grant DMS-07-06885.
AMS 2000 subject classifications. Primary 62F99; secondary 82-08.
Key words and phrases. Data augmentation, data mining, Gibbs posterior, high-
dimensional data, linear classification, Markov chain Monte Carlo, prior distribution, risk
performance, sparsity, variable selection.
This is an electronic reprint of the original article published by the
Institute of Mathematical Statistics in The Annals of Statistics,
2008, Vol. 36, No. 5, 2207–2231. This reprint differs from the original in
pagination and typographic detail.
1
2
W. JIANG AND M. A. TANNER
large dimension K, sometimes larger than the sample size n), and has had
a wide range of successful applications. See, for example, Smith and Kohn
(1996), George and McCulloch (1997), Gerlach, Bird and Hall (2002), Lee,
Sha, Dougherty, Vannucci and Mallick (2003), Zhou, Liu and Wong (2004)
and Dobra, Hans, Jones, Nevins, Yao and West (2004), among others.
For classification purpose, a regression model p = p(y|x) (y ∈{0,1}) is
typically assumed to be logit linear or probit linear and parameterized by a
parameter β, that is, p(y|x) = µy(1 −µ)1−y, where µ =
exp(xT β)
1+exp(xT β) (for logis-
tic regression) or
R xT β
−∞(2π)−1/2e−u2/2 du (for probit regression). A prior on p
is then induced by placing a prior on parameter β, forcing most of its compo-
nents to be zero, such that only a low-dimensional subset of x is selected in
regression. The corresponding posterior follows a standard Bayesian treat-
ment as (posterior) ∝(likelihood)×(prior) ∝{Qn
i=1 p(y(i)|x(i))}×(prior). A
number of things can be generated from this posterior: parameter β, condi-
tional density p(y|x), mean function µ, as well as the classification rule (for
y) I[µ > 0.5] = I[xT β > 0]. Jiang (2007) has shown that under certain regu-
larity conditions, the prior can be specified to render near-optimal posterior
performance for density estimation, mean estimation and classification.
The current paper introduces a new direction to BVS. Unlike Jiang (2007),
we will construct a modified posterior (called Gibbs posterior) using a risk
function of interest (such as the classification error) directly, instead of using
the usual likelihood-based Bayesian posterior. We will first focus on the
statistical properties (e.g., classification performance) of BVS with a Gibbs
posterior. (Section 7 will handle the algorithmic aspects.)
A problem with the usual Bayesian posterior.
Below, we first demon-
strate by a simple example that in case of model misspecification, the usual
likelihood-based BVS can provide suboptimal performance. Later our theory
will suggest that the proposed BVS with Gibbs posterior can improve over
the usual approach, since we will show that the proposed method can still
achieve near-optimality in some sense, despite the potential misspecification.
In Jiang (2007), it is assumed that the true model (with density p∗) is
of a known transformed linear form, say, logit linear, so that ln{p∗(y =
1|x)/p∗(y
…(Full text truncated)…
This content is AI-processed based on ArXiv data.