Gibbs posterior for variable selection in high-dimensional classification and data mining
In the popular approach of "Bayesian variable selection" (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs post…
Authors: ** Wenxin Jiang (노스웨스턴 대학교) Martin A. Tanner (노스웨스턴 대학교) **
The Annals of Statistics 2008, V ol. 36, No. 5, 2207–22 31 DOI: 10.1214 /07-AOS547 c Institute of Mathematical Statistics , 2 008 GIBBS POSTERIOR F OR V ARIABLE SELECTION IN HIGH-DIMENSIONAL CLAS SIFICA TION AND D A T A MINING 1 By Wenxin Jiang and Mar tin A. T an ner Northwestern University In the p opular approach of “Ba yesia n v ariable selection” (BVS), one uses prior and p osterior distributions to select a subset of can- didate va riables to enter t he mo del. A completely new direction will b e considered h ere to stu dy BVS with a Gibbs p osterior originating in statistical mechanics. The Gibb s p osterior is constructed from a risk funct ion of p ractical interest (such as the classification error) and aims at minimizing a risk function without mo deling the data probabilistically . This can improv e the p erformance o ver the u sual Ba yesian app roac h, whic h dep ends on a probability mo del whic h ma y b e misspecified. Conditions will be provided to ac hieve goo d risk p erformance, even in the presence of h igh dimensionalit y , when the num b er of candid ate vari ables “ K ” can b e muc h larger than t he sample size “ n .” In addition, we develop a conv enient Marko v chain Mon te Carlo algorithm to implement BVS with the Gibbs p osterior. 1. In tro duction. The problem of interest here is to p redict y , a { 0 , 1 } resp onse, based on x , a vect or of pred ictors of d imension dim( x ) = K . W e ha v e D n = ( y ( i ) , x ( i ) ) n 1 , the observed data with sample size n , t ypically as- sumed to form n i.i.d. (indep endent and identi cally distributed) copies of ( y , x ). One is often intereste d in m o deling the relation b et we en y and x , selecting comp onent s of x that are most relev an t to y , and p redicting y using selected information from x . In the ap p roac h of Bayesian variable sele ction ( BVS ), one c ho oses comp o- nen ts of x according to some p robabilit y distribu tion (pr ior and p osterior). The BVS app roac h is very p opu lar for handling high-dimen sional data (with Received F ebru ary 2007; revised August 2007. 1 Supp orted in part by N SF Grant DMS-07-06885. AMS 2000 subje ct classific ations. Primary 62F99; secondary 82-08. Key wor ds and phr ases. Data augmenta tion, data mining, Gibbs p osterior, high- dimensional data, linear classification, Marko v chain Monte Carlo, prior distribution, risk p erformance, sparsity , v ariable selection. This is an electr onic reprint of the original article published by the Institute of Mathema tica l Statistics in The Annals of S t atistics , 2008, V ol. 36, No. 5, 2 207– 2231 . This repr int differs fro m the original in pagination and typogra phic detail. 1 2 W. JIANG A ND M. A . T ANNER large d imension K , sometimes larger than the sample size n ), and has had a wid e range of successful ap p lications. S ee, for example, Smith and Kohn ( 1996 ), George and McCullo ch ( 1997 ), Gerlac h, Bird and Hall ( 2002 ), Lee, Sha, Doughert y , V annucci and Mallic k ( 2003 ), Zhou, L iu and W ong ( 2004 ) and Dobra, Hans, Jones, Nevins, Y ao and W est ( 2004 ), among others. F or classification p u rp ose, a regression mo del p = p ( y | x ) ( y ∈ { 0 , 1 } ) is t ypically assumed to b e logit linear or probit linear and parameterized b y a parameter β , that is, p ( y | x ) = µ y (1 − µ ) 1 − y , where µ = exp( x T β ) 1+exp( x T β ) (for logis- tic regression) or R x T β −∞ (2 π ) − 1 / 2 e − u 2 / 2 du (for pr obit regression). A p rior on p is then induced by p lacing a prior on parameter β , forcing most of its comp o- nen ts to b e zero, such that only a lo w-dimensional sub set of x is selected in regression. The corresp onding p osterior follo ws a standard Ba y esian treat- men t as ( p osterior ) ∝ ( likeliho o d ) × ( prior ) ∝ { Q n i =1 p ( y ( i ) | x ( i ) ) } × ( prior ). A n umb er of things can b e generated from this p osterior: parameter β , condi- tional densit y p ( y | x ), mean function µ , as w ell as th e classificatio n rule (for y ) I [ µ > 0 . 5] = I [ x T β > 0]. Jiang ( 2007 ) has sho wn that und er certain regu- larit y conditions, the p r ior can b e s p ecified to rend er near-optimal p osterior p erforman ce for d ensit y estimation, mean estimation and classification. The curr ent pap er introd uces a n ew direction to BVS. Unlik e Jiang ( 2007 ), w e will constru ct a mo dified p osterior (called Gibbs p osterior) u sing a r isk function of interest (suc h as the classification error) directly , instead of using the usual lik eliho o d-based Ba y esian p osterior. W e will first focus on the statistica l p rop erties (e.g., classification p erforman ce) of BVS with a Gibbs p osterior. (Section 7 will han d le the algorithmic asp ects.) A pr oblem with the usual Bayesian p osterior. Belo w, we first demon- strate b y a simple example that in case of mo del missp ecification, the usual lik eliho o d-based BVS can provide sub optimal p erformance. Later our theory will su ggest that the pr op osed BVS with Gibbs p osterior can imp r o v e o ve r the u sual appr oac h, since we w ill sh o w that the pr op osed metho d can still ac hiev e near-optimalit y in some sense, despite the p oten tial m iss p ecification. In Jiang ( 2007 ), it is assumed that the true m o del (with densit y p ∗ ) is of a kno wn tr ansformed linear form, sa y , logit linear, so that ln { p ∗ ( y = 1 | x ) /p ∗ ( y = 0 | x ) } is linear in predictors x 1 , . . . , x K , wh ic h can b e, f or exam- ple, expressions of K candidate genes. Supp ose we denote the true mo del b y p ∗ and the set of all logit linear mo dels by Λ. Then the assump tion sa ys p ∗ ∈ Λ . What if th is assump tion (e.g., lo git linearit y ) is not true, so that higher-order terms and in terac- tions are imp ortan t but not included? That is, what if the prior prop oses densities in Λ, b ut the true density p ∗ / ∈ Λ? Then the usu al lik eliho o d- based p osterior will pr op ose densities that are consisten t for (often close to) GIBBS POSTERIO R FOR V AR IABLE SELECTION 3 p KL = arg min p ∈ Ω R p ∗ ln( p ∗ /p ), a minimizer of KL (Ku llbac k–Leibler) differ- ence KL ( p ∗ , p ) = R p ∗ ln( p ∗ /p ), under some regularit y conditions—see Kleijn and v an der V aart ( 2006 ). Ho wev er , this limit p KL (of the us u al Ba ye sian inference) ma y ha ve a sub optimal risk p erformance! Th at is, one can ha ve ˜ R ( p KL ) > inf ˜ R p ∈ Ω ( p ) f or a risk f unction of practical interest suc h as the classification error ˜ R ( p ) = P ∗ { y 6 = I [ p ( y = 1 | x ) > 0 . 5] } . [See Devro y e et al. ( 1996 ), Section 4.6 (least squares) and Section 15.2 (maximum like liho o d ) for some related commen ts.] An example. Consider th e case w hen P ∗ ( x = ± 1) = λ , P ∗ ( x = 0) = 1 − 2 λ for some λ ∈ (0 , 0 . 25), P ∗ ( y = 1 | x ) = 1 − P ∗ ( y = 0 | x ) = I [ x 6 = 0], wh ic h define the tru e d ensit y p ∗ . Let Λ b e the set of densities from logistic regression with p ( y = 1 | x ) = e α + xβ / (1 + e α + xβ ), α, β ∈ ℜ . N ote that p ∗ / ∈ Λ. The lo gistic regression mo d el is missp ecified. According to the KL criterion, the b est c hoice p KL ( y = 1 | x ) = 2 λ < 0 . 5. This is w hat th e usual p osterior-based logistic r egression will con v erge to ac- cording to Kleijn and v an der V aart ( 2006 ). The resulting classifier C KL ( x ) = I [ p KL ( y = 1 | x ) > 0 . 5] = 0 alw a ys predicts 0 and the resulting classificati on error ˜ R ( p KL ) = 2 λ . On the other hand ˜ R ( p ) is minimized to b e ˜ R ( p R ) = λ at p = p R , where, f or example, p R ( y = 1 | x ) = e x − 0 . 7 / (1 + e x − 0 . 7 ), wh ic h corre- sp ond s to a linear classificati on ru le C R = I [ p R ( y = 1 | x ) > 0 . 5 ] = I [ x − 0 . 7 > 0]. F or examp le, when λ = 0 . 125, ˜ R ( p KL ) = 0 . 25 > ˜ R ( p R ) = 0 . 125, ev en th ough b oth p KL , p R ∈ Λ . So, wh en the mo del is missp ecified, the usual p osterior- based logistic regression is not reliable; it pro d uces sub optimal classificatio n error ev en f rom among the m iss p ecified logistic regression mo dels Λ . In suc h situations with mo del missp ecification, a mo dified p osterior di- rectly related to th e risk fun ction of in terest, called the Gibb s p osterior, can still p erform very well, un lik e the us u al likel iho o d-based (Ba y esian) p oste- rior. In Section 2 , w e discu ss th e Gibbs p osterior for risk m in imization. What is the Gibbs p osterior? Ho w is it int erpr eted? In add ition, w e incorp orate a smo othed risk function to the Gib b s p osterior for computing ease. Then we describ e ho w to ev aluate the risk p erformance of the p rop osed metho d in t w o scenarios in Section 3 . In Section 4 , we introdu ce for the first time the framew ork of BVS with the Gibbs p osterior, whic h is inte nd ed to effectiv ely handle high-dimensional data. Then w e pro vide some results on classificatio n p erforman ce in S ection 5 , whic h sh o w that BVS with the Gibbs p osterior can p erform very we ll in some sense, despite h igh d imensionalit y , without assuming th at the true mo del is logit linear. These resu lts use a sp ecial kind of normal-binary prior. The r esults are p ro v ed un der a more general frame- w ork in Section 6 , using more general conditions on the prior and on the risk fu nction. In particular, this cov ers more general r isk functions u sed in 4 W. JIANG A ND M. A . T ANNER data mining (in addition to classification p er f ormance). Some preparatory results for th e pro ofs will b e present ed in S ection 6.1 . Section 7 will handle the algorithmic asp ects of sampling from the Gibbs p osterior w ith v ariable selection. W e will show that a conv enien t and m o d- ular Mark o v chain Mon te Carlo (MCMC) algorithm is a v ailable based on data augmen tation [T anner and W ong ( 1987 )], so that all sampling steps are based on stand ard distributions. 2. Risk minimization with Gibb s p osterior. The p revious example sh o ws that for th e p u rp ose of m inimizing the classification error ˜ R ( p ) o ve r the logit linear mod els p ∈ Λ, it is preferable not to use p prop osed from th e usu al lik eliho o d-based (Ba y esian) p osterior o v er p ∈ Λ of the form ( p osterior ) ∝ ( likeliho o d ) × ( prior ). Note th at for logit linear mo dels p ∈ Λ, th e classification rule I [ p ( y = 1 | x ) = 0 . 5] = I [ x T β > 0] forms a linear decision rule (indexed b y β ). W e are int erested in min imizing ˜ R ( p ) = P ∗ { y 6 = I [ p ( y = 1 | x ) > 0 . 5] } = P ∗ { y 6 = I ( x T β > 0) } ≡ R ( β ) . F or this pur p ose, there is really n o need to assum e a probabilit y m o del p and inte rp ret β as a parameter asso ciated with p . Instead, w e can think of β as indexing a linear decision rule I [ x T β > 0] and try to min imize a risk function R ( β ) = P ∗ { y 6 = I ( x T β > 0) } . F or this purp ose, it is b etter to us e a Gibb s p osterior o v er β ∈ Ω for some parameter space Ω ⊂ ℜ K n : ω ( dβ | D n ) = w ( dβ | D n ) π ( dβ ) = e − nψ R n ( β ) π ( dβ ) . Z β ∈ Ω e − nψ R n ( β ) π ( dβ ) , where π is a pr ior ov er β ∈ Ω, and ψ > 0 is a constant to b e explained later in this section. Here R n is a sample v ersion of R d ep endin g on (i.i.d.) data D n . Examples include: (i) R n = n − 1 P n i =1 I [ y ( i ) 6 = A i ] = − ψ − 1 n − 1 P n i =1 log { A i e ψ ( y ( i ) − 1) + (1 − A i ) e − ψy ( i ) } , wh ere A i = I [ p ( y ( i ) = 1 | x ( i ) ) > 0 . 5] = I [( x ( i ) ) T β > 0]; (ii) R n = − ψ − 1 n − 1 P n i =1 log { Φ i e ψ ( y ( i ) − 1) + (1 − Φ i ) e − ψy ( i ) } , wher e Φ i = Φ( σ − 1 n ( x ( i ) ) T β ), Φ is the stand ard normal cum ulativ e densit y function and σ n is a scaling f actor. Choices (ii) and (i) are close when σ n → 0 but c hoice (ii) mak es R n smo oth in β ! Later o n (in Remark 2 and S ection 7 ) w e will see that R n in (ii) is related to a mixture mo del and can b e used to simplify the p osterior sim ulation. The Gibbs p osterior density w (with resp ect to the p rior π ) minimizes a com bination of an a v eraged sample r isk and a p enalty against th e “c hange in kno wledge” (from prior π to p osterior w π ). S uc h an in terpretation is giv en in Zhang ( 2006a ), Prop osition 5.1 and Zhang ( 2006b ), Section IV. GIBBS POSTERIO R FOR V AR IABLE SELECTION 5 Pr opo sition 1 [Z hang ( 2006a , 2006b )]. The Gibbs p osterior density w = e − nψ R n . Z β ∈ Ω e − nψ R n π ( dβ ) minimizes R β ∈ Ω wn R n ( β ) π ( dβ ) + ψ − 1 KL ( wπ ( dβ ) , π ( dβ )) over al l densities w on Ω with r esp e ct to the prior π . Her e KL ( wπ ( dβ ) , π ( dβ )) = R β ∈ Ω w (log w ) × π ( dβ ) . The parameter ψ − 1 in the Gibbs p osterior is related to th e temp erature in statistical mec h anics and was u sed, for example, in Geman and Geman ( 1984 ) when stud ying simulated annealing. The case of zero or v ery lo w temp erature corresp ond s to deterministic empirical risk minimization. Al- lo wing nonzero temp eratures results in a more general setup of rand om esti- mation and allo ws p oten tial im p ro v ement ov er th e deterministic appr oac h . The temp erature ψ − 1 is t ypically treated as a giv en constan t [e.g., in Zhang ( 2006b )], b ut wh en necessary , an optimal temp erature [e.g., Z hang ( 1999 )] ma y b e obtained b y , f or example, cross v alidation, as men tioned in Zhang ( 2006b ). This framew ork of the Gibbs p osterior has b een o v erlo ok ed by most statis- ticians for a long time, esp ecially when compared to the long-term p opu- larit y of the (lik eliho o d-b ased) Ba y esian p osterior. Recent ly , h o w ev er, the sequence of pap ers by Z hang ( 1999 , 200 6a , 200 6b ) ha v e laid a foun dation for understand ing the statistical b eh a vior of the Gibbs p osterior, wh ic h we b eliev e will op en a pr o ductiv e new line of researc h. While Z hang’s ( 2006b ) w ork concerns fun damen tal conv ergence pr op erties of th e Gibbs p osterior in general, our work fo cuses on the asp ect of v ariable selection, which is imp or- tan t for hand ling high-dimensional data with the Gibbs p osterior (see the coun terexample in Section 4.1 ). In addition, we allo w a computation-friendly smo othed r isk fun ction R n to b e used in a pr op osed algorithm later. Also, Zhang ( 2006b ) has considered the case with high temp erature (small ψ ), while our r esult holds for any ψ , even for lo w temp erature, which migh t b e of in terest. It migh t b e of in terest to u se, for example, a low temp erature to reco v er the results f r om empirical risk minimization (or maximize the Gibbs p osterior) using an approac h similar to sim ulated ann ealing. Also, we exp ect that the MCMC algorithm in Section 7 may hav e b etter conv ergence b ehavio r in the lo w-temp erature case since it will dep end on the data more hea vily . 3. Critical questions on r isk p erformance: t w o sc enarios. Define P β ,D as the joint distribu tion based on p ∗ ( D n ) w ( β | D n ), with E β ,D b eing the corresp ondin g exp ectation. T his corresp onds to randomly generating d ata D n from the true density p ∗ and then s electing β randomly from the Gibbs 6 W. JIANG A ND M. A . T ANNER p osterior ω ( dβ | D n ). Th e w ord “often” in the follo wing statemen ts refers to a high pr ob ab ility in P β ,D . Let R ( β ) b e a r isk function su c h as R ( β ) = P ∗ [ y 6 = I ( x T β > 0)]. W e will denote inf β ∈ B R ( B ) = in f R ( B ) f or a set of decision rules I ( x T β > 0) indexed b y β ∈ B . W e will address the follo wing question. With high-dimensional data [ K = dim( x ) ≫ n ] , wil l the Gibbs p osterior ( with variable sele ction ) often le ad to a go o d risk p erformanc e which is c om- p etitive to al l mo dels in B ? That is, wil l the metho d often pr op ose β such that R ( β ) ≤ in f β ∈ B R ( β ) + ( smal l δ )? W e will answ er this q u estion in t wo scenarios with a trad e-off b et w een the strengths of assump tions and results. Scenario I will in v olv e more as- sumptions (including a sparseness assumption) but b etter risk p erformance (comp etitiv e to a bigger set of mo dels B ). Scenario I I will inv olv e fewer as- sumptions (allo wing nonsparse cases) but will guarante e a less optimal risk p erforman ce (comp etitiv e to a smaller set of mo dels B ). The Scenario-I treatmen t uses a b igger s et B = Ω, whic h here corresp ond s to the set of al l linear d ecision ru les (see Section 4.2 for a more precise defi- nition). W e will try to sho w p osterior p erformance comp etitiv e to al l linear rules [“often” R ( β ) ≤ inf β ∈ Ω R ( β ) + δ ]. W e will t ypically need to assume that a b est linear r u le in Ω satisfies some sparseness conditions: β R ∈ H , where β R is a m inimizer of R o ver Ω and H is a “sparse subset” of Ω s atisfying some sparseness cond itions. The Scenario-I I treatmen t will address a s m aller set B = H , which cor- resp ond s to some set of sp arse linear decision rules. W e will try to sho w p osterior p erformance comp etitiv e to all sp arse linear ru les [“often” R ( β ) ≤ inf β ∈ H R ( β ) + δ ]. Although the results are comp etitiv e to fewer rules, th e assumptions needed are also less restrictiv e: w e no longer need to assu me that a b est linear rule is sp arse ( β R ∈ H ). This stud y is ab out a “nearly b est” p erformance o v er a set of d ecision rules in B , w h ile not assuming a tru e p robabilit y mo del f or data. Th is is similar to the “p ersistence” stud y for risk minimization by Greenshtei n ( 2006 ), in a fr equen tist app r oac h. W e now are considering the Bay esian an alog so the use of the prior π will also m atter, whic h will form part of the regularity conditions. The questions raised in th is section will b e answered in the next t wo sections. 4. BVS w ith a Gibbs p osterior. T o answe r the questions in Section 3 on risk p erformance, we fir st giv e an example to sh o w the need of v ariable selection in the high-d imensional case. Without v ariable selection, ev en if the Gibbs p osterior is u sed, the risk p erformance m a y still b e very p o or when K = dim( x ) ≫ n . With v ariable selection (to b e describ ed in Sections 4.2 and 4.3 ), ho w ev er, we will sho w lat er (in Section 5 ) that the risk p erformance can b e v ery go o d in the t wo scenarios describ ed in Section 3 . GIBBS POSTERIO R FOR V AR IABLE SELECTION 7 4.1. An example: high-dimensional classific ation with Gib b s p osterior with- out variable sele ction. Supp ose the true mo del P ∗ is sp ecified b y y = I [ z = 1] wh er e z is uniform ov er { 1 /K, . . . , K/K } . Define x as the ve ctor with com- p onents x j = I [ z = ( K + 1 − j ) /K ], j = 1 , . . . , K . Note that the b est lin ear classification r ule can b e written as I [ x T β > 0] where β = (1 , 0 , . . . , 0) T . This classification r ule is equal to I [ z = 1] = y and therefore has classification er r or R ( β ) = P ∗ [ y 6 = I ( x T β > 0)] = P ∗ [ y 6 = y ] = 0. (Note that x T β = β K +1 − K z .) Suc h a p erfect p erformance can b e appro ximately achiev ed d ue to the results later using v ariable selection with th e Gibb s p osterior. (See, e.g., Section 5 .) Ho w ev er, without v ariable selection, the use of the Gibbs p osterior alone will not guarantee a goo d classification error. F or example, su pp ose acco rd in g to the prior π , β j ’s are i.i.d. N (0 , 1) (or more generally , an y indep endent symmetric distributions whic h ha v e π [ β j > 0] = π [ β j ≤ 0]). S upp ose th e Gibbs p osterior ∝ e − nψ R n × π where R n dep end s on β through x ( i ) T β (= β K +1 − K z ( i ) ), i = 1 , . . . , n , where ( x ( i ) ) j = I [ z ( i ) = ( K + 1 − j ) /K ] , j = 1 , . . . , K , and ( y ( i ) , z ( i ) ) n 1 (data) and ( y , z ) form an i.i.d. sample. Note that the p osterior for β j will only b e up d ated by data if j ∈ ∆ ≡ { K + 1 − K z ( i ) } n i =1 . Consider the exp ected classification err or E P ∗ y ,x [ y 6 = I ( x T β > 0)] = E P ∗ y ,z [ y 6 = I ( β K +1 − K z > 0)] (w h ere E = E ∗ ( y ( i ) ,z ( i ) ) n 1 E β | ( y ( i ) ,z ( i ) ) n 1 ). Th is is the “o v erall” probabilit y of misclassification ˜ P [ y 6 = I ( x T β > 0)] = ˜ P [ y 6 = I ( β K +1 − K z > 0)], wh ere β is also rand om, in add ition to th e random y and z ’s. Here, th e d istribution ˜ P is sp ecified by noting that ( y ( i ) , z ( i ) ) n 1 are i.i.d. fr om the true mo del P ∗ , β | ( y ( i ) , z ( i ) ) n 1 follo ws the Gibbs p osterior, and ( y , z ) denotes an indep en d en t future observ ation from P ∗ . Supp ose z / ∈ { z ( i ) } n 1 ; then the p osterior for β K +1 − K z will not b e u p - dated by data ( y ( i ) , z ( i ) ) n 1 . So assumin g th e ev ent z / ∈ { z ( i ) } n 1 , the condi- tional probability ˜ P [ y 6 = I ( β K +1 − K z > 0) | z , { z ( i ) } n 1 ] is 0.5 , since it is deter- mined by the (un-up dated) pr ior of β K +1 − K z whic h is s ymmetric ab out 0. Therefore the p robabilit y ˜ P { [ y 6 = I ( x T β > 0)] ∩ [ z / ∈ { z ( i ) } n 1 ] } is 0 . 5 P ∗ [ z / ∈ { z ( i ) } n 1 ] ≥ 0 . 5(1 − n/K ) , whic h can b e close to 0 . 5 for K ≫ n . This also forms a lo w er b ound of ˜ P [ y 6 = I ( x T β > 0)], whic h is b ounded b elo w b y ˜ P { [ y 6 = I ( x T β > 0)] ∩ [ z / ∈ { z ( i ) } n 1 ] } . Therefore, without v ariable selection, the exp ected classification error can b e close to 50% wh en K ≫ n , ev en if the Gibbs p osterior is used . W e no w consider applying BVS with Gibbs p osterior for classificati on, when sub sets of candidate v ariables are us ed to effectiv ely h andle high- dimensional data. 4.2. A p ar ameterization. Consider a d ecision ru le I ( x T β > 0) for β ∈ ℜ K n ( x can include the constan t comp onent 1). Th e risk can b e, for example, the misclassification pr ob ab ility R ( β ) = P ∗ { y 6 = I ( x T β > 0) } . It is noted that 8 W. JIANG A ND M. A . T ANNER the decision ru le I ( x T β > 0) and the risk R ( β ) are not c hanged u nder the rescaling of β . F ollo wing the appr oac h of Horo witz ( 1992 ), w e supp ose it is p ossible to use a stand ard ization with | β 1 | = 1 or β 1 ∈ {± 1 } , and define β T = ( β 1 , ˜ β T ) ∈ {± 1 } × ℜ K n − 1 , and corresp ondingly x T = ( x 1 , ˜ x T ). Let Ω n denote the (standardized) parameter space Ω n = {± 1 } × ℜ K n − 1 . Characterize β b y ( γ , β γ ) where γ = ( γ j ) K n 1 is the “mo d el” indicato r with γ j = I [ β j 6 = 0] ( γ 1 = 1), telling which comp onen ts of β are nonzero. F or any v ector v , the notation v γ denotes the su bset of v j ’s with γ j = 1. Note that in this parameterization, x 1 is alw a ys conta ined in the decision rule with co efficien t b eing ± 1. It can b e a v ariable that w e alw a ys w an t to k eep for decision-making due to some pr actical considerations. W e can still allo w x 1 to hav e effectiv ely v ery small impact on classification, by allo wing other ˜ β coefficien ts to b e m uc h larger. Adop tin g suc h a standard ization re- duces the red u ndancy of parameterization and can improv e th e conv ergence of the algorithms when sim ulating the Gibbs p osterior. The Gibbs p osterior is ind uced by a prior π on β ∈ Ω , wh ich could b e equiv alent ly sp ecified by putting a p r ior on the p arameters ( γ , β γ ). Th en a Gibbs p osterior is obtained as ω ( dβ | D n ) ∝ e − nψ R n ( β ) π ( dβ ) as describ ed in Section 2 . Belo w we will first consider a normal-binary pr ior for ( γ , β γ ). 4.3. A prior sp e cific ation ( normal-binary ) . F or a vect or v = ( v j ) d 1 , we will denote its ℓ p norm ( p = 1 , 2 , . . . ) as | v | p = ( P d j =1 v p j ) 1 /p , its ℓ ∞ norm as | v | ∞ = sup d j =1 | v j | , and its ℓ 0 norm as | v | 0 = P d j =1 I [ | v j | > 0]. Supp ose β ∈ Ω n , with stand ardization | β 1 | = 1 as describ ed ab o ve . Sup- p ose for the prior π , ( γ j ) K n j =2 (the “mo d el” ind icators) are i.i.d. b inary with selection p r obabilit y λ n and size restriction ¯ r n . Conceptually , one fi rst gen- erates ˘ γ = ˘ γ K n 1 where ˘ γ 1 = 1, and ˘ γ K n 2 are i.i.d. bin ary with selection p rob- abilit y λ n . Th en set γ = ˘ γ only when | ˘ γ | 1 ≤ ¯ r n . S u pp ose conditional on γ , β 1 is indep endent of ˜ β γ [the su bset of ( β j ) K n 2 with γ j = 1], β 1 | γ = ± 1 with probabilit y 0 . 5 eac h, and ˜ β γ | γ ∼ N (0 , V γ ), according to the prior π . 5. Results on risk p erformance for BVS with Gibb s p osterior. This sec- tion w ill add ress the r isk p erformance in the tw o scenarios describ ed in Section 3 , wh en BVS is applied to th e Gibb s p osterior as describ ed in S ec- tions 4.2 and 4.3 . The r isk function R ( β ) here is the classification error, while the Gibbs p osterior is constructed f rom the smo oth sample risk R n ( β ) as describ ed in Section 2 [c hoice (ii)]. Define the follo wing collection of conditions. Different conditions w ill b e used from this collectio n for differen t results, to enable a compressed de- scription of many results. 0 ′ . T he candidate v ariable x j ’s are standardized to b e b et we en ± 1 for all j . GIBBS POSTERIO R FOR V AR IABLE SELECTION 9 0 ′′ . T he conditional densit y p ( x 1 | ˜ x ) with r esp ect to th e Leb esgue measure exists for all x and is b ounded ab o v e b y a constant S > 0. 1 ′ . T he r ate δ n is smaller than 1 and larger than n − 1 / 2 log n in order. (1 ≻ δ n ≻ n − 1 / 2 log n .) 3 ′ . T he dimension K n = dim( x ) is high and is p olynomial in n . ( n ≺ K n ≺ n α for some α > 1.) ( σ ). The smo othing parameter σ n used in a sample version of R n decreases to zero in some wa y as n in creases. [( n/ log n ) 1 / 2 ≺ ∼ σ − 1 n ≺ n q ′′ for some q ′′ > 1 / 2.] ( V ). The eigen v alues of p rior v ariance V γ and its inv erse are b ounded as “mo del” size | γ | 1 gro ws. [ max { ch 1 ( V γ ) , ch 1 ( V − 1 γ ) } ≤ B for some con- stan t B > 0 , for all large | γ | 1 .] ( r δ ). The prior size restriction (den oted as ¯ r n in Section 4.3 ) and the pr ior exp ectation of “mo del” size (b efore size r estriction, wh ic h is ab out λ n K n ) gro w with n in some slow w a ys: M ′ nδ 2 n / (log n ) 2 ≤ λ n K n ≤ ¯ r n = ⌈ M nδ 2 n / (log n ) 2 ⌉ for some M > 1 and M ′ > 0. (Here ⌈·⌉ denotes the in teger p art.) Finally , w e define a collection of “sparse subset” H ’s of th e linear decision rules Ω , which will b e used in a condensed statemen t of man y different results. Let H b b e a “sparse set of rules” of at most n δ 2 n / (log n ) 2 v ariables with co- efficien ts at most C (so me constan t): H b = { β ∈ Ω n : P j I [ | ˜ β j | 6 = 0] ≤ nδ 2 n / (log n ) 2 , su p j | ˜ β j | ≤ C } . Let H m and H E b e sparse sets satisfying some ℓ 1 summabilit y conditions with v arious typ es of tail b eha vior (p olynomial with p ow er m an d exp onen- tial, resp.) Th e form al definitions are: H m = { β ∈ Ω n : P j ≤ K n | ˜ β ( j ) | ≤ C, P j >r | ˜ β ( j ) | ≤ r − m for all r ≥ q } f or some constan ts m, q , C > 0 ; H E = { β ∈ Ω n : P j ≤ K n | ˜ β ( j ) | ≤ C , P j >r | ˜ β ( j ) | ≤ e − C ′′ r for all r ≥ q } for some constan ts q , C, C ′′ > 0. (W e use ˜ β ( j ) to denote the comp onent of ˜ β that has the j th largest abso- lute v alue.) Let H 1 , 2 , 3 ⊃ H b b e three other sparse s ets, whic h ha v e at most ab out nδ 2 n / (log n ) 2 p ossibly large β -co efficien ts, w hile allo wing many more other β -co- efficien ts to b e small and nonzero. The mathematical details are give n b elow: H 1 = { β ∈ Ω n : P j ≤ nδ 2 n / (log n ) 2 | ˜ β ( j ) | 2 ≤ C 2 nδ 2 n / (log n ) , P j >nδ 2 n / (log n ) 2 | ˜ β ( j ) | ≤ C ′ δ n / (log n ) } ; H 2 = { β ∈ Ω n : su p j ≤ nδ 2 n / (log n ) 2 | ˜ β ( j ) | ≤ C √ log n, P j >nδ 2 n / (log n ) 2 | ˜ β ( j ) | ≤ C ′ δ n / (log n ) } ; H 3 = { β ∈ Ω n : P j ≤ K n | ˜ β ( j ) | ≤ C, P j >nδ 2 n / (log n ) 2 | ˜ β ( j ) | ≤ C ′ δ n / (log n ) } for some constan ts C, C ′ > 0. 10 W. JIANG A ND M. A . T ANNER The follo wing prop osition addresses the risk p erformance of BVS (with a Gibbs p osterior) in tw o scenarios describ ed in Section 3 . The r esults concern the use of the Gibbs p osterior ω ( dβ | D n ) based on R n , un d er the probabilit y distribution P β ,D [based on p ∗ ( D n ) ω ( dβ | D n )] and the corresp on d ing exp ec- tation E β ,D . Pr opo sition 2 (Risk p erformance). (i) (Sc e nario I I ; “exp onential ly sp arse” H E .) Assuming c onditions 0 ′ , 0 ′′ , 3 ′ , ( σ ), ( V ) and ( r δ ), wher e δ n = n − 1 / 2 (log n ) 2 , we have R − inf R ( H E ) ≤ cn − 1 / 2+ ξ with P β ,D -pr ob ability tending to 1 as n → ∞ , and E β ,D R − inf R ( H E ) ≤ cn − 1 / 2+ ξ for al l lar ge enough n , for any ξ > 0 , for some c > 0 . (ii) (Sc enario I ; “e xp onential ly sp arse” H E .) Supp ose in addition that inf β ∈ Ω n R ( β ) is r e ache d at some β R ∈ H E (a b est rule in Ω n satisfies the sp ar- sity c ondition in H E ). Then R − inf R (Ω n ) ≤ cn − 1 / 2+ ξ with P β ,D -pr ob ability tending to 1 as n → ∞ , and E β ,D R − inf R (Ω n ) ≤ cn − 1 / 2+ ξ for al l lar ge enough n , for any ξ > 0 , f or some c > 0 . (i) ′ (Sc enario I I ; “p olynomial ly sp arse” H m .) Assuming c onditions 0 ′ , 0 ′′ , 3 ′ , ( σ ), ( V ) and ( r δ ), wher e δ n = n − m/ (2 m +1) (log n ) 2 , we have R − inf R ( H m ) ≤ cn − m/ (2 m +1) + ξ with P β ,D -pr ob ability tending to 1 as n → ∞ , and E β ,D R − inf R ( H m ) ≤ cn − m/ (2 m +1) + ξ for al l lar ge enough n , for any ξ > 0 , for some c > 0 . (ii) ′ (Sc enario I ; “p olynom ial ly sp arse” H m .) Supp ose in addition that inf β ∈ Ω n R ( β ) i s r e ache d at some β R ∈ H m (a b est rule in Ω n satisfies the sp arsity c ondition in H m ). Then R − inf R (Ω n ) ≤ cn − m/ (2 m +1) + ξ with P β ,D - pr ob ability tending to 1 as n → ∞ , and E β ,D R − inf R (Ω n ) ≤ cn − m/ (2 m +1) + ξ for al l lar ge enough n , for any ξ > 0 , for some c > 0 . Therefore (i) suggests that the Gibbs p osterior will lead to p erf ormance in R that is no worse than the b est p erformance among the sp arse linear ru les in H E , up to a rate close to n − 1 / 2 , despite the high dimension K n whic h can b e, for example, n 10 . Result (ii) sa ys that if a b est linear rule is sparse in H E , then the p erformance actually is no wo rse than the b est linear rules in Ω n , up to the same r ate despite the high d imension. When the sparsit y conditions from H E are relaxed to H m , the rate b e- comes ab out n − m/ (2 m +1) , w hic h is still not deteriorating as dim( x ) = K increases (ev en w hen K ≫ n ). T his is in con trast to some other situatio ns (suc h as regression w ithout v ariable selection, or piecewise constan t m o dels) whic h hav e rates d eteriorating as the d im en sion K increases. The ab o v e p rop osition inv olve s sparse rules that require a b oun ded ℓ 1 - sum of th e β -co efficien ts. This limits the num b er of “p oten tially imp ortant” GIBBS POSTERIO R FOR V AR IABLE SELECTION 11 (or “p ossib ly large”) coefficients to b e b ounded (in n ). T he next p rop osition generalizes this and allo ws some other s parse ru les, where the n umb er of “p ossibly large” co efficients can gro w in n in some w a y that affects the con v ergence rate. Pr opo sition 3 (Risk p erformance; other sparse cases). ( i) (Sc enario I I ; other sp arse c ases.) Under c onditions 0 ′ , 0 ′′ , 3 ′ , ( σ ), ( V ), ( r δ ), with δ n satisfying 1 ′ , we have R − inf R ( H 1 , 2 , 3 ,b ) ≤ cδ n with P β ,D -pr ob ability tending to 1 as n → ∞ , and E β ,D R − in f R ( H 1 , 2 , 3 ,b ) ≤ cδ n for al l lar g e enough n , for some c > 0 . (ii) (Sc enario I ; other sp arse c ases.) If in addition in f β ∈ Ω n R ( β ) is r e ache d at some β R ∈ H 1 , 2 , 3 ,b (a b est mo del in Ω n satisfies the sp arsity c ondition in H 1 , 2 , 3 ,b , r esp.), then R − in f R (Ω n ) ≤ cδ n with P β ,D -pr ob ability tending to 1 as n → ∞ , and E β ,D R − in f R (Ω n ) ≤ cδ n for al l lar ge enough n , for some c > 0 . Note th at there is some compr omise b et wee n the con v ergence rate δ n and the num b er v n = nδ 2 n / (log n ) 2 (the integ er p art of ) w h ic h is the num b er of “p ossibly large” ˜ β -co efficien ts allo w ed in th e “sparse set” H 1 , 2 , 3 ,b . When δ n is “p r ecise” or small (suc h as ab out n − 0 . 49 ), then v n is small (ab out n 0 . 02 ). When δ n is “rough” or large (su c h as n − 0 . 01 ), v n is large (ab out n 0 . 98 ). Prop ositions 2 and 3 will b e prov ed in a more general context of data mining (wh ic h need not b e classification), as Prop osition 5 in Section 6 b elo w, where we also accommod ate more general priors (whic h need not b e normal-binary). 6. Pro ofs and results f or m ore general pr iors an d risk functions. Some of the pro ofs b elo w utilize some pr ep aratory resu lts that will b e pr esented in Section 6.1 . Define the follo wing cond itions and notation. Differen t su bsets of these conditions will b e used later w hen form ulating different results. (RnR): F or h ∈ (0 , 1) and q > 0 , d enote p 0 = ( e ψh q − 1) / ( e ψq − 1), p 1 = (1 − e − ψh q ) / (1 − e − ψq ), Φ i = Φ( x ( i ) T β /σ n ), A = I ( x T β > 0) . Let R n = − ( nψ ) − 1 P n i =1 ln { Φ i p y ( i ) 1 (1 − p 1 ) 1 − y ( i ) + (1 − Φ i ) p y ( i ) 0 (1 − p 0 ) 1 − y ( i ) } . Let R = − ψ − 1 E ln { Ap y 1 (1 − p 1 ) 1 − y + (1 − A ) p y 0 (1 − p 0 ) 1 − y } . (C2R): Let R ′ = E ρ ( y , A ) where A = I [ x T β > 0], y ∈ { 0 , 1 } , ρ (0 , 0) < ρ (0 , 1) and ρ (1 , 1) < ρ (1 , 0). Define q = ρ (1 , 0) + ρ (0 , 1) − ρ (1 , 1) − ρ (0 , 0), and h = [ ρ (0 , 1) − ρ (0 , 0)] /q . (Rs): R n is an i.i.d. av erage of terms that are b ound ed b et w een [0 , Q ] for some p ositiv e constan t Q . (Rp): R is n onsto c hastic and b ound ed b et w een [0 , Q ] . 12 W. JIANG A ND M. A . T ANNER (C2L): (Uniform con tin uit y for R .) There exists a constant W > 0 and a constan t ε > 0 suc h that | R ( β ) − R ( β ′ ) | ≤ W | β − β ′ | 1 , for all β and β ′ in Ω n ⊂ ℜ K n , whenever | β − β ′ | 1 ≤ ε . (L): (Lipsh itz for R n .) F or some q ′ ≥ 0, for all large enough n , | R n ( β ) − R n ( β ′ ) | ≤ n q ′ | β − β ′ | ∞ with probabilit y 1, for all β and β ′ in ℜ K n . (B): (Bias.) sup β ∈ Ω n | E D n R n ( β ) − R ( β ) | ≺ δ n . (C): H n is suc h that π [ R − in f R ( H n ) < δ n ] ≥ e − nψ δ n for all large enough n . (C2b): H n ( ⊂ Ω n ) is a compact s et of β ’s eac h satisfying the follo wing: β ∈ Ω n , and for any sm all enough η > 0, there exists a large enough N η , such that the pr ior π around a neighborho o d of β satisfies π [ b : | b − β | 1 < η δ n ] ≥ e − nψ δ n for all n > N η . (T1): F or some M > 1 and u ≥ 0, π (Θ c n ) ≺ e − 2 nψc ′ for any constan t c ′ > 0, w here Θ c n = Ω n − Θ n , Ω n is th e su pp ort of π , and the set Θ n = { β ∈ Ω n : | β | 0 (= | γ | 1 ) ≤ ⌈ M nδ 2 n / (log n ) 2 ⌉ , | β | ∞ ≤ n u } . 0 ′ . | x j | ≤ 1 for all j . 0 ′′ . The conditional d ensit y p ( x 1 | ˜ x ) with resp ect to the Leb esgue m easure exists for all x and is b ound ed ab o v e b y a constant S > 0 . 1 ′ . 1 ≻ δ n ≻ n − 1 / 2 log n . 3 ′ . n ≺ K ≺ n α for some α > 1. Lemma 1. R ′ ( β ) = R ( β ) + c wher e R is the risk function i n (RnR), R ′ is the risk function in (C2R), and c is a c onstant. Both R ′ and R ar e e qu al to q E ( A ( h − Y )) + a c onstant. Pr oof . Note that for all a, y ∈ { 0 , 1 } , ln { Ap y 1 (1 − p 1 ) 1 − y + (1 − A ) p y 0 (1 − p 0 ) 1 − y } = Ay ln { p 1 (1 − p 0 ) p − 1 0 (1 − p 1 ) − 1 } + A ln { (1 − p 1 )(1 − p 0 ) − 1 } + y ln { p 0 (1 − p 0 ) − 1 } + ln(1 − p 0 ), w hic h is, using the defi n itions of p 0 , 1 , Ay ψ q − Ahψq plus something that do es not dep end on A . Then du e to (RnR), R ( β ) = − ψ − 1 E ( Ay ψ q − Ahψ q ) = q E [ A ( h − y )], up to an additiv e constant . In (C2R), R ′ ( β ) = E ρ ( Y , A ) = P y ,a ∈{ 0 , 1 } ρ ( y , a ) E [ Y y (1 − Y ) 1 − y A a (1 − A ) 1 − a ] = ρ (0 , 0) + ( ρ (1 , 1) + ρ (0 , 0) − ρ (1 , 0) − ρ (0 , 1)) E Y A + ( ρ (1 , 0) − ρ (0 , 0)) E Y + ( ρ (0 , 1) − ρ (0 , 0)) E A = ρ (0 , 0) + ( ρ (1 , 0) − ρ (0 , 0)) E Y + ( ρ (1 , 1) + ρ (0 , 0) − ρ (1 , 0) − ρ (0 , 1)) E [ A ( Y + ( ρ (0 , 1) − ρ (0 , 0)) / ( ρ ( 1 , 1) + ρ (0 , 0) − GIBBS POSTERIO R FOR V AR IABLE SELECTION 13 ρ (1 , 0) − ρ (0 , 1)))] = c onstant + ( ρ (1 , 0) + ρ (0 , 1) − ρ (1 , 1) − ρ (0 , 0)) × E [ A (( ρ (0 , 1) − ρ (0 , 0)) / ( ρ (1 , 0) + ρ (0 , 1) − ρ ( 1 , 1) − ρ (0 , 0)) − Y )] = c onstant + q E [ A ( h − Y )], where q = ρ (1 , 0) + ρ (0 , 1) − ρ (1 , 1) − ρ (0 , 0) > 0 and h = ( ρ (0 , 1) − ρ (0 , 0)) / ( ρ (1 , 0) + ρ (0 , 1) − ρ (1 , 1) − ρ (0 , 0)) ∈ (0 , 1) due to ρ (0 , 1) > ρ (0 , 0) and ρ (1 , 0) > ρ (1 , 1). Remark 1. The risk fu nction R ′ in condition (C2R) [or equiv alently R in (RnR)] d escrib es a risk fu nction in data mining that is more gen- eral than the classificatio n error. F or one example in a data m ining con- text: A mark eting effort A = I [mail] of mailing out an adv ertisemen t with cost c = 1 can b e based on x (includin g, e.g., gender, age, ethnic grou p , education , . . . ) thr ough a decision ru le A = I ( x T β > 0). T he outcome will b e Y = I [ pur chase ] wh er e a pu rc hase will lead to net income g = 100. Then one w ould lik e to maximize the exp ected profit E [( g Y − c ) A ] or minimize a risk R = c onstant − E [( g Y − c ) A ]. Here up to a constant , ρ ( Y , A ) = − ( g Y − c ) A , so that ρ (0 , 0) = ρ (1 , 0) = 0 , ρ (0 , 1) = c = 1, ρ (1 , 1) = c − g = − 99. Such profit- and-loss d ecision matrices are included in p opu lar data minin g soft w are suc h as SAS Enterprise Miner. When ρ (0 , 1) = ρ (1 , 0) = 1 and ρ (1 , 1) = ρ (0 , 0) = 0, w e obtain th e sp ecial case of R b eing the classification error, in which case q = 2 and h = 0 . 5. Remark 2. Consider the smo oth sample risk used for classification [c hoice (ii) in Section 2 ]: R n = − ψ − 1 n − 1 P n i =1 log { Φ i e ψ ( y ( i ) − 1) + (1 − Φ i ) e − ψy ( i ) } , where Φ i = Φ( σ − 1 n ( x ( i ) ) T β ), Φ is the standard n orm al cumulat ive d ensit y function and σ n is a scaling f actor. It is n oted that up to a constant of β , e − nψ R n = ( c onstant ) × Q n i =1 { Φ i p y ( i ) 1 (1 − p 1 ) 1 − y ( i ) + (1 − Φ i ) p y ( i ) 0 (1 − p 0 ) 1 − y ( i ) } , where p 1 = e ψ / (1 + e ψ ) and p 0 = 1 / (1 + e ψ ). So this forms a sp ecial case of R n in (RnR) with h = 1 / 2 and q = 2. Pr opo sition 4 (General prior). Under (Rs), (Rp), (C2L), (L), (B), (C2b), (T1), 1 ′ , 3 ′ , we have R − in f R ( H n ) ≤ 6 δ n with P β ,D -pr ob ability tend- ing to 1 as n → ∞ , and E β ,D R − inf R ( H n ) ≤ 6 δ n for al l lar ge e nough n . The same r esults hold if risk R is r eplac e d by a tr anslate d new risk R ′ = R + c for any c onstant c . Pr oof . The pr op osition is pro v ed b y combining Lemmas 2 and 3 b elo w. Lemma 2. Under (Rs), (Rp), (L), (B), (C) (in Pr op osition 7 of Se ction 6.1 ), (T1), 1 ′ , 3 ′ , we have R − inf R ( H n ) ≤ 6 δ n with P β ,D -pr ob ability tending 14 W. JIANG A ND M. A . T ANNER to 1 as n → ∞ , and E β ,D R − inf R ( H n ) ≤ 6 δ n for al l lar ge enough n . The same r esults hold if risk R is r eplac e d by a tr anslate d new risk R ′ = R + c for any c onstant c . Pr oof . W e will app ly Pr op osition 7 from Section 6.1 . Here we can tak e ¯ R = Q and ¯ R − inf R ( H n ) ∈ [0 , Q ], due to (Rp ). The parameter b is d enoted as β here. The set F n is d enoted as Θ n here. Note that R n ≥ 0 d ue to (Rs), δ n ≺ 1 due to cond ition 1 ′ . Note that (T1) imp lies (T). T o prov e th e b ound s in Lemm a 2 , it su ffices to sho w that the t w o terms on the r igh t-hand side of ( 1 ) in Prop osition 7 are b oth o ( δ n ). The second term 4 e − nψ δ n ≺ δ n due to th e condition 1 ′ for δ n . The fir st term P ∗ [sup β ∈ Θ n | R n ( β ) − R ( β ) | > δ n ] is b oun ded ab o v e by P ∗ [sup β ∈ Θ n | R n ( β ) − E R n ( β ) | > δ n / 2] for all large n , due to the bias condition (B) sup β ∈ Θ n | E D n R n ( β ) − R ( β ) | ≺ δ n . Therefore it su ffices to prov e that P ∗ [sup β ∈ Θ n | R n ( β ) − E R n ( β ) | > δ n / 2] ≺ δ n . Note that in general, due to (Rs), P ∗ [sup b ∈ Θ n | R n ( b ) − E R n ( b ) | > ε n ] can b e b ound ed using a co vering num b er for Θ n , a Lipsh itz condition | R n ( b ) − R n ( b ′ ) | ≤ L n | b − b ′ | ∞ , a un ion b ound and a Ho effding inequalit y . Supp ose Θ n can b e co vered by N balls of radius s , suc h th at for any b ∈ Θ n , there exists a b k ∈ ℜ K n , k ∈ { 1 , . . . , N } , such that | b − b k | ∞ < s . Then for any b ∈ Θ n , one can find one of these N b k ’s, sa y , b j , suc h that | R n ( b ) − E R n ( b ) | − | R n ( b j ) − E R n ( b j ) | ≤ 2 ε n / 3 b y c ho osing s = ε n / (3 L n ), du e to the Lip shitz condition. Th erefore su p b ∈ Θ n | R n ( b ) − E R n ( b ) | ≤ sup j ∈{ 1 ,...,N } | R n ( b j ) − E R n ( b j ) | + 2 ε n / 3. Th en P ∗ [sup b ∈ Θ n | R n ( b ) − E R n ( b ) | > ε n ] ≤ P ∗ [sup j ∈{ 1 ,...,N } | R n ( b j ) − E R n ( b j ) | > ε n / 3] whic h is at most N 2 e − 2 n ( ε n / 3) 2 Q − 2 due to the union b ound , cond ition (Rs), and th e Ho effd ing inequalit y . Note that one can c ho ose N ≤ ¯ N = P r ≤ d K r ( n u /s + 1) r , wh ere d = ⌈ M nδ 2 n / (log n ) 2 ⌉ , since the definition of Θ n implies that there can b e at most K r “mo del” ind icator γ ’s with size | γ | 1 = r ( r ≤ d ), eac h of w h ic h h as a param- eter space (of the nonzero β -comp onent s) that can b e co v ered b y at most (2 n u / (2 s ) + 1) r balls of size s . Now P r ≤ d K r ( n u /s + 1) r ≤ ( d + 1) K d ( n u /s + 1) d ≤ K d +1 ( n u /s + 1) d ≤ K 2 d ( n u /s + 1) d ≤ n 2 dα ( n u /s + 1) d for all large n , since 1 ≺ d ≺ K ≺ n α due to 1 ≺ d ≺ n (implied by condition 1 ′ ) and n ≺ K ≺ n α (b y condition 3 ′ ). So w e can c ho ose N ≤ n 2 dα ( n u /s + 1) d for all large enough n , wh ere s = ε n / (3 L n ) as p r escrib ed b efore. No w taking ε n = δ n / 2, L n = n q ′ [from condition (L)], we get, for all large n , P ∗ sup b ∈ Θ n | R n ( b ) − E R n ( b ) | > δ n / 2 ≤ n 2 α n u ( δ n / 2) / (3 n q ′ ) + 1 ⌈ M nδ 2 n / (log n ) 2 ⌉ 2 e − 2 n ( δ n / 2 / 3) 2 Q − 2 , GIBBS POSTERIO R FOR V AR IABLE SELECTION 15 whic h can b e prov ed to b e less than δ n in ord er due to the condition 1 ′ : 1 ≻ δ n ≻ n − 1 / 2 log n . Collecting these steps together leads to the pro of. Lemma 3. If the c onditions (C2L) and (C2b) ar e satisfie d for some se quenc e δ n ≺ 1 , then (C) is also satisfie d for the same δ n . Pr oof . Note that inf R ( H n ) is ac hiev ed at some β ∈ H n (p ossibly de- p end in g on n ) due to the compactness of H n in (C2b) and the con tin uit y of R implied b y (C2L). Then f or an y b ∈ Ω n , we ha ve (*) R ( b ) − inf R ( H n ) = R ( b ) − R ( β ) ≤ | R ( b ) − R ( β ) | ≤ W | b − β | 1 for all small enough | b − β | 1 . Th ere- fore f or any sequence δ n ≺ 1 , | b − β | 1 < δ n W − 1 implies R ( b ) − inf R ( H n ) < W δ n W − 1 = δ n , for all large n . Therefore π [ b : R ( b ) − inf R ( H n ) < δ n ] ≥ π [ b : | b − β | 1 < δ n W − 1 ], whic h is ≥ e − nψ δ n for all large n , by taking η = W − 1 in (C2b). Remark 3. Note that (C2b) is a cond ition for proving (C) (see L emma 3 ab ov e). Condition (C) describ es that the prior π is comp etitiv e against the ru les in H n in some sense [w h en comparing th e generated R ( β )’s to inf R ( H n )]. Cond ition (C2b) describ es one wa y to construct su c h a set of rules H n o v er whic h the pr ior π is comp etitiv e: a compact set of r ules suc h that around eac h of these rules the prior assigns a not to o low probabilit y . Lemma 4. (i) Condition (RnR) implies (Rs) and (Rp). (ii) Conditions 0 ′ , 0 ′′ and (RnR) imply (C2L). (iii) Conditions 0 ′ , ( σ ), 3 ′ and (RnR) imply (L). (iv) Conditions 0 ′′ , 1 ′ , ( σ ) and (Rn R) imply (B). Pr oof . F or (i), note that the pro ofs for (Rs) and (Rp ) are similar. Note that h ∈ (0 , 1) and q > 0 implies that p 0 , 1 ∈ (0 , 1). The terms inside ln { } are a v erages of p 1 and p 0 or a ve rages of (1 − p 1 ) and (1 − p 0 ), which are all b et wee n ( min , 1), where min = min { p 0 , 1 , (1 − p 0 , 1 ) } ∈ (0 , 1) . This imp lies that − ψ − 1 ln { } ∈ (0 , ψ − 1 ln(1 / min )). So (Rs) and (Rp ) are prov ed with Q = ψ − 1 ln(1 / min ) . F or (ii), n ote that R = q E [ A ( h − y )], up to an additive constant, due to Lemma 1 . F or an y b and β in {± 1 } × ℜ K − 1 suc h that | b − β | 1 ≤ ε for s ome small enough ε , we m ust ha v e b 1 = β 1 ∈ {± 1 } . Let u s take b 1 = β 1 = +1. (The other case is similar.) Then | R ( b ) − R ( β ) | = | q E [( I [ x T b > 0] − I [ x T β > 0] | )( h − y )] | ≤ q E | I [ x T β > 0] − I [ x T b > 0] | . Here we are usin g the representa tions suc h as b T = ( b 1 , ˜ b T ) and β T = ( β 1 , ˜ β T ). 16 W. JIANG A ND M. A . T ANNER No w b 1 = β 1 = 1 implies that ( ‡ ) E | I [ x T β > 0] − I [ x T b > 0] | = E ˜ x E x 1 | ˜ x × I [ − ˜ x T ˜ b ≥ x 1 > − ˜ x T ˜ β or − ˜ x T ˜ β ≥ x 1 > − ˜ x T ˜ b ] ≤ E S | ˜ x T ˜ b − ˜ x T ˜ β | ≤ E S × | x | ∞ | b − β | 1 ≤ S | b − β | 1 , w here | x | ∞ ≤ 1 du e to 0 ′ and S is an u pp erb oun d of the conditional dens ity of p ( x 1 | ˜ x ) in 0 ′′ . S o | R ( b ) − R ( β ) | ≤ q E | I [ x T β > 0] − I [ x T b > 0] | ≤ q S | b − β | 1 . So w e can tak e W = q S to obtain the pro of for (ii). F or (iii), note that ( ∗ ) | R n ( b ) − R n ( b ′ ) | ≤ K n C n | b − b ′ | ∞ , where C n is an y upp erb ou n d of | ∂ b j R n | o v er all j and o v er parameter space. Note that | ∂ b j R n | = | –( nψ ) − 1 P n i =1 { Φ i p y ( i ) 1 (1 − p 1 ) 1 − y ( i ) + (1 − Φ i ) p y ( i ) 0 (1 − p 0 ) 1 − y ( i ) } − 1 ( ∂ b j Φ i ) × { p y ( i ) 1 (1 − p 1 ) 1 − y ( i ) − p y ( i ) 0 (1 − p 0 ) 1 − y ( i ) }| ≤ ( nψ ) − 1 P n i =1 ( min ) − 1 (1 / √ 2 π ) σ − 1 n × 1, since | p y ( i ) 1 (1 − p 1 ) 1 − y ( i ) − p y ( i ) 0 (1 − p 0 ) 1 − y ( i ) }| ≤ 1, { Φ i p y ( i ) 1 (1 − p 1 ) 1 − y ( i ) + (1 − Φ i ) p y ( i ) 0 (1 − p 0 ) 1 − y ( i ) } ≥ min = min { p 0 , 1 , (1 − p 0 , 1 ) } ∈ (0 , 1), and | ∂ b j Φ i | = | ∂ b j Φ( x ( i ) T b/σ n ) | = | x ( i ) j | σ − 1 n (1 / √ 2 π ) e − 0 . 5( x ( i ) T b/σ n ) 2 ≤ σ − 1 n (1 / √ 2 π ) d u e to 0 ′ . So | ∂ b j R n | ≤ ψ − 1 ( min ) − 1 (1 / √ 2 π ) σ − 1 n , which can b e tak en as the upp er- b ound C n in ( ∗ ), which imp lies that | R n ( b ) − R n ( b ′ ) | ≤ ( c onstant ) K n σ − 1 n | b − b ′ | ∞ ≤ n α + q ′′ +1 | b − b ′ | ∞ for all large n , due to conditions 3 ′ and ( σ ). Then (C2L) is p ro v ed with q ′ = α + q ′′ + 1. F or (iv), note th at | E R n − R | = ψ − 1 | E ln { Φ p y 1 (1 − p 1 ) 1 − y + (1 − Φ) p y 0 (1 − p 0 ) 1 − y } − E ln { Ap y 1 (1 − p 1 ) 1 − y + (1 − A ) p y 0 (1 − p 0 ) 1 − y }| , where Φ = Φ( x T β / σ n ) and A = I ( x T β > 0). By a first-order T a ylor expansion one shows that | E R n − R | ≤ c onstant E | Φ − A | = c onstant E ˜ x E x 1 | ˜ x | Φ − A | . No w supp ose β 1 = 1 (the case of β 1 = − 1 is similar); then E x 1 | ˜ x | Φ − A | = E x 1 | ˜ x { I ( | x 1 + ˜ x T ˜ β | ≤ u ) | Φ − A |} + E x 1 | ˜ x { I ( | x 1 + ˜ x T ˜ β | > u ) | Φ − A |} ≤ S (2 u ) + e − 0 . 5( u/σ n ) 2 / p 2 π ( u/σ n ) 2 for any u > 0, due to 0 ′′ and the Mill’s r atio. The resu lting up p erb ound is uni- formly correct for all β , and b ecomes O (log n/ √ n ) by taking u = σ n √ log n and using ( σ ). So sup β | E R n − R | ≤ O (log n/ √ n ) ≺ δ n due to 1 ′ . Lemma 5. (i) H 1 ⊃ H 2 . (ii) H 2 ⊃ H 3 for al l lar ge n . (iii) H 2 ⊃ H b for al l lar ge n . (iv) H 3 6⊃ H b assuming c ondition 1 ′ . (v) H b 6⊃ H 3 assuming 1 ′ and 3 ′ . (vi) H m ⊃ H E for al l lar ge enough q . (vii) H 3 ⊃ H m for al l lar ge n if δ n ≥ n − m/ (2 m +1) (log n ) 2 , assuming 1 ′ . (viii) H 3 ⊃ H E for al l lar ge n if δ n ≥ n − 1 / 2 (log n ) 2 , assuming 1 ′ . (ix) H b 6⊃ H E under 3 ′ , for al l lar ge enough q . Pr oof . P art (vi) is due to th e d omination of p o wer law deca ys ov er the exp onentia l d eca ys. P arts (vii) and (viii) can b e p ro v ed by applying GIBBS POSTERIO R FOR V AR IABLE SELECTION 17 the (p olynomial and exp onential, r esp.) b ound s of P j >r | ˜ β ( j ) | for r = v n ≡ nδ 2 n / (log n ) 2 . Pa rt (iv) is pro ve d by examining β = (1 , C, . . . , C , 0 , . . . , 0) T with ab out v n C ’s, wh ic h is a n umb er of H b but not of H 3 due to an ℓ 1 norm th at is u n b ounded as n increases. P art (v) is p ro v ed by examining β = (1 , (1 / 2) C ′ δ n / (log n ) , (1 / 2 ) 2 C ′ δ n / (log n ) , (1 / 2 ) 3 C ′ δ n / (log n ) , . . . ) T whic h is a mem b er of H 3 but n ot of H b . Part (iii) is pro v ed b y noting that H b implies a zero tail for the su m o v er j > v n and b ounded terms for j ≤ v n . P art (ii) is pro v ed b y noting that a b ounded ℓ 1 norm implies th at all co efficien ts are b ounded. Part (i) is p ro v ed b y noting th at P j ≤ v n | ˜ β ( j ) | 2 ≤ (sup j ≤ v n | ˜ β ( j ) | ) 2 v n . T o prov e part (ix), note that β = (1 , ψ 0 ξ , ψ 0 ξ 2 , ψ 0 ξ 3 , . . . ) T is a m em b er of H E for all large q , if ψ 0 = C ( e 2 C ′′ − 1), ξ = e − 2 C ′′ . On the other hand β / ∈ H b under 3 ′ . Lemma 6. (i) R − inf ( H 2 , 3 ,b ) ≤ R − inf R ( H 1 ) . (ii) If δ n ≥ n − m/ (2 m +1) (log n ) 2 , then R − inf ( H m ) ≤ R − inf R ( H 1 ) . (iii) If δ n ≥ n − 1 / 2 (log n ) 2 , then R − inf ( H E ) ≤ R − inf R ( H 1 ) . Pr oof . Note that the pr evious lemma on the relations among th e spars e sets implies that H 1 con tains all other sparse sets in all situations of this lemma [with th e sp ecifications of δ n for situations (i) and (iii)]. Then inf R ( H 1 ) is the smallest among all these infimums and R − inf R ( H 2 , 3 ,b,m,E ) ≤ R − inf R ( H 1 ). Pr opo sition 5. Pr op ositions 2 and 3 hold for the mor e gener al risk functions in a data mining c ontext sp e cifie d in (RnR). Pr oof . W e only need prov e the Scenario-I I results, sin ce they ob viously imply the Scenario-I resu lts. [If we assume that inf β ∈ Ω n R ( β ) is achiev ed at some β H ∈ H n ⊂ Ω n , then inf β ∈ Ω n R ( β ) = inf β ∈ H n R ( β ).] Due to Lemma 6 (i) it is ob vious that w e only need to p ro v e Prop osition 3 for H 1 , in order to p ro v e Prop osition 3 . F or getting the upp erb oun d s of the risk p erformances in Prop osition 2 we start from Prop osition 3 and apply Lemma 6 (ii) and (iii), where w e tak e “=” for the c hoices of δ n , and note that the factors (log n ) 2 in δ n are less than the factor n ξ for all large enough n , for any ξ > 0 . T o p ro v e Pr op osition 3 for H 1 in the general context (RnR), we ap p ly Prop osition 4 and n ote that all cond itions hold, by applying Lemma 4 , as w ell as Lemm as 7 and 8 to b e giv en later. Lemma 7. F or δ n in c ondition 1 ′ , assume that K n satisfies 3 ′ and that we use the norma l/binary prior for π satisfying (V). A ssume c ondition ( r δ ) . Then the sp arse set H 1 satisfies the c ondition (C2b). 18 W. JIANG A ND M. A . T ANNER Pr oof . The set H 1 is ob viously compact. Consider any β ∈ H 1 . Define “mo del” γ n to b e the set of indices for the ( β j ) j > 1 ’s that hav e the top ⌈ v n ⌉ largest abs olute v alues, in addition to the ind ex 1 for β 1 ∈ {± 1 } w hic h is alw a ys k ept in the “mo d el.” Then P j / ∈ γ n | β j | = P j >v n | ˜ β ( j ) | ≤ C ′ δ n / (log n ). Here v n = nδ 2 n / (log n ) 2 . Note that f or any η > 0, π [ b : | b − β | 1 < η δ n ] ≥ π ( γ = γ n ) π [ b : | b − β | 1 < δ n η | γ = γ n ], where γ is th e “mo d el” indicator for the set of n onzero comp o- nen ts of b . F ollo wing the notation of Section 4.2 , γ = ( γ j ) K n 1 , w here γ j = I ( | b j | 6 = 0). Here and b elo w, f or a K n -v ector ζ (e .g., ζ ca n b e γ or ˘ γ ), the notation ζ = γ n for a set γ n ⊂ { 1 , . . . , K n } means that ζ j = I [ j ∈ γ n ], j = 1 , . . . , K n . W e will sh o w that π ( γ = γ n ) and π [ b : | b − β | 1 < δ n η | γ = γ n ] are b oth not to o small. Note that giv en “mo d el” γ n , b 1 = β 1 and | ˜ b γ − ˜ β γ | 1 ≡ P j > 1 ,j ∈ γ n | b j − β j | < δ n / log n will imply th at | b − β | 1 < δ n η (for all large enough n ). This is b ecause b only has nonzero comp onen ts in γ n , so | b − β | 1 = | b 1 − β 1 | + P j > 1 ,j ∈ γ n | b j − β j | + P j / ∈ γ n | β j | ≤ 0 + δ n / log n + C ′ δ n / (log n ) which is less than δ n in order. So π [ b : | b − β | 1 < δ n η | γ = γ n ] ≥ π [ b 1 = β 1 , | ˜ b γ − ˜ β γ | 1 < δ n / log n | γ = γ n ] = 0 . 5 π [ | ˜ b γ − ˜ β γ | 1 < δ n / log n | γ = γ n ] for all large n , noting that b 1 ∈ ± 1 with equal probabilit y and is indep endent of other things in the prior. The last probabilit y π [ | ˜ b γ − ˜ β γ | 1 < δ n / log n | γ = γ n ] is in tegrating a norm al densit y | 2 π V γ | − 1 / 2 e − 0 . 5 ˜ b T γ V − 1 γ ˜ b γ o v er a set S = [ | ˜ b γ − ˜ β γ | 1 < δ n / log n ] ⊃ [ v n | ˜ b γ − ˜ β γ | ∞ < δ n / log n ] , whic h has at least vo lume ( v − 1 n δ n / log n ) ⌈ v n ⌉ since the vec tor ˜ b γ is ⌈ v n ⌉ -dimensional u nder mo del γ = γ n . The normal densit y ov er is b ound ed b elo w by exp {− 0 . 5 v n log(2 π B ) − 0 . 5 | ˜ b γ | 2 2 B } us ing the b oun d s of the eigenv alues of prior v ariance in ( V ). Note also that | ˜ b γ | 2 2 ≤ 2 | ˜ β γ | 2 2 + 2 | ˜ b γ − ˜ β γ | 2 2 ≤ 2 | ˜ β γ | 2 2 + 2 | ˜ b γ − ˜ β γ | 2 1 ≤ 2 | ˜ β γ | 2 2 + 2 δ 2 n / (log n ) 2 o v er ˜ b γ ∈ S , w hic h is ≤ 2 C 2 nδ 2 n / (log n ) + 2 δ 2 n / (log n ) 2 since β ∈ H 1 . Collecting all these together w e get, for all large n , π [ | b − β | 1 < δ n η | γ = γ n ] ≥ 0 . 5 exp {− 0 . 5 v n log(2 π B ) − C 2 nδ 2 n / (log n ) B − δ 2 n / (log n ) 2 B } ( v − 1 n δ n / log n ) ⌈ v n ⌉ } = 0 . 5 exp {− 0 . 5 v n log(2 π B ) − C 2 nδ 2 n / (log n ) B − δ 2 n / (log n ) 2 B − ⌈ v n ⌉ log ( v n log n/δ n ) } , where v n = nδ 2 n / (log n ) 2 . It is then easy to v erify th at all terms in the exp o- nen t are of the form − o ( nδ 2 n ) und er condition 1 ′ for δ n . GIBBS POSTERIO R FOR V AR IABLE SELECTION 19 No w we consider π ( γ = γ n ) und er the (size-restricted) binary prior. Note that for all large enough n , π ( γ = γ n ) = π ( ˘ γ = γ n || ˘ γ | 1 ≤ ¯ r n ) ≥ π ( ˘ γ = γ n , | ˘ γ | 1 ≤ ¯ r n ) = π ( ˘ γ = γ n ), wh ere ¯ r n is th e size r estriction c hosen as (the int eger part of ) M nδ 2 n / (log n ) 2 ( M > 1) in co nd ition ( r δ ), and the “mo d el” γ n has size 1 + ⌈ v n ⌉ = ⌈ nδ 2 n / (log n ) 2 ⌉ + 1 < ¯ r n for all large enough n . Note that ˘ γ has unrestricted i.i.d. binary comp onents (except that ˘ γ 1 = 1 alw a ys) and the probabilit y π ( ˘ γ = γ n ) = λ ⌈ v n ⌉ n (1 − λ n ) K n − 1 −⌈ v n ⌉ . Note that λ n ∼ v n /K n due to ( r δ ) and v n ≺ K n due to 1 ′ and 3 ′ . Therefore log π ( ˘ γ = γ n ) = ⌈ v n ⌉ log λ n + ( K n − 1 − ⌈ v n ⌉ ) log (1 − λ n ) = ⌈ v n ⌉ log λ n + ( K n − 1 − ⌈ v n ⌉ )( − λ n + o ( λ n )) = ⌈ v n ⌉ log λ n (1 + o (1)) ≥ ( ⌈ v n ⌉ log ( M ′ v n ) − ⌈ v n ⌉ log K n )(1 + o (1)) ≥ − v n log K n (1 + o (1)) for all large n [since v n = nδ 2 n / (log n ) 2 ≻ 1 due to 1 ′ ]. No w v n log K n = nδ 2 n / (log n ) 2 log K n ≤ [ nδ 2 n / (log n ) 2 ] log ( n α ) = o ( nδ 2 n ). Collecting these results together we kno w that π [ b : | b − β | 1 < η δ n ] ≥ π ( γ = γ n ) π [ b : | b − β | 1 < δ n η | γ = γ n ] w h ere b oth factors can b e expressed as b eing at least e − o ( nδ 2 n ) , whic h w ill b e greater than e − ψn δ n , for all large n . (Note that δ n ≺ 1 du e to 1 ′ is used.) Lemma 8. With c onditions (V), ( r δ ), 1 ′ , the norma l-binary prior π (with size r estriction) satisfies the tail c ondition ( T 1). Pr oof . T ak e M as the one used in condition ( r δ ) and tak e u = 1 in (T1). Denote ¯ r n = ⌈ M nδ 2 n / (log n ) 2 ⌉ . Note that π (Θ c n ) ≤ π ( | γ | 1 > ¯ r n ) + P γ : | γ | 1 ≤ ¯ r n π [ | β | ∞ > n u | γ ] π ( γ ) ≤ π ( | γ | 1 > ¯ r n ) + sup γ : | γ | 1 ≤ ¯ r n π [ | β | ∞ > n u | γ ]. The first term is 0 du e to the size restriction. Th e term sup γ : | γ | 1 ≤ ¯ r n π [ | β | ∞ > n u | γ ] = sup γ : | γ | 1 ≤ ¯ r n π [ S j : γ j =1 [ | β j | > n u ] | γ ] ≤ ¯ r n sup γ : | γ | 1 ≤ ¯ r n sup j : γ j =1 π [ | β j | > n u | γ ], w here π [ | β j | > n u | γ ] can b e b ou n ded ab ov e b y 2 e − 0 . 5 n 2 u /B / p 2 π n 2 u /B u sing Mill’s ratio and the eigen v alue b ound ( V ). Collecting all these together w e get π (Θ c n ) ≤ ¯ r n 2 e − 0 . 5 n 2 u /B / p 2 π n 2 u /B (where u is tak en to b e 1), w hic h is at most e − 0 . 5 n 2 /B under cond itions ( r δ ) and 1 ′ , f or all large n , and is th erefore ≺ e − 2 nψc ′ for an y constant c ′ > 0. 6.1. Supplementary r esults on risk p e rformanc e of Gibbs p osterior. In this section w e will consider a ve ry general setup. The results here hav e b een applied in the pro ofs in Section 6 . Here we will consider the p erformance of a general risk R ( b ) [or more generally , r n ( b ), which is nonsto chastic b ut can dep end on n ]. Su pp ose b is samp led from a Gibbs p osterior ω ( db | D n ), w h ic h is constructed from a sample risk R n and a prior π ( db ), and D n denotes data generated fr om a true dens ity p ∗ . More formally , in b oth pr op ositions b elo w, w e will assume that th e data D n (indexed by a sample size n ) follo ws a probabilit y distribution P ∗ with densit y p ∗ ( D n ) with resp ect to some dominant measure dD n . Let b | D n de- note a d istribution (conditional on D n ) with a density w ( b | D n ) ∝ e − nψ R n ( b ) 20 W. JIANG A ND M. A . T ANNER with resp ect to a prior π ( db ), where R n ( b ) dep ends on a parameter b and data D n . Denote b y P b,D the r esu lting join t distribution of b and D n and E b,D the corresp onding exp ectation. Pr opo sition 6. Assume that R n ( b ) ≥ 0 for any b and D n . If the prior π is such that the supp ort su pp( π ) = Ω n = F n ∪ F c n wher e F c n = Ω n − F n , then for any r n ( b ) nonsto chastic (p ossibly dep ending on n but not otherwise on D n ) and any ρ n and δ n nonsto chastic and not dep ending on p , P b,D [ r n ( p ) − ρ n > 5 δ n ] ≤ P ∗ sup b ∈ F n | R n ( b ) − r n ( b ) | > δ n + π ( F c n ) e nψ ( ρ n +2 δ n ) + e − nψ (2 δ n ) ( π [ r n ( b ) − ρ n < δ n ] − π ( F c n )) + . Her e we use the notation A + = AI ( A > 0) . Pr oof . The left- hand sid e is E D Ψ = R P ∗ ( dD n )[ N 1+ N 2 D en ], where E D = R P ∗ ( dD n ), Ψ = [ N 1+ N 2 D en ], N 1 = R F c n e − nψ ( R n − ρ n ) I [ r n − ρ n > 5 δ n ] π ( db ), N 2 = R F n e − nψ ( R n − r n + r n − ρ n ) I [ r n − ρ n > 5 δ n ] π ( db ), D en = R e − nψ ( R n − r n + r n − ρ n ) π ( db ). Note that N 1 ≤ π ( F c n ) e nψ ρ n , N 2 ≤ e nψ ∆ n − nψ (5 δ n ) , where ∆ n = sup F n | R n ( b ) − r n ( b ) | , D en ≥ Z F n I [ r n − ρ n < δ n ] e − nψ ∆ n − nψ ( r n − ρ n ) π ( db ) ≥ e − nψ ∆ n − nψ δ n π ([ r n − ρ n < δ n ] ∩ F n ) ≥ e − nψ ∆ n − nψ δ n ( π [ r n − ρ n < δ n ] − π ( F c n )) + . Therefore Ψ = [ N 1+ N 2 D en ] ≤ G (∆ n ), wh ere G (∆ n ) = e (∆ n + δ n + ρ n ) nψ π ( F c n ) + e (∆ n + δ n +∆ n − 5 δ n ) nψ ( π [ r n − ρ n < δ n ] − π ( F c n )) + . Note that Ψ = P b | D n [ r n − ρ n > 5 δ n ] ≤ 1 and G (∆ n ) is in creasing in ∆ n . Then the left-hand sid e is E D Ψ = E D (Ψ I [∆ n > δ n ]) + E D (Ψ I [∆ n ≤ δ n ]) ≤ P ∗ [∆ n > δ n ] + E D { G (∆ n ) I [∆ n ≤ δ n ] } ≤ P ∗ [∆ n > δ n ] + G ( δ n ) ≤ P ∗ [∆ n > δ n ] + e (2 δ n + ρ n ) nψ π ( F c n ) + e ( − 2 δ n ) nψ ( π [ r n − ρ n < δ n ] − π ( F c n )) + . GIBBS POSTERIO R FOR V AR IABLE SELECTION 21 Pr opo sition 7. Assume that R n ( b ) ≥ 0 for any b and D n . Consider any p ositive se quenc e δ n which is nonsto chastic and not dep endent on b . Assume tha t δ n ≺ 1 . F or al l lar ge enough n , if the prior π is such that the supp ort su p p( π ) = Ω n = F n ∪ F c n wher e F c n = Ω n − F n , such that (T) π ( F c n ) ≤ e − 2 nψ ¯ R for some c onsta nt ¯ R > 0 , (C) a subset H n of the sup p( π ) is suc h that π [ R ( b ) − inf b ∈ H n R ( b ) < δ n ] ≥ e − nψ δ n , for some nonsto chastic R ( b ) ≤ ¯ R , then we have, for al l lar ge enough n , P b,D R ( b ) − inf b ∈ H n R ( b ) > 5 δ n (1) ≤ P ∗ sup b ∈ F n | R n ( b ) − R ( b ) | > δ n + 4 e − nψ δ n and E b,D R ( b ) − inf b ∈ H n R ( b ) (2) ≤ 5 δ n + ¯ R − i nf b ∈ H n R ( b ) P b,D R ( b ) − in f b ∈ H n R ( b ) > 5 δ n . Pr oof . Note that the second inequalit y relates exp ectation E to a p rob- abilit y P , wh ic h is b ou n ded in the fir st in equalit y . Suc h a relation is due to the general r elation f or a constant g > 0 and a ran d om v ariable G wh ic h is b ound ed ab o v e by a constan t c : E G = E GI [ G > g ] + E GI [ G ≤ g ] ≤ cP [ G > g ] + g . W e can then simply tak e g = 5 δ n and G = R ( b ) − inf b ∈ H n R ( b ), which is b ounded ab o v e by a constant c = ¯ R − inf b ∈ H n R ( b ). No w w e pro v e the fir s t inequalit y on P . This is pro v ed b y applying Prop o- sition 6 . W e tak e r n ( b ) = R ( b ), ρ n = inf b ∈ H n R ( b ), and apply the conditions (T) and (C) to the long f raction on the right-hand side of the inequality in Prop osition 6 , wh ic h is sho wn to b e b oun d ed ab o v e by 4 e − nψ δ n , by noting that ¯ R > 0 and δ n ≺ 1. Remark 4. This p r op osition simplifies the long fraction in Prop osition 6 b y applying the conditions (T) and (C ) on the pr ior π and “a scope of comparison” H n . Then the p erformance of R ( b ) is ev aluated und er E b,D or P b,D (as ge nerated b y the data generation m ec hanism P ∗ for D n and the Gibbs p osterior for b | D n ). The p erformance of R ( b ) is compared to the b est p erforman ce inf b ∈ H n R ( b ) o v er the scop e H n . I t w ill b e close to this b est p erforman ce if n − 1 ≺ δ n ≺ 1 and if there exists a un iform conv ergence result for a small P ∗ [sup b ∈ F n | R n ( b ) − R ( b ) | > δ n ]. Such a r elation is very general and allo ws many differen t situations by in v oking different tec hn iques. F or 22 W. JIANG A ND M. A . T ANNER example, V apnik–Chervo nenkis theory , or un iform con tinuit y of R n ( b ) − R ( b ) and co ve ring n um b er s of F n , ma y b e used to handle the P ∗ [sup . . . ] w ith a union b ound. The probabilit y of large | R n − R | ma y also b e b ounded by Ho effding’s or Bernstein’s inequalities for non-i.i.d. data or data that are dep end en t in some we ak wa ys (suc h as α - or φ -mixing, ergo dic Marko v c hain, etc.). 7. An MCMC algorithm. This section describ es some computational as- p ect for samp ling fr om the Gibbs p osterior ω ( dβ | D n ) ∝ e − nψ R n π ( dβ ), where π is the n ormal-binary prior sp ecified in Sections 4.2 and 4.3 . Consider the smo othed sample risk fu nction R n in (RnR). It is noted that e − nψ R n = n Y i =1 { Φ i p y ( i ) 1 (1 − p 1 ) 1 − y ( i ) + (1 − Φ i ) p y ( i ) 0 (1 − p 0 ) 1 − y ( i ) } , where Φ i = Φ( σ − 1 n ( x ( i ) ) T β ), Φ is the standard n orm al cumulat ive d ensit y function and σ n is a scaling f actor. This can b e recognized as the lik eliho o d for a mixtur e of tw o binary mo dels with mixing probabilit y Φ i . This suggests a data augmen tation metho d [see, e.g., T anner ( 1996 )] incorp orating latent v ariables Z = ( Z ( i ) ) n 1 , where Z ( i ) are indep endent N (( x ( i ) ) T β , σ n ), so that y ( i ) | Z ( i ) are indep en- den t Bin (1 , p I [ Z ( i ) > 0] ), whic h leads to computational adv antag e. Th e Gibbs sampler can b e used to obtain th e join t distribution of ( Z, γ , β ), where all full co nd itional distributions are standard. S imilar to, for example, Lee et al. ( 2003 ), we can integ rate o v er β γ and use the distrib ution γ | Z instead of γ | Z, β γ in the Gibbs sampler, in order to s p eed up the computations. Define β T = ( β 1 , ˜ β T ), ˜ γ = ( β 1 , γ 2 , . . . , γ K n ), and let ˜ β γ include ˜ β j ’s ( j = 2 , . . . , K n ) with γ j = 1. Consider the follo wing MCMC algorithm starting from an y initial p osition. F or t = 1 , 2 , . . . : (Step 1) S amp le Z t | ˜ β t − 1 γ , ˜ γ t − 1 . (Step 2) S amp le ˜ γ t | Z t . (Step 3) S amp le ˜ β t γ | ˜ γ t , Z t . Belo w we explain eac h of the three steps and omit the time index t to simplify notation. Step 1. Note that Z = ( Z (1) , . . . , Z ( n ) ) T . The step is carr ied out by inde- p end ently samp ling Z ( i ) ’s according to a “shifted” normal distribution: 1a: Generate Z ∗ i ∼ N (( x ( i ) ) T γ β γ , σ 2 ) ind ep endently where v γ denotes the subv ector of v j ’s with γ ′ j s b eing 1. 1b: Generate indep endent uniform v ariable U ∗ i ∼ Unif [0 , 1]. 1c (Case 1): If Z ∗ i > 0, set Z ( i ) = Z ∗ i only wh en U ∗ i ≤ a + = a 1 / max { a 1 , a 0 } , where a 0 , 1 = p y ( i ) 0 , 1 (1 − p 0 , 1 ) 1 − y ( i ) . GIBBS POSTERIO R FOR V AR IABLE SELECTION 23 1c (Case 2) : If Z ∗ i ≤ 0, set Z ( i ) = Z ∗ i only wh en U ∗ i ≤ a − = a 0 / max { a 1 , a 0 } . Step 2. Iterativ ely up d ate one comp onent at a time, conditional on all other comp onen ts of ˜ γ . Define Z ( i ) ( β 1 ) = Z ( i ) − x ( i ) 1 β 1 , Z ( β 1 ) = ( Z (1) ( β 1 ) , . . . , Z ( n ) ( β 1 )) T , ˜ X γ = ( ˜ x (1) γ , . . . , ˜ x ( n ) γ ) T . 2a: Sim ulate β 1 | γ K n 2 , Z to tak e v alue from ± 1, with pr ob ab ility p ( β 1 | Z, γ K n 2 ) ∝ 0 . 5 e 0 . 5 σ − 2 Z ( β 1 ) T [ ˜ X γ ( σ 2 V − 1 γ + ˜ X T γ ˜ X γ ) − 1 ˜ X T γ − I ] Z ( β 1 ) . 2b: F or j = 2 , . . . , K n , sim ulate γ j | ( γ k ) k 6 = j , β 1 , Z to tak e v alue in { 0 , 1 } , with probabilit y p ( γ j |{ γ k : k = 2 , . . . , K n , k 6 = j } , β 1 , Z ) ∝ λ γ j (1 − λ ) 1 − γ j I [ | γ | ≤ ¯ r ] × e 0 . 5 σ − 2 Z ( β 1 ) T [ ˜ X γ ( σ 2 V − 1 γ + ˜ X T γ ˜ X γ ) − 1 ˜ X T γ − I ] Z ( β 1 ) × { det[ I + σ − 2 ˜ X T γ ˜ X γ V γ ] } − 1 / 2 . Step 3. S imulate ˜ β γ | β 1 , γ , Z ∼ N { ( σ 2 V − 1 γ + ˜ X T γ ˜ X γ ) − 1 ˜ X T γ Z ( β 1 ) , σ 2 ( σ 2 V − 1 γ + ˜ X T γ ˜ X γ ) − 1 } . Note that al l these c onditional distributions ar e standar d. It can b e eas- ily sho wn that a stationary distribu tion of the pr op osed MCMC algorithm [whic h results in a Marko v chain ( γ t , β t γ ) and its corresp on d ing parameters ( β t )] is th e desired Gibbs p osterior ω ( dβ | D n ) ∝ e − nψ R n π ( dβ ), wh ere R n is th e smo othed emp irical risk in (RnR ). W e conjecture that the prop osed MCMC algorithm con v erges to th e d esired Gibbs p osterior in total v ariation distance, as t → ∞ , regardless of the starting p osition. 8. Discussion. The curr en t pap er stu d ies a new Ba y esian v ariable selec- tion (BVS) metho d using a Gibbs p osterior, wh ic h is d ir ectly constru cted from a sample risk function of in terest. Th is approac h can p erform b etter than the usual app roac h that uses a lik eliho o d-b ased p osterior, w hic h in some situations can giv e a su b optimal risk p er f ormance w ith mo del mis- sp ecification. A smo othed sample risk function is used to p ro vide conv e- nien t p osterior computation in the s tyle of Mark o v c hain Monte Carlo. With BVS, the p ro cedure can effectiv ely handle high-d im en sional data. W e show that the resulting risk p erform ance, ev en in a very high-dimensional case ( K ≫ n ), can resem ble the risk p erformance in a lo w-dimensional setting, in the sen s e that it can approac h the b est p ossible risk p erformance (achiev able b y certain sparse decision r u les) at a lo w-dimensional con ve rgence rate. The app r o ximately p arametric/lo w -d imensional rate that BVS ac hiev es, despite the high dimensionalit y ( K ≫ n ), seems to defy the “curse of dimen- sionalit y .” Th e reason is that BVS has us ed the so-called “b et-on-sparsit y” 24 W. JIANG A ND M. A . T ANNER principle [e.g., F riedman, Hastie, Rosset, Tibshiran i and Zh u ( 2004 )] b y fit- ting effectiv ely lo w-dimensional mo dels due to the use of the p rior distribu- tion. Su c h a b et of course can b e wr ong: that is, we can b e in the nonspars e case w here all x j ’s can b e imp ortant. Ho wev er, in suc h cases not to o m uc h will b e lost by the wrong b et, since nothing else seems to w ork well in such high-dimensional nonsparse case. On the other hand, wh en the b et is right, BVS can do m uch b etter than a minimax t yp e r ule that tries to protect for the b ad cases. Intuitiv ely sp eaking, a linear regression m o del without v ariable selec tion w ould h a v e a large v ariance K/n to start with, wh ic h is do omed to fail from the b eginnin g, wh en K ≫ n . O n the other h and, BVS w ould use lo we r-dimen s ional sub mo dels to mak e sure that the v ariance p art is n ot out of con trol in the first place. When sparseness holds (i.e., only a few out of K candidate x j ’s are imp ortan t), the metho d will p erform v ery w ell. It is n oted that the sparse case can describ e quite pr actical situations suc h as ho w a disease is mainly affected by only a few genes out of thousands. A related app roac h to v ariable selection is based on Ba ye sian decision theory , whic h wa s describ ed by Lindley ( 1968 ) and more recen tly extended b y , for example, Bro wn, F earn and V annucci ( 1999 ), to the m ultiv ariate case. Th is approac h is charact erized by assuming n orm al data and using a friendly loss function (suc h as the qu ad r atic loss). Under th is framew ork, v arious exp ectations can b e analytically computed and optimization can b e simplified to dep en d only on the mo del in d icator γ . Our approac h cannot ha v e such computational s im p lification and the Gibb s p osterior needs to generate b oth γ and β γ (parameter within th e mo del). This is b ecause w e allo w more general cases with a n onnormal predictor x and nonq u adratic loss. Our approac h can handle classification error as well as realistic dollar- costs used in data mining. In addition, the curr en t pap er studies frequ entist prop erties of risk p erformance which we re n ot add ressed in p revious w orks using the Ba yesia n decision-theoretic approac h. The curr en t appr oac h generates a Gibbs p osterior based on wh ich b oth v ariable selection and mo del a v eraging can b e p erformed. The theoreti- cal result in the current pap er is on a go o d p erformance of exp ected risk E β ,D R ( β ) = E D [ E β | D R ( β )], whic h inv olve s using mo dels obtained r andomly from the Gibb s p osterior for β | D . [The parameter β h as certain n onzero comp onent s selecte d by a mo del indicator and determines a decision rule I ( x T β > 0) .] W e argue that how to optimally u tilize these go o d decision rules obtained from the Gibbs p osterior (e.g., h o w mo d el a v eraging can b e done) is a n on trivial in teresting p roblem. Mo d el a v eraging would in v olv e us- ing rules parameterized b y E ( β | D ) instead of β . By J ensen’s inequalit y , if R ( β ) is con v ex, mo d el a v eraging is alw a ys b en eficial since E D [ R ( E ( β | D ))] ≤ E D [ E β | D R ( β )]. Ho w ev er, the classificatio n error R can b e noncon v ex and can ha v e multiple minimums. In s uc h cases the a ve raged decision rule can b e a p o or one ev en if eac h in dividual rule b eing av eraged is go o d. It m a y b e th at GIBBS POSTERIO R FOR V AR IABLE SELECTION 25 some kind of mo del a ve rage “lo cally” can still b e b eneficial, in a limited region of app ro ximate con ve xit y , r oughly sp eaking. This is worth further in v estigation. The current appr oac h uses a general framework allo w ing mo del missp ec- ification (w h en the true generating pro cess can b e outside of th e su pp ort of the pr ior). Although the prop osed approac h has an adv an tage in suc h a case with missp ecification, w e exp ect that in the case without missp ecification (when the true mo del is within the sup p ort of th e prior), the conv entio nal approac h u sing the likel iho o d-based p osterior s hould p erform comparably w ell to our pro cedur e. This is b ecause the conv entional approac h essen- tially minimizes the Kullbac k–Leibler (KL) div ergence, which will lead to a go o d risk p erform an ce due to a r elation b etw een the t w o, wh en there is no missp ecification. Suc h a relation is kno wn, for example, b et w een the classi- fication risk and the KL div ergence [see, e.g ., Devro y e, Gy¨ o rfi and Lugosi ( 1996 ), Problem 15.3]. Although this pap er has fo cused on a smo othed samp le risk for constru ct- ing the Gibbs p osterior [c hoice (ii) in Section 2 ], similar r esu lts on go o d risk p erformance can b e obtained when the unsmo othed sample r isk [c hoice (i), or a sample ve rsion for the m ore general d ata mining risk d escrib ed in Remark 1 ] is u sed. The pro of will b e more con v enti onal and inv olve prob- abilit y b ound s for uniform deviation of sample risk based on the V apnik– Chervo nenkis th eory [see, e.g., Devro y e, Gy¨ orfi and Lugosi ( 1996 ), C hapters 12 and 13 for a goo d descrip tion]. The p osterior simulatio n can b e based on the Metrop olis algorithm [see, e.g., T anner ( 1996 ), Chapter 6, for a d escrip- tion]. It is also noted that although we ha v e fo cus ed on the Gibbs sampler in this pap er, the Metrop olis algorithm can also b e applied to b oth cases with the unsmo othed and the smo othed sample risk. I n the latter case with a smo othed s ample risk, the Gibbs sampler app roac h of Section 7 may require a r elativ ely large smo othing parameter ( σ ) f or impro ving the algorithmic con v ergence. This w ould lead to some bias whic h can b e corrected by ap- plying the Metrop olis algorithm with less (or no) smo othing. Ac kno wledgment . W e wish to thank an Asso ciate Editor for the insight- ful commen ts and some additional r eferen ces. REFERENCES Bro wn, P. J., Fe arn, T. and V annucci, M. (1999). The choi ce of v ariables in m ulti- v ariate regres- sion: A non-conjugate Bay esian decision th eory ap p roac h. Bi ometrika 86 635–648. MR1723783 Devro ye, L., Gy ¨ orfi, L. and Lugosi, G . (1996). A Pr ob abilistic The ory of Pattern R e c o gnition. Springer, New Y ork. MR1383093 Dobra, A., Hans, C., Jones, B ., Nev ins, J. R., Y ao , G. and W est, M. (2004). Sparse graphical mod els for exploring gene expression data. J. M ultivariate Anal. 90 196–212. MR2064941 26 W. JIANG A ND M. A . T ANNER Friedman, J., Hastie, T. , Ro sset, S., Tibshirani , R. and Zhu, J. (2004). Discussion on b o osting. An n. Statist. 32 102–107. Geman, S. and Ge man, D. (1984). Sto chastic relaxation, Gibbs distributions, an d the Ba yesian restoration of images. IEEE T r ans. Pattern A nal. Machine Intel l. 6 721–741. George, E. I. and McCulloch, R. E. (1997). App roac hes for Bay esian v ariable selection. Statist. Sinic a 7 339–373 . Gerlach, R., Bi rd, R. and Hall, A. (2002). Bay esian va riable selection in logistic regression: Predicting compan y earnings direction. Aust. N. Z. J. Statist. 44 155–168. MR1963292 Greenshtei n, E. (2006). Best subset selection, p ersistency in high dimensional statistical learning and optimization under ℓ 1 constrain t. Ann. Statist. 34 2367–238 6. MR2291503 Hor owitz, J. L. (1992). A smoothed maximum score estimator for the binary resp onse mod el. Ec onometric a 60 505–531. MR1162997 Kleijn, B. J. K. and v a n der V aar t, A. W. (2006). Missp ecification in infinite- dimensional Bay esian statistics. Ann. Statist. 34 837–877. MR2283395 Jiang, W. (2007). Ba yesian vari able selection for h igh d imensional generalized linear mod - els: Conv ergence rates of the fi tted densities. Ann . Statist. 35 1487–1511. MR2351094 Lee, K. E., Sha, N., Dougher ty, E. R., V annucci, M. and Mallick, B. K. (2003). Gene selection: A Bay esian v ariable selection approac h. Bioinformatics 19 90–97. Lindley, D. V. (1968). The choice of va riables in multiple regression (with discussion). J. R oy. Statist. Asso c. Ser. B 30 31–66. MR0231492 Smith, M. and Ko hn, R. (1996). Nonparametric regression using Ba yesian v ariable se- lection. J. Ec onometrics 75 317–343. T a nner, M. A. (1996). T o ols for Statistic al Infer enc e: M etho ds for the Explor ation of Pos- terior Distributions and Li keli ho o d F unctions , 3rd ed. Sp ringer, New Y ork. MR1396311 T a nner, M. A. and Wo ng, W. H . (1987). The calculation of p osterior distributions by data augmentation (with discussion). J. Amer. Statist. Asso c. 82 528–550. MR0898357 Zhang, T. (1999). Theoretical analysis of a class of randomized regularization metho ds. In COL T 99. Pr o c e e dings of the Twelfth Annual Confer enc e on Computational L e arning The ory 156–163. ACM Press, N ew Y ork. MR1811611 Zhang, T. (2006a). F rom ǫ -entrop y to K L - entrop y: Analysis of minimum information complexity density estimation. Ann. Statist. 34 2180–22 10. MR2291497 Zhang, T. (2006b). Information theoretical upp er an d low er b ou n ds for statistical esti- mation. I EEE T r ans. Inform. T he ory 52 1307–132 1. MR2241190 Zhou, X., Liu, K.-Y. and Wong, S. T. C. (2004). Cancer classification and prediction using logi stic regression with Bay esian gene selection. J. Bi om e di c al Informatics 37 249–259 . Dep ar tm ent of St a tistics Nor thwestern University Ev anston, Illinois 602 08 USA E-mail: wjiang@north western.edu mat132@north w estern.edu
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment