Orthogonalized smoothing for rescaled spike and slab models
Rescaled spike and slab models are a new Bayesian variable selection method for linear regression models. In high dimensional orthogonal settings such models have been shown to possess optimal model selection properties. We review background theory a…
Authors: Hemant Ishwaran, Ariadni Papana
IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 267–281 c Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000001 92 Orthogon alized smo othing for rescaled spik e and s lab mo dels Heman t Ish w ara n ∗ 1 and Ariadni P apana 2 Cleveland Clinic and Case West ern R eserve University Abstract: Rescaled spi k e and slab models are a new Ba y esian v ariable se- lection meth od for linear regression models. In high dimensional orthogon al settings suc h models hav e b een shown t o possess optimal mo del select ion prop- erties. W e review bac kground theory and discuss applications of r escaled spike and slab mo dels to prediction problems in v olving orthogonal p ol ynomials. W e first consider global smo othing and di scuss p oten tial w eaknesses. Some of these deficiencies are remedied b y usi ng local regression. The local regression ap- proac h relies on an intimate connection b etw ee n lo cal weigh te d regressi on and we igh ted generalized ridge regressi on. An i mp or tan t implication i s tha t one can trace the effec tiv e degrees of fr eedom of a curve as a wa y to visualize and classify curv ature. Sev eral motiv ating examples are pr esen ted. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 2 Rescaled spike a nd slab mo de ls . . . . . . . . . . . . . . . . . . . . . . . . 26 8 2.1 Resca ling, the c hoice of π , and implications for s hr ink ag e . . . . . . . 2 69 2.2 Selective shrink ag e r ecast in terms of p ena lization . . . . . . . . . . . 27 0 3 Orthogo nal p olynomials : first illustration . . . . . . . . . . . . . . . . . . 27 2 3.1 Compa rative analys is using e ffective kernels . . . . . . . . . . . . . . 272 3.2 Effective degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . 2 74 4 Lo cal regr ession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 4.1 Resca led spike and slab weigh ted regressio n . . . . . . . . . . . . . . 276 4.2 Or thogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 77 4.3 Spinal BMD da ta revis ited . . . . . . . . . . . . . . . . . . . . . . . 27 7 4.4 Cos mic microw a v e background ra diation . . . . . . . . . . . . . . . . 2 7 8 4.5 Mas s sp e c trometry protein data . . . . . . . . . . . . . . . . . . . . . 278 Ac knowledgmen ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 1. In tro duction Rescaled spik e and slab models were introduced in [ 1 ] as a Bay esian v a riable se- lection metho d in linear r e g ression mo dels. Such mo dels w ere shown to posse s s a ∗ Supported by the NSF Gran t DMS-04-05675. 1 Departmen t of Quantitativ e Health Sciences Wb4, Clev eland Clinic, 9500 Euclid Av en ue, Cleve land, Ohio 4419 5, USA, e-m ail: hemant.i shwarant @gmail.com 2 Departmen t of Statistics, Case W estern Reserv e Universit y , 10900 Euclid Aven ue, Cleve land, Ohio 44106, USA, e-mail: ariadni. papana@c ase.edu AMS 2000 subje c t classific ations: Pr imary 62J07; secondary 62J05. Keywor ds and phr ases: effect ive degrees of freedom, penalization, selectiv e shrink age . 267 268 H. Ishwar an and A. Pap ana sele ctive shrinkage pro p e rty in or thogonal models . This pro p erty allows the p oste- rior mean for the coefficients to shrink to zer o for truly zero co efficients while for the non-zero co efficients p osterio r estimates are similar to the ordinary least squares (OLS) estima tes . In [ 2 ], resca led spike and slab mo dels were used to analyze multi- group microarr ay data (an extension of pr evious work [ 3 ]). Selec tive shrink ag e was shown to be a sufficient condition for oracle- like tota l miscla s sification. A finite sample ada ptive method for selecting v ariables using this principle was given. In this manuscript we extend the application of r escaled s pike and slab mo dels to smo othing problems. Giv en an outcome v a lue Y related to a v ar iable x throug h an unkno wn function f ( x ), w e would like a ccurate recov e r y of f ( x ) . Smoothing is a pre dic tio n problem, and an imp ortant contribution of the pap er is adv ancing applications o f resca led spike a nd slab mo dels to pre diction settings. How e ver this do es not mea n s e lective shrink age, a co re ingredient to mo del selection, is no t at play in a prediction pa radigm. Indee d, as s hown, selective shrink a ge plays a crucia l role in adaptive selection of ov er-para meterized ba sis functions in resp onse to curv atur e of f ( x ). W e consider global smo othing via orthogo nal p oly nomial regress ion as w ell as lo cal regr ession using orthog onal p oly no mials. Orthogonality is a k ey ingredient in our approa ch. No t only do es it allow us to exploit the selective shrink age pr op erty of resca led spike and sla b mo dels , which follow as a consequence of or tho gonality , but it als o gr eatly improves the computationally efficiency of our algorithms. While m uch work has been done in the area o f smo othing, w e note ther e are novel feature s in our approach p otentially useful in applied settings. One imp orta n t feature be- ing that selective shr ink ag e a llows for gr eater adaptivity to curv ature and grea ter robustness to misspecifica tion of dimension of ba sis functions. Secondly , in lo cal regres s ion s ettings, a daptivity via s elective shrink age can be in terpreted in ter ms of dimensionality and curv ature. F rom this we pr ovide an effective degrees of free- dom plot for gra phing estima ted dimensionality of f ( x ) as a function of x . Such plots provide a s imple and p ow erful w ay to regis ter curves. Several applicatio ns a re provided a s illus tration. 2. Rescaled spi ke and sl ab mo del s W e begin by first revie wing background theory for r e s caled spik e and slab mo dels. The underlying setting is the linear regr ession mo del where Y 1 , . . . , Y n are indepen- dent res po nses suc h that (2.1) Y i = β 1 x i, 1 + · · · + β d x i,d + ε i = x t i β β β + ε i , i = 1 , . . . , n. Here x 1 , . . . , x n are non- random (fixed design) d -dimensio nal cov ariates and β β β = ( β 1 , . . . , β d ) is the unknown co e fficient vector. The ε i are independent r andom v ari- ables (but not necessar ily identically distributed) s uch that E ( ε i ) = 0, E ( ε 2 i ) = σ 2 0 and E ( ε 4 i ) ≤ M for so me M < ∞ (the la st condition is needed to inv ok e a trian- gular central limit theor em la ter, but is no t crucial and can certainly b e r elaxed). The v ariance σ 2 0 > 0 is assumed to be unknown. Thr o ughout we assume x i are standardized so that P n i =1 x i,k = 0 and P n i =1 x 2 i,k = n for k = 1 , . . . , d (without loss of gener ality w e assume that there is no in tercept term in ( 2.1 )). W e shall also assume throughout that X , the n × d design matr ix, is orthog onal, i.e., X t X = n I . As mentioned in the Int ro duction, this will allow us to exploit certain elega nt the- ories for r escaled spike and slab metho ds, although, o f course, the spike and slab metho d works for g eneral design matr ic e s. Ortho gonalize d r esc ale d spike and slab smo othing 269 Spike and s lab methods first appea red in the works of [ 4 ] and [ 5 ] for subset selection in linear regressio n models. The expres sion “spik e and slab,” coined b y Mitc hell and Beauchamp in [ 5 ], referred to the prior for the r e gressio n co e fficient s used in their hiera r chical formulation. This w as ch osen so that the co efficients were m utually indep endent with a t w o-p oint mixtur e distr ibutio n made up of a uniform flat distribution (the slab) and a degener ate dis tribution at ze ro (the s pike). In [ 6 ] a differen t t y pe of prior was used. This in v olved a scale mixture of t w o norma l distributions. In pa r ticular, the use of a normal pr ior was highly adv antageous and led to a Gibbs sampling method that highly popular ized the spik e and slab approach; see [ 7 , 8 , 9 , 10 , 1 1 ]. As po inted out in [ 1 ], priors inv olving a normal scale mixture distribution, of which [ 6 ] is a specia l ex a mple, co nstitute a wide class o f mo dels termed “spike a nd slab mo dels.” A mo dified class of spike and slab mo dels calle d “r escaled spike and slab mo dels” was in tro duced [ 1 ]. These new mo dels differed in that the or iginal Y i v alues were replac e d by new v a lues scaled by the s q uare r o ot of the sample size and divided by the square ro ot o f an estimate for σ 2 0 . Rescaling was shown to induce a non-v a nishing pena lization effect for the p oster ior mean, and when used in tandem with a contin uous bimo dal prio r , the resulting p osterio r mean w as shown to p osses s a selective shr ink age pr op erty in orthog onal mo dels [ 1 ]. A r esc ale d spike and slab mo del was defined in [ 1 ] to deno te a Bayesian hierar - chical mo del s pec ified as follows: ( Y ∗ i | x i , β β β ) ind ∼ N( x t i β β β , n ) , i = 1 , . . . , n, ( β β β | γ ) ∼ N( 0 , Γ ) , γ ∼ π ( d γ ) . (2.2) Here Y ∗ i are the r escaled Y i v alues defined by Y ∗ i = ˆ σ − 1 n 1 / 2 Y i , where ˆ σ 2 = || Y − X ˆ β β β 0 || 2 / ( n − d ) is the unbiased estima to r for σ 2 0 based on the full mo del, and ˆ β β β 0 is the OLS estimate for β β β (other estimators for σ 2 0 are also p ossible; these details, howev er, play a mino r r ole). The v alue of n used in the firs t level of the hier arch y in ( 2.2 ) is a v a riance infla tion facto r introduced to compens a te for the res caling. Mor eov er, inclusion of n in the hier arch y was shown in [ 1 ] to b e neces s ary for selective shr ink age to take place. Without rescaling , shrink age for the p os ter ior mean v a nis hes in the limit due to the prior b e c oming swamped by the likelihoo d [ 1 ]. In ( 2.2 ), 0 denotes a d -dimensional zero vector, Γ = diag( γ 1 , . . . , γ d ) is a d × d dia gonal matrix and π is the prior meas ure for γ = ( γ 1 , . . . , γ d ) t . A B ayesian parameter σ 2 can also be in tro duced in ( 2.2 ) at the fir s t le vel of the hiera rch y . How ev er, we av oid this approach her e and opt for the simpler set up (( 2.2 )). The rationale for this is the following: (i) we hav e alrea dy removed the effect of σ 2 0 when rescaling Y i , a nd (ii) the simpler s e tup enforces a s parse s olution for the p osterior mean in ill-determined settings when d is of the same size, or larger, than n . Poin t (ii) is esp ecially r elev ant as this is the setting we are interested in here. 2.1. R esc aling, the choic e of π , and implic ati ons for shrinkage In additio n to rescaling the resp onse, the prior for γ k m ust satisfy c ertain re q uire- men ts in o rder for selective shrink a ge to o ccur. A sufficient condition requires the prior to hav e a bimoda l prop erty such that the rig h t tail of the distribution is con- tin uous a nd such that ther e is a spike in the distribution near zero (see Theor em 6 of [ 1 ] fo r pr ecise details). One such ex ample is the contin uous bimo dal pr ior used 270 H. Ishwar an and A. Pap ana Fig 1 . Conditional density for γ k given w : (a) w = 0 . 1 , (b) w = 0 . 25 . Observe that only the densities height changes as w is varie d. One c an think of w as a c omp lexity p ar ameter co ntr ol ling mo del dimension. Prior b ase d on hyp erp ar ameters a 1 = 5 , a 2 = 50 and v 0 = 0 . 005 as in [ 1 , 2 , 3 ] . in [ 1 , 2 , 3 ]. This prior is induced by a parameter ization inv olving a binary v ar iable and a p ositive v ar iable with an inv erse-gamma distr ibutio n. Mor e precise ly , define γ k by γ k = I k τ 2 k , where I k and τ 2 k are para meter s with priors specified according to ( I k | v 0 , w ) iid ∼ (1 − w ) δ v 0 ( · ) + w δ 1 ( · ) , k = 1 , . . . , d, ( τ − 2 k | a 1 , a 2 ) iid ∼ Gamma( a 1 , a 2 ) , w ∼ Uniform[0 , 1 ] . (2.3) The choice for v 0 (a sma ll p ositive v alue) and a 1 and a 2 (the s hap e and scale parameters for a ga mma density) are selected so that γ k has a contin uous bimo dal distribution with a spike a t v 0 and a right co nt in uous tail (see Figure 1). Such a prior allows the po sterior to shr ink a co e fficient to zer o dep ending up on the v alue for γ k . Small v alues heavily shrink a co e fficie nt to w ards zero. 2.2. Sele ctive shrinkage r e c ast i n terms of p enali zation One c a n view the p osterior mean as a solution to a constr ained least squar es op- timization proble m in which the hyp e rv ar iances are related to p enalty terms. This provides us with a nother wa y to think a b out the effects of selective shr ink age. As befo re, we consider the orthogo nal s etting wher e X t X = n I . Let V k = E ( ν k | Y ∗ ) where ν k = γ k / (1 + γ k ). F or our argument it will be ea sier to think of pena lization in terms of ˆ β β β = ˆ σ n − 1 / 2 ˆ β β β ∗ , wher e ˆ β β β ∗ = E ( β β β | Y ∗ ) denotes the p o sterior mean for β β β under our r escaled spike and slab mo del. It can b e shown tha t (2.4) ˆ β β β = ar g min β β β ( || Y − X β β β || 2 + n d X k =1 1 − V k V k β 2 k ) . Observe how ea ch β k co efficient in ( 2.4 ) is pe na lized by a unique v alue (1 − V k ) /V k . The closer V k is to 1, the smaller the p enalty and the less the shrink a ge for β k , while the closer V k is to z e ro, the larger the p enalty , a nd the more β k is shrunk to zero. It is clear the mo re adaptive V k is to the true co efficient v a lue, the mo re accurate v ariable se le ction bec o mes. This ar gument can b e formalized b y studying the asymptotic be havior o f V k . Using a similar argument a s in [ 2 ], one can show that under the spik e and slab Ortho gonalize d r esc ale d spike and slab smo othing 271 Fig 2 . Limit ing densit y for ν k c onditione d on w = 0 . 1 and Z 2 k under the nul l that β k is truly zer o. V alues for Z 2 k sele c te d fro m the 25th, 50th, 75th, 90th, 95th and 99th p er c entiles of a χ 2 1 - distribution. Mo de on the left i ncr ea ses as Z 2 k de cr e ases, wher eas mo de on t he right de cr e ases. mo del ( 2.2 ) with contin uous bimo da l pr ior ( 2.3 ) (sp ecified in Figure 1), the following holds: Theorem 2.1. Assum e that max 1 ≤ i ≤ n || x i || / √ n → 0 . If ( 2.1 ) r epr esents t he t rue data mo de l, and the c o efficient β k for variable k is tru ly non-z er o, t hen (2.5) V k d 1 . Mor e over, if β k is truly zer o, then (2.6) E ( ν k | w, Y ∗ ) d R 1 0 ν exp ν Z 2 k / 2 (1 − ν ) − 3 / 2 f ν / (1 − ν ) | w dν R 1 0 exp ν Z 2 k / 2 (1 − ν ) − 3 / 2 f ν / (1 − ν ) | w dν , wher e f ( ·| w ) = (1 − w ) g 0 ( · ) + wg 1 ( · ) is t he prior density for γ k given w , wher e g 0 ( u ) = v 0 u − 2 g ( v 0 u − 1 ) , g 1 ( u ) = u − 2 g ( u − 1 ) and g ( u ) = a a 1 2 ( a 1 − 1 )! u a 1 − 1 exp( − a 2 u ) and Z k has a N(0 , 1 ) distribution. Result ( 2.5 ) of Theorem 2 .1 shows that V k approaches the v alue 1 in the case where there is true signal, and hence the p enalty for β k in ( 2.4 ) v anishes as the sample size increases, a nd β k will not be shrunk, just as we’d exp ect and hope for . Result ( 2.6 ) applies to the case when β k is really zer o. The term Z k app earing in ( 2.6 ) is the limit o f the p o sterior mean ˆ β ∗ k under the n ull, and thus Z k reflects the effect of the data on the p os terior under the null. In particular, unless ˆ β ∗ k is unduly large, the poster io r mean for ν k should b e rela tively c lo se to the v alue under the prior. This has implicatio ns for spa rse settings. In such cas es, the p o sterior v alue for w will b e small and the p os terior for ν k conditioned on w (which will lo ok like the prior given w ) will b e concentrated near zero. Thus, the left-hand side of ( 2.6 ) should be small and the poster io r mean penalized and shrunk tow ards ze ro. On the other hand, if ˆ β ∗ k is la rge, then the le ft-ha nd side of ( 2.6 ) will b e larg e, and there will b e less p ena liz ation and le s s s hrink ag e fo r β k . A larg e v a lue for ˆ β ∗ k is unlikely under the null and in fact is expected only whe n β k is really non-zer o, whic h is another wa y to see why ( 2.5 ) holds. Figur e 2 illustrates how ν k might dep end up on ˆ β ∗ k in a spars e setting under the null that β k is truly zero . 272 H. Ishwar an and A. Pap ana 3. Orthogonal po l ynomials: first illustration F or our first illustration we consider a da ta set rela ted to spinal bo ne mineral dens ity (BMD) (see [ 12 ] for a mo re complete description of the data). The r esp onse is the relative ch ange in s pinal BMD as a function of age in ma le and female adolesce n ts. Figure 3 plots the res ults of our analy sis. Predicted v alues for Y bas ed on the po sterior of the rescaled s pike a nd slab mo del ( 2.2 ) are sup erimp osed on the figure as solid dar k a nd das hed dark lines for men and women, r esp ectively . The ana lysis on the left side of the plot is ba sed on an orthogonal p olyno mia l des ig n ma trix with d = 10 basis functions. Also sup erimp osed are O LS estimates (gray line s ). While the left side of Figure 3 shows some difference betw een the metho ds, discrepancies b ecome mor e apparent if d is allow ed to increase. W e re-r an the same analysis but using an overly para meterized design inv olving d = 25 basis functions. The right s ide of Figure 3 reco rds the result. Notice how badly OLS ov erfits, wher e as rescaled spike and slab predictors rema in relatively unaffected. 3.1. Comp ar ati ve analysis u sing effe ctive kernels A more formal co mparison b e t ween the tw o a pproaches can b e bas ed on an effective kernel analysis. Effective kernels w ere int ro duced in [ 13 ] (Chapter 2.8), as a wa y to ev alua te the differences b etw een kernel smo others . Supp ose we hav e data ( x i , Y i ), i = 1 , . . . , n , where Y i are the resp onse v alues. It is assumed that (3.1) Y i = f i + ε i , i = 1 , . . . , n, where f i = x t i β β β are unknown mean v alues a nd x i ∈ R d are the v a lues of the pre- chosen underlying d basis functions ev aluated a t x i . Call a smo other s ( x ) linear in Y , if for each x 0 s ( x 0 ) = n X j =1 S j ( x 0 ) Y j , where S j ( x 0 ) dep ends o nly up on the x -v alues and no t the r esp onses. More generally , let f = (f 1 , . . . , f n ) t . If s ( x i ) is a linear smo other fo r f i , then ˆ f = SY Fig 3 . L e f t plot: R elativ e c hange in spinal BMD as a function of age. Solid dark and dashe d dark lines ar e spike and slab pr e dicte d c urves for men and women using an ortho gonal p ol ynomial design matrix with d = 10 b asis functions. Gr ay solid and dashe d lines ar e OL S estimates for men and women. Right plot: Analysis similar as befor e, but now using an ove r p ar ameterize d b asis function, d = 25 . Note the spiky beha vior of the OL S. Ortho gonalize d r esc ale d spike and slab smo othing 273 is a linea r smoo thed estimate of f , where S is the n × n smo other matrix , S = { s i,j } for s i,j = S j ( x i ). The v a lue S i ( x i ) = s i,i is o ften refer red to as the effe ctive kernel at x i [ 13 , 14 ]. The effective kernel mea sures the influence o f x i on Y i . The set of v alues { s i,j : j = 1 , . . . , n } , which is the i th r ow o f S , is called the effective kernel for Y i . Plotting the effective kernel is a way to compa r e differ e n t smo others [ 13 ]. This idea can b e adapted to our setting a s follows. First we der ive the effective kernel for the OLS estimate. Co nsider the orthogo nal reg ression setting in which X t X = n I . Let x ( k ) denote the k th column of the desig n matrix X . It follows that ˆ f = SY , where (3.2) S = X ( X t X ) − 1 X t = n − 1 d X k =1 x ( k ) x t ( k ) . The effectiv e kernel for Y i is n − 1 P d k =1 x i,k x t ( k ) and the effectiv e kernel at x i is s i,i = n − 1 x t i x i . The no tion of an effective kernel needs to be slightly mo dified to handle adaptive pena lization. W e adopt the notion of an adaptive smo o ther ma trix that allows the effective k ernel to dep end upon b oth x i and Y i . Define the spike and slab predicto r as ˆ f ∗ = X ˆ β β β . Theorem 3.1 . Under otho gonality, the spike and slab pr e dictor for Y i in ( 3.1 ) c an b e writt en as ˆ f ∗ = S ∗ Y , wher e (3.3) S ∗ = n − 1 d X k =1 V k x ( k ) x t ( k ) . One c an c onc eptualize S ∗ as an adaptive line ar smo other matrix. Conse qu en tly, t he effe ctive kernel for Y i is define d as n − 1 P d k =1 V k x i,k x t ( k ) , and the effe ct ive kernel at x i is s ∗ i,i = n − 1 P d k =1 V k x 2 i,k . Figure 4 shows the effective kernels at x i for the OLS and r escaled spike and slab predictors, wher e x i is age. The plots are base d o n the over-parameterized orthogo nal p o lynomial desig n inv olving d = 25 basis functions. The lar ge num b er of predictors helps to emphasize the non- r obustness of the OLS estimate. Note esp ecially ho w the OLS es timate is affected by the p oints near the edges of the plots. In co nt rast, note the robustness of the spike and slab appro ach. Fig 4 . Left plot: Effe ctive kernels at x i , i = 1 , . . . , n , for men using over-p ar ameterize d design, d = 25 (r esc al e d spike and slab values depict e d by thick line; O LS by dashe d line ). Right plot: Effe c tive ke rne ls for women. 274 H. Ishwar an and A. Pap ana 3.2. Effe ctive de gr e es of fr e e dom The smo other matrix provides information ab out the nature of a predicted curve. Ideally , ho wev er, we would lik e a rig orous and sys tema tic manner in which to sum- marize this information as a way to re gister (class ify) a cur ve. O ne way to r igoro usly compare c urves is to use the notion of the effe ctive de gr e es of fr e e dom [ 13 ]. F or a ny smo other matrix S , the effective degr e es of freedom, D f , is defined as D f ( S ) = tr( S ) = n X i =1 s i,i . F or the O LS smoo ther matrix ( 3.2 ), D f ( S ) = n − 1 P n i =1 P d k =1 x 2 i,k = d (the last ident it y on the right is due to or thogonality). Meanwhile, for the s pike and slab smo other matrix ( 3.3 ), we have the following corollar y to Theorem 3.1 . Corollary 3. 1. Under the c onditions of The or em 3 .1 , D f ( S ∗ ) = n − 1 n X i =1 d X k =1 V k x 2 i,k = d X k =1 V k ≤ d. Henc e, the de gr e es of fr e e dom for the spike and slab smo other is b ounde d by t he dimension of the un derlying p olynomia l b asis. Observe how { V k } , the shrink age parameters , dictate the degr ees of freedom. The larger the v alue, the mo re degr ees of freedom used up, and the less s hrink ag e there is. In the analy s is pr esented earlie r us ing a satur a ted desig n ( d = 25), the effective degrees of freedom ar e 4 . 2 a nd 5 . 8 for men a nd women, resp ectively , indicating more ov erall shrink age for men a nd evidence of differences in the tw o curves. Effective degr ees of freedom a re useful for assess ing overall difference s b etw een curves. How e ver the metho d is limited in its a bilit y to register a curve, as it reduce s the ov erall prop er ties of a curve to a single s umma r y v alue. In the next sectio n w e illustrate a muc h more effective wa y to r egister curves. 4. Lo cal regres s ion In this section we illustrate how rescaled spike and slab mo dels can b e us ed for lo cal regres s ion, a n alternative metho d of smo othing [ 15 , 16 ]. By exploiting orthog onality , and by drawing co nnections to generalized ridg e reg ression, w e show that rescaled spike and slab predicto rs ca n b e viewed as lo cal re g ression smo o thers with a lo cal smo other matrix whose effective degrees of freedom c a n be tra c ed ov er x a s a wa y to characterize cur v ature of the under lying function. Another nice feature of using rescaled spike and slab mo dels , just like in global smo othing, is that w e end up being fairly robust to the choice of the dimension of the underlying basis functions. First let’s re v iew some ba ckground on lo cal regre ssion. In lo c a l regr esssion, for a given x i , ra ther tha n perfor ming a glo bal regres sion to es tima te f i , one instead fits a weigh ted regr ession mo del using weigh ted least-squa res, with weigh ts for an observ ation x chosen by how close they are to x i . This re sults in a lo cal estimator ˆ f ( x i ) = ( ˆ f i, 1 , . . . , ˆ f i,n ) t in whic h the i th co o rdinate, ˆ f i,i , is used as an estimator for f i = E ( Y i ). Unlike ( 3.1 ), how ev er, the r elationship b etw een f i and x i can v ar y with i . Ortho gonalize d r esc ale d spike and slab smo othing 275 As well known, a lo cal regressio n predictor is nothing more than a weigh ted least squares predictor. That is, for a given x i , let b i,j = ( b i,j, 1 , · · · , b i,j,d ) t be the v alues of the d basis functions chosen for x i ev alua ted at x j . The lo ca l regressio n pr edictor is defined as ˆ f ( x i ) = B i ˆ β β β W , where (4.1) ˆ β β β W = ar g min β β β ( n X j =1 Y j − d X k =1 β k b i,j,k 2 K x j − x i h ) , and K ( · ) is a p ositive kernel function with unknown bandwidth par ameter h > 0. Solving, it can b e shown that ˆ f ( x i ) is the weigh ted lea st squares predictor, (4.2) ˆ f ( x i ) = B i ( B t i W ( x i ) B i ) − 1 B t i W ( x i ) Y where B i is the n × d design matr ix with j th row b i,j , and W ( x i ) = diag { W i,j } is the n × n diagonal weigh t matrix , where W i,j = K x j − x i h , j = 1 , . . . , n. See [ 14 ] for details. Example 4.1 . A po pular basis function ex pansion for lo cal re g ression is in terms of p olynomials [ 1 7 , 18 ]. In this case , the design matrix is (4.3) B i = 1 x 1 − x i · · · ( x 1 − x i ) d 1 x 2 − x i · · · ( x 2 − x i ) d . . . . . . . . . . . . 1 x n − x i · · · ( x n − x i ) d n × ( d +1) . Note that B i has rank d + 1 because w e alwa ys include an intercept term. The rationale for using a p olynomial expansio n follows by considering an expans io n o f of E ( Y j ) = f ( x j ) around x i . Since, f ( x j ) = f ( x i ) + p X k =1 ( x j − x i ) k k ! f ( k ) ( x i ) + R j , where R j is a small r emainder ter m, the lo cal weigh ted reg ression ( 4.1 ) is (approx- imately) the v alue for β β β minimizing n X j =1 f ( x i ) + p X k =1 ( x j − x i ) k k ! f ( k ) ( x i ) − β 0 + d X k =1 β k ( x j − x i ) k 2 K x j − x i h . If d = p , then β 0 estimates f ( x i ) while β k estimates f ( k ) ( x i ) /k ! for k = 1 , . . . , d . The lo cal reg ression predictor is ˆ f i,j = ˆ β 0 ,W + d X k =1 ˆ β k,W ( x j − x i ) k , j = 1 , . . . , n, which sho uld be a go o d approximation to f ( x j ) when x j is near x i . 276 H. Ishwar an and A. Pap ana 4.1. R esc ale d spike and slab weighte d r e gr ession The repres entation ( 4.2 ) pres ent s an immediate tie-in to the spike and sla b metho d- ology . The resca le d p oster ior mean, ˆ β β β , from ( 2.2 ) is a mo del averaged genera liz ed ridge regr ession (GRR) estimator, expr essible as ˆ β β β = E X t X + n Γ − 1 − 1 X t Y Y ∗ . It is not ha rd to s ee that by appropriately in tro ducing a weigh ting matr ix into the hierarch y , that one ca n ar rive at a mo del averaged weigh ted GRR estimator, a nd a smo other of the form ( 4.2 ). The adv a ntage of this type o f approach is that the resulting smo other will be based on an e stimator that uses adaptive penaliza tion. In this mo difica tion, similar to ( 4.3 ), we work with a po lynomial basis tha t depe nds up on i . How ev er, our p olynomial bas is will b e strictly or thogonal. Let I i,h = j : K x j − x i h > 0 . F or an orthogona l basis we define B i to b e the design matrix for x i obtained using a d -degr ee o rthogona l basis using only those x j v alues wher e j ∈ I i,h . F or each j ∈ I i,h , define Y ∗ j = ˆ σ − 1 i n 1 / 2 i Y j , where n i is the cardinality of I i,h and ˆ σ 2 i is a n estimato r fo r σ 2 0 for the set o f res po nses, { Y j : j ∈ I i,h } . W e use the estimator due to [ 19 ], ˆ σ 2 i = 1 2( n i − 1 ) n i − 1 X j =1 ( Y ( j +1) − Y ( j ) ) 2 , where Y ( j ) is the Y -v a lue corresp onding to the j th o rdered x -v a lue in I i,h (the estimator is most ea sily computed by s orting the x v a lues ). Let Y ∗ i be the vector of the r escaled v a lues Y ∗ j = ˆ σ − 1 i n 1 / 2 i Y j for j ∈ I i,h . Let W i be the subset of W ( x i ) corresp onding to those j ∈ I i,h . F or a given x i , the mo dified rescaled spike and slab mo de l is ( Y ∗ i | B i , W i , β β β ) ∼ N( B i β β β , n i W − 1 i ) , ( β β β | γ ) ∼ N( 0 , Γ ) , γ ∼ π ( d γ ) . (4.4) Consider the following theorem which c haracterizes the s pike and slab pre dictor ˆ f ∗ ( x i ) = B i ˆ β β β i,W , wher e ˆ β β β i,W is the rescaled p oster ior mea n from ( 4.4 ). W e use this result later to explicitly characterize the smo other ma trix and its effective deg rees of freedom under orthogo nality . Theorem 4.1. U nder the Bayesian hier ar chy ( 4.4 ), the spike and slab lo c al pr e- dictor c an b e expr esse d as ˆ f ∗ ( x i ) = ( ˆ f ∗ i, 1 , . . . , ˆ f ∗ i,n ) t = S ∗ i,h Y i , wher e S ∗ i,h is the mo del aver age d s m o othing matrix define d by S ∗ i,h = E B i B t i W i B i + n i Γ − 1 − 1 B t i W i Y ∗ i . Note that the smo other matrix S ∗ i,h , u nlike ( 4.2 ), takes advantage of adaptive p e- nalization. Ortho gonalize d r esc ale d spike and slab smo othing 277 4.2. Ortho gonality Our construction for the basis ensures that B t i B i = n i I . Ho w ever, in order to fully exploit or thogonality , we a dditionally require that (4.5) B t i W i B i = n i I . F or ( 4.5 ) to hold w e must have W i = I . The simplest wa y to satis fy this condition is to use a nea rest neighbour kernel. F or a fixed bandwidth v a lue h , let K x h = 1 {| x | < h } . The neares t neighbour kernel puts a w eight o f 1 on all v alues of x within a distance of h to zero . Using suc h a kernel implies that W i = I and I i,h = { j : | x j − x i | < h } . Shrink ag e, just as in the g lobal orthogo nal regr ession setting, is intimately related to the degrees of freedom of the s mo other matrix. Co nsider the following coro llary to Theore m ( 4.1 ) c haracterizing effective deg rees of freedom under or tho gonality . Corollary 4.1. Under the ortho gonality assumption ( 4.5 ), t he lo c al s mo other ma- trix for t he r esc ale d spi ke and slab pr e dictor is S ∗ i,h = n − 1 i B i V i B t i . The effe ctive de gr e es of fr e e dom of S ∗ i,h e quals D f ( S ∗ i,h ) = n − 1 i tr ( B i V i B t i ) = n − 1 i tr ( B t i B i V i ) = d X k =1 V i,k ≤ d, wher e V i = diag { V i,k } and V i,k = E γ k 1 + γ k Y ∗ i , k = 1 , . . . , d. Henc e, the de gr e es of fr e e dom of the spike and slab lo c al smo other is b ounde d by the dimension of the lo c al p olynomial b asis. The effective deg rees of free dom can b e used to provide insight into the g eometry of a cur ve f . If the effectiv e degrees of freedom is large, f will po ssess higher order lo cal curv ature, whereas if the degree s o f freedo m are s ma ll, f is likely to b e flat. Plotting D f ( S ∗ i,h ) is therefore a way to register a cur ve and to identify key differences betw een curves. W e illustrate this concept by way of three different examples. 4.3. Spinal BMD data r evisite d F or our first example, we applied the lo cal re s caled spike and s lab mo del ( 4.4 ) to the prev iously a na lyzed BMD da ta. F o r the analysis, we used a near est neighbour kernel with a bandwidth set a t h = 1, corresp onding to one year o f ag e. F or the basis function we used cubic or thogonal p o lynomials. Figure 5 plots the spike and slab predictor for men and women (line types as in Figure 3 ). Also sup erimp os e d on the figure are pr edicted curves us ing F riedman’s super smo other (implemented in the pro gramming lang ua ge R by the call “supsmu(x,y)”). The tw o metho ds ag ree closely , altho ug h F riedman’s smoo ther app ear s to o v er-smo oth the data fo r men. The r ight-hand side of Figure 5 plots the e ffective deg rees of freedo m for men and women. One can immediately see a phase shift in the figure , signifying distinct mo des for the tw o curves. Also, overall, there is significantly less shrink age for women with degr ees of fre e do m b eing pos itive ov er a muc h wider reg ion than men. In b oth case s , curves even tually fla tten out at ar ound a ge 20. 278 H. Ishwar an and A. Pap ana Fig 5 . L eft plot: L o c al ke rnel r e g ressio n for BMD data via r esc ale d spike and slab mo dels with ortho gonal p olyno mial cubic b asis functions. Solid dark and dashe d dark lines ar e spike and slab pr e dictors for men and women, r e sp e ctively. Gr ay solid and dashe d lines ar e F ri edm an ’s sup e r- smo other for men and women. R ight plot: Effe ctive de gr e es of f re e do m of spike and slab lo c al smo other (solid line s ar e men, dashe d lines ar e women). 4.4. Cosmic micr owave b ackgr ound r adiation As another illustration w e loo k a t data related to cosmic microw a ve ba ckground (CMB) radiation [ 20 ]. Here , the v a lue for x is the mu ltipo le moment a nd Y is the estimated p ow er sp ectrum of the temper ature fluctuations. The outcome is sound wa v es in the cosmic microw ave background r adiation, which is the heat left ov er from the big ba ng . W e used the s ame s trategy a nd settings as b efore. F or the bandwidth we used h = 25 which was estimated prior to fitting using g eneralized c r oss-v alidation. Results from the analysis are depicted in Figure 6 with plots zoo med in o n differe nt regions of x in order to help visualiz e the v a rying curv ature. The bo ttom rig ht plot of Figure 6 shows the e ffectiv e deg rees o f freedom. The plot sug gests the presence of at lea st 4 distinct inflection p oints. In particular , note that initially for x < 200 there is a steep incr ease in the curve sig nified by the effective degree s o f freedom being roughly co ns tant at 3.0. A t aro und x = 200 there is a significa nt drop in the effective degr ees of freedom, follow ed by an increase and a flattening out until around x = 400. The drop at x = 2 0 0 indicates the first inflection point. At x = 40 0 there is a nother dr op in the e ffectiv e degree s o f freedom. Simila r ly , there is a drop near x = 600 a nd x = 800 . All told, this sugg ests at least 4 distinct inflections, a ll app earing in multiples of 200 starting at x = 200. 4.5. Mass sp e ctr ometry pr otein data The study of proteins is critical to under standing living organisms at the molec ula r level as pro teins a re the main comp onents of physiological pathw a ys of cells . Pro- teomics, the study of proteins o n a la rge sca le, is o ften considered the na tur al s tep after genomics in the s tudy of biologica l s y stems. Gr eatly co mplicating an y system- wide analysis of proteins, how ev er, is the dynamic natur e of the pro teome, which constantly changes through its bio chemical interactions with the g e nome and the environmen t. While the challenges faced b y pro teomics are grea t, the benefits at the sa me time are po tent ially huge. F or e xample, by studying protein differences for diseased individuals, one might b e able to discov er pathw ays re sp onsible for these differences, which in turn could lead to nov el biomarkers for ident ifying disease . Ortho gonalize d r esc ale d spike and slab smo othing 279 Fig 6 . First 3 plots (top to b ottom left to right) ar e r esc ale d spike and slab lo c al pr ed ictors (thick dark lines) for CMB data. Gr ay lines ar e F rie dman ’s smo other. Bott om right plot: Effe ct ive de gr e es of fr e e dom for r esca le d spike and slab pr ed ictor. Note the pr esenc e of 4 mo des suggeste d by this last plot. One promising technology for profiling pr otein be havior is SELDI-TOF-MS (sur- face enhanced laser deso r ption/ionizatio n time-of-flight mass s p ectr ometry). In this techn ology , homog eneous biolog ic al samples a r e placed on the active surface of an array . The protein s a mples ar e washed and an energ y absorbing molec ule so lution is plac ed on the surface o f the arr ay a nd allow ed to cr y stalize. The a rray is then queried by a la ser which ionizes the proteins in the sa mple. Charge d gaseo us pep- tides are emitted and their int ensity is detected downstream. The ma ss ov er charge ratio ( m/ z ) of a p eptide-io n is determined from the recorded TOF (time-of-flig h t). The data collected from a SE LDI-TOF-MS exper iment consists of the in tensit y (abundance) of pr oteins in the sa mple for a given m/z r atio. One can think of the set of these tw o v a lues as constituting a sp e ctra. Ea ch biological sample pro duces one spectra and it is of interest to study differences in s pec tra a s a function of phenotype. See [ 21 ] for more details a nd further references . Ident ification of unique p eaks in the spectra , a method commonly referred to as peak identification, is a crucia l part of a na lyzing mass s pe ctrometry da ta. F rom a statistica l p ersp ective, pea k identification can b e recas t as a s mo othing pro blem where the go al is to identify mo des in the da ta a fter appropr iate smo othing. The outcomes are the sp ectrometry intensit y measurements, wherea s x is the sp ecific m/z ratio. T o illustra te how our spike and slab metho d can b e used for pea k detec- tion, we ana lyzed a set of 8 calibr ation spectr a av a ila ble as part of the “PROcess” library [ 21 ] in the Bio conductor R-suite. The data is unique b ecaus e it is k nown a priori that the same 5 proteins ar e pr e s ent in each of the 8 samples. The results fr om the analysis are plotted in Figure 7 (we note that the data was first baseline nor malized prio r to analysis). The black lines in the top plot are the rescaled spike and slab smo othed predictors for ea ch sp ectr a. W e used ortho gonal 280 H. Ishwar an and A. Pap ana Fig 7 . Mass sp e ctr ometry c alibr ation data: 8 sp e ctr a, e ach c omprising 13,468 distinct m/z r atios (horizontal axis c onstr aine d to a subset of observe d m/z r atios to help zo om in figur e). T op plot: Solid lines ar e r esc ale d spike and slab pr e dictors and dashe d vertic al lines indic ate known unique pr oteins. Bottom plot: Effe ctive de gr e es of fr e e dom. po lynomials of deg ree 5 (the high degre es o f freedom used due to the spik y nature of the data). The bandwidth was set at h = 50. Superimp ose d are 5 dashed vertical lines indicating the 5 distinct proteins. Interestingly , we find that 3 o f the 5 pr oteins are clear ly identified in all 8 sp ectra. Howev er, the tw o smallest proteins m /z = 1084 and m/ z = 1638 a r e les s vis ible, the protein at m/z = 108 4 e sp ecially so. There is a ls o evidence o f at lea st 2 additional p eaks a t approximately m/z = 3500 a nd m/z = 4 500. The effective degr ees of fre e dom plo t, also given in Fig ure 7 , confirms these findings . The plot a lso indicates that overlap o f sp ectra is sub-par suggesting further nor ma lization of the data is needed. Ac kno wle dgment s. The authors ar e grateful to Bertra nd Clarke and the ref- eree who rev ie wed an earlier version o f the pap er . W e a ls o thank Lia ng Li for his comments. References [1] Ishw aran, H. and Ra o, J. S . (2005). Spike and slab v a r iable selection: frequentist and B ayesian strategies. Ann. S tatist. 33 730–7 73. MR21631 58 Ortho gonalize d r esc ale d spike and slab smo othing 281 [2] Ishw aran, H. a nd Ra o, J. S . (20 0 5). Spike and slab gene selection for multi- group microarr ay data. J. Amer. Statist. Asso c. 1 00 764– 7 80. MR22010 09 [3] Ishw aran, H. and Ra o, J. S. (200 3). Detecting differentially expressed ge ne s in microarrays using Bay esian mo del selection. J. Amer. Statist. Asso c. 9 8 438–4 55. MR19957 20 [4] Lempers, F. B. (19 71). Posterior Pr ob abilities of Alternative Line ar Mo dels . Rotterdam Univ. Pr e ss, Ro tterdam. MR0359 2 34 [5] Mitchell, T. J. and Beauchamp, J. J. (1988). Bay esian v ariable selection in linear reg ression. J. Amer. Statist. Asso c. 83 1 023–1 036. MR099 7578 [6] George, E . I. an d McCulloch, R. E. (19 93). V ariable selectio n via Gibbs sampling. J. Amer. St atist. Ass o c. 88 881–8 89. [7] George, E. I. and McCulloch, R. E. (199 7 ). Approaches for Bay esian v aria ble selection. Statist. Sinic a 7 33 9–37 3. [8] Chipman, H. (19 9 6). Bay esian v a riable selectio n with related predictor s. Canad. J. Statist. 24 17 –36. MR13947 38 [9] Cl yde, M. , D eSimone, H. and P armigiani, G . (1996). Prediction via or- thogonalized mo del mixing. J. Amer. Statist. Asso c. 91 1197–1 208. [10] Geweke, J. and Mee se, R. (1981). Estimating regressio n mo dels o f finite but unknown or der. International Ec on. Rev. 22 55 –70. MR06143 47 [11] Chipman, H. A., George , E. I. and McCulloch R. E . (2001). The prac- tical implemen tation of Ba yesian model selection (with discussion). In IMS L e ctur e Notes Mono gr. Ser. 3 8 . Mo del Sele ct ion (P . La hiri, ed.) 63 –134 . IMS, Beach w oo d, OH. MR20007 52 [12] Bachra ch, L. K., Hastie, T., W ang, M.-C. and Narasimhan , B. (1999). Bone miner a l acq uis ition in hea lth y asia n, hispanic, black and caucasian youth. A longitudinal study . J. Clin. Endo crinol. Metab. 84 4702 –471 2. [13] Hastie, T. and Tibshirani, R. (19 90). Gener alize d A dditive Mo dels . Cha p- man and Hall, Lo ndon. MR10821 47 [14] Hastie, T. and Loader, C. (19 93). Lo cal reg ression: automatic kernel car- pentry . Statist. Sci. 8 120 –143. [15] Stone, C. J. (1977). Consistent nonpar ametric regression. A nn. Statist. 5 595–6 20. MR04432 04 [16] Cleveland, W. S. (197 9). Robust lo cally weighted reg r ession and smo othing scatterplots. J. Amer. Stat ist . Asso c. 7 4 829–8 36. MR0556 476 [17] F an, J. (19 92). Design-a daptive nonpar ametric re g ression. J. Amer. Statist. Asso c. 8 7 998–1 004. MR120 9561 [18] F an, J. and Gijbels, I. ( 1995). Data-driven ba ndwidth s election in loca l po lynomial fitting: v ar ia ble ba ndwidth and spatial adaptation. J. R oy. Statist. So c. Ser. B 57 371–3 94. MR13233 45 [19] Rice, J. (1984). B andwidth choice for nonpa r ametric r egress io n. Ann. S tatist. 12 1215 – 1230 . MR07 6068 4 [20] Genovese, C. R., Miller, C. J., Nichol, R. C., Arjunw a dkar, M. and W asserman, L. (2004 ). Nonparametric inference for the cosmic micr ow a v e background. St atist. Sci. 19 308– 321. MR21469 46 [21] Li, X., Gentleman, R., Lu, X., Sh i, Q., Iglehar t, J. D ., Harris, L. and Miron, A. (200 5). SELDI-TOF mass sp ectometry protein data. In Bioin- formatics and Computational Biolo gy Solutions Using R and Bio c onductor (R. Gentleman et al., eds.) 91 – 108. Springer, New Y ork. MR22018 36
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment