Orthogonalized smoothing for rescaled spike and slab models

IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 267–281 c  Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000001 92 Orthogon alized smo othing for rescaled spik e and s lab mo dels Heman t Ish w ara n ∗ 1 and Ariadni P apana 2 Cleveland Clinic and Case West ern R eserve University Abstract: Rescaled spi k e and slab models are a new Ba y esian v ariable se- lection meth od for linear regression models. In high dimensional orthogon al settings suc h models hav e b een shown t o possess optimal mo del select ion prop- erties. W e review bac kground theory and discuss applications of r escaled spike and slab mo dels to prediction problems in v olving orthogonal p ol ynomials. W e ﬁrst consider global smo othing and di scuss p oten tial w eaknesses. Some of these deﬁciencies are remedied b y usi ng local regression. The local regression ap- proac h relies on an intimate connection b etw ee n lo cal weigh te d regressi on and we igh ted generalized ridge regressi on. An i mp or tan t implication i s tha t one can trace the eﬀec tiv e degrees of fr eedom of a curve as a wa y to visualize and classify curv ature. Sev eral motiv ating examples are pr esen ted. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 2 Rescaled spike a nd slab mo de ls . . . . . . . . . . . . . . . . . . . . . . . . 26 8 2.1 Resca ling, the c hoice of π , and implications for s hr ink ag e . . . . . . . 2 69 2.2 Selective shrink ag e r ecast in terms of p ena lization . . . . . . . . . . . 27 0 3 Orthogo nal p olynomials : ﬁrst illustration . . . . . . . . . . . . . . . . . . 27 2 3.1 Compa rative analys is using e ﬀective kernels . . . . . . . . . . . . . . 272 3.2 Eﬀective degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . 2 74 4 Lo cal regr ession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 4.1 Resca led spike and slab weigh ted regressio n . . . . . . . . . . . . . . 276 4.2 Or thogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 77 4.3 Spinal BMD da ta revis ited . . . . . . . . . . . . . . . . . . . . . . . 27 7 4.4 Cos mic microw a v e background ra diation . . . . . . . . . . . . . . . . 2 7 8 4.5 Mas s sp e c trometry protein data . . . . . . . . . . . . . . . . . . . . . 278 Ac knowledgmen ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 1. In tro duction Rescaled spik e and slab models were introduced in [ 1 ] as a Bay esian v a riable se- lection metho d in linear r e g ression mo dels. Such mo dels w ere shown to posse s s a ∗ Supported by the NSF Gran t DMS-04-05675. 1 Departmen t of Quantitativ e Health Sciences Wb4, Clev eland Clinic, 9500 Euclid Av en ue, Cleve land, Ohio 4419 5, USA, e-m ail: hemant.i shwarant @gmail.com 2 Departmen t of Statistics, Case W estern Reserv e Universit y , 10900 Euclid Aven ue, Cleve land, Ohio 44106, USA, e-mail: ariadni. papana@c ase.edu AMS 2000 subje c t classiﬁc ations: Pr imary 62J07; secondary 62J05. Keywor ds and phr ases: eﬀect ive degrees of freedom, penalization, selectiv e shrink age . 267 268 H. Ishwar an and A. Pap ana sele ctive shrinkage pro p e rty in or thogonal models . This pro p erty allows the p oste- rior mean for the coeﬃcients to shrink to zer o for truly zero co eﬃcients while for the non-zero co eﬃcients p osterio r estimates are similar to the ordinary least squares (OLS) estima tes . In [ 2 ], resca led spike and slab mo dels were used to analyze multi- group microarr ay data (an extension of pr evious work [ 3 ]). Selec tive shrink ag e was shown to be a suﬃcient condition for oracle- like tota l miscla s siﬁcation. A ﬁnite sample ada ptive method for selecting v ariables using this principle was given. In this manuscript we extend the application of r escaled s pike and slab mo dels to smo othing problems. Giv en an outcome v a lue Y related to a v ar iable x throug h an unkno wn function f ( x ), w e would like a ccurate recov e r y of f ( x ) . Smoothing is a pre dic tio n problem, and an imp ortant contribution of the pap er is adv ancing applications o f resca led spike a nd slab mo dels to pre diction settings. How e ver this do es not mea n s e lective shrink age, a co re ingredient to mo del selection, is no t at play in a prediction pa radigm. Indee d, as s hown, selective shrink a ge plays a crucia l role in adaptive selection of ov er-para meterized ba sis functions in resp onse to curv atur e of f ( x ). W e consider global smo othing via orthogo nal p oly nomial regress ion as w ell as lo cal regr ession using orthog onal p oly no mials. Orthogonality is a k ey ingredient in our approa ch. No t only do es it allow us to exploit the selective shrink age pr op erty of resca led spike and sla b mo dels , which follow as a consequence of or tho gonality , but it als o gr eatly improves the computationally eﬃciency of our algorithms. While m uch work has been done in the area o f smo othing, w e note ther e are novel feature s in our approach p otentially useful in applied settings. One imp orta n t feature be- ing that selective shr ink ag e a llows for gr eater adaptivity to curv ature and grea ter robustness to misspeciﬁca tion of dimension of ba sis functions. Secondly , in lo cal regres s ion s ettings, a daptivity via s elective shrink age can be in terpreted in ter ms of dimensionality and curv ature. F rom this we pr ovide an eﬀective degrees of free- dom plot for gra phing estima ted dimensionality of f ( x ) as a function of x . Such plots provide a s imple and p ow erful w ay to regis ter curves. Several applicatio ns a re provided a s illus tration. 2. Rescaled spi ke and sl ab mo del s W e begin by ﬁrst revie wing background theory for r e s caled spik e and slab mo dels. The underlying setting is the linear regr ession mo del where Y 1 , . . . , Y n are indepen- dent res po nses suc h that (2.1) Y i = β 1 x i, 1 + · · · + β d x i,d + ε i = x t i β β β + ε i , i = 1 , . . . , n. Here x 1 , . . . , x n are non- random (ﬁxed design) d -dimensio nal cov ariates and β β β = ( β 1 , . . . , β d ) is the unknown co e ﬃcient vector. The ε i are independent r andom v ari- ables (but not necessar ily identically distributed) s uch that E ( ε i ) = 0, E ( ε 2 i ) = σ 2 0 and E ( ε 4 i ) ≤ M for so me M < ∞ (the la st condition is needed to inv ok e a trian- gular central limit theor em la ter, but is no t crucial and can certainly b e r elaxed). The v ariance σ 2 0 > 0 is assumed to be unknown. Thr o ughout we assume x i are standardized so that P n i =1 x i,k = 0 and P n i =1 x 2 i,k = n for k = 1 , . . . , d (without loss of gener ality w e assume that there is no in tercept term in ( 2.1 )). W e shall also assume throughout that X , the n × d design matr ix, is orthog onal, i.e., X t X = n I . As mentioned in the Int ro duction, this will allow us to exploit certain elega nt the- ories for r escaled spike and slab metho ds, although, o f course, the spike and slab metho d works for g eneral design matr ic e s. Ortho gonalize d r esc ale d spike and slab smo othing 269 Spike and s lab methods ﬁrst appea red in the works of [ 4 ] and [ 5 ] for subset selection in linear regressio n models. The expres sion “spik e and slab,” coined b y Mitc hell and Beauchamp in [ 5 ], referred to the prior for the r e gressio n co e ﬃcient s used in their hiera r chical formulation. This w as ch osen so that the co eﬃcients were m utually indep endent with a t w o-p oint mixtur e distr ibutio n made up of a uniform ﬂat distribution (the slab) and a degener ate dis tribution at ze ro (the s pike). In [ 6 ] a diﬀeren t t y pe of prior was used. This in v olved a scale mixture of t w o norma l distributions. In pa r ticular, the use of a normal pr ior was highly adv antageous and led to a Gibbs sampling method that highly popular ized the spik e and slab approach; see [ 7 , 8 , 9 , 10 , 1 1 ]. As po inted out in [ 1 ], priors inv olving a normal scale mixture distribution, of which [ 6 ] is a specia l ex a mple, co nstitute a wide class o f mo dels termed “spike a nd slab mo dels.” A mo diﬁed class of spike and slab mo dels calle d “r escaled spike and slab mo dels” was in tro duced [ 1 ]. These new mo dels diﬀered in that the or iginal Y i v alues were replac e d by new v a lues scaled by the s q uare r o ot of the sample size and divided by the square ro ot o f an estimate for σ 2 0 . Rescaling was shown to induce a non-v a nishing pena lization eﬀect for the p oster ior mean, and when used in tandem with a contin uous bimo dal prio r , the resulting p osterio r mean w as shown to p osses s a selective shr ink age pr op erty in orthog onal mo dels [ 1 ]. A r esc ale d spike and slab mo del was deﬁned in [ 1 ] to deno te a Bayesian hierar - chical mo del s pec iﬁed as follows: ( Y ∗ i | x i , β β β ) ind ∼ N( x t i β β β , n ) , i = 1 , . . . , n, ( β β β | γ ) ∼ N( 0 , Γ ) , γ ∼ π ( d γ ) . (2.2) Here Y ∗ i are the r escaled Y i v alues deﬁned by Y ∗ i = ˆ σ − 1 n 1 / 2 Y i , where ˆ σ 2 = || Y − X ˆ β β β 0 || 2 / ( n − d ) is the unbiased estima to r for σ 2 0 based on the full mo del, and ˆ β β β 0 is the OLS estimate for β β β (other estimators for σ 2 0 are also p ossible; these details, howev er, play a mino r r ole). The v alue of n used in the ﬁrs t level of the hier arch y in ( 2.2 ) is a v a riance inﬂa tion facto r introduced to compens a te for the res caling. Mor eov er, inclusion of n in the hier arch y was shown in [ 1 ] to b e neces s ary for selective shr ink age to take place. Without rescaling , shrink age for the p os ter ior mean v a nis hes in the limit due to the prior b e c oming swamped by the likelihoo d [ 1 ]. In ( 2.2 ), 0 denotes a d -dimensional zero vector, Γ = diag( γ 1 , . . . , γ d ) is a d × d dia gonal matrix and π is the prior meas ure for γ = ( γ 1 , . . . , γ d ) t . A B ayesian parameter σ 2 can also be in tro duced in ( 2.2 ) at the ﬁr s t le vel of the hiera rch y . How ev er, we av oid this approach her e and opt for the simpler set up (( 2.2 )). The rationale for this is the following: (i) we hav e alrea dy removed the eﬀect of σ 2 0 when rescaling Y i , a nd (ii) the simpler s e tup enforces a s parse s olution for the p osterior mean in ill-determined settings when d is of the same size, or larger, than n . Poin t (ii) is esp ecially r elev ant as this is the setting we are interested in here. 2.1. R esc aling, the choic e of π , and implic ati ons for shrinkage In additio n to rescaling the resp onse, the prior for γ k m ust satisfy c ertain re q uire- men ts in o rder for selective shrink a ge to o ccur. A suﬃcient condition requires the prior to hav e a bimoda l prop erty such that the rig h t tail of the distribution is con- tin uous a nd such that ther e is a spike in the distribution near zero (see Theor em 6 of [ 1 ] fo r pr ecise details). One such ex ample is the contin uous bimo dal pr ior used 270 H. Ishwar an and A. Pap ana Fig 1 . Conditional density for γ k given w : (a) w = 0 . 1 , (b) w = 0 . 25 . Observe that only the densities height changes as w is varie d. One c an think of w as a c omp lexity p ar ameter co ntr ol ling mo del dimension. Prior b ase d on hyp erp ar ameters a 1 = 5 , a 2 = 50 and v 0 = 0 . 005 as in [ 1 , 2 , 3 ] . in [ 1 , 2 , 3 ]. This prior is induced by a parameter ization inv olving a binary v ar iable and a p ositive v ar iable with an inv erse-gamma distr ibutio n. Mor e precise ly , deﬁne γ k by γ k = I k τ 2 k , where I k and τ 2 k are para meter s with priors speciﬁed according to ( I k | v 0 , w ) iid ∼ (1 − w ) δ v 0 ( · ) + w δ 1 ( · ) , k = 1 , . . . , d, ( τ − 2 k | a 1 , a 2 ) iid ∼ Gamma( a 1 , a 2 ) , w ∼ Uniform[0 , 1 ] . (2.3) The choice for v 0 (a sma ll p ositive v alue) and a 1 and a 2 (the s hap e and scale parameters for a ga mma density) are selected so that γ k has a contin uous bimo dal distribution with a spike a t v 0 and a right co nt in uous tail (see Figure 1). Such a prior allows the po sterior to shr ink a co e ﬃcient to zer o dep ending up on the v alue for γ k . Small v alues heavily shrink a co e ﬃcie nt to w ards zero. 2.2. Sele ctive shrinkage r e c ast i n terms of p enali zation One c a n view the p osterior mean as a solution to a constr ained least squar es op- timization proble m in which the hyp e rv ar iances are related to p enalty terms. This provides us with a nother wa y to think a b out the eﬀects of selective shr ink age. As befo re, we consider the orthogo nal s etting wher e X t X = n I . Let V k = E ( ν k | Y ∗ ) where ν k = γ k / (1 + γ k ). F or our argument it will be ea sier to think of pena lization in terms of ˆ β β β = ˆ σ n − 1 / 2 ˆ β β β ∗ , wher e ˆ β β β ∗ = E ( β β β | Y ∗ ) denotes the p o sterior mean for β β β under our r escaled spike and slab mo del. It can b e shown tha t (2.4) ˆ β β β = ar g min β β β ( || Y − X β β β || 2 + n d X k =1 1 − V k V k β 2 k ) . Observe how ea ch β k co eﬃcient in ( 2.4 ) is pe na lized by a unique v alue (1 − V k ) /V k . The closer V k is to 1, the smaller the p enalty and the less the shrink a ge for β k , while the closer V k is to z e ro, the larger the p enalty , a nd the more β k is shrunk to zero. It is clear the mo re adaptive V k is to the true co eﬃcient v a lue, the mo re accurate v ariable se le ction bec o mes. This ar gument can b e formalized b y studying the asymptotic be havior o f V k . Using a similar argument a s in [ 2 ], one can show that under the spik e and slab Ortho gonalize d r esc ale d spike and slab smo othing 271 Fig 2 . Limit ing densit y for ν k c onditione d on w = 0 . 1 and Z 2 k under the nul l that β k is truly zer o. V alues for Z 2 k sele c te d fro m the 25th, 50th, 75th, 90th, 95th and 99th p er c entiles of a χ 2 1 - distribution. Mo de on the left i ncr ea ses as Z 2 k de cr e ases, wher eas mo de on t he right de cr e ases. mo del ( 2.2 ) with contin uous bimo da l pr ior ( 2.3 ) (sp eciﬁed in Figure 1), the following holds: Theorem 2.1. Assum e that max 1 ≤ i ≤ n || x i || / √ n → 0 . If ( 2.1 ) r epr esents t he t rue data mo de l, and the c o eﬃcient β k for variable k is tru ly non-z er o, t hen (2.5) V k d 1 . Mor e over, if β k is truly zer o, then (2.6) E ( ν k | w, Y ∗ ) d R 1 0 ν exp  ν Z 2 k / 2  (1 − ν ) − 3 / 2 f  ν / (1 − ν ) | w  dν R 1 0 exp  ν Z 2 k / 2  (1 − ν ) − 3 / 2 f  ν / (1 − ν ) | w  dν , wher e f ( ·| w ) = (1 − w ) g 0 ( · ) + wg 1 ( · ) is t he prior density for γ k given w , wher e g 0 ( u ) = v 0 u − 2 g ( v 0 u − 1 ) , g 1 ( u ) = u − 2 g ( u − 1 ) and g ( u ) = a a 1 2 ( a 1 − 1 )! u a 1 − 1 exp( − a 2 u ) and Z k has a N(0 , 1 ) distribution. Result ( 2.5 ) of Theorem 2 .1 shows that V k approaches the v alue 1 in the case where there is true signal, and hence the p enalty for β k in ( 2.4 ) v anishes as the sample size increases, a nd β k will not be shrunk, just as we’d exp ect and hope for . Result ( 2.6 ) applies to the case when β k is really zer o. The term Z k app earing in ( 2.6 ) is the limit o f the p o sterior mean ˆ β ∗ k under the n ull, and thus Z k reﬂects the eﬀect of the data on the p os terior under the null. In particular, unless ˆ β ∗ k is unduly large, the poster io r mean for ν k should b e rela tively c lo se to the v alue under the prior. This has implicatio ns for spa rse settings. In such cas es, the p o sterior v alue for w will b e small and the p os terior for ν k conditioned on w (which will lo ok like the prior given w ) will b e concentrated near zero. Thus, the left-hand side of ( 2.6 ) should be small and the poster io r mean penalized and shrunk tow ards ze ro. On the other hand, if ˆ β ∗ k is la rge, then the le ft-ha nd side of ( 2.6 ) will b e larg e, and there will b e less p ena liz ation and le s s s hrink ag e fo r β k . A larg e v a lue for ˆ β ∗ k is unlikely under the null and in fact is expected only whe n β k is really non-zer o, whic h is another wa y to see why ( 2.5 ) holds. Figur e 2 illustrates how ν k might dep end up on ˆ β ∗ k in a spars e setting under the null that β k is truly zero . 272 H. Ishwar an and A. Pap ana 3. Orthogonal po l ynomials: ﬁrst illustration F or our ﬁrst illustration we consider a da ta set rela ted to spinal bo ne mineral dens ity (BMD) (see [ 12 ] for a mo re complete description of the data). The r esp onse is the relative ch ange in s pinal BMD as a function of age in ma le and female adolesce n ts. Figure 3 plots the res ults of our analy sis. Predicted v alues for Y bas ed on the po sterior of the rescaled s pike a nd slab mo del ( 2.2 ) are sup erimp osed on the ﬁgure as solid dar k a nd das hed dark lines for men and women, r esp ectively . The ana lysis on the left side of the plot is ba sed on an orthogonal p olyno mia l des ig n ma trix with d = 10 basis functions. Also sup erimp osed are O LS estimates (gray line s ). While the left side of Figure 3 shows some diﬀerence betw een the metho ds, discrepancies b ecome mor e apparent if d is allow ed to increase. W e re-r an the same analysis but using an overly para meterized design inv olving d = 25 basis functions. The right s ide of Figure 3 reco rds the result. Notice how badly OLS ov erﬁts, wher e as rescaled spike and slab predictors rema in relatively unaﬀected. 3.1. Comp ar ati ve analysis u sing eﬀe ctive kernels A more formal co mparison b e t ween the tw o a pproaches can b e bas ed on an eﬀective kernel analysis. Eﬀective kernels w ere int ro duced in [ 13 ] (Chapter 2.8), as a wa y to ev alua te the diﬀerences b etw een kernel smo others . Supp ose we hav e data ( x i , Y i ), i = 1 , . . . , n , where Y i are the resp onse v alues. It is assumed that (3.1) Y i = f i + ε i , i = 1 , . . . , n, where f i = x t i β β β are unknown mean v alues a nd x i ∈ R d are the v a lues of the pre- chosen underlying d basis functions ev aluated a t x i . Call a smo other s ( x ) linear in Y , if for each x 0 s ( x 0 ) = n X j =1 S j ( x 0 ) Y j , where S j ( x 0 ) dep ends o nly up on the x -v alues and no t the r esp onses. More generally , let f = (f 1 , . . . , f n ) t . If s ( x i ) is a linear smo other fo r f i , then ˆ f = SY Fig 3 . L e f t plot: R elativ e c hange in spinal BMD as a function of age. Solid dark and dashe d dark lines ar e spike and slab pr e dicte d c urves for men and women using an ortho gonal p ol ynomial design matrix with d = 10 b asis functions. Gr ay solid and dashe d lines ar e OL S estimates for men and women. Right plot: Analysis similar as befor e, but now using an ove r p ar ameterize d b asis function, d = 25 . Note the spiky beha vior of the OL S. Ortho gonalize d r esc ale d spike and slab smo othing 273 is a linea r smoo thed estimate of f , where S is the n × n smo other matrix , S = { s i,j } for s i,j = S j ( x i ). The v a lue S i ( x i ) = s i,i is o ften refer red to as the eﬀe ctive kernel at x i [ 13 , 14 ]. The eﬀective kernel mea sures the inﬂuence o f x i on Y i . The set of v alues { s i,j : j = 1 , . . . , n } , which is the i th r ow o f S , is called the eﬀective kernel for Y i . Plotting the eﬀective kernel is a way to compa r e diﬀer e n t smo others [ 13 ]. This idea can b e adapted to our setting a s follows. First we der ive the eﬀective kernel for the OLS estimate. Co nsider the orthogo nal reg ression setting in which X t X = n I . Let x ( k ) denote the k th column of the desig n matrix X . It follows that ˆ f = SY , where (3.2) S = X ( X t X ) − 1 X t = n − 1 d X k =1 x ( k ) x t ( k ) . The eﬀectiv e kernel for Y i is n − 1 P d k =1 x i,k x t ( k ) and the eﬀectiv e kernel at x i is s i,i = n − 1 x t i x i . The no tion of an eﬀective kernel needs to be slightly mo diﬁed to handle adaptive pena lization. W e adopt the notion of an adaptive smo o ther ma trix that allows the eﬀective k ernel to dep end upon b oth x i and Y i . Deﬁne the spike and slab predicto r as ˆ f ∗ = X ˆ β β β . Theorem 3.1 . Under otho gonality, the spike and slab pr e dictor for Y i in ( 3.1 ) c an b e writt en as ˆ f ∗ = S ∗ Y , wher e (3.3) S ∗ = n − 1 d X k =1 V k x ( k ) x t ( k ) . One c an c onc eptualize S ∗ as an adaptive line ar smo other matrix. Conse qu en tly, t he eﬀe ctive kernel for Y i is deﬁne d as n − 1 P d k =1 V k x i,k x t ( k ) , and the eﬀe ct ive kernel at x i is s ∗ i,i = n − 1 P d k =1 V k x 2 i,k . Figure 4 shows the eﬀective kernels at x i for the OLS and r escaled spike and slab predictors, wher e x i is age. The plots are base d o n the over-parameterized orthogo nal p o lynomial desig n inv olving d = 25 basis functions. The lar ge num b er of predictors helps to emphasize the non- r obustness of the OLS estimate. Note esp ecially ho w the OLS es timate is aﬀected by the p oints near the edges of the plots. In co nt rast, note the robustness of the spike and slab appro ach. Fig 4 . Left plot: Eﬀe ctive kernels at x i , i = 1 , . . . , n , for men using over-p ar ameterize d design, d = 25 (r esc al e d spike and slab values depict e d by thick line; O LS by dashe d line ). Right plot: Eﬀe c tive ke rne ls for women. 274 H. Ishwar an and A. Pap ana 3.2. Eﬀe ctive de gr e es of fr e e dom The smo other matrix provides information ab out the nature of a predicted curve. Ideally , ho wev er, we would lik e a rig orous and sys tema tic manner in which to sum- marize this information as a way to re gister (class ify) a cur ve. O ne way to r igoro usly compare c urves is to use the notion of the eﬀe ctive de gr e es of fr e e dom [ 13 ]. F or a ny smo other matrix S , the eﬀective degr e es of freedom, D f , is deﬁned as D f ( S ) = tr( S ) = n X i =1 s i,i . F or the O LS smoo ther matrix ( 3.2 ), D f ( S ) = n − 1 P n i =1 P d k =1 x 2 i,k = d (the last ident it y on the right is due to or thogonality). Meanwhile, for the s pike and slab smo other matrix ( 3.3 ), we have the following corollar y to Theorem 3.1 . Corollary 3. 1. Under the c onditions of The or em 3 .1 , D f ( S ∗ ) = n − 1 n X i =1 d X k =1 V k x 2 i,k = d X k =1 V k ≤ d. Henc e, the de gr e es of fr e e dom for the spike and slab smo other is b ounde d by t he dimension of the un derlying p olynomia l b asis. Observe how { V k } , the shrink age parameters , dictate the degr ees of freedom. The larger the v alue, the mo re degr ees of freedom used up, and the less s hrink ag e there is. In the analy s is pr esented earlie r us ing a satur a ted desig n ( d = 25), the eﬀective degrees of freedom ar e 4 . 2 a nd 5 . 8 for men a nd women, resp ectively , indicating more ov erall shrink age for men a nd evidence of diﬀerences in the tw o curves. Eﬀective degr ees of freedom a re useful for assess ing overall diﬀerence s b etw een curves. How e ver the metho d is limited in its a bilit y to register a curve, as it reduce s the ov erall prop er ties of a curve to a single s umma r y v alue. In the next sectio n w e illustrate a muc h more eﬀective wa y to r egister curves. 4. Lo cal regres s ion In this section we illustrate how rescaled spike and slab mo dels can b e us ed for lo cal regres s ion, a n alternative metho d of smo othing [ 15 , 16 ]. By exploiting orthog onality , and by drawing co nnections to generalized ridg e reg ression, w e show that rescaled spike and slab predicto rs ca n b e viewed as lo cal re g ression smo o thers with a lo cal smo other matrix whose eﬀective degrees of freedom c a n be tra c ed ov er x a s a wa y to characterize cur v ature of the under lying function. Another nice feature of using rescaled spike and slab mo dels , just like in global smo othing, is that w e end up being fairly robust to the choice of the dimension of the underlying basis functions. First let’s re v iew some ba ckground on lo cal regre ssion. In lo c a l regr esssion, for a given x i , ra ther tha n perfor ming a glo bal regres sion to es tima te f i , one instead ﬁts a weigh ted regr ession mo del using weigh ted least-squa res, with weigh ts for an observ ation x chosen by how close they are to x i . This re sults in a lo cal estimator ˆ f ( x i ) = ( ˆ f i, 1 , . . . , ˆ f i,n ) t in whic h the i th co o rdinate, ˆ f i,i , is used as an estimator for f i = E ( Y i ). Unlike ( 3.1 ), how ev er, the r elationship b etw een f i and x i can v ar y with i . Ortho gonalize d r esc ale d spike and slab smo othing 275 As well known, a lo cal regressio n predictor is nothing more than a weigh ted least squares predictor. That is, for a given x i , let b i,j = ( b i,j, 1 , · · · , b i,j,d ) t be the v alues of the d basis functions chosen for x i ev alua ted at x j . The lo ca l regressio n pr edictor is deﬁned as ˆ f ( x i ) = B i ˆ β β β W , where (4.1) ˆ β β β W = ar g min β β β ( n X j =1  Y j − d X k =1 β k b i,j,k  2 K  x j − x i h  ) , and K ( · ) is a p ositive kernel function with unknown bandwidth par ameter h > 0. Solving, it can b e shown that ˆ f ( x i ) is the weigh ted lea st squares predictor, (4.2) ˆ f ( x i ) = B i ( B t i W ( x i ) B i ) − 1 B t i W ( x i ) Y where B i is the n × d design matr ix with j th row b i,j , and W ( x i ) = diag { W i,j } is the n × n diagonal weigh t matrix , where W i,j = K  x j − x i h  , j = 1 , . . . , n. See [ 14 ] for details. Example 4.1 . A po pular basis function ex pansion for lo cal re g ression is in terms of p olynomials [ 1 7 , 18 ]. In this case , the design matrix is (4.3) B i =      1 x 1 − x i · · · ( x 1 − x i ) d 1 x 2 − x i · · · ( x 2 − x i ) d . . . . . . . . . . . . 1 x n − x i · · · ( x n − x i ) d      n × ( d +1) . Note that B i has rank d + 1 because w e alwa ys include an intercept term. The rationale for using a p olynomial expansio n follows by considering an expans io n o f of E ( Y j ) = f ( x j ) around x i . Since, f ( x j ) = f ( x i ) + p X k =1 ( x j − x i ) k k ! f ( k ) ( x i ) + R j , where R j is a small r emainder ter m, the lo cal weigh ted reg ression ( 4.1 ) is (approx- imately) the v alue for β β β minimizing n X j =1  f ( x i ) + p X k =1 ( x j − x i ) k k ! f ( k ) ( x i ) −  β 0 + d X k =1 β k ( x j − x i ) k   2 K  x j − x i h  . If d = p , then β 0 estimates f ( x i ) while β k estimates f ( k ) ( x i ) /k ! for k = 1 , . . . , d . The lo cal reg ression predictor is ˆ f i,j = ˆ β 0 ,W + d X k =1 ˆ β k,W ( x j − x i ) k , j = 1 , . . . , n, which sho uld be a go o d approximation to f ( x j ) when x j is near x i . 276 H. Ishwar an and A. Pap ana 4.1. R esc ale d spike and slab weighte d r e gr ession The repres entation ( 4.2 ) pres ent s an immediate tie-in to the spike and sla b metho d- ology . The resca le d p oster ior mean, ˆ β β β , from ( 2.2 ) is a mo del averaged genera liz ed ridge regr ession (GRR) estimator, expr essible as ˆ β β β = E   X t X + n Γ − 1  − 1 X t Y     Y ∗  . It is not ha rd to s ee that by appropriately in tro ducing a weigh ting matr ix into the hierarch y , that one ca n ar rive at a mo del averaged weigh ted GRR estimator, a nd a smo other of the form ( 4.2 ). The adv a ntage of this type o f approach is that the resulting smo other will be based on an e stimator that uses adaptive penaliza tion. In this mo diﬁca tion, similar to ( 4.3 ), we work with a po lynomial basis tha t depe nds up on i . How ev er, our p olynomial bas is will b e strictly or thogonal. Let I i,h =  j : K  x j − x i h  > 0  . F or an orthogona l basis we deﬁne B i to b e the design matrix for x i obtained using a d -degr ee o rthogona l basis using only those x j v alues wher e j ∈ I i,h . F or each j ∈ I i,h , deﬁne Y ∗ j = ˆ σ − 1 i n 1 / 2 i Y j , where n i is the cardinality of I i,h and ˆ σ 2 i is a n estimato r fo r σ 2 0 for the set o f res po nses, { Y j : j ∈ I i,h } . W e use the estimator due to [ 19 ], ˆ σ 2 i = 1 2( n i − 1 ) n i − 1 X j =1 ( Y ( j +1) − Y ( j ) ) 2 , where Y ( j ) is the Y -v a lue corresp onding to the j th o rdered x -v a lue in I i,h (the estimator is most ea sily computed by s orting the x v a lues ). Let Y ∗ i be the vector of the r escaled v a lues Y ∗ j = ˆ σ − 1 i n 1 / 2 i Y j for j ∈ I i,h . Let W i be the subset of W ( x i ) corresp onding to those j ∈ I i,h . F or a given x i , the mo diﬁed rescaled spike and slab mo de l is ( Y ∗ i | B i , W i , β β β ) ∼ N( B i β β β , n i W − 1 i ) , ( β β β | γ ) ∼ N( 0 , Γ ) , γ ∼ π ( d γ ) . (4.4) Consider the following theorem which c haracterizes the s pike and slab pre dictor ˆ f ∗ ( x i ) = B i ˆ β β β i,W , wher e ˆ β β β i,W is the rescaled p oster ior mea n from ( 4.4 ). W e use this result later to explicitly characterize the smo other ma trix and its eﬀective deg rees of freedom under orthogo nality . Theorem 4.1. U nder the Bayesian hier ar chy ( 4.4 ), the spike and slab lo c al pr e- dictor c an b e expr esse d as ˆ f ∗ ( x i ) = ( ˆ f ∗ i, 1 , . . . , ˆ f ∗ i,n ) t = S ∗ i,h Y i , wher e S ∗ i,h is the mo del aver age d s m o othing matrix deﬁne d by S ∗ i,h = E  B i  B t i W i B i + n i Γ − 1  − 1 B t i W i     Y ∗ i  . Note that the smo other matrix S ∗ i,h , u nlike ( 4.2 ), takes advantage of adaptive p e- nalization. Ortho gonalize d r esc ale d spike and slab smo othing 277 4.2. Ortho gonality Our construction for the basis ensures that B t i B i = n i I . Ho w ever, in order to fully exploit or thogonality , we a dditionally require that (4.5) B t i W i B i = n i I . F or ( 4.5 ) to hold w e must have W i = I . The simplest wa y to satis fy this condition is to use a nea rest neighbour kernel. F or a ﬁxed bandwidth v a lue h , let K  x h  = 1 {| x | < h } . The neares t neighbour kernel puts a w eight o f 1 on all v alues of x within a distance of h to zero . Using suc h a kernel implies that W i = I and I i,h = { j : | x j − x i | < h } . Shrink ag e, just as in the g lobal orthogo nal regr ession setting, is intimately related to the degrees of freedom of the s mo other matrix. Co nsider the following coro llary to Theore m ( 4.1 ) c haracterizing eﬀective deg rees of freedom under or tho gonality . Corollary 4.1. Under the ortho gonality assumption ( 4.5 ), t he lo c al s mo other ma- trix for t he r esc ale d spi ke and slab pr e dictor is S ∗ i,h = n − 1 i B i V i B t i . The eﬀe ctive de gr e es of fr e e dom of S ∗ i,h e quals D f ( S ∗ i,h ) = n − 1 i tr ( B i V i B t i ) = n − 1 i tr ( B t i B i V i ) = d X k =1 V i,k ≤ d, wher e V i = diag { V i,k } and V i,k = E  γ k 1 + γ k    Y ∗ i  , k = 1 , . . . , d. Henc e, the de gr e es of fr e e dom of the spike and slab lo c al smo other is b ounde d by the dimension of the lo c al p olynomial b asis. The eﬀective deg rees of free dom can b e used to provide insight into the g eometry of a cur ve f . If the eﬀectiv e degrees of freedom is large, f will po ssess higher order lo cal curv ature, whereas if the degree s o f freedo m are s ma ll, f is likely to b e ﬂat. Plotting D f ( S ∗ i,h ) is therefore a way to register a cur ve and to identify key diﬀerences betw een curves. W e illustrate this concept by way of three diﬀerent examples. 4.3. Spinal BMD data r evisite d F or our ﬁrst example, we applied the lo cal re s caled spike and s lab mo del ( 4.4 ) to the prev iously a na lyzed BMD da ta. F o r the analysis, we used a near est neighbour kernel with a bandwidth set a t h = 1, corresp onding to one year o f ag e. F or the basis function we used cubic or thogonal p o lynomials. Figure 5 plots the spike and slab predictor for men and women (line types as in Figure 3 ). Also sup erimp os e d on the ﬁgure are pr edicted curves us ing F riedman’s super smo other (implemented in the pro gramming lang ua ge R by the call “supsmu(x,y)”). The tw o metho ds ag ree closely , altho ug h F riedman’s smoo ther app ear s to o v er-smo oth the data fo r men. The r ight-hand side of Figure 5 plots the e ﬀective deg rees of freedo m for men and women. One can immediately see a phase shift in the ﬁgure , signifying distinct mo des for the tw o curves. Also, overall, there is signiﬁcantly less shrink age for women with degr ees of fre e do m b eing pos itive ov er a muc h wider reg ion than men. In b oth case s , curves even tually ﬂa tten out at ar ound a ge 20. 278 H. Ishwar an and A. Pap ana Fig 5 . L eft plot: L o c al ke rnel r e g ressio n for BMD data via r esc ale d spike and slab mo dels with ortho gonal p olyno mial cubic b asis functions. Solid dark and dashe d dark lines ar e spike and slab pr e dictors for men and women, r e sp e ctively. Gr ay solid and dashe d lines ar e F ri edm an ’s sup e r- smo other for men and women. R ight plot: Eﬀe ctive de gr e es of f re e do m of spike and slab lo c al smo other (solid line s ar e men, dashe d lines ar e women). 4.4. Cosmic micr owave b ackgr ound r adiation As another illustration w e loo k a t data related to cosmic microw a ve ba ckground (CMB) radiation [ 20 ]. Here , the v a lue for x is the mu ltipo le moment a nd Y is the estimated p ow er sp ectrum of the temper ature ﬂuctuations. The outcome is sound wa v es in the cosmic microw ave background r adiation, which is the heat left ov er from the big ba ng . W e used the s ame s trategy a nd settings as b efore. F or the bandwidth we used h = 25 which was estimated prior to ﬁtting using g eneralized c r oss-v alidation. Results from the analysis are depicted in Figure 6 with plots zoo med in o n diﬀere nt regions of x in order to help visualiz e the v a rying curv ature. The bo ttom rig ht plot of Figure 6 shows the e ﬀectiv e deg rees o f freedom. The plot sug gests the presence of at lea st 4 distinct inﬂection p oints. In particular , note that initially for x < 200 there is a steep incr ease in the curve sig niﬁed by the eﬀective degree s o f freedom being roughly co ns tant at 3.0. A t aro und x = 200 there is a signiﬁca nt drop in the eﬀective degr ees of freedom, follow ed by an increase and a ﬂattening out until around x = 400. The drop at x = 2 0 0 indicates the ﬁrst inﬂection point. At x = 40 0 there is a nother dr op in the e ﬀectiv e degree s o f freedom. Simila r ly , there is a drop near x = 600 a nd x = 800 . All told, this sugg ests at least 4 distinct inﬂections, a ll app earing in multiples of 200 starting at x = 200. 4.5. Mass sp e ctr ometry pr otein data The study of proteins is critical to under standing living organisms at the molec ula r level as pro teins a re the main comp onents of physiological pathw a ys of cells . Pro- teomics, the study of proteins o n a la rge sca le, is o ften considered the na tur al s tep after genomics in the s tudy of biologica l s y stems. Gr eatly co mplicating an y system- wide analysis of proteins, how ev er, is the dynamic natur e of the pro teome, which constantly changes through its bio chemical interactions with the g e nome and the environmen t. While the challenges faced b y pro teomics are grea t, the beneﬁts at the sa me time are po tent ially huge. F or e xample, by studying protein diﬀerences for diseased individuals, one might b e able to discov er pathw ays re sp onsible for these diﬀerences, which in turn could lead to nov el biomarkers for ident ifying disease . Ortho gonalize d r esc ale d spike and slab smo othing 279 Fig 6 . First 3 plots (top to b ottom left to right) ar e r esc ale d spike and slab lo c al pr ed ictors (thick dark lines) for CMB data. Gr ay lines ar e F rie dman ’s smo other. Bott om right plot: Eﬀe ct ive de gr e es of fr e e dom for r esca le d spike and slab pr ed ictor. Note the pr esenc e of 4 mo des suggeste d by this last plot. One promising technology for proﬁling pr otein be havior is SELDI-TOF-MS (sur- face enhanced laser deso r ption/ionizatio n time-of-ﬂight mass s p ectr ometry). In this techn ology , homog eneous biolog ic al samples a r e placed on the active surface of an array . The protein s a mples ar e washed and an energ y absorbing molec ule so lution is plac ed on the surface o f the arr ay a nd allow ed to cr y stalize. The a rray is then queried by a la ser which ionizes the proteins in the sa mple. Charge d gaseo us pep- tides are emitted and their int ensity is detected downstream. The ma ss ov er charge ratio ( m/ z ) of a p eptide-io n is determined from the recorded TOF (time-of-ﬂig h t). The data collected from a SE LDI-TOF-MS exper iment consists of the in tensit y (abundance) of pr oteins in the sa mple for a given m/z r atio. One can think of the set of these tw o v a lues as constituting a sp e ctra. Ea ch biological sample pro duces one spectra and it is of interest to study diﬀerences in s pec tra a s a function of phenotype. See [ 21 ] for more details a nd further references . Ident iﬁcation of unique p eaks in the spectra , a method commonly referred to as peak identiﬁcation, is a crucia l part of a na lyzing mass s pe ctrometry da ta. F rom a statistica l p ersp ective, pea k identiﬁcation can b e recas t as a s mo othing pro blem where the go al is to identify mo des in the da ta a fter appropr iate smo othing. The outcomes are the sp ectrometry intensit y measurements, wherea s x is the sp eciﬁc m/z ratio. T o illustra te how our spike and slab metho d can b e used for pea k detec- tion, we ana lyzed a set of 8 calibr ation spectr a av a ila ble as part of the “PROcess” library [ 21 ] in the Bio conductor R-suite. The data is unique b ecaus e it is k nown a priori that the same 5 proteins ar e pr e s ent in each of the 8 samples. The results fr om the analysis are plotted in Figure 7 (we note that the data was ﬁrst baseline nor malized prio r to analysis). The black lines in the top plot are the rescaled spike and slab smo othed predictors for ea ch sp ectr a. W e used ortho gonal 280 H. Ishwar an and A. Pap ana Fig 7 . Mass sp e ctr ometry c alibr ation data: 8 sp e ctr a, e ach c omprising 13,468 distinct m/z r atios (horizontal axis c onstr aine d to a subset of observe d m/z r atios to help zo om in ﬁgur e). T op plot: Solid lines ar e r esc ale d spike and slab pr e dictors and dashe d vertic al lines indic ate known unique pr oteins. Bottom plot: Eﬀe ctive de gr e es of fr e e dom. po lynomials of deg ree 5 (the high degre es o f freedom used due to the spik y nature of the data). The bandwidth was set at h = 50. Superimp ose d are 5 dashed vertical lines indicating the 5 distinct proteins. Interestingly , we ﬁnd that 3 o f the 5 pr oteins are clear ly identiﬁed in all 8 sp ectra. Howev er, the tw o smallest proteins m /z = 1084 and m/ z = 1638 a r e les s vis ible, the protein at m/z = 108 4 e sp ecially so. There is a ls o evidence o f at lea st 2 additional p eaks a t approximately m/z = 3500 a nd m/z = 4 500. The eﬀective degr ees of fre e dom plo t, also given in Fig ure 7 , conﬁrms these ﬁndings . The plot a lso indicates that overlap o f sp ectra is sub-par suggesting further nor ma lization of the data is needed. Ac kno wle dgment s. The authors ar e grateful to Bertra nd Clarke and the ref- eree who rev ie wed an earlier version o f the pap er . W e a ls o thank Lia ng Li for his comments. References [1] Ishw aran, H. and Ra o, J. S . (2005). Spike and slab v a r iable selection: frequentist and B ayesian strategies. Ann. S tatist. 33 730–7 73. MR21631 58 Ortho gonalize d r esc ale d spike and slab smo othing 281 [2] Ishw aran, H. a nd Ra o, J. S . (20 0 5). Spike and slab gene selection for multi- group microarr ay data. J. Amer. Statist. Asso c. 1 00 764– 7 80. MR22010 09 [3] Ishw aran, H. and Ra o, J. S. (200 3). Detecting diﬀerentially expressed ge ne s in microarrays using Bay esian mo del selection. J. Amer. Statist. Asso c. 9 8 438–4 55. MR19957 20 [4] Lempers, F. B. (19 71). Posterior Pr ob abilities of Alternative Line ar Mo dels . Rotterdam Univ. Pr e ss, Ro tterdam. MR0359 2 34 [5] Mitchell, T. J. and Beauchamp, J. J. (1988). Bay esian v ariable selection in linear reg ression. J. Amer. Statist. Asso c. 83 1 023–1 036. MR099 7578 [6] George, E . I. an d McCulloch, R. E. (19 93). V ariable selectio n via Gibbs sampling. J. Amer. St atist. Ass o c. 88 881–8 89. [7] George, E. I. and McCulloch, R. E. (199 7 ). Approaches for Bay esian v aria ble selection. Statist. Sinic a 7 33 9–37 3. [8] Chipman, H. (19 9 6). Bay esian v a riable selectio n with related predictor s. Canad. J. Statist. 24 17 –36. MR13947 38 [9] Cl yde, M. , D eSimone, H. and P armigiani, G . (1996). Prediction via or- thogonalized mo del mixing. J. Amer. Statist. Asso c. 91 1197–1 208. [10] Geweke, J. and Mee se, R. (1981). Estimating regressio n mo dels o f ﬁnite but unknown or der. International Ec on. Rev. 22 55 –70. MR06143 47 [11] Chipman, H. A., George , E. I. and McCulloch R. E . (2001). The prac- tical implemen tation of Ba yesian model selection (with discussion). In IMS L e ctur e Notes Mono gr. Ser. 3 8 . Mo del Sele ct ion (P . La hiri, ed.) 63 –134 . IMS, Beach w oo d, OH. MR20007 52 [12] Bachra ch, L. K., Hastie, T., W ang, M.-C. and Narasimhan , B. (1999). Bone miner a l acq uis ition in hea lth y asia n, hispanic, black and caucasian youth. A longitudinal study . J. Clin. Endo crinol. Metab. 84 4702 –471 2. [13] Hastie, T. and Tibshirani, R. (19 90). Gener alize d A dditive Mo dels . Cha p- man and Hall, Lo ndon. MR10821 47 [14] Hastie, T. and Loader, C. (19 93). Lo cal reg ression: automatic kernel car- pentry . Statist. Sci. 8 120 –143. [15] Stone, C. J. (1977). Consistent nonpar ametric regression. A nn. Statist. 5 595–6 20. MR04432 04 [16] Cleveland, W. S. (197 9). Robust lo cally weighted reg r ession and smo othing scatterplots. J. Amer. Stat ist . Asso c. 7 4 829–8 36. MR0556 476 [17] F an, J. (19 92). Design-a daptive nonpar ametric re g ression. J. Amer. Statist. Asso c. 8 7 998–1 004. MR120 9561 [18] F an, J. and Gijbels, I. ( 1995). Data-driven ba ndwidth s election in loca l po lynomial ﬁtting: v ar ia ble ba ndwidth and spatial adaptation. J. R oy. Statist. So c. Ser. B 57 371–3 94. MR13233 45 [19] Rice, J. (1984). B andwidth choice for nonpa r ametric r egress io n. Ann. S tatist. 12 1215 – 1230 . MR07 6068 4 [20] Genovese, C. R., Miller, C. J., Nichol, R. C., Arjunw a dkar, M. and W asserman, L. (2004 ). Nonparametric inference for the cosmic micr ow a v e background. St atist. Sci. 19 308– 321. MR21469 46 [21] Li, X., Gentleman, R., Lu, X., Sh i, Q., Iglehar t, J. D ., Harris, L. and Miron, A. (200 5). SELDI-TOF mass sp ectometry protein data. In Bioin- formatics and Computational Biolo gy Solutions Using R and Bio c onductor (R. Gentleman et al., eds.) 91 – 108. Springer, New Y ork. MR22018 36

Orthogonalized smoothing for rescaled spike and slab models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment