Stochastic Stepwise Ensembles for Variable Selection

Sto c hastic Step wise Ensem bles for V ariable Selection Lu Xin and Mu Zh u Univ ersit y of W aterlo o W aterlo o, ON, C ana d a N2L 3G1 No ve mber 26, 2024 Abstract In this article, we adv o cate t he ensem ble approac h for v ariable selection. W e p oint out that the sto chastic mec hanism used to generate the v ariable-selection ensemble (VSE) must be pick ed with care. W e construct a VSE using a sto chastic stepw ise algorithm, and compare its p erformance with numerous state-of-the-art algorithms. 1 1 In tro d uction The ensem ble approach fo r statistical modelling w as ﬁrst made po pular by suc h algo rithms as bo osting (F r eund and Schapire 1996; F riedman et al. 200 0), ba gging (Breiman 1996), random forest (Breiman 2001), and the gradient bo osting machin e (F riedman 2001). They are pow er ful algorithms for s olving prediction problems. This article is concer ned with using the ens emble approa ch for a diﬀerent pro blem, v aria ble selection. W e shall use the terms “pr ediction ensemble” and “ v ar iable- selection ensemble” to diﬀerentiate ensembles used for these diﬀerent purp o ses. 1.1 V ariable-selection ensem bles (VSEs) First, we give a general description o f v ariable-selec tion ensembles (VSEs). Suppo se there ar e p candidate v ariables. A VSE (of size B ) can be represented by a B × p matrix, s ay E , whose j -th column contains B indep endent meas ures of how impor tant v ariable j is. Let E ( b, j ) denote the ( b, j )-th en try of E . Using the ensemble E as a whole, one t ypically ranks the importanc e of v ariable j using a ma jority-vote t yp e o f summary , such a s R ( j ) = 1 B B X b =1 E ( b, j ) , (1) and the v ariables that ar e r anked “co nsiderably higher” than the rest are then selected. The key for gener ating a VSE lie s in pro ducing m ultiple measur es of impor tance for each ca ndi- date v ariable. By contrast, traditional v ariable selection procedur e s, including step wise selection and Lasso, typically pro duce just one s uch measure, that is, B = 1. It shouldn’t b e hard for any statis- tician to appreciate that averaging o ver a n umber of independent measures is o ften beneﬁcial. This is the main reas on why VSEs are attractive and mo re powerful than many traditional approaches. T o make selection decisions, o ne must b e more precise ab out what it means to s ay that some v ariables are ranked “consider ably higher” than the rest. One o ption is to sele c t v aria ble j if it is ranked “ab ove av era g e,” i.e., if R ( j ) > 1 p p X k =1 R ( k ) . (2) This is wha t we use in all the exp eriments r ep o rted below, but we emphasize that other thresho lding rules can b e used a s well. F o r example, one ca n make a so-called “scree plot” of R (1) , R (2) , ..., R ( p ), and lo ok for a n “elb ow” — a v er y common pra ctice in principal comp onent analysis (e.g. Jolliﬀe 2002), but the pr ecise lo cation of the “elb ow” is hig hly sub jective, which is wh y w e choose not to use this strategy here . The distinction betw een ranking and thresholding is particularly imp or tant for VSEs; we will say more ab out this in Section 4.1. 2 1.2 PGA Zhu and Chipman (20 06) constructed a VSE using a so- called pa rallel genetic algo rithm (PGA). T o pro duce m ultiple measures of v ariable impo rtance, PGA repea tedly p er forms sto chastic r a ther than deterministic optimization of the Ak aike Informa tion Criterion (AIC; Ak aike 19 7 3), w hile delibe r ately stopping ea ch optimization path prematurely . In practice, one must b e more e x act ab out what “prema tur e” means, but we won’t go into the speciﬁcs here and refer the rea ders to Zhu and Chipman (2 006, Section 3.2) for details. The main idea is as follows. E arly termination forces each optimizatio n path to pro duce sub- optimal ra ther than o ptimal solutions, while the use of s to chastic r ather than deterministic opti- mization allows each of these sub-optimal solutions to be diﬀerent fro m each o ther (Zhu 2008). F o r example, suppo se we hav e ﬁve candidate v ar iables, x 1 , x 2 , ..., x 5 . The ﬁrst time we sto chastically optimize the AIC for just a few steps, w e may arrive at the so lution { x 1 , x 2 , x 3 } ; the second time, we may arrive at { x 1 , x 2 , x 4 } ; a nd the thir d time, p er haps { x 1 , x 2 , x 5 } . This pro duces the follo w ing ensemble: E =    1 1 1 0 0 1 1 0 1 0 1 1 0 0 1    . Since R (1) = R (2) = 1 > 1 3 = R (3) = R (4) = R (5) , the ensemble selects { x 1 , x 2 } . Zhu and Chipman (2006) used the genetic a lgorithm (GA; Go ldber g 1 989) as their sto chastic optimizer in each path, but our general description of VSEs above (Section 1.1) ma kes it clear that any other stochastic optimizer can be use d for PGA to work, des pite the name “PGA.” This is a crucial p oint, and we will come back to it later (Sections 1.3 and 4.2). Though driven by the AIC, it has b een obser ved that PGA has a m uch higher probability of selecting the correct subset of v ariables than optimizing the AIC by exhaustive se arch — of co urse, such observ ations ha ve only b een ma de on mid-s ized simulation problems where a n exhaustive search is feas ible and the co rrect subse t is k nown. No netheless, they show quite conclusively that PGA is not merely a better search alg orithm, b ecause one cannot p ossibly p erfor m a better search than an exhaustive one. Therefore , P GA can be seen as an eﬀective AIC “b o oster,” and this is pre cisely wh y the ensemble appr o ach to v ariable selection is v alua ble and p ow er ful. 1.3 Motiv a ting example: A w eak signal One of the k ey ob jectives of this article is to present a b etter v ariable-s election ensemble than wha t PGA pr o duces, one which we call the sto chastic stepwise e ns emble (ST2E). Like Zhu and Chipman (2006), w e also fo cus on m ultiple linear reg ressio n mo dels, a lthough VSEs are ea sily applicable to other statistical mo dels such as logistic regre s sion and Cox regressio n. 3 W e ﬁrst describ e a simple exp eriment to motiv ate our work. There ar e 20 p otential predictor s, x 1 , x 2 , ..., x 20 , but only three of them ar e actually used in the true mo del to genera te the r esp onse, y : y = α x 1 + 2 x 2 + 3 x 3 + σ ǫ , x 1 , ..., x 20 , ǫ ∼ N( 0 , I ) . (3) The sa mple size n is taken to be 100, a nd σ = 3. In additio n, the three v ariables that genera te y are correla ted, with Corr( x i , x j ) = 0 . 7 for i, j ∈ { 1 , 2 , 3 } a nd i 6 = j . The x ’s and ǫ are otherwise independent of each o ther. W e shall c onsider α = 0 . 1 , 0 . 2 , ..., 0 . 9 , 1 . 0, 1 . 2 and 1 . 5 . By co nstruction, there ar e three types of v ariables: x 1 is a relatively weak v ariable — it is part o f the true mo del but its signal- to-noise ra tio is relatively low; x 2 and x 3 are relatively strong v ariables ; and x 4 , ..., x 20 are no is e v ariables — they are not part of the true mo del. F o r eac h α , the experiment is repeated 100 times. Figure 1 shows the a verage frequency th e thre e diﬀerent t yp es of v aria bles are sele cted by tw o diﬀerent VSEs, PGA and ST2E, as our exp erimental parameter α v aries. The t wo VSEs are b oth o f size B = 300. The messa ges from this exp eriment are as fo llows. In terms of catching the strong signals ( x 2 and x 3 ), ST2E and P GA a r e ab out the same. In t e rms of guarding agains t the noise v aria bles ( x j for j > 3), ST2E is slig htly b etter than PGA. But, most imp or ta nt ly , we see that ST2E is signiﬁcantly better than PGA at catching the w eak signal, x 1 . It is in this sense that ST2E is a b etter VSE tha n PGA. The improv ed p erfor mance of ST2 E is due to the use of a more str uc tur ed sto chastic optimizer in ea ch path. In par ticular, a so-called sto chastic stepwise (STST or simply ST2) alg o rithm is use d, which is wh y we call it the “sto chastic stepwise ensemble” (or ST2E). Explanations for why a more structured sto chastic optimizer is desirable are given in Section 2.3.1. o o o o o o o o o o o o 0.0 0.5 1.0 1.5 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Pr(X j included), j=1 − Weak Signal α x x x x x x x x x x x x o x ST2E PGA o o o o o o o o o o o o 0.0 0.5 1.0 1.5 0.95 0.96 0.97 0.98 0.99 1.00 Pr(X j included), j=2,3 − Strong Signal α x x x x x x x x x x x x o x ST2E PGA o o o o o o o o o o o o 0.0 0.5 1.0 1.5 0.10 0.15 0.20 0.25 Pr(X j included), j>3 − Noise α x x x x x x x x x x x x o x ST2E PGA Figure 1: Mo tiv ating example (Section 1.3). Average frequency the thr e e diﬀerent types of v ariables are selected by ST2E a nd PGA. Notice that the vertical scales are not identical. 4 1.4 Random Lasso and stabilit y selection The idea of us ing the ensemble a pproach for v ariable selection has s tarted to catc h up in recent years, for example, the “random Lasso” method (W ang et al. 2 009) and the “stability selection” metho d (Meinshausen and B ¨ uhlmann 2010). The latter consists of a genera l clas s of methods for structural estimation, including g raphical mo deling , but we shall limit our discussio ns here to the regres s ion proble m. F o r reg ression problems, these algorithms essen tially give rise to diﬀerent VSEs as we have deﬁned them in Section 1.1, although they are not explicitly presented o r la b eled as ensemble algo rithms by the a uthors. The “ra ndo m Lass o ” metho d, for insta nce , was presented to “alleviate [v arious] limitations of [the] Lasso” when dea ling with hig hly co r related v ariables and “la rge p , small n ” problems. The “ stability selectio n” metho d, o n the other hand, w a s pr esented to r educe the inﬂuence of regularization parameters and to pro v ide “ ﬁnite sample familywis e er ror cont rol” f o r the ex pe c ted nu m b er of false discoveries. W e shall see la ter tha t, while this metho d do es pr ovide v ery go o d er ror control for false discov erie s , it app ear s to do so at the exp ense o f missing true signals. Even thoug h one can apply stability selection to diﬀerent regression proc e dures, we will consider only its application to the Lasso (Tibshirani 1996). At a very high lev el, bo th the random Lasso and the stabilit y selection algo rithms consist of running a r andomized Lass o rep eatedly o n many bo otstrap s a mples and agg regating the results. That is, they ca n b oth be reg arded as VSEs. They diﬀer o n how the Las so is rando mized. F or W ang et al. (2009), each Lasso is run with a r andomly selected s ubs et of v aria bles; both the size of this subset and the l 1 -regular ization parameter λ ar e selected by cr oss v alidation (CV) and ﬁxed at the CV choices. F or Meinshausen and B ¨ uhlmann (2010), all reg ularization par ameters λ ≥ λ min are co nsidered; each Lasso is run b y randomly scaling the regulariza tion para meter for every regression c o eﬃcient; and the par ameter λ min is c hos en to control the exp ected num b er of false discov er ies. 1.5 Outline W e pro ceed as follows. In Section 2, we describ e a mor e structured approach to genera te VSEs, the ST2 algor ithm. W e explain wh y the ST2 algor ithm pro duce s better ensembles than the genetic algorithm. W e als o descr ib e ho w to set the tuning parameter in ST2. In Section 3, we present a nu m b er o f simulated and r eal-data examples , and compar e ST2E with a nu m b er o f other v aria ble- selection algorithms, suc h as Las so (Tibshira ni 1996) and its v ar iations (e.g., Zou 2006; Meinshausen 2007), LARS (Efr o n et al. 2004), SCAD (F an and Li 20 01), elastic net (Zou and Hastie 2005), VISA (Radchenk o and James 2008), and the t wo ensemble appr oaches men tioned ab ove — random Lasso (W ang et al. 2009) a nd stability s election (Meinshause n and B ¨ uhlmann 2010). In Section 4, w e make a few more gener al comments abo ut the ensemble approach for v ariable selection, and present a simple extension of ST2E to tackle “large p , small n ” problems. 5 2 Sto c hastic step wise selection In this s e ction, we descr ibe a more structured sto chastic sear ch alg orithm suitable for v aria ble selection. In section 2.3.1, some explana tions will b e given as to why a mo re structured sto chastic search is des ir able. 2.1 The ST2 algorithm T r aditional stepwise r egressio n co m bines for ward and bac kward selec tio n, alter nating be tw een for - ward and backw ard steps. In the forward step, each v ar ia ble other tha n those already included is added to the current mo del, one at a time, and the one that can best impro ve the ob jective function, e.g., the AIC, is retained. In the ba ckw ard step, each v ar iable alr eady included is deleted fr om the current model, one at a time, a nd the one that can b est impr ov e the ob jective function is discarded. The algor ithm co ntin ues until no improv ement can be made by either the forward or the backw ar d step. Instead o f adding or deleting v aria bles o ne a t a time, ST2 adds o r deletes a gr oup of v ar iables at a time, where the group siz e is ra n domly decided. In traditiona l step wise, the gro up size is o ne and ea ch candidate v aria ble is assessed. When the gro up size is large r than one, a s is often the case for ST2, the tota l nu m b er of v ariable groups c a n b e quite larg e. Instea d of ev aluating all p o ssible groups, only a r andomly selected few are a s sessed and the b es t o ne chosen. T a ble 1 contains a detailed description of the ST2 alg orithm. 2.2 T uning functions W e now explain how the n umbers g f , g b , k f , k b (see T able 1) a re determined. Supp ose we ar e doing a forward (backward) step and the potential predic to rs to b e added (deleted) are { x 1 , x 2 , ..., x m } . First, we need to determine g , the num b er of v ar iables to add (delete). Int uitively , it s eems r easona ble that g should dep end on m , say g = φ g ( m ) . Second, given g , w e have a total of  m g  po ssible groups of v a riables a nd need to deter mine k , the nu m b er o f groups to ass ess. In tuitively , it also seems reaso nable tha t k should dep end o n  m g  , say k = φ k ( m, g ) . W e de ﬁne the function φ g as φ g ( m ) ∼ Unif(Ψ m ) , where Ψ m = { 1 , 2 , 3 , ..., ⌊ λm + 0 . 5 ⌋} , 6 T a ble 1: The sto chastic stepwise (STST o r s imply ST2) algor ithm for v ar ia ble s election. Rep eat 1. (F orward Step) Supp ose d v aria bles, { x l 1 , x l 2 , ..., x l d } a re not in the current model. Initially , d = p . (a) Determine the num b er of v ar iables w e w a nt to add in to the mo del, or the group size, say g f ≤ d . † (b) Determine the num b er of candidate gr oups that will b e asses s ed, k f . † (c) Generate k f candidate groups of size g f , each by r a ndomly c ho o sing g f v a riables without replacement fr om the set, { x l 1 , x l 2 , ..., x l d } . (d) Assess each candidate group by adding it into the curre nt model, one gr o up at a time. The one that c an b est improv e the ob jective function (e.g ., the AIC) is added into the mo de l. 2. (Backw a rd Step) Supp ose h v ariables , { x l 1 , x l 2 , ..., x l h } are in the curr ent mo del. (a) Determine the n umber o f v a riables w e wan t to delete from the mo del, or the group size, say g b ≤ h . † (b) Determine the num b er of candidate gr oups that will b e asses s ed, k b . † (c) Generate k b candidate groups of size g b , each by r a ndomly c ho osing g b v a riables without replacement fr om the set, { x l 1 , x l 2 , ..., x l h } . (d) Assess each candidate g r oup by deleting it fro m the curr ent mo del, one gro up at a time. The one that can best improv e the ob jective function (e.g., the AIC) is deleted from the mo del. Un til no improv ement can b e made by either the forward or the backward step. † Details for ho w the num b ers g f , k f , g b , k b are determined are given in Sections 2.2 and 2.3. 7 for some 0 < λ < 1, and ⌊ x ⌋ is the lar gest integer not gr eater than x . The function φ k is deﬁned as: φ k ( m, g ) = $  m g  1 /κ + 0 . 5 % (4) for some κ > 1. W e ﬁx λ = 1 / 2 and discuss ho w to c ho ose the parameter κ later (Section 2.3). Her e, we ﬁrst explain why the functions φ g and φ k are chosen to hav e these particular for ms, a nd why we ﬁx λ = 1 / 2. First, it is imp ortant that g = φ g ( m ) is a sto chastic and not a deterministic function. Consider the ﬁrst for ward step and the ﬁrst backw ar d step. Supp ose there are m = p = 20 p otential predictor s. If φ g ( m ) is a deter ministic function, say φ g (20) = 7 and φ g (7) = 3, then, for all individual paths, only mo dels with 7 predicto r s are assessed by the fo rward step a nd those with 7 − 3 = 4 predictor s are assessed b y the bac k ward step. Many mo dels , such as those with 2 or 3 predictors, are never assessed. Clearly , more ﬂexibilit y is needed. According to our deﬁnition and with λ = 1 / 2, φ g (20) ∼ Unif { 1 , 2 , ..., 10 } . In the ﬁr st forward step, so me paths may a dd 3 v aria bles, while others may add 1, 2, 4, 5, ..., or 1 0 v ariables . Next, it makes sense that w e should not add o r delete to o many v ar iables in a single step. This is why we ﬁx λ = 1 / 2 , so that a t the most half of the av ailable v ariables can b e a dded or dele ted. Notice that this restriction only applies to single search s teps; it does no t preclude the algorithm (whic h consists of many such steps) from selecting all the v ar ia bles if necessary . Finally , we w ant the function k = φ k ( m, g ) to b e mono tonically increasing in  m g  — as more subsets b eco me av ailable, more candidate groups s hould b e assessed in o rder to ha ve a rea sonable chance of ﬁnding an improvemen t. How ever, we cannot aﬀord to let this gr ow linearly in  m g  since it can b e a very large n umber and that’s why we use  m g  1 /κ for some κ > 1. 2.3 T uning parameters W e now e x plain how to choose the tuning pa rameter, κ . F or prediction ens embles, c r oss v alidation can be used to adjust v arious tuning pa rameters in order to maximize pr e diction ac cura cy , but this requires o ur v alidation data to contain the right answer. F or v ariable-s election ensembles, how ever, this is not viable. W e cannot empirica lly adjust its tuning para meter (s) to maximize sele ction ac cur acy b ecause, no matter how we set data aside for v alidatio n, we won’t know which v a riables are in the “true mo del” and which ones are not. F o r this reas on, man y res earchers still use cross- v alida ted prediction err o r to guide the c hoic e o f tuning para meter s for their v aria ble-selection pro cedure, but Y ang (2005) has clear ly es tablished that prediction a ccuracy and selection accuracy are fundamentally at o dds with each other. In a more recent article, Shmueli (201 0) discusses v a rious related issues fro m a less technical and more philos ophical p oint of view. It must b e s tated without am biguity that w e are aiming for s e lection accuracy , not prediction ac curacy . As such, cross v a lidation is out of the question. Instead, we use ideas fr om Br eiman (2001) to help us sp ecify our 8 tuning parameter . 2.3.1 Str e ngth-div e rsit y tradeoﬀ Breiman (2001) studied prediction ensembles that he called random fores ts (RF): RF = { f ( x ; θ b ) : θ b iid ∼ P θ , b = 1 , 2 , ..., B } . F o r classiﬁcatio n, f ( x ; θ b ) is a classiﬁer co mpletely para meterized by θ b , and the statement “ θ b iid ∼ P θ ” means that each f ( · ; θ b ) is generated using a n iid stochastic mechanism, P θ , e.g., bo otstra p sa mpling (e.g., Breiman 1996) and/or rando m subspaces (e.g., Ho 199 8). Breiman (2001) pr ov ed that, for a random forest to be eﬀective, individual members of the ensemble must be as go o d classiﬁers as possible, but they must also b e as uncorrelated to ea ch other as p ossible. In other words, a go o d classiﬁca tion ensem ble is a diverse collection of str ong classiﬁer s. Typically , the diversity of a n ensemble ca n be incre a sed by incr easing V a r( P θ ). But, unfor- tunately , this almo st always reduces its str ength. Therefore, it is imp ortant to use a sto chastic mechanism P θ with a “r easona ble ” V ar( P θ ). This basic principle has b een noted elsewhere, too . F or example, F r iedman and Popes c u (2003) describ ed ensembles fro m an imp ortance-s ampling point of view. There, the corresp onding notion of V ar( P θ ) is simply t he v ar iance of the importance-s ampling distribution, a lso re fer red to as the trial dis tribution. It is well-known in the impo rtance-sa mpling literature that the v a riance of the trial distribution must b e speciﬁed carefully (e.g., Liu 200 1, Section 2.5). Although Br eiman’s theory of strength-diversity tradeoﬀ is developed for predictio n ensem bles, some of these idea s can be bor row ed fo r VSEs. In fact, this tr adeoﬀ explains wh y ST2 E pro duces better v aria ble-selection ense m bles than PGA. Recall that PGA uses the genetic a lg orithm as the main stochastic mechanism to pro duce the ensemble, whereas ST2E uses a more str uctured ST2 algorithm (Section 2.1). Using Breiman’s language, we ca n say tha t the more structured ST2 algorithm has a “more reasonable” v ariance than the genetic algorithm. It also allows us to exerc is e more cont r ol o ver the a lgorithm via the tuning parameter κ in o rder to b etter balance the in tricate tradeoﬀ b etw een streng th and diversity . 2.3.2 Computable measures of strength and div ersity W e no w descr ib e how to use the div er sity-strength tradeo ﬀ to sp ecify the tuning parameter, κ . Given p p otential cov a r iates, let E b e a VSE o f size B . The idea is to deﬁne c omput able meas ures of its diversit y and strength, say D ( E ) and S ( E ), and choose κ to ba lance the tradeoﬀ betw een them. Given a VSE, E , its diversity D ( E ) can b e measur ed by the average within-ensemble v ariation. F o r ev er y v aria ble j , there ar e B indep endent measur es o f how impor tant the v ariable is. The 9 quantit y v ( j ) = 1 B − 1 B X b =1 " E ( b, j ) − 1 B B X b =1 E ( b, j ) # 2 (5) is the within-ensemble v ar iance of these measure s , a nd the quantit y D ( E ) = 1 p p X j =1 v ( j ) , (6) is a measure of the average within-ensem ble v ariation. Let F ( · ) deno te the ob jective function that each path of the VSE aims to optimize. W e measure the mean strength of E by the av erag e p ercent impr ovemen t of the ob jective function ov er the null mo del, i.e., S ( E ) = 1 B B X b =1 | F ( E ( b, · )) − F 0 | F 0 (7) where F 0 is the ob jective function ev aluated at the null mo del, i.e., a mo del that do es not co ntain any predicto r s. 2.3.3 Exa m ple Figure 2 shows an example of how our tuning strategy w or ks. This is bas e d on 50 simulations using equation (3 ), while ﬁxing α = 1. The q uantities S ( E ) and D ( E ) are plotted against κ (left and middle). F or κ , the log arithmic scale is used. V ar iable-selectio n p erformance, measured here by Perf ( E ) = ASF(Signal) - ASF(Noise) , where “ASF” stands for the average selection frequency , is also plotted (right). The b ehavior depicted here is fair ly typical. W e see from the left panel tha t S ( E ) tends to decrease as we increase κ . This is b ecause, when κ is relatively small, the sea r ch conducted by steps 1(d) and 2(d) in the ST2 algor ithm (T a ble 1) is r elatively greedy; many candidate subsets (of a certain given size ) a re ev a luated and the b est one, chosen. This r esults in high strength. As κ increases, the sear ch b eco mes les s greedy , which r educes strength. On the other hand, the greedier the search, the higher the chance of ﬁnding the same subset. This explains why a small κ tends to pro duce low div er sity . As κ increases and the search b ecomes less g reedy , div ers it y starts to increa se. But the para meter κ controls not only the greediness but also the scop e o f the search. F or ex ample, it is eas y to see from (4 ) that φ k ( m, g ) → 1 as κ → ∞ . This means that, when κ is very large, the scop e o f the sea rch pe rformed by steps 1 (d) and 2(d) bec omes quite limited, which also r educes diversit y . This is wh y we see that, in the middle panel, the diversit y measure D ( E ) ﬁrst inc r eases but even tually decrea ses with κ . 10 1.0 1.5 2.0 2.5 3.0 3.5 0.12 0.14 0.16 0.18 0.20 0.22 Strength, S(E) log ( κ ) 1.0 1.5 2.0 2.5 3.0 3.5 0.10 0.12 0.14 0.16 0.18 Diversity , D(E) log ( κ ) 1.0 1.5 2.0 2.5 3.0 3.5 0.75 0.80 0.85 Perf ormance , Perf(E) log ( κ ) Figure 2: Illustratio n of how to select the tuning para meter, κ , in ST2E. Ba sed on 50 sim ula tio ns of mo del (3) with α = 1. Finally , we see from the right panel tha t, as D ( E ) reaches its pea k level, the v ariable-s e lection per formance also star ts to lev el oﬀ o r dr op. This means choo sing κ by lo oking for the peak in the D ( E ) plot can b e an eﬀectiv e strategy . This is what w e use in all the exper iment s we repo rt below. Of course, w e m ust emphasize that the measure Perf( E ) and hence the r ig ht panel of Figur e 2 are typically no t a v ailable; they ar e o nly av aila ble for simulated examples wher e the true mo del is known. W e include them here merely as a v alida tion tha t it makes sense to choose κ in such a w ay . In reality , one must rely on the plot of D ( E ) alone to make the c ho ice. 2.4 Eﬀect of sample size Before we mov e on to examples and p erformanc e co mpa risons, we co nduct a s imple exp eriment to examine the p e rformance of ST2E as the sample s ize n incre a ses. As in Sec tion 2 .3 .3, we sim ulate from o ur motiv a ting exa mple, equation (3), except w e ﬁx α = 0 . 15 this time. Using a small α creates a more diﬃcult problem, which will a llow us to see the eﬀect of n more clearly . F or each n = 5 0 , 100 , 15 0 , 250 and 500 , we p erfor m 100 simulations and record the average n umber of times the three types of v ariables (strong , weak, noise) are selected. T able 2 shows that ST2E b ehav es “reaso nably” in this resp ect. As n increases, signal v ar ia bles (b oth str ong and weak) a re s elected with increasing probability , and the opp os ite is true for noise v aria bles. 3 Examples W e now present a few exa mples and compare the p erfor mance of ST2E with a num b er of other metho ds. F o r VSEs (ST2E, P GA, and stabilit y selection), w e use a size of B = 300. T o run stabilit y selection, tw o tuning parameters are r equired, π thr and λ min — see Meinsha usen and B ¨ uhlmann (2010) for details . The authors suggested ﬁxing either one a t a “sensible” v alue and c ho o sing the 11 T a ble 2: Average num b er of times (out of 10 0) the three types o f v a riables — weak signal ( j = 1), strong signal ( j = 2 , 3), and noise ( j > 3) — are selected by ST2E as the sample size n v ar ies. Same setting as the motiv ating example in Sectio n 1.3, with α = 0 . 15. Sample Size x j ∈ W ea k Signal x j ∈ Strong Signal x j ∈ Noise ( n ) ( j = 1) ( j = 2 , 3) ( j > 3) 50 62 99 17.12 100 86 100 15.41 150 89 100 13.18 250 97 100 11.82 500 99 100 11.53 other to co ntrol the exp ected num b er of false discov er ies. W e ﬁx π thr = 0 . 6, and rep or t re s ults for a range of λ min v a lues. 3.1 A widely used b enc hmark First, we lo ok at a wide ly used b enchmark sim ulation. There are p = 8 v ar iables, x 1 , ..., x 8 , each generated from the standard normal. F urthermor e, the v a riables are gener ated to b e co rrelated, with ρ ( x i , x j ) = 0 . 5 | i − j | for all i 6 = j . The re s po nse y is g enerated b y y = 3 x 1 + 1 . 5 x 2 + 2 x 5 + σ ǫ (8) where ǫ ∼ N( 0 , I ). That is, only three v ariables are true sig na ls; the remaining ﬁve a re noise. This b e nchm ark was ﬁrst used by Tibshir ani (1996), but it has bee n used by almost every ma jor v a riable-sele ction pa pe r published ever since. F o r example, F an a nd Li (200 1, E xample 4.1) used this sim ula tio n to c o mpare La sso, the non- negative garro te (Breiman 1995), hard thres holding (see, e.g., Donoho 199 5), bes t subset regres sion (e.g., Miller 2 002), and SCAD (F an a nd Li 200 1). W ang et al. (2 009, Example 1 ) used this simula- tion to compare Lasso, a da ptive Lasso (Zou 2006), elas tic net (Zou and Hastie 2005), r elaxed La sso (Meinshausen 2007), VISA (Radchenk o and Ja mes 2008), and random Lasso (W ang et al. 2009). Results from F an and Li (20 01) are replicated in T able 3, to gether with results from three VSEs — ST2E , PGA, and stability selection. The ensemble algorithms are clearly among the most co m- petitive. PGA and stabilit y selectio n using a relatively larg e λ min are slightly better than ST2E at excluding nois e v ariables, but, as will become clearer in T a ble 4, they are also mor e lik ely than ST2E to miss true signals . Results from W a ng et al. (200 9) a re replicated in T able 4, together with results from ST2E, P GA, and stability s election. Here, the diﬃcult y of P GA b ecomes clea rer. It is b etter than ST2E in terms of excluding no ise v aria bles, but it also misses true signals more often. The same ca n b e said ab out stability selection using a relatively lar ge λ min v a lue. It controls the num b er of false discov eries quite eﬀectively , but we see tha t this will cause the metho d to b ehav e p o orly in terms of catching 12 T a ble 3: Widely-used b enchmark (Section 3 .1). Av er age n umber of zero co eﬃcients for the noise group (oracle result = 5) and the sig nal group ( o racle result = 0), based on 100 simulations. Results other than those for ST2E , PGA and stability selection ar e taken from F an and Li (200 1, T able 1). SCAD1 uses cross- v a lidation to select the tuning parameter, while SCAD2 us es a ﬁxed tuning parameter — details are in F an and Li (20 01). Avg. No. of 0 Co ef. Metho d x j ∈ Noise x j ∈ Signal ( j = 3 , 4 , 6 , 7 , 8) ( j = 1 , 2 , 5) n = 40 , σ = 3 SCAD1 4.20 0.21 SCAD2 4.31 0.27 LASSO 3.53 0.07 Hard 4.09 0.19 Best subset 4.50 0.35 Garrote 2.80 0.09 Oracle 5.00 0.00 ST2E 4.56 0.18 PGA 4.75 0.16 Stabilit y selection λ min = 1 . 5 4.96 1.03 λ min = 1 . 0 4.86 0.58 λ min = 0 . 5 4.54 0.18 n = 60 , σ = 1 SCAD1 4.37 0.00 SCAD2 4.42 0.00 LASSO 3.56 0.00 Hard 4.02 0.00 Best subset 4.73 0.00 Garrote 3.38 0.00 Oracle 5.00 0.00 ST2E 4.81 0.00 PGA 4.96 0.00 Stabilit y selection λ min = 1 . 5 5.00 0.21 λ min = 1 . 0 5.00 0.01 λ min = 0 . 5 4.95 0.00 13 true signals . In order to improve its false negative rates, we must reduce λ min , but this necessa rily allows for more false discov e ries. Ra ndom La sso has the b est ability to catch true sig nals, but to do so, it makes a la rge n umber o f false discov er ies at the same time. 3.2 Highly-correlated predictors Next, we lo o k at a simulation (W ang et al. 2 009, Example 4) that is speciﬁca lly created to study v a riable selectio n algorithms when the predictors are highly c orrela ted and their co eﬃcients ha ve opp osite signs. There are p = 4 0 v ar iables, x 1 , ..., x 40 , e a ch genera ted from the sta ndard normal. The resp onse y is generated by y = 3 x 1 + 3 x 2 − 2 x 3 + 3 x 4 + 3 x 5 − 2 x 6 + σ ǫ (9) where ǫ ∼ N( 0 , I ) a nd σ = 6. In addition, the 40 v aria ble s are g enerated to hav e the following (blo ck diagonal) correla tion s tructure:    C 3 × 3 − − − C 3 × 3 − − − −    , where C =    1 0 . 9 0 . 9 0 . 9 1 0 . 9 0 . 9 0 . 9 1    . That is, there are three gro ups of v ariables: V 1 = { 1 , 2 , 3 } , V 2 = { 4 , 5 , 6 } , and V 3 = { 7 , 8 , ..., 40 } . The ﬁrs t t wo gro ups, V 1 and V 2 , are true signals, but, within eac h o f V 1 and V 2 , the v a riables a re highly correla ted. There is no betw e en-group correla tion. Results from W ang et al. (20 09) are r eplicated in T a ble 5, tog ether with results fro m the ST2E, PGA, a nd stability selection. Clea rly , ST2E and rando m Lasso ar e the top tw o p erformer s for this problem. PGA has a muc h higher tendency than ST2 E to miss true signa ls. Stabilit y s election app ears to hav e some diﬃculties her e as well. If w e choo se λ min to control the fals e discov eries too aggre s sively , we s tart to lose sig nals quite badly . When its false discovery rate is as low as that of relaxed La sso and VISA, for exa mple, stability selectio n misses true signals muc h more often. On the other hand, even if λ min is relaxed to allow for many false discoveries in this cas e, stability selection still c a nnot seem to catch the signals as well a s ST2E o r ra ndom La s so. This example clearly demonstrates that, when the predic to rs ar e highly correla ted, the per formances of PGA and stability s election can star t to deterior ate quite sig niﬁcantly , wherea s ST2E pro duce s a muc h more robust VSE. 3.3 Diab etes data Finally , we analyze the “ diab etes” data s et, which was used as the main ex ample in the “least angle regres s ion” (LAR) pap er (Efron et al. 2004). This is a r eal data set; a standardized version o f the data set is av ailable as part of the R pack age, lars . Ther e are n = 44 2 diab etes patien ts and p = 1 0 14 T a ble 4: Widely- us ed be nchmark (Section 3.1). Minimal, median, and ma ximal n umber of times (out of 100 simulations) diﬀerent types of v a riables (signal versus nois e) are selected. Results other than those for ST2E, PGA and sta bilit y sele ction are taken fro m W ang et al. (2009, T a ble 2). x j ∈ Signal x j ∈ Noise ( j = 1 , 2 , 5) ( j = 3 , 4 , 6 , 7 , 8) Metho d Min Median Ma x Min Media n Max n = 5 0 , σ = 1 Lasso 1 0 0 100 1 0 0 46 58 64 Adaptive Lasso 100 100 100 23 27 38 Elastic Net 100 100 100 46 59 64 Relaxed Lasso 100 100 100 10 15 19 VISA 10 0 100 10 0 11 17 20 Random Lasso 100 100 100 28 33 44 ST2E 100 100 100 1 1 8 PGA 100 100 100 0 2 6 Stabilit y selection λ min = 1 . 5 75 86 10 0 0 0 2 λ min = 1 . 0 100 100 100 0 0 2 λ min = 0 . 5 100 100 100 0 0 7 n = 5 0 , σ = 3 Lasso 99 10 0 100 48 55 61 Adaptive Lasso 95 99 10 0 3 3 40 48 Elastic Net 100 100 100 44 55 69 Relaxed Lasso 93 10 0 100 11 18 21 VISA 97 10 0 100 15 21 24 Random Lasso 99 10 0 100 45 57 68 ST2E 89 96 10 0 4 12 20 PGA 82 98 10 0 4 7 1 1 Stabilit y selection λ min = 1 . 5 59 64 10 0 0 0 3 λ min = 1 . 0 81 83 10 0 0 2 9 λ min = 0 . 5 90 98 10 0 4 8 2 2 n = 5 0 , σ = 6 Lasso 76 85 99 47 49 53 Adaptive Lasso 62 76 96 32 36 38 Elastic Net 85 92 10 0 4 3 51 70 Relaxed Lasso 60 70 98 15 19 21 VISA 61 72 98 15 19 24 Random Lasso 92 94 10 0 4 0 48 58 ST2E 68 69 96 9 13 21 PGA 54 76 94 9 14 16 Stabilit y selection λ min = 1 . 5 40 41 83 0 4 8 λ min = 1 . 0 59 61 92 4 8 18 λ min = 0 . 5 76 84 10 0 3 0 42 50 15 T a ble 5 : Highly-co rrelated predictors (Section 3.2). Minimal, median, and maximal num b er of t imes (out of 100 simulations) diﬀerent types of v a riables (signal versus nois e) are selected. Results other than those for ST2E, PGA, a nd stabilit y sele ction are taken fro m W ang et al. (2009, T a ble 2). x j ∈ Signal x j ∈ Noise ( j ≤ 6) ( j > 6) Metho d Min Median Ma x Min Media n Max n = 5 0 Lasso 11 70 77 12 17 25 Adaptive Lasso 16 49 59 4 8 14 Elastic Net 63 92 96 9 17 23 Relaxed Lasso 4 63 70 0 4 9 VISA 4 62 73 1 3 8 Random Lasso 84 96 97 11 21 30 ST2E 85 96 10 0 1 8 25 34 PGA 55 87 90 14 23 32 Stabilit y selection λ min = 0 . 1 1 35 42 1 5 1 3 λ min = 0 . 01 1 37 45 7 13 22 λ min = 0 . 002 1 40 52 31 42 54 n = 1 00 Lasso 8 84 88 12 22 31 Adaptive Lasso 17 62 72 4 10 14 Elastic Net 70 98 99 7 14 21 Relaxed Lasso 3 75 84 1 3 8 VISA 3 76 85 1 4 9 Random Lasso 89 99 99 8 14 21 ST2E 93 10 0 100 14 21 27 PGA 40 85 92 13 22 33 Stabilit y selection λ min = 0 . 5 1 67 73 3 8 1 3 λ min = 0 . 3 2 69 75 13 26 32 λ min = 0 . 2 3 71 78 60 72 78 16 v a riables, such as age, sex, b o dy mass index, and so on. The r esp onse is a measure of dis e a se progre s sion. Figure 3 shows results from b oth LAR a nd ST2E . F or LAR, the entire solution paths are displayed for all the v a riables. As the p enalty size decr e ases, the v ar iables enter the mo del seque ntially . The order in whic h they enter th e mode l is listed in T able 6. F or ST2E, the v ariable imp ortance measure (1) are plotted for each v ar iable. The o rder in whic h these v ariables are ranked b y ST2E is also listed in T able 6. * * * * * * * * * * * 0.0 0.2 0.4 0.6 0.8 1.0 −500 0 500 LAR |beta|/max|beta| Standardized Coefficients * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * tc sex age glu hdl tch map ldl bmi ltg 2 4 6 8 10 0.4 0.5 0.6 0.7 0.8 0.9 ST2E V ariables Importance age sex bmi map tc ldl hdl tch ltg glu Figure 3: Diab etes data . Left: Res ults from least angle reg ression. Right: Res ults from ST2E. According to T able 6 , LAR and ST2E agree that “bmi”, “ltg”, and “map” a re the top thr ee v a riables, whereas “age” is the least imp or tant v ar iable. F or the intermediate v ariables, LAR and ST2E seem to disag ree on their rela tive imp or tance. F or example, the v ariable “ldl” is the last one to b e entered into the mo del by LAR b efore the v ar iable “a ge”, a n indicatio n that it is p erhaps not an impo r tant v ar iable, whereas ST2E ranks it in the middle. Upo n closer examination, how ever, w e can see from Figure 3 that, once “ldl” is in the mo del, it actually gets a relatively larg e co eﬃcient, large r than so me o f the other v a riables that w er e en ter ed earlier — this can almost cer tainly b e attributed to “ ldl” being highly cor related with these other v a riables. Ther e fore, there is go o d reaso n why ST2E do es not rank “ ldl” close to the bottom. Similar statements can b e made for “tc”, “hdl”, and “sex” . 17 T a ble 6: Diab etes da ta. Ordering and Ranking of V aria ble s Metho d (top − → b ottom) LAR bmi ltg ma p hdl sex glu tc tc h ldl age ST2E bmi ltg map tc sex ldl hdl glu tch ag e 4 Discussions Before we end, we w o uld lik e discuss a few imp ortant is sues. 4.1 Ranking v ersus thresholding The v ar iable impor tance measure (1) is a particularly nice feature for the ensem ble appro ach. Using an ensemble appr oach, v ar iable selection is p er formed in tw o steps: ranking a nd thresholding. W e ﬁrst rank the v ar ia bles, e.g., by equation (1), and then use a thresholding rule to mak e the selection, e.g., equation (2). As pro po nents of the ensemble appro a ch, we ar e o f the opinion that the task of ra nking is the more fundamen tal of the t wo. F rom a decisio n-theoretic point of view, once the v ar iables are ranked, the choice of the decisio n threshold ha s more to do with o ne’s prior b elief o f how sparse the mo del is likely to b e. W e also think that v a riable sele ction p er s e is not quite the right ob jective, whereas v ar iable r anking is. Imagine the pro blem of searching for a few biomarkers that are asso ciated with a cer tain disease. What type of answer is mor e useful to the medical do ctor ? T elling her that you think it is biomarkers A, B, and C that are asso ciated? Or giving her a ranked list of the biomarkers? Such a list is precisely what the e nsemble appro ach aims to provide. Therefore, we would hav e pr eferred no t to introduce any thresholding r ule a t all. But, in order to compare with o ther metho ds in the literature, it is not enough to just rank the v aria bles; we m ust make a n active selection decision. In fact, b ecause exp eriments are t ypically r epe ated multiple times, we must use a thresholding rule that is automatic, suc h as equation (2). How ever, it is no t our inten tion to take this thresho lding rule to o s eriously; it is only introduced so that we could pro duce the same type o f r esults as everyb o dy else on v ar ious b enchmark exp eriments. As far as we a re concerned, the output of v ariable- selection ensem bles is the v aria ble importance meas ure (1), which ca n be us e d to r ank p otential v aria bles. 4.2 Sto cha st ic generating mec hanisms It is clear fro m our deﬁnition (Section 1.1) that there can b e many w ays to constr uct VSEs. T o do so, we m us t employ a sto chastic mechanism so that we can rep eatedly p erfo rm traditiona l v ar iable selection and obtain slightly diﬀeren t answers each time. One w ay to achiev e this is to use a sto chastic 18 T a ble 7: Sto chastic generating mec hanisms for diﬀerent VSEs. VSE Generating Mechanism PGA genetic algorithm ST2E sto chastic stepwise search algo rithm Random Lass o bo otstrap + rando m subsets Stabilit y Selection b o otstra p + ra ndom scaling of regular ization par ameter rather than a deterministic search algo rithm to perform the selection, e.g., PGA (Zh u and Chipman 2006) and ST2 E (this paper ). Another w ay is to p erfo rm the selection on bo otstr ap samples, e.g., r a ndom Lasso (W ang et al. 20 0 9) a nd stability selection (Meinshausen and B ¨ uhlmann 2010). According to o ur own empirica l exp eriences (not r ep o rted in this paper), bo otstra pping alone usually do es not generate enough div er sity within the ensemble to g ive sa tisfactory p erfor mance. This explains why extra ra ndo mization is employ ed by both the random Lasso and the sta bility selection metho ds; see Section 1.4. T able 7 summarizes the sto chastic mechanisms used to g enerate diﬀerent VSEs. This lea ds us to a v er y na tural question: what makes a goo d sto chastic mechanism for generating VSEs? This question will likely tak e some time to a nswer; w e certainly don’t hav e an answer at the moment. What we ha ve s hown in this paper is the following: the ST2 alg orithm appea rs to be q uite an eﬀective VSE-g enerating mechanism, and v arious ideas we have used to develop ST2E ca n help us think ab out this kind o f questions in a more systematic manner. 4.3 “Large p , small n ” problems W e now describ e a s imple extension that allows ST2E to tackle “ large p , small n ” problems. T o do s o , we s imply insert a pre-scre ening step b efor e running e ach ST2 path by p erforming “sure indep endence screening” or SIS (F an a nd Lv 2008) o n a b o otstra p sa mple to pre-select q < n v a r iables, e.g., q = n − 1. The ST2 alg orithm is then applied to this subse t of q v ar iables. Notice that, while this impo ses an upp er limit o n how many v ar iables e ach ST2 path can include, it by no means restricts the num b e r of v ariables the ST2E ensemble can inc lude as a whole. T o tes t this strategy , w e use another simulation from W ang et al. (20 09, E xample 5) with p = 120 and n = 50, designed speciﬁcally as a tes t case fo r p > n problems. The 120 cov ariates are gener ated from a m ultiv ar iate norma l distr ibutio n with mean zero and cov ar iance matrix,       Σ 30 × 30 − − − − Σ 30 × 30 J 30 × 30 − − J 30 × 30 Σ 30 × 30 − − − − Σ 30 × 30       , 19 where Σ i,j =    1 , i = j 0 . 7 , i 6 = j and J i,j = 0 . 2 ∀ i, j. The ﬁr st 6 0 co eﬃcients a r e gener ated from N(3 , 0 . 5) and then ﬁxed for all simulations; the r emaining 60 co eﬃcients a r e equal to zero. W e ar e a ble to obtain and use exactly the sa me set of 60 non-zero co eﬃcients and the noise level for ǫ ( σ = 50) as those used by W ang et al. (20 09). T a ble 8: A “larg e p , small n ” problem (Sectio n 4.3). Minimal, median, and ma x imal num b er of times (out of 100 simulations) diﬀerent types of v a r iables (signal v er sus noise) ar e s e le cted. Results other than those for ST2E and stability selection are taken from W a ng et al. (2 009, T a ble 2). x j ∈ Signal x j ∈ Noise ( j = 1 , 2 , ..., 60) ( j = 61 , 62 , ..., 120) Metho d Min Media n Max Min Median Max Lasso 19 30 40 3 8 14 Adaptive La sso 15 25 35 0 7 11 Elastic Net 40 50 61 1 5 8 Relaxed Lasso 14 23 34 0 3 8 VISA 16 27 35 0 2 8 Random Lasso 76 86 95 18 29 38 ST2E (with SIS) 81 88 95 0 1 5 Stabilit y selection λ min = 10 − 2 4 22 52 0 3 1 0 λ min = 10 − 4 13 4 0 80 1 6 25 λ min = 10 − 5 73 9 5 1 00 16 38 80 T a ble 8 shows the results. Again, r andom Lasso is comp etitive in terms of ca tching the signals, but it makes a larg e num b er of false discov eries. Stability selection has the sa me diﬃculty as befor e — it misses sig nals when large v alues of λ min are used to r e gulate false discoveries, and its ability to catch the signals can matc h that o f ST2E and ra ndo m Las so only if λ min is se t to b e so low as to tolerate a very large n umber of fals e discov eries. Readers will ﬁnd that the p erfor mance of ST2E (with SIS) on this p > n example is quite remark - able. This is because the extra pre-selectio n step really makes it an entirely diﬀerent VSE altogether , since the stochastic gene r ating mec hanism (see Section 4 .2 and T able 7) ha s fundamen tally changed. Int uitively , it is easy to see that SIS can improv e ensem ble strength by screening out noise v ar iables, while doing so on bo ots tr ap sa mples can further enhance ensem ble diversit y . This reinfo r ces the po ints we hav e made earlier in Section 4.2 — the question of how to design and/ or characterize a go o d sto chastic mechanism for g e nerating VSEs is a very intriguing one indeed, and thinking ab out the strength and diversity of the ensemble o ften gives us so me useful insights. 20 4.4 F alse p ositiv es v er sus false negativ es Another striking phenomenon that we ca n o bserve fr o m a ll our exp eriments is the extremely delicate balance b etw een false p ositive and false negative rates in v ariable s election problems. It is very hard to reduce one without signiﬁca ntly aﬀecting the other. F or VSEs, the underlying sto chastic generating mec ha nis m is, a gain, critical. A mo ng the four VSEs listed in T able 7, PGA seems to “care ” mor e abo ut false p ositives, wher eas ra ndom Lasso app ears to “ca r e” more ab out false negatives. Stability selection allows users to control the n umber of fa lse p ositives through a tuning parameter, but, as our exp eriments hav e sho wn, aggress ive con tr o l of false p ositive rates neces sarily leads to very p o o r false negative r ates, and vice versa. Overall, ST2E appea rs to ba lance the t wo ob jectives nicely , and this is precisely why we think it is a v aluable practica l algorithm. But, of course, there is no reason to b elieve wh y o ne cannot ﬁnd a nother gener ating mec hanism to pro duce a VSE that will balance the tw o ob jectives even better . Howev er, as w e’ve alluded to ear lie r, the more interesting question is no t whether w e can ﬁnd a nother mec hanis m, but how we ca n know that we hav e found a go o d o ne. This we leav e to future resear ch. 5 Summary W e are now ready to summarize the ma in co nt r ibutions of this pape r . First, we gav e a formal and g en- eral description of the ensemble approach for v ariable se le c tion. Next, we p ointed out that B r eiman’s theory for prediction ensembles — in particular, the tradeo ﬀ b etw een diversit y a nd streng th — is useful in guiding the dev elo pmen t of v ariable-se le ction ens e m bles as w ell. Finally , we used a more structured sto chastic mec hanism, the ST2 algo rithm, to construct a better v aria ble-selection ensem- ble, ST2E, whic h w e demonstrated to be more r obust than other VSEs, a nd comp etitive agains t many state-o f-the-art algorithms. References Ak aike, H. (197 3). Informa tion theory and a n extension of the maxim um likeliho o d principle. In Se c ond International Symp osium on Information The ory , pag es 267–28 1. Breiman, L. (19 95). Better subset regre ssion using the nonnegative garr ote. T e chnometrics , 37 , 373–3 84. Breiman, L. (1996). Ba gging predictors. Machine L e arning , 2 4 (2), 123–1 40. Breiman, L. (2001). Ra ndo m forests. Machine L e arning , 45 (1), 5–32. Donoho, D. L. (1995). De-noising b y soft-thresholding . IEEE T ra n sactions on Information The ory , 41 (3), 613–6 27. 21 Efron, B., Has tie, T., Johnstone, I., a nd Tibshirani, R. (2004). Least Angle Regr ession (with discussion). The Annals of S tatistics , 32 (2), 407 – 499. F a n, J. and Li, R. (2001). V ariable s election via nonconcav e p ena lized likeliho o d and its or acle prop erties. Jou r n al of the Americ an Statistic al Asso ciation , 96 (456), 1348– 1360 . F a n, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the R oyal Statistic al So ciety Series B , 7 0 , 849–9 11. F r eund, Y. and Schapire, R. (1996). E xp eriments with a new b o osting a lg orithm. In Machi n e L e arning: Pr o c e e dings of the Thi rte enth International Confe renc e , pa ges 148– 156, San F rancisc o . Morgan Kauﬀman. F r iedman, J . H. (2001). Greedy function a pproximation: A gradient bo osting machine. The A nnals of Statistics , 29 (5), 11 89–1 232. F r iedman, J . H. and Pop escu, B. (2003). Impor tance-sampled lear ning ensembles. T echnical rep ort, Stanford Universit y . F r iedman, J . H., Hastie, T. J ., and Tibshirani, R. J. (2 0 00). Additive logistic r egress ion: A statistical view of b o osting (with dis cussion). The Annals of Statistics , 28 (2 ), 337–40 7. Goldb erg, D. E . (1989). Genetic A lgorithms in S e ar ch, O ptimization and Machi n e Le arning . Addison-W esley , Reading, MA, USA. Ho, T. K. (19 98). The ra ndom s ubspace metho d for co ns tructing decision forests. IEEE T r ans actions on Pattern Analysis and Machine Intel ligenc e , 20 (8), 83 2 –844 . Jolliﬀe, I. T. (200 2). Princip al Comp onent Analysis . Springe r -V erlag, 2nd e dition. Liu, J. S. (2001 ). Monte Carlo Str ate gies in Scientiﬁc Computing . Springer-V erlag . Meinshausen, N. (2007 ). Relaxed Lasso. Computational Statist ics & Data Analysis , 52 , 374 –393. Meinshausen, N. a nd B ¨ uhlmann, P . (2010 ). Stability selection (with disc us sion). Journal of the R oyal S tatistic al So ciety Series B , 72 , 417 –473 . Miller, A. J. (200 2). Subset S ele ction in R e gre s s ion . Chapman and Hall, 2nd edition. Radchenk o, P . and James, G. (2008 ). V aria ble inclusion and shr ink a ge algor ithms. Journal of the Americ an Statistic al Asso ciation , 10 3 , 1304–1 315. Shm ueli, G. (2010 ). T o explain or to predict? Statistic al Scienc e , 25 (3), 28 9–31 0. Tibshirani, R. (19 96). Reg r ession s hrink age and selection via the Lasso. Journ al of t he R oyal Statistic al So ciety Series B , 58 , 26 7–288 . 22 W ang , S., Nan, B., Rosset, S., and Zhu, J. (2009). Random Lasso . Manuscript under r eview. Y ang , Y. (2 0 05). Can the strengths of AIC a nd BIC b e sha r ed? A conﬂict b etw een mo del indenti- ﬁcation and reg r ession e stimation. Biometrika , 92 , 93 7–95 0 . Zhu, M. (200 8 ). Ker nels and ensembles: Persp ectives on s tatistical lear ning. The Americ an S tatis- tician , 62 , 97– 109. Zhu, M. and Chipman, H. A. (2 006). Dar winian e volution in paralle l universes: A par allel genetic algorithm for v aria ble selection. T e chnometrics , 48 , 491– 502. Zou, H. (2006). The adaptive Lasso and its ora c le pro p erties. Journal of the A meric an Statist ic al Asso ciation , 101 , 141 8–14 29. Zou, H. and Hastie, T. J. (2 005). Regularizatio n and v a riable selection via the ela stic net. Journal of the R oyal Statistic al So ciety Series B , 67 , 30 1–32 0. 23

Stochastic Stepwise Ensembles for Variable Selection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment