Multi-stage Convex Relaxation for Feature Selection

Multi-stage Con v ex Relaxation for F eature Selection T ong Zhang Statistics Departmen t Rutgers Univ ersit y Piscata w a y , NJ 08854 tzhang @stat. rutgers.edu Abstract A num ber of recent w or k studied the eﬀectiveness of feature s e le ction using Las so. It is known that under the restricted iso metry prop erties (RIP), La sso do es not ge ne r ally lead to the exact recovery of the set o f nonzero coe ﬃcien ts, due to the lo oseness of co n vex rela xation. This pap er co nsiders the featur e selectio n prop erty of nonconv ex regular ization, where the solution is given by a m ulti-stage co n vex relaxation scheme. Under appropriate conditions, w e show that the lo cal solution obtained by this pro cedure recovers the set of nonzero co eﬃcients without suﬀering from the bias of Lasso r elaxation, which complements parameter estimation r esults of this pro cedure in [16]. 1 In tro duction W e consider the linear regression prob lem, where w e observe a set of inp ut v ectors x 1 , . . . , x n ∈ R p , with c orresp onding d esired output v ariables y 1 , . . . , y n . In a statistical linear m o del, it is common to assu m e that there exists a target coeﬃcient v ector ¯ w ∈ R p suc h that y i = ¯ w ⊤ x i + ǫ i ( i = 1 , . . . , n ) , (1) where ǫ i are zero-mean indep enden t r andom noises (but not necessarily id entical ly distributed). Moreo v er, we assume th at the target ve ctor ¯ w is sparse. That is, ¯ k = k ¯ w k 0 is small. Here we use the standard notation supp( w ) = { j : w j 6 = 0 } k w k 0 = | supp( w ) | for any vec tor w ∈ R p . This pap er fo cuses on the feature selection problem, wher e w e are inte rested in estimating the set of nonzero co eﬃcien ts sup p( ¯ w ) (also called supp ort set). Let y denote the v ector of [ y i ] and X b e the n × d matrix with eac h ro w a v ector x i . Th e standard statistical metho d is subset selection ( L 0 regularization), whic h computes the follo wing estimator ˆ w L 0 = arg min w ∈ R p k X w − y k 2 2 sub ject to k w k 0 ≤ k , (2) where k is a tu n ing parameter. This metho d is argu ab ly a n atural metho d for feature selection b ecause if noise ǫ i are iid Gaussian random v ariables, then (2) can b e r egarded as a Ba yes pro cedure 1 with an appropriately deﬁned sparse prior o v er w . Ho w ev er, b ecause the optimization p roblem in (2) is n oncon v ex, the global solution of th is pr oblem cannot b e eﬃcien tly computed. In practice, one can only ﬁnd an approxima te s olution of (2). The most p opular appro ximation to L 0 regularization is the L 1 regularization m etho d w h ic h is often referred to as Lasso [9]: ˆ w L 1 = arg min w ∈ R p  1 n k X w − y k 2 2 + λ k w k 1  , (3) where λ > 0 is an appr opriately c hosen regularization parameter. The global optimum of (3) can b e easily computed using standard con v ex programming tec h- niques. It is kno wn that in p r actice, L 1 regularization often leads to sparse solutions (although often sub op timal). Moreo ver, its p erform an ce has b een theoretically analyzed recen tly . F or ex- ample, it is kno wn fr om th e compressed sensing literature (e.g., [3]) that under certain conditions referred to as r e stricte d isometry pr op erty (RIP), the solution of L 1 relaxation (3) appro ximates the solution of the L 0 regularization problem (2 ). The p rediction and parameter p erform ance of this metho d has b een considered in [2, 1, 6, 14, 15, 10]. Exact sup p ort reco v ery was considered by v arious authors suc h as [8, 18, 11]. It is kno wn that u nder some more restrictiv e conditions referred to as irr epr esentable c onditions , L 1 regularization can ac h iev e exact r eco very of the supp ort set. Ho w ev er, the L 1 regularization m etho d (3) d o es not ac hiev e exact r eco very of th e su pp ort set und er the RIP t yp e of conditions, whic h w e are interested in here. Although it is p ossible to ac hieve exact reco very u sing p ost-pro cessing by thresh olding the small co eﬃcien ts of Lasso solution, this method is sub optimal under RIP in comparison to the L 0 regularization metho d (2) b ecause it requires the smallest nonzero co eﬃcien ts to b e √ ¯ k times larger than th e noise lev el instead of only requirin g the n on zero co eﬃcient s to b e larger than the noise leve l with L 0 regularization in (2). Th is issue, referred to as the b i as of Lasso f or feature selection, wa s extensiv ely discu s sed in [13]. Detaile d discussion can b e f ound after Theorem 1 . It is w orth mentio nin g th at u nder a s tr onger m utual coherence condition (similar to irr epresen table condition), this p ost-pro cessing step do es n ot give this bias facto r √ ¯ k as shown in [7] (also s ee [15]). Therefore the adv an tage of bias remov al for the multi -stage pro cedur e d iscussed here is only applicable when RI P holds bu t when the irrepresenta ble condition and mutual incoherence conditions fail. A thorough discussion of v arious conditions is b ey ond the scop e of the curren t pap er, and we would like to refer the r eaders to [10]. Nev ertheless, it is wo rth p oint ing out that ev en in the classical d < n setting with the design matrix X b eing rank d , th e irrepresent able condition or the m utual in coherence condition can still b e violate d while the RIP type sparse-eigen v alue condition used in this pap er holds trivially . In fact, this was p ointe d out in [19] as the main motiv ation of adaptiv e Lasso. Adaptiv e Lasso b eha ve s similarly to th e ab o v e men tioned p ost-pro cessing, and th us suﬀers from the same bias pr oblem. The bias of Lasso is d ue to the lo oseness of conv ex relaxation for L 0 regularization. Therefore the r emedy is to use a non-conv ex regularizer that is close to L 0 regularization. One dra wback of using noncon ve x optimization form ulation is that w e can only ﬁnd a lo cal optimal solution and diﬀeren t computational pro cedure ma y lead to a diﬀeren t lo cal solution. Therefore th e theoretical analysis has to b e in tegrated with sp eciﬁc computational pro cedure to sho w that the lo cal minim um obtained by the pr o cedure has d esirable pr op erties (e.g., exact su pp ort reco v ery). Several nonconv ex computational pro cedur es hav e b een analyzed in the literature, i nclud ing an adaptiv e forwa rd bac kw ard greedy pro cedur e (referred to as F oBa) to approximat ely solv e the regularization metho d (2) considered in [17], and the MC+ metho d in [13] to solv e a n on-con v ex regularized p roblem using a path-follo wing p ro cedure. Both metho ds can ac h iev e unbiased feature selectio n. 2 Related to the ab o v e m en tioned wo rk, a d iﬀeren t pro cedu re, referr ed to as multi- stage c onvex r elaxation , wa s analyzed in [16]. This pro cedure solv es a noncon v ex problem u sing multiple stages of Lasso relaxations, where con ve x formulatio ns are iterativ ely reﬁn ed based on solutions obtained from the previous stages. Ho we ve r, only parameter estimation p erformance was analyzed in [16]. Unfortunately , the r esult in [16] d o es not d irectly imply that m ulti-stage con vex relaxatio n ac hiev es unbiase d r eco very of the su pp ort set. Th e p urp ose of this p ap er is to prov e su c h a su pp ort reco v- ery result analogous to r elated result in [13] (whic h is for a diﬀeren t pro cedur e), and this result complemen ts the parameter estimation resu lt of [16]. 2 Multi-Stage Con v ex Relaxa tion with Capp ed- L 1 Regularization W e are intereste d in reco v ering ¯ w f rom n oisy observ ations y using the follo wing noncon vex r egu- larizatio n formulat ion: ˆ w = arg m in w   1 n k X w − y k 2 2 + λ p X j = 1 g ( | w j | )   , (4) where g ( | w j | ) is a r egularization function. F or s im p licit y , this p ap er only considers the sp eciﬁc regularizer g ( u ) = min( u, θ ) , (5) whic h is referr ed to as capp ed- L 1 regularization in [16]. The parameter θ is a thr esholding parameter whic h sa ys that w e use L 1 p enalization wh en a co eﬃcien t is suﬃcien tly small, bu t the p en alty do es not increase w hen the co eﬃcien t is larger th an a threshold θ . Deta iled d iscussions can b e foun d in [16]. S imilar to [16], one can analyze general regularization f unction g ( u ). Ho wev er, some of s uc h functions (such as adaptiv e Lasso) do n ot completely remov e the b ias. Therefore we only analyze the simp le function (5) in this pap er for clarity . While a theoretical ju s tiﬁcation has b een given in [16] for m ulti-stage con ve x relaxation, similar pro cedu re has b een shown to work well empirically without theoretical ju stiﬁcation [4, 12]. Moreo v er, a tw o-stage v ersion w as prop osed in [20], w h ic h do es n ot remov e th e bias issu e discussed in this p ap er. Since th e regularizer (5) is noncon ve x, the r esulting optimization pr oblem (4) is a n on-con v ex regularization pr ob lem. Ho w eve r the regularizer in (5) is con tinuous and p iecewise diﬀeren tiable, and th us its solution is easier to compute than the L 0 regularization metho d in (2). F or example, standard n umerical tec hniqu es suc h as sub-gradient descen t lead to lo cal min imum solutions. Un- fortunately , it is diﬃcult to ﬁnd the global optim um, and it is also d iﬃcult to analyze th e qualit y of the lo cal minim um obtained from the gradient descent metho d . As a matter of fact, results with non-con ve x regularization are diﬃ cu lt to repro du ce b ecause diﬀerent n umerical optimization pro cedures can lead to diﬀerent lo cal minima. Therefore the qu alit y of the solution h ea vily dep end on the numerical pro cedu re used. In the follo w ing, we consider a sp eciﬁc numerical p ro cedure referr ed to as multi- stage con vex relaxation in [16]. Th e algorithm is giv en in Figure 1. The p ro cedure conv erges to a lo cal optimal solution of (4) due to a simple conca ve dualit y argument , where (4) is rewr itten as ˆ w = arg m in w min { λ j ≥ 0 }   1 n k X w − y k 2 2 + p X j = 1 λ j | w j | + p X j = 1 g ∗ ( λ j )   , 3 with g ∗ ( λ j ) = max(( λ − λ j ) θ , 0). T he p ro cedure of Figure 1 can b e regarded as an alternating optimization metho d to solve this join t optimization problem of w and { λ j } , where the ﬁ rst step solv es for w with { λ j } ﬁ xed, and the second step is the closed form solution of { λ j } with w ﬁxed. A more detailed discussion can b e foun d in [16]. Our goal is to s h o w that th is pro cedu re can ac hieve unbiase d feature selection as describ ed in [13]. Initialize λ (0) j = λ for j = 1 , . . . , d F or ℓ = 1 , 2 , . . . • Let ˆ w ( ℓ ) = arg min w ∈ R p   1 n k X w − y k 2 2 + p X j = 1 λ ( ℓ − 1) j | w j |   . (6) • Let λ ( ℓ ) j = λI ( | ˆ w ( ℓ ) j | ≤ θ ) ( j = 1 , . . . , d ) Figure 1: Multi-stage Conv ex Relaxation for Sparse Regularization 3 Theoretical Analysis W e require s ome tec h nical conditions for our analysis. First we assum e sub-Gaussian noise as follo w s. Assumption 1 Assume that { ǫ i } i =1 ,...,n in (1) ar e indep endent (but not ne c essarily identic al ly distribute d) sub-Gaussians: ther e exists σ ≥ 0 such that ∀ i and ∀ t ∈ R , E ǫ i e tǫ i ≤ e σ 2 t 2 / 2 . Both Gaussian and b oun ded r andom v ariables are sub-Gaussian u sing the ab o ve deﬁnition. F or example, if a random v ariable ξ ∈ [ a, b ], then E ξ e t ( ξ − E ξ ) ≤ e ( b − a ) 2 t 2 / 8 . If a random v ariable is Gaussian: ξ ∼ N (0 , σ 2 ), th en E ξ e tξ ≤ e σ 2 t 2 / 2 . W e also int ro d uce the concept of sparse eige nv alue, whic h is standard in the analysis of L 1 regularization. Deﬁnition 1 Given k , deﬁne ρ + ( k ) = sup  1 n k X w k 2 2 / k w k 2 2 : k w k 0 ≤ k  , ρ − ( k ) = inf  1 n k X w k 2 2 / k w k 2 2 : k w k 0 ≤ k  . The follo w ing result for parameter estimation was obtained in [16], under the Assu mption 1. If w e assume that the target ¯ w is sp arse, with E y i = ¯ w ⊤ x i , and ¯ k = k ¯ w k 0 , and w e c ho ose θ and λ suc h that λ ≥ 20 σ p 2 ρ + (1) ln(2 p/η ) /n and θ ≥ 9 λ/ρ − (2 ¯ k + s ) . 4 Assume that ρ + ( s ) /ρ − (2 ¯ k + 2 s ) ≤ 1 + 0 . 5 s/ ¯ k for some s ≥ 2 ¯ k , then with probabilit y larger th an 1 − η : k ˆ w ( ℓ ) − ¯ w k 2 ≤ 17 ρ − (2 ¯ k + s ) " 2 σ q ρ + ( ¯ k ) r 7 . 4 ¯ k n + r 2 . 7 ln(2 /η ) n ! + λ p k θ # + 0 . 7 ℓ · √ ¯ k λ ρ − (2 ¯ k + s ) , (7) where ˆ w ( ℓ ) is the solution of (6), and k θ =   { j ∈ ¯ F : | ¯ w j | ≤ 2 θ }   . The condition ρ + ( s ) /ρ − (2 ¯ k + 2 s ) ≤ 1 + 0 . 5 s/ ¯ k requires the eigen v alue ratio ρ + ( s ) /ρ − ( s ) to grow sub-linearly in s . Suc h a cond ition, referred to as sp arse eigenvalue c ondition , is also needed in the stand ard analysis of L 1 regularization [14, 15]. It is related but slight ly weak er than the RIP condition in compressive sensing [3], whic h requires th e condition 1 − δ s ′ ≤ ρ − ( s ′ ) ≤ ρ + ( s ′ ) ≤ 1 + δ s ′ , for s ome δ s ′ ∈ (0 , 1) and s ′ > ¯ k . F or example, with s ′ = 6 ¯ k , and the restricted isometry constant δ s ′ ≤ 1 / 3, then the sp ars e eigenv alue condition ab o v e holds with s = 2 ¯ k . F or simplicit y , in this pap er we d o not make distinctions b et w een RIP and sparse eigen v alue condition. Note th at in the traditional lo w -dimensional statistical analysis, one assum es that ρ + ( s ) /ρ − (2 ¯ k + 2 s ) < ∞ as s → ∞ , whic h is signiﬁcan tly str onger than the condition we u s e h ere. Although in practice it is often diﬃcult to v erify the sparse eigenv alue condition for real problems, the parameter estimation result in (7) n ev ertheless pr o vides imp ortant theoretical insight s for m ulti-stage conv ex relaxation. F or standard Lasso, we hav e the follo w ing b oun d k ˆ w L 1 − ¯ w k 2 = O ( √ k λ ) , where ˆ w L 1 is the solution of the standard L 1 regularization. Th is b ound is tigh t for Lasso, in the sense that the righ t hand side cannot b e impr o v ed except for the constant —this can b e easily v eriﬁed w ith an orthogonal d esign matrix. It is kno wn that in order f or Lasso to b e eﬀectiv e, one has to pick λ no s maller th an the order σ p ln p/n . Therefore, the parameter estimation error of the standard Lasso is of the order σ p ¯ k ln p/n , wh ic h cannot b e improv ed. In comparison, if we consider the capp ed- L 1 regularization with g ( | w j | ) deﬁn ed in (5), the b ound in (7) can b e signiﬁcant ly b etter when most n on-zero co eﬃcien ts of ¯ w are relativ ely large in magnitude. In the extreme case where k θ = |{ j : | ¯ w j | ∈ (0 , 2 θ ] }| = 0, w h ic h can b e ac hieve d wh en all n onzero comp onents of ¯ w are larger than the ord er σ p ln p/n , w e obtain the follo wing b etter b ound k ˆ w ( ℓ ) − ¯ w k 2 = O ( q ¯ k /n + p ln(1 /η ) /n ) for the m ulti-stage pro cedure for a su ﬃcien tly large ℓ at the ord er of ln k + ln ln p . This b ound is sup erior to the standard one-stage L 1 regularization b ound k ˆ w L 1 − ¯ w k 2 = O ( p ¯ k ln( p/η ) /n ). In the literature, one is often in terested in t wo t yp es of results, one is parameter estimation b ound as in (7), and th e other is feature selection consistency: that is, to identify the set of nonzero co eﬃcien ts of the truth. Although the parameter estimation b ound in (7 ) is sup erior to Lasso, the result do es not imply that one can correctly select all v ariables under this cond ition. Moreo v er, th e sp eciﬁc pr o of presente d in [16] d o es not directly imply suc h a result. Therefore it is imp ortant to kno w w hether the m ulti-stage con vex relaxation can ac hieve u n biased f eature selection as stud ied in [13]. In the follo wing, w e p r esen t suc h a r esult whic h supplemen ts the parameter estimation b ound of (7). While the main h igh-lev el argum en t f ollo ws that of [16], there are many diﬀerences 5 in the details, and hence a full p ro of (which is included in Section 5 ) is still needed. T his theorem is the main resu lt of the pap er. It is w orth mentio nin g that although we only consider the s im p le capp ed- L 1 regularizer, similar results can b e obtained for other regularizers (with virtu ally the same pr o of ) suc h that g ′ ( u ) ∈ [0 , ∞ ), g ′ ( u ) > 0 when u b elongs to a neigh b or of 0, and g ′ ( u ) = 0 when u ≥ θ , w ith a threshold θ > 0 appr opriately c hosen at the order of the noise leve l — the condition of g ′ ( u ) = 0 when u ≥ θ ensur es th e remo v al of feature selection “bias” of Lasso which we discussed ab o v e. As an example, very similar r esult can b e obtained f or the MC+ p enalt y of [13] or SCAD p en alt y of [5] using the m ulti-stage con v ex relaxation pr o cedure here. I n fact, in p ractice there ma y b e additional adv an tages of using a smo oth noncon v ex p enalt y suc h as MC+ du e to the extra sm o othness, although such adv anta ge is not rev ealed in our theoretical analysis. Theorem 1 L et Assumption 1 hold. Assume also that the tar get ¯ w is sp arse, with E y i = ¯ w ⊤ x i , and ¯ k = k ¯ w k 0 . L et ¯ F = sup p( ¯ w ) . Cho ose θ and λ such that λ ≥ 7 σ p 2 ρ + (1) ln(2 p/η ) /n and θ > 9 λ/ρ − (1 . 5 ¯ k + s ) . Assume that min j ∈ ¯ F | ¯ w j | > 2 θ and ρ + ( s ) /ρ − (1 . 5 ¯ k + 2 s ) ≤ 1 + 2 s/ (3 ¯ k ) for some s ≥ 1 . 5 ¯ k , then with pr ob ability lar ger than 1 − η : supp( ˆ w ( ℓ ) ) = supp( ¯ w ) when ℓ > L , wher e ˆ w ( ℓ ) is the solution of (6) and L =  0 . 5 ln ¯ k ln( ρ − (1 . 5 ¯ k + s ) θ / (6 λ ))  + 1 . Theorem 1 is the main result of this pap er. If min w j ∈ ¯ F | w j | ≥ cσ p ln p/n for a suﬃcien tly large constan t c that is in dep endent of ¯ k (but could dep end on the RIP condition), then we can pic k b oth parameters λ = O ( σ p ln p/n ) an d θ = O ( σ p ln p/n ) at the noise lev el, so that Th eorem 1 can b e applied. In this case, Th eorem 1 imp lies that multi-sta ge capp ed- L 1 regularization ac hieve s exact reco v ery of the s upp ort set su pp( ¯ w ). In comparison, Lasso do es not ac hiev e exact sparse reco v ery un der RIP conditions. Wh ile ru nning Lasso follo wed by thresholding small co eﬃcien ts to zero (or usin g adaptive Lasso of [19] or the t wo-sta ge pro cedure of [20]) ma y ac hiev e exact reco ve ry , suc h a pro cedure r equ ires the condition that min w j ∈ ¯ F | w j | ≥ c ′ σ q ¯ k ln p/n (8) for some constant c ′ (also dep ends on the RIP condition). T his extra √ ¯ k factor is referr ed to as the bias of the Lasso p ro cedure in [13]. Moreo ver, it is kn o wn that for exact reco very to h old, 6 the requirement of min w j ∈ ¯ F | w j | ≥ cσ p ln p/n (up to a constan t) is necessary for all s tatistical pro cedures, in the sen se that if min w j ∈ ¯ F | w j | ≤ c ′ σ p ln p/n for a suﬃcientl y small constan t c ′ (under app ropriate RIP conditions), then no statistical p r o cedure can ac hiev e exact reco v ery w ith large pr obabilit y . Therefore s tatistic al pro cedur es that can achiev e exact supp ort reco very un der (8) are referred to as (n early) u n biased feature selection metho ds in [13]. T heorem 1 sho ws that m ulti-stage con ve x relaxation with capp ed- L 1 regularization achiev es unbiased feature selection. Results most comparable to what we ha v e obtained here are that of the F oBa pro cedur e in [17] and that of the MC+ pr o cedu re in [13]. Both can b e regarded as (approximat e) optimization metho ds for noncon vex formulati ons. The former is a forward bac kw ard greedy algorithm, which do es not optimize (4), while the latter is a path-follo wing algorithm for s olving formulations similar to (4). Although results in [13] are comparable to ours, w e should note that unlik e our pro cedure, whic h is eﬃcient du e to the ﬁ nite num b er of con v ex optimization, there is no p ro of s ho wing that the path-follo wing strategy in [13] is alw a ys eﬃcien t (in the s en se that th er e ma y b e exp onentia lly man y switc hing p oint s). 4 Sim u lation Study Numerical examples can b e found in [16] that demonstrate th e adv ant age of multi-st age conv ex relaxation o v er Lasso. Th er efore w e shall not rep eat a compreh en siv e study . Nev ertheless, this section present s a simple simulat ion study to illustrate the theoretical resu lts. The n × p design matrix X is generated with iid random Gaussian en tries and eac h column is normalized with 2-norm √ n . Here n = 100 and p = 250. W e then generate a vect or ¯ w with ¯ k = 30 nonzero co eﬃcien ts, and eac h nonzero coeﬃcient is unif orm ly generated from the inte rv al (1 , 10). The observ ation is y = X ¯ w + ǫ , where ǫ is zero-mean iid Gaussian noise with standard deviation σ = 1. W e study the feature selection p erformance of Multi-stage con vex relaxation metho d in Figure 1 usin g v arious conﬁgurations of λ = τ σ p ln( p ) /n (with τ = 1 , 2 , 4 , 8 , 16 , 32), and θ = µλ for v arious constan ts µ = 0 . 5 , 1 , 2 , 4. The exp erimen ts are rep eated for 100 times, and T able 1 r ep orts the probability (p ercen tage in the 100 runs) of exact supp ort r eco very for eac h conﬁguration at v arious s tages ℓ . Note that ℓ = 1 corresp onds to Lasso an d ℓ = 2 is an adaptiv e Lasso lik e t wo stage metho d [19, 20]. The main purp ose of th is stud y is to illustrate that it is b en eﬁ cial to use more than tw o stages, as predicted b y our theory . Ho w ev er, since only O (ln( ¯ k )) is suﬃcient , optimal results can b e ac hiev ed with relativ ely small num b er of stages. These conclusions can b e clearly seen fr om T able 1. Sp eciﬁcally the results for ℓ = 2 are b etter than those of ℓ = 1 (standard L asso), while resu lts of ℓ = 4 are b etter than th ose of ℓ = 2. Although the p erformance of ℓ = 8 is ev en b etter, the imp r o v e o ver ℓ = 4 is small at the optimal conﬁguration of λ and θ . This is consisten t with our th eory , whic h implies that a relativ ely small num b er of stages is needed to ac h iev e go o d p erformance. 5 Pro of of Theorem 1 The analysis is an adaptation of [16]. While the main pro of stru ctur e is similar, th ere are nev erthe- less subtle and imp ortan t diﬀerences in the details, and h ence a complete pr o of is s till necessary . The m ain tec hnical diﬀerences are as follo ws. The pro of of [16] trac ks the progress fr om one stage ℓ − 1 to the next stage ℓ us in g a b ound on 2-norm parameter estimate, while in the current p ro of w e trac k the progress usin g the set of v ariables that diﬀer signiﬁcan tly from the true v ariables. 7 θ = 0 . 5 λ λ 0 . 23 0 . 47 0 . 94 1 . 9 3 . 8 7 . 5 ℓ = 1 0 0 0 0 0 0 ℓ = 2 0 0 0 . 02 0 0 0 ℓ = 4 0 0 . 05 0 . 63 0 . 18 0 0 ℓ = 8 0 0 . 12 0 . 83 0 . 25 0 0 θ = 1 λ λ 0 . 23 0 . 47 0 . 94 1 . 9 3 . 8 7 . 5 ℓ = 1 0 0 0 0 0 0 ℓ = 2 0 0 . 04 0 . 15 0 . 06 0 0 ℓ = 4 0 0 . 33 0 . 86 0 . 13 0 0 ℓ = 8 0 0 . 38 0 . 93 0 . 16 0 0 θ = 2 λ λ 0 . 23 0 . 47 0 . 94 1 . 9 3 . 8 7 . 5 ℓ = 1 0 0 0 0 0 0 ℓ = 2 0 0 . 14 0 . 22 0 0 0 ℓ = 4 0 0 . 29 0 . 6 0 . 02 0 0 ℓ = 8 0 0 . 3 0 . 62 0 . 02 0 0 θ = 4 λ λ 0 . 23 0 . 47 0 . 94 1 . 9 3 . 8 7 . 5 ℓ = 1 0 0 0 0 0 0 ℓ = 2 0 0 . 01 0 . 01 0 0 0 ℓ = 4 0 0 . 06 0 . 06 0 0 0 ℓ = 8 0 0 . 06 0 . 06 0 0 0 T able 1: Probabilit y of Exact Su pp ort Reco v ery for Multi-stage Con vex Relaxation 8 Moreo ver, in [16], w e compare the cur ren t estimated parameter to the true parameter ¯ w , which is suﬃcien t for parameter estimation. Ho wev er, in ord er to establish feature selectio n r esult of this pap er, it is necessary to compare the curr en t estimated parameter to the least squares solution ˜ w within the tru e feature set ¯ F as deﬁn ed b elo w in (9). These subtle tec hnical diﬀerences mean that man y details in the pro ofs presen ted b elo w diﬀer from th at of [16]. 5.1 Auxiliary lemmas W e ﬁ r st introd u ce some deﬁnitions. Consider the p ositive semi-deﬁnite matrix A = n − 1 X ⊤ X ∈ R d × d . Giv en s, k ≥ 1 su c h that s + k ≤ d . Let I , J b e disjoin t sub sets of { 1 , . . . , d } with k and s elemen ts resp ectiv ely . Let A I ,I ∈ R k × k b e the restriction of A to indices I , A I ,J ∈ R k × s b e the restriction of A to in dices I on the left and J on the righ t. S imilarly w e deﬁne restriction w I of a v ector w ∈ R p on I ; and f or con v enience, we allo w either w I ∈ R k or w I ∈ R p (where comp onen ts not in I are zeros) dep ending on the conte xt. W e also n eed the follo wing quantit y in our analysis: π ( k , s ) = sup v ∈ R k , u ∈ R s ,I ,J v ⊤ A I ,J u k v k 2 v ⊤ A I ,I v k u k ∞ . The f ollo win g t w o lemmas are taken from [15]. W e skip the pro of. Lemma 1 The fol lowing ine qu ality holds: π ( k , s ) ≤ s 1 / 2 2 p ρ + ( s ) /ρ − ( k + s ) − 1 , Lemma 2 Consider k , s > 0 and G ⊂ { 1 , . . . , d } such that | G c | = k . Given any w ∈ R p . L et J b e the indic es of the s lar gest c omp onents of w G (in absolute values), and I = G c ∪ J . Then max(0 , w ⊤ I A w ) ≥ ρ − ( k + s )( k w I k 2 − π ( k + s, s ) k w G k 1 /s ) k w I k 2 . Our analysis requ ires us to k eep trac k of pr ogress with resp ect to the least squares solution ˜ w with th e tru e feature set ¯ F , wh ich w e deﬁne b elo w: ˜ w = arg min w ∈ R p k X w − y k 2 2 sub ject to sup p( w ) ⊂ ¯ F , (9) where ¯ F = sup p( ¯ w ). The follo wing lemmas require v aryin g degrees of mo diﬁcations from similar lemmas in [16], and th us the pro ofs are included for completeness. Lemma 3 D eﬁne ˆ ǫ = 1 n X ⊤ ( X ˜ w − y ) . Under the c onditions of Assumption 1, with pr ob ability lar ger than 1 − η : ∀ j ∈ ¯ F : | ˆ ǫ j | = 0 , | ˜ w j − ¯ w j | ≤ σ q 2 ρ − ( ¯ k ) − 1 ln(2 p/η ) /n, and ∀ j / ∈ ¯ F : | ˆ ǫ j | ≤ σ p 2 ρ + (1) ln (2 p/η ) /n. 9 Pro of Let ˜ P be the pr o j ection matrix to the subsp ace spanned by column s of X in ¯ F , then we kno w that X ˜ w = ˜ P y and ( I − ˜ P ) Ey = Ey − X ¯ w = 0 . Therefore f or eac h j n | ˆ ǫ j | = | X ⊤ j ( X ˜ w − y ) | = | X ⊤ j ( I − ˜ P )( y − Ey )) | . It implies that ˆ ǫ j = 0 if j ∈ ¯ F . Since for eac h j : the column X j satisﬁes k X ⊤ j ( I − ˜ P ) k 2 2 ≤ nρ + (1), w e ha ve from s ub-Gaussian tail b ound that for all j / ∈ ¯ F and ǫ > 0: P [ | ˆ ǫ j | ≥ ǫ ] ≤ 2 exp [ − nǫ 2 / (2 σ 2 ρ + (1))] . Moreo ver, for eac h j ∈ ¯ F , w e ha v e | ˜ w j − ¯ w j | = e ⊤ j ( X ⊤ ¯ F X ¯ F ) − 1 X ⊤ ¯ F ( y − Ey ) . Since k e ⊤ j ( X ⊤ ¯ F X ¯ F ) − 1 X ⊤ ¯ F k 2 2 = e ⊤ j ( X ⊤ ¯ F X ¯ F ) − 1 e j ≤ n − 1 ρ − ( ¯ k ) − 1 , w e ha v e for all ǫ > 0: P [ | ˜ w j − ¯ w j | ≥ ǫ ] ≤ 2 exp[ − n ρ − ( ¯ k ) ǫ 2 / (2 nσ 2 )] . T aking union b ound for j = 1 , . . . , d (eac h with probability η /d ) w e obtain the desired inequalit y . Lemma 4 Consider G ⊂ { 1 , . . . , d } such that ¯ F ∩ G = ∅ . L et ˆ w = ˆ w ( ℓ ) b e the solution of (6), and let ∆ ˆ w = ˆ w − ˜ w . L e t λ G = min j ∈ G λ ( ℓ − 1) j and λ 0 = max j λ ( ℓ − 1) j . If 2 k ˆ ǫ k ∞ k < λ G , then X j ∈ G | ˆ w j | ≤ 2 k ˆ ǫ k ∞ λ G − 2 k ˆ ǫ k ∞ X j / ∈ ¯ F ∪ G | ˆ w j | + λ 0 λ G − 2 k ˆ ǫ k ∞ X j ∈ ¯ F | ∆ ˆ w j | ≤ λ 0 λ G − 2 k ˆ ǫ k ∞ k ∆ ˆ w G c k 1 . Pro of F or s implicit y , let λ j = λ ( ℓ − 1) j . Th e ﬁrst order equation imp lies that 1 n n X i =1 2( x ⊤ i ˆ w − y i ) x i,j + λ j sgn( ˆ w j ) = 0 , where sgn ( w j ) = 1 wh en w j > 0, sgn( w j ) = − 1 when w j < 0, and sgn( w j ) ∈ [ − 1 , 1] wh en w j = 0. This implies that for all v ∈ R p , we ha ve 2 v ⊤ A ∆ ˆ w ≤ − 2 v ⊤ ˆ ǫ − p X j = 1 λ j v j sgn( ˆ w j ) . (10) 10 No w, let v = ∆ ˆ w in (10), and notice that ˆ ǫ ¯ F = 0, we obtain 0 ≤ 2∆ ˆ w ⊤ A ∆ ˆ w ≤ 2 | ∆ ˆ w ⊤ ˆ ǫ | − p X j = 1 λ j ∆ ˆ w j sgn( ˆ w j ) ≤ 2 k ∆ ˆ w ¯ F c k 1 k ˆ ǫ k ∞ − X j ∈ ¯ F λ j ∆ ˆ w j sgn( ˆ w j ) − X j / ∈ ¯ F λ j ∆ ˆ w j sgn( ˆ w j ) ≤ 2 k ∆ ˆ w ¯ F c k 1 k ˆ ǫ k ∞ + X j ∈ ¯ F λ j | ∆ ˆ w j | − X j / ∈ ¯ F λ j | ˆ w j | ≤ X j ∈ G (2 k ˆ ǫ k ∞ − λ G ) | ˆ w j | + X j / ∈ G ∪ ¯ F 2 k ˆ ǫ k ∞ | ˆ w j | + X j ∈ ¯ F λ 0 | ∆ ˆ w j | . By r earranging the ab o v e inequalit y , w e obtain the ﬁrst d esir ed b ound. The second in equalit y uses 2 k ˆ ǫ k ∞ ≤ λ 0 . Lemma 5 U sing the notations of L emma 4, and let J b e the indic es of the lar gest s c o eﬃcients (in absolute value) of ˆ w G . L et I = G c ∪ J and k = | G c | . If 0 ≤ λ 0 / ( λ G − 2 k ˆ ǫ k ∞ ) ≤ 3 , then k ∆ ˆ w k 2 ≤ (1 + (3 k /s ) 0 . 5 ) k ∆ ˆ w I k 2 . Pro of Usin g λ 0 / ( λ G − 2 k ˆ ǫ k ∞ ) ≤ 3, we obtain fr om Lemma 4 k ˆ w G k 1 ≤ 3 k ∆ ˆ w − ˆ w G k 1 . Therefore k ∆ ˆ w − ∆ ˆ w I k ∞ ≤k ∆ ˆ w J k 1 /s = s − 1 [ k ∆ ˆ w G k 1 − k ∆ ˆ w − ∆ ˆ w I k 1 ] ≤ s − 1 [3 k ∆ ˆ w − ˆ w G k 1 − k ∆ ˆ w − ∆ ˆ w I k 1 ] , whic h implies that k ∆ ˆ w − ∆ ˆ w I k 2 ≤ ( k ∆ ˆ w − ∆ ˆ w I k 1 k ∆ ˆ w − ∆ ˆ w I k ∞ ) 1 / 2 ≤ [ k ∆ ˆ w − ∆ ˆ w I k 1 (3 k ∆ ˆ w − ˆ w G k 1 − k ∆ ˆ w − ∆ ˆ w I k 1 )] 1 / 2 s − 1 / 2 ≤  (3 k ∆ ˆ w − ˆ w G k 1 / 2) 2  1 / 2 s − 1 / 2 ≤ (3 / 2) s − 1 / 2 k ∆ ˆ w − ˆ w G k 1 ≤ (3 / 2) s − 1 / 2 ¯ k 1 / 2 k ∆ ˆ w − ˆ w G k 2 ≤ (3 k/s ) 1 / 2 k ∆ ˆ w I k 2 . The third inequalit y u ses the simple algebraic inequalit y a (3 b − a ) ≤ (3 b/ 2) 2 . By rearr an ging this inequalit y , we obtain the desir ed b oun d. Note that in the ab ov e d eriv ation, we hav e used the fact that ¯ F ∩ G = ∅ , which implies that ∆ ˆ w G = ˆ w G , and th us ∆ ˆ w − ˆ w G = ∆ ˆ w G c . 11 Lemma 6 L et the c onditions of L emma 4 and L e mma 5 hold, and let k = | G c | . If t = 1 − π ( k + s, s ) k 1 / 2 s − 1 ∈ (0 , 4 / 3) , and 0 ≤ λ 0 / ( λ G − 2 k ˆ ǫ k ∞ ) ≤ (4 − t ) / (4 − 3 t ) , then k ∆ ˆ w k 2 ≤ (1 + (3 k /s ) 0 . 5 ) k ∆ ˆ w I k 2 ≤ 1 + (3 k /s ) 0 . 5 tρ − ( k + s )    2 k ˆ ǫ G c k 2 +   X j ∈ ¯ F ( λ ( ℓ − 1) j ) 2   1 / 2    . Pro of Let J b e th e indices of the largest s co eﬃcien ts (in absolute v alue) of ˆ w G , and I = G c ∪ J . The cond itions of the lemma imp ly that max(0 , ∆ ˆ w ⊤ I A ∆ ˆ w ) ≥ ρ − ( k + s )[ k ∆ ˆ w I k 2 − π ( k + s, s ) k ˆ w G k 1 /s ] k ∆ ˆ w I k 2 ≥ ρ − ( k + s )[1 − (1 − t )(4 − t )(4 − 3 t ) − 1 ] k ∆ ˆ w I k 2 2 ≥ 0 . 5 tρ − ( k + s ) k ∆ ˆ w I k 2 2 . In the ab ov e deriv ation, the ﬁ rst inequalit y is due to Lemma 2; the second inequalit y is due to the conditions of this lemma plus Lemma 4, whic h implies th at k ˆ w G k 1 ≤ λ 0 λ G − 2 k ˆ ǫ k ∞ k ˆ w G c k 1 ≤ λ 0 λ G − 2 k ˆ ǫ k ∞ √ k k ˆ w I k 2 ; and the last inequalit y follo w s from 1 − (1 − t )(4 − t )(4 − 3 t ) − 1 ≥ 0 . 5 t , which holds for t ∈ (0 , 4 / 3). If ∆ ˆ w ⊤ I A ∆ ˆ w ≤ 0, then the ab o v e inequalit y , together with Lemma 5 , imply th e lemma. Th ere- fore in the follo w ing, w e can assume that ∆ ˆ w ⊤ I A ∆ ˆ w ≥ 0 . 5 tρ − ( k + s ) k ∆ ˆ w I k 2 2 . Moreo ver, let λ j = λ ( ℓ − 1) j . W e obtain from (10) w ith v = ∆ ˆ w I the follo win g: 2∆ ˆ w ⊤ I A ∆ ˆ w ≤ − 2∆ ˆ w ⊤ I ˆ ǫ − X j ∈ I λ j ∆ ˆ w j sgn( ˆ w j ) = − 2∆ ˆ w ⊤ I ˆ ǫ G c − 2∆ ˆ w ⊤ I ˆ ǫ G − X j ∈ ¯ F λ j ∆ ˆ w j sgn( ˆ w j ) − X j ∈ G λ j | ∆ ˆ w j | − X j ∈ ¯ F c ∩ G c λ j | ∆ ˆ w j | ≤ 2 k ∆ ˆ w I k 2 k ˆ ǫ G c k 2 + 2 k ˆ ǫ G k ∞ X j ∈ G | ∆ ˆ w j | + X j ∈ ¯ F λ j | ∆ ˆ w j | − X j ∈ G λ j | ∆ ˆ w j | ≤ 2 k ∆ ˆ w I k 2 k ˆ ǫ G c k 2 + ( X j ∈ ¯ F λ 2 j ) 1 / 2 k ∆ ˆ w I k 2 . Note that the equalit y u ses the f act that G ⊂ ¯ F c , and ∆ ˆ w j sgn( ˆ w j ) = | ˆ w j | for j ∈ ¯ F c . The last inequalit y uses the fact that ∀ j ∈ G : λ j ≥ λ G ≥ 2 k ˆ ǫ G k ∞ . No w by com bining th e ab ov e t wo estimates, we obtain k ∆ ˆ w I k 2 ≤ 1 tρ − ( k + s )   2 k ˆ ǫ G c k 2 + ( X j ∈ ¯ F λ 2 j ) 1 / 2   . The d esired b ound follo ws from Lemma 5. 12 Lemma 7 L et λ j = λI ( | w j | ≤ θ ) for some w ∈ R p , then   X j ∈ ¯ F λ 2 j   1 / 2 ≤ λ s X j ∈ ¯ F I ( | ¯ w j | ≤ 2 θ ) + λ   { j ∈ ¯ F : | ¯ w j − w j | ≥ θ }   1 / 2 . Pro of By assum p tion, if | ¯ w j − w j | ≥ θ , then I ( | w j | ≤ θ ) ≤ 1 ≤ I ( | ¯ w j − w j | ≥ θ ); otherwise, I ( | w j | ≤ θ ) ≤ I ( | ¯ w j | ≤ 2 θ ). It follo ws that the follo wing inequalit y alw a ys holds: I ( | w j | ≤ θ ) ≤ I ( | ¯ w j | ≤ 2 θ ) + I ( | ¯ w j − w j | ≥ θ ) . The d esired b ound is a dir ect consequence of the ab ov e r esult and the 2-norm triangle inequalit y ( X j ( x j + ∆ x j ) 2 ) 1 / 2 ≤ ( X j x 2 j ) 1 / 2 + ( X j ∆ x 2 j ) 1 / 2 . Lemma 8 D eﬁne F ( ℓ ) = { j : | ˆ w ( ℓ ) j − ¯ w j | ≥ θ } . Under the c onditions of The or em 1, we have for al l s ≥ 2 ¯ k : k ˆ w ( ℓ ) − ˜ w k 2 ≤ 5 . 7 λ ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | , and q | F ( ℓ ) | ≤ 6 λθ − 1 ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | . Pro of F or all t ∈ [0 . 5 , 4 / 3), by using Lemma 3, w e know that the condition of th e theorem implies that λ λ − 2 k ˆ ǫ k ∞ ≤ 7 / 5 ≤ 4 − t 4 − 3 t . Moreo ver, Lemma 1 implies that the condition 0 . 5 ≤ t = 1 − π (1 . 5 ¯ k + s, s )(1 . 5 ¯ k ) 0 . 5 /s is also satisﬁed. This m eans that the conditions of Lemma 6 (with λ 0 = λ G = λ ) are satisﬁed. No w, w e assume th at at some ℓ ≥ 1, | G c ℓ | ≤ 1 . 5 ¯ k , where G ℓ = { j / ∈ ¯ F : λ ( ℓ − 1) j = λ } , (11) then it is easy to verify that G c ℓ \ ¯ F ⊂ F ( ℓ − 1) . 13 Moreo ver, with th e d eﬁnition of G = G ℓ in Lemma 6 and Lemma 7, w e can set λ 0 = λ G = λ and ob tain (note also that ˆ ǫ ¯ F = 0) k ˆ w ( ℓ ) − ˜ w k 2 ≤ 1 + √ 3 tρ − (1 . 5 ¯ k + s )    2 k ˆ ǫ G c ℓ \ ¯ F k 2 +   X j ∈ ¯ F ( λ ( ℓ − 1) j ) 2   1 / 2    ≤ 1 + √ 3 tρ − (1 . 5 ¯ k + s )  2 q | F ( ℓ − 1) \ ¯ F |k ˆ ǫ k ∞ + q | F ( ℓ − 1) ∩ ¯ F | λ  ≤ 1 + √ 3 tρ − (1 . 5 ¯ k + s )  (2 / 7) q | F ( ℓ − 1) \ ¯ F | + q | F ( ℓ − 1) ∩ ¯ F |  λ ≤ 1 + √ 3 0 . 5 ρ − (1 . 5 ¯ k + s )  q 1 . 082 | F ( ℓ − 1) |  λ ≤ 5 . 7 λ ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | , where the ﬁ rst inequalit y is d u e to Lemma 6. T h e second inequalit y uses the facts that G c ℓ \ ¯ F ⊂ F ( ℓ − 1) \ ¯ F , and Lemma 7 with I ( | ¯ w j | ≤ 2 θ ) = 0 (for all j ∈ ¯ F ). The third inequality us es 2 k ˆ ǫ k ∞ ≤ (2 / 7) λ , and the fourth inequalit y uses (2 / 7) a + b ≤ p 1 . 082( a 2 + b 2 ). Since Lemma 3 implies th at k ˜ w − ¯ w k ∞ ≤ (1 / 7) λ/ q ρ + (1) ρ − ( ¯ k ) , w e kno w that j ∈ F ( ℓ ) implies that | ˜ w j − ˆ w ( ℓ ) j | ≥ θ − (1 / 7 ) λ/ q ρ + (1) ρ − ( ¯ k ) ≥ (41 / 4 2) θ . Therefore q | F ( ℓ ) | ≤ (41 θ / 42) − 1 k ˜ w − ˆ w ( ℓ ) k 2 ≤ 5 . 7 λ (41 θ / 42) − 1 ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | ≤ 6 λθ − 1 ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | . That is, un der the assu mption of (11), the lemma holds at ℓ . Therefore next we only n eed to p r o v e by induction on ℓ that (11) h olds f or all ℓ = 1 , 2 , . . . . When ℓ = 1, w e ha v e G c 1 = ¯ F , wh ich implies that (11) holds. No w assu me that (11) holds at ℓ for some ℓ ≥ 1. Then by the induction h yp othesis we kno w that th e lemma holds at ℓ . This means that q | G c ℓ +1 \ ¯ F | ≤ q | F ( ℓ ) | ≤ 6 λθ − 1 ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | ≤ q 0 . 5 | F ( ℓ − 1) | ≤ · · · ≤ 0 . 5 ℓ/ 2 | F (0) | . 14 The ﬁ rst inequalit y is d ue to the f act G c ℓ +1 \ ¯ F ⊂ F ( ℓ ) . The second inequ alit y uses the assumption of θ in the theorem. The last inequalit y uses induction. No w note that F (0) = ¯ F , we th us h a v e | G c ℓ +1 \ ¯ F | ≤ 0 . 5 ¯ k . This completes the in d uction step. 5.2 Pro of of Theorem 1 Deﬁne β = 6 λθ − 1 ρ − (1 . 5 ¯ k + s ) , W e ha ve β < 1 b y the assumption of the theorem. Using ind uction, we ha ve from L emm a 8 that q | F ( L ) | ≤ β q | F ( L − 1) | ≤ · · · ≤ β L q | F (0) | ≤ β L p ¯ k < 1 . This means that when ℓ > L , | F ( ℓ − 1) | = 0. Th erefore by applying Lemm a 8 again w e obtain k ˆ w ( ℓ ) − ˜ w k 2 = 0 . Since Lemma 3 implies th at k ˜ w − ¯ w k ∞ ≤ (1 / 7) λ/ q ρ + (1) ρ − ( ¯ k ) < θ , w e ha ve supp( ˜ w ) = sup p( ¯ w ) . This implies that supp( ˆ w ( ℓ ) ) = su pp( ¯ w ). 6 Discussion This pap er inv estigate d the p erformance of multi- stage conv ex relaxation for feature selectio n, where it is shown that under RIP , the pro cedur e can ac hiev e unbiased f eature selectio n. This result complemen ts that of [16] whic h stud ies the parameter estimatio n p erformance of multi- stage con v ex relaxation. It also complemen ts similar resu lts obtained in [17] and [13] for diﬀeren t computational pro cedures. O ne adv antag e of our resu lt o ve r that in [13] is that th e multi- stage con v ex relaxation metho d is pr o v ably eﬃcien t b ecause the correct f eature set can b e obtained after n o more th an O (log ¯ k ) num b er of iterations. In comparison, a compu tational eﬃciency statemen t for the p ath- follo w ing metho d of [13] remains op en. 15 References [1] Pete r Bic k el, Y aaco v Rito v, and Alexandre Tsyb ak o v. Sim ultaneous analysis of Lasso and Dan tzig selector. An nals of Statistics , 37(4):1 705–1732, 2009. [2] Florentina Bunea, Alexandre Tsy b ak o v, and Marten H. W egk amp. Sparsit y oracle inequalities for the Lasso. Ele ctr onic Journal of Statistics , 1:169–194, 2007. [3] Em m an uel J. Cand es and T erence T ao. Deco ding by linear p rogramming. IEEE T r ans. on Information The ory , 51:42 03–4215, 2005. [4] Em m an uel J. Candes, Mic hael B. W akin, and Stephen P . Bo yd. Enhancing sparsit y by rew eigh ted l 1 minimization. Journal of F ourier Analysis and Applic ations , 14(5):87 7–905, 2008. [5] Jianqin g F an and Run ze Li. V ariable s electio n via nonconca ve p enalized lik eliho o d and its oracle p rop erties. J ournal of the Americ an Statistic al Asso ciation , 96:1348 –1360, 2001. [6] Vladimir Koltc hinskii. Sparsity in p en alized empirical r isk minimization. Anna les de l’Institut Henri Poinc ar ´ e , 2008. [7] Karim Loun ici. Sup-norm con ve rgence rate and sign concentrat ion prop erty of lasso and dan tzig estimators. Ele ctr on. J. Statist. , 2:90– 102, 2008. [8] Nicolai Meinshausen and P eter Buhlmann . High-dimensional graphs and v ariable selection with th e Lasso. The Annals of Statistics , 34:1436– 1462, 2006. [9] Rob ert Tibs h irani. Regression shrink age an d selection via the lasso. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 58:267– 288, 1996. [10] Sara A. v an de Geer and P eter Buhlmann. On the conditions used to pro v e oracle results f or the lasso. Ele c tr on. J. Statist. , 3:1360 –1392, 2009. [11] Martin W ain wright. S harp thresholds for high-dimensional and noisy sparsity reco very usin g L1-constrained quadr atic programming (Lasso). IEEE T r ansactions on Information The ory , 55:218 3–2202, 2009. [12] Da vid P . Wipf an d Srik anta n Nagara jan. Iterativ e reweig hted ℓ 1 and ℓ 2 metho ds for ﬁndin g sparse s olutions. Journal of Sele c te d T opics in Signal Pr o c e ssing , 4(2):31 7 – 329, 2010. [13] Cu n -Hui Zhang. Nea rly unbiased v ariable selection under minimax concav e p enalt y . The Anna ls of Statistics , 38:894–94 2, 2010. [14] Cu n -Hui Zhang and Jian Hu ang. The sparsit y and bias of the Lasso selection in high- dimensional linear regression. Annals of Statistics , 36(4):15 67–1594, 2008. [15] T on g Zhang. Some sh arp p erf ormance b ounds for least squares regression with L 1 regulariza- tion. Ann. Statist. , 37(5A):21 09–2144, 2009. [16] T on g Zhang. Analysis of m ulti-stage con v ex r elaxatio n for sparse regularization. Journal of Machine L e arning R ese ar ch , 11:1087–1 107, 2010 . 16 [17] T on g Zhang. Adaptive forward-bac kward greedy algorithm for learning spars e r epresen tations. IEEE T r ansactions on Information The ory , 57:4689 –4708, 2011. [18] Peng Zh ao and Bin Y u. On mo del selection consistency of Lasso. J ournal of Machine L e arning R ese ar ch , 7:2541–2 567, 2006. [19] Hui Zou. T h e adaptiv e lasso and its oracle prop erties. Journal of the Am eric an Statistic al Asso ciation , 101:141 8–1429, 2006. [20] Hui Z ou and Run ze Li. One-step sparse estimates in nonconca v e p enalized lik eliho o d mo dels. Anna ls of Statistics , 36(4):1509 –1533, 2008 . 17

Multi-stage Convex Relaxation for Feature Selection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment