Multi-stage Convex Relaxation for Feature Selection

A number of recent work studied the effectiveness of feature selection using Lasso. It is known that under the restricted isometry properties (RIP), Lasso does not generally lead to the exact recovery of the set of nonzero coefficients, due to the lo…

Authors: Tong Zhang

Multi-stage Con v ex Relaxation for F eature Selection T ong Zhang Statistics Departmen t Rutgers Univ ersit y Piscata w a y , NJ 08854 tzhang @stat. rutgers.edu Abstract A num ber of recent w or k studied the effectiveness of feature s e le ction using Las so. It is known that under the restricted iso metry prop erties (RIP), La sso do es not ge ne r ally lead to the exact recovery of the set o f nonzero coe fficien ts, due to the lo oseness of co n vex rela xation. This pap er co nsiders the featur e selectio n prop erty of nonconv ex regular ization, where the solution is given by a m ulti-stage co n vex relaxation scheme. Under appropriate conditions, w e show that the lo cal solution obtained by this pro cedure recovers the set of nonzero co efficients without suffering from the bias of Lasso r elaxation, which complements parameter estimation r esults of this pro cedure in [16]. 1 In tro duction W e consider the linear regression prob lem, where w e observe a set of inp ut v ectors x 1 , . . . , x n ∈ R p , with c orresp onding d esired output v ariables y 1 , . . . , y n . In a statistical linear m o del, it is common to assu m e that there exists a target coefficient v ector ¯ w ∈ R p suc h that y i = ¯ w ⊤ x i + ǫ i ( i = 1 , . . . , n ) , (1) where ǫ i are zero-mean indep enden t r andom noises (but not necessarily id entical ly distributed). Moreo v er, we assume th at the target ve ctor ¯ w is sparse. That is, ¯ k = k ¯ w k 0 is small. Here we use the standard notation supp( w ) = { j : w j 6 = 0 } k w k 0 = | supp( w ) | for any vec tor w ∈ R p . This pap er fo cuses on the feature selection problem, wher e w e are inte rested in estimating the set of nonzero co efficien ts sup p( ¯ w ) (also called supp ort set). Let y denote the v ector of [ y i ] and X b e the n × d matrix with eac h ro w a v ector x i . Th e standard statistical metho d is subset selection ( L 0 regularization), whic h computes the follo wing estimator ˆ w L 0 = arg min w ∈ R p k X w − y k 2 2 sub ject to k w k 0 ≤ k , (2) where k is a tu n ing parameter. This metho d is argu ab ly a n atural metho d for feature selection b ecause if noise ǫ i are iid Gaussian random v ariables, then (2) can b e r egarded as a Ba yes pro cedure 1 with an appropriately defined sparse prior o v er w . Ho w ev er, b ecause the optimization p roblem in (2) is n oncon v ex, the global solution of th is pr oblem cannot b e efficien tly computed. In practice, one can only find an approxima te s olution of (2). The most p opular appro ximation to L 0 regularization is the L 1 regularization m etho d w h ic h is often referred to as Lasso [9]: ˆ w L 1 = arg min w ∈ R p  1 n k X w − y k 2 2 + λ k w k 1  , (3) where λ > 0 is an appr opriately c hosen regularization parameter. The global optimum of (3) can b e easily computed using standard con v ex programming tec h- niques. It is kno wn that in p r actice, L 1 regularization often leads to sparse solutions (although often sub op timal). Moreo ver, its p erform an ce has b een theoretically analyzed recen tly . F or ex- ample, it is kno wn fr om th e compressed sensing literature (e.g., [3]) that under certain conditions referred to as r e stricte d isometry pr op erty (RIP), the solution of L 1 relaxation (3) appro ximates the solution of the L 0 regularization problem (2 ). The p rediction and parameter p erform ance of this metho d has b een considered in [2, 1, 6, 14, 15, 10]. Exact sup p ort reco v ery was considered by v arious authors suc h as [8, 18, 11]. It is kno wn that u nder some more restrictiv e conditions referred to as irr epr esentable c onditions , L 1 regularization can ac h iev e exact r eco very of the supp ort set. Ho w ev er, the L 1 regularization m etho d (3) d o es not ac hiev e exact r eco very of th e su pp ort set und er the RIP t yp e of conditions, whic h w e are interested in here. Although it is p ossible to ac hieve exact reco very u sing p ost-pro cessing by thresh olding the small co efficien ts of Lasso solution, this method is sub optimal under RIP in comparison to the L 0 regularization metho d (2) b ecause it requires the smallest nonzero co efficien ts to b e √ ¯ k times larger than th e noise lev el instead of only requirin g the n on zero co efficient s to b e larger than the noise leve l with L 0 regularization in (2). Th is issue, referred to as the b i as of Lasso f or feature selection, wa s extensiv ely discu s sed in [13]. Detaile d discussion can b e f ound after Theorem 1 . It is w orth mentio nin g th at u nder a s tr onger m utual coherence condition (similar to irr epresen table condition), this p ost-pro cessing step do es n ot give this bias facto r √ ¯ k as shown in [7] (also s ee [15]). Therefore the adv an tage of bias remov al for the multi -stage pro cedur e d iscussed here is only applicable when RI P holds bu t when the irrepresenta ble condition and mutual incoherence conditions fail. A thorough discussion of v arious conditions is b ey ond the scop e of the curren t pap er, and we would like to refer the r eaders to [10]. Nev ertheless, it is wo rth p oint ing out that ev en in the classical d < n setting with the design matrix X b eing rank d , th e irrepresent able condition or the m utual in coherence condition can still b e violate d while the RIP type sparse-eigen v alue condition used in this pap er holds trivially . In fact, this was p ointe d out in [19] as the main motiv ation of adaptiv e Lasso. Adaptiv e Lasso b eha ve s similarly to th e ab o v e men tioned p ost-pro cessing, and th us suffers from the same bias pr oblem. The bias of Lasso is d ue to the lo oseness of conv ex relaxation for L 0 regularization. Therefore the r emedy is to use a non-conv ex regularizer that is close to L 0 regularization. One dra wback of using noncon ve x optimization form ulation is that w e can only find a lo cal optimal solution and differen t computational pro cedure ma y lead to a differen t lo cal solution. Therefore th e theoretical analysis has to b e in tegrated with sp ecific computational pro cedure to sho w that the lo cal minim um obtained by the pr o cedure has d esirable pr op erties (e.g., exact su pp ort reco v ery). Several nonconv ex computational pro cedur es hav e b een analyzed in the literature, i nclud ing an adaptiv e forwa rd bac kw ard greedy pro cedur e (referred to as F oBa) to approximat ely solv e the regularization metho d (2) considered in [17], and the MC+ metho d in [13] to solv e a n on-con v ex regularized p roblem using a path-follo wing p ro cedure. Both metho ds can ac h iev e unbiased feature selectio n. 2 Related to the ab o v e m en tioned wo rk, a d ifferen t pro cedu re, referr ed to as multi- stage c onvex r elaxation , wa s analyzed in [16]. This pro cedure solv es a noncon v ex problem u sing multiple stages of Lasso relaxations, where con ve x formulatio ns are iterativ ely refin ed based on solutions obtained from the previous stages. Ho we ve r, only parameter estimation p erformance was analyzed in [16]. Unfortunately , the r esult in [16] d o es not d irectly imply that m ulti-stage con vex relaxatio n ac hiev es unbiase d r eco very of the su pp ort set. Th e p urp ose of this p ap er is to prov e su c h a su pp ort reco v- ery result analogous to r elated result in [13] (whic h is for a differen t pro cedur e), and this result complemen ts the parameter estimation resu lt of [16]. 2 Multi-Stage Con v ex Relaxa tion with Capp ed- L 1 Regularization W e are intereste d in reco v ering ¯ w f rom n oisy observ ations y using the follo wing noncon vex r egu- larizatio n formulat ion: ˆ w = arg m in w   1 n k X w − y k 2 2 + λ p X j = 1 g ( | w j | )   , (4) where g ( | w j | ) is a r egularization function. F or s im p licit y , this p ap er only considers the sp ecific regularizer g ( u ) = min( u, θ ) , (5) whic h is referr ed to as capp ed- L 1 regularization in [16]. The parameter θ is a thr esholding parameter whic h sa ys that w e use L 1 p enalization wh en a co efficien t is sufficien tly small, bu t the p en alty do es not increase w hen the co efficien t is larger th an a threshold θ . Deta iled d iscussions can b e foun d in [16]. S imilar to [16], one can analyze general regularization f unction g ( u ). Ho wev er, some of s uc h functions (such as adaptiv e Lasso) do n ot completely remov e the b ias. Therefore we only analyze the simp le function (5) in this pap er for clarity . While a theoretical ju s tification has b een given in [16] for m ulti-stage con ve x relaxation, similar pro cedu re has b een shown to work well empirically without theoretical ju stification [4, 12]. Moreo v er, a tw o-stage v ersion w as prop osed in [20], w h ic h do es n ot remov e th e bias issu e discussed in this p ap er. Since th e regularizer (5) is noncon ve x, the r esulting optimization pr oblem (4) is a n on-con v ex regularization pr ob lem. Ho w eve r the regularizer in (5) is con tinuous and p iecewise differen tiable, and th us its solution is easier to compute than the L 0 regularization metho d in (2). F or example, standard n umerical tec hniqu es suc h as sub-gradient descen t lead to lo cal min imum solutions. Un- fortunately , it is difficult to find the global optim um, and it is also d ifficult to analyze th e qualit y of the lo cal minim um obtained from the gradient descent metho d . As a matter of fact, results with non-con ve x regularization are diffi cu lt to repro du ce b ecause different n umerical optimization pro cedures can lead to different lo cal minima. Therefore the qu alit y of the solution h ea vily dep end on the numerical pro cedu re used. In the follo w ing, we consider a sp ecific numerical p ro cedure referr ed to as multi- stage con vex relaxation in [16]. Th e algorithm is giv en in Figure 1. The p ro cedure conv erges to a lo cal optimal solution of (4) due to a simple conca ve dualit y argument , where (4) is rewr itten as ˆ w = arg m in w min { λ j ≥ 0 }   1 n k X w − y k 2 2 + p X j = 1 λ j | w j | + p X j = 1 g ∗ ( λ j )   , 3 with g ∗ ( λ j ) = max(( λ − λ j ) θ , 0). T he p ro cedure of Figure 1 can b e regarded as an alternating optimization metho d to solve this join t optimization problem of w and { λ j } , where the fi rst step solv es for w with { λ j } fi xed, and the second step is the closed form solution of { λ j } with w fixed. A more detailed discussion can b e foun d in [16]. Our goal is to s h o w that th is pro cedu re can ac hieve unbiase d feature selection as describ ed in [13]. Initialize λ (0) j = λ for j = 1 , . . . , d F or ℓ = 1 , 2 , . . . • Let ˆ w ( ℓ ) = arg min w ∈ R p   1 n k X w − y k 2 2 + p X j = 1 λ ( ℓ − 1) j | w j |   . (6) • Let λ ( ℓ ) j = λI ( | ˆ w ( ℓ ) j | ≤ θ ) ( j = 1 , . . . , d ) Figure 1: Multi-stage Conv ex Relaxation for Sparse Regularization 3 Theoretical Analysis W e require s ome tec h nical conditions for our analysis. First we assum e sub-Gaussian noise as follo w s. Assumption 1 Assume that { ǫ i } i =1 ,...,n in (1) ar e indep endent (but not ne c essarily identic al ly distribute d) sub-Gaussians: ther e exists σ ≥ 0 such that ∀ i and ∀ t ∈ R , E ǫ i e tǫ i ≤ e σ 2 t 2 / 2 . Both Gaussian and b oun ded r andom v ariables are sub-Gaussian u sing the ab o ve definition. F or example, if a random v ariable ξ ∈ [ a, b ], then E ξ e t ( ξ − E ξ ) ≤ e ( b − a ) 2 t 2 / 8 . If a random v ariable is Gaussian: ξ ∼ N (0 , σ 2 ), th en E ξ e tξ ≤ e σ 2 t 2 / 2 . W e also int ro d uce the concept of sparse eige nv alue, whic h is standard in the analysis of L 1 regularization. Definition 1 Given k , define ρ + ( k ) = sup  1 n k X w k 2 2 / k w k 2 2 : k w k 0 ≤ k  , ρ − ( k ) = inf  1 n k X w k 2 2 / k w k 2 2 : k w k 0 ≤ k  . The follo w ing result for parameter estimation was obtained in [16], under the Assu mption 1. If w e assume that the target ¯ w is sp arse, with E y i = ¯ w ⊤ x i , and ¯ k = k ¯ w k 0 , and w e c ho ose θ and λ suc h that λ ≥ 20 σ p 2 ρ + (1) ln(2 p/η ) /n and θ ≥ 9 λ/ρ − (2 ¯ k + s ) . 4 Assume that ρ + ( s ) /ρ − (2 ¯ k + 2 s ) ≤ 1 + 0 . 5 s/ ¯ k for some s ≥ 2 ¯ k , then with probabilit y larger th an 1 − η : k ˆ w ( ℓ ) − ¯ w k 2 ≤ 17 ρ − (2 ¯ k + s ) " 2 σ q ρ + ( ¯ k ) r 7 . 4 ¯ k n + r 2 . 7 ln(2 /η ) n ! + λ p k θ # + 0 . 7 ℓ · √ ¯ k λ ρ − (2 ¯ k + s ) , (7) where ˆ w ( ℓ ) is the solution of (6), and k θ =   { j ∈ ¯ F : | ¯ w j | ≤ 2 θ }   . The condition ρ + ( s ) /ρ − (2 ¯ k + 2 s ) ≤ 1 + 0 . 5 s/ ¯ k requires the eigen v alue ratio ρ + ( s ) /ρ − ( s ) to grow sub-linearly in s . Suc h a cond ition, referred to as sp arse eigenvalue c ondition , is also needed in the stand ard analysis of L 1 regularization [14, 15]. It is related but slight ly weak er than the RIP condition in compressive sensing [3], whic h requires th e condition 1 − δ s ′ ≤ ρ − ( s ′ ) ≤ ρ + ( s ′ ) ≤ 1 + δ s ′ , for s ome δ s ′ ∈ (0 , 1) and s ′ > ¯ k . F or example, with s ′ = 6 ¯ k , and the restricted isometry constant δ s ′ ≤ 1 / 3, then the sp ars e eigenv alue condition ab o v e holds with s = 2 ¯ k . F or simplicit y , in this pap er we d o not make distinctions b et w een RIP and sparse eigen v alue condition. Note th at in the traditional lo w -dimensional statistical analysis, one assum es that ρ + ( s ) /ρ − (2 ¯ k + 2 s ) < ∞ as s → ∞ , whic h is significan tly str onger than the condition we u s e h ere. Although in practice it is often difficult to v erify the sparse eigenv alue condition for real problems, the parameter estimation result in (7) n ev ertheless pr o vides imp ortant theoretical insight s for m ulti-stage conv ex relaxation. F or standard Lasso, we hav e the follo w ing b oun d k ˆ w L 1 − ¯ w k 2 = O ( √ k λ ) , where ˆ w L 1 is the solution of the standard L 1 regularization. Th is b ound is tigh t for Lasso, in the sense that the righ t hand side cannot b e impr o v ed except for the constant —this can b e easily v erified w ith an orthogonal d esign matrix. It is kno wn that in order f or Lasso to b e effectiv e, one has to pick λ no s maller th an the order σ p ln p/n . Therefore, the parameter estimation error of the standard Lasso is of the order σ p ¯ k ln p/n , wh ic h cannot b e improv ed. In comparison, if we consider the capp ed- L 1 regularization with g ( | w j | ) defin ed in (5), the b ound in (7) can b e significant ly b etter when most n on-zero co efficien ts of ¯ w are relativ ely large in magnitude. In the extreme case where k θ = |{ j : | ¯ w j | ∈ (0 , 2 θ ] }| = 0, w h ic h can b e ac hieve d wh en all n onzero comp onents of ¯ w are larger than the ord er σ p ln p/n , w e obtain the follo wing b etter b ound k ˆ w ( ℓ ) − ¯ w k 2 = O ( q ¯ k /n + p ln(1 /η ) /n ) for the m ulti-stage pro cedure for a su fficien tly large ℓ at the ord er of ln k + ln ln p . This b ound is sup erior to the standard one-stage L 1 regularization b ound k ˆ w L 1 − ¯ w k 2 = O ( p ¯ k ln( p/η ) /n ). In the literature, one is often in terested in t wo t yp es of results, one is parameter estimation b ound as in (7), and th e other is feature selection consistency: that is, to identify the set of nonzero co efficien ts of the truth. Although the parameter estimation b ound in (7 ) is sup erior to Lasso, the result do es not imply that one can correctly select all v ariables under this cond ition. Moreo v er, th e sp ecific pr o of presente d in [16] d o es not directly imply suc h a result. Therefore it is imp ortant to kno w w hether the m ulti-stage con vex relaxation can ac hieve u n biased f eature selection as stud ied in [13]. In the follo wing, w e p r esen t suc h a r esult whic h supplemen ts the parameter estimation b ound of (7). While the main h igh-lev el argum en t f ollo ws that of [16], there are many differences 5 in the details, and hence a full p ro of (which is included in Section 5 ) is still needed. T his theorem is the main resu lt of the pap er. It is w orth mentio nin g that although we only consider the s im p le capp ed- L 1 regularizer, similar results can b e obtained for other regularizers (with virtu ally the same pr o of ) suc h that g ′ ( u ) ∈ [0 , ∞ ), g ′ ( u ) > 0 when u b elongs to a neigh b or of 0, and g ′ ( u ) = 0 when u ≥ θ , w ith a threshold θ > 0 appr opriately c hosen at the order of the noise leve l — the condition of g ′ ( u ) = 0 when u ≥ θ ensur es th e remo v al of feature selection “bias” of Lasso which we discussed ab o v e. As an example, very similar r esult can b e obtained f or the MC+ p enalt y of [13] or SCAD p en alt y of [5] using the m ulti-stage con v ex relaxation pr o cedure here. I n fact, in p ractice there ma y b e additional adv an tages of using a smo oth noncon v ex p enalt y suc h as MC+ du e to the extra sm o othness, although such adv anta ge is not rev ealed in our theoretical analysis. Theorem 1 L et Assumption 1 hold. Assume also that the tar get ¯ w is sp arse, with E y i = ¯ w ⊤ x i , and ¯ k = k ¯ w k 0 . L et ¯ F = sup p( ¯ w ) . Cho ose θ and λ such that λ ≥ 7 σ p 2 ρ + (1) ln(2 p/η ) /n and θ > 9 λ/ρ − (1 . 5 ¯ k + s ) . Assume that min j ∈ ¯ F | ¯ w j | > 2 θ and ρ + ( s ) /ρ − (1 . 5 ¯ k + 2 s ) ≤ 1 + 2 s/ (3 ¯ k ) for some s ≥ 1 . 5 ¯ k , then with pr ob ability lar ger than 1 − η : supp( ˆ w ( ℓ ) ) = supp( ¯ w ) when ℓ > L , wher e ˆ w ( ℓ ) is the solution of (6) and L =  0 . 5 ln ¯ k ln( ρ − (1 . 5 ¯ k + s ) θ / (6 λ ))  + 1 . Theorem 1 is the main result of this pap er. If min w j ∈ ¯ F | w j | ≥ cσ p ln p/n for a sufficien tly large constan t c that is in dep endent of ¯ k (but could dep end on the RIP condition), then we can pic k b oth parameters λ = O ( σ p ln p/n ) an d θ = O ( σ p ln p/n ) at the noise lev el, so that Th eorem 1 can b e applied. In this case, Th eorem 1 imp lies that multi-sta ge capp ed- L 1 regularization ac hieve s exact reco v ery of the s upp ort set su pp( ¯ w ). In comparison, Lasso do es not ac hiev e exact sparse reco v ery un der RIP conditions. Wh ile ru nning Lasso follo wed by thresholding small co efficien ts to zero (or usin g adaptive Lasso of [19] or the t wo-sta ge pro cedure of [20]) ma y ac hiev e exact reco ve ry , suc h a pro cedure r equ ires the condition that min w j ∈ ¯ F | w j | ≥ c ′ σ q ¯ k ln p/n (8) for some constant c ′ (also dep ends on the RIP condition). T his extra √ ¯ k factor is referr ed to as the bias of the Lasso p ro cedure in [13]. Moreo ver, it is kn o wn that for exact reco very to h old, 6 the requirement of min w j ∈ ¯ F | w j | ≥ cσ p ln p/n (up to a constan t) is necessary for all s tatistical pro cedures, in the sen se that if min w j ∈ ¯ F | w j | ≤ c ′ σ p ln p/n for a sufficientl y small constan t c ′ (under app ropriate RIP conditions), then no statistical p r o cedure can ac hiev e exact reco v ery w ith large pr obabilit y . Therefore s tatistic al pro cedur es that can achiev e exact supp ort reco very un der (8) are referred to as (n early) u n biased feature selection metho ds in [13]. T heorem 1 sho ws that m ulti-stage con ve x relaxation with capp ed- L 1 regularization achiev es unbiased feature selection. Results most comparable to what we ha v e obtained here are that of the F oBa pro cedur e in [17] and that of the MC+ pr o cedu re in [13]. Both can b e regarded as (approximat e) optimization metho ds for noncon vex formulati ons. The former is a forward bac kw ard greedy algorithm, which do es not optimize (4), while the latter is a path-follo wing algorithm for s olving formulations similar to (4). Although results in [13] are comparable to ours, w e should note that unlik e our pro cedure, whic h is efficient du e to the fi nite num b er of con v ex optimization, there is no p ro of s ho wing that the path-follo wing strategy in [13] is alw a ys efficien t (in the s en se that th er e ma y b e exp onentia lly man y switc hing p oint s). 4 Sim u lation Study Numerical examples can b e found in [16] that demonstrate th e adv ant age of multi-st age conv ex relaxation o v er Lasso. Th er efore w e shall not rep eat a compreh en siv e study . Nev ertheless, this section present s a simple simulat ion study to illustrate the theoretical resu lts. The n × p design matrix X is generated with iid random Gaussian en tries and eac h column is normalized with 2-norm √ n . Here n = 100 and p = 250. W e then generate a vect or ¯ w with ¯ k = 30 nonzero co efficien ts, and eac h nonzero coefficient is unif orm ly generated from the inte rv al (1 , 10). The observ ation is y = X ¯ w + ǫ , where ǫ is zero-mean iid Gaussian noise with standard deviation σ = 1. W e study the feature selection p erformance of Multi-stage con vex relaxation metho d in Figure 1 usin g v arious configurations of λ = τ σ p ln( p ) /n (with τ = 1 , 2 , 4 , 8 , 16 , 32), and θ = µλ for v arious constan ts µ = 0 . 5 , 1 , 2 , 4. The exp erimen ts are rep eated for 100 times, and T able 1 r ep orts the probability (p ercen tage in the 100 runs) of exact supp ort r eco very for eac h configuration at v arious s tages ℓ . Note that ℓ = 1 corresp onds to Lasso an d ℓ = 2 is an adaptiv e Lasso lik e t wo stage metho d [19, 20]. The main purp ose of th is stud y is to illustrate that it is b en efi cial to use more than tw o stages, as predicted b y our theory . Ho w ev er, since only O (ln( ¯ k )) is sufficient , optimal results can b e ac hiev ed with relativ ely small num b er of stages. These conclusions can b e clearly seen fr om T able 1. Sp ecifically the results for ℓ = 2 are b etter than those of ℓ = 1 (standard L asso), while resu lts of ℓ = 4 are b etter than th ose of ℓ = 2. Although the p erformance of ℓ = 8 is ev en b etter, the imp r o v e o ver ℓ = 4 is small at the optimal configuration of λ and θ . This is consisten t with our th eory , whic h implies that a relativ ely small num b er of stages is needed to ac h iev e go o d p erformance. 5 Pro of of Theorem 1 The analysis is an adaptation of [16]. While the main pro of stru ctur e is similar, th ere are nev erthe- less subtle and imp ortan t differences in the details, and h ence a complete pr o of is s till necessary . The m ain tec hnical differences are as follo ws. The pro of of [16] trac ks the progress fr om one stage ℓ − 1 to the next stage ℓ us in g a b ound on 2-norm parameter estimate, while in the current p ro of w e trac k the progress usin g the set of v ariables that differ significan tly from the true v ariables. 7 θ = 0 . 5 λ λ 0 . 23 0 . 47 0 . 94 1 . 9 3 . 8 7 . 5 ℓ = 1 0 0 0 0 0 0 ℓ = 2 0 0 0 . 02 0 0 0 ℓ = 4 0 0 . 05 0 . 63 0 . 18 0 0 ℓ = 8 0 0 . 12 0 . 83 0 . 25 0 0 θ = 1 λ λ 0 . 23 0 . 47 0 . 94 1 . 9 3 . 8 7 . 5 ℓ = 1 0 0 0 0 0 0 ℓ = 2 0 0 . 04 0 . 15 0 . 06 0 0 ℓ = 4 0 0 . 33 0 . 86 0 . 13 0 0 ℓ = 8 0 0 . 38 0 . 93 0 . 16 0 0 θ = 2 λ λ 0 . 23 0 . 47 0 . 94 1 . 9 3 . 8 7 . 5 ℓ = 1 0 0 0 0 0 0 ℓ = 2 0 0 . 14 0 . 22 0 0 0 ℓ = 4 0 0 . 29 0 . 6 0 . 02 0 0 ℓ = 8 0 0 . 3 0 . 62 0 . 02 0 0 θ = 4 λ λ 0 . 23 0 . 47 0 . 94 1 . 9 3 . 8 7 . 5 ℓ = 1 0 0 0 0 0 0 ℓ = 2 0 0 . 01 0 . 01 0 0 0 ℓ = 4 0 0 . 06 0 . 06 0 0 0 ℓ = 8 0 0 . 06 0 . 06 0 0 0 T able 1: Probabilit y of Exact Su pp ort Reco v ery for Multi-stage Con vex Relaxation 8 Moreo ver, in [16], w e compare the cur ren t estimated parameter to the true parameter ¯ w , which is sufficien t for parameter estimation. Ho wev er, in ord er to establish feature selectio n r esult of this pap er, it is necessary to compare the curr en t estimated parameter to the least squares solution ˜ w within the tru e feature set ¯ F as defin ed b elo w in (9). These subtle tec hnical differences mean that man y details in the pro ofs presen ted b elo w differ from th at of [16]. 5.1 Auxiliary lemmas W e fi r st introd u ce some definitions. Consider the p ositive semi-definite matrix A = n − 1 X ⊤ X ∈ R d × d . Giv en s, k ≥ 1 su c h that s + k ≤ d . Let I , J b e disjoin t sub sets of { 1 , . . . , d } with k and s elemen ts resp ectiv ely . Let A I ,I ∈ R k × k b e the restriction of A to indices I , A I ,J ∈ R k × s b e the restriction of A to in dices I on the left and J on the righ t. S imilarly w e define restriction w I of a v ector w ∈ R p on I ; and f or con v enience, we allo w either w I ∈ R k or w I ∈ R p (where comp onen ts not in I are zeros) dep ending on the conte xt. W e also n eed the follo wing quantit y in our analysis: π ( k , s ) = sup v ∈ R k , u ∈ R s ,I ,J v ⊤ A I ,J u k v k 2 v ⊤ A I ,I v k u k ∞ . The f ollo win g t w o lemmas are taken from [15]. W e skip the pro of. Lemma 1 The fol lowing ine qu ality holds: π ( k , s ) ≤ s 1 / 2 2 p ρ + ( s ) /ρ − ( k + s ) − 1 , Lemma 2 Consider k , s > 0 and G ⊂ { 1 , . . . , d } such that | G c | = k . Given any w ∈ R p . L et J b e the indic es of the s lar gest c omp onents of w G (in absolute values), and I = G c ∪ J . Then max(0 , w ⊤ I A w ) ≥ ρ − ( k + s )( k w I k 2 − π ( k + s, s ) k w G k 1 /s ) k w I k 2 . Our analysis requ ires us to k eep trac k of pr ogress with resp ect to the least squares solution ˜ w with th e tru e feature set ¯ F , wh ich w e define b elo w: ˜ w = arg min w ∈ R p k X w − y k 2 2 sub ject to sup p( w ) ⊂ ¯ F , (9) where ¯ F = sup p( ¯ w ). The follo wing lemmas require v aryin g degrees of mo difications from similar lemmas in [16], and th us the pro ofs are included for completeness. Lemma 3 D efine ˆ ǫ = 1 n X ⊤ ( X ˜ w − y ) . Under the c onditions of Assumption 1, with pr ob ability lar ger than 1 − η : ∀ j ∈ ¯ F : | ˆ ǫ j | = 0 , | ˜ w j − ¯ w j | ≤ σ q 2 ρ − ( ¯ k ) − 1 ln(2 p/η ) /n, and ∀ j / ∈ ¯ F : | ˆ ǫ j | ≤ σ p 2 ρ + (1) ln (2 p/η ) /n. 9 Pro of Let ˜ P be the pr o j ection matrix to the subsp ace spanned by column s of X in ¯ F , then we kno w that X ˜ w = ˜ P y and ( I − ˜ P ) Ey = Ey − X ¯ w = 0 . Therefore f or eac h j n | ˆ ǫ j | = | X ⊤ j ( X ˜ w − y ) | = | X ⊤ j ( I − ˜ P )( y − Ey )) | . It implies that ˆ ǫ j = 0 if j ∈ ¯ F . Since for eac h j : the column X j satisfies k X ⊤ j ( I − ˜ P ) k 2 2 ≤ nρ + (1), w e ha ve from s ub-Gaussian tail b ound that for all j / ∈ ¯ F and ǫ > 0: P [ | ˆ ǫ j | ≥ ǫ ] ≤ 2 exp [ − nǫ 2 / (2 σ 2 ρ + (1))] . Moreo ver, for eac h j ∈ ¯ F , w e ha v e | ˜ w j − ¯ w j | = e ⊤ j ( X ⊤ ¯ F X ¯ F ) − 1 X ⊤ ¯ F ( y − Ey ) . Since k e ⊤ j ( X ⊤ ¯ F X ¯ F ) − 1 X ⊤ ¯ F k 2 2 = e ⊤ j ( X ⊤ ¯ F X ¯ F ) − 1 e j ≤ n − 1 ρ − ( ¯ k ) − 1 , w e ha v e for all ǫ > 0: P [ | ˜ w j − ¯ w j | ≥ ǫ ] ≤ 2 exp[ − n ρ − ( ¯ k ) ǫ 2 / (2 nσ 2 )] . T aking union b ound for j = 1 , . . . , d (eac h with probability η /d ) w e obtain the desired inequalit y . Lemma 4 Consider G ⊂ { 1 , . . . , d } such that ¯ F ∩ G = ∅ . L et ˆ w = ˆ w ( ℓ ) b e the solution of (6), and let ∆ ˆ w = ˆ w − ˜ w . L e t λ G = min j ∈ G λ ( ℓ − 1) j and λ 0 = max j λ ( ℓ − 1) j . If 2 k ˆ ǫ k ∞ k < λ G , then X j ∈ G | ˆ w j | ≤ 2 k ˆ ǫ k ∞ λ G − 2 k ˆ ǫ k ∞ X j / ∈ ¯ F ∪ G | ˆ w j | + λ 0 λ G − 2 k ˆ ǫ k ∞ X j ∈ ¯ F | ∆ ˆ w j | ≤ λ 0 λ G − 2 k ˆ ǫ k ∞ k ∆ ˆ w G c k 1 . Pro of F or s implicit y , let λ j = λ ( ℓ − 1) j . Th e first order equation imp lies that 1 n n X i =1 2( x ⊤ i ˆ w − y i ) x i,j + λ j sgn( ˆ w j ) = 0 , where sgn ( w j ) = 1 wh en w j > 0, sgn( w j ) = − 1 when w j < 0, and sgn( w j ) ∈ [ − 1 , 1] wh en w j = 0. This implies that for all v ∈ R p , we ha ve 2 v ⊤ A ∆ ˆ w ≤ − 2 v ⊤ ˆ ǫ − p X j = 1 λ j v j sgn( ˆ w j ) . (10) 10 No w, let v = ∆ ˆ w in (10), and notice that ˆ ǫ ¯ F = 0, we obtain 0 ≤ 2∆ ˆ w ⊤ A ∆ ˆ w ≤ 2 | ∆ ˆ w ⊤ ˆ ǫ | − p X j = 1 λ j ∆ ˆ w j sgn( ˆ w j ) ≤ 2 k ∆ ˆ w ¯ F c k 1 k ˆ ǫ k ∞ − X j ∈ ¯ F λ j ∆ ˆ w j sgn( ˆ w j ) − X j / ∈ ¯ F λ j ∆ ˆ w j sgn( ˆ w j ) ≤ 2 k ∆ ˆ w ¯ F c k 1 k ˆ ǫ k ∞ + X j ∈ ¯ F λ j | ∆ ˆ w j | − X j / ∈ ¯ F λ j | ˆ w j | ≤ X j ∈ G (2 k ˆ ǫ k ∞ − λ G ) | ˆ w j | + X j / ∈ G ∪ ¯ F 2 k ˆ ǫ k ∞ | ˆ w j | + X j ∈ ¯ F λ 0 | ∆ ˆ w j | . By r earranging the ab o v e inequalit y , w e obtain the first d esir ed b ound. The second in equalit y uses 2 k ˆ ǫ k ∞ ≤ λ 0 . Lemma 5 U sing the notations of L emma 4, and let J b e the indic es of the lar gest s c o efficients (in absolute value) of ˆ w G . L et I = G c ∪ J and k = | G c | . If 0 ≤ λ 0 / ( λ G − 2 k ˆ ǫ k ∞ ) ≤ 3 , then k ∆ ˆ w k 2 ≤ (1 + (3 k /s ) 0 . 5 ) k ∆ ˆ w I k 2 . Pro of Usin g λ 0 / ( λ G − 2 k ˆ ǫ k ∞ ) ≤ 3, we obtain fr om Lemma 4 k ˆ w G k 1 ≤ 3 k ∆ ˆ w − ˆ w G k 1 . Therefore k ∆ ˆ w − ∆ ˆ w I k ∞ ≤k ∆ ˆ w J k 1 /s = s − 1 [ k ∆ ˆ w G k 1 − k ∆ ˆ w − ∆ ˆ w I k 1 ] ≤ s − 1 [3 k ∆ ˆ w − ˆ w G k 1 − k ∆ ˆ w − ∆ ˆ w I k 1 ] , whic h implies that k ∆ ˆ w − ∆ ˆ w I k 2 ≤ ( k ∆ ˆ w − ∆ ˆ w I k 1 k ∆ ˆ w − ∆ ˆ w I k ∞ ) 1 / 2 ≤ [ k ∆ ˆ w − ∆ ˆ w I k 1 (3 k ∆ ˆ w − ˆ w G k 1 − k ∆ ˆ w − ∆ ˆ w I k 1 )] 1 / 2 s − 1 / 2 ≤  (3 k ∆ ˆ w − ˆ w G k 1 / 2) 2  1 / 2 s − 1 / 2 ≤ (3 / 2) s − 1 / 2 k ∆ ˆ w − ˆ w G k 1 ≤ (3 / 2) s − 1 / 2 ¯ k 1 / 2 k ∆ ˆ w − ˆ w G k 2 ≤ (3 k/s ) 1 / 2 k ∆ ˆ w I k 2 . The third inequalit y u ses the simple algebraic inequalit y a (3 b − a ) ≤ (3 b/ 2) 2 . By rearr an ging this inequalit y , we obtain the desir ed b oun d. Note that in the ab ov e d eriv ation, we hav e used the fact that ¯ F ∩ G = ∅ , which implies that ∆ ˆ w G = ˆ w G , and th us ∆ ˆ w − ˆ w G = ∆ ˆ w G c . 11 Lemma 6 L et the c onditions of L emma 4 and L e mma 5 hold, and let k = | G c | . If t = 1 − π ( k + s, s ) k 1 / 2 s − 1 ∈ (0 , 4 / 3) , and 0 ≤ λ 0 / ( λ G − 2 k ˆ ǫ k ∞ ) ≤ (4 − t ) / (4 − 3 t ) , then k ∆ ˆ w k 2 ≤ (1 + (3 k /s ) 0 . 5 ) k ∆ ˆ w I k 2 ≤ 1 + (3 k /s ) 0 . 5 tρ − ( k + s )    2 k ˆ ǫ G c k 2 +   X j ∈ ¯ F ( λ ( ℓ − 1) j ) 2   1 / 2    . Pro of Let J b e th e indices of the largest s co efficien ts (in absolute v alue) of ˆ w G , and I = G c ∪ J . The cond itions of the lemma imp ly that max(0 , ∆ ˆ w ⊤ I A ∆ ˆ w ) ≥ ρ − ( k + s )[ k ∆ ˆ w I k 2 − π ( k + s, s ) k ˆ w G k 1 /s ] k ∆ ˆ w I k 2 ≥ ρ − ( k + s )[1 − (1 − t )(4 − t )(4 − 3 t ) − 1 ] k ∆ ˆ w I k 2 2 ≥ 0 . 5 tρ − ( k + s ) k ∆ ˆ w I k 2 2 . In the ab ov e deriv ation, the fi rst inequalit y is due to Lemma 2; the second inequalit y is due to the conditions of this lemma plus Lemma 4, whic h implies th at k ˆ w G k 1 ≤ λ 0 λ G − 2 k ˆ ǫ k ∞ k ˆ w G c k 1 ≤ λ 0 λ G − 2 k ˆ ǫ k ∞ √ k k ˆ w I k 2 ; and the last inequalit y follo w s from 1 − (1 − t )(4 − t )(4 − 3 t ) − 1 ≥ 0 . 5 t , which holds for t ∈ (0 , 4 / 3). If ∆ ˆ w ⊤ I A ∆ ˆ w ≤ 0, then the ab o v e inequalit y , together with Lemma 5 , imply th e lemma. Th ere- fore in the follo w ing, w e can assume that ∆ ˆ w ⊤ I A ∆ ˆ w ≥ 0 . 5 tρ − ( k + s ) k ∆ ˆ w I k 2 2 . Moreo ver, let λ j = λ ( ℓ − 1) j . W e obtain from (10) w ith v = ∆ ˆ w I the follo win g: 2∆ ˆ w ⊤ I A ∆ ˆ w ≤ − 2∆ ˆ w ⊤ I ˆ ǫ − X j ∈ I λ j ∆ ˆ w j sgn( ˆ w j ) = − 2∆ ˆ w ⊤ I ˆ ǫ G c − 2∆ ˆ w ⊤ I ˆ ǫ G − X j ∈ ¯ F λ j ∆ ˆ w j sgn( ˆ w j ) − X j ∈ G λ j | ∆ ˆ w j | − X j ∈ ¯ F c ∩ G c λ j | ∆ ˆ w j | ≤ 2 k ∆ ˆ w I k 2 k ˆ ǫ G c k 2 + 2 k ˆ ǫ G k ∞ X j ∈ G | ∆ ˆ w j | + X j ∈ ¯ F λ j | ∆ ˆ w j | − X j ∈ G λ j | ∆ ˆ w j | ≤ 2 k ∆ ˆ w I k 2 k ˆ ǫ G c k 2 + ( X j ∈ ¯ F λ 2 j ) 1 / 2 k ∆ ˆ w I k 2 . Note that the equalit y u ses the f act that G ⊂ ¯ F c , and ∆ ˆ w j sgn( ˆ w j ) = | ˆ w j | for j ∈ ¯ F c . The last inequalit y uses the fact that ∀ j ∈ G : λ j ≥ λ G ≥ 2 k ˆ ǫ G k ∞ . No w by com bining th e ab ov e t wo estimates, we obtain k ∆ ˆ w I k 2 ≤ 1 tρ − ( k + s )   2 k ˆ ǫ G c k 2 + ( X j ∈ ¯ F λ 2 j ) 1 / 2   . The d esired b ound follo ws from Lemma 5. 12 Lemma 7 L et λ j = λI ( | w j | ≤ θ ) for some w ∈ R p , then   X j ∈ ¯ F λ 2 j   1 / 2 ≤ λ s X j ∈ ¯ F I ( | ¯ w j | ≤ 2 θ ) + λ   { j ∈ ¯ F : | ¯ w j − w j | ≥ θ }   1 / 2 . Pro of By assum p tion, if | ¯ w j − w j | ≥ θ , then I ( | w j | ≤ θ ) ≤ 1 ≤ I ( | ¯ w j − w j | ≥ θ ); otherwise, I ( | w j | ≤ θ ) ≤ I ( | ¯ w j | ≤ 2 θ ). It follo ws that the follo wing inequalit y alw a ys holds: I ( | w j | ≤ θ ) ≤ I ( | ¯ w j | ≤ 2 θ ) + I ( | ¯ w j − w j | ≥ θ ) . The d esired b ound is a dir ect consequence of the ab ov e r esult and the 2-norm triangle inequalit y ( X j ( x j + ∆ x j ) 2 ) 1 / 2 ≤ ( X j x 2 j ) 1 / 2 + ( X j ∆ x 2 j ) 1 / 2 . Lemma 8 D efine F ( ℓ ) = { j : | ˆ w ( ℓ ) j − ¯ w j | ≥ θ } . Under the c onditions of The or em 1, we have for al l s ≥ 2 ¯ k : k ˆ w ( ℓ ) − ˜ w k 2 ≤ 5 . 7 λ ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | , and q | F ( ℓ ) | ≤ 6 λθ − 1 ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | . Pro of F or all t ∈ [0 . 5 , 4 / 3), by using Lemma 3, w e know that the condition of th e theorem implies that λ λ − 2 k ˆ ǫ k ∞ ≤ 7 / 5 ≤ 4 − t 4 − 3 t . Moreo ver, Lemma 1 implies that the condition 0 . 5 ≤ t = 1 − π (1 . 5 ¯ k + s, s )(1 . 5 ¯ k ) 0 . 5 /s is also satisfied. This m eans that the conditions of Lemma 6 (with λ 0 = λ G = λ ) are satisfied. No w, w e assume th at at some ℓ ≥ 1, | G c ℓ | ≤ 1 . 5 ¯ k , where G ℓ = { j / ∈ ¯ F : λ ( ℓ − 1) j = λ } , (11) then it is easy to verify that G c ℓ \ ¯ F ⊂ F ( ℓ − 1) . 13 Moreo ver, with th e d efinition of G = G ℓ in Lemma 6 and Lemma 7, w e can set λ 0 = λ G = λ and ob tain (note also that ˆ ǫ ¯ F = 0) k ˆ w ( ℓ ) − ˜ w k 2 ≤ 1 + √ 3 tρ − (1 . 5 ¯ k + s )    2 k ˆ ǫ G c ℓ \ ¯ F k 2 +   X j ∈ ¯ F ( λ ( ℓ − 1) j ) 2   1 / 2    ≤ 1 + √ 3 tρ − (1 . 5 ¯ k + s )  2 q | F ( ℓ − 1) \ ¯ F |k ˆ ǫ k ∞ + q | F ( ℓ − 1) ∩ ¯ F | λ  ≤ 1 + √ 3 tρ − (1 . 5 ¯ k + s )  (2 / 7) q | F ( ℓ − 1) \ ¯ F | + q | F ( ℓ − 1) ∩ ¯ F |  λ ≤ 1 + √ 3 0 . 5 ρ − (1 . 5 ¯ k + s )  q 1 . 082 | F ( ℓ − 1) |  λ ≤ 5 . 7 λ ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | , where the fi rst inequalit y is d u e to Lemma 6. T h e second inequalit y uses the facts that G c ℓ \ ¯ F ⊂ F ( ℓ − 1) \ ¯ F , and Lemma 7 with I ( | ¯ w j | ≤ 2 θ ) = 0 (for all j ∈ ¯ F ). The third inequality us es 2 k ˆ ǫ k ∞ ≤ (2 / 7) λ , and the fourth inequalit y uses (2 / 7) a + b ≤ p 1 . 082( a 2 + b 2 ). Since Lemma 3 implies th at k ˜ w − ¯ w k ∞ ≤ (1 / 7) λ/ q ρ + (1) ρ − ( ¯ k ) , w e kno w that j ∈ F ( ℓ ) implies that | ˜ w j − ˆ w ( ℓ ) j | ≥ θ − (1 / 7 ) λ/ q ρ + (1) ρ − ( ¯ k ) ≥ (41 / 4 2) θ . Therefore q | F ( ℓ ) | ≤ (41 θ / 42) − 1 k ˜ w − ˆ w ( ℓ ) k 2 ≤ 5 . 7 λ (41 θ / 42) − 1 ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | ≤ 6 λθ − 1 ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | . That is, un der the assu mption of (11), the lemma holds at ℓ . Therefore next we only n eed to p r o v e by induction on ℓ that (11) h olds f or all ℓ = 1 , 2 , . . . . When ℓ = 1, w e ha v e G c 1 = ¯ F , wh ich implies that (11) holds. No w assu me that (11) holds at ℓ for some ℓ ≥ 1. Then by the induction h yp othesis we kno w that th e lemma holds at ℓ . This means that q | G c ℓ +1 \ ¯ F | ≤ q | F ( ℓ ) | ≤ 6 λθ − 1 ρ − (1 . 5 ¯ k + s ) q | F ( ℓ − 1) | ≤ q 0 . 5 | F ( ℓ − 1) | ≤ · · · ≤ 0 . 5 ℓ/ 2 | F (0) | . 14 The fi rst inequalit y is d ue to the f act G c ℓ +1 \ ¯ F ⊂ F ( ℓ ) . The second inequ alit y uses the assumption of θ in the theorem. The last inequalit y uses induction. No w note that F (0) = ¯ F , we th us h a v e | G c ℓ +1 \ ¯ F | ≤ 0 . 5 ¯ k . This completes the in d uction step. 5.2 Pro of of Theorem 1 Define β = 6 λθ − 1 ρ − (1 . 5 ¯ k + s ) , W e ha ve β < 1 b y the assumption of the theorem. Using ind uction, we ha ve from L emm a 8 that q | F ( L ) | ≤ β q | F ( L − 1) | ≤ · · · ≤ β L q | F (0) | ≤ β L p ¯ k < 1 . This means that when ℓ > L , | F ( ℓ − 1) | = 0. Th erefore by applying Lemm a 8 again w e obtain k ˆ w ( ℓ ) − ˜ w k 2 = 0 . Since Lemma 3 implies th at k ˜ w − ¯ w k ∞ ≤ (1 / 7) λ/ q ρ + (1) ρ − ( ¯ k ) < θ , w e ha ve supp( ˜ w ) = sup p( ¯ w ) . This implies that supp( ˆ w ( ℓ ) ) = su pp( ¯ w ). 6 Discussion This pap er inv estigate d the p erformance of multi- stage conv ex relaxation for feature selectio n, where it is shown that under RIP , the pro cedur e can ac hiev e unbiased f eature selectio n. This result complemen ts that of [16] whic h stud ies the parameter estimatio n p erformance of multi- stage con v ex relaxation. It also complemen ts similar resu lts obtained in [17] and [13] for differen t computational pro cedures. O ne adv antag e of our resu lt o ve r that in [13] is that th e multi- stage con v ex relaxation metho d is pr o v ably efficien t b ecause the correct f eature set can b e obtained after n o more th an O (log ¯ k ) num b er of iterations. In comparison, a compu tational efficiency statemen t for the p ath- follo w ing metho d of [13] remains op en. 15 References [1] Pete r Bic k el, Y aaco v Rito v, and Alexandre Tsyb ak o v. Sim ultaneous analysis of Lasso and Dan tzig selector. An nals of Statistics , 37(4):1 705–1732, 2009. [2] Florentina Bunea, Alexandre Tsy b ak o v, and Marten H. W egk amp. Sparsit y oracle inequalities for the Lasso. Ele ctr onic Journal of Statistics , 1:169–194, 2007. [3] Em m an uel J. Cand es and T erence T ao. Deco ding by linear p rogramming. IEEE T r ans. on Information The ory , 51:42 03–4215, 2005. [4] Em m an uel J. Candes, Mic hael B. W akin, and Stephen P . Bo yd. Enhancing sparsit y by rew eigh ted l 1 minimization. Journal of F ourier Analysis and Applic ations , 14(5):87 7–905, 2008. [5] Jianqin g F an and Run ze Li. V ariable s electio n via nonconca ve p enalized lik eliho o d and its oracle p rop erties. J ournal of the Americ an Statistic al Asso ciation , 96:1348 –1360, 2001. [6] Vladimir Koltc hinskii. Sparsity in p en alized empirical r isk minimization. Anna les de l’Institut Henri Poinc ar ´ e , 2008. [7] Karim Loun ici. Sup-norm con ve rgence rate and sign concentrat ion prop erty of lasso and dan tzig estimators. Ele ctr on. J. Statist. , 2:90– 102, 2008. [8] Nicolai Meinshausen and P eter Buhlmann . High-dimensional graphs and v ariable selection with th e Lasso. The Annals of Statistics , 34:1436– 1462, 2006. [9] Rob ert Tibs h irani. Regression shrink age an d selection via the lasso. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 58:267– 288, 1996. [10] Sara A. v an de Geer and P eter Buhlmann. On the conditions used to pro v e oracle results f or the lasso. Ele c tr on. J. Statist. , 3:1360 –1392, 2009. [11] Martin W ain wright. S harp thresholds for high-dimensional and noisy sparsity reco very usin g L1-constrained quadr atic programming (Lasso). IEEE T r ansactions on Information The ory , 55:218 3–2202, 2009. [12] Da vid P . Wipf an d Srik anta n Nagara jan. Iterativ e reweig hted ℓ 1 and ℓ 2 metho ds for findin g sparse s olutions. Journal of Sele c te d T opics in Signal Pr o c e ssing , 4(2):31 7 – 329, 2010. [13] Cu n -Hui Zhang. Nea rly unbiased v ariable selection under minimax concav e p enalt y . The Anna ls of Statistics , 38:894–94 2, 2010. [14] Cu n -Hui Zhang and Jian Hu ang. The sparsit y and bias of the Lasso selection in high- dimensional linear regression. Annals of Statistics , 36(4):15 67–1594, 2008. [15] T on g Zhang. Some sh arp p erf ormance b ounds for least squares regression with L 1 regulariza- tion. Ann. Statist. , 37(5A):21 09–2144, 2009. [16] T on g Zhang. Analysis of m ulti-stage con v ex r elaxatio n for sparse regularization. Journal of Machine L e arning R ese ar ch , 11:1087–1 107, 2010 . 16 [17] T on g Zhang. Adaptive forward-bac kward greedy algorithm for learning spars e r epresen tations. IEEE T r ansactions on Information The ory , 57:4689 –4708, 2011. [18] Peng Zh ao and Bin Y u. On mo del selection consistency of Lasso. J ournal of Machine L e arning R ese ar ch , 7:2541–2 567, 2006. [19] Hui Zou. T h e adaptiv e lasso and its oracle prop erties. Journal of the Am eric an Statistic al Asso ciation , 101:141 8–1429, 2006. [20] Hui Z ou and Run ze Li. One-step sparse estimates in nonconca v e p enalized lik eliho o d mo dels. Anna ls of Statistics , 36(4):1509 –1533, 2008 . 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment