On the $ell_1-ell_q$ Regularized Regression

In this paper we consider the problem of grouped variable selection in high-dimensional regression using $\ell_1-\ell_q$ regularization ($1\leq q \leq \infty$), which can be viewed as a natural generalization of the $\ell_1-\ell_2$ regularization (th…

Authors: Han Liu, Jian Zhang

On the ℓ 1 - ℓ q Regulari zed Regression Han Liu 1 , 2 , Jian Zhang 3 1 Mac hine Learning Departm ent, 2 Statist i cs Department Carnegi e Mellon Universit y , Pitt sbur gh, P A, 15213 3 Department of Stat i stics Purdue Unive rsity , W est Lafay ette, IN, 479 07 July 23, 2013 ABSTRACT In this pap er we consider the problem of group ed v a riable selection in h igh-dimensional regression using ℓ 1 - ℓ q regularizatio n (1 ≤ q ≤ ∞ ), whic h can b e view ed as a n atural generaliza tion of the ℓ 1 - ℓ 2 regularizatio n (the group Lasso). Th e k ey conditio n is that the dimensionali t y p n can i ncrease m uc h fast er than the sample size n , i.e. p n ≫ n ( in our case p n is the num b er of groups), but the num b er of relev an t groups is small. T he main conclusion is that man y goo d prop erties from ℓ 1 -regularizati on (Lasso) naturally carry on to the ℓ 1 - ℓ q cases (1 ≤ q ≤ ∞ ), ev en if the num b er of v ariables within eac h group also increases with the s ample size. With fixed design, we sho w th at the whole family of estimators are b oth estimation consisten t and v ariable selection consisten t under different conditions. W e also sho w the p ersistency r esult with r and om design under a m uc h w eak er condition. T hese resu lts pro vide a unified treatmen t f or the whole family of estimators ranging fr om q = 1 (Lasso) to q = ∞ (iCAP), with q = 2 (group Lasso)as a sp ecial case. When there is no group structure a v ailable, all the analysis reduces to the cur ren t r esults of the Lasso estimator ( q = 1). Keyw ords: ℓ 1 - ℓ q regularization, ℓ 1 -consiste ncy , v ariable selection consistenc y , sparsit y or a - cle inequalities, rates of con v ergence, Lasso, iCAP , group Lasso, simu ltaneous Lasso 1 I. Intr oduction W e consider the pro blem of reco ve ring a high-dimensional v ector β ∗ ∈ R m n using a sample of independen t pairs ( X 1 • , Y 1 ) , . . . , ( X n • , Y n ) f r o m a m ultiple linear regression mo del, Y = X β ∗ + ǫ . Here Y is the n × 1 respo nse v ector and X represen ts the observ ed n × m n design matrix whose i -th row v ector is denoted by X i • . β ∗ is t he true unkno wn co efficien t ve ctor that w e wan t to reco v er, and ǫ = ( ǫ 1 , . . . , ǫ n ) is an n × 1 v ector of i.i.d. noise with ǫ i ∼ N (0 , σ 2 ). In this pap er we are in terested in the situation where all the v ariables a re naturally parti- tioned into p n groups. Group ed v ariables often app ear in real world applications. F or ex- ample, in man y data mining problems w e enco de categorical v ariables using a set of dumm y v ariables and as a result they form a group. Another example is a dditiv e mo del, where eac h comp onen t function can b e represen ted using its basis expansions whic h can b e treated as a group. Supp ose the n um b er of v ariables in the j - th group is represen ted b y d j , then b y definition w e hav e m n = P p n j =1 d j . W e can rewrite the ab ov e linear mo del as Y = X β ∗ + ǫ = p n X j =1 X j β ∗ j + ǫ (1 . 1) where X j is an n × d j matrix corresp onding to the j - th group (whic h could b e either cat- egorical or contin uous) and β ∗ j is the corresponding d j × 1 co efficien t sub ve ctor. Therefore, w e hav e X = ( X 1 , . . . , X p n ) and β ∗ = ( β ∗ T 1 , . . . , β ∗ T p n ) T . All predictors and the resp onse v ariable a re assumed to b e cen tered at zero to simplify notation. F urthermore, we use X j to represen t the j -th column in t he design matrix X and assume that all columns in the design matrix are standardized, i.e. 1 n k X j k 2 ℓ 2 = 1 , j = 1 , . . . , m n . Similar to the notatio n of X j , w e denote β ∗ j ( j = 1 , . . . , m n ) to b e the j -th individual eleme n t of the v ector β ∗ . Since w e are mainly in terested in the high-dimensional setting, w e a llo w the nu m b er of groups p n to increase as the nu m b er of examples n increases and our results mainly fo cus on t he case where p n ≫ n . F urthermore, w e also allo w the group size d j to increase with n at a rate d j = o ( n ) and define ¯ d n = max j d j to b e the upp er b ound of the group size for a fixed n . In the rest of the pap er w e will suppress the subscript n when there is no confusion. In order to o btain a reliable estimation of β ∗ when p n ≫ n , the ke y assumption is that the true co efficien t v ector β ∗ is sp arse . Denote S = { j : k β ∗ j k ℓ ∞ 6 = 0 , j = 1 , . . . , p n } to b e the set of group indices and let s n = | S | to b e the cardinalit y of the set S , w e also denote β ∗ S to b e the ve ctor concatenating all subv ectors β ∗ j ’s for j ∈ S . The sparsity assumption means that s n ≪ p n . Therefore, ev en if β ∗ has a v ery high dimension, the only effectiv e part is β ∗ S while the remaining part β ∗ S c = 0 . Our task is to selec t and reco v er the nonzero groups of v ariables corresp onding to the index set S . Sparsit y has a long history of successes in solving suc h high-dimensional problems. Without considering the gro up structure, there exist man y classical metho ds for v a riable selection, suc h as AIC (Ak aike, 1973), BIC (Sch w arz , 197 8), Mallo w’s C p (Mallo ws, 1973), etc. Al- though these metho ds ha v e b een pro v en to b e theoretically sound and ha v e b een show n to 2 p erform w ell in practice, they are only computationally feasible when the n um b er of v ari- ables is small. Recen tly , more attention has b een fo cused on the ℓ 1 -regularized least squares (Lasso) estimator (Tibshirani, 1996; Chen et al., 1998) whic h is defined as b β λ n = arg min β ( 1 2 n k Y − X β k 2 ℓ 2 + λ n k β k ℓ 1 ) (1 . 2) where λ n is the regularization parameter for the ℓ 1 -norm of the co efficien ts β , while b β λ n means the La sso solution when λ n is used for regularization. In the followin g, w e will suppress the sup erscript if not confusion is caused. Lasso can b e formulate d as a quadratic programming problem and the solution can b e solv ed efficien tly (Osb orne et al., 200 0; Efron et al., 200 4). Its asymptotic prop erties for fixed dimensionalit y hav e b een studied in (F u and Knigh t, 2000). F or high dimensional setting, Greensh tein and Rito v (2 0 04) prov e that Lasso esti- mator is p ersisten t, in the sense that , when constrained in a class , the predictiv e risk o f the La sso estimator conv erges to the risk obtained by the oracle estimator in probabilit y . Ho w ev er, recen t studies (Meinshausen and B ¨ uhlmann, 2006; Zhao and Y u, 2007; Z ou, 2006) sho w that the Lasso estimator is not in general v ariable selec tion consisten t, whic h means that in general the correct sparse subset of the relev ant v ariables can not b e iden tified ev en asymptotically . In particular, in (Zhao and Y u , 2007; W ain wrigh t et al., 2006), it is sho wn that in o rder f or Lasso to b e v a riable selection consisten t, the so-called irrepresen table con- dition has to b e satisfied. Z o u (2006) prop ose the adaptiv e Lasso and sho w that b y using adaptiv e w eigh ts for differen t v ariables, the ℓ 1 p enalt y can lead to v ariable selec tion consis- tency . In terms of estimation, it has b een sho w in Meinshause n and Y u (2006) that under w eak er conditions, the La sso estimator is ℓ 2 -consiste n t for high-dimensional setting where the total n um ber of v ariables can gro w almost as fast as exp( n ). Under a stronger assumption, Bunea et al. (2007a) further pro v es the sparsit y ora cle inequalities for the Lasso estimator using fixed design, whic h b ounds the ℓ 2 -norm o f the predictiv e error in terms of the n um ber of non-zero comp onents of the oracle v ector. Suc h results can b e used applied to nonparamet- ric adaptiv e regression estimation and to the problem of aggregation of arbitrary estimators. P arallel to the fixed design result, a similar result for the random design can b e found in (Bunea et al., 200 7b). A more recen t result from (Bic k el et al., 2 007) refine similar oracle inequalities using w eaker assumptions. All these results show that for sparse linear mo dels, Lasso can ov ercome the curse o f dimensionalit y ev en when facing increasing dimensions. When v ariables are naturally gro up ed to g ether, it is more meaningful to select v ariables at a group lev el instead of individual v ariables, as can b e seen fro m previous examples. A general strategy for group ed v ariable selection is to use blo c k ℓ 1 -norm regularization. F or v ariables within eac h blo ck (group), an ℓ q norm is applied, and differen t blo c ks are then com bined b y an ℓ 1 norm (therefore the name ℓ 1 - ℓ q regularization). One suc h example is the group Lasso (Y uan and Lin, 2 006), whic h is an extension of Lasso for group ed v ariable and can b e view ed an ℓ 1 - ℓ 2 regularized regression. Other w orks related to group ed v ariable selection include the iCAP estimator (Z hao et al., 2008), whic h can b e view ed as an ℓ 1 - ℓ ∞ regularized regression, and group logistic regression (Meier et al., 2007), etc. Using random des ign, 3 Meier et al. (2007) prov ed the estimation consistency result for group Lasso with Lipsc hitz t yp e lo ss functions. Also with random design, Bach (2007) deriv ed a similar irrepresen table condition as in ( Z hao and Y u, 2007) and prov ed the v a r ia ble selection consistency result for group Lasso. Ho we v er, to the b est of our kno wledge, there isn’t corresp onding result for estimation and v ariable selection consistency for the group Lasso a nd iCAP estimators using fixed design, nor the p ersistency results using random design. There is also no systematic theoretical treatmen t for the whole family of the more general ℓ 1 - ℓ q regularized regression with 1 ≤ q ≤ ∞ . Our w ork tries to bridge this gap and prov ide a unified treatmen t of ℓ 1 - ℓ q regularized re- gression for the whole range from q = 1 to q = ∞ . The main conclusion of our study is that man y go o d prop erties from ℓ 1 -regularization (La sso) naturally carry on to the ℓ 1 - ℓ q cases (1 ≤ q ≤ ∞ ), eve n if the n um b er of v ariables within each group can increase with the sample size n . Using fixed design, when differen t conditions are assumed, w e sho w that ℓ 1 - ℓ q estimator is b oth estimation consiste n t and v ariable selection consisten t, and if the linear mo del assumption do es not hold, sparsit y o racle inequalities for t he prediction error could still b e obtained under a we ak er condition. Using random design, we sho w that a constrained form of the ℓ 1 - ℓ q regression estimator is p ersiste n t. Our results provid e sim ultaneous analysis to b oth the iCAP ( q = ∞ ) and the group La sso estimators ( q = 2). When there is no group structure, all the analysis naturally reduces to the curren t results of the La sso estimator ( q = 1). One in teresting application of these results is to analyze the sim ultaneous Lasso es- timator (T urlac h et al., 2005), whic h can b e view ed as an ℓ 1 - ℓ ∞ regularized regression using blo c k designs. The rest of the pap er is o rganized as f o llo ws. In Secti on 2 w e first introduce some pre- liminaries of the ℓ 1 - ℓ q regularized regression and then describe some c haracteristics of its solution. In Section 3, w e study the v aria ble selection consisten cy result. In Section 4, w e study the estimation consistenc y and the sparsit y o ra cle inequalities. In Section 5, w e study the p ersiste ncy prop erty . W e conclude with some discuss ion in Section 6. I I. ℓ 1 - ℓ q Regularized Regress ion Giv en the design matrix X and the response v ector Y , the ℓ 1 - ℓ q regularized regression esti- mator is defined as the solution of the f o llo wing conv ex optimization problem: b β λ n = arg min β 1 2 n k Y − X β k 2 ℓ 2 + λ n p n X j =1 ( d j ) 1 /q ′ k β j k ℓ q (2 . 1) where λ n is a p ositiv e num b er whic h p enalizes complex mo del and q ′ is the conjugate exp o- nen t of q , whic h satisfies 1 q ′ + 1 q = 1 (assuming 1 ∞ = 0). The terms ( d j ) 1 /q ′ are used to adjust the effect of differen t group sizes. It is easy to see that when q = 1, this reduces to the stan- 4 dard La sso estimator; when q = 2, this reduces to the group La sso estimator (Y uan and Lin, 2006); when q = ∞ , this reduces t o the ℓ 1 - ℓ ∞ regularized regression estimator, or the iCAP estimator defined in (Zhao et al., 2008). T o characte rize the solution to t his problem, the following result can b e straightforw ardly obtained using the Karush-Kuhn-T uc k er (KKT) optimalit y condition for conv ex optimiza- tion. Prop osition 2.1. (KKT conditions) A v ector b β = ( b β T 1 , . . . , b β T p ) T ∈ R m n , m n = P p n j =1 d j , is an optimum of the ob jectiv e function in (2 . 1) if and only if there exists a sequence of subgradien ts b g j ∈ ∂ k b β j k ℓ q , suc h that 1 n X T j  X b β − Y  + λ n ( d j ) 1 /q ′ b g j = 0 . (2 . 2) The subdifferen tials ∂ k b β j k ℓ q is the set of v ectors b g j ∈ R d j satisfying If 1 < q < ∞ , then b g j = ∂ k b β j k ℓ q =      B q ′ (1) if b β j = 0 (  | b β j ℓ | q − 1 sign( b β j ℓ ) k b β j k q − 1 ℓ q  d j ℓ =1 ) o . w . (2 . 3) where B q ′ (1) denotes the ball of radius 1 in the dual nor m, i.e. 1 /q + 1 / q ′ = 1 . It’s easy to see that k b g j k ℓ q ′ ≤ 1 for an y j . If q = ∞ then b g j = ∂ k b β j k ℓ ∞ = ( B 1 (1) if b β j = 0 con v { sign( b β j ℓ ) e ℓ : | b β j ℓ | = k b β j k ℓ ∞ } o . w . (2 . 4) where conv( A ) denotes the con v ex h ull of a set A and e ℓ the ℓ -th canonical unit vec tor in R d j . It’s also easy to see that k b g j k ℓ q ′ = k b g j k ℓ 1 ≤ 1 for all j when q = ∞ . If q = 1 then b g j = ∂ k b β j k ℓ 1 = { ξ ∈ R d j : ξ ℓ ∈ ∂ | · | ( x ℓ ) , ℓ = 1 , . . . , d j } . (2 . 5) F rom prop o sition 2.1, the ℓ 1 - ℓ q regularized regression estimator can b e efficien tly solve d ev en with large n and p n . F or example, blo ck wise co ordinate descen t algorithms as in (Zhao et al., 2008) can b e easily applied. When q = 1 and q = ∞ , due to fa ct that feasible parameters are constrained to lie within a p o lyhedral region with parallel lev el curv es, efficien t path algorithm can b e dev elop ed (Efron et al., 2004; Zhao et al., 2008). At eac h iteration of the blo c kwise co ordinate descen t algorithm, β j for j = 1 , . . . , p n is up dated, with the rest of the co efficien ts fixed. Coupled with a threshold op erator, these algorithms general con v erge v ery 5 fast and exact solution can b e obtained. Standard optimization metho ds, suc h as in terior- p oin t metho ds (Boy d and V anden b erghe, 2004), can also b e dire ctly applied to solv e the ℓ 1 - ℓ q regularized regression problems. It is w ell-kno wn (O sb orne et al., 200 0 ) that under some conditions, the Lasso can at most select n nonzero v ariables ev en in the case p n ≫ n . A similar but w eaker result can b e obtained fo r the ℓ 1 - ℓ q regularized regression. Prop osition 2.2. F or the ℓ 1 - ℓ q regularized regression problem defined in equation ( 2 . 1 ) with λ n > 0 , there exists a solution b β λ suc h that the n um b er of nonzero groups | S ( b β ) | is upp er b ounded by n , the n um ber of give n data p oin ts, where S ( b β ) = { j : b β j 6 = 0 } Remark 2.3. Notice that the solution to ℓ 1 - ℓ q regularized regression problem ma y not b e unique especially when p n ≫ n (similar to the La sso case), since the optimization problem migh t not b e strictly con vex . Conse quen tly , there migh t exist other solutions that con tain more than n activ e groups. How ev er, a compact solution b β with | S ( b β ) | ≤ n can alwa ys b e obtained b y following an easy and mec hanical step describ ed in the pro of of Prop o sition 2.2. Pro of : F rom the KKT condition in prop osition 2.1, w e kno w that an y solution b β should satisfy the following conditions ( j = 1 , . . . , p n ): 1 n X T j ( Y − X b β ) = λg j where g j = ∂ k β j k ℓ q . No w supp ose there is a solution b β whic h has s = | S ( b β ) | > n n um b er of activ e groups, in the f ollo wing w e will show that w e can alw a ys construct another solution e β with one less activ e group, i.e. | S ( e β ) | = | S ( b β ) | − 1. Without lo ss of generalit y assume that the first s groups of v ariables in b β are activ e, i.e. b β j 6 = 0 for j = 1 , . . . , s . Since X b β = s X j =1 X j b β j ∈ R n × 1 and s > n , the set of v ectors X 1 b β 1 , . . . , X s b β s are linearly dep enden t. Without loss of g ener- alit y assume X 1 b β 1 = α 2 X 2 b β 2 + . . . + α s X s b β s . No w define e β j = 0 for j = 1 and j > s , and e β j = (1 + α j ) b β j for j = 2 , . . . , s , and it is straigh tforw ard to c hec k that e β satisfies the KKT condition and thus is also a solution to the ℓ 1 - ℓ q regularized regression problem in equation 2 . 1. The result thus follo ws b y induction.  The main ob jectiv e of the pap er is to in v estigate sev eral importa n t statistical prop erties of the ℓ 1 - ℓ q estimator b β . W e first giv e some rough definitions of the prop erties that w e w ould lik e to establish, more details will b e sho wn in their corresp onding sections. 6 Definition 2.4. (V ariable selec tion consistency) An estimator is said to b e v ariable selec- tion consisten t if it can correctly reco v er the sparsit y pattern with pro bability go es to 1. F or the case of group ed v ariable selection, b β is said to b e v ariable selection consisten t if P  S ( b β ) = S ( β ∗ )  → 1 . (2 . 6) Definition 2.5. ( ℓ 1 -estimation consistenc y) An estimator is said to b e ℓ 1 -estimation con- sisten t if the ℓ 1 -norm of the difference b et w een the estimator and the true parameter v ector con v erges to 0 in probabilit y . i.e. ∀ δ > 0 P  k b β − β ∗ k ℓ 1 > δ  → 0 . (2 . 7) Definition 2.6. (Prediction error consistency) An estimator is said to b e prediction error consisten t if the prediction error, defined as 1 n k b Y − X β ∗ k 2 ℓ 2 , of the estimator conv erges to 0 in probabilit y . i.e. ∀ δ > 0 P  1 n k b Y − X β ∗ k 2 ℓ 2 > δ  → 0 . (2 . 8) Definition 2.7. (Risk consistency or P ersistency ) Assuming the true mo del f ∗ ( X ) do es no t ha v e to b e linear, for the regression mo del with random design, ( X , Y ) ∼ F n ∈ F n , where F n is a collection of distributions of i.i.d. m n + 1 dimensional random v ectors. Define the risk function under the distribution F n to b e R F n ( β ) (More details in Section 5). Give n a sequenc e of sets of predictors B n , the sequen ce of estimators b β b F n ∈ B n is called p ersisten t if for eve ry sequence F n ∈ F n , R F n ( b β b F n ) − R F n ( β F n ∗ ) P → 0 , (2 . 9) where β F n ∗ = arg min β ∈B n R F n ( β ) . (2 . 10) F or the ℓ 1 - ℓ q regularized regression, later, w e will use B n = { β : P p n j =1 ( d j ) 1 /q ′ k β j k 2 ℓ 2 ≤ L n } , for some L n = o (( n/ (log n )) 1 / 4 ). The follo wing table giv es a high lev el summary of our main results, ordered from v ery stringen t assumptions to m uc h w eak er assumptions: 7 V ariable selection consistency: P  S ( b β ) = S ( β ∗  → 1 (R1) ℓ 1 -estimation conv ergence rate: k b β − β ∗ k ℓ 1 = O P s n ¯ d n r log m n n ! (R2) Prediction error con v ergence rate: 1 n k b Y − X β ∗ k 2 ℓ 2 = O P  s n ¯ d n log m n n  (R3) Prediction (missp ecified mo del): 1 n k b Y − f ∗ k 2 ℓ 2 = O P  s ′ ¯ d n log m n n  (R3 ∗ ) P ersiste ncy (missp ecified mo del): R F n ( b β b F n ) − R F n ( β F n ∗ ) P → 0 (R4) Remark 2.8. (R1) to (R3) assume the true mo del m ust b e linear, while (R3 ∗ ) and (R4) relax this condition so that the mo del can b e misspecified. Ev en though (R3) and (R3 ∗ ) lo o k v ery similar, (R3 ∗ ) dropp ed the linear mo del assumption at the price of enforcing another “ we ak sp ars i ty ” condition. Also, (R1), (R2), (R3), and (R 3 ∗ ) are fixed design results, while (R4) is a random design r esult. In general, the condition for v ariable selection consisten cy is the strongest since it inv olv es not only certain relations among n , λ n , p n , s n , ¯ d n , but also the minim um absolute v a lue of the para meters, ρ ∗ n = min j ∈ S k β ∗ j k ∞ . The ℓ 1 -estimation consistency and prediction error con- sistency requires w eak er conditions than v ariable selection consistency . Unlik e the previous prop erties, when the mo del is misspecified, the prediction error consiste ncy in (R3 ∗ ) follo ws from a sparse oracle inequalit y . Since b oth the sparsit y oracle inequalities a nd p ersistency do es not require the existence o f a true linear mo del and thus is more general. Especially , t he p ersistency is ab out the consistency of the predictiv e risk when considering random design and only need a v ery w eak a ssumption ab out the design. I I I. V ariable Selection Consistency In this Section w e study the conditions under whic h the ℓ 1 - ℓ q estimator is v a riable selection consisten t. Our pro of is adapted from (W ainwrigh t , 2006) and (Ra vikumar et a l., 2007). The former pap er dev elop the “witness” pro o f idea whic h is the main framew ork used in our pro of. The latter pap er mainly treat v ariable selec tion consis tency when q = 2 in a nonparametric sparse additiv e mo del setting, whic h mak es their conditions more stringen t than ours ev en when q = 2. In the follow ing, Let S denote the true set of group indices { j : X j 6 = 0 } , with s n = | S | , and S c denote its complemen t. Denote Λ min ( C ) to b e the minim um eigen v alue of the matrix C . 8 Then, w e hav e Theorem 3.1. Let q and q ′ are conjugate exp onen ts with eac h other, that is 1 q + 1 q ′ = 1 and 1 ≤ q , q ′ ≤ ∞ . Supp ose that the followin g conditions hold on the design matrix X : Λ min  1 n X T S X S  ≥ C min > 0 max j ∈ S c     ( X T j X S )( X T S X S ) − 1     q ′ ,q ′ ≤ 1 − δ, f or some 0 < δ ≤ 1 . (3 . 1) where k · k a,b is the matrix norm, defined as k A k a,b = sup x k Ax k ℓ b k x k ℓ a , 1 ≤ a, b ≤ ∞ . Assume the maxim um n um b er of v ariables with each g roup ¯ d n → ∞ and ¯ d n = o ( n ) . F urthermore, suppo se the follow ing conditions, whic h relate the regularization parameter λ n to the design parameters n , p n , the num b er of relev an t groups s n and the maxim um group size ¯ d n : λ 2 n n log(( p n − s n ) ¯ d n ) − → ∞ . (3 . 2) 1 ρ ∗ n    r log( s n ¯ d n ) n + λ n ( ¯ d n ) 1 /q ′       1 n X T S X S  − 1      ∞ , ∞    − → 0 . (3 . 3) where ρ ∗ n = min j ∈ S k β ∗ j k ∞ . Then, the ℓ 1 - ℓ q regularized regression is v ariable selec tion con- sisten t. Remark 3.2. First, notice that the result established in Theorem 3.1 is a direct general- ization of the v a riable selection r esult fo r La sso in (W ain wrigh t, 2006) b y setting q = 1 and ¯ d n = 1 (as then the ℓ 1 - ℓ q degenerates to Lasso). This g iv es the sufficien t conditions for exact reco v ery of sparsit y pattern in β ∗ for the ℓ 1 - ℓ q regularized regression. Also notice that when ¯ d n is b ounded fro m ab o v e, the conditions are almost the same as those of Lasso except the condition in equation 3 . 1 whic h dep ends on the v alue o f q . Second, w e consider the case when ρ n is b ounded a w a y from zero. Assu ming that q = ∞ and ¯ d n = n 1 / 5 (suc h as in the fitting of additive mo del with basis expansion), w e mus t hav e λ n = o ( n − 1 / 5 ) and a s a result of λ 2 n n log(( p n − s n ) ¯ d n ) → ∞ , w e need to hav e p n = o (exp( n 3 / 5 )). This means that ev en when w e ha v e increasing group size ¯ d n , the sparse pattern (in t erms of gr o up ed v a riables) can still b e correctly iden tified with a la r g e p n . Finally , when minim um pa ra meter v a lue ρ n → 0, to ensure v ariable selection consistency , it can a t most con v erge to zero a t a rate slo w er than n − 1 / 2 . Pro of: Note, the sp ecial case when q = 1 has already b een pro v ed in (W ain wrigh t et al., 2006). Here, w e only consider the case that 1 < q ≤ ∞ . A v ector b β ∈ R m n , m n = P p n j =1 d j , 9 is an optimum of the ob jectiv e function in (2 . 1) if and only if there exists a sequence of subgradien ts b g j ∈ ∂ k b β j k ℓ q , suc h that 1 n X T X j X j b β j − Y ! + λ n ( d j ) 1 /q ′ b g j = 0 . (3 . 4) The subdifferen tials ∂ k b β j k ℓ q satisfies the KK T conditions in prop osition 2.1 . Our ar g umen t closely follows the approach of W ainw righ t et al. (2 006) in the linear case. In particular, we pro ceed b y a “witness” pro of tech nique, to show the existence of a co efficien t- subgradien t pair ( b β , b g ) for whic h supp( b β ) = supp( β ∗ ). T o do so, we first set b β S c = 0 and b g S to b e the v ector concatenating a ll the subv ectors b g j ’s, for j ∈ S . W e also define b g S c and b β S in a similar w a y . And w e then obtain b β S and b g S c from the stationary conditions in (3 . 4). By sho wing that, with high probabilit y , b β j 6 = 0 for j ∈ S (3 . 5) b g j ∈ B q ′ (1) for j ∈ S c , (3 . 6) this demonstrates that with high probabilit y there exists an optimal solution to the opti- mization problem in (2 . 1) that has the same sparsity pattern as the true mo del. Setting b β S c = 0 and b g j =      (  | b β j ℓ | q − 1 sign( b β j ℓ ) k b β j k q − 1 ℓ q  d j ℓ =1 ) 1 < q < ∞ con v { sign( b β j ℓ ) e ℓ : | b β j ℓ | = k b β j k ℓ ∞ } q = ∞ (3 . 7) for j ∈ S , denote W = diag (( d 1 ) 1 /q ′ I d 1 , . . . , ( d p ) 1 /q ′ I d p ) where I d j is a d j × d j iden tity matrix. W e define W S to b e submatrix of W b y extracting out t he ro ws and columns corresp onding to the group index set S . The stationary condition for b β S is 1 n X T S  X S b β S − Y  + λ n W S b g S = 0 . (3 . 8) Let ǫ = ( ǫ 1 , . . . , ǫ n ) T , then the stationary condition can b e written as 1 n X T S X S  b β S − β ∗ S  − 1 n X T S ǫ + λ n W S b g S = 0 (3 . 9) or b β S − β ∗ S =  1 n X T S X S  − 1  1 n X T S ǫ − λ n W S b g S  (3 . 10) assuming that 1 n X T S X S is nonsingular. Recalling our definition ρ ∗ n = min j ∈ S k β ∗ j k ℓ ∞ > 0 . (3 . 11) 10 it suffices to sho w that k b β S − β ∗ S k ℓ ∞ < ρ ∗ n 2 (3 . 12) in order to ensure that supp( β ∗ S ) = supp( b β S ) = n j : k b β j k ℓ ∞ 6 = 0 o . Using Σ S S = 1 n X T S X S to simplify notation, w e ha v e the ℓ ∞ b ound k b β S − β ∗ S k ℓ ∞ ≤     Σ − 1 S S  1 n X T S ǫ      ℓ ∞ + λ n   Σ − 1 S S W S b g S   ℓ ∞ . (3 . 13) W e now pro ceed to b ound the quan tities ab ov e. First note that fo r j ∈ S , k b g j k ℓ q ′ = 1. Therefore, since k b g S k ℓ ∞ = max j ∈ S k b g j k ℓ ∞ ≤ max j ∈ S k b g j k ℓ q ′ = 1 (3 . 14) w e ha v e that   Σ − 1 S S W S b g S   ℓ ∞ ≤ ( ¯ d n ) 1 /q ′   Σ − 1 S S   ∞ , ∞ . (3 . 15) Therefore k b β S − β ∗ S k ℓ ∞ ≤     Σ − 1 S S  1 n X T S ǫ      ℓ ∞ + λ n ( ¯ d n ) 1 /q ′   Σ − 1 S S   ∞ , ∞ . Finally , consider Z = Σ − 1 S S  1 n X T S ǫ  . Note that ǫ ∼ N (0 , σ 2 I ), so that Z is Gaussian as w ell, with mean zero. Consider its ℓ -th comp onen t, Z ℓ = e T ℓ Z . Then E [ Z ℓ ] = 0, and V ar( Z ℓ ) = σ 2 n e T ℓ Σ − 1 S S e ℓ ≤ σ 2 nC min . (3 . 16) By the comparison results on G aussian maxima (Ledoux and T alagra nd, 1991), w e ha v e then that E [ k Z k ℓ ∞ ] ≤ 3 q log( s ¯ d n ) max ℓ q V ar( Z ℓ ) ≤ 3 σ s log( s ¯ d n ) nC min . (3 . 17) An a pplication of Mark o v’s inequalit y then gives that P  k b β S − β ∗ S k ℓ ∞ > ρ ∗ n 2  ≤ P  k Z k ℓ ∞ + λ n ( ¯ d n ) 1 /q ′   Σ − 1 S S   ∞ , ∞ > ρ ∗ n 2  ≤ 2 ρ ∗ n n E [ k Z k ℓ ∞ ] + λ n ( ¯ d n ) 1 /q ′   Σ − 1 S S   ∞ , ∞ o (3 . 18) ≤ 2 ρ ∗ n    3 σ s log( s ¯ d n ) nC min + λ n ( ¯ d n ) 1 /q ′   Σ − 1 S S   ∞ , ∞    (3 . 19) whic h con verges to zero under the condition that 1 ρ ∗ n ( r log( s ¯ d n ) n + λ n ( ¯ d n ) 1 /q ′   Σ − 1 S S   ∞ , ∞ ) − → 0 . (3 . 20) 11 W e no w analyze b g S c . Recall that w e ha v e set b β S c = β ∗ S c = 0. The stationary condition for j ∈ S c is thus giv en b y 1 n X T j  X S b β S − X S β ∗ S − ǫ  + λ n ( d j ) 1 /q ′ b g j = 0 . (3 . 21) Therefore, b g S c = W − 1 S c λ n  1 n X T S c X S  β ∗ S − b β S  + 1 n X T S c ǫ  = W − 1 S c λ n ( 1 n X T S c X S  1 n X T S X S  − 1  λ n W S b g S − 1 n X T S ǫ  + 1 n X T S c ǫ ) = W − 1 S c λ n  Σ S c S Σ − 1 S S  λ n W S b g S − 1 n X T S ǫ  + 1 n X T S c ǫ  (3 . 22) from equation (3 . 10). W e wan t to sho w that b g j ∈ B q ′ (1) (3 . 23) for a ll j ∈ S c . F ro m (3 . 22), w e see that b g j is G aussian, with mean µ j = E ( b g j ) = ( d j ) − 1 /q ′ Σ j S Σ − 1 S S W S b g S . (3 . 24) W e then obtain the b ound k µ j k ℓ q ′ ≤   Σ j S Σ − 1 S S   q ′ ,q ′ k b g S k ℓ q ′ =   Σ j S Σ − 1 S S   q ′ ,q ′ ≤ 1 − δ f or some δ > 0 . It therefore suffices to sho w that P  max j ∈ S c ( d j ) 1 /q ′ k b g j − µ j k ℓ ∞ > δ 2  − → 0 (3 . 25) since this implies that k b g j k ℓ q ′ ≤ k µ j k ℓ q ′ + k b g j − µ j k ℓ q ′ (3 . 26) ≤ k µ j k ℓ q ′ + ( d j ) 1 /q ′ k b g j − µ j k ℓ ∞ (3 . 27) ≤ (1 − δ ) + δ 2 + o (1) (3 . 28) with probabilit y approac hing one. T o sho w (3 . 25), w e again app eal to comparison results of Gaussian maxima. Define Z j = ( d j ) 1 /q ′ λ n ( b g j − µ j ) = X T j  I − X S ( X T S X S ) − 1 X T S  ǫ n (3 . 29) for j ∈ S c . Then Z j are zero mean Gaussian random v ector, and w e need t o show that P  max j ∈ S c k Z j k ℓ ∞ λ n ≥ δ 2  − → ∞ . (3 . 30) 12 Let Z j k represen t the k -th elemen t of Z j for j ∈ S c . A calculation sho ws that E ( Z 2 j k ) ≤ σ 2 n . Therefore, w e ha v e by Marko v’s inequalit y and the comparison results of Gaussian maxima that P  max j ∈ S c k Z j k ℓ ∞ λ n ≥ δ 2  ≤ 2 δ λ n E  max j ∈ S c ,k | Z j k |  ≤ 2 δ λ n  3 q log(( p n − s n ) ¯ d n ) max j ∈ S c ,k q E  Z 2 j k   (3 . 31) ≤ 6 σ δ λ n r log(( p n − s n ) ¯ d n ) n (3 . 32) whic h con verges to zero under the condition that λ 2 n n log(( p n − s n ) ¯ d n ) − → ∞ . (3 . 33) This is just the condition in the statemen t of the theorem.  IV. Estima tion Consistency In this section, w e pro v e the estimation consistency results under t w o types assumptions: (i) When the mo del is correctly specified, i.e., the true mo del is linear, we can ac hiev e b oth ℓ 1 -consistency results and deriv e the optimal rate of conv ergence for the prediction error. (ii) When the mo del is missp ecified, i.e. the t r ue mo del is not linear, we can still ach iev e a sparsit y o ra cle inequalit y , whic h provid e a b ound of the prediction error using the loss of the prediction o r a cle with the nu m b er of nonzero groups of the prediction loss inv olv ed in. Under the “ we ak sp arsity ” condition, w e can still obtain a rate of conv ergence o f the prediction error whic h is similar to the conv ergence rate obtained under the linear mo del assumption. W e b egin with a tec hnical lemma, whic h is esse n tially lemma 1 as in (Bunea et al., 2007a) and (Bic k el et a l., 2007), but need t o b e extended to handle the gro up structures in the more general ℓ 1 - ℓ q regularized regression setting. Lemma 4.1. Le t ǫ 1 . . . , ǫ n b e indep enden t N (0 , σ 2 ) random v a r ia bles with σ 2 > 0 and Let b Y = X b β b e the ℓ 1 - ℓ q regularized regression estimator with 1 ≤ q ≤ ∞ as in ( 2 . 1 ) with λ n = Aσ r log m n n (4 . 1) for some A > 2 √ 2 . Then, for all m n ≥ 2 , n > 1 , with probabilit y of at least 1 − m n 1 − A 2 / 8 w e ha v e sim ultaneously fo r all β ∈ R m n : 1 n k b Y − X β ∗ k 2 ℓ 2 + λ n p n X j =1 ( d j ) 1 /q ′ k b β j − β j k ℓ q ≤ 1 n k X β − X β ∗ k 2 ℓ 2 + 4 X j ∈ S ( β ) λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q (4 . 2) 13 where S ( β ) denotes the set of nonzero group indices of β . Pro of: By the definition of b Y = X b β , we ha v e 1 2 n k Y − X b β k 2 ℓ 2 + λ n p n X j =1 ( d j ) 1 /q ′ k b β j k ℓ q ≤ 1 2 n k Y − X β k 2 ℓ 2 + λ n p n X j =1 ( d j ) 1 /q ′ k β j k ℓ q for a ll β ∈ R m n , m n = P p n j =1 d j , whic h w e ma y r ewritten as 1 n k X β ∗ − X b β k 2 ℓ 2 + 2 λ n p n X j =1 ( d j ) 1 /q ′ k b β j k ℓ q ≤ 1 n k X β ∗ − X β k 2 ℓ 2 + 2 λ n p n X j =1 ( d j ) 1 /q ′ k β j k ℓ q + 2 n ǫ T X ( b β − β ) . (4 . 3) F or each j = 1 , . . . , m n , w e define the random v ariables V j = 1 n X T j ǫ , a nd the ev ent A = m n \ j =1 n 2 | V j | ≤ λ n o . Under the normalit y assumption, w e hav e that √ nV j ∼ N (0 , σ 2 ) j = 1 , . . . , m n . (4 . 4) Using the elemen tary b ound on the tails of G a ussian distribution we find that the probabilit y of the complemen t a ry ev en t A c satisfies P {A c } ≤ m n X j =1 P { √ n | V j | > √ nλ n / 2 } ≤ m n P {| Z | ≥ √ nλ n / (2 σ )) } (4 . 5) ≤ m n exp  − nλ 2 n 8 σ 2  = m n exp  − A 2 log m n 8  = m 1 − A 2 / 8 n (4 . 6) where Z ∼ N (0 , 1). Then, on the set A , we hav e 2 n ǫ T X ( b β − β ) = 2 m n X j =1 V j ( b β j − β j ) ≤ m n X j =1 λ n | b β j − β j | ≤ p n X j =1 λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q and therefore, still on the set A , 1 n k X β ∗ − X b β k 2 ℓ 2 ≤ 1 n k X β ∗ − X β k 2 ℓ 2 +2 λ n p n X j =1 ( d j ) 1 /q ′ k β j k ℓ q + p n X j =1 λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q − 2 λ n p n X j =1 ( d j ) 1 /q ′ k b β j k ℓ q . (4 . 7) 14 Adding the same term p n X j =1 λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q on b oth sides, w e obtain 1 n k X β ∗ − X b β k 2 ℓ 2 + λ n p n X j =1 ( d j ) 1 /q ′ k b β j − β j k ℓ q ≤ 1 n k X β ∗ − X β k 2 ℓ 2 +2 λ n p n X j =1 ( d j ) 1 /q ′ k β j k ℓ q + 2 λ n p n X j =1 ( d j ) 1 /q ′ k b β j − β j k ℓ q − 2 λ n p n X j =1 ( d j ) 1 /q ′ k b β j k ℓ q . (4 . 8) Recall S ( β ) to b e the set of non- zero group indices o f β . Rewriting the righ t-hand side of the previous displa y , then, o n set A 1 n k X β ∗ − X b β k 2 ℓ 2 + λ n p n X j =1 ( d j ) 1 /q ′ k b β j − β j k ℓ q ≤ 1 n k X β ∗ − X β k 2 ℓ 2 + 2   p n X j =1 λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q − X j / ∈ S ( β ) λ n ( d j ) 1 /q ′ k b β j k ℓ q   +2   X j ∈ S ( β ) λ n ( d j ) 1 /q ′ k β j k ℓ q − X j ∈ S ( β ) λ n ( d j ) 1 /q ′ k b β j k ℓ q   ≤ 1 n k X β − X β ∗ k 2 ℓ 2 + 4 X j ∈ S ( β ) λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q b y the triangle inequalit y and the fact that β j = 0 for j / ∈ S ( β ).  A. Estimation Consistency Under the Line ar Mo del Assumption Assuming the true mo del is linear, to obtain the ℓ 1 -consistency result, a k ey assumption on the design matrix is needed, whic h is stated a s the follow ing Assumption 1 R e c al l that s n = S ( β ∗ ) , assume for any ve ctor γ ∈ R m n satisfies κ ≡ min S 0 ⊂{ 1 ,...,p } : | S 0 |≤ s n min P j ∈ S c 0 ( d j ) 1 /q ′ k γ j k ℓ q ≤ 3 P j ∈ S 0 ( d j ) 1 /q ′ k γ j k ℓ q k X γ k ℓ 2 √ n q P j ∈ S 0 ( d j ) 2 /q ′ − 1 k γ j k 2 ℓ q > 0 . (4 . 9) Remark 4.2. Before proving the follo wing t heorem, w e pause to mak e some commen ts ab out this assumption. First, F o r q = 1 (th us, q ′ = ∞ ), this assumption is ve ry similar to the r estricte d ei g e nvalue assumption a s in (Bic k el et al., 2007 ), whic h is defined as κ ≡ min S 0 ⊂{ 1 ,...,p } : | S 0 |≤ s n min P j ∈ S c 0 k γ j k ℓ 1 ≤ 3 P j ∈ S 0 k γ j k ℓ 1 k X γ k ℓ 2 √ n q P j ∈ S 0 k γ j k 2 ℓ 2 > 0 . (4 . 10) 15 Ho w ev er, our assumption is sligh t ly w eake r, due to the f act that, for an y γ ∈ R d j k γ j k 2 ℓ 1 ≤ d j k γ j k 2 ℓ 2 . (4 . 11) Second, the quan tit y q P j ∈ S 0 ( d j ) 2 /q ′ − 1 k γ j k 2 ℓ q in our assumption balances b et we en q = 1 and q = ∞ . F or example, when q = 1, k γ j k 2 ℓ 1 is relativ ely large, but ( d j ) 2 /q ′ − 1 = ( d j ) − 1 is v ery small. While for q = ∞ , k γ j k 2 ℓ q = k γ j k 2 ℓ ∞ is relativ ely small, ho w ev er, ( d j ) 2 /q ′ − 1 = ( d j ) 1 is v ery significan t. In this sense, q = 2 seems the most balanced one, due to the fact that X j ∈ S 0 ( d j ) − 1 k γ j k 2 ℓ 1 ≤ X j ∈ S 0 p d j k γ j k 2 ℓ 2 ≤ X j ∈ S 0 d j k γ j k 2 ℓ ∞ (4 . 12) Therefore, among q = 1 , 2 , ∞ , q = 2 needs the w eak est assumption, this prov ides some insigh ts ab out wh y group Lasso might also b e a suitable choic e for g roup ed v ariable selection. Ho w ev er, w e need to more cautions to say whic h v alue of q is the b est. Since in real applications, the choice of q migh t dep ends on the true relev a n t co efficien ts β ∗ S . If differen t comp onen ts in the relev an t groups are on the same order of magnitude, q = ∞ might b e more suitable, on the con trary , if some relev ant co efficien ts are v ery small relativ e to the others, q = 1 might b e b etter. we plan to in v estigate this issue in a separate pap er. Theorem 4.3. (Estimation consistency under linear mo del assumptions ) Under assump- tion 2, let ǫ 1 , . . . , ǫ n b e indep enden t N ( 0 , σ 2 ) random v a riables with σ 2 > 0 . Consider the ℓ 1 - ℓ q regularized estimator defined b y ( 2 . 1 ) with λ n = Aσ r log m n n (4 . 13) for some A > 2 √ 2 . then, for all n ≥ 1 with probability at least 1 − m n 1 − A 2 / 8 w e hav e 1 n k b Y − X β ∗ k 2 ℓ 2 ≤ 9 A 2 σ 2 κ 2 s n ¯ d n log m n n (4 . 14) k b β − β ∗ k ℓ 1 ≤ 12 A 2 σ 2 s n ¯ d n κ 2 r log m n n . (4 . 15) Remark 4.4. F rom this theorem, we obtain ℓ 1 -consiste ncy and the corresp onding rate of con v ergence. Due to the fact that k γ k ℓ q ≤ k γ k ℓ 1 for all 1 < q ≤ ∞ , w e obtain ℓ q consistenc y also if s n ¯ d n r log m n n → 0 . If w e wan t to the rate of con v ergence f o r ℓ 2 -consistency , a direct result will b e k b β − β ∗ k 2 ℓ 2 ≤ 144 A 4 σ 4 s 2 n ¯ d 2 n κ 4 log m n n . (4 . 16) whic h is sub optimal. Recall that k b β − β ∗ k 2 ℓ 1 ≤ p n ¯ d n k b β − β ∗ k 2 ℓ 2 , if | S ( b β ) | is O ( s n ) a nd the elemen ts in b β j − β ∗ j are balanced for j ∈ S , then we can also achie v e the optimal rate of conv ergence for ℓ 2 -norm consistency . Ho w to obtain optimal rate of conv ergence f o r ℓ q -consiste ncy for general q w ould b e an intere sting future w ork. 16 Pro of: F rom equation 4 . 2, Using β = β ∗ , w e hav e that on the eve n t A , 1 n k b Y − X β ∗ k 2 ℓ 2 ≤ 3 X j ∈ S ( β ∗ ) λ n ( d j ) 1 /q ′ k b β j − β ∗ j k ℓ q ≤ 3 λ n p ¯ d n s n s X j ∈ S ( β ∗ ) ( d j ) 2 /q ′ − 1 k b β j − β ∗ j k 2 ℓ q (4 . 17) X j ∈ S ( β ∗ ) c ( d j ) 1 /q ′ k b β j − β ∗ j k ℓ q ≤ 3 X j ∈ S ( β ∗ ) ( d j ) 1 /q ′ k b β j − β ∗ j k ℓ q . (4 . 18) By the last equation, w e ha v e that assumption 1 hold on ev en t A , b y this assumption, w e ha v e that 1 n k b Y − X β ∗ k 2 ℓ 2 ≥ κ 2 X j ∈ S ( β ∗ ) ( d j ) 2 /q ′ − 1 k b β j − β ∗ j k 2 ℓ q . (4 . 19) By combini ng the ab ov e inequalities, w e get 1 n k b Y − X β ∗ k 2 ℓ 2 ≤ 9 λ 2 n s n ¯ d n κ 2 (4 . 20) and s X j ∈ S ( β ∗ ) ( d j ) 2 /q ′ − 1 k b β j − β ∗ j k 2 ℓ q ≤ 3 λ n p ¯ d n s n κ 2 . (4 . 21) Th us, we hav e k b β − β ∗ k ℓ 1 = p n X j =1 k b β − β ∗ k ℓ 1 ≤ p n X j =1 ( d j ) 1 /q ′ k b β j − β ∗ j k ℓ q (4 . 22) = X j ∈ S ( β ∗ ) ( d j ) 1 /q ′ k b β j − β ∗ j k ℓ q + X j ∈ S ( β ∗ ) c ( d j ) 1 /q ′ k b β j − β ∗ j k ℓ q (4 . 23) ≤ 4 X j ∈ S ( β ∗ ) ( d j ) 1 /q ′ k b β j − β ∗ j k ℓ q ≤ 4 p ¯ d n s n s X j ∈ S ( β ∗ ) ( d j ) 2 /q ′ − 1 k b β j − β ∗ j k 2 ℓ q (4 . 24) ≤ 12 λ n ¯ d n s n κ 2 = 12 A 2 σ 2 s n ¯ d n κ 2 r log m n n . (4 . 25) Note, equation 4 . 20 is exactly equation 4 . 17.  B. Or acle Ine qualities for Pr e diction Err or Under Missp e cifie d Mo dels Assuming the true regression function f ∗ ( X ) is not linear, i.e. the mo del is missp ecified. W e can no longer obtain the optimal rate of con v ergence directly . But w e can still obta in a spar- sit y oracle inequalit y , whic h can b ound the prediction error in terms of nonzero comp o nents of the prediction oracle. 17 Assumption 2 Assume s ′ is an in te ger such that 1 ≤ s ′ ≤ p n , and δ is some p o sitive numb er, then, for any γ 6 = 0 κ ( s ′ , δ ) ≡ min S 0 ⊂{ 1 ,...,p } : | S 0 |≤ s ′ min P j ∈ S c 0 ( d j ) 1 /q ′ k γ j k ℓ q ≤ (2+ 3 δ ) P j ∈ S 0 ( d j ) 1 /q ′ k γ j k ℓ q k X γ k ℓ 2 √ n q P j ∈ S 0 ( d j ) 2 /q ′ − 1 k γ j k 2 ℓ q > 0 . Theorem 4.5. Under assumption (2), let ǫ 1 , . . . , ǫ n b e indep enden t N (0 , σ 2 ) ra ndom v ari- ables with σ 2 > 0 . Consider the ℓ 1 - ℓ q regularized estimator defined b y ( 2 . 1 ) with λ n = Aσ r log m n n (4 . 26) for some A > 2 √ 2 . then, for all n ≥ 1 with probability at least 1 − m n 1 − A 2 / 8 w e hav e 1 n k f ∗ − X b β k 2 ℓ 2 ≤ (1 + δ ) inf β ∈ R m n : | S ( β ) |≤ s ′  1 n k f ∗ − X β k 2 ℓ 2 + C ( δ ) A 2 σ 2 κ ( s ′ , δ ) 2  ¯ d n | S ( β ) | log m n n  (4 . 27) where C ( δ ) > 0 is a constan t dep ending only on δ . While | S ( β ) | represen ts the n umber o f nonzero elemen ts in the set S ( β ) . Remark 4.6. F rom this sparsit y oracle inequalit y , if w e add some assumptions, suc h as there exists some β ′ , suc h that 1 n k f ∗ − X β ′ k 2 ℓ 2 → 0, then w e can still obtain prediction error consistenc y if ¯ d n | S ( β ′ ) | log m n n → 0. If we also w an t to obtain a con v ergence rate similar to that as in theorem 4.3, more conditions will b e needed, as is show n in corollary 4.8. Pro of: Fix an ar bitra ry β ∈ R m n with | S ( β ) | ≤ s ′ . On the ev en t A , w e get from lemma 4.1 that 1 n k b Y − f ∗ k 2 ℓ 2 + λ n p n X j =1 ( d j ) 1 /q ′ k b β j − β j k ℓ q ≤ 1 n k X β − f ∗ k 2 ℓ 2 + 4 X j ∈ S ( β ) λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q (4 . 28) F urther from ab ov e, w e can get that 1 n k b Y − f ∗ k 2 ℓ 2 ≤ 1 n k X β − f ∗ k 2 ℓ 2 + 3 λ n X j ∈ S ( β ) ( d j ) 1 /q ′ k b β j − β j k ℓ q (4 . 29) ≤ 1 n k X β − f ∗ k 2 ℓ 2 + 3 λ n q ¯ d n | S ( β ) | s X j ∈ S ( β ) ( d j ) 2 /q ′ − 1 k b β j − β j k 2 ℓ q (4 . 30) 18 Consider separately the cases where 3 X j ∈ S ( β ) λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q ≤ δ n k X β − f ∗ k 2 ℓ 2 (4 . 31) and 3 X j ∈ S ( β ) λ n ( d j ) 1 /q ′ k b β j − β j k ℓ q > δ n k X β − f ∗ k 2 ℓ 2 (4 . 32) In case (4 . 31), the result of the theorem trivially follo ws from equation (4 . 28). So ,w e will only consider the case (4 . 32). All the subsequen t inequalities are v alid on the ev en t A ∩ A 1 where A 1 is defined b y (4 . 32). On this ev en t, w e get from (4 . 28) that p n X j =1 ( d j ) 1 /q ′ k b β j − β j k ℓ q ≤ 3  1 + 1 δ  X j ∈ S ( β ) ( d j ) 1 /q ′ k b β j − β j k ℓ q (4 . 33) whic h further implies that X j ∈ S ( β ) c ( d j ) 1 /q ′ k b β j − β j k ℓ q ≤  2 + 3 δ  X j ∈ S ( β ) ( d j ) 1 /q ′ k b β j − β j k ℓ q (4 . 34) By assumption 2, w e ha v e κ ( s ′ , δ ) s X j ∈ S ( β ) ( d j ) 2 /q ′ − 1 k b β j − β j k 2 ℓ q ≤ r 1 n k X ( b β − β ) k 2 ℓ 2 = 1 √ n k b Y − X β k ℓ 2 (4 . 35) Com bining this with (4 . 30), w e get 1 n k b Y − f ∗ k 2 ℓ 2 ≤ 1 n k X β − f ∗ k 2 ℓ 2 + 3 λ n κ − 1 ( s ′ , δ ) q ¯ d n | S ( β ) |  1 √ n k b Y − X β k ℓ 2  (4 . 36) ≤ 1 n k X β − f ∗ k 2 ℓ 2 + 4 λ n κ − 1 ( s ′ , δ ) q ¯ d n | S ( β ) |  1 √ n k b Y − f ∗ k ℓ 2 + 1 √ n k X β − f ∗ k ℓ 2  (4 . 37) This inequalit y is of the same form as (A.4) in (Bunea et al., 2007a). A standard decoupling argumen t as in (Bunea et al., 2007a) using inequalit y 2 xy ≤ x 2 b + by 2 with b > 1, x = λ n κ − 1 ( s ′ , δ ) q ¯ d n | S ( β ) | , and y b eing either 1 √ n k b Y − f ∗ k ℓ 2 or 1 √ n k X β − f ∗ k ℓ 2 yields that 1 n k b Y − f ∗ k 2 ℓ 2 ≤ b + 1 b − 1 1 n k X β − f ∗ k 2 ℓ 2 + 8 b 2 ( b − 1) κ 2 ( s ′ , δ ) λ 2 n ¯ d n | S ( β ) | , ∀ β > 1 . (4 . 38) T a king b = 1 + 2 /δ in the last displa y finishes the pro o f of the theorem.  19 F rom the ab ov e sparse oracle inequalities, w e can show that the ℓ 1 - ℓ q regression estimator can ac hiev e the optimal rate of con v ergence if some “ we a k sp arsity ” condition holds (Bunea et al., 2007b). The main in tuition is, ev en if the true function f ∗ can not b e represen ted exactly b y a linear mo del X β , but f or some e β ∈ R m n the squared distance from f ∗ to X β can b e con trolled, up to logarithmic factors, b y | S ( e β ) | /n . Then, the optimal rate of con v ergence can still b e ach iev ed. More f ormally , w e define an oracle set as Definition 4.7. Let B b e a constant dep ending only for f ∗ and define an oracle set as B =  β : s . t . 1 n k f ∗ − X β k 2 ℓ 2 ≤ B λ 2 n | S ( β ) |  (4 . 39) Corollary 4.8. Under the same condition as in theorem 4.5, if the or a cle set B is nonempty and there is at least one elemen t e β suc h that | S ( e β ) | ≤ s ′ , w e hav e 1 n k f ∗ − b Y k 2 ℓ 2 = O P  ¯ d n s ′ log m n n  (4 . 40) Therefore, when s ′ ≤ s n , the ℓ 1 - ℓ q regression estimator achie v es the optimal rate of con v er- gence. Remark 4.9. Generally , the conditions for estimation consistency is weak er than those for v a ria ble selec tion consisten cy . F or q = 1, wh y assumption 2 a nd 1 are we ak er than the assumptions in theorem 3.1 can b e f o und in (Meinshausen and Y u, 2 006) and (Bic k el et al., 2007). The cases for q > 1 and the group cases should follow in a similar w a y . V. Risk Consistency In this section, we study the risk consistency (or p ersisten cy) prop erty with r a ndom design, whic h holds under a m uc h w eak er condition than v ariable selection consistency and do es not need the true mo del to b e linear. Instead o f directly to sho w the p ersistency result for the estimator defined in equation 2 . 1 , w e sho w the p ersistency result for a constrained form estimator, whic h is equiv alen t to the estimator in 2 . 1 in the sense of primal and dual problems. Due to the fact of random design a nd increasing dimensions, the same triangular array statistical paradigm a s in (Greensh tein and Ritov, 2004) is adopted. In the follo wing, we use calligraphic letter, suc h as Z t o represe n t random v ariables, while Z to represen t its realization. Consider the triangular arra y Z ( n ) 1 , . . . , Z ( n ) n (whic h is simplifi ed a s Z 1 , . . . Z n ), 20 our study mainly fo cus on the case where Z 1 , . . . , Z n iid ∼ F n ∈ F n , where F n is a collection of distributions of m n + 1 dimensional i.i.d. random v ectors Z i = ( Y i , X i, 1 , . . . , X i,m n ) i = 1 , . . . , n (5 . 1) with the corresp onding realizations Z i = ( Y i , X i, 1 , . . . , X i,m n ) i = 1 , . . . , n. (5 . 2) Denote γ = ( − 1 , β 1 , . . . , β m n ) = ( β 0 , β 1 , . . . , β m n ) , (5 . 3) and define R F n ( β ) = E   Y − m n X j =1 X j β j   2 = γ T Σ F n γ (5 . 4) where Z = ( Y , X 1 , . . . , X m n ) ∼ F n ∈ F n and (Σ F n ) = E Z T Z . Giv en n observ ations Z 1 , . . . , Z n , denote their empirical distribution b y b F n and define the empirical risk as R b F n ( β ) = γ T Σ b F n γ (5 . 5) where Σ b F n = 1 n n X i =1 Z i Z T i . Giv en a sequenc e of sets of predictors B n = { P p n j =1 ( d j ) 1 /q ′ k β j k ℓ q ≤ L n } , the seq uence of estimators b β b F n is called p ersisten t if for eve ry sequence F n ∈ F n , R F n ( b β b F n ) − R F n ( β F n ∗ ) P → 0 , (5 . 6) where b β b F n = arg min β ∈B n R b F n ( β ) = arg min β ∈B n k Y − X β k 2 ℓ 2 (5 . 7) β F n ∗ = arg min β ∈B n R F n ( β ) . (5 . 8) T o show the p ersistency r esult, a momen t condition as in (Zhou et a l., 2007) is needed. Assumption 3 F or e ach j , k ∈ { 1 , . . . , m n + 1 } , denote E = ( Z Z T − E ( Z Z T )) j ,k , wher e Z = ( Y , X 1 , . . . , X m n ) , supp ose that ther e exists some c onstants M and s . E ( | E | q ) ≤ q ! M q − 2 s/ 2 (5 . 9) for e v ery q ≥ 2 and every F n ∈ F n . 21 Theorem 5.1. Supp ose that m n ≤ e n ξ for some ξ < 1 . If L n = o  ( n/ log n ) 1 / 4  , then ℓ 1 - ℓ q regularized regression is p ersisten t. That is, for ev ery sequenc e F n ∈ F n : R F n ( b β b F n ) − R F n ( β F n ∗ ) = o P (1) . (5 . 10) Pro of: F or an y j , k ∈ { 1 , . . . , m n + 1 } a nd any δ > 0, from assumption 3 w e can apply the Bernstein’s inequalit y and obtain P     Σ b F n  j ,k − (Σ F n ) j ,k   > δ  ≤ e − cnδ 2 (5 . 11) for some c > 0. The refore, by Bonferoni b ound w e hav e P  max j ,k    Σ b F n  j ,k − (Σ F n ) j ,k   > δ  ≤ m 2 n e − cnδ 2 ≤ e 2 n ξ − cnδ 2 ≤ e − cnδ 2 / 2 (5 . 12) for larg e enough n . F o r a sequen ce δ n = r 2 log n cn , w e hav e P  max j ,k    Σ b F n  j ,k − (Σ F n ) j ,k   > δ n  ≤ 1 n → 0 (5 . 13) whic h implies that max j ,k    Σ b F n  j ,k − (Σ F n ) j ,k   = O P r log n n ! . (5 . 14) Therefore, sup β ∈B n   R F n ( β ) − R b F n ( β )   = sup β ∈B n   γ T (Σ F n − Σ b F n ) γ   (5 . 15) ≤ max j ,k    Σ b F n  j ,k − (Σ F n ) j ,k   k γ k 2 ℓ 1 (5 . 16) ≤ max j ,k    Σ b F n  j ,k − (Σ F n ) j ,k   1 + p n X j =1 k β j k ℓ 1 ! 2 (5 . 17) ≤ max j ,k    Σ b F n  j ,k − (Σ F n ) j ,k   1 + p n X j =1 ( d j ) 1 /q ′ k β j k ℓ q ! 2 (5 . 18) ≤ max j ,k    Σ b F n  j ,k − (Σ F n ) j ,k   (1 + L n ) 2 = o P (1) for L n = o  ( n/ log n ) 1 / 4  . F urther, by definition, w e hav e R b F n ( b β b F n ) ≤ R F n ( β F n ∗ ), combin ing with the following inequal- ities R F n ( b β b F n ) − R b F n ( b β b F n ) ≤ sup β ∈B n   R F n ( β ) − R b F n ( β )   (5 . 19) R b F n ( β b F n ∗ ) − R F n ( β b F n ∗ ) ≤ sup β ∈B n   R F n ( β ) − R b F n ( β )   . (5 . 20) 22 This implies that R F n ( b β b F n ) − R F n ( β F n ∗ ) ≤ 2 sup β ∈B n   R F n ( β ) − R b F n ( β )   = o P (1) , (5 . 21) whic h completes the pro of.  VI. Discussions The results presen ted here sho w that man y go o d prop erties from ℓ 1 -regularization (Lasso) naturally carry on to the ℓ 1 - ℓ q cases (1 ≤ q ≤ ∞ ), ev en if the num b er of v ariables within eac h group also increase with t he sample size n . Using fixed design, w e get b oth v ariable selection and estimation consistency under differen t conditions. Using random design, w e get p ersistency under a m uc h w eake r condition. Our results provid e a unified treatmen t for b oth t he iCAP estimator ( q = ∞ ) a nd the group Lasso estimator ( q = 2). Our results can also provide theoretical analysis to the sim ultaneous Lasso estimator (T urlac h et al., 2005; T ropp et al., 2006) for joint sparsity . Whic h can find a go o d appro ximation of sev eral respo nse v ariables at once using differen t linear com binations of the high dimensional co- v ariates. A t t he same time, it tries to balance the error in appro ximation against the total n um b er of co v a riates that participate. Ass uming that w e ha v e altogether ¯ d n respo nse, the i -th signal is represen ted as Y ( i ) ∈ R n , and the design matrix is X = ( X 1 , . . . , X p n ) ∈ R n × p n . Denote the mo del as Y ( i ) = X β ( i ) + ǫ ( i ) , i = 1 , . . . , ¯ d n (6 . 1) The sim ultaneous Lasso estimator can b e form ulated as b β (1) , . . . , b β ( ¯ d n ) = arg min β (1) ,...,β ( ¯ d n ) 1 2 n ¯ d n X k =1   Y ( k ) − X β ( k )   2 ℓ 2 + λ n p n X j =1 max ℓ ∈{ 1 ,..., ¯ d n } | β ( j ) ℓ | , (6 . 2) This problem can b e form ulated as a standard ℓ 1 - ℓ q regularized regression estimator with q = ∞ . F or this, define e Y =    Y (1) . . . Y ( ¯ d n )    ∈ R n ¯ d n e X = I ¯ d n ⊗ X =    X . . . X    and β =    β (1) . . . β ( ¯ d n )    (6 . 3) where ⊗ denotes the Kronec k er pro duct. Therefore, the sim ultaneous Lasso estimator can b e rewritten as b β (1) , . . . , b β ( ¯ d n ) = arg min β (1) ,...,β ( ¯ d n ) 1 2 n    e Y − e X β    2 ℓ 2 + λ ′ n p n X j =1 ( ¯ d n ) max ℓ ∈{ 1 ,..., ¯ d n } | β ( j ) ℓ | (6 . 4) 23 where λ ′ n = λ n / ¯ d n . This is just an ℓ 1 - ℓ ∞ regularized regression estimator with blo ck design. Therefore, all results in this pap er can b e applied to analyze suc h t yp e estimators. VI I. Ackno wledgements W e thank John Laffert y , Pradeep Ra vikumar, Alessandro Rinaldo, Larry W asserman, and Sh uheng Zhou for their v ery helpful discussions and commen ts. References Akaike, H. (1 973). Information theory and an extension o f the maximum lik eliho o d principle. S e c o n d International S ymp osium on Information Th e ory 267–281. Bach, F. (2007). Consistency o f the group lasso and m ultiple ke rnel learning. T ec h. rep. Bickel, P. J. , Ritov, Y. and Tysbako v, A. (2007 ). Sim ultaneous analysis of lasso and dan tzig selector. T e chnic al r ep ort, U.C.Berkel e y . Bo yd, S. and V andenberghe, L. (2004). Convex Optimization . Cam bridge Press. Bunea, F. , Alexandre, B. , Tsybako v, A. and Wegkamp, M. (20 0 7a). Aggregation for g a ussian regression. The Annals of Statistics 35 1674–169 7 . Bunea, F. , Tsybako v, A. and Wegkamp, M. (2007 b). Sparsit y oracle inequalities for the lasso. Ele ctr o n ic Journal of Statistics 1 169–194. Chen, S. S . , Donoho, D. L. and Saunders, M . A. (1998). Atomic decomp osition by basis pursuit. SIAM Journal on Scientific and Statistic al Computing 20 33–61 . Efr on, B. , Hastie, T. , Johnstone, I. and Tibshirani, R. (20 0 4). Least a ngle regres- sion. T he Annals of Statistics 32 407–499. Fu, W. and Knight, K. (2000). Asymptotics for lasso type estimators. The A nnals of Statistics 28 1356–1 3 78. Greenshtein, E. and R ito v, Y. (200 4). P ersistency in high dimensional linear predictor- selection and the virtue o f ov er-parametrization. Journal of Bernoul li 10 971–988 . Ledoux, M. and T alagrand, M . (1991). Pr ob ability in Banach sp ac es: iso p erimetry and pr o c esses . Springer-V erlag Inc. Mallo ws, C. L. (1973). Some commen ts on C p . T e chno m etrics 15 66 1 –675. 24 Meier, L. , v an de Ge er, S. and B ¨ uhlmann, P. (2007 ). The group lasso for logistic regression. Journal of the R oyal Statistic al So ciety, Series B, Metho dolo gic al 70 53–71. Meinshausen, N. and B ¨ uhlmann, P. (2006). High dime nsional graphs and v ariable selection with the lasso. T he Annals of Statistics 34 1436–146 2 . Meinshausen, N. and Yu, B. (2006). Lasso-type recov ery of sparse represen tations for high-dimensional da ta . T ec h. Rep. 720, D epartmen t o f Statistics, UC Berk eley . Osborne, M. R. , Presnell, B. a nd Turlach, B. A. (2 0 00). On the lasso and its dual. Journal o f Computational and Gr aphic al Statistics 9 3 1 9–337. Ra vikumar, P. , Liu, H. , Laffer ty, J. and W ass erman, L. (2 0 07). Spam: Sparse additiv e mo dels. In A dvan c es in Neur al Information Pr o c essing Systems 20 . MIT Press. Schw arz, G. (19 7 8). Estimating the dimension of a mo del. The A nnals of Statistics 6 461–464. Tibshirani, R. (1996). Regress ion shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety, Series B, Metho dolo gic al 58 2 6 7–288. Tr opp, J. , Gilber t, A. C. and Strauss, M. J. (2006). Algorithms for simu ltaneous sparse appro ximation. part ii: Con ve x relaxation. Signal Pr o c essing 86 5 72–588. Turla ch, B. , Venables, W. N. and Wright, S. J. (2005 ). Sim ultaneous v ariable selection. T e c h nometrics 27 349–363. W ainwright, M. , Ra vikumar, P. and Laffer ty, J. (2006). High-dimensional graph- ical mo del selection using ℓ 1 -regularized logistic regression. In A dvanc es in Neur al Infor- mation Pr o c essing Systems 19 . MIT Press. W ainwright, M. J. (2 0 06). Sharp thresholds for high-dimensional and noisy sp arsit y reco v ery using ℓ 1 -constrained quadratic programs. In Pr o c. Al l e rton Confer enc e on Com- munic ation, Contr ol an d Computing . Yuan, M. and Lin, Y. (2 006). Mo del selection and estimation in regression with group ed v ariables. Jo urnal of the R oyal Statistic al So cie ty, Series B, Metho do l o gic al 68 49–67. Zhao , P. , Ro cha, G. and Yu, B. (2 008). G roup ed and hierarchi cal mo del selection through comp osite absolute p enalties. A nnals of Statistics to app ear . Zhao , P. and Yu, B. (200 7 ). On mo del selection consistency of lasso. J. of Mach. L e a rn. R es. 7 2541–2567. Zhou, S. , Laffer ty, J . and W asse rman, L. (2 0 07). Compres sed regression. T ec h. rep., Carnegie Mellon. T ech nical rep ort. Zou, H. (20 06). The adaptive lasso and its oracle prop erties. Journal of the Americ a n Statistic al Asso ciation 101 1418– 1429. 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment