Theoretical Properties of the Overlapping Groups Lasso
We present two sets of theoretical results on the grouped lasso with overlap of Jacob, Obozinski and Vert (2009) in the linear regression setting. This method allows for joint selection of predictors in sparse regression, allowing for complex structu…
Authors: Daniel Percival
Electronic Journal of Stati stics ISSN: 1935-7524 Theoretic al Prop erties of the Ov erlapping Groups Lasso Daniel P erciv al ∗ , † , ‡ Carne gie Me l lon University Dep artment of Statist ics Pittsbur gh, P A 15213 USA e-mail: dperciva @andrew.cmu.edu Abstract: W e presen t t wo se ts of theo r etical results o n the g r ouped lasso with o verlap due to Jacob, Obozinski and V ert ( 2009 ) in the linear regres- sion set ting. This method join tly selects predict or s in sparse regression, al- lowing for complex s tructured sparsi t y ov er the predictors enco ded as a set of groups. Thi s flexible framework suggests that arbitrarily complex struc- tures can b e encoded with an i n tr i cate set of groups. Our results sho w that this strategy results in unexp ected theoretica l consequence s for the pro ce- dure. In particular, we give tw o sets of results: (1) finite sample b ounds on prediction and estimation, and (2) asymptotic dis tribution and selection. Both sets of r esults demonstrate negativ e consequences from choosing an increasingly complex set of groups for the pro cedure, as well for when the set of groups cannot reco ver the true sparsi t y patte r n. Additionally , these results de monstrate the differences and similar ities bet ween the the group ed lasso procedure with and without ov erlapping groups. Our analysis sho ws that whil e the procedure enjo ys adv ant ages o ver the standard lasso, the set of groups must be chosen wi th caution — an ov erly complex set of groups will damage the analysis. Keywords and phrases: Sparsity , V ari able Selection, Structured Spar- sity , Regularized Methods. 1. In tro duction In this pap er, we consider the linea r regr ession mo del: y = X β 0 + ǫ , where X is an n × p rea l v alued da ta matrix, y ∈ R n is a vector of resp ons es, β 0 ∈ R p is a vector of linea r weights, and ǫ is an erro r vector. Much work fo cuses on estimat- ing a sp arse b β , where many of the entries are equal to zero , effectively excluding many of the dimensio ns of X — the candida te predictors — from the mo del. Recent work adds the no tion of stru ctur e to this setting. That is, we desire the set of nonzero entries in b β to follo w some predefined structure ov er the candi- date predic to rs. The r e ar e now many metho ds tailored to a diverse co llection of ∗ This work i s part of the author’s Ph.D thesis. † This work was f unded by the N ational Institutes of Health gr an t M H057881 and National Science F oundation grant DMS-0943577. The author was also partiall y supported b y National Institutes of Health SBIR grant 7R44GM074313-04 at Insilicos LLC. ‡ The author would like to thank Lar ry W asserman and A ar ti Singh for helpful commen ts and discussi ons, as well as the t wo anon ymous referees whose suggestions helped greatly improv e the conten ts of this pap er. 1 imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 2 structures, including hier archical structures, group structures, and g r aph der ived structures: see Bach ( 20 10a , 2 0 08b , 2010 b ); Huang, Zhang and Metaxas ( 2009 ); Jenatton, Audib ert and Bach ( 2009 ); J enatton, Ob ozinski and Bach ( 2010 ); Peng et a l. ( 2010 ); P er c iv al et a l. ( 2011 ); K im a nd Xing ( 2010 ); Zhao et al. ( 2007 ) for ex- amples. One such struc tur ed spars e metho d is the group ed la sso o f Y uan et al. ( 2006 ), which extends the familia r ℓ 1 pena lization to a g roup ed ℓ 1 norm. In pa rticu- lar, the grouped lasso allows for groups of predictors to ent e r the mo del to- gether, a useful prop erty in settings such as ANO V A or m ulti-task regressio n. F or exa mple, we can encode a factor pr e dictor with m levels a s m − 1 indica- tor v ar ia bles in X . When we build a spa rse r egressio n mo del, we might pr efer to select none or a ll of this g roup of m − 1 v ar iables, but not any o ther sub- set. The g r oup ed lass o enables this t yp e of gro uped selection. Bach ( 2 008a ); Chesneau and Hebiri ( 2 008 ); Huang and Zhang ( 201 0 ); Lounici et al. ( 20 0 9 ); Nardi and Rinaldo ( 2008 ) give theore tica l results for this pro cedure, including oracle inequalities and a s ymptotic distributions. In particular , they show ed that for some pro blems the gro upe d lasso outp erforms the or dinary lasso. How ever, the gr oup ed ℓ 1 norm of the gro uped lasso is limited in that it only allows groups that partition the set of candidate predictors. This restr icts the complexity and types of structures over the candidate predictors tha t ca n b e enco ded in the groups . F or example, we could repres en t the p otential structur es of β 0 as a a graph over p no des, w her e ea ch no de represe nts a candidate predic- tor. W e might then seek to build a sparse mo del wher e the selected pre dictors corres p ond to a subgr a ph, such as a neighbor ho o d or clique, of this graph. This structure can b e enc o ded, for example, a s a series of overlapping neighborho o ds , such as 4- cycles in a 2-dimensio nal lattice gra ph. The group ed ℓ 1 norm do e s not allow for s uch a set o f groups . The group ed lasso with ov erlap of J a cob, Ob ozinski a nd V ert ( 2009 ) is one solution to this pr o blem (see also the CAP p enalty o f Zhao et al. ( 2007 ), as well as other group based pro cedures in Bach ( 2010b ); Jenatton, Audib ert and Ba ch ( 2009 )). Using an extension of the group ed norm of the g roup ed la sso, this pro ce- dure allows for complex, ov er lapping gro up structures. Given a co lle ction o f sub- sets of the set o f candidate predictors, the pro cedure recovers nonzero patter ns equal to a union of some s ubs et of this collection. This pro pe r ty can enco de many complex s tructures over the candidate predictors, and thus within the resulting sparsity patterns of the estimated co efficients. While Jacob, Ob ozins ki and V ert ( 2009 ) gave so me initial theoretical results on this pro cedure, including a co nsis- tency result, many theoretical questions were left op en. In particular , the impact on the predictive and estimation p erfor ma nce of the procedur e of incre asingly complex se ts of groups remained una nswered. The ov er lapping nature of the groups allows the p oss ibility for an ar bitrarily large set o f groups to enco de complex str uctures, or many p ossible structures sim ultaneously . If we supp os e that there is no consequence to incre a sing the n umber and co mplexity of the groups, then w e can freely run the pro c edure under ma n y structur al conditions simult a neously . The concluding remarks of Huang and Zhang ( 2010 ) indicate tha t the g r oup ed imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 3 lasso do es no t perfor m well with o verlapping gr oups. The go al of this paper is to exp ose exactly how introducing the poss ibilit y of overlapping groups impac ts the gro uped lasso . T owards this go al, we demo ns trate so me theoretical pro per - ties of the ov er lapping group e d lasso, with a fo cus on the co nsequence of the nu mber and c o mplexity of the groups of pr edictors. W e give a finite sample and an asymptotic res ult. In particular , we make the following con tributio ns: 1. W e show tha t b oth the finite s a mple and asy mpto tic perfor mance of the ov erlapping group ed la s so suffer s as the num be r a nd complexity of the groups grows. 2. In the finite sample case, we s how that the ass umptions on the design matrix X b ecome more restrictive a s the complexit y o f the groups grows. 3. In o ur asymptotic a nalysis, we intro duce the adaptive overlapping gr oup e d lasso, and give an adaptive weight ing scheme with a s ymptotic selection guarantees similar to the ada ptive lasso of Zou ( 2006 ) (see also the adap- tive group ed lasso results in Nardi and Rinaldo ( 2008 )). Overall, we conclude that the overlapping gr oup ed lass o enjoys many of the same theor etical guara nt ee s as the g roup ed las s o, provided that the set of gro ups are not to o complex or la rge. W e there fo re recommend that the pr o cedure s hould be used with a set o f groups that is not ov er ly c omplex, or contains a nested structure. The pap er is orga nized as follows: w e fir st intro duce notatio n for the ov er lap- ping group ed lasso. W e a lso repro duce some bas ic theoretica l prop er ties of the pro cedure and the ass o ciated ov erlapping group ed norm. W e nex t give our finite sample results, and then our as ymptotic results. W e then pre sent a simulation study to support our theor etical results. Pr o ofs of the main r esults alo ng with suppo rting lemmas app ear in the app endix. 2. Notation W e adopt a combination of the notation of Jacob, Ob ozinski and V ert ( 2009 ) and Lounici et a l. ( 200 9 ). Recall o ur basic setting, the linear mo del: y = X β 0 + ǫ . (1) Here, X is an n × p da ta matrix, y ∈ R n is the resp onse, β 0 ∈ R p is a vector of true linear co efficients, a nd ǫ is a sto chastic error term. Our goal is to estimate a spars e b β , such tha t the no nzero entries follow some structure which we a s sume to b e known a priori. In particular, w e consider structures defined in terms o f groups of pre dic to rs, which we define a s subsets o f the set o f candidate predictor s indices: I = { 1 , 2 , . . . , p } . W e denote a co lle ction of groups as G with elements g such that each g ⊆ I . Let |G | = M , a nd assume [ g ∈G g = I . F or co efficient vectors β , we define β g ∈ R | g | as the sub vector consisting of the entires corresp onding to the indices in g . Define the supp or t of a vector as: supp( β ) = { i : β i 6 = 0 } ⊆ I . (2) imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 4 W e now give a fra mew o rk, prop os ed by Ja cob, Ob ozinski a nd V ert ( 20 09 ), to measure the structur e d spa rsity of vectors in R p . W e define the following conv ention: for vectors denoted v g ∈ R p we hav e that supp( v g ) ⊆ g . W e define a decomp osition of β ∈ R p with resp ect to G as: V G ( β ) = { v g : g ∈ G } such that X g ∈G v g = β . (3) That is, each decomp osition in V G ( β ) is a collection of M vectors in R p each satisfying supp( v g ) ⊆ g for a different g ∈ G . F rom now on, we suppress the G in the notation for decomp os itio ns and write V ( β ). V ( β ) is not unique in gener al. W e define the following nor ms: || β || 2 ,p, G = min V ( β ) X g ∈G X i ∈ g v 2 g,i p/ 2 1 /p , (4) = min V ( β ) X g ∈G || v g || p 1 /p ; (5) || β || 2 , ∞ , G = min V ( β ) max g ∈G || v g || . (6) Here, || · || denotes the Euclidea n o r ℓ 2 norm. The ab ov e t wo e q uations are norms by arg uments presen ted in Jacob, Ob ozinski and V ert ( 2009 ). Note that the notation min V ( β ) indicates the minimum ov er a ll p ossible decomp ositions. Note that the decomp osition that minimizes these norms is not neces sarily unique, as we state in the following lemma. Lemma 1. (Cor ol lary 1 fr om Jac ob, Ob ozinski and V ert ( 200 9 )) F or any c ol- le ctions { v g } , { v ′ g } minimizing the norm 5 , we have, ∀ g ∈ G : || v g || × || v ′ g || = 0 or v g || v g || = v ′ g || v ′ g || . (7) The ab ov e lemma implies that in some cases the co llection of g roups used in the decomp osition — that is, { g ∈ G s.t. v g 6 = 0 } — is not unique. Finally , for J ⊂ G w e write β J = P g ∈ G v g 1 g ∈ J , note that J ⊆ { 1 , 2 , . . . , M } . Let J v ( β ) = { g : v g 6 = 0 } , and M v ( β ) = | J v ( β ) | . Th us, J v ( β ) is the set o f g roups used to decompose β for a pa rticular decomp os itio n. M v ( β ) is thus a measure of the structured spa rsity of β with re spe c t to a par ticular dec ompo sition. Let M ( β ) = min v M v ( β ), where this minimum is taken over the set of decomp o- sitions minimizing the norm 5 . Thus, M ( β ) measures the ov er all structured sparsity o f β , with r esp ect to the gro ups G . Here is a simple example to illustra te the setting. Le t p = 3, a nd consider the following groups: G = {{ 1 , 2 } , { 2 , 3 }} . (8) imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 5 F or any α ∈ R , we have the following p ossible decomp osition o f β = [ a, b , c ]: V ( β ) = { v { 1 , 2 } , v { 2 , 3 } } , (9) v { 1 , 2 } = [ a, αb, 0] , (10) v { 2 , 3 } = [0 , (1 − α ) b, c ] . (11) Thu s , the norm fro m Equation 5 can b e expr essed as: || β || 2 , 1 , G = min α p a 2 + ( αb ) 2 + p ((1 − α ) b ) 2 + c 2 . (12) Finally , it is c lear that for a, c 6 = 0, M ( β ) = 2 , and M ( β ) = 1 otherwise. 3. Ov erlapping Group ed Lasso Recall o ur goal, under the model of Equation 1 , w e estimate the target β 0 with a sparse b β – that is, man y entires of b β a re set to zero. Additionally , w e know these nonzero ent r ies o ccur in a structured pattern, as given by G . W e ev a lua te the fit with the usual qua dratic loss: ℓ ( β ) = 1 n || y − X β || 2 . (13) The ov erla pping gr oup ed lasso solves the following o ptimization pro blem: b β = a rgmin β ∈ R p ( ℓ ( β ) + 2 λ || β || 2 , 1 , G ) . (14 ) Here λ > 0 is a tuning parameter cont r olling the amoun t of regula r ization. If the element s of G are restricted to b e pairwise disjoint, then the no rm || · || 2 , 1 , G reduces to the gro uped ℓ 1 norm. W e then r ecov er the o riginal formulation of the gro uped las so. In the sp ecia l case wher e the groups a re all sing letons: G = {{ i } : i ∈ I } , we recover the fa milia r las s o Tibshirani ( 1 996 ). If w e allo w G to be any co lle ction, allowing for the p ossibility of o verlap betw een gro ups , then the minimum ov er V ( β ) in the no rm now plays a r o le since the decomp osition of β is no lo nger unique in general. This setting gives us the ov erla pping group ed lasso. F or each o f these problems, the key fact is that the supp or t of b β will be a union of members o f a subset of G . Fina lly , we also introduce the the adaptive ov erlapping group ed las s o: b β = a rgmin β ∈ R p ℓ ( β ) + 2 λ min V ( β ) X g ∈G λ g || v g || . (15) As prev ious work a nd theory has sugge sted ( Nardi and Rinaldo ( 2008 ); Zou ( 2006 )), the choice of weigh ts: λ g = 1 / || β OLS g || γ , where β OLS = ( X T X ) − 1 X T y , and γ > 0, g ives g o o d asymptotic gua rantees. In Section 5 , we show that a different choice is needed in our setting to give similar asymptotic g uarantees. imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 6 Finally , as noted by Jaco b, O bo zinski and V ert ( 2009 ), the overlapping group ed lasso metho d is simple to implement. In the case where G consists of non- ov erlapping groups, there ar e sev e ral efficient algo rithms a v aila ble. In the ov er- lapping case, no new s pecia lized algor ithm is requir ed. W rite X g as the sub- matrix of X with only the co lumns of X indexed by the elements of g . Now define e X = [ X g ] g ∈G — a n × P g | g | matrix of the concatenation o f the columns of X cor resp onding to each group in G . W e then ca n so lve the optimization problem with with a new, non-ov er lapping, set of groups G defined on the ap- propriate co lumns o f e X . Since G is now a non- ov erlapping set of gr oups for e X , we can s imply apply exis ting algorithms for the gr oup ed la sso. 4. Finite Sample Bounds W e no w give a sparsity oracle inequality for the ov erlapping group ed la sso. This finite sample r esult is an extension of a res ult o n m ultitask regres sion due to Lounici et al. ( 200 9 ), which is in tur n built on results from Bickel, Ritov and Tsybakov ( 2009 ). W e first state and discuss our main a s sumption, whic h is a n a daptation of the restricted eig env alue condition o f Bick el, Ritov and Tsybakov ( 2009 ) to the ov er lapping g roup ed lasso seetting. Assumption 1. Supp ose 1 ≤ s ≤ M = |G | . Then ther e ex ists κ ( s ) > 0 such that: κ ( s ) ≤ min ( √ ∆ T X T X ∆ √ n P g ∈ J || v ∆ g || : J ⊆ G ; J ∈ J ( s ) ) , (16) J ( s ) := J ⊆ G ; | J | ≤ s ; ∆ ∈ R p \ 0 ; V (∆) = { v ∆ g } s.t. X g ∈ J c || v ∆ g || ≤ 3 X g ∈ J || v ∆ g || . (17) Her e J c = { g : g ∈ G , g / ∈ J } , and V (∆) = { v ∆ g } denotes the de c omp osition minimizing the norm || ∆ || 2 , 1 , G . In the s ubsequent r esults, the integer s measure s the structured sparsity o f the targe t. Ther e ar e tw o key differences b etw een this assumption and other restricted eigenv alue conditions. Fir st, it re lie s on norms of the de c omp ositions of vectors, rather than norms o f the vector o r appro priate sub-vectors. Note that the decomp ositio n of ∆ must b e a decomp osition minimizing the || · || 2 , 1 , G norm. As we will discuss later , this condition grows more restric tiv e as G b ecomes mor e complex. The key second difference in the assumption lies in the denominator term P g ∈ J || v ∆ g || , whic h appea r s instea d of the directly analo gous || P g ∈ J v ∆ g || . W e know by the triangle inequality that || P g ∈ J v ∆ g || ≤ P g ∈ J || v ∆ g || , and so our κ is less than o r equa l to a κ ′ obtained under the analo g ous assumption. In the case of non-ov erla pping gr oups, this is an equalit y , and the as sumption is ident ica l in this c ase. imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 7 W e now ex amine some sufficient conditions for the ex istence of κ ( s ). E xamin- ing the numerator of the main q uant ity defining κ ( s ), we see tha t p ∆ T X T X ∆ /n ≥ | ρ X | 1 / 2 || ∆ || , whe r e ρ X is the minimal eig e n v alue o f X T X/n . Exa mining the de- nominator, we can make the following b ounds: X g ∈ J || v ∆ g || ≤ X g ∈G || v ∆ g || (18) ≤ X g ∈G || ∆ g || (19) ≤ || ∆ || ( M G ov erlap ) 1 / 2 . (20) Here, G ov erlap := max j ∈I h P g ∈G 1 j ∈ g i is the maximal num b er of times a candi- date predictor a ppea rs in the groups o f the collection G . Thus, as long as X T X has a no nzero minimal eigenv alue, we are guaranteed to find a κ ( s ) of at most ( ρ X / M G ov erlap ) 1 / 2 . In particula r , for κ ( s ) to exist, it is s ufficie nt for X T X to be p ositive definite. W e now state our main r esult. Theorem 1. Consider the mo del in Equation 1 . Supp ose |G | = M ≥ 2 , and n ≥ 1 . Assume that t he ent ries of ǫ ar e i.i.d. Gaussian with me an 0 and varianc e σ 2 . L et X b e normalize d so that the t he diagonal entries of X T X /n ar e al l e qual t o 1. D enote M ( β 0 ) ≤ s as the maximum numb er of n onzer o gr oups in de c omp ositions of β 0 , V ( β 0 ) . L et Assu mption 1 hold with κ = κ ( s ) . Le t : λ = 2 σ q max g | g |G overlap √ n 1 + A log M p max g | g | ! 1 / 2 . (21) Her e, A > 8 . Define q = min g ( ρ − 2 g ) min A p min g | g | / 8 , 8 log M , wher e ρ g is the maximal absolute eigenvalue of a Cholesky de c omp osition of X T g X g , wher e X g is the su b matr ix of X c orr esp onding to the c olumns indexe d by the gr oup g . Then, with pr ob ability at le ast 1 − M 1 − q , for any solution b β to Equation 14 , for al l β 0 ∈ R p , the fol lowing ine qualities hold: 1 n || X ( b β − β 0 ) || 2 ≤ 64 σ 2 κ 2 n max g | g | + A q max g | g | log M , (22) || b β − β 0 || 2 , 1 , G ≤ 32 σ κ √ n max g | g | + A q max g | g | log M 1 / 2 . (23) The pro of for this re s ult is g iven in the app endix A.1.2 . The pro of r elies o n Lemma 4 given in the app endix A.1.1 . W e now discuss the r e sult. 1. As the set of gr oups gr ows, the finite sample guar ante es de gr ade . In Pro p o- sition 1 , the predictio n and estimation b ounds b oth ge t coarser a s the nu mber of g roups increa ses. Note that the set o f g roups can g row no t only as the dimensio n o f the problem gr ows, but also if w e enco de complex imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 8 structures over the predicto rs using G . Th us, ev en for a problem of fixed dimension p , there is a consequenc e to choosing a n ar bitr arily complex set of gr oups. T o make this r esult clear , let the gr oups be maximally complex: G = 2 I , the p ower set of the set of predictors . Now, as the dimension of the problem grows, the prediction b ound gr ows at rate O ( p 3 / 2 ), and the esti- mation b ound a t rate O ( p ). If |G | is ins tea d of the sa me or der as p and the maximum gr oup size is c o nstant, these r ates a re instead b oth O (lo g p ). This shows that gr oup ed sparsity achiev es the tightest upp er b ounds if bo th the max im um group s ize and the n umber of gr oups gr ow a t slower rates than p . Note the co nt r ast her e to the results of Lo unici et al. ( 2009 ) in the m ulti-tas k setting, where a growing num b er of tasks b enefitted the pro cedure. Note that in multi-task setting, the num b er o f observ ations necessarily gr ows with the num be r of tas ks, contrary to our se tting. 2. As the c omplexity of t he gr oups gr ows, Assum ption 1 b e c omes mor e r e- strictive . Since κ app ears in the denomina tor of both the prediction and estimation b ounds, the b ounds bec ome less tight as κ decr e ases. Consider the condition: X g ∈ J c || v ∆ g || ≤ 3 X g ∈ J || v ∆ g || . (24) Recall that J is a ca r dinality s se t of groups. Thus, for fixed s , a s the com- plexity of G gr ows, the flexibilit y of the decompos itio ns g rows, and then more vectors ∆ sa tisfy this condition. This makes κ decrea sing a s a func- tion of |G | . W e als o recall that when X T X has a no nzero minimal abs olute eigenv alue, w e know κ is at most q ρ X / M G ov erlap . As noted ea rlier, as the co mplexity of the g roups grows, G ov erlap increases a s well, leading to a smaller κ and in turn inferior pre diction and estimation b ounds. If G ov erlap is on the same order as the n umber of predictors, then κ ( s ) is of order 1 / M rather than 1 / √ M . This dep endence shows that our b ounds depe nd equally on the dimension of the pro blem M and the group com- plexity as mea sured by G ov erlap . In the ca se of the lasso o r gr oup lasso, G ov erlap = 1, giving us no dep endence o n g roup complexity , as exp ected. 3. The r esults show that the pr o c e dur e enjoys an advantage over non-st r u ctur e d pr o c e dur es when β 0 is structu r e d sp arse . F or example, in the finite sample case, no ne of our b ounds dep ended ex plicitly o n the dimension o f the prob- lem p . Th us, we can adopt a simila r argument to those of Lounici et al. ( 2009 ) to s how that compared to the lasso, the overlapping g roup ed la s so gives sup erior results in the ca se where β 0 is structured sparse. That is, from Bick el, Ritov and Tsybakov ( 2 0 09 ), if we let: λ = Aσ r log p n , (25) imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 9 then for A > 2 √ 2, we hav e that with pro bability at lea st 1 − ( p ) 1 − A 2 / 8 : 1 n || X ( b β lasso − β 0 ) || 2 ≤ 16 A 2 σ 2 κ 2 n log p. (26) Thu s , if p max g | g | log M + max g | g | is o f smaller order than log p , the pro cedure has a predictive adv antage. Since κ dep ends on the structured sparsity of the tar g et, this r esult holds only for str uctured sparse targets β 0 which give sufficiently lar ge v alues of κ under our a ssumption. 4. In the n on-overlappi ng c ase, we c an r e c over m any r esult s avail able in t he liter atur e . Here we hav e G ov erlap = 1 . W e adjust our ass umption to match the literature , so that the quantit y in the minimu m is r eplaced with: √ ∆ T X T X ∆ √ n P g ∈ J v ∆ g . (27) Combining this with an applica tion of the Ca uch y-Sch warz inequality in the last s teps o f the pr o ofs of the res ult, we ca n recover the res ults of Lounici et al. ( 200 9 ) in the multi-task case. In the case of the g roup ed lasso, w e can r ecov er the result from Na rdi and Rinaldo ( 2008 ). The de- pendenc e on the minimal eigenv alues o f the Choles k y decomp ositio n o f each X T g X g is re lated to the co nditions g iven in Huang and Z ha ng ( 2010 ). In the se ttings of Lounici et al. ( 200 9 ), ρ g = 1 for all g . 5. We c an show a similar r esult solely in terms of max g | g | . In particula r , for: λ = 2 σ q G ov erlap √ n max g | g | + A lo g M 1 / 2 , (28) the same r esults ho ld with proba bility 1 − M 1 − q , for q = min g ( ρ − 2 g ) min A/ 8 , 8 log M max g | g | . This result is a co nsequence o f a simple a djustment for this choice of λ in the pr o of o f Le mma 4 fro m the a ppe ndix . This alternate result shows that as the maximum g r oup size grows, the estimation and prediction bo unds bec ome less tig h t, and the pr obability that they hold fa lls. 6. The r esult do es not dep end on the any uniqueness assumptions on the de c omp osition of β 0 . The consistency result for the overlapping gro uped lasso in Jacob, Ob ozins ki and V ert ( 200 9 ) assumes that the decomp osition of β 0 that minimizes the || · || 2 , 1 , G norm is unique. Our result, in contrast, depe nds only on the maximal structured sparsity of such decomp ositions. Thu s , in the case where β 0 do es not have a unique decomp osition mini- mizing the || · || 2 , 1 , G norm, o ur res ults still ho ld. This is a co n tr ast to the asymptotic results o f the next section. 5. Asymptotic Results In this section, we cons ider fixed dimension asymptotic for the adaptive ov erlap- ping group ed lass o as de s crib ed in Equatio n 15 . These res ults ex tend those o n imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 10 the g roup ed lasso found in Nardi and Rinaldo ( 2008 ) to the ca se o f ov er lapping groups. T o b egin, define the set of indices of the true linear co efficient vector β 0 which are nonzero a nd zero as the following: H = { i : β 0 i 6 = 0 } , (29) H c = { i : β 0 i = 0 } . (30) Accordingly , we define X H as the sub matr ix co n ta ining the entries with col- umn indices in the s e t H . Similarly , for a p - vector x , let x H be the sub vector containing the entries with indice s in the set H . Clearly , H ∪ H c = I . Ho wever, that H and H c are not necessar ily the union of members of J ( β 0 ) a nd J ( β 0 ) c , resp ectively . W e next define the following three subsets of G r elated to H a nd H c : G H = { g : g ⊆ H } , (31) G H c = { g : g ⊆ H c } , (32) G H o = { g : | g ∩ H | > 0; | g ∩ H c | > 0 } . (33) These a re, r e spe c tiv e ly , the set of groups in which the indices are all nonzer o in β 0 , all zer o in β 0 , and a mix of ze ro and nonzer o in β 0 . F or this setting, we now make the following ass umptions: Assumption 2. As n → ∞ , X T X → M , wher e M is p ositive definite. Assumption 3. The entries of the st o chastic term ǫ in Equation 1 ar e i.i.d. with finite se c ond moment σ 2 . Assumption 4. Ther e exists a neighb orho o d in R p ar ound β 0 such that the de c omp osition of any ve ctor b in the neighb orho o d has a u nique de c omp osition { v b g } minimizing the norm || b || 2 , 1 , G . In p articular, the de c omp osition { v 0 g } , min- imizing the norm || β 0 || 2 , 1 , G is unique. F urther, this de c omp osition is su ch that v 0 g = 0 for al l g ∈ G H o . Assumptions 2 and 3 are directly ta ken from the groupe d la sso s etting. As- sumption 4 is another s uch condition a dapted to o ur setting. A direct a daptation would b e that ther e exists s ome G ⊆ G , s u ch that ∪ g ∈ G g = supp ( β 0 ). This pro p- erty is implied by Assumption 4 . Note that these three a ssumptions are analo - gous to those needed for the consis tency res ult given in Jacob, Ob ozinsk i a nd V ert ( 2009 ). This assumption also addresses indirec tly the issue o f identifiabilit y of the gr o ups. F or example, for M = 3, and G = {{ 1 , 2 } , { 2 , 3 } , { 1 , 3 }} , the tar- get β 0 = ( a, a, a ) do es not admit a unique, nor m minimizing decompo sition within any neighbor ho o d. Similarly , we can cr eate the set { 1 , 2 , 3 } in four p ossi- ble wa ys from unions of members of G . Thus, this particular G do es not sa tisfy Assumption 4 for some ta rgets. In the following r esult, we consider the ada ptive overlapping g roup ed la sso of Equation 1 5 . W e now prop ose a set o f weights { λ g } for the adaptive ov erlapping group ed la sso. If we let β OLS = ( X T X ) − 1 X T y , a nd let { v OLS g } = V ( β OLS ) b e imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 11 any decomp osition minimizing the nor m || β OLS || 2 , 1 , G . Then, let λ g = 1 / || v OLS g || . This choice of weigh ts gives us our main r e sult: Theorem 2. Consider t he adaptive overlapping gr oup e d lasso. Supp ose As- sumptions 2 , 3 , and 4 hold. L et β OLS = ( X T X ) − 1 X T y , and let { v OLS g } = V ( β OLS ) b e any de c omp osition minimizing the norm || β OLS || 2 , 1 , G . Then, let λ g = 1 || v OLS g || γ , for γ > 0 such that n ( γ +1) / 2 λ → ∞ . If √ nλ → 0 , then, as n → ∞ : √ n ( b β − β 0 ) → Z. (34) Wher e the ab ove is c onver genc e in distribution. The ve ctor Z has en t ries: Z H ∼ N | H | (0 , σ 2 M − 1 H ) , (35) Z H c = 0 . (36) Wher e M H is the sub-matrix of M c onsisting of the entries with r ow and c olumn indic es in H . W e now ma ke so me comment s on the result. 1. In the non-overlappi n g c ase, our r esu lt r e duc es to pr evious r esults fr om Nar di and Rinaldo ( 2008 ). In par ticular, the weigh ts ar e c le a rly λ g = 1 / || β OLS g || γ . Given this, we c ould as k what is the consequence o f simply choo sing λ g = 1 / || β OLS g || γ for the adaptive weigh ts in any case ? In the pro of of the res ult, the impact is for the case when g ∈ G H o . In summary , the term n γ / 2 || β OLS g || γ is no longer O p (1), since || β 0 g || > 0. Then, we ge t the following distribution: Z H o ∼ N | H o | (0 , σ 2 M − 1 H o ) , (37) Z H o c = 0 . (38) The r esulting distributio n is nonzero with p ositive pr obability in co ordi- nates that are zero in β 0 . In this situa tio n, the problem can b e remedied by assuming that G H o is empty , that is : Assumption 5. (Sep ar ation of s upp ort) ∃ G ⊂ G such that ∪ g ∈ G g = H and ∪ g / ∈ G g = H c . F o r ma n y settings with overlap, this is an o verly r estrictive assumption. Note that this ass umption cor r esp onds to assuming the groups a re cor - rect in the non- ov erlapping gr oup ed lasso . If the groups are inco rrect, the result of this pr op osition gives us some insig h t as to what g o es wrong asymptotically . 2. The re s ult gives a c onse quenc e of having an “ inc orr e ct” set of gr oups, re l- ative to the supp ort of β 0 . When the conditio n ∀ g ∈ G H o ; v 0 g = 0 of Assumption 4 is viola ted, w e hav e that n γ / 2 || β OLS g || γ is no longer O p (1) for g ∈ G H o , and the consequence is similar to the previo us remar k. Again, we get the wrong a s ymptotic mean, a nd the e s timator do es not hav e go o d selection prop er ties. Such a violatio n Assumption 4 implies that the struc- ture implied by G is not sufficient to capture the structure in β 0 . imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 12 3. These r esults exclude some typ es of st ructur es: in p articular nest e d gr oups in G . In particular, the uniquenes s assumption implies that we can not use a G which co n ta ins nested gr oups. In this case, given a set of groups , the uniqueness condition o f Assumption 4 are viola ted for some β 0 . F o r example, supp os e p = 5 and G = {{ 1 , 2 } , { 3 , 4 } , { 1 , 2 , 3 , 4 } , { 5 }} . (39) Then, for β 0 = [ a, a, 0 , 0 , c ], then ther e are an infinite nu mber of decom- po sitions minimizing the || · || 2 , 1 , G . In par ticular, for any α ∈ (0 , a ), the following decompos ition minimizes the norm: v 0 1 = [ a − α, a − α, 0 , 0 , 0] , (40) v 0 2 = [0 , 0 , 0 , 0 , 0] , (41) v 0 3 = [ α, α, 0 , 0 , 0] , (42) v 0 4 = [0 , 0 , 0 , 0 , c ] . (43) Then, consider the w eights λ g = || v OLS G || . In almost all data a pplica - tions w e have: supp( β OLS ) ⊃ { 1 , 2 , 3 , 4 } . The minimizing deco mpo sition of || β OLS || 2 , 1 , G will clear ly hav e v OLS { 1 , 2 } = v OLS { 3 , 4 } = 0. This e ffectively ex- cludes the fir st t wo gro ups, a nd we will be unable to detec t a ll p ossible sparsity patterns. Mor e gener ally , using the same argument as the exam- ple, we can state that in the case where gro ups a re nes ted, there exist some β 0 which cannot b e uniquely decomp osed to minimize the || · || 2 , 1 , G norm. Thus, using nested groups degrades the asymptotic guara ntees of the ov e r lapping g roup ed lasso. This prop er ty pr ecludes using a complex nested set of groups to enco de multiple structures. 6. Simulation Study W e now present the results of a simulation study to illuminate a nd supp ort our earlie r theoretical claims. F or ease of compar ison, we imitate the setting of Huang and Zhang ( 201 0 ). Here, w e explore issues most p ertinent to the over- lapping groups lasso, leaving aside some of the issues a ddressed by the simulation study in Huang and Zhang ( 201 0 ). W e generate an n × p design matrix X w ith i.i.d. standard normal entries, with each r ow scaled so it ha s unit magnitude. W e next gener ate a s tructured spa r se β 0 vector with the nonzero entries de- fined a s the union of the fir st k gro ups from our set o f g roups G . W e c ho o se the first k gro ups to achieve a cons is ten t amount of ov erlap in β 0 with resp ect to G betw een trials. W e define k , G , n , and p separately in each exp eriment. After constructing our resp onse fro m X and β 0 , we add zero mean Gaussia n noise with sta ndard devia tion σ = 0 . 01. W e compare the s tandard la sso aga inst the ov erlapping groups lasso, with set o f gro ups G . As in Huang and Z ha ng ( 20 10 ), we ado pt the following metric to ev aluate the p erformance of b oth estimators: Recov ery Erro r : || β 0 − b β || 2 || β 0 || 2 (44) imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 13 W e conduct the following pair o f exper iment s : 1. Study on t he effe ct of overlap . Here, we simulate a problem that has nearly constant difficult y for the ordinary , un-g roup ed, lasso, but incre a sing dif- ficult y for the gr oup ed lasso. W e set p = 51 2, and set each group so that it consists of 8 consecutive (by index) pr edictors. W e then v ar y G Overlap ∈ { 1 , 2 , . . . , 8 } . F or example, with G Overlap = 1, our first t wo groups are g 1 = { 1 , 2 , . . . , 8 } ; g 2 = { 9 , 10 , . . . , 16 } , and with with G Overlap = 2 , g 1 = { 1 , 2 , . . . , 8 } ; g 2 = { 8 , 9 , . . . , 1 5 } , a nd so forth. W e select k = ceiling((64 − 8 ) / (8 + G Overlap )) + 1 groups to be nonzero in β 0 , and set n = 192. 2. Study on the effe ct of sample size . W e adopt a similar setting of the first exp eriment. W e set G Overlap = 4, a nd set G , p , k in a simila r manner as the first exp eriment. W e c o nsider n satisfying log 2 ( n/ 48) ∈ { 0 , 1 , 2 , 3 , 4 } . The purp ose of the fir s t exp eriment is to study the effect of inc r easing co m- plexity of G on estima tio n p erfo rmance. F or G ∈ { 1 , 2 , 3 , 4 } , we see that as the degree of o verlap incre a ses, the estimato r per formance deg rades, though not dramatically in these settings. F o r G = 5, with groups of size 8, we ca n see that due to the consecutive pla cement of the signal, ab out ha lf of the groups may b e dropp ed without degradation in per fo rmance, a nd we return to the setting and per formance of G = 1. F o r G ∈ { 6 , 7 , 8 } , the estimator a gain do es w o rse than in the case of no ov erlap, but no worse than G = 4. T his result s uppo rts the discussion sur rounding Assumption 1 a nd Theore m 1 , but still indicates that the proc edure is mo r e robust to ov er lap than p ostulated in Huang and Zhang ( 2010 ). In the sample size study , we see that for a reasona ble ( G ov erlap = 4) set of g roups, the estimator outp erforms the lasso: it is able to achieve a limiting level of recov ery er ror for lower sample size s than the lasso. This supp orts the conclusions of Theorem 1 , as well as the conc lus ions from the literature ab out the gro uped las so, e.g. Huang and Z ha ng ( 2010 ) and Lo unici et al. ( 2009 ). W e th us see that even in the ov erlap case, the pro cedure still enjo ys a benefit due to group spa rsity . 7. Discussion and Con clusions In the previous t wo sectio ns, w e hav e giv e n res ults o n the p erfor mance of the ov erlapping group ed lasso in b oth the finite sample and a symptotic setting. One of the basic s teps in pr actical applications of this pro cedure is the c hoice of the collection of groups G . In b oth cases, we show ed that a n ov er ly complex choice of G degr a des the theoretica l g uarantees on the p erforma nce o f the estimator . In the case wher e the dimension of the problem is fixed, increasing the num b er of groups leads to less tight uppe r b ounds on b oth pre diction a nd es tima tio n in the finite s ample cas e. In the asymptotic setting, nested groups lea d to inconsistent selection of the true spar s it y pattern. Nonetheless, when G is suitably c ho sen, imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 14 G_Overlap Recov er y Error 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 Overlap Lasso Lasso Overlap Study log_2(n/48) Recov er y Error 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 Overlap Lasso Lasso Sample Size Study Fig 1 . R esults of the simulation study. We co mp ar e the overlapping gr oup e d lasso to the or dinary lasso. L eft: study on the de gr e e of overlap of the gr oups. Right: stud y on sample size. we s till see that the pr o cedure retains the theore tica l b enefits of the gr oup ed lasso demonstrated in previous literatur e. In summary , we find that the ov er lapping group ed lasso is a useful extension of the group ed lasso that m ust b e used with caution. The flexibility allow e d by overlapping gro ups is v aluable in many applications, and can enco de a wide v a riety of structures a s collections of groups. W e ha ve shown that allo wing for ov erlap does not remov e many of the theor etical pro p e r ties and b enefits prov e n for the lasso and gr oup ed lasso. Ho wev er, the procedur e must be used with caution. While the flexible nature of the pr o cedure s uggests that the analys t may enco de ma n y structur es sim ultaneo usly , this approach is not supp or ted by the results in this pap er . App endix A: Pro ofs A.1. Fini te Sample R esult A.1.1. Auxil lary L emmas Lemma 2. Le t χ 2 D b e a chi-squar e d r andom variable with D de gr e es of fr e e dom. Then: P ( χ 2 D > D + x ) ≤ exp − 1 8 min x, x 2 D . (45) Pr o of. See Lemma A.1 from Lounici et al. ( 2 009 ). imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 15 Lemma 3. L et α, β ∈ R p , then, ∀G : α T β ≤ G 1 / 2 overlap max g || α g || 2 || β || 2 , 1 , G . (46) Pr o of. Let { v β ∗ g } denote any decomp ositio n of α minimizing the nor m || β || 2 , 1 , G . Then, b eginning with Holder’s inequality the following chain gives the desired result: α T β ≤ || α || ∞ || β || 2 (47) ≤ X g α g ∞ X g v β ∗ g 2 (48) ≤ G 1 / 2 ov erlap max g || α g || 2 X g || v β ∗ g || 2 (49) = G 1 / 2 ov erlap max g || α g || 2 || β || 2 , 1 , G . (50) Lemma 4. Consider the mo del in Equation 1 . Supp ose |G | = M ≥ 2 , and n ≥ 1 . Assume that the entries of ǫ ar e i.i.d. Gaussian with me an 0 and varianc e σ 2 . L et X b e normalize d s o that the the diagonal entries of X T X /n ar e al l e qual to 1. L et { v b β − β g } denote a de c omp osition of b β − b β minimizing the || · || 2 , 1 , G norm. L et J = J ( β 0 ) = { g : v 0 g 6 = 0 } b e t he set of gr oups that ar e n onzer o in the norm minimizing de c omp osition of β . L et: λ = 2 σ p max g | g | √ n 1 + A log M p max g | g | ! 1 / 2 (51) Her e, A > 8 . Define q = min A p min g | g | / 8 , 8 log M . Then, with pr ob ability at le ast 1 − M 1 − q , for any solution b β to Equation 14 , for al l β ∈ R p , the fol lowing ine quality holds: 1 n || X ( b β − β 0 ) || 2 + λ || b β − β || 2 , 1 , G ≤ 1 n || X ( β − β 0 ) || 2 + 4 λ X g ∈ J || v b β − β g || . (52) Pr o of. W e follow the pro of strategy of Lounici et al. ( 2 009 ). F o r all β ∈ R p , we hav e: 1 n || X b β − y || 2 + 2 λ || b β || 2 , 1 , G ≤ 1 n || X β − y || 2 + 2 λ || β || 2 , 1 , G (53) Let y = X β 0 + ǫ to obtain: 1 n || X ( b β − β 0 ) || 2 ≤ 1 n || X ( β − β 0 ) || 2 + 2 n ǫ T X ( b β − β ) + 2 λ || β || 2 , 1 , G − || b β || 2 , 1 , G (54) imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 16 W e now examine the second term o n the r ig ht hand side: 2 n ǫ T X ( b β − β ) ≤ 2 G 1 / 2 ov erlap n max g || ǫ T X g || || b β − β || 2 , 1 , G (55) = 2 G 1 / 2 ov erlap n max g v u u t X j ∈ g n X i =1 X ij ǫ i ! 2 || b β − β || 2 , 1 , G . (56) Here, we apply our version of H¨ older’s inequality (Lemma 3 ). W e now cons ider the even t: A = 1 n max g v u u t X j ∈ g n X i =1 X ij ǫ i ! 2 ≤ λ 2 G 1 / 2 ov erlap . (57) Note that random v ar iables V g ( j ) = 1 σ √ n P n i =1 X ij ǫ i , where g ( j ) denotes the j th element of g ∈ G , are standard Ga ussian random v ar iables. Within a gro up, they hav e a m ultiv aria te normal distribution with cov ariance matrix X T g X g / ( σ 2 n ), where X g denotes the s ub ma trix of X consisting of the columns indexed by the group X g . It then follows that, pr ovided X g admits a Cholesky decomp ositio n, that ( X T g X g ) − 1 / 2 X g ǫ /σ 2 n is a vector of i.i.d. standard normal r andom v ar iables. Thu s , letting ρ g denote the maximal absolute eig env alue of ( X T g X g ) − 1 / 2 , we hav e || X g ǫ /σ 2 n || ≤ ρ g || ( X T g X g ) − 1 / 2 X g ǫ /σ 2 n || by prop erties of the op erator norm of ( X T g X g ) 1 / 2 . Now, for any g ∈ G define: γ g = 2 σ q | g |G ov erlap √ n 1 + A log M p | g | ! 1 / 2 . (58) Note, ∀ g ∈ G ; γ g ≤ λ . Now: P X j ∈ g n X i =1 X ij ǫ i ! 2 ≥ λ 2 n 2 4 G ov erlap ≤ P ρ 2 g χ 2 | g | ≥ λ 2 n 4 σ 2 G ov erlap ! (59) ≤ P χ 2 | g | ≥ γ 2 g n 4 σ 2 ρ 2 g G ov erlap ! (60) = P χ 2 | g | ≥ ρ − 2 g ( | g | + A p | g | log M ) (61) ≤ ex p − ρ − 2 g A log M 8 min n p | g | , A log M o ! (62) ≤ ex p − min g [ ρ − 2 g ] A log M 8 min min g p | g | , A log M ! (63) imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 17 In the a b ove, we use d Lemma 2 for the proba bility bo und on χ 2 v a riables. W e now a pply the union b ound to obtain: P ( A c ) ≤ M exp − min g [ ρ − 2 g ] A log M 8 min min g p | g | , A log M ! (64) ≤ M 1 − q (65) Now, on the even t A , we can obtain, fr om Equation 54 : 1 n || X ( b β − β 0 ) || 2 + λ || b β − β || 2 , 1 , G ≤ (66) 1 n || X ( β − β 0 ) || 2 + 2 λ || b β − β || 2 , 1 , G + || β || 2 , 1 , G − || b β || 2 , 1 , G (67) ≤ 1 n || X ( β − β 0 ) || 2 + 4 λ X g ∈ J || v b β − β g || (68) Where { v b β − β g } deno tes a dec o mpo sition of b β − b β minimizing the || · || 2 , 1 , G norm. Note that the last line follows from the fact that || · || 2 , 1 , G ob eys the tr iangle inequality . This g ives us the desired re sult in Equatio n 52 . A.1.2. Pr o of of The or em 1 Again, we follow the pro of strategy of Lounici et al. ( 2009 ). Fix a decomp osition of β 0 : { v 0 g } . Let J = J ( β 0 ) = { g : v 0 g 6 = 0 } . Let the even t A in Lemma 4 hold and let β = β 0 in the inequa lit y 52 : λ || b β − β 0 || 2 , 1 , G ≤ 4 λ X g ∈ J || v b β − β 0 || , (69 ) = ⇒ X g ∈ J c || v b β − β 0 || ≤ 3 X g ∈ J || v b β − β 0 || . (70) Thu s , we can apply Assumption 1 with ∆ = ( b β − β 0 ) , V (∆) = { v b β − β 0 } to obtain: X g ∈ J v b β − β 0 ≤ || X ( b β − β 0 ) || κ √ n (71) Again, when the event A in Lemma 4 hold and for β = β 0 in the ineq uality 52 : 1 n || X ( b β − β 0 ) || ≤ 4 λ X g ∈ J || v b β − β 0 || (72) ≤ 4 λ κ √ n || X ( b β − β 0 ) || (73) = ⇒ 1 n 2 || X ( b β − β 0 ) || 2 ≤ 16 λ 2 κ 2 n (74) = ⇒ 1 n || X ( b β − β 0 ) || 2 ≤ 64 σ 2 κ 2 n max g | g | + A lo g M (75) imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 18 This corr esp onds to the r esult in E quation 22 . Equa tio n 23 follows fro m an analogo us chain as the ab ove, b eginning with the inequality 69 . A.2. Asymptotic Setting Before we prov e the main result, we give the following lemma. Lemma 5. L et Assu mption 4 hol d. F or g ∈ G H o n γ / 2 ( || v OLS g || ) γ is O p (1) for γ > 0 . Pr o of. By Assumption 4 , we may denote { v 0 g } = V ( β 0 ) a s the unique deco mpo - sition minimizing the norm || β 0 || 2 , 1 , G . T o make the dep endence on n explicit, we denote β OLS n as the o r dinary least squar es estimate for β 0 using n da ta po int s . W e know β OLS n → β o in pro ba bilit y , a s n → ∞ . By Assumption 4 , there ex- ists an N such that, with high probabilit y , β OLS n has a unique decomp ositio n for all n ≥ N . W e denote this unique dec o mpo sition as: { v OLS ,n g } = V ( β OLS n ), minimizing || β OLS n || 2 , 1 , G . W e next wr ite β OLS n = β 0 + δ n , and then define the decomp osition v δ n g = v 0 g − v OLS n g . Recall that for g ∈ G H o , w e hav e || v 0 g || = 0 and furthermore || β OLS n || 2 , 1 , G → || β 0 || 2 , 1 , G in pro bability . Thus, consider ing the terms in || β OLS n || 2 , 1 , G corres p onding to those g ∈ G H o we co nclude || v δ n g || → 0 in probability as n → ∞ for g ∈ G H o . Finally , for g ∈ G H o : √ n ( || v OLS g || ) = √ n ( || v OLS g || − || v 0 g || ) = √ n ( || v δ n g || − 0) ∈ O p (1). The result then follows for γ > 0 by the contin uous mapping theorem. A.2.1. Pr o of of The or em 2 W e follow the genera l pro o f stra teg y of Theor e m 3.2 from Na r di and Rinaldo ( 2008 ), which is ada pted from similar results on the lasso from F u and Knight ( 2000 ) and Zo u ( 2006 ). First, define β n = β 0 + u √ n . Let { v 0 g } = V ( β 0 ); { v n g } = V ( β n ) b e decompositions of β 0 minimizing || β 0 || 2 , 1 , G , and || β n || 2 , 1 , G , re s pec - tively . Therefore, the following is a deco mpo sition of u : ∀ g ∈ G , v u g = √ n ( v n g − v 0 g ). T o b egin, we write the o b jective fro m E quation 15 (m ultiplied by n 2 ) as: Q n ( u ) = 1 2 1 √ n X u + ǫ 2 + X g nλλ n v 0 g + 1 √ n v u g imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 19 Let: D n ( u ) = Q n ( u ) − Q n (0) = 1 2 n u T X T X u − 1 √ n u T X ǫ + √ nλ X g λ g √ n v 0 g + 1 √ n v u g − || v 0 g || = I 1 ,n + X g I 2 ,n,g W e now pro cee d to examine the ter ms in the se cond summation. The b ehavior of these terms dep ends on the gro up g : • F o r g ∈ G H , we have λ n → 1 / || v o g || γ 2 in pr obability , by the uniqueness of the decomp osition { v 0 g } along with Assumption 4 . Also: √ n v 0 g + 1 √ n v u g − || v 0 g || → ( v u g ) T v 0 g || v 0 g || . Since √ nλ = o (1), then the term I 2 ,n,g → 0. • F o r g ∈ G H c , n γ / 2 || v OLS g || γ = O p (1) and: √ n v 0 g + 1 √ n v u g − || v 0 g || = || v u g || . Since, n ( γ +1) / 2 λ → ∞ , then I 2 ,n,g → ∞ . • F o r g ∈ G H o , and n γ / 2 || v OLS g || γ 2 is O p (1) by Lemma 5 . As b efore, √ n v 0 g + 1 √ n v u g − || v 0 g || = || v u g || . So I 2 ,n,g → ∞ . Now, I 1 ,n → 1 2 u T M u − u T W , where W ∼ N p (0 , σ 2 M ). Since p is fixed and finite, then it follows that D n ( u ) → D ( u ), where: D ( u ) = 1 2 u T M u − u T W if ∀ g / ∈ G H : v u g = 0 ∞ else Now, u = ( M − 1 H W , 0) T minimizes D ( u ) and so by the a rgmax theore m from ( v a n der V aart a nd W ellner , 1998 , Coro llary 3.2.3), the result follows. References Bach, F. (200 8a). Consistency of the Gro up Lass o a nd Multiple K ernel Learn- ing. Journal of Machine L e arning R ese ar ch 9 1 179–1 225. imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 20 Bach, F. (2008 b). Ex ploring Large F ea tur e Spa ces with Hierarchical Multi- ple Kernel Learning. In Ad vanc es in Neur al In formation Pr o c essing Systems . NIPS ’08 . Bach, F. (2 010a). Sha ping Level Sets w ith Submo dular F unctions T echnical Repo rt No. ar Xiv:1012 .1 501v1 . Bach, F. (201 0b). Structured Sparsity-Inducing No r ms through Submo dular F unctions . In A dvanc es in Neura l Information Pr o c essing Systems . NIPS ’10 . Bickel, J. , Rito v, Y. and Tsybako v , A. B. (2009). Sim ultaneous analysis of Lasso and Dant z ig selector. Annals of Statistics 37 1705 – 1732 . Chesnea u, C. and H ebiri, M. (2008). Some theoretical r esults on the Group ed V ar iables Lasso. Mathematic al Metho ds of Statistics 17 3 17–32 6. Fu, W. a nd Knight, K. (2000). Asymptotics for lasso - t y pe estimators. The Annals of St atistics 28 1356 – 1378. Huang, J. , Zhang, T. and Met axas, D. (2009 ). Lear ning with structured sparsity . In Pr o c e e dings of the 26th Annual International Confer enc e on Ma- chine L e arning . ICML ’09 417– 424. ACM, New Y ork, NY, USA. Huang, J. and Z hang, T. (2010 ). The B enefit of Group Sparsity. Annals of Statistics 38 197 8–200 4. Jacob, L. , Obozinsk i, G . and Ver t, J.-P. (2009). Group las so with overlap and graph la s so. In Pr o c e e dings of the 26 t h Annual Int ern ational Confer enc e on Machine L e arning . ICML ’09 433–4 40. ACM, New Y ork, NY, USA. Jena tton, R. , Audiber t, J.-Y. and Bac h , F. (2009). Structured V ar iable Selectio n with Sparsity-Inducing Norms T echnical Report No. arXiv:090 4.3523 v3. Jena tton, R. , Obozinski, G . and Bach, F. (201 0). Structured Sparse Prin- cipal Comp onent Analysis. In Pr o c e e dings of the International Confer enc e on Artificia l Intel ligenc e and Statistics . AIST A TS ’10 . Kim, S. and Xing, E. (2 0 10). T r ee-Guided Group Lasso for Multi-T ask Re- gressio n with Structured Spars it y . In Pr o c e e dings of the 27th In ternational Confer enc e on Machine L e arningy . ICML ’10 . Lounici, K. , Tsybakov, A. B. , Pontil, M. and Geer, S . A. V. D. (20 09). T a king Adv antage of Spar sity in Multi-T ask Lea rning. In COL T 2009 . Nardi, Y. a nd Rinaldo, A. (2 008). O n the asymptotic pro pe r ties of the group lasso estimator fo r linear mo dels. Ele ctr onic Journal of Stat ist ics 2 605–6 33. Peng, J. , Zhu, J. , Bergamaschi, A. , H an, W. , N oh, D.-Y. , Pol- lack, J. R. and W ang, P. (2010). Regularized m ultiv aria te regression for ident ifying master predictors with application to integrative genomics study of breast cance r . Annals Of Applie d S tatistics 4 53 – 77. Perciv al, D. , Roeder, K. , R osenfeld, R. and W asserman, L. (201 1). Structured, Sparse Regression With Applicatio n to HIV Drug Resistance. Annals Of Applie d St atistics . T o app ear. Tibshirani, R. (1996). Regr ession Shrink age and Selection Via the La sso. Jour- nal of the R oyal Statistic al S o ciety, S eries B 58 267 –288. v an d er V aar t, A. W. and Wellner, J. A. (1 998). We ak Conver genc e and Empiric al Pr o c esses: W it h Appli c ations to Statistics . Spr inger. Yuan, M. , Y uan, M. , Lin, Y . and Lin, Y . (200 6). Mo del selection and esti- imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024 Per cival/The ory of Ov erlapping Gr oup L asso 21 mation in regres sion with group ed v a riables. Journal of the R oyal Statistic al So ciety, Series B 68 49–6 7. Zhao, P . , Rocha, G. , , and Yu, B. (2 007). The comp osite abso lute p enalties family for gr oup ed and hierar chical v ariable selection. Annals Of Statistics 37 3468– 3497. Zou, H. (2006 ). The Adaptive Lasso a nd Its O racle Prop erties . J ournal of t he Americ an Statistic al As so ciation 1 01 1418 -1429 . imsart-e js ver. 2009/08/13 file: percival-overlap -theory-rev1.tex date: November 26, 2024
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment