Structured Sparsity via Alternating Direction Methods

Structured Sparsit y via Alternating Direction Metho ds ∗ Zhiw ei (T on y) Qin † Donald Goldfarb ‡ Dep artment of Industrial Engine ering and Op er ations R ese ar ch Columbia University New Y ork, NY 10027 Octob er 25, 2018 Abstract W e consider a class of sparse learning problems in high dimensional feature space regularized b y a structured sparsit y-inducing norm that incorporates prior kno wledge of the group structure of the features. Such problems often p ose a considerable challenge to optimization algorithms due to the non-smo othness and non-separability of the regularization term. In this pap er, we fo cus on tw o commonly adopted sparsity-inducing regularization terms, the ov erlapping Group Lasso p enalt y l 1 /l 2 -norm and the l 1 /l ∞ -norm. W e prop ose a uniﬁed framework based on the augmen ted Lagrangian metho d, under which problems with b oth types of regularization and their v ariants can b e eﬃciently solv ed. As one of the core building-blo c ks of this framework, w e develop new algorithms using a partial-linearization/splitting technique and pro ve that the accelerated versions of these algorithms require O ( 1 √  ) iterations to obtain an  -optimal solu- tion. W e compare the p erformance of these algorithms against that of the alternating direction augmen ted Lagrangian and FIST A metho ds on a collection of data sets and apply them to t wo real-w orld problems to compare the relative merits of the t wo norms. Keywor ds : structured sparsit y , o v erlapping Group Lasso, alternating direction metho ds, v ariable splitting, augmented Lagrangian 1 In tro duction F or feature learning problems in a high-dimensional space, sparsity in the feature v ector is usually a desirable prop ert y . Man y statistical mo dels ha ve been prop osed in the literature to enforce sparsit y , dating back to the classical Lasso mo del ( l 1 -regularization) [45, 10]. The Lasso mo del is particularly app ealing b ecause it can b e solved by very eﬃcient pro ximal gradient metho ds; e.g., see [12]. Ho wev er, the Lasso do es not take in to accoun t the structure of the features [50]. In man y real applications, the features in a learning problem are often highly correlated, exhibiting a group structure. Structured sparsity has b een shown to b e eﬀectiv e in those cases. The Group Lasso model [49, 2, 41] assumes disjoint groups and enforces sparsity on the pre-deﬁned groups of features. This mo del has been extended to allo w for groups that are hierarc hical as well as o verlapping [25, 27, 3] with a wide arra y of applications from gene selection [27] to computer ∗ Researc h supp orted in part by DMS 10-16571, ONR Grant N00014-08-1-1118 and DOE Grant DE-FG02- 08ER25856. † Email: zq2107@columbia.edu. ‡ Email: goldfarb@columbia.edu. 1 vision [23, 26]. F or image denoising problems, extensions with non-integer blo c k sizes and adaptive partitions hav e b een prop osed in [36, 37]. In this pap er, we consider the following basic mo del of minimizing the squared-error loss with a regularization term to induce group sparsity: min x ∈ R m L ( x ) + Ω( x ) , (1) where L ( x ) = 1 2 k Ax − b k 2 , A ∈ R n × m , Ω( x ) =  Ω l 1 /l 2 ( x ) ≡ λ P s ∈S w s k x s k , or Ω l 1 /l ∞ ( x ) ≡ λ P s ∈S w s k x s k ∞ , (2) S = { s 1 , · · · , s |S | } is the set of group indices with |S | = J , and the elements (features) in the groups p ossibly ov erlap [11, 30, 25, 3]. In this mo del, λ, w s , S are all pre-deﬁned. k · k without a subscript denotes the l 2 -norm. W e note that the p enalt y term Ω l 1 /l 2 ( x ) in (2) is diﬀerent from the one prop osed in [24], although b oth are called ov erlapping Group Lasso p enalties. In particular, (1)-(2) cannot b e cast into a non-ov erlapping group lasso problem as done in [24]. 1.1 Related W ork Tw o proximal gradient metho ds hav e b een prop osed to solve a close v ariant of (1) with an l 1 /l 2 p enalt y , min x ∈ R m L ( x ) + Ω l 1 /l 2 ( x ) + λ k x k 1 , (3) whic h has an additional l 1 -regularization term on x . Chen et al. [11] replace Ω l 1 /l 2 ( x ) with a smo oth approximation Ω η ( x ) b y using Nesterov’s smo othing technique [33] and solve the resulting problem by the F ast Iterative Shrink age Thresholding algorithm (FIST A) [4]. The parameter η is a smo othing parameter, up on which the practical and theoretical con vergence speed of the algorithm critically dep ends. Liu and Y e [29] also apply FIST A to solv e (3), but in each iteration, they transform the computation of the proximal op erator asso ciated with the combined p enalt y term in to an equiv alen t constrained smo oth problem and solv e it by Nesterov’s accelerated gradient descen t method [33]. Mairal et al. [30] apply the accelerated proximal gradient metho d to (1) with l 1 /l ∞ p enalt y and prop ose a net w ork ﬂo w algorithm to solv e the pro ximal problem asso ciated with Ω l 1 /l ∞ ( x ). Mosci et al.’s metho d [32] for solving the Group Lasso problem in [24] is in the same spirit as [29], but their approach uses a pro jected Newton metho d. 1.2 Our Contributions W e take a uniﬁed approach to tackle problem (1) with b oth l 1 /l 2 - and l 1 /l ∞ -regularizations. Our strategy is to dev elop eﬃcien t algorithms based on the Alternating Linearization Metho d with Skipping (ALM-S) [19] and FIST A for solving an equiv alen t constrained v ersion of problem (1) (to b e introduced in Section 2) in an augmen ted Lagrangian metho d framework. Sp eciﬁcally , w e mak e the following con tributions in this pap er: • W e build a general framew ork based on the augmented Lagrangian metho d, under which learning problems with both l 1 /l 2 and l 1 /l ∞ regularizations (and their v arian ts) can b e solv ed. This framework allo ws for exp erimen tation with its key building blo c ks. 2 • W e propose new algorithms: ALM-S with partial splitting (APLM-S) and FIST A with partial linearization (FIST A-p), to serv e as the k ey building blo c k for this framew ork. W e prov e that APLM-S and FIST A-p hav e con vergence rates of O ( 1 k ) and O ( 1 k 2 ) resp ectiv ely , where k is the num b er of iterations. Our algorithms are easy to implement and tune, and they do not require line-search, eliminating the need to ev aluate the ob jectiv e function at ev ery iteration. • W e ev aluate the qualit y and sp eed of the proposed algorithms and framew ork against state-of- the-art approac hes on a ric h set of synthetic test data and compare the l 1 /l 2 and l 1 /l ∞ mo dels on breast cancer gene expression data [47] and a video sequence background subtraction task [30]. 2 A V ariable-Splitting Augmen ted Lagrangian F ramew ork In this section, we presen t a uniﬁed framew ork, based on v ariable splitting and the augmented Lagrangian metho d for solving (1) with b oth l 1 /l 2 - and l 1 /l ∞ -regularizations. This framework reform ulates problem (1) as an equiv alent linearly-constrained problem, by using the following v ariable-splitting pro cedure. Let y ∈ R P s ∈S | s | b e the vector obtained from the vector x ∈ R m b y rep eating components of x so that no comp onen t of y b elongs to more than one group. Let M = P s ∈S | s | . The relationship b et w een x and y is sp eciﬁed b y the linear constrain t C x = y , where the ( i, j )-th element of the matrix C ∈ R M × m is C i,j =  1 , if y i is a replicate of x j , 0 , otherwise. (4) F or examples of C , refer to [11]. Consequently , (1) is equiv alent to min F obj ( x, y ) ≡ 1 2 k Ax − b k 2 + ˜ Ω( y ) (5) s.t. C x = y , where ˜ Ω( y ) is the non-o verlapping group-structured p enalt y term corresp onding to Ω( y ) deﬁned in (2). Note that C is a highly sparse matrix, and D = C T C is a diagonal matrix with the diagonal en tries equal to the n umber of times that eac h entry of x is included in some group. Problem (5) now includes tw o sets of v ariables x and y , where x app ears only in the loss term L ( x ) and y app ears only in the p enalt y term ˜ Ω( y ). All the non-ov erlapping versions of Ω( · ), including the Lasso and Group Lasso, are special cases of Ω( · ), with C = I , i.e. x = y . Hence, (5) in this case is equiv alent to applying v ariable-splitting on x . Problems with a composite p enalt y term, suc h as the Elastic Net, λ 1 k x k 1 + λ 2 k x k 2 , can also b e reformulated in a similar wa y by merging the smo oth part of the p enalty term ( λ 2 k x k 2 in the case of the Elastic Net) with the loss function L ( x ). T o solve (5), we apply the augmented Lagrangian metho d [22, 38, 34, 6] to it. This metho d, Algorithm 2.1, minimizes the augmen ted Lagrangian L ( x, y , v ) = 1 2 k Ax − b k 2 − v T ( C x − y ) + 1 2 µ k C x − y k 2 + ˜ Ω( y ) (6) exactly for a given Lagrange multiplier v in every iteration follo wed by an up date to v . The parameter µ in (6) con trols the amount of weigh t that is placed on violations of the constrain t C x = y . Algorithm 2.1 can also b e view ed as a dual ascent algorithm applied to P ( v ) = min x,y L ( x, y , v ) [5], where v is the dual v ariable, 1 µ is the step-length, and C x − y is the gradient ∇ v P ( v ). This 3 Algorithm 2.1 AugLag 1: Cho ose x 0 , y 0 , v 0 . 2: for l = 0 , 1 , · · · do 3: ( x l +1 , y l +1 ) ← arg min x,y L ( x, y , v l ) 4: v l +1 ← v l − 1 µ ( C x l +1 − y l +1 ) 5: Up date µ according to the chosen up dating sc heme. 6: end for algorithm do es not require µ to b e v ery small to guarantee conv ergence to the solution of problem (5) [34]. How ever, solving the problem in Line 3 of Algorithm 2.1 exactly can b e v ery challenging in the case of structured sparsity . W e instead seek an approximate minimizer of the augmen ted Lagrangian via the abstract subroutine Appro xAugLagMin( x, y , v ). The following theorem [40] guaran tees the conv ergence of this inexact version of Algorithm 2.1. Theorem 2.1. L et α l := L ( x l , y l , v l ) − inf x ∈ R m ,y ∈ R M L ( x, y , v l ) and F ∗ b e the optimal value of pr oblem (5) . Supp ose pr oblem (5) satisﬁes the mo diﬁe d Slater’s c ondition, and ∞ X l =1 √ α l < + ∞ . (7) Then, the se quenc e { v l } c onver ges to v ∗ , which satisﬁes inf x ∈ R m ,y ∈ R M  F obj ( x, y ) − ( v ∗ ) T ( C x − y )  = F ∗ , while the se quenc e { x l , y l } satisﬁes lim l →∞ C x l − y l = 0 and lim l →∞ F obj ( x l , y l ) = F ∗ . The condition (7) requires the augmented Lagrangian subproblem b e solved with increasing accuracy . W e formally state this framework in Algorithm 2.2. W e index the iterations of Algorithm Algorithm 2.2 OGLasso-AugLag 1: Cho ose x 0 , y 0 , v 0 . 2: for l = 0 , 1 , · · · do 3: ( x l +1 , y l +1 ) ← ApproxAugLagMin( x l , y l , v l ), to compute an approximate minimizer of L ( x, y , v l ) 4: v l +1 ← v l − 1 µ ( C x l +1 − y l +1 ) 5: Up date µ according to the chosen up dating sc heme. 6: end for 2.2 b y l and call them ‘outer iterations’. In Sections 3, we develop algorithms that implemen t Appro xAugLagMin( x, y , v ). The iterations of these subroutine are indexed by k and are called ‘inner iterations’. 3 Metho ds for Appro ximately Minimizing the Augmen ted La- grangian In this section, we use the o verlapping Group Lasso p enalty Ω( x ) = λ P s ∈S w s k x s k to illustrate the optimization algorithms under discussion. The case of l 1 /l ∞ -regularization will b e discussed in Section 4. F rom now on, we assume without loss of generality that w s = 1 for ev ery group s . 4 3.1 Alternating Direction Augmen ted Lagrangian (AD AL) Metho d The well-kno wn Alternating Direction Augmen ted Lagrangian (ADAL) metho d [15, 17, 18, 8] 1 appro ximately minimizes the augmented Lagrangian by minimizing (6) with resp ect to x and y alternatingly and then up dates the Lagrange m ultiplier v on each iteration (e.g., see [7], Section 3.4). Sp eciﬁcally , the single-iteration pro cedure that serves as the pro cedure Appro xAugLagMin( x, y , v ) is given b elo w as Algorithm 3.1. Algorithm 3.1 AD AL 1: Giv en x l , y l , and v l . 2: x l +1 ← arg min x L ( x, y l , v l ) 3: y l +1 ← arg min y L ( x l +1 , y , v l ) 4: return x l +1 , y l +1 . The AD AL method, also kno wn as the alternating direction metho d of m ultipliers (ADMM) and the split Bregman metho d, has recently b een applied to problems in signal and image pro cessing [12, 1, 20] and low-rank matrix reco very [28]. Its conv ergence has b een established in [15]. This metho d can accommo date a sum of more than t wo functions. F or example, b y applying v ariable- splitting (e.g., see [7, 8]) to the problem min x f ( x ) + P K i =1 g i ( C i x ), it can b e transformed into min x,y 1 , ··· ,y K f ( x ) + K X i =1 g i ( y i ) s.t. y i = C i x, i = 1 , · · · , K . The subproblems corresp onding to y i ’s can thus b e solv ed simultaneously by the ADAL metho d. This so-called simultaneous direction metho d of multipliers (SDMM) [42] is related to Spingarn’s metho d of partial inv erses [43] and has been sho wn to be a sp ecial instance of a more general parallel proximal algorithm with inertia parameters [35]. Note that the problem solv ed in Line 3 of Algorithm 3.1, y l +1 = arg min y L ( x l +1 , y , v l ) ≡ arg min y  1 2 µ k d l − y k 2 + ˜ Ω( y )  , (8) where d l = C x l +1 − µv l , is group-separable and hence can b e solved in parallel. As in [39], eac h subproblem can b e solved by applying the blo ck soft-thresholding op erator, T ( d l s , µλ ) ≡ d l s k d l s k max(0 , k d l s k − λµ ) , s = 1 , · · · , J . Solving for x l +1 in Line 2 of Algorithm 3.1, i.e. x l +1 = arg min x L ( x, y l , v l ) ≡ arg min x  1 2 k Ax − b k 2 − ( v l ) T C x + 1 2 µ k C x − y l k 2  , (9) in volv es solving the linear system ( A T A + 1 µ D ) x = A T b + C T v l + 1 µ C T y l , (10) where the matrix on the left hand side of (10) has dimension m × m . Many real-world data sets, suc h as gene expression data, are highly under-determined. Hence, the n umber of features ( m ) is muc h 1 Recen tly , Mairal et al. [31] also applied ADAL with t wo v ariants based on v ariable-splitting to the ov erlapping Group Lasso problem. 5 larger than the n umber of samples ( n ). In suc h cases, one can use the Sherman-Morrison-W o odbury form ula, ( A T A + 1 µ D ) − 1 = µD − 1 − µ 2 D − 1 A T ( I + µAD − 1 A T ) − 1 AD − 1 , (11) and solve instead an n × n linear system in volving the matrix I + µAD − 1 A T . In addition, as long as µ stays the same, one has to factorize A T A + 1 µ D or I + µAD − 1 A T only once and store their factors for subsequen t iterations. When b oth n and m are v ery large, it might b e infeasible to compute or store A T A , not to men tion its eigen-decomp osition, or the Cholesky decomp osition of A T A + 1 µ D . In this case, one can solv e the linear systems using the preconditioned Conjugate Gradien t (PCG) method [21]. Similar commen ts apply to the other algorithms prop osed in Sections 3.2 - 3.4 b elow.Alternativ ely , we can apply FIST A to Line 3 in Algorithm 2.2 (see Section 3.5). 3.2 ALM-S: partial split (APLM-S) W e now consider applying the Alternating Linearization Metho d with Skipping (ALM-S) from [19] to appro ximately minimize (6). In particular, w e apply v ariable splitting (Section 2) to the v ariable y , to whic h the group-sparse regularizer ˜ Ω is applied, (the original ALM-S splits b oth v ariables x and y ,) and re-formulate (6) as follows. min x,y , ¯ y 1 2 k Ax − b k 2 − v T ( C x − y ) + 1 2 µ k C x − y k 2 + ˜ Ω( ¯ y ) (12) s.t. y = ¯ y . Note that the Lagrange m ultiplier v is ﬁxed here. Deﬁning f ( x, y ) := 1 2 k Ax − b k 2 − v T ( C x − y ) + 1 2 µ k C x − y k 2 , (13) g ( y ) = ˜ Ω( y ) = λ X s k y s k , (14) problem (12) is of the form min f ( x, y ) + g ( ¯ y ) (15) s.t. y = ¯ y , to which w e now apply partial-linearization. 3.2.1 P artial linearization and conv ergence rate analysis Let us deﬁne F ( x, y ) := f ( x, y ) + g ( y ) = L ( x, y ; v ) , (16) L ρ ( x, y , ¯ y , γ ) := f ( x, y ) + g ( ¯ y ) + γ T ( ¯ y − y ) + 1 2 ρ k ¯ y − y k 2 , (17) where γ is the Lagrange multiplier in the augmented Lagrangian (17) corresp onding to problem (15), and γ g ( ¯ y ) is a sub-gradient of g at ¯ y . W e no w presen t our partial-split alternating linearization algorithm to implemen t Appro xAugLagMin( x, y , v ) in Algorithm 2.2. W e note that in Line 6 in Algorithm 3.2, x k +1 = arg min x L ρ ( x ; y k +1 , ¯ y k , γ k ) ≡ arg min x f ( x ; y k +1 ) ≡ arg min x f ( x ; ¯ y k ) . (18) No w, w e ha ve a v ariant of Lemma 2.2 in [19]. 6 Algorithm 3.2 APLM-S 1: Giv en x 0 , ¯ y 0 , v . Cho ose ρ, γ 0 , such that − γ 0 ∈ ∂ g ( ¯ y 0 ). Deﬁne f ( x, y ) as in (13). 2: for k = 0 , 1 , · · · until stopping criterion is satisﬁed do 3: ( x k +1 , y k +1 ) ← arg min x,y L ρ ( x, y , ¯ y k , γ k ). 4: if F ( x k +1 , y k +1 ) > L ρ ( x k +1 , y k +1 , ¯ y k , γ k ) then 5: y k +1 ← ¯ y k 6: x k +1 ← arg min x f ( x, y k +1 ) ≡ arg min x L ρ ( x ; y k +1 , ¯ y k , γ k ) 7: end if 8: ¯ y k +1 ← p f ( x k +1 , y k +1 ) ≡ arg min ¯ y L ρ ( x k +1 , y k +1 , ¯ y , ∇ y f ( x k +1 , y k +1 )) 9: γ k +1 ← ∇ y f ( x k +1 , y k +1 ) − y k +1 − ¯ y k +1 ρ 10: end for 11: return ( x K +1 , ¯ y K +1 ) Lemma 3.1. F or any ( x, y ) , if ¯ q := arg min ¯ y L ρ ( x, y , ¯ y , ∇ y f ( x, y )) and F ( x, ¯ q ) ≤ L ρ ( x, y , ¯ q , ∇ y f ( x, y )) , (19) then for any ( ¯ x, ¯ y ) , 2 ρ ( F ( ¯ x, ¯ y ) − F ( x, ¯ q )) ≥ k ¯ q − ¯ y k 2 − k y − ¯ y k 2 + 2 ρ (( ¯ x − x ) T ∇ x f ( x, y )) . (20) Similarly, for any ¯ y , if ( p, q ) := arg min x,y L ρ ( x, y , ¯ y , − γ g ( ¯ y )) and F (( p, q )) ≤ L ρ (( p, q ) , ¯ y , − γ g ( ¯ y )) , (21) then for any ( x, y ) , 2 ρ ( F ( x, y ) − F (( p, q ))) ≥ k ( p, q ) y − y k 2 − k ¯ y − y k 2 . (22) Pr o of. See App endix A. Algorithm 3.2 c hecks condition (21) at Line 4 b ecause the function g is non-smooth and condition (21) may not hold no matter what the v alue of ρ is. When this condition is violated, a skipping step occurs in which the v alue of y is set to the v alue of ¯ y in the previous iteration (Line 5) and L ρ re-minimized with resp ect to x (Line 6) to ensure con vergence. Let us deﬁne a r e gular iter ation of Algorithm 3.2 to b e an iteration where no skipping step o ccurs, i.e. Lines 5 and 6 are not executed. Lik ewise, we deﬁne a skipping iter ation to b e an iteration where a skipping step o ccurs. No w, we are ready to state the iteration complexit y result for APLM-S. Theorem 3.1. Assume that ∇ y f ( x, y ) is Lipschitz c ontinuous with Lipschitz c onstant L y ( f ) , i.e. for any x , k∇ y f ( x, y ) − ∇ y f ( x, z ) k ≤ L y ( f ) k y − z k , for al l y and z . F or ρ ≤ 1 L y ( f ) , the iter ates ( x k , ¯ y k ) in A lgorithm 3.2 satisfy F ( x k , ¯ y k ) − F ( x ∗ , y ∗ ) ≤ k ¯ y 0 − y ∗ k 2 2 ρ ( k + k n ) , ∀ k , (23) wher e ( x ∗ , y ∗ ) is an optimal solution to (12) , and k n is the numb er of r e gular iter ations among the ﬁrst k iter ations. Pr o of. See App endix B. 7 Remark 3.1. F or The or em 3.1 to hold, we ne e d ρ ≤ 1 L y ( f ) . F r om the deﬁnition of f ( x, y ) in (13) , it is e asy to se e that L y ( f ) = 1 µ r e gar d less of the loss function L ( x ) . Henc e, we set ρ = µ , so that c ondition (19) in L emma 3.1 is satisﬁe d. In Section 3.3, w e will discuss the case where the iterations en tirely consist of skipping steps. W e will show that this is equiv alent to IST A [4] with partial linearization as well as a v ariant of AD AL. In this case, the inner Lagrange m ultiplier γ is redundant. 3.2.2 Solving the subproblems W e now sho w ho w to solve the subproblems in Algorithm 3.2. First, observe that since ρ = µ arg min ¯ y L ρ ( x, y , ¯ y , ∇ y f ( x, y )) ≡ arg min ¯ y  ∇ y f ( x, y ) T ¯ y + 1 2 µ k ¯ y − y k 2 + g ( ¯ y )  (24) ≡ arg min ¯ y ( 1 2 µ k d − ¯ y k 2 + λ X s k ¯ y s k ) , (25) where d = C x − µv . Hence, ¯ y can b e obtained by applying the blo c k soft-thresholding op erator T ( d s , µλ ) as in Section 3.1. Next consider the subproblem min ( x,y ) L ρ ( x, y , ¯ y , γ ) ≡ min ( x,y )  f ( x, y ) + γ T ( ¯ y − y ) + 1 2 µ k ¯ y − y k 2  . (26) It is easy to verify that solving the linear system given b y the optimality conditions for (26) by blo c k Gaussian elimination yields the system  A T A + 1 2 µ D  x = r x + 1 2 C T r y (27) for computing x , where r x = A T b + C T v and r y = − v + γ + ¯ y ρ . Then y can b e computed as ( µ 2 )( r y + 1 µ C x ). As in Section 3.1, only one Cholesky factorization of A T A + 1 2 µ D is required for e ac h in vocation of Algorithm 3.2. Hence, the amoun t of w ork in volv ed in eac h iteration of Algorithm 3.2 is comparable to that of an AD AL iteration. It is straigh tforward to deriv e an accelerated version of Algorithm 3.2, whic h we shall refer to as F APLM-S, that corresp onds to a partial-split version of the F ALM algorithm proposed in [19] and also requires O ( q L ( f )  ) iterations to obtain an  -optimal solution. In Section 3.4, w e present an algorithm FIST A-p, which is a sp ecial v ersion of F APLM-S in which every iteration is a skipping iteration and whic h has a m uch simpler form than F APLM-S, while having essentially the same iteration complexity . It is also p ossible to apply ALM-S directly , which splits b oth x and y , to solve the augmen ted Lagrangian subproblem. Similar to (12), w e reform ulate (6) as min ( x,y ) , ( ¯ x, ¯ y ) 1 2 k Ax − b k 2 − v T ( C x − y ) + 1 2 µ k C x − y k 2 + λ X s k ¯ y s k (28) s.t. x = ¯ x, y = ¯ y . The functions f and g are deﬁned as in (13) and (14), except that now w e write g as g ( ¯ x, ¯ y ) even though the v ariable ¯ x do es not app ear in the expression for g . It can b e sho wn that ¯ y admits exactly 8 the same expression as in APLM-S, whereas ¯ x is obtained by a gradient step, x − ρ ∇ x f ( x, y ). T o obtain x , w e solv e the linear system  A T A + 1 µ + ρ D + 1 ρ I  x = r x + ρ µ + ρ C T r y , (29) after which y is computed by y =  µρ µ + ρ   r y + 1 µ C x  . Remark 3.2. F or ALM-S, the Lipschitz c onstant for ∇ f ( x, y ) L f = λ max ( A T A ) + 1 µ d max , wher e d max = max i D ii ≥ 1 . F or the c omplexity r esults in [19] to hold, we ne e d ρ ≤ 1 L f . Sinc e λ max ( A T A ) is usual ly not known, it is ne c essary to p erform a b acktr acking line-se ar ch on ρ to ensur e that F ( x k +1 , y k +1 ) ≤ L ρ ( x k +1 , y k +1 , ¯ x k , ¯ y k , γ k ) . In pr actic e, we adopte d the fol lowing c ontinuation scheme inste ad. We initial ly set ρ = ρ 0 = µ d max and de cr e ase d ρ by a factor of β after a given numb er of iter ations until ρ r e ache d a user-supplie d minimum value ρ min . This scheme pr events ρ fr om b eing to o smal l, and henc e ne gatively imp acting c omputational p erformanc e. However, in b oth c ases the left-hand-side of the system (29) has to b e r e-factorize d every time ρ is up date d. As we hav e seen ab o ve, the Lipschitz constan t resulting from splitting both x and y is p oten tially m uch larger than 1 µ . Hence, partial-linearization reduces the Lipsc hitz constant and hence impro ves the b ound on the righ t-hand-side of (23) and allows Algorithm 3.2 to take larger step sizes (equal to µ ). Compared to ALM-S, solving for x in the skipping step (Line 6) b ecomes harder. Intuitiv ely , APLM-S do es a better job of ‘load-balancing’ by managing a better trade-oﬀ b et w een the hardness of the subproblems and the practical con vergence rate. 3.3 IST A: partial linearization (IST A-p) W e can also minimize the augme n ted Lagrangian (6), which we write as L ( x, y , v ) = f ( x, y ) + g ( y ) with f ( x, y ) and g ( y ) deﬁned as in (13) and (14), using a v arian t of IST A that only linearizes f ( x, y ) with resp ect to the y v ariables. As in Section 3.2, we can set ρ = µ and guarantee the con vergence prop erties of IST A-p (see Corollary 3.1 b elo w). F ormally , let ( x, y ) b e the current iterate and ( x + , y + ) b e the next iterate. W e compute y + b y y + = arg min y 0 L ρ ( x, y , y 0 , ∇ y f ( x, y )) (30) = arg min y 0    1 2 µ X j ( k y 0 j − d y j k 2 + λ k y 0 j k )    , (31) where d y = C x − µv . Hence the solution y + to problem (31) is given blo c kwise b y T ([ d y ] j , µλ ) , j = 1 , · · · , J . No w giv en y + , we solv e for x + b y x + = arg min x 0 f ( x 0 , y + ) = arg min x 0  1 2 k Ax 0 − b k 2 − v T ( C x 0 − y + ) + 1 2 µ k C x 0 − y + k 2  (32) The algorithm that implemen ts subroutine Appro xAugLagMin( x, y , v ) in Algorithm 2.2 by IST A with partial linearization is stated b elo w as Algorithm 3.3. As we remarked in Section 3.2, Algorithm 3.3 is equiv alent to Algorithm 3.2 (APLM-S) where ev ery iteration is a skipping iteration. Hence, we ha ve from Theorem 3.1. 9 Algorithm 3.3 IST A-p (partial linearization) 1: Giv en x 0 , ¯ y 0 , v . Cho ose ρ . Deﬁne f ( x, y ) as in (13). 2: for k = 0 , 1 , · · · until stopping criterion is satisﬁed do 3: x k +1 ← arg min x f ( x ; ¯ y k ) 4: ¯ y k +1 ← arg min y L ρ ( x k +1 , ¯ y k , y , ∇ y f ( x k +1 , ¯ y k )) 5: end for 6: return ( x K +1 , ¯ y K +1 ) Corollary 3.1. Assume ∇ y f ( · , · ) is Lipschitz c ontinuous with Lipschitz c onstant L y ( f ) . F or ρ ≤ 1 L y ( f ) , the iter ates ( x k , ¯ y k ) in A lgorithm 3.3 satisfy F ( x k , ¯ y k ) − F ( x ∗ , y ∗ ) ≤ k ¯ y 0 − y ∗ k 2 2 ρk , ∀ k , (33) wher e ( x ∗ , y ∗ ) is an optimal solution to (12) . It is easy to see that (31) is equiv alent to (8), and that (32) is the same as (9) in AD AL. Remark 3.3. We have shown that with a ﬁxe d v , the IST A-p iter ations ar e exactly the same as the ADAL iter ations. The diﬀer enc e b etwe en the two algorithms is that AD AL up dates the (outer) L agr ange multiplier v in e ach iter ation, while in IST A-p, v stays the same thr oughout the inner iter ations. We c an thus view IST A-p as a variant of ADAL with delaye d up dating of the L agr ange multiplier. The ‘load-balancing’ b eha vior discussed in Section 3.2 is more obvious for IST A-p. As we will see in Section 3.5, if we apply IST A (with full linearization) to minimize (6), solving for x is simply a gradient step. Here, we need to minimize f ( x, y ) with resp ect to x exactly , while b eing able to tak e larger step sizes in the other subproblem, due to the smaller asso ciated Lipschitz constan t. 3.4 FIST A-p W e now present an accelerated v ersion FIST A-p of IST A-p. FIST A-p is a sp ecial case of F APLM-S with a skipping step o ccurring in ev ery iteration.W e state the algorithm formally as Algorithm 3.4. The iteration complexit y of FIST A-p (and F APLM-S) is given by the follo wing theorem. Algorithm 3.4 FIST A-p (partial linearization) 1: Giv en x 0 , ¯ y 0 , v . Cho ose ρ , and z 0 = ¯ y 0 . Deﬁne f ( x, y ) as in (13). 2: for k = 0 , 1 , · · · , K do 3: x k +1 ← arg min x f ( x ; z k ) 4: ¯ y k +1 ← arg min y L ρ ( x k +1 , z k , y , ∇ y f ( x k +1 , z k )) 5: t k +1 ← 1+ √ 1+4 t 2 k 2 6: z k +1 ← ¯ y k +1 +  t k − 1 t k +1  ( ¯ y k +1 − ¯ y k ) 7: end for 8: return ( x K +1 , ¯ y K +1 ) 10 Theorem 3.2. Assuming that ∇ y f ( · ) is Lipschitz c ontinuous with Lipschitz c onstant L y ( f ) and ρ ≤ 1 L y ( f ) , the se quenc e { x k , ¯ y k } gener ate d by Algorithm 3.4 satisﬁes F ( x k , ¯ y k ) − F ( x ∗ , y ∗ ) ≤ 2 k ¯ y 0 − y ∗ k 2 ρ ( k + 1) 2 , (34) Although we need to solve a linear system in every iteration of Algorithms 3.2, 3.3, and 3.4, the left-hand-side of the system stays constant throughout the inv o cation of the algorithms b ecause, follo wing Remark 3.1, we can alwa ys set ρ = µ . Hence, no line-search is necessary , and this step essen tially requires only one backw ard- and one forw ard-substitution, the complexity of whic h is the same as a gradien t step. 3.5 IST A/FIST A: full linearization IST A solves the following problem in each iteration to pro duce the next iterate  x + y +  . min x 0 ,y 0 1 2 ρ      x 0 y 0  − d     2 + λ X s k y s k ≡ 1 2 ρ k x 0 − d x k 2 + X j 1 2 ρ  k y 0 j − d y j k 2 + λ k y 0 j k  , (35) where d =  d x d y  =  x y  − ρ ∇ f ( x, y ), and f ( x, y ) is deﬁned in (13). It is easy to see that we can solve for x + and y + separately in (35). Sp eciﬁcally , x + = d x (36) y + j = d y j k d y j k max(0 , k d y j k − λρ ) , j = 1 , . . . , J. (37) Using IST A to solve the outer augmented Lagrangian (6) subproblem is equiv alent to taking only skipping steps in ALM-S. In our exp erimen ts, we used the accelerated version of IST A, i.e. FIST A (Algorithm 3.5) to solve (6). FIST A (resp. IST A) is, in fact, an inexact v ersion of FIST A-p (resp. IST A-p), where we minimize with resp ect to x a linearized approximation ˜ f ( x, z k ) := f ( x k , z k ) + ∇ x f ( x k , z k )( x − x k ) + 1 2 ρ k x − x k k 2 of the quadratic ob jective function f ( x, z k ) in (32). The up date to x in Line 3 of Algorithm 3.4 is replaced by (36) as a result. Similar to FIST A-p, FIST A is also a sp ecial skipping version of the full-split F ALM-S. Considering that FIST A has an iteration complexity of O ( 1 k 2 ), it is not surprising that FIST A-p has the same iteration complexity . Remark 3.4. Sinc e FIST A r e quir es only the gr adient of f ( x, y ) , it c an e asily hand le any smo oth c onvex loss function, such as the lo gistic loss for binary classiﬁc ation, L ( x ) = P N i =1 log(1 + exp( − b i a T i x )) , wher e a T i is the i -th r ow of A , and b is the ve ctor of lab els. Mor e over, when the sc ale of the data (min { n, m } ) is so lar ge that it is impr actic al to c ompute the Cholesky factor- ization of A T A , FIST A is a go o d choic e to serve as the subr outine Appr oxAugL agMin ( x, y , v ) in OGL asso-AugL ag. 11 Algorithm 3.5 FIST A 1: Giv en ¯ x 0 , ¯ y 0 , v . Cho ose ρ 0 . Set t 0 = 0 , z 0 x = ¯ x 0 , z 0 y = ¯ y 0 . Deﬁne f ( x, y ) as in (13). 2: for k = 0 , 1 , · · · until stopping criterion is satisﬁed do 3: Perform a backtrac king line-search on ρ , starting from ρ 0 . 4:  d x d y  =  z k x z k y  − ρ ∇ f ( z k x , z k y ) 5: ¯ x k +1 ← d x 6: ¯ y k +1 j ← d y j k d y j k max(0 , k d y j k − λρ ) , j = 1 , . . . , J. 7: t k +1 ← 1+ √ 1+4 t 2 k 2 8: z k +1 x ← ¯ x k + t k − 1 t k +1 ( ¯ x k +1 − ¯ x k ) 9: z k +1 y ← ¯ y k + t k − 1 t k +1 ( ¯ y k +1 − ¯ y k ) 10: end for 11: return ( ¯ x K +1 , ¯ y K +1 ) 4 Ov erlapping Group l 1 /l ∞ -Regularization The subproblems with resp ect to y (or ¯ y ) inv olved in all the algorithms presented in the previous sections take the following form min y 1 2 ρ k c − y k 2 + ˜ Ω( y ) , (38) where ˜ Ω( y ) = λ P s ∈ ˜ S w s k y s k ∞ in the case of l 1 /l ∞ -regularization. In (8), for example, c = C x − µv . The solution to (38) is the proximal op erator of ˜ Ω [13, 12]. Similar to the classical Group Lasso, this problem is blo c k-separable and hence all blo c ks can b e solved sim ultaneously . Again, for notational simplicity , w e assume w s = 1 ∀ s ∈ ˜ S and omit it from now on. F or eac h s ∈ ˜ S , the subproblem in (38) is of the form min y s 1 2 k c s − y s k 2 + ρλ k y s k ∞ . (39) As sho wn in [48], the optimal solution to the ab o ve problem is c s − P ( c s ), where P denotes the orthogonal pro jector on to the ball of radius ρλ in the dual norm of the l ∞ -norm, i.e. the l 1 -norm. The Euclidean pro jection on to the simplex can be computed in (expected) linear time [14, 9]. Duc hi et al. [14] show that the problem of computing the Euclidean pro jection onto the l 1 -ball can b e reduced to that of ﬁnding the Euclidean pro jection onto the simplex in the following wa y . First, w e replace c s in problem (39) b y | c s | , where the absolute v alue is taken comp onen t-wise. After w e obtain the pro jection z s on to the simplex, we can construct the pro jection onto the l 1 -ball by setting y ∗ s = sig n ( c s ) z s , where sig n ( · ) is also taken component-wise. 5 Exp erimen ts W e tested the OGLasso-AugLag framework (Algorithm 2.2) with four subroutines: ADAL, APLM- S, FIST A-p, and FIST A. W e implemented the framework with the ﬁrst three subroutines in C++ to compare them with the ProxFlo w algorithm prop osed in [30]. W e used the C interface and BLAS and LAP ACK subroutines provided by the AMD Core Math Library (ACML) 2 . T o compare 2 h ttp://developer.amd.com/libraries/acml/pages/default.aspx. Ideally , w e should ha ve used the Intel Math Kernel Library (In tel MKL), which is optimized for Intel pro cessors, but Intel MKL is not freely av ailable. 12 Algorithm Outer rel. dual residual s l +1 Inner iteration Rel. primal residual Rel. ob jective gradien t residual AD AL k C T ( y l +1 − y l ) k k C T y l k - - FIST A-p k C T ( ¯ y K +1 − z K ) k k C T z K k k ¯ y k +1 − z k k k z k k k C T ( ¯ y k +1 − z k ) k k C T z k k APLM-S k C T ( ¯ y K +1 − y K +1 ) k k C T y K +1 k k ¯ y k +1 − y k +1 k k y k +1 k k C T ( ¯ y k +1 − y k +1 k ) k C T y k +1 k FIST A         ¯ x K +1 ¯ y K +1   −   z K x z K y                 z K x z K y                 ¯ x k +1 ¯ y k +1   −   z k x z k y                 z k x z k y                 ¯ x k +1 ¯ y k +1   −   z k x z k y                 z k x z k y         T able 1: Sp eciﬁcation of the quantities used in the outer and inner stopping criteria. with ProxGrad [11], we implemented the framework and all four algorithms in Matlab. W e did not include ALM-S in our exp erimen ts b ecause it is time-consuming to ﬁnd the right ρ for the inner lo ops as discussed in Remark 3.2, and our preliminary computational exp erience show ed that ALM-S was slow er than the other algorithms, even when the heuristic ρ -setting scheme discussed in Remark 3.2 w as used, b ecause a large num b er of steps w ere skipping steps, which mean t that the computation inv olved in solving the linear systems in those steps was w asted. All of our experiments w ere p erformed on a laptop PC with an In tel Core 2 Duo 2.0 GHz processor and 4 Gb of memory . 5.1 Algorithm parameters and termination criteria Eac h algorithm (framew ork + subroutine) 3 required several parameters to b e set and termination criteria to b e speciﬁed. W e used stopping criteria based on the primal and dual residuals as in [8]. W e sp ecify the criteria for each of the algorithms below, but defer their deriv ation to App endix C. The maximum n um b er of outer iterations w as set to 500, and the tolerance for the outer lo op was set at  out = 10 − 4 . The num b er of inner-iterations w as capp ed at 2000, and the tolerance at the l -th outer iteration for the inner lo ops was  l in . Our termination criterion for the outer iterations w as max { r l , s l } ≤  out , (40) where r l = k C x l − y l k max {k C x l k , k y l k} is the outer relativ e primal residual and s l is the relative dual residual, whic h is giv en for each algorithm in T able 1. Recall that K + 1 is the index of the last inner iteration of the l -th outer iteration; for example, for APLM-S, ( x l +1 , y l +1 ) takes the v alue of the last inner iterate ( x K +1 , ¯ y K +1 ). W e stopp ed the inner iterations when the maxim um of the relative primal residual and the relativ e ob jective gradien t for the inner problem was less than  l in . (See T able 1 for the expressions of these tw o quantities.) W e see there that s l +1 can b e obtained directly from the relative gradien t residual computed in the last inner iteration of the l -th outer iteration. W e set µ 0 = 0 . 01 in all algorithms except that we set µ 0 = 0 . 1 in ADAL for the data sets other than the ﬁrst synthetic set and the breast cancer data set. W e set ρ = µ in FIST A-p and APLM-S and ρ 0 = µ in FIST A. F or Theorem 2.1 to hold, the solution returned by the function Appro xAugLagMin( x, y , v ) has to b ecome increasingly more accurate ov er the outer iterations. Ho w ever, it is not p ossible to 3 F or conciseness, we use the subroutine names (e.g. FIST A-p) to represent the full algorithms that consist of the OGLasso-AugLag framew ork and the subroutines. 13 ev aluate the sub-optimality quantit y α l in (7) exactly b ecause the optimal v alue of the augmen ted Lagrangian L ( x, y , v l ) is not known in adv ance. In our exp erimen ts, we used the maximum of the relativ e primal and dual residuals (max { r l , s l } ) as a surrogate to α l for tw o reasons: First, it has b een sho wn in [8] that r l and s l are closely related to α l . Second, the quantities r l and s l are readily av ailable as bi-pro ducts of the inner and outer iterations. T o ensure that the sequence {  l in } satisﬁes (7), w e basically set:  l +1 in = β in  l in , (41) with  0 in = 0 . 01 and β in = 0 . 5. How ever, since we terminate the outer iterations at  out > 0, it is not necessary to solve the subproblems to an accuracy muc h higher than the one for the outer lo op. On the other hand, it is also imp ortan t for  l in to decrease to b elo w  out , since s l is closely related to the quan tities in volv ed in the inner stopping criteria. Hence, we slightly mo diﬁed (41) and used  l +1 in = max { β in  l in , 0 . 2  out } . Recen tly , we b ecame aw are of an alternativ e ‘relativ e error’ stopping criterion [16] for the inner lo ops, whic h guarantees conv ergence of Algorithm 2.2. In our con text, this criterion essen tially requires that the absolute dual residual is less than a fraction of the absolute primal residual. F or FIST A-p, for instance, this condition requires that the ( l + 1)-th iterate satisﬁes 2      w 0 x − x l +1 w l y − y l +1      ¯ s l +1 + ( ¯ s l +1 ) 2 µ 2 ≤ σ ( ¯ r l +1 ) 2 , where ¯ r and ¯ s are the numerators in the expressions for r and s resp ectiv ely , σ = 0 . 99, w 0 x is a con- stan t, and w y is an auxiliary v ariable up dated in each outer iteration b y w l +1 y = w l y − 1 µ 2 C T ( ¯ y K +1 − z K ). W e exp erimen ted with this criterion but did not ﬁnd any computational adv an tage ov er the heuristic based on the relativ e primal and dual residuals. 5.2 Strategies for up dating µ The p enalt y parameter µ in the outer augmented Lagrangian (6) not only controls the infeasibility in the constraint C x = y , but also serv es as the step-length in the y -subproblem (and the x -subproblem in the case of FIST A). W e adopted t wo kinds of strategies for up dating µ . The ﬁrst one simply k ept µ ﬁxed. In this case, choosing an appropriate µ 0 w as imp ortan t for go o d p erformance. This w as esp ecially true for ADAL in our computational exp eriments. Usually , a µ 0 in the range of 10 − 1 to 10 − 3 w orked w ell. The second strategy is a dynamic scheme based on the v alues r l and s l [8]. Since 1 µ p enalizes the primal infeasibility , a small µ tends to result in a small primal residual. On the other hand, a large µ tends to yield a small dual residual. Hence, to keep r l and s l appro ximately balanced in eac h outer iteration, our scheme up dated µ as follows: µ l +1 ←    max { β µ l , µ min } , if r l > τ s l min { µ l /β , µ max } , if s l > τ r l µ l , otherwise, (42) where we set µ max = 10, µ min = 10 − 6 , τ = 10 and β = 0 . 5, except for the ﬁrst syn thetic data set, where we set β = 0 . 1 for ADAL, FIST A-p, and APLM-S. 5.3 Syn thetic examples T o compare our algorithms with the ProxGrad algorithm prop osed in [11], w e ﬁrst tested a synthetic data set (ogl) using the pro cedure rep orted in [11] and [24]. The sequence of decision v ariables x 14 w ere arranged in groups of ten, with adjacent groups having an ov erlap of three v ariables. The supp ort of x w as set to the ﬁrst half of the v ariables. Each en try in the design matrix A and the non-zero entries of x were sampled from i.i.d. standard Gaussian distributions, and the output b w as set to b = Ax +  , where the noise  ∼ N (0 , I ). Two sets of data w ere generated as follows: (a) Fix n = 5000 and v ary the num b er of groups J from 100 to 1000 with increments of 100. (b) Fix J = 200 and v ary n from 1000 to 10000 with increments of 1000. The stopping criterion for Pro xGrad w as the same as the one used for FIST A, and we set its smo othing parameter to 10 − 3 . Figure 1 plots the CPU times taken by the Matlab v ersion of our algorithms and ProxGrad (also in Matlab) on theses scalabilit y tests on l 1 /l 2 -regularization. A subset of the numerical results on whic h these plots are based is presented in T ables 4 and 5. The plots clearly show that the alternating direction metho ds w ere muc h faster than ProxGrad on these tw o data sets. Compared to ADAL, FIST A-p p erformed sligh tly b etter, while it sho wed ob vious computational adv an tage ov er its general version APLM-S. In the plot on the left of Figure 1, FIST A exhibited the adv an tage of a gradien t-based algorithm when b oth n and m are large. In that case (tow ards the righ t end of the plot), the Cholesky factorizations required by ADAL, APLM-S, and FIST A-p became relativ ely exp ensiv e. When min { n, m } is small or the linear systems can be solv ed cheaply , as the plot on the righ t shows, FIST A-p and AD AL ha v e an edge o ver FIST A due to the smaller n umbers of inner iterations required. W e generated a second data set (dct) using the approach from [30] for scalability tests on b oth the l 1 /l 2 and l 1 /l ∞ group p enalties. The design matrix A w as formed from ov er-complete dictionaries of discrete cosine transforms (DCT). The set of groups were all the contiguous sequences of length ﬁve in one-dimensional space. x had ab out 10% non-zero en tries, selected randomly . W e generated the output as b = Ax +  , where  ∼ N (0 , 0 . 01 k Ax k 2 ). W e ﬁxed n = 1000 and v aried the num b er of features m from 5000 to 30000 with increments of 5000. This set of data leads to considerably harder problems than the previous set b ecause the groups are heavily o verlapping, and the DCT dictionary-based design matrix exhibits lo cal correlations. Due to the excessive running time required on Matlab, we ran the C++ v ersion of our algorithms for this data set, lea ving out APLM-S and ProxGrad, whose p erformance compared to the other algorithms is already fairly clear from Figure 1. F or ProxFlo w, we set the tolerance on the relative duality gap to 10 − 4 , the same as  out , and k ept all the other parameters at their default v alues. Figure 2 presen ts the CPU times required by the algorithms v ersus the num b er of features. In the case of l 1 /l 2 -regularization, it is clear that FIST A-p outp erformed the other tw o algorithms. F or l 1 /l ∞ -regularization, ADAL and FIST A-p p erformed equally well and compared fa vorably to Pro xFlow. In b oth cases, the growth of the CPU times for FIST A follows the same trend as that for FIST A-p, and they required a similar n umber of outer iterations, as shown in T ables 6 and 7. Ho wev er, FIST A lagged b ehind in sp eed due to larger num b ers of inner iterations. Unlik e in the case of the ogl data set, Cholesky factorization was not a b ottleneck for FIST A-p and ADAL here b ecause w e needed to compute it only once. T o simulate the situation where computing or caching A T A and its Cholesky factorization is not feasible, w e switc hed AD AL and FIST A-p to PCG mo de by alwa ys using PCG to solv e the linear systems in the subproblems. W e compared the p erformance of ADAL, FIST A-p, and FIST A on the previous data set for b oth l 1 /l 2 and l 1 /l ∞ mo dels. The results for ProxFlo w are copied from from Figure 2 and T able 9 to serve as a reference. W e exp erimen ted with the ﬁxed-v alue and the dynamic up dating sc hemes for µ on all three algorithms. F rom Figure 3, it is clear that the performance of FIST A-p was signiﬁcantly impro v ed by using the dynamic sc heme. F or ADAL, ho wev er, the dynamic scheme w orked well only in the l 1 /l 2 case, whereas the p erformance turned w orse in general in the l 1 /l ∞ case. W e did not include the results for FIST A with the dynamic sc heme b ecause the solutions obtained were considerably more sub optimal than the ones obtained 15 Figure 1: Scalabilit y test results of the algorithms on the syn thetic ov erlapping Group Lasso data sets from [11]. The scale of the y -axis is logarithmic. The dynamic sc heme for µ was used for all algorithms except Pro xGrad. with the ﬁxed- µ scheme. T ables 8 and 9 rep ort the b est results of the algorithms in eac h case. The plots and n umerical results sho w that FIST A-p compares fav orably to ADAL and sta ys comp etitiv e to ProxFlo w. In terms of the qualit y of the solutions, FIST A-p and AD AL also did a b etter job than FIST A, as evidenced in T able 9. On the other hand, the gap in CPU time b et ween FIST A and the other three algorithms is less obvious. 5.4 Real-w orld Examples T o demonstrate the practical usefulness of our algorithms, we tested our algorithms on tw o real- w orld applications. 5.4.1 Breast Cancer Gene Expressions W e used the breast cancer data set [47] with canonical pathw ays from MSigDB [44]. The data w as collected from 295 breast cancer tumor samples and contains gene expression measurements for 8,141 genes. The goal was to select a small set of the most relev ant genes that yield the b est prediction p erformance. A detailed description of the data set can b e found in [11, 24]. In our exp erimen t, w e p erformed a regression task to predict the length of surviv al of the patients. The canonical pathw ays naturally pro vide grouping information of the genes. Hence, we used them as the groups for the group-structured regularization term Ω( · ). T able 2 summarizes the data attributes. The numerical results for the l 1 /l 2 -norm are collected in T able 10, which show that FIST A-p and ADAL were the fastest on this data set. Again, w e had to tune ADAL with diﬀerent initial v alues ( µ 0 ) and up dating schemes of µ for sp eed and quality of the solution, and we ev entually kept µ constant at 0.01. The dynamic up dating scheme for µ also did not work for FIST A, which returned a very sub optimal solution in this case. W e instead 16 Figure 2: Scalability test results on the DCT set with l 1 /l 2 -regularization (left column) and l 1 /l ∞ - regularization (right column). The scale of the y -axis is logarithmic. All of FIST A-p, FITSA, and AD AL w ere run with a ﬁxed µ = µ 0 . Data sets N (no. samples) J (no. groups) group size a verage frequency BreastCancerData 295 637 23.7 (avg) 4 T able 2: The Breast Cancer Dataset adopted a simple scheme of decreasing µ by half every 10 outer iterations. Figure 6 graphically depicts the p erformance of the diﬀeren t algorithms. In terms of the outer iterations, APLM-S b eha v ed identically to FIST A-p, and FIST A also b eha ved similarly to ADAL. How ev er, APLM-S and FIST A were considerably slo wer due to larger num b ers of inner iterations. W e plot the ro ot-mean-squared-error (RMSE) ov er diﬀeren t v alues of λ (which lead to diﬀerent n umbers of active genes) in the left half of Figure 4. The training set consists of 200 randomly selected samples, and the RMSE was computed on the remaining 95 samples. l 1 /l 2 -regularization ac hieved low er RMSE in this case. How ever, l 1 /l ∞ -regularization yielded b etter group sparsity as shown in Figure 5. The sets of activ e genes selected b y the tw o mo dels were very similar as illustrated in the right half of Figure 4. In general, the magnitudes of the co eﬃcien ts returned b y l 1 /l ∞ -regularization tended to b e similar within a group, whereas those returned by l 1 /l 2 - regularization did not follow that pattern. This is b ecause l 1 /l ∞ -regularization p enalizes only the maxim um elemen t, rather than all the co eﬃcien ts in a group, resulting in man y co eﬃcien ts ha ving the same magnitudes. 5.4.2 Video Sequence Background Subtraction W e next considered the video sequence background subtraction task from [30, 23]. The main ob jective here is to segmen t out foreground ob jects in an image (frame), given a sequence of m 17 Figure 3: Scalability test results on the DCT set with l 1 /l 2 -regularization (left column) and l 1 /l ∞ - regularization (right column). The scale of the y -axis is logarithmic. FIST A-p and ADAL are in PCG mo de. The dotted lines denote the results obtained with the dynamic updating sc heme for µ . frames from a ﬁxed camera. The data used in this exp erimen t is a v ailable online 4 [46]. The basic setup of the problem is as follo ws. W e represen t each frame of n pixels as a column v ector A j ∈ R n and form the matrix A ∈ R n × m as A ≡  A 1 A 2 · · · A m  . The test frame is represen ted by b ∈ R n . W e mo del the relationship b et ween b and A by b ≈ Ax + e , where x is assumed to b e sparse, and e is the ’noise’ term whic h is also assumed to b e sparse. Ax is thus a sparse linear com bination of the video frame sequence and accounts for the background presen t in b oth A and b . e contains the sparse foreground ob jects in b . The basic mo del with l 1 -regularization (Lasso) is min x,e 1 2 k Ax + e − b k 2 + λ ( k x k 1 + k e k 1 ) . (43) It has b een shown in [30] that w e can signiﬁcantly impro ve the quality of segmen tation by applying a group-structured regularization Ω( · ) on e , where the groups are all the o verlapping k × k -square patc hes in the image. Here, we set k = 3. The mo del thus b ecomes min x,e 1 2 k Ax + e − b k 2 + λ ( k x k 1 + k e k 1 + Ω( e )) . (44) Note that (44) still ﬁts into the group-sparse framework if we treat the l 1 -regularization terms as the sum of the group norms, where the each groups consists of only one element. W e also considered an alternative mo del, where a Ridge regularization is applied to x and an Elastic-Net p enalt y [50] to e . This mo del min x,e 1 2 k Ax + e − b k 2 + λ 1 k e k 1 + λ 2 ( k x k 2 + k e k 2 ) (45) 4 h ttp://research.microsoft.com/en-us/um/people/jckrumm/w allﬂow er/testimages.htm 18 Figure 4: On the left: Plot of ro ot-mean-squared-error against the num b er of activ e genes for the Breast Cancer data. The plot is based on the regularization path for ten diﬀeren t v alues for λ . The total CPU time (in Matlab) using FIST A-p was 51 seconds for l 1 /l 2 -regularization and 115 seconds for l 1 /l ∞ -regularization. On the right: The recov ered sparse gene co eﬃcien ts for predicting the length of the surviv al p eriod. The v alue of λ used here was the one minimizing the RMSE in the plot on the left. Figure 5: Path w ay-lev el sparsity v.s. Gene-level sparsity . do es not yield a sparse x , but sparsity in x is not a crucial factor here. It is, how ev er, well suited for our partial linearization metho ds (APLM-S and FIST A-p), since there is no need for the augmen ted Lagrangian framework. Of course, we can also apply FIST A to solve (45). W e recov ered the foreground ob jects b y solving the ab o v e optimization problems and applying the sparsity pattern of e as a mask for the original test frame. A hand-segmen ted ev aluation image from [46] serv ed as the ground truth. The regularization parameters λ, λ 1 , and λ 2 w ere selected in such a w ay that the recov ered foreground ob jects matched the ground truth to the maximum exten t. FIST A-p was used to solve all three models. The l 1 mo del (43) was treated as a special case of the group regularization model (44), with each group con taining only one comp onen t of the feature 19 Figure 6: Ob jectiv e v alues v.s. Outer iters and Ob jective v alues v.s. CPU time plots for the Breast Cancer data. The results for Pro xGrad are not plotted due to the diﬀerent ob jectiv e function that it minimizes. The red (APLM-S) and blue (FIST A-p) lines ov erlap in the left column. v ector. 5 F or the Ridge/Elastic-Net p enalt y mo del, w e applied FIST A-p directly without the outer augmen ted Lagrangian lay er. The solutions for the l 1 /l 2 , l 1 /l ∞ , and Lasso mo dels were not strictly sparse in the sense that those supp osedly zero feature co eﬃcien ts had non-zero (alb eit extremely small) magnitudes, since w e enforced the linear constraints C x = y through an augment ed Lagrangian approach. T o obtain sparse solutions, w e truncated the non-sparse solutions using thresholds ranging from 10 − 9 to 10 − 3 and selected the threshold that yielded the b est accuracy . Note that b ecause of the additional feature vector e , the data matrix is eﬀectively ˜ A =  A I n  ∈ R n × ( m + n ) . F or solving (44), FIST A-p has to solv e the linear system A T A + 1 µ D x A T A I n + 1 µ D e !  x e  =  r x r e  , (46) where D is a diagonal matrix, and D x , D e , r x , r e are the comp onen ts of D and r corresp onding to x and e respectively . In this example, n is muc h larger than m , e.g. n = 57600 , m = 200. T o a void solving a system of size n × n , we to ok the Sc hur complement of I n + 1 µ D e and solved instead the p ositiv e deﬁnite m × m system  A T A + 1 µ D x − A T ( I + 1 µ D e ) − 1 A  x = r x − A T ( I + 1 µ D e ) − 1 r e , (47) e = diag ( 1 + 1 µ D e ) − 1 ( r e − Ax ) . (48) The l 1 /l ∞ mo del yielded the b est bac kground separation accuracy (marginally b etter than the l 1 /l 2 mo del), but it also was the most computationally exp ensiv e. (See T able 3 and Figure 7.) Although the Ridge/Elastic-Net mo del yielded as p oor separation results as the Lasso ( l 1 ) mo del, it was orders of magnitude faster to solve using FIST A-p. W e again observed that the dynamic sc heme for µ w orked b etter for FIST A-p than for ADAL. F or a constant µ ov er the entire run, 5 W e did not use the original version of FIST A to solve the mo del as an l 1 -regularization problem b ecause it to ok to o long to conv erge in our exp erimen ts due to extremely small step sizes. 20 Figure 7: Separation results for the video sequence bac kground substraction example. Eac h training image had 120 × 160 R GB pixels. The training set contained 200 images in sequence. The accuracy indicated for eac h of the diﬀeren t mo dels is the percentage of pixels that matc hed the ground truth. Mo del Accuracy (p ercen t) T otal CPU time (s) No. parameter v alues on reg path l 1 /l 2 97.17 2.48e+003 8 l 1 /l ∞ 98.18 4.07e+003 6 l 1 87.63 1.61e+003 11 ridge + elastic net 87.89 1.82e+002 64 T able 3: Computational results for the video sequence background subtraction example. The algorithm used is FIST A-p. W e used the Matlab v ersion for the ease of generating the images. The C++ version runs at least four times faster from our exp erience in the previous exp erimen ts. W e rep ort the b est accuracy found on the regularization path of each mo del. The total CPU time is recorded for computing the entire regularization path, with the speciﬁed num b er of diﬀerent regularization parameter v alues. AD AL to ok at least t wice as long as FIST A-p to pro duce a solution of the same qualit y . A t ypical run of FIST A-p on this problem with the b est selected λ to ok less than 10 outer iterations. On the other hand, AD AL to ok more than 500 iterations to meet the stopping criteria. 5.5 Commen ts on Results The computational results exhibit t w o general patterns. First, the simpler algorithms (FIST A-p and AD AL) were signiﬁcantly faster than the more general algorithms, such as APLM-S. Interestingly , the ma jority of the APLM-S inner iterations consisted of a skipping step for the tests on syn thetic data and the breast cancer data, which means that APLM-S essentially b eha v ed like IST A-p in these cases. Indeed, FIST A-p generally required the same num b er of outer-iterations as APLM- 21 S but m uch few er inner-iterations, as predicted by theory . In addition, no computational steps w ere wasted and no function ev aluations were required for FIST A-p and ADAL. Second, FIST A-p con verged faster (required less iterations) than its full-linearization coun terpart FIST A. W e hav e suggested p ossible reasons for this in Section 3. On the other hand, FIST A w as very eﬀectiv e for data b oth of whose dimensions were large b ecause it required only gradient computations and soft-thresholding op erations, and did not require linear systems to b e solved. Our exp erimen ts sho wed that the p erformance of ADAL (as well as the quality of the solution that it returned) v aried a lot as a function of the parameter settings, and it was tric ky to tune them optimally . In contrast, FIST A-p exhibited fairly stable p erformance for a simple set of parameters that we rarely had to alter and in general p erformed b etter than ADAL. It may seem straight-forw ard to apply FIST A directly to the Lasso problem (43) without the augmen ted Lagrangian framew ork. 6 Ho wev er, as we hav e seen in our exp erimen ts, FIST A to ok m uch longer than AugLag-FIST A-p to solv e this problem. W e b elieve that this is further evidence of the ‘load-balancing’ prop ert y of the latter algorithm that we discussed in Section 3.2. It also demonstrates the v ersatility of our approach to regularized learning problems. 6 Conclusion W e ha ve built a uniﬁed framew ork for solving sparse learning problems inv olving group-structured regularization, in particular, the l 1 /l 2 - or l 1 /l ∞ -regularization of arbitrarily ov erlapping groups of v ariables. F or the k ey building-block of this framework, w e dev elop ed new eﬃcien t algorithms based on alternating partial-linearization/splitting, with prov en conv ergence rates. In addition, we hav e also incorp orated AD AL and FIST A in to our framew ork. Computational tests on several sets of syn thetic test data demonstrated the relativ e strength of the algorithms, and through tw o real-w orld applications we compared the relativ e merits of these structured sparsity-inducing norms. Among the algorithms studied, FIST A-p and ADAL p erformed the b est on most of the data sets, and FIST A app eared to b e a goo d alternativ e choice for large-scale data. F rom our exp erience, FIST A- p is easier to conﬁgure and is more robust to v ariations in the algorithm parameters. T ogether, they form a ﬂexible and versatile suite of metho ds for group-sparse problems of diﬀerent sizes. 7 Ac kno wledgemen t W e w ould like to thank Katy a Schein b erg and Shiqian Ma for many helpful discussions, and Xi Chen for pro viding the Matlab co de for ProxGrad. A Pro of of Lemma 3.1 F ( ¯ x, ¯ y ) − F ( x, ¯ q ) ≥ F ( ¯ x, ¯ y ) − L ρ ( x, y , ¯ q , ∇ y f ( x, y )) = F ( ¯ x, ¯ y ) −  f ( x, y ) + ∇ y f ( x, y ) T ( ¯ q − y ) + 1 2 ρ k ¯ q − y k 2 + g ( ¯ q )  . (49) 6 T o av oid confusion with our algorithms that consist of inner-outer iterations, we preﬁx our algorithms with ‘AugLag’ here. 22 F rom the optimality of ¯ q , w e also hav e γ g ( ¯ q ) + ∇ y f ( x, y ) + 1 ρ ( ¯ q − y ) = 0 . (50) Since F ( x, y ) = f ( x, y ) + g ( y ), and f and g are conv ex functions, for any ( ¯ x, ¯ y ), F ( ¯ x, ¯ y ) ≥ g ( ¯ q ) + ( ¯ y − ¯ q ) T γ g ( ¯ q ) + f ( x, y ) + ( ¯ y − y ) T ∇ y f ( x, y ) + ( ¯ x − x ) T ∇ x f ( x, y ) . (51) Therefore, from (49), (50), and (51), it follows that F ( ¯ x, ¯ y ) − F ( x, ¯ q ) ≥ g ( ¯ q ) + ( ¯ y − ¯ q ) T γ g ( ¯ q ) + f ( x, y ) + ( ¯ y − y ) T ∇ y f ( x, y ) + ( ¯ x − x ) T ∇ x f ( x, y ) −  f ( x, y ) + ∇ y f ( x, y ) T ( ¯ q − y ) + 1 2 ρ k ¯ q − y k 2 + g ( ¯ q )  = ( ¯ y − ¯ q ) T ( γ g ( ¯ q ) + ∇ y f ( x, y )) − 1 2 ρ k ¯ q − y k 2 + ( ¯ x − x ) T ∇ x f ( x, y ) = ( ¯ y − ¯ q ) T  − 1 ρ ( ¯ q − y )  − 1 2 ρ k ¯ q − y k 2 + ( ¯ x − x ) T ∇ x f ( x, y ) = 1 2 ρ ( k ¯ q − ¯ y k 2 − k y − ¯ y k 2 ) + ( ¯ x − x ) T ∇ x f ( x, y ) . (52) The pro of for the second part of the lemma is very similar, but we giv e it for completeness. F ( x, y ) − F (( p, q )) ≥ F ( x, y ) −  f (( p, q )) + g ( ¯ y ) + γ g ( ¯ x, y ) T (( p, q ) y − ¯ y ) + 1 2 ρ k ( p, q ) y − ¯ y k 2  (53) By the optimalit y of ( p, q ), we ha ve ∇ x f (( p, q )) = 0 , (54) ∇ y f (( p, q )) + γ g ( ¯ y ) + 1 ρ (( p, q ) y − ¯ y ) = 0 . (55) Since F ( x, y ) = f ( x, y ) + g ( y ), it follows from the con vexit y of b oth f and g and (54) that F ( x, y ) ≥ g ( ¯ y ) + ( y − ¯ y ) T γ g ( ¯ x, y ) + f (( p, q )) + ( y − ( p, q ) y ) T ∇ y f (( p, q )) . (56) No w com bining (53), (55), and (56), it follo ws that F ( x, y ) − F (( p, q )) ≥ ( y − ( p, q ) y ) T ( γ g ( ¯ x, y ) ¯ y + ∇ y f (( p, q ))) − 1 2 ρ k ( p, q ) y − ¯ y k 2 = ( y − ( p, q ) y ) T  1 ρ ( ¯ y − ( p, q ) y )  − 1 2 ρ k ( p, q ) y − ¯ y k 2 = 1 2 ρ ( k ( p, q ) y − y k 2 − k y − ¯ y k 2 ) . (57) B Pro of of Theorem 3.1 Let I b e the set of all regular iteration indices among the ﬁrst k − 1 iterations, and let I c b e its complemen t. F or all n ∈ I c , y n +1 = ¯ y n . F or n ∈ I , we can apply Lemma 3.1 since (21) automatically holds, and (19) holds when ρ ≤ 1 L ( f ) . In (22), b y letting ( x, y ) = ( x ∗ , y ∗ ), and ¯ y = ¯ y n , we get ( p, q ) = ( x n +1 , y n +1 ), and 2 ρ ( F ( x ∗ , y ∗ ) − F ( x n +1 , y n +1 )) ≥ k y n +1 − y ∗ k 2 − k ¯ y n − y ∗ k 2 . (58) 23 In (20), b y letting ( ¯ x, ¯ y ) = ( x ∗ , y ∗ ) , ( x, y ) = ( x n +1 , y n +1 ), we get ¯ q = ¯ y n +1 and 2 ρ ( F ( x ∗ , y ∗ ) − F ( x n +1 , ¯ y n +1 )) ≥ k ¯ y n +1 − y ∗ k 2 − k y n +1 − y ∗ k 2 + ( x ∗ − x n +1 ) T ∇ x f ( x n +1 , y n +1 ) = k ¯ y n +1 − y ∗ k 2 − k y n +1 − y ∗ k 2 ., (59) since ∇ x f ( x n +1 , y n +1 ) = 0, for n ∈ I by (54) and for n ∈ I c b y (18). Adding (59) to (58), we get 2 ρ (2 F ( x ∗ , y ∗ ) − F ( x n +1 , y n +1 ) − F ( x n +1 , ¯ y n +1 )) ≥ k ¯ y n +1 − y ∗ k 2 − k ¯ y n − y ∗ k 2 . (60) F or n ∈ I c , since ∇ x f ( x n +1 , y n +1 ) = 0, we ha ve that (59) holds. Since y n +1 = ¯ y n , it follo ws that 2 ρ ( F ( x ∗ , y ∗ ) − F ( x n +1 , ¯ y n +1 )) ≥ k ¯ y n +1 − y ∗ k 2 − k ¯ y n − y ∗ k 2 . (61) Summing (60) and (61) ov er n = 0 , 1 , . . . , k − 1 and observing that 2 | I | + | I c | = k + k n , we obtain 2 ρ ( k + k n ) F ( x ∗ , y ∗ ) − k − 1 X n =1 F ( x n +1 , ¯ y n +1 ) − X n ∈ I F ( x n +1 , y n +1 ) ! (62) ≥ k − 1 X n =0 ( k ¯ y n +1 − y ∗ k 2 − k ¯ y n − y ∗ k 2 ) = k ¯ y k − y ∗ k 2 − k ¯ y 0 − y ∗ k 2 ≥ −k ¯ y 0 − y ∗ k 2 . In Lemma 3.1, by letting ( ¯ x, ¯ y ) = ( x n +1 , y n +1 ) in (20) instead of ( x ∗ , y ∗ ), we ha ve from (59) that 2 ρ ( F ( x n +1 , y n +1 ) − F ( x n +1 , ¯ y n +1 )) ≥ k ¯ y n +1 − y n +1 k 2 ≥ 0 . (63) Similarly , for n ∈ I , if we let ( x, y ) = ( x n , ¯ y n ) instead of ( x ∗ , y ∗ ) in (58), we ha ve 2 ρ ( F ( x n , ¯ y n ) − F ( x n +1 , y n +1 )) ≥ k y n +1 − ¯ y n k 2 ≥ 0 . (64) F or n ∈ I c , y n +1 = ¯ y n ; from (18), since x n +1 = arg min x F ( x, y ) with y = ¯ y n = y n +1 , 2 ρ ( F ( x n , ¯ y n ) − F ( x n +1 , y n +1 )) ≥ 0 . (65) Hence, from (63) and (64) to (65), F ( x n , y n ) ≥ F ( x n , ¯ y n ) ≥ F ( x n +1 , y n +1 ) ≥ F ( x n +1 , ¯ y n +1 ). Then, w e ha ve k − 1 X n =0 F ( x n +1 , ¯ y n +1 ) ≥ k F ( x k , ¯ y k ) , and X n ∈ I F ( x n +1 , y n +1 ) ≥ k n F ( x k , y k ) . (66) Com bining (62) and (66) yields 2 ρ ( k + k n )( F ( x ∗ , y ∗ ) − F ( x k , ¯ y k )) ≥ −k ¯ y 0 − y ∗ k 2 . C Deriv ation of the Stopping Criteria In this section, w e show that the quantities that we use in our stopping criteria corresp ond to the primal and dual residuals [8] for the outer iterations and the gradient residuals for the inner iterations. W e ﬁrst consider the inner iterations. 24 FIST A-p The necessary and suﬃcient optimalit y conditions for problem (12) or (15) are primal feasibilit y ¯ y ∗ − y ∗ = 0 , (67) and v anishing of the gradient of the ob jective function at ( x ∗ , ¯ y ∗ ), i.e. 0 = ∇ x f ( x ∗ , ¯ y ∗ ) , (68) 0 ∈ ∇ y f ( x ∗ , ¯ y ∗ ) + ∂ g ( ¯ y ∗ ) . (69) Since y k +1 = z k , the primal residual is thus ¯ y k +1 − y k +1 = ¯ y k +1 − z k . It follo ws from the optimalit y of x k +1 in Line 3 of Algorithm 3.4 that A T ( Ax k +1 − b ) − C T v l + 1 µ C T ( C x k +1 − ¯ y k +1 ) + 1 µ C T ( ¯ y k +1 − z k ) = 0 ∇ x f ( x k +1 , ¯ y k +1 ) = 1 µ C T ( z k − ¯ y k +1 ) . Similarly , from the optimality of ¯ y k +1 in Line 4, we ha ve that 0 ∈ ∂ g ( ¯ y k +1 ) + ∇ y f ( x k +1 , z k ) + 1 ρ ( ¯ y k +1 − z k ) = ∂ g ( ¯ y k +1 ) + ∇ y f ( x k +1 , ¯ y k +1 ) − 1 µ ( ¯ y k +1 − z k ) + 1 ρ ( ¯ y k +1 − z k ) = ∂ g ( ¯ y k +1 ) + ∇ y f ( x k +1 , ¯ y k +1 ) , where the last step follo ws from µ = ρ . Hence, we see that 1 µ C T ( z k − ¯ y k +1 ) is the gradien t residual corresp onding to (68), while (69) is satisﬁed in every inner iteration. APLM-S The primal residual is ¯ y k +1 − y k +1 from (67). F ollo wing the deriv ation for FIST A-p, it is not hard to verify that (69) is alwa ys satisﬁed, and the gradient residual corresp onding to (68) is 1 µ C T ( y k +1 − ¯ y k +1 ). FIST A Similar to FIST A-p, the necessary and suﬃcien t optimalit y conditions for problem (28) are primal feasibilit y ( x ∗ , y ∗ ) = ( ¯ x ∗ , ¯ y ∗ ) , and v anishing of the ob jective gradient at ( ¯ x ∗ , ¯ y ∗ ), 0 = ∇ x f ( ¯ x ∗ , ¯ y ∗ ) , 0 ∈ ∇ y f ( ¯ x ∗ , ¯ y ∗ ) + ∂ g ( ¯ y ∗ ) . Clearly , the primal residual is ( ¯ x k +1 − z k x , ¯ y k +1 − z k y ) since ( x k +1 , y k +1 ) ≡ ( z k x , z k y ). F rom the optimalit y of ( ¯ x k +1 , ¯ y k +1 ), it follo ws that 0 = ∇ x f ( z k x , z k y ) + 1 ρ ( ¯ x k +1 − z k x ) , 0 ∈ ∂ g ( ¯ y k +1 ) + ∇ y f ( z k x , z k y ) + 1 ρ ( ¯ y k +1 − z k y ) . Here, we simply use 1 ρ ( ¯ x k +1 − z k x ) and 1 ρ ( ¯ y k +1 − z k y ) to appro ximate the gradient residuals. 25 Data Sets Algs CPU (s) Iters Avg Sub-iters F ( x ) ogl-5000-100-10-3 AD AL 1.70e+000 61 1.00e+000 1.9482e+005 APLM-S 1.71e+000 8 4.88e+000 1.9482e+005 FIST A-p 9.08e-001 8 4.38e+000 1.9482e+005 FIST A 2.74e+000 10 7.30e+000 1.9482e+005 Pro xGrad 7.92e+001 3858 - - ogl-5000-600-10-3 AD AL 6.75e+001 105 1.00e+000 1.4603e+006 APLM-S 1.79e+002 9 1.74e+001 1.4603e+006 FIST A-p 4.77e+001 9 8.56e+000 1.4603e+006 FIST A 3.28e+001 12 1.36e+001 1.4603e+006 Pro xGrad 7.96e+002 5608 - - ogl-5000-1000-10-3 AD AL 2.83e+002 151 1.00e+000 2.6746e+006 APLM-S 8.06e+002 10 2.76e+001 2.6746e+006 FIST A-p 2.49e+002 10 1.28e+001 2.6746e+006 FIST A 5.21e+001 13 1.55e+001 2.6746e+006 Pro xGrad 1.64e+003 6471 - - T able 4: Numerical results for ogl set 1. F or ProxGrad, Avg Sub-Iters and F ( x ) ﬁelds are not applicable since the algorithm is not based on an outer-inner iteration scheme, and the ob jective function that it minimizes is diﬀeren t from ours. W e tested ten problems with J = 100 , · · · , 1000, but only sho w the results for three of them to sav e space. Next, we consider the outer iterations. The necessary and suﬃcient optimality conditions for problem (5) are primal feasibilit y C x ∗ − y ∗ = 0 , (70) and dual feasibilit y 0 = ∇ L ( x ∗ ) − C T v ∗ (71) 0 ∈ ∂ ˜ Ω( y ∗ ) + v ∗ . (72) Clearly , the primal residual is r l = C x l − y l . The dual residual is ∇ L ( x l +1 ) − C T ( v l − 1 µ ( C x l +1 − ¯ y l +1 )) ∂ ˜ Ω( y l +1 ) + v l − 1 µ ( C x l +1 − ¯ y l +1 ) ! , recalling that v l +1 = v l − 1 µ ( C x l +1 − ¯ y l +1 ). The ab o v e is simply the gradient of the augmented Lagrangian (6) ev aluated at ( x l , y l , v l ). Now, since the ob jectiv e function of an inner iteration is the augmented Lagrangian with v = v l , the dual residual for an outer iteration is readily av ailable from the gradien t residual computed for the last inner iteration of the outer iteration. D Numerical Results References [1] M. Afonso, J. Bioucas-Dias, and M. Figueiredo. An augmented Lagrangian approach to the constrained optimization formulation of imaging inv erse problems. Image Pr o c essing, IEEE T r ansactions on , (99):1–1, 2009. 26 Data Sets Algs CPU (s) Iters Avg Sub-iters F ( x ) ogl-1000-200-10-3 AD AL 4.18e+000 77 1.00e+000 9.6155e+004 APLM-S 1.64e+001 9 2.32e+001 9.6156e+004 FIST A-p 3.85e+000 9 1.02e+001 9.6156e+004 FIST A 2.92e+000 11 1.44e+001 9.6158e+004 Pro xGrad 1.16e+002 4137 - - ogl-5000-200-10-3 AD AL 5.04e+000 63 1.00e+000 4.1573e+005 APLM-S 8.42e+000 8 8.38e+000 4.1576e+005 FIST A-p 3.96e+000 9 6.56e+000 4.1572e+005 FIST A 6.54e+000 10 9.70e+000 4.1573e+005 Pro xGrad 1.68e+002 4345 - - ogl-10000-200-10-3 AD AL 6.41e+000 44 1.00e+000 1.0026e+006 APLM-S 1.46e+001 10 7.60e+000 1.0026e+006 FIST A-p 5.60e+000 10 5.50e+000 1.0026e+006 FIST A 1.09e+001 10 8.50e+000 1.0027e+006 Pro xGrad 3.31e+002 6186 - - T able 5: Numerical results for ogl set 2. W e ran the test for ten problems with n = 1000 , · · · , 10000, but only sho w the results for three of them to sav e space. [2] F. Bac h. Consistency of the group Lasso and m ultiple kernel learning. The Journal of Machine L e arning R ese ar ch , 9:1179–1225, 2008. [3] F. Bac h. Structured sparsity-inducing norms through submo dular functions. In J. Laﬀerty , C. K. I. Williams, J. Sha we-T aylor, R. Zemel, and A. Culotta, editors, A dvanc es in Neur al Information Pr o c essing Systems 23 , pages 118–126. 2010. [4] A. Beck and M. T eb oulle. A fast iterative shrink age-thresholding algorithm for linear in verse problems. SIAM Journal on Imaging Scienc es , 2(1):183–202, 2009. [5] D. Bertsek as. Multiplier metho ds: a survey . Automatic a , 12(2):133–145, 1976. [6] D. Bertsek as. Nonline ar pr o gr amming . Athena Scien tiﬁc Belmon t, MA, 1999. [7] D. Bertsek as and J. Tsitsiklis. Par al lel and distribute d c omputation: numeric al metho ds . Pren tice-Hall, Inc., 1989. [8] S. Bo yd, N. Parikh, E. Ch u, B. Peleato, and J. Ec kstein. Distributed optimization and statisti- cal learning via the alternating direction method of multipliers. Machine L e arning , 3(1):1–123, 2010. [9] P . Bruck er. An o (n) algorithm for quadratic knapsack problems. Op er ations R ese ar ch L etters , 3(3):163–166, 1984. [10] S. Chen, D. Donoho, and M. Saunders. Atomic decomp osition b y basis pursuit. SIAM journal on scientiﬁc c omputing , 20(1):33–61, 1999. [11] X. Chen, Q. Lin, S. Kim, J. Pe˜ na, J. Carb onell, and E. Xing. An Eﬃcient Proximal- Gradien t Method for Single and Multi-task Regression with Str uctured Sparsit y . Arxiv pr eprint arXiv:1005.4717 , 2010. 27 Data Sets Algs CPU (s) Iters Avg Sub-iters F ( x ) ogl-dct-1000-5000-1 AD AL 1.14e+001 194 1.00e+000 8.4892e+002 FIST A-p 1.21e+001 20 1.11e+001 8.4892e+002 FIST A 2.49e+001 24 2.51e+001 8.4893e+002 ogl-dct-1000-10000-1 AD AL 3.31e+001 398 1.00e+000 1.4887e+003 FIST A-p 2.54e+001 41 5.61e+000 1.4887e+003 FIST A 6.33e+001 44 1.74e+001 1.4887e+003 ogl-dct-1000-15000-1 AD AL 6.09e+001 515 1.00e+000 2.7506e+003 FIST A-p 3.95e+001 52 4.44e+000 2.7506e+003 FIST A 9.73e+001 54 1.32e+001 2.7506e+003 ogl-dct-1000-20000-1 AD AL 9.52e+001 626 1.00e+000 3.3415e+003 FIST A-p 6.66e+001 63 6.10e+000 3.3415e+003 FIST A 1.81e+002 64 1.61e+001 3.3415e+003 ogl-dct-1000-25000-1 AD AL 1.54e+002 882 1.00e+000 4.1987e+003 FIST A-p 7.50e+001 88 3.20e+000 4.1987e+003 FIST A 1.76e+002 89 8.64e+000 4.1987e+003 ogl-dct-1000-30000-1 AD AL 1.87e+002 957 1.00e+000 4.6111e+003 FIST A-p 8.79e+001 96 2.86e+000 4.6111e+003 FIST A 2.24e+002 94 8.54e+000 4.6111e+003 T able 6: Numerical results for dct set 2 (scalabilit y test) with l 1 /l 2 -regularization. All three algorithms were ran in factorization mo de with a ﬁxed µ = µ 0 . [12] P . Com b ettes and J. Pesquet. Pro ximal splitting metho ds in signal pro cessing. Fixe d-Point A lgorithms for Inverse Pr oblems in Scienc e and Engine ering , pages 185–212, 2011. [13] P . Com b ettes and V. W a js. Signal reco very by proximal forward-bac kw ard splitting. Multisc ale Mo deling and Simulation , 4(4):1168–1200, 2006. [14] J. Duchi, S. Shalev-Sh w artz, Y. Singer, and T. Chandra. Eﬃcient pro jections on to the l 1-ball for learning in high dimensions. In Pr o c e e dings of the 25th international c onfer enc e on Machine le arning , pages 272–279. A CM, 2008. [15] J. Ec kstein and D. Bertsek as. On the Douglas-Rachford splitting metho d and the proximal p oin t algorithm for maximal monotone operators. Mathematic al Pr o gr amming , 55(1):293–318, 1992. [16] J. Ec kstein and P . Silv a. A practical relative error criterion for augmen ted lagrangians. T ec h- nical rep ort, Rutgers Universit y , 2011. [17] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear v ariational problems via ﬁnite elemen t approximation. Computers & Mathematics with Applic ations , 2(1):17–40, 1976. [18] R. Glowinski and A. Marro co. Sur l’appro ximation, par elements ﬁnis d’ordre un, et la res- olution, par penalisation-dualite d’une classe de problemes de dirichlet non lineares. R ev. F r anc aise d’Automat. Inf. R e cher che Op er ationel le , (9):41–76, 1975. 28 Data Sets Algs CPU (s) Iters Avg Sub-iters F ( x ) ogl-dct-1000-5000-1 AD AL 1.53e+001 266 1.00e+000 7.3218e+002 FIST A-p 1.61e+001 10 3.05e+001 7.3219e+002 FIST A 3.02e+001 16 4.09e+001 7.3233e+002 Pro xFlow 1.97e+001 - - 7.3236e+002 ogl-dct-1000-10000-1 AD AL 3.30e+001 330 1.00e+000 1.2707e+003 FIST A-p 3.16e+001 10 3.10e+001 1.2708e+003 FIST A 7.27e+001 24 3.25e+001 1.2708e+003 Pro xFlow 3.67e+001 - - 1.2709e+003 ogl-dct-1000-15000-1 AD AL 4.83e+001 328 1.00e+000 2.2444e+003 FIST A-p 5.40e+001 15 2.52e+001 2.2444e+003 FIST A 8.64e+001 23 2.66e+001 2.2449e+003 Pro xFlow 9.91e+001 - - 2.2467e+003 ogl-dct-1000-20000-1 AD AL 8.09e+001 463 1.00e+000 2.6340e+003 FIST A-p 8.09e+001 16 2.88e+001 2.6340e+003 FIST A 1.48e+002 26 2.93e+001 2.6342e+003 Pro xFlow 2.55e+002 - - 2.6357e+003 ogl-dct-1000-25000-1 AD AL 7.48e+001 309 1.00e+000 3.5566e+003 FIST A-p 1.15e+002 30 1.83e+001 3.5566e+003 FIST A 2.09e+002 38 2.30e+001 3.5568e+003 Pro xFlow 1.38e+002 - - 3.5571e+003 ogl-dct-1000-30000-1 AD AL 9.99e+001 359 1.00e+000 3.7057e+003 FIST A-p 1.55e+002 29 2.17e+001 3.7057e+003 FIST A 2.60e+002 39 2.25e+001 3.7060e+003 Pro xFlow 1.07e+002 - - 3.7063e+003 T able 7: Numerical results for dct set 2 (scalability test) with l 1 /l ∞ -regularization. The algorithm conﬁgurations are exactly the same as in T able 6. [19] D. Goldfarb, S. Ma, and K. Schein b erg. F ast alternating linearization metho ds for minimizing the sum of tw o conv ex functions. Arxiv pr eprint arXiv:0912.4571v2 , 2009. [20] T. Goldstein and S. Osher. The split bregman metho d for l1-regularized problems. SIAM Journal on Imaging Scienc es , 2:323, 2009. [21] G. Golub and C. V an Loan. Matrix c omputations . Johns Hopkins Univ Pr, 1996. [22] M. Hestenes. Multiplier and gradien t metho ds. Journal of optimization the ory and applic ations , 4(5):303–320, 1969. [23] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Pr o c e e dings of the 26th A nnual International Confer enc e on Machine L e arning , pages 417–424. A CM, 2009. [24] L. Jacob, G. Obozinski, and J. V ert. Group Lasso with ov erlap and graph Lasso. In Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning , pages 433–440. A CM, 2009. [25] R. Jenatton, J. Audib ert, and F. Bac h. Structured v ariable selection with sparsity-inducing norms. Stat , 1050, 2009. 29 Data Sets Algs CPU (s) Iters Avg Sub-iters F ( x ) ogl-dct-1000-5000-1 FIST A-p 1.83e+001 12 2.34e+001 8.4892e+002 FIST A 2.49e+001 24 2.51e+001 8.4893e+002 AD AL 1.35e+001 181 1.00e+000 8.4892e+002 ogl-dct-1000-10000-1 FIST A-p 3.16e+001 14 1.73e+001 1.4887e+003 FIST A 6.33e+001 44 1.74e+001 1.4887e+003 AD AL 4.43e+001 270 1.00e+000 1.4887e+003 ogl-dct-1000-15000-1 FIST A-p 4.29e+001 14 1.51e+001 2.7506e+003 FIST A 9.73e+001 54 1.32e+001 2.7506e+003 AD AL 5.37e+001 216 1.00e+000 2.7506e+003 ogl-dct-1000-20000-1 FIST A-p 7.53e+001 13 2.06e+001 3.3416e+003 FIST A 1.81e+002 64 1.61e+001 3.3415e+003 AD AL 1.57e+002 390 1.00e+000 3.3415e+003 ogl-dct-1000-25000-1 FIST A-p 7.41e+001 15 1.47e+001 4.1987e+003 FIST A 1.76e+002 89 8.64e+000 4.1987e+003 AD AL 8.79e+001 231 1.00e+000 4.1987e+003 ogl-dct-1000-30000-1 FIST A-p 8.95e+001 14 1.58e+001 4.6111e+003 FIST A 2.24e+002 94 8.54e+000 4.6111e+003 AD AL 1.12e+002 249 1.00e+000 4.6111e+003 T able 8: Numerical results for the DCT set with l 1 /l 2 -regularization. FIST A-p and ADAL were ran in PCG mo de with the dynamic scheme for up dating µ . µ was ﬁxed at µ 0 for FIST A. [26] R. Jenatton, G. Ob ozinski, and F. Bac h. Structured sparse principal component analysis. A rxiv pr eprint arXiv:0909.1440 , 2009. [27] S. Kim and E. Xing. T ree-guided group lasso for multi-task regression with structured sparsity. In Pr o c e e dings of the 27th Annual International Confer enc e on Machine L e arning , 2010. [28] Z. Lin, M. Chen, L. W u, and Y. Ma. The augmen ted lagrange m ultiplier metho d for exact reco very of corrupted low-rank matrices. Arxiv pr eprint arXiv:1009.5055 , 2010. [29] J. Liu and J. Y e. F ast Overlapping Group Lasso. Arxiv pr eprint arXiv:1009.0306 , 2010. [30] J. Mairal, R. Jenatton, G. Ob ozinski, and F. Bac h. Net work ﬂow algorithms for structured sparsit y. In J. Laﬀerty , C. K. I. Williams, J. Sha w e-T a ylor, R. Zemel, and A. Culotta, editors, A dvanc es in Neur al Information Pr o c essing Systems 23 , pages 1558–1566. 2010. [31] J. Mairal, R. Jenatton, G. Ob ozinski, and F. Bach. Con vex and Netw ork ﬂow Optimization for Structured Sparsit y. Arxiv pr eprint arXiv:1104.1872v1 , 2011. [32] S. Mosci, S. Villa, A. V erri, and L. Rosasco. A primal-dual algorithm for group sparse regu- larization with o verlapping groups. In at NIPS , 2010. [33] Y. Nesterov. Smo oth minimization of non-smooth functions. Mathematic al Pr o gr amming , 103(1):127–152, 2005. [34] J. No cedal and S. W right. Numeric al optimization . Springer v erlag, 1999. [35] J. P esquet and N. Pustelnik. A parallel inertial pro ximal optimization method. Pr eprint , 2010. 30 Data Sets Algs CPU (s) Iters Avg Sub-iters F ( x ) ogl-dct-1000-5000-1 FIST A-p 2.30e+001 11 2.93e+001 7.3219e+002 AD AL 1.89e+001 265 1.00e+000 7.3218e+002 FIST A 3.02e+001 16 4.09e+001 7.3233e+002 Pro xFlow 1.97e+001 - - 7.3236e+002 ogl-dct-1000-10000-1 FIST A-p 5.09e+001 11 3.16e+001 1.2708e+003 AD AL 4.77e+001 323 1.00e+000 1.2708e+003 FIST A 7.27e+001 24 3.25e+001 1.2708e+003 Pro xFlow 3.67e+001 - - 1.2709e+003 ogl-dct-1000-15000-1 FIST A-p 6.33e+001 12 2.48e+001 2.2445e+003 AD AL 9.41e+001 333 1.00e+000 2.2444e+003 FIST A 8.64e+001 23 2.66e+001 2.2449e+003 Pro xFlow 9.91e+001 - - 2.2467e+003 ogl-dct-1000-20000-1 FIST A-p 8.21e+001 12 2.42e+001 2.6341e+003 AD AL 1.59e+002 415 1.00e+000 2.6340e+003 FIST A 1.48e+002 26 2.93e+001 2.6342e+003 Pro xFlow 2.55e+002 - - 2.6357e+003 ogl-dct-1000-25000-1 FIST A-p 1.43e+002 13 2.98e+001 3.5567e+003 AD AL 1.20e+002 310 1.00e+000 3.5566e+003 FIST A 2.09e+002 38 2.30e+001 3.5568e+003 Pro xFlow 1.38e+002 - - 3.5571e+003 ogl-dct-1000-30000-1 FIST A-p 1.75e+002 13 3.18e+001 3.7057e+003 AD AL 2.01e+002 361 1.00e+000 3.7057e+003 FIST A 2.60e+002 39 2.25e+001 3.7060e+003 Pro xFlow 1.07e+002 - - 3.7063e+003 T able 9: Numerical results for the DCT set with l 1 /l ∞ -regularization. FIST A-p and ADAL were ran in PCG mo de. The dynamic up dating scheme for µ was applied to FIST A-p, while µ w as ﬁxed at µ 0 for ADAL and FIST A. [36] G. Peyre and J. F adili. Group sparsit y with o verlapping partition functions. In EUSIPCO 2011 , 2011. [37] G. Peyre, J. F adili, and C. Chesneau. Adaptive structured blo c k sparsit y via dy adic partition- ing. In EUSIPCO 2011 , 2011. [38] M. P ow ell. Optimization , chapter A Method for Nonlinear Constraints in Minimization Prob- lems. Academic Press, New Y ork, New Y ork, 1972. [39] Z. Qin, K. Schein b erg, and D. Goldfarb. Eﬃcien t Blo c k-co ordinate Descent Algorithms for the Group Lasso. 2010. [40] R. Ro c k afellar. The multiplier metho d of hestenes and p o well applied to conv ex programming. Journal of Optimization The ory and Applic ations , 12(6):555–562, 1973. [41] V. Roth and B. Fisc her. The group-lasso for generalized linear mo dels: uniqueness of solu- tions and eﬃcient algorithms. In Pr o c e e dings of the 25th international c onfer enc e on Machine le arning , pages 848–855. A CM, 2008. 31 Data Sets Algs CPU (s) Iters Avg Sub-iters F ( x ) BreastCancerData AD AL 6.24e+000 136 1.00e+000 2.9331e+003 APLM-S 4.02e+001 12 4.55e+001 2.9331e+003 FIST A-p 6.86e+000 12 1.48e+001 2.9331e+003 FIST A 5.11e+001 75 1.29e+001 2.9340e+003 Pro xGrad 7.76e+002 6605 1.00e+000 - T able 10: Numerical results for Breast Cancer Data using l 1 /l 2 -regularization. In this exp erimen t, w e k ept µ constant at 0.01 for AD AL. The CPU time is for a single run on the entire data set with the v alue of λ selected to minimize the RMSE in Figure 4. [42] S. Setzer, G. Steidl, and T. T eub er. Deblurring poissonian images b y split bregman tec hniques. Journal of Visual Communic ation and Image R epr esentation , 21(3):193–199, 2010. [43] J. Spingarn. Partial inv erse of a monotone op erator. Applie d mathematics & optimization , 10(1):247–265, 1983. [44] A. Subramanian, P . T amay o, V. Mo otha, S. Mukherjee, B. Eb ert, M. Gillette, A. Paulo vich, S. P omeroy , T. Golub, E. Lander, et al. Gene set enric hment analysis: a knowledge-based ap- proac h for in terpreting genome-wide expression proﬁles. Pr o c e e dings of the National A c ademy of Scienc es of the Unite d States of Americ a , 102(43):15545, 2005. [45] R. Tibshirani. Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , 58(1):267–288, 1996. [46] K. T oy ama, J. Krumm, B. Brumitt, and B. Meyers. W allﬂow er: Principles and practice of bac kground main tenance. In ic cv , page 255. Published by the IEEE Computer So ciet y , 1999. [47] M. V an De Vijv er, Y. He, L. v an’t V eer, H. Dai, A. Hart, D. V oskuil, G. Schreiber, J. P eterse, C. Rob erts, M. Marton, et al. A gene-expression signature as a predictor of surviv al in breast cancer. New England Journal of Me dicine , 347(25):1999, 2002. [48] S. W right, R. Now ak, and M. Figueiredo. Sparse reconstruction b y separable approximation. IEEE T r ansactions on Signal Pr o c essing , 57(7):2479–2493, 2009. [49] M. Y uan and Y. Lin. Mo del selection and estimation in regression with group ed v ariables. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 68(1):49–67, 2006. [50] H. Zou and T. Hastie. Regularization and v ariable selection via the elastic net. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 67(2):301–320, 2005. 32

Structured Sparsity via Alternating Direction Methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment