Recursive $ell_{1,infty}$ Group lasso

1 Recursi v e ` 1 , ∞ Group lasso Y ilun Chen, Student Member , IEEE, and Alfred O. Hero, III, F ellow , IEEE Abstract —W e introduce a recursiv e adaptive group lasso al- gorithm for r eal-time penalized least squares prediction that produces a time sequence of optimal sparse predictor coefﬁcient vectors. At each time index the proposed algorithm computes an exact update of the optimal ` 1 , ∞ -penalized recursi ve least squares (RLS) predictor . Each update minimizes a conv ex but non- differentiable function optimization problem. W e develop an on- line homotopy method to reduce the computational complexity . Numerical simulations demonstrate that the proposed algorithm outperforms the ` 1 regularized RLS algorithm f or a group sparse system identiﬁcation pr oblem and has lo wer implementation complexity than direct group lasso solvers. Index T erms —RLS, group sparsity , mixed norm, homotopy , group lasso, system identiﬁcation I . I N T RO D U C T I O N Recursiv e Least Squares (RLS) is a widely used method for adaptiv e ﬁltering and prediction in signal processing and re- lated ﬁelds. Its applications include: acoustic echo cancelation; wireless channel equalization; interference cancelation and data streaming predictors. In these applications a measurement stream is recursiv ely ﬁtted to a linear model, described by the coefﬁcients of an FIR prediction ﬁlter , in such a way to minimize a weighted av erage of squared residual prediction errors. Compared to other adaptiv e ﬁltering algorithms such as Least Mean Square (LMS) ﬁlters, RLS is popular because of its fast con ver gence and low steady-state error . In many applications it is natural to constrain the predictor coefﬁcients to be sparse. In such cases the adaptiv e FIR prediction ﬁlter is a sparse system: only a few of the impulse response coef ﬁcients are non-zero. Sparse systems can be divided into general sparse systems and group sparse systems [1], [2]. Unlike a general sparse system, whose impulse re- sponse can hav e arbitrary sparse structure, a group sparse sys- tem has impulse response composed of a fe w distinct clusters of non-zero coefﬁcients. Examples of group sparse systems include specular multipath acoustic and wireless channels [3], [4] and compressiv e spectrum sensing of narrowband sources [5]. The exploitation of sparsity to improve prediction perfor- mance has attracted considerable interest. For general sparse systems, the ` 1 norm has been recognized as an ef fective promotor of sparsity [6], [7]. In particular , ` 1 regularized LMS [2], [8] and RLS [9], [10] algorithms have been proposed for for sparsiﬁcation of adaptiv e ﬁlters. For group sparse systems, mixed norms such as the ` 1 , 2 norm and the ` 1 , ∞ norm hav e Y . Chen and A. O. Hero are with the Department of Electrical Engi- neering and Computer Science, Univ ersity of Michigan, Ann Arbor , MI 48109, USA. T el: 1-734-763-0564. Fax: 1-734-763-8041. Emails: { yilun, hero } @umich.edu. This work was partially supported by AFOSR, grant number F A9550-06- 1-0324. been applied to promote sparsity in statistical regression [11]– [13], commonly referred to as the group lasso, and sparse signal recovery in signal processing and communications [1], [14]. Howe ver , most of the proposed estimation algorithms operate in the ofﬂine mode and are not designed for time varying systems and online prediction. This is the motiv ation of our work. In this paper , we propose a RLS method penalized by the ` 1 , ∞ norm to promote group sparsity , called the recursive ` 1 , ∞ group lasso. Our recursiv e group lasso algorithm is suitable for online applications where data is acquired sequentially . The algorithm is based on the homotopy approach to solving the lasso problem and is an extension of [15]–[17] to group sparse systems. The paper is org anized as follo ws. Section II formulates the problem. In Section III we dev elop the homotopy based algorithm to solve the recursi ve ` 1 , ∞ group lasso in an online recursiv e manner . Section IV provides numerical simulation results and Section V summarizes our principal conclusions. The proofs of theorems and some details of the proposed algorithm are provided in Appendix. Notations : In the follo wing, matrices and vectors are de- noted by boldface upper case letters and boldface lower case letters, respectiv ely; ( · ) T denotes the transpose operator , and k · k 1 and k · k ∞ denote the ` 1 and ` ∞ norm of a vector , respectiv ely; for a set A , |A| denotes its cardinality and φ denotes the empty set; x A denotes the sub-vector of x from the index set A and R AB denotes the sub-matrix of R formed from the row index set A and column index set B . I I . P RO B L E M F O R M U L A T I O N A. Recursive Least Squar es Let w be a p -dimensional coefﬁcient vector . Let y be an n -dimensional vector comprised of observ ations { y j } n j =1 . Let { x j } n j =1 be a sequence of p -dimensional predictor variables. In standard adaptive ﬁltering terminology , y j , x j and w are the primary signal, the reference signal, and the adaptiv e ﬁlter weights. The RLS algorithm solves the follo wing quadratic minimization problem recursiv ely over time n = p, p + 1 , . . . : ˆ w n = arg min w n X j =1 γ n − j ( y j − w T x j ) 2 , (1) where γ ∈ (0 , 1] is the forgetting factor controlling the trade- off between transient and steady-state behaviors. T o serve as a template for the sparse RLS extensions described below we brieﬂy revie w the RLS update algorithm. Deﬁne R n and r n as R n = n X j =1 γ n − j x j x T j (2) 2 0 100 200 300 400 500 −0.2 0 0.2 (b) 0 100 200 300 400 500 −2 0 2 (a) Fig. 1. Examples of (a) a general sparse system and (b) a group-sparse system. and r n = n X j =1 γ n − j x j y j . (3) The solution ˆ w n to (1) can be then expressed as ˆ w n = R − 1 n r n . (4) The matrix R n and the vector r n are updated as R n = γ R n − 1 + x n x T n , and r n = γ r n − 1 + x n y T n . Applying the Sherman-Morrison-W oodbury formula [18], R − 1 n = γ − 1 R − 1 n − 1 − γ − 1 α n g n g T n , (5) where g n = R − 1 n − 1 x n (6) and α n = 1 γ + x T n g n . (7) Substituting (5) into (4), we obtain the weight update [19] ˆ w n = ˆ w n − 1 + α n g n e n , (8) where e n = y n − ˆ w T n − 1 x n . (9) Equations (5)-(9) deﬁne the RLS algorithm which has com- putational complexity of order O ( p 2 ) . B. Non-recur sive ` 1 , ∞ gr oup lasso The ` 1 , ∞ group lasso is a regularized least squares approach which uses the ` 1 , ∞ mixed norm to promote group-wise sparse pattern on the predictor coef ﬁcient vector . The ` 1 , ∞ norm of a vector w is deﬁned as k w k 1 , ∞ = M X m =1 k w G m k ∞ , where {G m } M m =1 is a group partition of the index set G = { 1 , . . . , p } , i.e., M [ m =1 G m = G , G m ∩ G m 0 = φ if m 6 = m 0 , and w G m is a sub-vector of w indexed by G m . The ` 1 , ∞ norm is a mixed norm: it encourages correlation among coefﬁcients inside each group via the ` ∞ norm within each group and promotes sparsity across each group using the ` 1 norm. The mixed norm k w k 1 , ∞ is con ve x in w and reduces to k w k 1 when each group contains only one coefﬁcient, i.e., |G 1 | = |G 2 | = · · · = |G M | = 1 . The ` 1 , ∞ group lasso solves the following penalized least squares problem: ˆ w n = arg min w 1 2 n X j =1 γ n − j ( y j − w T x j ) 2 + λ k w k 1 , ∞ , (10) where λ is a regularization parameter . Eq. (10) is a con vex problem and can be solved by standard con vex optimizers or path tracing algorithms [12]. Direct solution of (10) has computational complexity of O ( p 3 ) . C. Recursive ` 1 , ∞ gr oup lasso In this subsection we obtain a recursiv e solution for (10) that gives an update ˆ w n from ˆ w n − 1 . The approach taken is a group-wise generalization of recent works [15], [16] that uses the homotopy approach to sequentially solve the lasso problem. Using the deﬁnitions (2) and (3), the problem (10) is equiv alent to ˆ w n = arg min w 1 2 w T R n w − w T r n + λ k w k 1 , ∞ = arg min w 1 2 w T  γ R n − 1 + x T n x n  w − w T ( γ r n − 1 + x n y n ) + λ k w k 1 , ∞ . (11) Let f ( β , λ ) be the solution to the follo wing parameterized problem f ( β , λ ) = arg min w 1 2 w T  γ R n − 1 + β x n x T n  w − w T ( γ r n − 1 + β x n y n ) + λ k w k 1 , ∞ (12) where β is a constant between 0 and 1. ˆ w n and ˆ w n − 1 of problem (11) can be expressed as ˆ w n − 1 = f (0 , γ λ ) , and ˆ w n = f (1 , λ ) . Our proposed method computes ˆ w n from ˆ w n − 1 in the follo w- ing two steps: Step 1 . Fix β = 0 and calculate f (0 , λ ) from f (0 , γ λ ) . This is accomplished by computing the regularization path between γ λ and λ using homotopy methods introduced for the non- recursiv e ` 1 , ∞ group lasso. The solution path is piece wise linear and the algorithm is described in [12]. Step 2 . Fix λ and calculate the solution path between f (0 , λ ) and f (1 , λ ) . This is the key problem addressed in this paper . 3 B A C G 1 G 2 G 3 G 4 G 5 P Q Fig. 2. Illustration of the partitioning of a 20 element coefﬁcient vector w into 5 groups of 4 indices. The sets P and Q contain the activ e groups and the inactive groups, respectiv ely . W ithin each of the two active groups the maximal coefﬁcients are denoted by the dark red color . T o ease the notations we denote x n and y n by x and y , respectiv ely , and deﬁne the following variables: R ( β ) = γ R n − 1 + β xx T (13) r ( β ) = γ r n − 1 + β x y . (14) Problem (12) is then f ( β , λ ) = arg min w 1 2 w T R ( β ) w − w T r ( β ) + λ k w k 1 , ∞ . (15) In Section III we will show ho w to propag ate f (0 , λ ) to f (1 , λ ) using the homotopy approach applied to (15). I I I . O N L I N E H O M O TO P Y U P D A T E A. Set notation W e begin by introducing a series of set deﬁnitions. Figure 2 provides an example. W e divide the entire group index set into P and Q , respecti vely , where P contains acti ve groups and Q is its complement. For each activ e group m ∈ P , we partition the group into two parts: the maximal values, with indices A m , and the rest of the values, with indices B m : A m = arg max i ∈G m | w i | , m ∈ P , and B m = G m − A m . The set A and B are deﬁned as the union of the A m and B m sets, respectively: A = [ m ∈P A m , B = [ m ∈P B m . Finally , we deﬁne C = [ m ∈Q G m . and C m = G m ∩ C . B. Optimality condition The objective function in (15) is conv ex but non-smooth as the ` 1 , ∞ norm is non-dif ferentiable. Therefore, problem (15) reaches its global minimum at w if and only if the sub- differential of the objective function contains the zero vector . Let ∂ k w k 1 , ∞ denote the sub-differential of the ` 1 , ∞ norm at w . A vector z ∈ ∂ k w k 1 , ∞ only if z satisﬁes the following conditions [12], [14]: k z A m k 1 = 1 , m ∈ P , (16) sgn ( z A m ) = sgn ( w A m ) , m ∈ P , (17) z B = 0 , (18) k z C m k 1 ≤ 1 , m ∈ Q , (19) where A , B , C , P and Q are β -dependent sets deﬁned on w as deﬁned in Section III-A. For notational conv enience we drop β in R ( β ) and r ( β ) leaving the β -dependency implicit. The optimality condition is then written as Rw − r + λ z = 0 , z ∈ ∂ k w k 1 , ∞ . (20) As w C = 0 and z B = 0 , (20) implies the three conditions R AA w A + R AB w B − r A + λ z A = 0 , (21) R BA w A + R BB w B − r B = 0 , (22) R C A w A + R C B w B − r C + λ z C = 0 . (23) The vector w A lies in a lo w dimensional subspace. Indeed, by deﬁnition of A m , if |A m | > 1 | w i | = | w i 0 | , i, i 0 ∈ A m . Therefore, for any activ e group m ∈ P , w A m = s A m α m (24) where α m = k w G m k ∞ , and s A = sgn ( w A ) . Using matrix notation, we represent (24) as w A = Sa . (25) where S =    s A 1 . . . s A | P |    (26) is a |A| × |P | sign matrix and the vector a is comprised of α m , m ∈ P . The solution to (15) can be determined in closed form if the sign matrix S and sets ( A , B , C , P , Q ) are av ailable. Indeed, from (16) and (17) S T z A = 1 , (27) where 1 is a |P | × 1 vector comprised of 1’ s. W ith (25) and (27), (21) and (22) are equiv alent to S T R AA Sa + S T R AB w B − S T r A + λ 1 = 0 , R BA Sa + R BB w B − r B = 0 . (28) 4 Therefore, by deﬁning the (a.s. in vertible) matrix H =  S T R AA S S T R AB R BA S R BB  , (29) and b =  S T r A r B  , v =  a w B  , (30) (28) is equi v alent to Hv = b − λ e , where e = ( 1 T , 0 T ) T , so that v = H − 1 ( b − λ e ) . (31) As w A = Sa , the solution vector w can be directly obtained from v via (30). For the sub-gradient vector , it can be shown that λ z A = r A − ( R AA S R AB ) v , (32) z B = 0 (33) and λ z C = r C − ( R C A S R C B ) v . (34) C. Online update Now we consider (15) using the results in III-B. Let β 0 and β 1 be two constants such that β 1 > β 0 . For a gi ven value of β ∈ [ β 0 , β 1 ] deﬁne the class of sets S = ( A , B , C , P , Q ) and make β explicit by writing S ( β ) . Recall that S ( β ) is speciﬁed by the solution f ( β , λ ) deﬁned in (19). Assume that S ( β ) does not change for β ∈ [ β 0 , β 1 ] . The following theorem propagates f ( β 0 , λ ) to f ( β 1 , λ ) via a simple algebraic relation. Theorem 1. Let β 0 and β 1 be two constants suc h that β 1 > β 0 and for any β ∈ [ β 0 , β 1 ] the solutions to (15) shar e the same sets S = ( A , B , C , P , Q ) . Let v 0 and v be vectors deﬁned as f ( β 1 , λ ) and f ( β 0 , λ ) , respectively . Then v 0 = v + β 1 − β 0 1 + σ 2 H β 1 ( y − ˆ y ) g , (35) and the corr esponding sub-gr adient vector has the explicit update λ z 0 A = λ z A + β 1 − β 0 1 + σ 2 H β 1 ( y − ˆ y ) { x A − ( R AA S R AB ) g } (36) and λ z 0 C = λ z C + β 1 − β 0 1 + σ 2 H β 1 ( y − ˆ y ) { x C − ( R C A S R C B ) g } , (37) wher e R = R (0) as deﬁned in (13), ( x , y ) is the new sample as deﬁned in (13) and (14), the sign matrix S is obtained fr om the solution at β = β 0 , H 0 is calculated fr om (29) using S and R (0) , and d , u , ˆ y and σ 2 H ar e deﬁned by d =  S T x A x B  , (38) g = H − 1 0 d , (39) ˆ y = d T v , (40) σ 2 H = d T g . (41) The proof of Theorem 1 is provided in Appendix A. Theorem 1 provides the closed form update for the solution path f ( β 0 , λ ) → f ( β 1 , λ ) , under the assumption that the associated sets S ( β ) remain unaltered over the path. Next, we partition the range β ∈ [0 , 1] into contiguous segments over which S ( β ) is piecewise constant. Within each segment we can use Theorem 1 to propagate the solution from left endpoint to right endpoint. Below we specify an algorithm for ﬁnding the endpoints of each of these segments. Fix an endpoint β 0 of one of these segments. W e seek a critical point β 1 that is deﬁned as the maximum β 1 ensuring S ( β ) remains unchanged within [ β 0 , β 1 ] . By increasing β 1 from β 0 , the sets S ( β ) will not change until at least one of the following conditions are met: Condition 1 . There exists i ∈ A such that z 0 i = 0 ; Condition 2 . There exists i ∈ B m such that | w 0 i | = α 0 m ; Condition 3 . There exists m ∈ P such that α 0 m = 0 ; Condition 4 . There exists m ∈ Q such that k z 0 C m k 1 = 1 . Condition 1 is from (17) and (18), Condition 2 and 3 are based on deﬁnitions of A and P , respectiv ely , and Condition 4 comes from (16) and (19). Follo wing [12], [20], the four conditions can be assumed to be mutually exclusi ve. The actions with respect to Conditions 1-4 are giv en by Action 1 . Move the entry i from A to B : A ← A − { i } , B ← B ∪ { i } ; Action 2 . Move the entry i from B to A : A ← A ∪ { i } , B ← B − { i } ; Action 3 . Remove group m from the active group list P ← P − { m } , Q ← Q ∪ { m } , and update the related sets A ← A − A m , C ← C ∪ A m ; Action 4 . Select group m P ← P ∪ { m } , Q ← Q − { m } , and update the related sets A ← A ∪ C m , C ← C − C m . By Theorem 1, the solution update from β 0 to β 1 is in closed form. The critical point of β 1 can be determined in a straightforward manner (details are provided in Appendix B). Let β ( k ) 1 , k = 1 , ..., 4 be the minimum value that is greater than β 0 and meets Condition 1-4, respectiv ely . The critical point β 1 is then β 1 = min k =1 ,..., 4 β ( k ) 1 . D. Homotopy algorithm implementation W e now have all the ingredients for the homotopy update algorithm and the pseudo code is giv en in Algorithm 1. Next we analyze the computational cost of Algorithm 1. The complexity to compute each critical point is summarized in T able I, where N is the dimension of H 0 . As N = |P | + |B | ≤ |A| + |B | , N is upper bounded by the number of non-zeros in 5 Algorithm 1: Homotopy update from f (0 , λ ) to f (1 , λ ) . Input : f (0 , λ ) , R (0) , x , y output : f (1 , λ ) Initialize β 0 = 0 , β 1 = 0 , R = R (0) ; Calculate ( A , B , C , P , Q ) and ( v , λ z A , λ z C ) from f (0 , λ ) ; while β 0 < 1 do Calculate the environmental variables ( S , H 0 , d , g , ˆ y , σ 2 H ) from f ( β 0 , λ ) and R ; Calculate { β ( k ) 1 } 4 k =1 that meets Condition 1-4, respectiv ely; Calculate the critical point β 1 that meets Condition k ∗ : k ∗ = arg min k β ( k ) 1 and β 1 = β ( k ∗ ) 1 ; if β 1 ≤ 1 then Update ( v , λ z A , λ z C ) using (35), (36) and (37); Update ( A , B , C , P , Q ) by Action k ∗ ; β 0 = β 1 ; else break ; end end β 1 = 1 ; Update ( v , λ z A , λ z C ) using (35); Calculate f (1 , λ ) from v . the solution vector . The vector g can be computed in O ( N 2 ) time using the matrix-in verse lemma [18] and the fact that, for each action, H 0 is at most perturbed by a rank-two matrix. This implies that the computation complexity per critical point is O ( p max { N , log p } ) and the total complexity of the online update is O ( k 2 · p max { N , log p } ) , where k 2 is the number of critical points of β in the solution path f (0 , λ ) → f (1 , λ ) . This is the computational cost required for Step 2 in Section II-C. A similar analysis can be performed for the complexity of Step 1, which requires O ( k 1 · p max { N , log p } ) where k 1 is the number of critical points in the solution path f (0 , γ λ ) → f (0 , λ ) . Therefore, the overall computation complexity of the recursiv e ` 1 , ∞ group lasso is O ( k · p max { N , log p } ) , where k = k 1 + k 2 , i.e., the total number of critical points in the solution path f (0 , γ λ ) → f (0 , λ ) → f (1 , λ ) . An instructiv e benchmark is to directly solve the n -samples problem (12) from the solution path f (1 , ∞ ) ( i.e., a zero vector) → f (1 , λ ) [12], without using the pre vious solution ˆ w n − 1 . This algorithm, called iCap in [12], requires O ( k 0 · p max { N , log p } ) , where k 0 is the number of critical points in f (1 , ∞ ) → f (1 , λ ) . Empirical comparisons between k and k 0 , provided in the following section, indicate that iCap requires signiﬁcantly more computation than our proposed Algorithm 1. I V . N U M E R I C A L S I M U L A T I O N S In this section we demonstrate our proposed recursiv e ` 1 , ∞ group lasso algorithm by numerical simulation. W e simulated the model y j = w T ∗ x j + v j , j = 1 , . . . , 400 , where v j is a g = H − 1 0 d O ( N 2 ) x A − ( R AA S R AB ) g O ( |A| N ) x C − ( R C A S R C B ) g O ( |C | N ) β (1) 1 O ( |A| ) β (2) 1 O ( |B | ) β (3) 1 O ( |P | ) β (4) 1 O ( |C | log |C | ) T ABLE I C O MP U TA T I O N C O S T S O F O N L I NE H O MO TO P Y U P DA T E F OR E AC H C R IT I C A L P O I N T . 0 20 40 60 80 100 −0.5 0 0.5 (a) 0 20 40 60 80 100 −0.5 0 0.5 (b) Fig. 3. Responses of the time varying system. (a): Initial response. (b): Response after the 200th iteration. The groups for Algorithm 1 were chosen as 20 equal size contiguous groups of coefﬁcients partitioning the range 1 , . . . , 100 . zero mean Gaussian noise and w ∗ is a sparse p = 100 element vector containing only 14 non-zero coefﬁcients clustered be- tween indices 29 and 42. See Fig. 3 (a). After 200 time units, the locations of the non-zero coefﬁcients of w ∗ is shifted to the right, as indicated in Fig. 3 (b). The input vectors were generated as independent identically distributed Gaussian random vectors with zero mean and iden- tity cov ariance matrix, and the variance of observation noise v j is 0.01. W e created the groups in the recursiv e ` 1 , ∞ group lasso as follows. W e divide the 100 RLS ﬁlter coef ﬁcients w into 20 groups with group boundaries 1 , 5 , 10 , . . . , where each group contains 5 coef ﬁcients. The forgetting factor γ and the regularization parameter λ were set to 0.9 and 0.1, respectiv ely . W e repeated the simulation 100 times and the av eraged mean squared errors of the RLS, sparse RLS and proposed RLS shown in Fig. 4. W e implemented the standard RLS and sparse RLS using the ` 1 regularization, where the forgetting factors are also set to 0.9. W e implemented sparse RLS [15] by choosing the regularization parameter λ which achiev es the lowest steady-state error , resulting in λ = 0 . 05 . It can be seen from Fig. 4 that our proposed sparse RLS method outperforms standard RLS and sparse RLS in both con vergence rate and steady-state MSE. This demonstrates the power of our group sparsity penalty . At the change point of 200 iterations, both the proposed method and sparse RLS of [15] sho w superior tracking performances as compared to the standard RLS. W e also observe that the proposed method achiev es even smaller MSE after the change point occurs. This 6 0 100 200 300 400 −15 −10 −5 0 5 10 Time MSE (dB) Proposed RLS Recursive Lasso Fig. 4. A veraged MSE of the proposed algorithm, RLS and recursi ve lasso. 0 100 200 300 400 0 20 40 60 80 100 Time Averaged Number of Critical Points Homotopy on β and λ Homotopy on λ Fig. 5. A veraged number of critical points for the proposed recursive method of implementing ` 1 , ∞ lasso and the iCap [12] non-recursiv e method of implementation. is due to the fact that the activ e cluster spans across group boundaries in the initial system (Fig. 3 (a)), while the active clusters in the shifted system overlap with fe wer groups. Fig. 5 sho ws the a verage number of critical points (ac- counting for both trajectories in β and λ ) of the proposed algorithm, i.e., the number k as deﬁned in Section III-D. As a comparison, we implement the iCap method of [12], a homotopy based algorithm that traces the solution path only ov er λ . The average number of critical points for iCap is plotted in Fig. 5, which is the number k 0 in Section III-D. Both the proposed algorithm and iCap yield the same solution but hav e different computational complexities proportional to k and k 0 , respectively . It can be seen that the proposed algorithm sav es as much as 75% of the computation costs for equi v alent performance. V . C O N C L U S I O N In this paper we proposed a ` 1 , ∞ regularized RLS algorithm for online sparse linear prediction. W e developed a homotopy based method to sequentially update the solution vector as ne w measurements are acquired. Our proposed algorithm uses the previous estimate as a “warm-start”, from which we compute the homotopy update to the current solution. The proposed algorithm can process streaming measurements with time varying predictors and is computationally efﬁcient compared to non-recursiv e group lasso solvers. Numerical simulations demonstrated that the proposed method outperformed the standard and ` 1 regularized RLS for identifying an unkno wn group sparse system, in terms of both tracking and steady-state mean squared error . The work presented here assumed non-overlapping group partitions. In the future, we will in vestigate o verlapping groups and other ﬂexible partitions [21]. V I . A P P E N D I X A. Pr oof of Theorem 1 W e begin by deriving (35). According to (31), v 0 = H 0− 1 ( b 0 − λ e 0 ) . (42) As S and ( A , B , C , P , Q ) remain constant within [ β 0 , β 1 ] , e 0 = e , (43) b 0 = b + δ d y , (44) and H 0 = H + δ dd T , where δ = β 1 − β 0 , H and b are calculated using S within [ β 0 , β 1 ] and R ( β 0 ) and r ( β 0 ) , respectiv ely . W e emphasize that H is based on R ( β ) and is different from H 0 deﬁned in Theorem 1. According to the Sherman-Morrison-W oodbury formula, H 0− 1 = H − 1 − δ 1 + σ 2 δ ( H − 1 d )( H − 1 d ) T , (45) where σ 2 = d T H − 1 d . Substituting (43), (44) and (45) into (42), after simpliﬁcation we obtain v 0 =  H − 1 − δ 1 + σ 2 δ ( H − 1 d )( H − 1 d ) T  ( b + δ d y − λ e ) = H − 1 ( b − λ e ) + H − 1 δ d y − δ 1 + σ 2 δ H − 1 dd T H − 1 ( b − λ e ) − σ 2 δ 2 1 + σ 2 δ H − 1 d y = v + δ 1 + σ 2 δ ( y − d T v ) H − 1 d = v + δ 1 + σ 2 δ ( y − ˆ y ) H − 1 d , (46) where ˆ y = d T v as deﬁned in (40). Note that H is deﬁned in terms of R ( β 0 ) rather than R (0) and H = H 0 + β 0 dd T , so that H − 1 = H − 1 0 − β 0 1 + σ 2 H β 0 gg T , (47) where g and σ 2 H are deﬁned by (39) and (41), respecti vely . As σ 2 H = d T g , H − 1 d = H − 1 0 d − σ 2 H β 0 1 + σ 2 H β 0 g . (48) 7 Accordingly , σ 2 = d T H − 1 d = σ 2 H − σ 2 H β 0 1 + σ 2 H β 0 σ 2 H = σ 2 H 1 + σ 2 H β 0 . (49) Substituting (48) and (49) to (46), we ﬁnally obtain v 0 = v + δ 1 + σ 2 H β 1 ( y − ˆ y ) g = v + β 1 − β 0 1 + σ 2 H β 1 ( y − ˆ y ) g . Equations (36) and (37) can be established by direct sub- stitutions of (35) into their deﬁnitions (32) and (34) and thus the proof of Theorem 1 is complete. B. Computation of critical points For ease of notation we work with ρ , deﬁned by ρ = β 1 − β 0 1 + σ 2 H β 1 . (50) It is easy to see that over the range β 1 > β 0 , ρ is monotonically increasing in (0 , 1 /σ 2 H ) . Therefore, (50) can be in verted by β 1 = ρ + β 0 1 − σ 2 H ρ , (51) where ρ ∈ (0 , 1 /σ 2 H ) to ensure β 1 > β 0 . Suppose we hav e obtained ρ ( k ) , k = 1 , ..., 4 , β ( k ) 1 can be calculated using (51) and the critical point β 1 is then β 1 = min k =1 ,..., 4 β ( k ) 1 . W e now calculate the critical value of ρ for each condition one by one. 1) Critical point for Condition 1: Deﬁne the temporary vector t A = ( y − ˆ y ) { x A − ( R AA S R AB ) g } . According to (36), λ z 0 A = λ z A + ρ t A . Condition 1 is met for any ρ = ρ (1) i such that ρ (1) i = − λz i t i , i ∈ A . Therefore, the critical value of ρ that satisﬁes Condition 1 is ρ (1) = min n ρ (1) i    i ∈ A , ρ (1) i ∈ (0 , 1 /σ 2 H ) o . 2) Critical point for Condition 2: By the deﬁnition (30), v is a concatenation of α m and w B m , m ∈ P : v T =  ( α m ) m ∈P , w T B 1 , ..., w T B |P |  , (52) where ( α m ) m ∈P denotes the vector comprised of α m , m ∈ P . Now we partition the vector g in the same manner as (52) and denote τ m and u m as the counter part of α m and w B m in g , i.e., g T =  ( τ m ) m ∈P , u T 1 , ..., u T |P |  . Eq. (35) is then equiv alent to α 0 m = α m + ρτ m , (53) and w 0 B m ,i = w B m ,i + ρu m,i , where u m.i is the i -th element of the vector u m . Condition 2 indicates that α 0 m = ± w 0 B m ,i , and is satisﬁed if ρ = ρ (2+) m,i or ρ = ρ (2 − ) m,i , where ρ (2+) m,i = α m − w B m ,i u m,i − τ m , ρ (2 − ) m,i = − α m + w B m ,i u m,i + τ m . Therefore, the critical value of ρ for Condition 2 is ρ (2) = min n ρ (2 ± ) m,i    m ∈ P , i = 1 , ..., |B m | , ρ (2 ± ) m,i ∈ (0 , 1 /σ 2 H ) o . 3) Critical point for Condition 3: According to (53), α 0 m = 0 yields ρ = ρ (3) i determined by ρ (3) m = − α m τ m , m ∈ P , and the critical value for ρ (3) is ρ (3) = min n ρ (3) m    , m ∈ P , ρ (3) m ∈ (0 , 1 /σ 2 H ) o . 4) Critical point for Condition 4: Deﬁne t C = ( y − ˆ y ) { x C − ( R C A S R C B ) g } . Eq. (37) is then λ z 0 C m = λ z C m + ρ t C m , and Condition 4 is equiv alent to X i ∈C m | ρt i + λz i | = λ. (54) T o solve (54) we develop a fast method that requires complex- ity of O ( N log N ) , where N = |C m | . The algorithm is giv en in Appendix C. For each m ∈ Q , let ρ (4) m be the minimum positiv e solution to (54). The critical value of ρ for Condition 4 is then ρ (4) = min n ρ (4) m    m ∈ Q , ρ (4) m ∈ (0 , 1 /σ 2 H ) o . C. F ast algorithm for critical condition 4 Here we dev elop an algorithm to solv e problem (54). Consider solving the more general problem: N X i =1 a i | x − x i | = y , (55) where a i and x i are constants and a i > 0 . Please note that the notations here hav e no connections to those in previous sections. Deﬁne the following function h ( x ) = N X i =1 a i | x − x i | . The problem is then equiv alent to ﬁnding h − 1 ( y ) , if it exists. An illustration of the function h ( x ) is shown in Fig. 6, where k i denotes the slope of the i th segment. It can be shown that h ( x ) is piecewise linear and con vex in x . Therefore, the equation (55) generally has two solutions if they exist, denoted 8 x 1 x 2 x 3 x 4 x 5 h ( x 1 ) h ( x 2 ) h ( x 3 ) h ( x 4 ) y k 0 k 1 k 2 k 3 k 4 k 5 Fig. 6. An illustration of the fast algorithm for critical condition 4. as x min and x max . Based on piece wise linearity we propose a search algorithm to solve (55). The pseudo code is shown in Algorithm 2 and its computation complexity is dominated by the sorting operation which requires O ( N log N ) . Algorithm 2: Solve x from P N i =1 a i | x − x i | = y . Input : { a i , x i } N i =1 , y output : x min , x max Sort { x i } N i =1 in the ascending order: x 1 ≤ x 2 ≤ ... ≤ x N ; Re-order { a i } N i =1 such that a i corresponds to x i ; Set k 0 = − P N i =1 a i ; for i = 1 , ..., N do k i = k i − 1 + 2 a i ; end Calculate h 1 = P N i =2 a i | x 1 − x i | ; for i = 2 , ..., N do h i = h i − 1 + k i − 1 ( x i − x i − 1 ) end if min i k i > y then No solution; Exit ; else if y > h 1 then x min = x 1 + ( y − h 1 ) /k 0 ; else Seek j such that y ∈ [ h j , h j − 1 ] ; x min = x j + ( y − h j ) /k j − 1 ; end if y > h N then x max = x N + ( y − h N ) /k N ; else Seek j such that y ∈ [ h j − 1 , h j ] ; x max = x j − 1 + ( y − h j − 1 ) /k j − 1 ; end end R E F E R E N C E S [1] Y .C. Eldar , P . Kuppinger , and H. Bolcskei, “Block-sparse signals: Uncertainty relations and ef ﬁcient recovery , ” Signal Pr ocessing, IEEE T ransactions on , vol. 58, no. 6, pp. 3042–3054, 2010. [2] Y . Chen, Y . Gu, and A.O. Hero, “Regularized Least-Mean-Square Algorithms, ” Arxiv preprint , 2010. [3] W .F . Schreiber , “ Advanced television systems for terrestrial broadcast- ing: Some problems and some proposed solutions, ” Pr oceedings of the IEEE , vol. 83, no. 6, pp. 958–981, 1995. [4] Y . Gu, J. Jin, and S. Mei, “ ` 0 Norm Constraint LMS Algorithm for Sparse System Identiﬁcation, ” IEEE Signal Processing Letters , vol. 16, pp. 774–777, 2009. [5] M. Mishali and Y .C. Eldar, “From theory to practice: Sub-Nyquist sampling of sparse wideband analog signals, ” Selected T opics in Signal Pr ocessing, IEEE Journal of , vol. 4, no. 2, pp. 375–391, 2010. [6] R. Tibshirani, “Regression shrinkage and selection via the lasso, ” J. Royal. Statist. Soc B. , vol. 58, pp. 267–288, 1996. [7] E. Cand ` es, “Compressive sampling, ” Int. Congress of Mathematics , vol. 3, pp. 1433–1452, 2006. [8] Y . Chen, Y . Gu, and A.O. Hero, “Sparse LMS for system identiﬁcation, ” in Acoustics, Speech and Signal Processing , 2009. ICASSP 2009. IEEE International Conference on . IEEE, 2009, pp. 3125–3128. [9] B. Babadi, N. Kalouptsidis, and V . T arokh, “SP ARLS: The sparse RLS algorithm, ” Signal Processing, IEEE T ransactions on , vol. 58, no. 8, pp. 4013–4025, 2010. [10] D. Angelosante, J.A. Bazerque, and G.B. Giannakis, “Online Adaptiv e Estimation of Sparse Signals: Where RLS Meets the ` 1 norm, ” Signal Pr ocessing, IEEE T ransactions on , vol. 58, no. 7, pp. 3436–3447, 2010. [11] M. Y uan and Y . Lin, “Model selection and estimation in regression with grouped variables, ” Journal of the Royal Statistical Society: Series B (Statistical Methodology) , vol. 68, no. 1, pp. 49–67, 2006. [12] P . Zhao, G. Rocha, and B. Y u, “The composite absolute penalties family for grouped and hierarchical variable selection, ” Annals of Statistics , vol. 37, no. 6A, pp. 3468–3497, 2009. [13] F .R. Bach, “Consistency of the group Lasso and multiple kernel learning, ” The Journal of Machine Learning Researc h , vol. 9, pp. 1179– 1225, 2008. [14] S. Negahban and M.J. W ainwright, “Joint support recov ery under high-dimensional scaling: Beneﬁts and perils of ` 1 , ∞ -regularization, ” Advances in Neural Information Pr ocessing Systems , pp. 1161–1168, 2008. [15] P .J. Garrigues and E.L. Ghaoui, “An homotopy algorithm for the Lasso with online observations, ” in Neural Information Processing Systems (NIPS) , 2008, vol. 21. [16] S. Asif and J. Romberg, “Dynamic Updating for ` 1 Minimization, ” Selected T opics in Signal Processing, IEEE J ournal of , vol. 4, no. 2, pp. 421–434, 2010. [17] D.M. Malioutov , S.R. Sanghavi, and A.S. W illsky , “Sequential com- pressed sensing, ” Selected T opics in Signal Processing, IEEE Journal of , vol. 4, no. 2, pp. 435–444, 2010. [18] W .W . Hager, “Updating the inv erse of a matrix, ” SIAM r eview , vol. 31, no. 2, pp. 221–239, 1989. [19] B. W idro w and S.D. Stearns, Adaptive Signal Processing , New Jersey: Prentice Hall, 1985. [20] B. Efron, T . Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression, ” Annals of statistics , vol. 32, no. 2, pp. 407–451, 2004. [21] R. Jenatton, J.Y . Audibert, and F . Bach, “Structured V ariable Selection with Sparsity-Inducing Norms, ” Arxiv preprint , 2009.

Recursive $ell_{1,infty}$ Group lasso

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment