A simple and efficient algorithm for fused lasso signal approximator with convex loss function

A simple and eﬃ c i ent algorithm for fused lasso signal appro ximator with con v ex loss function Heng Lian Divisio n of Mathemati cal Sciences Sc ho ol of Ph ysical and Mathemati ca l Sciences Nan yang T ec hnolog ical Universit y Singap ore 6 3737 1 Singap ore E-mai l : hengli an@n tu.ed u. sg Abstract W e co nsider the a ugmente d Lagrangian method (ALM) as a solv er for the fused lasso signal appro ximator (FLSA) prob lem. The ALM is a dual metho d in which squares of th e constrain t functions are ad ded as pen alties to the Lagrangian. In order to apply this metho d to FLSA, t wo types of auxiliary v a riables are introdu ced to transform the original unconstrained minimization problem into a linearly constrained minimization pr oblem. Eac h up dating in this iterativ e algorithm consists of just a s imple on e-dimensional con v ex pro- gramming problem, with closed form solution in man y cases. While the existing literature mostly fo cused on the quadr atic loss function, our algo rithm can b e easily implemen ted for general con v ex loss. T he most attracti ve feature of this 1 algorithm is its simplicit y in imp lemen tation compared to other existing fast solv ers. W e also pr o vide some con v ergence analysis of the algorithm. Finally , the metho d is illus trated with some simulation datasets. k eyw ords: Augmented Lagrangian; Conv e rgence analysis; LAD-FLASSO; 1 In tro d uction In this pap er w e examine the one- dimensional fussed lasso signal a ppro ximator ( Tibshirani et al., 2005), whic h is t o solv e min β f ( β ) = F ( y , β ) + λ 1 n X i =1 | β i | + λ 2 n X i =2 | β i − β i − 1 | , (1) where y = ( y 1 , . . . , y n ) are the noisy o bserv ations, λ 1 , λ 2 > 0 are t wo regularizat io n parameters and F ( y , β ) = P n i =1 F i ( β i , y i ) is the loss function. The most frequen tly app earing case is the quadratic loss F i ( β i , y i ) = ( y i − β i ) 2 / 2, for whic h there exis ts sev eral solv ers. Here w e also consider the more general case where F i is a conv ex and co ercive function of β i . Note that b y deﬁnition t he co erciv e function F i satisﬁes lim | β i |→∞ F i ( β i , y i ) → ∞ , f o r all y i ∈ R , which is used to ensure the existence of the minimizer. As demonstrated in Huang et al. (2005); Tibshirani and W a ng (2 008), an imp ortant application of FLSA is the reconstruction of copy nu m b ers fr o m CGH arra ys. Sev eral algor ithms ha v e b een pro p osed fo r F LSA, including a sp ecially designed quadratic programming (Tibshirani et al., 2005; Tibshirani a nd W ang, 2008), co ordi- nate descen t a nd fusion alg orithm ( F riedman et al., 2007) and a path algo r it hm that solv es the problem for all r egula r ization para meters sim ultaneously (Ho eﬂing, 2010). Based on the num erical results p erformed in Ho eﬂing (2010), the lat t er t w o algorithms are clearly v ery fast and eﬃcien t a nd represen t the state of the a rt. Ho w ev er, these tw o 2 algorithms require substan tial eﬀorts in implemen tation for non-exp ert programmers, since one needs to k eep trac k of the “fused sets” whic h con tains the co eﬃcien ts β i that assume the same v alue. Besides, the algorithm of F riedman et al. (2007) has the disadv an tage that once the co eﬃcien t s are fused, the link ag e cannot be remo v ed later (a similar problem is noticed in Zou and Li (2008) f or the lo cally quadratic appro xi- mation a lg orithm prop osed in v ariable selection pro blem with non-conca v e p enalty ), and no c onv ergence analysis is a v ailable. The a lgorithm of Hoeﬂing (2010) is de signed to solv e (1) for all regularization parameters but it do es not w ork for general con v ex loss since the solution path is in general not piecewis e linear (Rosset and Zhu, 2007). Here w e cons ider augmen ted Lagrangian metho d (ALM) w hic h w as independen tly dev elop ed by Hestenes (1969) and Po w ell (19 69) almost half a cen tury ago, whic h aims to solv e con v ex optimization problem with linear constrain ts. There are surged in ter- ests recently in applying this metho d in diﬀeren t optimization pro blems (T ai and W u, 2009; T ao and Y uan, 2010; W en et al., 2009; Y ang and Z hang, 2009; Y ang and Y uan, 2010). W e will sho w that after some simple transformations of (1), the ALM can b e applied to eﬃcien tly solv e FLSA with general loss functions. The most attractiv e feature of the metho d is its simplicit y in implemen tation. W e presen t our R co de for solving (1) with quadratic loss in App endix B in the Supplemen tary Material, in whic h the main itera t io ns consist of o nly ab out 20 lines of commands. W e pro vide a clear self-con tained con v ergence analysis of ALM in our con text (App endix A in t he Supplemen t a ry Material) follow ing existing ideas. Our algorithm can be initialized essen tially arbitrarily , in particular initialized with zero v alues, while for algorithms of F riedman et al. (2007); Ho eﬂing (2010) suc h initialization will not work a nd the co eﬃcien ts will stay at zero at all times. 3 2 Augmen ted Lagrang ian F orm ulation By introducing the auxiliary v ariables θ i , i = 2 , . . . , n , the following linearly con- strained problem is tr ivially equiv alen t t o (1). min β ,θ g ( β , θ ) = n X i =1 F i ( y i , β i ) + λ 1 n X i =1 | β i | + λ 2 n X i =2 | θ i | s.t. θ i = β i − β i − 1 , i = 2 , . . . , n. F ollowing Glowins ki and Le T allec (19 89), we deﬁne the augmen ted Lagra ng ian, for c > 0, b y L c ( β , θ , ν ) = g ( β , θ ) + n X i =2 ν i ( θ i − β i + β i − 1 ) + c 2 n X i =2 ( θ i − β i + β i − 1 ) 2 , where ν = ( ν 2 , . . . , ν n ) is the Lagrange multiplier. W e consider the follow ing saddle-p oin t problem, Find β ∗ , θ ∗ , ν ∗ , s.t. L c ( β ∗ , θ ∗ , ν ) ≤ L c ( β ∗ , θ ∗ , ν ∗ ) ≤ L c ( β , θ , ν ∗ ) , ∀ β , θ , ν. (2) The pro of for the fo llo wing is w ell known fro m classical duality theory (Ro ck afellar, 1970; Ek eland and T urn bull, 1983) and is thus omitted. Prop osition 1 β ∗ is a solution of (1) if and only if ( β ∗ , θ ∗ , ν ∗ ) i s a solution of (2) for some θ ∗ and ν ∗ . The basic algorithm for ﬁnding the saddle p oin t is the following Algorithm 1 (Glo winski a nd Le T allec, 198 9). 4 Algorithm 1 initialize ν 0 , arbitrarily . F or k = 1 , 2 , . . . ( β k , θ k ) = arg min ( β ,θ ) L c ( β , θ , ν k − 1 ) ν k i = ν k − 1 i + c ( θ k i − β k i + β k i − 1 ) , i = 2 , . . . , n In general, it is diﬃcult to minimize L c ( β , θ , ν k ) ov er β and θ simultaneous ly , but it might b e easier to minimize ov er β when ﬁxing θ and vice vers a. In this case, w e can alternate these t w o steps un til conv ergence. It turns out that w e can up dat e β and θ just once when the other is ﬁxed, resulting in the following algorithm. Algorithm 2 initialize ν 0 and θ 0 , arbitra rily . F or k = 1 , 2 , . . . β k = arg min β L c ( β , θ k − 1 , ν k − 1 ) θ k = arg min θ L c ( β k , θ , ν k − 1 ) ν k i = ν k − 1 i + c ( θ k i − β k i + β k i − 1 ) , i = 2 , . . . , n Example. W e apply Algorithm 2 to ( 1) with quadratic loss. In this case, the augmen ted L agrangian is L c ( β , θ , ν ) = 1 2 X i ( y i − β i ) 2 + λ 1 n X i =1 | β i | + λ 2 n X i =2 | θ i | + n X i =2 ν i ( θ i − β i + β i − 1 )+ c 2 n X i =2 ( θ i − β i + β i − 1 ) 2 . If λ 1 = 0, giv en θ k − 1 and ν k − 1 , the minimiz ation ov er β is a simple quadratic problem and all comp onents of β can b e found sim ultaneously by solving a linear system B β = b , where w e do not write do wn explicitly the expression of matrix B and v ector b , but note that due to the special structure of the problem, B is a tridiagonal matrix and there exists an eﬃcien t algorithm with complexity linear in n for solving t he linear system (see for example Conte and De Bo or (1980)). 5 F or λ 1 > 0, it is more diﬃcult to up da t e β directly . F ortunately , for quadratic loss, solution for FLSA with λ 1 > 0 can b e obtained by thresholding the solution for FLSA with λ 1 = 0 as shown in F riedman et al. (2007), and thus (for this example) w e only consider λ 1 = 0. With β = β k and ν = ν k − 1 ﬁxed, the minimization ov er θ is a simple lasso regression with o r t hogonal design and th us we hav e the simple comp onen t-wise soft thresholding up dating rule θ k i = sig n ( ˆ θ i )( | ˆ θ i | − λ 2 /c ) + , (3) where ˆ θ i = β k i − β k i − 1 − ν k − 1 i /c and ( a ) + denotes the p ositiv e part of a .  F or quadratic loss, the ex ample sho ws that b oth update for β and for θ can b e computed eﬃ cien tly for λ 1 = 0. Ho w ev er, fo r more general loss a nd/or for λ 1 > 0, it is diﬃcult to up date β directly and th us in our implemen tation we do not use Algorithms 1 and 2. W e propose next a further augmen tation step t ha t decouples the quadratic term ( θ i − β i + β i − 1 ) 2 with t he loss function. W e in tro duce a no ther set of auxiliary v ariables γ i , i = 1 , . . . , n a nd consider the follo wing pr o blem whic h is still obv iously equiv alen t to (1). min γ ,β ,θ g ( γ , β , θ ) = n X i =1 F i ( y i , γ i ) + λ 1 n X i =1 | γ i | + λ 2 n X i =2 | θ i | s.t. γ i = β i , i = 1 , . . . , n, θ j = β j − β j − 1 , j = 2 , . . . , n. The corresp onding (doubly) augmen ted Lagrang ia n is L c ( γ , β , θ , µ , ν ) = g ( γ , β , θ ) + n X i =1 µ i ( γ i − β i ) + c 2 n X i =1 ( γ i − β i ) 2 (4) + n X i =2 ν i ( θ i − β i + β i − 1 ) + c 2 n X i =2 ( θ i − β i + β i − 1 ) 2 . 6 In the ab ov e, the co eﬃcien ts for b oth quadratic p enalties are the same (equal to c/ 2). In principle, w e can use diﬀerent co eﬃcien ts but computationally it is diﬃcult to tune b oth parameters and th us w e settle with this simpler choice . With the newly deﬁned Lagrangian in ( 4 ), w e can similarly mo dify the saddle- p oin t pro blem (2) in an obvious wa y and it can b e sho wn that the saddle-p oint problem is the same as the original FLSA problem (1 ) . Accordingly , w e hav e the follo wing a lgorithms for ﬁnding the saddle p oint whic h directly extends Algorithm 1 and Algor it hm 2 resp ectiv ely . Algorithm 3 initialize ν 0 , arbitrarily . F or k = 1 , 2 , . . . ( γ k , β k , θ k ) = arg min ( γ ,β ,θ ) L c ( γ , β , θ , µ k − 1 , ν k − 1 ) ν k i = ν k − 1 i + c ( θ k i − β k i + β k i − 1 ) , i = 2 , . . . , n µ k i = µ k − 1 i + c ( γ k i − β k i ) , i = 1 , . . . , n Algorithm 4 initialize ν 0 , β 0 and θ 0 , arbitra rily . F or k = 1 , 2 , . . . γ k = arg min γ L c ( γ , β k − 1 , θ k − 1 , µ k − 1 , ν k − 1 ) β k = arg min β L c ( γ k , β , θ k − 1 , µ k − 1 , ν k − 1 ) θ k = arg min θ L c ( γ k , β k , θ , µ k − 1 , ν k − 1 ) ν k i = ν k − 1 i + c ( θ k i − β k i + β k i − 1 ) , i = 2 , . . . , n µ k i = µ k − 1 i + c ( γ k i − β k i ) , i = 1 , . . . , n In Algorithm 3, ar g min ( γ ,β ,θ ) L c ( γ , β , θ , µ k − 1 , ν k − 1 ) is t ypically diﬃcult to ﬁnd di- rectly and iterativ e up da ting of each o ne of them with others ﬁxed is applied (i.e., 7 rep eat the ﬁrst three steps in the lo op of Algorithm 4 un t il con v ergence). W e will use sim ulation later to compare the relativ e eﬃciency of Algorithm 3 and Algorithm 4. W e no w consider each up date in detail. Note t he doubly augmen ted Lagrangian is L c ( γ , β , θ , µ , ν ) = n X i =1 F i ( y i , γ i ) + λ 1 n X i =1 | γ i | + λ 2 n X i =2 | θ i | + n X i =1 µ i ( γ i − β i ) + c 2 n X i =1 ( γ i − β i ) 2 + n X i =2 ν i ( θ i − β i + β i − 1 ) + c 2 n X i =2 ( θ i − β i + β i − 1 ) 2 . The up da t e for β can b e p erformed in closed form b y solving a linear system, whic h still in v olv es a tridiagonal matrix and can b e s olv ed eﬃcien tly . Note that the eﬀect of in tro ducing γ i is to decouple some terms in the Lagrang ian so that the loss f unction and the lasso p enalty do not come into play when up dating β . The update for θ is the same as b efore and can b e p erformed with comp onent-wise thresholding using the same formula (3). The up date for γ is generally not av a ila ble in closed form. Ho w ev er, due to the sp ecial separable structure of the f unctional, it can b e up dated comp onen t by componen t, r esulting in m ultiple one-dimensional con v ex optimization problems for whic h many eﬃcien t solv ers exist. F or diﬀeren t con v ex loss, only the up dates for γ i need to b e mo diﬁed. W e a lso note that fo r t he quadratic loss, the up dates for γ i is also a simple soft thresholding. Example. In this example w e tak e F i ( y i , γ i ) = | y i − γ i | , the absolute deviation or L 1 loss. The L 1 loss function is a n in teresting alternativ e to the quadratic lo ss in t ha t it is more robust to outliers. W e refer to the resulting FLSA problem (1) with L 1 loss as LAD- FLASSO. In this case, the up date of γ i consists in minimizing | y i − γ i | + λ 1 | γ i | + µ i ( γ i − β i ) + c/ 2( γ i − β i ) 2 . Although the solution is not av a ila ble in closed for m, the function is strictly con v ex and diﬀeren tiable except at tw o p oints, 0 and y i . Th us the minimizer can b e found b y comparing its v a lues at 0, y i , and o ther 8 p oten tial stationary p oin ts, a total of only six case s (by considering the sign of γ i and y i − γ i ). Th us the up date in γ can also b e found eﬃcien tly and implemen ted easily .  In the follow ing theorem, w e giv e the conv ergence of Algorithms 1-4. It sho ws that β k is a minimizing s equence o f the FLSA (1). If the minimizer is unique, then β k con v erges to the minimizer. The pro of of the t heorem is giv en in App endix A in the Supplemen tary Material. Theorem 1 F or any of the algo rithms 1-4, we have f ( β k ) → min β f ( β ) wher e f is the FLSA functional deﬁne d in (1). 3 Sim ulation Resul ts W e fo llo w the similar sim ulation setups used in Ho eﬂing (2010). Eac h sim ulated se- quence consists of data p oin ts with v alues of 0, 1 , 2 a nd ro ughly 20% of the data p oin ts hav e v alue 1 and ano t her 20% hav e v alue 2, with Gaussian noises added (ex- cept in Experimen t 4 b elo w where noise with hea vy-tailed distribution is use d). In exp eriments 1-3 b elo w, w e restrict ourselv es to quadratic loss f unctions. The exper- imen ts are p erfo rmed on HP w orkstation xw4400 with In tel Core 2 Duo Pro cessor 2.66GHz a nd 2GB of RAM, implemen ted in R. W e also mak e use of the limSolve pac k- age in R whic h implemen ted the tridiagonal matrix algorit hm. W e a pply o ur doubly augmen ted Lagrangia n metho d to the sim ulated dataset. The diﬀeren t b et w een algo- rithm 3 and algorithm 4 is that a lgorithm 3 has an additional inner lo op that applies the ﬁrst three up datings in t he lo op of Algorithm 4 rep eatedly till con v ergence. Exp erim e nt 1. First w e study the eﬀect of the num b er of iterations, T , p erformed in this inne r lo op. Th us Alg o rithm 3 corr espo nds to the c ase T → ∞ while Algorithm 4 corresp onds to T = 1. In t his exp erimen t, we set the sequence length n = 200 and 9 T able 1: Sim ulation results for Exp erimen t 1 for v arying the n um b er of iterations of the inner lo op. n=200 n=2000 T=1 T=2 T=5 T=10 T=1 T=2 T=5 T=10 n um b er of iterations 226.95 131.61 72.70 69.09 212.02 147.84 78.56 71.21 computation time 0.117 0.118 0.144 0 .2 58 0.354 0.654 0.838 1.753 n = 20 00, with Gaussian noise of v ariance 0 . 1, c = 5 and λ 1 = 0 . 5 , λ 2 = 4. 100 datasets are sim ulated in t his exp erimen t. The conv ergence criterion us ed is || ν k − ν k − 1 || + || µ k − µ k − 1 || < 10 − 10 . In T able 1, w e sho w the av erage num b er of iteratio ns required till con v ergence as w ell as the time (in seconds) elapsed. W e see tha t although using T > 1 reduced the nu m b er of iterations (for the outer lo op) required, the ov erall computation time is either similar to the case with T = 1 or signiﬁcan tly increased ev en fo r small v alue of T . Th us w e see no adv antage of using T > 1 and Algo r it hm 4 is adopted in the follo wing. W e hav e also conducted sim ulations using other sequence lengths a nd parameters a nd the conclusion is the same. Exp erim e nt 2. Next w e consider the eﬀect of the parameter c on the con v ergence o f the algor it hm. Although t heoretically the ALM conv erges in the limit for an y c > 0, w e will see t ha t this parameter can aﬀect the sp eed of conv ergence. Our sim ulation in v olv es a sequence of length n = 1000 with N (0 , 0 . 1) noises, a nd we solve the FLSA problem with λ 1 = 0 and λ 2 = 0 . 1. W e c ho ose many diﬀeren t v alues f o r c and the ev olution of the mean squared error P n i =1 ( β k i − β i ) 2 /n for ﬁve v a lues of c is plot ted in F ig ure 1 (a). Here β i represen t s the true signal and β k i is the estimate for the k - th iteration. W e see that fo r small v alue of c , the conv ergence of the estim ate is slo w and o scillate in the initial stage, while for big v alues of c , the con ve rgence is also slow. F or this sequence, a v alue b etw een 0 . 5 and 5 generally pro duces reasonable sp eed o f con v ergence visually . No w we v ary diﬀeren t par a meters in v olv ed in the optimization to inv estigate how 10 these c hanges aﬀect the c hoice of c . First w e g enerate a sequence of length n = 100 and another with n = 10000. The con v ergence diagnostic plots are sho wn in Fig ur e 1 (b) and (c). Remark ably , the plots sho w that the c hoice of c is almost unaﬀected b y the length of the sequence and the num b er of iteratio ns required for con v ergence do es not dep end on n . This empirical observ ation has at least t w o implications. (i) In order to choo se a reasonable v alue of c for a n extremely long sequence, w e can run the alg o rithm on a subsequence with sev eral diﬀeren t v alues of c and c ho ose the b est one ba sed on the conv ergence speed on the subseque nce. Of course f o r this to w ork w e need to assume the sequence is stationar y in some sense. (ii) The complexit y of the algorithm is linear in the length o f the sequence since it is linear for eac h iteration and the num b er of iteratio ns do es not v ary m uc h with the length (note this is only based on empirical o bserv ation). Then w e use t he same sequence with n = 1000 but with a bigg er noise v ariance 0 . 4. The plot sho wn in Fig ure 1 (d) lo oks diﬀeren t, but the range of v alues f o r c tha t results in fast con v ergence is similar as b efore. In Fig ure 1 (e), w e sho w the results when solving FLSA with λ 1 = 0 . 5 and λ 2 = 4, and in Figure 1 (f ),(g), we m ultiply and divide the noisy sequence by a factor of 10 resp ectiv ely . When these parameters are c hanged, w e see the trace plot is more v ariable. Since all these types of c hanges can be regarded as the c hange in relativ e sizes of the diﬀeren t terms in (4), w e conclude that the c hoice of c dep ends on this relativ e sc ale but is quite s table ot herwise. Ev en s o, w e still observ e that the o ptimal choice of c is somewhere b et w een 0 . 2 and 10. W e hav e generated diﬀeren t sequences and w orke d with diﬀ erent regularizatio n parameters to mak e sure the observ atio ns made ab ov e apply to a wide v a r iet y of settings. Since our algorithm is relativ ely fast, we can s uggest running the algor it hm for sev eral diﬀeren t v alues of c and visually chec k its conv ergence, except when the sequence is extremely long ( > 10 6 ) and then w e can run the algorithm on one or more subsequence s to c ho ose c b efore running it on the en tire sequence. In all the follow ing exp erimen ts w e 11 set c = 5 . Exp erim e nt 3. Here w e w an t to sa y something ab out the computation speed of our algorithm based on comparisons with previous approac hes. W e do wnload f r om CRAN the ﬂsa pac k age (v ersion 1.03 ) whic h is based on the path algorithm pres en ted in Ho eﬂing (2010). F or eac h case of n = 100 , 1000 , 10000 , 100000 , w e generate 100 sequence s and the av erage computation times for each sequence are sho wn in T able 2 for λ 1 = 0 . 5 , λ 2 = 4. F rom the results rep orted in the table, w e see that the path algorithm is ab out 20 times faster than our ALM algorithm fo r n ≥ 100 0. F o r the case n = 100 , the diﬀerence is ab out 100 fold. W e observ e from T able 2 that b oth algorithms hav e computation time approximately linear in n , except for ALM when n = 100 . Th us the larg e diﬀerence for n = 100 ma y b e due to the reason tha t in this case most of the computation time in ALM is sp en t o n ancillary c hores suc h as calling the R function, setting up para meter v alues and returning results. The rep orted computation time for the pa th algorithm is slo wer t ha n those reported in Ho eﬂing (2010) and the reason migh t be due t o the diﬀerence in s im ulation setup and diﬀerence in computer system used. W e also need to note that the path algorithm is sp eciﬁcally designed for computing the en tire solutio n pa th for all regularization parameters, and in this sense it should ha v e ev en b etter p erforma nce when the solution for many regularization para meter v alues are sough t. How eve r, this alg orithm do es not w ork with general loss function as the ALM do es. There is no publicly av a ilable pac k age implemen ting the descen t algorithm in F riedman et al. (2007). Ho w ev er, T able 1 in Ho eﬂing (2010) rep ort ed that t he path algorithm is ab out 10-100 times faster than the descen t algo rithm when n ≤ 10 4 , while the tw o algorithms ha v e comparable sp eed with larger n . Ba sed on t his comparison, w e think our algo rithm is pro bably comparable to the desc en t algorithm when n ≤ 10 4 but m uc h slow er for bigger sequence length. F inally , we note it is diﬃcult to 12 0 50 100 150 200 250 0.005 0.010 0.015 0.020 0.025 0.030 iteration error c=0.2 c=1 c=5 c=10 c=30 (a) 0 50 100 150 200 250 0.01 0.02 0.03 0.04 0.05 iteration error c=0.2 c=1 c=5 c=10 c=30 (b) 0 50 100 150 200 250 0.005 0.010 0.015 0.020 0.025 0.030 iteration error c=0.2 c=1 c=5 c=10 c=30 (c) 0 50 100 150 200 250 0.10 0.15 0.20 iteration error c=0.2 c=1 c=5 c=10 c=30 (d) 0 50 100 150 200 250 0.05 0.10 0.15 0.20 iteration error c=0.2 c=1 c=5 c=10 c=30 (e) 0 50 100 150 200 250 1.0 1.5 2.0 2.5 3.0 iteration error c=0.2 c=1 c=5 c=10 c=30 (f ) 0 50 100 150 200 250 0.00010 0.00020 0.00030 iteration error c=0.2 c=1 c=5 c=10 c=30 (g) Figure 1: Ev olution of mean squared error of reconstructed signals with iteratio ns. 13 T able 2: Comparison of computatio n sp eed fo r o ur ALM algorithm and the ﬂsa pac k age based o n the path algorithm. n = 100 n = 1000 n = 10 4 n = 10 5 n = 10 6 ALM 0.09811 0.1 797 1.861 21.702 223.9 ﬂsa 0.00092 0.0073 0.072 0.958 10.51 exactly compare the computation sp eed f o r diﬀeren t algorithms since all a lgorithms in v olv e some pa r a meter c hoice. In particular, the con v ergence criterion used in our implemen t ation is | | ν k − ν k − 1 || + || µ k − µ k − 1 || ≤ 10 − 10 , a nd if we increase the threshold to 10 − 5 , it b ecomes 3 to 5 times faster. Besides, the ﬂsa pack a ge uses C co de in its underlying implemen t a tion whic h mak es it faster, and o ur implemen t a tion uses tridiagonal matrix a lgorithm from the limSolv e pack age whic h uses F ortran co de in its implemen t ation and thus the net eﬀect is diﬃcult to compare. W e emphasize again that the bigg est adv an tage of our alg orithm is the ease in implemen tation as we ll a s that it w orks with general conv ex loss functions. Exp erim e nt 4. Finally , in this exp erimen t, w e consider the L AD -FLASSO prob- lem, where t he loss function in (1) is deﬁned b y F i ( y i , β i ) = | y i − β i | . W e only illustrate here with a sequence of length n = 1 0 0 and the no ise ha s t distribution with 2 degrees of f r eedom and the scale par a meter equal to 0 . 3. With heavy -tailed noises, the LAD-F L ASSO is expected to p erform b etter than the us ual FLSA with quadratic loss. Indeed, Figure 2 show s the noisy sequence, the true signal, as w ell a s the tw o reconstruc tions. The regularization parameters λ 1 and λ 2 in the t wo cases are those minimizing sum of squared errors and sum o f absolute deviations resp ectiv ely (of course this dep ends on the kno wledge of the true signal in the simulation), by searc hing o v er a ﬁne grid. An obvious diﬀerence b et w een the t w o reconstructions is seen at p ositions 70-80, where a n extremely high v alue o f observ atio n o ccurs due t o the heav y-tailed noise distribution. 14 0 20 40 60 80 100 −2 −1 0 1 2 3 4 true signal quadratic loss LAD−FLASSO Figure 2: Reconstruction o f a signal sequ ence based on FLSA with quadratic loss and absolute deviation ( L 1 ) loss. 15 4 Conclus ion In this pap er we prop ose a simple algorithm for the F LSA problem. Altho ugh not as fast as the path a lgorithm implemen ted in the ﬂsa pac k age, t he most attractiv e feature of this algorithm is the simplicit y of its implemen tation a nd it w orks for general con- v ex loss functions. How ev er, the computatio nal sp eed o f the current implemen ta tion in R can p ossibly b e improv ed if a more general prog ramming languag e suc h as C is used f o r its unde rlying implemen ta tion. Another adv an tage of the algorithm is that it is pro v ably con vergen t for an y initialization v a lues, and its con v ergence prop erties are in v estigated based on simulation studies presen ted here. The ﬂexibilit y in implemen- tation is demonstrated by our implemen tation of the LAD- FLASSO problem which is lacking from other existing implemen t a tions based on either descen t algorithm or path algo rithm. W e exp ect t ha t ALM a s a general techniq ue will b e ve ry useful for computing other optimization problems in statistical learning. References Con te, S. D. and De Bo or, C. Elementary numeric al analysis : an algorithmic ap- pr o ach . New Y o rk: McGraw-Hill, 3d edition (1980). Ek eland, I. and T urnbull, T. Inﬁnite-di m ensional optimization and c o nvexity . Chicago lectures in mathematics. Chicago: Univ ersit y of Chicago Press (1 983). F riedman, J., Hastie, T., Hoﬂing, H., and Tibshirani, R. “Path wise co ordinate opti- mization.” Annals of Applie d Statistics , 1( 2):302–332 (2007). Glo winski, R. and Le T allec, P . Augmente d L agr angian and op er ator-splitting metho ds in nonlin e ar me chanics . Philadelphia: So ciet y for Industrial and Applied Mathe- matics (1989 ) . 16 Hestenes, M. R. “Multiplier a nd gradien t metho ds.” Journal of Optimization the ory and applic ations , 4:303–320 (19 69). Ho eﬂing, H. “A path algorithm for the fused lasso signal approximator.” Manuscript available at http://ww w .holgerho eﬂing.c om/ (2010) . Huang, T., W u, B. L., Lizardi, P ., and Zhao, H. Y. “Detection of D NA copy n um b er alterations using p enalized least squares regression.” Bioinform atics , 21(20):381 1– 3817 (2005). P o w ell, M. J. D. “A metho d for nonlinear constrain ts in minimization problems.” In: Fletcher, R. (e d.) Optimiz a tion , 283–298 (1969). Ro c k af ellar, R. T. Convex analysis . Princeton, N.J.,: Princeton Univ ersity Press (1970). Rosset, S. and Zh u, J. “Piecewise linear regularized solution paths.” A nnals of Statistics , 35(3):10 12–1030 (2007). T ai, X.-C. and W u, C. “Augmen ted Lagrangian metho d, dual metho ds and split Bregman iteration for ROF mo del.” In 2nd International Confer enc e on Sc ale Sp ac e and V ariational Metho ds in Comp uter Vision , 502–51 3 (2 009). T ao, M. and Y ua n, X. M. “Reco v ering lo w-rank and s parse comp onen ts of matrices from incomplete and noisy observ atio ns.” Pr eprint, available at http://www.optimization-o nline.or g (2010). Tibshirani, R., Saunders, M., Ro sset, S., Zh u, J., and Knight, K. “Spar sity and smo othness via the fused lasso.” Journal of the R oyal Statistic al S o ciety Series B-Statistic al Metho dolo gy , 67:9 1 –108 (2005). 17 Tibshirani, R . and W ang, P . “Spatial smo o thing and hot sp ot detection for CGH data using the fused lasso.” Biostatistics , 9(1):1 8–29 (2 008). W en, Z. W., Goldfarb, D ., and Yin, W. “Alternating direction augmen ted lagrang ia n metho ds for semideﬁnite programming.” TR09 -42, CAAM R ep ort, Ric e University (2009). Y ang, J. F. and Y uan, X. M. “An inexact alternating direction method for trace norm regularized least squares problem.” Pr eprint, available at http://w ww.optimization- online.or g (2010). Y ang, J. F. and Zhang, Y. “ Alternating direction method for L1 problems in com- pressiv e sensing.” TR09-37 , CAAM R ep ort, Ric e University (2009 ). Zou, H. and Li, R. Z. “One-step sparse estimates in nonconca v e p enalized lik eliho o d mo dels.” Annals of Statistics , 36(4):150 9–1533 (2008). Supplementary Material App endix A Pro of o f The orem 1. In the pro of w e use matrix and v ector notations. In particular, the expressions θ i − β i + β i − 1 , i = 2 , . . . , n can b e written as θ − Aβ with A an ( n − 1) × n matrix. W e also mak e freque n t use of some standard and classical results from conv ex analysis, suc h as those con tained in R o c k afellar (1 970); Ek eland and T urnb ull (1 983), most notably the prop erties of sub diﬀerential for con v ex functions. Also, w e only sho w the con v ergence of Algorithms 1 and 2 while the analysis fo r Algorithms 3 and 4 is very m uc h the same but more t edious to write dow n and thus omitted. 18 W e start with Algorithm 1 , f or whic h the augmented Lagrangian can b e written as L ( β , θ , ν ) = U ( β ) + V ( θ ) + c 2 || θ − Aβ | | 2 + ν T ( θ − Aβ ) , where U ( β ) = P n i =1 F i ( y i , β i ) + λ 1 P n i =1 | β i | , V ( θ ) = λ 2 P n i =2 | θ i | , and ν T is the trans- p ose of the column v ector ν . In the pro of w e o nly need to use the conv exit y of U a nd V . Using t he usual not a tion, supp ose ( β ∗ , θ ∗ , ν ∗ ) is the saddle p oin t of L satisfying L ( β ∗ , θ ∗ , ν ) ≤ L ( β ∗ , θ ∗ , ν ∗ ) ≤ L ( β , θ , ν ∗ ) ∀ β , θ , ν (5) F rom the ﬁrst equalit y of (5), w e hav e θ ∗ = Aβ ∗ . The up date for ν in Algorithm 1 is ν k = ν k − 1 + c ( θ k − Aβ k ), whic h implies ¯ ν k = ¯ ν k − 1 + c ( ¯ θ k − A ¯ β k ) , (6) where we set ¯ β k = β k − β ∗ , ¯ θ k = θ k − θ ∗ and ¯ ν k = ν k − ν ∗ . F rom (6), w e immediately get || ¯ ν k − 1 || 2 − || ¯ ν k || 2 = − 2 c ( ¯ ν k − 1 ) T ( ¯ θ k − A ¯ β k ) − c 2 || ¯ θ k − A ¯ β k || 2 . Next w e sho w t he right hand side o f the ab o v e is nonnegativ e. F rom the second inequalit y of (5) , w e hav e 0 ∈ ∂ β L ( β ∗ , θ ∗ , ν ∗ ) ⇔ 0 ∈ ∂ U ( β ∗ ) − cA T ( θ ∗ − Aβ ∗ ) − ν ∗ T A, (7) 0 ∈ ∂ θ L ( β ∗ , θ ∗ , ν ∗ ) ⇔ 0 ∈ ∂ V ( θ ∗ ) + c ( θ ∗ − Aβ ∗ ) + ν ∗ , (8) where ∂ is the notatio n for the sub diﬀeren tial of a con v ex function. 19 Corresp ondingly , based on the up date o f β k and θ k in Algorithm 1, we ha v e 0 ∈ ∂ β L ( β k , θ k , ν k − 1 ) ⇔ 0 ∈ ∂ U ( β k ) − cA T ( θ k − Aβ k ) − ( ν k − 1 ) T A, (9) 0 ∈ ∂ θ L ( β k , θ k , ν k − 1 ) ⇔ 0 ∈ ∂ V ( θ k ) + c ( θ k − Aβ k ) + ν k − 1 . (10) Subtracting (7 ) from (9) and subtracting (8) from (10), w e g et 0 ∈ ∂ U ( β k ) − ∂ U ( β ∗ ) − cA T ( ¯ θ k − A ¯ β k ) − ( ¯ ν k − 1 ) T A, (11) 0 ∈ ∂ V ( θ k ) − ∂ V ( θ ∗ ) + c ( ¯ θ k − A ¯ β k ) + ¯ ν k − 1 . (12) Multiplying ( ¯ β k ) T to (11) from the left, m ultiplying ( ¯ θ k ) T to (12) from the left, and adding the tw o expressions giv es us 0 ∈ h ∂ U ( β k ) − ∂ U ( β ∗ ) , ¯ β k i + h ∂ V ( θ k ) − ∂ V ( θ ∗ ) , ¯ θ k i + c || ¯ θ k − A ¯ β k || 2 + ( ¯ ν k − 1 ) T ( ¯ θ k − A ¯ β k ) , (13) where we used h · , · i to denote the dot pro duct of t w o vec tors in some pla ces ab o v e to b e consisten t with the usual notation in con v ex analysis as in Ek eland and T urn bull (1983). F rom standard results in con v ex analysis, all elemen ts in h ∂ U ( β k ) − ∂ U ( β ∗ ) , ¯ β k i and h ∂ V ( θ k ) − ∂ V ( θ ∗ ) , ¯ θ k i a re nonnegativ e and th us w e get c || ¯ θ k − A ¯ β k || 2 + ( ¯ ν k − 1 ) T ( ¯ θ k − A ¯ β k ) ≤ 0 whic h immediately implies that || ¯ ν k − 1 || 2 − || ¯ ν k || 2 = − 2 c ( ¯ ν k − 1 ) T ( ¯ θ k − A ¯ β k ) − c 2 || ¯ θ k − A ¯ β k || 2 ≥ c 2 || ¯ θ k − A ¯ β k || 2 . No w that || ¯ ν k || 2 is nonnegativ e and decreasing, w e obtain ¯ θ k − A ¯ β k → 0. Using this in (13), we get 0 ≤ h ∂ U ( β k ) − ∂ U ( β ∗ ) , ¯ β k i → 0 , 0 ≤ h ∂ V ( θ k ) − ∂ V ( θ ∗ ) , ¯ θ k i → 0 , (14) 20 where the ab ov e ex pression is tak en to mean that “there exists some sequence u k ∈ h ∂ U ( β k ) − ∂ U ( β ∗ ) , ¯ β k i with 0 ≤ u k → 0”, for example. Similar in terpretations are used in the fo llo wing. By the deﬁnition o f sub diﬀeren tial, w e ha v e U ( β k ) ≥ U ( β ∗ ) + h ∂ U ( β ∗ ) , ¯ β k i , U ( β ∗ ) ≥ U ( β k ) − h ∂ U ( β k ) , ¯ β k i , resulting in U ( β k ) − h ∂ U ( β ∗ ) , ¯ β k i ≥ U ( β ∗ ) ≥ U ( β k ) − h ∂ U ( β k ) , ¯ β k i . Using (14), the diﬀerence b et w een and left hand side and t he righ t hand side is con v erging to zero and th us w e ha v e U ( β k ) → U ( β ∗ ). Similarly we can sh ow V ( θ k ) → V ( θ ∗ ). These com bined with ¯ θ k − A ¯ β k → 0 pro v e the con v ergence of Algo rithm 1. F or Alg o rithm 2, the pro of strategy is similar and w e only p oin t out the diﬀerences. The pro o f is the same as b efore up to equation (8). Because the o r der of up da t e of β and θ in Algorithm 2, equation (9) is replaced b y 0 ∈ ∂ β L ( β k , θ k − 1 , ν k − 1 ) ⇔ 0 ∈ ∂ U ( β k ) − cA T ( θ k − 1 − Aβ k ) − ( ν k − 1 ) T A, and thus equation (1 1) b ecomes instead 0 ∈ ∂ U ( β k ) − ∂ U ( β ∗ ) − cA T ( ¯ θ k − 1 − A ¯ β k ) − ( ¯ ν k − 1 ) T A, 21 while equation (12) remains the same. Then w e hav e, in place of (13), 0 ∈ h ∂ U ( β k ) − ∂ U ( β ∗ ) , ¯ β k i + h ∂ V ( θ k ) − ∂ V ( θ ∗ ) , ¯ θ k i + c || ¯ θ k − A ¯ β k || 2 + ( ¯ ν k − 1 ) T ( ¯ θ k − A ¯ β k ) − c ( ¯ β k ) T A T ( ¯ θ k − 1 − ¯ θ k ) , whic h then implies || ¯ ν k − 1 || 2 − || ¯ ν k || 2 ≥ c 2 || ¯ θ k − A ¯ β k || 2 − 2 c 2 ( ¯ β k ) T A T ( ¯ θ k − 1 − ¯ θ k ) . (15) So the diﬀerence from the corresp onding analysis for Algorithm 1 is the extra term − 2 c 2 ( ¯ β k ) T A T ( ¯ θ k − 1 − ¯ θ k ) on the righ t hand side ab ov e. No w we analyze the term 2 c 2 ( ¯ β k ) T A T ( ¯ θ k − ¯ θ k − 1 ). F rom (1 2) (whic h is still true for Algorithm 2) and the up date rule for ν in Algor ithm 2, we hav e 0 ∈ ∂ V ( θ k ) − ∂ V ( θ ∗ ) + c ( ¯ θ k − A ¯ β k ) + ¯ ν k − 1 (16) 0 ∈ ∂ V ( θ k − 1 ) − ∂ V ( θ ∗ ) + c ( ¯ θ k − 1 − A ¯ β k − 1 ) + ¯ ν k − 2 (17) ¯ ν k − 1 − ¯ ν k − 2 = c ( ¯ θ k − 1 − A ¯ β k − 1 ) . (18) Subtracting (1 7 ) from (16) and ta king into accoun t (18), w e g et 0 ∈ ∂ V ( θ k ) − ∂ V ( θ k − 1 ) + c ( ¯ θ k − A ¯ β k ) . T aking inner pro duct with θ k − θ k − 1 in the ab ov e equation and using the prop erty of con v ex function that h ∂ V ( θ k ) − ∂ V ( θ k − 1 ) , θ k − θ k − 1 i ≥ 0, we get ( ¯ θ k − A ¯ β k ) T ( θ k − θ k − 1 ) ≤ 0 , 22 and we can rewrite the ab o v e expression as ( ¯ β k ) T A T ( ¯ θ k − ¯ θ k − 1 ) ≥ ( ¯ θ k ) T ( ¯ θ k − ¯ θ k − 1 ) . Using the iden tit y ( ¯ θ k ) T ( ¯ θ k − ¯ θ k − 1 ) = 1 / 2( | | ¯ θ k || 2 − || ¯ θ k − 1 || 2 + | | ¯ θ k − ¯ θ k − 1 || 2 ) w e obtain from (15) || ¯ ν k − 1 || 2 − || ¯ ν k || 2 ≥ c 2 || ¯ θ k − A ¯ β k || 2 + c 2 ( || ¯ θ k || 2 − || ¯ θ k − 1 || 2 + | | ¯ θ k − ¯ θ k − 1 || 2 ) . After rearra ng ing, w e get ( || ¯ ν k − 1 || 2 + c 2 || ¯ θ k − 1 || 2 ) − ( || ¯ ν k || 2 + c 2 || ¯ θ k || 2 ) ≥ c 2 || ¯ θ k − A ¯ β k || 2 + c 2 || ¯ θ k − ¯ θ k − 1 || 2 , and t hen | | ¯ θ k − A ¯ β k || → 0 and || ¯ θ k − ¯ θ k − 1 || → 0. Now the rest of the analysis follows that for Alg o rithm 1 with no c hanges. App endix B R co de for FLSA with quadratic loss flasso.alm< -function(y,lambda1,lambda2,C=5,tol=1e-10){ n<-length(y ) #initializa tion beta<-y theta<-rep( 0,n-1) gamma<-rep( 0,n) mu<-rep(0,n ) nu<-rep(0,n -1) 23 conv<-100 iter<-0 while (conv>tol ){ temp<-(y+C* beta-mu/2)/(1+C) gamma<-abs( temp)-lambda1/(1+C) gamma<-pmax (0,gamma)*sign(temp) ##compute rhs of the linear system for solving beta temp1<--C*g amma; temp2<--mu; temp3<-c(theta[1],di ff(theta),-theta[n-1])*C; temp4<-c(nu [1],diff(nu),-nu[n-1]); rhs<-temp 1+temp2+temp3+temp4 ##compute the three diagonals in the linear system diag1<-rep( -C/2,n-1); diag2<-c(C/ 2,rep(C,n-2),C/2)+rep(C/2,n) ##call the tridiagonal matrix algorithm beta<-Solve .tridiag(diag1,diag2,diag1, -rhs/2) temp<-diff( beta)-nu/b theta<-abs( temp)-lambda2/b theta<-pmax (0,theta)*sign(temp) premu<-mu mu<-mu+C*(g amma-beta) prenu<-nu nu<-nu+C*(t heta-diff(beta)) 24 conv<-mean( c((nu-prenu)^2,(mu-premu)^2)) #used to test convergence iter<-iter+ 1 }#while loop end #return the estimated signal and number of iterations performed list(beta=b eta,iter=iter) } 25

A simple and efficient algorithm for fused lasso signal approximator with convex loss function

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment