Accelerated Multiplicative Updates and Hierarchical ALS Algorithms for Nonnegative Matrix Factorization

Accelerated Multiplicative Up dates and Hierarc hical ALS Algorithms for Nonnegativ e Matrix F actorization Nicolas Gillis 1 and F ran¸ cois Glineur 2 Abstract Nonnegative matrix factorization (NMF) is a da ta a nalysis technique used in a grea t v ariety of applications such as text mining, image pr ocess ing , h yp ersp ectral data analy s is, computational biology , a nd clustering. In this pap er, we consider tw o well-kno wn algor ithms desig ned to s olv e NMF problems, namely the m ultiplica tiv e updates of L ee and Seung and t he hierar chical alternating least squares of Cic ho cki et al. W e prop ose a simple wa y to signiﬁcantly acceler a te these schemes, based on a careful analysis of the computational cost needed at each iteration, while preserv ing their convergence prop erties. This acce leration technique can also b e applied to other a lgorithms, which we illustrate on the pro jected gra dien t metho d o f Lin. The eﬃciency of the accelerated algorithms is empirically demonstrated on imag e and text datasets, and compar es fav or ably with a state- of-the-art alternating nonnega tive least s quares algor ithm. Keyw o rds: n onnegative ma tr ix factorization, algor ithms, multiplicativ e up dates, hierarchical a l- ternating le ast squares . 1 In tro duc tion Nonnegativ e matrix f actorization (NMF) consists in app ro ximating a nonnegativ e matrix M as a lo w - rank pro duct of t wo nonn ega tiv e matrices W and H , i.e., giv en a matrix M ∈ R m × n + and an int eger r < min { m, n } , ﬁnd t w o matrices W ∈ R m × r + and H ∈ R r × n + suc h th at W H ≈ M . With a non n ega tiv e input data matrix M , nonnegativit y constrain ts on the factors W and H are w ell-kno wn to lead to lo w -rank decomp ositions w ith b etter in terpretation in man y applications suc h as text mining (Sh ahnaz et al., 2006), image p rocessing (Lee & S eung, 1999), hyp ersp ectral data analysis (Pa uca et al., 2006), computational b iolo gy (Dev ara jan , 2008), and clus tering (Ding et al. , 2005). Unfortu n ately , imp osing these c onstrain ts is also kno wn to render the problem co mputationally diﬃcult (V a v asis , 2009). Since an exact lo w -rank representat ion of the inpu t matrix do es not exist in general, the qualit y of the appr o ximation is measur ed b y some criterion, t ypically the sum of the squares of the errors on the entries, whic h leads to the follo wing m inimizati on p roblem: min W ∈ R m × r ,H ∈ R r × n || M − W H || 2 F suc h th at W ≥ 0 and H ≥ 0 , (NMF) where || A || F = ( P i,j A 2 ij ) 1 2 denotes th e F rob enius norm of matrix A . Most NMF algorithms are iterativ e, and exploit the fact that (NMF) reduces to an eﬃcient ly solv able conv ex n onnegativ e least 1 Universit y of W aterloo, Departmen t of Com b inatorics and Op timizatio n, W aterlo o, Ontario N2L 3G1, Canada. E-mail: ngillis@u waterloo.ca. This work w as carried out when th e author was a Researc h fello w of the F onds de la Recherc he Scientiﬁque (F.R.S .- FNRS) at U niv ersit ´ e catholique de Louv ain. 2 Universit ´ e catholique de Louvai n, CORE and ICTEAM Institute, B-1348 Louv ain-la-Neuve, Belgium. E -mail: francois.gl ineur @uclouv ain.b e. T his tex t presents researc h results of the Belgian Program on Interuniv ersit y P oles of Attraction initiated by the Belgi an State, Prime Minis ter’s Oﬃce, Science P olicy Programming. The scie ntiﬁc resp onsi- bilit y is assumed by the authors. 1 squares pr oblem (NNLS) w h en one of the factors W or H is ﬁ x ed . Actually , it seems that nearly all algorithms prop osed for NMF adhere to th e follo wing general fr amew ork (0) Select initial matrices ( W (0) , H (0) ) (e.g., randomly). Then for k = 0 , 1 , 2 , . . . , do (a) Fix H ( k ) and ﬁnd W ( k +1) ≥ 0 su c h that || M − W ( k +1) H ( k ) || 2 F < || M − W ( k ) H ( k ) || 2 F . (b) Fix W ( k +1) and ﬁn d H ( k +1) ≥ 0 su c h that || M − W ( k +1) H ( k +1) || 2 F < || M − W ( k +1) H ( k ) || 2 F . More precisely , at eac h iteration, one of the t wo facto rs is ﬁxed and the other is up dated in suc h a wa y that the ob jectiv e fu nction is reduced, wh ic h amounts to a tw o-blo c k co ordinate descent metho d. Notice that the role of matrices W and H is p erfectly symmetric: if one transp oses in- put matrix M , the new m atrix M T has to b e approxi mated by a pro duct H T W T , so that an y formula designed to up date the ﬁrst factor in th is pro duct d irectl y translates into an up date for the sec- ond factor in th e original problem. F ormally , if the up date p erf orm ed in step (a) is d escrib ed by W ( k +1) = up date( M , W ( k ) , H ( k ) ), an algorithm preserving symmetry will u p date the factor in step (b) according to H ( k +1) = up d ate ( M T , H ( k ) T , W ( k +1) T ) T . In th is p ap er, w e only consider su ch sym- metrical algorithms, and fo cus on th e up d ate of matrix W . This u p date can b e carried out in man y d iﬀeren t wa ys: the most n atural p ossibilit y is to compute an optimal solution for the NNLS sub problem, which leads to a class of algorithms called alternat- ing nonnegativ e least squares (ANLS), see, e.g., H.Kim & Pa rk (2008). Ho wev er, this computation, whic h can b e p erformed with activ e-set-lik e metho ds (H.Kim & Park, 2008; J.Kim & Pa rk, 2008), is relativ ely costly . T herefore, since an optimal solutio n for th e NNLS pr oblem corresp onding to one factor is n ot required b efore the up date of the other factor is p erformed , sev eral algorithms only com- pute an approximat e solution of the NNLS s ubproblem, sometimes ve ry r oughly , but with a c heap er computational cost, leading to an inexact t w o-block co ordinate descent scheme. W e now present t wo suc h p r ocedur es: the multiplic ativ e up dates of Lee and Seun g and the hierarc hical alternating least squares of C ic ho c ki et al. In their seminal pap ers, Lee & S eung (1999, 2001) introduce the multiplicat iv e u p dates: W ( k +1) = MU( M , W ( k ) , H ( k ) ) = W ( k ) ◦ [ M H ( k ) T ] [ W ( k ) H ( k ) H ( k ) T ] , where ◦ (resp. [ . ] [ . ] ) denotes the comp onen t-wise pro duct (resp. division) of m atrice s, and pr o v e that eac h up date monotonically decreases the F rob enius n orm of the err or || M − W H || F , i.e., sati sﬁes the descrip- tion of steps (a) and (b). Th is te c hn ique w as actually originally prop osed b y Daub e-Withersp o on & Muehllehner (1986) to solv e NNLS pr oblems. Th e p opu larit y of this algorithm came along w ith the p opularit y of NMF and man y auth ors hav e stud ied or used this algorithm or v ariants to compute NMF’s, see, e.g., Berry et al. (2007); Cic h ocki et al. (2009) and the references therein. In particular, the MA TLAB R  Statistics T o olb o x imp lemen ts this metho d. Ho wev er, MU hav e b een observe d to con v erge relativ ely slowly , esp ecially when dealing with d ense matrices M , see Han et al. (2009); Gillis & Glineur (2008) and the references therein, and many other algorithms ha ve b een subsequently int ro duced whic h p erform b etter in most situations. F or example, Cic h o c ki et al. (2007); Cic ho c ki & Ph an (2009) and, indep endently , sev eral other authors (Ho, 2008; Gillis & Glineur , 2008; Li & Z h ang , 2009) p rop osed a tec hn ique called hierarchical alternating least squares (HALS) 1 , whic h successive ly u p dates eac h column of W w ith an optimal and easy to compu te closed-form solution. I n fact, wh en ﬁxing all v ariables but a single column W : p of W , the problem reduces to min W : p ≥ 0 || M − W H || 2 F = || ( M − X l 6 = p W : l H l : ) − W : p H p : || 2 F = m X i =1 || ( M i : − X l 6 = p W il H l : ) − W ip H p : || 2 F . 1 Ho (2008) refers to HALS as rank-one residue iterati on (RR I), and Li & Zhang (2009) as F astNMF. 2 Because eac h r ow of W only aﬀects the corresp ond ing ro w of the p r od uct W H , this problem can b e further decoupled in to m indep endent quadr atic programs in one v ariable W ip , corresp ondin g to the i th ro w of M . The optimal solution W ∗ ip of these s u bproblems can b e easily written in closed-form W ∗ ip = max  0 , ( M i : − P l 6 = p W il H l : ) H T p : H p : H T p :  = max  0 , M i : H T p : − P l 6 = p W il H l : H T p : H p : H T p :  , 1 ≤ i ≤ m. Hence HALS u p dates s u ccessiv ely the columns of W , s o that W ( k +1) = HALS( M , W ( k ) , H ( k ) ) can b e computed in the follo win g wa y: W ( k +1) : p = max  0 , A : p − P p − 1 l =1 W ( k +1) : l B lp − P r l = p +1 W ( k ) : l B lp B pp  , successiv ely for p = 1 , 2 , . . . , r , where A = M H ( k ) T and B = H ( k ) H ( k ) T . This amount s to app ro xi- mately solving eac h NNLS subprob lem in W with a single complete round of an exac t block-c o ordinate descen t metho d with r b locks of m v ariables corresp onding to the columns of W (notice that an y other ordering for the up date of th e columns of W is also p ossible). Other approac hes based on iterativ e metho ds to solv e the NNLS subproblems include p r o jected gradien t descen t (Lin, 2007a) or Newton-lik e metho ds (Dhillon et al., 2007; C ic ho c ki et al., 2006); see also C ic ho c ki et al. (2009) and the references therein. W e ﬁ rst analyze in Section 2 the computational cost needed to up d ate the factors W in MU and HALS, then mak e sev eral s imple obser v ations leading in Section 3 to the d esign of accelerate d v ersions of these alg orithms. These impro v ements can in p r inciple b e applied to an y t wo-bloc k coordinate descen t NMF algorithm, as demonstrated in Section 3.4 on the pr o jected gradien t metho d of Lin (2007 a ). W e m ainly fo cus on MU, b ecause it is by far the most p opular NMF algorithm, and on HALS, b ecause it is very eﬃcien t in practice. Section 4 studies con v ergence of the accelerated v arian ts to statio nary p oin ts, and sho ws that they preserv e the prop erties of the original sc hemes. In Sectio n 5, w e exp eriment ally d emonstrate a signiﬁ can t acc eleration in conv ergence on seve ral image and text datasets, with a comparison with the state-of-the-art ANLS algorithm of J.K im & Park (2008). 2 Analysis of the Compu tati onal Cost of F actor Up dates In ord er to make our analysis v alid for b oth d ense and s parse input matrices, let us introdu ce a parameter K d enoting th e n u m b er of nonzero entrie s in m atrix M ( K = mn when M is dense). F actors W and H are t yp icall y stored as dense matrices throughout the execution of the algorithms. W e assume that NMF ac hieves compression, whic h is often a r equiremen t in p r acti ce. T h is means that storing W and H must b e c heap er than storing M : roughly sp ea king, the num b er of en tries in W and H must b e smaller than the n um b er of non zero entries in M , i.e., r ( m + n ) ≤ K . Descriptions of Algorithms 1 and 2 b elo w provide separate estimates for the num b er of ﬂ oat ing p oin t op erations (ﬂops) in eac h matrix pro duct compu tat ion needed to up d ate factor W in MU and HALS. On e can c heck that the pr oposed organizatio n of the d iﬀerent matrix computations (and, in particular, the orderin g of the matrix pr odu cts) minimizes the total computational cost (for examp le, starting the computation of the MU denomin ator W H H T with the pro duct W H is clearly wo rse than with H H T ). MU and HALS p ossess almost exactly the same computational cost (the diﬀerence b eing a t yp ical ly negligible mr ﬂops). It is particularly interesti ng to observe that 3 Algorithm 1 MU up date for W ( k ) 1: A = M H ( k ) T ; → 2 K r ﬂ ops 2: B = H ( k ) H ( k ) T ; → 2 nr 2 ﬂops 3: C = W ( k ) B ; → 2 mr 2 ﬂops 4: W ( k +1) = W ( k ) ◦ [ A ] [ C ] ; → 2 mr ﬂops % T otal : r (2 K + 2 nr + 2 mr + 2 m ) ﬂop s Algorithm 2 HALS up date for W ( k ) 1: A = M H ( k ) T ; → 2 K r ﬂops 2: B = H ( k ) H ( k ) T ; → 2 nr 2 ﬂops 3: for i = 1 , 2 , . . . , r do 4: C : k = P p − 1 l =1 W ( k +1) : l B lk + P r l = p +1 W ( k ) : l B lk ; → 2 m ( r − 1) ﬂops 5: W : k = max  0 , A : k − C : k B kk  ; → 3 m ﬂop s 6: end for % T otal : r (2 K + 2 nr + 2 mr + m ) ﬂ ops 1. Steps 1. and 2. in b oth algorithms are identical and d o not d epen d on th e matrix W ( k ) ; 2. Recalling our assumption K ≥ r ( m + n ), co mputation of M H ( k ) T (step 1.) is the most exp ensiv e among all s teps. Therefore, this time-consuming step should b e p erformed sparingly , and w e should tak e full adv ant age of ha vin g compu ted the relativ ely exp ensiv e M H ( k ) T and H ( k ) H ( k ) T matrix pro du cts. This can b e done by up dating W ( k ) sev eral times b efore the next up date of H ( k ) , i.e., b y rep eating steps 3. and 4. in MU (resp. steps 3. to 6. in HALS) s everal times after the computation of matrices M H ( k ) T and H ( k ) H ( k ) T . In this fashion, b etter solutions of the corresp onding NNLS subpr oblems will b e obtained at a relativ ely cheap additional cost. The original MU and HALS alg orithms do not tak e adv antag e of this fac t, and alternativ ely up date matrices W and H only once p er (outer) iteration. An imp ortant question for us is now: how man y times should we up date W p er outer iterat ion?, i.e., ho w man y inner iterations of MU and HALS should we p erf orm ? This is the topic of the next section. 3 Stopping Criterion for the Inner Iterations In this section, w e discuss tw o diﬀerent strategies for c ho osing the num b er of inn er iterations: the ﬁrst u s es a ﬁxed num b er of inner iterations determined by the ﬂop coun ts, wh ile the second is based on a dyn amic stopping criterion that chec ks th e diﬀerence b et wee n t wo consecutiv e iterates. Th e ﬁrst approac h is sh own empirically to work b etter. W e also describ e a third h ybrid strategy that provides a further small improv emen t in p erformance. 3.1 Fixed Num b er of Inner I terations Let us fo cus on the MU algorithm (a completely similar analysis holds for HALS, as b oth metho ds diﬀer only b y a negligible num b er of ﬂop s ). Based on the ﬂ ops coun ts, w e estimate h o w exp ensiv e the ﬁrst inner up date of W wo uld b e relativ ely to the n ext ones (all p er f ormed w hile kee ping H ﬁ xed), 4 whic h is give n b y the follo wing factor ρ W (the corresp onding v alue for H will b e denoted by ρ H ) ρ W = 2 K r + 2 nr 2 + 2 mr 2 + 2 mr 2 mr 2 + 2 mr = 1 + K + nr mr + m .  ρ H = 1 + K + mr nr + n  . V alues of ρ W and ρ H for s ev eral datasets are giv en in S ecti on 5, see T ables 1 and 2. Notice that for K ≥ r ( m + n ), we hav e ρ W ≥ 2 r r +1 so that th e ﬁrst inner up d ate of W is at least ab out twice as exp en s iv e as the su bsequen t ones. F or a dense matrix, K is equ al to m n and w e actually ha ve that ρ W = 1 + n ( m + r ) m ( r +1) ≥ 1 + n r +1 , whic h is typica lly qu ite large since n is often m u c h greater than r . This means f or example that, in our accelerated scheme, W could b e up dated ab out 1 + ρ W times for the same computational cost as tw o indep endent up dates of W in the original MU. A simp le an d natural c h oice consists in p erforming inner up dates of W and H a ﬁxed num b er of times, dep ending on the v alues of ρ W and ρ H . Let us introd uce a parameter α ≥ 0 such that W is up dated (1 + αρ W ) times b efore the n ext u p date of H , and H is up dated (1 + αρ H ) times b efore the n ext u p date of W . Let us also denote the corresp onding algorithm MU α (MU 0 reduces to the original MU). Therefore, p erf orming (1 + αρ W ) inner up d ates of W in MU α has approxima tely the same computational cost as p erforming (1 + α ) u p dates of W in MU 0 . In order to ﬁnd an appropriate v alue f or p aramete r α , we h a v e p er f ormed some pr eliminary tests on image and text datasets. First, let us denote e ( t ) the F rob enius norm of the error || M − W H || F ac hieve d b y an algorithm within time t , and deﬁne E ( t ) = e ( t ) − e min e (0) − e min , (3.1) where e (0) is the error of the initial iterate ( W (0) , H (0) ), and e min is the sm alle st error observ ed among all algorithms across all initializat ions. Quan tity E ( t ) is therefore a normalized measure of the improv emen t of the ob jectiv e f unction (relativ e to th e in itial gap) with r esp ect to time; w e h a v e 0 ≤ E ( t ) ≤ 1 for monotonically decreasing algorithms (such as MU an d HALS). The adv an tage of E ( t ) o ver e ( t ) is that one can meaningfully tak e the a v er age ov er sev eral run s in vo lving diﬀeren t initializat ions and datasets, and display the av erage b ehavio r of a give n algorithm. Figure 1 disp lays the av erage of this fu n ction E ( t ) for dense (on the left) and sp arse (on the right) matrices u sing the datasets describ ed in Sectio n 5 for ﬁve v alues of α = 0 , 0 . 5 , 1 , 2 , 4. W e observe that the original MU algorithm ( α = 0) con verge s signiﬁcan tly less r apidly th an all the other tested v arian ts (esp ecially in the dense case). The b est v alue for parameter α seems to b e 1. Figure 2 disp la ys the same computational exp erimen ts for HALS 2 . HALS with α = 0 . 5 p er f orms the b est. F or sp ars e matrices, the impro v ement is harder to discern (but still present); an explanation for th at fact will b e giv en at the end of Section 3.3. 3.2 Dynamic Stopping Criterion for Inner Iterations In the pr evious section, a ﬁxed num b er of inner iterations is p erformed. One could instead consider switc hin g dynamically from on e factor to the other based on an appr opriate criterion. F or example, it is p ossible to use the norm of the pro jected gradient as prop osed by Lin (2007a). A simpler and c heap er p ossib ilit y is to rely solely on th e norm of the diﬀerence b et w een t w o iterates. Noting W ( k, l ) the iterate after l up dates of W ( k ) (while H ( k ) is b ei ng k ept ﬁx ed ), w e stop inner iterations as so on as || W ( k, l +1) − W ( k, l ) || F ≤ ǫ || W ( k, 1) − W ( k, 0) || F , (3.2) 2 Because HA LS inv olves a lo op ove r the columns of W and row s of H , we observed that an up date of HALS is noticeably slo w er t h an an up date of MU when using MA TLAB R  (esp ecial ly for r ≫ 1), despite the qu asi-equiv alent theoretical computational cost. Therefore, to obtain fair results, w e adjusted ρ W and ρ H by measuring directly the ratio b et ween time sp ent for the ﬁ rst u pdate and the next one, using the cputime function of MA TLAB R  . 5 Figure 1: Av erage of f u nctions E ( t ) f or MU using d iﬀeren t v alues of α : (left) dense matrices, (right) sparse matrices, computed o ver 4 image datasets and 6 text datasets, u sing tw o diﬀerent v alues for the r an k for eac h d ata set and 10 rand om in itia lizatio ns, see S ectio n 5. Figure 2: Av erage of functions E ( t ) for HALS using diﬀeren t v alues of α : (left) dens e matrices, (righ t) sparse matrices. Same settings as in Figure 1. i.e., as s o on as the imp ro v emen t of the last u p date b ecomes negligible compared to the one obtained with the ﬁrst up date, b ut without any a priori ﬁxed maximal num b er of inner iterations. Figures 3 sho w s the r esults for MU with diﬀerent v alues of ǫ (we also include the original MU and MU w ith α = 1 presente d in the pr evious section to s erv e as a comparison). Figures 4 displays th e same exp erimen t for HALS. In b oth cases, w e observ e that the dyn amic s topping criterion is not able to outp erform th e approac h based on a ﬁ xed num b er of in ner iterations ( α = 1 for MU, α = 0 . 5 for HALS). Moreo v er, in the exp erimen ts for HALS with sp ars e matrices, it is not ev en able to comp ete w ith the original algorithm. 6 Figure 3: Average of functions E ( t ) for MU us ing diﬀerent v alues of ǫ , with α = 0 and α = 1 for reference (see Section 3.1): (left) dense matrices, (righ t) spars e matrices. Same settings as in Figure 1. Figure 4: Averag e of f unctions E ( t ) for HALS u sing diﬀerent v alues of ǫ , with α = 0 and α = 0 . 5 for reference (see Section 3.1): (left) dense matrices, (righ t) spars e matrices. Same settings as in Figure 1. 3.3 A Hybrid Stopping Criterion W e hav e sho wn in the p revious section that u sing a ﬁxed num b er of inn er iterations works b etter than a stopping criterion b ased solely on the diﬀerence b etw een tw o iterates. How ev er, in some circumstances, w e ha v e observ ed that inn er iterations b ecome ineﬀectiv e b efore their m aximal count is reac hed, so that it w ould in some cases b e worth switc hing earlier to the other factor. This o ccurs in particular when the num b ers of rows m and columns n of matrix M ha ve diﬀerent orders of m agnitud e. F or example, assume without loss of generalit y that m ≪ n , so that w e ha ve ρ W ≫ ρ H . Hence, on the one hand, matrix W has signiﬁcan tly less en tries than H ( mr ≪ nr ), and the corresp onding NNLS subpr oblem features a muc h smaller num b er of v ariables; on the other hand, ρ W ≫ ρ H so th at the abov e choic e will lead many more up d ates of W p erformed. In other 7 w ords, many more iterations are p erform ed on the simpler pr oblem, whic h migh t b e unreasonable. F or example, for the CBCL face database (cf. Section 5) with m = 361, n = 2429 and r = 20, w e ha ve ρ H ≈ 18 and ρ W ≈ 123, and this large num b er of inner W -up d ates is typica lly not necessary to obtain an iterate close to an optimal solution of the corresp ond ing NNLS su bproblem. Therefore, to a voi d unnecessary inner iterations, we pr opose to com bin e the ﬁxed num b er of inn er iterations prop osed in Section 3.1 with the s upplemen tary stopp ing criterion d escrib ed in Section 3.2. This safeguard pro cedure will stop the inner iterations b efore their maxim u m num b er ⌊ 1 + αρ W ⌋ is reac hed w hen they b ecome ineﬀectiv e (dep ending on parameter ǫ , see Equation (3.2) ). Algorithm 3 displa ys the pseudo co de for the corresp onding accelerated MU, as w ell as a similar adaptation for HALS. Figures 5 and 6 displa ys the n umerical exp erimen ts for MU and HALS r esp ect iv ely . Algorithm 3 Accelerate d MU and HALS Require: Data matrix M ∈ R m × n + and initial iterates ( W (0) , H (0) ) ∈ R m × r + × R r × n + . 1: for k = 0 , 1 , 2 , . . . do 2: Compute A = M H ( k ) T and B = H ( k ) H ( k ) T ; W ( k, 0) = W ( k ) ; 3: for l = 1 : ⌊ 1 + αρ W ⌋ do 4: Compute W ( k, l ) using either MU or HALS (cf. Algorithms 1 and 2); 5: if || W ( k, l ) − W ( k, l − 1) || F ≤ ǫ || W ( k, 1) − W ( k, 0) || F then 6: break; 7: end if 8: end for 9: W ( k +1) = W ( k, l ) ; 10: Compute H ( k +1) from H ( k ) and W ( k +1) using a symmetrically adapted ve rsion of steps 2-9; 11: end for Figure 5: Av erage of fu n ctio ns E ( t ) for MU u s ing diﬀerent v alues of α and ǫ : (left) dense matrices, (righ t) sparse matrices. Same settings as in Figure 1. In the d ense case, this safeguard p rocedu re sligh tly impr o v es p erforman ce. W e also note that the b est v alues of p arameter α now seem to b e higher than in the unsafeguarded case ( α = 2 versus α = 1 for MU, and α = 1 v ersus α = 0 . 5 for HALS). W orse p erformance of those higher v alues of α 8 Figure 6: Av erage of functions E ( t ) f or HALS u sing diﬀeren t v alues of α and ǫ : (left) dense matrices, (righ t) sparse matrices. Same settings as in Figure 1. in the unsafeguarded scheme can b e explained by the fact that additional inn er iterations, although sometimes u seful, b ecome to o costly o verall if they are not stopp ed wh en b ecoming in eﬀectiv e. In the sparse case, the impro v ement is rather limited (if n ot absen t) and most accelerated v ariant s pro vide similar p erformances. In particular, a s already o bserve d in Sections 3.1 and 3.2, the accelerate d v arian t of HALS do es not p erform very diﬀeren tly from the original HALS on sp ars e matrices. W e explain this b y the fact that HALS applied on spars e matrices is extremely eﬃcient and one inn er up date already d ecrease s the ob jectiv e fun ctio n signiﬁcantly . T o illustrate this, Figure 7 shows the ev olution of th e relativ e error E k ( l ) = || M − W ( k, l ) H ( k ) || F − e k min || M − W ( k, 0) H ( k ) || F − e k min of the iterate W ( k, l ) for a sparse matrix M , where 3 e k min = min W ≥ 0 || M − W H ( k ) || F . Recall that ( W ( k, 0) , H ( k ) ) d en ote s th e solution obtained after k outer iterations (starting from randomly ge nerated matrices). F or k = 1 (resp. k = 20), the r ela tiv e err or is reduced by a factor of more than 87% (resp. 97%) after only one in ner iteration. 3.4 Application to Lin’s Pro jected Gradien t A lgorithm The accelerating pro cedure describ ed in the p revious sections ca n potentia lly b e applied to man y other NMF algorithms. T o illustrate this, w e hav e mo diﬁed Lin’s pro jecte d gradient algorithm (PG) (Lin , 2007a) by replacing the original dynamic stopping criterion (based on the stationarit y conditions) by the hybrid str ategy describ ed in Section 3.3. It is in fact s tr aig h tforw ard to see that our analysis is applicable in this case, since Lin ’s algorithm also requires the compu tatio n of H H T and M H T when up dating W , b ecause the gradien t of the ob jecti v e function in (NMF) is giv en by ∇ W || M − W H || 2 F = 2 W H H T − 2 M H T . This is also a direct conﬁr matio n th at our approac h can b e straightfo rw ardly applied to man y more NMF algorithms than those considered in this pap er. Figure 8 displa ys the corresp ond ing computational r esults, comparing the original PG algorithm (as a v ailable from Lin (2007a)) with its dynamic stopping crite rion (based on th e n orm of the p ro jected gradien t) and our v arian ts, based on a (safeguarded) ﬁxed n um b er of inner iterat ions. It demonstrates 3 W e hav e used the activ e-set algorithm of J. Kim & P ark (2008) to compute the optimal v alue o f t he NNLS subproblem. 9 Figure 7: Evol ution of th e r elat iv e err or E k ( l ) of the iterates of inner iterations in MU and HALS, solving th e NNLS subp roblem min W ≥ 0 || M − W H ( k ) || F with r = 40 for the classic text d ata set (cf. T able 2). that our accelerat ed schemes p erform signiﬁ can tly b etter, b oth in the s parse and dense cases (n otice that in the sparse case, most accel erated v arian ts p erform similarly). The c hoice α = 0 . 5 giv es the b est results, a nd the s afeguard p rocedu re d o es not help muc h; the reaso n b eing that PG c on verges relativ ely Figure 8: Av erage of functions E ( t ) for the pro ject ed gradien t algorithm of Lin (2007a), and its mo diﬁcation u sing a ﬁxed num b er of in ner iterations. Same settings as Figure 1. slo wly (we will s ee in Section 5 that its acc elerated v arian t con v erges s lo w er than the accelerat ed MU). 4 Con v ergence to Stationary P oin ts In this section, w e brieﬂy recall con vergence prop erties of b ot h MU and HALS, and show th at they are inherited by their accelerated v arian ts. 10 4.1 Multiplicativ e Updates It was shown by Daub e-Withersp o on & Muehllehner (1986) and later by Lee & Seung (1999 ) that a single multiplicat iv e u p date of W (i.e., replacing W b y W ◦ [ M H T ] [ W H H T ] while H is k ep t ﬁ x ed ) guaran tees that the ob jectiv e function || M − W H || 2 F do es not in crease. S ince our accelerated v ariant sim p ly p erforms sev eral up dates of W while H is un c hanged (and v ice v ersa), we imm ed iate ly obtain that the ob jectiv e function || M − W H || 2 F is non-increasing un der the iterations of Algorithm 3. Unfortunately , th is prop erty do es not guarante e con vergence to a stationary p oin t of (NMF), and this qu estion on the con vergence of the MU seems to b e still op en, see Lin (2007b). F urthermore, in practice, roundin g errors migh t set some entries in W or H to zero, and then m ultiplicativ e up dates cannot mo dify their v alues. Hence, it w as observ ed that despite their monotonicit y , MU do not necessarily con verge to a stationary p oin t, see Gonzales & Zhang (2005 ). Ho wev er, Lin (2007b) prop osed a sligh t mo diﬁcation of MU in order to obtain the conv ergence to a stationary p oint. Roughly sp eaking, MU is recast as a rescaled gradient descent metho d and the step length is mo diﬁed accordingly . Another eve n simpler p ossibilit y is prop osed by Gillis & Glineur (2008) who p ro v ed the follo wing theorem (see also (G illis, 2011, § 4.1) where the inﬂu ence of parameter δ is discussed): Theorem 1 (Gillis & Glineur (2008)) . F or any c onstant δ > 0 , M ≥ 0 and any 4 ( W , H ) ≥ δ , || M − W H || F is nonincr e asing under W ← max  δ , W ◦ [ M H T ] [ W H H T ]  , H ← max  δ , H ◦ [ W T M ] [ W T W H ]  , (4.1) wher e the max is taken c omp onent-wise. Mor e over, every limit p oint of the c orr esp onding (alternate d) algorithm is a stationary p oint of the fol lowing optimization pr oblem min W ≥ δ,H ≥ δ || M − W H || 2 F . The pr oof of Th eorem 1 only relies on the fact that the limit p oints of the up d ates (4.1) are ﬁ x ed p oin ts (there alw ays exists at least one limit p oint b ecause the ob jectiv e fu nction is b ounded b elo w and n on-increasing un der up dates (4.1)). T herefore, one can easily c hec k that the p roof still holds when a b ound ed num b er of in ner iterations is p erform ed, i.e., the theorem applies to our accelerate d v arian t (cf. Algorithm 3). It is imp ortant to realize that this is not m erely a theoretical issue and th at this observ ation can really play a crucial role in practice. T o illustrate this, Figure 9 sho ws the ev olution of the norm alized ob jectiv e function (cf. Eq u atio n (3.1 )) using δ = 0 and δ = 10 − 16 starting from the same initial matrices W (0) and H (0) randomly generated (ea c h ent ry u niformly dra w n b et w een 0 and 1). W e observ e th at, after some num b er of iterations, the original MU (i.e., with δ = 0) get s tuc k while the v arian t w ith δ = 10 − 16 is s till able to sligh tly impro v e W and H . Notice that this is esp ecially critical on sp arse matrices (see Figure 9, right) b ecause many more entries of W and H are exp ected to b e equal to zero at stationarit y . F or this reason, in this pap er, all n umerical exp eriments w ith MU us e the u p dates from Equation (4.1) with δ = 10 − 16 (instead of th e original ve rsion with δ = 0). 4.2 Hierarc hical Alternating Least Squares HALS is an exact block-coordinate descen t metho d where blocks of v ariables (columns of W and ro w s of H ) are optimized in a cyclic wa y (ﬁr st the columns of W , then the ro ws of H , etc.). Clearly , exact blo c k-coord inate descent metho ds alw a ys guaran tee the ob jectiv e function to decrease. How ev er, con verge nce to a stationary p oint requires additional assumptions. F or example, Bertsek as (1999a,b) (Prop osition 2.7.1) sh o w ed that if the follo w ing three cond itio ns hold: 4 ( W, H ) ≥ δ means that W and H are comp onen t- wise larger th an δ . 11 Figure 9: F unctions E ( t ) for δ = 0 an d δ = 10 − 16 on the (dense) OR L f ace dataset (cf. T able 1) and the (sp ars e) classic text dataset (cf. T able 2) with r = 40. • eac h blo c k of v ariables b elongs to a closed con vex set (whic h is the case here since the blo c ks of v ariables b elong either to R m + or R n + ), • the m inim um computed at eac h iteration for a give n blo c k of v ariables is uniquely attained; • the f unction is monotonically n on in creasing in the int erv al f rom one iterate to the n ext; then exa ct blo c k-co ordinate descen t method s con v erge to a s tationary p oin t. The second and the third requirement s are satisﬁed as long as no columns of W and no r o ws of H b ecome completely equal to zero (subpr oblems are then strictly con v ex quadratic p rograms, whose un ique optimal solutions are giv en by the HALS up date s, see Section 1). In practice, if a column of W or a ro w of H b ecomes zero 5 , we reinitialize it to a small p osit iv e constan t (w e used 10 − 16 ). W e refer the reader to Ho (2008) and Gillis & Glineur (2008) for more details on the conv ergence issu es related to HALS. Because our accelerated v arian t of HALS is just another type of exact blo c k -co ordinate descent metho d (the only diﬀerence b eing that the v ariables are optimized in a diﬀeren t order: ﬁrst several times the columns of W , then sev eral times the rows of H , etc.), it in herits all the ab ov e prop erties. In fact, the statemen t of the ab o ve-men tioned theorem in (Bertsek as, 1999b, p.6) mentions that ‘the order of the blo c ks may b e arbitrary as long as there is an in teger K suc h that eac h blo c k-comp onen t is iterated at least once in ev ery grou p of K contig uous iterat ions’, which clearly holds for our acce lerated algorithm w ith a ﬁxed num b er of in ner iterations and its hybrid v ariant 6 . 5 Numerical Exp erimen ts In this s ect ion, w e compare the follo w ing algorithms, c ho osing for our accelerated MU, HALS and PG sc hemes the hybrid stopping criterion and the b est compromise for v alues for p aramete rs α and ǫ according to tests p erformed in Section 3. 1. ( MU ) The m ultiplicativ e up dates algorithm of Lee & Seun g (2001). 5 In practice, this typically only hap p ens if the initial facto rs are not prop erly c hosen, see Ho (2008). 6 Note how ever that the accelerated alg orithms based solely on the dy namic stopping criterion (Section 3.2) might not satisfy this requirement, b ecause t he number of inner iterations for eac h outer iteration ca n in principle gro w indeﬁn itely in the course of the alg orithm. 12 2. ( A-MU ) Th e accelerated MU with a safeguarded ﬁxed num b er of inn er iterations using α = 2 and ǫ = 0 . 1 (cf. Algorithm 3). 3. ( HALS ) The hierarc h ical alternating least squares algorithm of Cichoc ki et al. (2007). 4. ( A-HALS ) The accelerated HALS with a safeguarded ﬁxed num b er of inn er iterations u sing α = 0 . 5 and ǫ = 0 . 1 (cf. Algorithm 3). 5. ( PG ) The pro ject ed gradient metho d of Lin (2007a). 6. ( A-PG ) The mo diﬁed p ro jected gradient metho d of Lin (2007a ) using α = 0 . 5 and ǫ = 0 (cf. Section 3.4). 7. ( ANLS ) The alternating nonnegativ e least squ ares algorithm 7 of J .Kim & Pa rk (2008), whic h alternativ ely optimizes W and H exactl y u sing a b lock-piv ot activ e set m ethod. Kim and Pa rk sho wed that their metho d t ypically outp erforms other tested algorithms (in p articular MU and PG) on syn thetic, images and text datasets. All tests were run using MA TLAB R  7.1 (R14), on a 3GHz Int el R  Core TM 2 dual core pro cessor. W e present numerical results on im ages datasets (dense matrices, Section 5.1) and on text datasets (sparse m atrice s, S ection 5.2). Co de for all algorithms bu t ANLS is a v ailable at http://s ites.google.c om/site/nicolasgillis/ . 5.1 Dense Matrices - Images Datasets T able 1 summarizes the c h aract eristics of the diﬀeren t datasets. T able 1: Image d ata sets. Data # p ixels m n r ⌊ ρ W ⌋ ⌊ ρ H ⌋ ORL 1 112 × 92 10304 400 30, 60 358, 195 13, 7 Umist 2 112 × 92 10304 575 30, 60 351, 188 19, 10 CBCL 3 19 × 19 361 2429 30, 60 12, 7 85, 47 F rey 2 28 × 20 560 1965 30, 60 19, 10 67, 36 ⌊ x ⌋ denotes the large st integer smaller than x . 1 http://www .cl.cam.ac .uk/research/dtg/attarchive/facedatabase.html 2 http://www .cs.toront o.edu/ ~ roweis/dat a.html 3 http://cbc l.mit.edu/ cbcl/software- datasets/FaceData2.html F or eac h dataset, we use t wo diﬀerent v alues for the rank ( r = 30 , 60) and initialize the algorithms with the same 50 rand om factors ( W (0) , H (0) ) (u s ing i.i.d. un iform r andom v ariables on [0 , 1]) 8 . In order to assess the p erform ance of the d iﬀeren t algorithms, we disp la y individually for eac h d ata set the av erage ov er all ru ns of the function E ( t ) d eﬁned in Equation (3.1), see Figure 10. First, these results co nﬁrm what w as already observe d by p revious w orks : PG p erf orms b etter than MU (Lin, 2007 a), ANLS p erforms b etter than MU and PG (J.K im & P ark, 2008), and HALS 7 Code is a vail able at http:// www.cc.gat ech.edu/ ~ hpark/ . 8 Generating initial matrices ( W (0) , H (0) ) randomly t ypically leads to a v ery large initial error e (0) = || M − W (0) H (0) || F . This implies that E ( t ) will get very small after one step of any algorithm. T o av oid this large initial decrease, w e hav e scaled the initial matrices such that argmin α || M − αW (0) H (0) || F =1; this simply amoun t to multiply- ing W and H by an appropriate constant, see Gillis & Glineur (2008). 13 Figure 10: Ave rage of functions E ( t ) f or diﬀerent image datasets: O RL (top left), Umist (top righ t), CBCL (b ottom left) and F rey (b ottom right). p erforms the b est (Ho, 2008). Second, they conﬁrm that the accelerate d algorithms indeed are more eﬃcien t: A-MU (resp. A-PG) clearly outp erforms MU (resp. PG) in all cases, while A-HALS is, b y far, the most eﬃcient algorithm for the tested databases. It is interesting to notice that A-MU p erf orms b etter th an A-PG, and only slightly worse than ANLS, often decreasing the error as fast during the ﬁrst iterations. 5.2 Sparse Matrices - T ext Datasets T able 2 summarizes the c h aract eristics of the diﬀeren t datasets. The factorization rank r was set to 10 and 20. F or the comparison, we used the same settings as for the d ense m atrices. Figure 11 disp la ys for eac h dataset the ev olution of the a v erage of functions E ( t ) o ver all run s. Again the accelerate d alg orithms are muc h more eﬃcien t. I n particular, A-MU and A-PG con verge initially muc h faster than ANLS, and also obtain b etter ﬁn al s olutio ns 9 . A-MU, HALS an d A-HALS hav e the fastest in itia l conv ergence rates, and HALS and A-HALS generate the 9 W e a lso observe that ANLS no longer outperforms the original MU and PG algorithms, and only sometimes generates b etter solutions. 14 T able 2: T ext mining datasets (Zh on g & Ghosh, 2005) (sp arsit y is giv en in %: 100 ∗ #zeros / ( mn )). Data m n r #nonzero sparsit y ⌊ ρ W ⌋ ⌊ ρ H ⌋ classic 709 4 41681 10, 20 22383 9 99.92 12, 9 2, 1 sp orts 8580 14870 10, 20 10917 23 99.14 18, 11 10, 6 reviews 4069 18483 10, 20 758635 98 .99 35, 22 8, 4 hitec h 230 1 10080 10, 20 33137 3 98.57 25, 16 5, 4 ohscal 11162 1146 5 10, 20 674365 9 9.47 7, 4 7, 4 la1 3204 31472 10, 20 48402 4 99.52 31, 21 3, 2 b est solutions in all cases. Notice that A-HALS do es not alw ays obtain b etter ﬁnal solutions than HALS (still this h app ens on half of the d atase ts), b ecause HALS already p erf orm s remark ably well (see discussion at the end of Section 3.3). Ho wev er, the initial conv ergence of A-HALS is in all cases at least as f ast as that of HALS . 6 Conclusion In th is pap er, w e considered the m u ltiplica tiv e up dates of Lee & S eung (2001) and the h ierarc hical alternating least squ ares algorithm of Cic h ocki et al. (2007). W e in tro duced accelerate d v ariants of these t w o sc h emes, based on a careful analysis of the computational cost sp ent at eac h iteratio n, and pr eserve the con verge nce prop erties of the original algorithms. The idea b ehin d our app roac h is based on taking b etter ad v antag e of the most exp ensiv e p art of the algorithms, by rep eating a (safeguarded) ﬁxed n u m b er of times the cheaper part of the iteratio ns. This tec hnique can in principle b e applied to most NMF algorithms; in particular, w e sh o w ed ho w it can su bstan tially impr o v e the pro ject ed gradien t metho d of Lin (2007a). W e then exp erimenta lly sho wed that these accelerat ed v arian ts, d espite the relativ e simp licit y of the mo diﬁcation, signiﬁcan tly outp erform the original ones, esp ecially on dense m atrice s, and comp ete f a v orab ly w ith a state- of-the-art algorithm, namely the ANLS method of J.Kim & Park (2008). A direction for future researc h w ould b e to c ho ose the num b er of inner iterations in a more sophisticated wa y , with the hop e of fu rther impro ving the eﬃciency of A-MU, A-PG and A-HALS. Finally , w e observed th at HALS and its accelerated v ers ion are th e most eﬃcien t v arian ts for solving NMF problems, sometimes by far. Ac kno wledgmen t s W e would like to thank the a no n ymo us reviewers for their insightful comments, which help ed to improv e the pap er. References Berry , M.W., Br o wne, M., Langville, A.N., Pauca, P .V. & P lemmons, R.J. (20 09). Algorithms and Applications for Approximate Nonnegative Matr ix F actorization. Comp utational St atistics and Data Analysis , 52 , 15 5 –173. Bertsek as, D.P . (19 99a). Nonlinear Prog ramming: Seco nd Edition. Athena Scientiﬁc, Massachusetts . Bertsek as, D.P . (1999 b). Correc tions for the bo ok Nonlinear Pro g ramming: Seco nd Edition. Athena Scientiﬁc, Massachusetts . Av aila ble at http: //www.at henasc.com/nlperrata.pdf . Cichocki, A., Zdunek, R. & Amari, S. (2006 ). Non-nega tive Matrix F actorization with Quasi-Newton Optimiza- tion. L e ctur e N otes in Artiﬁcial Intel ligenc e, Springer , 4029 , 870 –879. Cichocki, A., Zdunek, R. & Amari, S. (2007). Hierarchical ALS Algo rithms fo r No nnegativ e Ma trix and 3D T ensor F actor ization. L e ctur e Notes in Computer S cienc e, Springer , 4666 , 169– 176. 15 Figure 11: Av erage of functions E ( t ) for text datasets: classic (top left), sp orts (top righ t), reviews (middle left), hitec h (middle right ), ohscal (b ottom left) and la1 (b ottom righ t). Cichocki, A., & Phan, A.H. (2009 ). F ast lo cal a lgorithms for la r ge scale Nonnegative Matrix and T ensor F actor izations. IEICE T r ansactions on F undamentals of Ele ctr onics , V ol. E92-A No.3 , 70 8–721. 16 Cichocki, A., Amari, S., Zdunek, R. & P han, A.H. (2009). No n-negative Matrix and T enso r F actor izations: Applications to E xploratory Multi-wa y Data Analysis a nd Blind So urce Separa tion. Wiley-Blackwel l . Daube - Witherspo on, M.E. & Muehllehner, G. (1986). An itera tiv e image space reconstruction algorithm suitable for volume ECT. IEEE T ra ns. Me d. Imaging , 5 , 61– 66. Dev ara jan, K. (20 08). Nonnegative Matrix F a ctorization: An Analytical and Interpretiv e T o ol in Co mputational Biology . PL oS Computational Biolo gy , 4(7) , e1 000029. Dhillon, I.S., Kim, D. & Sra , S. (20 07). F ast Newton- t yp e Metho ds for the Leas t Squares Nonnegative Matrix Approximation problem. Pr o c. of SIA M Conf. on Data Mining , 3 43–354. Ding, C., He, X. & Simo n, H.D. (200 5). O n the Equiv alenc e of Nonnegative Matrix F actoriza tion and Spectra l Clustering. Pr o c. of SIA M Conf. on Data Mining , 606 –610. Gillis, N. (201 1). Nonneg ativ e Matrix F acto rization: Complexity , Alg o rithms and Applica tions. Universit ´ e c atholique de L ouvain , P hD Thesis. Gillis, N. & Glineur, F. (2008). Nonnegative F actoriza tio n and The Maximum Edge Biclique Problem. CORE Discussion p ap er 2008/6 4 . Gonzales, E.F. & Zhang, Y. (20 0 5). Accelerating the Lee-Seung algor ithm for non-negative matrix factorization. T e chnic al r ep ort, Dep artm ent of Computational and Applie d Mathematics, Ric e University . Han, J., Han, L., Neumann, M. & U. P rasad (2 009). On the ra te of co n vergence of the image space reconstruction algorithm. Op er ators and Matric es , 3(1) , 4 1–58. Ho, N.-D. (2008). Nonnegative Ma trix F a ctorization - Algorithms and Applications. Universit´ e c atholique de L ouvain , P hD Thesis. Kim, H. & Park, K. (2 008). Non-negativ e Matrix F actor ization B ased on Alternating Non-negativity Con- strained Least Squa res and Activ e Set Metho d. SIA M J. Matrix Anal. Appl. , 30(2) , 713– 730. Kim, J. & Park, H. (2008). T ow ard F a ster Nonnegative Matrix F actorization: A New Algorithm and Compar- isons. Pr o c. of IEEE Int. Conf. on Data Mining , 353 –362. Lee, D.D. & Seung, H.S. (199 9 ). Learning the Parts o f Ob jects by Nonnegative Matrix F a ctorization. Natu r e , 401 , 7 88–791. Dee, D.D. & Seung, H.S. (2001). Algo rithms for Non-negative Matrix F actor ization. A dvanc es in Neur al Information Pr o c essing , 13 . Li, L. & Zhang, Y.-J. (2 009). F astNMF: highly eﬃcient monotonic ﬁxed-p oin t nonneg ativ e matrix factorization algorithm with go o d applicability . J. Ele ctr on. Imaging , 18 , 03 3 004. Lin, C.-J. (2 007a). Pro jected Gradient Me tho ds for Nonneg ativ e Ma trix F actor ization. Neur al Computation, MIT pr ess , 19 , 275 6–2779. Lin, C.-J. (200 7b). On the Conv e r gence o f Multiplicative Up date Algor ithms for Nonneg ativ e Matrix F actor- ization. IEEE T r ans. on N eu r al Networks , 18(6) , 1 5 89–1596 . Pauca, P .V., Pip er, J. & P lemmons, R.J. (200 6). Nonnega tiv e matrix factorizatio n for sp ectral data a nalysis. Line ar Algebr a and its Applic ations , 406(1) , 29–4 7. Shahnaz, F., Berr y , M.W., La ngville, A.N., Pauca, V.P . & P lemmons, R.J. (20 06). Do cumen t clustering using nonnegative matrix factor ization. Information Pr o c essing and Management , 42 , 373–3 86. V av asis, S.A. (2009 ). On the Complexity of Nonnegative Matrix F actorization. SIA M Journal on Optimization , 20(3) , 1 364 – 1 377. Zhong, S. & Ghosh, J. (200 5). Genera tiv e mo del-based do cumen t cluster ing: a co mparative study . Know le dge and Information Systems , 8(3) , 374 –384. 17

Accelerated Multiplicative Updates and Hierarchical ALS Algorithms for Nonnegative Matrix Factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment