Descent methods for Nonnegative Matrix Factorization

Chapter 1 DESCENT METHODS F OR NONNEGA TIVE MA TRIX F A CTORIZA TION Ngo c-Diep H o, P aul V a n Do oren and Vincent D. Blondel CESAME, Universit ´ e c atholique de L ouvain, Av. Ge or ges L ema ˆ ıtr e 4, B-1348 L ouvain-la-Neuve, Belgium. T el : +32(10) 47 22 64 F ax : +32(10) 47 21 80 { ngoc .ho, paul.vandooren, vincent.blondel } @uclouvain.b e Abstract In this pap er, w e present several d escen t methods t hat can b e applied to nonnegativ e ma trix factorizatio n and w e analyze a recently devel- opp ed fast blo ck co ordinate metho d called Rank-one Residue Iteration (RRI). W e also giv e a comparison of t hese d iﬀeren t metho ds and sh ow that the new blo c k co ordinate metho d has b etter prop erties in t erms of appro ximation error and complexity . By interpreting this metho d as a rank-one approximatio n of the residue matrix, w e prov e th at it c on- ver ges and also exten d it to the nonnegativ e tensor factorization a nd introduce some va riants of the method by imp osing some additional contro llable constrain ts such as: sparsity , discreteness and smoothn ess. Keywords: Algorithm, Nonn egativ e matrix, F actorization 1. In tro duction Linear algebra h as b ecome a k ey to ol in almost all mod ern tec hniques for d ata analysis. Mo st of these tec hniques mak e use of linear s u bspaces represent ed b y e igen v ectors of a partic ular matrix. In this pap er, we consider a set of n data p oint s a 1 , a 2 , . . . , a n , where eac h p oin t is a r eal v ector of s ize m , a i ∈ R m . W e th en app r o ximate these data p oin ts b y linear com b inations of r basis v ectors u i ∈ R m : a i ≈ r X j =1 v ij u j , v ij ∈ R , u j ∈ R m . 2 This can b e rewritten in matrix form as A ≈ U V T , where a i and u i are resp ectiv ely the columns of A and U and the v ij ′ s are the elemen ts of V . Optimal solutions of this approximat ion in terms of the Euclidean (or F robeniu s) norm can b e obtained by the Singular V alue Decomp osition (SVD) [13]. In many cases, data p oints are constrained to a subset of R m . F or example, ligh t intensities, concen trations of sub s tances, absolute tem- p eratures are, b y their nature, nonnegativ e (or ev en p ositiv e) and lie in the nonnegativ e orth ant R m + . The input matrix A then b ecomes elemen- t wise nonn ega tiv e and it is then natural to constrain the basis vecto rs v i and the co eﬃcien ts v ij to b e nonnegativ e as well. I n order to satisfy this constrain t, we need to appr oximate th e column s of A by th e follo wing additiv e m od el: a i ≈ r X j =1 v ij u j , v ij ∈ R + , u j ∈ R m + . where the v ij co eﬃcien ts and u j v ectors are nonnegativ e, v ij ∈ R + , u j ∈ R m + . Man y algorithms ha v e b een prop osed to ﬁnd suc h a r epresen tation, whic h is referred to as a Nonnegativ e Matrix F actorization (NMF). The earliest algorithms were in tro duced b y Paat ero [25, 26]. But the topic b ecame quite p opular with the pub lication of the algorithm of Lee and Seung in 1999 [20] where multiplica tiv e rules were introdu ced to solv e the p r oblem. This algorithm is very simple and elegan t but it lac ks a complete con v ergence analysis. Other metho ds and v ariant s can b e found in [23], [21], [17]. The qu alit y of th e approximati on is often measured b y a distance. Tw o p opular c hoices are the Euclidean (F rob enius) norm and the gen- eralized Kullbac k-Leibler dive rgence. In this p ap er, we fo cus on th e Euclidean distance and w e inv estiga te descent metho ds for this mea- sure. One charact eristic of descen t metho ds is their monotonic decrease unt il they reac h a stationary p oint . This p oin t ma yb e located in the in - terior of the nonnegativ e orth an t or on its b oundary . In the second case, the constraints b ecome activ e and ma y p rohibit an y further decrease of the distance measure. This is a ke y issue to b e analyzed for an y d escen t metho d. In this pap er, R m + denotes the set of nonnegativ e real v ectors (elemen- t wise) and [ v ] + the pr o jection of the vec tor v on R m + . W e use v ≥ 0 and A ≥ 0 to d enote n onnegativ e v ectors and matrices and v > 0 and A > 0 to denote p ositiv e v ectors and matrices. A ◦ B and [ A ] [ B ] are resp ectiv ely Desc ent metho ds for Nonn e gative Matrix F actorization 3 the Hadamard (elemen t wise) p ro d uct and quotien t. A : i and A i : are th e i th column and i th ro w of A . This pap er is an extension of the inte rnal rep ort [15], wh ere w e pro- p osed to d ecouple th e problem b ased on rank one appro ximations to create a new algorithm called Rank-one Residue Iteration (RRI). Dur- ing the revision of this rep ort, we were informed that essentia lly the same algorithm was indep endently prop osed and p ublished in [8] un - der the name Hierarc hical Alternativ e Least Squares (HALS ). But the present pap er gives several additional r esults w h erein the ma jor con tri- butions are the c onver genc e pr o of of the metho d and its e xtensions to man y pratical situations and constrain ts. The pap er also compares a selection of some r ecen t descent metho ds fr om the literature and aims at providing a sur v ey of s uc h metho ds for nonnegativ e matrix f actor- izations. F or that reason, we try to b e self-con tained and hence recall some well-kno w n r esults. W e also pro vide short pro ofs w hen useful for a b etter under s tanding of the rest of the p ap er. W e ﬁr s t giv e a short introd uction of lo w r an k appro ximations, b oth unconstrained and constrained. In Section 3 w e d iscuss error b ounds of v arious approximati ons and in Section 4 w e giv e a n umber of d escen t metho ds for Nonnegativ e Matrix F actorizations. In Section 5 we de- scrib e the metho d b ased on successiv e r ank one ap p ro ximations. This metho d is then also extended to appro ximate higher order tensor and to tak e into account other constrain ts than n onnegativit y . In Section 5.0 w e discus s v arious regularization metho ds and in S ectio n 6, w e p resen t n umerical exp eriments comparin g the d iﬀerent metho ds. W e end w ith some conclud ing remarks. 2. Lo w-rank matrix appro ximation Lo w-rank approximat ion is a sp ecial case of matrix nearn ess problem [14]. When only a rank constraint is imp osed, th e optimal appro ximation with r esp ect to the F rob en iu s norm can b e obtained fr om the Singular V alue Decomposition. W e ﬁrst in v estigate the problem without the nonnegativit y constraint on the lo w-rank appr o ximatio n. This is useful for un derstanding prop- erties of the appr oximati on when the non n egat ivit y constrain ts are im- p osed but inactiv e. W e b egin w ith the w ell-kno wn Ec k art-Y oung Theo- rem. 4 Theorem 2.1 (Ec k art-Y oung) . L et A ∈ R m × n ( m ≥ n ) have the singu- lar value de c omp osition A = P Σ Q T , Σ =          σ 1 0 . . . 0 0 σ 2 . . . 0 . . . . . . . . . . . . 0 0 . . . σ n . . . . . . . . . 0 0 . . . 0          wher e σ 1 ≥ σ 2 ≥ . . . ≥ σ n ≥ 0 ar e the singular values of A and wher e P ∈ R m × m and Q ∈ R n × n ar e orth o gonal matric es. Then for 1 ≤ r ≤ n , the matrix A r = P Σ r Q T , Σ r =          σ 1 0 . . . 0 . . . 0 0 σ 2 . . . 0 . . . 0 . . . . . . . . . . . . . . . 0 0 . . . σ r . . . 0 . . . . . . . . . . . . . . . 0 0 . . . 0 . . . 0          is a glob al minimizer of the pr oblem min B ∈ R m × n r ank ( B ) ≤ r 1 2 k A − B k 2 F (1.1) and its err or is 1 2 k A − B k 2 F = 1 2 n X i = r +1 σ 2 i . Mor e over, if σ r > σ r +1 then A r is the unique glob al minimizer. The pr oof and other implications can b e f ou n d f or in stance in [13]. The columns of P a nd Q are called singular v ectors of A , in wh ic h v ectors corresp ond ing to the largest singular v alues are referred to as the dominan t singular v ectors. Let us no w look at the follo wing mo diﬁed pr oblem min X ∈ R m × r Y ∈ R n × r 1 2 k A − X Y T k 2 F , (1.2) where the rank constraint is implicit in the pro duct X Y T since th e dimensions of X and Y guaran tee that r ank ( X Y T ) ≤ r . Conv ersely , ev ery matrix of rank less than r can b e trivially rewritten as a pro duct Desc ent metho ds for Nonn e gative Matrix F actorization 5 X Y T , where X ∈ R m × r and Y ∈ R n × r . T h erefore Problems (1.1) and (1.2) are equiv alen t. But ev en wh en the pro duct A r = X Y T is un ique, the pairs ( X R T , Y R − 1 ) with R in v ertible, yield the same pro du ct X Y T . In order to av oid th is, we can alwa ys choose X and Y such that X = P D 1 2 and Y = QD 1 2 , (1.3) where P T P = I r × r , Q T Q = I r × r and D is r × r nonnegativ e diagonal matrix. Doing this is equiv alent to computin g a compact SVD decom- p osition of the pr od u ct A r = X Y T = P D Q T . As usual for optimization problems, we calculate the gradient with resp ect to X and Y and set them equal to 0. ∇ X = X Y T Y − AY = 0 ∇ Y = Y X T X − A T X = 0 . (1.4) If we then prem ultiply A T with ∇ X and A with ∇ Y , we obtain ( A T A ) Y = ( A T X ) Y T Y ( AA T ) X = ( AY ) X T X. (1.5) Replacing A T X = Y X T X and AY = X Y T Y in to (1.5) yields ( A T A ) Y = Y X T X Y T Y ( AA T ) X = X Y T Y X T X. (1.6) Replacing (1.3) in to (1.6) yields ( A T A ) QD 1 2 = QD P T P D Q T QD 1 2 and ( AA T ) P D 1 2 = P D Q T QD P T P D 1 2 . When D is inv ertible, this ﬁ nally yields ( A T A ) Q = QD 2 and ( AA T ) P = P D 2 . This sho ws that the columns of P and Q are sin gular vec tors and D ii ′ s are nonzero s in gular v alues of A . Notice that if D is singular, one can thro w aw a y the corresp onding column s of P and Q and red uce it to a smaller-rank appr o ximatio n with the same prop erties. Without loss of generalit y , we therefore can fo cus on appro ximations of Problem (1.2) whic h are of exact rank r . W e can summarize the ab o v e reasoning in the follo wing theorem. Theorem 2.2. L et A ∈ R m × n ( m > n and r an k ( A ) = t ). If A r ( 1 ≤ r ≤ t ) i s a r ank r stationary p oint of Pr oblem 1.2, then ther e exists two ortho g onal matric es P ∈ R m × m and Q ∈ R n × n such that: A = P ˆ Σ Q T and A r = P ˆ Σ r Q T 6 wher e ˆ Σ =          ˆ σ 1 0 . . . 0 0 ˆ σ 2 . . . 0 . . . . . . . . . . . . 0 0 . . . ˆ σ n . . . . . . . . . 0 0 . . . 0          , ˆ Σ r =          ˆ σ 1 0 . . . 0 . . . 0 0 ˆ σ 2 . . . 0 . . . 0 . . . . . . . . . . . . . . . 0 0 . . . ˆ σ r . . . 0 . . . . . . . . . . . . . . . 0 0 . . . 0 . . . 0          and the ˆ σ ′ i s ar e unsorte d singular values of A . Mor e over, the appr oxi- mation err or is: 1 2 k A − A r k 2 F = 1 2 t X i = r +1 ˆ σ 2 i . This result shows that, if the singular v alues are all diﬀerent, there are n ! r !( n − r )! p ossible stationary p oin ts A r . When there are multiple singu- lar v alues, there will b e inﬁnitely man y stationary p oin ts A r since there are inﬁn itely many singular subsp aces. T he next result will id entify the minima among all stationary p oin ts. Other stationary p oints are sad- dle p oin ts wh ose ev ery n eigh b orho o d con tains b oth s m aller and higher p oin ts. Theorem 2.3. The only minima of P r oblem 1.2 ar e given by The or em 2.1 and ar e glob al minima. Al l other stationary p oints ar e sadd le p oints. Pr o of. L et us assume that A r is a stationary p oint giv en b y Th eorem 2.2 bu t not b y Th eorem 2.1. Th en ther e alw a ys exists a p ermutat ion of the columns of P and Q , an d of the diagonal element s of ˆ Σ and ˆ Σ r suc h that ˆ σ r +1 > ˆ σ r . W e then constru ct t wo p oint s in th e ǫ -neighborh oo d of A r that yield an increase and a d ecrease , r esp ectiv ely , of the distance measure. They are obtained by taking: Σ r ( ǫ ) =        ˆ σ 1 + ǫ . . . 0 . . . 0 . . . . . . . . . . . . . . . 0 . . . ˆ σ r . . . 0 . . . . . . . . . . . . . . . 0 . . . 0 . . . 0        , A r ( ǫ ) = P Σ r ( ǫ ) Q T Desc ent metho ds for Nonn e gative Matrix F actorization 7 and Σ r ( ǫ ) =           ˆ σ 1 . . . 0 0 . . . 0 . . . . . . . . . . . . . . . . . . 0 . . . ˆ σ r ǫ √ ˆ σ r . . . 0 0 . . . ǫ √ ˆ σ r ǫ 2 . . . 0 . . . . . . . . . . . . . . . . . . 0 0 . . . 0 . . . 0           , A r ( ǫ ) = P Σ r ( ǫ ) Q T . Clearly A r ( ǫ ) and A r ( ǫ ) are of rank r . E v aluating the d istance measure yields k A − A r ( ǫ ) k 2 F = 2 ˆ σ r ǫ 2 + ( ˆ σ r +1 − ǫ 2 ) 2 + t X i = r +2 ˆ σ 2 i = ǫ 2 [ ǫ 2 − 2( ˆ σ r +1 − ˆ σ r )] + t X i = r +1 ˆ σ 2 i < t X i = r +1 ˆ σ 2 i = k A − A r k 2 F for all ǫ ∈ (0 , p 2( ˆ σ r +1 − ˆ σ r )) and k A − A r ( ǫ ) k 2 F = ǫ 2 + t X i = r +1 ˆ σ 2 i > t X i = r +1 ˆ σ 2 i = k A − A r k 2 F for all ǫ > 0. Hence, for an arbitrarily small p ositiv e ǫ , w e obtain k A − A r ( ǫ ) k 2 F < k A − A r k 2 F < k A − A r ( ǫ ) k 2 F whic h sho ws that A r is a saddle p oin t of the distance measure. When we add a nonnegativit y constrain t in th e next section, the re- sults of this section will help to identify stationary p oints at which all the nonn egat ivit y constraints are inactiv e. 3. Nonnegativit y constraint In this section, we inv estig ate the problem of Nonnegativ e Matrix F actorizatio n. Th is problem diﬀers Problem 1.2 in the previous section b ecause of the additional nonn ega tivit y constraint s on th e factors. W e ﬁrst discuss the eﬀects of adding suc h a constrain t. By doing so, th e 8 problem is no longer easy b ecause of the existence of lo cal minima at the b oundary of the non n egat iv e orthant. Determining the lo west min- im um among these minim a is far from trivial. On the other h and, a minim um th at coincides with a m inim um of the unconstrained problem (i.e. Pr ob lem 1.2) ma y b e easily r eac hed b y standard descen t metho ds, as we will see. Problem 1 (Nonnegativ e m atrix f acto rization - NMF) . Given a m × n nonne gative matrix A and an i nte ger r < min( m, n ) , solve min U ∈ R m × r + V ∈ R n × r + 1 2 k A − U V T k 2 F . Where r is called the r ed uced r an k . F r om no w on, m and n will b e used to denote the size of the target matrix A and r is the reduced r ank of a factorizat ion. W e rewrite the nonn ega tiv e m atrix f acto rization as a standard non- linear optimizat ion problem: min − U ≤ 0 − V ≤ 0 1 2 k A − U V T k 2 F . The asso ciated Lagrangian f unction is L ( U, V , µ, ν ) = 1 2 k A − U V T k 2 F − µ ◦ U − ν ◦ V , where µ and ν are tw o matric es of the same size of U and V , resp ec- tiv ely , con taining the Lagrange multi pliers asso ciated with the n onneg- ativit y constrain ts U ij ≥ 0 and V ij ≥ 0. T hen the Karush-Ku hn-T uck er conditions for the nonnegativ e matrix factorization problem sa y th at if ( U, V ) is a lo cal minim um, then there exist µ ij ≥ 0 and ν ij ≥ 0 suc h that: U ≥ 0 , V ≥ 0 , (1.7) ∇ L U = 0 , ∇ L V = 0 , (1.8) µ ◦ U = 0 , ν ◦ V = 0 . (1.9) Dev eloping (1.8 ) we ha v e: AV − U V T V − µ = 0 , A T U − V U T U − ν = 0 or µ = − ( U V T V − AV ) , ν = − ( V U T U − A T U ) . Desc ent metho ds for Nonn e gative Matrix F actorization 9 Com bining this with µ ij ≥ 0, ν ij ≥ 0 and (1.9) giv es the follo wing conditions: U ≥ 0 , V ≥ 0 , (1.10) ∇ F U = U V T V − AV ≥ 0 , ∇ F V = V U T U − A T U ≥ 0 , (1.11) U ◦ ( U V T V − AV ) = 0 , V ◦ ( V U T U − A T U ) = 0 , (1.12) where th e corresp ondin g Lagrange m ultipliers for U and V are also the gradien t of F with r esp ect to U and V . Sin ce the E uclidean distance is not conv ex with resp ect to b oth v ariables U and V a t the same time, these conditions are only necessary . This is imp lied b ecause of th e ex- istence of saddle p oints and m axima. W e then call all the p oints that satisfy the ab o v e conditions, th e stationary p oin ts. Deﬁnition 1 (NMF stationary p oin t) . We c al l ( U, V ) a stationar y p oint of the NMF Pr oblem if and only if U and V satisfy the KKT c onditions (1.10), (1.11) and (1.12). Alternativ ely , a stationary p oin t ( U, V ) of the NMF p roblem can also b e deﬁned by usin g the follo wing necessary condition (see for example [4]) on the conv ex sets R m × r + and R n × r + , that is  ∇ F U ∇ F V  ,  X − U Y − V  ≥ 0 , ∀ X ∈ R m × r + , Y ∈ R n × r + , (1.13) whic h can b e sho wn to b e equiv alent to the KKT conditions (1.10), (1.11) and (1.12). Indeed, it is trivial that the KKT cond itions imply (1.13). And b y carefully c hoosing d iﬀeren t v alues of X and Y from (1.13), one can easily prov e that the KK T conditions h old. There are t w o v alues of reduced rank r for whic h we can trivially iden tify the global solution w hic h are r = 1 and r = min( m, n ). F or r = 1, a pair of dominant singular vecto rs are a global minimizer. And for r = min ( m, n ), ( U = A, V = I ) is a global minimizer. S ince most of existing m ethods for the non n egat iv e matrix factorization are descen t algorithms, we sh ould p a y atten tion to all lo cal minimizers. F or th e rank-one case, they can easily b e c haracterized. Rank one case The rank-one NMF problem of a n onnegativ e matrix A can b e rewrit- ten as min u ∈ R m + v ∈ R n + 1 2 k A − uv T k 2 F (1.14) and a complete analysis can b e carried out. It is w ell known that an y pair of nonn ega tiv e Pe rron v ectors of AA T and A T A yields a global minimizer 10 of th is problem, b ut we can also show that the only stationary p oin ts of (1.14) are giv en b y suc h v ectors. The follo win g theorem excludes the case wh ere u = 0 and/or v = 0. Theorem 3.1. The p air ( u, v ) is a lo c al minimizer of (1.14) if and only if u and v ar e nonne gative eigenve ctors of AA T and A T A r esp e ctive ly of the eigenvalue σ = k u k 2 2 k v k 2 2 . Pr o of. T he if p art easily follo ws fr om Theorem 2.2. F or the only if p art w e pro ceed as follo ws. Without loss of generalit y , we can p ermute the ro ws and columns of A such that the corresp onding vec tors u and v are partitioned as ( u + 0) T and ( v + 0) T resp ectiv ely , w here u + , v + > 0. P artition the corresp ondin g matrix A conformably as follo ws A =  A 11 A 12 A 21 A 22  , then fr om (1.11) w e hav e  u + v T + 0 0 0   v + 0  −  A 11 A 12 A 21 A 22   v + 0  ≥ 0 and  v + u T + 0 0 0   u + 0  −  A T 11 A T 21 A T 12 A T 22   u + 0  ≥ 0 implying that A 21 v + ≤ 0 and A T 12 u + ≤ 0. Since A 21 , A 12 ≥ 0 and u + , v + > 0, we can conclude that A 12 = 0 and A 21 = 0. Then from (1.12) w e ha v e: u + ◦ ( k v + k 2 2 u + − A 11 v + ) = 0 and v + ◦ ( k u + k 2 2 v + − A + 11 u + ) = 0 . Since u + , v + > 0, we ha ve : k v + k 2 2 u + = A 11 v + and k u + k 2 2 v + = A T 11 u + or k u + k 2 2 k v + k 2 2 u + = A 11 A T 11 u + and k u + k 2 2 k v + k 2 2 v + = A T 11 A 11 v + . Setting σ = k u + k 2 2 k v + k 2 2 and using the blo c k diagonal stru cture of A yields the desired result. Theorem 3.1 guaran tees that all stationary p oin ts of the rank-one case are nonnegativ e s in gular v ectors of a submatrix of A . These resu lts imply that a global minimizer of the rank-one NMF can b e calculated Desc ent metho ds for Nonn e gative Matrix F actorization 11 correctly based on the largest singular v alue and corresp onding singular v ectors of the matrix A . F or r anks other than 1 and m in( m, n ), th ere are no longer trivial stationary p oint s. In the next section, w e try to d eriv e some simple c haracteristics of the lo cal min ima of the nonn egat iv e matrix factoriza- tion. The K KT conditions (1.12) help to c haracterize the statio nary p oints of the NMF p roblem. Summing up all the elemen ts of one of the condi- tions (1.12), w e get: 0 = X ij  U ◦ ( U V T V − AV )  ij =  U, U V T V − AV  =  U V T , U V T − A  . (1.15 ) F rom that, w e ha v e some simp le c haracteristics of the NMF solutions: Theorem 3.2. L et ( U, V ) b e a stationary p oint of the NMF pr oblem, then U V T ∈ B  A 2 , 1 2 k A k F  , the b al l c enter e d at A 2 and with r adius = 1 2 k A k F . Pr o of. F rom (1.15) it immediately follo w s that  A 2 − U V T , A 2 − U V T  =  A 2 , A 2  whic h implies U V T ∈ B  A 2 , 1 2 k A k F  . Theorem 3.3. L et ( U, V ) b e a stationary of the N MF pr oblem, then 1 2 k A − U V T k 2 F = 1 2 ( k A k 2 F − k U V T k 2 F ) . Pr o of. F rom (1.15), we ha v e  U V T , A  =  U V T , U V T  . Therefore, 1 2  A − U V T , A − U V T  = 1 2 ( k A k 2 F − 2  U V T , A  + k U V T k 2 F ) = 1 2 ( k A k 2 F − k U V T k 2 F ) . Theorem 3.3 also suggests that at a stationary p oin t ( U, V ) of the NMF problem, w e should ha v e k A k 2 F ≥ k U V T k 2 F . This norm in equalit y 12 can b e also found in [7] f or less general cases where we hav e ∇ F U = 0 and ∇ F V = 0 at a stationary p oin t. F or this particular class of NMF stationary p oin t, all the n onnegativi t y constraints on U and V are in- activ e. And all su c h stationary p oin ts are also stationary p oints of the unconstrained problem, charac terized by Theorem 2.2. W e ha v e seen in T h eorem 2.2 that, for the unconstrained least-square problem the only stable stationary p oin ts are in f act global minima. Therefore, if the stationary p oin ts of the constrained pr ob lem are inside the nonnegativ e orthan t (i.e. all constrain ts are inactiv e), we can th en probably reac h the global minimum of the NMF problem. T his can b e exp ected b ecause the constrain ts ma y no longer prohibit the descen t of the up date. Let A r b e the optimal rank- r approximati on of a nonnegativ e matrix A , which w e obtain from the sin gular v alue d ecomposition, as indicated in Theorem 2.2. Then w e can easily construct its nonn ega tiv e part [ A r ] + , whic h is obtained from A r b y j ust setting all its negativ e elemen ts equal to zero. This is in fact the closest matrix in the cone of nonn ega tiv e matrices to the matrix A r , in th e F rob enius norm (in that sense, it is its pro jection on that cone). W e no w deriv e some b oun ds for the err or k A − [ A r ] + k F . Theorem 3.4. L et A r b e the b est r ank r app r oximation of a nonne gative matrix A , and let [ A r ] + b e its nonne gative p art, then k A − [ A r ] + k F ≤ k A − A r k F . Pr o of. T his follo ws easily from the con v exit y of the cone of nonnegativ e matrices. Since b oth A and [ A r ] + are nonnegativ e and since [ A r ] + is the closest m atrix in that cone to A r w e immediately obtain the in equalit y k A − A r k 2 F ≥ k A − [ A r ] + k 2 F + k A r − [ A r ] + k 2 F ≥ k A − [ A r ] + k 2 F from w hic h the result r eadily follo ws. The appro ximation [ A r ] + has the m erit of requiring as muc h storage as a rank r appro ximation, even though its rank is larger than r whenev er A r 6 = [ A r ] + . W e will lo ok at the qualit y of this appro ximation in Section 6. I f we n o w compare this b oun d with the nonnegativ e appro ximations then w e obtain the follo w ing inequalities. Let U ∗ V T ∗ b e an optimal non- negativ e rank r appr o ximatio n of A and let U V T b e an y stationary p oint of the KKT conditions for a n onnegativ e rank r appro ximation, th en we ha v e : k A − [ A r ] + | 2 F ≤ k A − A r k 2 F = n X i = r +1 σ 2 i ≤ k A − U ∗ V T ∗ k 2 F ≤ k A − U V T k 2 F . Desc ent metho ds for Nonn e gative Matrix F actorization 13 F or more implicatio ns of the NMF problem, see [16]. 4. Existing descen t algorithms W e fo cus on d escent algorithms that guarantee a non increasing up- date at eac h iteration. Based on the searc h space, we ha v e t w o categories: F ul l-sp ac e se ar ch and (Blo ck) Co or dinate se ar ch . Algorithms in the former category try to ﬁnd up d ates for b oth U and V at the same time. This requir es a searc h for a descent d irectio n in the ( m + n ) r -dimensional space. Note also that the NMF p roblem in this full sp ace is n ot con v ex but the optimalit y conditions ma y b e easier to ac hiev e. Algorithms in the latter category , on the other han d , ﬁnd u p dates for eac h (b lo c k) co ordinate in ord er to guarant ee the descen t of the ob j ecti v e fun ction. Usually , searc h subsp ace s are c hosen to mak e the ob j ecti v e f u nction con ve x so that eﬃcient metho ds can b e applied. Su c h a simp liﬁcatio n might lead to the loss of some con v ergence pr op erties. Most of the algorithms use the follo wing column partitioning: 1 2 k A − U V T k 2 F = 1 2 n X i =1 k A : ,i − U ( V i, : ) T k 2 2 , (1.16) whic h sho ws that one can minimize with resp ect to eac h of the ro ws of V indep end en tly . Th e p roblem th us decouples in to smaller conv ex problems. This leads to the solution of quadratic problems of the form min v ≥ 0 1 2 k a − U v k 2 2 . (1.17) Up dates for the ro ws of V are then alternated with up dates for the r ows of U in a similar manner by transp osing A and U V T . Indep endent on the s earc h space, most of algorithms u s e the Pro jected Gradien t sc heme for w hic h three b asic s teps are carr ied out in eac h iteration: Calculating the gradien t ∇ F ( x k ), Cho osing the step size α k , Pro j ecti ng the up date on the nonnegativ e orthan t x k +1 = [ x k − α k ∇ F ( x k )] + , where x k is the v ariable in the selected s earc h space. Th e last tw o steps can b e merged in one iterativ e pro cess and must guaran tee a suﬃ cien t decrease of th e ob j ecti ve function as w ell as the nonnegativit y of the new p oin t. 14 Multiplicat ive rules (Mult) Multiplicativ e rules w ere introdu ced in [20]. The algorithm app lies a b lock co ordinate type search and uses the ab o v e column partition to form ulate the up dates. A sp ecial feature of this metho d is that the step size is calculated for eac h element of the v ector. F or the elemen tary problem (1.1 7) it is giv en b y v k +1 = v k − α k ◦ ∇ F ( v k +1 ) = v k ◦  U T a  [ U T U v k ] where [ α k ] i = v i [ U T U v ] i . Applying this to all rows of V and U giv es the up dating rule of Algorithm 1 to compute ( U ∗ , V ∗ ) = argmin U ≥ 0 V ≥ 0 k A − U V T k 2 F . Algorithm 1 (Mult) 1: Initialize U 0 , V 0 and k = 0 2: rep eat 3: U k +1 = U k ◦ [ AV k ] [ U k ( V k ) T ( V k ) ] 4: V k +1 = V k ◦ [ A T U k +1 ] [ V k ( U k +1 ) T ( U k +1 ) ] 5: k = k + 1 6: until Stoppin g condition These up dates guaran tee automatica lly the n onnegativit y of the fac- tors bu t ma y fail to give a su ﬃcien t decrease of th e ob jectiv e fu n ction. It ma y also get stuck in a n on-statio nary p oint and hence suﬀer from a p o or con v ergence. V ariant s can b e foun d in [21, 24]. Line searc h using Armijo criterion (Line) In order to ensure a suﬃcient descen t, th e follo wing pr o jected gradient sc heme with Armijo criterion [23, 22] can b e ap p lied to minimize x ∗ = argmin x F ( x ) . Algorithm 2 needs t w o parameters σ an d β that ma y aﬀect its con- v ergence. It requires only the gradient in formation, and is applied in [23] for t w o diﬀerent strategies : for th e whole space ( U, V ) (Algorithm FLine) and f or U and V separately in an alte rnating fashion (Algo rithm Desc ent metho ds for Nonn e gative Matrix F actorization 15 Algorithm 2 (Line) 1: Initialize x 0 , σ , β , α 0 = 1 and k = 1 2: rep eat 3: α k = α k − 1 4: y = [ x k − α k ∇ F ( x k )] + 5: if F ( y ) − F ( x k ) > σ  ∇ F ( x k ) , y − x k  then 6: rep eat 7: α k = α k · β 8: y = [ x k − α k ∇ F ( x k )] + 9: un til F ( y ) − F ( x k ) ≤ σ  ∇ F ( x k ) , y − x k  10: else 11: rep eat 12: las ty = y 13: α k = α k /β 14: y = [ x k − α k ∇ F ( x k )] + 15: un til F ( y ) − F ( x k ) > σ  ∇ F ( x k ) , y − x k  16: y = l asty 17: end if 18: x k +1 = y 19: k = k + 1 20: until Stopping cond ition 16 CLine). With a go o d c hoice of parameters ( σ = 0 . 01 and β = 0 . 1) and a go o d strategy of alternating b et w een v ariables, it wa s rep orted in [23] to b e the faster than th e multiplica tiv e ru les. Pro jected gradien t with ﬁrs t-order appro ximation (F O) In order to ﬁ n d the solution to x ∗ = argmin x F ( x ) w e can also appro ximate at eac h iteration the function F ( X ) using: ˜ F ( x ) = F ( x k ) + D ∇ x F ( x k ) , x − x k E + L 2 k x k − x k 2 2 , where L is a Lips hitz constan t satisfying F ( x ) ≤ ˜ F ( x ) , ∀ x . Because of this inequ alit y , the solution of the follo wing pr oblem x k +1 = argmin x ≥ 0 ˜ F ( x ) also is a p oint of descen t for the function F ( x ) since F ( x k +1 ) ≤ ˜ F ( x k +1 ) ≤ ˜ F ( x k ) = F ( x k ) . Since the constan t L is not kno wn a pr iori, an in ner loop is needed. Algorithm 3 p resen ts an iterativ e wa y to carr y out this sc heme. As in the pr evious algorithm this also requires only the gradient in forma- tion and can th erefore can b e applied to t w o diﬀerent strategie s: to the whole space ( U, V ) (Algorithm FF O) and to U and V separately in an alternating fashion (Algorithm CF O). A main diﬀerence with the previous algorithm is its stopping criterion for the inner loop. T h is algorithm requires also a p arameter β for wh ic h the practical c hoice is 2. Alternativ e least squares metho ds The ﬁrst algorithm p r op osed for s olving the nonnegativ e matrix fac- torizatio n was the alternativ e least squares metho d [25]. It is kn o w n that, ﬁ xing either U or V , the p roblem b ecomes a least squares problem with n onnegativit y constraint. Since the least squares problems in Algorithm 4 can b e p erfectly de- coupled in to smaller problems corresp onding to the columns or ro ws of A , w e can directly apply metho ds for the Nonnegativ e Least Square problem to eac h of the small p roblem. Method s that can b e ap p lied are [19], [6], etc. Desc ent metho ds for Nonn e gative Matrix F actorization 17 Algorithm 3 (F O) 1: Initialize x 0 , L 0 and k = 0 2: rep eat 3: y = [ x k − 1 L k ∇ F ( x k )] + 4: while F ( y ) − F ( x k ) >  ∇ F ( x k ) , y − x k  + L k 2 k y − x k k 2 2 do 5: L k = L k /β 6: Y = [ x k − 1 L k ∇ F ( x k )] + 7: end while 8: x k +1 = y 9: L k +1 = L k · β 10: k = k + 1 11: until Stopping cond ition Algorithm 4 Alternativ e L east Square (ALS) 1: Initialize U and V 2: rep eat 3: Solv e: min V ≥ 0 1 2 k A − U V T k 2 F 4: Solv e: min U ≥ 0 1 2 k A T − V U T k 2 F 5: until Stoppin g cond ition Implemen tation The most time-consuming job is the test for the suﬃcient decrease, whic h is also th e stopp in g condition for the inner lo op. As m entioned at the b eginning of the section, the ab ov e metho ds can b e carried out using tw o diﬀerent strategies: full sp ace searc h or co ordinate s earch. I n some cases, it is required to ev aluate rep eatedly the fu nction F ( U, V ). W e men tion h ere how to d o this eﬃcien tly w ith the co ordinate searc h. F ull space searc h : The exact ev aluation of F ( x ) = F ( U, V ) = k A − U V T k 2 F need O ( mnr ) op erations. When there is a correction y = ( U + ∆ U, V + ∆ V ), w e ha v e to calculate F ( y ) which also requ ires O ( mnr ) op erations. Hence, it requires O ( tmnr ) op erations to determine a steps ize in t iterations of the inner loop. Co ordinate search : when V is ﬁxed, the Eu clidean d istance is a quadratic function on U : F ( U ) = k A − U V T k 2 F = h A, A i − 2  U V T , A  +  U V T , U V T  = k A k 2 F − 2 h U, AV i +  U, U ( V T V )  . The most expen s iv e step is the computation of AV , which r equ ires O ( mnr ) op erations. But w hen V is ﬁ xed, AV can b e calculated once 18 at the b eginning of the inn er lo op. The remaining computations are h U, AV i and  U, U ( V T V )  , whic h requires O ( nr ) and O ( nr 2 + nr ) op er- ations. Therefore, it requ ires O ( tnr 2 ) op erations to determine a stepsize in t iterati ons of the inner lo op whic h is muc h less than O ( tmn r ) op er- ations. This is due to the assumption r ≪ n . Similarly , when U ﬁxed, O ( tmr 2 ) op erations are n eeded to d etermine a stepsize. If w e consider an iteration is a sw eep, i.e. once all th e v ariables are up dated, the follo wing table summ arizes the complexit y of eac h swe ep of the describ ed algorithms: Algorithm Complexit y p er iteration Mult O ( m nr ) FLine O ( tm nr ) CLine O ( t 1 nr 2 + t 2 mr 2 ) FF O O ( tm nr ) CF O O ( t 1 nr 2 + t 2 mr 2 ) ALS O (2 r mnr ) ∗ IALS O ( m nr ) where t , t 1 and t 2 are the n umber of iterations of inner lo ops, which can not b e b oun ded in general. F or algorithm ALS , th e complexit y is rep orted for the case w here the activ e set m etho d [19] is used. Although O (2 r mnr ) is a v ery high the oric al u p p er b ound that coun t all the p ossible subsets of r v ariables of eac h sub problem, in p ractice , the activ e set metho d needs muc h less iterations to con v erge. One migh t as w ell u se more eﬃcien t conv ex optimizatio n tools to solv e the subp roblems instead of the activ e set m ethod. Scaling and Stopping criterion F or d escen t metho ds, seve ral stopping conditions are used in the lit- erature. W e now d iscuss s ome p roblems when implemen ting these con- ditions for NMF. The ve ry ﬁrst condition is the decrease of the ob jectiv e function. The algorithm shou ld stop wh en it fails to make the ob jectiv e function d e- crease with a certain amount : F ( U k +1 , V k +1 ) − F ( U k , V k ) < ǫ or F ( U k +1 , V k +1 ) − F ( U k , V k ) F ( U k , V k ) < ǫ. This is not a goo d c hoice for all cases sin ce the algorithm may s top at a p oin t very far from a stationary p oin t. Time and iteration b ounds can also b e imp osed f or v ery slo wly con v erging algorithms. But here again this ma y not b e go o d f or the optimalit y conditions. A b etter c hoice is probably the norm of the pro j ected gradient as su gge sted in [23]. F or Desc ent metho ds for Nonn e gative Matrix F actorization 19 the NMF problem it is deﬁned as follo ws : [ ∇ P X ] ij =  [ ∇ X ] ij if X ij > 0 min(0 , [ ∇ X ] ij ) if X ij = 0 where X stands f or U or V . The prop osed condition then b ecomes      ∇ P U k ∇ P V k      F ≤ ǫ      ∇ U 1 ∇ V 1      F . (1.18) W e should also tak e in to accoun t the scaling in v ariance b etw een U and V . Putting ¯ U = γ U and ¯ V = 1 γ V do es n ot c hange th e ap p ro ximation U V T but the ab o v e pro jected gradient norm is aﬀected:      ∇ P ¯ U ∇ P ¯ V      2 F = k∇ P ¯ U k 2 F + k∇ P ¯ V k 2 F = 1 γ 2 k∇ P U k 2 F + γ 2 k∇ P V k 2 F (1.19) 6 =      ∇ P U ∇ P V      2 F . Tw o appro ximate factorizations U V T = ¯ U ¯ V T resulting in the same ap- pro ximation should b e consid ered equiv alen t in terms of precision. One could c ho ose γ 2 := k∇ P U k F / k∇ P V k F , which minimizes (1.19) and forces k∇ P ¯ U k F = k∇ P ¯ V k F , but this ma y not b e a go od c hoice when only one of the gradients k∇ P ¯ U k F and k∇ P ¯ V k F is nearly zero. In fact, the gradient  ∇ U ∇ V  is scale dep enden t in th e NMF problem and any stopping criterion that uses gradien t information is aﬀected b y this scaling. T o limit that eﬀect, w e suggest the follo wing scaling after eac h iteration: ˜ U k ← U k D k ˜ V k ← V k D − 1 k where D k is a p ositiv e d iago nal matrix: [ D k ] ii = s k V : i k 2 k U : i k 2 . This ensures that k ˜ U : i k 2 F = k ˜ V : i k 2 F and hop efully reduces also the diﬀer- ence b et w een k∇ P ˜ U k 2 F and k∇ P ˜ V k 2 F . Moreo ver, it m a y help to av oid The same scaling sh ou ld b e applied to the initial p oin t as w ell ( U 1 , V 1 ) when us in g (1.18) as the stopping condition. 5. Rank-one Residue Iteration In the previous section, w e hav e seen that it is very app ealing to decouple the p r oblem in to con v ex subpr ob lems. But th is ma y “con v erge” to solutions that are far from the global minimizers of the pr oblem. 20 In this section, w e analyze a diﬀeren t decoupling of the problem based on rank one appro ximations. This also allo ws us to formulate a ve ry simp le basic subpr ob lem. This sc heme has a ma jor adv an tage o v er other metho ds : the subp r oblems can b e optimally solve d in closed form. Th erefore it can b e pr ov ed to hav e a strong con v ergence results through its damp e d ve rsion and it can b e extend ed to more general t yp es of factorizations su c h as for nonnegativ e tensors and to some p ractic al constrain ts suc h as sparsity and smo othness. Moreo v er, the exp eriments in Section 6 suggest th at this metho d outp er f orms th e other ones in most cases. During the completion of the revised v ersion of this rep ort, w e w ere informed that an ind ep enden t rep ort [8] had also prop osed this decoupling without any con v ergence in v estiv ation and exten tions. New partition of v ariables Let th e u i ’s and v i ’s b e resp ectiv ely the columns of U and V . Then the NMF problem can b e rewritten as follo ws : Problem 2 (Nonnegativ e Matrix F actorization) . Given a m × n non- ne gative matrix A , solve min u i ≥ 0 v i ≥ 0 1 2 k A − r X i =1 u i v T i k 2 F . Let us ﬁ x all the v ariables, except for a single ve ctor v t and consider the follo win g least squares problem: min v ≥ 0 1 2 k R t − u t v T k 2 F , (1.20) where R t = A − P i 6 = t u i v T i . W e h a v e: k R t − u t v T k 2 F = tr ace  ( R t − u t v T ) T ( R t − u t v T )  (1.21) = k R t k 2 F − 2 v T R T t u t + k u t k 2 2 k v k 2 2 . (1.22) F rom this formulation, on e no w deriv es the follo wing lemma. Lemma 5.1. If [ R T t u t ] + 6 = 0 , then v ∗ := [ R T t u t ] + k u t k 2 2 is the unique g lob al minimizer of (1.20) and the function value e quals k R t k 2 F − k [ R T t u t ] + k 2 2 k u t k 2 2 . Pr o of. L et us p erm ute the elemen ts of the vec tors x := R T t u t and v suc h that P x =  x 1 x 2  , P v =  v 1 v 2  , with x 1 ≥ 0 , x 2 < 0 Desc ent metho ds for Nonn e gative Matrix F actorization 21 and P is th e p ermutatio n matrix. Th en k R t − u t v T k 2 F = k R t k 2 F − 2 v T 1 x 1 − 2 v T 2 x 2 + k u t k 2 2 ( v T 1 v 1 + v T 2 v 2 ) . Since x 2 < 0 and v 2 ≥ 0, it is ob vious that k R t − u t v T k 2 F can only b e minimal if v 2 = 0. Ou r assumption imp lies that x 1 is nonempty and x 1 > 0. Moreo v er [ R T t u t ] + 6 = 0 and u t ≥ 0 imply k u t k 2 2 > 0, one can then ﬁn d the optimal v 1 b y min im izing the remainin g qu adratic f unction k R t k 2 F − 2 v T 1 x 1 + k u t k 2 2 v T 1 v 1 whic h yields the solution v 1 = x 1 k u t k 2 2 . Putting the tw o comp onent s to- gether yields the result v ∗ = [ R T t u t ] + k u t k 2 2 and k R t − u t v T ∗ k 2 F = k R t k 2 F − k [ R T t u t ] + k 2 2 k u t k 2 2 . Algorithm 5 (RRI) 1: Initialize u i ’s, v i ’s, for i = 1 to r 2: rep eat 3: for t = 1 to r do 4: R t = A − P i 6 = t u i v T i 5: 6: if [ R T t u t ] + 6 = 0 then 7: v t ← [ R T t u t ] + k u t k 2 2 8: else 9: v t = 0 10: end if 11: 12: if [ R t v t ] + 6 = 0 then 13: u t ← [ R t v t ] + k v t k 2 2 14: else 15: u t = 0 16: end if 17: end for 18: until Stopping cond ition Remark 1 : The ab o v e lemma h as of course a dual form, where one ﬁxes v t but solv es for th e optimal u to minimize k R t − uv T t k 2 F . This 22 w ould yield the up dating ru les v t ← [ R T t u t ] + k u t k 2 2 and u t ← [ R t v t ] + k v t k 2 2 (1.23) whic h can b e u sed to recursive ly u p date approximat ions P r i =1 u i v T i b y mo difying eac h rank-one matrix u t v T t in a cyclic manner. This pr oblem is diﬀerent f rom th e NMF, since the error matrices R t = A − P i 6 = t u i v T i are n o longer nonnegativ e. W e will therefore call this metho d the R ank- one R esidue Iter ation (RRI), i.e. Algorithm 5. T he same algorithm w as in dep enden tly rep orted as Hierarc hical Alternating Least Squares (HALS) [8]. Remark 2 : In case where [ R T t u t ] + = 0, w e ha v e a trivial solution for v = 0 that is not co v ered by L emma 5.1. In add ition, if u t = 0, this solution is no longer unique. In f act, v can b e arb itrarily tak en to construct a rank-deﬁcient appr o ximatio n. The eﬀect of this on the con v ergence of the algorithm w ill b e discussed further in the n ext section. Remark 3 : Notice that the optimalit y of L emm a 5.1 imp lies that k A − U V T k can not increase. And since A ≥ 0 ﬁxed, U V T ≥ 0 m ust b e b ounded. Therefore, its comp onent u i v t i (i=1. . . r) must b e b ounded as w ell. One can m oreo ver s cale the vec tor pairs ( u i , v i ) at eac h stage as explained in S ectio n 4.0 without aﬀecting the lo cal op timalit y of L emm a 5.1. It then follo ws th at the r ank one p ro ducts u i v T i and their scaled v ectors remain b oun d ed. Con vergence In the pr evious section, w e ha v e established the p artial up dates for eac h of the v ariable u i or v i . And for a NMF problem where the re- duced rank is r , w e h a ve in total 2 r v ector v ariables (the u i ’s and v i ’s). The d escrib ed algorithm can b e also considered as a p ro jected gradien t metho d since the u p date (1.23) can b e r ewritten as: u t ← [ R t v t ] + k v t k 2 2 = [( A − P i 6 = t u i v T i ) v t ] + k v t k 2 2 = [( A − P i u i v T i + u t v T t ) v t ] + k v t k 2 2 = [( A − P i u i v T i ) v t + u t v T t v t ] + k v t k 2 2 =  u t − 1 k v t k 2 2 ∇ u t  + . Similarly , the up date for v i can b e rewritten as v t ←  v t − 1 k u t k 2 2 ∇ v t  + . Therefore, the new metho d follo w s the pro j ecte d gradient scheme de- scrib ed in the p revious section. But it pro duces the optimal solution Desc ent metho ds for Nonn e gative Matrix F actorization 23 in closed form. F or eac h up date of a column v t (or u t ), the p rop osed algorithm requires ju s t a matrix-v ector m ultiplication R T t u t (or R t v t ), wherein the residue m atrix R t = A − P i 6 = t u i v T i do es not ha v e to b e cal- culated explicitly . In deed, by calculating R T t u t (or R t v t ) f rom A T u t (or Av t ) and P i 6 = t v i ( u T i u t ) (or P i 6 = t u i ( v T i v t )), the complexit y is r educed from O ( mnr + mn ) to on ly O ( mn + ( m + n )( r − 1)) which is ma jored b y O ( mn ). This implies that the complexit y of eac h sweep th rough the 2 r v ariables u ′ t s and v ′ t s requires only O ( mnr ) op erations, w hic h is equiv alent to a sweep of the multiplicativ e rules and to an inner lo op of an y grad ient metho ds. This is very low since the ev aluat ion of the whole gradien t requires already the same complexit y . Because at eac h step of the 2 r b asic steps of Algorithm 5, we compute an optimal rank-one nonnegativ e correction to the corresp ondin g error matrix R t the F rob en iu s n orm of the error can not increase. This is a reassuring prop erty but it do es n ot imp ly conv ergence of the algorithm. Eac h vecto r u t or v t lies in a conv ex set U t ⊂ R m + or V t ⊂ R n + . Moreo v er, b ecause of the p ossibility to includ e scaling we can set an upp er b oun d f or k U k and k V k , in such a wa y that all the U t and V t sets can b e considered as closed con ve x. Then, we can u se the follo wing Theorem 5.1, to pro v e a stronger con v ergence result for Algorithm 5. Theorem 5.1. Every limit p oint g e ner ate d by A lgorithm 5 is a station- ary p oint. Pr o of. W e n otice that, if u t = 0 and v t = 0 at some stages of Algo- rithm 5, they will remain zero and no longer tak e part in all subsequent iterations. W e can divide the execution of Algorithm 5 into t w o phases. During the ﬁrs t phase, some of the p airs ( u t , v t ) b ecome zero. Be- cause there are only a ﬁnite num b er (2 r ) of su c h v ectors, the num b er of iterations in this p hase is also ﬁn ite. A t the end of th is phase, w e can rearrange and partition the matrices U and V su c h that U = ( U + 0) and V = ( V + 0) , where U + and V + do not h a ve any zero column. W e temp orarily remo v e zero columns out of the appro ximation. During the second p hase, no column of U + and V + b ecomes zero, whic h guarante es the up dates for the column s of U + and V + are uniqu e and optimal. Moreo v er, 1 2 k A − P r i =1 u i v T i k 2 F is con tin uously diﬀeren- tiable ov er th e set U 1 × . . . × U r × V 1 × . . . × V r , and the U i ’s and V i ’s are closed con v ex. A direct application of P rop osition 2.7.1 in [4] pro v es that ev ery stationary p oint ( U ∗ + , V ∗ + ) is a stationary p oin t. It is then easy to pro v e th at if there are zero column s remov ed at the end 24 of the ﬁr s t ph ase, addin g th em bac k yields another stationary p oin t: U ∗ = ( U ∗ + 0) and V ∗ = ( V ∗ + 0) of the requir ed dimension. Ho wev er, in this case, the rank of the appr o x im ation will then b e lo w er th an the requested d imension r . In Algorithm 5, v ariables are up d ated in this order: u 1 , v 1 , u 2 , v 2 , . . . . W e can alternate th e v ariables in a diﬀerent order as wel l, for example u 1 , u 2 , . . . , u r v 1 , v 2 , . . . , v r , . . . . Wheneve r th is is carried out in a cyclic f ashion, the T h eorem 5.1 still holds and this do es n ot increase th e complexit y of eac h iteration of the algorithm. As p ointed ab o v e, stationary p oin ts given by Algorithm 5 ma y con tain useless zero comp onent s. T o imp ro v e this, one could r eplace u t v T t ( ≡ 0) b y any nonnegativ e rank-one appro ximation that redu ces the norm of the error matrix. F or example, the sub stitution u t = e i ∗ v t = [ R T t u t ] + , (1.24) where i ∗ = argmax i k [ R T t e i ] + k 2 2 , r ed uces the error norm by k [ R T t e i ] + k 2 2 > 0 u nless R t ≤ 0. T hese substitutions can b e done as so on as u t and v t start to b e zero. If w e do th ese sub stitutions in only a ﬁnite n umb er of times b efore the algorithm starts to conv erge, Th eorem 5.1 still holds. In practice, only a few suc h substitutions in total are usually n eeded b y the algorithm to con v erge to a stationary p oin t without an y zero comp onen t. Note that the matrix r an k of the app ro ximation m igh t not b e r , ev en wh en all u t ’s and v t ’s ( t = 1 . . . r ) are nonzero. A p ossibly b etter wa y to ﬁx the problem due to zero comp onents is to u s e the follo wing damp e d RRI algorithm in wh ic h we introd uce new 2 r dummy v ariables w i ∈ U i and z i ∈ V i , where i = 1 ...r . The new problem to solv e is: Problem 3 (Damp ed Nonnegativ e Matrix F actoriza tion) . min u i ≥ 0 v i ≥ 0 w i ≥ 0 z i ≥ 0 1 2 k A − r X i =1 u i v T i k 2 F + ψ 2 X i k u i − w i k 2 2 + ψ 2 X i k v i − z i k 2 2 , wher e the damping factor ψ is a p ositive c onstant . Again, the co ordinate descent sc heme is applied with the cyclic up d ate order: u 1 , w 1 , v 1 , z 1 , u 2 , w 2 , v 2 , z 2 , . . . to result in the follo wing optimal up dates for u t , v t , w t and z t : u t = [ R t v t ] + + ψ w t k v t k 2 2 + ψ , w t = u t , v t = [ R T t u t ] + + ψ z t k u t k 2 2 + ψ and z t = v t (1.25) Desc ent metho ds for Nonn e gative Matrix F actorization 25 where t = 1 . . . r . The up d ates w t = u t and z t = v t can b e integrat ed in the up d ates of u t and v t to yield Algorithm 6. W e ha v e the follo wing results: Theorem 5.2. Every limit p oint g e ner ate d by A lgorithm 6 is a station- ary p oint of N MF pr oblem 2. Algorithm 6 (Damp ed RRI) 1: Initialize u i ’s, v i ’s, for i = 1 to r 2: rep eat 3: for t = 1 to r do 4: R t = A − P i 6 = t u i v T i 5: v t ← [ R T t u t + ψv t ] + k u t k 2 2 + ψ 6: u t ← [ R t v t + ψu t ] + k v t k 2 2 + ψ 7: end for 8: until Stoppin g cond ition Pr o of. C learly the cost fu nction in Prob lem 3 is con tin uously diﬀeren- tiable o v er the set U 1 × . . . × U r × U 1 × . . . × U r × V 1 × . . . × V r × V 1 × . . . × V r , and the U i ’s and V i ’s are closed con v ex. The uniqueness of the global minim um of the elemen tary pr ob lems and a direct application of Prop osi- tion 2.7.1 in [4] pro v e that ev ery limit p oint of Algorithm 6 is a stationary p oin t of Problem 3. Moreo v er, at a stationary p oint of Problem 3, w e ha v e u t = w t and v t = z t , t = 1 ...r . The cost function in Problem 3 b ecomes the cost function of the NMF problem 2. This imp lies that eve ry stationary p oin t of Pr oblem 3 yields a stationary p oint of the standard NMF problem 2. This damp e d version not only helps to eliminate the prob lem of zero comp onen ts in th e con ve rgence analysis but may also help to a v oid zero columns in the approximat ion when ψ is carefully c hosen. But it is n ot an easy task. Small v alues of ψ provide an automatic treatmen t of zeros while not changing muc h th e up d ates of RR I . Larger v alues of ψ m igh t help to p r ev en t the vect ors u t and v t ( t = 1 . . . r ) from b ecoming zero to o so on. But to o large v alues of ψ limit the up dates to only small c hanges, whic h will slo w do wn th e con v ergence. In general, the rank of the app ro ximation can still b e lo w er than the requested d imension. Patc hes ma y still b e needed when a zero comp o- nen t app ears. Therefore, in our exp eriments, usin g th e undamp e d R R I algorithm 5 with the su b stitution (1.24) is still the b est c h oice. 26 V arian ts of the RRI metho d W e n o w extend the Rank-one Residue Iteration b y usin g a factor- ization of th e t yp e X D Y T where D is diagonal and nonnegativ e and the columns of the nonnegativ e matrices X and Y are normalized. The NMF formulation then b ecomes min x i ∈ X i y i ∈ Y i d i ∈ R + 1 2 k A − r X i =1 d i x i y T i k 2 F , where X i ’s and Y i ’s are sets of normed v ectors. The v ariants th at w e present here dep end on the c hoice of X i ’s an d Y i ’s. A generalized Rank-one Residue Iteration metho d for low-rank appro ximation is give n in Algo rithm 7. This algo rithm n eeds to solv e a sequence of elemen tary problems of the t yp e: max s ∈ S y T s (1.26) where y ∈ R n and S ⊂ R n is a set of normed v ectors. W e ﬁrst in tro duce a p erm utation vecto r I y = ( i 1 i 2 . . . i n ) whic h reorders the element s of y in n on-increasing order : y i k ≥ y i k +1 , k = 1 . . . ( n − 1). Th e function p ( y ) returns the n um b er of p ositiv e en tries of y . Algorithm 7 GRRI 1: Initialize x i ’s, y i ’s and d i ’s, for i = 1 to r 2: rep eat 3: for i = 1 to r do 4: R i = A − P j 6 = i d j x j y T j 5: y i ← argmax s ∈ Y i  x T i R i s  6: x i ← argmax s ∈ X i  y T i R T i c  7: d i = x T i R i y i 8: end for 9: until Stoppin g condition Let us ﬁrst p oin t out that for the set of normed nonnegativ e vect ors the solution of p r oblem (1.26) is giv en by s ∗ = y + k y + k 2 . It then f ollo ws that Algorithm 7 is essenti ally the same as Algorithm 5 sin ce the s olutions v i and u i of eac h step of Algorithm 7, giv en b y (1.23 ), corresp ond exactl y to those of problem (1.26) via th e relations y i = u i / k u i k 2 , y i = v i / k v i k 2 and d i = k u i k 2 k v i k 2 . Belo w w e list the sets for which the solution s ∗ of (1.2 6) can b e easily computed. Desc ent metho ds for Nonn e gative Matrix F actorization 27 Set of norme d ve ctors : s = y k y k 2 . This is usefu l when one wa nt s to create factoriza tions where only one of the factor U or V is nonnegativ e and the other is real m atrix. Set of norme d nonne gative ve ctors : s = y + k y + k 2 . Set of norme d b ounde d nonne gative ve ctors { s } : wh ere 0 ≤ l i ≤ s i ≤ p i . The optimal solution of (1.26) is giv en by: s = max  l, min  p, y + k y + k 2  . Set of norme d binary ve ctors { s } : w h ere s = b k b k and b ∈ { 0 , 1 } n . The optimal solution of (1.26) is giv en by: [ s ∗ ] i t =  1 √ k ∗ if t ≤ k ∗ 0 otherwise where k ∗ = argmax k P k t =1 y i t √ k . Set of norme d sp arse nonne gative ve ctors : all normed nonnegativ e v ectors having at most K nonzero entries. The optimal solution for (1.26) is giv en b y norming the follo w in g v ecto r p ∗ [ p ∗ ] i t =  y i t if t ≤ min( p ( y ) , K ) 0 otherwise Set of norme d ﬁxe d-sp arsity nonne gative ve ctors : all n onnegativ e v ectors s a ﬁxed spars it y , where spar sity ( s ) = √ n − k s k 1 / k s k 2 √ n − 1 . The optimal solution f or (1.26) is giv en by using the pro jection sc heme in [17]. One can also imagine other v arian ts, for in stance b y combining the ab o v e ones. Dep end ing on h o w d ata need to b e appro ximated, one can create new algorithms p ro vided it is relativ ely s imple to solv e p roblem (1.26) . There hav e b een some particular id eas in the literatures su ch as NMF w ith sparseness constraint [17], Semidiscrete Matrix Decom- p osition [18] and Semi-Nonnegativ e Matrix F actorization [11] for whic h v ariants of the ab ov e sc heme can oﬀer an alternativ e c hoice of algorithm. Remark: Only the ﬁr st three s ets are the normed v ersion of a closed con v ex set, as requir ed for the conv ergence by Theorem 5.1. Therefore the algo rithms migh t not con v erge to a stationary p oint with the other sets. Ho w ev er, the algorithm alwa ys guarante es a non-incr easing u p- date ev en in those cases and can therefore b e exp ected to return a go o d appro ximation. 28 Nonnegativ e T ensor F actorizati on If we refer to the pr oblem of ﬁ nding the nearest n onnegativ e v ector to a giv en v ector a as the non n egat iv e appro ximation in one dimension, the NMF is its generaliza tion in t w o dimensions and n aturally , it can b e extend ed to eve n higher-order tensor approximat ion problems. Al- gorithms describ ed in the previous section us e th e closed form solution of th e one dimensional problem to solve the t w o-dimensional problem. W e no w generalize th is to higher ord ers. Since in on e dimension such an appro ximation is easy to construct, w e con tin ue to use this approac h to build the solutions for h igher order problems. F or a lo w-rank tensor, there are t wo p opular kind s of f act ored tensors, namely those of T uck er and K rusk al [2]. W e only giv e an algorithm for ﬁnding appr o ximatio ns of Krusk al type. It is easy to extend this to tensors of T u c k er type, but this is omitted here. Giv en a d dimens ional tensor T , we w ill deriv e an algorithm f or ap- pro ximating a nonnegativ e tensor by a rank- r nonn egativ e Krusk al ten- sor S ∈ R n 1 × n 2 × ... × n d + represent ed as a sum of r rank-one tensors: S = r X i =1 σ i u 1 i ⋆ u 2 i ⋆ . . . ⋆ u di where σ i ∈ R + is a scaling factor, u ti ∈ R n t + is a normed ve ctor (i.e. k u ti k 2 = 1) and a ⋆ b stands for the outer pro du ct b etw een t w o vect ors or tensors a and b . The f ollo wing up d ate rules are the generalizatio n of the matrix case to the higher order tensor: y = ( . . . (( . . . ( R k u 1 k ) . . . u ( t − 1) k ) u ( t +1) k ) . . . ) u dk (1.27) σ k = k [ y ] + k 2 , u tk = [ y ] + σ k , (1.28) where R k = T − P i 6 = k σ i u 1 i ⋆ u 2 i ⋆ . . . ⋆ u di is the r esidue tensor calcu- lated without the k th comp onen t of S and R k u ij is the ordinary ten- sor/v ector pro duct in the corresp ondin g dimension. W e can then pro duce an algorithm whic h up dates in a cyclic fashion all vect ors u j i . This is in fact a direct extension to Algorithm 5, one can carry out the same d iscussion ab out the con v ergence here to guarant ee that eac h limit p oint of this algorithm is a stationary p oin t for the non- negativ e tensor factorizatio n problem and to imp ro v e th e approximat ion qualit y . Again, as we hav e seen in the previous section, we can extend the pro cedure to tak e into account diﬀeren t constraints on the v ectors u ij suc h as discreteness, sparseness, etc. Desc ent metho ds for Nonn e gative Matrix F actorization 29 The approac h prop osed h er e is again diﬀeren t from th at in [9] where a sim ilar cascade p ro cedu re for m ultila y er nonn ega tiv e matrix factor- ization is used to compute a 3D tensor appr o ximatio n. C learly , the appro ximation error will b e higher than our prop osed metho d, since the cost fu nction is minimized by taking in to account all the d imensions. Regularizati ons The regularizations are common method s to cop e with the ill-p osedness of inv ers e p roblems. Ha ving kno wn some additional in f ormation ab out the solution, one ma y w an t to imp osed a pr iori some constraint s to al- gorithms, such as: smo othness, spars it y , discreteness, etc. T o add suc h regularizations in to the RRI algorithms, it is p ossible to mo dify the NMF cost fun ction b y adding some regularizing terms. W e will list here the up date f or u i ’s and v i ’s when some simp le regularizations are added to the original cost function. The pr oof of these up d ates are straigh t- forw ard and hence omitted. One-Norm k . k 1 regularization: the one-norm of the vect or v ariable can b e added as a h euristic for ﬁnd ing a sparse solution. This is an alternativ e to th e ﬁxed-sp arsit y v ariant pr esen ted ab o v e. The regularized cost fu n ction with r esp ect to the v ariable v t will b e 1 2 k R t − u t v T k 2 F + β k v k 1 , β > 0 where th e optimal up date is giv en by v ∗ t = [ R T t u t − β 1 n × 1 ] + k u t k 2 2 . The constan t β > 0 can b e v aried to con trol the trade-oﬀ b etw een the appr o x im ation error 1 2 k R t − u t v T k 2 F and k v k 1 . F r om this up- date, one can see that this works by zeroing out elemen ts of R T t u t whic h are smaller than β , h ence reducing the num b er of nonzero elemen ts of v ∗ t . Smo othness regularization k v − B ˆ v t k 2 F : where ˆ v t is the current v alue of v t and th e matrix B helps to calculate the a v erage of the neigh b oring elemen ts at eac h element of v . When v is a 1D smo oth 30 function, B can b e the follo win g n × n matrix: B =        0 1 . . . . . . 0 1 2 0 1 2 . . . 0 . . . . . . . . . . . . . . . 0 . . . 1 2 0 1 2 0 . . . 0 1 0        . (1.29) This matrix can b e deﬁned in a diﬀerent w a y to take the true top ology of v in to accoun t, for instance v = v ec ( F ) wh ere F is a matrix. T he r egularize d cost function with r esp ect to th e v ariable v t will b e 1 2 k R t − u t v T k 2 F + δ 2 k v − B ˆ v t k 2 F , δ > 0 where th e optimal up date is giv en by v ∗ t = [ R T t u t + δ B ˆ v t ] + k u t k 2 2 + δ . The constant δ ≥ 0 can b e v aried to control the trade-oﬀ b et we en the appro ximation error 1 2 k R t − u t v T k 2 F and th e smo othness of v t at the ﬁx ed p oint. F rom the up date, one can see that this w orks b y searching for the optimal up date v ∗ t with some preferen ce for the neighborh oo d of B ˆ v i , i.e., a smo othed vec tor of the curr ent v alue ˆ v t . The tw o ab ov e regularizatio ns can b e added indep endently to eac h of the columns of U and/or V . The trade-oﬀ factor β (or δ ) can b e diﬀeren t for eac h column. A com bination of diﬀeren t regularizations on a column (for instance v t ) can also b e used to solv e the m ulti-criterion problem 1 2 k R t − u t v T k 2 F + γ 2 k v k 2 2 + δ 2 k v − B ˆ v t k 2 F , β , γ , δ > 0 where th e optimal up date is giv en by v ∗ t = [ R T t u t − β 1 n × 1 + δ B ˆ v t ] + k u t k 2 2 + δ . The one-norm regularizations as well as the tw o-norm regularization can b e f ound in [1] and [3]. A ma jor diﬀerence with that metho d is that the norm constraints is add ed to th e ro ws rather than on the column s of V or U as done h er e. How ever, for the t w o versions of the one-norm Desc ent metho ds for Nonn e gative Matrix F actorization 31 regularization, the eﬀects are somehow similar. While th e t w o-norm regularization on the columns of U and V are simply scaling eﬀects, whic h yield nothing in the RR I algorithm. W e ther efore only test th e smo othness regularization at the end of the c hapter with some numerical generated d ata. F or more extensions and v ariants, see [16]. 6. Exp erimen ts Here we pr esen t sev eral exp erimen ts to compare the diﬀerent descen t algorithms pr esen ted in this p ap er. F or all th e algorithms, the scaling sc heme prop osed in s ecti on 4.0 w as app lied. Random matrices W e generated 100 rand om n onnegativ e matrices of diﬀeren t sizes. W e used seve n diﬀerent algorithms to approximat e eac h matrix: the m ultiplicativ e rule ( Mult ), alternativ e least squares u sing Matlab function lsqnonne g ( ALS ), a full space searc h u sing line s earc h and Arm ij o criterion ( FLine ), a coord inate searc h alternating on U and V , and using line searc h and Ar mijo criterion ( CLine ), a f u ll space searc h u s ing ﬁrst-order app ro ximation ( FFO ), a co ordinate searc h alternating on U and V , an d u sing ﬁrst-order appro ximation ( CF O ) an iterativ e rank-one residue app ro ximation ( RRI ). F or eac h matrix, the same starting p oin t is used for ev ery algorithm. W e create a starting p oin t by randomly generating t wo matrices U and V and then rescaling them to yield a ﬁrst app ro ximation of the original matrix A as prop osed in Section 4.0: U = U D √ α, V = V D − 1 √ α, where α :=  A, U V T  h U V T , U V T i and D ij = ( q k V : i k 2 k U : i k 2 if i = j 0 otherwise . F rom (1.15), we see that when approac hing a KKT stationary p oint of the problem, the ab ov e scaling f act or α → 1. This imp lies that ev ery KKT stationary p oint of this problem is scale-in v arian t. 32 ǫ Mult ALS FLine CLine FFO CF O R RI ( m = 30 , n = 20 , r = 2 ) 10 − 2 0 . 02(96) 0 . 40 0 . 04 0 . 02 0 . 02 0 . 01 0 . 01 10 − 3 0 . 08(74) 1 . 36 0 . 12 0 . 09 0 . 05 0 . 04 0 . 03 10 − 4 0 . 17(71) 2 . 81 0 . 24 0 . 17 0 . 11 0 . 08 0 . 05 10 − 5 0 . 36(64) 4 . 10 0 . 31 0 . 25 0 . 15 0 . 11 0 . 07 10 − 6 0 . 31(76) 4 . 74 0 . 40 0 . 29 0 . 19 0 . 15 0 . 09 ( m = 100 , n = 50 , r = 5 ) 10 − 2 45 ∗ (0) 3 . 48 0 . 10 0 . 09 0 . 09 0 . 04 0 . 02 10 − 3 45 ∗ (0) 24 . 30 (96) 0 . 59 0 . 63 0 . 7 8 0 . 25 0 . 15 10 − 4 45 ∗ (0) 45 ∗ (0) 2 . 74 2 . 18 3 . 34 0 . 8 6 0 . 45 10 − 5 45 ∗ (0) 45 ∗ (0) 5 . 93 4 . 06 6 . 71 1 . 5 8 0 . 89 10 − 6 45 ∗ (0) 45 ∗ (0) 7 . 23 4 . 75 8 . 98 1 . 9 3 1 . 30 ( m = 100 , n = 50 , r = 10 ) 10 − 2 45 ∗ (0) 11 . 61 0 . 28 0 . 27 0 . 1 8 0 . 11 0 . 05 10 − 3 45 ∗ (0) 41 . 89(5) 1 . 90 2 . 11 1 . 50 0 . 74 0 . 35 10 − 4 45 ∗ (0) 45 ∗ (0) 7 . 20 5 . 57 5 . 08 2 . 2 9 1 . 13 10 − 5 45 ∗ (0) 45 ∗ (0) 12 . 90 9 . 69 10 . 30 4 . 01 1 . 71 10 − 6 45 ∗ (0) 45 ∗ (0) 14 . 62(99) 11 . 68(99) 13 . 19 5 . 26 2 . 11 ( m = 100 , n = 50 , r = 15 ) 10 − 2 45 ∗ (0) 25 . 98 0 . 66 0 . 59 0 . 4 0 0 . 20 0 . 09 10 − 3 45 ∗ (0) 45 ∗ (0) 3 . 90 4 . 58 3 . 18 1 . 5 7 0 . 61 10 − 4 45 ∗ (0) 45 ∗ (0) 16 . 55(98) 13 . 61(99) 9 . 74 6 . 12 1 . 87 10 − 5 45 ∗ (0) 45 ∗ (0) 21 . 72(97) 17 . 31(92) 16 . 59(98) 7 . 08 2 . 39 10 − 6 45 ∗ (0) 45 ∗ (0) 25 . 88(89) 19 . 76(98) 19 . 20(98) 10 . 34 3 . 66 ( m = 100 , n = 10 0 , r = 20 ) 10 − 2 45 ∗ (0) 42 . 51(4) 1 . 16 0 . 80 0 . 89 0 . 55 0 . 17 10 − 3 45 ∗ (0) 45 ∗ (0) 9 . 19 8 . 58 10 . 51 5 . 45 1 . 41 10 − 4 45 ∗ (0) 45 ∗ (0) 28 . 59(86) 20 . 63(94) 29 . 89(69) 12 . 59 4 . 02 10 − 5 45 ∗ (0) 45 ∗ (0) 32 . 89(42) 27 . 94(68) 34 . 59(34) 18 . 83(90) 6 . 59 10 − 6 45 ∗ (0) 45 ∗ (0) 37 . 14(20) 30 . 75(60) 36 . 48(8) 22 . 80(87 ) 8 . 71 ( m = 200 , n = 10 0 , r = 30 ) 10 − 2 45 ∗ (0) 45 ∗ (0) 2 . 56 2 . 20 2 . 68 1 . 3 1 0 . 44 10 − 3 45 ∗ (0) 45 ∗ (0) 22 . 60(99) 25 . 03(98) 29 . 67(90) 12 . 94 4 . 12 10 − 4 45 ∗ (0) 45 ∗ (0) 36 . 49(2) 39 . 13(13) 45 ∗ (0) 33 . 33(45) 14 . 03 10 − 5 45 ∗ (0) 45 ∗ (0) 45 ∗ (0) 39 . 84(2) 45 ∗ (0) 37 . 60(6) 21 . 96(92) 10 − 6 45 ∗ (0) 45 ∗ (0) 45 ∗ (0) 45 ∗ (0) 45 ∗ (0) 45 ∗ (0) 25 . 61(87) T able 1.1. Comparison of av erage successful ru nning time of algorithms ov er 100 random matrices. Time limit is 45 seconds. 0 . 02(96) means that a result is returned with the required precision ǫ within 45 seconds for 96 (of 100) matrices of which the a verage runn ing time is 0 . 02 seconds. 45 ∗ (0): failed in all 100 matrices. Desc ent metho ds for Nonn e gative Matrix F actorization 33 The algorithms are all stopp ed when the pro jected gradien t norm is lo w er than ǫ times the gradient norm at the starting p oin t or wh en it tak es more than 45 seconds. The r elati v e pr ecisions ǫ are c hosen equal to 10 − 2 , 10 − 3 , 10 − 4 , 10 − 5 , 10 − 6 . No limit was imp osed on th e num b er of iterations. F or alternativ e gradien t algorithms C Line and CFO , we use d iﬀeren t precisions ǫ U and ǫ V for eac h of the inner iteration for U and for V as suggested in [23] w h ere ǫ U and ǫ V are initialized by 10 − 3 . And wh en the inner loop for U or V n eeds no iteration to reac h the precision ǫ U or ǫ V , one more digit of precision will b e added into ǫ U or ǫ V (i.e. ǫ U = ǫ U / 10 or ǫ V = ǫ V / 10). T able 1.1 shows that for all sizes and ranks, Algorithm RRI is the fastest to reac h the required precision. Ev en though it is widely used in practice, algo rithm Mult fails to pro vide solutions to the NMF problem within the allocated time. A further inv estig ation s h o ws that the algo- rithm gets easily trapp ed in b oundary p oin ts wh er e s ome U ij and/or V ij is zero wh ile ∇ U ij and/or ∇ V ij is n egat iv e, h ence violating one of th e KKT conditions (1.11). T he m ultiplicativ e ru les then f ail to mov e and do not r eturn to a lo cal minimizer. A sligh tly mo diﬁ ed version of this algorithm wa s given in [21], but it n eeds to w ait to get suﬃcientl y close to such p oin ts b efore attempting an escap e, and is ther efore also n ot eﬃcien t. Th e ALS algo rithm can return a stationary p oint, but it tak es to o long. W e select ﬁv e metho ds: FLine , CL ine , FF O , CF O and RRI for a more detaile d comparison. F or eac h matrix A , we run these algorithms with 100 diﬀeren t s tarting p oin ts. Figure 1.1, 1.2, 1.3 and 1.4 sho w the results with some diﬀerent settings. On e can see that, w hen the appro ximated err ors are almost the same b et w een the algorithms, RRI is the b est o v erall in terms of running times. It is pr obably b ecause the RRI algorithm c ho oses only one vecto r u t or v t to optimize at once. This allo ws the algorithm to mo v e optimal ly do wn on partial d irectio n rather than just a smal l step on a more global direction. F urthermore, the computational load for an up date is very small, only one matrix-v ecto r m ultiplication is needed. All these factors make the ru nning time of the RRI algorithm v ery attractiv e. Image data The follo wing exp eriments u se the Cam bridge ORL face d atabase as the inp u t data. T he database con tains 400 images of 40 p ersons (10 images p er p erson). The size of eac h image is 112 × 92 with 256 gra y lev els p er pixel representing a front view of the f ace of a p erson. The images 34 FLine CLine FFO CFO RRI 0.1867 0.1867 0.1867 0.1867 0.1867 0.1867 0.1867 0.1867 Relative Euclidean Error Algorithm m = 30, n = 20, r = 2, ε = 10 −7 FLine CLine FFO CFO RRI 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Running time (seconds) Algorithm m = 30, n = 20, r = 2, ε = 10 −7 Figur e 1.1. Comparison of selected algo rithms for ǫ = 10 − 7 FLine CLine FFO CFO RRI 0.148 0.1482 0.1484 0.1486 0.1488 0.149 Relative Euclidean Error Algorithm m = 100, n = 50, r = 10, ε = 10 −5 FLine CLine FFO CFO RRI 0 5 10 15 20 25 30 35 40 45 Running time (seconds) Algorithm m = 100, n = 50, r = 10, ε = 10 −5 Figur e 1.2. Comparison of selected algo rithms for ǫ = 10 − 5 Desc ent metho ds for Nonn e gative Matrix F actorization 35 FLine CLine FFO CFO RRI 0.12 0.1202 0.1204 0.1206 0.1208 0.121 0.1212 0.1214 Relative Euclidean Error Algorithm m = 100, n = 50, r = 15, ε = 10 −4 FLine CLine FFO CFO RRI 0 5 10 15 20 25 30 35 40 Running time (seconds) Algorithm m = 100, n = 50, r = 15, ε = 10 −4 Figur e 1.3. Comparison of selected algorithms for ǫ = 10 − 4 FLine CLine FFO CFO RRI 0.1302 0.1304 0.1306 0.1308 0.131 0.1312 0.1314 0.1316 0.1318 0.132 Relative Euclidean Error Algorithm m = 100, n = 100, r = 20, ε = 10 −3 FLine CLine FFO CFO RRI 0 5 10 15 20 Running time (seconds) Algorithm m = 100, n = 100, r = 20, ε = 10 −3 Figur e 1.4. Comparison of selected algorithms for ǫ = 10 − 3 36 are th en transformed int o 400 “face v ectors” in R 10304 (112 × 92 = 10304) to form the d ata matrix A of size 10304 × 400. W e us ed three weig ht matrices of the same size of A (ie. 10304 × 400). Since it was u sed in [20], this data has b ecome th e standard b en c hmark for NMF algorithms. Figur e 1.5. NMF: Error vs. Iterations In the ﬁr st exp erimen t, we run six NMF algorithms d escrib ed ab ov e on this data f or the redu ced rank of 49. T h e original matrix A is con- stituted by transforming eac h image into one of its column. Fig ure 1.5 sho ws for the six algorithms th e ev olution of the error versus the n umber of iterations. Because th e minimization pro cess is diﬀeren t in eac h algo- rithm, w e will say that one iteration corresp onds to all elemen ts of b oth U and V b eing up dated. Figure 1.6 shows th e evolutio n of the error v ersus time. Sin ce the w ork of one iteration v aries from one algorithm to another, it is crucial to plot th e error v ersus time to get a fair com- parison b et we en the diﬀerent algorithms. In the t w o ﬁgures, w e can see that the RRI algorithm b ehav es very well on th is dataset. And since its computation load of eac h iteratio n is small and constan t (without inner lo op), this algorithm con v erges f aster than the others. In the second exp eriment, we constru ct a third-order nonn ega tiv e ten- sor appro ximation. W e ﬁrst b uild a tensor b y stac king all 400 images to ha v e a 112 × 92 × 400 n onnegativ e tensor. Using the prop osed algorithm, a r ank − 142 nonn egat iv e tensor is calculated to appr o x im ate this tensor. Figure 1.7 sh ows the result for six images chosen randomly from the 400 images. Their appr o ximatio ns giv en by the rank-142 nonn egat iv e tensor are muc h b etter than that giv en b y the rank-8 nonnegativ e matrix, eve n though they require similar storage space: 8 ∗ (112 ∗ 92 + 400) = 85632 and 142 ∗ (112 + 92 + 400) = 85768. Th e rank-8 truncated SVD appro x- imation (i.e. [ A 8 ] + ) is also included for reference. Desc ent metho ds for Nonn e gative Matrix F actorization 37 Figur e 1.6. NMF: Error vs. Time Figur e 1.7. T ensor F actorization v s. Matrix F actorization on facial d ata. Six ran- domly chosen images from 400 of ORL dataset. F rom top to b ottom: original images, their r ank − 8 truncated SVD approximation, th eir r ank − 142 n on n egativ e tensor approximatio n (150 RRI iterations) and their r ank − 8 nonnegative matrix appro xi- mation (150 RRI iterations). 38 In the th ird exp eriment, w e apply the v arian ts of R R I algorithm men- tioned in Section 5.0 to the face databases. T he follo wing settings are compared: Original : original faces from th e databases. 49NMF : stand ard factorization (nonnegativ e v ectors), r = 49. 100Binary : columns of U are limited to the scaled bin ary vect ors, r = 100. 49Sparse10 : columns of U are sparse. Not more than 10% of the elemen ts of eac h column of A are p ositive . r = 49. 49Sparse20 : columns of U are sparse. Not more than 20% of the elemen ts of eac h column of A are p ositive . r = 49. 49HSparse60 : columns of U are sparse. Th e Ho y er sparsit y of eac h column of U are 0 . 6. r = 49. 49HSparse70 : columns of U are sparse. Th e Ho y er sparsit y of eac h column of U are 0 . 7. r = 49. 49HBSparse60 : column s of U are sparse. The Ho y er sparsit y of eac h column of U are 0 . 6. Columns of V are scaled binary . r = 49. 49HBSparse70 : column s of U are sparse. The Ho y er sparsit y of eac h column of U are 0 . 7. Columns of V are scaled binary . r = 49. F or eac h setting, w e us e RRI algorithm to compute th e corr esp onding factorizat ion. Some randomly s elected faces are reconstructed by these settings as shown in Figure 1.8. F or eac h s etti ng, RRI algorithm p r o- duces a diﬀerent set of bases to app ro ximate the original f ace s. When th e columns of V are constrained to scaled binary vect ors ( 100Binary ), the factorizat ion can b e rewr itten as U V T = ˆ U B T , where B is a binary ma- trix. This implies that eac h image is reconstructed b y ju st the presence or absence of 100 bases sho wn in Figure 1.9. Figure 1.10 and 1.11 sh o w nonn ega tiv e bases obtained by imp osing some sp arsit y on the columns of V . The sparsit y can b e easily con trolled b y th e p ercen tages of p ositiv e elemen ts or by the Ho y er sp arsit y measure. Figure 1.12 com bines the sparsit y of the bases (columns of U ) and the binary rep r esen tatio n of V . Th e sparsit y is measured by the Ho y er measure as in Figure 1.11. On ly with the absence or presence of these 49 f e atur es , f aces are appro ximated as sho w ed in the last tw o ro ws of Figure 1.8. Desc ent metho ds for Nonn e gative Matrix F actorization 39 Figur e 1.8. Nonn egativ e matrix factorization with sev eral sparse settings 40 Figur e 1.9. Bases from 100Binary setting Desc ent metho ds for Nonn e gative Matrix F actorization 41 (a) (b) Figur e 1.10. Sparse bases 49Sparse20 and 49Sparse10 . Maximal p ercen tage of p ositiv e elemen ts is 20% (a) and 10% (b ) . (a) (b) Figur e 1.11. H o yer sparse bases 49HSparse60 and 49HSparse70 . Sparsity of bases is 0 . 6 (a) and 0 . 7 (b) . 42 (a) (b) Figur e 1.12. Hoy er sparse bases 49HBSparse60 and 49HBSparse70 . Sparsity of bases is 0 . 6 (a) and 0 . 7 (b). V is binary matrix. . The ab ov e examples show ho w to use the v ariants of the RRI algo- rithm to con trol the sparsit y of the bases. O ne can see that the sp ars er the b ases are, the less storage is needed to store the approxima tion. Moreo v er, this pro vides a p art-based decomp osition using lo c al features of the faces. Smo oth appro ximation W e carry out this exp eriment to test the new smo othness constraint in tro duced in the p revious section: 1 2 k R i − u i v T k 2 F + δ 2 k v − B ˆ v i k 2 F , δ > 0 where B is deﬁned in (1.29 ). W e generate the data u s ing f ou r sm ooth nonnegativ e fu nctions f 1 , f 2 , f 3 et f 4 , describ ed in Figur e 1.13, where eac h function is rep r esen ted as a n onnegativ e vect or of size 200. W e then generate a matrix A con taining 100 mixture of these functions as f ollo ws A = max ( F E T + N , 0) Desc ent metho ds for Nonn e gative Matrix F actorization 43 0 50 100 150 200 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0 50 100 150 200 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 f 1 f 3 f 2 f 4 Figur e 1.13. Smo oth functions 0 50 100 150 200 0.1 0.2 0.3 0.4 0.5 0 50 100 150 200 0.2 0.25 0.3 0.35 0 50 100 150 200 0.1 0.2 0.3 0.4 0.5 0 50 100 150 200 0.1 0.2 0.3 0.4 0.5 Figur e 1.14. Randomly selected generated d ata where F = [ f 1 f 2 f 3 f 4 ], E is a ran d om nonnegativ e matrix and N is normally distrib u ted random noise w ith k N k F = 0 . 2 k F E T k F . F our randomly select ed columns of A are plotted in Figure 1.14. 44 0 50 100 150 200 0 0.05 0.1 0.15 0.2 ( a ) Or i g i n al f u n c t i o n s 0 50 100 150 200 0 0.05 0.1 0.15 0.2 ( b) δ = 0 0 50 100 150 200 0 0.05 0.1 0.15 0.2 ( c ) δ = 1 0 0 50 100 150 200 0 0.05 0.1 0.15 0.2 ( d) δ = 1 00 Figur e 1.15. Original functions vs. reconstructed functions W e ru n the regularized RRI algorithm to force the smo othness of columns of U . W e app ly , for eac h r un, the same v alue of δ for all the column s of U : δ = 0 , 10 , 100. The results obtained thr ough these runs are p r esen ted in Figure 1.15. W e s ee th at, without r egulariza - tion, i.e. δ = 0, the noise is present in the approximat ion, which p ro- duces nonsmo oth solutions. When increasing the r egularizing terms, i.e. δ = 10 , 100, the reconstructed functions b ecome smo other and the s hap e of the original functions are w ell preserv ed. This smo othing technique can b e used for applications like that in [27], where smo oth s p ectral r eﬂecta nce data from space ob jects is un- mixed. The multiplicativ e rules are mo diﬁed b y adding the t w o-norm regularizations on the factor U and V to enforce the sm oothness. Th is is a diﬀeren t approac h, therefore, a comparison sh ould b e carried out. W e hav e describ ed a n ew metho d for nonnegativ e matrix factorizatio n that has a go o d and fast con v ergence. Moreo v er, it is also very ﬂexible to create v arian ts and to add some constraint s as well. The numerical exp erimen ts show that this method and its d eriv ed v ariants b eha v e very w ell w ith diﬀerent t yp es of data. This giv es enough motiv ations to ex- tend to other types of data and app lications in th e f uture. In the last Desc ent metho ds for Nonn e gative Matrix F actorization 45 t w o chapters of this thesis, it is applied to we igh ted cost fu nctions and to sym m etric factorizations. 7. Conclusion This p ap er fo cuses on the descen t metho ds for Nonnegativ e Matrix F actorizatio n, whic h are c haracterized b y n onincreasing u p dates at eac h iteration. W e presen t also the Rank-one Residue Iteration algorithm f or com- puting an appr o ximate Nonn ega tiv e Matrix F actoriza tion. It u ses r e- cursiv ely nonnegativ e rank one app ro ximations of a r esidual matrix that is not necessarily n onnegativ e. This algorithm requ ires no p arameter tuning, h as n ice pr op er ties and typical ly con v erges quite f ast. It also has many p oten tial extensions. Dur ing th e revision of this rep ort, we w ere informed that essenti ally th e same algorithm w as pu blished in an indep endent con tribution [8] and also mentio ned later in an ind ep en den t p ersonal communicati on [12]. Ac kno wledgmen ts This p ap er present s researc h results of the Concerted Researc h Ac- tion(AR C) ”Large Graphs and Net w orks” of the F renc h C omm unit y of Belgium and th e Belgian Net w ork D YSCO (Dynamical Systems, Con- trol, and Optimization), fun d ed b y th e Interuniv ersit y A ttractio n P oles Programme, in itiated by the Belgian State, Science P olicy Oﬃ ce. The scien tiﬁc resp onsibilit y rests with its auth ors. Ngo c-Diep Ho is a FRIA fello w. References [1] R. Alb r igh t, J. Co x, D. Duling, A.N. Langville, and C .D. Mey er. Al- gorithms, initializat ions, and conv ergence for the nonnegativ e m a- trix factorization. Pr eprint , 2006. [2] B.W. Bader and T.G. Kolda. Eﬃcient MA T LAB computations with sparse and factored tensors. T ec hnical Rep ort SAND2006-759 2, Sandia National Lab oratories, Albu querque, NM and Live rmore, CA, Dec. 2006. [3] M.W. Berry , M. Browne, A.N. L an gville, V.P . P auca, and R.J. Plemmons. Algorithms and applications for approximate non- negativ e matrix factorization. Computat ional Statistics and Data Ana lysis , 52(1) :155–17 3, 2007. [4] D.P . Bertsek as. Nonline ar pr o gr amming . Athena Scient iﬁc Belmon t, Mass, 1999. 46 [5] M. Biggs, A. Gh od s i and S. V a v asis. Nonnegati ve matrix factor- ization via r ank-one d o wndate. University of Waterlo o, Pr eprint , 2007. [6] R. Bro and S. De Jong. A fast non-negativit y constrained least squares algorithm. Journal of Chemometrics , 11(5) :393–40 1, 1997. [7] M. Catral, L. Han, M. Neumann, and R.J. Plemmons. On r ed uced rank nonn egativ e matrix f acto rization for sy m metric nonn egativ e matrices. Line ar Algebr a and Its Applic atio ns , 393:107–1 26, 2004. [8] A. Cic ho c ki, R. Z dunek, and S. Amari. Hierarc hical ALS Algo- rithms for Nonnegativ e Matrix and 3D T ensor F actorization. In Pr o c e e dings of Indep endent Comp onent Anal ysis, ICA 2007, L on- don, UK , Septemb er 9-12, 2007, L e ctur e N otes in Computer Sci- enc e, Springer , 4666 :169–17 6, 2007. [9] A. Cic ho c ki, R. Zdunek, and S. Amari. Nonnegativ e matrix and tensor factorizatio n IEE E on Signal Pr o c essing Magazine , 25:142– 145, 2008. [10] A. C ic h ocki, R. Z dunek. NMFLAB for Signal Pro cessing, a v ailable at http://www.bsp.brain.rike n.jp/ICALAB/nmﬂ ab.h tml. [11] C . Ding, T. Li, and M.I. Jordan. C onv ex and Semi-Nonnegativ e Matrix F actorizations. T ec hnical rep ort, LBNL T ec h Rep ort 60428, 2006. [12] N. Gillis and F ranois Glineur. Nonnegativ e Matrix F actorizati on and Un derappro ximation. Prepr int, 2007. [13] G. Golub and C.F. V an Loan. Matrix c omp utations.3r d e d. Balti- more, T he Johns Hopkins Univ. Press. xxvii, 694 p. , 1996. [14] N.J. Higham, M.J.C. Go v er and S . Barnett. Matrix Nearness Prob- lems and Applications. Applic ations of Matrix The ory , 1–27, 1989. [15] N.-D. Ho, P . V an Dooren, and V.D. Blondel. Desce nt algorithms for Nonnegativ e Matrix F actoriza tion. T e chnic al R ep ort 2007-57, Cesame. University c atholique de L ouvain . Belgium. 2007 . [16] N.-D. Ho. Nonnegativ e Matrix F actorizatio n - Algorithms and Ap- plications. PhD Thesis. University c atholique de L ouvain . Belgium. 2008. [17] P .O. Ho y er. Non-negativ e m atrix factorization w ith sp arseness con- strain ts. Journal of Machine L e arning R ese ar ch , 5:1457– 1469, 200 4. [18] T .G. Kolda and D.P . O’Leary . A semidiscrete m atrix decomp osition for laten t semantic indexing inf orm atio n retriev al. ACM T r ansac- tions on Information Systems (TOIS) , 16(4):322 –346, 1998. Desc ent metho ds for Nonn e gative Matrix F actorization 47 [19] C .L. La wson and R.J. Hanson. Solving le ast squar es pr oblems . Pren tice-Hall Englew o od Cliﬀs, NJ, 1974. [20] D.D. Lee and H.S. S eu ng. Learnin g the parts of ob jects by non - negativ e m atrix factorization. Natur e , 401:788– 791, 1999. [21] C .-J. Lin. On the conv ergence of multiplica tiv e up d ate algorithms for non-negativ e m atrix factorizat ion. IEEE T r ansactions on Neur al Networks , 2007. T o app ear. [22] C .-J. Lin. Pro jected gradient metho ds for non-negativ e matrix fac- torizatio n. N eur al Computation , 2007. T o app ear. [23] C .-J.-Lin. Pro jected gradien t metho ds for non-negativ e m atrix factorizat ion. T e chnic al R ep ort Information and Supp ort Servic es T e chnic al R ep ort ISSTECH-95-013, Dep art ment of Computer Sci- enc e, National T aiwan U ni v ersity, 200 5 [24] M. Merritt and Y. Zhan g. Inte rior-P oin t Gradien t Method for Large-Scale T otally Nonn egativ e Least Squares Problems. Journal of Optimization The ory and Applic ations , 126 (1):191– 202, 2005. [25] P . P aatero and U. T app er . P ositiv e matrix factorizatio n: A non- negativ e factor mo del with optimal utilizatio n of error estimates of data v alues. Env i r onmetrics , 5(1): 111–126 , 1994. [26] P . Paa tero. A weig hte d n on-negativ e least squ ares algorithm for three-w a y ’parafac’ factor analysis. Chemometrics and Intel lig ent L ab or atory Systems , 38(2):223–2 42, 1997. [27] V. P . Pauca, J. Pip er and R. J. Plemmons. Nonn ega tiv e matrix factorizat ion for sp ectral data analysis Line ar Algebr a and its Ap- plic ations , 416(1) :29–47, 2006.

Descent methods for Nonnegative Matrix Factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment