Non-Negative Matrix Factorization, Convexity and Isometry
In this paper we explore avenues for improving the reliability of dimensionality reduction methods such as Non-Negative Matrix Factorization (NMF) as interpretive exploratory data analysis tools. We first explore the difficulties of the optimization …
Authors: ** - Nikolaos Vasiloglou - Alex, er G. Gray - David V. Anderson **
Non-Nega tive Matrix F actor izati on, C on v exity and Isometry ∗ Nik olaos V asiloglou Alexander G. Gra y Da vid V. Anderson † Abstract In this paper we explore a v en ues for impro ving the relia bilit y of dimensi onalit y reduction metho d s such as N on-Negative Matrix F actorization (NMF) as interpretive exploratory data analysis to ols. W e first explore the difficulties o f the o p- timization problem underlying NMF, showi ng for the first time t hat non- trivial N MF solutions alw a ys exist and that the optimization problem is actu ally conv ex, by using the theory of Completely P ositiv e F actorization. W e subse- quently exp lore four no v el app roac hes to fin ding globally- optimal NMF solutions using v arious ideas from con vex op- timization. W e then dev elop a new metho d, isometric NMF (isoNMF), which preserves non-n egativit y while also provid- ing an isometric emb edding, simultaneously achieving tw o prop erties which are helpful for interpretation. Though it results in a more difficult optimization problem, we sho w ex- p erimenta lly that the resulting metho d is scalable and even ac hieves more compact sp ectra t han standard NMF. 1 In troductio n. In this paper w e explo re av enues for improving the relia- bilit y of dimensionality reductio n methods s uch as Non- Negative Matrix F actoriza tion (NMF) [21] as in terpre- tive explorator y data ana lysis to ols , to make them r eli- able enoug h for, say , making scientific conclusions fro m astronomica l data. NMF is a dimensionality r e duction method o f mu ch recent int erest whic h can, for some common kinds o f data, sometimes yield results which ar e more meaning- ful tha n those returned by the class ical metho d of P rin- cipal Comp onent Analy sis (PCA), for e xample (though it will no t in general yield better dimensio na lity reduc- tion than PCA, as we’ll illustrate later). F or data of significant in terest such as images (pixel intensities) or text (presence/a bsence of words) or astronomical spec- tra (magnitude in v arious frequencies ), where the data v alues are non-nega tive, NMF ca n pro duce co mpo nents which can themselves b e interpreted as ob jects o f the same type as the data which a re a dded together to pro- duce the observed data. In other words, the comp onents are more likely to b e s e nsible ima ges or do c uments o r sp ectra. This makes it a p o tentially very useful inter- ∗ Supported by Go ogle Grant s † Georgia Institute of T ec hnology pretive data mining to ol for such data. A second imp or tant in terpretive usage of dimen- sionality reduction metho ds is the plot of the data p oints in the low-dimensional space obtained (2 -D or 3-D, generally). Multidimensional scaling metho ds and re- cent nonlinear manifold lear ning methods focus on this usage, typically enforcing t hat the distances b etw een the p oints in the or iginal high-dimensiona l space are preserved in the low-dimensional spa ce (isometr y con- straints). Then, appa rent r elationships in the low-D plot (indicating for exa mple cluster structure o r out- liers) c orresp o nd to actua l relationships. A plot of the po int s using co mpo nent s found by standard NMF meth- o ds will in gener al pro duce misleading results in this regar d, as existing metho ds do not enforce such a con- straint. Another ma jor reaso n that NMF mig ht not yie ld reliable interpretiv e results is that current o ptimization metho ds [18, 22] might not find the actual o ptim um, leading to po or p erfor mance in terms of b oth of the ab ov e interpretive usages. This is be cause its ob jectiv e function is not co nv ex, and s o unconstrained o ptimizers are used. Thu s, obtaining a reliably interpretable NMF metho d requires understanding its optimization problem more dee ply – esp ecially if w e are going to actually crea te an additionally difficult optimization problem b y adding isometry constraints. 1.1 P aper Organization. In Section 2 we first study at a fundamental lev el the optimization pro blem of standard NMF. W e r e la te for the first time the NMF problem to the theory of Co mpletely Positiv e F ac to riza- tion, then using that theory , we show that every non- negative ma trix has a non-triv ial exact non-negative matrix factorizatio n of the form W=VH, a basic fac t which had not been shown un til now. Using this the- ory we also show that a conv ex for mulation o f the NMF optimization problem e x ists, though a pra c tical solu- tion method for this formulation does not yet exist. W e then explor e four novel formulations o f the NMF optimization problem tow ard achieving a global opti- m um: conv ex relaxa tion using the positive semidefinite cone, approximating the semidefinite cone with smaller ones, convex m ulti-ob jectiv e optimization, and general- ized g eometric programming. W e highlight the difficul- ties enco untered by each approa ch. In order to turn to the question of how to cr e ate a new isometric NMF, in Section 3 we give background on t wo recent success ful manifold learning metho ds, Maxi- m um V ar ia nce Unfolding (MVU) [33] and a new v ariant of it, Maxim um F urthest Neigh bo r Unfolding (MFNU) [25]. It ha s b een shown exp erimentally [34] that they can r ecov er the intrinsic dimension o f a datase t very r e - liably and effectiv ely , compared to pr evious well-known metho ds such as ISOMAP [31], La pla cian E igen-Maps [2] and Diffusion Maps [6]. In synthetic exp eriments the ab ov e metho ds manage to decomp ose data in to mean- ingful dimensions. F or example, for a da ta set consisting of imag es o f a statue photog r aphed from differen t ho r- izontal and vertical a ngles, MVU and MFNU find t wo dimensions that ca n b e identified as the horizontal and the vertical c amera angle. MVU and MFNU c ontain ideas, par ticula rly co ncerning the formulation o f their optimization problems, up o n which isometr ic NMF will be base d. In Sectio n 4 we show a pr actical a lgorithm for an isometric NMF ( isoNMF for short), r epresenting a new data mining metho d capable of pro ducing b oth inter- pretable compo nents and interpretable plots simultane- ously . W e use idea s for efficient optimization and effi- cient neig hborho o d co mputation to o bta in a pra ctical scalable method. In Section 5 w e demonstrate the utility of isoNMF in exp eriments with datas ets used in previous pap ers. W e show that the comp onents it finds ar e comparable to those found by standard NMF, while it additionally preserves distances muc h b etter, and also results in more compact sp ectr a. 2 Con v exit y in Non Negativ e Matrix F actorization. Given a non-negative matrix V ∈ ℜ N × m + the goal of NMF is to deco mpo se it in tw o matric es W ∈ ℜ N × k + , H ∈ ℜ k × m + such that V = W H . Such a fa ctorization alwa ys exists for k ≥ m . The factorization has a trivial solution where W = V and H = I m . Determining the minim um k is a difficult problem and no algorithm exists for finding it. In general we can show that NMF can b e c a st as a Completely Positiv e (CP) F actor ization problem [3]. Definition 2.1. A matrix A ∈ ℜ N × N + is C ompletely Positive i f it c an b e factor e d i n the form A = B B T , wher e B ∈ ℜ N × k + . The minimum k for which A = B B T holds is c al le d the CP r ank of A . Not all ma trices admit a completely pos itive fac to riza- tion even if they are p ositive definite a nd non-nega tive. Notice thoug h that for every po s itive definite non- negative matrix a Cholesky factor ization alwa ys exists, but there is no g ua rantee that the Cholesky factor s are non-negative too . Up to now there is no algo rithm of po lynomial complexit y that can decide if a g iven po si- tive matrix is CP . A simple observ ation can sho w that A has to be pos itive definite, but this is a nec essary and not a sufficient condition. Theorem 2.1. If A ∈ ℜ N × N + is CP t hen rank ( A ) ≤ cp-r ank(A) ≤ N ( N +1) 2 − 1 The pro of ca n be found in [3 ]p.156. It is a ls o co njectured that the upp er b o und can be tighter N 2 4 . Theorem 2.2. if A ∈ ℜ N × N + is diago nal ly dominant 1 , then it is also CP. The pro of of the theo rem can b e found in [1 7]. Although CP factor ization ( A = B B T ) do esn’t exist fo r ev ery matrix, we pro ve that non-trivia l NMF ( A = W H ) alwa ys exists. Theorem 2.3. Every non-ne gative m atrix V ∈ ℜ N × m + has a n on-trivial, non-ne gative factorization of the form V = W H . Pr o of. Consider the following matr ix: Z = D V V T E (2.1) W e want to prov e tha t there a lwa ys ex ists B ∈ ℜ N × k + such that Z = B B T . If this is true then B can take the form: B = W H T (2.2) Notice that if D and E are arbitrary dia gonally domi- nant completely p ositive matrices, then B always exis ts . The simplest choice would be to c hose them a s diago nal matrices where eac h element is greater or equal to the sum of r ows/columns of V. Since they are diag o nally dominant according to 2.2 Z is alwa ys CP . Since Z is CP then B exists so do W and H . Although theorem 2.2 also provides an algo rithm for constructing the CP-facto rization, the cp- rank is usually high. A coro llary of theorems 2.1 (cp-rank( A ) ≥ rank( A )) and 2.3 (existence of NMF) is that SVD has alwa ys a more compact sp ectrum than NMF. There is no algor ithm known y et for computing an exact NMF despite its existence. In practice, scient ists try to minimize the norm [12, 22] o f the factoriza tion error . min W ,H || V − W H || 2 (2.3) 1 A matrix is di agonally dominant if a ii ≥ P j 6 = i | a i j | This is the ob jective function we use in the exp eriments for this pap er . 2.1 Solving the optimization problem of NMF. Although in the curr ent liter ature it is widely b elieved that NMF is a non-conv ex problem and only lo cal min- ima can b e found, w e will show in the fo llowing subsec- tions that a conv ex fo r mulation do es exist. Despite the existence o f the con vex form ulation, we also show that a formulation o f the problem as a gener a lized geomet- ric pro g ram, which is non-conv ex, co uld give a better approach for finding the global optimum . 2.1.1 NMF as a con v ex coni c program. Theorem 2.4. The set of Completely Positive Matri- c es K C P is a c onvex c one. Pr o of. See [3]p.71. It is alwa ys des irable to find the minim um rank of NMF since we are loo king for the mo st compact r e presenta- tion of the data matrix V . Finding the minimum rank NMF can be cast as the following optimization pr o blem: min W , H rank W V V T H (2.4) sub ject to: W ∈ K C P H ∈ K C P (2.5) Since minimizing the r ank is non-co n vex, we can use its conv ex en velope that according to [28] is the trace of the matrix. So a conv ex rela xation of the ab ov e problem is: min W , H T ra ce( W V V T H ) (2.6) sub ject to: (2.7) W ∈ K C P H ∈ K C P (2.8) After determining W , H , W and H can be recovered by CP factorization of W , H , which again is not an easy pro blem. In fa ct there is no practica l bar rier function known yet for the CP co ne so that Interior Poin t Metho ds can b e employ ed. Finding a practical description o f the CP cone is an op en problem. So although the problem is co nv ex, there is no algorithm known for solving it. 2.2 Con v e x relaxations of the NMF proble m. In the following subsections we inv estigate conv ex relax- ations o f the NMF pro blem with the Positiv e Semidefi- nite Cone [26]. 2.2.1 A s i mple con v ex upp er b ound with Sin- gular V alue Decomp ositi on. Singular V a lue Decom- po sition (SVD) can decompo se a matrix in t w o factors U, V : A = U V (2.9) Unfortunately the sign of the SVD compone nts of A ≥ 0 cannot b e guaranteed to b e non-ne g ative except for the first eigenv ector [24]. How ever if we pro ject U, V on the nonnegative orthant ( U, V ≥ 0) we get a v ery go o d estimate (upp er b ound) for NMF. W e will call it clipp ed SVD, (CSVD). CSVD was used as a benchmark for the relaxatio ns that follow. It has also b een used a s an initializer for NMF algor ithms [20]. 2.2.2 Relaxation wi th a p o sitive sem idefinite cone. In the minimizatio n problem of eq. 2.2 .2 where the co st function is the L 2 norm, the nonlinear terms w il h lj app ear. A t ypical wa y to get these terms [26] would be to generate a lar ge vector z = [ W ′ (:); H (: )], wher e we use the MA TLAB notation ( H (:) is the column-wise unfolding of a ma trix). If Z = z z T ( rank ( Z ) = 1 ) and z > 0 is true, then the terms app earing in || V − W H || 2 are linear in Z . In the following example eq. 2 .10, 2.11 (see next page) where V ∈ ℜ 2 × 3 , W ∈ ℜ 2 × 2 , H ∈ ℜ 2 × 3 we show the structure of Z . T er ms in bo ld ar e the o nes we need to expres s the constraint V = W H , i.e v 11 = w 11 h 11 + w 12 h 21 . z = w 11 w 12 w 21 w 22 h 11 h 21 h 12 h 22 h 13 h 23 (2.10) Now the optimization problem eq. is equiv alent to: min i = N ,j = m X i =1 ,j =1 k X l =1 ( Z ik + l,N k + j k + l − V ij ) 2 (2.12) sub ject to: rank(Z) = 1 This is no t a co nv ex problem but it can b e easily be rela xed to [7]: Z = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 w 2 11 w 11 w 12 w 11 w 21 w 11 w 22 w 11 h 11 w 11 h 21 w 11 h 12 w 11 h 22 w 11 h 13 w 11 h 23 w 12 w 11 w 2 12 w 12 w 21 w 12 w 22 w 12 h 11 w 12 h 21 w 12 h 12 w 12 h 22 w 12 h 13 w 12 h 23 w 21 w 11 w 21 w 12 w 2 21 w 21 w 22 w 21 h 11 w 21 h 21 w 21 h 12 w 21 h 22 w 21 h 13 w 21 h 23 w 22 w 11 w 22 w 12 w 22 w 21 w 2 22 w 22 h 11 w 22 h 21 w 22 h 12 w 22 h 22 w 22 h 13 w 22 h 23 h 11 w 11 h 11 w 12 h 11 w 21 h 11 w 22 h 2 11 h 11 h 21 h 11 h 12 h 11 h 22 h 11 h 13 h 11 h 23 h 21 w 11 h 21 w 12 h 21 w 21 h 21 w 22 h 21 h 11 h 2 21 h 21 h 12 h 21 h 22 h 21 h 13 h 21 h 23 h 12 w 11 h 12 w 12 h 12 w 21 h 12 w 22 h 12 h 11 h 12 h 21 h 2 12 h 12 h 22 h 12 h 13 h 12 h 23 h 22 w 11 h 22 w 12 h 22 w 21 h 22 w 22 h 22 h 11 h 22 h 21 h 22 h 12 h 2 22 h 22 h 13 h 22 h 23 h 13 w 11 h 13 w 12 h 13 w 21 h 13 w 22 h 13 h 11 h 13 h 21 h 13 h 12 h 13 h 22 h 2 13 h 13 h 23 h 23 w 11 h 23 w 12 h 23 w 21 h 23 w 22 h 23 h 11 h 23 h 21 h 23 h 12 h 23 h 22 h 23 h 13 h 2 23 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 (2.11) min T ra ce( Z ) (2.13) sub ject to: A • Z = V ij Z 0 Z z z T Z ≥ 0 where A is a matrix that selects the a ppropriate ele- men ts from Z . Here is an example for a matrix A that selects the elements of Z that should sum to the V 13 element: A 13 = 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (2.14) In the second formulation (2 .13) w e have relaxe d Z = z z T with Z z z T . The ob jectiv e function tries to minimize the rank of the matrix, while the constra ints try to match the v alues o f the g iven matrix V . After solving the optimizatio n pro blem the solution can be found on the firs t eigenv ector of Z . The qualit y of the relaxatio n dep ends on the r a tio of the fir s t eigenv alue to sum of the r est. The p ositivity of Z will gua rantee that the first eig env ector will hav e elements with the same sign according to the Peron F ro benio us Theo rem [24]. Ideally if the rest of the eigenv ectors are po sitive they can also b e included. One o f the proble ms of this metho d is the complexity . Z is ( N + m ) k × ( N + m ) k and there are (( N + m ) k )(( N + m ) k − 1) 2 non-negative constr aints. V ery quickly the problem becomes unsolv a ble. In practice the problem as p o s ed in 2.12 alw ays gives W and H matrices that a re r ank one. After testing the metho d exhaustively with random ma tr ices V that either had a pro duct V = W H representation or not the solution was alwa ys rank o ne on both W a nd H . This was alwa ys the case with any of the conv ex formulations presented in this pa pe r. This is b ecause ther e is a missing constr aint that will let the energy of the dot pro ducts s pread among dimensio ns . This is something that should characterize the sp ectrum of H . The H matrix is often interpreted as the basis vec- tors of the factorization and W as the matrix that has the co efficients. It is wide ly known that in natur e sp ec- tral analysis is giving sp ectrum tha t decays either exp o- nent ially e − λf or mor e slowly 1 /f γ . Dep ending o n the problem we c a n try different s pe c tral functions. In o ur exp eriments we chose the exp onential one. Of course the decay parameter λ is something that should b e s et adho c. W e exp er iment ed with several v a lues of λ , but we couldn’t co me up with a systematic, heur is tic and practical r ule. In some cases the reconstructio n erro r was low but in some others not. Another r elaxation that was necessary for making the optimization tracta ble was to reduce the the no n-negativity constra ints only on the elements that are involv ed in the equa lit y c o nstraints. 2.2.3 Appro ximating the SDP cone with smaller ones. A different wa y to deal with the compu- tational complexity of SDP (eq. 2.1 3) is to appr oximate the big SDP cone ( N + m ) k × ( N + m ) k with smaller ones. Let W i be the i th row of W and H j the j th column of H . Now z ij = [ W i (:) ′ ; H j (:)] (2 k dimensional vector) and Z ij = z ij z T ij (2 k × 2 k matr ix), o r Z ij = W T i W i W T i H j W T i H j H j H T j (2.15) or it is b etter to think it in the for m: Z ij = W i Z W H Z W H H j (2.16) and once W , H are fo und then W i , H j can be found from SVD deco mpo sition of W , H and the quality of the relaxa tion will b e judged up on the mag nitude of the first eigenv a lue compared to the sum of the others. Now the optimization pro blem b eco mes: min N X i =1 m X j =1 T ra ce( Z ij ) (2.17) Z ij ≥ 0 Z ij 0 A ij • Z ij = v ij , ∀ i , j The ab ov e metho d has N m co nstraints. In terms of storage it nee ds • ( N + m ) symmetric p ositive definite k × k ma- trices for every row/column of W, H , whic h is ( N + m ) k ( k +1) 2 • N m symmetric positive definite k × k matrices for every W i H j pro duct, whic h is ( N m ) k ( k +1) 2 In total the stor age co mplexity is O (( N + m + N m ) k ( k +1) 2 ) which is significantly smaller by an o rder of magnitude from O ( ( N + m ) k (( N + m ) k − 1) 2 ) which is the complexity of the previo us metho d. There is also s ig- nificant improvemen t in the computationa l pa rt. The SDP problem is solved with interior p oint metho ds [26] that require the inversion of a s ymmetric p ositive def- inite matr ix at some p o int . In the previo us metho d that would require O (( N + m ) 3 k 3 ) steps, while w ith this metho d we have to inv ert N m 2 k × 2 k matrices , that would cos t N m (2 k ) 3 . Because o f their sp ecial structure the a ctual cost is ( N m ) k 3 + max( N , m ) k 3 = ( N m + max( N , m )) k 3 . W e know that W i , H j 0. Since Z ij is PSD and according to Sch ur’s co mplement on eq. 2.1 6: H j − Z W H W − 1 i Z W H 0 (2.18) So ins tea d o f inv erting (2.16) that would co st 8 k 3 we can inv ert 2.1 8. This formulation giv es similar results with the big SDP co ne and most of the case s the results are co mparable to the CSVD. 2.2.4 NMF as a con v ex multi-ob jectiv e prob- lem. A different approach would b e to find a conv ex set in which the solution of the NMF lives and sear ch for it over there. Assume that we wan t to ma tch V ij = W i H j = P m l =1 W il H lj . In this s e ction w e show that by controlling the ratio of the L 2 /L 1 norms of W , H it is p os sible to find the solution to NMF. De- fine W il H lj = V ij,l and P k l =1 V ij,l = V ij . W e fo r m the following matrix that we require to b e PSD: 1 W il H lj W il t il V ij,l H lj V ij,l t j l 0 (2.19) If w e use the Sc hur co mplement we hav e: t il − W 2 il V ij,l − W il H lj V ij,l − W il H lj t j l − H 2 lj 0 (2.20) An immediate co nsequence is that t il ≥ W 2 il (2.21) t j l ≥ H 2 ll (2.22) ( t il − W 2 il )( t j l − H 2 lj ) ≥ ( V ij,l − W il H lj ) 2 (2.23) In the above inequa lit y we see that the L 2 error P N i =1 P m j =1 P k l =1 ( V ij,l − W il H lj ) 2 bec omes zeros if t il = W 2 il , t j l = H 2 il , ∀ t il , t j l . In gener al we wan t to min- imize t while max imizing || W || 2 and || H || 2 . L 2 norm maximization is not conv ex, but instead we can max- imize P W il , P H lj which are e qual to the L 1 norms since everything is po s itive. This ca n b e cast as con vex m ulti-ob jectiv e problem 2 on the second order co ne [4]. min " P i = N i =1 P k l =1 t il + P j = m j =1 P k l =1 t lj − P i = N i =1 P k l =1 W il − P j = m j =1 P k l =1 H lj # sub ject to : (2.24) t il − W 2 il V ij,l − W il H lj V ij,l − W il H lj t j l − H 2 lj 0 Unfortunately multi-ob jective optimiza tion problems, even when they ar e conv ex, they have lo c al minima that are not g lobal. An interesting dir ection would be to test the ro bustness of existing multi-ob jective algor ithms on NMF. 2.2.5 Lo cal solution of the non-con v ex probl em. In the previous sections w e gav e several conv ex for m u- lations and r elaxations of the NMF problem that unfor - tunately ar e either unsolv a ble or they g ive trivial r ank one solutions that are not useful at all. In pra ctice the non-convex formulation of eq. 2 .2 .2 (classic NMF ob jective) along with other like the KL distance b etw een V and W H ar e used in practice [22]. All of them a r e non-conv ex and several methods hav e bee n recommended, such as a lternating least square s, gradient decent or active set metho ds [18]. In our exp eriments we used the L-BFGS metho d that scales very w ell for la rge ma trices. 2.2.6 NMF as a Generalized Geometri c Pro- gram a nd it’s Global Optimum. The ob jective function (eq. 2.2.2) c an b e written in the following form: || V − W H || 2 = N X i =1 m X j =1 k X l =1 ( V ij − W il H lj ) 2 (2.25) The above function is twice differen tiable s o a ccording to [11] the function c a n be cast as the difference of conv ex (d.c.) functions. The problem can b e 2 also kno wn as v ector optimization solved with genera l off-the-shelf global optimiza tion algorithms. It can a lso be formu lated as a sp ecial case of dc prog ramming, the g e neralized geometric progra mming. With the following transformation W il = e ˜ w il , H lj = e ˜ h lj the ob jectiv e b ecomes: || V − W H || 2 = N X i =1 m X j =1 k X l =1 V ij − e ˜ w il + ˜ h lj 2 (2.26) = N X i =1 m X j =1 V 2 ij + N X i =1 m X j =1 k X l =1 e ˜ w il + ˜ h lj ! 2 − 2 N X i =1 m X j =1 V ij k X l =1 e ˜ w il + ˜ h lj ! The fir st ter m is c onstant a nd it can b e ignor ed for the optimization. The other t w o terms: f ( ˜ w il , ˜ h lj ) = N X i =1 m X j =1 k X l =1 e ˜ w il + ˜ h lj ! 2 (2.27) g ( ˜ w il , ˜ h lj ) = 2 N X i =1 m X j =1 V ij k X l =1 e ˜ w il + ˜ h lj ! (2.28) are conv ex functions als o known a s the exp o nent ial form of p osynomia ls 3 [4]. F or the global solution of the problem min ˜ W , ˜ H f ( ˜ W , ˜ H ) − g ( ˜ W , ˜ H ) (2.29) the alg o rithm pr op osed in [8] ca n b e employ ed. The ab ov e algo r ithm uses a branch and b ound scheme that is impractical for hig h dimensional o ptimization pr oblems as it requir e s to o many iter ations to conv erge. It is worth while though to compare it with thelo cal non- conv ex NMF so lver on a small matrix. W e tried to do NMF of o r der 2 on the following random matr ix: 0 . 45 0 . 434 0 . 35 0 . 70 0 . 64 0 . 43 0 . 22 0 . 01 0 . 3 After 1000 0 restarts of the lo cal solver the b est er ror we got was 2 . 7% while the glo bal optimizer very quickly gav e 0 . 015% e r ror, which is 2 orde r s o f magnitude less than the lo cal optimizer. Another direction that is no t in vestigated in this pap er is the recently develop ed a lg orithm for Difference Conv ex pr oblems by T ao [30] that has b een a pplied successfully to other da ta mining a pplications such as Multidimensional Scaling. [1]. 3 Po synomial is a product of p ositive v ariables exponentiat ed in any real num ber 3 Isometric Em b edding The key concept in Manifold Learning (ML)is to r epre- sent a datas et in a lo wer dimensional space by preser ving the lo cal dista nce s. The differences b etw een metho ds Isomap [31], Ma ximum V ar iance unfolding (MVU) [3 3], Laplacian EigenMaps [2] and Diffusion Maps [6] is how they treat distances betw een points that are no t in the lo cal neighbor ho o d. F or exa mple IsoMap pr e serves ex- actly the geo des ic distances, while Diffu sion Maps pr e- serves distances that are bas e d on the diffusion kernel. Maximum F urthest Neighbor Unfolding (MFNU) [25] that is a v aria nt o f Maximum V a riance Unfolding, pre- serves lo cal distance and it tries to maximize the dis- tance b etw een furthest neighbors. In this section we are going to pr esent the MFNU metho d a s it will b e the basis for building isoNMF. 3.1 Con v e x Maxim um F urthest Nei gh b or Un- folding. W einberger formu lated the problem of isomet- ric unfolding as a Se midefinite Pr o gramming alg orithm [33] 4 . In [19] Kulis pres ented a non-conv ex formula- tion o f the problem tha t req uires less memory than the Semidefinite one. He also claimed that the non-co nv ex formulation is scalable. The non-conv ex formulation has the same globa l optim um with the Semidefinite one as prov en in [5]. In [25] V a siloglou presen ted exp eriments where he v erified the scalability of this form ulation. A v ar iant of MVU the Maximum F urthest Neighbor Un- folding (MFNU) w as also presen ted in the same pap er. The latest formulation tends to be mor e robust and scal- able than MVU, this is why we will employ it as the basis of isoNMF. Bo th metho ds can b e cast as a semidefinite progra mming problem [32]. Given a set of data X ∈ ℜ N × d , wher e N is the nu mber of po in ts and d is the dimensionality , the dot pro duct o r Gr am matr ix is de fined a s G = X X T . The go al is to find a new Gram matr ix K such that rank ( K ) < rank ( G ) in other words K = ˆ X ˆ X T where ˆ X ∈ ℜ N × d ′ and d ′ < d . Now the data s et is repre s ented by ˆ X which has fewer dimensions than X . The require ment of isometric unfolding is that the euclidian distances in the ℜ d ′ for a given neighbo rho o d around every p oint ha ve to be the same a s in the ℜ d . This is expr essed in: K ii + K j j − K ij − K j i = G ii + G j j − G ij − G j i , ∀ i, j ∈ I i where I i is the set o f the indices o f the neig hbo rs of the ith po int. F rom all the K matrices MFNU chooses the one that maximizes the distances betw een furthest 4 A simil ar approac h to learning metrics is given in [ 29] neighbor pa ir s. So the algor ithm is presented as an SDP: max K N X i =1 B i • K sub ject to A ij • K = d ij ∀ j ∈ I i K 0 where the A • X = T r ace( AX T ) is the dot pro duct betw een ma trices. A ij has the following form: 1 0 . . . − 1 . . . 0 0 . . . 0 . . . 0 0 . . . 0 . . . 0 . . . 0 − 1 . . . 0 1 . . . 0 . . . 0 . . . 0 . . . 0 0 . . . . . . 0 . . . 0 (3.30) and d ij = G ii + G j j − G ij − G j i (3.31) B i has the sa me str ucture of A ij and co mputes the distance of the i th po int with its furthest neighbor . The las t condition is just a centering constra int for the cov aria nce ma tr ix. The new dimensio ns ˆ X ar e the eige nvectors of K . In genera l MFNU gives Gr a m matrices that hav e c o mpact sp ectrum, a t least mor e compact than traditional linear Principal Comp onent Analysis (PCA). The method b ehav es equally w ell with MVU. Bo th MVU and MFNU are conv ex so they conv erge to the global optimum. Unfor tunately this metho d can handle datasets of no more than h undreds of p oints b ecause of its complexity . In the following section a non-co nv ex for mulation o f the pro blem that scales b etter is presented. 3.2 The Non Co n v e x Maximum F urthest Neighbor Unfolding. By replac ing the constra int K 0 [5] with an explicit rank constraint K = RR T the problem bec o mes no n-conv ex and it is reformulated to max N X i =1 B i • R R T (3.32) sub ject to: A ij • R R T = d ij In [5], Bur er prov ed that the ab ov e formulation has the same global minimum w ith the conv ex one. The a b ove problem ca n b e s o lved with the aug- men ted Lagrangia n method [27]. L = − N X i =1 B i • R R T (3.33) − N X i =1 X ∀ j ∈ I i λ ij ( A ij • R R T − d i j ) + σ 2 N X i =1 X ∀ j ∈ I i ( A ij • R R T − d ij ) 2 Our goal is to minimize the Lag rangia n that’s why the ob jectiv e function is − RR T and not R R T The deriv ative of the augmen ted La g rangia n is: ∂ L ∂ R = − 2 N X i =1 B i • R (3.34) − 2 N X i =1 X ∀ j ∈ I i λ ij A ij R 2 σ N X i =1 X ∀ j ∈ I i ( A ij • R R T − d i j ) A ij R Gradient descent is a p oss ible w ay to solve the mini- mization o f the La grang ian, but it is ra ther s low. The Newton metho d is also prohibitive. The Hessian of this problem is a spar se matrix although the cos t of the inv e rsion might b e high it is worth inv estigating. In our exp eriments we used the limited memo ry BFGS (L- BFGS) metho d [23, 27] that is known to give a go o d rate for conv ergence. MFNU in this non-conv ex for m u- lation b ehaves muc h b etter than MVU. In the exp er i- men ts pre s ented in [25], MFNU tends to find mor e often the glo ba l optimum, than MVU. The exper imen ts als o show ed that the metho d scales well up to 100 K p oints. 4 Isometric NMF. NMF and MFNU are optimization problems. T he goal of isoNMF is to combine these optimization problems in one optimization problem. MFNU has a conv ex a nd a non-c o nv ex form ulation, while fo r NMF only a no n- conv ex formulation that can be solved is known. 4.1 Con v e x i soNMF. By us ing the theo r y pre - sented in section 2.1.1 we can ca st isoNMF as a conv ex problem: max ˜ W , ˜ H N X i =1 B i • Z (4.35) sub ject to: A ij • ˜ W = d ij Z = ˜ W V V T ˜ H Z ∈ K C P ˜ W ∈ K C P ˜ H ∈ K C P Then W, H can b e found by the completely p ositive factorization of ˜ W = W W T , ˜ H = H H T . Again this problem although it is conv ex, there is no p o lynomial algorithm kno wn for solving it. 4.2 Non-con v ex form ulation o f isoNMF. The non co nv ex isoNMF can be cast as the following pro b- lem: max N X i =1 B i • W W T (4.36) sub ject to: A ij • W W T = d ij W H = V W ≥ 0 H ≥ 0 The aug ment ed lag rangian with quadratic p enalty func- tion is the following: L = − N X i =1 B i • W W T (4.37) − N X i =1 X ∀ j ∈ I i λ ij ( A ij • W W T − d i j ) − N X i =1 m X j =1 µ ij k X l =1 ( W ik H kj − V ij ) + σ 1 2 N X i =1 X ∀ j ∈ I i ( A ij • W W T − d ij ) 2 + σ 2 2 N X i =1 m X j =1 k X l =1 ( W il H lj − V ij ) 2 The non-nega tivity constra in ts a r e missing fr om the Lagra ng ian. This is bec a use we can enforce them through the limited b ound BFGS a lso known as L- BFGS-B. The deriv ative of the augmented L agrang ian is: ∂ L ∂ W = − 2 N X i =1 B i W (4.3 8) − 2 N X i =1 X ∀ j ∈ I i λ ij A ij W − N X i =1 m X j =1 µ ij H +2 σ 1 N X i =1 X ∀ j ∈ I i ( A ij • W W T − d i j ) A ij W +2 σ 2 N X i =1 m X j =1 k X l =1 ( W il H lj − V ij ) H ∂ L ∂ H = − N X i =1 m X j =1 µ ij H (4.39) +2 σ 2 N X i =1 m X j =1 k X l =1 ( W il H lj − V ij ) W 4.3 Computing the lo cal neigh b orho o ds. As al- ready discussed in prev ious section MFNU and iso NMF require the computation of a ll-neares t and a ll-furthest neighbors. The all-nearest ne ig hbor pro blem is a spec ial case of a mor e genera l cla ss of pro blems called N-b o dy problems [10]. In the following sections w e give a sort description o f the near est neighbor computation. The actual algorithm is a four-way recursion. More details can be found in [1 0]. 4.3.1 Kd-tree. The kd-tree is a hierar chical parti- tioning structure fo r fast near e s t neighbor search [9]. Every no de is r ecursively partitioned in tw o nodes un- til the p oints contained a re less than a fixed num b er . This is a le a f. Near est neighbor sea rch is bas ed on a to p down recursio n until the quer y p oint finds the clos est leaf. When the recursion hits a leaf then it searches lo- cally for a ca ndidate near est neig hbor. At this p oint we hav e an upp er b ound for the nea rest neighbor distance, meaning that the true neighbor will b e a t most as far aw a y as the ca ndidate one. As the recur sion backtracks it eliminates (pr unes) no des that ther e are fur ther aw ay than the ca ndidate neighbor . K d- trees provide on the av erage nearest neighbor sear ch in O (log N ) time, al- though for patholo gical cases the kd-tre e p er formance can a symptotically hav e linear complexity like the naive metho d. 4.3.2 The Dual T ree Algori thm for nearest neighbor computatio n . In the single tree algorithm the reference p oints a re orde r ed on a kd- tr ee. Every nearest neighbor computation requir es O (log ( N )) com- putations. Since there are N quer y p o ints the total cost is O ( N log( N )). The dual- tree alg orithm [10] or- ders the quer y p o in ts o n a tree to o. I f the query set and the reference set are the same then they can sha re the same tree. Instead of querying a sing le po int at a time the dual-tree algorithm alwa ys queries a gro up of po int s that live in the same no de. So ins tea d of do- ing the top-do wn recursion individua lly for every p oint it do es it for the whole g roup at once. Mo reov er in- stead of computing distances b etw een points and no des it computes distances b etw een no des. This is the rea - son why most of the times the dua l- tree algo rithm can prune la rger p ortio ns of the tree than the single tree algorithm. The complexity of the dual-tr e e a lgorithm is empir ically O ( N ). If the data set is pathologic al then the algorithm can be of quadra tic co mplex it y to o . The pseudo-co de for the algor ithm is describ ed in fig. 1. recurse( q : KdTree, r : KdTree) { if (max_ nearest_n eighbor_distance_in_node(q) < distan ce(q, r) { /* prune */ } else if (IsLea f(q)==tr ue and IsLeaf(r) ==true) { /* search for every point in q node */ /* its nearest neighbor in the r node */ /* at leaves we must resort to */ /* exhaustiv e search O(n^2) */ /*update the maximum_ne ighbor_di stance_in_node(q)*/ } else if (IsLea f(q)==fa lse and IsLeaf(r)=tr ue { /* choose the child that is closer to r */ /* and recurse first */ recurse( closest(r , q.left, q.right), r) recurse( furthest( r, q.left, q.right), r) } else if (IsLea f(q)==tr ue and IsLeaf(r) ==false) { /* choose the child that is closer to q */ /* and recurse first */ recurse( q, closest(q, r.left, r.right)) recurse( q, furthest(q, r.left, r.right)) } else { recurse( q.left,cl osest(q.left, r.left, r.right)); recurse( q.left,fu rthest(q.left, r.left, r.right)); recurse( q.right,c losest(q.right, r.left, r.right)); recurse( q.right,f urthest(q.right, r.left, r.right)); } } Figure 1: Ps eudo-co de for the dual-tree all nearest neighbor a lgorithm 4.3.3 The Dual T ree Algorithm for all furthest neighbor algori thm. Co mputing the furthest neigh- bo r with the naiv e computation is a lso of quadratic com- plexity . The use of trees ca n speed up the computations to o. It turns o ut that furthest neig hbor search for a s in- gle query p oint is very similar to the neares t neighbor search presented in the original pap er of kd-tree [9]. The only difference is that in the top- down recursio n the a l- gorithm alwa ys c ho oses the furthest no de. Similar ly in the b o ttom up r ecursion we pr une a no de only if the maximum distance b etw een the p oint and the no de is smaller than the current furthest distance. The pseudo co de is pr e sented in fig. 2. 5 Exp erimental Res ults In order to ev aluate and co mpare the p erformance of isoNMF with traditional NMF w e pick ed 3 b enchmark datasets that hav e b een tested in the liter ature: recurse( q : KdTree, r : KdTree) { if (min_furt hest_neig hbor_distance_in_node(q) < distance(q , r) { /* prune */ } else if (IsLeaf(q )==true and IsLeaf(r)==tru e) { /* search for every point in q node its /* furthest neighbo r in the r node */ /* at leaves we must resort to */ /* exhaustiv e search O(n^2) */ /*update the minimum_fu rthest_di stance_in_node(q)*/ } else if (IsLeaf(q )==false and IsLeaf(r)=tr ue { /*choose the child that is furthest to r */ /* and recurse first */ recurse( furthest( r, q.left, q.right), r) recurse( closest(r , q.left, q.right), r) } else if (IsLeaf(q )==true and IsLeaf(r)==fal se) { /* choose the child that is furthest to q */ /* and recurse first */ recurse( q, furthest(q, r.left, r.right)) recurse( q, closest(q, r.left, r.right)) } else { recurse( q.left,fu rthest(q.left, r.left, r.right)); recurse( q.left,cl osest(q.left, r.left, r.right)); recurse( q.right,f urthest(q.right, r.left, r.right)); recurse( q.right,c losest(q.right, r.left, r.right)); } } Figure 2: Ps eudo-co de for the dua l-tree all furthest neighbor a lgorithm 1. The CBCL faces database fig. 3(a,b) [13], used in the exp eriments of the origina l pap er on NMF [21]. It consists o f 2429 grayscale 19 × 19 imag es that they are ha nd aligned. The dataset w as normalized as in [21]. 2. The iso map statue datase t fig. 3(c) [1 4] co ns ists of 698 64 × 6 4 synthetic fac e photog raphed from different angles. The data was downsampled to 32 × 32 with the Matlab imr esize function (bicubic int erp olation). 3. The ORL faces [15] fig. 3(d) pres ented in [12]. The set consis ts of 472 19 × 19 gray scale images that are not alig ned. F or visualization of the res ults we used the nmfp ack co de av ailable on the web [16]. The results for cla ssic NMF and is o NMF with k- neighborho o d equal to 3 ar e pre sented in fig. 4 and tables 1, 2. W e observe that classic NMF giv es always low er reco nstruction err or r a tes that are no t that far aw a y from the isoNMF. Cla ssic NMF fails to preserve distances contrary to isoNMF that alwa ys does a go o d job in pre serving distances. Another obs e r v atio n is that iso NMF gives mo re sparse so lutio n than class ic NMF. The o nly case where NMF has a big difference in recons tr uction er ror is in the CBCL-fa ce database when it is b eing prepro cessed. This is ma inly bec ause the prepr o cessing distor ts the imag es and s p o ils the manifold structure. If we don’t do the prepro cessing fig. 4(f ), the recons truction error of NMF and isoNMF are almo st the same. W e would als o like to p oint that isoNMF s c ales equally well with the classic NMF. Moreov er they are seem to show the same s e nsitivity to the initial co nditions. In fig. 6 we see a compariso n of the ener g y sp ec- trums of classic NMF and isoNMF. W e define the sp ec- trum as s i = P N l =1 W 2 li q P M l =1 H 2 il This re presents the energy of the comp onent normalized by the ener gy of the pr o totype image gener ated by NMF/isoNMF. Although the r esults show that isoNMF is muc h mo re compac t than NMF, it is not a rea sonable metric. This is b eca use the prototypes (rows of the H matrix are not o rthogona l to each other. So in reality P k i =1 s i < P N i =1 P m j =1 ( W H ) 2 ij and actually m uch smaller. This is b eca use the dot pro duct b etw een the rows is not zero. (a) (b) (c) (d) Figure 3: (a)So me images from the cbcl face databa se (b)The same images after v aria nce normalization, mean set to 0.25 and thresholding in the interv al [0,1] (c)The synthetic s tatue da taset fr om the is omap website [14] (d)472 images from the orl face s database [15] classic NMF cbcl norm. cbc l statue orl rec. error 22.01% 9.20% 13.62% 8.46% sparsity 63.23% 29.06% 48.36% 46.80% dist. error 92.10% 98.61% 97.30% 90.79% T able 1: Classic NMF, the relative r o ot mean square error , spar sity and distance error for the four different datasets (cb cl no rmalized a nd plain, statue and orl) isoNMF cbc l norm. cbcl statue orl rec error 33.34% 10.16% 16.81% 11.77% sparsity 77.69% 43.98% 53.84% 54.86% dist. err or 4.19% 3.0 7% 0.03 % 0.01 % T able 2: isoNMF, the relative ro ot mean square err or, sparsity and distance error for the four different datase ts (cbcl nor malized and plain, statue and o rl) 6 Conclusion In this pap er we presented a deep study of the opti- mization problem o f NMF, showing so me fundamental existence theorems fo r the firs t time a s well a s v ario us adv anc e d optimization a pproaches – conv ex a nd non- conv ex, g lobal and lo ca l. W e b elieve that this s tudy has the capability to op en do or s for further a dv ances in NMF-like metho ds as well as other machine lea r n- ing problems. W e a ls o developed and exp erimentally demonstrated a new metho d, isoNMF, which preserves bo th non-negativity a nd isometry , simultaneously pro- viding tw o types of interpretability . With the a dded reliability and s calability stemming fr om an effective o p- timization a lgorithm, w e b elieve that this metho d r e p- resents a potentially v a luable prac tical new to o l for the explorato r y analy sis of common data such as images, text, and sp ectra . References [1] L.T.H. An and P .D. T ao. DC Programming Approach and Solution A lgorithm to th e Multidimensional Scal- ing Problem. F r om L o c al to Glob al Optimization , pages 231–276 , 2001 . [2] M. Belkin and P . Niyogi. Laplacian Eigenmaps for Dimensionalit y R ed uction and Data Rep resen tation, 2003. [3] A. Berman and N. Sh aked-Monderer. Compl etely Positive Matric es . W orld Scientific, 2003. [4] S.P . Bo yd and L. V andenberghe. Convex Optimization . Cam bridge Universit y Press, 200 4. [5] S. Burer and R.D.C. Mon teiro. A nonlinear program- ming algorithm for solving semidefinite programs via lo w-rank factorization. Mathematic al Pr o gr amming , 95(2):329– 357, 2003. (a) (b) (c) (d) (e) (f ) (g) (h) Figure 4: T op row: 4 9 Cla ssic NMF pr ototype images. Botto m row: 49 iso NMF prototype images ( a, e ) CBCL-face database with mean v aria nce no rmalization and thresholding, ( b, f ) CBCL face database without pr epro cessing , ( c, g ) Statue dataset ( d, h )ORL face database [6] R.R. Coifman and S . Lafon. Diffusion maps. Applie d and Computational Harmonic Analysis , 21(1):5–30, 2006. [7] M. F azel, H. Hindi, and SP Bo yd. A rank min imization heuristic with application to minimum ordersystem approximatio n. Americ an Contr ol Confer enc e, 2001. Pr o c e e dings of the 2001 , 6, 2001. [8] C.A. Floudas. Deterministic Glob al Optim i zation: The ory, Metho ds and Applic ations . Kluw er Academic Pub, 2000. [9] J.H. F riedman, J.L. Bentl ey , and R .A . Finkel. An Algorithm for Finding Best Matches in Logarithmic Exp ected Time. ACM T r ansactions on Mathematic al Softwar e , 3(3):20 9–226, 1977. [10] A. Gra y an d A .W. Mo ore. N -Bod y problems in statistical learning. A dvanc es in Neur al Inf ormation Pr o c essing Systems , 13, 2001. [11] R. Horst and H. T uy . Glob al Optimization: Determin- istic Appr o aches . Springer, 1996. [12] P .O. Hoyer. Non - negative Matrix F actorization with Sparseness Constrain ts. The Journal of Machine L e arning R ese ar ch , 5:1457–1469, 2004. [13] http://cbcl.mit.edu/cb cl/soft w are- datasets/F aceData2.html. [14] http://iso map.stanford.edu/face data.mat.Z. [15] http://ww w.cl.cam. ac.uk/researc h/dtg/ attarc hive/f acedatabase.h tml. [16] http://ww w.cs.helsinki.fi/u/pho y er/soft w are.h tml. [17] M. Kayko bad. On n onnegative factorization of ma- trices. Line ar A lgebr a and its Applic ations , 96:2 7–33, 1987. [18] H. Kim and H. Pa rk. N on-Negative Matrix F actoriza- tion Based on Alternating Non-Negativity Constrained Least Squares and Active Set Method. [19] B. K ulis, A.C. Surend ran, and J.C. Platt. F ast Low- Rank Semidefinite Programming for Embedd ing and Clustering. In Eleventh I nternational Confer enc e on Ar tific al Intel ligenc e and Statistics, AIST A TS 2007 , 2007. [20] A.N. Langville, C.D. Meyer, and R. Albright. Initial- izations for the nonnegativ e matrix factorization. Pr o c. of the 12 ACM SIGKDD International Confer enc e on Know le dge Di sc overy and Data Mi ning , 2006. [21] D.D. Lee and H.S. Seung. Learning the parts of ob jects b y non-negative matrix factorizati on. Natur e , 401(6755 ):788–791 , 1999. [22] D.D. Lee and H.S. S eung. Algorithms for Non-negative Matrix F actorization. AD V ANCES IN NEURAL IN- FORMA TION PROCESSING SYSTEMS , p ages 556– 562, 2001. [23] D.C. Liu an d J. No cedal. On the limited memory BF GS method for large scale optimization. Mathemat- ic al Pr o gr ammi ng , 45(1):503–528, 1989 . [24] H. Minc. Nonne gative m atric es . Wiley New Y ork, 1988. [25] D. A nderson N. V asiloglou, A. Gra y . S calable Semidefnite Manifold Learning. I EEE Machine L e arn- ing i n Signal Pr o c essing , 2008. [26] Nemirovski A. L e ctur es on Mo dern Convex Optimiza- tion . [27] J. Noced al and S.J. W right. N um eric al Optimization . Springer, 1999. (a) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (b) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (c) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Figure 5: Sca tter plo ts of t w o la rgest comp onents of classic NMF(in blue) and I s ometric NMF(in re d) for ( a )cbcl faces ( b )isomap faces ( c )or l faces [28] B. R ech t, M. F azel, and P .A. Parril o. Guaran- teed Minimum-Rank Solutions of Linear Matrix Equa- tions via Nuclear Norm Minimization. Arx iv pr eprint arXiv:0706.4138 , 2007. [29] R. Rosales and G. F ung. Learning sparse metrics via linear programming. In Pr o c e e dings of the 12th ACM SI GKDD international c onfer enc e on Know le dge disc overy and data mining , p ages 367– 373. ACM New Y ork, NY, USA, 2006. [30] P .D. T ao and L.T.H. An. Difference of conv ex func- tions optimization algorithms (DCA) for globally min- imizing nonconv ex quadratic forms on Eu clidean balls and spheres. O p er ations R ese ar ch L etters , 19(5):207– 216, 1996. [31] J.B. T enenbaum, V . Silv a, and J.C. Langford. A Global Geometric F ramew ork for Nonlinear Dimen- sionalit y Reduction, 2000. [32] L. V andenb erghe and S. Boyd. Semidefi n ite Program- ming. SIAM R eview , 38(1):49–9 5, 1996. [33] K.Q. W einberger and L.K. Saul. An introduction ( a ) 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 NMF dimension normalized energy ( b ) 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 NMF dimension normalized energy ( c ) 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 NMF dimension normalized energy Figure 6: In this set o f figures we show the sp ectrum of classic NMF (solid line) and Isometric NMF (dashed line) for the three da tasets ( a )cbcl face ( b )isomap statue ( c )orl faces . Although isoNMF gives m uc h more com- pact sp ectrum w e hav e to p oint that the basis functions are not or thogonal, so this figur e is no t comparable to SVD type sp ectrums to nonlinear dimensionalit y reduction by max imum v aria nce unfolding. Pr o c e e dings of the Twenty First National Confer enc e on Artificial Intel ligenc e (AAAI- 06) , 2006. [34] K.Q. W einberger, F. Sha, and L .K. Saul. Learning a kernel matrix for n onlinear dimensionalit y red uction. In Pr o c e e dings of the twenty -first international c onf er- enc e on M achine le arning . A CM New Y ork, N Y , USA, 2004.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment