Robust Large Scale Non-negative Matrix Factorization using Proximal Point Algorithm

1 Rob ust Lar ge-Scale Non-Ne gati v e Matrix F actorization Using Proximal Point Algorithm Jason Gejie Liu and Shuchin Aeron Department of Electrical and Computer Engineering T ufts Uni versity , Medford, MA 02155 Gejie.Liu@tufts.edu, shuchin@ece.tufts.edu Abstract A robust algorithm for non-ne gativ e matrix factorization (NMF) is presented in this paper with the purpose of dealing with large-scale data, where the separability assumption is satisﬁed. In particular , we modify the Linear Programming (LP) algorithm of [9] by introducing a reduced set of constraints for exact NMF . In contrast to the previous approaches, the proposed algorithm does not require the knowledge of factorization rank (extreme rays [3] or topics [7]). Furthermore, moti vated by a similar problem arising in the context of metabolic network analysis [13], we consider an entirely dif ferent regime where the number of extreme rays or topics can be much larger than the dimension of the data vectors. The performance of the algorithm for dif ferent synthetic data sets are provided. I . I N T RO D U C T I O N Matrix factorization has numerous applications to the real-world problems where the data matrices representing the numerical observ ations are often huge and hard to analyze. Meanwhile, factorizing them into lower -rank forms is able to re veal the inherent structure and features, which helps in the meaningful interpretation of the data. In a wide range of natural signals, such as pixel intensities, amplitude spectra, and occurance counts, negati ve v alues are usually physically meaningless. In order to deal with this non-negati ve constraint, Non-Negati ve Matrix Factorization (NMF) was introduced. NMF was ﬁrst proposed in [1] and used by Lee and Seung for parts-based data representation [2]. It is well kno wn that NMF may not be unique. In this context a sufﬁcient condition on the uniqueness of NMF was pointed out in [3]. A geometrical interpretation [4], of this condition amounts to the fact that the extreme rays (topics) generating the cone (in the non-negati ve orthant) are contained in the data. Thus, for NMF , one only needs to identify these extreme rays. Additionally , it was demonstrated in [5] that under such a separability assumption, one can use Linear Programming (LP) to isolate the extreme rays from the non-extreme rays. In this paper we will focus only on such cases. Bittorf et al. [9] presented a LP-based NMF algorithm named Hottopixx . Kumar et al. [4], instead, presented a fast conical hull algorithm to deal with the large-scale NMF based on its polyhedral structure. It was sho wn to perform much faster than Hottopixx . Howe ver , both of these algorithms require the factorization rank, i.e. the number of extreme rays as a necessary input. Some applications will grant this as prior knowledge but from the vie w of robustness, the preference would go to a rob ust NMF algorithm. Gillis and Luce [10] reformulated the algorithm in [9] to detect the number of extreme rays automatically . Ne vertheless, the limitation still exists with the fact that the number of constraints in the LP is enormous in face of the large-scale data. Alternati vely , moti vated by Theorem 5.4 in [5], we propose a simpler modiﬁcation of the constraints to alle viate these issues. In particular we reduce the number of constraints and do not require that the number of extreme rays to be known. This allows us to use a proximal point-based algorithm [12] to solv e the LP problem ef ﬁciently for data sets with large size. In addition we also consider an entirely different regime from the NMF applications in the literature so far , where the data lies high dimension with a much smaller factorization rank (the number of extreme rays) in comparison. Speciﬁcally , we look at the case when the number of extreme rays is much larger than the dimensionality of the space. This is caused by the computational issues, which arises in Double Description (DD) method [6] for the analysis of metabolic networks to ﬁnd the Elementary Flux Modes (EFMs) as the set of extreme rays of the polyhedral cone [13]. A NMF like problem arises as an intermediate step in DD method. In this context, we believ e that the computational advances in NMF can help with addressing the computational issues in the DD method [6]. 2 The organization of the paper is as follo ws. Section II provides a brief revie w of NMF from the geometric perspecti ve as well as the Hottopixx algorithm. Section III explains the proposed proximal point algorithm with the reformulated LP constraints. The experiments results are presented in Section IV and the paper concludes in Section V . Notation : The matrices will be denoted by boldface capital letters and vectors by boldface small letters. In addition we use the MA TLAB notation of diag to transform vectors to diagonal matrices and to extract the diagonal from the matrix in the argument. Also we use the MA TLAB “ , ” and the “; ” operators for matrix concatenations. I I . N O N - N E G A T I V E M A T R I X F AC T O R I Z A T I O N For the non-noisy case, gi ven a data matrix X = [ x 1 , x 2 , ..., x n ] ∈ R m × n + . Therefore, NMF aims to ﬁnd two nonnegati ve matrices F ∈ R m × r + and W ∈ R r × n + such that X = FW . For an approximate NMF , instead, the aim is to solve the following optimization problem. min F , W ≥ 0 || X − FW || 2 2 (1) A. Geometry of the NMF Pr oblem The factorization X = FW implies that all the columns of X can be represented as non-negati ve combination of the columns { f i } r i =1 of the matrix F . The algebraic characterization can be described as below . Deﬁnition 1. The simplicial cone generated by columns { f i } r i =1 is given by , Γ = Γ F = { x : x = X i α i f i , α i ≥ 0 } (2) The factorization X = FW refers geometrically to that the x i , i = 1 , 2 , ..., n all lie in or on the surface of the simplical cone generated by the { f i } r i =1 , as depicted in Fig. 1. Fig. 1. Geometry of the NMF Problem. Separability implies that data is contained in a cone generated by a subset of r extreme rays (indicated by purple squares). W ith this viewpoint in mind, we deﬁne three assumptions as follows. • Assumption 1 : Extreme rays by deﬁnition are simplicial: No extreme ray is in the con vex combination of the other e xtreme rays. This is also shown to be necessary and suf ﬁcient for exact recov ery in topic modeling [8]. • Assumption 2 : The dataset consisting of all columns of X , reside in or on the surface of a cone generated by these extreme rays of X [3]. • Assumption 3 : Assuming that the columns of X are normalized to unity there are no duplicate columns in X . The abov e three assumptions will be collectiv ely referred to as separability assumption in the following: The entire dataset, i.e. all columns of X , reside in or on a surface of a cone generated by a small subset of r columns of X , the vectors in this subset being simplicial and there are no duplicate columns in X after column normalization. In algebraic terms, X = FW = X I W for some subset I ⊆ { 1 , 2 , ..., r } of columns ( extreme rays ) of X and where X I denotes the matrix built with columns of X inde xed by I . This means that the r vectors of F are hidden 3 among the columns of X ( I is unkno wn) [5]. Equiv alently , it implies that the corresponding subset of r r ows of W constitutes the r × n weight matrix. Therefore, the computational challenge is to identify the extreme rays ef ﬁciently . In this context, we ﬁrst outline the LP-based Hottopixx Algorithm from [9] . B. Hottopixx Bittorf et al. [9] proposed an algorithm of NMF under separability assumption based on the follo wing LP problem: min C ∈ Φ 1 ( X ) p T diag ( C ) (3) and p ∈ R n × 1 is a random vector with distinct positiv e entries and C ∈ R n × n + is referred to as a factorization localizing matrix [9], which belongs to the following polyhedral set. Φ 1 ( X ) = { C : XC = X , T race ( C ) = r, C ( i, i ) ≤ 1 for all i C ( i, j ) ≤ C ( i, i ) for all i, j, C ≥ 0 } (4) For a large scale set-up they proposed an incremental gradient descent algorithm to solve the LP . I I I . R O B U S T N M F U S I N G Pr oximal P oint A L G O R I T H M As explained before, two of the prominent shortcomings of existing algorithms for the NMF problem are - (i) Dependence on knowledge of the number of extreme rays r and, (ii) Dealing with a large data set resulting in an enormous number of constraints. An approach in this direction was taken in [10]. Ho wev er, the number of constraints in their reformulation is still immense for large data. In this paper we present a reformulation which drastically reduces the set of constraints. A. LP Reformulation Assuming that the columns of X are normalized to have an unit L 1 norm, our LP reformulation for NMF is gi ven as min C ∈ Φ 2 ( X ) p T diag ( C ) (5) where, Φ 2 ( X ) = { C : XC = X , C T 1 = 1 , C ≥ 0 } (6) where 1 ∈ R n × 1 + is the vector of all 1 -s and p ∈ R n × 1 + is the same as the vector in (3). Proposition 1. Suppose X admits a separable factorization FW , compute the minimization of (5) and let I = { i : C ii = 1 } , then F = X I . In order to prov e the above proposition, we consider the Lagrangian of the optimization problem in (5), which is, L ( C , R , λ ) = min C P T diag ( C ) + T r { R T ( XC − X ) } + λ T ( C T 1 − 1 ) + T r { M T C } (7) where R , λ and M are the Lagrange multipliers. Then the dual form of (5) is max R , λ , M − T r { R T X } − λ T 1 s.j.t P + X T R + 1 λ T + M = 0 , M ≥ 0 (8) The proof of the proposition follows from Lemma 1 and Lemma 2 below . Lemma 1. If ` / ∈ I , C `` = 0 f or al l C ∈ Φ 2 ( X ) . Pr oof: For ` / ∈ I , consider the LP problem min C ∈ Φ 2 ( X ) − e ` T diag ( C ) (9) 4 where e ` ∈ R n × 1 + denotes the vector with ` th entry 1 and the rest 0 . Assign P = − diag ( e ` ) and using the constraint C ≥ 0 , we can claim that − e T ` diag ( C ) ≤ 0 . Under the separability assumption, there exists a collection of vectors { ρ i } ∈ R n × 1 such that, ρ T i x i = u i (10) ρ T i x j ≤ − v i , for j 6 = i for u i = 0 and some v i ≥ 0 . A feasible solution to (8) is P = − diag ( e ` ) , λ = 0 ∈ R n × 1 R = [0 , ..., ρ i , ..., 0] , for some i ∈ I M = M 1 + M 2 : M 1 = diag ( e ` ) , M 2 = − X T R = 0 (11) W ith such selection, the dual cost function is equal to 0 . From weak duality [11] it follows that − e T ` diag ( C ) ≥ 0 . Combined with − e T ` diag ( C ) ≤ 0 , it implies e T ` diag ( C ) = 0 and C `` = 0 . Lemma 2. If ` ∈ I , C `` = 1 f or al l C ∈ Φ 2 ( X ) . Pr oof: For ` ∈ I , Consider the LP problem min C ∈ Φ 2 ( X ) e T ` diag ( C ) (12) Note that the constraint C T 1 = 1 implies that C ≤ 1 therefore e T ` diag ( C ) ≤ 1 . For the dual program a feasible solution can be found as P = diag ( e ` ) , λ T = [0 , 0 , ... − 1 , ..., 0] ∈ R 1 × n , where ` th entry is − 1 R = 0 , M = − 1 λ T − P (13) for which the dual cost function (8) is equal to 1. Again using the weak duality [11], it implies that e T ` diag ( C ) ≥ 1 and from e T ` diag ( C ) ≤ 1 , we have e T ` diag ( C ) = 1 and C `` = 1 . Proof of Proposition 1. Let C 0 denote the factorization localizing matrix which identiﬁes the factorization with lo west cost p T diag ( C ) and has either ones or zeros on the diagonal. Then C 0 is the unique optimal solution of (5). T o see this, let I denote the set of simplicial columns of minimum cost. Since each column of X can only belong to I or not, once X is given, C is determined then the lowest cost p T diag ( C ) is determined, which is unique. Remark. Note that W can be r eadily obtained fr om C as W = C ( I , :) (in MA TLAB notation). B. Pr oximal P oint Algorithm Based on the abov e reformulation, a proximal point-based ( Pr oximal P oint ) algorithm is employed in this paper . A necessary pre-processing step is the column normalization, which makes sure that the sum of each column of X is equal to one. After the normalization, we can rewrite the LP in (5) as min C ∈ Φ 3 ( X ) p T diag ( C ) (14) where, Φ 3 ( X ) = { C : AC = A , C ≥ 0 } (15) where A = [ X ; 1 T ] W e solve this LP using the proximal point algorithm [12] in Algorithm 1 and in our implementation, { t k } are set to a large constant 100 . The discussion on the con vergence of the algorithm can be found in [12]. I V . E X P E R I M E N T S R E S U L T S All of the e xperiments were run on an identical conﬁguration: a dual Xeon W3505 (2.53GHz) machine with 6GB RAM. Pr oximal P oint Algorithm is e xamined in MA TLAB with the version of 2013a. 5 Algorithm 1 Robust NMF by Pr oximal P oint Algorithm Input: A column normalized matrix X ∈ R m × n + , stopping threshold  . Output: A matrix F ∈ R m × r + and W ∈ R r × n + , and X = FW . 1: Initialize Q 0 = 0 and C 0 = 0 , randomly generate p ∈ R m × 1 + . 2: Update C k +1 : C k +1 = argmin C { p T diag ( C ) + 1 t k || Q k + t k ( AC − A ) || 2 2 } = 1 2 t k ( A T A ) − 1  2 t k A T A − diag ( p ) − 2 A T Q k  3: Project C k +1 to the constraint C ≥ 0 using C k +1 = p os( C k +1 ) , where p os( · ) keeps the positiv e elements and switch the negati ve elements to 0 . 4: Update Q k +1 : Q k +1 = Q k + t k ( A C k +1 − A ) . 5: Stop the iterations if || C k +1 − C k || 2 ≤  . 6: Let I = { i : C ii = 1 } and set F = X I as well as obtain W = C ( I , :) . T ABLE I E X PE R I M E N T S O N D I FF E R E N T D AT A S E T R E G A R D I N G T O C 1 - C 3 Data Set # of Extreme Rays Accuracy  100 × 75 (C1) 25 25 / 25 10 − 5 500 × 375 (C1) 25 23 / 25 10 − 4 1200 × 600 (C1) 300 300 / 300 10 − 4 25 × 100 (C2) 15 14 / 15 10 − 5 125 × 500 (C2) 75 74 / 75 10 − 4 425 × 1200 (C2) 225 223 / 225 10 − 4 25 × 100 (C3) 45 45 / 45 10 − 5 125 × 500 (C3) 150 150 / 150 10 − 4 425 × 1200 (C3) 625 625 / 625 10 − 4 A. Random Data Generation T o generate our instances, r independent extreme rays are ﬁrstly created in R m × 1 + , with the element value between [0 , 100] . The remaining columns are then generated to be the random non-negati ve combinations of the r 0 extreme rays, where r 0 ∈ [2 , r ] is randomly selected for each of the n − r points. The column normalization is carried out sequentially . Three regimes of NMF problems are analyzed here: • (C1). m ≥ n, m ≥ r , which is motiv ated from the data structure for face recognition [14] • (C2). r ≤ m ≤ n , which is the scenario for topic modeling problem [7] • (C3). m ≤ r ≤ n , which can be applied to metabolic network data [13]. Furthermore, since the algorithm is free from the order of the columns, the r extreme rays are allocated at the beginning of each data set. Dif ferent size of data sets are generated to check the effecti veness of Pr oximal P oint Algorithm, from small to large-scale. In T ab . I, the last column indicates the highest le vel for iteration stopping criterion  to achiev e the listed accuracy . From the experiments, it is exhibited that our algorithm can deal with three regimes of the data with dif ferent sizes. Moreo ver , the identiﬁcation accuracy is satisfying. B. Application to Image Pr ocessing In this section, we apply the Pr oximal P oint algorithm to one face image processing data set, namely , CBCL Dataset [14]. Basically , the CBCL face dataset is made of 2429 gray-le vel face images with 19 × 19 pixels. W e randomly choose 20 images from the dataset with vectorization to be the generators, which means the number of extreme rays in this case is r = 20 . Through the random non-negati ve combination of the extreme rays, a 361 × 500 6 facial data matrix is created. Applying Proximal P oint algorithm to this dataset, the results of the extreme rays identiﬁcation are shown in Fig. 2, which represents the initial 20 images as generators. 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 5 10 15 Fig. 2. The facial images identiﬁed as extreme rays for the 361 × 500 data set with 20 extreme rays. The stopping criterion was selected as 10 − 5 . V . A C K N O W L E D G E M E N T S The second author would like to acknowledge sev eral useful discussions with Prof. Prakash Ishwar at Boston Uni versity . R E F E R E N C E S [1] P . P aatero and U. T apper, “Positive Matrix Factorization: A non-negati ve factor model with optimal utilization of error estimates of data values, ” Envir onmetrics , vol. 5, pp. 111-126, 1994. [2] D. Lee and H. Seung,“ Algorithms for Non-Negati ve Matrix Factorization, ” in NIPS , 2001, pp. 556-562. [3] D. Donoho and V . Stodden, “When does non-negati ve matrix factorization give a correct decomposition into parts, in NIPS , 2004, MIT Press. [4] A. Kumar , V . Sindhwani, and P . Kambadur , “Fast conical hull algorithms for near-separable non-negati ve matrix factorization, ” arXiv:1210.1190, 2012. [5] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a nonnegati ve matrix factorization – prov ably , ” in ST OC , 2012, pp.145-162. [6] K. Fukuda and A. Prodon. “Double description method re visited. ” in Combinatorics and Computer Science, v ol. 1120, pp. 91-111, 1996. [7] W . Ding, M. H. Rohban, P . Ishwar , V . Saligrama, “ A New Geometric approach to latent topic modeling and discovery , [stat.ML], 2013. [8] W . Ding, P . Ishwar , M. H. Rohban, V . Saligrama, “Necessary and Sufﬁcient Conditions for Nov el W ord Detection in Separable T opic Models, ” arXiv:1310.7994 [cs.LG], 2013. [9] B. Recht, C. Re, J. T ropp, and V . Bittorf, “Factoring nonnegati ve matrices with linear programs, ” in NIPS ,2012, pp. 1223-1231. [10] N. Gillis and R. Luce, “Robust near-Separable nonnegativ e matrix factorization using linear optimization, ” arXiv:1302.4385, 2013. [11] D.G. Luenberger , Optimization by vector space methods. New Y ork: Wile y , 1969. [12] J. Ekstein, “Nonlinear proximal point algorithms using Bregman functions with applications to con vex programming, ” Math. Oper . Res., vol. 18, pp. 203-226, 1993. [13] M. T erzer and J. Stelling, “ Accelerating the Computation of Elementary Modes Using Pattern Trees, ” in W ABI, 2006, pp. 333-343. [14] http://cbcl.mit.edu/software-datasets/FaceData2.html [15] D. P . Bertsekas, “ A new class of incremental gradient methods for least squares problems, ” SIAM J . Optim. , vol. 7, no. 4, pp. 913-926, 1997.

Robust Large Scale Non-negative Matrix Factorization using Proximal Point Algorithm

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment