MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization

MPI-F A UN: An MPI-Based Framew ork for Alternating-Updating Nonnegative Matrix Factorization Ramakrishnan Kannan , Oak Ridge National Laboratories, TN Grey Ballard , W ake F orest Univ ersity , NC Haesun Park , Georgia Institute of T echnology , GA Non-negati ve matrix factorization (NMF) is the problem of determining two non-negati ve low rank factors W and H , for the given input matrix A , such that A ≈ WH . NMF is a useful tool for many applications in di ﬀ erent domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community , there is a lack of e ﬃ cient parallel algorithms to solv e the problem for big data sets. The main contribution of this work is a new , high-performance parallel computational framework for a broad class of NMF algorithms that iterativ ely solves alternating non-negati ve least squares (NLS) subproblems for W and H . It maintains the data and f actor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framew ork is ﬂexible and able to lev erage a variety of NMF and NLS algorithms, including Multiplicativ e Update, Hierarchical Alternating Least Squares, and Block Principal Piv oting. Our implementation allows us to benchmark and compare di ﬀ erent algorithms on massi ve dense and sparse data matrices of size that spans for fe w hundreds of millions to billions. W e demonstrate the scalability of our algorithm and compare it with baseline implementations, sho wing signiﬁcant performance improvements. The code and the datasets used for conducting the experiments are a vailable online. 1. INTRODUCTION Non-negati ve Matrix Factorization (NMF) is the problem of ﬁnding two low rank factors W ∈ R m × k + and H ∈ R k × n + for a gi ven input matrix A ∈ R m × n + , such that A ≈ WH . Here, R m × n + denotes the set of m × n matrices with non-negati ve real values. F ormally , the NMF problem [Seung and Lee 2001] can be deﬁned as min W > 0 , H > 0 k A − WH k F , (1) where k X k F = ( P i j x 2 i j ) 1 / 2 is the Frobenius norm. NMF is widely used in data mining and machine learning as a dimension reduction and factor analysis method. It is a natural ﬁt for many real world problems as the non-negati vity is inher- ent in many representations of real-world data and the resulting low rank factors are expected to hav e a natural interpretation. The applications of NMF range from text mining [Pauca et al. 2004], computer vision [Hoyer 2004], and bioinformatics [Kim and Park 2007] to blind source separation [Cichocki et al. 2009], unsupervised clustering [Kuang et al. 2012; Kuang et al. 2013] and many other areas. In the typical case, k  min( m , n ); for problems today , m and n can be on the order of millions or more, and k is on the order of few tens to thousands. There is a vast literature on algorithms for NMF and their con ver gence properties [Kim et al. 2014]. The commonly adopted NMF algorithms are – (i) Multiplicati ve Update ( MU ) [Seung and Lee 2001] (ii) Hierarchical Alternating Least Squares ( HALS ) [Cichocki et al. 2009; Ho et al. 2008] (iii) NMF based on Alternating Nonnegati ve Least Squares and Block Principal Pivoting ( ABPP ) [Kim and Park 2011], and (iv) Stochastic Gradient Descent (SGD) Updates [Gemulla et al. 2011]. Most of the algorithms in NMF literature are based on alternately optimizing each of the low rank factors W and H while keeping the other ﬁxed, in which case each subproblem is a constrained con- ve x optimization problem. Subproblems can then be solv ed using standard optimization techniques such as projected gradient or interior point method; a detailed surve y for solving such problems can be found in [W ang and Zhang 2013; Kim et al. 2014]. In this paper , our implementation uses either ABPP , MU , or HALS . But our parallel framework is extensible to other algorithms as-is or with a few modiﬁcations, as long as the y ﬁt an alternating-updating framework (deﬁned in Section 4). W ith the advent of large scale internet data and interest in Big Data, researchers have started studying scalability of many foundational machine learning algorithms. T o illustrate the dimension of matrices commonly used in the machine learning community , we present a few e xamples. No wa- days the adjacency matrix of a billion-node social network is common. In the matrix representation of a video data, every frame contains three matrices for each RGB color , which is reshaped into a column. Thus in the case of a 4K video, every frame will take approximately 27 million rows (4096 row pixels x 2196 column pixels x 3 colors). Similarly , the popular representation of documents in text mining is a bag-of-words matrix, where the rows are the dictionary and the columns are the documents (e.g., webpages). Each entry A i j in the bag-of-words matrix is generally the frequency count of the w ord i in the document j . T ypically with the explosion of the ne w terms in social media, the number of words spans to millions. T o handle such high-dimensional matrices, it is important to study low-rank approximation methods in a data-distrib uted and parallel computing en vironment. In this work, we present an e ﬃ cient algorithm and implementation using tools from the ﬁeld of High-Performance Computing (HPC). W e maintain data in memory (distributed across processors), take advantage of optimized libraries like BLAS and LAP ACK for local computational routines, and use the Message P assing Interface (MPI) standard to organize interprocessor communication. Furthermore, the current hardware trend is that av ailable parallelism (and therefore aggregate com- putational rate) is increasing much more quickly than improvements in network bandwidth and latency , which implies that the relative cost of communication (compared to computation) is in- creasing. T o address this challenge, we analyze algorithms in terms of both their computation and communication costs. In particular , we prov e in Section 5.2 that in the case of dense input and under a mild assumption, our proposed algorithm minimizes the amount of data communicated between processors to within a constant factor of the lo wer bound. A key attribute of our framework is that the e ﬃ ciency does not require a loss of generality of NMF algorithms. Our central observation is that most NMF algorithms consist of two main tasks: (a) performing matrix multiplications and (b) solving Non-negati ve Least Squares (NLS) subprob- lems, either approximately or e xactly . More importantly , NMF algorithms tend to perform the same matrix multiplications, di ﬀ ering only in ho w they solve NLS subproblems, and the matrix multipli- cations often dominate the running time of the algorithms. Our frame work is designed to perform the matrix multiplications e ﬃ ciently and organize the data so that the NLS subproblems can be solved independently in parallel, lev eraging any of a number of possible methods. W e explore the o verall e ﬃ ciency of the framework and compare three di ﬀ erent NMF methods in Section 6, performing con vergence, scalability , and parameter-tuning experiments on o ver 1500 processors. Dataset T ype Matrix size NMF T ime V ideo Dense 1 Million x 13,824 5.73 seconds Stack Exchange Sparse 627,047 x 12 Million 67 seconds W ebbase-2001 Sparse 118 Million x 118 Million 25 minutes T able I: MPI-F A UN on lar ge real-world datasets. Reported time is for 30 iterations on 1536 proces- sors with a low rank of 50. W ith our framework, we are able to explore several large-scale synthetic and real-world data sets, some dense and some sparse. In T able I, we present the NMF computation w all clock time on some very lar ge real world datasets. W e describe the results of the computation in Section 6, showing the range of application of NMF and the ability of our framew ork to scale to large data sets. A preliminary version of this work has already appeared as a conference paper [Kannan et al. 2016]. While the focus of the previous work was parallel performance of ABPP , the goal of this paper is to explore more data analytic questions. In particular , the new contributions of this paper include (1) implementing a software frame work to compare ABPP with MU and HALS for large scale data sets, (2) benchmarking on a data analysis cluster and scaling up to over 1500 proces- sors, and (3) providing an interpretation of results for real-world data sets. W e provide a detailed comparison with other related work, including MapReduce implementations of NMF , in Section 3. 2 A Input matrix W Left low rank f actor H Right low rank f actor m Number of rows of input matrix n Number of columns of input matrix k Low rank M i i th row block of matrix M M i i th column block of matrix M M i j ( i , j )th subblock of M p Number of parallel processes p r Number of rows in processor grid p c Number of columns in processor grid T able II: Notation Our main contribution is a ne w , high-performance parallel computational framew ork for a broad class of NMF algorithms. The framework is e ﬃ cient, scalable, ﬂexible, and demonstrated to be ef- fectiv e for large-scale dense and sparse matrices. Based on our survey and kno wledge, we are the fastest NMF implementation available in the literature. The code and the datasets used for conduct- ing the experiments can be do wnloaded from https: // github .com / ramkikannan / nmﬂibrary. 2. PRELIMINARIES 2.1. Notation T able II summarizes the notation we use throughout this paper . W e use upper case letters for ma- trices and lower case letters for vectors. W e use both subscripts and superscripts for sub-blocks of matrices. For example, A i is the i th row block of matrix A , and A i is the i th column block. Like wise, a i is the i th row of A , and a i is the i th column. W e use m and n to denote the numbers of rows and columns of A , respectiv ely , and we assume without loss of generality m > n throughout. 2.2. Communication model T o analyze our algorithms, we use the α - β - γ model of distributed-memory parallel computation. In this model, interprocessor communication occurs in the form of messages sent between two processors across a bidirectional link (we assume a fully connected network). W e model the cost of a message of size n words as α + n β , where α is the per-message latenc y cost and β is the per-w ord bandwidth cost. Each processor can compute ﬂoating point operations (ﬂops) on data that resides in its local memory; γ is the per-ﬂop computation cost. With this communication model, we can predict the performance of an algorithm in terms of the number of ﬂops it performs as well as the number of words and messages it communicates. For simplicity , we will ignore the possibilities of ov erlapping computation with communication in our analysis. F or more details on the α - β - γ model, see [Thakur et al. 2005; Chan et al. 2007]. 2.3. MPI collectives Point-to-point messages can be organized into collective communication operations that in volv e more than two processors. MPI provides an interface to the most commonly used collectiv es like broadcast, reduce, and gather , as the algorithms for these collectives can be optimized for particular network topologies and processor characteristics. The algorithms we consider use the all-gather , reduce-scatter , and all-reduce collectives, so we re view them here, along with their costs. Our anal- ysis assumes optimal collectiv e algorithms are used (see [Thakur et al. 2005; Chan et al. 2007]), though our implementation relies on the underlying MPI implementation. At the start of an all-gather collecti ve, each of p processors owns data of size n / p . After the all-gather , each processor owns a copy of the entire data of size n . The cost of an all-gather is 3 α · log p + β · p − 1 p n . At the start of a reduce-scatter collectiv e, each processor o wns data of size n . After the reduce-scatter , each processor owns a subset of the sum ov er all data, which is of size n / p . (Note that the reduction can be computed with other associativ e operators besides addition.) The cost of an reduce-scatter is α · log p + ( β + γ ) · p − 1 p n . At the start of an all-reduce collectiv e, each processor owns data of size n . After the all-reduce, each processor owns a copy of the sum over all data, which is also of size n . The cost of an all-reduce is 2 α · log p + (2 β + γ ) · p − 1 p n . Note that the costs of each of the collectiv es are zero when p = 1. 3. RELA TED WORK In the data mining and machine learning literature there is an overlap between lo w rank approxi- mations and matrix factorizations due to the nature of applications. Despite its name, non-negativ e matrix “factorization” is really a low rank approximation. Recently there is a growing interest in collaborativ e ﬁltering based recommender systems. One of the popular techniques for collabora- tiv e ﬁltering is matrix factorization, often with nonnegati vity constraints, and its implementation is widely av ailable in many o ﬀ -the-shelf distributed machine learning libraries such as GraphLab [Low et al. 2012], MLLib [Meng et al. 2015], and many others [Satish et al. 2014; Y un et al. 2014] as well. Ho wever , we would like to clarify that collaborativ e ﬁltering using matrix factorization is a di ﬀ erent problem than NMF: in the case of collaborativ e ﬁltering, non-nonzeros in the matrix are considered to be missing entries, while in the case of NMF , non-nonzeros in the matrix correspond to true zero values. There are sev eral recent distributed NMF algorithms in the literature [Liao et al. 2014; Falout- sos et al. 2014; Y in et al. 2014; Liu et al. 2010]. Liu et al. propose running Multiplicative Update (MU) for KL diver gence, squared loss, and “exponential” loss functions [Liu et al. 2010]. Matrix multiplication, element-wise multiplication, and element-wise division are the building blocks of the MU algorithm. The authors discuss performing these matrix operations e ﬀ ecti vely in Hadoop for sparse matrices. Using similar approaches, Liao et al. implement an open source Hadoop-based MU algorithm and study its scalability on large-scale biological data sets [Liao et al. 2014]. Also, Y in, Gao, and Zhang present a scalable NMF that can perform frequent updates, which aim to use the most recently updated data [Y in et al. 2014]. Similarly Faloutsos et al. propose a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradi- ent descent on a variety of objectiv e functions [Faloutsos et al. 2014]. The authors also provide an implementation that can enforce non-negativ e constraints on the factor matrices. All of these works use Hadoop to implement their algorithms. W e emphasize that our MPI-based approach has sev eral advantages over Hadoop-based ap- proaches: — e ﬃ ciency – our approach maintains data in memory , ne ver communicating the data matrix, while Hadoop-based approaches must read / write data to / from disk and in volves global shu ﬄ es of data matrix entries; — generality – our approach is well-designed for both dense and sparse data matrices, whereas Hadoop-based approaches generally require sparse inputs; — priv acy – our approach allo ws processors to collaborate on computing an approximation without ev er sharing their local input data (important for applications inv olving sensitive data, such as electronic health records), while Hadoop requires the user to relinquish control of data place- ment. W e note that Spark [Zaharia et al. 2010] is a popular big-data processing infrastructure that is generally more e ﬃ cient for iterati ve algorithms such as NMF than Hadoop, as it maintains data in memory and av oids ﬁle system I / O. Even with a Spark implementation of previously proposed Hadoop-based NMF algorithm, we e xpect performance to su ﬀ er from e xpensi ve communication of input matrix entries, and Spark will not overcome the shortcomings of generality and priv acy of the previous algorithms. Although Spark has collaborativ e ﬁltering libraries such as MLlib [Meng 4 et al. 2015], which use matrix factorization and can impose non-neg ati vity constraints, none of them implement pure NMF , and so we do not ha ve a direct comparison against NMF running on Spark. As mentioned above, the problem of collaborativ e ﬁltering is di ﬀ erent from NMF , and therefore di ﬀ erent computations are performed at each iteration. Fairbanks et al. [Fairbanks et al. 2015] present a parallel NMF algorithm designed for multicore machines. T o demonstrate the importance of minimizing communication, we consider this approach to parallelizing an alternating-updating NMF algorithm in distributed memory (see Section 5.1). While this naiv e algorithm exploits the natural parallelism av ailable within the alternating iterations (the fact that rows of W and columns of H can be computed independently), it performs more com- munication than necessary to set up the independent problems. W e compare the performance of this algorithm with our proposed approach to demonstrate the importance of designing algorithms to minimize communication; that is, simply parallelizing the computation is not su ﬃ cient for satisfac- tory performance and parallel scalability . Apart from distributed NMF algorithms using Hadoop and multicores, there are also implemen- tations of the MU algorithm in a distributed memory setting using X10 [Gro ve et al. 2014] and on a GPU [Mej ´ ıa-Roa et al. 2015]. 4. AL TERNA TING-UPDA TING NMF ALGORITHMS W e deﬁne Alternating-Updating NMF algorithms as those that (1) alternate between updating W for a giv en H and updating H for a giv en W and (2) use the Gram matrix associated with the ﬁxed factor matrix and the product of the input data matrix A with the ﬁxed factor matrix. W e show the structure of the framew ork in Algorithm 1. Algorithm 1 [ W , H ] = A U-NMF( A , k ) Require: A is an m × n matrix, k is rank of approximation 1: Initialize H with a non-negati ve matrix in R n × k + . 2: while stopping criteria not satisﬁed do 3: Update W using HH T and AH T 4: Update H using W T W and W T A 5: end while The speciﬁcs of lines 3 and 4 depend on the NMF algorithm, and we refer to the computation associated with these lines as the Local Update Computations ( LUC ), as they will not a ﬀ ect the parallelization schemes we deﬁne in Section 5.2. Because these computations are performed locally , we use a function F ( m , n , k ) to denote the number of ﬂops required for each algorithm’ s LUC (and we do not consider communication costs). W e note that A U-NMF is very similar to a two-block, block coordinate descent (BCD) frame work, but it has a k ey di ﬀ erence. In the BCD framework where the two blocks are the unkno wn factors W and H , we solve the follo wing subproblems, which ha ve a unique solution for a full rank H and W : W ← argmin ˜ W > 0    A − ˜ WH    F , H ← argmin ˜ H > 0    A − W ˜ H    F . (2) Since each subproblem inv olves nonnegativ e least squares, this two-block BCD method is also called the Alternating Non-negati ve Least Squares (ANLS) method [Kim et al. 2014]. For exam- ple, Block Principal Pi voting ( ABPP ), discussed more in detail at Section 4.3, is one algorithm that solves these NLS subproblems. In the context of the A U-NMF algorithm, an ANLS method maxi- mally reduces the o verall NMF objecti ve function v alue by ﬁnding the optimal solution for giv en H and W in lines 3 and 4 respectiv ely . 5 There are other popular NMF algorithms that update the factor matrices alternatively without maximally reducing the objecti ve function v alue each time, in the same sense as in ANLS. These updates do not necessarily solve each of the subproblems (2) to optimality but simply improve the ov erall objectiv e function (1). Such methods include Multiplicativ e Update ( MU ) [Seung and Lee 2001] and Hierarchical Alternating Least Squares ( HALS ) [Cichocki et al. 2009], which was also proposed as Rank-one Residual Iteration (RRI) [Ho et al. 2008]. T o sho w ho w these methods can ﬁt into the A U-NMF framew ork, we discuss them in more detail in Sections 4.1 and 4.2. The conv ergence properties of these di ﬀ erent algorithms are discussed in detail by Kim, He and Park [Kim et al. 2014]. W e emphasize here that both MU and HALS require computing Gram matrices and matrix products of the input matrix and each factor matrix. Therefore, if the update ordering follows the con vention of updating all of W followed by all of H , both methods ﬁt into the A U-NMF framew ork. W e note that both MU and HALS are deﬁned for more general update orders, but for our purposes we constrain them to be A U-NMF algorithms. While we focus on three NMF algorithms in this paper, we highlight that our framework is ex- tensible to other NMF algorithms, including those based on Alternating Direction Method of Mul- tipliers (ADMM) [Sun and F ´ evotte 2014], Nesterov-based methods [Guan et al. 2012], or any other method that ﬁts the framew ork of Algorithm 1. 4.1. Multiplicative Update (MU) In the case of MU [Seung and Lee 2001], individual entries of W and H are updated with all other entries ﬁxed. In this case, the update rules are w i j ← w i j ( AH T ) i j ( WHH T ) i j , and h i j ← h i j ( W T A ) i j ( W T WH ) i j . (3) Instead of performing these ( m + n ) k in an arbitrary order, if all of W is updated before H (or vice- versa), this method also follows the A U-NMF frame work. After computing the Gram matrices HH T and W T W and the products AH T and W T A , the e xtra cost of computing W ( HH T ) and ( W T W ) H is F ( m , n , k ) = 2( m + n ) k 2 ﬂops to perform updates for all entries of W and H , as the other elementwise operations a ﬀ ect only lower -order terms. Thus, when MU is used, lines 3 and 4 in Algorithm 1 – and functions UpdateW and UpdateH in Algorithms 2 and 3 – implement the expressions in (3), giv en the previously computed matrices. 4.2. Hierarchical Alternating Least Squares (HALS) In the case of HALS [Cichocki et al. 2009; Cichocki and Anh-Huy 2009], updates are performed on individual columns of W and rows of H with all other entries in the factor matrices ﬁxed. This approach is a BCD method with 2 k blocks, set to minimize the function f ( w 1 , · · · , w k , h 1 , · · · , h k ) =        A − k X i = 1 w i h i        F , (4) where w i is the i th column of W and h i is the i th ro w of H . The update rules [Cichocki and Anh-Huy 2009, Algorithm 2] can be written in closed form: w i ← h w i + ( AH T ) i − W ( HH T ) i i + w i ← w i k w i k , and h i ← h h i + ( W T A ) i − ( W T W ) i H i + . (5) 6 Note that the columns of W and rows of H are updated in order , so that the most up-to-date values are always used, and these 2 k updates can be done in an arbitrary order . Howe ver , if all the W updates are done before H (or vice-versa), the method falls into the A U-NMF framew ork. After computing the matrices HH T , AH T , W T W , and W T A , the extra computation is F ( m , n , k ) = 2( m + n ) k 2 ﬂops for updating both W and H . Thus, when HALS is used, lines 3 and 4 in Algorithm 1 – and functions UpdateW and UpdateH in Algorithms 2 and 3 – implement the expressions in (5), gi ven the previously computed matrices. 4.3. Alternating Nonnegative Least Squares with Bloc k Principal Piv oting Block Principal Pi voting (BPP) is an activ e-set-like method for solving the NLS subproblems in Eq. (2). The main subroutine of BPP is the single right-hand side NLS problem min x > 0 k Cx − b k 2 . (6) The Karush-Kuhn-T ucker (KKT) optimality conditions for Eq. (6) are as follo ws y = C T Cx − C T b (7a) y > 0 (7b) x > 0 (7c) x i y i = 0 ∀ i . (7d) The KKT conditions (7) states that at optimality , the support sets (i.e., the non-zero elements) of x and y are complementary to each other . Therefore, Eq. (7) is an instance of the Linear Comple- mentarity Pr oblem (LCP) which arises frequently in quadratic programming. When k  min( m , n ), activ e-set and activ e-set-like methods are very suitable because most computations in volve matrices of sizes m × k , n × k , and k × k which are small and easy to handle. If we knew which indices correspond to nonzero values in the optimal solution, then computing the solution is an unconstrained least squares problem on these indices. In the optimal solution, call the set of indices i such that x i = 0 the activ e set, and let the remaining indices be the passi ve set. The BPP algorithm works to ﬁnd this ﬁnal activ e set and passive set. It greedily swaps indices between the intermediate activ e and passiv e sets until ﬁnding a partition that satisﬁes the KKT condition. In the partition of the optimal solution, the v alues of the indices that belong to the acti ve set will take zero. The values of the indices that belong to the passiv e set are determined by solving the unconstrained least squares problem restricted to the passiv e set. Kim, He and Park [Kim and Park 2011], discuss the BPP algorithm in further detail. W e use the notation X ← SolveBPP( C T C , C T B ) to deﬁne the (local) function for using BPP to solve Eq. (6) for e very column of X . W e deﬁne C BPP ( k , c ) as the cost of SolveBPP, given the k × k matrix C T C and k × c matrix C T B . SolveBPP mainly in volv es solving least squares problems o ver the intermediate passi ve sets. Our implementa- tion uses the normal equations to solve the unconstrained least squares problems because the normal equations matrices hav e been pre-computed in order to check the KKT condition. Ho we ver , more numerically stable methods such as QR decomposition can also be used. Thus, when ABPP is used, lines 3 and 4 in Algorithm 1 – and functions UpdateW and UpdateH in Algorithms 2 and 3 – correspond to calls to SolveBPP . The number of ﬂops in volv ed in SolveBPP is not a closed form expression; in this case F ( m , n , k ) = C BPP ( k , m ) + C BPP ( k , n ). 5. P ARALLEL ALGORITHMS 5.1. Naive Parallel NMF Algorithm In this section we present a nai ve parallelization of NMF algorithms, which has pre viously appeared in the context of a shared-memory parallel platform [Fairbanks et al. 2015]. Each NLS problem with multiple right-hand sides can be parallelized on the observation that the problems for multiple right- hand sides are independent from each other . For e xample, we can solve se veral instances of Eq. (6) 7 Algorithm 2 [ W , H ] = Naiv e-Parallel-A UNMF( A , k ) Require: A is an m × n matrix distributed both row-wise and column-wise across p processors, k is rank of approximation Require: Local matrices: A i is m / p × n , A i is m × n / p , W i is m / p × k , H i is k × n / p 1: p i initializes H i 2: while stopping criteria not satisﬁed do / * Compute W given H * / 3: collect H on each processor using all-gather 4: p i computes W i ← updateW( HH T , A i H T ) / * Compute H given W * / 5: collect W on each processor using all-gather 6: p i computes ( H i ) T ← updateH( W T W , ( W T A i ) T ) 7: end while Ensure: W , H ≈ argmin ˜ W > 0 , ˜ H > 0 k A − ˜ W ˜ H k Ensure: W is an m × k matrix distrib uted row-wise across processors, H is a k × n matrix distributed column-wise across processors Algorithm Flops W ords Messages Memory Naiv e-Parallel-A UNMF 4 mnk p + ( m + n ) k 2 + F  m p , n p , k  O (( m + n ) k ) O (log p ) ∗ O  mn p + ( m + n ) k  MPI-F A UN ( m / p > n ) 4 mnk p + ( m + n ) k 2 p + F  m p , n p , k  O ( nk ) O (log p ) ∗ O  mn p + mk p + nk  MPI-F A UN ( m / p < n ) 4 mnk p + ( m + n ) k 2 p + F  m p , n p , k  O  q mnk 2 p  O (log p ) ∗ O  mn p + q mnk 2 p  Lower Bound − Ω  min  q mnk 2 p , nk  Ω (log p ) mn p + ( m + n ) k p T able III: Leading order algorithmic costs for Naiv e-Parallel-A UNMF and MPI-F A UN (per iter- ation). Note that the computation and memory costs assume the data matrix A is dense, but the communication costs (words and messages) apply to both dense and sparse cases. The function F ( · ) denotes the number of ﬂops required for the particular NMF algorithm’ s Local Update Computa- tion, aside from the matrix multiplications common across A U-NMF algorithms. ∗ The stated latency cost assumes no communication is required in LUC ; HALS requires k log p messages for normalization steps. independently for di ﬀ erent b where C is ﬁxed, which implies that we can optimize row blocks of W and column blocks of H in parallel. Algorithm 2 and Figure 1 present a straightforward approach to setting up the independent sub- problems. Let us divide W into row blocks W 1 , . . . , W p and H into column blocks H 1 , . . . , H p . W e then double-partition the data matrix A accordingly into row blocks A 1 , . . . , A p and column blocks A 1 , . . . , A p so that processor i owns both A i and A i (see Figure 1). W ith these partitions of the data and the variables, one can implement any A U-NMF algorithm in parallel, with only one communication step for each solve. W e summarize the algorithmic costs of Algorithm 2 (deriv ed in the following subsections) in T able III. This naiv e algorithm [Fairbanks et al. 2015] has three main drawbacks: (1) it requires storing two copies of the data matrix (one in ro w distribution and one in column distrib ution) and both full factor matrices locally , (2) it does not parallelize the computation of HH T and W T W (each processor computes it redundantly), and (3) as we will see in Section 5.2, it communicates more data than necessary . 5.1.1. Computation Cost. The computation cost of Algorithm 2 depends on the particular NMF algorithm used. Thus, the computation at line 4 consists of computing A i H T , HH T , and performing 8 Fig. 1: Naiv e-Parallel-A UNMF . Note that both rows and columns of A are 1D distributed. The algorithm works by iterativ ely (all-)gathering the entire ﬁxed factor matrix to each processor and then performing the Local Update Computations to update the variable f actor matrix. the algorithm-speciﬁc Local Update Computations for m / p rows of W . Likewise, the computation at line 6 consists of computing W T A i , W T W , and performing the Local Update Computations for n / p columns of H . In the dense case, this amounts to 4 mnk / p + ( m + n ) k 2 + F ( m / p , n / p , k ) ﬂops. In the sparse case, processor i performs 2(nnz( A i ) + nnz( A i )) k ﬂops to compute A i H T and W T A i instead of 4 mnk / p . 5.1.2. Communication Cost. The size of W is mk words, and the size of H is nk words. Thus, the communication cost of the all-gathers at lines 3 and 5, based on the expression giv en in Section 2.3 is α · 2 log p + β · ( m + n ) k . 5.1.3. Memor y Requirements. The local memory requirement includes storing each processor’ s part of matrices A , W , and H . In the case of dense A , this is 2 mn / p + ( m + n ) k / p words, as A is stored twice; in the sparse case, processor i requires nnz( A i ) + nnz( A i ) words for the input matrix and ( m + n ) k / p words for the output factor matrices. Local memory is also required for storing temporary matrices W and H of size ( m + n ) k words. 5.2. MPI-F A UN W e present our proposed algorithm, MPI-F A UN, as Algorithm 3. The main ideas of the algorithm are to (1) exploit the independence of Local Update Computations for rows of W and columns of H and (2) use communication-optimal matrix multiplication algorithms to set up the Local Update Computations. The naiv e approach (Algorithm 2) shares the ﬁrst property , by parallelizing over 9 rows of W and columns of H , b ut it uses parallel matrix multiplication algorithms that communicate more data than necessary . The central intuition for communication-e ﬃ cient parallel algorithms for computing HH T , AH T , W T W , and W T A comes from a classiﬁcation proposed by Demmel et al. [Demmel et al. 2013]. They consider three cases, depending on the relativ e sizes of the dimensions of the matrices and the number of processors; the four multiplies for NMF fall into either the “one large dimension” or “two large dimensions” cases. MPI-F A UN uses a careful data distribution in order to use a communication-optimal algorithm for each of the matrix multiplications, while at the same time exploiting the parallelism in the LUC . The algorithm uses a 2D distrib ution of the data matrix A across a p r × p c grid of processors (with p = p r p c ), as shown in Figure 2. As we deriv e in the subsequent subsections, Algorithm 3 performs an alternating method in parallel with a per-iteration bandwidth cost of O  min n p mnk 2 / p , nk o  words, latency cost of O (log p ) messages, and load-balanced computation (up to the sparsity pattern of A and con vergence rates of local BPP computations). T o minimize the communication cost and local memory requirements, in the typical case p r and p c are chosen so that m / p r ≈ n / p c ≈ p mn / p , in which case the bandwidth cost is O  p mnk 2 / p  . If the matrix is very tall and skinn y , i.e., m / p > n , then we choose p r = p and p c = 1. In this case, the distribution of the data matrix is 1D, and the bandwidth cost is O ( nk ) words. The matrix distrib utions for Algorithm 3 are giv en in Figure 2; we use a 2D distrib ution of A and 1D distributions of W and H . Recall from T able II that M i and M i denote row and column blocks of M , respectively . Thus, the notation ( W i ) j denotes the j th row block within the i th row block of W . Lines 3 – 8 compute W for a ﬁxed H , and lines 9–14 compute H for a ﬁxed W ; note that the computations and communication patterns for the two alternating iterations are analogous. In the rest of this section, we derive the per-iteration computation and communication costs, as well as the local memory requirements. W e also argue the communication-optimality of the algorithm in the dense case. T able III summarizes the results of this section and compares them to Naiv e-Parallel-A UNMF . 5.2.1. Computation Cost. Local matrix computations occur at lines 3, 6, 9, and 12. In the case that A is dense, each processor performs n p k 2 + 2 m p r n p c k + m p k 2 + 2 m p r n p c k = 4 mnk p + ( m + n ) k 2 p ﬂops. In the case that A is sparse, processor ( i , j ) performs ( m + n ) k 2 / p ﬂops in computing U i j and X i j , and 4nnz( A i j ) k ﬂops in computing V i j and Y i j . Local update computations occur at lines 8 and 14. In each case, the symmetric positiv e semi-deﬁnite matrix is k × k and the number of columns / rows of length k to be computed are m / p and n / p , respecti vely . These costs together are giv en by F ( m / p , n / p , k ). There are computation costs associated with the all-reduce and reduce- scatter collectiv es, both those contribute only to lo wer order terms. 5.2.2. Communication Cost. Communication occurs during six collectiv e operations (lines 4, 5, 7, 10, 11, and 13). W e use the cost expressions presented in Section 2.3 for these collectiv es. The communication cost of the all-reduces (lines 4 and 10) is α · 4 log p + β · 2 k 2 ; the cost of the two all-gathers (lines 5 and 11) is α · log p + β · ( ( p r − 1) nk / p + ( p c − 1) mk / p ) ; and the cost of the two reduce-scatters (lines 7 and 13) is α · log p + β · ( ( p c − 1) mk / p + ( p r − 1) nk / p ) . W e note that LUC may introduce signiﬁcant communication cost, depending on the NMF algo- rithm used. The normalization of columns of W within HALS , for example, introduces an extra k log p latency cost. W e will ignore such costs in our general analysis. In the case that m / p < n , we choose p r = p m p / n > 1 and p c = p n p / m > 1, and these communication costs simplify to α · O (log p ) + β · O ( mk / p r + nk / p c + k 2 ) = α · O (log p ) + β · O ( p mnk 2 / p + k 2 ). In the case that m / p > n , we choose p c = 1, and the costs simplify to α · O (log p ) + β · O ( nk ). 10 Algorithm 3 [ W , H ] = MPI-F A UN( A , k ) Require: A is an m × n matrix distributed across a p r × p c grid of processors, k is rank of approximation Require: Local matrices: A i j is m / p r × n / p c , W i is m / p r × k , ( W i ) j is m / p × k , H j is k × n / p c , and ( H j ) i is k × n / p 1: p i j initializes ( H j ) i 2: while stopping criteria not satisﬁed do / * Compute W giv en H * / 3: p i j computes U i j = ( H j ) i ( H j ) i T 4: compute HH T = P i , j U i j using all-reduce across all procs  HH T is k × k and symmetric 5: p i j collects H j using all-gather across proc columns 6: p i j computes V i j = A i j H T j  V i j is m / p r × k 7: compute ( AH T ) i = P j V i j using reduce-scatter across proc row to achieve row-wise distribution of ( AH T ) i  p i j owns m / p × k submatrix (( AH T ) i ) j 8: p i j computes ( W i ) j ← UpdateW( HH T , (( AH T ) i ) j ) / * Compute H gi ven W * / 9: p i j computes X i j = ( W i ) j T ( W i ) j 10: compute W T W = P i , j X i j using all-reduce across all procs  W T W is k × k and symmetric 11: p i j collects W i using all-gather across proc ro ws 12: p i j computes Y i j = W i T A i j  Y i j is k × n / p c 13: compute ( W T A ) j = P i Y i j using reduce-scatter across proc columns to achie ve column-wise distrib u- tion of ( W T A ) j  p i j owns k × n / p submatrix (( W T A ) j ) i 14: p i j computes (( H j ) i ) T ← UpdateH( W T W , ((( W T A ) j ) i ) T ) 15: end while Ensure: W , H ≈ ar gmin ˜ W > 0 , ˜ H > 0 k A − ˜ W ˜ H k Ensure: W is an m × k matrix distributed ro w-wise across processors, H is a k × n matrix distributed column- wise across processors 5.2.3. Memor y Requirements. The local memory requirement includes storing each processor’ s part of matrices A , W , and H . In the case of dense A , this is mn / p + ( m + n ) k / p words; in the sparse case, processor ( i , j ) requires nnz( A i j ) words for the input matrix and ( m + n ) k / p words for the output factor matrices. Local memory is also required for storing temporary matrices W j , H i , V i j , and Y i j , of size 2 mk / p r + 2 nk / p c ) words. In the dense case, assuming k < n / p c and k < m / p r , the local memory requirement is no more than a constant times the size of the original data. For the optimal choices of p r and p c , this assump- tion simpliﬁes to k < max n p mn / p , m / p o . W e note that if the temporary memory requirements become prohibitive, the computation of (( AH T ) i ) j and (( W T A ) j ) i via all-gathers and reduce-scatters can be blocked, decreasing the local memory requirements at the expense of greater latency costs. When A is sparse and k is large enough, the memory footprint of the factor matrices can be lar ger than the input matrix. In this case, the extra temporary memory requirements can become prohibitive; we observed this for a sparse data set with very large dimensions (see Section 6.3.5). W e lea ve the implementation of the blocked algorithm to future work. 5.2.4. Communication Optimality. In the case that A is dense, Algorithm 3 provably minimizes communication costs. Theorem 5.1 establishes the bandwidth cost lower bound for any algorithm that computes W T A or AH T each iteration. A latency lower bound of Ω (log p ) exists in our com- munication model for any algorithm that aggregates global information [Chan et al. 2007], and for NMF , this global aggregation is necessary in each iteration. Based on the costs deri ved above, MPI- F A UN is communication optimal under the assumption k < p mn / p , matching these lower bounds to within constant factors. 11 A A 0 A 1 A 2 A 3 W W 0 W 1 W 2 W 3 H H 0 H 1 H 2 H 3 k m ↑ ↓ m p k n ← → n p (a) 1D Distribution with p = p r = 4 and p c = 1. A A 00 A 10 A 20 A 01 A 11 A 21 W W 0 W 1 W 2 ( W 0 ) 0 ( W 0 ) 1 ( W 1 ) 0 ( W 1 ) 1 ( W 2 ) 0 ( W 2 ) 1 H H 0 H 1 ( H 0 ) 0 ( H 0 ) 1 ( H 0 ) 2 ( H 1 ) 0 ( H 1 ) 1 ( H 1 ) 2 k m ↑ ↓ m p r ↑ ↓ m p k n ← → n p c ← → n p (b) 2D Distribution with p r = 3 and p c = 2. Fig. 2: Data distrib utions for MPI-F A UN. Note that for the 2D distrib ution, A i j is m / p r × m / p c , W i is m / p r × k , ( W i ) j is m / p × k , H j is k × n / p c , and ( H j ) i is k × n / p . T heorem 5.1 ([D emmel et al . 2013]). Let A ∈ R m × n , W ∈ R m × k , and H ∈ R k × n be dense ma- trices, with k < n 6 m. If k < p mn / p, then any distributed-memory parallel algorithm on p pr ocessors that load balances the matrix distributions and computes W T A and / or AH T must com- municate at least Ω (min { p mnk 2 / p , nk } ) wor ds along its critical path. P r oof . The proof follows directly from [Demmel et al. 2013, Section II.B]. Each matrix mul- tiplication W T A and AH T has dimensions k < n 6 m , so the assumption k < p mn / p ensures that neither multiplication has “3 large dimensions. ” Thus, the communication lo wer bound is either Ω ( p mnk 2 / p ) in the case of p > m / n (or “2 large dimensions”), or Ω ( nk ), in the case of p < m / n (or “1 large dimension”). If p < m / n , then nk < p mnk 2 / p , so the lower bound can be written as Ω (min { p mnk 2 / p , nk } ). W e note that the communication costs of Algorithm 3 are the same for dense and sparse data matrices (the data matrix itself is nev er communicated). In the case that A is sparse, this commu- nication lower bound does not necessarily apply , as the required data movement depends on the sparsity pattern of A . Thus, we cannot make claims of optimality in the sparse case (for general A ). The communication lo wer bounds for W T A and / or AH T (where A is sparse) can be expressed in terms of hypergraphs that encode the sparsity structure of A [Ballard et al. 2015]. Indeed, hyper- graph partitioners hav e been used to reduce communication and achieve load balance for a similar problem: computing a low-rank representation of a sparse tensor (without non-negati vity constraints on the factors) [Kaya and Uc ¸ ar 2015]. 12 Fig. 3: Parallel matrix multiplications within MPI-F A UN for ﬁnding H giv en W , with p r = 3 and p c = 2. The computation of W T W appears on the far left; the rest of the ﬁgure depicts computation of W T A . 6. EXPERIMENTS In this section, we describe our implementation of MPI-F A UN and ev aluate its performance. W e identify a few synthetic and real world data sets to experiment with MPI-F A UN with dimensions that span from hundreds to millions. W e compare the performance and exploring scaling behavior of di ﬀ erent NMF algorithms – MU , HALS , and ANLS / BPP ( ABPP ), implemented using the parallel MPI-F A UN framework. The code and the datasets used for conducting the experiments can be downloaded from https: // github .com / ramkikannan / nmﬂibrary. 6.1. Experimental Setup 6.1.1. Data Sets. W e used sparse and dense matrices that are either synthetically generated or from real world applications. W e explain the data sets in this section. — Dense Synthetic Matrix: W e generate a low rank matrix as the product of two uniform random matrices of size 207,360 × 100 and 100 × 138,240. The dimensions of this matrix are chosen to be ev enly divisible for a particular set of processor grids. — Sparse Synthetic Matrix: W e generate a random sparse Erd ˝ os-R ´ enyi matrix of the size 207,360 × 138,240 with density of 0.001. That is, ev ery entry is nonzero with probability 0.001. — Dense Real W orld Matrix ( V ideo ): NMF is used on video data for background subtraction in order to detect moving objects. The low rank matrix ˆ A = WH represents background and the error matrix A − ˆ A represents moving objects. Detecting moving objects has many real-world applications such as tra ﬃ c estimation [Fujimoto et al. 2014] and security monitoring [Bouwmans et al. 2015]. In the case of detecting moving objects, only the last minute or two of video is taken 13 from the li ve video camera. The algorithm to incrementally adjust the NMF based on the ne w streaming video is presented in [Kim et al. 2014]. T o simulate this scenario, we collected a video in a busy intersection of the Georgia T ech campus at 20 frames per second. From this video, we took video for approximately 12 minutes and then reshaped the matrix such that every RGB frame is a column of our matrix, so that the matrix is dense with size 1,013,400 × 13,824. — Sparse Real W orld Matrix ( W ebbase ): This data set is a directed sparse graph whose nodes cor- respond to webpages (URLs) and edges correspond to hyperlinks from one webpage to another . The NMF output of this directed graph helps us understand clusters in graphs. W e consider two v ersions of the data set: webbase-1M and webbase-2001 . The dataset webbase-1M contains about 1 million nodes (1,000,005) and 3.1 million edges (3,105,536), and was ﬁrst reported by W illiams et al. [W illiams et al. 2009]. The version webbase-2001 has about 118 million nodes (118,142,155) and o ver 1 billion edges (1,019,903,190); it was ﬁrst reported by Boldi and V igna [Boldi and V igna 2004]. Both data sets are available in the Univ ersity of Florida Sparse Matrix Collection [Davis and Hu 2011] and the latter webbase-2001 being the largest among the entire collection. — T ext data ( Stack Exchange ): Stack Exchange is a network of question-and-answer websites on topics in v aried ﬁelds, each site covering a speciﬁc topic, where questions, answers, and users are subject to a reputation award process. There are many Stack Exchange forums, such as ask ubuntu, mathematics, latex . W e downloaded the latest anonymized dump of all user-contributed content on the Stack Exchange network from https: // archiv e.org / details / stacke xchange as of 28- Jul-2016. W e used only the questions from the most popular site called Stacko verﬂow and did not include the answers and comments. W e remov ed the standard 571 English stop words (such as ar e, am, be, above, below ) and then used snowball stemming av ailable through the Natural Language T oolkit (NL TK) package (www .nltk.org). After this initial pre-processing, we deleted HTML tags (such as lt, gt, em ) from the posts. The resulting bag-of-words matrix has a vocab u- lary of size 627,047 ov er 11,708,841 documents with 365,168,945 non-zero entries. The size of all the real world data sets were adjusted to the nearest size for uniformly distributing the matrix. 6.1.2. Implementation Platform. W e conducted our e xperiments on “Rhea” at the Oak Ridge Lead- ership Computing Facility (OLCF). Rhea is a commodity-type Linux cluster with a total of 512 nodes and a 4X FDR Inﬁniband interconnect. Each node contains dual-socket 8-core Intel Sandy Bridge-EP processors and 128 GB of memory . Each socket has a shared 20MB L3 cache, and each core has a priv ate 256K L2 cache. Our objecti ve of the implementation is using open source software as much as possible to promote reproducibility and reuse of our code. The entire C ++ code was de veloped using the matrix library Armadillo [Sanderson 2010]. In Armadillo, the elements of the dense matrix are stored in column major order and the sparse matrices in Compressed Sparse Column (CSC) format. F or dense BLAS and LAP A CK operations, we linked Armadillo with Intel MKL – the default LAP A CK / BLAS li- brary in RHEA. It is also easy to link Armadillo with OpenBLAS [Xianyi 2015]. W e use Armadillo’ s own implementation of sparse matrix-dense matrix multiplication, the default GNU C ++ Compiler (g ++ (GCC) 4.8.2) and MPI library (Open MPI 1.8.4) on RHEA. W e chose the commodity cluster with open source software so that the numbers presented here are representati ve of common use. 6.1.3. Algor ithms. In our experiments, we considered the following algorithms: — MU : MPI-F A UN (Algorithm 3) with MU (Equation (3)) — HALS : MPI-F A UN (Algorithm 3) with HALS (Equation (5)) — ABPP : MPI-F A UN (Algorithm 3) with BPP (Section 4.3) — Naive : Nai ve-Parallel-A UNMF (Algorithm 2, Section 5.1) Our implementation of Naive (Algorithm 2) uses BPP but can be easily to extended to MU and HALS and other NMF algorithms. A detailed comparison of Naiv e-Parallel-A UNMF with MPI- 14 F A UN is made in our earlier work [Kannan et al. 2016]. W e include some benchmark results from Naive to reiterate the point that communication e ﬃ ciency is key to obtaining reasonable perfor- mance, b ut we also omit other Naiv e results in order to focus attention on comparisons among other algorithms. For the algorithms based on MPI-F A UN, we use the processor grid that is closest to the theoretical optimum (see Section 5.2.2) in order to minimize communication costs. See Section 6.3.4 for an empirical ev aluation of varying processor grids for a particular algorithm and data set. T o ensure fair comparison among algorithms, the same random seed is used across di ﬀ erent methods appropriately . That is, the initial random matrix H is generated with the same random seed when testing with di ﬀ erent algorithms (note that W need not be initialized). In our experiments, we use number of iterations as the stopping criteria for all the algorithms. While we would like to compare against other high-performance NMF algorithms in the litera- ture, the only other distributed-memory implementations of which we’ re aware are implemented us- ing Hadoop and are designed only for sparse matrices [Liao et al. 2014], [Liu et al. 2010], [Gemulla et al. 2011], [Y in et al. 2014] and [Faloutsos et al. 2014]. W e stress that Hadoop is not designed for high performance computing of iterative numerical algorithms, requiring disk I / O between steps, so a run time comparison between a Hadoop implementation and a C ++/ MPI implementation is not a fair comparison of parallel algorithms. A qualitativ e example of di ﬀ erences in run time is that a Hadoop implementation of the MU algorithm on a lar ge sparse matrix of size 2 17 × 2 16 with 2 × 10 8 nonzeros (with k = 8) takes on the order of 50 minutes per iteration [Liu et al. 2010], while our MU implementation takes 0.065 seconds per iteration for the synthetic data set (which is an order of magnitude larger in terms of ro ws, columns, and nonzeros) running on only 16 nodes. 6.2. Relative Error over Iterations There are various metrics to compare the quality of the NMF algorithms [Kim et al. 2014]. The most common among these metrics are (a) relativ e error and (b) projected gradient. The former represents the closeness of the low rank approximation ˆ A ≈ WH , which is generally the optimization objectiv e. The latter represent the quality of the produced low rank factors and the stationarity of the ﬁnal solution. These metrics are also used as the stopping criterion for terminating the iteration of the NMF algorithm as in line 2 of Algorithm 1. T ypically a combination of the number of iterations along with improv ement of these metrics until a tolerance is met is be used as stopping criterion. In this paper , we use relati ve error for the comparison as it is monotonically decreasing, as opposed to projected gradient of the low rank factors, which shows oscillations over iterations. The relativ e error can be formally deﬁned as k A − WH k F / k A k F . In Figure 4, we measure the relative error at the end of every iteration (i.e., after the updates of both W and H ) for all three algorithms MU , HALS , and ABPP . W e consider three real world datasets, video , stack exchang e and webbase-1M , and set k = 50. W e used only the number of iterations as stopping criterion and just for this section, ran all the algorithms for 50 iterations. T o be gin with, we explain the observ ations on the dense video dataset presented in Figure 4a. The relativ e error of MU was highest at 0.1804 after 50 iterations and ABPP was the least with 0.1170. HALS ’ s relativ e error was 0.1208. From the ﬁgure, we can observe that ABPP error didn’t change after 29 iterations where as HALS and MU was still improving marginally at the 4th decimal ev en after 50 iterations. W e can observe that the relativ e error of stack exchange from Figure 4b is better than webbase-1M from Figure 4c ov er all three algorithms. In the case of the stac k exchang e dataset, the relati ve errors after 50 iterations follo w the pattern MU > HALS > ABPP , with values 0.8480, 0.8365, and 0.8333 respectiv ely . Unlike the video dataset, both MU and HALS stopped improving after 23 iterations, where as ABPP was still improving in the 4th decimal even though its error was better than the others. Howe ver , the di ﬀ erence in relati ve error for the webbase-1M dataset was not as signiﬁcant as in the others, though the relative ordering of MU > HALS > ABPP was consistent, with values of 0.9703 for MU 0.9697 for HALS and 0.9695 for ABPP . 15 In general, for these datasets ABPP identiﬁed better approximations than MU and HALS , which is consistent with the literature [Kim et al. 2014; Kim and Park 2011]. Howe ver , for the sparse datasets, the di ﬀ erences in relativ e error are small across the NMF algorithms. 6.3. Time Per Iteration In this section we focus on per-iteration time of all the algorithms. W e report four types of e xper- iments, varying the number of processors (Section 6.3.2), the rank of the approximation (Section 6.3.3), the shape of the processor grid (Section 6.3.4), and scaling up the dataset size. For each ex- periment we report a time breakdown in terms of the ov erall computation and communication steps (described in Section 6.3.1) shared by all algorithms. 6.3.1. Time Breakdown. T o di ﬀ erentiate the computation and communication costs among the al- gorithms, we present the time breakdown among the various tasks within the algorithms for all performance experiments. For Algorithm 3, there are three local computation tasks and three com- munication tasks to compute each of the factor matrices: — MM , computing a matrix multiplication with the local data matrix and one of the factor matrices; — LUC , local updates either using ABPP or applying the remaining work of the MU or HALS updates (i.e., the total time for both U pd at eW and U pd at e H functions); — Gram , computing the local contribution to the Gram matrix; — All-Gather , to compute the global matrix multiplication; — Reduce-Scatter , to compute the global matrix multiplication; — All-Reduce , to compute the global Gram matrix. In our results, we do not distinguish the costs of these tasks for W and H separately; we report their sum, though we note that we do not always expect balance between the two contributions for each task. Algorithm 2 performs all of these tasks except Reduce-Scatter and All-Reduce; all of its communication is in All-Gather . 6.3.2. Scaling p : Strong Scaling. Figure 5 presents a strong scaling experiment with four data sets: sparse synthetic , dense synthetic , webbase-1M , and video . In this experiment, for each data set and algorithm, we use lo w rank k = 50 and vary the number of processors (with ﬁxed problem size). W e use { 1 , 6 , 24 , 54 , 96 } nodes; since each node has 16 cores, this corresponds to { 16 , 96 , 384 , 864 , 1536 } cores and report av erage per-iteration times. W e highlight three main observations from these e xperiments: (1) Naive is slo wer than all other algorithms for large p ; (2) MU , HALS , and ABPP (algorithms based on MPI-F A UN) scale up to over 1000 processors; (3) the relativ e per -iteration cost of LUC decreases as p increases (for all algorithms), and therefore the extra per -iteration cost of ABPP (compared with MU and HALS ) becomes negligible. Observation 1. W e report Naive performance only for the synthetic data sets (Figures 5a and 5b); the results for the real-world data sets are similar . For the Sparse Synthetic data set, Naive is 4 . 2 × slower than the fastest algorithm ( ABPP ) on 1536 processors; for the Dense Synthetic data set, Naive is 1 . 6 × slower than the fastest algorithm ( MU ) at that scale. Nearly all of this slo wdown is due to the communication costs of Naive . Theoretical and practical evidence supporting the ﬁrst observation is also reported in our previous paper [Kannan et al. 2016]. Howe ver , we also note that Naive is the fastest algorithm for the smallest p for each problem, which is largely due to reduced MM time. Each algorithm performs exactly the same number of ﬂops per MM; the e ﬃ ciency of Naive for small p is due to cache e ﬀ ects. For example, for the Dense Synthetic problem on 96 processors, the output matrix of Naive ’ s MM ﬁts in L2 cache, but the output matrix of MPI-F A UN’ s MM does not; these e ﬀ ects disappear as the p increases. Observation 2. Algorithms based on MPI-F A UN ( MU , HALS , ABPP ) scale well, up to ov er 1000 processors. All algorithms’ run times decrease as p increases, with the exception of the Sparse 16 0 10 20 30 40 50 0 . 12 0 . 14 0 . 16 0 . 18 Iterations Rel. Error for k = 50 MU HALS ABPP (a) Dense Real W orld 0 10 20 30 40 50 0 . 85 0 . 9 0 . 95 1 Iterations Rel. Error for k = 50 MU HALS ABPP (b) Stack Exchange 0 10 20 30 40 50 0 . 97 0 . 98 0 . 99 Iterations Rel. Error for k = 50 MU HALS ABPP (c) W ebbase Fig. 4: Relativ e error comparison of MU , HALS , ABPP on real world datasets 17 Real W orld data set, in which case all algorithms slow down scaling from p = 864 to p = 1536 (we attribute this lack of scaling to load imbalance). F or sparse problems, comparing p = 16 to p = 1536 (a factor increase of 96), we observe speedups from ABPP of 59 × (synthetic) and 22 × (real world). For dense problems, comparing p = 96 to p = 1536 (a factor increase of 16), ABPP ’ s speedup is 12 × for both problems. MU and HALS demonstrate similar scaling results. For comparison, speedups for Naive were 8 × and 3 × (sparse) and 6 × and 4 × (dense). Observation 3. MU , HALS , and ABPP share all the same subroutines except those that are characterized as LUC . Considering only LUC subroutines, MU and HALS require fewer operations than ABPP . Howe ver , HALS has to make one additional communication for normalization of W . For small p , these cost di ﬀ erences are apparent in Figure 5. For example, for the sparse real world data set on 16 processors, ABPP ’ s LUC time is 16 × that of MU , and the per iteration time di ﬀ ers by a factor of 4 . 5. Howe ver , as p increases, the relativ e time spent in LUC computations decreases, so the e xtra time taken by ABPP has less of an e ﬀ ect on the total per iteration time. By contrast, for the dense real world data set on 1536 processors, ABPP spends a factor of 27 times more time in LUC than MU but only 11% longer ov er the entire iteration. For the synthetic data sets, LUC takes 24% (sparse) on 16 processors and 84% (dense) on 96 processors, and that percentage drops to 11% (sparse) and 15% (dense) on 1536 processors. These trends can also be seen theoretically (T able III). W e expect local computations like MM, LUC , and Gram to scale like 1 / p , assuming load balance is preserved. If communication costs are dominated by the number of words being communicated (i.e., the communication is bandwidth bound), then we expect time spent in communication to scale like 1 / √ p , and at least for dense problems, this scaling is the best possible. Thus, communication costs will ev entually dominate computation costs for all NMF problems, for su ﬃ ciently large p . (Note that if communication is latency bound and proportional to the number of messages, then time spent communicating actually increases with p .) The overall conclusion from this empirical and theoretical observation is that the e xtra per- iteration cost of ABPP over alternati ves lik e MU and HALS decreases as the number of processors p increases. As sho wn in Section 6.2 the f aster error reduction of ABPP typically reduces the o ver- all time to solution compared with the alternatives ev en it requires more time for each iteration. Our conclusion is that as we scale up p , this tradeo ﬀ is further relax ed so that ABPP becomes more and more advantageous for both quality and performance. 6.3.3. Scaling k . Figure 6 presents an experiment scaling up the lo w rank v alue k from 10 to 50 with each of the four data sets. In this e xperiment, for each data set and algorithm, the problem size is ﬁx ed and the number of processors is ﬁxed to p = 864. As in Section 6.3.2, we report the a verage per-iteration times. W e also omit Naiv e data for the real world data sets to highlight the comparisons among MU , HALS , and ABPP . W e highlight two observations from these e xperiments: (1) Naive is plagued by communication time that increases linearly with k ; (2) ABPP ’ s time increases more quickly with k than those of MU or HALS ; Observation 1. W e see from the synthetic data sets (Figures 6a and 6b) that the ov erall time of Naive increases more rapidly with k than any other algorithm and that the increase in time is due mainly to communication (All-Gather). T able III predicts that Naive communication volume scales linearly with k , and we see that in practice the prediction is almost perfect with the synthetic problems. This conﬁrms that the communication is dominated by bandwidth costs and not latency costs (which are constant with respect to k ). W e note that the communication cost of MPI-F A UN scales like √ k , which is why we don’t see as dramatic an increase in communication time for MU , HALS , or ABPP in Figure 6. Observation 2. Focusing attention on time spent in LUC computations, we can compare how MU , HALS , and ABPP scale di ﬀ erently with k . W e see a more rapid increase of LUC time for 18 MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e 0 1 2 3 Number of Processes ( p ) T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM 16 96 384 864 1536 (a) Sparse Synthetic MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e 0 1 2 3 Number of Processes ( p ) T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM 16 96 384 864 1536 (b) Dense Synthetic MU HALS ABPP MU HALS ABPP MU HALS ABPP MU HALS ABPP MU HALS ABPP 0 5 10 Number of Processes ( p ) T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM 16 96 384 864 1536 (c) Sparse Real W orld (webbase-1M) MU HALS ABPP MU HALS ABPP MU HALS ABPP MU HALS ABPP MU HALS ABPP 0 1 2 Number of Processes ( p ) T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM 16 96 384 864 1536 (d) Dense Real W orld (V ideo) Fig. 5: Strong scaling (varying p ) with k = 50 benchmarking per-iteration times. 19 ABPP than MU or HALS ; this is expected because the LUC computations unique to ABPP require between O ( k 3 ) and O ( k 4 ) operations (depending on the data) while the unique LUC computations for MU and HALS are O ( k 2 ), with all other parameters ﬁxed. Thus, the extra per-iteration cost of ABPP increases with k , so the advantage of ABPP of better error reduction must also increase with k for it to remain superior at large values of k . W e also note that although the number of operations within MM is O ( k ), we do not observe much increase in time from k = 10 to k = 50; this is due to the improv ed e ﬃ ciency of local MM for lar ger values of k . 6.3.4. V ar ying Processor Grid. In this section we demonstrate the e ﬀ ect of the dimensions of the processor grid on per-iteration performance. For a ﬁxed total number of processors p , the commu- nication cost of Algorithm 3 v aries with the choice of p r and p c . T o minimize the amount of data communicated, the theoretical analysis suggests that the processor grid should be chosen to make the sizes of the local data matrix as square as possible. This implies that if m / p > n , p r = p and p c = 1 is the optimal choice (a 1D processor grid); like wise if n / p > m then a 1D processor grid with p r = 1 and p c = p is the optimal choice. Otherwise, a 2D processor grid minimizes communication with p r ≈ p m p / n and p c ≈ p n p / m (subject to integrality and p r p c = p ). Figure 7 presents a benchmark of ABPP for the Sparse Synthetic data set for ﬁxed values of p and k . W e v ary the processor grid dimensions from both 1D grids to the 2D grid that matches the theoretical optimum e xactly . Because the sizes of the Sparse Synthetic matrix are 172 , 800 × 115 , 200 and the number of processors is 1536, the theoretically optimal grid is p r = p m p / n = 48 and p c = p n p / m = 32. The experimental results conﬁrm that this processor grid is optimal, and we see that the time spent communicating increases as the processor grid deviates from the optimum, with the 1D grids performing the worst. 6.3.5. Scaling up to V er y Large Sparse Datasets. In this section, we test MPI-F A UN by scaling up the problem size. While we’ ve used webbase-1M in previous experiments, we consider webbase- 2001 in this section as it is the lar gest sparse data in University of Florida Sparse Matrix Collection [Davis and Hu 2011]. The former dataset has about 1 million nodes and 3 million edges, whereas the latter dataset has o ver 100 million nodes and 1 billion edges (see Section 6.1.1 for more details). Not only is the size of the input matrix increased by two orders of magnitude (because of the increase in the number of edges), but also the size of the output matrices is increased by two orders of magnitude (because of the increase in the number of nodes). In fact, with a low rank of k = 50, the size of the output matrices dominates that of the input matrix: W and H together require a total of 88 GB, while A (stored in compressed column format) is only 16 GB. At this scale, because each node (consisting of 16 cores) of Rhea has 128 GB of memory , multiple nodes are required to store the input and output matrices with room for other intermediate v alues. As mentioned in Section 5.1.3, MPI-F A UN requires considerably more tempo- rary memory than necessary when the output matrices require more memory than the input matrix. While we were not limited by this property for the other sparse matrices, the webbase-2001 matrix dimensions are so large that we need the memories of tens of nodes to run the algorithm. Thus, we report results only for the lar gest number of processors in our experiments: 1536 processors (96 nodes). The extra temporary memory used by MPI-F A UN is a latenc y-minimizing optimization; the algorithm can be updated to av oid this extra memory cost using a blocked matrix multiplication al- gorithm. The extra memory can be reduced to a negligible amount at the expense of more messages between processors and synchronizations across the parallel machine. W e hav e not yet implemented this update. W e present results for webbase-2001 in Figure 8. The timing results are consistent with the ob- servations from other synthetic and real world sparse datasets as discussed in Section 6.3.2, though the raw times are about 2 orders of magnitude larger , as expected. In the case of the error plot, as observed in other experiments, ABPP outperforms other algorithms; ho wever we see that MU re- duces error at a faster rate than HALS in the ﬁrst 30 iterations. At the 30th iteration, the error for HALS was still improving at the third decimal, whereas MU ’ s w as improving at the fourth decimal. 20 MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e 0 0 . 5 1 Low Rank ( k ) T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM 10 20 30 40 50 (a) Sparse Synthetic MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e MU HALS ABPP Naiv e 0 0 . 2 0 . 4 0 . 6 Low Rank ( k ) T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM 10 20 30 40 50 (b) Dense Synthetic MU HALS ABPP MU HALS ABPP MU HALS ABPP MU HALS ABPP MU HALS ABPP 0 0 . 2 0 . 4 0 . 6 Low Rank ( k ) T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM 10 20 30 40 50 (c) Sparse Real W orld (webbase-1M) MU HALS ABPP MU HALS ABPP MU HALS ABPP MU HALS ABPP MU HALS ABPP 0 0 . 1 0 . 2 0 . 3 Low Rank ( k ) T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM 10 20 30 40 50 (d) Dense Real W orld (V ideo) Fig. 6: V arying low rank k with p = 864, benchmarking per-iteration times. 21 1 × 1536 8 × 192 16 × 96 32 × 48 48 × 32 96 × 16 192 × 8 1536 × 1 0 0 . 5 1 Processor Grid T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM Fig. 7: T uning processor grid for ABPP on Sparse Synthetic data set with p = 1536 and k = 50. MU HALS ABPP 0 50 100 T ime (seconds) All-Reduce Reduce-Scatter All-Gather Gram LUC MM (a) T ime 0 10 20 30 0 . 98 0 . 99 1 Iterations Rel. Error for k = 50 MU HALS ABPP (b) Error Fig. 8: NMF comparison on webbase-2001 for k = 50 on 1536 processors. W e suspect that over a greater number of iterations the error of HALS could become smaller than that of MU , which would be more consistent with other datasets. 6.4. Interpretation of Results In this section, we present results from two of the real world datasets. The ﬁrst example shows an image processing example of background separation and moving object detection in surveillance video data, and the second example shows topic modeling output on the stack exchange text dataset. The details of these datasets are presented in Section 6.1.1. While the literature covers more detail about ﬁne tuning NMF and di ﬀ erent NMF variants for higher quality results on these two tasks [Zhou and T ao 2011; Bouwmans 2014; Anandkumar et al. 2014; Kim et al. 2015], our main focus is to show ho w quickly we can produce a baseline NMF output and its real world interpretation. 6.4.1. Moving Object Detection of Sur veillance Video Data. As explained in the Section 6.1.1, we processed 12 minutes video that is captured from a busy junction in Georgia T ech to separate the background and moving objects from this video. In Figure 9 we present some sample frames to compare the input image with the separated background and moving objects. The background are the results of the low rank approximation ˆ A = WH output yielded from our MPI-F A UN algorithm and the moving objects are given by A − ˆ A . W e can clearly see the background remains static and the moving objects (e.g., cars) are visible. 22 Input Frame( A ) Background ( WH ) Moving Object A − WH Fig. 9: Moving object detection for video data using NMF . Each row of images corresponds to a particular frame in the video. The left column is the original frame, the middle column is the reconstructed frame from the low-rank approximation (which captures the background), and the right column is the di ﬀ erence (which captures the moving objects). 6.4.2. T opic Modeling of Stack Exchange Data. W e downloaded the latest Stack Overﬂow dump from its archiv e on 28-Jul-2016. The details of the preprocessing and the sparse matrix generation are explained in Section 6.1.1. W e ran our MPI-F A UN algorithm on this dataset, which has nearly 12 million questions from the Stack Overﬂow site (under Stack Exchange) to produce 50 topics. The matrix W can be interpreted as vocabulary-topic distribution and the H as topic-document distribution. W e took the top 5 words for each of the 50 topics and present them in T able IV. T ypically a good topic generation satisﬁes properties such as (a) ﬁnding discriminative rather than common words – capturing words that can provide some information; (b) ﬁnding di ﬀ erent topics – the similarity between di ﬀ erent topics should be low; (c) coherence - all the words that belong to one topic should be coherent. There are some topic quality metrics [Newman et al. 2010] that capture the usefulness of topic generation algorithm. W e can see NMF generated generally high- quality and coherent topics. Also, each of the topics are from di ﬀ erent domains such as databases, C / C ++ programming, Jav a programming, and web technologies like PHP and HTML. 7. CONCLUSION In this paper, we propose a high-performance distributed-memory parallel framework for NMF al- gorithms that iterati vely update the low rank factors in an alternating fashion. Our parallel algorithm is designed to av oid communication overheads and scales well to over 1500 cores. The frame work is ﬂe xible, being (a) expressi ve enough to lev erage many di ﬀ erent NMF algorithms and (b) e ﬃ cient for both sparse and dense matrices of sizes that span from a few hundreds to hundreds of millions. Our open-source software implementation is a v ailable for do wnload. 23 T op K eywords from T opics 1-25 T op K eywords from T opics 26-50 word1 word2 word3 word4 word5 word1 word2 word3 word4 w ord5 refer undeﬁn const key compil echo type = te xt php form result text ﬁeld box word static test perform fail unit result imag src descript alt = ent size tabl key queri databas insert button click ev ent form add user email usernam login log creat bean add databas except data json store read databas string static ﬁnal catch url page load content url link width height color left display pri vat static ﬁnal import ﬂoat app applic servic thread work row column date cell v alu ipsum lorem dolor sit amet line import command print recent node list root err element v ar map marker match url 0x00 0x ﬀ byte 0x01 0xc0 server connect client messag request ﬁle directori read open upload number byte size print input function call e vent work variabl object properti json instanc list int char const static doubl array element valu key index public overrid virtual static extend main thread program frame cout return param result def boolean type ﬁeld properti argument resolv info thread start map servic select item queri join list error syntax found symbol f ail sourc tar get except jav a fail set properti virtual default updat instal version packag err default case break switch default cout code work problem chang write method call except static todo void overrid protect catch extend href nofollow src link work true requir boolean option v alid end def dim begin properti ﬁnd project import warn referenc debug request ﬁlter match found view control item ov errid posit fals boolean ﬁx bool autoincr null default key int(11 primari T able IV: T op 5 words of 50 topics from Stack Exchang e data set. For solving data mining problems at today’ s scale, parallel computation and distributed-memory systems are becoming prerequisites. W e argue in this paper that by using techniques from high- performance computing, the computations for NMF can be performed very e ﬃ ciently . Our frame- work allows for the HPC techniques (e ﬃ cient matrix multiplication) to be separated from the data mining techniques (choice of NMF algorithm), and we compare data mining techniques at large scale, in terms of data sizes and number of processors. One conclusion we draw from the empirical and theoretical observations is that the extra per-iteration cost of ABPP over alternati ves like MU and HALS decreases as the number of processors p increases, making ABPP more adv antageous in terms of both quality and performance at larger scales. By reporting time breakdowns that sepa- rate local computation from interprocessor communication, we also see that our e ﬃ cient algorithm prev ents communication from bottlenecking the overall computation; our comparison with a naiv e approach shows that communication can easily dominate the running time of each iteration. In future work, we would like to e xtend MPI-F A UN algorithm to dense and sparse tensors, com- puting the CANDECOMP / P ARAF A C decomposition in parallel with non-negati vity constraints on the factor matrices. W e plan on e xtending our software to include more NMF algorithms that ﬁt the A U-NMF framew ork; these can be used for both matrices and tensors. W e would also like to ex- plore more intelligent distributions of sparse matrices: while our 2D distrib ution is based on ev enly dividing rows and columns, it does not necessarily load balance the nonzeros of the matrix, which can lead to load imbalance in matrix multiplications. W e are interested in using graph and hyper- 24 graph partitioning techniques to load balance the memory and computation while at the same time reducing communication costs as much as possible. A CKNOWLEDGMENTS This manuscript has been co-authored by UT -Battelle, LLC under Contract No. DE-A C05-00OR22725 with the U.S. Department of Energy . This project was partially funded by the Laboratory Director’ s Research and Dev elopment fund. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory , which is supported by the O ﬃ ce of Science of the U.S. Department of Energy . Also, partial funding for this work was provided by AFOSR Grant F A9550-13-1-0100, National Sci- ence Foundation (NSF) grants IIS-1348152 and ACI-1338745, Defense Adv anced Research Projects Agency (D ARP A) XD A T A program grant F A8750-12-2-0309. The United States Government retains and the publisher, by accepting the article for publication, acknowl- edges that the United States Government retains a non-exclusi ve, paid-up, irrev ocable, world-wide license to publish or reproduce the published form of this manuscript, or allo w others to do so, for United States Go vern- ment purposes. The Department of Ener gy will provide public access to these results of federally sponsored re- search in accordance with the DOE Public Access Plan (http: // energy .gov / downloads / doepublic- access- plan). Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the USDOE, NERSC, AFOSR, NSF or D ARP A. REFERENCES Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus T elgarsky. 2014. T ensor decompositions for learning latent variable models. J ournal of Machine Learning Researc h 15, 1 (2014), 2773–2832. Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2015. Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication. In Pr oceedings of SP AA . 86–88. http: // doi.acm.org / 10.1145 / 2755573. 2755613 P . Boldi and S. V igna. 2004. The W ebgraph Frame work I: Compression T echniques. In Pr oceedings of the (WWW ’04) . New Y ork, NY , USA, 595–602. http: // doi.acm.org / 10.1145 / 988672.988752 Thierry Bouwmans. 2014. Traditional and recent approaches in background modeling for foreground detection: An overvie w . Computer Science Review 11-12 (2014), 31 – 66. DOI: http: // dx.doi.or g / 10.1016 / j.cosrev .2014.04.001 Thierry Bouwmans, Andrews Sobral, Sajid Javed, Soon Ki Jung, and El-Hadi Zahzah. 2015. Decomposition into low-rank plus additive matrices for background / foreground separation: A revie w for a comparative ev aluation with a large-scale dataset. arXiv pr eprint arXiv:1511.01245 (2015). E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. 2007. Collecti ve communication: theory , practice, and expe- rience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749–1783. http: // dx.doi.org / 10.1002 / cpe.1206 Andrzej Cichocki and Phan Anh-Huy. 2009. Fast local algorithms for large scale nonnegativ e matrix and tensor factor- izations. IEICE Tr ansactions on Fundamentals of Electronics, Communications and Computer Sciences 92, 3 (2009), 708–721. Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. 2009. Nonnegative matrix and tensor factorizations: applications to explor atory multi-way data analysis and blind sour ce separation . W iley . T imothy A. Davis and Y ifan Hu. 2011. The Univ ersity of Florida Sparse Matrix Collection. ACM T rans. Math. Softw . 38, 1, Article 1 (Dec. 2011), 25 pages. DOI: http: // dx.doi.org / 10.1145 / 2049662.2049663 J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. 2013. Communication-Optimal Parallel Recursiv e Rectangular Matrix Multiplication. In Pr oceedings of IPDPS . 261–272. http: // dx.doi.org / 10.1109 / IPDPS. 2013.80 James P . Fairbanks, Ramakrishnan Kannan, Haesun Park, and David A. Bader. 2015. Behavioral clusters in dynamic graphs. P arallel Comput. 47 (2015), 38–50. http: // dx.doi.org / 10.1016 / j.parco.2015.03.002 Christos Faloutsos, Alex Beutel, Eric P . Xing, Evangelos E. Papale xakis, Abhimanu Kumar, and Partha Pratim T alukdar. 2014. Flexi-FaCT: Scalable Flexible Factorization of Coupled T ensors on Hadoop. In Pr oceedings of the SDM . 109– 117. http: // epubs.siam.org / doi / abs / 10.1137 / 1.9781611973440.13 Richard Fujimoto, Angshuman Guin, Michael Hunter, Haesun Park, Gaurav Kanitkar , Ramakrishnan Kannan, Michael Mil- holen, SaBra Neal, and Philip Pecher. 2014. A Dynamic Data Driven Application System for V ehicle Tracking. Pr ocedia Computer Science 29 (2014), 1203–1215. http: // dx.doi.org / 10.1016 / j.procs.2014.05.108 Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Y annis Sismanis. 2011. Lar ge-scale matrix factorization with distrib uted stochastic gradient descent. In Pr oceedings of the KDD . ACM, 69–77. http: // dx.doi.org / 10.1145 / 2020408.2020426 David Gro ve, Josh Milthorpe, and Olivier T ardieu. 2014. Supporting Array Programming in X10. In Pr oceedings of ACM SIGPLAN International W orkshop on Libr aries, Languag es, and Compiler s for Arr ay Pro gramming (ARRA Y’14) . Arti- cle 38, 6 pages. http: // doi.acm.org / 10.1145 / 2627373.2627380 25 N. Guan, D. T ao, Z. Luo, and B. Y uan. 2012. NeNMF: An Optimal Gradient Method for Nonnega- tiv e Matrix F actorization. IEEE T ransactions on Signal Processing 60, 6 (June 2012), 2882–2898. DOI: http: // dx.doi.org / 10.1109 / TSP .2012.2190406 Ngoc-Diep Ho, Paul V an Dooren, and V incent D. Blondel. 2008. Descent methods for Nonnegativ e Matrix Factorization. CoRR abs / 0801.3199 (2008). Patrik O Hoyer. 2004. Non-neg ativ e matrix factorization with sparseness constraints. JMLR 5 (2004), 1457–1469. www .jmlr. org / papers / volume5 / hoyer04a / hoyer04a.pdf Ramakrishnan Kannan, Grey Ballard, and Haesun Park. 2016. A High-performance Parallel Algorithm for Nonneg ativ e Matrix Factorization. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of P arallel Pr ogramming (PP oPP ’16) . A CM, New Y ork, NY , USA, 9:1–9:11. http: // doi.acm.org / 10.1145 / 2851141.2851152 Oguz Kaya and Bora Uc ¸ ar. 2015. Scalable Sparse T ensor Decompositions in Distributed Memory Systems. In Pr oceedings of SC . A CM, Article 77, 11 pages. http: // doi.acm.org / 10.1145 / 2807591.2807624 Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K. Reddy, and Haesun Park. 2015. Simultaneous Discov ery of Common and Discriminati ve T opics via Joint Nonne gati ve Matrix Factorization. In Pr oceedings of the 21th ACM SIGKDD Inter - national Conference on Knowledge Discovery and Data Mining , Sydney , NSW , Austr alia, August 10-13, 2015 . 567–576. DOI: http: // dx.doi.org / 10.1145 / 2783258.2783338 Hyunsoo Kim and Haesun Park. 2007. Sparse non-negativ e matrix factorizations via alternating non-negati vity-constrained least squares for microarray data analysis. Bioinformatics 23, 12 (2007), 1495–1502. http: // dx.doi.org / 10.1093 / bioinformatics / btm134 Jingu Kim, Y unlong He, and Haesun Park. 2014. Algorithms for nonnegativ e matrix and tensor factorizations: A uniﬁed view based on block coordinate descent frame work. Journal of Global Optimization 58, 2 (2014), 285–319. http: // dx. doi.org / 10.1007 / s10898- 013- 0035- 4 Jingu Kim and Haesun Park. 2011. Fast nonnegati ve matrix factorization: An acti ve-set-like method and comparisons. SIAM Journal on Scientiﬁc Computing 33, 6 (2011), 3261–3281. http: // dx.doi.or g / 10.1137 / 110821172 Da Kuang, Chris Ding, and Haesun Park. 2012. Symmetric nonnegati ve matrix factorization for graph clustering. In Pr o- ceedings of SDM . 106–117. http: // epubs.siam.org / doi / pdf / 10.1137 / 1.9781611972825.10 Da K uang, Sangwoon Y un, and Haesun Park. 2013. SymNMF: nonnegativ e low-rank approximation of a similarity matrix for graph clustering. Journal of Global Optimization (2013), 1–30. http: // dx.doi.or g / 10.1007 / s10898- 014- 0247- 2 Ruiqi Liao, Y ifan Zhang, Jihong Guan, and Shuigeng Zhou. 2014. CloudNMF: A MapReduce Implementation of Nonneg- ativ e Matrix Factorization for Large-scale Biological Datasets. Genomics, pr oteomics & bioinformatics 12, 1 (2014), 48–51. http: // dx.doi.org / 10.1016 / j.gpb.2013.06.001 Chao Liu, Hung-chih Y ang, Jinliang Fan, Li-W ei He, and Y i-Min W ang. 2010. Distributed nonnegati ve matrix factorization for web-scale dyadic data analysis on MapReduce. In Pr oceedings of the WWW . A CM, 681–690. http: // dx.doi.org / 10. 1145 / 1772690.1772760 Y ucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distrib uted GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow . 5, 8 (April 2012), 716–727. http: // dx.doi.org / 10.14778 / 2212351.2212354 Edgardo Mej ´ ıa-Roa, Daniel T abas-Madrid, Javier Setoain, Carlos Garc ´ ıa, Francisco Tirado, and Alberto P ascual-Montano. 2015. NMF-mGPU: non-negativ e matrix factorization on multi-GPU systems. BMC bioinformatics 16, 1 (2015), 43. http: // dx.doi.org / 10.1186 / s12859- 015- 0485- 4 Xiangrui Meng, Joseph Bradley, Burak Y avuz, Evan Sparks, Shivaram V enkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet T alwalkar. 2015. MLlib: Machine Learning in Apache Spark. (26 May 2015). http: // arxi v .org / abs / 1505.06807 David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic ev aluation of topic coherence. In Human Language T echnologies: The 2010 Annual Confer ence of the North American Chapter of the Association for Computa- tional Linguistics . Association for Computational Linguistics, 100–108. V Paul Pauca, Farial Shahnaz, Michael W Berry, and Robert J Plemmons. 2004. T ext mining using nonnegati ve matrix factorizations. In Pr oceedings of SDM . Conrad Sanderson. 2010. Armadillo: An Open Sour ce C ++ Linear Algebra Library for F ast Prototyping and Computation- ally Intensive Experiments . T echnical Report. NICT A. http: // arma.sourceforge.net / armadillo nicta 2010.pdf Nadathur Satish, Narayanan Sundaram, Md Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M Amber Hassaan, Shubho Sengupta, Zhaoming Y in, and Pradeep Dubey. 2014. Navigating the maze of graph analytics frame works using massi ve graph datasets. In Pr oceedings of the 2014 ACM SIGMOD international confer ence on Management of data . A CM, 979–990. D. Seung and L. Lee. 2001. Algorithms for non-negati ve matrix factorization. NIPS 13 (2001), 556–562. D. L. Sun and C. F ´ evotte. 2014. Alternating direction method of multipliers for non-negati ve matrix factorization with the beta-div ergence. In 2014 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . 6201– 6205. DOI: http: // dx.doi.org / 10.1109 / ICASSP .2014.6854796 26 Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collecti ve Communication Operations in MPICH. International Journal of High P erformance Computing Applications 19, 1 (2005), 49–66. http: // hpc.sagepub . com / content / 19 / 1 / 49.abstract Y u-Xiong W ang and Y u-Jin Zhang. 2013. Nonnegati ve Matrix Factorization: A Comprehensiv e Revie w . TKDE 25, 6 (June 2013), 1336–1353. http: // dx.doi.org / 10.1109 / TKDE.2012.51 Samuel Williams, Leonid Oliker, Richard V uduc, John Shalf, Katherine Y elick, and James Demmel. 2009. Optimization of sparse matrix-vector multiplication on emer ging multicore platforms. P arallel Comput. 35, 3 (2009), 178 – 194. Zhang Xianyi. Last Accessed 03-Dec-2015. OpenBLAS. (Last Accessed 03-Dec-2015). http: // www .openblas.net Jiangtao Y in, Lixin Gao, and Zhongfei(Mark) Zhang. 2014. Scalable Nonnegati ve Matrix Factorization with Block-wise Updates. In Machine Learning and Knowledge Discovery in Databases (LNCS) , V ol. 8726. 337–352. http: // dx.doi.org / 10.1007 / 978- 3- 662- 44845- 8 22 Hyokun Y un, Hsiang-Fu Y u, Cho-Jui Hsieh, SVN V ishwanathan, and Inderjit Dhillon. 2014. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. Proceedings of the VLDB Endowment 7, 11 (2014), 975–986. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenk er, and Ion Stoica. 2010. Spark: Cluster Computing with W orking Sets. In Pr oceedings of the 2nd USENIX Conference on Hot T opics in Cloud Computing (HotCloud’10) . USENIX Association, 10–10. http: // dl.acm.org / citation.cfm?id = 1863103.1863113 T ianyi Zhou and Dacheng T ao. 2011. Godec: Randomized lo w-rank & sparse matrix decomposition in noisy case. In Pr o- ceedings of the 28th International Confer ence on Machine Learning (ICML-11) . 33–40. 27

MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment