Fast And Efficient Boolean Matrix Factorization By Geometric Segmentation
Boolean matrix has been used to represent digital information in many fields, including bank transaction, crime records, natural language processing, protein-protein interaction, etc. Boolean matrix factorization (BMF) aims to find an approximation o…
Authors: Changlin Wan, Wennan Chang, Tong Zhao
F ast and Efficient Boolean Matrix F actorization by Geometric Segmentation Changlin W an 1,2 , W ennan Chang 1,2 , T ong Zhao 3 , Mengya Li 2 , Sha Cao 2,* , Chi Zhang 2,* 1 Purdue Univ ersity , 2 Indiana Univ ersity , 3 Amazon { wan82,chang534 } @purdue.edu, zhaoton@amazon.com, imyli1024@gmail.com, { shacao,czhang87 } @iu.edu Abstract Boolean matrix has been used to represent digital infor- mation in man y fields, including bank transaction, crime records, natural language processing, protein-protein inter- action, etc. Boolean matrix factorization (BMF) aims to decompose a boolean matrix via the product of two low- ranked boolean matrices, benefiting a number of applica- tions on boolean matrices, e.g., data denoising, clustering, dimension reduction and community detection. Inspired by binary matrix permutation theories and geometric segmen- tation, in this w ork, we dev eloped a fast and scalable BMF approach, called MEBF ( M edian E xpansion for B oolean F actorization). MEBF adopted a heuristic approach to lo- cate binary patterns presented as submatrices that are dense in 1s. In each iteration, MEBF permutates the ro ws and columns such that the permutated matrix is approximately U pper T riangular- L ike ( UTL ) with so-called S imultaneous C onsecutiv e-ones P roperty ( SC1P ). The largest submatrix dense in 1 would lie on the upper triangular area of the per- mutated matrix, and its location was determined based on a geometric segmentation of a triangular . W e compared MEBF with state-of-the-art BMF baselines on data scenarios with different density and noise lev els. Through comprehensive experiments, MEBF demonstrated superior performances in lower reconstruction error , and higher computational effi- ciency , as well as more accurate density pattern mining than state-of-the-art methods such as ASSO, P ANDA and Mes- sage Passing. W e also presented the application of MEBF on non-binary data sets, and rev ealed its further potential in knowledge retrieving and data denoising on general matrix factorization problems. Introduction Binary data gains more and more attention during the trans- formation of modern li ving (Kocayusufoglu, Hoang, and Singh 2018; Balasubramaniam, Nayak, and Y uen 2018). It consists of a large domain of our e veryday life, where the 1s or 0s in a binary matrix can physically mean whether or not an ev ent of online shopping transaction, web brows- ing, medical record, journal submission, etc, has occurred * Corresponding author Copyright c 2020, Association for the Advancement of Artificial Intelligence (www .aaai.org). All rights reserved. or not. The scale of these datasets has increased exponen- tially over the years. Mining the patterns within binary data as well as adapting to the drastic increase of dimensionality is of prominent interests for now adays data science research. Recent study also showed that some continuous data could benefit from binary pattern mining. For instance, the bina- rization of continuous single cell gene expression data to its on and off state, can better reflect the coordination patterns of genes in regulatory networks (Larsson et al. 2019). Ho w- ev er , owing to its two value characteristics, the rank of a binary matrix under normal linear algebra can be very high due to certain spike rows or columns. This makes it infeasi- ble to apply established methods such as SVD and PCA for BMF (W all, Rechtsteiner , and Rocha 2003). Boolean matrix factorization (BMF) has been dev eloped particularly for binary pattern mining, and it factorizes a bi- nary matrix into approximately the product of two low rank binary matrices follo wing Boolean algebra, as shown in Fig- ure 1. The decomposition of a binary matrix into lo w rank binary patterns is equiv alent to locating submatrices that are dense in 1. Analyzing binary matrix with BMF shows its unique power . In the most optimal case, it significantly reduces the rank of the original matrix calculated in nor - mal linear algebra to its log scale (Monson, Pullman, and Rees 1995). Since the binary patterns are usually embedded within noisy and randomly arranged binary matrix, BMF is known to be an NP-hard problem (Miettinen et al. 2008). Background Related work BMF was first introduced as a set basis problem in 1975 (Stockmeyer 1975). This area has received wide attention after a series of work by Mittenin et al (Miettinen et al. 2008; Miettinen and Vreeken 2014; Karaev , Miettinen, and Vreeken 2015). Among them, the ASSO algorithm per- forms factorization by retrieving binary bases from row-wise correlation matrix in a heuristic manner (Miettinen et al. 2008). Despite its popularity , the high computational cost of ASSO makes it impracticable when dealing with large scale data. Recently , an algorithm called Nassua was devel- oped by the same group (Karaev , Miettinen, and Vreeken 2015). Nassua optimizes the initialization of the matrix fac- Figure 1: BMF , the addition of rank 1 binary matrices torization by locating dense seeds hidden within the matrix, and with improv ed performance comparing to ASSO. How- ev er , optimal parameter selection remains a challenge for Nassua. A second series of work called P AND A was de vel- oped by Claudio et al (Lucchese, Orlando, and Perego 2010; Lucchese, Orlando, and Perego 2013). P AND A aims to find the most significant patterns in the current binary matrix by discov ering core patterns iteratively (Lucchese, Orlando, and Pere go 2010). After each iteration, P ANDA only retains a residual matrix with all the non-zero v alues covered by identified patterns removed. Later , P ANDA+ was recently dev eloped to reduce the noise le vel in core pattern detection and e xtension (Lucchese, Orlando, and Pere go 2013). These two methods also suffer from inhibitory computational cost, as they need to recalculate a global loss function at each iter - ation. More algorithms and applications of BMF hav e been proposed in recent years. FastStep (Araujo, Ribeiro, and Faloutsos 2016) relaxed BMF constraints to non-negati vity by integrating non-negati v e matrix factorization (NMF) and Boolean thresholding. But interpreting deriv ed non-negati ve bases could also be challenging. W ith prior network infor- mation, K ocayusufoglu et al decomposes binary matrix in a stepwise fashion with bases that are sampled from given network space (Kocayusufoglu, Hoang, and Singh 2018). Bayesian probability mapping has also been applied in this field . Rav anbakhsh et al proposed a probability graph model called factor-graph to characterize the embedded patterns, and dev eloped a message passing approach, called MP (Ra- vanbakhsh, P ´ oczos, and Greiner 2016). On the other hand, Ormachine, proposed by Rukat et al, provided a probabilis- tic generativ e model for BMF (Rukat et al. 2017). Simi- larly , these Bayesian approaches suffer from low computa- tional ef ficiency . In addition, Bayesian model fitting could be highly sensitiv e to noisy data. Notations A matrix is denoted by a uppercase character with a super script n × m indicating its dimension, such as X n × m , and with subscript X i, : , X : ,j , X ij indicating i th row , j th column, or the ( i, j ) th element, respectively . A vector is denoted as a bold lowercase character , such as a , and its subscript a i indicates the i th element. A scalar value is represented by a lowercase character , such as a , and [ a ] as its integer part. | X | and | x | represents the ` 1 norm of a matrix and a vec- tor . Under the Boolean algebra, the basic operations include ∧ ( AN D , 1 ∧ 1 = 1 , 1 ∧ 0 = 0 , 0 ∧ 0 = 0) , ∨ ( O R , 1 ∨ 1 = 1 , 0 ∨ 1 = 1 , 0 ∨ 0 = 0) , ¬ ( N O T , ¬ 1 = 0 , ¬ 0 = 1) . De- note the Boolean element-wise sum, subtraction and prod- n 0 m m 2 n 0 2 n m m 2 n 2 n m 0 2 n 2 m 0 Figure 2: Three simplified scenarios for UTL matrices with direct SC1P . uct as A ⊕ B = A ∨ B , A B = ( A ∧ ¬ B ) ∨ ( ¬ A ∧ B ) and A ~ B = A ∧ B , and the Boolean matrix product of two Boolean matrices as X n × m = A n × k ⊗ B k × m , where X ij = ∨ k l =1 A il ∧ B lj . Problem statement Giv en a binary matrix X ∈ { 0 , 1 } n × m and a criteria pa- rameter τ , the BMF problem is defined as identifying two binary matrices A ∗ and B ∗ , called pattern matrices, that minimize the cost function γ ( A, B ; X ) under criteria τ , i.e., ( A ∗ , B ∗ ) = argmin A,B ( γ ( A, B ; X ) | τ ) . Here the criteria τ could vary with different problem assumptions. The criteria used in the current study is to identify A ∗ and B ∗ with at most k patterns, i.e., A ∈ { 0 , 1 } n × k , B ∈ { 0 , 1 } k × m , and the cost function is γ ( A, B ; X ) = | X ( A ⊗ B ) | . W e call the l th column of matrix A and l th row of matrix B as the l th binary pattern, or the l th basis, l = 1 , ..., k . MEBF Algorithm Framework Motivation of MEBF BMF is equiv alent to decomposing the matrix into the sum of multiple rank 1 binary matrices, each of which is also re- ferred as a pattern or basis in the BMF literature (Lucchese, Orlando, and Perego 2010). Lemma 1 ( Submatrix detection ) . Let A ∗ , B ∗ be the solu- tion to arg min A ∈{ 0 , 1 } n × k ,B ∈{ 0 , 1 } k × m | X ( A ⊗ B ) | , then the k patterns identified in A ∗ , B ∗ corr espond to k subma- trices in X that ar e dense in 1’ s. In other wor ds, finding A ∗ , B ∗ is equivalent to identify submatrices X I l ,J l , I l ⊂ { 1 , . . . , n } ; J l ⊂ { 1 , . . . , m } , l = 1 , ..., k , s.t. | X I l ,J l | ≥ t 0 ( | I l | ∗ | J l | ) . Here | I l | is the car dinality of the index set I l , t 0 is a positive number between 0 and 1 that contr ols the noise level of X I l ,J l . Pr oof. ∀ l , it suffices to let I l be the indices of the l th column of A ∗ , such that A ∗ : ,l = 1 ; and let J l be the indices of the l th row of B ∗ such that B ∗ l, : = 1 . 10 11 8 3 9 7 6 1 5 2 4 4 7 10 11 9 1 3 6 5 8 2 9 4 7 6 3 5 8 10 11 2 1 7 9 6 10 11 5 1 8 2 4 3 10 11 8 3 9 7 6 1 5 2 4 4 7 10 11 9 1 3 6 5 8 2 7 9 6 10 11 5 1 8 2 4 3 9 4 7 6 3 5 8 10 11 2 1 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 a1. original matrix a2. approx UTL with direct SCIP a3. bidirectional growth 1 a4. residual matrix 1 a5. bidirectional growth 2 9 10 4 7 6 11 3 5 1 8 2 7 5 1 9 6 2 4 10 11 8 3 7 5 1 9 6 2 4 10 11 8 3 9 10 4 7 6 11 3 5 1 8 2 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 a7. weak signal detection a6. residual matrix 2 a8. MEBF decomposition a9. MEBF patterns 1 2 3 4 5 6 7 8 9 10 11 1 2 3 1 2 3 4 5 6 7 8 9 1011 1 2 3 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 b1. original matrix b2. approx UTL with direct SC1P b3. bidirec tional growth 1 b4. residual matrix 1 b5. bidirectional growth 2 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 1 2 3 4 5 6 7 8 9 10 11 1 2 3 b7. bidirectional growth 3 b6. residual matrix 2 b8. MEBF decomposition b9. MEBF patterns Figure 3: A schematic ov erview of the MEBF pipeline for three data scenarios where the matrix is roughly UTL with SC1P . Motiv ated by Lemma 1, instead of looking for patterns directly , we turn to identify large submatrices in X that are enriched by 1, such that each submatrix would correspond to one binary pattern or basis. Definition 1 ( Direct consecutive-ones property , direct C1P ) . A binary matrix X has direct C1P if for each of its row v ector , all 1’ s occur at consecutive indices. Definition 2 ( Simultaneous consecutive-ones property , SC1P ) . A binary matrix X has direct SC1P , if both X and X T hav e direct C1P; and a binary matrix X has SC1P , if there e xists a permutation of the ro ws and columns such that the permutated matrix has direct SC1P . Definition 3 ( Upper T riangular-Like matrix, UTL ) . A binary matrix X m × n is called an U pper T riangular - L ike ( UTL ) matrix, if 1) P m i =1 X i 1 ≤ P m i =1 X i 2 ≤ · · · ≤ P m i =1 X in ; 2) P n j =1 X 1 j ≥ P n j =1 X 2 j ≥ · · · ≥ P n j =2 X mj . In other words, the matrix has non-increasing row sums from top down, and non-decreasing column sums from left to right. Lemma 2 ( UTL matrix with direct SC1P ) . Assume X has no all-zer o r ows or columns. If X is an UTL matrix and has dir ect SC1P , then an all 1 submatrix of the lar gest area in X is seeded where one of its column lies in the medium column of the matrix, or one of its row lies in the medium r ow of the matrix, as shown in F igur e 2. Figure 2 presented three simplified scenarios of UTL ma- trix that has direct SC1P . In (a), (b), the 1s are organized in triangular shape, where certain rows in (a) and certain columns in (b) are all zero, and in (c), the 1s are shaped in block diagonal. After removing all-zero rows and columns, the upper triangular area of the shuffled matrix is dense in 1. It is easy to show that a rectangular with the largest area in a triangular is the one defined by the three midpoints of the three sides, together with the vertex of the right angle of the triangular , as colored by red in Figure 2. The width and height of the rectangular equal to half of the two le gs of the triangular , i.e. ( m 2 , n 0 2 ) , ( m 0 2 , n 2 ) , ( m 2 , n 2 ) for the three scenarios in Figure 2 respectiv ely . According to Lemma 2, this largest rectangular contains at least one ro w or one col- umn (colored in yellow) of the largest all 1 submatrix in the matrix. Consequently , starting with one ro w or column, ex- pansions with new rows or columns could be done easily if they show strong similarity to the first row or column. Af- ter the expansion concludes, one could determine whether to retain the submatrix expanded row-wise or column-wise, whichev er reduces more of the cost function. It is common that the underlying SC1P pattern may not exist for a binary matrix, and we turn to find the matrix with closest SC1P . Definition 4 ( Closest SC1P ) . Giv en a binary matrix X and a nonnegati ve weight matrix W , a matrix ˆ X that has SC1P and minimizes the distance d W ( X, ˆ X ) is the closest SC1P matrix of X . Based on Lemma 2, we could find all the submatrices in Lemma 1 by first permutating rows and columns of matrix X to be an UTL matrix with closest direct SC1P , locating the lar gest submatrix of all 1’ s, to be our first binary pattern. Then we are left with a residual matrix whose entries cov- ered by existing patterns are set to zero. Repeat the process on the residual matrix until conv ergence. Howe ver , finding matrix of closest SC1P of matrix X is NP-hard (Junttila and others 2011; Oswald and Reinelt 2009). Lemma 3 ( Closest SC1P ) . Given a binary matrix X and a nonne gative weight matrix W , finding a matrix ˆ X that has SC1P and minimizes the distance d W ( X, ˆ X ) is an NP-hard pr oblem. The NP-hardness of the closest SC1P problem has been shown in(Oswald and Reinelt 2009, Junttila 2011). Both ex- act and heuristic algorithms are known for the problem, and it has also been shown if the number of rows or columns is bounded, then solving closest SC1P requires only polyno- mial time (Oswald 2003). In our MEBF algorithm, we at- tempt to address it by using heuristic methods and approxi- mation algorithms. Overview Overall, MEBF adopted a heuristic approach to locate sub- matrices that are dense in 1’ s iterati vely . Starting with the original matrix as a residual matrix, at each iteration, MEBF permutates the ro ws and columns of the current residual ma- trix so that the 1’ s are gathered on entries of the upper tri- angular area. This step is to approximate the permutation operation it tak es to make a matrix UTL and direct SC1P . Then as illustrated in Figure 2 and Figure 3, the rectangular of the largest area in the upper triangular, and presumably , of the highest frequencies of 1’ s, will be captured. The pat- tern corresponding to this submatrix represents a good rank- 1 approximation of the current residual matrix. Before the end of each iteration, the residual matrix will be updated by flipping all the 1’ s located in the identified submatrix in this step to be 0. Shown in Figure 3a, for an input Boolean matrix (a1), MEBF first rearranges the matrix to obtain an approximate UTL matrix with closest direct SC1P . This was achie ved by reordering the rows so that the row norms are non- increasing, and the columns so that the column norms are non-decreasing (a2). Then, MEBF takes either the column or row with medium number of 1’ s as one basis or pattern (a3). As the name reveals, MEBF then adopts a median e xpansion step, where the medium column or row would propogate to other columns or rows with a bidirectional gro wth algorithm until certain stopping criteria is met. Whether to choose the pattern expanded row-wise or column-wise depends on which one minimizes the cost function with regards to the current residual matrix. Before the end of each iteration, MEBF computes a residual matrix by doing a Boolean sub- traction of the ne wly selected rank-1 pattern matrix from the current residual matrix (a4). This process continues until the con ver gence criteria was met. If the patterns identified by the bidirectional gro wth step stopped deceasing the cost func- tion before the con vergence criteria was met, another step called weak signal detection would be conducted (a6,a7). Figure 3b illustrated a special case, where the permutated matrix is roughly block diagonal (b1), which corresponds to the third scenario in Figure 2. The same procedure as sho wn in 3a could guarantee the accurate location of all the pat- terns. The computational complexity of bidirectional growth and weak signal detection algorithms are both O(nm) and the complexity of each iteration of MEBF is O(nm). The main algorithm of MEBF is illustrated below: Bidirectional Gr owth For an input binary (residual) matrix X , we first rearrange X by reordering the rows and columns so that the row norms are non-increasing, and the column norms are non- decreasing. The rearranged X , after removing its all-zero columns and ro ws, is denoted as X 0 , the median column and median ro w of X 0 as X 0 : , med and X 0 med , : . Denote X : , ( med ) and X ( med ) , : as the column and row in X corresponding to X 0 : , med and X 0 med , : . The similarity between X : , ( med ) and columns of X can be computed as a column wise correlation vec- tor m ∈ (0 , 1) m , where m i = . Similarly , the similarity between X ( med ) , : and ro ws of X can be com- puted as a vector n ∈ (0 , 1) n , n j = . A pre-specified threshold t ∈ (0 , 1) was further applied, and two vectors e and f indicating the similarity strength of the columns and rows of X with X : , ( med ) and X ( med ) , : , are ob- tained, where e j = ( m j > t ) and f i = ( n i > t ) . Here the binary vectors e and f each repr esent one potential BMF pattern . In each iteration, we select the row or column pat- tern whichev er fits the current residual matrix better, i.e. the Algorithm 1: MEBF Inputs: X ∈ { 0 , 1 } n × m , t ∈ (0 , 1) , τ Outputs: A ∗ ∈ { 0 , 1 } n × k , B ∗ ∈ { 0 , 1 } k × m M E B F ( X , t, τ ) : X residual ← X , γ 0 ← inf A ∗ ← N U LL , B ∗ ← N U LL while ! τ do ( a , b ) ← bidirectional gro wth ( X residual , t ) A tmp ← append ( A ∗ , a ) B tmp ← append ( B ∗ , b ) if γ ( A tmp , B tmp ; X ) > γ 0 then ( a , b ) ← weak signal detection ( X residual , t ) ; A tmp ← append ( A ∗ , a ) B tmp ← append ( B ∗ , b ) if γ ( A tmp , B tmp ; X ) > γ 0 then break ; A ∗ ← append ( A ∗ , a ) B ∗ ← append ( B ∗ , b ) γ 0 ← γ ( A ∗ , B ∗ ; X ) X residual ij ← 0 when ( a ⊗ b ) ij = 1 end column pattern if γ ( X : , ( med ) , e ; X ) < γ ( f , X ( med ) , : ; X ) , or the row pattern otherwise. Here, the cost function is defined as γ ( a , b ; X ) = | X ( a ⊗ b ) | . This is equi v alent to selecting a pattern that achiev es lo wer o verall cost function at the cur- rent step. Obviously here, a smaller t could achieve higher cov erage with less number of patterns, while a larger t en- ables a more sparse decomposition of the input matrix with greater number of patterns. Patterns found by bidirectional growth does not guarantee a constant decrease of the cost function. In the case the cost function increases, we adopt a weak signal detection step before stopping the algorithm. Algorithm 2: Bidirectional Growth Inputs: X ∈ { 0 , 1 } n × m , t ∈ (0 , 1] Outputs: ( a , b ) bidirectional growth ( X, t ) : X 0 ← UTL operation on X d ← X : , ( med ) , e ← { ( < d , d > > t ) , j = 1 , ..., m } f ← X ( med ) , : , g ← { ( < f , f > > t ) , i = 1 , ..., n } if γ ( d , e ; X ) > γ ( g , f ; X ) then a ← g ; b ← f ; else a ← d ; b ← e ; end W eak Signal Detection Algorithm The bidirectional growth steps do not guarantee a constant decrease of the cost function, especially when after the ”large” patterns have been identified and the ”small” pat- terns are easily confused with noise. T o identify weak pat- terns from a residual matrix, we came up with a week signal detection algorithm to locate the regions that may still have small but true patterns. Here, from the current residual ma- trix, we search the two columns with the most number of 1’ s and form a new column that is the intersection of the two columns; and the tw o ro ws with the most number of 1’ s and form a ne w row that is the intersection of the two rows. Starting from the new column and new row as a pattern, sim- ilar to bidirectional growth, we locate the rows or columns in the residual matrix that ha ve high enough similarity to the pattern, thus expanding a single row or column into a sub- matrix. The one pattern among the two with the lowest cost function with re gards to the residual matrix will be selected. And if addition of the pattern to existing patterns could de- crease the cost function with regards to the original matrix, it will be retained. Otherwise, the algorithm will stop. Algorithm 3: W eak Signal Detection Inputs: X ∈ { 0 , 1 } n × m , t ∈ (0 , 1] Outputs: ( a , b ) W eak signal detection( X , t ) X 0 ← UTL operation on X d 1 ← X 0 : ,m ∧ X 0 : ,m − 1 e 1 ← { ( < d 1 , d 1 > > t ) , j = 1 , ..., m } e 2 ← X 0 1 , : ∧ X 0 2 , : d 2 ← { ( < e 2 , e 2 > > t ) , i = 1 , ..., n } l ← arg min l =1 , 2 γ ( d l , e l , X ) a ← d l ; b ← e l ; Experiment Simulation data W e first compared MEBF 1 with three state-of-the-art ap- proaches, ASSO, P AND A and Message Passing (MP), on simulated datasets. A binary matrix X n × m is simulated as X n × m = U n × k ⊗ V k × m + f E where U ij , V ij ∼ B er noul li ( p 0 ) E ij ∼ B er noul li ( p ) ” + f ” is a flipping operation, s.t. X ij = ∨ k l =1 U il ∧ V lj , E ij = 0 ¬ ∨ k l =1 U il ∧ V lj , E ij = 1 Here, p 0 controls the density levels of the true patterns, and E is introduced as noise that could flip the binary values, and the level of noise could be regulated by the parameter p . W e simulated two data scales, a small one, n = m = 100 , and a large one n = m = 1000 . For each data scale, the number of patterns k , is set to 5, and we used two density lev els, where p 0 = 0 . 2 , 0 . 4 , and two noise le vels p = 0 , 0 . 01 . 50 simulation was done for each data scale at 1 The code is av ailable at https://github .com/clwan/MEBF each scenario. W e e valuate the goodness of the algorithms by con- sidering two metrics, namely the reconstruction error and density (Belohlav ek, Outrata, and Trnecka 2018; Rukat et al. 2017), as defined below: Reconstruction error := | ( U ⊗ V ) ( A ∗ ⊗ B ∗ ) | | U ⊗ V | Density := | A ∗ n × k | + | B ∗ k × m | ( n + m ) × k Here, U , V are the ground truth patterns while A ∗ and B ∗ are the decomposed patterns by each algorithm. The density metric is introduced to e v aluate whether the decomposed patterns could reflect the sparsity/density le vels of the true patterns. It is notable that with the same reconstruction er- ror , patterns of lower density , i.e., higher sparsity are more desirable, as it leads to more parsimonious models. In Figure 4 and 5, we sho w that, compared with ASSO, P ANDA and MP , MEBF is the fastest and most robust al- gorithm. Here, the con ver gence criteria for the algorithms are set as: (1) 10 patterns were identified; (2) or for MEBF , P ANDA and ASSO, they will also stop if a ne wly identified pattern does not decrease the cost function. As shown in Figure 4, MEBF has the best performance on small and big sized matrices for all the four dif ferent sce- narios, on 50 simulations each. It achie ved the lowest re- constructed error with the least computation time compared with all other algorithms. The con ver gence rate of MEBF also outperforms P AND A and MP . Though ASSO conv erges early with the least number of patterns, its reconstruction er- ror is considerably higher than MEBF , especially for high density matrices. In addition, ASSO derived patterns tend to be more dense than the true patterns, while those deriv ed from the other three methods hav e similar density lev els with the true patterns. By increasing the number of patterns, P ANDA stably decreased reconstruction error , but it has a considerably slow conv ergence rate and high computation cost. MP suffered in fitting small size matrices, and in the case of low density matrix with noise, MP deriv ed patterns would not con verge. The standard deviations of reconstruc- tion error and density across 50 simulations is quite low , and was demonstrated by the size of the shapes. The computa- tional cost and its standard deviation for each algorithm is shown as bar plots in Figure 5, and clearly , MEBF has the best computational efficienc y among all. Real world data application W e next focus on the application of MEBF on two real world datasets, and its performance comparison with MP . Both datasets, Chicago Crime r ecor ds 2 ( X ∈ { 0 , 1 } 6787 × 752 ) and Head and Neck Cancer Single Cell RN A Sequencing data 3 ( X ∈ { 0 , 1 } 344 × 5902 ), are publicly available. The 2 Chicago crime records downloaded on August 20, 2019 from https://data.cityofchicago.org/Public-Safety 3 This head and neck sequencing data can be accessed at https://www .ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103322 0.12 0.16 0.20 0.24 2 4 6 8 10 Reconstruct Error Density Number of Patterns Low Density W/O Noise Low Density W Error 2 4 6 8 10 Number of Patterns 2 4 6 8 10 Number of Patterns 2 4 6 8 10 Number of Patterns MEBF ASSO P ANDA MP 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 1.2 0.4 0.6 0.8 1.0 0.25 0.30 0.35 0.40 0.0 High Density W/O Error Low Density W Error 0.12 0.16 0.20 0.24 0.2 0.2 0.4 0.6 0.8 1.0 0.0 0.25 0.30 0.35 0.40 0.4 0.6 0.8 1.0 0.2 0.12 0.16 0.20 0.24 2 4 6 8 10 Number of Patterns Low Density W/O Noise Low Density W Error 2 4 6 8 10 Number of Patterns 2 4 6 8 10 Number of Patterns 2 4 6 8 10 Number of Patterns MEBF ASSO P ANDA MP 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 0.15 0.20 0.20 0.25 0.30 0.35 0.0 High Density W/O Error Low Density W Error 0.2 0.4 0.6 0.8 1.0 0.10 0.2 0.4 0.6 0.8 1.0 0.40 0.45 0.2 0.4 0.6 0.8 1.0 0.0 0.20 0.25 0.30 0.35 0.40 0.45 Small Size Matrix 100x100 Big Size Matrix 1000x1000 Figure 4: Performance comparisons of MEBF , ASSO, P ANDA and MP on the accurac y of decomposition. MEBF ASSO P ANDA MP Time(s) Low Density W/O Noise L ow Density W Error High Density W/O Error Low Density W Error Small Size Matrix 100x100 Big Size Matrix 1000x1000 0.0 0.25 0.50 0.75 1.00 0 25 50 75 100 Time(s) Low Density W/O Noise L ow Density W Error High Density W/O Error Low Density W Error Figure 5: Performance comparison of MEBF , ASSO, P ANDA and MP on computational cost. computational cost of ASSO and P AND A are too inhibitive to be applied to datasets of such a large scale, so they were dropped from the comparisons. Due to a lack of ground-truth of the two low rank construction matrices and the possible high noise lev el in the real world datasets, it may not be reasonable to examine the reconstruction error with respect to the original matrix. Instead, we focused on two metrics, the density and co verage le vels. Density metric was defined as in the simulation data, and coverage rate is defined as Cov erage rate := | ( X · ( A ∗ ⊗ B ∗ )) | | X | W ith the same reconstruction error , binary patterns are more desirable to have high sparsity , meaning low density levels, as it makes further interpretation easier and avoids possible ov erfitting. On the other hand, in many real world data, 0 is more likely to be a false negati ve occurrence, compared with 1 being a false positiv e occurrence. In this regard, a higher cov erage rate, meaning higher recov ery of the 1’ s, w ould be a more reasonable metric than lower reconstruction error to the noisy original matrix, as 0’ s are more likely to be noisy observations than 1’ s. W e compared MEBF and MP for k = 5 and k = 20 , and the density and coverage rate of the deri ved patterns and time consumption of the two algorithms are presented in T able 1. Overall, as shown in T able 1, for both k = 5 and k = 20 , MEBF outperforms MP in all three measures higher cover - age rate, roughly half the density levels to MP , and the time consumption of MEBF is approximately 1% to that of MP . Also noted, with the increased number of patterns, the cov- erage rate of MP une xpectedly drops from 0.812 to 0.807 in the crime data, suggesting the low rob ustness of MP . Next we discuss in detail the application of BMF on dis- crete data mining and continuous data mining, and present the findings on the two datasets using MEBF . Coverage (MEBF/MP) Density (MEBF/MP) T ime/s (MEBF/MP) Crime k=5 0.835/0.812 0.019/0.027 2.913/333.601 Crime k=20 0.891/0.807 0.030/0.066 10.608/992.011 Single cell k=5 0.496/0.463 2 . 06 e-4/ 2 . 86 e-4 1.846/137.623 Single cell k=20 0.626/0.580 3 . 34 e-4/ 7 . 22 e-4 5.954/390.217 T able 1: Comparison of MEBF and MP on real world data Discrete data mining Chicago is the most populous city in the US Midwest, and it has one of the highest crime rates in the US. It has been well understood that the majority of crimes such as theft and robbery ha ve strong date patterns. For example, crimes were committed for the need to repay regular debt like credit cards, which has a strong date pattern in each month. Here we apply MEBF in analyzing Chicago crime data from 2001 to 2019 to find crime patterns on time and date for different regions. The crime patterns is useful for the allocation of police force, and could also reflect the overall standard of living situation of re gions in general. W e di vided Chicago area into ∼ 800 regions of roughly equal sizes. For each of the 19 years, a binary matrix X n × m d for the d th year is constructed, where n is the total dates in each year and m represents the total number of regions, and X d i,j = 1 means that crime was reported at date i in region j in year d , X d i,j = 0 otherwise. MEBF w as then applied on each of the constructed binary matrices with pa- rameters ( t = 0 . 7 , k = 20) and outputs A n × k d and B k × m d . The reconstructed binary matrix is accordingly calculated as X d ∗ = A n × k d ⊗ B k × m d . A crime index was defined as the total days with crime committed for regions j in year d , C d j = P m i =1 X d i,j . Figure 6 shows the crime patterns over time, and only 2002 2004 2008 2006 2010 2012 2014 2016 2018 Crime Index Crime Index 0 100 200 300 0 100 200 300 A B C Figure 6: MEBF analysis of Chicago crime data ov er the years. ev en years were sho wn due to space limit. In 6A, from year 2002-2018, the crime index calculated from the recon- structed matrix, namely , C d ∗ j = P m i =1 X d ∗ i,j was shown on the y -axis for all the regions on x -axis, In 6A, points col- ored in red indicate those regions with crime index equal to total dates of the year , i.e., 365 or 366, meaning these re- gions are heavily plagued with crimes, such that there is no day without crime committed. Points colored in green shows vice versa, indicating those regions with no crimes commit- ted over the year . Points are otherwise colored in gray . 6B shows the crime index on the original matrix, and clearly , the green and red regions are distinctly separated, i.e. green on the bottom with lo w crime inde x, and red on the top with high crime inde x. This sho ws the consistency of the crime patterns between the reconstructed and original crime data, and thus, validate the ef fecti veness of MEBF pattern mining. Notably , the dramatic decrease in crime index starting from 2008 as shown in Figure 6A and 6B correlates with the re- ported crime decrease in Chicago area since 2008. Figure 6C shows the crime trend o ver the years on the map of Chicago. Clearly , from 2008 to now , there is a gradual increase in the green regions, and decrease in the red re gions, indicating an ov erall good transformation for Chicago. This result also in- dicate that more police force could be allocated in between red and green regions when a vailable. Continuous data denoising Binary matrix factorization could also be helpful in contin- uous matrix factorization, as the Boolean rank of a matrix could be much smaller than the matrix rank under linear al- gebra. Recently , clustering of single cells using single-cell RN A sequencing data has gained extensi ve utilities in many fields, wherein the biggest challenge is that the high dimen- sional gene features, mostly noise features, makes the dis- tance measure of single cells in clustering algorithm highly −50 0 50 −50 0 50 t−SNE 1 t−SNE 2 −60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60 t−SNE 1 t−SNE 2 Original MEBF Figure 7: V isualization of single cell clustering before and after MEBF . unreliable. Here we aim to use MEBF to denoise the contin- uous matrix while clustering. W e collected a single cell RN A sequencing (scRN Aseq) data (Puram et al. 2017), that measured more than 300 gene ex- pression features for over 5,000 single cells, i.e., X 5000 × 300 . W e first binarize original matrix X into X ∗ , s.t. X ∗ ij = 1 where X ij > 0 ,and X ∗ ij = 0 otherwise. Then, applying MEBF on X ∗ with parameters ( t = 0 . 6 , k = 5) outputs A n × k , B k × m . Let X ∗∗ = A ⊗ B and X use be the inner product of X ∗ and X ∗∗ , namely , X use = X ~ X ∗∗ . X use represents a denoised version of X , by retaining only the en- tries in X with true non-zero gene expressions. And this is reconstructed from the hidden patterns extracted by MEBF . As shown in Figure 7, clustering on the denoised expres- sion matrix, X use , results in much tighter and well separated clusters (right) than that on the original expression matrix (left), as visualized by t -SNE plots shown in Figure 7. t - SNE is an non-linear dimensional reduction approach for the visualization of high dimensional data (Maaten and Hin- ton 2008). It is worth noting that, in generating Figure 7, Boolean rank of 5 was chosen for the factorization, indi- cating that the heterogeneity among cell types with such a high dimensional feature space could be well captured by matrices of Boolean rank equal to 5. Interestingly , we could see a small portion of fibroblast cell (dark blue) lies much closer to cancer cells (red) than to the majority of the fibrob- last population, which could biologically indicate a strong cancer-fibroblast interaction in cancer micro-en vironment. Unfortunately , such interaction is not easily visible in the clustering plot using original noisy matrix. Conclusion W e sought to develop a fast and efficient algorithm for boolean matrix factorization, and adopted a heuristic ap- proach to locate submatrices that are dense in 1’ s iterativ ely , where each such submatrix corresponds to one binary pat- tern in BMF . The submatrix identification was inspired by binary matrix permutation theory and geometric segmenta- tion. Approximately , we permutate rows and columns of the input matrix so that the 1’ s could be ”dri ven” to the upper triangular of the matrix as much as possible, and a dense submatrix could be more easily geometrically located. Com- pared with three state of the art methods, ASSO, P ANDA and MP , MEBF achie ved lower reconstruction error , better density and much higher computational efficienc y on simu- lation data of an array of situations. Additionally , we demon- strated the use of MEBF on discrete data pattern mining and continuous data denoising, where in both case, MEBF gen- erated consistent and robust findings. Acknowledgment This work was supported by R01 award #1R01GM131399- 01, NSF IIS (N0.1850360), Sho walter Y oung Inv estigator A ward from Indiana CTSI and Indiana Uni versity Grand Challenge Precision Health Initiativ e. References [Araujo, Ribeiro, and Faloutsos 2016] Araujo, M.; Ribeiro, P .; and Faloutsos, C. 2016. Faststep: Scalable boolean ma- trix decomposition. In P acific-Asia Confer ence on Knowl- edge Discovery and Data Mining , 461–473. Springer . [Balasubramaniam, Nayak, and Y uen 2018] Balasubramaniam, T .; Nayak, R.; and Y uen, C. 2018. People to people recommendation using coupled nonnega- tiv e boolean matrix factorization. [Belohlav ek, Outrata, and T rnecka 2018] Belohlav ek, R.; Outrata, J.; and Trnecka, M. 2018. T oward quality assess- ment of boolean matrix factorizations. Information Sciences 459:71–85. [Junttila and others 2011] Junttila, E., et al. 2011. Patterns in permuted binary matrices. [Karaev , Miettinen, and Vreeken 2015] Karaev , S.; Mietti- nen, P .; and Vreeken, J. 2015. Getting to know the unknown unknowns: Destructi ve-noise resistant boolean matrix fac- torization. In Pr oceedings of the 2015 SIAM International Confer ence on Data Mining , 325–333. SIAM. [K ocayusufoglu, Hoang, and Singh 2018] K ocayusufoglu, F .; Hoang, M. X.; and Singh, A. K. 2018. Summarizing network processes with network-constrained boolean matrix factorization. In 2018 IEEE International Confer ence on Data Mining (ICDM) , 237–246. IEEE. [Larsson et al. 2019] Larsson, A. J.; Johnsson, P .; Hagemann-Jensen, M.; Hartmanis, L.; Faridani, O. R.; Reinius, B.; Se gerstolpe, ˚ A.; Riv era, C. M.; Ren, B.; and Sandberg, R. 2019. Genomic encoding of transcriptional burst kinetics. Nature 565(7738):251. [Lucchese, Orlando, and Perego 2010] Lucchese, C.; Or - lando, S.; and Perego, R. 2010. Mining top-k patterns from binary datasets in presence of noise. In Pr oceedings of the 2010 SIAM International Conference on Data Mining , 165–176. SIAM. [Lucchese, Orlando, and Perego 2013] Lucchese, C.; Or - lando, S.; and Perego, R. 2013. A unifying framework for mining approximate top- k binary patterns. IEEE T ransactions on Knowledge and Data Engineering 26(12):2900–2913. [Maaten and Hinton 2008] Maaten, L. v . d., and Hinton, G. 2008. V isualizing data using t-sne. Journal of machine learning r esear ch 9(No v):2579–2605. [Miettinen and Vreeken 2014] Miettinen, P ., and Vreeken, J. 2014. Mdl4bmf: Minimum description length for boolean matrix factorization. A CM T ransactions on Knowledge Dis- covery fr om Data (TKDD) 8(4):18. [Miettinen et al. 2008] Miettinen, P .; Mielik ¨ ainen, T .; Gionis, A.; Das, G.; and Mannila, H. 2008. The discrete basis prob- lem. IEEE transactions on knowledge and data engineering 20(10):1348–1362. [Monson, Pullman, and Rees 1995] Monson, S. D.; Pullman, N. J.; and Rees, R. 1995. A survey of clique and biclique cov erings and factorizations of (0, 1)-matrices. Bull. Inst. Combin. Appl 14:17–86. [Oswald and Reinelt 2009] Oswald, M., and Reinelt, G. 2009. The simultaneous consecutive ones problem. The- or etical Computer Science 410(21-23):1986–1992. [Oswald 2003] Oswald, M. 2003. W eighted consecutive ones pr oblems . Ph.D. Dissertation. [Puram et al. 2017] Puram, S. V .; Tirosh, I.; Parikh, A. S.; Pa- tel, A. P .; Y izhak, K.; Gillespie, S.; Rodman, C.; Luo, C. L.; Mroz, E. A.; Emerick, K. S.; et al. 2017. Single-cell tran- scriptomic analysis of primary and metastatic tumor ecosys- tems in head and neck cancer . Cell 171(7):1611–1624. [Rav anbakhsh, P ´ oczos, and Greiner 2016] Rav anbakhsh, S.; P ´ oczos, B.; and Greiner , R. 2016. Boolean matrix factoriza- tion and noisy completion via message passing. In ICML , 945–954. [Rukat et al. 2017] Rukat, T .; Holmes, C. C.; Titsias, M. K.; and Y au, C. 2017. Bayesian boolean matrix factorisation. In Pr oceedings of the 34th International Confer ence on Ma- chine Learning-V olume 70 , 2969–2978. JMLR. or g. [Stockmeyer 1975] Stockmeyer , L. J. 1975. The set basis pr oblem is NP-complete . IBM Thomas J. W atson Research Division. [W all, Rechtsteiner , and Rocha 2003] W all, M. E.; Recht- steiner , A.; and Rocha, L. M. 2003. Singular value decom- position and principal component analysis. In A practical appr oach to micr oarray data analysis . Springer . 91–109.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment