A Fast Greedy Algorithm for Generalized Column Subset Selection

A F ast Gre edy Algorithm f or Generalized Column Subset Selection Ahmed K. Farahat, Ali Ghodsi, and Mohamed S. Kamel University of W aterloo, W aterloo, Ontario , N2L 3G1, Canada { afarahat, aghodsib,mka mel } @uwaterloo.ca Abstract This paper deﬁnes a genera lized column subset selection problem which is con- cerned with the selection o f a few columns fr om a source matr ix A that best ap- proxim ate the span o f a target matrix B . The pa per th en propo ses a fast gr eedy algorithm f or solvin g this p roblem and draws conne ctions to different pro blems that can be efﬁciently s olved using the proposed algorithm. 1 Generalized Column Subset Selection The Colum n Subset Selection (CSS) problem can be g enerally deﬁn ed as the selection o f a few columns from a data matrix t hat best approximate its span [2 – 5, 10, 15]. W e extend this deﬁnition to the g eneralized pro blem of selectin g a few colum ns from a source matrix to app roximate the span of a target matrix. The gen eralized CSS problem can be formally deﬁned as follows : Problem 1 (Generalized Column Subset Selection) Given a sour ce matrix A ∈ R m × n , a targ et matrix B ∈ R m × r and an inte ger l , ﬁnd a subset of columns L fr om A such that |L| = l and L = arg min S k B − P ( S ) B k 2 F , wher e S is the set of th e in dices o f th e candida te columns fr o m A , P ( S ) ∈ R m × m is a p r ojection matrix whic h pr ojects the columns of B onto the span o f the set S of columns, and L is the set of the indices of the selected columns fr o m A . The CSS criterion F ( S ) = k B − P ( S ) B k 2 F represents the sum of squared errors between the tar get matrix B and its rank - l approxima tion P ( S ) B . In other words, it calculates the Fro benius norm of the residual matrix F = B − P ( S ) B . Other types o f matrix norm s can also be u sed to quantify the reco nstruction e rror [2, 3]. The present work, howe ver , focuses on d ev eloping alg orithms that minimize the Frobenius n orm of the residu al matrix. The pro jection matrix P ( S ) can b e calcu lated as P ( S ) = A : S  A T : S A : S  − 1 A T : S , where A : S is th e sub- matrix of A which co nsists of the column s correspo nding to S . It should be noted that if S is known, the term  A T : S A : S  − 1 A T : S B is the clo sed- form solution of least-squares problem T ∗ = arg min T k B − A : S T k 2 F . 2 A Fast Gr eedy Algorithm f or Generalized CSS Problem 1 is a combinato rial optimiza tion p roblem w hose optima l solution can be obtain ed in O  max  n l mrl , n l ml 2  . In or der to ap proxim ate this optimal solution , we p ropose a fast gre edy algorithm that selects o ne c olumn fr om A at a time. T he gr eedy algorith m is based on a recu rsi ve formu la for the projection matrix P ( S ) which can be derived as fo llows. Lemma 1 Given a set of co lumns S . F or an y P ⊂ S , P ( S ) = P ( P ) + R ( R ) , where R ( R ) = E : R  E T : R E : R  − 1 E T : R is a pr ojection matrix which pr ojects the c olumns of E = A − P ( P ) A onto the span of the subset R = S \ P of columns. 1 Proof Deﬁn e D = A T : S A : S . T he pro jection matrix P ( S ) can be written as P ( S ) = A : S D − 1 A T : S . W ithout loss of generality , the colu mns and rows of A : S and D can be rearran ged such that the ﬁrst sets of r ows and co lumns correspon d to P . Let S = D RR − D T P R D − 1 P P D P R be the Schu r compleme nt [17] of D P P in D , wh ere D P P = A T : P A : P , D P R = A T : P A : R and D RR = A T : R A : R . Using the block-wise in version formula [17], D − 1 can be calculated as D − 1 =  D − 1 P P + D − 1 P P D P R S − 1 D T P R D − 1 P P − D − 1 P P D P R S − 1 − S − 1 D T P R D − 1 P P S − 1  Substituting with A : S and D − 1 in P ( S ) = A : S D − 1 A T : S , the projection matrix can be simpliﬁed to P ( S ) = A : P D − 1 P P A T : P +  A : R − A : P D − 1 P P D P R  S − 1  A T : R − D T P R D − 1 P P A T : P  . (1) The ﬁr st term o f the r ight-han d side is th e pro jection matrix P ( P ) which projects vecto rs onto the span of the subset P of co lumns. The second term can be simpliﬁed as follows . Let E be an m × n residual matrix which is calculated as: E = A − P ( P ) A . The sub-m atrix E : R can be expressed as E : R = A : R − P ( P ) A : R = A : R − A : P  A T : P A : P  − 1 A T : P A : R = A : R − A : P D − 1 P P D P R . Since projection matrices are idempo tent, then P ( P ) P ( P ) = P ( P ) and E T : R E : R =  A : R − P ( P ) A : R  T  A : R − P ( P ) A : R  = A T : R A : R − A T : R P ( P ) A : R . Substituting with P ( P ) = A : P  A T : P A : P  − 1 A T : P giv es E T : R E : R = A T : R A : R − A T : R A : P  A T : P A : P  − 1 A T : P A : R = D RR − D T P R D − 1 P P D P R = S . Substituting  A : P D − 1 P P A T : P  ,  A : R − A : P D − 1 P P D P R  and S with P ( P ) , E : R and E T : R E : R respec- ti vely , Equation (1) can be expressed as P ( S ) = P ( P ) + E : R  E T : R E : R  − 1 E T : R . The seco nd term is th e projection matrix R ( R ) which projects vecto rs onto the span of E : R . This proves that P ( S ) can be written in terms of P ( P ) and R as P ( S ) = P ( P ) + R ( R ) Giv en the recursive form ula for P ( S ) , the following theo rem deri ves a recu rsiv e f ormula for F ( S ) . Theorem 2 Given a set of columns S . F or any P ⊂ S , F ( S ) = F ( P ) −   R ( R ) F   2 F , wher e F = B − P ( P ) B an d R ( R ) is a pr ojection matrix which pr ojects the column s of F onto the span of the subset R = S \ P o f columns of E = A − P ( P ) A Proof By deﬁnition, F ( S ) =   B − P ( S ) B   2 F . Using Le mma 1, P ( S ) B = P ( P ) B + R ( R ) B . The term R ( R ) B is equal to R ( R ) F as E T : R B = E T : R F . T o pr ove that, m ultiplying E T : R by F = B − P ( P ) B g i ves E T : R F = E T : R B − E T : R P ( P ) B . Using E : R = A : R − P ( P ) A : R , the expr ession E T : R P ( P ) can be written as E T : R P ( P ) = A T : R P ( P ) − A T : R P ( P ) P ( P ) . T his is equa l to 0 as P ( P ) P ( P ) = P ( P ) (an idempo tent matrix). Substituting in F ( S ) an d using F = B − P ( P ) B gives F ( S ) =    B − P ( P ) B − R ( R ) F    2 F =    F − R ( R ) F    2 F Using the relation between Frobeniu s norm and trace, F ( S ) can be simpliﬁed to F ( S ) = tr   F − R ( R ) F  T  F − R ( R ) F   = tr  F T F − F T R ( R ) F  = k F k 2 F −    R ( R ) F    2 F Using F ( P ) = k F k 2 F proves the theorem . Using the recursi ve fo rmula for F ( S ∪ { i } ) allows the development of a greedy algorithm which at iteration t selects column p such that p = arg min i F ( S ∪ { i } ) = arg max i    P ( { i } ) F    2 F . 2 Let G = E T E and H = F T E , the objective functio n   P ( { i } ) F   2 F can be simpliﬁed to    E : i  E T : i E : i  − 1 E T : i F    2 F = tr  F T E : i  E T : i E : i  − 1 E T : i F  =   F T E : i   2 E T : i E : i = k H : i k 2 G ii . This allows the deﬁnition of the following greedy generalize d CSS problem. Problem 2 (Greedy Ge neralized CSS) At iteration t , ﬁnd column p such that p = arg max i k H : i k 2 G ii wher e H = F T E , G = E T E , F = B − P ( S ) B , E = A − P ( S ) A and S is the set o f column s selected during the ﬁrst t − 1 iterations. For iteration t , deﬁne δ = G : p , γ = H : p , ω = G : p / p G pp = δ / p δ p and υ = H : p / p G pp = γ / p δ p . The vectors δ ( t ) and γ ( t ) can be calculated in terms of A , B and p revious ω ’ s and υ ’ s as δ ( t ) = A T A : p − t − 1 X r =1 ω ( r ) p ω ( r ) , γ ( t ) = B T A : p − t − 1 X r =1 ω ( r ) p υ ( r ) . (2) The numer ator and denomin ator of the selection criterio n at each iteration can be calcula ted in an efﬁcient manner without explicitly calculating H or G u sing the follo wing theorem. Theorem 3 Let f i = k H : i k 2 and g i = G ii be the n umerator an d denomina tor o f the greedy criterion function for column i respectively , f = [ f i ] i =1 ..n , and g = [ g i ] i =1 ..n . Then, f ( t ) =  f − 2  ω ◦  A T B υ − Σ t − 2 r =1  υ ( r ) T υ  ω ( r )  + k υ k 2 ( ω ◦ ω )  ( t − 1) , g ( t ) =  g − ( ω ◦ ω )  ( t − 1) , wher e ◦ repr es ents the Hadama r d p r o duct operator . In the u pdate formulas o f Theorem 3, A T B can be ca lculated once an d th en used in d ifferent it- erations. This m akes the comp utational com plexity of these f ormulas O ( nr ) per iteration. The computatio nal complexity of the algo rithm is dominated b y that of calculating A T A : p in (2) which is of O ( mn ) per iteration. The other complex step is that o f ca lculating th e initial f , which is O ( mnr ) . Howev er , these steps ca n be implemen ted in an efﬁcient way if the data matrix is sparse. The total computatio nal complexity of the algorithm is O (ma x ( mnl , mnr )) , where l is the number of selected columns. Algorithm 1 in Append ix A shows the complete greedy algorithm. 3 Generalized CSS Problems W e describe a variety of p roblems th at can b e formu lated as a gen eralized co lumn sub set selectio n (see T able 1). It sho uld be noted tha t for some of these prob lems, the use of greed y algorithms h as been explo red in th e literatu re. Howe ver , ide ntifying the connection between these p roblems and the proble m presented in this p aper g iv es more insight abo ut the se prob lems, an d allows the efﬁ cient greedy algorithm presented in this paper to be explored in other interesting domains. Column Subset Selection. The b asic column su bset selectio n [2–4, 10, 1 5] is clearly an instance of the gener alized CSS prob lem. In this in stance, the target matrix is the same as the so urce matrix B = A an d the goal is to select a subset of colum ns from a data matrix th at b est represent oth er columns. The g reedy algorith m p resented in this pap er can be directly used f or solving the basic CSS problem . A detailed compa rison of the greedy CSS algorithm and the state-of -the-art CSS method s can be found at [11]. In our previous work [13, 14], we successfu lly u sed th e proposed gr eedy algorithm for u nsuperv ised feature selection wh ich is an instan ce of the CSS prob lem. W e u sed the greedy algo rithm to so lve two instan ces of the g eneralized CSS prob lem: one is b ased o n selecting features th at ap proxim ate the original ma trix B = A and the other is b ased on selecting featu res that approx imate a r andom partitioning of the featu res B : c = P j ∈P c A : j . The prop osed gr eedy 3 T able 1: Different problem s as instances of the generalize d column subset selection pro blem. Method Source T arg et Generalized CSS A B Column Subset Selection Data matrix A Data matrix A Distributed CSS Data matrix A Random subsp ace A Ω SVD-based CSS Data matrix A SVD-based subspace U k Σ k Sparse Approximation Atoms D T arget v ector y Simultaneous Sparse Approximation Atoms D T arget vecto rs  y ( 1 ) , y ( 2 ) , ... y ( r )  algorithm s ach iev ed su perior clu stering perf ormance in co mparison to state-of-the -art m ethods fo r unsuper vised feature selection. Distribu ted Co lumn Subset Select ion. The generalized CSS pr oblem can be used to deﬁne dis- tributed v ariants of th e basic colum n subset selection pr oblem. In this case, the matrix B is d eﬁned to encod e a concise representatio n of the sp an of the origin al matrix A . Th is concise repr esentation can be ob tained using an efﬁcient metho d like random projectio n. I n our recen t work [12], we de- ﬁned a distributed CSS based on this idea and used the proposed greed y alg orithm to select co lumns from big data matrices that are massiv ely distrib uted across different machines. SVD-based Column Subset Select ion. C ¸ ivril and Magdon -Ismail [5] proposed a CSS meth od which ﬁrst calculates the Singular V alue Deco mposition (SVD) of the d ata matr ix, and then selects the subset of colu mns which best app roximates th e leadin g sing ular values of the data matrix. The formu lation of this CSS method is an instance of the gener alized CSS p roblem, in wh ich the target matrix is calculated from the leadin g singu lar vectors o f th e d ata m atrix. The greedy algo rithm presented in [5] can be implem ented using Algorithm 1 by setting B = U k Σ k where U k is a matrix whose column s rep resent the leading left singular vector s of th e data matrix , and Σ k is a matrix whose diag onal elements represent the corresp onding singular values. Our gree dy algorithm is howe ver mo re efﬁcient th an the greedy algorithm of [5]. Sparse A pproximation. Given a target vector an d a set of basis vectors, also called atoms, the goal of s parse approximatio n is to represent the target vector as a linear combination of a few ato ms [ 20]. Different instances of this problem ha ve been studied in the l iterature under d ifferent names, such as variable selection for linear regression [8], sparse co ding [16, 19], and diction ary selection [6, 9]. If the goal is to min imize the discrep ancy be tween the target vector and its pr ojection onto the subsp ace of selected atom s, th e sparse appro ximation c an be conside red a n instan ce of the generalized CSS problem in which th e target ma trix is a vector a nd the column s of the source matrix are th e atoms. Sev eral g reedy algorithm s have been propo sed for sparse a pprox imation, such as basic matchin g pursuit [18], or thogon al match ing pu rsuit [2 1], the ortho gonal least squares [7]. T he g reedy alg o- rithm for gener alized CSS is equiv alent to the orth ogona l least squ ares algorithm (as deﬁned in [1]) because at eac h iteration it selects a new column such that the r econstructio n error after adding this column is minim um. Algorithm 1 can b e used to efﬁciently implement the orthogo nal least squares algorithm by setting B = y , where y is the target vector . Howe ver , an additional step will be needed to calculate the weights of the selected atoms as  A T : S A : S  − 1 A T : S y . Simultaneous Sparse A pproximation. A more general spa rse approxima tion prob lem is the s elec- tion of atoms which represent a group of target vectors. Th is problem is referred to as simultaneous sparse appro ximation [22]. Different greedy algorithms ha ve been p roposed f or simu ltaneous sparse approx imation with dif ferent co nstraints [6 , 22]. I f the goal is to select a subset of atoms to represent different target vectors with out impo sing sparsity con straints on each repr esentation, simu ltaneous sparse approx imation will be an instance of the greedy CS S proble m, where the source columns are the atoms and the target columns are the input s ignals. 4 Conclusions W e deﬁne a generalized variant of the column subset selection prob lem and presen t a fast g reedy algorithm for solving it. The prop osed greed y algorithm can be ef fectiv ely used to solve a variety o f problem s t hat are instances of the generalized column subset selection problem . 4 Refer ences [1] T . Blumen sath and M. E. Davies. On the difference between orthog onal matching pursuit and orthog onal lea st squares. 20 07. Unp ublished Manuscript. [2] C. Boutsidis, P . Drin eas, an d M. Magd on-Isma il. Near optimal column -based matrix recon - struction. In Pr oceedings of the 52 nd Annua l IEE E Symposium o n F ou ndation s of Compu ter Science (FOCS’11) , pages 305 –314, 2011 . [3] C. Boutsidis, M. W . Mah oney , an d P . Drineas. An improved app roximation algorithm fo r the column subset selection problem . In Pr oce edings o f the T wentieth Ann ual ACM-SIAM Symposium on Discr ete Algorithms (SOD A’09) , pages 968–977, 2009. [4] C. Boutsidis, J. Sun, and N. Anerousis. Clustered subset selection and its application s on it service metrics. In Pr oceeding s of the Seventeenth ACM Confer ence on Information and Knowledge Management (CIKM’08) , pages 599–6 08, 2008. [5] A. C ¸ i vril and M. Ma gdon- Ismail. Column subset selectio n via sparse approx imation of SVD. Theor etical Computer Science , 421(0):1 – 14, 2012. [6] V . Cevher and A . Kra use. Gre edy dictiona ry selectio n for sparse repr esentation. Journal of Selected T opics in Sig nal Pr oce ssing , 5(5):97 9–988 , 2011. [7] S. Chen, S. A. Billings, and W . Luo. Orthogo nal least squares meth ods a nd th eir application to non-linear system identiﬁcation . In ternationa l Journal of contr ol , 50(5):1873 –1896 , 1 989. [8] A. Das and D. Kempe. Alg orithms for subset selection in linear regression. I n Pr oceedings of the 40th Annual AC M Symposium on Theory of Computing (STOC’08) , pages 45–5 4, 2008. [9] A. Das and D. Kempe. Sub modular meets sp ectral: Greedy algo rithms for sub set selection, sparse ap proxim ation and dictionary selection. In Pr oceedings o f the 28th Inte rnational Con- fer ence on Machine Learning, ( ICML ’11) , pages 1057–106 4, 201 1. [10] P . Dr ineas, M. Mahon ey , and S. Muthukr ishnan. Subspac e sampling and relative-error matrix approx imation: Colu mn-based meth ods. I n Appr oximation, R andomizatio n, and Combinato- rial Optimization. Algorithms and T echniqu es , pages 316–3 26. Springer, 2006. [11] A. K. Farahat. Gr eedy Rep r esentative Selection for Un supervised Data Ana lysis . PhD thesis, University of W aterloo, 2012 . [12] A. K. Farahat, A. Elgohary , A. Gho dsi, and M. S. Kamel. Distributed column subset selection on MapReduce. In Pr oceedings of th e Thirteenth I EEE International Confer ence on Data Mining (ICDM’13) , 2013. In Press. [13] A. K. F arahat, A. Ghodsi, and M. S. Kamel. An ef ﬁcient greedy metho d for unsup ervised feature selection. In Pr oceedings of the Eleventh IEEE Intern ational Confer ence on Data Mining (ICDM’11) , pages 161 –170, 2011 . [14] A. K. Farahat, A. Gh odsi, a nd M. S. Kamel. Efﬁcient greed y feature selection for unsuper vised learning. Kno wledge and Informatio n Systems , 35(2):285– 310, 201 3. [15] A. Frieze, R. Kannan, and S. V empala. F ast Mo nte-Carlo algo rithms fo r ﬁn ding low-rank approx imations. In Pr oc eedings of the 39th Annual IEEE Symposium on F oundation s of Com- puter Science (FOCS’98) , pages 370 –378, 1998 . [16] H. Lee, A. Battle, R. Raina, an d A. Ng. E fﬁcient sparse cod ing alg orithms. In Adv ances in Neural Information Pr ocessing Systems 19 (NIP S’06) , pages 801–808 . MIT , 2006. [17] H. L ¨ utkepohl. Handb ook of M atrices . Jo hn W ile y & Sons Inc, 1996. [18] S. Ma llat and Z. Z hang. Matchin g pursuits with time-f requen cy dic tionaries. Signal Pr ocess- ing, IEEE T ransactions on , 41(12):3 397–3 415, 1993. [19] B. Olshau sen and D. Field. Sp arse cod ing with an o vercomplete basis set: A strategy employed by VI? V isio n Resear ch , 37(23 ):3311– 3326, 1997. [20] J. T ropp. Greed is g ood: Algorith mic resu lts fo r spar se appr oximation . Info rmation Theory , IEEE T ransactions on , 50( 10):22 31–2242 , 2 004. [21] J. T ropp and A. Gilbert. Signal recovery from rand om m easurements via orthogonal matching pursuit. Info rmation Theory , IEE E T ransactions on , 53(12):46 55–46 66, 2007. [22] J. Tropp, A. Gilb ert, and M. Strauss. Algor ithms for simultan eous sparse appr oximation . Part I: Greedy pursuit. S ignal Pr o cessing , 86(3):57 2–588 , 2006. 5 Ap pendix A Algorithm 1 Greedy Generalized Column Subset Selection Input: Source matrix A , T arget matrix B , Numb er of columns l Output: Selected subset of column s S 1: Initialize f (0) i = k B T A : i k 2 , g (0) i = A T : i A : i for i = 1 ... n 2: Repeat t = 1 → l : 3: p = arg max i f ( t ) i / g ( t ) i , S = S ∪ { p } 4: δ ( t ) = A T A : p − P t − 1 r =1 ω ( r ) p ω ( r ) 5: γ ( t ) = B T A : p − P t − 1 r =1 ω ( r ) p υ ( r ) 6: ω ( t ) = δ ( t ) / q δ ( t ) p , υ ( t ) = γ ( t ) / q δ ( t ) p 7: Update f i ’ s, g i ’ s ( Theorem 3) Proof of T heorem 3 Let S d enote the set of columns selected during the ﬁrst t − 1 iterations, F ( t − 1) denote the residual matrix of B at the start of the t -th iteration (i. e., F ( t − 1) = B − P ( S ) B ) , a nd p be the column selected at itera tion t . From Lem ma 1, P ( S ∪{ p } ) = P ( S ) + R ( { p } ) . M ultiplying bo th sides with B gi ves P ( S ∪ { p } ) B = P ( S ) B + R ( { p } ) B . Sub tracting both sides from B an d substituting B − P ( S ) B , and B − P ( S ∪{ p } ) B with F ( t − 1) and F ( t ) respectively g iv es F ( t ) =  F − R ( { p } ) B  ( t − 1) . Since R ( { p } ) B = R ( { p } ) F (see the proof of Theo rem 2), F ( t ) can be calculated recursively as F ( t ) =  F − R ( { p } ) F  ( t − 1) . Similarly , E ( t ) can be expressed as E ( t ) =  E − R ( { p } ) E  ( t − 1) . Substituting with F and E in H = F T E gives H ( t ) =   F − R ( { p } ) F  T  E − R ( { p } ) E   ( t − 1) =  H − F T R ( { p } ) E  ( t − 1) . Using R ( { p } ) = E : p  E T : p E : p  − 1 E T : p , and giv en that ω = G : p = E T E : p / q E T : p E : p and υ = H : p = F T E : p / q E T : p E : p , the matrix H can be calculated recursiv ely as H ( t ) =  H − υ ω T  ( t − 1) . Similarly , G can b e expressed as G ( t ) =  G − ω ω T  ( t − 1) . Using these recursive fo rmulas, f ( t ) i can be calculated as f ( t ) i =  k H : i k 2  ( t ) =  k H : i − ω i υ k 2  ( t − 1) =  ( H : i − ω i υ ) T ( H : i − ω i υ )  ( t − 1) =  H T : i H : i − 2 ω i H T : i υ + ω 2 i k υ k 2  ( t − 1) =  f i − 2 ω i H T : i υ + ω 2 i k υ k 2  ( t − 1) . Similarly , g ( t ) i can be calculated as g ( t ) i = G ( t ) ii =  G ii − ω 2 i  ( t − 1) =  g i − ω 2 i  ( t − 1) . 6 Let f = [ f i ] i =1 ..n and g = [ g i ] i =1 ..n , f ( t ) and g ( t ) can be expressed as f ( t ) =  f − 2  ω ◦ H T υ  + k υ k 2 ( ω ◦ ω )  ( t − 1) , g ( t ) = ( g − ( ω ◦ ω )) ( t − 1) , (3) where ◦ represen ts the Hadamard prod uct operato r . Using the recursive fo rmula of H , the term H T υ at iteration ( t − 1) c an be expressed as H T υ =  A T B − Σ t − 2 r =1  ω υ T  ( r )  υ = A T B υ − Σ t − 2 r =1  υ ( r ) T υ  ω ( r ) Substituting with H T υ in (3) gives the update formu las for f and g . 7

A Fast Greedy Algorithm for Generalized Column Subset Selection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment