A Maximum Matching Algorithm for Basis Selection in Spectral Learning

A Maxim um Matc hing Algorithm for Basis Select ion in Sp ectral Learning Ariadna Quattoni a nd Xa vier Carreras and M a tthi a s Gall´ e Xerox Research Centre E urope (XR CE) Meylan, F rance { ariadna. quattoni,xavier.ca r rer a s,ma t thi a s.galle } @xrc e .xe r ox.com Abstract W e present a solution to scale sp ectral alg o- rithms for learning s equence functions. W e are interested in the case where these func- tions ar e sparse (that is, for mo s t se q uences they return 0). Spec tr al alg orithms reduce the learning problem to the task of comput- ing an SVD decomp osition o ver a special type of matrix called the Hankel ma trix. This ma- trix is designed to capture the relev ant statis- tics of the training sequences. What is cr ucial is that to ca pture long range dependencies we must consider very larg e Hankel matrices. Thu s the co mput a tion of the SVD b ecomes a critical b ottlenec k. Our solution ﬁnds a sub- set o f rows and co lumns of the Hankel that realizes a compact a nd infor mativ e Hankel submatrix. The novelt y lies in the way that this subset is sele c ted: we explo it a maxima l bipartite matching combinatorial algor ithm to lo ok for a sub-block with full s t ructur al rank, a nd s ho w how computation of this sub- blo c k can be further impr o ved by exploiting the sp eciﬁc s tr ucture of Hankel matrices. 1 INTRODUCTION Our go al is to mo del functions whose domain a re dis- crete sequences ov er so me ﬁnite alphab et. Our fo cus is on sparse functions, by which we mean functions that hav e the prop ert y that only a very small prop ortion of the se q uences in the do main map to a non-zero v alue. W e call thos e sequences the support of the function. The main motiv ation lies in s olving pro ble ms a rising Proceedings of the 20 th International Conference on Artiﬁ- cial Intelligence and Statistics (AIS T A TS) 2017, F ort Laud- erdale, Florida, U SA. JMLR: W&CP volume 54. Cop y - righ t 2017 by the author(s). in Natura l Langua ge Pro cessing (NLP) applications, where spar se seq uence functions are of s pecial in ter- est. F or example, think o f a ll po ssible sequences of T letters that cons tit ute v a lid E nglish words of length T . If Σ is the set o f E nglish letters , is cle a r that out of the Σ T po ssible letter s equences only a very small fraction are v alid words (i.e. should have non-zero pro babilit y). One interesting function class ov er Σ ⋆ is that of func- tions computed b y Non-Deterministic W eighted Au- tomata (W A), since this class prop erly includes clas ses such as ngram mo dels a nd hidden Markov mo dels. In recent years sev er al approa c hes for estimating W A s hav e be en prop osed that are based on repr esen ting the function computed by a W A using a Hankel ma- trix [Beimel et al., 200 0 , Jaeger, 2000, Hsu et a l., 2009, Anandkumar et al., 20 12 , B alle et al., 20 13 ]. As a n illustration of the metho d, consider the following problem: Assume we are given a set of pairs ( x, f ( x )), where x is a sequence in the supp ort o f some targe t function f over Σ ⋆ and we wis h to learn a W A that approximates f . The s p ectral metho d provides a solu- tion to this pro blem and it would work in four steps: 1. Basis Selection: Choo s e a set of preﬁxes P and suﬃxes S . 2. Build a Hankel matr ix: H ∈ R | P |×| S | where the ent r y H ( p, s ) is the v alue of the target function on the s equence o bt a ined by concatenating pr eﬁx p with suﬃx s . 3. Perform SVD on H = UΣV ⊤ . 4. Use the factorization F = UΣ and B = V ⊤ and H to recover the parameter s of the minimal W A, following Hsu et al. [2009] (see § 2.3 for details). The computationa l c ost of the alg orithm will b e dom- inated b y the SVD step O  min( | P | , | S | ) 3  , th us to control the computational co mplexit y , it is critica l to choose a small and yet informative basis. Maximum Bipartite Matc hing for Sp ectr al Learning The theory of s pectral learning tells us that if the tar - get function ha s a minimal W A re presen tation of size n , ther e will b e a complete ba sis wher e | P | = | S | = n , where co mplete means that the rank of the co rrespo nd- ing Hankel deﬁned ov er that basis is the same as the size of the minimal W A. But the theory do es not give a pra ctical a ns w er to how to choo se such a ba sis. The design of eﬃcient algorithms for cho osing an informa- tive a nd yet small s ample-dependent basis is still an op en pro blem, which is the fo cus o f our pap er. W e prop ose an eﬃcient combinatorial algor ithm for sample-dep enden t basis selection. A t its co re, our strategy computes a maximum matc hing of the bi- partite gr aph as sociated with the s parsit y pattern of a Hankel matrix. The main idea is quite simple, we ﬁnd a s ubset of preﬁxes and suﬃxes in the g iv en sam- ple, such that the co rrespo ndin g Hankel matrix deﬁned ov er that basis has full structural rank. The key in- sight is tha t for spar s e matrice s it is easy to remo ve symbolic dep endencies (i.e. dep endencies a t the level of the sparsit y pattern of the matrix ). Similar idea s hav e a lo ng histor y in the n umer ic al optimization lit- erature, where combinatorial algorithms ar e used for computing preconditioners for solving large spa rse lin- ear systems. How ever, to the b est o f our knowledge we are the ﬁr st ones in applying this idea in the context of sp ectral lea rning. W e show that when the Hankel matrix of a function satisﬁes some non-degenera cy assumptions, o ur basis selection algorithm is optimal, in the s ense that it computes the smalles t co mplete basis. While the non- degeneracy a ssumption will not be a lw ays s atisﬁed, our exp erimen ts sugge st that it is always almost satisﬁed for sparse sequence functions. Our exp eriments on a r e al sequence modeling tasks show that the prop osed algor ithm ca n s e lect a basis that is at least an order of mag nit ude smaller than the bes t a lternativ e metho ds for ba s is selection, r esulting in a n SVD step which is at least tw o o rders of magni- tude faster. 1.1 Related W ork Although cho o sing a basis is in practice an imp or- tant tas k for ha ving a robust sp ectral learning algo- rithm, no t m uch resea rc h has fo cused on this problem. One p opular appr oac h is to choo se a basis by se le c tin g all o bserv ed preﬁxes and suﬃxes o f length less than T , for some T > 0 [Hsu et al., 2 0 09 , Siddiqi et a l., 2010]. In practice, this strateg y only works if there are no long-ra nge dependencie s in the target function. Wiewiora [2 005] presented a gr eedy heur is tic wher e for each preﬁx added to the bas is a computatio n taking exp onen tial time in the n umber of states n is r equired. Bailly et al. [2009] suggest to include all obser v ed pre- ﬁxes and suﬃxes (observed in the sample) in the bas is . There are so me theo retical results [Denis et al., 20 16 ] that suggest that under certain assumptions this is the o pt ima l s trategy , in the sense that there is no sta- tistic al harm in consider ing all preﬁxes and suﬃx e s . How ever, this approach is in practice unfeasible: to give a concr e te exa mple if one consider s mo deling the distribution of n-gr ams up to length 1 0 in a standard NLP b enc hmark , the unique num b er of observed pre- ﬁxes and suﬃxes is at least tens of millions. Finally , Balle et al. [2012] g a ve the ﬁrs t theo retical results for the problem of basis selection. They s how tha t by sampling preﬁxes and suﬃxes pr oportio na l to their fre- quency in a large eno ugh sample, with high probabil- it y , a complete basis will b e found. They also provide exp erimen tal results [Ba lle et al., 2 013 ]. 2 PRELIMINA RIES 2.1 Non Determi nistic W eigh ted Fi nite State Automata W e sta rt by deﬁning a class of functions over discrete sequences. More speciﬁca lly , let x = x 1 · · · x t be a se- quence o f leng th t o ver some ﬁnit e alphab et Σ. W e use Σ ⋆ to denote the set of all ﬁnite sequences with elements in Σ, and we use ǫ to denote the empty s e - quence. The domain of our functions is Σ ⋆ . An Non-Deterministic W eig h ted Automaton (W A) with n s tates is deﬁned as a tuple: A = h α 0 , α ∞ , { A σ } σ ∈ Σ i where α 0 , α ∞ ∈ R n are the ini- tial and ﬁna l weigh t vectors and A σ ∈ R n × n are the transition matrices asso ciated to each s ym b ol σ ∈ Σ. The function f A : Σ ⋆ → R rea liz ed by an W A A is deﬁned as: f A ( x ) = α ⊤ 0 A x 1 · · · A x t α ∞ . (1) The a bov e eq ua tion is an alge braic representation o f the computation pe rformed by a n W A on a sequence x . T o see this consider a state vector s i ∈ R n where the j th entry represe nts the sum of the weights of a ll the state paths that gener ate the preﬁx x 1: i and end in state j . Initially , s 0 = α 0 , and then s ⊤ i = s ⊤ i − 1 A x i upda tes the state distribution b y simultaneously emit- ing the sym b ol x i and transitioning to generate the next state v ecto r . W As co nstitut e a rich function clas s which pr o perly includes p opular sequence models such as HMMs. 2.2 Hank el Matrices W e now intro duce the co ncept of Hankel matrices for W A, which are central to the sp ectral learning alg o- rithm, and to the result in this pap er. Ariadna Quattoni, Xavier Carreras, Matthias Gall´ e ǫ a aa aab b bb c ca cb ǫ b ab aab bb a c ca cb P = { ǫ, a, aa, aab, c } S = { ǫ, b, ab, aab, a } ǫ b ab aab bb a c ca cb ǫ 1 1 0 1 1 0 1 1 1 a 0 0 1 0 0 0 0 0 0 aa 0 1 0 0 0 0 0 0 0 aab 1 0 0 0 0 0 0 0 0 b 1 1 0 0 0 0 0 0 0 bb 1 0 0 0 0 0 0 0 0 c 1 1 0 0 0 1 0 0 0 ca 1 0 0 0 0 0 0 0 0 cb 1 0 0 0 0 0 0 0 0 H P × S H P T × S T rank( H P T × S T ) = struct rank( H P T × S T ) = 5 rank( H P × S ) = struct ra nk ( H P × S ) = 5 Σ = { a, b, c } T = { ( ǫ, 1) , ( aab, 1) , ( b, 1) , ( bb, 1) , ( c, 1) , ( ca, 1 ) , ( cb, 1) } f T ( x ) = ( 1 if x ∈ { ǫ , aab, b, bb, c, ca, cb } 0 otherwise P T = { ǫ, a, aa, aab, b, bb, c, ca, cb } S T = { ǫ, b, ab , aab, bb , a, c, ca, cb } Figure 1: Illustratio n of the Maximum Bipar tite Matching sub- blo ck. Left: a tra ining set and the asso ciated target function. Middle: a preﬁx-s uﬃx g raph with a corres p onding maximum matching in red. Righ t: the full Hankel matrix for the tr a ining set, and the s ubm a trix given by the ma tc hing. Let f : Σ ⋆ → R be an a rbitrary function from se- quences to r e a ls (not necess arily computed by a W A). Let P , S ⊆ Σ ⋆ be sets of sequences. W e call preﬁx e s the elements p ∈ P , and suﬃxe s the elements s ∈ S . The Hankel matrix H f ∈ R P × S for f ov er the blo ck ( P , S ) is deﬁned by entries H ( p, s ) = f ( ps ), where ps is the concatenation o f pr eﬁx p ∈ P and suﬃx s ∈ S . The following theorem gives a bijection b e t ween the class of functions computed by W A and Hankel matr ices: Theorem 1. [Sch¨ u tzenb er ger, 1961 , Carlyle and Paz, 1971, Fliess, 1974] A function f : Σ ⋆ → R c an b e r e alize d by a W A with n states if and only if, for every p ossible blo ck ( P , S ) , t he c orr esp onding Hankel matrix H f has r ank at most n . 2.3 The Sp ectral Metho d W e now g iv e a brief descr ipt io n o f the spectr al metho d for estimating a minimal W A representation for a tar- get function. The a lgorithm is a co nstructiv e version of the theorem ab o ve: it builds a Hankel ma tr ix o f rank n a nd computes the asso ciated n state W A from it. W e only provide a higher-le v el descr iption of the metho d; for a complete deriv ation and the theo r y jus- tifying the alg orithm we refer the reader to the works by Hsu et al. [20 09 ] a nd Balle et al. [2013]. Assume a training set T in the form of a co llection of sequences , e ac h paired with a target rea l v alue. W e will denote as f T the function obtaine d from the train- ing s et, i.e. if x ∈ T , f T ( x ) is the ta r get v alue. F or example, T could b e a corpus of E nglish sentences, and f T ( x ) the probability with which x appe a rs in T . Given a training set T , the s pectral algorithm com- putes a W A A with n states, wher e n is a parameter of the algor ithm, such that f A is a go od a ppro xima- tion o f f T . See Hsu et al. [200 9 ] for the gene r alization theory of the a lgorithm. The method is describ ed b y the following steps: (1) Select a Hankel blo ck. Let P T and S T be r espec- tively the sets o f all unique preﬁxes and suﬃxes of sequences in T . Select a blo ck o ut of them, namely , a subs e t of preﬁxes P ⊆ P T and a subset of suﬃxes S ∈ S T . (2) Compute Hankel matr ices for ( P , S ). (a) Compute H ∈ R P × S , with entries H ( p, s ) = f T ( ps ). (b) Compute h P ∈ R P with h P ( p ) = f T ( p ) a nd h S ∈ R S with h S ( s ) = f T ( s ). (c) F or ea c h σ ∈ Σ, compute H σ ∈ R P × S with ent r ies H σ ( p, s ) = f T ( pσ s ). (3) Compute an n -rank fac to rization of H . Compute the truncated SVD of H , i.e. H ≈ UΣV ⊤ result- ing in a matrix F = UΣ ∈ R P × n and a ma trix B = V ∈ R S × n . (4) Recov er the W A A of n states. Let M + denote the Mo ore-Penrose pseudo- in verse of a matr ix M . The elements of A are recov ered as follows. Initial vector: α ⊤ 0 = h ⊤ S B . Final vector: α ∞ = F + h P . T ransition Matr ices: A σ = F + H σ B , for σ ∈ Σ. There are some observ ations to make that motiv ate the contribution of this pa p er. Consider the co mplete Maximum Bipartite Matc hing for Sp ectr al Learning training blo ck ( P T , S T ), and let H T denote the Ha nkel matrix for this complete blo c k . If we wan t to fully r e- construct the function f T , we need an a uto mata A that has as many states a s the rank of H T . By using less states, we will b e lea r ning a low-rank a ppr o ximation of f T in the form of a W A. The second obser v ation is that any sub-block ( P , S ) whose Hankel submatrix ha s full rank (with res pect to the rank of H T ) can b e used to fully recov er f T . Thu s , in the ideal case, step (1) of the algorithm would select a compact submatrix o f H T that preser v es the rank. By doing so, the cost of s teps (4) and (5) would only depend on the size of the submatrix. Even if w e can not get the ideal blo c k, it would b e g o o d to hav e a method for step (1) that pro duces a small and infor- mative block. Unfortuna tely , in the gene r al case (i.e. for any rea l matrix) ﬁnding the submatrix o f ﬁxed size that has max imal ra nk is kno wn to be NP - complete [Peeters, 2003]. In this pap er we prop ose an alg o rithm to ﬁnd a small subma trix of H T of hig h ra nk. As a ﬁnal note, sp ectral metho ds can b e used to learn a langua ge mo del, that is , a pro babilit y distribution ov er all sentences of a la ng uage. A straig h tforward wa y to lea rn a language model is to reg ard the tra in- ing collection T as an empirical dis tribution over se- quences of words, where the probability of a sequence is prop ortional to the num b er of times it appea rs, i.e. f T ( x ) = P r T ( x ). Another choice, s ometimes re- ferred to as mo men t matching, is to se t the function f T ( x ) to b e the exp ected n umber of times that the sequence x a ppears as a subseq uence of a random se- quence sampled from an empir ical distr ibut io n. In this case, the sp ectral algo r ithm will learn a W A that com- putes exp ectations of subsequence freque ncie s. One useful re s ult is that this W A ca n be conv er ted to another W A that corr esponds to the under lying lan- guage model, i.e. a distribution over sequences; see Balle et al. [20 13 ] for details. In pra c tice this second metho d is preferr ed, since subsequence frequency ex- pec tations a re statistics that are mor e stable to esti- mate from a training s et. 3 SUB-BLOCK SELECTION VIA BEST BIP AR TITE MA TC HING W e sta rt this section by deﬁning the structur al ra nk of a matrix. Deﬁnition 1. The structur al r ank of a matrix is the maximu m ra nk of al l numeric al matric es with the same non-zer o p attern. Our pr oposed alg orithm will then s earc h for a subma- trix of H with full structural ra nk. In the co n text of W A and Hankel matrices this has a nice interpretation as a no tion of c omplexity of the supp ort of a function. This is b ecause the structural r ank of a Hankel matrix corres p onds to the num b er of sta tes o f the minimal W A for the har dest function deﬁned ov er that supp ort. Notice that by deﬁnition, the numerical rank of a ma- trix is a lw ays less or equal than its structural rank, th us the str uc tur al ra nk of the Hankel matrix H of a function f A will b e always grea ter or equal than the nu mber of sta tes of the minimal W A computing f A . Our a lgorithm is based o n ﬁnding a submatr ix of H of full structural ra nk. The problem of ﬁnding a full structural rank sub-blo c k of H can b e casted as an instance of maximum bi- partite matching [Edmonds, 1967]. Given a bipa rtite graph ( V , G ) where V are the set of v er tices and G the set o f edg es, the maximum bipartite matching is de- ﬁned as the lar gest set of non-intersecting edges, where non-interse cting means that no t wo edg es in the set share a common vertex. In the case o f the Hankel matr ix for a function f A we would hav e a bipartite gra ph ( V , G ) where on one side we hav e vertices corr esponding to a ll unique preﬁxes in the supp ort of f A and on the other side we hav e all unique suﬃxes, thus: | V | = | P | + | S | . There will be an edge connecting node i an j if the corr esponding sequence made b y the concatenation of preﬁx i and suﬃx j is in the supp ort of f A . F or every seque nc e s o f leng th T in the supp ort of f A and e very p ossible cut of s in to a pr eﬁx and a suﬃx, there will b e T + 1 corres p onding edge s in G , thus | G | = O ( T | f A | ) wher e we use | f A | to r efer to the num b er of sequences in the suppo rt of f A . The maximum bipa rtite matching of a set of sequenc e s is a subset of the sequences such that no t wo s e q uences share a common preﬁx or suﬃx and there is no larg er subset that satisﬁes that prop ert y . Figure 1 shows a n example of a function f A and its cor responding g raph, and a maximum bipartite matc hing for that graph. W e deﬁne the max imu m bip artite matching sub-blo ck as the blo c k c o nsisting of all v er tices (preﬁxes and suf- ﬁxes) in a maximum ma tc hing. Fig ure 1 shows an ex- ample of a function, a maximum bipar tite matching, and the cor responding sub-blo c k and Hankel subma - trix. T o ﬁnd a maxim um bipartite ma tching ther e ar e sev- eral classic a l algorithms. The Augmented Paths a lgo- rithm runs in O  | V || E |  , but in practice it has a m uch low er av era g e cas e c o mplexit y . The Hop croft-Karp al- gorithm runs in O  | E | p | V |  , removing the linear de- pendenc e o n V (how ever, in our exp erimen ts the Aug- men ted Paths alg orithm was a lr edy very fast). In the next section w e prop ose an algor itm that takes adv an- Ariadna Quattoni, Xavier Carreras, Matthias Gall´ e tage of the str uc tur e of the Hankel matrix to o bta in further sp eed ups. 3.1 On the Optim alit y of the Maximum Matc hing Su b- block W e will use a weak version the matching pr op erty , an assumption used by Hoﬀman and McCo rmic k [1 982 ]. Let M b e a matrix o f str uctural r ank s . M has the weak matching prop ert y (WMP) if for any submatrix M ′ of at least s rows and s columns, the rank of M ′ is equal to the str uctural rank of M ′ . Lemma 1. L et H b e a Hankel matrix that satisﬁes the we ak m at ching pr op erty. L et B b e a maximum bip ar- tite matchi n g of H and let H B b e the c orr esp onding submatrix. B is a b asis of H , i.e. the ra nk of H B is e qual to the r ank of H . Pro of. Let s ( M ) b e the structural rank of a matrix M . Let n be the rank o f H , and note that s ( H ) is n b ecause H has WMP . No w note that s ( H B ) is also n , b ecause the maximum bipartite matching of H is included in H B , thus s ( H B ) is at least n ; and it is at most n , otherwis e s ( H ) ≥ s ( H B ) > n . Since H has WMP , the r a nk of H B is n . Ideally , we would not ha ve to assume the m a t ch - ing pr op erty a nd instead we could provide theoret- ical guar an tees for the maximum gap b et ween the structural and nu mer ic ra nk of a matrix. Un- fortunately , b ecause of the discrete nature of the structural r ank, deriv ing useful b ounds for this g ap has been shown to b e a ha rd theoretical challenge [Hoﬀman and McCormick, 198 2]. Thus to provide v al- idation for our assumption we reso rted to a n empiri- cal ev aluation of the ga p on a w ide rang e of sequence mo deling datasets, where we observe that the weak matching prop ert y is a reas onable a ssumption. The complete res ults are in section B of the supplementary materials. 4 F ASTER BIP AR TITE MA TCHING F OR HANKE L MA TRICES As sa id in the prev ious s e c tion, ﬁnding the structural rank can b e reduced to the max im um bipartite match- ing pr o blem. In this section, we pro pose a simple heuristic to sp e ed-up the maximum bipartite match- ing for the sp eciﬁc cas e where the underlying ma trix is a Hankel. W e do this by exploiting s tructural prop- erties o f these matrices for a n under lying subr outine, the augmenting p ath algo r ithm. E ac h basic applica- tion of the a ugmen ting path increase s the match ing b y one, and a matching is maximum if and only if there is no further a ug men ting path. The stra ig h tforward solution of applying it on each no de is equiv alent to (a) x y z w (b) x y z w (c) x y z w Figure 2: Illustrations o f the augment ing path algo- rithm. the maximal ﬂow algor ithm, and while more sophisti- cated alg orithms wher e pr oposed [Hop croft and Kar p , 1973] which ﬁnd several paths p er iteration, b enc h- marks [Setubal, 19 96 ] hav e shown that the simple al- gorithm works in g eneral faster. W e ﬁrst describ e the basic pro cedure: a ssume the graph depicted in Figure 2 (a), and fu r thermore as- sume that the current matching (not maximal) is a s follows: M = { ( y , z ) } . This is clear ly not max i- mal, as a b etter (and maximal) matching w ould be { ( x, z ) , ( y , w ) } . The augmenting path pro cedure maps the previo us matching to the dir e cte d g raph G de- picted in Figure 2 (b). Unselected edges will be directed from left to right, while selected edges will b e directed from right to left. An augmenting path is then deﬁned as a path x 1 , . . . , x m ov er G , such that x 1 belo ngs to the left par- tition, x m to the right one, and b oth x 1 and x m are unmatched (this is, do not b e long yet to a matching). No r estrictions a re put on the intermediate no des, but it b ecomes clear that the path a lternates b et ween un- matched pairs (left to right edges) and matched pair s (right to left). Note that suc h paths c a n now easily be retrieved with a s ta ndard gra ph trav er sal (in our implemen ta tio n we use a depth-ﬁrst sear ch, which we assumed w a s faster on sparse gra phs a lthough this was not veriﬁed). Starting fr om no de x , the following path can then be retrieved: x, z , y , w , and the graph will then b e re wired to the graph depicted in Figure 2 (c). No fur th er augmenting paths exist here, and the max- im um matching alg orithm therefore ﬁnishes with the following matching: { ( x, z ) , ( y , w ) } . The sp eciﬁc ca se where the left par t of the bipartite are pr eﬁxes a nd the rig h t part a re suﬃxe s creates some strong structural constr ain ts. Notably: Prop ert y 1. ( pσ, s ) is an e dge in the gr aph iﬀ ( p, σ s ) is an e dge This is, the edges of the bipartite gr aph denoting a Hankel matrix co me by (po ssibly overlapping) gro ups of edges, each g roup or iginating in one of the suppor t sequences. W e pr opose to take adv antage of that s tructural knowl- edge to sp eed-up the maximum ma tc hing algor ithm. First, we s ort the preﬁxe s b y their le ngths, and sta rt Maximum Bipartite Matc hing for Sp ectr al Learning applying the a ug men ting path pro cedure from the longest pr eﬁx node. Each augmenting path pro cedure returns a set of edges R to b e removed from the matc h- ing, and a se t of edg es A to be added to the matc hing. F or each edge ( σ 1 . . . σ k , s ) ∈ A w e co nsider all s h ifte d pairs ( σ 1 . . . σ i , σ i +1 . . . σ k s ). Due to Pro perty 1, each one of these pairs is an edge in the bipartite gr aph. W e chec k each such pair , and if bo th no des are unmatched we simply add them to the matching. Assuming a bitset implementation of sets, the chec ks can b e done in O  | E |  , but in the worst-case scenar io, it may well b e that none of the shifted pairs are fre e, and therefor e only add co mput a tion w itho ut improv- ing the matc hing. In § B of the supplemen tary ma- terial we r eport synthetic exp erimen ts that show the sp eed-ups of this str ategy compared to the standar d metho d. 5 EXPER IMENTS T o v alidate our s ub-block selection str ategy , we pres e n t compariso ns to metho ds for scaling up sp ectral learn- ing. W e ﬁrst compare to general methods to sca le SVD, and then to sub-blo c k selection strategie s for Hankel matric es. W e end this section with a compar - ison to state-of-the-a rt methods on the SPiCE b enc h- mark. In all exp eriments we use natural language data for the task of language mo deling. The goal is to learn a language model that predicts the next sy m bo l for a sentence preﬁx (including ending the s e n tence). As ev aluation metric we use Bits p er Char acter (BpC) , the average log-2 probability that the mo del g ives to ea c h symbol in the ev aluation senq uences, includ- ing s e quence ends. As data s ets w e use the E nglish Penn T ree ba nk [Marcus et al., 19 94 ] using standa rd splits 1 , the W ar and Peace dataset [Karpa th y et al., 2016] 2 , a nd the NLP data sets of the SP iCe b enc hmar k [Balle et al., 20 1 6 ]. 5.1 Scalable SVD Me thods W e conducted exp eriments compa ring our metho d with tw o other strategies for scaling SVD. The ﬁrst us es R andomize d Pr oje ctions to p erform SVD [N. Halko and T ro pp. , 2009]. This idea w as previo usly used to scale s pectral le arning [Hamilton et al., 2013]. The second str ategy is based on Sampling , and selects the k top rows and columns that have the hig hest norm [Deshpande and V empala., 2006]. 1 49 characters; 5017k / 393k / 442k characters in the train / dev / test p ortions. 2 84 sy m b ols ; 2658k / 300k / 300k characters in the train / dev / test p ortio n s. 0 200 400 600 800 1 . 26 1 . 28 1 . 3 1 . 32 1 . 34 1 . 36 time (sec.) bits p er character Matching Complete Random P . Sampli ng Figure 3: Co mpa rison of Stra teg ies for Scaling Sp ec- tral Learning. F or this comparison w e used the P enn T reeBank dataset with simpliﬁed part- o f-speech tag s (1 2 sym- bo ls). W e chose this dataset b ecause it results in a rel- atively sma ll Hankel ma trix wher e we can run spa rse SVD. In particular, w e used moment size of T = 5, which results in a square Hankel matr ix o f size 52 ,450, nu mer ic rank of 3 12, a nd s tructural rank of 313. T hus, the Complete metho d will use run spa rse SVD on this matrix. W e present a trade-oﬀ be tween p erformance (in terms of bits-p er-character) and tra ining time of a metho d. When appr opriate, we generate so lutions that utilize diﬀerent amounts of time. F o r Sampling, since it se- lects k rows a nd columns pro portional to their norm, a natural wa y o f generating diﬀeren t solutions is to v ary k . F or Randomized Pro jections we do not select a sub- blo c k, instead we pro ject the Hankel matrix to a lower ℓ -dimensional space and then the SVD is p erformed on the pro jected matrix. Th us to get p erformance a s a function of training co st we can change the size o f the pro jection. The tr aining time 3 of a metho d consists o f: (1) time sp en t in selecting the Hankel s ub-block (for alg orithms that start by sub-blo ck se le c tion (e.g. best matching); (2) time sp en t on computing the singular v alue decom- po sition; and (3) time spent co mput ing in verses, i.e. recov er ing op erators. Notice that all spec tr al metho ds will p erform SVD o f a Hankel sub-blo c k. Whenever we co mput e SVD we take the cost of the most eﬃcient (i.e. spa rse o r full SVD) to b e the cost of the algor ithm. Another imp o rtan t obser v ation is that the spar s e SVD algorithm takes a s a pa rameter the num b er of singu- lar v alues to co mput e. W e take this to b e the optimal nu mber o f states found us ing the v alidation data. 3 All exp erimen ts were ru n on a 2.2 GHz Intel Core pro- cessor. Ariadna Quattoni, Xavier Carreras, Matthias Gall´ e Figure 3 shows the trade-oﬀ, for the four metho ds. The ﬁrst obs e r v ation is that with suﬃcient amount of computational time b oth Random Pr o jections and Sampling achieve the same p erformance as using the Complete Hank el. This is exp ected since by setting k a nd ℓ s uﬃcien tly la rge we should alwa ys obtain the same result as using the complete Ha nkel. Random Pro jections seems to be signiﬁcantly b e tter than Sam- pling in terms of speed up, and it can obtain the same solution as Complete in less than 1 / 4 o f the time. Best Bipar tite Ma tching obtains a slightly higher bits- per -c hara cter tha n Rando m Pro jections, but is sig nif- icantly mor e eﬃcient. More precis ely , to a c hieve the same p erformance as with Matching, Random Pro jec- tions requires ab out 5 0 times mor e time. 5.2 Sub-blo c k Sel ect i on Strategies for Sp ectral Me tho ds W e now present an empirical co mparison betw een the most prev alent sub-blo c k selection strateg ie s for sp ec- tral learning. W e tr a in sp ectral lang ua ge mo dels at character level that use a ﬁxed window of T characters bo th a t train- ing a nd tes t time. At tr aining, we co llect a ll substrings x of length up to T . F ollowing Ba lle et al. [2013], we set a target function f T ( x ) to be the exp ected n umber of times that x app ears as a subsequence of a r andom sentence sampled from training . W e run the sp ectral algorithm with f T and obtain a W A. At tes t, we r un the W A to compute the probability of the next char- acter given a sliding pr eﬁx of length T − 1. W e compare maximum matching sub-blo c k selection to thr e e s trategies: full blo c k, random cuts, and length up-to. F ul l blo ck us es all substrings of the s upport of f T as preﬁxe s and s uﬃxe s . R andom Cuts follows Balle et al. [2012]: it samples a string x of the sup- po rt, and chooses a rando m c ut of x into a preﬁx and suﬃx, which are a dded to the sub-block. This pro- cess is rep eated until the sub-blo c k reaches s iz e k (a parameter). L ength ≤ ℓ selects all substrings up to length ℓ . T able 1 compare s sub-blo c k se lection metho ds in terms of the num er ic ra nk of the sub-matrix, the time it takes to compute an n -rank fa ctorization, a nd the quality of the resulting n - state W A in terms of bits p er c ha rac- ter (BpC). n is a parameter that we tune on v alidation data with a range of v alues up to the ra nk of the s ub- matrix. The ma tc hing sub-blo ck obtains results that are very c lose to using the full matrix. How ever, it is m uch faster: the time to compute a matching is neg- ligible, and the time to factor ize the matr ix is thr e e orders o f magnitude faster. Compa red to o ther strate- gies, the matching sub-blo c k is the most acc ur ate a nd T able 1: Compar ison Bet ween Sub-blo ck Selection Metho ds for Supp ort Strings up to Size T = 5 metho d size rank sec. BpC F ull 144,37 8 - 18,0 00 1.7 35 Matching 1,661 1,612 8 1.74 1 Random Cuts 1 × 1,661 739 10 2.011 Random Cuts 2 × 3,322 807 74 1.828 Random Cuts 3 × 4,983 902 163 1.81 2 Random Cuts 4 × 6,664 989 271 1.79 1 Random Cuts 5 × 8,305 1,010 302 1.769 Random Cuts 6 × 9,966 1,086 411 1.761 Random Cuts 7 × 11,627 1,114 825 1.75 2 Length ≤ 2 861 92 2 3.1 05 Length ≤ 3 7,455 417 290 2.66 2 Length ≤ 4 38,314 907 3,500 1.856 T able 2: Results of sp ectral mo dels for increasing length of strings in the suppo rt Penn T reebank W ar and Peace (Lik) T F ull Match. F ull Match. KJL16 5 1.735 1.741 1.377 1.405 1.451 6 1.623 1.653 1.326 1.393 1.339 7 1.597 1.622 1.323 1.369 1.321 the most c o mpact, and th us it is drastically faster. This impro vement is achiev ed b ecause it selects a v er y compact sub-blo ck (of size 1,661) that ha s approxi- mately full numeric rank (rank is 1,612 ). In contrast, using Random Cuts, a blo ck of the same size as the matching sub-blo c k (1,6 61) has o nly a r ank of 73 9 , which r esults in lower quality pre dictio ns. When in- creasing the blo c k of Ra ndom Cuts up to 7 times the size of the matching, we obtain a rank of 1 ,114 and very c lo se results to the matching and full sub-blo c k s; how ever, factorizing the sub-matrix is 1 00 times mo re costly . Sub-blo c k selection by max im um length also per forms p o orly . This la s t result is evidence that long- range statistica l dep endencies exist in this data, and these are not ca ptu r ed b y small moments. On the other hand, a brute-force appr oac h to capture s uch long ra nge dep endencies is prohibitiv e. O ur metho d clearly oﬀers a very c ompetitive so lutio n. Next we pr esen t results of mo dels trained on large r substrings, of up to siz e 7, for the Penn T reebank a nd W ar and Peace datasets. T able 2 compares the pe r - formance of the matching sub-blo c k to using the full blo c k. 4 As we increa se the size of the substrings ( T ) 4 F or the W ar and P eace data, we measure p erformance in terms of test negative log-likelihoo d, such that we can Maximum Bipartite Matc hing for Sp ectr al Learning T able 3: Results on NLP Datasets of SP iCe Sequence P r ediction Comp e titio n. V erbs LM (words) LM (characters) POS Normalization All RNN-P *0.6078 * *0.5434 * *0.8101 * *0.6573* *0.5882 * *0.6414* COMBO-NN-1 0.5794 0.5014 0.7632 0.6331 0.5181 0.599 0 COMBO-B 0.5 5 14 0.4264 0.7978 0.5890 0.384 3 0.5498 LSTM 0.512 3 0.403 4 0.76 3 0 0 .5941 0.4187 0 .5383 COMBO-Sp 0.5273 0.4148 0.61 42 0 .6235 0.4990 0 .5358 Sp-BM 0.5928 0.499 8 0.7820 0.6356 0.5441 0.6109 the mo dels g et b etter. There is always a per formance gap b e tw e e n using the full or the matc hing blo c ks, how ever the matc hing sub-blo c k scales m uch better: the cost of computing a ma tc hing is negligible (less than 1 5 seconds), and the cost of the factoriza tion is at three order s of magnitude faster . T able 2 also com- pares to the results by Karpa th y et a l. [2016] (noted KJL16), in terms of negative log-likelihoo d on test characters (noted Lik). W e rep ort their results, cor- resp onding to non- recurrent feed-for w ard neural mo d- els, which co ndition each pr ediction on the T − 1 latest characters (see T able 2 of their pap er). The r esults are fairly compara ble, exhibiting the same trend. 5.3 Comparison with State-of-the- Art In or der to compare the pe r formance of our prop osed metho d to o ther state of the ar t metho ds for sparse se- quence modeling , we run exp erimen ts on the ﬁve NLP datasets of the SPiCe sequence prediction comp e titio n [Balle et al., 2016]. The ta s k of the comp etition was the following: given a string (preﬁx) of symbols in a ﬁnite alphabet the goal is to pr edict a r anking of po ssible next sym b ols to be the next element of the sequence. The metric used fo r ev aluation mea sures the average ranking that the mo del gives to the cor - rect next s ym b ol. 5 Both v alidation and test sets are av ailable from the c ha llenge w ebs ite. There were a total of 26 teams implementing a wide range of metho ds, including: ma n y diﬀerent t yp es of neural netw o rk mo dels, b oos ting , sp ectral and clas si- cal state-merging algorithms for lea rning weigh ted au- tomaton, and ensemble metho ds that combined several techn iq ues. T able 3 shows r esults for the top 5 teams o f the com- petitio n. The to p team (RNN-P) is a nov el RNN ar - chit ec tu r e where the state v ector is augmented with an indicator v ector represe nting the previo us ngra m in the history . The sec o nd b e st team (COMBO-NN) compare to pub li shed results. 5 W e refer th e reader to the SPiCe b enc hmark website: http://spi ce.lif.univ- mrs.f r . is an ensemble of MLP , CNN, LSTM and ngram mo d- els. The third team (COMB O -B) is also an ensemble metho d of ngr a m, sp ectral, RNN and tree b o osting. The fourth team (LSTM) is a n RNN with LSTM cells and the ﬁfth team (COMBO-Sp) is another ensemble metho d that com bines a sp ectral mo del with ngram mo dels. The p erformance of our sp ectral metho d with the pr o - po sed sub-blo c k selection using b est bipartite match- ing (Sp-BM) is g iven in the last row. W e indicate with b old and sta rs the top perfo r ming method for each dataset and with b old the s econd b est. Run- ning the pr o posed algorithm out-of-the-b o x a nd with- out a n y mo del combination we get a very comp etitiv e per formance: s econd b est ov erall (0.6414 v s 0 .6109) and second in 3 o ut of 5 da tasets. One o f the most attractive prop erties of our metho d is tha t the mos t costly tr a ining times (those co rrespo nding to datasets with Hankel matrices of higher structural rank) were less than 5 minutes. 6 CONCLUSIONS W e presented a nov el strategy for scaling spectr al learning algorithms that is sp eciﬁcally des igned for mo deling long ra nge dep endencies in spars e s equence functions. The main idea is to use maximal bipartite matching to ﬁnd a Hank el sub- blo ck of maxima l struc- tural ra nk. Our exp eriments on a real sequence mo del- ing task show that: (1) Exploiting large Hankel matri- ces is essential for the success of sp ectral lear ning algo- rithms; and that: (2) Our pro posed sub-blo c k selec tion strategy to ha ndle larg e Hankel matrices can b e m uch faster than using sparse SVD ov er the complete Hankel matrix without a signiﬁcant loss in p erformance. Our algorithm leads to a very app ealing trade- oﬀ b et ween computational complexity and mo del p erformance. References Animashree Anandkumar , Daniel J. Hsu, and Sham M. Kak ade. A metho d of moments for mix- ture mo dels and hidden markov mo dels. In Shie Ariadna Quattoni, Xavier Carreras, Matthias Gall´ e Mannor, Nathan Srebro, and Rob ert C. Williams on, editors, Pr o c e e dings of the 25th Annual Confer enc e on L e arning Th e ory (COL T) , volume 23 of JMLR Pr o c e e dings , pages 33.1–3 3.34. J MLR.org, 2012 . R. Bailly , F. Denis, a nd L. Ralaivola. Grammatical inference as a principal comp onen t a nalysis problem. In Pr o c. I CML , 2009. B. Balle, X. Car reras, F.M. Luque, and A. Quatto ni. Spec tr al learning of weigh ted automa ta: A forward- backw ard p erspective. Machine L e arning , 2013 . Borja Balle, Aria dna Quattoni, and Xavier Car r eras. Lo cal lo ss optimiza tio n in o p erator mo de ls : A new insight into sp ectral learning. In ICML ’12 , 20 12. Borja Balle, R´ emi E yraud, F r anco M. Luque, Ariadna Quattoni, a nd Sicco V e r w er. Results of the sequence prediction c ha llenge (spice): a comp etition on learn- ing the nex t symbol in a sequence. In Pr o c e e dings of the 13th International Confer enc e on Gr ammatic al Infer enc e , 2016 . A. Beimel, F. B e rgadano, N.H. Bshouty , E . Kushile- vitz, a nd S. V ar ricc hio. Lea rning functions repre- sented as multiplicit y automata . JAC M , 2000. J. W. Carlyle a nd M A. Paz. Realiza tions b y sto c has- tic ﬁnite a utomata. Journal of Computer Systems Scienc e , 1971 . F ran¸ cois Denis, Mattias Gyb els, and Amaury Habr ard. Dimension-free concentration bo unds on ha nk el matrices for sp ectral learning. J ou rn al of Ma- chine L e arning R ese ar ch , 17(31 ):1–32, 2016 . URL http:/ /jmlr.org / pa p e r s /v 1 7 / 1 4- 501.html . A. Deshpande and S. V empa la . Adaptive sampling and fast low-rank ma trix approximation. RANDO M 06 , 2006. J. Edmonds. Systems of distinct r epresen tatives and linear algebra. Journ a l of R ese ar ch of the National Bur e au of S tanda r ds , 71B(4):241 – 245, 196 7. M. Fliess. Matrices de Hankel. Journal de Math ´ ematiques Pur es et Appliqu ´ ees , 1974. William L. Hamilton, Mahdi M. F ard, and Jo elle Pineau. Mo delling sparse dynamica l sys tems with compressed predictive state representations. In Pr o c e e dings of the 30th International Confer enc e on Machine L e arning (ICML-13) , pages 17 8–186. JMLR W or k shop and Conference P r oceedings , 20 13. A.J. Hoﬀman and S. T. McCor mic k. A fast a lgorithm that makes matrice s optimally sparse. T echnical Re- po rt 13, Stanford Universit y Systems Optimization Lab oratory Rep ort, 1982 . John E. Hop croft and Richard M. Karp. An n 5 / 2 algo- rithm fo r maximum matchings in bipartite g raphs. SIAM Journal on Computing , 2(4):225–2 31, 19 7 3. D. Hsu, S. M. Kak ade, and T. Zha ng. A sp e ctral algo - rithm for learning hidden marko v mo dels. In Pr o c. of COL T , 20 09. H. Ja e g er. Obs e rv able oper a tor models for dis c r ete sto c hastic time series. Neur al Computation , 12: 1371– 1398, 20 00. Andrej Ka rpath y , Justin Johnson, and F ei- F ei Li. Vi- sualizing and under standing r ecurren t netw orks. In ICLR Workshop T r ack , 201 6 . Mitc hell P . Marc us, Bea tr ice San to r ini, and Mary A. Marcinkiewicz. Building a large annotated cor pus of e ng lish: The penn treeba nk. Computational Lin- guistics , 19, 19 94. P .G. Martisson N. Halko a nd J . A. T ropp. Finding structure with rando mnes s: Sto c ha stic a lgorithms for co nstructing aproximate matrix decompo sitions. arXiv: 0909 .4061 , 20 09. R. P eeters. The maxim um edge biclique problem is np-complete. Discr ete Applie d Mathematics , 20 03. M. P . Sch¨ utzen b erger. On the deﬁnition of a family of automata. Informatio n and Contr ol , 1961 . Joao C. Setubal. Sequential a nd para llel exp eriment a l results with bipartite matc hing a lgorithms, 1996 . Sa jid Siddiqi, Byr o n B o ots, and Geoﬀrey J. Go r- don. Reduced-ra nk hidden Ma rk ov mo dels. In Pr o- c e e dings of t he Thirte enth Int ernatio nal Confer enc e on Artiﬁcial In tel li genc e and S tatistics (AIST A TS- 2010) , 2010. Eric Wiewior a. Lear ning predictiv e represent a tions from a histor y . In Pr o c e e dings of the 22nd interna- tional c onfer enc e on Machine le arning , pages 964 – 971. A CM, 2005. This figure "bipartite0.png" is available in "png" format from: This figure "bipartite1.png" is available in "png" format from: This figure "bipartite2.png" is available in "png" format from: This figure "speedupMatching.png" is available in "png" format from:

A Maximum Matching Algorithm for Basis Selection in Spectral Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment