A Maximum Matching Algorithm for Basis Selection in Spectral Learning
We present a solution to scale spectral algorithms for learning sequence functions. We are interested in the case where these functions are sparse (that is, for most sequences they return 0). Spectral algorithms reduce the learning problem to the tas…
Authors: Ariadna Quattoni, Xavier Carreras, Matthias Galle
A Maxim um Matc hing Algorithm for Basis Select ion in Sp ectral Learning Ariadna Quattoni a nd Xa vier Carreras and M a tthi a s Gall´ e Xerox Research Centre E urope (XR CE) Meylan, F rance { ariadna. quattoni,xavier.ca r rer a s,ma t thi a s.galle } @xrc e .xe r ox.com Abstract W e present a solution to scale sp ectral alg o- rithms for learning s equence functions. W e are interested in the case where these func- tions ar e sparse (that is, for mo s t se q uences they return 0). Spec tr al alg orithms reduce the learning problem to the task of comput- ing an SVD decomp osition o ver a special type of matrix called the Hankel ma trix. This ma- trix is designed to capture the relev ant statis- tics of the training sequences. What is cr ucial is that to ca pture long range dependencies we must consider very larg e Hankel matrices. Thu s the co mput a tion of the SVD b ecomes a critical b ottlenec k. Our solution finds a sub- set o f rows and co lumns of the Hankel that realizes a compact a nd infor mativ e Hankel submatrix. The novelt y lies in the way that this subset is sele c ted: we explo it a maxima l bipartite matching combinatorial algor ithm to lo ok for a sub-block with full s t ructur al rank, a nd s ho w how computation of this sub- blo c k can be further impr o ved by exploiting the sp ecific s tr ucture of Hankel matrices. 1 INTRODUCTION Our go al is to mo del functions whose domain a re dis- crete sequences ov er so me finite alphab et. Our fo cus is on sparse functions, by which we mean functions that hav e the prop ert y that only a very small prop ortion of the se q uences in the do main map to a non-zero v alue. W e call thos e sequences the support of the function. The main motiv ation lies in s olving pro ble ms a rising Proceedings of the 20 th International Conference on Artifi- cial Intelligence and Statistics (AIS T A TS) 2017, F ort Laud- erdale, Florida, U SA. JMLR: W&CP volume 54. Cop y - righ t 2017 by the author(s). in Natura l Langua ge Pro cessing (NLP) applications, where spar se seq uence functions are of s pecial in ter- est. F or example, think o f a ll po ssible sequences of T letters that cons tit ute v a lid E nglish words of length T . If Σ is the set o f E nglish letters , is cle a r that out of the Σ T po ssible letter s equences only a very small fraction are v alid words (i.e. should have non-zero pro babilit y). One interesting function class ov er Σ ⋆ is that of func- tions computed b y Non-Deterministic W eighted Au- tomata (W A), since this class prop erly includes clas ses such as ngram mo dels a nd hidden Markov mo dels. In recent years sev er al approa c hes for estimating W A s hav e be en prop osed that are based on repr esen ting the function computed by a W A using a Hankel ma- trix [Beimel et al., 200 0 , Jaeger, 2000, Hsu et a l., 2009, Anandkumar et al., 20 12 , B alle et al., 20 13 ]. As a n illustration of the metho d, consider the following problem: Assume we are given a set of pairs ( x, f ( x )), where x is a sequence in the supp ort o f some targe t function f over Σ ⋆ and we wis h to learn a W A that approximates f . The s p ectral metho d provides a solu- tion to this pro blem and it would work in four steps: 1. Basis Selection: Choo s e a set of prefixes P and suffixes S . 2. Build a Hankel matr ix: H ∈ R | P |×| S | where the ent r y H ( p, s ) is the v alue of the target function on the s equence o bt a ined by concatenating pr efix p with suffix s . 3. Perform SVD on H = UΣV ⊤ . 4. Use the factorization F = UΣ and B = V ⊤ and H to recover the parameter s of the minimal W A, following Hsu et al. [2009] (see § 2.3 for details). The computationa l c ost of the alg orithm will b e dom- inated b y the SVD step O min( | P | , | S | ) 3 , th us to control the computational co mplexit y , it is critica l to choose a small and yet informative basis. Maximum Bipartite Matc hing for Sp ectr al Learning The theory of s pectral learning tells us that if the tar - get function ha s a minimal W A re presen tation of size n , ther e will b e a complete ba sis wher e | P | = | S | = n , where co mplete means that the rank of the co rrespo nd- ing Hankel defined ov er that basis is the same as the size of the minimal W A. But the theory do es not give a pra ctical a ns w er to how to choo se such a ba sis. The design of efficient algorithms for cho osing an informa- tive a nd yet small s ample-dependent basis is still an op en pro blem, which is the fo cus o f our pap er. W e prop ose an efficient combinatorial algor ithm for sample-dep enden t basis selection. A t its co re, our strategy computes a maximum matc hing of the bi- partite gr aph as sociated with the s parsit y pattern of a Hankel matrix. The main idea is quite simple, we find a s ubset of prefixes and suffixes in the g iv en sam- ple, such that the co rrespo ndin g Hankel matrix defined ov er that basis has full structural rank. The key in- sight is tha t for spar s e matrice s it is easy to remo ve symbolic dep endencies (i.e. dep endencies a t the level of the sparsit y pattern of the matrix ). Similar idea s hav e a lo ng histor y in the n umer ic al optimization lit- erature, where combinatorial algorithms ar e used for computing preconditioners for solving large spa rse lin- ear systems. How ever, to the b est o f our knowledge we are the fir st ones in applying this idea in the context of sp ectral lea rning. W e show that when the Hankel matrix of a function satisfies some non-degenera cy assumptions, o ur basis selection algorithm is optimal, in the s ense that it computes the smalles t co mplete basis. While the non- degeneracy a ssumption will not be a lw ays s atisfied, our exp erimen ts sugge st that it is always almost satisfied for sparse sequence functions. Our exp eriments on a r e al sequence modeling tasks show that the prop osed algor ithm ca n s e lect a basis that is at least an order of mag nit ude smaller than the bes t a lternativ e metho ds for ba s is selection, r esulting in a n SVD step which is at least tw o o rders of magni- tude faster. 1.1 Related W ork Although cho o sing a basis is in practice an imp or- tant tas k for ha ving a robust sp ectral learning algo- rithm, no t m uch resea rc h has fo cused on this problem. One p opular appr oac h is to choo se a basis by se le c tin g all o bserv ed prefixes and suffixes o f length less than T , for some T > 0 [Hsu et al., 2 0 09 , Siddiqi et a l., 2010]. In practice, this strateg y only works if there are no long-ra nge dependencie s in the target function. Wiewiora [2 005] presented a gr eedy heur is tic wher e for each prefix added to the bas is a computatio n taking exp onen tial time in the n umber of states n is r equired. Bailly et al. [2009] suggest to include all obser v ed pre- fixes and suffixes (observed in the sample) in the bas is . There are so me theo retical results [Denis et al., 20 16 ] that suggest that under certain assumptions this is the o pt ima l s trategy , in the sense that there is no sta- tistic al harm in consider ing all prefixes and suffix e s . How ever, this approach is in practice unfeasible: to give a concr e te exa mple if one consider s mo deling the distribution of n-gr ams up to length 1 0 in a standard NLP b enc hmark , the unique num b er of observed pre- fixes and suffixes is at least tens of millions. Finally , Balle et al. [2012] g a ve the firs t theo retical results for the problem of basis selection. They s how tha t by sampling prefixes and suffixes pr oportio na l to their fre- quency in a large eno ugh sample, with high probabil- it y , a complete basis will b e found. They also provide exp erimen tal results [Ba lle et al., 2 013 ]. 2 PRELIMINA RIES 2.1 Non Determi nistic W eigh ted Fi nite State Automata W e sta rt by defining a class of functions over discrete sequences. More specifica lly , let x = x 1 · · · x t be a se- quence o f leng th t o ver some finit e alphab et Σ. W e use Σ ⋆ to denote the set of all finite sequences with elements in Σ, and we use ǫ to denote the empty s e - quence. The domain of our functions is Σ ⋆ . An Non-Deterministic W eig h ted Automaton (W A) with n s tates is defined as a tuple: A = h α 0 , α ∞ , { A σ } σ ∈ Σ i where α 0 , α ∞ ∈ R n are the ini- tial and fina l weigh t vectors and A σ ∈ R n × n are the transition matrices asso ciated to each s ym b ol σ ∈ Σ. The function f A : Σ ⋆ → R rea liz ed by an W A A is defined as: f A ( x ) = α ⊤ 0 A x 1 · · · A x t α ∞ . (1) The a bov e eq ua tion is an alge braic representation o f the computation pe rformed by a n W A on a sequence x . T o see this consider a state vector s i ∈ R n where the j th entry represe nts the sum of the weights of a ll the state paths that gener ate the prefix x 1: i and end in state j . Initially , s 0 = α 0 , and then s ⊤ i = s ⊤ i − 1 A x i upda tes the state distribution b y simultaneously emit- ing the sym b ol x i and transitioning to generate the next state v ecto r . W As co nstitut e a rich function clas s which pr o perly includes p opular sequence models such as HMMs. 2.2 Hank el Matrices W e now intro duce the co ncept of Hankel matrices for W A, which are central to the sp ectral learning alg o- rithm, and to the result in this pap er. Ariadna Quattoni, Xavier Carreras, Matthias Gall´ e ǫ a aa aab b bb c ca cb ǫ b ab aab bb a c ca cb P = { ǫ, a, aa, aab, c } S = { ǫ, b, ab, aab, a } ǫ b ab aab bb a c ca cb ǫ 1 1 0 1 1 0 1 1 1 a 0 0 1 0 0 0 0 0 0 aa 0 1 0 0 0 0 0 0 0 aab 1 0 0 0 0 0 0 0 0 b 1 1 0 0 0 0 0 0 0 bb 1 0 0 0 0 0 0 0 0 c 1 1 0 0 0 1 0 0 0 ca 1 0 0 0 0 0 0 0 0 cb 1 0 0 0 0 0 0 0 0 H P × S H P T × S T rank( H P T × S T ) = struct rank( H P T × S T ) = 5 rank( H P × S ) = struct ra nk ( H P × S ) = 5 Σ = { a, b, c } T = { ( ǫ, 1) , ( aab, 1) , ( b, 1) , ( bb, 1) , ( c, 1) , ( ca, 1 ) , ( cb, 1) } f T ( x ) = ( 1 if x ∈ { ǫ , aab, b, bb, c, ca, cb } 0 otherwise P T = { ǫ, a, aa, aab, b, bb, c, ca, cb } S T = { ǫ, b, ab , aab, bb , a, c, ca, cb } Figure 1: Illustratio n of the Maximum Bipar tite Matching sub- blo ck. Left: a tra ining set and the asso ciated target function. Middle: a prefix-s uffix g raph with a corres p onding maximum matching in red. Righ t: the full Hankel matrix for the tr a ining set, and the s ubm a trix given by the ma tc hing. Let f : Σ ⋆ → R be an a rbitrary function from se- quences to r e a ls (not necess arily computed by a W A). Let P , S ⊆ Σ ⋆ be sets of sequences. W e call prefix e s the elements p ∈ P , and suffixe s the elements s ∈ S . The Hankel matrix H f ∈ R P × S for f ov er the blo ck ( P , S ) is defined by entries H ( p, s ) = f ( ps ), where ps is the concatenation o f pr efix p ∈ P and suffix s ∈ S . The following theorem gives a bijection b e t ween the class of functions computed by W A and Hankel matr ices: Theorem 1. [Sch¨ u tzenb er ger, 1961 , Carlyle and Paz, 1971, Fliess, 1974] A function f : Σ ⋆ → R c an b e r e alize d by a W A with n states if and only if, for every p ossible blo ck ( P , S ) , t he c orr esp onding Hankel matrix H f has r ank at most n . 2.3 The Sp ectral Metho d W e now g iv e a brief descr ipt io n o f the spectr al metho d for estimating a minimal W A representation for a tar- get function. The a lgorithm is a co nstructiv e version of the theorem ab o ve: it builds a Hankel ma tr ix o f rank n a nd computes the asso ciated n state W A from it. W e only provide a higher-le v el descr iption of the metho d; for a complete deriv ation and the theo r y jus- tifying the alg orithm we refer the reader to the works by Hsu et al. [20 09 ] a nd Balle et al. [2013]. Assume a training set T in the form of a co llection of sequences , e ac h paired with a target rea l v alue. W e will denote as f T the function obtaine d from the train- ing s et, i.e. if x ∈ T , f T ( x ) is the ta r get v alue. F or example, T could b e a corpus of E nglish sentences, and f T ( x ) the probability with which x appe a rs in T . Given a training set T , the s pectral algorithm com- putes a W A A with n states, wher e n is a parameter of the algor ithm, such that f A is a go od a ppro xima- tion o f f T . See Hsu et al. [200 9 ] for the gene r alization theory of the a lgorithm. The method is describ ed b y the following steps: (1) Select a Hankel blo ck. Let P T and S T be r espec- tively the sets o f all unique prefixes and suffixes of sequences in T . Select a blo ck o ut of them, namely , a subs e t of prefixes P ⊆ P T and a subset of suffixes S ∈ S T . (2) Compute Hankel matr ices for ( P , S ). (a) Compute H ∈ R P × S , with entries H ( p, s ) = f T ( ps ). (b) Compute h P ∈ R P with h P ( p ) = f T ( p ) a nd h S ∈ R S with h S ( s ) = f T ( s ). (c) F or ea c h σ ∈ Σ, compute H σ ∈ R P × S with ent r ies H σ ( p, s ) = f T ( pσ s ). (3) Compute an n -rank fac to rization of H . Compute the truncated SVD of H , i.e. H ≈ UΣV ⊤ result- ing in a matrix F = UΣ ∈ R P × n and a ma trix B = V ∈ R S × n . (4) Recov er the W A A of n states. Let M + denote the Mo ore-Penrose pseudo- in verse of a matr ix M . The elements of A are recov ered as follows. Initial vector: α ⊤ 0 = h ⊤ S B . Final vector: α ∞ = F + h P . T ransition Matr ices: A σ = F + H σ B , for σ ∈ Σ. There are some observ ations to make that motiv ate the contribution of this pa p er. Consider the co mplete Maximum Bipartite Matc hing for Sp ectr al Learning training blo ck ( P T , S T ), and let H T denote the Ha nkel matrix for this complete blo c k . If we wan t to fully r e- construct the function f T , we need an a uto mata A that has as many states a s the rank of H T . By using less states, we will b e lea r ning a low-rank a ppr o ximation of f T in the form of a W A. The second obser v ation is that any sub-block ( P , S ) whose Hankel submatrix ha s full rank (with res pect to the rank of H T ) can b e used to fully recov er f T . Thu s , in the ideal case, step (1) of the algorithm would select a compact submatrix o f H T that preser v es the rank. By doing so, the cost of s teps (4) and (5) would only depend on the size of the submatrix. Even if w e can not get the ideal blo c k, it would b e g o o d to hav e a method for step (1) that pro duces a small and infor- mative block. Unfortuna tely , in the gene r al case (i.e. for any rea l matrix) finding the submatrix o f fixed size that has max imal ra nk is kno wn to be NP - complete [Peeters, 2003]. In this pap er we prop ose an alg o rithm to find a small subma trix of H T of hig h ra nk. As a final note, sp ectral metho ds can b e used to learn a langua ge mo del, that is , a pro babilit y distribution ov er all sentences of a la ng uage. A straig h tforward wa y to lea rn a language model is to reg ard the tra in- ing collection T as an empirical dis tribution over se- quences of words, where the probability of a sequence is prop ortional to the num b er of times it appea rs, i.e. f T ( x ) = P r T ( x ). Another choice, s ometimes re- ferred to as mo men t matching, is to se t the function f T ( x ) to b e the exp ected n umber of times that the sequence x a ppears as a subseq uence of a random se- quence sampled from an empir ical distr ibut io n. In this case, the sp ectral algo r ithm will learn a W A that com- putes exp ectations of subsequence freque ncie s. One useful re s ult is that this W A ca n be conv er ted to another W A that corr esponds to the under lying lan- guage model, i.e. a distribution over sequences; see Balle et al. [20 13 ] for details. In pra c tice this second metho d is preferr ed, since subsequence frequency ex- pec tations a re statistics that are mor e stable to esti- mate from a training s et. 3 SUB-BLOCK SELECTION VIA BEST BIP AR TITE MA TC HING W e sta rt this section by defining the structur al ra nk of a matrix. Definition 1. The structur al r ank of a matrix is the maximu m ra nk of al l numeric al matric es with the same non-zer o p attern. Our pr oposed alg orithm will then s earc h for a subma- trix of H with full structural ra nk. In the co n text of W A and Hankel matrices this has a nice interpretation as a no tion of c omplexity of the supp ort of a function. This is b ecause the structural r ank of a Hankel matrix corres p onds to the num b er of sta tes o f the minimal W A for the har dest function defined ov er that supp ort. Notice that by definition, the numerical rank of a ma- trix is a lw ays less or equal than its structural rank, th us the str uc tur al ra nk of the Hankel matrix H of a function f A will b e always grea ter or equal than the nu mber of sta tes of the minimal W A computing f A . Our a lgorithm is based o n finding a submatr ix of H of full structural ra nk. The problem of finding a full structural rank sub-blo c k of H can b e casted as an instance of maximum bi- partite matching [Edmonds, 1967]. Given a bipa rtite graph ( V , G ) where V are the set of v er tices and G the set o f edg es, the maximum bipartite matching is de- fined as the lar gest set of non-intersecting edges, where non-interse cting means that no t wo edg es in the set share a common vertex. In the case o f the Hankel matr ix for a function f A we would hav e a bipartite gra ph ( V , G ) where on one side we hav e vertices corr esponding to a ll unique prefixes in the supp ort of f A and on the other side we hav e all unique suffixes, thus: | V | = | P | + | S | . There will be an edge connecting node i an j if the corr esponding sequence made b y the concatenation of prefix i and suffix j is in the supp ort of f A . F or every seque nc e s o f leng th T in the supp ort of f A and e very p ossible cut of s in to a pr efix and a suffix, there will b e T + 1 corres p onding edge s in G , thus | G | = O ( T | f A | ) wher e we use | f A | to r efer to the num b er of sequences in the suppo rt of f A . The maximum bipa rtite matching of a set of sequenc e s is a subset of the sequences such that no t wo s e q uences share a common prefix or suffix and there is no larg er subset that satisfies that prop ert y . Figure 1 shows a n example of a function f A and its cor responding g raph, and a maximum bipartite matc hing for that graph. W e define the max imu m bip artite matching sub-blo ck as the blo c k c o nsisting of all v er tices (prefixes and suf- fixes) in a maximum ma tc hing. Fig ure 1 shows an ex- ample of a function, a maximum bipar tite matching, and the cor responding sub-blo c k and Hankel subma - trix. T o find a maxim um bipartite ma tching ther e ar e sev- eral classic a l algorithms. The Augmented Paths a lgo- rithm runs in O | V || E | , but in practice it has a m uch low er av era g e cas e c o mplexit y . The Hop croft-Karp al- gorithm runs in O | E | p | V | , removing the linear de- pendenc e o n V (how ever, in our exp erimen ts the Aug- men ted Paths alg orithm was a lr edy very fast). In the next section w e prop ose an algor itm that takes adv an- Ariadna Quattoni, Xavier Carreras, Matthias Gall´ e tage of the str uc tur e of the Hankel matrix to o bta in further sp eed ups. 3.1 On the Optim alit y of the Maximum Matc hing Su b- block W e will use a weak version the matching pr op erty , an assumption used by Hoffman and McCo rmic k [1 982 ]. Let M b e a matrix o f str uctural r ank s . M has the weak matching prop ert y (WMP) if for any submatrix M ′ of at least s rows and s columns, the rank of M ′ is equal to the str uctural rank of M ′ . Lemma 1. L et H b e a Hankel matrix that satisfies the we ak m at ching pr op erty. L et B b e a maximum bip ar- tite matchi n g of H and let H B b e the c orr esp onding submatrix. B is a b asis of H , i.e. the ra nk of H B is e qual to the r ank of H . Pro of. Let s ( M ) b e the structural rank of a matrix M . Let n be the rank o f H , and note that s ( H ) is n b ecause H has WMP . No w note that s ( H B ) is also n , b ecause the maximum bipartite matching of H is included in H B , thus s ( H B ) is at least n ; and it is at most n , otherwis e s ( H ) ≥ s ( H B ) > n . Since H has WMP , the r a nk of H B is n . Ideally , we would not ha ve to assume the m a t ch - ing pr op erty a nd instead we could provide theoret- ical guar an tees for the maximum gap b et ween the structural and nu mer ic ra nk of a matrix. Un- fortunately , b ecause of the discrete nature of the structural r ank, deriv ing useful b ounds for this g ap has been shown to b e a ha rd theoretical challenge [Hoffman and McCormick, 198 2]. Thus to provide v al- idation for our assumption we reso rted to a n empiri- cal ev aluation of the ga p on a w ide rang e of sequence mo deling datasets, where we observe that the weak matching prop ert y is a reas onable a ssumption. The complete res ults are in section B of the supplementary materials. 4 F ASTER BIP AR TITE MA TCHING F OR HANKE L MA TRICES As sa id in the prev ious s e c tion, finding the structural rank can b e reduced to the max im um bipartite match- ing pr o blem. In this section, we pro pose a simple heuristic to sp e ed-up the maximum bipartite match- ing for the sp ecific cas e where the underlying ma trix is a Hankel. W e do this by exploiting s tructural prop- erties o f these matrices for a n under lying subr outine, the augmenting p ath algo r ithm. E ac h basic applica- tion of the a ugmen ting path increase s the match ing b y one, and a matching is maximum if and only if there is no further a ug men ting path. The stra ig h tforward solution of applying it on each no de is equiv alent to (a) x y z w (b) x y z w (c) x y z w Figure 2: Illustrations o f the augment ing path algo- rithm. the maximal flow algor ithm, and while more sophisti- cated alg orithms wher e pr oposed [Hop croft and Kar p , 1973] which find several paths p er iteration, b enc h- marks [Setubal, 19 96 ] hav e shown that the simple al- gorithm works in g eneral faster. W e first describ e the basic pro cedure: a ssume the graph depicted in Figure 2 (a), and fu r thermore as- sume that the current matching (not maximal) is a s follows: M = { ( y , z ) } . This is clear ly not max i- mal, as a b etter (and maximal) matching w ould be { ( x, z ) , ( y , w ) } . The augmenting path pro cedure maps the previo us matching to the dir e cte d g raph G de- picted in Figure 2 (b). Unselected edges will be directed from left to right, while selected edges will b e directed from right to left. An augmenting path is then defined as a path x 1 , . . . , x m ov er G , such that x 1 belo ngs to the left par- tition, x m to the right one, and b oth x 1 and x m are unmatched (this is, do not b e long yet to a matching). No r estrictions a re put on the intermediate no des, but it b ecomes clear that the path a lternates b et ween un- matched pairs (left to right edges) and matched pair s (right to left). Note that suc h paths c a n now easily be retrieved with a s ta ndard gra ph trav er sal (in our implemen ta tio n we use a depth-first sear ch, which we assumed w a s faster on sparse gra phs a lthough this was not verified). Starting fr om no de x , the following path can then be retrieved: x, z , y , w , and the graph will then b e re wired to the graph depicted in Figure 2 (c). No fur th er augmenting paths exist here, and the max- im um matching alg orithm therefore finishes with the following matching: { ( x, z ) , ( y , w ) } . The sp ecific ca se where the left par t of the bipartite are pr efixes a nd the rig h t part a re suffixe s creates some strong structural constr ain ts. Notably: Prop ert y 1. ( pσ, s ) is an e dge in the gr aph iff ( p, σ s ) is an e dge This is, the edges of the bipartite gr aph denoting a Hankel matrix co me by (po ssibly overlapping) gro ups of edges, each g roup or iginating in one of the suppor t sequences. W e pr opose to take adv antage of that s tructural knowl- edge to sp eed-up the maximum ma tc hing algor ithm. First, we s ort the prefixe s b y their le ngths, and sta rt Maximum Bipartite Matc hing for Sp ectr al Learning applying the a ug men ting path pro cedure from the longest pr efix node. Each augmenting path pro cedure returns a set of edges R to b e removed from the matc h- ing, and a se t of edg es A to be added to the matc hing. F or each edge ( σ 1 . . . σ k , s ) ∈ A w e co nsider all s h ifte d pairs ( σ 1 . . . σ i , σ i +1 . . . σ k s ). Due to Pro perty 1, each one of these pairs is an edge in the bipartite gr aph. W e chec k each such pair , and if bo th no des are unmatched we simply add them to the matching. Assuming a bitset implementation of sets, the chec ks can b e done in O | E | , but in the worst-case scenar io, it may well b e that none of the shifted pairs are fre e, and therefor e only add co mput a tion w itho ut improv- ing the matc hing. In § B of the supplemen tary ma- terial we r eport synthetic exp erimen ts that show the sp eed-ups of this str ategy compared to the standar d metho d. 5 EXPER IMENTS T o v alidate our s ub-block selection str ategy , we pres e n t compariso ns to metho ds for scaling up sp ectral learn- ing. W e first compare to general methods to sca le SVD, and then to sub-blo c k selection strategie s for Hankel matric es. W e end this section with a compar - ison to state-of-the-a rt methods on the SPiCE b enc h- mark. In all exp eriments we use natural language data for the task of language mo deling. The goal is to learn a language model that predicts the next sy m bo l for a sentence prefix (including ending the s e n tence). As ev aluation metric we use Bits p er Char acter (BpC) , the average log-2 probability that the mo del g ives to ea c h symbol in the ev aluation senq uences, includ- ing s e quence ends. As data s ets w e use the E nglish Penn T ree ba nk [Marcus et al., 19 94 ] using standa rd splits 1 , the W ar and Peace dataset [Karpa th y et al., 2016] 2 , a nd the NLP data sets of the SP iCe b enc hmar k [Balle et al., 20 1 6 ]. 5.1 Scalable SVD Me thods W e conducted exp eriments compa ring our metho d with tw o other strategies for scaling SVD. The first us es R andomize d Pr oje ctions to p erform SVD [N. Halko and T ro pp. , 2009]. This idea w as previo usly used to scale s pectral le arning [Hamilton et al., 2013]. The second str ategy is based on Sampling , and selects the k top rows and columns that have the hig hest norm [Deshpande and V empala., 2006]. 1 49 characters; 5017k / 393k / 442k characters in the train / dev / test p ortions. 2 84 sy m b ols ; 2658k / 300k / 300k characters in the train / dev / test p ortio n s. 0 200 400 600 800 1 . 26 1 . 28 1 . 3 1 . 32 1 . 34 1 . 36 time (sec.) bits p er character Matching Complete Random P . Sampli ng Figure 3: Co mpa rison of Stra teg ies for Scaling Sp ec- tral Learning. F or this comparison w e used the P enn T reeBank dataset with simplified part- o f-speech tag s (1 2 sym- bo ls). W e chose this dataset b ecause it results in a rel- atively sma ll Hankel ma trix wher e we can run spa rse SVD. In particular, w e used moment size of T = 5, which results in a square Hankel matr ix o f size 52 ,450, nu mer ic rank of 3 12, a nd s tructural rank of 313. T hus, the Complete metho d will use run spa rse SVD on this matrix. W e present a trade-off be tween p erformance (in terms of bits-p er-character) and tra ining time of a metho d. When appr opriate, we generate so lutions that utilize different amounts of time. F o r Sampling, since it se- lects k rows a nd columns pro portional to their norm, a natural wa y o f generating differen t solutions is to v ary k . F or Randomized Pro jections we do not select a sub- blo c k, instead we pro ject the Hankel matrix to a lower ℓ -dimensional space and then the SVD is p erformed on the pro jected matrix. Th us to get p erformance a s a function of training co st we can change the size o f the pro jection. The tr aining time 3 of a metho d consists o f: (1) time sp en t in selecting the Hankel s ub-block (for alg orithms that start by sub-blo ck se le c tion (e.g. best matching); (2) time sp en t on computing the singular v alue decom- po sition; and (3) time spent co mput ing in verses, i.e. recov er ing op erators. Notice that all spec tr al metho ds will p erform SVD o f a Hankel sub-blo c k. Whenever we co mput e SVD we take the cost of the most efficient (i.e. spa rse o r full SVD) to b e the cost of the algor ithm. Another imp o rtan t obser v ation is that the spar s e SVD algorithm takes a s a pa rameter the num b er of singu- lar v alues to co mput e. W e take this to b e the optimal nu mber o f states found us ing the v alidation data. 3 All exp erimen ts were ru n on a 2.2 GHz Intel Core pro- cessor. Ariadna Quattoni, Xavier Carreras, Matthias Gall´ e Figure 3 shows the trade-off, for the four metho ds. The first obs e r v ation is that with sufficient amount of computational time b oth Random Pr o jections and Sampling achieve the same p erformance as using the Complete Hank el. This is exp ected since by setting k a nd ℓ s ufficien tly la rge we should alwa ys obtain the same result as using the complete Ha nkel. Random Pro jections seems to be significantly b e tter than Sam- pling in terms of speed up, and it can obtain the same solution as Complete in less than 1 / 4 o f the time. Best Bipar tite Ma tching obtains a slightly higher bits- per -c hara cter tha n Rando m Pro jections, but is sig nif- icantly mor e efficient. More precis ely , to a c hieve the same p erformance as with Matching, Random Pro jec- tions requires ab out 5 0 times mor e time. 5.2 Sub-blo c k Sel ect i on Strategies for Sp ectral Me tho ds W e now present an empirical co mparison betw een the most prev alent sub-blo c k selection strateg ie s for sp ec- tral learning. W e tr a in sp ectral lang ua ge mo dels at character level that use a fixed window of T characters bo th a t train- ing a nd tes t time. At tr aining, we co llect a ll substrings x of length up to T . F ollowing Ba lle et al. [2013], we set a target function f T ( x ) to be the exp ected n umber of times that x app ears as a subsequence of a r andom sentence sampled from training . W e run the sp ectral algorithm with f T and obtain a W A. At tes t, we r un the W A to compute the probability of the next char- acter given a sliding pr efix of length T − 1. W e compare maximum matching sub-blo c k selection to thr e e s trategies: full blo c k, random cuts, and length up-to. F ul l blo ck us es all substrings of the s upport of f T as prefixe s and s uffixe s . R andom Cuts follows Balle et al. [2012]: it samples a string x of the sup- po rt, and chooses a rando m c ut of x into a prefix and suffix, which are a dded to the sub-block. This pro- cess is rep eated until the sub-blo c k reaches s iz e k (a parameter). L ength ≤ ℓ selects all substrings up to length ℓ . T able 1 compare s sub-blo c k se lection metho ds in terms of the num er ic ra nk of the sub-matrix, the time it takes to compute an n -rank fa ctorization, a nd the quality of the resulting n - state W A in terms of bits p er c ha rac- ter (BpC). n is a parameter that we tune on v alidation data with a range of v alues up to the ra nk of the s ub- matrix. The ma tc hing sub-blo ck obtains results that are very c lose to using the full matrix. How ever, it is m uch faster: the time to compute a matching is neg- ligible, and the time to factor ize the matr ix is thr e e orders o f magnitude faster. Compa red to o ther strate- gies, the matching sub-blo c k is the most acc ur ate a nd T able 1: Compar ison Bet ween Sub-blo ck Selection Metho ds for Supp ort Strings up to Size T = 5 metho d size rank sec. BpC F ull 144,37 8 - 18,0 00 1.7 35 Matching 1,661 1,612 8 1.74 1 Random Cuts 1 × 1,661 739 10 2.011 Random Cuts 2 × 3,322 807 74 1.828 Random Cuts 3 × 4,983 902 163 1.81 2 Random Cuts 4 × 6,664 989 271 1.79 1 Random Cuts 5 × 8,305 1,010 302 1.769 Random Cuts 6 × 9,966 1,086 411 1.761 Random Cuts 7 × 11,627 1,114 825 1.75 2 Length ≤ 2 861 92 2 3.1 05 Length ≤ 3 7,455 417 290 2.66 2 Length ≤ 4 38,314 907 3,500 1.856 T able 2: Results of sp ectral mo dels for increasing length of strings in the suppo rt Penn T reebank W ar and Peace (Lik) T F ull Match. F ull Match. KJL16 5 1.735 1.741 1.377 1.405 1.451 6 1.623 1.653 1.326 1.393 1.339 7 1.597 1.622 1.323 1.369 1.321 the most c o mpact, and th us it is drastically faster. This impro vement is achiev ed b ecause it selects a v er y compact sub-blo ck (of size 1,661) that ha s approxi- mately full numeric rank (rank is 1,612 ). In contrast, using Random Cuts, a blo ck of the same size as the matching sub-blo c k (1,6 61) has o nly a r ank of 73 9 , which r esults in lower quality pre dictio ns. When in- creasing the blo c k of Ra ndom Cuts up to 7 times the size of the matching, we obtain a rank of 1 ,114 and very c lo se results to the matching and full sub-blo c k s; how ever, factorizing the sub-matrix is 1 00 times mo re costly . Sub-blo c k selection by max im um length also per forms p o orly . This la s t result is evidence that long- range statistica l dep endencies exist in this data, and these are not ca ptu r ed b y small moments. On the other hand, a brute-force appr oac h to capture s uch long ra nge dep endencies is prohibitiv e. O ur metho d clearly offers a very c ompetitive so lutio n. Next we pr esen t results of mo dels trained on large r substrings, of up to siz e 7, for the Penn T reebank a nd W ar and Peace datasets. T able 2 compares the pe r - formance of the matching sub-blo c k to using the full blo c k. 4 As we increa se the size of the substrings ( T ) 4 F or the W ar and P eace data, we measure p erformance in terms of test negative log-likelihoo d, such that we can Maximum Bipartite Matc hing for Sp ectr al Learning T able 3: Results on NLP Datasets of SP iCe Sequence P r ediction Comp e titio n. V erbs LM (words) LM (characters) POS Normalization All RNN-P *0.6078 * *0.5434 * *0.8101 * *0.6573* *0.5882 * *0.6414* COMBO-NN-1 0.5794 0.5014 0.7632 0.6331 0.5181 0.599 0 COMBO-B 0.5 5 14 0.4264 0.7978 0.5890 0.384 3 0.5498 LSTM 0.512 3 0.403 4 0.76 3 0 0 .5941 0.4187 0 .5383 COMBO-Sp 0.5273 0.4148 0.61 42 0 .6235 0.4990 0 .5358 Sp-BM 0.5928 0.499 8 0.7820 0.6356 0.5441 0.6109 the mo dels g et b etter. There is always a per formance gap b e tw e e n using the full or the matc hing blo c ks, how ever the matc hing sub-blo c k scales m uch better: the cost of computing a ma tc hing is negligible (less than 1 5 seconds), and the cost of the factoriza tion is at three order s of magnitude faster . T able 2 also com- pares to the results by Karpa th y et a l. [2016] (noted KJL16), in terms of negative log-likelihoo d on test characters (noted Lik). W e rep ort their results, cor- resp onding to non- recurrent feed-for w ard neural mo d- els, which co ndition each pr ediction on the T − 1 latest characters (see T able 2 of their pap er). The r esults are fairly compara ble, exhibiting the same trend. 5.3 Comparison with State-of-the- Art In or der to compare the pe r formance of our prop osed metho d to o ther state of the ar t metho ds for sparse se- quence modeling , we run exp erimen ts on the five NLP datasets of the SPiCe sequence prediction comp e titio n [Balle et al., 2016]. The ta s k of the comp etition was the following: given a string (prefix) of symbols in a finite alphabet the goal is to pr edict a r anking of po ssible next sym b ols to be the next element of the sequence. The metric used fo r ev aluation mea sures the average ranking that the mo del gives to the cor - rect next s ym b ol. 5 Both v alidation and test sets are av ailable from the c ha llenge w ebs ite. There were a total of 26 teams implementing a wide range of metho ds, including: ma n y different t yp es of neural netw o rk mo dels, b oos ting , sp ectral and clas si- cal state-merging algorithms for lea rning weigh ted au- tomaton, and ensemble metho ds that combined several techn iq ues. T able 3 shows r esults for the top 5 teams o f the com- petitio n. The to p team (RNN-P) is a nov el RNN ar - chit ec tu r e where the state v ector is augmented with an indicator v ector represe nting the previo us ngra m in the history . The sec o nd b e st team (COMBO-NN) compare to pub li shed results. 5 W e refer th e reader to the SPiCe b enc hmark website: http://spi ce.lif.univ- mrs.f r . is an ensemble of MLP , CNN, LSTM and ngram mo d- els. The third team (COMB O -B) is also an ensemble metho d of ngr a m, sp ectral, RNN and tree b o osting. The fourth team (LSTM) is a n RNN with LSTM cells and the fifth team (COMBO-Sp) is another ensemble metho d that com bines a sp ectral mo del with ngram mo dels. The p erformance of our sp ectral metho d with the pr o - po sed sub-blo c k selection using b est bipartite match- ing (Sp-BM) is g iven in the last row. W e indicate with b old and sta rs the top perfo r ming method for each dataset and with b old the s econd b est. Run- ning the pr o posed algorithm out-of-the-b o x a nd with- out a n y mo del combination we get a very comp etitiv e per formance: s econd b est ov erall (0.6414 v s 0 .6109) and second in 3 o ut of 5 da tasets. One o f the most attractive prop erties of our metho d is tha t the mos t costly tr a ining times (those co rrespo nding to datasets with Hankel matrices of higher structural rank) were less than 5 minutes. 6 CONCLUSIONS W e presented a nov el strategy for scaling spectr al learning algorithms that is sp ecifically des igned for mo deling long ra nge dep endencies in spars e s equence functions. The main idea is to use maximal bipartite matching to find a Hank el sub- blo ck of maxima l struc- tural ra nk. Our exp eriments on a real sequence mo del- ing task show that: (1) Exploiting large Hankel matri- ces is essential for the success of sp ectral lear ning algo- rithms; and that: (2) Our pro posed sub-blo c k selec tion strategy to ha ndle larg e Hankel matrices can b e m uch faster than using sparse SVD ov er the complete Hankel matrix without a significant loss in p erformance. Our algorithm leads to a very app ealing trade- off b et ween computational complexity and mo del p erformance. References Animashree Anandkumar , Daniel J. Hsu, and Sham M. Kak ade. A metho d of moments for mix- ture mo dels and hidden markov mo dels. In Shie Ariadna Quattoni, Xavier Carreras, Matthias Gall´ e Mannor, Nathan Srebro, and Rob ert C. Williams on, editors, Pr o c e e dings of the 25th Annual Confer enc e on L e arning Th e ory (COL T) , volume 23 of JMLR Pr o c e e dings , pages 33.1–3 3.34. J MLR.org, 2012 . R. Bailly , F. Denis, a nd L. Ralaivola. Grammatical inference as a principal comp onen t a nalysis problem. In Pr o c. I CML , 2009. B. Balle, X. Car reras, F.M. Luque, and A. Quatto ni. Spec tr al learning of weigh ted automa ta: A forward- backw ard p erspective. Machine L e arning , 2013 . Borja Balle, Aria dna Quattoni, and Xavier Car r eras. Lo cal lo ss optimiza tio n in o p erator mo de ls : A new insight into sp ectral learning. In ICML ’12 , 20 12. Borja Balle, R´ emi E yraud, F r anco M. Luque, Ariadna Quattoni, a nd Sicco V e r w er. Results of the sequence prediction c ha llenge (spice): a comp etition on learn- ing the nex t symbol in a sequence. In Pr o c e e dings of the 13th International Confer enc e on Gr ammatic al Infer enc e , 2016 . A. Beimel, F. B e rgadano, N.H. Bshouty , E . Kushile- vitz, a nd S. V ar ricc hio. Lea rning functions repre- sented as multiplicit y automata . JAC M , 2000. J. W. Carlyle a nd M A. Paz. Realiza tions b y sto c has- tic finite a utomata. Journal of Computer Systems Scienc e , 1971 . F ran¸ cois Denis, Mattias Gyb els, and Amaury Habr ard. Dimension-free concentration bo unds on ha nk el matrices for sp ectral learning. J ou rn al of Ma- chine L e arning R ese ar ch , 17(31 ):1–32, 2016 . URL http:/ /jmlr.org / pa p e r s /v 1 7 / 1 4- 501.html . A. Deshpande and S. V empa la . Adaptive sampling and fast low-rank ma trix approximation. RANDO M 06 , 2006. J. Edmonds. Systems of distinct r epresen tatives and linear algebra. Journ a l of R ese ar ch of the National Bur e au of S tanda r ds , 71B(4):241 – 245, 196 7. M. Fliess. Matrices de Hankel. Journal de Math ´ ematiques Pur es et Appliqu ´ ees , 1974. William L. Hamilton, Mahdi M. F ard, and Jo elle Pineau. Mo delling sparse dynamica l sys tems with compressed predictive state representations. In Pr o c e e dings of the 30th International Confer enc e on Machine L e arning (ICML-13) , pages 17 8–186. JMLR W or k shop and Conference P r oceedings , 20 13. A.J. Hoffman and S. T. McCor mic k. A fast a lgorithm that makes matrice s optimally sparse. T echnical Re- po rt 13, Stanford Universit y Systems Optimization Lab oratory Rep ort, 1982 . John E. Hop croft and Richard M. Karp. An n 5 / 2 algo- rithm fo r maximum matchings in bipartite g raphs. SIAM Journal on Computing , 2(4):225–2 31, 19 7 3. D. Hsu, S. M. Kak ade, and T. Zha ng. A sp e ctral algo - rithm for learning hidden marko v mo dels. In Pr o c. of COL T , 20 09. H. Ja e g er. Obs e rv able oper a tor models for dis c r ete sto c hastic time series. Neur al Computation , 12: 1371– 1398, 20 00. Andrej Ka rpath y , Justin Johnson, and F ei- F ei Li. Vi- sualizing and under standing r ecurren t netw orks. In ICLR Workshop T r ack , 201 6 . Mitc hell P . Marc us, Bea tr ice San to r ini, and Mary A. Marcinkiewicz. Building a large annotated cor pus of e ng lish: The penn treeba nk. Computational Lin- guistics , 19, 19 94. P .G. Martisson N. Halko a nd J . A. T ropp. Finding structure with rando mnes s: Sto c ha stic a lgorithms for co nstructing aproximate matrix decompo sitions. arXiv: 0909 .4061 , 20 09. R. P eeters. The maxim um edge biclique problem is np-complete. Discr ete Applie d Mathematics , 20 03. M. P . Sch¨ utzen b erger. On the definition of a family of automata. Informatio n and Contr ol , 1961 . Joao C. Setubal. Sequential a nd para llel exp eriment a l results with bipartite matc hing a lgorithms, 1996 . Sa jid Siddiqi, Byr o n B o ots, and Geoffrey J. Go r- don. Reduced-ra nk hidden Ma rk ov mo dels. In Pr o- c e e dings of t he Thirte enth Int ernatio nal Confer enc e on Artificial In tel li genc e and S tatistics (AIST A TS- 2010) , 2010. Eric Wiewior a. Lear ning predictiv e represent a tions from a histor y . In Pr o c e e dings of the 22nd interna- tional c onfer enc e on Machine le arning , pages 964 – 971. A CM, 2005. This figure "bipartite0.png" is available in "png" format from: This figure "bipartite1.png" is available in "png" format from: This figure "bipartite2.png" is available in "png" format from: This figure "speedupMatching.png" is available in "png" format from:
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment