Graph Kernels via Functional Embedding

Graph Kernels via F unctional Em b edding Ansh umal i Shriv asta v a Ping L i Department of Computer Science Computing and Information Science Cornell University Ithaca, NY 1 4853, USA Email: anshu@cs.c o r nell.edu Department of Statistics and Biostatistics Department of Computer Science Rutgers Univ e r sity Piscataw ay , NJ 0 8854 , USA Email: pingl i@sta t.rutgers.edu Abstract W e prop ose a representation of graph as a functional ob ject derived from the p ower it- eration o f the underlying adjacency matrix. The pro p osed functional repre s entation is a graph inv ariant, i.e ., the functional r emains unch anged under any reorde r ing of the ver- tices. This prop erty e limina tes the diﬃculty of handling exp o nentially many isomorphic forms. Bhattacharyya kernel cons tructed b e- t ween these functionals signiﬁcantly outp er- forms the state-of-the-art gra ph kernels on 3 out o f the 4 standar d b enchmark gra ph cla s- siﬁcation datasets, demonstrating the sup eri- ority of our approach. The prop o sed metho d- ology is simple and r uns in time linear in the n umber of edges, whic h makes our ker- nel more eﬃcient a nd sca lable compa red to many widely ado pted gr aph kernels with run- ning time cubic in the num b er of vertices. 1 In tro duction Graphs ar e becoming ubiquitous in mo der n applica- tions spanning bioinfor matics, s o cial netw or ks, sear ch, computer vision, natural language pro cessing, etc. Computing mea ning ful similarity measure b etw e e n graphs is a cruc ia l prerequisite for a v ariety of le a rn- ing algorithms op er ating on gra ph da ta. This notion of similarity typically v aries with the applica tion. In designing similarities (e.g., kernels) betw een graphs, it is desirable to hav e a measur e which incorp ora tes the rich structural information a nd is not aﬀected b y spurious transforma tions like r e ordering o f vertices. This report is mainly for acrhiv al purp p ose. It is essentia lly a pap er initially submitted to a conference on Dec. 15 2012. A s ep erate paper [22], which focu ses on so cial netw orks, used similar metho dologies (b ut only th e cov ariance part). Note that, in certain a pplica tions, g raphs can co me with additional lab el information such as no de or edge lab els [18, 4]. These additio na l annotations are no t alwa ys av aila ble in every doma in (e.g. so cial netw o r ks) and are typically exp ensive to obta in. In this pap er, we fo cus only on the bas ic gra ph structures, without assuming any additional information. A common a pproach for computing kernels is to ex- tract an explicit fea tur e ma p from the graph, and then compute the kernel v alues via certa in standar d o p er - ations b etw e e n features (e.g., inner pro ducts). This line of techniques typically make use o f graph i n v ari- an ts [20] s uch as eigenv alues of Graph L a placian as features. F or example, [12] which uses harmonic anal- ysis techniques to extra ct a s et o f g raph inv ar iants. It was shown that a simple linea r kernel, i.e., dot pro duct betw een these graph inv ar ia nt n umber s, outp erforms many other graph k er nels. Alternatively , one ca n design a kernel function K ( G 1 , G 2 ) given gra phs G 1 and G 2 , directly using “similarity” b etw een them [25, 26]. F or ex ample, the random walk kernel [7, 9] is based on counting common random walks betw e e n t wo given g raphs. Another ex - ample is the shortest-pa th kernel [1] which is based on counting pa irs of vertices, b e t ween tw o gra phs, having similar shortest distance betw een them. Although random w alk kernels and path bas ed k er- nels ar e still among the widely a do pted gra ph kernels, one common disadv antage with them is that walks and paths do not capture information of the substructures present in the g raph [21, 16]. T o addres s this problem, a ﬂurry o f interest aros e on kernels based on count- ing common subgra ph patterns. Co unt ing all p ossible common subgra phs was known to be NP -complete [7]. This le d to the developmen t of graph kernels fo cusing only on counting sma ll subgr aphs; for ex a mple, [21] counts common subgra phs with o nly 1, 2, or 3 nodes also called as gr aphlets. This kind of technique is very po pular in so cial ne tw ork cla ssiﬁcation. Rec e nt ly , [23] Graph Ke rnels vi a F unctional Embe dding used histograms of size four subgraphs for classifying F aceb o ok so cial net works. How ever, simply counting common substructures lik e walks, paths, subgra phs , etc., ignor es some c rucial rela tive information be tw e en the substructures . F or instance, the information o f how diﬀerent triangles are r e latively embedded in the graph structure ca nnot b e captured by simply count - ing the num b e r of triang le s . This r elative information, as we show in this pap er, is necessary for discriminat- ing betw een diﬀerent graph struc tur es. This paper follows an altogether diﬀer ent appro ach. W e repres ent a graph as an express ive functional ob- ject. W e ﬁr st use the dynamica l prop er ties of the graph adjacency matrix to construct an infor mative summary of the gra ph. W e then imp ose a pro bability distribution o ver the summary , and we show that this distribution is a graph inv aria nt . Bhattacharyya ker- nel b etw een the o btained distribution, whic h we c a ll P ow er K e rnel , signiﬁcantly outp erforms other well- known graph k ernels on standard b enchmark graph classiﬁcation datasets. In additio n, w e show that, un- like other kernels, mos t of which re q uire O ( n 3 ) time to compute (where n is the num b er of no des), o ur kernel c a n b e computed in time linear in the num- ber of edges (which is at most O ( n 2 )). This makes the prop osed metho dolo gy signiﬁcantly mo re pr actical for larger graphs. 2 Notation Given a graph G with n no de s , we denote its adjacency matrix by A ∈ R n × n . I n this pap er, entries o f A ar e binary (0/1), i.e., A ( i, j ) = 1 means there is a n edge betw een no de i and no de j . W e interch angeably use terms no des and vertices, a nd terms graph G and a dja- cency matrix A . The g r aph will a lwa ys b e a ssumed to be unlab e led, undir ected and unw eighted with default n num b er of no des, unless otherwis e sp eciﬁed. W e use 1 for a vector of all ones. By v ecto r, we mean co lumn vector, i.e., n × 1 matrix. T o avoid ov er loading subsc ripts, we will follow Matlab style notation while denoting r ows and columns of a matrix. F or a given matrix A , A ( i, :) will denote the i th row of A , while A (: , i ) will r efer to the i th column. F or a v ecto r x , x ( i ) will denote its i th comp onent. Every p er mu tation π : { 1 , 2 , .., .n } → { 1 , 2 , .., .n } is asso cia ted with a corresp onding per mu tation matr ix P . O ne impo rtant prop erty of a p er mut a tion matrix is that its tr ansp ose is equa l to its inv er se, P T = P − 1 . The eﬀect of left multiplying a given matrix A by P shuﬄes its rows according to π , i.e., the π ( i ) th row of P A is i th row o f A . The eﬀect of right multiply- ing has the same eﬀect on c olumns instea d of rows. F or any p er mutation matrix P , gr aphs represented by adjacency matrices A and P AP T are is omorphic, i.e., they re present the sa me gra ph structure e xcept that the no des ar e r eorder ed acco rding to π . 3 Graphs a s ARMA Mo dels and Random W alk Kernels One w ay o f r epresenting gra phs is to think of adja- cency matrix A ∈ R n × n as a matrix op erator oper at- ing in R n . A natural way of character izing a n op era tor is to s ee how it tra nsforms a given vector v ∈ R n . This idea was pioneer ed in ca se of gr aphs by works on diﬀu- sion kernels [13] follow ed by Binet-Cauch y kernels [27]. Here, the adjacency matrix was trea ted as a dynamical system and simila rity meas ure betw een these systems was used as a similarity b etw een co rresp o nding graphs. In [27], gra ph with adjacency matrix A was as so ciated with the following noiseless ARMA mode l. y t = x t ; x t +1 = Ax t . (1) It was shown that the r andom walk kernel betw e en t wo graphs, with a djacency matrix A and A ′ , is a ctually the Binet-Cauch y trace kernel ov er the corr esp onding ARMA mo dels that tak es the following for m (see Eq. (10) in [27]): K ( A, A ′ ) = ∞ X t =1 e − λt y T t W y ′ t . (2) where x 0 = 1 / | V | , x ′ 0 = 1 / | V ′ | a nd W is a ma trix of a ll ones. The discounting term e − λt is necessary for the ﬁniteness of the summation. F or tunately , the inﬁnite summation in Eq. (2) has a closed form solution and can be computed in O ( n 3 ) [25]. It ca n be observ ed fr o m Eq . (2) that rando m walk kernel is s imply a discounted summation of s imilarity betw een y t and y ′ t , where the summation is taken ov e r t . It do es not take into a ccount the co v ar iance struc- ture of the dy namical system. In particular, g iven the adjacency ma tr ix A, if we think o f { y t : t ∈ N } as a series, one of the identifying c har acteristics o f a s eries is how y t relates with y t ′ for t 6 = t ′ . Such kind o f auto- cov ar ia nce structures are very crucial in time series mo deling liter a ture. Unfortunately , this infor mation is not taken into consider ation while computing the similarity in Eq. (2). There are more expres sive kernels for ARMA mo dels like the determinan t k ernel [27]. How ever, determi- nant kernel for ARMA mo dels a re not applicable for graphs be c ause it is sensitive to reorder ing of rows [2 9]. It s ho uld be noted that g iven a permutation matrix P and an adjacency matrix A , A and P AP T leads to dif- ferent dynamical sy s tems but the gr aphs repres ented Anshumali Shriv as ta v a, Ping Li Algorithm 1 Power Summary of Gr aph Input: Adjacency matrix A ∈ R n × n ; initial vector x (0) ∈ R n × 1 ; n umber of p ow er iterations k for t = 1 to k do x ( t ) = A × x ( t − 1) || x ( t − 1) || 1 S A x (0) (: , t ) = x ( t ) end for return S A x (0) by them ar e isomorphic. Therefo r e, we need a very diﬀerent approach for deﬁning kernels b etw een gra phs which takes into account the cov a riance s tr ucture of the series { y t : t ∈ N } . W e pro c eed by c o mputing an isomor phic inv ariant functional repr esentation of a given graph, which cap- tures the cov ar iance information of the dynamica l sys- tem. W e describ e this functional em b edding in the next section. 4 Graph Embedding in F unctional Space In Eq. (1), y t is simply a p ower iteration o f matrix A . A sma ll history o f p ower iter ation o ften captures suf- ﬁcient information ab out the underlying matrix [1 7]. Our representation capitalizes on this fact. W e ﬁr st extract a summar y of power iteration a s shown in Al- gorithm 1. In standard pow e r iteration, w e start with a given normaliz e d vector x (0) and at each iteration t ∈ { 1 , 2 , ..., k } , we g enerate vector x t = A × x ( t − 1) || x ( t − 1) || 1 recursively . The choice of nor malization is not imp or- tant. W e refer the ( n × k ) ma tr ix, whose j th column cor- resp onds to the x j , as S A x (0) . S A x (0) is not p ermutation inv ariant be c ause A t x (0) and ( P AP T ) t x (0) , for gener al x (0) , are not equa l. How ever, if the starting v ector is x (0) = 1 (where e is a vector of all o nes), then reo rder- ing the no des by per mutation matrix P just shuﬄes the rows o f S A 1 in the same o rder. This fact can b e stated as the following theorem. Theorem 1 If P is any p ermut ation matrix, t hen S P AP T 1 = P × S A 1 , and the al l-one ve ctor 1 is the unique starting ve ctor, up to sc aling, having such pr op erty for al l A and P . Pro of Using the ide ntit y P T = P − 1 , it is not diﬃcult to s how that for a ny per mut a tion matrix P , ( P AP T ) k = P A k P T . This along with the fact P T × 1 = 1 , yields the required result. F or unique- ness, let x (0) hav e tw o diﬀere nt co mpo nents at i and j , then P AP T x (0) 6 = P Ax (0) in g e neral. Equalit y here forces a constraint on A, P a nd x (0) . Since we hav e limited de g rees o f freedom fo r x (0) compared to choices o f A and P , this will e nd in contradiction.  One more intu itive wa y to see why Theorem 1 is true is to disrega rd normalization and imagine that at time step t = 0, we asso ciate every no de in the gra ph with the starting num be r 1. During every itera tion o f Algo- rithm 1, which is mult iplica tion by A , we up da te this nu mber on every no de with the sum o f num b er s on all its neighbors. A simple recursive ar gument tells us that the seq uence of num b ers gener a ted on each no de, under this pro cess, is not g oing to change as long as the neighborho o d structure is preserved. Unit vector 1 is the o nly starting c hoic e that do es not distinguish betw een no des. In fact, each row vector of S A 1 can b e treated as a repre sentation for the co rresp o nding no de in the graph. Such kind of up dates a re v er y infor- mative and is the motiv ation b ehind many celebrated link analy s is alg orithms including Hyp er-tex t Induced T opic Search (HITS) [1 1]. In light o f Theore m 1, we can as so ciate a set of n vec- tors, co rresp o nding to rows of S A 1 ∈ R n × k , with graph G as a p ermutation inv a riant repres e nt a tion. Our pro- po sal, therefo re, is a ma thematical quantit y that de- scrib es this set o f vectors as a repr esentation for graph. W e hav e tw o choices: 1) we can either think of a sub- space r epresented b y these n vectors, or 2) we can think of these n vectors as sa mples from some prob- ability distribution [15]. This choice depends on size of n and k . In cas e where n is larg e compare d to k , the subspace represented by n v ectors of dimension k will almost always be the whole k dimensiona l Eu- clidean v ecto r spa ce, and it will not b e very informa- tive. O n the other hand, when k is large compared to n , the subs pa ce represe nt a tion may b e more informa- tive compar ed to ﬁtting a distr ibution. F or e x ample, if we decide to ﬁt a Gaussian distribution ov er these vectors, when k is mor e tha n n , the cov ariance matrix is no t very informative. Po wer iteration converges v er y quickly due to its ge- ometric r ate of convergence. W e therefore need m uch smaller v alues of k co mpared to n . Hence, we asso ciate a proba bilit y function with the rows of S A 1 . W e can ge t a v ariety of p er mutation indep endent functiona l em- bedding s by diﬀerent choices of this distribution func- tions. W e use the most natural distribution function, the Gaussian, for t wo ma jor reasons: the similarity computations a re usua lly in closed form and it nicely captures the co rrela tio n structure of S A 1 . Since we will alwa ys use x (0) = 1 , for notational co nvenience we will drop the subscript 1 . Deﬁnition 1 Given an un dir e cte d gr aph G with ad- Graph Ke rnels vi a F unctional Embe dding jac ency matrix A ∈ R n × n and S A c ompute d fr om Al- gorithm 1 run for k iter ations. L et µ A ∈ R k b e t he me an of c olumn ve ctors of S A and Σ A ∈ R k × k denote the c ovarianc e matrix of S A : µ A = 1 n n X i =1 S A ( i, :) , Σ A = 1 n n X i =1 ( S A ( i, :) − µ A )( S A ( i, :) − µ A ) T . We deﬁne Ψ A ∈ ( R k ⇒ R + ) of gr aph G as a pr ob a- bility density funct ion of multivaria t e Gaussian wi t h me an µ A and c ovari anc e Σ A Ψ A ( x ) = 1 q (2 π ) k | Σ A | e − ( x − µ A ) T Σ A − 1 ( x − µ A ) 2 . Since, our representation is deﬁned a s a Gaussian den- sity ov er the bag o f vectors, it can also be in terpr e ted as a Gaussian Pro cess, see [1 5]. This Ψ repr e s enta- tion has the desired pro p erty that it is inv a r iant under reorder ing of no des. Theorem 2 F or any p ermut ation matrix P we have Ψ A = Ψ P AP T . Pro of Using Theor em 1, it is not diﬃcult to see that µ A = µ P AP T and Σ A = Σ P AP T .  Although Theorem 2 captures gr aph isomo rphism in one direction, this representation is no t an if and only if r elationship, and we cannot hop e for it as w e would hav e then so lved the Gr aph Isomorphism Pr oblem . Al- though the complexity of Gr aph Isomorphism Pr oblem is still a big op en que s tion, for most of the graphs in practice it is known to b e ea sy and a small sum- mary of power iteration is almo st always enough to discriminate b etw ee n non-iso morphic gra phs. In fact, real wold gr aphs usually p oss ess very distinct spectr al behavior [6]. W e can therefor e exp ect the Ψ e mbed- ding to b e an eﬀective repres e nt a tion for g raphs en- countered in practice. It might seem little uncomfor table to call it a distribu- tion b ecause the row v ecto rs of S A never change, and so there is no thing sto chastic. It is b etter to think of this repre sentation o f graph as an ob ject in a functiona l space ( R k → R + ). The distribution analo gy g ives the motiv ation of a mathematical ob ject for a set of vec- tors, and a simple intuition a s to why T he o rem 2 is true given T heo rem 1. 5 The P rop osed Kernel W e deﬁne the P ow er Kernel b etw een tw o g r aphs with adja c ency matric e s A and B as a Bha tta charyya Algorithm 2 Power Kernel Input: A ( n a × n a ), B ( n b × n b ), k 1) Compute S A and S B using Algo rithm 1 for k it- erations. 2) µ A = 1 n P n i =1 S A ( i, :) 3) µ B = 1 n P n i =1 S B ( i, :) 4) Σ A = 1 n P n i =1 ( S A ( i, :) − µ A )( S A ( i, :) − µ A ) T 5) Σ B = 1 n P n i =1 ( S B ( i, :) − µ B )( S B ( i, :) − µ B ) T 6) C o mpute K(A,B) using Eq. (4) return K (A,B) Kernel [8] betw een Ψ A ( x ) and Ψ B ( x ) K ( A, B ) = Z Ω q Ψ A ( x ) q Ψ B ( x ) dx. (3) SinceΨ A ( x ) and Ψ B ( x ) are pdf of Ga ussians, Eq. (3) has closed form so lution given b y : K ( A, B ) = | Σ A | − 1 4 | Σ | 1 4 | Σ B | − 1 4 × e ( T 1+ T 2+ T 3) (4) T 1 = − 1 4 ( µ A ) T (Σ A ) − 1 ( µ A ) T 2 = − 1 4 ( µ B ) T (Σ B ) − 1 ( µ B ) T 3 = 1 2 µ T Σ − 1 µ Σ = Σ A + Σ B 2 µ = 1 2 (Σ A ) − 1 µ A + 1 2 (Σ B ) − 1 µ B While designing kernels for gra ph e ns uring p o sitive semi-deﬁniteness is not tr ivial a nd many previously prop osed kernels do not s a tisfy this prop er ty [26, 2 4]. Since our kernel is a kernel ov er well studied mathe- matical r epresentation w e get this pro p erty for free, which is a n immediate consequence of the result that Bhattacharyya kernels are p ositive semideﬁnite. Theorem 3 Power Kernel is p ositive semideﬁnite.  Overall, we hav e a very s imple pro cedure for co mput- ing kernel b etw een tw o graphs with adjacency matri- ces A ∈ R n a × n a and B ∈ R n b × n b . The pro cedur e is summarized in Algorithm 2. The v alue of k determines the num b er of p ower iter - ations in Algorithm 1. F o r a djacency ma trix A , let λ 1 ≥ λ 2 ≥ ...λ n be the eigenv a lues and v 1 , v 2 , ...v n be the corresp onding eigenv ec tors. The t th itera- tion on vector x (0) will g enerate A t x (0) = c 1 λ t 1 v 1 + Anshumali Shriv as ta v a, Ping Li c 2 λ t 2 v 2 + ... + c n λ t n v n , where ( c 1 , c 2 , ..., c n ) is the rep- resentation of x (0) in the basis o f v i ’s, i.e., x (0) = c 1 v 1 + c 2 v 2 + ... + c n v n . This gives A t x (0) = λ t 1 " c 1 v 1 + n X i =2  λ i λ 1  t c i v i # . (5) W e c a n see that power iteration lo oses information ab out the i th eigenv alue and eigenvector a t an exp o- nent ia l rate of ( λ i λ 1 ) t . A matrix is uniquely character - ized by the set of its eige nv alues and eig env ectors, and we need a ll of them to fully capture the information in the matr ix. It s hould b e noted here, that unlike other machine lea rning applications where s mall eig env alues corres p o nds to noise, in our case the informa tion of the whole sp ectr um is needed. W e therefore nee d small v alues o f k like 4 or 5. La r ger v alues of k will cause the infor mation of the larger eig env alues to dominate the r epresentation, and this will make the kernel v alues biased to wards the dominant sp e c trum of A . 6 Running Time W e now analyze the r unning time o f ea ch step in Al- gorithm 2. F o r simplicity , let n = max ( n a , n b ). Step (1) requir es running Algorithm 1 o n bo th the g r aphs, which co nsists of matrix vector multiplications for k iterations. The complex ity of Step (1) is thus O ( n 2 k ). Steps (2) and (3) compute the mean of n vectors, ea ch with dimension k , b oth of which cost O ( nk ). Steps (4) and (5) c ompute the sa mple cov ariance matr ix whos e complexity is O ( nk 2 ) each. The ﬁnal step req uires ev aluating Eq . (4) o n the computed mea n and cov ar i- ance matrices, which re q uires O ( k 3 ) op erations. Over- all, the total time complexity for computing the kernel from scratch is O ( n 2 k + nk 2 + k 3 ). The r e c ommended v alue of k is us ually a small co n- stant (e.g . 4 or 5) even for large g raphs. T reating k a s a constant, the time complexity is O ( n 2 ) in the worst case. In fact, due to the sparsity of the a dja c e ncy ma- trix A , the actual time complexity is O ( E ), where E is the total nu m b er of edges (which is at most O ( n 2 )). In o ther w o r ds, our total running time is linea r in the nu mber of edges. The current state- of-the-art kernels including skew sp e ctrum of gr aph [12], rando m walk kernel [7], r equire O ( n 3 ) computations while shortest path kernel [1 ] is even costlier. The g r aphs that we en- counter in mos t real-world applications are in g eneral very s parse, i.e., E ≪ n 2 . Moreover, when the num- ber of edges is o n the order of the num b er of vertices (whic h is not unusual), our algor ithm is a ctually linear in n . This makes o ur pro p osed power kernels scala ble even for w e b applications. Note that, w e ca n pre-co mpute the ﬁr st ﬁve steps in Algorithms 2, indep endently for each graph. After this prepro cess ing, kernel computatio n only req uir es O ( k 3 ) per pair, which is a cons ta nt . 7 Wh y Co v ariance Captures Relativ e Information ? The w or k of [12] was based on extracting permuta- tion inv ar iant features from gra ph using an algebraic approach. Our representation leads to a new set o f in- v ariants. As a consequence o f Theorem 2 , µ A and Σ A are graph in v aria nts. Deﬁne N i t as the num b er of disjoin t paths of leng th t , in the given g raph G , having no de i as one of its end po ints. In computing N i t , we allow r ep etition o f no des. One simple o bs erv ation is that the i th comp onent of A t 1 , i.e. A t 1 ( i ), is equal to N i t . This fact can be prov en b y a simple inductive ar gument, where the base case N i 1 corres p o nds to the degree of the no de i . The t th comp onent of µ A is the mean o f N i t , i.e. µ A ( t ) = C 1 n X i =1 N i t , which is a trivial graph inv a r iant b ecaus e it is the to- tal num b e r of paths of length t , in the g iven graph, m ultiplied by a c onstant. The constant C 1 comes into the picture due to no rmalization. Int e r esting set of inv a riants come from the matrix Σ A . The ( t 1 , t 2 ) th element of Σ A can b e written as Σ A ( t 1 , t 2 ) = C 2 n X i =1 ( N i t 1 − µ A ( t 1 )) × ( N i t 2 − µ A ( t 2 )) which the correlations among the n umber of paths o f length t 1 and that of leng th t 2 having a common end- po int. When t 1 = t 2 = t , it can b e interpreted as the v ariance in the num b er o f paths of length t having a common endpo int. In the hindsig ht, its not diﬃcult to see that these ag greg a ted statistics of paths of diﬀer e nt lengths star ting at a given no de are graph inv a riants. W e will see in the ne x t Section that this information is very useful in dis c r iminating v ar ious graph structures. µ A ( t ) ca ptures the infor mation ab out the mean statis- tic o f diﬀerent kind of paths pres ent in the gr aph. Σ A captures the relative structure of no des in ea ch g raph. The cor relations b etw een v ar ious kinds of paths rel- ative to a no de indicated its r e lative connectivity in the graph structure. This kind of relative co r relation information were missing in random walk kernels and path based kernels, which only count common paths or walks of same leng th b etw een t wo given g raphs. Even kernels try ing to co unt commo n small subg raphs do Graph Ke rnels vi a F unctional Embe dding not capture this relative struc tur al informa tion s uﬃ- ciently . µ A ( t ) and Σ A capture ag greg a te behavior of paths rela tive to diﬀeren t no des, and they can also be treated as a n infor mative summar y of the given graph. Gaussian density function is just one wa y of exploiting this corr e la tion str ucture. W e ca n generate other func- tionals on the rows of S A . F or exa mple, we can gen- erate an expressive functional b y us ing kernel dens ity estimation on these set of n vectors, and Theor em 1 guarantees that the obta ine d functional is a graph in- v ariant. W e b elieve that such in v ar iants ca n pr ovide deep e r insight which could pr ov e b eneﬁc ia l fo r many applications dealing with g raphs. The behaviors of these graph inv a riants rais e ma ny interesting theoret- ical questions which could be of indep endent in ter est. F or instance we can as k, “wha t will be the b ehavior of these inv ariants if the graph has low e x pansion?” 8 Exp erimen ts W e follo w the ev alua tio n procedure of [12, 14]. W e chose the sa me fo ur b enchmark graph cla s siﬁcation datasets consisting of the gra ph structure of the chem- ical comp ounds: MUT AG, E NZYMES, NCI1 and NCI109, used in [12, 14] fo r their diversity in terms of s ize and as well a s tasks. In ea ch o f these da ta set, each data point is a graph structure asso ciated with a classiﬁcation lab el. MUT A G [5] is a dataset of 188 m u- tagenic a romatic and hetr oaro ma tic nitro c o mp ounds, lab eled acc o rding to whether o r not they have mu- tagenic eﬀect on Gram-neg ative bacterium S al monel l ty phi m u ri um . The maximum num b er of no des in this dataset is 28 with mean a round 19 , while the maxi- m um num b er o f edges is 33 and the mean is aro und 2 0. ENZYMES is a datas et of pro tein tertiar y structure, which was used in [2]. It consists o f 6 00 enzymes fro m the BRE ND A enzymes databas e [19]. This is a multi- class classiﬁcation ta sk, wher e eac h enz y me has the lab el as to which o f the 6 EC top level class it b elongs to. The maxim um n umber of no des in this dataset is 126 with av erage a round 32.6, while the maximum nu mber of edge s is 149 and the mean is a round 6 2. The other tw o bala nced datasets, NCI1 and NCI109, classify c o mp ounds bas e d on whether or not they a re active in an a nti-cancer scre en [28]. F or both NCI1 and NCI109 the ma x imum n umber o f no des is 11 1 with mean ar ound 30 , and the max imum num b er of edges is 1 19 with mean a round 32. Our fo c us will rema in on ev aluating the ba sic structure captured by o ur functional re presentation Ψ A . W e therefore fo cus our compariso ns with metho dologies not relying on no de a nd edge la b e l information. W e rep eat ev aluation pro cedure follow ed in [12, 1 4] with power kernel. The ev alua tions consists of r unning ker- nel SVM on the four datasets using diﬀerent kernel. The standard ev aluation pro cedur e used is as follows. First split each dataset in to 10 folds of iden tical size. Combine 9 of these folds and again split it into 10 parts, then use the ﬁrst 9 parts to train the C-SVM [3] and use the 10th par t as v alidatio n set to ﬁnd the bes t p erforming v alue of C fr o m { 10 − 7 , 10 − 6 , ..., 10 7 } . With this choice o f C, train the C-SVM on all the 9 folds (form initial 10 folds) and predict o n the 10th fold acting a s an indep endent ev aluation set. The pro- cedure is rep eated 10 times with e ach fold acting as an independent test set once. F or each dataset the whole pro cedure is then rep eated 1 0 times randomizing ov er partitions. The mean c lassiﬁcation ac curacy and the standard error s are shown in T able 1. Since the r esults are averaged ov er 10 runs with dif- ferent partitions, the num b ers are very stable. W e bo rrowed the accuracy v alues o f state-of-the-ar t unla- bele d g raph kernels: ra ndo m walk k e rnel [7], shortes t path kernel [1], gra phlet count kernel [2 1], and reduce d skew sp ectr um o f gra ph fro m [12, 1 4], wher e para me- ters, if any , fo r these kernels were optimized for b est per formance. As noted b efore, the v alue of k s ho uld not b e large. Though we have the c hoice to tune this v alue for dif- ferent datasets independently , to k eep things simple and allow eas y r eplication of r esults, we rep or t the re- sults for a ﬁxed v alue of k = 5 on all the four datasets. F rom the results, e can see that other tha n the MU- T AG da ta set, p ow er k er nel outp erforms other kernels on the remaining 3 da tasets. On NCI1 and NCI109, which are la r ger da tasets with la rger graphs compar ed to MUT AG, we b eat the previous be s t pe r forming ker- nel, which is based on skew sp ectrum o f gra ph, by a huge margin. O n these tw o data sets, power k er- nel gives a classiﬁca tion a ccuracy of around 70 % while the b est p erforming baseline can only achiev e a round 62%. In ca s e of ENZYMES data set, the s ho rtest path kernel p erfor ms the b est among other baseline kernels and achieves 27 .5 3% accuracy , while w e can achieve around 34.6%. This signiﬁcant improv ement clear ly establishes the expressiveness of our repre sentation in capturing structure of g raphs. On MUT AG datas et the a ccuracy of 88.61 % is achiev ed by reduced s kew sp e ctrum kernel while power kernel gives 83.22 %. W e b elieve that this is due to the fact tha t MUT AG c o nsists of r elatively m uch s ma ller graphs, and it seems that the few g r aph inv ariant features g enerated by r educed skew sp ectrum suﬃ- ciently capture the discriminative informatio n in this dataset. On datas ets with lar ger graphs, s uch featur es are les s expressive than o ur functional r epresentation, and hence p ow er kernel leads to muc h b etter res ults. Anshumali Shriv as ta v a, Ping Li T able 1 : P rediction Accuracy in p er centage for p ower kernel and the state-of-the-a rt graph kernels o n four clas- siﬁcation b enchmark da ta sets. The re p o rted results are av era ged over 10 rep etitions of 10-fold cros s -v alidation. Standard errors are indicated using pa r entheses. Datasets MUT AG ENZYME S NCI1 NCI109 No of Instances/Class e s 188/2 600/6 4110/2 4127/2 Max n umb e r of no des 28 126 111 111 Power Kernel (This P aper) 83.22(0.47) 34.60(0. 48) 70.73(0. 10) 70.15(0. 12) Reduced-Skew-Spectr um [1 2] 88.61(0. 21) 25.83(0.34) 62.72(0.05) 62.62(0.03) Graphlet-Count-Kernel [21] 81.7(0.67) 23.94(0.4) 54.34(0.04) 52.39(0.09) Random-W alk-Kernel [ 7] 71.89(0.66) 14.97(0.28) 51.30(0.23) 53.11(0.11) Shor test-P a th-Kernel [1] 81.28(0.45) 27.53(0.29) 61.66(0.10) 62.35(0.13) Also MUT AG datase t contains only 1 88 data elements, and so the p er centage diﬀerence is not signiﬁcant a s compared to larger dataset lik e NCI1 and NCI109. W e always outp erform gr aphlet count kernel, random walk kernel and shor test path kernel. This shows that our ba sic repr esentation is muc h more expressive and sup e rior. It is no t surprising b e cause we are captur- ing higher order corre lation information, while kernels based on counting common paths or subg raphs of small size miss this rela tive info r mation. Diss ecting graphs int o small subg raphs loo ses a lot of informatio n. As shown in Section 6 our algo rithm runs in O ( E ) and from the statistics of the dataset we can see that on an a verage the edges a re of the order of vertices, and so the running time complexity of p ow er kernel in this case is actually around O ( n ), while all other comp et- ing metho ds exc ept graphlet count kernel requir e a t least O ( n 3 ). Therefore, we have a huge gain in per - formance. The running time complexit y of g r aphlet kernel is competitive with our metho d but accurac y wise our metho d is m uch superio r. The whole pro ce- dure for power kernel is simple and since we hav en’t tuned anything except C for SVM a ll these num b ers are easily r epro ducible. 9 Discussion: Eﬀect of P ert urbations The succe ss of isomo rphism capturing kernels is due to their ability o f pre serving near neighbors with high probability . Our pr op osed p ow er kernel po sses the fo l- lowing tw o pr op erties: 1. If tw o gra phs A and B ar e isomo rphic then K ( A, B ) = 1, and if they are not, then likely K ( A, B ) < 1 . 2. If tw o g raphs A and B a re sma ll p erturb ed ver- sions of ea ch other then K ( A, B ) should b e clos e to 1 , in particula r it should b e higher compared to t wo r a ndom graphs. Determining which graphs a re uniquely deter mined by their sp ectrum is in gene r al a very hard pr oblem, but all graphs e nc o untered in practice ar e well behav ed and uniquely determined b y their dynamics. Hence, our prop osed e mbedding do es not lo ose muc h infor mation. F or p ower kernels, it is clea r fro m Theorem 2 that if t wo gr aphs are iso morphic then K(A,B) = 1. Beca us e of the p ermutation inv ariance pr op erty , we do not hav e to worry ab out whic h ordering of no des to co nsider as long as there exist o ne which gives the require d bijec- tion. If t wo graphs are not isomo r phic then their spe c- trum follow very diﬀeren t behaviors and hence kernel v alue b etw een them sho uld b e muc h less than 1. T o illustrate why o ur representation s atisﬁes prop erty 2, we use the fact that the sp ec tr um of adjacency ma- trix is usually very stable under small p er turbations, see [10]. Here, the p erturbatio ns means op era tions like adding or deleting few no des a nd edges. It is diﬀerent from the usual small normed pe rturbations. Moreov er , our kernel r elies o n stable statistics such as cov ar iance Σ and mean µ of S A , w hich do not undergo any ma jor jump by small changes in the S A , assuming the size of graph n is large. Our method th us ensures that small gr aph p ertur bations do not le a d to any blow up causing relatively big c ha nges in the kernel v a lues. Although, it might b e diﬃcult to q ua ntif y the sensi- tivit y o f p ow er kernels with resp ect to s mall p ertur ba- tions in the graph, we can empirica lly verify the ab ov e claim. W e chose the same four datasets used in the exp eriments. F rom each datase t, w e ra ndo mly s a m- ple 1 00 g raphs fo r the ev aluations. W e p e r turb each graph structure by ﬂippin g a random edge, i.e., we choose tw o no des i and j randomly , if the edge ( i, j ) was present in the graph then we delete the edge ( i, j ), otherwise w e add the edge ( i, j ) to the gr a ph. W e do this p er turbation pro cess 2 0 times one after the other, thereby obtaining a sequence of 2 0 graphs with increas- ing amoun t of p ertur bations. After each p erturba tion we compute the kernel v alue of the p erturb ed graph with the or iginal gra ph. The v alue of k was again set Graph Ke rnels vi a F unctional Embe dding to be 5. W e plot the average kernel v alues o ver these 100 points on a ll the four da tasets, in Figur e 1. 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 (Number of Perturbations) Kernel Value Mutag Enzymes NCI1 NCI109 Figure 1: Changes in the v alue of p ow er kernel with increasing per turbation in the given graph W e can clearly se e that the kernel v alues smo othly decrease with increa s ing pertur ba tions. F o r MUT A G dataset which consists of smaller g r aphs, the eﬀect of per turbations is more co mpared to other datasets with relatively bigger gra phs, which is e xp ected. The plots clearly demons tr ate that sma ll p erturba tions do not lead to discontin uous jumps in the kernel v alues. 10 Conclusion W e appro ached the problem of gra ph kernels by ﬁnd- ing a n embedding in functional space. Pow er kernel, which is based o n kernel be tw een these functionals, sig- niﬁcantly outperforms the existing state-o f-the-art ker- nels on b enchmark graph class iﬁcation da tasets. Our kernel only r equires O ( E ) time to compute and thus this scheme is v er y practical. Our fo c us was to demonstra te the p ow er of an expres - sive functional repres entation in a simples t p ossible wa y . W e b elieve that there is a huge scop e of im- prov ement in the prop osed kernel owing to the p os si- ble ﬂexibility in our a pproach. F or example, the choice of Ga ussian functions was the natura l o ne. There is a whole ro om for deriving mo re ex pr essive functionals on row vectors of S A , like using kernel dens ity estimators, etc. Incor po rating no de and edge lab el information in power kernel is ano ther ar ea to explore. The idea of discounting subsequent columns of the p ow er itera tion could be a useful extensio n. W e hav e demons trated that our functional represen- tation can provide an ea sy interface for dealing with graphs, a combinatorially hard ob ject. Although we hav e seen s igniﬁcant gains in the p erfor mance ov er the existing state- o f-the-art kernels, in light of the p oss ible future w ork, we b elie ve a lot more is yet to co me. Ac knowledgem ent The w o rk w as partially supported by NSF (DMS080886 4, SES113 1848 , I I I124 9316) and AFOSR (F A9550 -13- 1 -013 7). References [1] K. M. Borgwardt and H. P . Kriegel. Shortest- path kernels on gr a phs. In Pr o c e e dings of the Fifth IEEE International Confer enc e o n Data Mining (ICDM 2005) , pages 74– 81, 20 0 5. [2] K. M. B orgwardt, C. S. Ong, S. Sch¨ onauer, S. V. N. Vish wanathan, A. J. Smola, and H. Krieg el. P rotein function prediction via g raph kernels. In ISMB (Supplement of Bioinformatic s) , pages 47–56, 20 05. [3] C.-C. Cha ng and C.-J. Lin. Libs vm: A librar y for supp o rt vector machines. ACM TIST , 2(3):27, 2011. [4] F. Costa and K. D. Grave. F a st neigh b orho o d subgraph pairwise distance kernel. In ICML , pages 255–26 2 , 20 10. [5] A. K. Debnath, R. L. Lo p ez de Compadre, G. Debnath, A. J. Shusterman, a nd C. Hansch. Structureactivity rela tionship of mutagenic aro- matic and hetero a romatic nitro comp ounds. cor- relation with molecular orbital energies and hy- drophobicity . J Me d Chem , 34 :786–7 97, 19 91. [6] Illes J F ark as, Imre Der´ enyi, Alb ert-L´ as zl´ o Barab´ a si, and T amas Vicsek. Spectr a of real- world graphs: Beyond the semicircle law. Physic al R eview E , 64(2 ):02670 4, 2 001. [7] T. Gartner, P . A. Flach, and S. W rob el. On graph kernels: Har dness res ults and eﬃcie nt al- ternatives. In COL T , pag es 129–14 3, 2 003. [8] T. Jeba ra, R. I. Ko ndor, a nd A. Ho ward. Proba- bilit y pro duct kernels. J ournal of Machine L e arn- ing Re se ar ch , 5:819–8 4 4, 2 004. [9] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalize d k ernels betw een lab eled gra phs . In ICML , pa ges 321–3 28, 2003 . [10] A. D. Ke edwell, editor. Sur veys in Combinatorics . Cambridge Universit y Press, 1991. Anshumali Shriv as ta v a, Ping Li [11] J . M. Kleinber g . Auth o ritative sources in a hy- per linked environment. In S ODA , pages 6 68–6 77, 1998. [12] R. I. Kondor and K. M. Borgw a rdt. The s kew sp ectrum of graphs. In P ro c e e dings of the 25th International Confer enc e on Machine Le arning (ICML 2008) , pag es 49 6–50 3, 2 008. [13] R. I. Kondor a nd J. D. Laﬀert y . Diﬀusion ker- nels on gra phs and other dis crete input spaces. In ICML , pa ges 315–3 22, 2002 . [14] R. I. Kondo r, N. Sherv ashidze, and K. M. Bor g- wardt. The g raphlet spectrum. In Pr o c e e dings of the 26th International Confer enc e on Machine L e arning (ICML 2009) , pages 529–53 6 , 20 09. [15] Risi Kondor and T o ny Jeba r a. A kernel betw een sets of v ecto rs. [16] N. Krieg e and P . Mutzel. Subgraph matching ker- nels for a ttr ibuted graphs. In ICML , 2012. [17] F r ank Lin a nd William W Cohen. Po wer iteratio n clustering. Citeseer . [18] P . Mah´ e, N. Ueda, T. Akutsu, J. Perret, and J. V ert. E xtensions of mar ginalized gra ph kernels. In ICML , 2 004. [19] I. Schomb urg, A. Chang , C. Eb e ling, M. Gr emse, C. Heldt, G. Huhn, and D. Sc homburg. B renda, the enz y me database: upda tes and ma jor new developmen ts. Nu cleic A cids R ese ar ch , 32(Database- Issue):431 –433, 200 4. [20] J . Shaw e-taylor. Symmetries and discriminabil- it y in feedforward ne tw ork architectures. IEEE T r ans. on Neur al Networks , 4 :816– 826, 1993 . [21] N. Sherv ashidze, S. V. N. Vishw ana tha n, T. Petri, K. Mehlhor n, and K. M. Borg wardt. Eﬃcient graphlet kernels for lar g e gra ph compar is on. In Pr o c e e dings of International Confer enc e on Artiﬁ- cial Intel ligenc e and St at ist ics. ( A IST A TS 2009) , pages 129–14 3 , 20 09. [22] Anshumali Shriv astav a and P ing Li. A new space for compar ing gra phs . T ec hnica l rep or t, arXiv:140 4.464 4 , 2014. [23] J ohan Ugander , Lars B a ckstrom, and Jon M. Kleinberg. Subgraph frequencies: mapping the empirical a nd extrema l geogra phy of lar g e graph collections. In WWW , pages 1307–1 318, 201 3 . [24] J ean-Philipp e V ert. The optimal a ssignment kernel is not p ositive deﬁnite. arXiv pr eprint arXiv:080 1.4061 , 2008. [25] S. V. N. Vish wanathan, K. M. Borgwardt, and N. N. Sc hraudolph. F a st computation of g raph kernels. In NIPS , pages 1449– 1456 , 2006. [26] S. V. N. Vish wanathan, N. N. Schraudolph, R. I. Kondor, and K. M. Borgwardt. Graph kernels. Journal of Machine L e arning R ese ar ch , 11:1201 – 1242, 2 010. [27] S. V. N. Vish wanathan, A. J. Smola, and R. Vi- dal. Binet-cauch y kernels on dyna mica l s ystems and its applica tion to the analys is of dynamic scenes. International Journal of Computer Vi- sion , 73(1):95–1 19, 2007. [28] N. W ale and G. Karypis. Compar is on of descrip- tor s pa ces for c hemica l comp ound retriev al a nd classiﬁcation. In ICDM , pages 678– 689, 2006 . [29] Lio r W o lf and Amnon Shashu a. Lea rning ov er sets using kernel principal ang les. The J ournal of Machine L e arning R ese ar ch , 4:913–9 31, 2003.

Graph Kernels via Functional Embedding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment