Learning Convolutional Neural Networks for Graphs

Learning Con volutional Neural Netw orks f or Graphs Mathias Niepert M A T H I A S . N I E P E RT @ N E C L A B . E U Mohamed Ahmed M O H A M E D . A H M E D @ N E C L A B . E U Konstantin K utzk ov K ON S TA N T I N . K U T Z KO V @ N E C L A B . E U NEC Labs Europe, Heidelberg, German y Abstract Numerous important problems can be framed as learning from graph data. W e propose a frame- work for learning con volutional neural networks for arbitrary graphs. These graphs may be undi- rected, directed, and with both discrete and con- tinuous node and edge attributes. Analogous to image-based conv olutional networks that oper- ate on locally connected regions of the input, we present a general approach to extracting locally connected regions from graphs. Using estab- lished benchmark data sets, we demonstrate that the learned feature representations are competi- tiv e with state of the art graph kernels and that their computation is highly efﬁcient. 1. Introduction W ith this paper we aim to bring con volutional neural net- works to bear on a large class of graph-based learning prob- lems. W e consider the following two problems. 1. Giv en a collection of graphs, learn a function that can be used for classiﬁcation and regression problems on unseen graphs. The nodes of any two graphs are not necessarily in correspondence. For instance, each graph of the collection could model a chemical com- pound and the output could be a function mapping un- seen compounds to their lev el of activity against can- cer cells. 2. Giv en a large graph, learn graph representations that can be used to infer unseen graph properties such as node types and missing edges. W e propose a framew ork for learning representations for classes of directed and undirected graphs. The graphs may Pr oceedings of the 33 rd International Conference on Machine Learning , New Y ork, NY , USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 1 1 1 ... ... ... ... (b) (a) Figure 1. A CNN with a receptive ﬁeld of size 3 x 3 . The ﬁeld is mov ed over an image from left to right and top to bottom using a particular stride (here: 1) and zero-padding (here: none) (a). The values read by the receptive ﬁelds are transformed into a linear layer and fed to a con volutional architecture (b). The node se- quence for which the receptiv e ﬁelds are created and the shapes of the receptiv e ﬁelds are fully determined by the hyper -parameters. hav e nodes and edges with multiple discrete and continuous attributes and may hav e multiple types of edges. Similar to conv olutional neural network for images, we construct locally connected neighborhoods from the input graphs. These neighborhoods are generated ef ﬁciently and serve as the recepti ve ﬁelds of a con v olutional architecture, allow- ing the framew ork to learn ef fecti ve graph representations. The proposed approach builds on concepts from conv olu- tional neural networks (CNNs) ( Fukushima , 1980 ; Atlas et al. , 1988 ; LeCun et al. , 1998 ; 2015 ) for images and ex- tends them to arbitrary graphs. Figure 1 illustrates the lo- cally connected receptive ﬁelds of a CNN for images. An image can be represented as a square grid graph whose nodes represent pix els. Now , a CNN can be seen as trav ers- ing a node sequence (nodes 1 - 4 in Figure 1 (a)) and gen- erating ﬁxed-size neighborhood graphs (the 3 x 3 grids in Figure 1 (b)) for each of the nodes. The neighborhood graphs serve as the receptiv e ﬁelds to read feature values from the pixel nodes. Due to the implicit spatial order of the pixels, the sequence of nodes for which neighborhood graphs are created, from left to right and top to bottom, is uniquely determined. The same holds for NLP prob- lems where each sentence (and its parse-tree) determines Learning Con v olutional Neural Networks f or Graphs a sequence of words. Howe ver , for numerous graph col- lections a problem-speciﬁc ordering (spatial, temporal, or otherwise) is missing and the nodes of the graphs are not in correspondence. In these instances, one has to solv e two problems: (i) Determining the node sequences for which neighborhood graphs are created and (ii) computing a nor- malization of neighborhood graphs, that is, a unique map- ping from a graph representation into a vector space rep- resentation. The proposed approach, termed P A T C H Y - S A N , addresses these two problems for arbitrary graphs. For each input graph, it ﬁrst determines nodes (and their order) for which neighborhood graphs are created. For each of these nodes, a neighborhood consisting of exactly k nodes is ex- tracted and normalized, that is, it is uniquely mapped to a space with a ﬁxed linear order . The normalized neighbor- hood serves as the receptive ﬁeld for a node under consider- ation. Finally , feature learning components such as conv o- lutional and dense layers are combined with the normalized neighborhood graphs as the CNN’ s receptive ﬁelds. Figure 2 illustrates the P A T C H Y - S A N architecture which has several advantages over existing approaches: First, it is highly efﬁcient, naiv ely parallelizable, and applicable to large graphs. Second, for a number of applications, rang- ing from computational biology to social network analysis, it is important to visualize learned network motifs ( Milo et al. , 2002 ). P A T C H Y - S A N supports feature visualiza- tions providing insights into the structural properties of graphs. Third, instead of crafting yet another graph kernel, P A T C H Y - S A N learns application dependent features with- out the need to feature engineering. Our theoretical contri- butions are the deﬁnition of the normalization problem on graphs and its complexity; a method for comparing graph labeling approaches for a collection of graphs; and a result that shows that P A T C H Y - S A N generalizes CNNs on images. Using standard benchmark data sets, we demonstrate that the learned CNNs for graphs are both efﬁcient and ef fec- tiv e compared to state of the art graph kernels. 2. Related W ork Graph kernels allow kernel-based learning approaches such as SVMs to work directly on graphs ( V ishwanathan et al. , 2010 ). K ernels on graphs were originally deﬁned as sim- ilarity functions on the nodes of a single graph ( Kondor & Lafferty , 2002 ). T wo representati ve classes of kernels are the sk e w spectrum k ernel ( K ondor & Borgwardt , 2008 ) and kernels based on graphlets ( Kondor et al. , 2009 ; Sher- vashidze et al. , 2009 ). The latter is related to our work, as it builds kernels based on ﬁxed-sized subgraphs. These subgraphs, which are often called motifs or graphlets, re- ﬂect functional netw ork properties ( Milo et al. , 2002 ; Alon , 2007 ). Howe v er , due to the combinatorial complexity of subgraph enumeration, graphlet kernels are restricted to ... ... neighborhood graph construction convolutiona l archite cture node sequenc e selec tion graph normalizat ion Figure 2. An illustration of the proposed architecture. A node sequence is selected from a graph via a graph labeling procedure. For some nodes in the sequence, a local neighborhood graph is as- sembled and normalized. The normalized neighborhoods are used as recepti ve ﬁelds and combined with existing CNN components. subgraphs with fe w nodes. An effecti ve class of graph kernels are the W eisfeiler-Lehman (WL) kernels ( Sher- vashidze et al. , 2011 ). WL kernels, ho wev er , only sup- port discrete features and use memory linear in the num- ber of training examples at test time. P AT C H Y - S A N uses WL as one possible labeling procedure to compute re- ceptiv e ﬁelds. Deep graph kernels ( Y anardag & V ish- wanathan , 2015 ) and graph inv ariant kernels ( Orsini et al. , 2015 ) compare graphs based on the existence or count of small substructures such as shortest paths ( Borgwardt & Kriegel , 2005 ), graphlets, subtrees, and other graph in- variants ( Haussler , 1999 ; Orsini et al. , 2015 ). In con- trast, P A TC H Y - S A N learns substructures from graph data and is not limited to a predeﬁned set of motifs. More- ov er , while all graph kernels hav e a training complexity at least quadratic in the number of graphs ( Sherv ashidze et al. , 2011 ), which is prohibitive for lar ge-scale problems, P A T C H Y - S A N scales linearly with the number of graphs. Graph neural networks (GNNs) ( Scarselli et al. , 2009 ) are a recurrent neural network architecture deﬁned on graphs. GNNs apply recurrent neural networks for walks on the graph structure, propagating node representations until a ﬁxed point is reached. The resulting node representations are then used as features in classiﬁcation and regression problems. GNNs support only discrete labels and perform as man y backpropag ation operations as there are edges and nodes in the graph per learning iter ation . Gated Graph Se- quence Neural Networks modify GNNs to use gated recur - rent units and to output sequences ( Li et al. , 2015 ). Recent work e xtended CNNs to topologies that dif fer from the low-dimensional grid structure ( Bruna et al. , 2014 ; Henaff et al. , 2015 ). All of these methods, howe v er , assume one global graph structure, that is, a correspondence of the vertices across input examples. ( Duvenaud et al. , 2015 ) perform con volutional type operations on graphs, de velop- ing a differentiable v ariant of one speciﬁc graph feature. Learning Con v olutional Neural Networks f or Graphs 3. Background W e pro vide a brief introduction to the required background in con v olutional networks and graph theory . 3.1. Con volutional Neural Netw orks CNNs were inspired by earlier work that showed that the visual cortex in animals contains complex arrangements of cells, responsible for detecting light in small local re- gions of the visual ﬁeld ( Hubel & W iesel , 1968 ). CNNs were developed in the 1980 s and have been applied to im- age, speech, text, and drug disco v ery problems ( Atlas et al. , 1988 ; LeCun et al. , 1989 ; 1998 ; 2015 ; W allach et al. , 2015 ). A predecessor to CNNs was the Neocognitron ( Fukushima , 1980 ). A typical CNN is composed of conv olutional and dense layers. The purpose of the ﬁrst con volutional layer is the e xtraction of common patterns found within local re- gions of the input images. CNNs con volve learned ﬁlters ov er the input image, computing the inner product at ev- ery image location in the image and outputting the result as tensors whose depth is the number of ﬁlters. 3.2. Graphs A graph G is a pair ( V , E ) with V = { v 1 , ..., v n } the set of vertices and E ⊆ V × V the set of edges. Let n be the number of vertices and m the number of edges. Each graph can be represented by an adjacency matrix A of size n × n , where A i,j = 1 if there is an edge from vertex v i to verte x v j , and A i,j = 0 otherwise. In this case, we say that vertex v i has position i in A . Moreover , if A i,j = 1 we say v i and v j are adjacent . Node and edge attributes are features that attain one v alue for each node and edge of a graph. W e use the term attribute value instead of label to av oid confusion with the graph-theoretical concept of a labeling. A walk is a sequence of nodes in a graph, in which consecuti ve nodes are connected by an edge. A path is a walk with distinct nodes. W e write d ( u, v ) to denote the distance between u and v , that is, the length of the shortest path between u and v . N 1 ( v ) is the 1 -neighborhood of a node, that is, all nodes that are adjacent to v . Labeling and Node Partitions. P AT C H Y - S A N utilizes graph labelings to impose an order on nodes. A graph la- beling ` is a function ` : V → S from the set of vertices V to an ordered set S such as the real numbers and integers. A graph labeling procedure computes a graph labeling for an input graph. When it is clear from the context, we use labeling to refer to both, the graph labeling and the proce- dure to compute it. A ranking (or coloring) is a function r : V → { 1 , ..., | V |} . Every labeling induces a ranking r with r ( u ) < r ( v ) if and only if ` ( u ) > ` ( v ) . If the la- beling ` of graph G is injective, it determines a total order of G ’ s vertices and a unique adjacency matrix A ` ( G ) of G where vertex v has position r ( v ) in A ` ( G ) . Moreov er, ev ery graph labeling induces a partition { V 1 , ..., V n } on V with u, v ∈ V i if and only if ` ( u ) = ` ( v ) . Examples of graph labeling procedures are node degree and other measures of centrality commonly used in the analysis of networks. For instance, the betweeness centrality of a verte x v computes the fractions of shortest paths that pass through v . The W eisfeiler-Lehman algorithm ( W eisfeiler & Lehman , 1968 ; Douglas , 2011 ) is a procedure for partition- ing the vertices of a graph. It is also kno wn as color reﬁne- ment and naive vertex classiﬁcation. Color reﬁnement has attracted considerable interest in the ML community since it can be applied to speed-up inference in graphical mod- els ( Kersting et al. , 2009 ; 2014 ) and as a method to compute graph kernels ( Shervashidze et al. , 2011 ). P A T C H Y - S A N applies these labeling procedures, among others (degree, page-rank, eigen vector centrality , etc.), to impose an order on the nodes of graphs, replacing application-dependent or- ders (temporal, spatial, etc.) where missing. Isomorphism and Canonicalization. The computational problem of deciding whether two graphs are isomorphic surfaces in se veral application domains. The graph isomor- phism (GI) problem is in NP but not known to be in P or NP-hard. Under sev eral mild restrictions, GI is kno wn to be in P . For instance, GI is in P for graphs of bounded de- gree ( Luks , 1982 ). A canonicalization of a graph G is a graph G 0 with a ﬁxed vertex order which is isomorphic to G and which represents its entire isomorphism class. In practice, the graph canonicalization tool N AU T Y has shown remarkable performance ( McKay & Piperno , 2014 ). 4. Learning CNNs f or Arbitrary Graphs When CNNs are applied to images, a receptiv e ﬁeld (a square grid) is moved over each image with a particular step size. The recepti ve ﬁeld reads the pix els’ feature v al- ues, for each channel once, and a patch of values is cre- ated for each channel. Since the pixels of an image have an implicit arrangement – a spatial order – the recepti ve ﬁelds are always moved from left to right and top to bot- tom. Moreov er, the spatial order uniquely determines the nodes of each receptiv e ﬁeld and the way these nodes are mapped to a vector space representation (see Figure 1 (b)). Consequently , the v alues read from two pixels using two different locations of the receptiv e ﬁeld are assigned to the same relative position if and only if the pixels’ structural roles (their spatial position within the receptiv e ﬁeld) are identical. T o show the connection between CNNs and P A T C H Y - S A N , we frame CNNs on images as identifying a sequence of nodes in the square grid graph representing the image and building a normalized neighborhood graph – a receptiv e Learning Con volutional Neural Networks f or Graphs Algorithm 1 S E L N O D E S E Q : Select Node Sequence 1: input: graph labeling procedure ` , graph G = ( V , E ) , stride s , width w , receptive ﬁeld size k 2: V sort = top w elements of V according to ` 3: i = 1 , j = 1 4: while j < w do 5: if i ≤ | V sort | then 6: f = R E C E P T I V E F I E L D ( V sort [ i ]) 7: else 8: f = Z E RO R E C E P T I V E F I E L D () 9: apply f to each input channel 10: i = i + s , j = j + 1 ﬁeld – for each node in the identiﬁed sequence. For graph collections where an application-dependent node order is missing and where the nodes of any two graphs are not yet aligned, we need to determine for each graph (i) the se- quences of nodes for which we create neighborhoods, and (ii) a unique mapping from the graph representation to a vector representation such that nodes with similar structural roles in the neighborhood graphs are positioned similarly in the vector representation. W e address these problems by le veraging graph labeling procedures that assigns nodes from two different graphs to a similar relati ve position in their respecti ve adjacency matrices if their structural roles within the graphs are sim- ilar . Given a collection of graphs, P A T C H Y - S A N ( S E L E C T - A S S E M B L E - N O R M A L I Z E ) applies the following steps to each graph: (1) Select a ﬁxed-length sequence of nodes from the graph; (2) assemble a ﬁxed-size neighborhood for each node in the selected sequence; (3) normalize the ex- tracted neighborhood graph; and (4) learn neighborhood representations with con volutional neural networks from the resulting sequence of patches. In the following, we describe methods that address the abov e-mentioned challenges. 4.1. Node Sequence Selection Node sequence selection is the process of identifying, for each input graph, a sequence of nodes for which receptive ﬁelds are created. Algorithm 1 lists one such procedure. First, the vertices of the input graph are sorted with re- spect to a given graph labeling. Second, the resulting node sequence is trav ersed using a giv en stride s and for each visited node, Algorithm 3 is ex ecuted to construct a recep- tiv e ﬁeld, until e xactly w receptive ﬁelds ha ve been created. The stride s determines the distance, relati ve to the selected node sequence, between two consecuti ve nodes for which a receptiv e ﬁeld is created. If the number of nodes is smaller than w , the algorithm creates all-zero recepti ve ﬁelds for padding purposes. Sev eral alternativ e methods for vertex sequence selection are possible. For instance, a depth-ﬁrst traversal of the in- Algorithm 2 N E I G H A S S E M B : Neighborhood Assembly 1: input: vertex v , receptive ﬁeld size k 2: output: set of neighborhood nodes N for v 3: N = [ v ] 4: L = [ v ] 5: while | N | < k and | L | > 0 do 6: L = S v ∈ L N 1 ( v ) 7: N = N ∪ L 8: retur n the set of vertices N put graph guided by the v alues of the graph labeling. W e leav e these ideas to future work. 4.2. Neighborhood Assembly For each of the nodes identiﬁed in the pre vious step, a re- ceptiv e ﬁeld has to be constructed. Algorithm 3 ﬁrst calls Algorithm 2 to assembles a local neighborhood for the in- put node. The nodes of the neighborhood are the candidates for the receptive ﬁeld. Algorithm 2 lists the neighborhood assembly steps. Given as inputs a node v and the size of the recepti ve ﬁeld k , the procedure performs a breadth-ﬁrst search, exploring v ertices with an increasing distance from v , and adds these vertices to a set N . If the number of col- lected nodes is smaller than k , the 1 -neighborhood of the vertices most recently added to N are collected, and so on, until at least k v ertices are in N , or until there are no more neighbors to add. Note that at this time, the size of N is possibly different to k . 4.3. Graph Normalization The recepti ve ﬁeld for a node is constructed by normaliz- ing the neighborhood assembled in the previous step. Illus- trated in Figure 3 , the normalization imposes an order on the nodes of the neighborhood graph so as to map from the unordered graph space to a vector space with a linear order . The basic idea is to le verage graph labeling procedures that assigns nodes of two different graphs to a similar relativ e position in the respective adjacency matrices if and only if their structural roles within the graphs are similar . T o formalize this intuition, we deﬁne the optimal graph nor- malization problem which aims to ﬁnd a labeling that is optimal relativ e to a giv en collection of graphs. Problem 1 (Optimal graph normalization) . Let G be a col- lection of unlabeled graphs with k nodes, let ` be an injec- tive graph labeling procedur e, let d G be a distance mea- sur e on graphs with k nodes, and let d A be a distance measur e on k × k matrices. F ind ˆ ` such that ˆ ` = arg min ` E G    d A  A ` ( G ) , A ` ( G 0 )  − d G ( G, G 0 )    . The problem amounts to ﬁnding a graph labeling proce- dure ` , such that, for any two graphs drawn uniformly at Learning Con volutional Neural Networks f or Graphs 4 2 5 3 1 4 2 1 3 normalize subgraph receptive field reads vertex and edge attribute s = channels selec t neighborhood ... ... 1 2 3 4 5 1 2 3 Figure 3. The normalization is performed for each of the graphs induced on the neighborhood of a root node v (the red node; node colors indicate distance to the root node). A graph labeling is used to rank the nodes and to create the normalized recepti ve ﬁelds, one of size k (here: k = 9 ) for node attributes and one of size k × k for edge attributes. Normalization also includes cropping of excess nodes and padding with dummy nodes. Each vertex (edge) attribute corresponds to an input channel with the respecti ve recepti ve ﬁeld. Algorithm 3 R E C E P T I V E F I E L D : Create Receptiv e Field 1: input: vertex v , graph labeling ` , receptive ﬁeld size k 2: N = N E I G H A S SE M B ( v, k ) 3: G norm = N O R M A L I Z E G R A P H ( N, v , `, k ) 4: retur n G norm random from G , the expected difference between the dis- tance of the graphs in vector space (with respect to the ad- jacency matrices based on ` ) and the distance of the graphs in graph space is minimized. The optimal graph normal- ization problem is a generalization of the classical graph canonicalization problem. A canonical labeling algorithm, howe ver , is optimal only for isomorphic graphs and might perform poorly for graphs that are similar but not isomor- phic. In contrast, the smaller the expectation of the optimal normalization problem, the better the labeling aligns nodes with similar structural roles. Note that the similarity is de- termined by d G . W e hav e the following result concerning the complexity of the optimal normalization problem. Theorem 1. Optimal graph normalization is NP-har d. Proof: By reduction from subgraph isomorphism. P AT C H Y - S A N does not solve the above optimization prob- lem. Instead, it may compare different graph labeling methods and choose the one that performs best relative to a giv en collection of graphs. Theorem 2. Let G be a collection of graphs and let ( G 1 , G 0 1 ) , ..., ( G N , G 0 N ) be a sequence of pairs of graphs sampled independently and uniformly at random from G . Let ˆ θ ` := P N i =1 d A  A ` ( G i ) , A ` ( G 0 i )  / N and θ ` := E G    d A  A ` ( G ) , A ` ( G 0 )  − d G ( G, G 0 )    . If d A ≥ d G , then E G [ ˆ θ ` 1 ] < E G [ ˆ θ ` 2 ] if and only if θ ` 1 < θ ` 2 . Theorem 2 enables us to compare different labeling proce- dures in an unsupervised manner via a comparison of the corresponding estimators. Under the assumption d A ≥ d G , the smaller the estimate ˆ θ ` the smaller the absolute difference. Therefore, we can simply choose the labeling ` for which ˆ θ ` is minimal. The assumption d A ≥ d G holds, for instance, for the edit distance on graphs and the Ham- Algorithm 4 N O R M A L I Z E G R A P H : Graph Normalization 1: input: subset of vertices U from original graph G , vertex v , graph labeling ` , receptiv e ﬁeld size k 2: output: receptive ﬁeld for v 3: compute ranking r of U using ` , subject to ∀ u, w ∈ U : d ( u, v ) < d ( w , v ) ⇒ r ( u ) < r ( w ) 4: if | U | > k then 5: N = top k vertices in U according to r 6: compute ranking r of N using ` , subject to ∀ u, w ∈ N : d ( u, v ) < d ( w , v ) ⇒ r ( u ) < r ( w ) 7: else if | V | < k then 8: N = U and k − | U | dummy nodes 9: else 10: N = U 11: construct the subgraph G [ N ] for the vertices N 12: canonicalize G [ N ] , respecting the prior coloring r 13: retur n G [ N ] ming distance on adjacency matrices. Finally , note that all of the abov e results can be extended to directed graphs. The graph normalization problem and the application of ap- propriate graph labeling procedures for the normalization of local graph structures is at the core of the proposed ap- proach. W ithin the P A T C H Y - S A N frame work, we normalize the neighborhood graphs of a vertex v . The labeling of the vertices is therefore constrained by the graph distance to v : for any two vertices u, w , if u is closer to v than w , then v is always ranked higher than w . This deﬁnition ensures that v has always rank 1 , and that the closer a vertex is to v in G , the higher it is ranked in the vector space representation. Since most labeling methods are not injectiv e, it is neces- sary to break ties between same-label nodes. T o do so, we use N AU T Y ( McKay & Piperno , 2014 ). N AU T Y accepts prior node partitions as input and breaks remaining ties by choosing the lexicographically maximal adjacency matrix. It is known that graph isomorphism is in PTIME for graphs of bounded degree ( Luks , 1982 ). Due to the constant size k of the neighborhood graphs, the algorithm runs in time polynomial in the size of the original graph and, on aver - age, in time linear in k ( Babai et al. , 1980 ). Our exper - iments verify that computing a canonical labeling of the graph neigborhoods adds a negligible o verhead. Learning Con volutional Neural Networks f or Graphs Algorithm 4 lists the normalization procedure. If the size of the input set U is larger than k , it ﬁrst applies the ranking based on ` to select the top k nodes and recomputes a rank- ing on the smaller set of nodes. If the size of U is smaller than k , it adds disconnected dummy nodes. Finally , it in- duces the subgraph on the vertices N and canonicalizes the graph taking the ranking r as prior coloring. W e can relate P A T C H Y - S A N to CNNs for images as follows. Theorem 3. Given a sequence of pixels taken fr om an image . Applying P A T C H Y - S A N with r eceptive ﬁeld size (2 m − 1) 2 , stride s , no zer o padding, and 1 -WL normal- ization to the sequence is identical (up to a ﬁxed permuta- tion of the receptive ﬁeld) to the ﬁrst layer of a C N N with r eceptive ﬁeld size 2 m − 1 , stride s , and no zer o padding. Proof: It is possible to show that if an input graph is a square grid, then the 1 -WL normalized receptiv e ﬁeld con- structed for a vertex is always a square grid graph with a unique verte x order . 4.4. Con volutional Architectur e P AT C H Y - S A N is able to process both vertex and edge at- tributes (discrete and continuous). Let a v be the number of vertex attributes and let a e be the number of edge at- tributes. For each input graph G , it applies normalized re- ceptiv e ﬁelds for vertices and edges which results in one ( w , k, a v ) and one ( w , k , k, a e ) tensor . These can be re- shaped to a ( w k, a v ) and a ( w k 2 , a e ) tensors. Note that a v and a e are the number of input channels. W e can now apply a 1 -dimensional con volutional layer with stride and receptiv e ﬁeld size k to the ﬁrst and k 2 to the second ten- sor . The rest of the architecture can be chosen arbitrarily . W e may use merge layers to combine con volutional layers representing nodes and edges, respectiv ely . 5. Complexity and Implementation P AT C H Y - S A N ’ s algorithm for creating receptiv e ﬁelds is highly efﬁcient and naively parallelizable because the ﬁelds are generated independently . W e can show the follo wing asymptotic worst-case result. Theorem 4. Let N be the number of gr aphs, let k be the r eceptive ﬁeld size, w the width, and O ( f ( n, m )) the com- plexity of computing a given labeling ` for a graph with n vertices and m edges. P A T C H Y - S A N has a worst-case complexity of O ( N w ( f ( n, m ) + n log ( n ) + exp( k ))) for computing the r eceptive ﬁelds for N graphs. Proof: Node sequence selection requires the labeling of each input graph and the retrie val of the k highest ranked nodes. For the creation of normalized graph patches, most computational effort is spent applying the labeling proce- dure ` to a neighborhood whose size may be larger than 5 10 20 30 40 50 Field size (k) 500 1500 2500 3500 4500 Fields/sec. torus random p o w er p ol-b o oks preferen tial astro-ph email-enron Figure 4. Recepti ve ﬁelds per second rates on dif ferent graphs. k . Let d be the maximum degree of the input graph G , and U the neighborhood returned by Algorithm 2 . W e ha ve | U | ≤ ( k − 2) d ≤ n . The term exp( k ) comes from the worst-case complexity of the graph canonicalization algo- rithm N AU T Y on a k node graph ( Miyazaki , 1997 ). For instance, for the W eisfeiler-Lehman algorithm, which has a complexity of O (( n + m ) log( n )) ( Berkholz et al. , 2013 ), and constants w  n and k  n , the comple xity of P AT C H Y - S A N is linear in N and quasi-linear in m and n . 6. Experiments W e conduct three types of experiments: a runtime analysis, a qualitati ve analysis of the learned features, and a compar - ison to graph kernels on benchmark data sets. 6.1. Runtime Analysis W e assess the efﬁcienc y of P A T C H Y - S A N by applying it to real-world graphs. The objectiv e is to compare the rates at which receptiv e ﬁelds are generated to the rate at which state of the art CNNs perform learning. All input graphs are part of the collection of the Python module G R A P H - T O O L 1 . For a giv en graph, we used P A T C H Y - S A N to com- pute a receptiv e ﬁeld for all nodes using the 1 -dimensional W eisfeiler-Lehman ( Douglas , 2011 ) (1-WL) algorithm for the normalization. torus is a periodic lattice with 10 , 000 nodes; random is a random undirected graph with 10 , 000 nodes and a de gree distribution P ( k ) ∝ 1 /k and k max = 3 ; power is a network representing the topology of a po wer grid in the US; polbooks is a co-purchasing network of books about US politics published during the 2004 presi- dential election; preferential is a preferential attachment network model where newly added vertices have degree 3 ; astro-ph is a coauthorship network between authors of preprints posted on the astrophysics arxiv ( Newman , 2001 ); email-enron is a communication network generated from about half a million sent emails ( Lesko vec et al. , 2009 ). All experiments were run on commodity hardware with 64G RAM and a single 2.8 GHz CPU. 1 https://graph-tool.ske wed.de/ Learning Con volutional Neural Networks f or Graphs Figure 5. V isualization of RBM features learned with 1-dimensional WL normalized receptiv e ﬁelds of size 9 for a torus (periodic lattice, top left), a preferential attachment graph ( Barab ´ asi & Albert 1999 , bottom left), a co-purchasing network of political books (top right), and a random graph (bottom right). Instances of these graphs with about 100 nodes are depicted on the left. A visual representation of the feature’ s weights (the darker a pixel, the stronger the corresponding weight) and 3 graphs sampled from the RBMs by setting all but the hidden node corresponding to the feature to zero. Y ellow nodes ha ve position 1 in the adjacency matrices. (Best seen in color .) Figure 4 depicts the receptive ﬁelds per second rates for each input graph. For receptiv e ﬁeld size k = 5 and k = 10 P AT C H Y - S A N creates ﬁelds at a rate of more than 1000 /s except for email-enron with a rate of 600 /s and 320 /s , re- spectiv ely . For k = 50 , the largest tested size, ﬁelds are created at a rate of at least 100 /s . A CNN with 2 con volu- tional and 2 dense layers learns at a rate of about 200 - 400 training examples per second on the same machine. Hence, the speed at which recepti ve ﬁelds are generated is sufﬁ- cient to saturate a downstream CNN. 6.2. Featur e V isualization The visualization experiments’ aim is to qualitativ ely in- vestigate whether popular models such as the restricted Boltzman machine (RBM) ( Freund & Haussler , 1992 ) can be combined with P A T C H Y - S A N for unsupervised feature learning. For every input graph, we have generated recep- tiv e ﬁelds for all nodes and used these as input to an RBM. The RBM had 100 hidden nodes and was trained for 30 epochs with contrastiv e di ver gence and a learning rate of 0 . 01 . W e visualize the features learned by a single-layer RBM for 1 -dimensional W eisfeiler-Lehman (1-WL) nor- malized receptiv e ﬁelds of size 9 . Note that the features learned by the RBM correspond to reoccurring receptiv e ﬁeld patterns. Figure 5 depicts some of the features and samples drawn from it for four dif ferent graphs. 6.3. Graph Classiﬁcation Graph classiﬁcation is the problem of assigning graphs to one of sev eral categories. Data Sets. W e use 6 standard benchmark data sets to com- pare run-time and classiﬁcation accuracy with state of the art graph kernels: MUT AG, PCT , NCI1, NCI109, PR O- TEIN, and D&D. MUT AG ( Debnath et al. , 1991 ) is a data set of 188 nitro compounds where classes indicate whether the compound has a mutagenic effect on a bacterium. PTC consists of 344 chemical compounds where classes indi- cate carcinogenicity for male and female rats ( T oivonen et al. , 2003 ). NCI1 and NCI109 are chemical compounds screened for activity against non-small cell lung cancer and ov arian cancer cell lines ( W ale & Karypis , 2006 ). PR O- TEINS is a graph collection where nodes are secondary structure elements and edges indicate neighborhood in the amino-acid sequence or in 3D space. Graphs are classi- ﬁed as enzyme or non-enzyme. D&D is a data set of 1178 protein structures ( Dobson & Doig , 2003 ) classiﬁed into enzymes and non-enzymes. Experimental Set-up. W e compared P A T C H Y - S A N with the shortest-path kernel (SP) ( Borgwardt & Kriegel , 2005 ), the random walk kernel (R W) ( Gaertner et al. , 2003 ), the graphlet count kernel (GK) ( Sherv ashidze et al. , 2009 ), and the W eisfeiler-Lehman subtree kernel (WL) ( Shervashidze et al. , 2011 ). Similar to previous work ( Y anardag & V ish- wanathan , 2015 ), we set the height parameter of WL to 2 , the size of the graphlets for GK to 7 , and chose the de- cay factor for R W from { 10 − 6 , 10 − 5 , ..., 10 − 1 } . W e per- formed 10 -fold cross-v alidation with L I B - S V M ( Chang & Lin , 2011 ), using 9 folds for training and 1 for testing, and repeated the experiments 10 times. W e report average pre- diction accuracies and standard deviations. For P A T C H Y - S A N (referred to as PSCN), we used 1 - dimensional WL normalization, a width w equal to the av- erage number of nodes (see T able 1 ), and receptiv e ﬁeld sizes of k = 5 and k = 10 . For the experiments we only used node attributes. In addition, we ran experiments for k = 10 where we combined receptive ﬁelds for nodes and edges using a merge layer ( k = 10 E ). T o make a fair com- Learning Con volutional Neural Networks f or Graphs Data set MUT A G PCT NCI1 PR OTEIN D & D Max 28 109 111 620 5748 A vg 17.93 25.56 29.87 39.06 284.32 Graphs 188 344 4110 1113 1178 SP [ 7 ] 85 . 79 ± 2 . 51 58 . 53 ± 2 . 55 73 . 00 ± 0 . 51 75 . 07 ± 0 . 54 > 3 days R W [ 17 ] 83 . 68 ± 1 . 66 57 . 26 ± 1 . 30 > 3 days 74 . 22 ± 0 . 42 > 3 days GK [ 38 ] 81 . 58 ± 2 . 11 57 . 32 ± 1 . 13 62 . 28 ± 0 . 29 71 . 67 ± 0 . 55 78 . 45 ± 0 . 26 WL [ 39 ] 80 . 72 ± 3 . 00 (5 s ) 56 . 97 ± 2 . 01 (30 s ) 80 . 22 ± 0 . 51 (375 s ) 72 . 92 ± 0 . 56 (143 s ) 77 . 95 ± 0 . 70 (609 s ) PSCN k = 5 91 . 58 ± 5 . 86 (2 s ) 59 . 43 ± 3 . 14 (4 s ) 72 . 80 ± 2 . 06 (59 s ) 74 . 10 ± 1 . 72 (22 s ) 74 . 58 ± 2 . 85 (121 s ) PSCN k = 10 88 . 95 ± 4 . 37 (3 s ) 62 . 29 ± 5 . 68 (6 s ) 76 . 34 ± 1 . 68 (76 s ) 75 . 00 ± 2 . 51 (30 s ) 76 . 27 ± 2 . 64 (154 s ) PSCN k = 10 E 92 . 63 ± 4 . 21 (3 s ) 60 . 00 ± 4 . 82 (6 s ) 78 . 59 ± 1 . 89 (76 s ) 75 . 89 ± 2 . 76 (30 s ) 77 . 12 ± 2 . 41 (154 s ) PSLR k = 10 87 . 37 ± 7 . 88 58 . 57 ± 5 . 46 70 . 00 ± 1 . 98 71 . 79 ± 3 . 71 68 . 39 ± 5 . 56 T able 1. Properties of the data sets and accuracy and timing results (in seconds) for PA T C H Y - S A N and 4 state of the art graph kernels. Data set GK [ 38 ] DGK [ 45 ] PSCN k = 10 COLLAB 72 . 84 ± 0 . 28 73 . 09 ± 0 . 25 72 . 60 ± 2 . 15 IMDB-B 65 . 87 ± 0 . 98 66 . 96 ± 0 . 56 71 . 00 ± 2 . 29 IMDB-M 43 . 89 ± 0 . 38 44 . 55 ± 0 . 52 45 . 23 ± 2 . 84 RE-B 77 . 34 ± 0 . 18 78 . 04 ± 0 . 39 86 . 30 ± 1 . 58 RE-M5k 41 . 01 ± 0 . 17 41 . 27 ± 0 . 18 49 . 10 ± 0 . 70 RE-M10k 31 . 82 ± 0 . 08 32 . 22 ± 0 . 10 41 . 32 ± 0 . 42 T able 2. Comparison of accuracy results on social graphs [ 45 ]. parison, we used a single network architecture with two con volutional layers, one dense hidden layer , and a softmax layer for all experiments. The ﬁrst con volutional layer had 16 output channels (feature maps). The second con v layer has 8 output channels, a stride of s = 1 , and a ﬁeld size of 10 . The con volutional layers have rectiﬁed linear units. The dense layer has 128 rectiﬁed linear units with a dropout rate of 0 . 5 . Dropout and the relativ ely small number of neurons are needed to av oid overﬁtting on the smaller data sets. The only hyperparameter we optimized is the num- ber of epochs and the batch size for the mini-batch gradient decent algorithm R M S P RO P . All of the abov e was imple- mented with the T H E A N O ( Bergstra et al. , 2010 ) wrapper K E R A S ( Chollet , 2015 ). W e also applied a logistic regres- sion (PSLR) classiﬁer on the patches for k = 10 . Moreov er, we ran experiments with the same set-up 2 on larger social graph data sets (up to 12000 graphs each, with an av erage of 400 nodes), and compared P AT C H Y - S A N with previously reported results for the graphlet count (GK) and the deep graphlet count kernel (DGK) ( Y anardag & V ish- wanathan , 2015 ). W e used the normalized node degree as attribute for P A T C H Y - S A N , highlighting one of its advan- tages: it can easily incorporate continuous features. Results. T able 1 lists the results of the experiments. W e omit the results for NCI109 as they are almost identical to NCI1. Despite using a one-ﬁts-all CNN architecture, the CNNs accuracy is highly competitive with existing graph 2 Due to the larger size of the data sets, we remo ved dropout. kernels. In most cases, a recepti ve ﬁeld size of 10 results in the best classiﬁcation accuracy . The relati vely high v ari- ance can be explained with the small size of the bench- mark data sets and the fact that the CNNs hyperparame- ters (with the exception of epochs and batch size) were not tuned to individual data sets. Similar to the experience on image and text data, we expect P AT C H Y - S A N to perform ev en better for large data sets. Moreover , P A T C H Y - S A N is between 2 and 8 times more ef ﬁcient than the most efﬁ- cient graph kernel (WL). W e expect the performance ad- vantage to be much more pronounced for data sets with a large number of graphs. Results for betweeness centrality normalization are similar with the e xception of the runtime which increases by about 10 %. Logistic regression applied to P A T C H Y - S A N ’ s receptiv e ﬁelds performs worse, indicat- ing that P A T C H Y - S A N works especially well in conjunction with CNNs which learn non-linear feature combinations and which share weights across receptiv e ﬁelds. P AT C H Y - S A N is also highly competiti ve on the social graph data. It signiﬁcantly outperforms the other two kernels on four of the six data sets and achieves ties on the rest. T able 2 lists the results of the experiments. 7. Conclusion and Future W ork W e proposed a framew ork for learning graph represen- tations that are especially beneﬁcial in conjunction with CNNs. It combines two complementary procedures: (a) selecting a sequence of nodes that covers large parts of the graph and (b) generating local normalized neighborhood representations for each of the nodes in the sequence. Ex- periments show that the approach is competitive with state of the art graph kernels. Directions for future work include the use of alternativ e neural network architectures such as RNNs; combining dif- ferent receptiv e ﬁeld sizes; pretraining with RBMs and au- toencoders; and statistical relational models based on the ideas of the approach. Learning Con volutional Neural Networks f or Graphs Acknowledgments Many thanks to the anonymous ICML revie wers who pro- vided tremendously helpful comments. The research lead- ing to these results has receiv ed funding from the Euro- pean Union’ s Horizon 2020 innovation action program un- der grant agreement No 653449-TYPES. References Alon, Uri. Network motifs: theory and experimental approaches. Natur e Reviews Genetics , 8(6):450–461, 2007. Atlas, Les E., Homma, T oshiteru, and Marks, Robert J. II. An artiﬁcial neural network for spatio-temporal bipolar patterns: Application to phoneme classiﬁcation. In An- derson, D.Z. (ed.), Neural Information Pr ocessing Sys- tems , pp. 31–40. 1988. Babai, L ´ aszl ´ o, Erd ˝ os, Paul, and Selko w , Stanley M. Ran- dom graph isomorphism. SIAM J . Computing , 9(3):628– 635, 1980. Barab ´ asi, Albert-Laszlo and Albert, R ´ eka. Emergence of scaling in random networks. Science , 286(5439):509– 512, 1999. Bergstra, James, Breuleux, Oli vier, Bastien, Fr ´ ed ´ eric, Lamblin, Pascal, P ascanu, Razvan, Desjardins, Guil- laume, T urian, Joseph, W arde-Farley , David, and Ben- gio, Y oshua. Theano: a CPU and GPU math expression compiler . In Proceedings of the Python for Scientiﬁc Computing Confer ence (SciPy) , 2010. Berkholz, Christoph, Bonsma, Paul S., and Grohe, Mar- tin. T ight lo wer and upper bounds for the complexity of canonical colour reﬁnement. In Pr oceedings of the Eu- r opean Symposium on Algorithms , pp. 145–156, 2013. Borgw ardt, Karsten M. and Kriegel, Hans-Peter . Shortest- path k ernels on graphs. In Pr oceedings of the F ifth IEEE International Conference on Data Mining (ICDM) , pp. 74–81, 2005. Bruna, Joan, Zaremba, W ojciech, Szlam, Arthur, and Le- Cun, Y ann. Spectral networks and locally connected net- works on graphs. In International Confer ence on Learn- ing Repr esentations , 2014. Chang, Chih-Chung and Lin, Chih-Jen. Libsvm: A library for support vector machines. A CM T rans. Intell. Syst. T echnol. , 2(3):27:1–27:27, 2011. Chollet, Franc ¸ ois. keras. https://github.com/ fchollet/keras , 2015. Debnath, Asim Kumar , de Compadre, Rosa L. Lopez, Deb- nath, Gargi, Shusterman, Alan J., and Hansch, Cor- win. Structure-activity relationship of mutagenic aro- matic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity . J. Med. Chem. , (34):786–797, 1991. Dobson, Paul D. and Doig, Andre w J. Distinguishing en- zyme structures from non-enzymes without alignments. Journal of Molecular Biolo gy , 330(4):771 – 783, 2003. Douglas, Brendan L. The weisfeiler-lehman method and graph isomorphism testing. arXiv pr eprint arXiv:1101.5211 , 2011. Duvenaud, Da vid K, Maclaurin, Dougal, Iparraguirre, Jorge, Bombarell, Rafael, Hirzel, Timothy , Aspuru- Guzik, Alan, and Adams, Ryan P . Conv olutional net- works on graphs for learning molecular ﬁngerprints. In Advances in Neur al Information Pr ocessing Systems , pp. 2215–2223, 2015. Freund, Y oav and Haussler , David. Unsupervised learning of distributions of binary vectors using two layer net- works. In Advances in Neural Information Pr ocessing Systems , pp. 912–919, 1992. Fukushima, Kunihiko. Neocognitron: A self-organizing neural network model for a mechanism of pattern recog- nition unaffected by shift in position. Biological Cyber- netics , 36(4):193–202, 1980. Gaertner , Thomas, Flach, Peter , and Wrobel, Stefan. On graph kernels: Hardness results and efﬁcient alterna- tiv es. In Proceedings of the 16th Annual Conference on Computational Learning Theory , pp. 129–143, 2003. Haussler , David. Con volution kernels on discrete struc- tures. T echnical report, Department of Computer Sci- ence, Univ ersity of California at Santa Cruz, 1999. Henaff, Mikael, Bruna, Joan, and LeCun, Y ann. Deep con volutional networks on graph-structured data. arXiv pr eprint arXiv:1506.05163 , 2015. Hubel, David H. and W iesel, T orsten N. Receptive ﬁelds and functional architecture of monke y striate cortex. Journal of Physiolo gy (London) , 195:215–243, 1968. Kersting, Kristian, Ahmadi, Babak, and Natarajan, Sri- raam. Counting belief propagation. In Pr oceedings of the T wenty-F ifth Conference on Uncertainty in Artiﬁcial Intelligence (U AI) , pp. 277–284, 2009. Kersting, Kristian, Mladenov , Martin, Garnett, Roman, and Grohe, Martin. Power iterated color reﬁnement. In Pr o- ceedings of the T wenty-Eighth AAAI Confer ence on Ar- tiﬁcial Intelligence (AAAI) , pp. 1904–1910, 2014. Learning Con volutional Neural Networks f or Graphs K ondor, Risi and Bor gwardt, Karsten M. The ske w spec- trum of graphs. In Pr oceedings of the 25th International Confer ence on Machine Learning (ICML) , pp. 496–503, 2008. K ondor, Risi and Lafferty , John. Diffusion kernels on graphs and other discrete input spaces. In Proceedings of the 19th International Confer ence on Machine Learning (ICML) , pp. 315–322, 2002. K ondor, Risi, Shervashidze, Nino, and Borgw ardt, Karsten M. The graphlet spectrum. In Pr oceedings of the 26th International Confer ence on Machine Learning (ICML) , pp. 529–536, 2009. LeCun, Y ., Boser, B., Denker , J. S., Henderson, D., How ard, R. E., Hubbard, W ., and Jackel, L. D. Back- propagation applied to handwritten zip code recognition. Neural Comput. , 1(4):541–551, 1989. LeCun, Y ann, Bottou, L ´ eon, Bengio, Y oshua, and Haffner , Patrick. Gradient-based learning applied to document recognition. Pr oceedings of the IEEE , 86(11):2278– 2324, 1998. LeCun, Y ann, Bengio, Y oshua, and Hinton, Geof frey . Deep learning. Natur e , 521:436–444, 2015. Leskov ec, Jure, Lang, Ke vin J, Dasgupta, Anirban, and Mahoney , Michael W . Community structure in large net- works: Natural cluster sizes and the absence of large well-deﬁned clusters. Internet Mathematics , 6(1):29– 123, 2009. Li, Y ujia, T arlo w , Daniel, Brockschmidt, Marc, and Zemel, Richard. Gated graph sequence neural networks. arXiv pr eprint arXiv:1511.05493 , 2015. Luks, Eugene M. Isomorphism of graphs of bounded v a- lence can be tested in polynomial time. Journal of Com- puter and System Sciences , (25):42–65, 1982. McKay , Brendan D. and Piperno, Adolfo. Practical graph isomorphism, { II } . J ournal of Symbolic Computation , 60(0):94 – 112, 2014. Milo, Ron, Shen-Orr , Shai, Itzkovitz, Shalev , Kashtan, Na- dav , Chklo vskii, Dmitri, and Alon, Uri. Network motifs: simple building blocks of complex networks. Science , 298(5594):824–827, 2002. Miyazaki, T akunari. The complexity of mckays canonical labeling algorithm. In Gr oups and Computation II , vol- ume 28, pp. 239–256, 1997. Newman, Mark EJ. The structure of scientiﬁc collabora- tion networks. Pr oceedings of the National Academy of Sciences , 98(2):404–409, 2001. Orsini, F ., Frasconi, P ., and Raedt, L. De. Graph in vari- ant kernels. In Pr oceedings of the AAAI Conference on Artiﬁcial Intelligence (AAAI) , pp. 678–689, 2015. Scarselli, F ., Gori, M., Tsoi, A. C., Hagenbuchner , M., and Monfardini, G. The graph neural network model. IEEE T ransactions on Neural Networks , 20(1):61–80, 2009. Shervashidze, Nino, V ishwanathan, S.V .N., Petri, T o- bias H., Mehlhorn, Kurt, and Borgwardt, Karsten M. Ef- ﬁcient graphlet kernels for large graph comparison. In Pr oceedings of the 12th International Conference on Ar - tiﬁcial Intelligence and Statistics (AIST A TS) , pp. 488– 495, 2009. Shervashidze, Nino, Schweitzer , Pascal, van Leeuwen, Erik Jan, Mehlhorn, Kurt, and Borgwardt, Karsten M. W eisfeiler-lehman graph kernels. J . Mach. Learn. Res. , 12:2539–2561, 2011. T oi vonen, Hannu, Sriniv asan, Ashwin, King, Ross D, Kramer , Stefan, and Helma, Christoph. Statistical e valu- ation of the predictive toxicology challenge 2000–2001. Bioinformatics , 19(10):1183–1193, 2003. V ishwanathan, S. V . N., Schraudolph, Nicol N., K ondor, Risi, and Borgwardt, Karsten M. Graph kernels. J . Mach. Learn. Res. , 11:1201–1242, 2010. W ale, Nikil and Karypis, George. Comparison of descrip- tor spaces for chemical compound retriev al and classiﬁ- cation. In Pr oceedings of the International Confer ence on Data Mining (ICDM) , pp. 678–689, 2006. W allach, Izhar , Dzamba, Michael, and Heifets, Abra- ham. Atomnet: A deep conv olutional neural network for bioactivity prediction in structure-based drug discov ery . CoRR , abs/1510.02855, 2015. W eisfeiler , Boris and Lehman, AA. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-T echnicheskaya Informatsia , 2(9): 12–16, 1968. Y anardag, Pinar and V ishwanathan, S.V .N. Deep graph ker- nels. In Pr oceedings of the 21th ACM SIGKDD Inter- national Confer ence on Knowledge Discovery and Data Mining , pp. 1365–1374, 2015.

Learning Convolutional Neural Networks for Graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment