Inductive Representation Learning on Large Graphs

Inductiv e Repr esentation Learning on Lar ge Graphs William L. Hamilton ∗ wleif@stanford.edu Rex Y ing ∗ rexying@stanford.edu Jur e Leskov ec jure@cs.stanford.edu Department of Computer Science Stanford Univ ersity Stanford, CA, 94305 Abstract Low-dimensional embeddings of nodes in large graphs ha ve proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. Howe ver , most e xisting approaches require that all nodes in the graph are present during training of the embeddings; these pre vious approaches are inherently transductive and do not naturally generalize to unseen nodes. Here we present GraphSA GE, a general inductive frame work that leverages node feature information (e.g., text attributes) to efﬁciently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node’ s local neighborhood. Our algorithm outperforms strong baselines on three inductive node-classiﬁcation benchmarks: we classify the category of unseen nodes in ev olving information graphs based on citation and Reddit post data, and we show that our algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions. 1 Introduction Low-dimensional vector embeddings of nodes in large graphs 1 hav e proved extremely useful as feature inputs for a wide variety of prediction and graph analysis tasks [ 5 , 11 , 28 , 35 , 36 ]. The basic idea behind node embedding approaches is to use dimensionality reduction techniques to distill the high-dimensional information about a node’ s graph neighborhood into a dense vector embedding. These node embeddings can then be fed to downstream machine learning systems and aid in tasks such as node classiﬁcation, clustering, and link prediction [11, 28, 35]. Howe ver , pre vious works have focused on embedding nodes from a single ﬁxed graph, and man y real-world applications require embeddings to be quickly generated for unseen nodes, or entirely new (sub)graphs. This inductive capability is essential for high-throughput, production machine learning systems, which operate on e volving graphs and constantly encounter unseen nodes (e.g., posts on Reddit, users and videos on Y outube). An inductiv e approach to generating node embeddings also facilitates generalization across graphs with the same form of features: for example, one could train an embedding generator on protein-protein interaction graphs deri ved from a model org anism, and then easily produce node embeddings for data collected on new or ganisms using the trained model. The inducti ve node embedding problem is especially dif ﬁcult, compared to the transductive setting, because generalizing to unseen nodes requires “aligning” ne wly observed subgraphs to the node embeddings that the algorithm has already optimized on. An inducti ve frame work must learn to ∗ The two ﬁrst authors made equal contrib utions. 1 While it is common to refer to these data structures as social or biological networks , we use the term graph to av oid ambiguity with neural network terminology . 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Figure 1: V isual illustration of the GraphSAGE sample and aggre gate approach. recognize structural properties of a node’ s neighborhood that rev eal both the node’ s local role in the graph, as well as its global position. Most existing approaches to generating node embeddings are inherently transducti ve. The majority of these approaches directly optimize the embeddings for each node using matrix-factorization-based objectiv es, and do not naturally generalize to unseen data, since they mak e predictions on nodes in a single, ﬁxed graph [ 5 , 11 , 23 , 28 , 35 , 36 , 37 , 39 ]. These approaches can be modiﬁed to operate in an inductiv e setting (e.g., [ 28 ]), but these modiﬁcations tend to be computationally expensi ve, requiring additional rounds of gradient descent before new predictions can be made. There are also recent approaches to learning over graph structures using con volution operators that offer promise as an embedding methodology [ 17 ]. So far , graph con volutional networks (GCNs) ha ve only been applied in the transductiv e setting with ﬁxed graphs [ 17 , 18 ]. In this work we both extend GCNs to the task of inducti ve unsupervised learning and propose a frame work that generalizes the GCN approach to use trainable aggregation functions (be yond simple con volutions). Present w ork . W e propose a general frame work, called GraphSA GE ( S A mple and aggre G at E ), for inductiv e node embedding. Unlike embedding approaches that are based on matrix factorization, we le verage node features (e.g., te xt attrib utes, node proﬁle information, node de grees) in order to learn an embedding function that generalizes to unseen nodes. By incorporating node features in the learning algorithm, we simultaneously learn the topological structure of each node’ s neighborhood as well as the distrib ution of node features in the neighborhood. While we focus on feature-rich graphs (e.g., citation data with text attributes, biological data with functional/molecular markers), our approach can also make use of structural features that are present in all graphs (e.g., node degrees). Thus, our algorithm can also be applied to graphs without node features. Instead of training a distinct embedding vector for each node, we train a set of a ggr egator functions that learn to aggregate feature information from a node’ s local neighborhood (Figure 1). Each aggregator function aggre gates information from a dif ferent number of hops, or search depth, aw ay from a giv en node. At test, or inference time, we use our trained system to generate embeddings for entirely unseen nodes by applying the learned aggregation functions. Follo wing pre vious work on generating node embeddings, we design an unsupervised loss function that allo ws GraphSA GE to be trained without task-speciﬁc supervision. W e also sho w that GraphSA GE can be trained in a fully supervised manner . W e e v aluate our algorithm on three node-classiﬁcation benchmarks, which test GraphSA GE’ s ability to generate useful embeddings on unseen data. W e use two ev olving document graphs based on citation data and Reddit post data (predicting paper and post categories, respecti vely), and a multi- graph generalization experiment based on a dataset of protein-protein interactions (predicting protein functions). Using these benchmarks, we sho w that our approach is able to ef fectiv ely generate representations for unseen nodes and outperform rele vant baselines by a signiﬁcant margin: across domains, our supervised approach improves classiﬁcation F1-scores by an av erage of 51% compared to using node features alone and GraphSA GE consistently outperforms a strong, transducti ve baseline [ 28 ], despite this baseline taking ∼ 100 × longer to run on unseen nodes. W e also sho w that the ne w aggregator architectures we propose provide signiﬁcant gains (7.4% on average) compared to an aggregator inspired by graph con volutional networks [ 17 ]. Lastly , we probe the e xpressiv e capability of our approach and sho w , through theoretical analysis, that GraphSA GE is capable of learning structural information about a node’ s role in a graph, despite the fact that it is inherently based on features (Section 5). 2 2 Related work Our algorithm is conceptually related to pre vious node embedding approaches, general supervised approaches to learning ov er graphs, and recent advancements in applying con volutional neural networks to graph-structured data. 2 Factorization-based embedding approaches . There are a number of recent node embedding approaches that learn low-dimensional embeddings using random walk statistics and matrix factorization-based learning objectiv es [ 5 , 11 , 28 , 35 , 36 ]. These methods also bear close rela- tionships to more classic approaches to spectral clustering [ 23 ], multi-dimensional scaling [ 19 ], as well as the P ageRank algorithm [ 25 ]. Since these embedding algorithms directly train node embeddings for indi vidual nodes, they are inherently transductive and, at the v ery least, require expensi ve additional training (e.g., via stochastic gradient descent) to mak e predictions on new nodes. In addition, for many of these approaches (e.g., [ 11 , 28 , 35 , 36 ]) the objecti ve function is in variant to orthogonal transformations of the embeddings, which means that the embedding space does not naturally generalize between graphs and can drift during re-training. One notable e xception to this trend is the Planetoid-I algorithm introduced by Y ang et al. [ 40 ], which is an inducti ve, embedding- based approach to semi-supervised learning. Howe ver , Planetoid-I does not use any graph structural information during inference; instead, it uses the graph structure as a form of regularization during training. Unlike these previous approaches, we le verage feature information in order to train a model to produce embeddings for unseen nodes. Supervised learning o ver graphs . Beyond node embedding approaches, there is a rich literature on supervised learning ov er graph-structured data. This includes a wide v ariety of kernel-based approaches, where feature v ectors for graphs are deri ved from various graph kernels (see [ 32 ] and references therein). There are also a number of recent neural network approaches to supervised learning ov er graph structures [ 7 , 10 , 21 , 31 ]. Our approach is conceptually inspired by a number of these algorithms. Howe ver , whereas these previous approaches attempt to classify entire graphs (or subgraphs), the focus of this work is generating useful representations for indi vidual nodes. Graph con volutional networks . In recent years, several con volutional neural network architectures for learning ov er graphs hav e been proposed (e.g., [ 4 , 9 , 8 , 17 , 24 ]). The majority of these methods do not scale to lar ge graphs or are designed for whole-graph classiﬁcation (or both) [ 4 , 9 , 8 , 24 ]. Howe ver , our approach is closely related to the graph con volutional netw ork (GCN), introduced by Kipf et al. [ 17 , 18 ]. The original GCN algorithm [ 17 ] is designed for semi-supervised learning in a transductiv e setting, and the exact algorithm requires that the full graph Laplacian is kno wn during training. A simple variant of our algorithm can be viewed as an e xtension of the GCN framework to the inductiv e setting, a point which we revisit in Section 3.3. 3 Proposed method: GraphSA GE The key idea behind our approach is that we learn how to aggre gate feature information from a node’ s local neighborhood (e.g., the degrees or text attributes of nearby nodes). W e ﬁrst describe the GraphSA GE embedding generation (i.e., forward propagation) algorithm, which generates embeddings for nodes assuming that the GraphSAGE model parameters are already learned (Section 3.1). W e then describe how the GraphSAGE model parameters can be learned using standard stochastic gradient descent and backpropagation techniques (Section 3.2). 3.1 Embedding generation (i.e., forward pr opagation) algorithm In this section, we describe the embedding generation, or forward propagation algorithm (Algorithm 1), which assumes that the model has already been trained and that the parameters are ﬁxed. In particular , we assume that we hav e learned the parameters of K aggregator functions (denoted AG G R E G A T E k , ∀ k ∈ { 1 , ..., K } ), which aggregate information from node neighbors, as well as a set of weight matrices W k , ∀ k ∈ { 1 , ..., K } , which are used to propagate information between dif ferent layers of the model or “search depths”. Section 3.2 describes how we train these parameters. 2 In the time between this papers original submission to NIPS 2017 and the submission of the ﬁnal, accepted (i.e., “camera-ready”) version, there ha ve been a number of closely related (e.g., follow-up) w orks published on pre-print servers. For temporal clarity , we do not revie w or compare against these papers in detail. 3 Algorithm 1: GraphSA GE embedding generation (i.e., forward propagation) algorithm Input : Graph G ( V , E ) ; input features { x v , ∀ v ∈ V } ; depth K ; weight matrices W k , ∀ k ∈ { 1 , ..., K } ; non-linearity σ ; differentiable aggre gator functions AG G R E G A T E k , ∀ k ∈ { 1 , ..., K } ; neighborhood function N : v → 2 V Output : V ector representations z v for all v ∈ V 1 h 0 v ← x v , ∀ v ∈ V ; 2 for k = 1 ...K do 3 for v ∈ V do 4 h k N ( v ) ← AG G R E G A T E k ( { h k − 1 u , ∀ u ∈ N ( v ) } ) ; 5 h k v ← σ  W k · C O N C A T ( h k − 1 v , h k N ( v ) )  6 end 7 h k v ← h k v / k h k v k 2 , ∀ v ∈ V 8 end 9 z v ← h K v , ∀ v ∈ V The intuition behind Algorithm 1 is that at each iteration, or search depth, nodes aggregate information from their local neighbors, and as this process iterates, nodes incrementally gain more and more information from further reaches of the graph. Algorithm 1 describes the embedding generation process in the case where the entire graph, G = ( V , E ) , and features for all nodes x v , ∀ v ∈ V , are provided as input. W e describe ho w to generalize this to the minibatch setting belo w . Each step in the outer loop of Algorithm 1 proceeds as follo ws, where k denotes the current step in the outer loop (or the depth of the search) and h k denotes a node’ s representation at this step: First, each node v ∈ V aggregates the representations of the nodes in its immediate neighborhood, { h k − 1 u , ∀ u ∈ N ( v ) } , into a single v ector h k − 1 N ( v ) . Note that t his aggreg ation step depends on the representations generated at the previous iteration of the outer loop (i.e., k − 1 ), and the k = 0 (“base case”) representations are deﬁned as the input node features. After aggreg ating the neighboring feature vectors, GraphSA GE then concatenates the node’ s current representation, h k − 1 v , with the aggregated neighborhood vector , h k − 1 N ( v ) , and this concatenated vector is fed through a fully connected layer with nonlinear acti vation function σ , which transforms the representations to be used at the ne xt step of the algorithm (i.e., h k v , ∀ v ∈ V ). For notational con venience, we denote the ﬁnal representations output at depth K as z v ≡ h K v , ∀ v ∈ V . The aggregation of the neighbor representations can be done by a v ariety of aggre gator architectures (denoted by the AG G R E G A T E placeholder in Algorithm 1), and we discuss different architecture choices in Section 3.3 belo w . T o extend Algorithm 1 to the minibatch setting, gi ven a set of input nodes, we ﬁrst forward sample the required neighborhood sets (up to depth K ) and then we run the inner loop (line 3 in Algorithm 1), but instead of iterating o ver all nodes, we compute only the representations that are necessary to satisfy the recursion at each depth (Appendix A contains complete minibatch pseudocode). Relation to the W eisfeiler-Lehman Isomorphism T est . The GraphSA GE algorithm is conceptually inspired by a classic algorithm for testing graph isomorphism. If, in Algorithm 1, we (i) set K = |V | , (ii) set the weight matrices as the identity , and (iii) use an appropriate hash function as an aggregator (with no non-linearity), then Algorithm 1 is an instance of the W eisfeiler -Lehman (WL) isomorphism test, also known as “nai ve verte x reﬁnement” [ 32 ]. If the set of representations { z v , ∀ v ∈ V } output by Algorithm 1 for two subgraphs are identical then the WL test declares the two subgraphs to be isomorphic. This test is kno wn to fail in some cases, but is valid for a broad class of graphs [ 32 ]. GraphSA GE is a continuous approximation to the WL test, where we replace the hash function with trainable neural network aggre gators. Of course, we use GraphSA GE to generate useful node representations–not to test graph isomorphism. Nev ertheless, the connection between GraphSA GE and the classic WL test provides theoretical context for our algorithm design to learn the topological structure of node neighborhoods. Neighborhood deﬁnition . In this work, we uniformly sample a ﬁxed-size set of neighbors, instead of using full neighborhood sets in Algorithm 1, in order to keep the computational footprint of each batch 4 ﬁxed. 3 That is, using overloaded notation, we deﬁne N ( v ) as a ﬁxed-size, uniform draw from the set { u ∈ V : ( u, v ) ∈ E } , and we dra w different uniform samples at each iteration, k , in Algorithm 1. W ithout this sampling the memory and expected runtime of a single batch is unpredictable and in the worst case O ( |V | ) . In contrast, the per-batch space and time complexity for GraphSA GE is ﬁxed at O ( Q K i =1 S i ) , where S i , i ∈ { 1 , ..., K } and K are user-speciﬁed constants. Practically speaking we found that our approach could achiev e high performance with K = 2 and S 1 · S 2 ≤ 500 (see Section 4.4 for details). 3.2 Learning the parameters of GraphSA GE In order to learn useful, predictiv e representations in a fully unsupervised setting, we apply a graph-based loss function to the output representations, z u , ∀ u ∈ V , and tune the weight matrices, W k , ∀ k ∈ { 1 , ..., K } , and parameters of the aggre gator functions via stochastic gradient descent. The graph-based loss function encourages nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct: J G ( z u ) = − log  σ ( z > u z v )  − Q · E v n ∼ P n ( v ) log  σ ( − z > u z v n )  , (1) where v is a node that co-occurs near u on ﬁxed-length random walk, σ is the sigmoid function, P n is a neg ativ e sampling distribution, and Q deﬁnes the number of ne gativ e samples. Importantly , unlike pre vious embedding approaches, the representations z u that we feed into this loss function are generated from the features contained within a node’ s local neighborhood, rather than training a unique embedding for each node (via an embedding look-up). This unsupervised setting emulates situations where node features are provided to downstream machine learning applications, as a service or in a static repository . In cases where representations are to be used only on a speciﬁc downstream task, the unsupervised loss (Equation 1) can simply be replaced, or augmented, by a task-speciﬁc objectiv e (e.g., cross-entropy loss). 3.3 Aggregator Ar chitectures Unlike machine learning ov er N-D lattices (e.g., sentences, images, or 3-D volumes), a node’ s neighbors hav e no natural ordering; thus, the aggregator functions in Algorithm 1 must operate over an unordered set of v ectors. Ideally , an aggre gator function would be symmetric (i.e., in variant to permutations of its inputs) while still being trainable and maintaining high representational capacity . The symmetry property of the aggre gation function ensures that our neural network model can be trained and applied to arbitrarily ordered node neighborhood feature sets. W e examined three candidate aggregator functions: Mean aggregator . Our ﬁrst candidate aggre gator function is the mean operator , where we simply take the elementwise mean of the vectors in { h k − 1 u , ∀ u ∈ N ( v ) } . The mean aggregator is nearly equiv alent to the con volutional propagation rule used in the transducti ve GCN frame work [ 17 ]. In particular , we can deri ve an inductive variant of the GCN approach by replacing lines 4 and 5 in Algorithm 1 with the following: 4 h k v ← σ ( W · M E A N ( { h k − 1 v } ∪ { h k − 1 u , ∀ u ∈ N ( v ) } ) . (2) W e call this modiﬁed mean-based aggreg ator con volutional since it is a rough, linear approximation of a localized spectral conv olution [ 17 ]. An important distinction between this con volutional aggregator and our other proposed aggregators is that it does not perform the concatenation operation in line 5 of Algorithm 1—i.e., the con volutional aggreg ator does concatenate the node’ s previous layer representation h k − 1 v with the aggregated neighborhood vector h k N ( v ) . This concatenation can be vie wed as a simple form of a “skip connection” [ 13 ] between the dif ferent “search depths”, or “layers” of the GraphSA GE algorithm, and it leads to signiﬁcant gains in performance (Section 4). LSTM aggregator . W e also examined a more complex aggreg ator based on an LSTM architecture [ 14 ]. Compared to the mean aggregator , LSTMs have the advantage of larger e xpressiv e capability . Howe ver , it is important to note that LSTMs are not inherently symmetric (i.e., the y are not permuta- tion in v ariant), since they process their inputs in a sequential manner . W e adapt LSTMs to operate on an unordered set by simply applying the LSTMs to a random permutation of the node’ s neighbors. 3 Exploring non-uniform samplers is an important direction for future work. 4 Note that this differs from Kipf et al’ s exact equation by a minor normalization constant [17]. 5 Pooling aggregator . The ﬁnal aggreg ator we examine is both symmetric and trainable. In this pooling approach, each neighbor’ s v ector is independently fed through a fully-connected neural network; following this transformation, an elementwise max-pooling operation is applied to aggregate information across the neighbor set: AG G R E G A T E pool k = max( { σ  W pool h k u i + b  , ∀ u i ∈ N ( v ) } ) , (3) where max denotes the element-wise max operator and σ is a nonlinear acti vation function. In principle, the function applied before the max pooling can be an arbitrarily deep multi-layer percep- tron, b ut we focus on simple single-layer architectures in this work. This approach is inspired by recent adv ancements in applying neural network architectures to learn ov er general point sets [ 29 ]. Intuiti vely , the multi-layer perceptron can be thought of as a set of functions that compute features for each of the node representations in the neighbor set. By applying the max-pooling operator to each of the computed features, the model effecti vely captures dif ferent aspects of the neighborhood set. Note also that, in principle, an y symmetric vector function could be used in place of the max operator (e.g., an element-wise mean). W e found no signiﬁcant difference between max- and mean-pooling in dev elopments test and thus focused on max-pooling for the rest of our experiments. 4 Experiments W e test the performance of GraphSA GE on three benchmark tasks: (i) classifying academic papers into different subjects using the W eb of Science citation dataset, (ii) classifying Reddit posts as belonging to dif ferent communities, and (iii) classifying protein functions across v arious biological protein-protein interaction (PPI) graphs. Sections 4.1 and 4.2 summarize the datasets, and the supplementary material contains additional information. In all these experiments, we perform predictions on nodes that are not seen during training, and, in the case of the PPI dataset, we test on entirely unseen graphs. Experimental set-up . T o conte xtualize the empirical results on our inductive benchmarks, we compare against four baselines: a random classifer , a logistic regression feature-based classiﬁer (that ignores graph structure), the DeepW alk algorithm [ 28 ] as a representati ve f actorization-based approach, and a concatenation of the raw features and DeepW alk embeddings. W e also compare four variants of GraphSA GE that use the different aggregator functions (Section 3.3). Since, the “con vo- lutional” variant of GraphSAGE is an e xtended, inducti ve version of Kipf et al’ s semi-supervised GCN [ 17 ], we term this v ariant GraphSA GE-GCN. W e test unsupervised variants of GraphSA GE trained according to the loss in Equation (1) , as well as supervised v ariants that are trained directly on classiﬁcation cross-entropy loss. For all the GraphSA GE v ariants we used rectiﬁed linear units as the non-linearity and set K = 2 with neighborhood sample sizes S 1 = 25 and S 2 = 10 (see Section 4.4 for sensitivity analyses). For the Reddit and citation datasets, we use “online” training for DeepW alk as described in Perozzi et al. [ 28 ], where we run a new round of SGD optimization to embed the ne w test nodes before making predictions (see the Appendix for details). In the multi-graph setting, we cannot apply DeepW alk, since the embedding spaces generated by running the DeepW alk algorithm on dif ferent disjoint graphs can be arbitrarily rotated with respect to each other (Appendix D). All models were implemented in T ensorFlow [ 1 ] with the Adam optimizer [ 16 ] (except DeepW alk, which performed better with the v anilla gradient descent optimizer). W e designed our experiments with the goals of (i) verifying the improv ement of GraphSA GE over the baseline approaches (i.e., raw features and DeepW alk) and (ii) providing a rigorous comparison of the dif ferent GraphSA GE aggregator architectures. In order to pro vide a fair comparison, all models share an identical imple- mentation of their minibatch iterators, loss function and neighborhood sampler (when applicable). Moreov er , in order to guard against unintentional “hyperparameter hacking” in the comparisons be- tween GraphSA GE aggre gators, we sweep ov er the same set of hyperparameters for all GraphSA GE v ariants (choosing the best setting for each variant according to performance on a v alidation set). The set of possible hyperparameter v alues was determined on early validation tests using subsets of the citation and Reddit data that we then discarded from our analyses. The appendix contains further implementation details. 5 5 Code and links to the datasets: http://snap.stanford.edu/graphsage/ 6 T able 1: Prediction results for the three datasets (micro-a veraged F1 scores). Results for unsupervised and fully supervised GraphSA GE are shown. Analogous trends hold for macro-averaged scores. Citation Reddit PPI Name Unsup. F1 Sup. F1 Unsup. F1 Sup. F1 Unsup. F1 Sup. F1 Random 0.206 0.206 0.043 0.042 0.396 0.396 Raw features 0.575 0.575 0.585 0.585 0.422 0.422 DeepW alk 0.565 0.565 0.324 0.324 — — DeepW alk + features 0.701 0.701 0.691 0.691 — — GraphSA GE-GCN 0.742 0.772 0.908 0.930 0.465 0.500 GraphSA GE-mean 0.778 0.820 0.897 0.950 0.486 0.598 GraphSA GE-LSTM 0.788 0.832 0.907 0.954 0.482 0.612 GraphSA GE-pool 0.798 0.839 0.892 0.948 0.502 0.600 % gain ov er feat. 39% 46% 55% 63% 19% 45% Figure 2: A : T iming experiments on Reddit data, with training batches of size 512 and inference on the full test set (79,534 nodes). B : Model performance with respect to the size of the sampled neighborhood, where the “neighborhood sample size” refers to the number of neighbors sampled at each depth for K = 2 with S 1 = S 2 (on the citation data using GraphSA GE-mean). 4.1 Inductive lear ning on ev olving graphs: Citation and Reddit data Our ﬁrst two experiments are on classifying nodes in e volving information graphs, a task that is especially relev ant to high-throughput production systems, which constantly encounter unseen data. Citation data . Our ﬁrst task is predicting paper subject categories on a lar ge citation dataset. W e use an undirected citation graph dataset derived from the Thomson Reuters W eb of Science Core Collection, corresponding to all papers in six biology-related ﬁelds for the years 2000-2005. The node labels for this dataset correspond to the six dif ferent ﬁeld labels. In total, this is dataset contains 302,424 nodes with an a verage degree of 9.15. W e train all the algorithms on the 2000-2004 data and use the 2005 data for testing (with 30% used for validation). For features, we used node degrees and processed the paper abstracts according Arora et al. ’ s [ 2 ] sentence embedding approach, with 300-dimensional word vectors trained using the GenSim w ord2vec implementation [30]. Reddit data . In our second task, we predict which community diff erent Reddit posts belong to. Reddit is a lar ge online discussion forum where users post and comment on content in different topical communities. W e constructed a graph dataset from Reddit posts made in the month of September , 2014. The node label in this case is the community , or “subreddit”, that a post belongs to. W e sampled 50 large communities and b uilt a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an av erage degree of 492. W e use the ﬁrst 20 days for training and the remaining days for testing (with 30% used for v alidation). For features, we use of f-the-shelf 300-dimensional GloV e CommonCra wl word vectors [ 27 ]; for each post, we concatenated (i) the av erage embedding of the post title, (ii) the av erage embedding of all the post’ s comments (iii) the post’ s score, and (iv) the number of comments made on the post. The ﬁrst four columns of T able 1 summarize the performance of GraphSA GE as well as the baseline approaches on these two datasets. W e ﬁnd that GraphSA GE outperforms all the baselines by a signiﬁcant margin, and the trainable, neural network aggre gators provide signiﬁcant gains compared 7 to the GCN approach. For example, the unsupervised v ariant GraphSAGE-pool outperforms the concatenation of the DeepW alk embeddings and the raw features by 13.8% on the citation data and 29.1% on the Reddit data, while the supervised version provides a gain of 19.7% and 37.2%, respectiv ely . Interestingly , the LSTM based aggregator shows strong performance, despite the fact that it is designed for sequential data and not unordered sets. Lastly , we see that the performance of unsupervised GraphSA GE is reasonably competiti ve with the fully supervised version, indicating that our framew ork can achie ve strong performance without task-speciﬁc ﬁne-tuning. 4.2 Generalizing across graphs: Protein-pr otein interactions W e now consider the task of generalizing across graphs, which requires learning about node roles rather than community structure. W e classify protein roles—in terms of their cellular functions from gene ontology—in various protein-protein int eraction (PPI) graphs, with each graph corresponding to a dif ferent human tissue [ 41 ]. W e use positional gene sets, motif gene sets and immunological signatures as features and gene ontology sets as labels (121 in total), collected from the Molecular Signatures Database [ 34 ]. The av erage graph contains 2373 nodes, with an a verage de gree of 28.8. W e train all algorithms on 20 graphs and then av erage prediction F1 scores on two test graphs (with two other graphs used for v alidation). The ﬁnal two columns of T able 1 summarize the accuracies of the various approaches on this data. Again we see that GraphSAGE signiﬁcantly outperforms the baseline approaches, with the LSTM- and pooling-based aggre gators providing substantial g ains ov er the mean- and GCN-based aggregators. 6 4.3 Runtime and parameter sensitivity Figure 2.A summarizes the training and test runtimes for the different approaches. The training time for the methods are comparable (with GraphSA GE-LSTM being the slowest). Ho wev er , the need to sample new random walks and run new rounds of SGD to embed unseen nodes makes DeepW alk 100 - 500 × slo wer at test time. For the GraphSA GE variants, we found that setting K = 2 provided a consistent boost in accurac y of around 10 - 15% , on average, compared to K = 1 ; howe ver , increasing K beyond 2 ga ve marginal returns in performance ( 0 - 5% ) while increasing the runtime by a prohibiti vely large factor of 10 - 100 × , depending on the neighborhood sample size. W e also found diminishing returns for sampling large neighborhoods (Figure 2.B). Thus, despite the higher variance induced by sub-sampling neighborhoods, GraphSA GE is still able to maintain strong predictive accurac y , while signiﬁcantly improving the runtime. 4.4 Summary comparison between the different aggr egator architectur es Overall, we found that the LSTM- and pool-based aggregators performed the best, in terms of both av erage performance and number of experimental settings where the y were the top-performing method (T able 1). T o giv e more quantitati ve insight into these trends, we consider each of the six dif ferent experimental settings (i.e., (3 datasets) × ( unsupervised vs. supervised ) ) as trials and consider what performance trends are likely to generalize. In particular , we use the non-parametric W ilcoxon Signed-Rank T est [ 33 ] to quantify the differences between the dif ferent aggregators across trials, reporting the T -statistic and p -v alue where applicable. Note that this method is rank-based and essentially tests whether we would expect one particular approach to outperform another in a ne w experimental setting. Giv en our small sample size of only 6 dif ferent settings, this signiﬁcance test is somewhat underpo wered; nonetheless, the T -statistic and associated p -values are useful quantitati ve measures to assess the aggregators’ relati ve performances. W e see that LSTM-, pool- and mean-based aggre gators all provide statistically signiﬁcant gains ov er the GCN-based approach ( T = 1 . 0 , p = 0 . 02 for all three). Ho wev er , the g ains of the LSTM and pool approaches ov er the mean-based aggregator are more marginal ( T = 1 . 5 , p = 0 . 03 , comparing 6 Note that in very recent follow-up work Chen and Zhu [ 6 ] achieve superior performance by optimizing the GraphSA GE hyperparameters speciﬁcally for the PPI task and implementing ne w training techniques (e.g., dropout, layer normalization, and a new sampling scheme). W e refer the reader to their work for the current state-of-the-art numbers on the PPI dataset that are possible using a variant of the GraphSA GE approach. 8 LSTM to mean; T = 4 . 5 , p = 0 . 10 , comparing pool to mean). There is no signiﬁcant difference between the LSTM and pool approaches ( T = 10 . 0 , p = 0 . 46 ). Ho wev er , GraphSA GE-LSTM is signiﬁcantly slo wer than GraphSA GE-pool (by a factor of ≈ 2 × ), perhaps gi ving the pooling-based aggregator a slight edge o verall. 5 Theoretical analysis In this section, we probe the e xpressiv e capabilities of GraphSA GE in order to provide insight into how GraphSAGE can learn about graph structure, even though it is inherently based on features. As a case-study , we consider whether GraphSA GE can learn to predict the clustering coef ﬁcient of a node, i.e., the proportion of triangles that are closed within the node’ s 1-hop neighborhood [ 38 ]. The clustering coefﬁcient is a popular measure of ho w clustered a node’ s local neighborhood is, and it serves as a building block for many more complicated structural motifs [ 3 ]. W e can show that Algorithm 1 is capable of approximating clustering coefﬁcients to an arbitrary de gree of precision: Theorem 1. Let x v ∈ U, ∀ v ∈ V denote the featur e inputs for Algorithm 1 on graph G = ( V , E ) , wher e U is any compact subset of R d . Suppose that ther e exists a ﬁxed positive constant C ∈ R + such that k x v − x v 0 k 2 > C for all pairs of nodes. Then we have that ∀  > 0 ther e e xists a parameter setting Θ ∗ for Algorithm 1 such that after K = 4 iterations | z v − c v | < , ∀ v ∈ V , wher e z v ∈ R ar e ﬁnal output values gener ated by Algorithm 1 and c v ar e node clustering coefﬁcients. Theorem 1 states that for any graph there exists a parameter setting for Algorithm 1 such that it can approximate clustering coef ﬁcients in that graph to an arbitrary precision, if the features for ev ery node are distinct (and if the model is suf ﬁciently high-dimensional). The full proof of Theorem 1 is in the Appendix. Note that as a corollary of Theorem 1, GraphSA GE can learn about local graph structure, ev en when the node feature inputs are sampled from an absolutely continuous random distribution (see the Appendix for details). The basic idea behind the proof is that if each node has a unique feature representation, then we can learn to map nodes to indicator vectors and identify node neighborhoods. The proof of Theorem 1 relies on some properties of the pooling aggregator , which also provides insight into why GraphSA GE-pool outperforms the GCN and mean-based aggre gators. 6 Conclusion W e introduced a novel approach that allows embeddings to be efﬁciently generated for unseen nodes. GraphSA GE consistently outperforms state-of-the-art baselines, effecti vely trades of f performance and runtime by sampling node neighborhoods, and our theoretical analysis provides insight into how our approach can learn about local graph structures. A number of extensions and potential improv ements are possible, such as extending GraphSA GE to incorporate directed or multi-modal graphs. A particularly interesting direction for future work is e xploring non-uniform neighborhood sampling functions, and perhaps even learning these functions as part of the GraphSAGE optimization. Acknowledgments The authors thank Austin Benson, Aditya Grov er , Bryan He, Dan Jurafsky , Ale x Ratner , Marinka Zitnik, and Daniel Selsam for their helpful discussions and comments on early drafts. The authors would also like to thank Ben Johnson for his man y useful questions and comments on our code and Nikhil Mehta and Y uhui Ding for catching some minor errors in a pre vious version of the appendix. This research has been supported in part by NSF IIS-1149837, D ARP A SIMPLEX, Stanford Data Science Initiati ve, Huawei, and Chan Zuckerberg Biohub . WLH was also supported by the SAP Stanford Graduate Fello wship and an NSERC PGS-D grant. The vie ws and conclusions expressed in this material are those of the authors and should not be interpreted as necessarily representing the of ﬁcial policies or endorsements, either e xpressed or implied, of the above funding agencies, corporations, or the U.S. and Canadian gov ernments. 9 References [1] M. Abadi, A. Agarwal, P . Barham, E. Bre vdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. De vin, et al. T ensorﬂo w: Large-scale machine learning on heterogeneous distrib uted systems. arXiv preprint , 2016. [2] S. Arora, Y . Liang, and T . Ma. A simple b ut tough-to-beat baseline for sentence embeddings. In ICLR , 2017. [3] A. R. Benson, D. F . Gleich, and J. Leskov ec. Higher-order or ganization of complex netw orks. Science , 353(6295):163–166, 2016. [4] J. Bruna, W . Zaremba, A. Szlam, and Y . LeCun. Spectral networks and locally connected networks on graphs. In ICLR , 2014. [5] S. Cao, W . Lu, and Q. Xu. Grarep: Learning graph representations with global structural information. In KDD , 2015. [6] J. Chen and J. Zhu. Stochastic training of graph con volutional networks. arXiv pr eprint arXiv:1710.10568 , 2017. [7] H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent v ariable models for structured data. In ICML , 2016. [8] M. Defferrard, X. Bresson, and P . V anderghe ynst. Con volutional neural networks on graphs with fast localized spectral ﬁltering. In NIPS , 2016. [9] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T . Hirzel, A. Aspuru-Guzik, and R. P . Adams. Conv olutional networks on graphs for learning molecular ﬁngerprints. In NIPS , 2015. [10] M. Gori, G. Monfardini, and F . Scarselli. A ne w model for learning in graph domains. In IEEE International Joint Confer ence on Neural Networks , v olume 2, pages 729–734, 2005. [11] A. Grover and J. Lesko vec. node2vec: Scalable feature learning for networks. In KDD , 2016. [12] W . L. Hamilton, J. Lesk ovec, and D. Jurafsky . Diachronic word embeddings re veal statistical laws of semantic change. In A CL , 2016. [13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In EA CV , 2016. [14] S. Hochreiter and J. Schmidhuber . Long short-term memory . Neural Computation , 9(8):1735– 1780, 1997. [15] K. Hornik. Approximation capabilities of multilayer feedforward netw orks. Neural Networks , 4(2):251–257, 1991. [16] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR , 2015. [17] T . N. Kipf and M. W elling. Semi-supervised classiﬁcation with graph conv olutional networks. In ICLR , 2016. [18] T . N. Kipf and M. W elling. V ariational graph auto-encoders. In NIPS W orkshop on Bayesian Deep Learning , 2016. [19] J. B. Kruskal. Multidimensional scaling by optimizing goodness of ﬁt to a nonmetric h ypothesis. Psychometrika , 29(1):1–27, 1964. [20] O. Le vy and Y . Goldberg. Neural w ord embedding as implicit matrix f actorization. In NIPS , 2014. [21] Y . Li, D. T arlow , M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In ICLR , 2015. [22] T . Mikolov , I. Sutskev er, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality . In NIPS , 2013. [23] A. Y . Ng, M. I. Jordan, Y . W eiss, et al. On spectral clustering: Analysis and an algorithm. In NIPS , 2001. [24] M. Niepert, M. Ahmed, and K. Kutzkov . Learning conv olutional neural netw orks for graphs. In ICML , 2016. 10 [25] L. Page, S. Brin, R. Motwani, and T . Winograd. The pagerank citation ranking: Bringing order to the web . T echnical report, Stanford InfoLab, 1999. [26] F . Pedregosa, G. V aroquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P . Prettenhofer, R. W eiss, V . Dubourg, J. V anderplas, A. Passos, D. Cournapeau, M. Brucher , M. Perrot, and E. Duchesnay . Scikit-learn: Machine learning in Python. Journal of Mac hine Learning Resear ch , 12:2825–2830, 2011. [27] J. Pennington, R. Socher , and C. D. Manning. Glove: Global vectors for w ord representation. In EMNLP , 2014. [28] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In KDD , 2014. [29] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. In CVPR , 2017. [30] R. ˇ Reh ˚ u ˇ rek and P . Sojka. Software Framework for T opic Modelling with Large Corpora. In LREC , 2010. [31] F . Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner , and G. Monfardini. The graph neural network model. IEEE T ransactions on Neural Networks , 20(1):61–80, 2009. [32] N. Shervashidze, P . Schweitzer, E. J. v . Leeuwen, K. Mehlhorn, and K. M. Borgwardt. W eisfeiler- lehman graph kernels. Journal of Mac hine Learning Resear ch , 12:2539–2561, 2011. [33] S. Siegal. Nonparametric statistics for the behavioral sciences . McGraw-hill, 1956. [34] A. Subramanian, P . T amayo, V . K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulo vich, S. L. Pomeroy , T . R. Golub, E. S. Lander , et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression proﬁles. Pr oceedings of the National Academy of Sciences , 102(43):15545–15550, 2005. [35] J. T ang, M. Qu, M. W ang, M. Zhang, J. Y an, and Q. Mei. Line: Large-scale information network embedding. In WWW , 2015. [36] D. W ang, P . Cui, and W . Zhu. Structural deep network embedding. In KDD , 2016. [37] X. W ang, P . Cui, J. W ang, J. Pei, W . Zhu, and S. Y ang. Community preserving network embedding. In AAAI , 2017. [38] D. J. W atts and S. H. Strogatz. Collectiv e dynamics of ‘small-world’ networks. Nature , 393(6684):440–442, 1998. [39] L. Xu, X. W ei, J. Cao, and P . S. Y u. Embedding identity and interest for social networks. In WWW , 2017. [40] Z. Y ang, W . Cohen, and R. Salakhutdinov . Re visiting semi-supervised learning with graph embeddings. In ICML , 2016. [41] M. Zitnik and J. Lesko vec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics , 33(14):190–198, 2017. 11 A ppendices A Minibatch pseudocode In order to use stochastic gradient descent, we adapt our algorithm to allo w forward and backward propagation for minibatches of nodes and edges. Here we focus on the minibatch forward propagation algorithm, analogous to Algorithm 1. In the forw ard propagation of GraphSA GE the minibatch B contains nodes that we want to generate representations for . Algorithm 2 giv es the pseudocode for the minibatch approach. Algorithm 2: GraphSA GE minibatch forward propagation algorithm Input : Graph G ( V , E ) ; input features { x v , ∀ v ∈ B} ; depth K ; weight matrices W k , ∀ k ∈ { 1 , ..., K } ; non-linearity σ ; differentiable aggre gator functions AG G R E G A T E k , ∀ k ∈ { 1 , ..., K } ; neighborhood sampling functions, N k : v → 2 V , ∀ k ∈ { 1 , ..., K } Output : V ector representations z v for all v ∈ B 1 B K ← B ; 2 for k = K... 1 do 3 B k − 1 ← B k ; 4 for u ∈ B k do 5 B k − 1 ← B k − 1 ∪ N k ( u ) ; 6 end 7 end 8 h 0 u ← x v , ∀ v ∈ B 0 ; 9 for k = 1 ...K do 10 for u ∈ B k do 11 h k N ( u ) ← AG G R E G A T E k ( { h k − 1 u 0 , ∀ u 0 ∈ N k ( u ) } ) ; 12 h k u ← σ  W k · C O N C A T ( h k − 1 u , h k N ( u ) )  ; 13 h k u ← h k u / k h k u k 2 ; 14 end 15 end 16 z u ← h K u , ∀ u ∈ B The main idea is to sample all the nodes needed for the computation ﬁrst. Lines 2-7 of Algorithm 2 correspond to the sampling stage. Each set B k contains the nodes that are needed to compute the representations of nodes v ∈ B k +1 , i.e., the nodes in the ( k + 1) -st iteration, or “layer”, of Algorithm 1. Lines 9-15 correspond to the aggregation stage, which is almost identical to the batch inference algorithm. Note that in Lines 12 and 13, the representation at iteration k of any node in set B k can be computed, because its representation at iteration k − 1 and the representations of its sampled neighbors at iteration k − 1 hav e already been computed in the previous loop. The algorithm thus av oids computing the representations for nodes that are not in the current minibatch and not used during the current iteration of stochastic gradient descent. W e use the notation N k ( u ) to denote a deterministic function which speciﬁes a random sample of a node’ s neighborhood (i.e., the randomness is assumed to be pre-computed in the mappings). W e index this function by k to denote the fact that the random samples are independent across iterations ov er k . W e use a uniform sampling function in this work and sample with replacement in cases where the sample size is lar ger than the node’ s degree. Note that the sampling process in Algorithm 2 is conceptually re versed compared to the iterations ov er k in Algorithm 1: we start with the “layer-K” nodes (i.e., the nodes in B ) that we w ant to generate representations for; then we sample their neighbors (i.e., the nodes at “layer-K-1” of the algorithm) and so on. One consequence of this is that the deﬁnition of neighborhood sampling sizes can be some what counterintuitiv e. In particular, if we use K = 2 total iterations with sample sizes S 1 12 and S 2 then this means that we sample S 1 nodes during iteration k = 1 of Algorithm 1 and S 2 nodes during iteration k = 2 , and—from the perspectiv e of the “target” nodes in B that we want to generate representations for after iteration k = 2 —this amounts to sampling S 2 of their immediate neighbors and S 1 · S 2 of their 2-hop neighbors. B Additional Dataset Details In this section, we provide some additional, relev ant dataset details. The full PPI and Reddit datasets are av ailable at: http://snap.stanford.edu/graphsage/ . The W eb of Science dataset (W oS) is licensed by Thomson Reuters and can be made av ailable to groups with v alid W oS licenses. Reddit data T o sample communities, we rank ed communities by their total number of comments in 2014 and selected the communities with ranks [11,50] (inclusi ve). W e omitted the lar gest communities because they are large, generic default communities that substantially ske w the class distribution. W e selected the largest connected component of the graph deﬁned o ver the union of these communities. W e performed early validation experiments and model dev elopment on data from October and Nov ember , 2014. Details on the source of the Reddit data are at: https://archive.org/details/ FullRedditSubmissionCorpus2006ThruAugust2015 and https://archive. org/details/2015_reddit_comments_corpus . W oS data W e selected the following subﬁelds manually , based on them being of relati vely equal size and all biology-related ﬁelds. W e performed early v alidation and model de velopment on the neuroscience subﬁeld (code=R U, which is excluded from our ﬁnal set). W e did not run an y experiments on an y other subsets of the W oS data. W e took the lar gest connected component of the graph deﬁned ov er the union of these ﬁelds. • Immunology (code: NI, number of documents: 77356) • Ecology (code: GU, number of documents: 37935) • Biophysics (code: D A, number of documents: 36688) • Endocrinology and Metabolism (code: IA, number of documents: 52225). • Cell Biology (code: DR, number of documents: 84231) • Biology (other) (code: CU, number of documents: 13988) PPI Tissue Data For training, we randomly selected 20 PPI networks that had at least 15,000 edges. For testing and v alidation, we selected 4 large networks (2 for v alidation, 2 for testing, each with at least 35,000 edges). All e xperiments for model design and development were performed on the same 2 validation netw orks, and we used the same random training set in all experiments. W e selected features that included at least 10% of the proteins that appear in any of the PPI graphs. Note that the feature data is very sparse for dataset ( 42% of nodes have no non-zero feature v alues), which makes le veraging neighborhood information critical. C Details on the Experimental Setup and Hyperparameter T uning Random walks f or the unsupervised objecti ve For all settings, we ran 50 random walks of length 5 from each node in order to obtain the pairs needed for the unsupervised loss (Equation 1). Our implementation of the random walks is in pure Python and is based directly on Python code pro vided by Perozzi et al. [28]. Logistic regr ession model For the feature only model and to make predictions on the embeddings output from the unsupervised models, we used the logistic SGDClassiﬁer from the scikit-learn Python package [ 26 ], with all default settings. Note that this model is always optimized only on the training nodes and it is not ﬁne-tuned on the embeddings that are generated for the test data. 13 Hyperparameter selection In all settings, we performed hyperparameter selection on the learning rate and the model dimension. W ith the exception of DeepW alk, we performed a parameter sweep on initial learning rates { 0 . 01 , 0 . 001 , 0 . 0001 } for the supervised models and { 2 × 10 − 6 , 2 × 10 − 7 , 2 × 10 − 8 } for the unsupervised models. 7 When applicable, we tested a “big” and “small” version of each model, where we tried to keep the ov erall model sizes comparable. For the pooling aggregator , the “big” model had a pooling dimension of 1024, while the “small” model had a dimension of 512. For the LSTM aggre gator , the “big” model had a hidden dimension of 256, while the “small” model had a hidden dimension of 128; note that the actual parameter count for the LSTM is roughly 4 × this number , due to weights for the different gates. In all experiments and for all models we specify the output dimension of the h k i vectors at e very depth k of the recursion to be 256 . All models use rectiﬁed linear units as a non-linear acti vation function. All the unsupervised GraphSA GE models and DeepW alk used 20 negati ve samples with context distribution smoothing o ver node degrees using a smoothing parameter of 0 . 75 , follo wing [ 11 , 22 , 28 ]. Initial experiments re vealed that DeepW alk performed much better with large learning rates, so we swept o ver rates in the set { 0 . 2 , 0 . 4 , 0 . 8 } . For the supervised GraphSA GE methods, we ran 10 epochs for all models. All methods except DeepW alk use batch sizes of 512. W e found that DeepW alk achie ved faster wall-clock con vergence with a smaller batch size of 64. Hardwar e Except for DeepW alk, we ran experiments single a machine with 4 NVIDIA T itan X Pascal GPUs (12Gb of RAM at 10Gbps speed), 16 Intel Xeon CPUs (E5-2623 v4 @ 2.60GHz), and 256Gb of RAM. DeepW alk was f aster on a CPU intensi ve machine with 144 Intel Xeon CPUs (E7-8890 v3 @ 2.50GHz) and 2Tb of RAM. Overall, our e xperiments took about 3 days in a shared resource setting. W e expect that a consumer-grade single-GPU machine (e.g., with a T itan X GPU) could complete our full set of experiments in 4-7 days, if its full resources were dedicated. Notes on the DeepW alk implementation Existing DeepW alk implementations [ 28 , 11 ] are simply wrappers around dedicated word2vec code, and they do not easily support embedding ne w nodes and other variations. Moreo ver , this makes it dif ﬁcult to compare runtimes and other statistics for these approaches. F or this reason, we reimplemented DeepW alk in pure T ensorFlow , using the v ector initializations etc that are described in the T ensorFlow word2v ec tutorial. 8 W e found that DeepW alk was much slo wer to con ver ge than the other methods, and since it is 2-5X faster at training, we gave it 5 passes o ver the random walk data, instead of one. T o update the DeepW alk method on new data, we ran 50 random walks of length 5 (as described above) and performed updates on the embeddings for the ne w nodes while holding the already trained embeddings ﬁxed. W e also tested tw o v ariants, one where we restricted the sampled random w alk “context nodes” to only be from the set of already trained nodes (which alle viates statistical drift) and an approach without this restriction. W e al ways selected the better performing variant. Note that despite DeepW alk’ s poor performance on the inducti ve task, it is f ar more competiti ve when tested in the transductiv e setting, where it can be extensi vely trained on a single, ﬁxed graph. (That said, Kipf et al [ 17 ][ 18 ] found that GCN-based approach consistently outperformed DeepW alk, even in the transducti ve setting on link prediction, a task that theoretically f av ors DeepW alk.) W e did observe DeepW alk’ s performance could improv e with further training, and in some cases it could become competiti ve with the unsupervised GraphSA GE approaches (but not the supervised approaches) if we let it run for > 1000 × longer than the other approaches (in terms of wall clock time for prediction on the test set); howe ver , we did not deem this to be a meaningful comparison for the inductive task. Note that DeepW alk is also equiv alent to the node2vec model [11] with p = q = 1 . Notes on neighborhood sampling Due to the heavy-tailed nature of degree distributions we do wnsample the edges in all graphs before feeding them into the GraphSA GE algorithm. In particular , we subsample edges so that no node has de gree larger than 128 . Since we only sample at most 25 neighbors per node, this is a reasonable tradeof f. This do wnsampling allows us to store neighborhood information as dense adjacency lists, which drastically improv es computational efﬁciency . For the Reddit data we also downsampled the edges of the original graph as a pre-processing step, since the 7 Note that these values dif f er from our previous reported pre-print values because the y are corrected to account for an extraneous normalization by the batch size. W e thank Ben Johnson for pointing out this discrepancy . 8 https://github.com/tensorflow/models/blob/master/tutorials/embedding/ word2vec.py 14 original graph is extremely dense. All experiments are on the do wnsampled version, but we release the full version on the project website for reference. D Alignment Issues and Orthogonal In variance f or DeepW alk and Related A pproaches DeepW alk [ 28 ], node2vec [ 11 ], and other recent successful node embedding approaches employ objectiv e functions of the form: α X i,j ∈A f ( z > i z j ) + β X i,j ∈B g ( z > i z j ) (4) where f , g are smooth, continuous functions, z i are the node representations that are being directly optimized (i.e., via embedding look-ups), and A , B are sets of pairs of nodes. Note that in many cases, in the actual code implementations used by the authors of these approaches, nodes are associated with two unique embedding v ectors and the ar guments to the dot products in f and g are drawn for distinct embedding look-ups (e.g., [ 11 , 28 ]); howe ver , this does not fundamentally alter the learning algorithm. The majority of approaches also normalize the learned embeddings to unit length, so we assume this post-processing as well. By connection to word embedding approaches and the arguments of [ 20 ], these approaches can also be vie wed as stochastic, implicit matrix factorizations where we are trying to learn a matrix Z ∈ R |V |× d such that ZZ > ≈ M , (5) where M is some matrix containing random walk statistics. An important consequence of this structure is that the embeddings can be rotated by an arbitrary orthogonal matrix, without impacting the objectiv e: ZQ > QZ > = ZZ > , (6) where Q ∈ R d × d is any orthogonal matrix. Since the embeddings are otherwise unconstrained and the only error signal comes from the orthogonally-in variant objecti ve (4) , the entire embedding space is free to arbitrarily rotate during training. T wo clear consequences of this are: 1. Suppose we run an embedding approach based on (4) on two separate graphs A and B using the same output dimension. W ithout some explicit penalty enforcing alignment, the learned embeddings spaces for the two graphs will be arbitrarily rotated with respect to each other after training. Thus, for an y node classiﬁcation method that is trained on indi vidual embeddings from graph A, inputting the embeddings from graph B will be essentially random. This fact is also simply true by virtue of the fact that the M matrices of these graphs are completely disjoint. Of course, if we had a way to match “similar” nodes between the graphs, then it could be possible to use an alignment procedure to share information between the graphs, such as the procedure proposed by [ 12 ] for aligning the output of word embedding algorithms. In vestigating such alignment procedures is an interesting direction for future work; though these approaches will ine vitably be slow run on ne w data, compared to approaches like GraphSA GE that can simply generate embeddings for ne w nodes without any additional training or alignment. 2. Suppose that we run an embedding approach based on (4) on graph C at time t and train a classiﬁer on the learned embeddings. Then at time t + 1 we add more nodes to C and run a ne w round of SGD and update all embeddings. T wo issues arise: First by analogy to point 1 above, if the ne w nodes are only connected to a very small number of the old nodes, then the embedding space for the new nodes can essentially become rotated with respect to the original embedding space. Moreover , if we update all embeddings during training (not just for the ne w nodes), as suggested by [ 28 ]’ s streaming approach to DeepW alk, then the embedding space can arbitrarily rotate compared to the embedding space that we trained our classiﬁer on, which only further exasperates the problem. 15 Note that this rotational inv ariance is not problematic for tasks that only rely on pairwise node distances (e.g., link prediction via dot products). Moreo ver , some reasonable approaches to alleviate this issue of statistical drift are to (1) not update the already trained embeddings when optimizing the embeddings for new test nodes and (2) to only k eep existing nodes as “context nodes” in the sampled random walks, i.e. to ensure that e very dot-product in the skip-gram objecti ve is the product of an already-trained node and a new/test node. W e tried both of these approaches in this work and alw ays selected the best performing DeepW alk variant. Also note that empirically DeepW alk performs better on the citation data than the Reddit data (Section 4.1) because this statistical drift is worse in the Reddit data, compared to the citation graph. In particular , the Reddit data has fewer edges from the test set to the train set, which help prevent mis-alignment: 96% of the 2005 citation links connect back to the 2000-2004 data, while only 73% of edges in the Reddit test set connect back to the train data. E Proof of Theor em 1 T o prove Theorem 1, we ﬁrst pro ve three lemmas: • Lemma 1 states that there exists a continuous function that is guaranteed to only be positiv e in closed balls around a ﬁxed number of points, with some noise tolerance. • Lemma 2 notes that we can approximate the function in Lemma 1 to an arbitrary precision using a multilayer perceptron with a single hidden layer . • Lemma 3 builds off the preceding two lemmas to pro ve that the pooling architecture can learn to map nodes to unique indicator v ectors, assuming that all the input feature v ectors are sufﬁciently distinct. W e also rely on fact that the max-pooling operator (with at least one hidden layer) is capable of approximating any Hausdorf f continuous, symmetric function to an arbitrary  precision [29]. W e note that all of the following are essentially identiﬁability arguments. W e show that there exists a parameter setting for which Algorithm 1 can learn nodes clustering coef ﬁcients, which is non-obvious giv en that it operates by aggregating feature information. The ef ﬁcient learnability of the functions described is the subject of future work. W e also note that these proofs are conserv ative in the sense that clustering coef ﬁcients may be in fact identiﬁable in fewer iterations, or with less restrictions, than we impose. Moreov er , due to our reliance on two univ ersal approximation theorems [ 15 , 29 ], the required dimensionality is in principle O ( |V | ) . W e can provide a more informativ e bound on the required output dimension of some particular layers (e..g., Lemma 3); howe ver , in the worst case this identiﬁability argument relies on having a dimension of O ( |V | ) . It is w orth noting, howe ver , that Kipf et al’ s “featureless” GCN approach has parameter dimension O ( |V | ) , so this requirement is not entirely unreasonable [17, 18]. Follo wing Theorem 1, we let x v ∈ U, ∀ v ∈ V denote the feature inputs for Algorithm 1 on graph G = ( V , E ) , where U is any compact subset of R d . Lemma 1. Let C ∈ R + be a ﬁxed positive constant. Then for any non-empty ﬁnite subset of nodes D ⊆ V , there e xists a continuous function g : U → R such that  g ( x ) > , if k x − x v k 2 = 0 for some v ∈ D g ( x ) ≤ − , if k x − x v k 2 > C, ∀ v ∈ D , (7) wher e  < 0 . 5 is a chosen err or tolerance. Pr oof. Many such functions exist. For concreteness, we pro vide one construction that satisﬁes these criteria. Let x ∈ U denote an arbitrary input to g , let d v = k x − x v k 2 , ∀ v ∈ D , and let g be deﬁned as g ( x ) = P v ∈D g v ( x ) with g v ( x ) = 3 |D |  bd 2 v + 1 − 2  (8) where b = 3 |D|− 1 C 2 > 0 . By construction: 1. g v has a unique maximum of 3 |D |  − 2  > 2 |D|  at d v = 0 . 16 2. lim d v →∞  3 |D|  bd 2 v +1 − 2   = − 2  3. 3 |D|  bd 2 v +1 − 2  ≤ −  if d v ≥ C . Note also that g is continuous on its domain ( d v ∈ R + ) since it is the sum of ﬁnite set of continuous functions. Moreover , we have that, for a giv en input x ∈ U , if d v ≥ C for all points v ∈ D then g ( x ) = P v ∈D g v ( a ) ≤ −  by property 3 above. And, if d v = 0 for any v ∈ D , then g is positiv e by construction, by properties 1 and 2, since in this case, g v ( x ) + X v 0 ∈D\ v g v 0 ( x ) ≥ g v ( x ) − ( |D | − 1)2  > g v ( x ) − 2( |D | )  > 2( |D| )  − 2( |D | )  > 0 , so we kno w that g is positi ve whene ver d v = 0 for any node and ne gative whene ver d v > C for all nodes. Lemma 2. The function g : U → R can be approximated to an arbitrary de gr ee of pr ecision by standar d multilayer per ceptr on (MLP) with least one hidden layer and a non-constant monotonically incr easing activation function (e.g., a r ectiﬁed linear unit). In pr ecise terms, if we let f θ σ denote this MLP and θ σ its parameters, we have that ∀  , ∃ θ σ such that | f θ σ ( x ) − g ( x ) | <  | , ∀ x ∈ U . Pr oof. This is a direct consequence of Theorem 2 in [15]. Lemma 3. Let A be the adjacency matrix of G , let N 2 ( v ) denote the 2-hop neighborhood of a node, v , and deﬁne χ ( G 4 ) as the chr omatic number of the graph with adjacency matrix A 4 (ignoring self-loops). Suppose that ther e exists a ﬁxed positive constant C ∈ R + such that k x v − x v 0 k 2 > C for all pairs of nodes. Then we have that there exists a parameter setting for Algorithm 1, using a pooling a ggr e gator at depth k = 1 , wher e this pooling aggr e gator has ≥ 2 hidden layers with r ectiﬁed non-linear units, such that h 1 v 6 = h 1 v 0 , ∀ ( v, v 0 ) ∈ { ( v , v 0 ) : ∃ u ∈ V , v , v 0 ∈ N 2 ( u ) } , h 1 v , h 1 v 0 ∈ E χ ( G 4 ) I wher e E χ ( G 4 ) I is the set of one-hot indicator vectors of dimension χ ( G 4 ) . Pr oof. By the deﬁnition of the chromatic number , we know that we can label e very node in V using χ ( G 4 ) unique colors, such that no two nodes that co-occur in any node’ s 2-hop neighborhood are assigned the same color . Thus, with exactly χ ( G 4 ) dimensions we can assign a unique one-hot indicator vector to e very node, where no tw o nodes that co-occur in an y 2-hop neighborhood ha ve the same vector . In other words, each color deﬁnes a subset of nodes D ⊆ V and this subset of nodes can all be mapped to the same indicator vector without introducing conﬂicts. By Lemma 1 and 2 and the assumption that k x v − x v 0 k 2 > C for all pairs of nodes, we can choose an  < 0 . 5 and there exists a single-layer MLP , f θ σ , such that for any subset of nodes D ⊆ V :  f θ σ ( x v ) > 0 , ∀ v ∈ D f θ σ ( x v ) < 0 , ∀ v ∈ V \ D . (9) By making this MLP one layer deeper and speciﬁcally using a rectiﬁed linear activ ation function, we can return a positi ve v alue only for nodes in the subset D and zero otherwise, and, since we normalize after applying the aggre gator layer , this single positiv e v alue can be mapped to an indicator vector . Moreov er , we can create χ ( G 4 ) such MLPs, where each MLP corresponds to a dif ferent color/subset; equiv alently each MLP corresponds to a different max-pooling dimension in equation 3 of the main text. W e now restate Theorem 1 and pro vide a proof. 17 Theorem 1. Let x v ∈ R d , ∀ v ∈ V denote the featur e inputs for Algorithm 1 on gr aph G = ( V , E ) , wher e U is any compact subset of R d . Suppose that ther e exists a ﬁxed positive constant C ∈ R + such that k x v − x v 0 k 2 > C for all pairs of nodes. Then we have that ∀  > 0 ther e e xists a parameter setting Θ ∗ for Algorithm 1 such that after K = 4 iterations | z v − c v | < , ∀ v ∈ V , wher e z v ∈ R ar e ﬁnal output values gener ated by Algorithm 1 and c v ar e node clustering coefﬁcients, as deﬁned in [38]. Pr oof. W ithout loss of generality , we describe ho w to compute the clustering coefﬁcient for an arbitrary node v . For notational conv enience we use ⊕ to denote vector concatenation and d v to denote the de gree of node v . This proof requires 4 iterations of Algorithm 1, where we use the pooling aggre gator at all depths. For clarity and we ignore issues related to vector normalization and we use the fact that the pooling aggre gator can approximate any Hausdorf f continuous function to an arbitrary  precision [ 29 ]. Note that we can alw ays account for normalization constants (line 7 in Algorithm 1) by having aggregators prepend a unit value to all output representations; the normalization constant can then be recov ered at later layers by taking the in verse of this prepended value. Note also that almost certainly e xist settings where the symmetric functions described belo w can be computed exactly by the pooling aggregator (or a variant of it), but the symmetric uni versal approximation theorem of [ 29 ] along with Lipschitz continuity ar guments sufﬁce for the purposes of proving identiﬁability of clustering coef ﬁcients (up to an arbitrary precision). In particular , the functions described below , that we need approximate to compute clustering coef ﬁcients, are all Lipschitz continuous on their domains (assuming we only run on nodes with positi ve degrees) so the errors introduced by approximation remain bounded by ﬁxed constants (that can be made arbitrarily small). W e assume that the weight matrices, W 1 , W 2 at depths k = 2 and k = 3 are the identity , and that all non-linearities are rectiﬁed linear units. In addition, for the ﬁnal iteration (i.e, k = 4 ) we completely ignore neighborhood information and simply treat this layers as an MLP with a single hidden layer . Theorem 1 can be equi valently stated as requiring K = 3 iterations of Algorithm 1, with the representations then being fed to a single-layer MLP . By Lemma 3, we can assume that at depth k = 1 all nodes in v ’ s 2-hop neighborhood hav e unique, one-hot indicator v ectors, h 1 v ∈ E I . Thus, at depth k = 2 in Algorithm 1, suppose that we sum the unnormalized representations of the neighboring nodes. Then without loss of generality , we will hav e that h 2 v = h 1 v ⊕ A v where A is the adjacency matrix of the subgraph containing all nodes connected to v in G 4 and A v is the ro w of the adjacenc y matrix corresponding to v . Then, at depth k = 3 , again assume that we sum the neighboring representations (with the weight matrices as the identity), then we will hav e that h 3 v = h 1 v ⊕ A v ⊕   X v ∈N ( v ) h 1 v ⊕ A v   . (10) Letting m denote the dimensionality of the h 1 v vectors (i.e., m ≡ χ ( G 4 ) from Lemma 3) and using square brackets to denote vector inde xing, we can observe that • a ≡ h 3 v [0 : m ] is v ’ s one-hot indicator vector . • b ≡ h 3 v [ m : 2 m ] is v ’ s row in the adjacenc y matrix, A . • c ≡ h 3 v [3 m : 4 m ] is the sum of the adjacency ro ws of v ’ s neighbors. Thus, we hav e that b > c is the number of edges in the subgraph containing only v and it’ s immediate neighbors and P m i =0 b [ i ] = d v . Finally we can compute 2( b > c − d v ) ( d v )( d v − 1) = 2 |{ e v ,v 0 : v, v 0 ∈ N ( v ) , e v ,v 0 ∈ E }| ( d v )( d v − 1) (11) = c v , (12) and since this is a continuous function of h 3 v , we can approximate it to an arbitrary  precision with a single-layer MLP (or equiv alently , one more iteration of Algorithm 1, ignoring neighborhood information). Again this last step follows directly from [15]. 18 Figure 3: Accuracy (in F1-score) for different approaches on the citation data as the feature matrix is incrementally replaced with random Gaussian noise. Corollary 2. Suppose we sample nodes featur es fr om any pr obability distrib ution µ over x ∈ U , wher e µ is absolutely continuous with r espect to the Lebesgue measure . Then the conditions of Theor em 1 ar e almost surely satisﬁed with featur e inputs x v ∼ µ . Corollary 2 is a direct consequence of Theorem 1 and the fact that, for any probability distrib ution that is absolutely continuous w .r .t. the Lebesgue measure, the probability of sampling two identical points is zero. Empirically , we found that GraphSA GE-pool was in fact capable of maintaining modest performance by lev eraging graph structure, even with completely random feature inputs (see Figure 3). Howe ver , the performance GraphSA GE-GCN was not so rob ust, which makes intuiti ve sense giv en that the Lemmas 1, 2, and 3 rely directly on the univ ersal expressi ve capability of the pooling aggregator . Finally , we note that Theorem 1 and Corollary 2 are expressed with respect to a particular gi ven graph and are thus somewhat transducti ve. For the inducti ve setting, we can state Corollary 3. Suppose that for all gr aphs G = ( V , E ) belonging to some class of gr aphs G ∗ , we have that ∃ k , d ≥ 0 , k , d ∈ Z such that h k v 6 = h k v 0 , ∀ ( v, v 0 ) ∈ { ( v , v 0 ) : ∃ u ∈ V , v , v 0 ∈ N 3 ( u ) } , h k v , h k v 0 ∈ E d I , then we can appr oximate clustering coefﬁcients to an arbitrary epsilon after K = k + 4 iterations of Algorithm 1. Corollary 3 simply states that if after k iterations of Algorithm 1, we can learn to uniquely identify nodes for a class of graphs, then we can also approximate clustering coefﬁcients to an arbitrary precision for this class of graphs. 19

Inductive Representation Learning on Large Graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment