EdgeNets:Edge Varying Graph Neural Networks

1 EdgeNets: Edge V ar ying Graph Neur al Networks Elvin Isuﬁ, F er nando Gama and Alejandro Ribeiro Abstract —Driven by the outstanding perf ormance of neural networks in the str uctured Euclidean domain, recent years ha ve seen a surge of interest in dev eloping neural networks for g raphs and data suppor ted on graphs. The gr aph is lev eraged at each lay er of the neural network as a parameterization to capture detail at the node lev el with a reduced number of parameters and computational complexity . Follo wing this rationale, this paper puts f or th a general framework that uniﬁes state-of-the-art graph neural networks (GNNs) through the concept of EdgeNet. An EdgeNet is a GNN architecture that allows diff erent nodes to use different parameters to weigh the inf or mation of different neighbors . By extrapolating this strategy to more iter ations between neighboring nodes, the EdgeNet learns edge- and neighbor-dependent weights to capture local detail. This is a general linear and local operation that a node can perform and encompasses under one f or mulation all existing g raph convolutional neur al networks (GCNNs) as well as graph attention networks (GA Ts). In writing different GNN architectures with a common language, EdgeNets highlight speciﬁc architecture adv antages and limitations, while providing guidelines to impro ve their capacity without compromising their local implementation. F or instance, we show that GCNNs ha ve a parameter sharing structure that induces permutation equivariance. This can be an adv antage or a limitation, depending on the application. In cases where it is a limitation, we propose h ybr id approaches and provide insights to de velop se veral other solutions that promote parameter sharing without enforcing permutation equiv ar iance. Another interesting conclusion is the uniﬁcation of GCNNs and GA Ts —approaches that hav e been so far perceived as separ ate. In par ticular , we show that GA Ts are GCNNs on a graph that is learned from the features . This par ticular ization opens the doors to dev elop alter native attention mechanisms for impro ving discriminator y power . Index T erms —Edge varying, graph neural netw or ks, graph signal processing, g raph ﬁlters, learning on graphs. F 1 I N T R O D U C T I O N D ATA generated by networks is incr easingly common. Examples include user prefer ences in recommenda- tion systems, writer pr oclivities in blog networks [2], or properties of assembled molecular compounds [3]. Dif ferent from data encountered in the structured temporal or spatial domains, network data lives in high-dimensional irregular spaces. This fact makes difﬁcult to extend tools that exploit the regularity of time and space, leading to a rising interest in novel techniques for dealing with network data [4]. Since graphs are the prominent mathematical tool to model individual node properties —product ratings, writer bias, or molecule pr operties— along with node dependencies — user similarities, blog hyperlinks, or molecular bonds— the interest in network data has translated into a concomitant increase in the interest in tools for processing graphs and data supported on graphs [5]. Several recent works have proposed graph neural net- works (GNNs) as a means of translating to graphs the success convolutional and recurrent neural networks have attained at learning on time and space [6]–[23]. GNNs are ﬁrst concr etized in [6], [7] by means of recursive neighboring label aggregations combined with pointwise nonlinearities. • Isuﬁ is with the Intelligent Systems Department, Delft University of T ech- nology , Delft, The Netherlands. Gama is with the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94709. Ribeiro is with the Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, P A 19104. Isuﬁ conducted this research during his postdoctoral period at the Department of Electrical and Systems Engineering, University of Pennsylvania. This work is supported by NSF CCF 1717120, ARO W911NF1710438, ARL DCIST CRA W911NF-17-2-0181, ISTC-W AS and Intel DevCloud. E- mails: e.isuﬁ-1@tudelft.nl, fgama@berkeley.edu, aribeiro@seas.upenn.edu. Part of this work has been presented in [1]. The convolutional GNN counterpart appears in [9] where graph convolutions ar e deﬁned as pointwise operators in the Laplacian’s spectrum. T o avoid the cost and numerical instability of spectral decompositions, [10] appr oximates this spectral convolution with a Chebyshev polynomial on the Laplacian matrix. Parallel to these efforts, the ﬁeld of graph signal pr ocessing has developed notions of graph convolutional ﬁlters as polynomials on a matrix repr esen- tation of a graph [24]–[30]. This has led to GNNs described as architectures that simply replace time convolutions with graph convolutions [11], [12]. A third approach to deﬁne GNNs is to focus on the locality of convolutions by replacing the adjacency of points in time with the adjacency of neigh- bors in a graph; something that can be accomplished by mixing nodes’ features with their neighbor ’s features [13], [14]. Despite their differ ent motivations, spectral GNNs [9], [10], polynomial GNNs [11], [12], and local GNNs [13], [14] can all be seen to be equivalent to each other (Section 4). In particular , they all share the r euse of parameters across all neighborhoods of a graph as well as indifference towards the values of differ ent neighbors –see also [31]. This is an important limitation that is tackled, e.g., by the graph attention networks (GA T) of [19]–[23] through the use of attention mechanisms [32], [33]. In this paper , we leverage edge varying graph ﬁlter [34] to provide a generic frame- work for the design of GNNs that can affor d ﬂexibility to use differ ent parameters at different nodes as well as differ ent weighing to differ ent neighbors of a node (Section 3). Edge varying ﬁlters are linear ﬁnite order recursions that allow individual nodes to introduce weights that are speciﬁc to the node, speciﬁc to each neighbor , and speciﬁc to the recursion index. In this way , the edge varying r ecursion repr esents a 2 general linear operation that a node can implement locally . I.e., a general operation that relies on information exchanges only with neighbor nodes (Section 2). In alternative to the EdgeNet, graph network [35] is a popular framework for generalizing GNNs. Graph network consist of general update and aggregation functions over nodal, edge, and entire graph features. This uniﬁcation is slightly more general than the message passing neural net- work [36] and considers updates to be principally affected by information exchange only with the one-hop neighbors. While providing relevant insights on the local detail of order -one ﬁlter GNNs such as [13], [19], this strategy does not put emphasis on the role of the ﬁlter within the GNN or how the parameters of such ﬁlter are allocated to the dif- ferent multi-hop neighbours [10], [11]. Instead, the EdgeNet framework focuses only on nodal feature aggregations to highlight the r ole of multi-hop exchanges within a layer and to put emphasis on how differ ent solutions operate from a node perspective. At the same time, the EdgeNet framework allows for a ﬁlter spectral analysis [37], which provides a better understanding of the type of ﬁlters that conform the learned ﬁlter bank. This spectral perspective of GNNs will shed light, for instance, of the advantages of ARMA ﬁlters [30] in learning sharper r esponses with less parameters than ﬁnite impulse response graph ﬁlters [25], [26], [29]. In its most general form, the edge varying GNNs allocate differ ent parameters to the differ ent edges, which is of the order of the number of nodes and edges of the graph. While allocating different parameters over the edges can help exploiting the graph structure better for the learning task at hand, edge-speciﬁc parameters sacriﬁce the inductive capa- bilities across different graphs. T o reduce the complexity of this parameterization, we can regularize EdgeNets in differ- ent ways by imposing restrictions on the freedom to choose differ ent parameters at differ ent nodes. W e explain that existing GNN architectures are particular cases of EdgeNets associated with differ ent parameter restrictions. In turn, this shows how these solutions sacriﬁce the degrees of freedom to gain in parameter sharing and inductive capabilities. W e further utilize the insight of edge varying recursions to propose novel GNN architectur es. In consequence, the novel contributions of this paper are: (i) W e deﬁne EdgeNets, which parameterize the linear operation of neural networks through a bank of edge varying recursions. EdgeNets are a generic framework to design GNN architectur es (Section 3). (ii) W e show the approaches in [9]–[18] (among others) are EdgeNets where all nodes share the parameters. W e extend the repr esenting power of these networks by adding some level of variability in weighing differ ent nodes and differ ent edges of a node (Section 4). (iii) Replacing ﬁnite length polynomials by rational func- tions provides an alternative parameterization of con- volutional GNNs in terms of autoregr essive moving average (ARMA) convolutional graph ﬁlters [30]. These ARMA GNNs generalize rational functions based on Cayley polynomials [15] (Section 4.3). (iv) W e show that GA T s can be understood as GNNs with convolutional graph ﬁlters where a graph is learned ad hoc in each layer to represent the requir ed abstraction between nodes. The weights of this graph choose neigh- bors whose values should most inﬂuence the computa- tions at a particular node. This reinterpr etation allows for the proposal of more generic GA T s with higher expressive power (Section 5). The this paper has the following structure. Section 2 reviews edge varying recursions on graphs and Section 3 introduces edge varying GNNs. T o ease exposition, the GNN are grouped into two categories: convolutional in Section 4 and attentional in Section 5. W ithin each category , we follow the same rationale. First, we discuss the state-of-the-art solu- tions as a particular case of the EdgeNet framework. Then, we discuss their architectural advantages and limitations. Finally , we leverage the EdgeNet viewpoint to propose new solutions that address some of these limitations and high- light the corresponding tradeoffs. Section 6 evaluates these solutions with numerical r esults and Section 7 concludes the paper . 2 E D G E V A RY I N G L I N E A R G R A P H F I LT E R S Consider a weighted graph G with vertex set V = { 1 , . . . , N } , edge set E ⊆ V × V composed of |E | = M order ed pairs ( i, j ) , and weight function W : E → R . For each node i , deﬁne the neighbor hood N i = { j : ( j, i ) ∈ E } as the set of nodes connected to i and let N i := |N i | denote the number of elements (neighbors) in this set. Associated with G is a graph shift operator matrix S ∈ R N × N whose sparsity pattern matches that of the edge set, i.e., entry S ij 6 = 0 when ( j, i ) ∈ E or when i = j . Supported on the vertex set are graph signals x = [ x 1 , . . . , x N ] T ∈ R N in which component x i is associated with node i ∈ V . The adjacency of points in time signals or the adjacency of points in images codiﬁes a sparse and local relationship between signal components. This sparsity and locality are leveraged by time or space ﬁlters. Similarly , S captures the sparsity and locality of the r elationship between compo- nents of a signal x supported on G . It is then natural to take the shift operator as the basis for deﬁning ﬁlters for graph signals. In this spirit, let Φ (0) be an N × N diagonal matrix and Φ (1) , . . . , Φ ( K ) be a collection of K matrices sharing the sparsity pattern of I N + S . Consider then the sequence of signals z ( k ) as z ( k ) = k Y k 0 =0 Φ ( k 0 ) x = Φ ( k :0) x , for k = 0 , . . . , K (1) where the product matrix Φ ( k :0) := Q k k 0 =0 Φ ( k 0 ) = Φ ( k ) . . . Φ (0) is deﬁned for future reference. Signal z ( k ) can be computed using the recursion z ( k ) = Φ ( k ) z ( k − 1) , for k = 0 , . . . , K (2) with initialization z ( − 1) = x . This recursive expression implies signal z ( k ) is pr oduced from z ( k − 1) using operations that are local in the graph. Indeed, since Φ ( k ) shares the sparsity pattern of S , node i computes its component z ( k ) i as z ( k ) i = X j ∈N i ∪ i Φ ( k ) ij z ( k − 1) j . (3) Particularizing (3) to k = 0 , it follows each node i builds the i th entry of z (0) as a scaled version of its signal x by 3 Φ (0) Φ (1) Φ (2) Φ (3) x Φ (0) x Φ (1:0) x Φ (2:0) x Φ (3:0) x + + + + A ( S ) x Fig. 1. Edge V ar ying Gr aph Filters. Each edge v ar ying matrix Φ ( k ) acts as a diff erent shift operator that locally combines the graph signal. (T op-left) The colored discs are centered at ﬁve ref erence nodes and their coverage shows the amount of local information needed to compute z (1) = Φ (1:0) x at these nodes. The cov erage of the discs in the other graphs shows the signal inf ormation needed by the reference nodes to produce the successiv e outputs. (Bottom) Schematic illustration of the edge v ar ying ﬁlter output of order K = 3 . the diagonal matrix Φ (0) , i.e., z (0) i = Φ (0) ii x i . Particularizing to k = 1 , (3) yields the components of z (1) depend on the values of the signal x at most at neighboring nodes. Particu- larizing to k = 2 , (3) shows the components of z (2) depend only on the values of signal z (1) at neighboring nodes which, in turn, depend only on the values of x at their neighbors. Thus, the components of z (2) are a function of the values of x at most at the respective two-hop neighbors. Repeating this argument iteratively , z ( k ) i repr esents an aggregation of information at node i coming from its k -hop neighborhood —see Figure 1. The collection of signals z ( k ) behaves like a sequence of scaled shift operations except that instead of shifting the signal in time, the signal is diffused through the graph (the signal values are shifted between neighboring nodes). Leveraging this interpretation, the graph ﬁlter output u is deﬁned as the sum u = K X k =0 z ( k ) = K X k =0 Φ ( k :0) x . (4) A ﬁlter output in time is a sum of scaled and shifted copies of the input signal. That (4) behaves as a ﬁlter follows from interpreting Φ ( k :0) as a scaled shift, which holds because of its locality . Each shift Φ ( k :0) is a recursive composition of individual shifts Φ ( k ) . These individual shifts repr esent differ ent operators that r espect the structure of G while reweighing individual edges differently when needed. For future reference, deﬁne the ﬁlter matrix A ( S ) so (4) rewrites as u = A ( S ) x . For this to hold, the ﬁlter matrix must be A ( S ) = K X k =0 Φ ( k :0) = K X k =0 k Y k 0 =0 Φ ( k 0 ) ! . (5) Following [34], A ( S ) is a K th or der edge varying graph ﬁlter . Each matrix Φ ( k ) contains at most M + N nonzero elements corresponding to the nonzero entries of I N + S ; thus, the total number of parameters deﬁning ﬁlter A ( S ) in (5) is K ( M + N ) + N . For short ﬁlters, this is smaller than the N 2 components of an arbitrary linear transform. Likewise, in computing z ( k ) = Φ ( k ) z ( k − 1) as per (2) incurs a computational complexity of order O ( M + N ) . This further results in an overall computational complexity of order O  K ( M + N )  for obtaining the ﬁlter output u in (4). This reduced number of parameters and computational cost is leveraged next to deﬁne graph neural network (GNN) architectur es with a controlled number of parameters and computational complexity matched to the graph sparsity . Remark 1. The presence of the edge ( j , i ) in graph G is interpreted here as signal components x j and x i being related by the given structure in the data. The shift operator entry S ij is a measur e of the expected similarity . Larger entries indicate linked signal components are mor e related to each other . Ther efore, the deﬁnition of the shift operator S makes it a valid stand-in for any graph repr esentation matrix. Forthcoming discussions ar e valid whether S is an adjacency or a Laplacian matrix in any of their various normalized and unnormalized forms. W e use S to keep discussions generic. 3 E D G E V A RY I N G G R A P H N E U R A L N E T W O R K S Edge varying graph ﬁlters are the basis for deﬁning GNN architectur es through composition with pointwise nonlinear functions. Formally , consider a set of L layers indexed by l = 1 , . . . , L and let A l ( S ) = P K k =0 Φ ( k :0) l be the graph ﬁlter used at layer l . A GNN is deﬁned by the recursive expression x l = σ  A l ( S ) x l − 1  = σ K X k =0 Φ ( k :0) l x l − 1 ! (6) where we convene that x 0 = x is the input to the GNN and x L is its output. T o augment the r epresentation power of GNNs, it is customary to add multiple node features per layer . W e do this by deﬁning matrices X l = [ x 1 l , . . . , x F l l ] ∈ R N × F l in which each column x f l repr esents a differ ent graph signal at layer l . These so-called features ar e cascaded through layers where they are processed with edge varying 4 graph ﬁlters and composed with pointwise nonlinearities according to X l = σ K X k =0 Φ ( k :0) l X l − 1 A lk ! (7) where A lk ∈ R F l − 1 × F l is a parameter matrix that affor ds ﬂexibility to process differ ent features with differ ent ﬁlter parameters. It is ready to see that (7) represents a bank of edge varying graph ﬁlters A f g l ( S ) applied to a set of F l − 1 input features x g l − 1 to produce a set of F l output features x f l . Indeed, if we let a f g lk = [ A lk ] f g denote the ( f , g ) th entry of A lk , (7) produces a total of F l − 1 F l intermediate features of the form 1 u f g l = A f g l ( S ) x g l − 1 = K X k =0 a f g lk Φ f g , ( k :0) l x g l − 1 (8) for g = 1 , . . . , F l − 1 and f = 1 , . . . , F l . The features u f g l are then aggr egated across all g and passed through a pointwise nonlinearity to produce the output features of layer l as x f l = σ F l − 1 X g =1 u f g l ! . (9) At layer l = 1 the input feature is a graph signal x 1 0 = x . This feature is passed through F 1 ﬁlters to produce F 1 higher-level featur es as per (8). The latter are then processed by a pointwise nonlinearity [cf. (9)] to produce F 1 output features x f 1 . The subsequent layers l > 1 start with F l − 1 input features x g l − 1 that are passed through the ﬁlter bank A f g l ( S ) [cf. (8)] to produce the higher -level features u f g l . These are aggregated across all g = 1 , . . . , F l − 1 and passed through a nonlinearity to produce the layer ’s output fea- tures x f l [cf. (9)]. In the last layer l = L , we consider without loss of generality the number of output features is F L = 1 . This single feature x 1 L = x L is the output of the edge varying GNN or , for short, EdgeNet. Remark the EdgeNet aggregates at each layer information form neighbors that ar e up to K hops away [cf. (7)]. This increases its ﬂexibility to process intermediate features and generalizes the masking aggregation r ule in [38], which can be seen as an EdgeNet of order K = 1 . The EdgeNet output is a function of the input signal x and the collection of ﬁlter banks A f g l [cf. (5)]. Gr oup the ﬁlters in the ﬁlter tensor A ( S ) = { A f g l ( S ) } lf g so that to deﬁne the GNN output as the mapping Ψ  x ; A ( S )  := x L with A ( S ) = n A f g l ( S ) o lf g . (10) The ﬁlter parameters are trained to minimize a loss over a training set of input-output pairs T = { ( x , y ) } . This loss measures the differ ence between the EdgeNet output x L and the true value y averaged over the examples ( x , y ) ∈ T . As it follows from (5), the number of parameters in each ﬁlter is K ( M + N ) + N . This gets scaled by the number of 1. Throughout the paper , we will denote any graph ﬁlter by A f g l ( S ) to indicate that it is a matrix depending on the graph shift operator S . When this ﬁlter contains additional parameters rather than those on the edges (i.e., Φ f g, ( k :0) l ), we will indicate them with scalars a f g lk [cf. (8)]. For consistency , when expressing the bank of ﬁlters in a single recursion [cf. (7)], we will group parameters a f g lk into the matrix A lk . ﬁlters per layer F l − 1 F l and the number of layers L . T o pro- vide an order bound on the number of parameters deﬁning the EdgeNet set the maximum feature number F = max l F l and observe the number of parameters per layer is of order ( K ( M + N ) + N ) F 2 + F 2 . Likewise, the computational complexity at each layer is of order O  K ( M + N ) F 2  . This number of parameters and computational complexity are expected to be smaller than the corresponding numbers of a fully connected neural network. This is a consequence of exploiting the sparse nature of edge varying ﬁlters [cf. (4) and (5)]. A GNN can be then consider ed as an architecture that exploits the graph structur e to reduce the number of parameters of a fully connected neural network. The implicit hypothesis is those signal components associated with different nodes are processed together in accordance with the nodes’ proximity in the graph. W e will show different existing GNN architectur es are particular cases of (8)-(9) using dif ferent subclasses of edge varying graph ﬁlters (Section 4) and the same is true for graph attention networks (Section 5). Establishing these relationships allows the proposal of natural architectural generalizations that increase the descriptive power of GNNs while still retaining manageable complexity . Remark 2. The key property of the EdgeNet is to allocate trainable prarameters for each edge in each shift. While this formulation improves the expressive power of a GNN, it affects its inductive capability over graphs (i.e., the ability to generalize to new unseen graphs) [39]. In the full form (7), the EdgeNet does not have inductive capabilities for graphs but only for graph signals. That is, it cannot be applied to test cases where new unseen graphs are present but can only be applied to test cases where new unseen graph signals are present for a ﬁxed graph support. W e shall see in the next section that graph convolutional neural networks sacriﬁce instead the degr ees of freedom to gain inductive capabilities also for graphs. Remark 3. In the proposed EdgeNet, we considered graphs with single edge features, i.e., each edge is described by a single scalar . However , even when the graph has multiple edge features, say E , the EdgeNet extends r eadily to this scenario. This can be obtained by seeing the multi-edge featured graph as the union of E graphs G e = ( V , E e ) with identical node set V and respective shift operator matrix S e . For { Φ e ( k ) } being the collection of the edge varying parameter matrices [cf. (1)] relative to the shift operator S e , the l th layer output X l [cf. (7)] becomes X l = σ E X e =1 K X k =0 Φ e ( k :0) l X l − 1 A e lk ! . (11) I.e., the outputs of each ﬁlter are aggregated also over the edge-feature dimension. The number of parameters and computational complexity get scaled by E . The GNN ar- chitectures discussed in the remainder of this manuscript, as a special case of the EdgeNet, are r eadily extendable to the multi-edge feature scenario by replacing (7) with (11). The approach in [40] is the particular case for (11) with K = 1 and the parameter matrix reduced to a scalar . 5 1 x 1 2 x 2 3 x 3 4 x 4 5 x 5 6 x 6 7 x 7 8 x 8 9 x 9 10 x 10 11 x 11 12 x 12 Fig. 2. Permutation equivariance of machine lear ning on graphs. Many tasks in machine learning on graphs are equivariant to permutations (cf . Proposition 1) but not all are . E.g., w e expect agents 3, 5, 8, and 12 to be interchangeable from the perspectiv e of predicting product ratings from the ratings of other nodes. But from the perspective of community classiﬁcation we expect 3 and 5 or 8 and 12 to be interchangeable, but 3 and 5 are not interchangeable with 8 and 12. 4 G R A P H C O N VO L U T I O N A L N E U R A L N E T W O R K S Graph convolutional neural networks (GCNNs) have shown great success to learning repr esentations for graph data with pr ominent variants introduced in [10]–[13]. All these variants be written as GNN architectures in which the edge varying component in (7) is ﬁxed and given by powers of the shift operator matrix Φ ( k :0) l = S k , X l = σ K X k =0 S k X l − 1 A lk ! . (12) By comparing (8) with (12), it follows this particular restric- tion yields a tensor A ( S ) with ﬁlters of the form A f g l ( S ) = K X k =0 a f g lk S k (13) for some or der K and scalar parameters a f g l 0 , . . . , a f g lK . Our focus in this section is to discuss variations on (13). T o sim- plify the discussion, we omit the layer and feature indices and for the remainder of this section write A ( S ) = K X k =0 a k S k . (14) The ﬁlters in (14) are of the form in (5) with Φ (0) = a 0 I N and Φ ( k :0) = a k S k for k ≥ 1 . By particularizing G to the line graph, (14) r epresents a linear time-invariant ﬁlter described by a regular convolution. This justiﬁes using the qualiﬁer convolutional for an architecture with ﬁlters of the form (14). The appeal of the graph convolutional ﬁlters in (14) is that they r educe the number of parameters from the K ( M + N ) + N of the edge varying ﬁlters in (5) to just K + 1 ; yielding also a computational complexity of order O ( K M ) . While we can reduce the number of parameters in several ways, the formulation in (14) is of note because it endows the resulting GNN with equivariance to permutations of the labels of the graph. W e state this property formally in the following proposition. Proposition 1. Let x be a graph signal deﬁned on the vertices of a graph G = ( V , E ) with shift operator S . Consider also the output of a GCNN Ψ ( x ; A ( S )) [cf. (10) ] with input x and tensor A ( S ) =  A ( S )  composed of ﬁlters of the form in (14) . Then, for a permutation matrix P , it holds that P T Ψ ( x ; A ( S )) = Ψ ( P T x ; A ( P T SP )) . That is, the GCNN output operating on the graph G with input x is a permuted version of the GCNN output operating on the permuted graph G 0 = ( V 0 , E 0 ) with permuted shift operator S 0 = P T SP and permuted input signal x 0 = P T x . Proof. See Appendix A. Proposition 1 establishes the output of a GCNN is inde- pendent of node labeling. This is important not just because graph signals are independent of labeling —therefore, so should be their processing— but because it explains how GCNNs exploit the internal signal symmetries. If two parts of the graph are topologically identical and the nodes support identical signal values, a GCNN yields identical outputs [41]–[43]. It must be emphasized that permutation equivariance is of use only inasmuch as this is a desirable property of the consider ed task. Permutation equivariance holds in, e.g., recommendation systems but does not hold in, e.g., commu- nity classiﬁcation. In the graph in Figure 2, we expect agents 3, 5, 8, and 12 to be interchangeable from the perspective of predicting product ratings from the ratings of other nodes. But fr om the perspective of community classiﬁcation, we expect 3 and 5 or 8 and 12 to be interchangeable, but 3 and 5 are not interchangeable with 8 and 12. When equivariance is not a property of the task, GCNNs are not expected to do well. GCNNs will suffer in any problem in which local detail around a node is important. This is because the ﬁlter in (14) forces all nodes to weigh the information of all k -hop neighbors with the same parameter a k irrespectively of the r elative importance of differ ent nodes and differ ent edges. T o avoid this limitation, we can use a GNN that relies on the edge varying ﬁlters (5) in which each node i learns a different parameter Φ ( k ) ij for each neighbor j . These two cases ar e analogous to CNNs pr ocess- ing time signals with conventional convolutional ﬁlters as opposed to a neural network that operates with arbitrarily time varying ﬁlters (i.e., ﬁlters whose coefﬁcients change in time). The appealing intermediate solution is to use ﬁlters with controlled edge variability to mix the advantage of a per- mutation equivariant parameterization (Proposition 1) with the processing of local detail. W e introduce architectures that construct different versions of ﬁlters with controlled edge variability in Sections 4.1-4.3. Remark 4. Along with the above-referred works, also the works in [16]–[18] and [14] use versions of the convolutional ﬁlter in (14). In speciﬁc, [16]–[18] consider single shifts on the graph with shift operator a learnable weight matrix, a Gaussian kernel, and a random-walk, r espectively . The work in [14] adopts multi-layer perceptrons along the feature dimension at each node, before exchanging information with their neighbors. This is equivalent to (12) with the ﬁrst layers having order K = 0 (depending on the depth of the MLP), followed by a ﬁnal layer of order K = 1 . 6 4.1 GNNs with Controlled Edg e V ariability T o build a GNN that ﬁts between a permutation equivariant GCNN [cf. (14)] and a full edge varying GNN [cf. (5)], we use dif ferent ﬁlter parameters in dif ferent parts of the graph. Formally , let B = {B 1 , . . . , B B } be a partition of the node set into B blocks with block B i having B i nodes. Deﬁne the tall matrix C B ∈ { 0 , 1 } N × B such that [ C B ] ij = 1 if node i belongs to block B j and 0 otherwise. Let also a ( k ) B ∈ R B be a vector of block parameters of ﬁlter order k . Block varying graph ﬁlters are then deﬁned as A ( S ) = K X k =0 diag  C B a ( k ) B  S k . (15) Filters in (15) use parameters [ a ( k ) B ] i for all nodes i ∈ B i . Block varying ﬁlters belong to the family of node varying graph ﬁlters [29] and are of the form in (5) with Φ ( k :0) = diag ( C B a ( k ) B ) S k . (16) Substituting (16) into (5) generates block varying GNNs [44]. Block varying GNNs have B ( K + 1) F 2 parameters per layer and a computational complexity of order O ( K F 2 M ) . Alternatively , we can consider what we call hybrid ﬁlters that are deﬁned as linear combinations of convolutional ﬁlters and edge varying ﬁlters that operate in a subset of nodes —see Figur e 3. Formally , let I ⊂ V denote an important subset of I = |I | nodes and deﬁne the shift matrices Φ ( k ) I such that the diagonal matrix Φ (0) I has entries [ Φ (0) I ] ii 6 = 0 for all i ∈ I and [ Φ ( k ) I ] ij = 0 for all i / ∈ I or ( i, j ) / ∈ E and k ≥ 1 . That is, the parameter matrices Φ ( k ) I may contain nonzero elements only at rows i that belong to set I and with the node j being a neighbor of i . W e deﬁne hybrid ﬁlters as those of the form A ( S ) = K X k =0  k Y k 0 =0 Φ ( k 0 ) I + a k S k  . (17) Substituting (17) in (5) generates hybrid GNNs. In essence, nodes i ∈ I learn edge dependent parameters which may also be different at different nodes, while nodes i / ∈ I learn global parameters. Hybrid ﬁlters are deﬁned by a number of parameters that depends on the total neighbors of all nodes in the importance set I . Deﬁne then M I = P i ∈I N i and observe Φ (0) I has I nonzero entries since it is a diagonal matrix, while Φ ( k ) I for k ≥ 1 have respectively M I nonzero values. W e then have K M I + I parameters in the edge varying ﬁlters and K + 1 parameters in the convolutional ﬁlters. W e therefor e have a total of ( I + K M I + K + 1) F 2 parameters per layer in a hybrid GNN. The implementation cost of a hybrid GNN layer is of order O ( K F 2 ( M + N )) since both terms in (17) respect the graph sparsity . Block GNNs depend on the choice of blocks B and hybrid GNNs on the choice of the importance set I . W e explore the use of dif ferent heuristics based on centrality and clustering measures in Section 6 where we will see that the choice of B and I is in general problem speciﬁc. 1 2 3 4 5 6 7 8 [Φ ( k ) I ] 21 [Φ ( k ) I ] 31 [Φ ( k ) I ] 41 [Φ ( k ) I ] 74 [Φ ( k ) I ] 51 [Φ ( k ) I ] 75 [Φ ( k ) I ] 76 [Φ ( k ) I ] 78 Fig. 3. Hybrid Edge V ar ying Filter [cf. (17) ]. The nodes in set I = { 2 , 7 } are highlighted. Nodes 2 and 7 have edge varying parameters associ- ated with their incident edges. All nodes, including 2 and 7, also use the global parameter a k as in a regular conv olutional graph ﬁlter . 4.2 Spectral Graph Conv olutional Neural Networks The convolutional operation of the graph ﬁlter in (14) can be repr esented in the spectral domain. T o do so, consider the input-output relationship u = A ( S ) x along with the eigen- vector decomposition of the shift operator S = VΛV − 1 . Projecting the input and output signals in the eigenvector space of S creates the so-called graph Fourier transforms ˜ x := V − 1 x and ˜ u := V − 1 u [45] which allow us to write ˜ u :=  K X k =0 a k Λ k  ˜ x . (18) Eq. (18) reveals convolutional graph ﬁlters are pointwise in the spectral domain, due to the diagonal nature of the eigenvalue matrix Λ . W e can, therefore, deﬁne the ﬁlter ’s spectral response a : R → R as the function a ( λ ) = K X k =0 a k λ k (19) which is a single-variable polynomial characterizing the graph ﬁlter A ( S ) . If we allow for ﬁlters of order K = N − 1 , there is always a set of parameters a k such that a ( λ i ) = ˜ a i for any set of spectral response ˜ a i [26]. Thus, training over the set of spectral parameters a ( λ 1 ) , . . . , a ( λ N ) is equivalent to training over the space of (nodal) parameters a 0 , . . . , a N − 1 . GCNNs were ﬁrst introduced in [9] using the spectral repr esentation of graph ﬁlters in (19). By using edge varying graph ﬁlters [cf. (5)], we can pro- pose an alternative parameterization of the space of ﬁlters of order N which we will see may have some advantages. T o explain this better let J be the index set deﬁning the zero entries of S + I N and let C J ∈ { 0 , 1 } |J |× N 2 be a binary selection matrix whose rows are those of I N 2 indexed by J . Let also B be a basis matrix that spans the null space of C J vec ( V − 1 ∗ V ) (20) where vec ( · ) is the column-wise vectorization operator and “ ∗ ” is the Khatri-Rao product [46]. Then, the following proposition from [34] quantiﬁes the spectral response of a particular class of the edge varying graph ﬁlter in (5). Proposition 2. Consider the subclass of the edge varying graph ﬁlters in (5) where the parameter matrices  Φ (0) + Φ (1)  and Φ ( k ) for all k = 2 , . . . , K are restricted to the ones that share the eigenvectors with S , i.e.,  Φ (0) + Φ (1)  = V Λ (1) V − 1 and 7 Φ ( k ) = V Λ ( k ) V − 1 for all k = 2 , . . . , K . The spectral response of this subclass of edge varying ﬁlter has the form a ( Λ ) = K X k =1  k Y k 0 =1 Λ ( k 0 )  = K X k =1 k Y k 0 =1 diag  B µ ( k 0 )  (21) where B is an N × b basis kernel matrix that spans the null space of (20) and µ ( k ) is a b × 1 vector containing the expansion parameters of Λ ( k ) into B . Proof. See Appendix B. Proposition 2 provides a subclass of the edge varying graph ﬁlters where, instead of learning K ( M + N ) + N parameters, they learn the K b entries µ (1) , . . . , µ ( K ) in (21). These ﬁlters build the output featur es as a point- wise multiplication between the ﬁlter spectral response a ( Λ ) and the input spectral transform ˜ x = V − 1 x , i.e., u = V a ( Λ ) ˜ x = V a ( Λ ) V − 1 x . Following then the analogies with conventional signal processing, (21) repr esents the spectral r esponse of a convolutional edge varying graph ﬁlter . Spectral GCNNs are a particular case of (21) with order K = 1 and kernel B independent from the graph (e.g., a spline kernel). Besides generalizing [9], a graph- dependent kernel allows to implement (21) in the vertex domain through an edge varying ﬁlter of the form (5); hence, having a complexity of order O ( K ( M + N )) in contrast to O ( N 2 ) r equired for the graph-independent kernels. The edge varying implementation captures also local detail up to a region of radius K from a node; yet, having a spectral in- terpretation. Nevertheless, both the graph-dependent GNN [cf. (21)] and the graph-independent GNN [9] are more of theoretical interest since they require the eigendecom- position of the shift operator S . This aspect inadvertently implies a cubic complexity in the number of nodes and an accurate learning process will suffer from numerical instabilities since it requires an or der K ≈ N ; hence, high order matrix powers S k . 4.3 ARMA graph conv olutional neural networks W e can increase the descriptive power of the ﬁlter in (14) by growing its order K , which allows learning ﬁlters with a more discriminative polynomial frequency response [cf. (19)]. However , this also increases the parameters and com- putational cost. Most importantly , it introduces numerical issues associated with high order matrix powers S k , ul- timately , leading to poor interpolatory and extrapolatory performance [47]. These challenges can be overcame by considering graph ﬁlters with a rational spectral response, since rational functions have better interpolatory and ex- trapolatory pr operties than polynomials [47]–[49]. Rational functions can also achieve more complicated responses with lower degrees in both numerator and denominator , thus, having less learnable parameters. Autoregressive moving average (ARMA) graph ﬁlters [30] serve for such purpose and implement rational functions of the form A ( S ) =  I + P X p =1 a p S p  − 1  Q X q =0 b q S q  := P − 1 ( S ) Q ( S ) (22) where we have deﬁned P ( S ) := I + P P p =1 a p S p and Q ( S ) := P Q q =0 b q S q . The ARMA ﬁlter in (22) is deﬁned by P denominator parameters a = [ a 1 , . . . , a P ] > and Q + 1 numerator parameters b = [ b 0 , . . . , b Q ] > . The input-output relationship u = A ( S ) x of the ARMA ﬁlter can be repre- sented in the spectral domain as [cf. (18)] ˜ u =  I + P X p =1 a p Λ p  − 1  Q X q =0 b q Λ q  ˜ x . (23) It follows that ARMA ﬁlters are also pointwise operators in the spectral domain characterized by the rational spectral response function a ( λ ) =  Q X q =0 b q λ q  .  1 + P X p =1 a p λ p  . (24) In particular , it follows the space of ARMA ﬁlters deﬁned by (22) is equivalent to the space of spectral ARMA ﬁlters deﬁned by (24) which is equivalent to the space of spectral ﬁlters in (19) and, in turn, equivalent to the graph convolu- tional ﬁlters in (14). That they are equivalent does not mean they have the same pr operties. W e expect ARMA ﬁlters produce useful spectral responses with less parameters than the convolutional ﬁlters in (14) or the spectral ﬁlters in (19). As it follows from (22), we need to compute the inverse matrix P − 1 ( S ) to get the ARMA output. The latter incurs a cubic complexity , which unless the graph is of limited di- mensions is computationally unaffordable. When the graph is large, we need an iterative method that exploits the spar- sity of the graph to approximate the inverse with a reduced cost [30], [49]. Due to its faster conver gence, we consider a parallel structur e that consists of ﬁrst transforming the polynomial ratio in (18) in its partial fraction decomposition form and subsequently using the Jacobi method to ap- proximate inverse. While also other Krylov approaches are possible, the parallel Jacobi method offers a better tradeoff between computational complexity and convergence rate. Partial fraction decomposition of ARMA ﬁlters. The partial fraction decomposition of the rational function a ( λ ) in (24) provides an equivalent repr esentation of ARMA ﬁlters. Let γ = [ γ 1 , . . . , γ P ] > be a set of poles, β = [ β 1 , . . . , β P ] > a corresponding set of r esiduals and α = [ α 0 , . . . , α K ] > be a set of direct terms; we can then rewrite (24) as a ( λ ) = P X p =1 β p λ − γ p + K X k =0 α k λ k (25) where α , β , and γ ar e computed from a and b . A graph ﬁlter whose spectral r esponse is as in (25) is one in which the spectral variable λ is replaced by the shift operator variable S . It follows that if α , β , and γ are chosen to make (25) and (24) equivalent, the ﬁlter in (22) is, in turn, equivalent to A ( S ) = P X p =1 β p  S − γ p I  − 1 + K X k =0 α k S k . (26) The equivalence of (22) and (26) means that instead of training a and b in (22) we can train α , β , and γ in (26). 8 Jacobi implementation of single-pole ﬁlters. T o cir cumvent the matrix inverses in (26), we ﬁrst consider each single- pole ﬁlter in (26) separately and implement the input-output relationship u p = β p  S − γ p I  − 1 x . (27) Expression (27) is equivalent to the linear equation ( S − γ p I ) u p = β p x , which we can solve iteratively through a Jacobi recursion. This requires us to separate ( S − γ p I ) into diagonal and off-diagonal components. W e, therefore, begin by deﬁning the diagonal degree matrix D = diag ( S ) so that the shift operator can be written as S = D +  S − D  := diag ( S ) +  S − diag ( S )  . (28) W ith this deﬁnition, we write ( S − γ p I N ) = ( D − γ p I N ) + ( S − D  , which is a decomposition on diagonal terms ( D − γ p I N ) and off-diagonal terms ( S − D  . The Jacobi iteration k for (27) is given by the recursive expression u pk = −  D − γ p I N  − 1 h β p x −  S − D  u p ( k − 1) i (29) initialized with u p 0 = x . W e can unroll this iteration to write an explicit relationship between u pk and x . T o do that, we deﬁne the parameterized shift operator R ( γ p ) = −  D − γ p I N  − 1  S − D  (30) and use it to write the K th iterate of the Jacobi recursion as u pK = β p K − 1 X k =0 R k ( γ p ) x + R K ( γ p ) x . (31) For a convergent Jacobi recursion, signal u pK in (31) con- verges to the output u p of the single-pole ﬁlter in (27). T runcating (31) at a ﬁnite K yields an appr oximation in which single-pole ﬁlters are written as polynomials on the shift operator R ( γ p ) . I.e., a single-pole ﬁlter is approximated as a convolutional ﬁlter of order K [cf. (14)] in which the shift operator of the graph S is replaced by the shift operator R ( γ p ) deﬁned in (30). This convolutional ﬁlter uses parameters β p for k = 0 , . . . , K − 1 and 1 for k = K . Jacobi ARMA ﬁlters and Jacobi ARMA GNNs. Assuming we use Jacobi iterations to appr oximate all single-pole ﬁlters in (26) and that we truncate all of these iterations at K , we can write ARMA ﬁlters as A ( S ) = P X p =1 H K  R ( γ p )  + K X k =0 α k S k . (32) where H K ( R ( γ p )) is a K order Jacobi approximation of the ARMA ﬁlter , which, as per (31) is given by H K  R ( γ p )  = β p K − 1 X k =0 R k ( γ p ) + R K ( γ p ) . (33) A Jacobi ARMA ﬁlter of order ( P , K ) is deﬁned by (32) and (33). The order P repr esents the number of poles in the ﬁlter and the or der K the number of Jacobi iterations we consider appropriate to properly approximate individual single-pole ﬁlters. Notice the number of taps K in the ﬁlter P K k =0 α k S k need not be the same as the number of Jacobi iterations used in (33). But we use the same to avoid complicating notation. For sufﬁciently large K (32)-(33), (26), and (22) are all equivalent expressions of ARMA ﬁlters of or ders ( P, Q ) . W e could train parameters using either of these equivalent expressions but we advocate for the use (32)-(33) as no inversions ar e necessary except for the elementary inversion of the diagonal matrix ( D − γ p I ) . It is interesting to note that in this latter form ARMA ﬁlters ar e reminiscent of the convolutional ﬁlters in (14) but the similarity is superﬁ- cial. In (14), we train K + 1 parameters a k that multiply shift operator powers S k . In (32)-(33) we also train K + 1 parameters of this form in the ﬁlter P K k =0 α k S k but this is in addition to the parameters β p and γ p of each of the single-pole ﬁlter approximations H K ( R ( γ p )) . These single- pole ﬁlters ar e themselves reminiscent of the convolutional ﬁlters in (14) but the similarity is again superﬁcial. Instead of parameters a k that multiply shift operator powers S k , the ﬁlters in (33) train a parameters γ p which repr esents a constant that is subtracted from the diagonal entries of the shift operators S . The fact this is equivalent to an ARMA ﬁlter suggests (32)-(33) may help designing more discriminative ﬁlters. W e corr oborate in Section 6 that GNNs using (32)-(33) outperform GNNs that utilize ﬁlters (14). An ARMA GNN has (2 P + K + 1) F 2 parame- ters per layer and a computational complexity of order O  F 2 P ( M K + N )  . This decomposes as O ( P N ) to invert the diagonal matrices ( D − γ p I N ) ; O ( P M ) to scale the nonzeros of ( S − D ) by the inverse diagonal; O ( P K M ) to obtain the outputs of Jacobi ARMA ﬁlters (32) of order K . ARMA GNNs as EdgeNets. ARMA GNNs (ARMANets) are another subclass of the EdgeNet. T o see this, consider that each shift operator R ( γ p ) in (30) shares the support with I N + S . Hence, we can express the graph ﬁlter in (32) as the union of P + 1 edge varying graph ﬁlters. The ﬁrst P of these ﬁlters have parameters matrices of the form Φ ( k :0) p =  β p R k ( γ p ) , k = 0 , . . . , K − 1 R K ( γ p ) , k = K while the last ﬁlter captures the direct-term with edge varying parameter matrices Φ ( k :0) = α k S K [cf. Section 4]. The union of these edge varying ﬁlter has the expression A ( S ) = K X k =0  P X p =1 Φ ( k :0) p + Φ ( k :0)  x (34) which by grouping further the terms of the same order k leads to a single edge varying graph ﬁlter of the form in (5). ARMANet provides an alternative parameterization of the EdgeNet that is different from that of the other polyno- mial convolutional ﬁlters in (14). In particular , ARMANets promote to use multiple polynomial ﬁlters of smaller order (i.e., the number of Jacobi iterations) with shared parameters between them. Each of the ﬁlters H K ( R ( γ p )) depends on two parameters β p and γ p . W e believe this parameter shar- ing among the differ ent orders and the different nodes is the success behind the improved performance of the ARMA GNN compared with the single polynomial ﬁlters in (14). Given also the hybrid solutions developed in Section (4.1) for the polynomial ﬁlters, a direction that may attain further improvements is that of ARMA GNN ar chitectures with controlled edge variability . 9 x S S + + + α 0 α 1 α 2 R ( γ 1 ) R ( γ 1 ) + + + β 1 β 1 1 R ( γ 2 ) R ( γ 2 ) + + + β 2 β 2 1 + + + H 2  R ( γ 1 )  x H 2  R ( γ 2 )  x u Fig. 4. J acobi Autoreg ressive Moving Av erage Filter . The input signal x is processed b y a parallel bank of ﬁlters. One of this ﬁlters is a conv olutional ﬁlter of the form in (14) operating w .r .t. the shift operator S (highlighted in red). The remaining ﬁlters operate w.r .t. scaled shift operators [cf. (30) ] (highlighted in b lue). All ﬁlter outputs are summed together to yield the ov erall Jacobi ARMA output. ARMANet generalizes the architectur e in [15] where instead of restricting the polynomials in (22) to Cayley poly- nomials, it allows the use of general polynomials. Iterative approaches to implement an ARMA graph ﬁlter [30], [50] have been r ecently used to develop GNNs that resemble this r ecursion [51], [52]. However , both [51], [52] do not implement an ARMA layer like (23) does. Instead, their propagation rule can be either seen as that used in graph recurr ent neural networks [53, eq. (5)] but with a constant input or conceptually similar to that used in iterative sparse coding [54]. 5 G R A P H C O N VO L U T I O N A L A T T E N T I O N N E T W O R K S Graph convolutional neural networks parameterize the Ed- geNet by using a ﬁxed underlying shift operator S and learning only the ﬁlter parameters. However , this shift operator may often be estimated from data disjointly fr om the GNN task, or its speciﬁc weights may be unknown. In these instances, we can use the graph attention mechanism [19] to parameterize the EdgeNet in a way that we learn both the shift operator weights and the convolutional ﬁlter parameters for the task at hand. W e pr opose the graph convolutional attention network (GCA T), which utilizes the ﬁlters as in (12) but they are convolutional in a layer -speciﬁc matrix Φ l = Φ that may be different from the shift operator S X l = σ K X k =0 Φ k X l − 1 A k ! . (35) Note A k = A lk and Φ = Φ l are layer-dependent but we omit the layer index to simplify notation. Since matrix Φ shares the sparsity pattern of S , (35) deﬁnes a GNN as per (7). Matrix Φ is learned fr om the features X l − 1 passed from layer l − 1 following the attention mechanism [19]. Speciﬁcally , we consider a trainable matrix B ∈ R F l − 1 × F l and vector e ∈ R 2 F l , and compute the edge scores α ij = σ  e > h  X l − 1 B  i ,  X l − 1 B  j i >  (36) for all edges ( i, j ) ∈ E . In (36), we start with the vector of featur es X l − 1 and mix them as per the parameters in B . This produces a collection of graph signals X l − 1 B in which each node i has F l features that correspond to the i th row [ X l − 1 B ] i of the product matrix X l − 1 B . The features at node i are concatenated with the features of node j and the resulting vector of 2 F l components is multiplied by vector e . This pr oduct produces the score α ij after passing thr ough the nonlinearity σ ( · ) . Note that B = B l , e = e l are global parameters for all scor es α ij = α lij and depend on the layer index l . As is the case of A k and Φ in (35), we omitted this index for simplicity . The score α ij could be used directly as an entry for the matrix Φ but to encourage attention sparsity we pass α ij through a local soft maximum operator Φ ij = exp  α ij  X j 0 ∈N i ∪ i exp  α ij 0  ! − 1 . (37) The soft maximum assigns edge weights Φ ij close to 1 to the largest of the edge scores α ij and weights Φ ij close to 0 to the rest. See also Figure 5 (a). In Section 2, we intr oduced arbitrary edge varying graph ﬁlters [cf. (5)] which we leveraged in Section 3 to build edge varying GNNs [cf. (6) - (7)]. In Section 4, we pointed out that edge varying graph ﬁlters left too many degrees of freedom in the learning parametrization; a problem that we could overcome with the use of graph convolutional ﬁlters [cf. (14)]. The latter suffer from the opposite problem as they may excessively constrict the GNN. GA T s provide a solution of intermediate complexity . Indeed, the ﬁlters in (35) allow us to build a GNN with convolutional graph ﬁlters where the shift operator Φ is learned ad hoc in each layer to repr esent the required abstraction between nodes. The edges of this shift operator try to choose neighbors whose values should most inﬂuence the computations at a particular node. This is as in any arbitrary edge varying graph ﬁlter but the novelty of GA T s is to reduce the number of learnable parameters by tying edge values to matrix B and vector e —observe that in (36) e is the same for all edges. Thus, the computation of scor es α ij depends on the F l − 1 × F l parameters in B and the 2 F l parameters in e . This is of or der no more than F 2 if we make F = max l F l . It follows that for the GA T in (35) the number of learnable parameters is at most F 2 + 2 F + F 2 ( K + 1) , which depends on design choices and is independent of the number of edges. W e point out that since Φ respects the graph sparsity , the computational complexity of implementing (35) and its parameterization is of order O ( F ( N F + K M )) . 5.1 Edge varying GA T networks The idea of using attention mechanisms to estimate entries of a shift operator Φ can be extended to estimate entries Φ ( k :0) of an edge varying graph ﬁlter . T o be speciﬁc, we 10 Φ ( B , e ) Φ ( B , e ) Φ ( B , e ) Φ ( B , e ) + + + + X l − 1 Φ 0 X l − 1 Φ 1 X l − 1 Φ 2 X l − 1 Φ 3 X l − 1 Φ 0 X l − 1 A 1 Φ 1 X l − 1 A 2 Φ 2 X l − 1 A 2 Φ 3 X l − 1 A 2 K X k =0 Φ ( k :0) X l − 1 A k (a) Convolutional GA T ﬁlter [cf. (35)-(37)]. Φ (0) ( B 0 , e 0 ) Φ (1) ( B 1 , e 1 ) Φ (2) ( B 2 , e 2 ) Φ (3) ( B 3 , e 3 ) + + + + X l − 1 Φ (0) X l − 1 Φ (1:0) X l − 1 Φ (2:0) X l − 1 Φ (3:0) X l − 1 Φ (0) X l − 1 A 0 Φ (1:0) X l − 1 A 1 Φ (2:0) X l − 1 A 2 Φ (3:0) X l − 1 A 3 K X k =0 Φ ( k :0) X l − 1 A k (b) Multi-parameter edge varying GA T [cf. (38)-(41)]. Fig. 5. Higher-order Graph Attention Filters. (a) Graph convolutional attention ﬁlter. The input features X l − 1 are shifted by the same edge varying shift operator Φ ( B , e ) and weighted by different parameter matr ices A k . The edge varying parameters in all Φ ( B , e ) are parameterized by the same matr ix B and vector e f ollowing the attention mechanism. (b) Edge var ying GA T ﬁlter . The input features X l − 1 are shifted by different edge varying shift operators Φ ( k ) ( B k , e k ) and weighted b y diff erent parameter matrices A k . The edge varying parameters in the different Φ ( k ) ( B k , e k ) are parameterized by a diff erent matrix B k and vector e k follo wing the attention mechanism. T ABLE 1 Proper ties of Different Gr aph Neural Network Architectures. The parameters and comple xity are considered per layer . Architectures in bold are proposed in this w or k. Legend: N - number nodes; M - number of edges; F - maximum number of f eatures; K - recursio n order ; b - dimension of the K er nel in (21) ; B - number of bloc ks in (15) ; I - the set of impor tant nodes in (15) and (17) ; M I - total neighbors for the nodes in I ; P - parallel J-ARMANet branches; R - parallel attention branches; ∗ Self-loops are not considered. ∗∗ The eigendecomposition cost O ( N 3 ) is computed once. Architectur e Expression Or der parameters ∗ O ( · ) Order of complexity ∗ , ∗∗ O ( · ) Fully connected n/a N 2 F 2 N 2 F 2 Edge varying Eq. (7) K ( M + N ) F 2 K F 2 ( M + N ) GCNN [10]–[13] Eq. (12) K F 2 K F 2 M Node varying [44] Eq. (15) B K F 2 K F 2 M Hybrid edge varying Eq. (17) ( |I | + K M I ) F 2 K F 2 ( M + N ) Spec. edge varying GNN ∗∗ Eq. (21) K bF 2 K F 2 ( M + N ) Spec. Kernel GCNN ∗∗ [9] Eq. (21) for K = 1 bF 2 N 2 F 2 ARMANet Eq. (22)-(33) ( P + K ) F 2 F 2 P ( M K + N ) GCA T Eq. (35) RF 2 K R ( N F 2 + K F M ) GA T [19] Eq. (35) for K = 1 RF 2 R ( N F 2 + F M ) Edge varying GA T Eq. (38) RK F 2 RK ( N F 2 + M F ) propose to implement a generic GNN as deﬁned by recur - sion (7) which we repeat here for ease of reference X l = σ K X k =0 Φ ( k :0) X l − 1 A k ! . (38) Further recall each edge varying ﬁlter parameter matrix Φ ( k :0) is itself deﬁned recursively as [cf. (5)] Φ ( k :0) = Φ ( k ) Φ ( k − 1:0) = k Y k 0 =0 Φ ( k 0 ) . (39) W e propose to generalize (36) so that we compute a different matrix Φ ( k ) for each ﬁlter order k . Consider then matrices B k and vectors e k to compute the edge scores α ( k ) ij = σ  e T k h  X l − 1 B k  i ,  X l − 1 B k  j i >  (40) for all edges ( i, j ) ∈ E . As in (36), we could use α ( k ) ij as edge weights in Φ ( k ) , but to pr omote attention sparsity we send α ( k ) ij through a soft maximum function to yield edge scores Φ ( k ) ij = exp  α ( k ) ij  X j 0 ∈N i ∪ i exp  α ( k ) ij 0  ! − 1 . (41) Each of the edge varying matrices Φ ( k ) for k = 1 , . . . , K is parameterized by the tuple of transform parameters ( B k , e k ) . Put simply , we are using a different GA T mech- anism for each edge varying matrix Φ ( k ) . These learned matrices ar e then used to build an edge varying ﬁlter to process the featur es X l passed from the previous layer – see Figure 5 (b). The edge varying GA T ﬁlter employs K + 1 transform matrices B k of dimensions F l × F l − 1 , K + 1 vectors e k of dimensions 2 F l , and K + 1 matrices A k of dimensions F l × F l − 1 . Hence, the total number of parameters for the edge varying GA T ﬁlter is at most ( K + 1)(2 F 2 + 2 F ) . 11 The computational complexity of the edge varying GA T is of order O ( K F ( N F + M )) . Remark 5. Graph attention networks ﬁrst appeared in [19]. In this paper , (36) and (37) are pr oposed as an attention mechanism for the signals of neighboring nodes and GNN layers are of the form X l = σ ( Φ l X l − 1 A l ) . The latter is a particular case of either (35) or (38) in which only the term k = 1 is not null. Our observation in this section is that this is equivalent to computing a dif ferent graph repr esented by the shift operator Φ . This allows for the generalization to ﬁlters of arbitrary order K [cf. (35)] and to edge varying graph ﬁlters of arbitrary order [cf. (38)]. The approaches presented in this paper can likewise be extended with the multi-head attention mechanism proposed in [19] to improve the network capacity . 5.2 Discussions Reducing model complexity . As deﬁned in (36) and (40) the attention mechanisms are separate from ﬁltering. T o reduce the number of parameters, we can equate the attention matrices B or B k with the ﬁltering matrices A k . For the GCA T in (35), the original proposal in [19] is to make B = A 1 so that (36) reduces to α ij = σ  e T h  X l − 1 A 1  i ,  X l − 1 A 1  j i >  . (42) For the edge varying GA T s in (38), it is natural to equate B k = A k in which case (40) reduces to α ( k ) ij = σ  e T k h  X l − 1 A k  i ,  X l − 1 A k  j i >  (43) The choice in (43) removes ( K + 1) F 2 parameters. Accounting for differences in edge weights in the original shift operator . The major beneﬁt of the GA T mechanism is to build a GNN without requiring full knowledge of S . This is beneﬁcial as it yields a GNN robust to uncertainties in the edge weights. This beneﬁt becomes a drawback when S is well estimated as it renders weights s ij equivalent regar dless of their relative values. One possible solution to this latter drawback is to use a weighted soft maximum operator so that the entries of Φ ( k ) are chosen as Φ ( k ) ij = exp  s ij α ( k ) ij  X j 0 ∈N i ∪ i exp  s ij 0 α ( k ) ij 0  ! − 1 . (44) Alternatively , we can resort to the use of a hybrid GA T in which we combine a GA T ﬁlter of the form in (38) with a regular convolutional ﬁlter of the form in (12) X l = σ K X k =0 S k X l − 1 A k + Φ ( k :0) X l − 1 A 0 k ! . (45) This is the GA T version of the hybrid GNN we proposed in (17). The ﬁlters in (45) account for both, the GA T learned shifts Φ ( k ) and the original shift S . 6 N U M E R I C A L R E S U L T S This section corroborates the capacity of the different mod- els with numerical results on synthetic and real-world graph signal classiﬁcation problems. Given the different hyperpa- rameters for the models and the trade-offs (e.g., complexity , number of parameters, radius of local information), we aim to pr ovide insights on which methods exploit better the graph prior for learning purposes rather than achieving the best performance. In the sequel, we pr ovide a summary of the different architectur es, while in the upcoming four sec- tion analyze in detail each experiment. Section 6.5 provides some general observations. The code used for these simu- lations can be found at http://github.com/alelab- upenn/ graph- neural- networks. Summary of the architectures. A summary of the different architectur es discussed in this paper is provided in T able 1. • Fully connected stands in for the standar d multi- layer per ceptron neural network that uses no graph prior . • The edge varying stand in for the EdgeNet in the full form in (7), which allocates differ ent parameters per edge and shift and has the most degrees of freedom. • The graph convolutional neural network in (12) has instead shared parameters among all nodes, edges, and shifts. • The node varying [cf. (15)] and the hybrid edge vary- ing [cf. (17)] ar e the intermediate solutions between the full-form EdgeNet and the GCNN. Both rely on selecting a set of important nodes and allocate differ ent parameters to them. The hybrid edge vary- ing allocates also different parameters to the edges of these important nodes. T o the remaining nodes shared parameters as in the GCNN are used. • The spectral edge varying GNN is the subclass of the full-form EdgeNet in which ﬁlters share the eigenvectors with the shift operator [cf. Prop. 2], while the spectral kernel GCNN is the equivalent repr esentation of the GCNN in the spectral domain. • ARMANet is the GCNN that we propsoed in Sec- tion 4.3. Contrarily to the polynomial GCNN form [cf. (12)], it has a ﬁlter with a rational spectral re- sponse rather than polynomial. • The graph convolutional attention network and its variants are the architectures we developed in Sec- tion 5. The GCA T generalises the popular GA T [19] to a convolutional ﬁlter of higher order such that it can account for multi-hop neighbor information in each layer . The edge varying GA T [cf. (38)] is an alternative to the GCA T that uses edge varying ﬁlters together with attention instead of convolu- tional ﬁlters. Lastly , we also discussed two variants of the GCA T and edge varying GA T in Section 5.2 but several others developed under the EdgeNet framework. From all these architectures, we do not evaluate the fully- connected, the spectral kernel GCNN [9], and the spectral edge varying GCNN (21) since their computational cost is higher than linear . W e also leave to interested r eaders the extensions discussed in Section 5.2. For the remaining 12 GCNN Edge varying Degree Spec. proxies Diff. centrality Degree Spec. proxies Diff. centrality K = 1 K = 3 K = 5 GAT GCAT Edge varying GAT 0 2 4 6 8 10 15 20 test error % 40 45 50 55 60 65 70 80 Node varying Hybrid edge varying ARMANet Fig. 6. Source Localization T est Error in the Stochastic Bloc k Model graph. The y − axis scale is def or med to impro ve visibility . The thick bar interval indicates the av erage performance for diff erent parameter choices (e.g., ﬁlter order , attention heads). The circle marker represents the mean value of this inter val. The thin line spans an interval of one standard de viation from the aver age performance. The conv olutional-based approaches perform better than attention-based. We attribute the poor perf or mance of the attention techniques to the slow lear ning rate. Both the GA T and the edge variant GA T required more than 40 epochs to reach a local minimum. Howe ver , the graph convolutional attention network (GCA T) does not suffer from the latter issue leading to f aster learning. solutions, we trained all of them using ADAM with the standard forgetting factors [55]. 6.1 Sour ce localization on SBM graphs The goal of this experiment is to identify which community in a stochastic block model (SBM) graph is the source of a diffused signal by observing different realizations in differ ent time instants. W e considered a connected and undirected graph of N = 50 nodes divided into ﬁve blocks each representing one community { c 1 , . . . , c 5 } . The intra- and inter-community edge formation pr obabilities ar e 0 . 8 and 0 . 2 , r espectively . The source node is one of the ﬁve nodes ( i 1 , . . . , i 5 ) with the largest degree in the respective community . The source signal x (0) is a Kronecker delta centered at the source node. The source signal is diffused at time t ∈ [0 , 50] as x ( t ) = S t x (0) , where S is the graph adjacency matrix normalized by the maximum eigenvalue. The training set is composed of 10240 tuples of the form ( x ( t ) , c i ) for random t and i ∈ { 1 , . . . , 5 } . These tuples ar e used to train the EdgeNets that are subsequently used to predict the source community c 0 i for a testing signal x 0 ( t ) again for a random value of t . The validation and the test set are both composed of 2560 tuples ( 25% of the training set). The performance of the dif ferent algorithms is averaged over ten different graph realizations and ten data splits, for a total of 100 Monte-Carlo iterations. ADAM is run for 40 epochs with batches of 100 samples and learning rate 10 − 3 . Architecture parameters. For this experiment, we compar ed 14 differ ent architectures. All architectures comprise the cascade of a graph ﬁltering layer with ReLU nonlinearity and a fully connected layer with softmax nonlinearity . The architectur es are: i ) the edge varying GNN (7); ii ) the GCNN (12); iii ) thr ee node varying GNNs (15), wher e the ﬁve important nodes are selected based on iii − a ) maximum degree; iii − b ) spectral proxies [56], which ranks the nodes according to their contribution to different frequencies; iii − c ) diffusion centrality (see Appendix C); iv ) three node dependent edge varying GNNs, where the ﬁve important nodes B are selected similalr to the node varying case; v ) three ARMANets (32) with Jacobi iterations v − a ) K = 1 ; v − b ) K = 3 ; v − c ) K = 5 ; v i ) the GA T network from [19]; v ii ) the GCA T network (35); and v iii ) the edge varying GA T (38) network. Our goal is to see how the different architectures handle their degrees of freedom, while all having linear complexity . T o make this comparison more insightful, we proceed with the following rationale. For the approaches in i ) − iv ) , we analyzed ﬁlter orders in the interval K ∈ { 1 , . . . , 5 } . This is the only handle we have on these ﬁlters to control the number of parameters and locality radius while keeping the same computational complexity . For the ARMANet in v ) , we set the direct term or der to K = 0 to observe only the effect of the rational part. Subsequently , for each Jacobi iteration value K , we analyzed rational orders in the interval P ∈ { 1 , . . . , 5 } as for the former approaches. While this strategy helps us controlling the local radius, recall the ARMANet has a computational complexity slightly higher than the former four architectures. For the GA T in v i ) , we analyzed different attention heads R ∈ { 1 , . . . , 5 } such that the algorithm complexity matches those of the approaches i ) − iv ) . The number of attention heads is the only handle in the GA T network. Finally , for the GCA T in vii ) and the edge varying GA T in v iii ) , we ﬁxed the attention heads to R = 3 and analyzed dif ferent ﬁlter or ders K ∈ { 1 , . . . , 5 } . The latter allows comparing the impact of the local radius for the median attention head value. Recall, these architectur es have again a slightly higher complexity for K ≥ 2 . Observations. The results of this experiment are shown in Figure 6. W e make the following observations. First, the attention-based approaches are characterized by a slow learning rate leading to a poor performance in 40 epochs. This is reﬂected in the higher test error of the GA T and the edge varying GA T networks. However , this is not the case for the GCA T network. W e attribute the latter reduced error to the superposition of the graph convolu- tional to attentions that GCA T explores –all convolutional approaches learn faster . On the contrary , the error increases 13 T ABLE 2 Source Localization T est Error in F acebook Subnetwork. The goal is to grid-search the parameters to achiev e a mean error of at most 2% . For the architectures that did not achiev e this criter ion, the minimum error is reported. Architectur e mean std. dev . order attention heads epochs learning rate { 1 , 2 , 3 , 4 , 5 } { 1 , 2 , 3 , 4 , 5 } { 10, 20, 40, 100 } { 10 − 2 , 10 − 3 } GCNN 4 . 0% 13 . 0% 3 n/a 100 10 − 3 Edge varying 1 . 5 % 8 . 4% 1 n/a 10 10 − 3 Node varying 6 . 0% 15 . 8% 3 n/a 20 10 − 3 Hybrid edge var . 6 . 6% 15 . 9% 2 n/a 40 10 − 3 ARMANet 2 . 0 % 9 . 7% 1 n/a 20 10 − 3 GA T 10 . 9% 20 . 8% n/a 1 40 10 − 3 GCA T 8 . 0% 18 . 4 3 1 100 10 − 3 Edge varying GA T 7 . 1% 17 . 8% 2 3 100 10 − 3 further for the edge varying GA T since multiple attention strategies are adopted for all k ∈ { 1 , . . . , K } in (38). There- fore, our conclusion is the graph convolutional prior can be signiﬁcantly helpful for attention mechanisms. W e will see this consistent improvement in all our experiments. Second, the edge varying GNN in (7) achieves the lowest error , although having the largest number of parameters. The convolutional approaches parameterize well the edge varying ﬁlter; hence, highlighting the beneﬁt of the graph convolution. ARMANet is the best among the latter charac- terized both by a lower mean error and standard deviation. This reduced error for ARMANet is not entirely surprising since rational functions have better interpolation and ex- trapolation properties than polynomial ones. It is, however , remarkable that the best result is obtained for a Jacobi iteration of K = 1 . I.e., the parameter sharing imposed by ARMANet reaches a good local optimal even with a coarse approximation of the rational function. Notice also the source localization task is not permutation equivariant, therefor e, architectures that are not permutation equivariant (edge varying, node varying, hybrid edge varying, ARMA for low Jacobi orders) are expected to perform better . Third, for the node selection strategies in architectur es iii ) and iv ) , there is no clear difference between the degree and the dif fusion centrality . For the node varying GNNs, the diffusion centrality of fers a lower error both in the mean and deviation. In the hybrid edge varying GNNs [cf. (17)], instead, the degr ee centrality achieves a lower error but pays in deviation. The spectral pr oxy centrality yields the worst performance. Finally , we remark that we did not ﬁnd any particular trend while changing the parameters of the differ ent GNNs (e.g., order , attention head). A rough observation is that low order recursions are often sufﬁcient to reach low errors. 6.2 Sour ce localization on Facebook sub-network In the second experiment, we considered a similar source lo- calization on a r eal-world network comprising a 234 − used Facebook subgraph obtained as the lar gest connected com- ponent of the dataset in [57]. This graph has two well- deﬁned connected communities of dif ferent size and the objective is to identify which of the two communities originated the diffusion. The performance of the dif ferent algorithms is averaged over 100 Monte-Carlo iterations. The remaining parameters for generating the synthetic data are similar as before. Architecture parameters. W e compared the eight GNN ar - chitectures reported in the left-most column of T able 2. For the node varying and the hybrid edge varying GNNs, the important nodes are again 10% of all nodes selected based on diffusion centrality . The Jacobi number of iterations for the ARMANet is K = 1 while there is no direct term. Overall, this problem is easy to solve if the GNN is hypertuned with enough width and depth. However , this strategy hinders the impact of the speciﬁc ﬁlter . T o highlight the role of the latter , we considered minimal GNN architec- tures composed of one layer and two features. In turn, this allows understanding better how much the speciﬁc ﬁlter contributes to the performance. W e then grid-searched all parameters in T able 2 to reach a classiﬁcation error of at most 2% . For the architectur es that r each this criterion, we report the smallest parameters. For the architectures that do not reach this criterion, we report the minimum achieved error and the respective parameters. Our rationale is that the minimum parameters yield a lower complexity and show better the contribution of the ﬁlter type. They also lead to faster training; the opposite holds for the learning rate. From T able 2, we observe that only the edge varying GNN and the ARMANet reach the pr edeﬁned err or . Both ar- chitectures stress our observation that low order recursions ( K = 1 ) are often suf ﬁcient. Nevertheless, this is not the case for all other architectur es. These observations suggest the edge varying GNN explores well its degrees of freedom and adapts well to the non-permutation equivariance of the task. The ARMANet explores the best the convolutional prior; in accordance with the former results, the Jacobi implementation does not need to runt until convergence to achieve impressive results. W e also conclude the convolu- tional prior helps to reduce the degrees of freedom of the EdgeNet but requires a deeper and/or wider network to achieve the pr edeﬁned criterion. This is particularly seen in the GA T based architectures. The GCA T ar chitecture, in here, explor es the convolutional prior and reduces the error compared with the GA T . Finally , we remark for all approaches a substantially lower variance can be achieved by solely increasing the features. 6.3 A uthorship attribution In this third experiment, we assess the performance of the differ ent GNN architectures in an authorship attribution problem based on real data. The goal is to classify if a text excerpt belongs to a speciﬁc author or any other of the 20 contemporary authors based on word adjacency networks (W ANs) [58]. A W AN is an author-speciﬁc directed graph whose nodes are function words without semantic meaning (e.g., prepositions, pronouns, conjunctions). The relative 14 positioning of function words carries stylistic information about the author . T o capture this information, we build a directed graph, where each node is a function word, and each weighted edge r epresents the average co-occurence of the corresponding function words, discounted by relative distance (i.e. if the two words are next to each other , the weight is higher than if the two words ar e further apart). W e build this graph support only once, and before the training set. The signal on top of this graph is the frequency count for the function words in text excerpts of 1 , 000 words. These are the graph signals that form the dataset that is used for training and testing. The W ANs and the word fr equency count serve as author signatures and allow learning r epre- sentation patterns in their writing style. The task translates into a binary classiﬁcation problem where one indicates the text excerpt is written by the author of interest and zero by any other author . A more detailed account on the cr eation of W ANs can be found in [58] and the used dataset is available in our code. The W ANs of the respective authors have from N = 190 to N = 210 function word nodes. Following [58], we built single-author W ANS for Jane Austen, Emily Bront ¨ e, and Edgar Allan Poe. For each author , we processed the texts to count the number of times each function word pair co- appears in a window of ten words. These co-appearances are imputed into an N × N matrix and normalized row- wise. The resulting matrix is used as the shift operator , which can also be interpreted as a Markov chain transition matrix. W e considered a train-test split of 95% − 5% of the available texts. Around 8 . 7% of the training samples are used for validation. This division leads to: ( i ) Austen: 1346 training samples, 118 validation samples, and 78 testing samples; ( ii ) Bront ¨ e: 1192 training samples, 104 validation samples, 68 testing samples; ( iii ) Poe: 740 training samples, 64 validation samples, 42 testing samples. For each author , the sets are extended by a similar amount with texts from the other 20 authors shared equally between them. Architecture parameters. W e considered again the eight GNN architectures of the former section shown in the leftmost column of T able 3. Following the setup in [11], all architectures comprise a graph neural layer of F = 32 features with ReLU nonlinearity followed by a fully con- nected layer . The baseline or der for all ﬁlters is K = 4 . For the ARMANet this is also the number of denominator parameters and the order of the direct term in (32); the number of the Jacobi iterations in (33) is one. W e want to show how much the rational part helps to improve the performance of the GCNN (which is the direct term in the ARMANet [cf. (32)]). The important nodes for the node varying and the hybrid edge varying are 20 ( ∼ 10% of N ) selected with degree centrality . The GA T , GCA T , and edge varying GA T have a single attention head to highlight the role of the convolutional and edge varying recursion over it. The loss function is the cross-entropy optimized over 25 epochs with a learning rate of 0 . 005 . The performance is averaged over ten data splits. T able 3 shows the results of this experiment. Overall, we see again the graph convolution is a solid prior to learning meaningful repr esentations. This is particularly highlighted in the improved performance of the GCA T for Austen and T ABLE 3 Authorship Attrib ution T est Error . The results show the av erage classiﬁcation test error and standard deviation on 10 different training-test 95% − 5% splits. Architectur e Austen Bront ¨ e Poe GCNN 7 . 2( ± 2 . 0)% 12 . 9( ± 3 . 5)% 14 . 3( ± 6 . 4)% Edge varying 7 . 1( ± 2 . 2)% 13 . 1( ± 3 . 9)% 10 . 7 ( ± 4 . 3 )% Node varying 7 . 4( ± 2 . 1)% 14 . 6( ± 4 . 2)% 11 . 7( ± 4 . 9)% Hybrid edge var . 6 . 9 ( ± 2 . 6 )% 14 . 0( ± 3 . 7)% 11 . 7( ± 4 . 8)% ARMANet 7 . 9( ± 2 . 3)% 11 . 6 ( ± 5 . 0 )% 10 . 9( ± 3 . 7)% GA T 10 . 9( ± 4 . 6)% 22 . 1( ± 7 . 4)% 12 . 6( ± 5 . 5)% GCA T 8 . 2( ± 2 . 9)% 13 . 1( ± 3 . 5)% 13 . 6( ± 5 . 8)% Edge varying GA T 14 . 5( ± 5 . 9)% 23 . 7( ± 9 . 0)% 18 . 1( ± 8 . 4)% Bront ¨ e compared with the GA T even with a single attention head. These observations also suggest the GA T and the edge varying GA T ar chitectures r equire multi-head appr oaches to achieve comparable performance. An exception is the case of Poe. In this instance, multi-head attention is also needed for the GCA T . The (approximated) rational part of the ARMANet gives a consistent improvement of the GCNN. Hence, we recommend considering the additional parame- terization of the ARMANet when implementing graph con- volutional neural networks, since the increased number of parameters and implementation costs are minimal. Finally , we remark the hybrid edge varying GNN improves the accuracy of the node varying counterpart. 6.4 Recommender Systems In this last experiment, we evaluate all former architectures for movie rating prediction in a subset of the MovieLens 100 K data set [59]. The full data set comprises U = 943 users and I = 1 , 582 movies and 100 K out of ∼ 1 , 5 M po- tential ratings. W e set the missing ratings to zero. From the incomplete U × I rating matrix, we consider two scenarios: a user-based and a movie-based. In a user-based scenario, we considered the 200 users that have rated the most movies as the nodes of a graph whose edges r epresent Pearson similarities between any two users. Each of the I = 1 , 582 movies is treated as a differ ent graph signal whose value at a node is the rating given to that movie by a user or zero if unrated. W e are interested to predict the rating of a speciﬁc user u with GNNs, which corr esponds to completing the u th row of the 200 × 1 , 5882 sub-rating matrix. In a movie- based scenario, we considered the 200 movies with the largest number of ratings as nodes of a graph whose edges repr esent Pearson similarities between any two movies. In this instance, there are 943 graph signals: the ratings each user gives to all 200 movies is one such graph signal. W e are interested to pr edict the rating to a speciﬁc movie i with GNNs, which corr esponds to completing the i th column of the rating matrix. W e remark this task is permutation equivariant, ther efore, we expect architectur es holding this property to perform better . Architecture parameters. W e considered the same architec- tural settings as in the authorship attribution experiments to highlight consistent behaviors and differences. Following [60], we chose ten 90% − 10% splits for training and test sets and pruned the graphs to keep only the top- 40 most similar connections per node. The shift operator is again the adjacency matrix normalized by the maximum eigenvalue. The ADAM learning algorithm is run over 40 epochs in 15 T ABLE 4 A verage RMSE on user gr aph. Archit./User -ID 405 655 13 450 276 A verage GCNN 1 . 09 0 . 72 1 . 18 0 . 82 0 . 66 0 . 89 Edge var . 1 . 25 0 . 74 1 . 34 0 . 99 0 . 70 1 . 00 Node var . 1 . 17 0 . 68 1 . 19 0 . 83 0 . 67 0 . 91 Hybrid edge var . 1 . 10 0 . 72 1 . 27 0 . 80 0 . 60 0 . 90 ARMANet 1 . 13 0 . 69 1 . 24 0 . 80 0 . 65 0 . 90 GA T 1 . 27 0 . 74 1 . 44 0 . 92 0 . 80 1 . 03 GCA T 1 . 09 0 . 71 1 . 12 0 . 77 0 . 65 0 . 87 Edge var . GA T 1 . 19 0 . 70 1 . 31 0 . 85 0 . 75 0 . 96 batches of ﬁve and learning rate 5 × 10 − 3 . W e trained the networks on a smooth- ` 1 loss and measur e the accuracy through the root mean squared error (RMSE). T ables 4 and 5 show the results for the ﬁve users and ﬁve movies with the largest number of ratings, respectively . The ﬁrst thing to note is that GCA T consistently improves GA T . The latter further stresses that multi-head attentions are more needed in the GA T than in the GCA T . Second, the edge varying GNN yields the worst performance because it is not a permutation equivariant ar chitecture. In fact, the node varying and the hybrid edge varying, which are approaches in-between permutation equivariance and local detail, work much better . This trend is observed also in the edge varying GA T r esults, suggesting that also the number of parameters in the edge varying is too high for this task. 6.5 General Obser vations Altogether these experiments lead to four main observa- tions. Edge varying GNN useful for non-permutation equiv- ariant tasks. The edge varying GNN [cf. (7)] can perform well if the task is not permutation equivariant (e.g., source localization). W e have observed that minimal architectur es (i.e., lower number of features F , layers L and ﬁlter order K ) adapt easier to the task. This is because a minimal architectur e has less degr ees of freedom and can avoid over - ﬁtting. Contrarily , when the task is permutation equivariant the edge varying GNN will suffer and parameterizations matched to the task are needed. Convolution provides a strong parameterization. In per- mutation equivariant tasks, we have seen GCNNs are a valid parameterization. They have shown potential also in tasks that are not permutation equivariant. However , contrarily to the edge varying GNN, GCNNs requite a wider architectur e and with ﬁlters of higher order . W e have also seen the ARMANet [cf. (22)-(33)] can improve the performance of the polynomial counterpart [cf (12)]. This is because for the rational frequency response ARMANet implements in each layer , thus requiring less parameters. However , surprisingly to us, we have observed that even with a few Jacobi iterations –ARMANets require at each layer to compute a matrix inverse, which we solve itera- tively with the Jacobi method– ARMANet can often achieve a better performance than the polynomial GCNN. Hybrid solutions offer a good parameterization. While the edge varying GNN and the GCNN may outperform each other depending on the task, the node varying [cf. (15)] and the hybrid edge varying [cf. (17)] has shown a good parameterization of the EdgeNet in all tasks. W e have seen the hybrid edge varying to perform overall better than the node varying due to its edge-dependent parameters. However , we did not found any consistent differ ence in the node selection strategies. Our rationale is that the sampling strategy of the important nodes needs to be matched with the task at hand. Graph convolutions improve attention. The graph convo- lutional attention network [cf. (35)] improves consistently over the GA T . This is because it generalizes the latter fr om a convolutional ﬁlter of or der one to an arbitrary or der . W orking with the GCA T has shown to accelerate the learn- ing procedure of the GA T and perform well in any task. In addition, the GCA T may result effective even with a single attention head. Instead, by generalizing the GA T to an edge varying GA T [cf. (38)] we have not seen substantial improvement than that seen in the GCA T . This indicates that the attention mechanism may not be the best strategy to learn differ ent shift operators in each layer . 7 C O N C L U S I O N This paper introduced EdgeNets: GNN architectur es that allow each node to collect information from its direct neigh- bors and apply different weights to each of them. EdgeNets preserve the state-of-the-art implementation complexity and provide a single r ecursion that encompasses all state-of- the-art architectur es. By showcasing how each solution is a particular instance of the EdgeNet, we provided guidelines to develop more expressive GNN architectur es, yet without compromising the computational complexity . This paper , in speciﬁc, pr oposed eight GNN ar chitectures that can be r ead- ily extended to scenarios containing multi-edge features. The EdgeNet link showed a tight connection between the graph convolutional and graph attention mechanism, which have been so far treated as two separate approaches. W e found the graph attention network learns the weight of a graph and then performs an order one convolution over this learned graph. Following this link, we introduced the concept of graph convolutional attention networks, which is an EdgeNet that jointly learns the edge weights and the parameters of a convolutional ﬁlter . W e advocate the EdgeNet as a more formal way to build GNN solutions. However , further resear ch is needed in three main directions. First, resear ch should be done to explore the connection between the EdgeNets and receptive ﬁelds. This will lead to differ ent parameterizations and architectur es. Second, works needs to be done for assessing the capabilities of EdgeNets solutions to handle graph iso- morphisms [61], [62]. Third, theoretical work is also needed to characterize how the stability of the EdgeNet to link perturbations [43], [63]. A P P E N D I X A P RO O F O F P R O P O S I T I O N 1 Denote the respective graph shift operator matrices of the graphs G and G 0 as S and S 0 . For P being a permutation matrix, S 0 and x 0 can be written as S 0 = P T SP and 16 T ABLE 5 A verage RMSE on movie g raph. Archit./Movie-ID 50 258 100 181 294 A verage GCNN 0 . 82 1 . 08 0 . 95 0 . 86 1 . 04 0 . 95 Edge var . 0 . 93 1 . 03 1 . 00 0 . 88 1 . 24 1 . 02 Node var . 0 . 78 1 . 04 1 . 00 0 . 87 1 . 00 0 . 94 Hybrid edge var . 0 . 75 1 . 02 0 . 98 0 . 82 1 . 08 0 . 93 ARMANet 0 . 81 1 . 05 1 . 02 0 . 87 1 . 09 0 . 97 GA T 0 . 98 1 . 24 1 . 28 1 . 00 1 . 30 1 . 16 GCA T 0 . 83 1 . 06 1 . 04 0 . 83 1 . 05 0 . 96 Edge var . GA T 0 . 81 1 . 04 1 . 01 0 . 86 1 . 07 0 . 96 x 0 = P T x . Then, the output of the convolutional ﬁlter in (14) applied to x 0 is u 0 = K X k =0 a k S 0 k x 0 = K X k =0 a k  P T SP  k P T x . (46) By using the properties of the permutation matrix P k = P and PP T = I N , the output u 0 becomes u 0 = P T K X k =0 a k S k x ! = P T u (47) which implies the ﬁlter output operating on the permuted graph G 0 with input x 0 is simply the permutation of the convolutional ﬁlter in (14) applied to x . Subsequently , since the nonlinearities of each layer are pointwise they implicitly preserve permutation equivariance; hence, the output of a GCNN layer is a permuted likewise. These permutations will propagate in the cascade of the differ ent layers yielding the ﬁnal permuted output. A P P E N D I X B P RO O F O F P R O P O S I T I O N 2 T o start, let ˇ Φ (1) = Φ (0) + Φ (1) and ˇ Φ ( k ) = Φ ( k ) for all k = 2 , . . . , K be the transformed parameter matrices. Recall also that Φ (0) is a diagonal matrix; thus, ˇ Φ (1) shares the support with Φ (1) and with S + I N . Given the eigendecompostion of the transformed parameter matrices ˇ Φ ( k ) = V Λ ( k ) V − 1 for all k = 1 , . . . , K , the edge varying ﬁlter can be written in the graph spectral domain as a ( Λ ) = K X k =1  k Y k 0 =1 Λ ( k 0 )  . (48) Subsequently , r ecall that J is the index set deﬁning the zero entries of S + I N and that C J ∈ { 0 , 1 } |I |× N 2 is the selection matrix whose rows are those of I N 2 indexed by J [cf. (20)]. Then, the ﬁxed support condition for ˇ Φ ( k ) for all k = 1 , . . . , K is C J vec  ˇ Φ ( k )  = C J vec  VΛ ( k ) V − 1  = 0 |J | . (49) Put differ ently , equation (49) expresses in a vector form the zero entries of ˇ Φ ( k ) (hence, of Φ (0) , . . . , Φ ( K ) ) that match those of S + I N . From the properties of the vectorization operation, (49) can be rewritten as C J vec  VΛ ( k ) V − 1  = C J vec  V − 1 ∗ V  λ ( k ) (50) where “ ∗ ” denotes the Khatri-Rao product and λ ( k ) = diag ( Λ ( k ) ) is the N -dimensional vector composed by the di- agonal elements of Λ ( k ) . As it follows fr om (49), (50) implies the vector λ ( k ) lies in the null space of C J vec ( V − 1 ∗ V ) , i.e., λ ( k ) ∈ null  C I vec ( V − 1 ∗ V )  . (51) Let then B be a basis that spans spans this null space [cf. (20)]. The vector λ ( k ) can be expanded as λ ( k ) = B µ ( k ) (52) µ ( k ) is the vector containing the basis expansion parameters. Finally , by putting back together (49)-(52), (48) becomes a ( Λ ) = K X k =1 k Y k 0 =1 diag  B µ ( k 0 )  . (53) The N × b basis matrix B is a kernel that depends on the speciﬁc graph and in particular on the eigenvectors V . The kernel dimension b depends on the rank of B and thus, on the rank of the null space in (51). In practice it is often observed that rank ( B ) = b  N . A P P E N D I X C D I FF U S I O N C E N T R A L I T Y Let S be the shift operator used to represent the graph structure. W e deﬁne the dif fusion centrality (DC) δ i of a node i in K shifts, as the i th entry of the vector δ = K X k =0 S k 1 N . (54) The DC describes how much each node inﬂuences the passing of information in the network for a ﬁnite time of hops. The DC vector δ can also be seen as the convolution of the constant graph signal with a convolutional ﬁlter of the form in (14) which has all unitary parameters. This deﬁnition of DC is more appropriate for choices of S being the adjacency matrix or normalizations of it. For S being the discrete Laplacian matrix, the DC is zero for all nodes since the constant vector is the eigenvector of the discrete Laplacian corresponding to the zero eigenvalue. The above DC deﬁnition is the particularization of the DC proposed in [64] for stochastic setting to the case where all nodes decide to take part in the signal diffusion. Both the DC in (54) and the one from [64] are correlated to Katz-Bonacich centrality and eigenvector centrality . R E F E R E N C E S [1] E. Isuﬁ, F . Gama, and A. Ribeir o, “Generalizing graph convolu- tional neural networks with edge-variant recursions on graphs,” in 27th Eur . Signal Process. Conf. A Coru ˜ na, Spain: Eur . Assoc. Signal Process., 2-6 Sep. 2019. [2] L. T ang and H. Liu, “Relational learning via latent social dimen- sions,” in 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining . Paris, France: ACM, 28 June-1 July 2009, pp. 817– 826. [3] N. W ale, I. A. W atson, and G. Karypis, “Comparison of descriptor spaces for chemical compound retrieval and classiﬁcation,” Knowl- edge and Information Systems , vol. 14, no. 3, pp. 347–375, 2008. 17 [4] M. M. Bronstein, J. Bruna, Y . LeCun, A. Szlam, and P . V an- dergheynst, “Geometric deep learning: Going beyond Euclidean data,” IEEE Signal Process. Mag. , vol. 34, no. 4, pp. 18–42, July 2017. [5] D. I. Shuman, S. K. Narang, P . Frossar d, A. Ortega, and P . V an- dergheynst, “The emerging ﬁeld of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Process. Mag. , vol. 30, no. 3, pp. 83–98, May 2013. [6] F . Scarselli, S. L. Y ong, M. Gori, M. Hagenbuchner , A. C. T so, and M. Maggini, “Graph neural networks for ranking web pages,” in The 2005 IEEE/WIC/ACM Int. Conf. Web Intelligence . Compiegne, France: IEEE, 19-22 Sep. 2005, pp. 1–7. [7] F . Scarselli, M. Gori, A. C. T soi, M. Hagenbuchner , and G. Monfar- dini, “The graph neural network model,” IEEE T rans. Neural Netw . , vol. 20, no. 1, pp. 61–80, Jan. 2009. [8] F . Gama, E. Isuﬁ, G. Leus, and A. Ribeiro, “Graphs, convolu- tions, and neural networks: From graph ﬁlters to graph neural networks,” IEEE Signal Process. Mag. , vol. 37, no. 6, pp. 128–138, Nov . 2020. [9] J. Bruna, W . Zaremba, A. Szlam, and Y . LeCun, “Spectral networks and deep locally connected networks on graphs,” in 2nd Int. Conf. Learning Representations . Banff, AB: Assoc. Comput. Linguistics, 14-16 Apr . 2014, pp. 1–14. [10] M. Defferrard, X. Bresson, and P . V andergheynst, “Convolutional neural networks on graphs with fast localized spectral ﬁltering,” in 30th Conf. Neural Inform. Pr ocess. Syst. Barcelona, Spain: Neural Inform. Process. Foundation, 5-10 Dec. 2016, pp. 3844–3858. [11] F . Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutional neural network architectures for signals supported on graphs,” IEEE T rans. Signal Process. , vol. 67, no. 4, pp. 1034–1049, Feb. 2019. [12] J. Du, J. Shi, S. Kar , and J. M. F . Moura, “On graph convolution for graph CNNs,” in 2018 IEEE Data Sci. W orkshop . Lausanne, Switzerland: IEEE, 4-6 June 2018, pp. 239–243. [13] T . N. Kipf and M. W elling, “Semi-supervised classiﬁcation with graph convolutional networks,” in 5th Int. Conf. Learning Represen- tations . T oulon, France: Assoc. Comput. Linguistics, 24-26 Apr . 2017, pp. 1–14. [14] K. Xu, W . Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in 7th Int. Conf. Learning Representations . New Orleans, LA: Assoc. Comput. Linguistics, 6-9 May 2019, pp. 1–17. [15] R. Levie, F . Monti, X. Bresson, and M. M. Bronstein, “CayleyNets: Graph convolutional neural networks with complex rational spec- tral ﬁlters,” IEEE T rans. Signal Process. , vol. 67, no. 1, pp. 97–109, Jan. 2019. [16] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned ﬁlters in convolutional neural networks on graphs,” in Conf. Com- put. V ision and Pattern Recognition 2017 . Honolulu, HI: Comput. V ision Foundation, 21-26 July 2017, pp. 3693–3702. [17] F . Monti, D. Boscaini, J. Masci, E. Rodol ` a, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model CNNs,” in Conf. Comput. V ision and Pattern Recognition 2017 . Honolulu, HI: Comput. V ision Foundation, 21- 26 July 2017, pp. 3693–3702. [18] J. Atwood and D. T owsley , “Dif fusion-convolutional neural net- works,” in 30th Conf. Neural Inform. Pr ocess. Syst. Barcelona, Spain: Neural Inform. Process. Foundation, 5-10 Dec. 2016. [19] P . V eli ˇ ckovi ´ c, G. Cucurull, A. Casanova, A. Romero, P . Li ` o, and Y . Bengio, “Graph attention networks,” in 6th Int. Conf. Learning Representations . V ancouver , BC: Assoc. Comput. Linguistics, 30 Apr .-3 May 2018, pp. 1–12. [20] Z. W u, S. Pan, F . Chen, G. Long, C. Zhang, and P . S. Y u, “A comprehensive survey on graph neural networks,” arXiv:1901.00596v3 [cs.LG] , 8 Aug. 2019. [Online]. A vailable: http://arxiv .org/abs/1901.00596 [21] J. Zhou, G. Cui, Z. Zhang, C. Y ang, Z. Liu, L. W ang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” arXiv:1812.08434v4 [cs.LG] , 10 July 2019. [Online]. A vailable: http://arxiv .org/abs/1812.08434 [22] Z. Zhang, P . Cui, and W . Zhu, “Deep learning on graphs: A survey ,” arXiv:1812.04202v1 [cs.LG] , 11 Dec. 2018. [Online]. A vailable: http://arxiv .org/abs/1812.04202 [23] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh, “Attention models in graphs: A survey ,” arXiv:1807.07984v1 [cs.AI] , 20 July 2018. [Online]. A vailable: http://arxiv .org/abs/1807.07984 [24] G. T aubin, “Geometric signal processing on polygonal meshes,” in EUROGRAPHICS ’2000 . Interlaken, Switzerland: The Eur ograph- ics Association, 21-25 Aug. 2000, pp. 1–11. [25] D. I. Shuman, P . V andergheynst, D. Kressner , and P . Frossard, “Distributed signal pr ocessing via Chebyshev polynomial approx- imation,” IEEE T rans. Signal, Inform. Process. Networks , vol. 4, no. 4, pp. 736–751, Dec. 2018. [26] A. Sandryhaila and J. M. F . Moura, “Discrete signal processing on graphs,” IEEE T rans. Signal Process. , vol. 61, no. 7, pp. 1644–1656, Apr . 2013. [27] S. K. Narang, “Compact support biorthogonal wavelet ﬁlter - banks for arbitrary undirected graphs,” IEEE T rans. Signal Process. , vol. 61, no. 19, pp. 4673–4685, Oct. 2013. [28] O. T eke and P . P . V aidyanathan, “Extending classical multirate signal processing theory to graphs—Part I: Fundamentals,” IEEE T rans. Signal Process. , vol. 65, no. 2, pp. 409–422, Jan. 2017. [29] S. Segarra, A. G. Marques, and A. Ribeiro, “Optimal graph-ﬁlter design and applications to distributed linear network operators,” IEEE T rans. Signal Process. , vol. 65, no. 15, pp. 4117–4131, Aug. 2017. [30] E. Isuﬁ, A. Loukas, A. Simonetto, and G. Leus, “Autor egres- sive moving average graph ﬁltering,” IEEE T rans. Signal Process. , vol. 65, no. 2, pp. 274–288, Jan. 2017. [31] J. Y ou, J. Leskovec, K. He, and S. Xie, “Graph structure of neural networks,” in International Conference on Machine Learning . PMLR, 2020, pp. 10 881–10 891. [32] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd Int. Conf. Learning Representations . San Diego, CA: Assoc. Comput. Linguistics, 7-9 May 2015, pp. 1–15. [33] A. V aswani, N. Shazeer , N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin, “Attention is all you need,” in 31st Conf. Neural Inform. Process. Syst. Long Beach, CA: Neural Inform. Process. Syst. Foundation, 4-9 Dec. 2017, pp. 1–11. [34] M. Coutino, E. Isuﬁ, and G. Leus, “Advances in distributed graph ﬁltering,” IEEE T rans. Signal Process. , vol. 67, no. 9, pp. 2320–2333, May 2019. [35] P . W . Battaglia, J. B. Hamrick, V . Bapst, A. Sanchez-Gonzalez, V . Zambaldi, M. Malinowski, A. T acchetti, D. Raposo, A. Santoro, R. Faulkner et al. , “Relational inductive biases, deep learning, and graph networks,” arXiv preprint , 2018. [36] J. Gilmer , S. S. Schoenholz, P . F . Riley , O. V inyals, and G. E. Dahl, “Neural message passing for quantum chemistry ,” arXiv preprint arXiv:1704.01212 , 2017. [37] A. Ortega, P . Frossar d, J. Kova ˇ cevi ´ c, J. M. Moura, and P . V an- dergheynst, “Graph signal processing: Overview , challenges, and applications,” Proceedings of the IEEE , vol. 106, no. 5, pp. 808–828, 2018. [38] S. Y an, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 32, no. 1, 2018. [39] W . Hamilton, Z. Y ing, and J. Leskovec, “Inductive repr esentation learning on large graphs,” in Advances in neural information process- ing systems , 2017, pp. 1024–1034. [40] V . N. Ioannidis, A. G. Marques, and G. B. Giannakis, “A recurrent graph neural network for multi-relational data,” in 44th IEEE Int. Conf. Acoust., Speech and Signal Process. Brighton, UK: IEEE, 12-17 May 2019, pp. 8157–8161. [41] H. Maron, H. Ben-Hamu, N. Shamir , and Y . Lipman, “Invariant and equivariant graph networks,” arXiv preprint , 2018. [42] R. Levie, E. Isuﬁ, and G. Kutyniok, “On the transferability of spec- tral graph ﬁlters,” in 13th Int. Conf. Sampling Theory Applications . Bordeaux, France: IEEE, 8-12 July 2019, pp. 1–5. [43] F . Gama, J. Bruna, and A. Ribeir o, “Stability pr operties of graph neural networks,” IEEE T rans. Signal Process. , vol. 68, pp. 5680– 5695, 25 Sep. 2020. [44] F . Gama, G. Leus, A. G. Marques, and A. Ribeiro, “Convolutional neural networks via node-varying graph ﬁlters,” in 2018 IEEE Data Sci. Workshop . Lausanne, Switzerland: IEEE, 4-6 June 2018, pp. 220–224. [45] A. Sandyhaila and J. M. F . Moura, “Discrete signal processing on graphs: Fr equency analysis,” IEEE T rans. Signal Process. , vol. 62, no. 12, pp. 3042–3054, June 2014. [46] S. Liu and G. T renkler , “Hadamar d, khatri-rao, kronecker and other matrix products,” International Journal of Information and Systems Sciences , vol. 4, no. 1, pp. 160–177, 2008. 18 [47] N. A. Heckert and J. J. Filliben, “Nist/sematech e-handbook of statistical methods; chapter 1: Exploratory data analysis,” 2003. [48] L. N. T refethen, Approximation Theory and Approximation Practice, Extended Edition . SIAM, 2019. [49] J. Liu, E. Isuﬁ, and G. Leus, “Filter design for autoregressive moving average graph ﬁlters,” IEEE T rans. Signal, Inform. Process. Networks , vol. 5, no. 1, pp. 47–60, March 2019. [50] E. Isuﬁ, A. Loukas, and G. Leus, “Autoregressive moving average graph ﬁltering: A stable distributed implementation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , no. EPFL-CONF-223825, 2017. [51] F . M. Bianchi, D. Grattarola, C. Alippi, and L. Livi, “Graph neural networks with convolutional ARMA ﬁlters,” arXiv:1901.01343v5 [cs.LG] , 24 Oct. 2019. [Online]. A vailable: http://arxiv .org/abs/1901.01345 [52] W . A. S. W ijesinghe and Q. W ang, “Dfnets: Spectral cnns for graphs with feedback-looped ﬁlters,” in Advances in Neural Infor- mation Processing Systems , 2019, pp. 6007–6018. [53] L. Ruiz, F . Gama, and A. Ribeiro, “Gated graph recurr ent neural networks,” IEEE T rans. Signal Pr ocess. , vol. 68, pp. 6303–6318, 26 Oct. 2020. [54] K. Gregor and Y . LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th international conference on inter- national conference on machine learning , 2010, pp. 399–406. [55] D. P . Kingma and J. L. Ba, “ADAM: A method for stochastic optimization,” in 3rd Int. Conf. Learning Representations . San Diego, CA: Assoc. Comput. Linguistics, 7-9 May 2015, pp. 1–15. [56] A. Anis, A. Gadde, and A. Ortega, “Efﬁcient sampling set selection for bandlimited graph signals using graph spectral proxies,” IEEE T rans. Signal Process. , vol. 64, no. 14, pp. 3775–3789, July 2016. [57] J. McAuley and J. Leskovec, “Learning to discover social circles in Ego networks,” in 26th Conf. Neural Inform. Pr ocess. Syst. Stateline, TX: Neural Inform. Process. Foundation, 3-8 Dec. 2012. [58] S. Segarra, M. Eisen, and A. Ribeiro, “Authorship attribution through function word adjacency networks,” IEEE T rans. Signal Process. , vol. 63, no. 20, pp. 5464–5478, Oct. 2015. [59] F . M. Harper and J. A. Konstan, “The movielens datasets: History and context,” Acm transactions on interactive intelligent systems (tiis) , vol. 5, no. 4, pp. 1–19, 2015. [60] L. Ruiz, F . Gama, A. G. Marques, and A. Ribeiro, “Invariance- preserving localized activation functions for graph neural net- works,” IEEE T rans. Signal Process. , vol. 68, pp. 127–141, 25 Nov . 2019. [61] H. Maron, H. Ben-Hamu, H. Serviansky , and Y . Lipman, “Prov- ably powerful graph networks,” in Advances in neural information processing systems , 2019, pp. 2156–2167. [62] C. Morris, M. Ritzert, M. Fey , W . L. Hamilton, J. E. Lenssen, G. Rat- tan, and M. Grohe, “W eisfeiler and leman go neural: Higher-order graph neural networks,” in Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 33, 2019, pp. 4602–4609. [63] R. Levie, W . Huang, L. Bucci, M. M. Bronstein, and G. Kutyniok, “T ransferability of spectral graph convolutional neural networks,” arXiv preprint arXiv:1907.12972 , 2019. [64] A. Banerjee, A. G. Chandrasekhar , E. Duﬂo, and M. O. Jackson, “The diffusion of microﬁnance,” Science , vol. 341, no. 6144, pp. 1 236 498 (1–7), July 2013.

EdgeNets:Edge Varying Graph Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment