EdgeNets:Edge Varying Graph Neural Networks

Driven by the outstanding performance of neural networks in the structured Euclidean domain, recent years have seen a surge of interest in developing neural networks for graphs and data supported on graphs. The graph is leveraged at each layer of the…

Authors: Elvin Isufi, Fern, o Gama

EdgeNets:Edge Varying Graph Neural Networks
1 EdgeNets: Edge V ar ying Graph Neur al Networks Elvin Isufi, F er nando Gama and Alejandro Ribeiro Abstract —Driven by the outstanding perf ormance of neural networks in the str uctured Euclidean domain, recent years ha ve seen a surge of interest in dev eloping neural networks for g raphs and data suppor ted on graphs. The gr aph is lev eraged at each lay er of the neural network as a parameterization to capture detail at the node lev el with a reduced number of parameters and computational complexity . Follo wing this rationale, this paper puts f or th a general framework that unifies state-of-the-art graph neural networks (GNNs) through the concept of EdgeNet. An EdgeNet is a GNN architecture that allows diff erent nodes to use different parameters to weigh the inf or mation of different neighbors . By extrapolating this strategy to more iter ations between neighboring nodes, the EdgeNet learns edge- and neighbor-dependent weights to capture local detail. This is a general linear and local operation that a node can perform and encompasses under one f or mulation all existing g raph convolutional neur al networks (GCNNs) as well as graph attention networks (GA Ts). In writing different GNN architectures with a common language, EdgeNets highlight specific architecture adv antages and limitations, while providing guidelines to impro ve their capacity without compromising their local implementation. F or instance, we show that GCNNs ha ve a parameter sharing structure that induces permutation equivariance. This can be an adv antage or a limitation, depending on the application. In cases where it is a limitation, we propose h ybr id approaches and provide insights to de velop se veral other solutions that promote parameter sharing without enforcing permutation equiv ar iance. Another interesting conclusion is the unification of GCNNs and GA Ts —approaches that hav e been so far perceived as separ ate. In par ticular , we show that GA Ts are GCNNs on a graph that is learned from the features . This par ticular ization opens the doors to dev elop alter native attention mechanisms for impro ving discriminator y power . Index T erms —Edge varying, graph neural netw or ks, graph signal processing, g raph filters, learning on graphs. F 1 I N T R O D U C T I O N D ATA generated by networks is incr easingly common. Examples include user prefer ences in recommenda- tion systems, writer pr oclivities in blog networks [2], or properties of assembled molecular compounds [3]. Dif ferent from data encountered in the structured temporal or spatial domains, network data lives in high-dimensional irregular spaces. This fact makes difficult to extend tools that exploit the regularity of time and space, leading to a rising interest in novel techniques for dealing with network data [4]. Since graphs are the prominent mathematical tool to model individual node properties —product ratings, writer bias, or molecule pr operties— along with node dependencies — user similarities, blog hyperlinks, or molecular bonds— the interest in network data has translated into a concomitant increase in the interest in tools for processing graphs and data supported on graphs [5]. Several recent works have proposed graph neural net- works (GNNs) as a means of translating to graphs the success convolutional and recurrent neural networks have attained at learning on time and space [6]–[23]. GNNs are first concr etized in [6], [7] by means of recursive neighboring label aggregations combined with pointwise nonlinearities. • Isufi is with the Intelligent Systems Department, Delft University of T ech- nology , Delft, The Netherlands. Gama is with the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94709. Ribeiro is with the Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, P A 19104. Isufi conducted this research during his postdoctoral period at the Department of Electrical and Systems Engineering, University of Pennsylvania. This work is supported by NSF CCF 1717120, ARO W911NF1710438, ARL DCIST CRA W911NF-17-2-0181, ISTC-W AS and Intel DevCloud. E- mails: e.isufi-1@tudelft.nl, fgama@berkeley.edu, aribeiro@seas.upenn.edu. Part of this work has been presented in [1]. The convolutional GNN counterpart appears in [9] where graph convolutions ar e defined as pointwise operators in the Laplacian’s spectrum. T o avoid the cost and numerical instability of spectral decompositions, [10] appr oximates this spectral convolution with a Chebyshev polynomial on the Laplacian matrix. Parallel to these efforts, the field of graph signal pr ocessing has developed notions of graph convolutional filters as polynomials on a matrix repr esen- tation of a graph [24]–[30]. This has led to GNNs described as architectures that simply replace time convolutions with graph convolutions [11], [12]. A third approach to define GNNs is to focus on the locality of convolutions by replacing the adjacency of points in time with the adjacency of neigh- bors in a graph; something that can be accomplished by mixing nodes’ features with their neighbor ’s features [13], [14]. Despite their differ ent motivations, spectral GNNs [9], [10], polynomial GNNs [11], [12], and local GNNs [13], [14] can all be seen to be equivalent to each other (Section 4). In particular , they all share the r euse of parameters across all neighborhoods of a graph as well as indifference towards the values of differ ent neighbors –see also [31]. This is an important limitation that is tackled, e.g., by the graph attention networks (GA T) of [19]–[23] through the use of attention mechanisms [32], [33]. In this paper , we leverage edge varying graph filter [34] to provide a generic frame- work for the design of GNNs that can affor d flexibility to use differ ent parameters at different nodes as well as differ ent weighing to differ ent neighbors of a node (Section 3). Edge varying filters are linear finite order recursions that allow individual nodes to introduce weights that are specific to the node, specific to each neighbor , and specific to the recursion index. In this way , the edge varying r ecursion repr esents a 2 general linear operation that a node can implement locally . I.e., a general operation that relies on information exchanges only with neighbor nodes (Section 2). In alternative to the EdgeNet, graph network [35] is a popular framework for generalizing GNNs. Graph network consist of general update and aggregation functions over nodal, edge, and entire graph features. This unification is slightly more general than the message passing neural net- work [36] and considers updates to be principally affected by information exchange only with the one-hop neighbors. While providing relevant insights on the local detail of order -one filter GNNs such as [13], [19], this strategy does not put emphasis on the role of the filter within the GNN or how the parameters of such filter are allocated to the dif- ferent multi-hop neighbours [10], [11]. Instead, the EdgeNet framework focuses only on nodal feature aggregations to highlight the r ole of multi-hop exchanges within a layer and to put emphasis on how differ ent solutions operate from a node perspective. At the same time, the EdgeNet framework allows for a filter spectral analysis [37], which provides a better understanding of the type of filters that conform the learned filter bank. This spectral perspective of GNNs will shed light, for instance, of the advantages of ARMA filters [30] in learning sharper r esponses with less parameters than finite impulse response graph filters [25], [26], [29]. In its most general form, the edge varying GNNs allocate differ ent parameters to the differ ent edges, which is of the order of the number of nodes and edges of the graph. While allocating different parameters over the edges can help exploiting the graph structure better for the learning task at hand, edge-specific parameters sacrifice the inductive capa- bilities across different graphs. T o reduce the complexity of this parameterization, we can regularize EdgeNets in differ- ent ways by imposing restrictions on the freedom to choose differ ent parameters at differ ent nodes. W e explain that existing GNN architectures are particular cases of EdgeNets associated with differ ent parameter restrictions. In turn, this shows how these solutions sacrifice the degrees of freedom to gain in parameter sharing and inductive capabilities. W e further utilize the insight of edge varying recursions to propose novel GNN architectur es. In consequence, the novel contributions of this paper are: (i) W e define EdgeNets, which parameterize the linear operation of neural networks through a bank of edge varying recursions. EdgeNets are a generic framework to design GNN architectur es (Section 3). (ii) W e show the approaches in [9]–[18] (among others) are EdgeNets where all nodes share the parameters. W e extend the repr esenting power of these networks by adding some level of variability in weighing differ ent nodes and differ ent edges of a node (Section 4). (iii) Replacing finite length polynomials by rational func- tions provides an alternative parameterization of con- volutional GNNs in terms of autoregr essive moving average (ARMA) convolutional graph filters [30]. These ARMA GNNs generalize rational functions based on Cayley polynomials [15] (Section 4.3). (iv) W e show that GA T s can be understood as GNNs with convolutional graph filters where a graph is learned ad hoc in each layer to represent the requir ed abstraction between nodes. The weights of this graph choose neigh- bors whose values should most influence the computa- tions at a particular node. This reinterpr etation allows for the proposal of more generic GA T s with higher expressive power (Section 5). The this paper has the following structure. Section 2 reviews edge varying recursions on graphs and Section 3 introduces edge varying GNNs. T o ease exposition, the GNN are grouped into two categories: convolutional in Section 4 and attentional in Section 5. W ithin each category , we follow the same rationale. First, we discuss the state-of-the-art solu- tions as a particular case of the EdgeNet framework. Then, we discuss their architectural advantages and limitations. Finally , we leverage the EdgeNet viewpoint to propose new solutions that address some of these limitations and high- light the corresponding tradeoffs. Section 6 evaluates these solutions with numerical r esults and Section 7 concludes the paper . 2 E D G E V A RY I N G L I N E A R G R A P H F I LT E R S Consider a weighted graph G with vertex set V = { 1 , . . . , N } , edge set E ⊆ V × V composed of |E | = M order ed pairs ( i, j ) , and weight function W : E → R . For each node i , define the neighbor hood N i = { j : ( j, i ) ∈ E } as the set of nodes connected to i and let N i := |N i | denote the number of elements (neighbors) in this set. Associated with G is a graph shift operator matrix S ∈ R N × N whose sparsity pattern matches that of the edge set, i.e., entry S ij 6 = 0 when ( j, i ) ∈ E or when i = j . Supported on the vertex set are graph signals x = [ x 1 , . . . , x N ] T ∈ R N in which component x i is associated with node i ∈ V . The adjacency of points in time signals or the adjacency of points in images codifies a sparse and local relationship between signal components. This sparsity and locality are leveraged by time or space filters. Similarly , S captures the sparsity and locality of the r elationship between compo- nents of a signal x supported on G . It is then natural to take the shift operator as the basis for defining filters for graph signals. In this spirit, let Φ (0) be an N × N diagonal matrix and Φ (1) , . . . , Φ ( K ) be a collection of K matrices sharing the sparsity pattern of I N + S . Consider then the sequence of signals z ( k ) as z ( k ) = k Y k 0 =0 Φ ( k 0 ) x = Φ ( k :0) x , for k = 0 , . . . , K (1) where the product matrix Φ ( k :0) := Q k k 0 =0 Φ ( k 0 ) = Φ ( k ) . . . Φ (0) is defined for future reference. Signal z ( k ) can be computed using the recursion z ( k ) = Φ ( k ) z ( k − 1) , for k = 0 , . . . , K (2) with initialization z ( − 1) = x . This recursive expression implies signal z ( k ) is pr oduced from z ( k − 1) using operations that are local in the graph. Indeed, since Φ ( k ) shares the sparsity pattern of S , node i computes its component z ( k ) i as z ( k ) i = X j ∈N i ∪ i Φ ( k ) ij z ( k − 1) j . (3) Particularizing (3) to k = 0 , it follows each node i builds the i th entry of z (0) as a scaled version of its signal x by 3 Φ (0) Φ (1) Φ (2) Φ (3) x Φ (0) x Φ (1:0) x Φ (2:0) x Φ (3:0) x + + + + A ( S ) x Fig. 1. Edge V ar ying Gr aph Filters. Each edge v ar ying matrix Φ ( k ) acts as a diff erent shift operator that locally combines the graph signal. (T op-left) The colored discs are centered at five ref erence nodes and their coverage shows the amount of local information needed to compute z (1) = Φ (1:0) x at these nodes. The cov erage of the discs in the other graphs shows the signal inf ormation needed by the reference nodes to produce the successiv e outputs. (Bottom) Schematic illustration of the edge v ar ying filter output of order K = 3 . the diagonal matrix Φ (0) , i.e., z (0) i = Φ (0) ii x i . Particularizing to k = 1 , (3) yields the components of z (1) depend on the values of the signal x at most at neighboring nodes. Particu- larizing to k = 2 , (3) shows the components of z (2) depend only on the values of signal z (1) at neighboring nodes which, in turn, depend only on the values of x at their neighbors. Thus, the components of z (2) are a function of the values of x at most at the respective two-hop neighbors. Repeating this argument iteratively , z ( k ) i repr esents an aggregation of information at node i coming from its k -hop neighborhood —see Figure 1. The collection of signals z ( k ) behaves like a sequence of scaled shift operations except that instead of shifting the signal in time, the signal is diffused through the graph (the signal values are shifted between neighboring nodes). Leveraging this interpretation, the graph filter output u is defined as the sum u = K X k =0 z ( k ) = K X k =0 Φ ( k :0) x . (4) A filter output in time is a sum of scaled and shifted copies of the input signal. That (4) behaves as a filter follows from interpreting Φ ( k :0) as a scaled shift, which holds because of its locality . Each shift Φ ( k :0) is a recursive composition of individual shifts Φ ( k ) . These individual shifts repr esent differ ent operators that r espect the structure of G while reweighing individual edges differently when needed. For future reference, define the filter matrix A ( S ) so (4) rewrites as u = A ( S ) x . For this to hold, the filter matrix must be A ( S ) = K X k =0 Φ ( k :0) = K X k =0 k Y k 0 =0 Φ ( k 0 ) ! . (5) Following [34], A ( S ) is a K th or der edge varying graph filter . Each matrix Φ ( k ) contains at most M + N nonzero elements corresponding to the nonzero entries of I N + S ; thus, the total number of parameters defining filter A ( S ) in (5) is K ( M + N ) + N . For short filters, this is smaller than the N 2 components of an arbitrary linear transform. Likewise, in computing z ( k ) = Φ ( k ) z ( k − 1) as per (2) incurs a computational complexity of order O ( M + N ) . This further results in an overall computational complexity of order O  K ( M + N )  for obtaining the filter output u in (4). This reduced number of parameters and computational cost is leveraged next to define graph neural network (GNN) architectur es with a controlled number of parameters and computational complexity matched to the graph sparsity . Remark 1. The presence of the edge ( j , i ) in graph G is interpreted here as signal components x j and x i being related by the given structure in the data. The shift operator entry S ij is a measur e of the expected similarity . Larger entries indicate linked signal components are mor e related to each other . Ther efore, the definition of the shift operator S makes it a valid stand-in for any graph repr esentation matrix. Forthcoming discussions ar e valid whether S is an adjacency or a Laplacian matrix in any of their various normalized and unnormalized forms. W e use S to keep discussions generic. 3 E D G E V A RY I N G G R A P H N E U R A L N E T W O R K S Edge varying graph filters are the basis for defining GNN architectur es through composition with pointwise nonlinear functions. Formally , consider a set of L layers indexed by l = 1 , . . . , L and let A l ( S ) = P K k =0 Φ ( k :0) l be the graph filter used at layer l . A GNN is defined by the recursive expression x l = σ  A l ( S ) x l − 1  = σ K X k =0 Φ ( k :0) l x l − 1 ! (6) where we convene that x 0 = x is the input to the GNN and x L is its output. T o augment the r epresentation power of GNNs, it is customary to add multiple node features per layer . W e do this by defining matrices X l = [ x 1 l , . . . , x F l l ] ∈ R N × F l in which each column x f l repr esents a differ ent graph signal at layer l . These so-called features ar e cascaded through layers where they are processed with edge varying 4 graph filters and composed with pointwise nonlinearities according to X l = σ K X k =0 Φ ( k :0) l X l − 1 A lk ! (7) where A lk ∈ R F l − 1 × F l is a parameter matrix that affor ds flexibility to process differ ent features with differ ent filter parameters. It is ready to see that (7) represents a bank of edge varying graph filters A f g l ( S ) applied to a set of F l − 1 input features x g l − 1 to produce a set of F l output features x f l . Indeed, if we let a f g lk = [ A lk ] f g denote the ( f , g ) th entry of A lk , (7) produces a total of F l − 1 F l intermediate features of the form 1 u f g l = A f g l ( S ) x g l − 1 = K X k =0 a f g lk Φ f g , ( k :0) l x g l − 1 (8) for g = 1 , . . . , F l − 1 and f = 1 , . . . , F l . The features u f g l are then aggr egated across all g and passed through a pointwise nonlinearity to produce the output features of layer l as x f l = σ F l − 1 X g =1 u f g l ! . (9) At layer l = 1 the input feature is a graph signal x 1 0 = x . This feature is passed through F 1 filters to produce F 1 higher-level featur es as per (8). The latter are then processed by a pointwise nonlinearity [cf. (9)] to produce F 1 output features x f 1 . The subsequent layers l > 1 start with F l − 1 input features x g l − 1 that are passed through the filter bank A f g l ( S ) [cf. (8)] to produce the higher -level features u f g l . These are aggregated across all g = 1 , . . . , F l − 1 and passed through a nonlinearity to produce the layer ’s output fea- tures x f l [cf. (9)]. In the last layer l = L , we consider without loss of generality the number of output features is F L = 1 . This single feature x 1 L = x L is the output of the edge varying GNN or , for short, EdgeNet. Remark the EdgeNet aggregates at each layer information form neighbors that ar e up to K hops away [cf. (7)]. This increases its flexibility to process intermediate features and generalizes the masking aggregation r ule in [38], which can be seen as an EdgeNet of order K = 1 . The EdgeNet output is a function of the input signal x and the collection of filter banks A f g l [cf. (5)]. Gr oup the filters in the filter tensor A ( S ) = { A f g l ( S ) } lf g so that to define the GNN output as the mapping Ψ  x ; A ( S )  := x L with A ( S ) = n A f g l ( S ) o lf g . (10) The filter parameters are trained to minimize a loss over a training set of input-output pairs T = { ( x , y ) } . This loss measures the differ ence between the EdgeNet output x L and the true value y averaged over the examples ( x , y ) ∈ T . As it follows from (5), the number of parameters in each filter is K ( M + N ) + N . This gets scaled by the number of 1. Throughout the paper , we will denote any graph filter by A f g l ( S ) to indicate that it is a matrix depending on the graph shift operator S . When this filter contains additional parameters rather than those on the edges (i.e., Φ f g, ( k :0) l ), we will indicate them with scalars a f g lk [cf. (8)]. For consistency , when expressing the bank of filters in a single recursion [cf. (7)], we will group parameters a f g lk into the matrix A lk . filters per layer F l − 1 F l and the number of layers L . T o pro- vide an order bound on the number of parameters defining the EdgeNet set the maximum feature number F = max l F l and observe the number of parameters per layer is of order ( K ( M + N ) + N ) F 2 + F 2 . Likewise, the computational complexity at each layer is of order O  K ( M + N ) F 2  . This number of parameters and computational complexity are expected to be smaller than the corresponding numbers of a fully connected neural network. This is a consequence of exploiting the sparse nature of edge varying filters [cf. (4) and (5)]. A GNN can be then consider ed as an architecture that exploits the graph structur e to reduce the number of parameters of a fully connected neural network. The implicit hypothesis is those signal components associated with different nodes are processed together in accordance with the nodes’ proximity in the graph. W e will show different existing GNN architectur es are particular cases of (8)-(9) using dif ferent subclasses of edge varying graph filters (Section 4) and the same is true for graph attention networks (Section 5). Establishing these relationships allows the proposal of natural architectural generalizations that increase the descriptive power of GNNs while still retaining manageable complexity . Remark 2. The key property of the EdgeNet is to allocate trainable prarameters for each edge in each shift. While this formulation improves the expressive power of a GNN, it affects its inductive capability over graphs (i.e., the ability to generalize to new unseen graphs) [39]. In the full form (7), the EdgeNet does not have inductive capabilities for graphs but only for graph signals. That is, it cannot be applied to test cases where new unseen graphs are present but can only be applied to test cases where new unseen graph signals are present for a fixed graph support. W e shall see in the next section that graph convolutional neural networks sacrifice instead the degr ees of freedom to gain inductive capabilities also for graphs. Remark 3. In the proposed EdgeNet, we considered graphs with single edge features, i.e., each edge is described by a single scalar . However , even when the graph has multiple edge features, say E , the EdgeNet extends r eadily to this scenario. This can be obtained by seeing the multi-edge featured graph as the union of E graphs G e = ( V , E e ) with identical node set V and respective shift operator matrix S e . For { Φ e ( k ) } being the collection of the edge varying parameter matrices [cf. (1)] relative to the shift operator S e , the l th layer output X l [cf. (7)] becomes X l = σ E X e =1 K X k =0 Φ e ( k :0) l X l − 1 A e lk ! . (11) I.e., the outputs of each filter are aggregated also over the edge-feature dimension. The number of parameters and computational complexity get scaled by E . The GNN ar- chitectures discussed in the remainder of this manuscript, as a special case of the EdgeNet, are r eadily extendable to the multi-edge feature scenario by replacing (7) with (11). The approach in [40] is the particular case for (11) with K = 1 and the parameter matrix reduced to a scalar . 5 1 x 1 2 x 2 3 x 3 4 x 4 5 x 5 6 x 6 7 x 7 8 x 8 9 x 9 10 x 10 11 x 11 12 x 12 Fig. 2. Permutation equivariance of machine lear ning on graphs. Many tasks in machine learning on graphs are equivariant to permutations (cf . Proposition 1) but not all are . E.g., w e expect agents 3, 5, 8, and 12 to be interchangeable from the perspectiv e of predicting product ratings from the ratings of other nodes. But from the perspective of community classification we expect 3 and 5 or 8 and 12 to be interchangeable, but 3 and 5 are not interchangeable with 8 and 12. 4 G R A P H C O N VO L U T I O N A L N E U R A L N E T W O R K S Graph convolutional neural networks (GCNNs) have shown great success to learning repr esentations for graph data with pr ominent variants introduced in [10]–[13]. All these variants be written as GNN architectures in which the edge varying component in (7) is fixed and given by powers of the shift operator matrix Φ ( k :0) l = S k , X l = σ K X k =0 S k X l − 1 A lk ! . (12) By comparing (8) with (12), it follows this particular restric- tion yields a tensor A ( S ) with filters of the form A f g l ( S ) = K X k =0 a f g lk S k (13) for some or der K and scalar parameters a f g l 0 , . . . , a f g lK . Our focus in this section is to discuss variations on (13). T o sim- plify the discussion, we omit the layer and feature indices and for the remainder of this section write A ( S ) = K X k =0 a k S k . (14) The filters in (14) are of the form in (5) with Φ (0) = a 0 I N and Φ ( k :0) = a k S k for k ≥ 1 . By particularizing G to the line graph, (14) r epresents a linear time-invariant filter described by a regular convolution. This justifies using the qualifier convolutional for an architecture with filters of the form (14). The appeal of the graph convolutional filters in (14) is that they r educe the number of parameters from the K ( M + N ) + N of the edge varying filters in (5) to just K + 1 ; yielding also a computational complexity of order O ( K M ) . While we can reduce the number of parameters in several ways, the formulation in (14) is of note because it endows the resulting GNN with equivariance to permutations of the labels of the graph. W e state this property formally in the following proposition. Proposition 1. Let x be a graph signal defined on the vertices of a graph G = ( V , E ) with shift operator S . Consider also the output of a GCNN Ψ ( x ; A ( S )) [cf. (10) ] with input x and tensor A ( S ) =  A ( S )  composed of filters of the form in (14) . Then, for a permutation matrix P , it holds that P T Ψ ( x ; A ( S )) = Ψ ( P T x ; A ( P T SP )) . That is, the GCNN output operating on the graph G with input x is a permuted version of the GCNN output operating on the permuted graph G 0 = ( V 0 , E 0 ) with permuted shift operator S 0 = P T SP and permuted input signal x 0 = P T x . Proof. See Appendix A. Proposition 1 establishes the output of a GCNN is inde- pendent of node labeling. This is important not just because graph signals are independent of labeling —therefore, so should be their processing— but because it explains how GCNNs exploit the internal signal symmetries. If two parts of the graph are topologically identical and the nodes support identical signal values, a GCNN yields identical outputs [41]–[43]. It must be emphasized that permutation equivariance is of use only inasmuch as this is a desirable property of the consider ed task. Permutation equivariance holds in, e.g., recommendation systems but does not hold in, e.g., commu- nity classification. In the graph in Figure 2, we expect agents 3, 5, 8, and 12 to be interchangeable from the perspective of predicting product ratings from the ratings of other nodes. But fr om the perspective of community classification, we expect 3 and 5 or 8 and 12 to be interchangeable, but 3 and 5 are not interchangeable with 8 and 12. When equivariance is not a property of the task, GCNNs are not expected to do well. GCNNs will suffer in any problem in which local detail around a node is important. This is because the filter in (14) forces all nodes to weigh the information of all k -hop neighbors with the same parameter a k irrespectively of the r elative importance of differ ent nodes and differ ent edges. T o avoid this limitation, we can use a GNN that relies on the edge varying filters (5) in which each node i learns a different parameter Φ ( k ) ij for each neighbor j . These two cases ar e analogous to CNNs pr ocess- ing time signals with conventional convolutional filters as opposed to a neural network that operates with arbitrarily time varying filters (i.e., filters whose coefficients change in time). The appealing intermediate solution is to use filters with controlled edge variability to mix the advantage of a per- mutation equivariant parameterization (Proposition 1) with the processing of local detail. W e introduce architectures that construct different versions of filters with controlled edge variability in Sections 4.1-4.3. Remark 4. Along with the above-referred works, also the works in [16]–[18] and [14] use versions of the convolutional filter in (14). In specific, [16]–[18] consider single shifts on the graph with shift operator a learnable weight matrix, a Gaussian kernel, and a random-walk, r espectively . The work in [14] adopts multi-layer perceptrons along the feature dimension at each node, before exchanging information with their neighbors. This is equivalent to (12) with the first layers having order K = 0 (depending on the depth of the MLP), followed by a final layer of order K = 1 . 6 4.1 GNNs with Controlled Edg e V ariability T o build a GNN that fits between a permutation equivariant GCNN [cf. (14)] and a full edge varying GNN [cf. (5)], we use dif ferent filter parameters in dif ferent parts of the graph. Formally , let B = {B 1 , . . . , B B } be a partition of the node set into B blocks with block B i having B i nodes. Define the tall matrix C B ∈ { 0 , 1 } N × B such that [ C B ] ij = 1 if node i belongs to block B j and 0 otherwise. Let also a ( k ) B ∈ R B be a vector of block parameters of filter order k . Block varying graph filters are then defined as A ( S ) = K X k =0 diag  C B a ( k ) B  S k . (15) Filters in (15) use parameters [ a ( k ) B ] i for all nodes i ∈ B i . Block varying filters belong to the family of node varying graph filters [29] and are of the form in (5) with Φ ( k :0) = diag ( C B a ( k ) B ) S k . (16) Substituting (16) into (5) generates block varying GNNs [44]. Block varying GNNs have B ( K + 1) F 2 parameters per layer and a computational complexity of order O ( K F 2 M ) . Alternatively , we can consider what we call hybrid filters that are defined as linear combinations of convolutional filters and edge varying filters that operate in a subset of nodes —see Figur e 3. Formally , let I ⊂ V denote an important subset of I = |I | nodes and define the shift matrices Φ ( k ) I such that the diagonal matrix Φ (0) I has entries [ Φ (0) I ] ii 6 = 0 for all i ∈ I and [ Φ ( k ) I ] ij = 0 for all i / ∈ I or ( i, j ) / ∈ E and k ≥ 1 . That is, the parameter matrices Φ ( k ) I may contain nonzero elements only at rows i that belong to set I and with the node j being a neighbor of i . W e define hybrid filters as those of the form A ( S ) = K X k =0  k Y k 0 =0 Φ ( k 0 ) I + a k S k  . (17) Substituting (17) in (5) generates hybrid GNNs. In essence, nodes i ∈ I learn edge dependent parameters which may also be different at different nodes, while nodes i / ∈ I learn global parameters. Hybrid filters are defined by a number of parameters that depends on the total neighbors of all nodes in the importance set I . Define then M I = P i ∈I N i and observe Φ (0) I has I nonzero entries since it is a diagonal matrix, while Φ ( k ) I for k ≥ 1 have respectively M I nonzero values. W e then have K M I + I parameters in the edge varying filters and K + 1 parameters in the convolutional filters. W e therefor e have a total of ( I + K M I + K + 1) F 2 parameters per layer in a hybrid GNN. The implementation cost of a hybrid GNN layer is of order O ( K F 2 ( M + N )) since both terms in (17) respect the graph sparsity . Block GNNs depend on the choice of blocks B and hybrid GNNs on the choice of the importance set I . W e explore the use of dif ferent heuristics based on centrality and clustering measures in Section 6 where we will see that the choice of B and I is in general problem specific. 1 2 3 4 5 6 7 8 [Φ ( k ) I ] 21 [Φ ( k ) I ] 31 [Φ ( k ) I ] 41 [Φ ( k ) I ] 74 [Φ ( k ) I ] 51 [Φ ( k ) I ] 75 [Φ ( k ) I ] 76 [Φ ( k ) I ] 78 Fig. 3. Hybrid Edge V ar ying Filter [cf. (17) ]. The nodes in set I = { 2 , 7 } are highlighted. Nodes 2 and 7 have edge varying parameters associ- ated with their incident edges. All nodes, including 2 and 7, also use the global parameter a k as in a regular conv olutional graph filter . 4.2 Spectral Graph Conv olutional Neural Networks The convolutional operation of the graph filter in (14) can be repr esented in the spectral domain. T o do so, consider the input-output relationship u = A ( S ) x along with the eigen- vector decomposition of the shift operator S = VΛV − 1 . Projecting the input and output signals in the eigenvector space of S creates the so-called graph Fourier transforms ˜ x := V − 1 x and ˜ u := V − 1 u [45] which allow us to write ˜ u :=  K X k =0 a k Λ k  ˜ x . (18) Eq. (18) reveals convolutional graph filters are pointwise in the spectral domain, due to the diagonal nature of the eigenvalue matrix Λ . W e can, therefore, define the filter ’s spectral response a : R → R as the function a ( λ ) = K X k =0 a k λ k (19) which is a single-variable polynomial characterizing the graph filter A ( S ) . If we allow for filters of order K = N − 1 , there is always a set of parameters a k such that a ( λ i ) = ˜ a i for any set of spectral response ˜ a i [26]. Thus, training over the set of spectral parameters a ( λ 1 ) , . . . , a ( λ N ) is equivalent to training over the space of (nodal) parameters a 0 , . . . , a N − 1 . GCNNs were first introduced in [9] using the spectral repr esentation of graph filters in (19). By using edge varying graph filters [cf. (5)], we can pro- pose an alternative parameterization of the space of filters of order N which we will see may have some advantages. T o explain this better let J be the index set defining the zero entries of S + I N and let C J ∈ { 0 , 1 } |J |× N 2 be a binary selection matrix whose rows are those of I N 2 indexed by J . Let also B be a basis matrix that spans the null space of C J vec ( V − 1 ∗ V ) (20) where vec ( · ) is the column-wise vectorization operator and “ ∗ ” is the Khatri-Rao product [46]. Then, the following proposition from [34] quantifies the spectral response of a particular class of the edge varying graph filter in (5). Proposition 2. Consider the subclass of the edge varying graph filters in (5) where the parameter matrices  Φ (0) + Φ (1)  and Φ ( k ) for all k = 2 , . . . , K are restricted to the ones that share the eigenvectors with S , i.e.,  Φ (0) + Φ (1)  = V Λ (1) V − 1 and 7 Φ ( k ) = V Λ ( k ) V − 1 for all k = 2 , . . . , K . The spectral response of this subclass of edge varying filter has the form a ( Λ ) = K X k =1  k Y k 0 =1 Λ ( k 0 )  = K X k =1 k Y k 0 =1 diag  B µ ( k 0 )  (21) where B is an N × b basis kernel matrix that spans the null space of (20) and µ ( k ) is a b × 1 vector containing the expansion parameters of Λ ( k ) into B . Proof. See Appendix B. Proposition 2 provides a subclass of the edge varying graph filters where, instead of learning K ( M + N ) + N parameters, they learn the K b entries µ (1) , . . . , µ ( K ) in (21). These filters build the output featur es as a point- wise multiplication between the filter spectral response a ( Λ ) and the input spectral transform ˜ x = V − 1 x , i.e., u = V a ( Λ ) ˜ x = V a ( Λ ) V − 1 x . Following then the analogies with conventional signal processing, (21) repr esents the spectral r esponse of a convolutional edge varying graph filter . Spectral GCNNs are a particular case of (21) with order K = 1 and kernel B independent from the graph (e.g., a spline kernel). Besides generalizing [9], a graph- dependent kernel allows to implement (21) in the vertex domain through an edge varying filter of the form (5); hence, having a complexity of order O ( K ( M + N )) in contrast to O ( N 2 ) r equired for the graph-independent kernels. The edge varying implementation captures also local detail up to a region of radius K from a node; yet, having a spectral in- terpretation. Nevertheless, both the graph-dependent GNN [cf. (21)] and the graph-independent GNN [9] are more of theoretical interest since they require the eigendecom- position of the shift operator S . This aspect inadvertently implies a cubic complexity in the number of nodes and an accurate learning process will suffer from numerical instabilities since it requires an or der K ≈ N ; hence, high order matrix powers S k . 4.3 ARMA graph conv olutional neural networks W e can increase the descriptive power of the filter in (14) by growing its order K , which allows learning filters with a more discriminative polynomial frequency response [cf. (19)]. However , this also increases the parameters and com- putational cost. Most importantly , it introduces numerical issues associated with high order matrix powers S k , ul- timately , leading to poor interpolatory and extrapolatory performance [47]. These challenges can be overcame by considering graph filters with a rational spectral response, since rational functions have better interpolatory and ex- trapolatory pr operties than polynomials [47]–[49]. Rational functions can also achieve more complicated responses with lower degrees in both numerator and denominator , thus, having less learnable parameters. Autoregressive moving average (ARMA) graph filters [30] serve for such purpose and implement rational functions of the form A ( S ) =  I + P X p =1 a p S p  − 1  Q X q =0 b q S q  := P − 1 ( S ) Q ( S ) (22) where we have defined P ( S ) := I + P P p =1 a p S p and Q ( S ) := P Q q =0 b q S q . The ARMA filter in (22) is defined by P denominator parameters a = [ a 1 , . . . , a P ] > and Q + 1 numerator parameters b = [ b 0 , . . . , b Q ] > . The input-output relationship u = A ( S ) x of the ARMA filter can be repre- sented in the spectral domain as [cf. (18)] ˜ u =  I + P X p =1 a p Λ p  − 1  Q X q =0 b q Λ q  ˜ x . (23) It follows that ARMA filters are also pointwise operators in the spectral domain characterized by the rational spectral response function a ( λ ) =  Q X q =0 b q λ q  .  1 + P X p =1 a p λ p  . (24) In particular , it follows the space of ARMA filters defined by (22) is equivalent to the space of spectral ARMA filters defined by (24) which is equivalent to the space of spectral filters in (19) and, in turn, equivalent to the graph convolu- tional filters in (14). That they are equivalent does not mean they have the same pr operties. W e expect ARMA filters produce useful spectral responses with less parameters than the convolutional filters in (14) or the spectral filters in (19). As it follows from (22), we need to compute the inverse matrix P − 1 ( S ) to get the ARMA output. The latter incurs a cubic complexity , which unless the graph is of limited di- mensions is computationally unaffordable. When the graph is large, we need an iterative method that exploits the spar- sity of the graph to approximate the inverse with a reduced cost [30], [49]. Due to its faster conver gence, we consider a parallel structur e that consists of first transforming the polynomial ratio in (18) in its partial fraction decomposition form and subsequently using the Jacobi method to ap- proximate inverse. While also other Krylov approaches are possible, the parallel Jacobi method offers a better tradeoff between computational complexity and convergence rate. Partial fraction decomposition of ARMA filters. The partial fraction decomposition of the rational function a ( λ ) in (24) provides an equivalent repr esentation of ARMA filters. Let γ = [ γ 1 , . . . , γ P ] > be a set of poles, β = [ β 1 , . . . , β P ] > a corresponding set of r esiduals and α = [ α 0 , . . . , α K ] > be a set of direct terms; we can then rewrite (24) as a ( λ ) = P X p =1 β p λ − γ p + K X k =0 α k λ k (25) where α , β , and γ ar e computed from a and b . A graph filter whose spectral r esponse is as in (25) is one in which the spectral variable λ is replaced by the shift operator variable S . It follows that if α , β , and γ are chosen to make (25) and (24) equivalent, the filter in (22) is, in turn, equivalent to A ( S ) = P X p =1 β p  S − γ p I  − 1 + K X k =0 α k S k . (26) The equivalence of (22) and (26) means that instead of training a and b in (22) we can train α , β , and γ in (26). 8 Jacobi implementation of single-pole filters. T o cir cumvent the matrix inverses in (26), we first consider each single- pole filter in (26) separately and implement the input-output relationship u p = β p  S − γ p I  − 1 x . (27) Expression (27) is equivalent to the linear equation ( S − γ p I ) u p = β p x , which we can solve iteratively through a Jacobi recursion. This requires us to separate ( S − γ p I ) into diagonal and off-diagonal components. W e, therefore, begin by defining the diagonal degree matrix D = diag ( S ) so that the shift operator can be written as S = D +  S − D  := diag ( S ) +  S − diag ( S )  . (28) W ith this definition, we write ( S − γ p I N ) = ( D − γ p I N ) + ( S − D  , which is a decomposition on diagonal terms ( D − γ p I N ) and off-diagonal terms ( S − D  . The Jacobi iteration k for (27) is given by the recursive expression u pk = −  D − γ p I N  − 1 h β p x −  S − D  u p ( k − 1) i (29) initialized with u p 0 = x . W e can unroll this iteration to write an explicit relationship between u pk and x . T o do that, we define the parameterized shift operator R ( γ p ) = −  D − γ p I N  − 1  S − D  (30) and use it to write the K th iterate of the Jacobi recursion as u pK = β p K − 1 X k =0 R k ( γ p ) x + R K ( γ p ) x . (31) For a convergent Jacobi recursion, signal u pK in (31) con- verges to the output u p of the single-pole filter in (27). T runcating (31) at a finite K yields an appr oximation in which single-pole filters are written as polynomials on the shift operator R ( γ p ) . I.e., a single-pole filter is approximated as a convolutional filter of order K [cf. (14)] in which the shift operator of the graph S is replaced by the shift operator R ( γ p ) defined in (30). This convolutional filter uses parameters β p for k = 0 , . . . , K − 1 and 1 for k = K . Jacobi ARMA filters and Jacobi ARMA GNNs. Assuming we use Jacobi iterations to appr oximate all single-pole filters in (26) and that we truncate all of these iterations at K , we can write ARMA filters as A ( S ) = P X p =1 H K  R ( γ p )  + K X k =0 α k S k . (32) where H K ( R ( γ p )) is a K order Jacobi approximation of the ARMA filter , which, as per (31) is given by H K  R ( γ p )  = β p K − 1 X k =0 R k ( γ p ) + R K ( γ p ) . (33) A Jacobi ARMA filter of order ( P , K ) is defined by (32) and (33). The order P repr esents the number of poles in the filter and the or der K the number of Jacobi iterations we consider appropriate to properly approximate individual single-pole filters. Notice the number of taps K in the filter P K k =0 α k S k need not be the same as the number of Jacobi iterations used in (33). But we use the same to avoid complicating notation. For sufficiently large K (32)-(33), (26), and (22) are all equivalent expressions of ARMA filters of or ders ( P, Q ) . W e could train parameters using either of these equivalent expressions but we advocate for the use (32)-(33) as no inversions ar e necessary except for the elementary inversion of the diagonal matrix ( D − γ p I ) . It is interesting to note that in this latter form ARMA filters ar e reminiscent of the convolutional filters in (14) but the similarity is superfi- cial. In (14), we train K + 1 parameters a k that multiply shift operator powers S k . In (32)-(33) we also train K + 1 parameters of this form in the filter P K k =0 α k S k but this is in addition to the parameters β p and γ p of each of the single-pole filter approximations H K ( R ( γ p )) . These single- pole filters ar e themselves reminiscent of the convolutional filters in (14) but the similarity is again superficial. Instead of parameters a k that multiply shift operator powers S k , the filters in (33) train a parameters γ p which repr esents a constant that is subtracted from the diagonal entries of the shift operators S . The fact this is equivalent to an ARMA filter suggests (32)-(33) may help designing more discriminative filters. W e corr oborate in Section 6 that GNNs using (32)-(33) outperform GNNs that utilize filters (14). An ARMA GNN has (2 P + K + 1) F 2 parame- ters per layer and a computational complexity of order O  F 2 P ( M K + N )  . This decomposes as O ( P N ) to invert the diagonal matrices ( D − γ p I N ) ; O ( P M ) to scale the nonzeros of ( S − D ) by the inverse diagonal; O ( P K M ) to obtain the outputs of Jacobi ARMA filters (32) of order K . ARMA GNNs as EdgeNets. ARMA GNNs (ARMANets) are another subclass of the EdgeNet. T o see this, consider that each shift operator R ( γ p ) in (30) shares the support with I N + S . Hence, we can express the graph filter in (32) as the union of P + 1 edge varying graph filters. The first P of these filters have parameters matrices of the form Φ ( k :0) p =  β p R k ( γ p ) , k = 0 , . . . , K − 1 R K ( γ p ) , k = K while the last filter captures the direct-term with edge varying parameter matrices Φ ( k :0) = α k S K [cf. Section 4]. The union of these edge varying filter has the expression A ( S ) = K X k =0  P X p =1 Φ ( k :0) p + Φ ( k :0)  x (34) which by grouping further the terms of the same order k leads to a single edge varying graph filter of the form in (5). ARMANet provides an alternative parameterization of the EdgeNet that is different from that of the other polyno- mial convolutional filters in (14). In particular , ARMANets promote to use multiple polynomial filters of smaller order (i.e., the number of Jacobi iterations) with shared parameters between them. Each of the filters H K ( R ( γ p )) depends on two parameters β p and γ p . W e believe this parameter shar- ing among the differ ent orders and the different nodes is the success behind the improved performance of the ARMA GNN compared with the single polynomial filters in (14). Given also the hybrid solutions developed in Section (4.1) for the polynomial filters, a direction that may attain further improvements is that of ARMA GNN ar chitectures with controlled edge variability . 9 x S S + + + α 0 α 1 α 2 R ( γ 1 ) R ( γ 1 ) + + + β 1 β 1 1 R ( γ 2 ) R ( γ 2 ) + + + β 2 β 2 1 + + + H 2  R ( γ 1 )  x H 2  R ( γ 2 )  x u Fig. 4. J acobi Autoreg ressive Moving Av erage Filter . The input signal x is processed b y a parallel bank of filters. One of this filters is a conv olutional filter of the form in (14) operating w .r .t. the shift operator S (highlighted in red). The remaining filters operate w.r .t. scaled shift operators [cf. (30) ] (highlighted in b lue). All filter outputs are summed together to yield the ov erall Jacobi ARMA output. ARMANet generalizes the architectur e in [15] where instead of restricting the polynomials in (22) to Cayley poly- nomials, it allows the use of general polynomials. Iterative approaches to implement an ARMA graph filter [30], [50] have been r ecently used to develop GNNs that resemble this r ecursion [51], [52]. However , both [51], [52] do not implement an ARMA layer like (23) does. Instead, their propagation rule can be either seen as that used in graph recurr ent neural networks [53, eq. (5)] but with a constant input or conceptually similar to that used in iterative sparse coding [54]. 5 G R A P H C O N VO L U T I O N A L A T T E N T I O N N E T W O R K S Graph convolutional neural networks parameterize the Ed- geNet by using a fixed underlying shift operator S and learning only the filter parameters. However , this shift operator may often be estimated from data disjointly fr om the GNN task, or its specific weights may be unknown. In these instances, we can use the graph attention mechanism [19] to parameterize the EdgeNet in a way that we learn both the shift operator weights and the convolutional filter parameters for the task at hand. W e pr opose the graph convolutional attention network (GCA T), which utilizes the filters as in (12) but they are convolutional in a layer -specific matrix Φ l = Φ that may be different from the shift operator S X l = σ K X k =0 Φ k X l − 1 A k ! . (35) Note A k = A lk and Φ = Φ l are layer-dependent but we omit the layer index to simplify notation. Since matrix Φ shares the sparsity pattern of S , (35) defines a GNN as per (7). Matrix Φ is learned fr om the features X l − 1 passed from layer l − 1 following the attention mechanism [19]. Specifically , we consider a trainable matrix B ∈ R F l − 1 × F l and vector e ∈ R 2 F l , and compute the edge scores α ij = σ  e > h  X l − 1 B  i ,  X l − 1 B  j i >  (36) for all edges ( i, j ) ∈ E . In (36), we start with the vector of featur es X l − 1 and mix them as per the parameters in B . This produces a collection of graph signals X l − 1 B in which each node i has F l features that correspond to the i th row [ X l − 1 B ] i of the product matrix X l − 1 B . The features at node i are concatenated with the features of node j and the resulting vector of 2 F l components is multiplied by vector e . This pr oduct produces the score α ij after passing thr ough the nonlinearity σ ( · ) . Note that B = B l , e = e l are global parameters for all scor es α ij = α lij and depend on the layer index l . As is the case of A k and Φ in (35), we omitted this index for simplicity . The score α ij could be used directly as an entry for the matrix Φ but to encourage attention sparsity we pass α ij through a local soft maximum operator Φ ij = exp  α ij  X j 0 ∈N i ∪ i exp  α ij 0  ! − 1 . (37) The soft maximum assigns edge weights Φ ij close to 1 to the largest of the edge scores α ij and weights Φ ij close to 0 to the rest. See also Figure 5 (a). In Section 2, we intr oduced arbitrary edge varying graph filters [cf. (5)] which we leveraged in Section 3 to build edge varying GNNs [cf. (6) - (7)]. In Section 4, we pointed out that edge varying graph filters left too many degrees of freedom in the learning parametrization; a problem that we could overcome with the use of graph convolutional filters [cf. (14)]. The latter suffer from the opposite problem as they may excessively constrict the GNN. GA T s provide a solution of intermediate complexity . Indeed, the filters in (35) allow us to build a GNN with convolutional graph filters where the shift operator Φ is learned ad hoc in each layer to repr esent the required abstraction between nodes. The edges of this shift operator try to choose neighbors whose values should most influence the computations at a particular node. This is as in any arbitrary edge varying graph filter but the novelty of GA T s is to reduce the number of learnable parameters by tying edge values to matrix B and vector e —observe that in (36) e is the same for all edges. Thus, the computation of scor es α ij depends on the F l − 1 × F l parameters in B and the 2 F l parameters in e . This is of or der no more than F 2 if we make F = max l F l . It follows that for the GA T in (35) the number of learnable parameters is at most F 2 + 2 F + F 2 ( K + 1) , which depends on design choices and is independent of the number of edges. W e point out that since Φ respects the graph sparsity , the computational complexity of implementing (35) and its parameterization is of order O ( F ( N F + K M )) . 5.1 Edge varying GA T networks The idea of using attention mechanisms to estimate entries of a shift operator Φ can be extended to estimate entries Φ ( k :0) of an edge varying graph filter . T o be specific, we 10 Φ ( B , e ) Φ ( B , e ) Φ ( B , e ) Φ ( B , e ) + + + + X l − 1 Φ 0 X l − 1 Φ 1 X l − 1 Φ 2 X l − 1 Φ 3 X l − 1 Φ 0 X l − 1 A 1 Φ 1 X l − 1 A 2 Φ 2 X l − 1 A 2 Φ 3 X l − 1 A 2 K X k =0 Φ ( k :0) X l − 1 A k (a) Convolutional GA T filter [cf. (35)-(37)]. Φ (0) ( B 0 , e 0 ) Φ (1) ( B 1 , e 1 ) Φ (2) ( B 2 , e 2 ) Φ (3) ( B 3 , e 3 ) + + + + X l − 1 Φ (0) X l − 1 Φ (1:0) X l − 1 Φ (2:0) X l − 1 Φ (3:0) X l − 1 Φ (0) X l − 1 A 0 Φ (1:0) X l − 1 A 1 Φ (2:0) X l − 1 A 2 Φ (3:0) X l − 1 A 3 K X k =0 Φ ( k :0) X l − 1 A k (b) Multi-parameter edge varying GA T [cf. (38)-(41)]. Fig. 5. Higher-order Graph Attention Filters. (a) Graph convolutional attention filter. The input features X l − 1 are shifted by the same edge varying shift operator Φ ( B , e ) and weighted by different parameter matr ices A k . The edge varying parameters in all Φ ( B , e ) are parameterized by the same matr ix B and vector e f ollowing the attention mechanism. (b) Edge var ying GA T filter . The input features X l − 1 are shifted by different edge varying shift operators Φ ( k ) ( B k , e k ) and weighted b y diff erent parameter matrices A k . The edge varying parameters in the different Φ ( k ) ( B k , e k ) are parameterized by a diff erent matrix B k and vector e k follo wing the attention mechanism. T ABLE 1 Proper ties of Different Gr aph Neural Network Architectures. The parameters and comple xity are considered per layer . Architectures in bold are proposed in this w or k. Legend: N - number nodes; M - number of edges; F - maximum number of f eatures; K - recursio n order ; b - dimension of the K er nel in (21) ; B - number of bloc ks in (15) ; I - the set of impor tant nodes in (15) and (17) ; M I - total neighbors for the nodes in I ; P - parallel J-ARMANet branches; R - parallel attention branches; ∗ Self-loops are not considered. ∗∗ The eigendecomposition cost O ( N 3 ) is computed once. Architectur e Expression Or der parameters ∗ O ( · ) Order of complexity ∗ , ∗∗ O ( · ) Fully connected n/a N 2 F 2 N 2 F 2 Edge varying Eq. (7) K ( M + N ) F 2 K F 2 ( M + N ) GCNN [10]–[13] Eq. (12) K F 2 K F 2 M Node varying [44] Eq. (15) B K F 2 K F 2 M Hybrid edge varying Eq. (17) ( |I | + K M I ) F 2 K F 2 ( M + N ) Spec. edge varying GNN ∗∗ Eq. (21) K bF 2 K F 2 ( M + N ) Spec. Kernel GCNN ∗∗ [9] Eq. (21) for K = 1 bF 2 N 2 F 2 ARMANet Eq. (22)-(33) ( P + K ) F 2 F 2 P ( M K + N ) GCA T Eq. (35) RF 2 K R ( N F 2 + K F M ) GA T [19] Eq. (35) for K = 1 RF 2 R ( N F 2 + F M ) Edge varying GA T Eq. (38) RK F 2 RK ( N F 2 + M F ) propose to implement a generic GNN as defined by recur - sion (7) which we repeat here for ease of reference X l = σ K X k =0 Φ ( k :0) X l − 1 A k ! . (38) Further recall each edge varying filter parameter matrix Φ ( k :0) is itself defined recursively as [cf. (5)] Φ ( k :0) = Φ ( k ) Φ ( k − 1:0) = k Y k 0 =0 Φ ( k 0 ) . (39) W e propose to generalize (36) so that we compute a different matrix Φ ( k ) for each filter order k . Consider then matrices B k and vectors e k to compute the edge scores α ( k ) ij = σ  e T k h  X l − 1 B k  i ,  X l − 1 B k  j i >  (40) for all edges ( i, j ) ∈ E . As in (36), we could use α ( k ) ij as edge weights in Φ ( k ) , but to pr omote attention sparsity we send α ( k ) ij through a soft maximum function to yield edge scores Φ ( k ) ij = exp  α ( k ) ij  X j 0 ∈N i ∪ i exp  α ( k ) ij 0  ! − 1 . (41) Each of the edge varying matrices Φ ( k ) for k = 1 , . . . , K is parameterized by the tuple of transform parameters ( B k , e k ) . Put simply , we are using a different GA T mech- anism for each edge varying matrix Φ ( k ) . These learned matrices ar e then used to build an edge varying filter to process the featur es X l passed from the previous layer – see Figure 5 (b). The edge varying GA T filter employs K + 1 transform matrices B k of dimensions F l × F l − 1 , K + 1 vectors e k of dimensions 2 F l , and K + 1 matrices A k of dimensions F l × F l − 1 . Hence, the total number of parameters for the edge varying GA T filter is at most ( K + 1)(2 F 2 + 2 F ) . 11 The computational complexity of the edge varying GA T is of order O ( K F ( N F + M )) . Remark 5. Graph attention networks first appeared in [19]. In this paper , (36) and (37) are pr oposed as an attention mechanism for the signals of neighboring nodes and GNN layers are of the form X l = σ ( Φ l X l − 1 A l ) . The latter is a particular case of either (35) or (38) in which only the term k = 1 is not null. Our observation in this section is that this is equivalent to computing a dif ferent graph repr esented by the shift operator Φ . This allows for the generalization to filters of arbitrary order K [cf. (35)] and to edge varying graph filters of arbitrary order [cf. (38)]. The approaches presented in this paper can likewise be extended with the multi-head attention mechanism proposed in [19] to improve the network capacity . 5.2 Discussions Reducing model complexity . As defined in (36) and (40) the attention mechanisms are separate from filtering. T o reduce the number of parameters, we can equate the attention matrices B or B k with the filtering matrices A k . For the GCA T in (35), the original proposal in [19] is to make B = A 1 so that (36) reduces to α ij = σ  e T h  X l − 1 A 1  i ,  X l − 1 A 1  j i >  . (42) For the edge varying GA T s in (38), it is natural to equate B k = A k in which case (40) reduces to α ( k ) ij = σ  e T k h  X l − 1 A k  i ,  X l − 1 A k  j i >  (43) The choice in (43) removes ( K + 1) F 2 parameters. Accounting for differences in edge weights in the original shift operator . The major benefit of the GA T mechanism is to build a GNN without requiring full knowledge of S . This is beneficial as it yields a GNN robust to uncertainties in the edge weights. This benefit becomes a drawback when S is well estimated as it renders weights s ij equivalent regar dless of their relative values. One possible solution to this latter drawback is to use a weighted soft maximum operator so that the entries of Φ ( k ) are chosen as Φ ( k ) ij = exp  s ij α ( k ) ij  X j 0 ∈N i ∪ i exp  s ij 0 α ( k ) ij 0  ! − 1 . (44) Alternatively , we can resort to the use of a hybrid GA T in which we combine a GA T filter of the form in (38) with a regular convolutional filter of the form in (12) X l = σ K X k =0 S k X l − 1 A k + Φ ( k :0) X l − 1 A 0 k ! . (45) This is the GA T version of the hybrid GNN we proposed in (17). The filters in (45) account for both, the GA T learned shifts Φ ( k ) and the original shift S . 6 N U M E R I C A L R E S U L T S This section corroborates the capacity of the different mod- els with numerical results on synthetic and real-world graph signal classification problems. Given the different hyperpa- rameters for the models and the trade-offs (e.g., complexity , number of parameters, radius of local information), we aim to pr ovide insights on which methods exploit better the graph prior for learning purposes rather than achieving the best performance. In the sequel, we pr ovide a summary of the different architectur es, while in the upcoming four sec- tion analyze in detail each experiment. Section 6.5 provides some general observations. The code used for these simu- lations can be found at http://github.com/alelab- upenn/ graph- neural- networks. Summary of the architectures. A summary of the different architectur es discussed in this paper is provided in T able 1. • Fully connected stands in for the standar d multi- layer per ceptron neural network that uses no graph prior . • The edge varying stand in for the EdgeNet in the full form in (7), which allocates differ ent parameters per edge and shift and has the most degrees of freedom. • The graph convolutional neural network in (12) has instead shared parameters among all nodes, edges, and shifts. • The node varying [cf. (15)] and the hybrid edge vary- ing [cf. (17)] ar e the intermediate solutions between the full-form EdgeNet and the GCNN. Both rely on selecting a set of important nodes and allocate differ ent parameters to them. The hybrid edge vary- ing allocates also different parameters to the edges of these important nodes. T o the remaining nodes shared parameters as in the GCNN are used. • The spectral edge varying GNN is the subclass of the full-form EdgeNet in which filters share the eigenvectors with the shift operator [cf. Prop. 2], while the spectral kernel GCNN is the equivalent repr esentation of the GCNN in the spectral domain. • ARMANet is the GCNN that we propsoed in Sec- tion 4.3. Contrarily to the polynomial GCNN form [cf. (12)], it has a filter with a rational spectral re- sponse rather than polynomial. • The graph convolutional attention network and its variants are the architectures we developed in Sec- tion 5. The GCA T generalises the popular GA T [19] to a convolutional filter of higher order such that it can account for multi-hop neighbor information in each layer . The edge varying GA T [cf. (38)] is an alternative to the GCA T that uses edge varying filters together with attention instead of convolu- tional filters. Lastly , we also discussed two variants of the GCA T and edge varying GA T in Section 5.2 but several others developed under the EdgeNet framework. From all these architectures, we do not evaluate the fully- connected, the spectral kernel GCNN [9], and the spectral edge varying GCNN (21) since their computational cost is higher than linear . W e also leave to interested r eaders the extensions discussed in Section 5.2. For the remaining 12 GCNN Edge varying Degree Spec. proxies Diff. centrality Degree Spec. proxies Diff. centrality K = 1 K = 3 K = 5 GAT GCAT Edge varying GAT 0 2 4 6 8 10 15 20 test error % 40 45 50 55 60 65 70 80 Node varying Hybrid edge varying ARMANet Fig. 6. Source Localization T est Error in the Stochastic Bloc k Model graph. The y − axis scale is def or med to impro ve visibility . The thick bar interval indicates the av erage performance for diff erent parameter choices (e.g., filter order , attention heads). The circle marker represents the mean value of this inter val. The thin line spans an interval of one standard de viation from the aver age performance. The conv olutional-based approaches perform better than attention-based. We attribute the poor perf or mance of the attention techniques to the slow lear ning rate. Both the GA T and the edge variant GA T required more than 40 epochs to reach a local minimum. Howe ver , the graph convolutional attention network (GCA T) does not suffer from the latter issue leading to f aster learning. solutions, we trained all of them using ADAM with the standard forgetting factors [55]. 6.1 Sour ce localization on SBM graphs The goal of this experiment is to identify which community in a stochastic block model (SBM) graph is the source of a diffused signal by observing different realizations in differ ent time instants. W e considered a connected and undirected graph of N = 50 nodes divided into five blocks each representing one community { c 1 , . . . , c 5 } . The intra- and inter-community edge formation pr obabilities ar e 0 . 8 and 0 . 2 , r espectively . The source node is one of the five nodes ( i 1 , . . . , i 5 ) with the largest degree in the respective community . The source signal x (0) is a Kronecker delta centered at the source node. The source signal is diffused at time t ∈ [0 , 50] as x ( t ) = S t x (0) , where S is the graph adjacency matrix normalized by the maximum eigenvalue. The training set is composed of 10240 tuples of the form ( x ( t ) , c i ) for random t and i ∈ { 1 , . . . , 5 } . These tuples ar e used to train the EdgeNets that are subsequently used to predict the source community c 0 i for a testing signal x 0 ( t ) again for a random value of t . The validation and the test set are both composed of 2560 tuples ( 25% of the training set). The performance of the dif ferent algorithms is averaged over ten different graph realizations and ten data splits, for a total of 100 Monte-Carlo iterations. ADAM is run for 40 epochs with batches of 100 samples and learning rate 10 − 3 . Architecture parameters. For this experiment, we compar ed 14 differ ent architectures. All architectures comprise the cascade of a graph filtering layer with ReLU nonlinearity and a fully connected layer with softmax nonlinearity . The architectur es are: i ) the edge varying GNN (7); ii ) the GCNN (12); iii ) thr ee node varying GNNs (15), wher e the five important nodes are selected based on iii − a ) maximum degree; iii − b ) spectral proxies [56], which ranks the nodes according to their contribution to different frequencies; iii − c ) diffusion centrality (see Appendix C); iv ) three node dependent edge varying GNNs, where the five important nodes B are selected similalr to the node varying case; v ) three ARMANets (32) with Jacobi iterations v − a ) K = 1 ; v − b ) K = 3 ; v − c ) K = 5 ; v i ) the GA T network from [19]; v ii ) the GCA T network (35); and v iii ) the edge varying GA T (38) network. Our goal is to see how the different architectures handle their degrees of freedom, while all having linear complexity . T o make this comparison more insightful, we proceed with the following rationale. For the approaches in i ) − iv ) , we analyzed filter orders in the interval K ∈ { 1 , . . . , 5 } . This is the only handle we have on these filters to control the number of parameters and locality radius while keeping the same computational complexity . For the ARMANet in v ) , we set the direct term or der to K = 0 to observe only the effect of the rational part. Subsequently , for each Jacobi iteration value K , we analyzed rational orders in the interval P ∈ { 1 , . . . , 5 } as for the former approaches. While this strategy helps us controlling the local radius, recall the ARMANet has a computational complexity slightly higher than the former four architectures. For the GA T in v i ) , we analyzed different attention heads R ∈ { 1 , . . . , 5 } such that the algorithm complexity matches those of the approaches i ) − iv ) . The number of attention heads is the only handle in the GA T network. Finally , for the GCA T in vii ) and the edge varying GA T in v iii ) , we fixed the attention heads to R = 3 and analyzed dif ferent filter or ders K ∈ { 1 , . . . , 5 } . The latter allows comparing the impact of the local radius for the median attention head value. Recall, these architectur es have again a slightly higher complexity for K ≥ 2 . Observations. The results of this experiment are shown in Figure 6. W e make the following observations. First, the attention-based approaches are characterized by a slow learning rate leading to a poor performance in 40 epochs. This is reflected in the higher test error of the GA T and the edge varying GA T networks. However , this is not the case for the GCA T network. W e attribute the latter reduced error to the superposition of the graph convolu- tional to attentions that GCA T explores –all convolutional approaches learn faster . On the contrary , the error increases 13 T ABLE 2 Source Localization T est Error in F acebook Subnetwork. The goal is to grid-search the parameters to achiev e a mean error of at most 2% . For the architectures that did not achiev e this criter ion, the minimum error is reported. Architectur e mean std. dev . order attention heads epochs learning rate { 1 , 2 , 3 , 4 , 5 } { 1 , 2 , 3 , 4 , 5 } { 10, 20, 40, 100 } { 10 − 2 , 10 − 3 } GCNN 4 . 0% 13 . 0% 3 n/a 100 10 − 3 Edge varying 1 . 5 % 8 . 4% 1 n/a 10 10 − 3 Node varying 6 . 0% 15 . 8% 3 n/a 20 10 − 3 Hybrid edge var . 6 . 6% 15 . 9% 2 n/a 40 10 − 3 ARMANet 2 . 0 % 9 . 7% 1 n/a 20 10 − 3 GA T 10 . 9% 20 . 8% n/a 1 40 10 − 3 GCA T 8 . 0% 18 . 4 3 1 100 10 − 3 Edge varying GA T 7 . 1% 17 . 8% 2 3 100 10 − 3 further for the edge varying GA T since multiple attention strategies are adopted for all k ∈ { 1 , . . . , K } in (38). There- fore, our conclusion is the graph convolutional prior can be significantly helpful for attention mechanisms. W e will see this consistent improvement in all our experiments. Second, the edge varying GNN in (7) achieves the lowest error , although having the largest number of parameters. The convolutional approaches parameterize well the edge varying filter; hence, highlighting the benefit of the graph convolution. ARMANet is the best among the latter charac- terized both by a lower mean error and standard deviation. This reduced error for ARMANet is not entirely surprising since rational functions have better interpolation and ex- trapolation properties than polynomial ones. It is, however , remarkable that the best result is obtained for a Jacobi iteration of K = 1 . I.e., the parameter sharing imposed by ARMANet reaches a good local optimal even with a coarse approximation of the rational function. Notice also the source localization task is not permutation equivariant, therefor e, architectures that are not permutation equivariant (edge varying, node varying, hybrid edge varying, ARMA for low Jacobi orders) are expected to perform better . Third, for the node selection strategies in architectur es iii ) and iv ) , there is no clear difference between the degree and the dif fusion centrality . For the node varying GNNs, the diffusion centrality of fers a lower error both in the mean and deviation. In the hybrid edge varying GNNs [cf. (17)], instead, the degr ee centrality achieves a lower error but pays in deviation. The spectral pr oxy centrality yields the worst performance. Finally , we remark that we did not find any particular trend while changing the parameters of the differ ent GNNs (e.g., order , attention head). A rough observation is that low order recursions are often sufficient to reach low errors. 6.2 Sour ce localization on Facebook sub-network In the second experiment, we considered a similar source lo- calization on a r eal-world network comprising a 234 − used Facebook subgraph obtained as the lar gest connected com- ponent of the dataset in [57]. This graph has two well- defined connected communities of dif ferent size and the objective is to identify which of the two communities originated the diffusion. The performance of the dif ferent algorithms is averaged over 100 Monte-Carlo iterations. The remaining parameters for generating the synthetic data are similar as before. Architecture parameters. W e compared the eight GNN ar - chitectures reported in the left-most column of T able 2. For the node varying and the hybrid edge varying GNNs, the important nodes are again 10% of all nodes selected based on diffusion centrality . The Jacobi number of iterations for the ARMANet is K = 1 while there is no direct term. Overall, this problem is easy to solve if the GNN is hypertuned with enough width and depth. However , this strategy hinders the impact of the specific filter . T o highlight the role of the latter , we considered minimal GNN architec- tures composed of one layer and two features. In turn, this allows understanding better how much the specific filter contributes to the performance. W e then grid-searched all parameters in T able 2 to reach a classification error of at most 2% . For the architectur es that r each this criterion, we report the smallest parameters. For the architectures that do not reach this criterion, we report the minimum achieved error and the respective parameters. Our rationale is that the minimum parameters yield a lower complexity and show better the contribution of the filter type. They also lead to faster training; the opposite holds for the learning rate. From T able 2, we observe that only the edge varying GNN and the ARMANet reach the pr edefined err or . Both ar- chitectures stress our observation that low order recursions ( K = 1 ) are often suf ficient. Nevertheless, this is not the case for all other architectur es. These observations suggest the edge varying GNN explores well its degrees of freedom and adapts well to the non-permutation equivariance of the task. The ARMANet explores the best the convolutional prior; in accordance with the former results, the Jacobi implementation does not need to runt until convergence to achieve impressive results. W e also conclude the convolu- tional prior helps to reduce the degrees of freedom of the EdgeNet but requires a deeper and/or wider network to achieve the pr edefined criterion. This is particularly seen in the GA T based architectures. The GCA T ar chitecture, in here, explor es the convolutional prior and reduces the error compared with the GA T . Finally , we remark for all approaches a substantially lower variance can be achieved by solely increasing the features. 6.3 A uthorship attribution In this third experiment, we assess the performance of the differ ent GNN architectures in an authorship attribution problem based on real data. The goal is to classify if a text excerpt belongs to a specific author or any other of the 20 contemporary authors based on word adjacency networks (W ANs) [58]. A W AN is an author-specific directed graph whose nodes are function words without semantic meaning (e.g., prepositions, pronouns, conjunctions). The relative 14 positioning of function words carries stylistic information about the author . T o capture this information, we build a directed graph, where each node is a function word, and each weighted edge r epresents the average co-occurence of the corresponding function words, discounted by relative distance (i.e. if the two words are next to each other , the weight is higher than if the two words ar e further apart). W e build this graph support only once, and before the training set. The signal on top of this graph is the frequency count for the function words in text excerpts of 1 , 000 words. These are the graph signals that form the dataset that is used for training and testing. The W ANs and the word fr equency count serve as author signatures and allow learning r epre- sentation patterns in their writing style. The task translates into a binary classification problem where one indicates the text excerpt is written by the author of interest and zero by any other author . A more detailed account on the cr eation of W ANs can be found in [58] and the used dataset is available in our code. The W ANs of the respective authors have from N = 190 to N = 210 function word nodes. Following [58], we built single-author W ANS for Jane Austen, Emily Bront ¨ e, and Edgar Allan Poe. For each author , we processed the texts to count the number of times each function word pair co- appears in a window of ten words. These co-appearances are imputed into an N × N matrix and normalized row- wise. The resulting matrix is used as the shift operator , which can also be interpreted as a Markov chain transition matrix. W e considered a train-test split of 95% − 5% of the available texts. Around 8 . 7% of the training samples are used for validation. This division leads to: ( i ) Austen: 1346 training samples, 118 validation samples, and 78 testing samples; ( ii ) Bront ¨ e: 1192 training samples, 104 validation samples, 68 testing samples; ( iii ) Poe: 740 training samples, 64 validation samples, 42 testing samples. For each author , the sets are extended by a similar amount with texts from the other 20 authors shared equally between them. Architecture parameters. W e considered again the eight GNN architectures of the former section shown in the leftmost column of T able 3. Following the setup in [11], all architectures comprise a graph neural layer of F = 32 features with ReLU nonlinearity followed by a fully con- nected layer . The baseline or der for all filters is K = 4 . For the ARMANet this is also the number of denominator parameters and the order of the direct term in (32); the number of the Jacobi iterations in (33) is one. W e want to show how much the rational part helps to improve the performance of the GCNN (which is the direct term in the ARMANet [cf. (32)]). The important nodes for the node varying and the hybrid edge varying are 20 ( ∼ 10% of N ) selected with degree centrality . The GA T , GCA T , and edge varying GA T have a single attention head to highlight the role of the convolutional and edge varying recursion over it. The loss function is the cross-entropy optimized over 25 epochs with a learning rate of 0 . 005 . The performance is averaged over ten data splits. T able 3 shows the results of this experiment. Overall, we see again the graph convolution is a solid prior to learning meaningful repr esentations. This is particularly highlighted in the improved performance of the GCA T for Austen and T ABLE 3 Authorship Attrib ution T est Error . The results show the av erage classification test error and standard deviation on 10 different training-test 95% − 5% splits. Architectur e Austen Bront ¨ e Poe GCNN 7 . 2( ± 2 . 0)% 12 . 9( ± 3 . 5)% 14 . 3( ± 6 . 4)% Edge varying 7 . 1( ± 2 . 2)% 13 . 1( ± 3 . 9)% 10 . 7 ( ± 4 . 3 )% Node varying 7 . 4( ± 2 . 1)% 14 . 6( ± 4 . 2)% 11 . 7( ± 4 . 9)% Hybrid edge var . 6 . 9 ( ± 2 . 6 )% 14 . 0( ± 3 . 7)% 11 . 7( ± 4 . 8)% ARMANet 7 . 9( ± 2 . 3)% 11 . 6 ( ± 5 . 0 )% 10 . 9( ± 3 . 7)% GA T 10 . 9( ± 4 . 6)% 22 . 1( ± 7 . 4)% 12 . 6( ± 5 . 5)% GCA T 8 . 2( ± 2 . 9)% 13 . 1( ± 3 . 5)% 13 . 6( ± 5 . 8)% Edge varying GA T 14 . 5( ± 5 . 9)% 23 . 7( ± 9 . 0)% 18 . 1( ± 8 . 4)% Bront ¨ e compared with the GA T even with a single attention head. These observations also suggest the GA T and the edge varying GA T ar chitectures r equire multi-head appr oaches to achieve comparable performance. An exception is the case of Poe. In this instance, multi-head attention is also needed for the GCA T . The (approximated) rational part of the ARMANet gives a consistent improvement of the GCNN. Hence, we recommend considering the additional parame- terization of the ARMANet when implementing graph con- volutional neural networks, since the increased number of parameters and implementation costs are minimal. Finally , we remark the hybrid edge varying GNN improves the accuracy of the node varying counterpart. 6.4 Recommender Systems In this last experiment, we evaluate all former architectures for movie rating prediction in a subset of the MovieLens 100 K data set [59]. The full data set comprises U = 943 users and I = 1 , 582 movies and 100 K out of ∼ 1 , 5 M po- tential ratings. W e set the missing ratings to zero. From the incomplete U × I rating matrix, we consider two scenarios: a user-based and a movie-based. In a user-based scenario, we considered the 200 users that have rated the most movies as the nodes of a graph whose edges r epresent Pearson similarities between any two users. Each of the I = 1 , 582 movies is treated as a differ ent graph signal whose value at a node is the rating given to that movie by a user or zero if unrated. W e are interested to predict the rating of a specific user u with GNNs, which corr esponds to completing the u th row of the 200 × 1 , 5882 sub-rating matrix. In a movie- based scenario, we considered the 200 movies with the largest number of ratings as nodes of a graph whose edges repr esent Pearson similarities between any two movies. In this instance, there are 943 graph signals: the ratings each user gives to all 200 movies is one such graph signal. W e are interested to pr edict the rating to a specific movie i with GNNs, which corr esponds to completing the i th column of the rating matrix. W e remark this task is permutation equivariant, ther efore, we expect architectur es holding this property to perform better . Architecture parameters. W e considered the same architec- tural settings as in the authorship attribution experiments to highlight consistent behaviors and differences. Following [60], we chose ten 90% − 10% splits for training and test sets and pruned the graphs to keep only the top- 40 most similar connections per node. The shift operator is again the adjacency matrix normalized by the maximum eigenvalue. The ADAM learning algorithm is run over 40 epochs in 15 T ABLE 4 A verage RMSE on user gr aph. Archit./User -ID 405 655 13 450 276 A verage GCNN 1 . 09 0 . 72 1 . 18 0 . 82 0 . 66 0 . 89 Edge var . 1 . 25 0 . 74 1 . 34 0 . 99 0 . 70 1 . 00 Node var . 1 . 17 0 . 68 1 . 19 0 . 83 0 . 67 0 . 91 Hybrid edge var . 1 . 10 0 . 72 1 . 27 0 . 80 0 . 60 0 . 90 ARMANet 1 . 13 0 . 69 1 . 24 0 . 80 0 . 65 0 . 90 GA T 1 . 27 0 . 74 1 . 44 0 . 92 0 . 80 1 . 03 GCA T 1 . 09 0 . 71 1 . 12 0 . 77 0 . 65 0 . 87 Edge var . GA T 1 . 19 0 . 70 1 . 31 0 . 85 0 . 75 0 . 96 batches of five and learning rate 5 × 10 − 3 . W e trained the networks on a smooth- ` 1 loss and measur e the accuracy through the root mean squared error (RMSE). T ables 4 and 5 show the results for the five users and five movies with the largest number of ratings, respectively . The first thing to note is that GCA T consistently improves GA T . The latter further stresses that multi-head attentions are more needed in the GA T than in the GCA T . Second, the edge varying GNN yields the worst performance because it is not a permutation equivariant ar chitecture. In fact, the node varying and the hybrid edge varying, which are approaches in-between permutation equivariance and local detail, work much better . This trend is observed also in the edge varying GA T r esults, suggesting that also the number of parameters in the edge varying is too high for this task. 6.5 General Obser vations Altogether these experiments lead to four main observa- tions. Edge varying GNN useful for non-permutation equiv- ariant tasks. The edge varying GNN [cf. (7)] can perform well if the task is not permutation equivariant (e.g., source localization). W e have observed that minimal architectur es (i.e., lower number of features F , layers L and filter order K ) adapt easier to the task. This is because a minimal architectur e has less degr ees of freedom and can avoid over - fitting. Contrarily , when the task is permutation equivariant the edge varying GNN will suffer and parameterizations matched to the task are needed. Convolution provides a strong parameterization. In per- mutation equivariant tasks, we have seen GCNNs are a valid parameterization. They have shown potential also in tasks that are not permutation equivariant. However , contrarily to the edge varying GNN, GCNNs requite a wider architectur e and with filters of higher order . W e have also seen the ARMANet [cf. (22)-(33)] can improve the performance of the polynomial counterpart [cf (12)]. This is because for the rational frequency response ARMANet implements in each layer , thus requiring less parameters. However , surprisingly to us, we have observed that even with a few Jacobi iterations –ARMANets require at each layer to compute a matrix inverse, which we solve itera- tively with the Jacobi method– ARMANet can often achieve a better performance than the polynomial GCNN. Hybrid solutions offer a good parameterization. While the edge varying GNN and the GCNN may outperform each other depending on the task, the node varying [cf. (15)] and the hybrid edge varying [cf. (17)] has shown a good parameterization of the EdgeNet in all tasks. W e have seen the hybrid edge varying to perform overall better than the node varying due to its edge-dependent parameters. However , we did not found any consistent differ ence in the node selection strategies. Our rationale is that the sampling strategy of the important nodes needs to be matched with the task at hand. Graph convolutions improve attention. The graph convo- lutional attention network [cf. (35)] improves consistently over the GA T . This is because it generalizes the latter fr om a convolutional filter of or der one to an arbitrary or der . W orking with the GCA T has shown to accelerate the learn- ing procedure of the GA T and perform well in any task. In addition, the GCA T may result effective even with a single attention head. Instead, by generalizing the GA T to an edge varying GA T [cf. (38)] we have not seen substantial improvement than that seen in the GCA T . This indicates that the attention mechanism may not be the best strategy to learn differ ent shift operators in each layer . 7 C O N C L U S I O N This paper introduced EdgeNets: GNN architectur es that allow each node to collect information from its direct neigh- bors and apply different weights to each of them. EdgeNets preserve the state-of-the-art implementation complexity and provide a single r ecursion that encompasses all state-of- the-art architectur es. By showcasing how each solution is a particular instance of the EdgeNet, we provided guidelines to develop more expressive GNN architectur es, yet without compromising the computational complexity . This paper , in specific, pr oposed eight GNN ar chitectures that can be r ead- ily extended to scenarios containing multi-edge features. The EdgeNet link showed a tight connection between the graph convolutional and graph attention mechanism, which have been so far treated as two separate approaches. W e found the graph attention network learns the weight of a graph and then performs an order one convolution over this learned graph. Following this link, we introduced the concept of graph convolutional attention networks, which is an EdgeNet that jointly learns the edge weights and the parameters of a convolutional filter . W e advocate the EdgeNet as a more formal way to build GNN solutions. However , further resear ch is needed in three main directions. First, resear ch should be done to explore the connection between the EdgeNets and receptive fields. This will lead to differ ent parameterizations and architectur es. Second, works needs to be done for assessing the capabilities of EdgeNets solutions to handle graph iso- morphisms [61], [62]. Third, theoretical work is also needed to characterize how the stability of the EdgeNet to link perturbations [43], [63]. A P P E N D I X A P RO O F O F P R O P O S I T I O N 1 Denote the respective graph shift operator matrices of the graphs G and G 0 as S and S 0 . For P being a permutation matrix, S 0 and x 0 can be written as S 0 = P T SP and 16 T ABLE 5 A verage RMSE on movie g raph. Archit./Movie-ID 50 258 100 181 294 A verage GCNN 0 . 82 1 . 08 0 . 95 0 . 86 1 . 04 0 . 95 Edge var . 0 . 93 1 . 03 1 . 00 0 . 88 1 . 24 1 . 02 Node var . 0 . 78 1 . 04 1 . 00 0 . 87 1 . 00 0 . 94 Hybrid edge var . 0 . 75 1 . 02 0 . 98 0 . 82 1 . 08 0 . 93 ARMANet 0 . 81 1 . 05 1 . 02 0 . 87 1 . 09 0 . 97 GA T 0 . 98 1 . 24 1 . 28 1 . 00 1 . 30 1 . 16 GCA T 0 . 83 1 . 06 1 . 04 0 . 83 1 . 05 0 . 96 Edge var . GA T 0 . 81 1 . 04 1 . 01 0 . 86 1 . 07 0 . 96 x 0 = P T x . Then, the output of the convolutional filter in (14) applied to x 0 is u 0 = K X k =0 a k S 0 k x 0 = K X k =0 a k  P T SP  k P T x . (46) By using the properties of the permutation matrix P k = P and PP T = I N , the output u 0 becomes u 0 = P T K X k =0 a k S k x ! = P T u (47) which implies the filter output operating on the permuted graph G 0 with input x 0 is simply the permutation of the convolutional filter in (14) applied to x . Subsequently , since the nonlinearities of each layer are pointwise they implicitly preserve permutation equivariance; hence, the output of a GCNN layer is a permuted likewise. These permutations will propagate in the cascade of the differ ent layers yielding the final permuted output. A P P E N D I X B P RO O F O F P R O P O S I T I O N 2 T o start, let ˇ Φ (1) = Φ (0) + Φ (1) and ˇ Φ ( k ) = Φ ( k ) for all k = 2 , . . . , K be the transformed parameter matrices. Recall also that Φ (0) is a diagonal matrix; thus, ˇ Φ (1) shares the support with Φ (1) and with S + I N . Given the eigendecompostion of the transformed parameter matrices ˇ Φ ( k ) = V Λ ( k ) V − 1 for all k = 1 , . . . , K , the edge varying filter can be written in the graph spectral domain as a ( Λ ) = K X k =1  k Y k 0 =1 Λ ( k 0 )  . (48) Subsequently , r ecall that J is the index set defining the zero entries of S + I N and that C J ∈ { 0 , 1 } |I |× N 2 is the selection matrix whose rows are those of I N 2 indexed by J [cf. (20)]. Then, the fixed support condition for ˇ Φ ( k ) for all k = 1 , . . . , K is C J vec  ˇ Φ ( k )  = C J vec  VΛ ( k ) V − 1  = 0 |J | . (49) Put differ ently , equation (49) expresses in a vector form the zero entries of ˇ Φ ( k ) (hence, of Φ (0) , . . . , Φ ( K ) ) that match those of S + I N . From the properties of the vectorization operation, (49) can be rewritten as C J vec  VΛ ( k ) V − 1  = C J vec  V − 1 ∗ V  λ ( k ) (50) where “ ∗ ” denotes the Khatri-Rao product and λ ( k ) = diag ( Λ ( k ) ) is the N -dimensional vector composed by the di- agonal elements of Λ ( k ) . As it follows fr om (49), (50) implies the vector λ ( k ) lies in the null space of C J vec ( V − 1 ∗ V ) , i.e., λ ( k ) ∈ null  C I vec ( V − 1 ∗ V )  . (51) Let then B be a basis that spans spans this null space [cf. (20)]. The vector λ ( k ) can be expanded as λ ( k ) = B µ ( k ) (52) µ ( k ) is the vector containing the basis expansion parameters. Finally , by putting back together (49)-(52), (48) becomes a ( Λ ) = K X k =1 k Y k 0 =1 diag  B µ ( k 0 )  . (53) The N × b basis matrix B is a kernel that depends on the specific graph and in particular on the eigenvectors V . The kernel dimension b depends on the rank of B and thus, on the rank of the null space in (51). In practice it is often observed that rank ( B ) = b  N . A P P E N D I X C D I FF U S I O N C E N T R A L I T Y Let S be the shift operator used to represent the graph structure. W e define the dif fusion centrality (DC) δ i of a node i in K shifts, as the i th entry of the vector δ = K X k =0 S k 1 N . (54) The DC describes how much each node influences the passing of information in the network for a finite time of hops. The DC vector δ can also be seen as the convolution of the constant graph signal with a convolutional filter of the form in (14) which has all unitary parameters. This definition of DC is more appropriate for choices of S being the adjacency matrix or normalizations of it. For S being the discrete Laplacian matrix, the DC is zero for all nodes since the constant vector is the eigenvector of the discrete Laplacian corresponding to the zero eigenvalue. The above DC definition is the particularization of the DC proposed in [64] for stochastic setting to the case where all nodes decide to take part in the signal diffusion. Both the DC in (54) and the one from [64] are correlated to Katz-Bonacich centrality and eigenvector centrality . R E F E R E N C E S [1] E. Isufi, F . Gama, and A. Ribeir o, “Generalizing graph convolu- tional neural networks with edge-variant recursions on graphs,” in 27th Eur . Signal Process. Conf. A Coru ˜ na, Spain: Eur . Assoc. Signal Process., 2-6 Sep. 2019. [2] L. T ang and H. Liu, “Relational learning via latent social dimen- sions,” in 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining . Paris, France: ACM, 28 June-1 July 2009, pp. 817– 826. [3] N. W ale, I. A. W atson, and G. Karypis, “Comparison of descriptor spaces for chemical compound retrieval and classification,” Knowl- edge and Information Systems , vol. 14, no. 3, pp. 347–375, 2008. 17 [4] M. M. Bronstein, J. Bruna, Y . LeCun, A. Szlam, and P . V an- dergheynst, “Geometric deep learning: Going beyond Euclidean data,” IEEE Signal Process. Mag. , vol. 34, no. 4, pp. 18–42, July 2017. [5] D. I. Shuman, S. K. Narang, P . Frossar d, A. Ortega, and P . V an- dergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Process. Mag. , vol. 30, no. 3, pp. 83–98, May 2013. [6] F . Scarselli, S. L. Y ong, M. Gori, M. Hagenbuchner , A. C. T so, and M. Maggini, “Graph neural networks for ranking web pages,” in The 2005 IEEE/WIC/ACM Int. Conf. Web Intelligence . Compiegne, France: IEEE, 19-22 Sep. 2005, pp. 1–7. [7] F . Scarselli, M. Gori, A. C. T soi, M. Hagenbuchner , and G. Monfar- dini, “The graph neural network model,” IEEE T rans. Neural Netw . , vol. 20, no. 1, pp. 61–80, Jan. 2009. [8] F . Gama, E. Isufi, G. Leus, and A. Ribeiro, “Graphs, convolu- tions, and neural networks: From graph filters to graph neural networks,” IEEE Signal Process. Mag. , vol. 37, no. 6, pp. 128–138, Nov . 2020. [9] J. Bruna, W . Zaremba, A. Szlam, and Y . LeCun, “Spectral networks and deep locally connected networks on graphs,” in 2nd Int. Conf. Learning Representations . Banff, AB: Assoc. Comput. Linguistics, 14-16 Apr . 2014, pp. 1–14. [10] M. Defferrard, X. Bresson, and P . V andergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in 30th Conf. Neural Inform. Pr ocess. Syst. Barcelona, Spain: Neural Inform. Process. Foundation, 5-10 Dec. 2016, pp. 3844–3858. [11] F . Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutional neural network architectures for signals supported on graphs,” IEEE T rans. Signal Process. , vol. 67, no. 4, pp. 1034–1049, Feb. 2019. [12] J. Du, J. Shi, S. Kar , and J. M. F . Moura, “On graph convolution for graph CNNs,” in 2018 IEEE Data Sci. W orkshop . Lausanne, Switzerland: IEEE, 4-6 June 2018, pp. 239–243. [13] T . N. Kipf and M. W elling, “Semi-supervised classification with graph convolutional networks,” in 5th Int. Conf. Learning Represen- tations . T oulon, France: Assoc. Comput. Linguistics, 24-26 Apr . 2017, pp. 1–14. [14] K. Xu, W . Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in 7th Int. Conf. Learning Representations . New Orleans, LA: Assoc. Comput. Linguistics, 6-9 May 2019, pp. 1–17. [15] R. Levie, F . Monti, X. Bresson, and M. M. Bronstein, “CayleyNets: Graph convolutional neural networks with complex rational spec- tral filters,” IEEE T rans. Signal Process. , vol. 67, no. 1, pp. 97–109, Jan. 2019. [16] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters in convolutional neural networks on graphs,” in Conf. Com- put. V ision and Pattern Recognition 2017 . Honolulu, HI: Comput. V ision Foundation, 21-26 July 2017, pp. 3693–3702. [17] F . Monti, D. Boscaini, J. Masci, E. Rodol ` a, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model CNNs,” in Conf. Comput. V ision and Pattern Recognition 2017 . Honolulu, HI: Comput. V ision Foundation, 21- 26 July 2017, pp. 3693–3702. [18] J. Atwood and D. T owsley , “Dif fusion-convolutional neural net- works,” in 30th Conf. Neural Inform. Pr ocess. Syst. Barcelona, Spain: Neural Inform. Process. Foundation, 5-10 Dec. 2016. [19] P . V eli ˇ ckovi ´ c, G. Cucurull, A. Casanova, A. Romero, P . Li ` o, and Y . Bengio, “Graph attention networks,” in 6th Int. Conf. Learning Representations . V ancouver , BC: Assoc. Comput. Linguistics, 30 Apr .-3 May 2018, pp. 1–12. [20] Z. W u, S. Pan, F . Chen, G. Long, C. Zhang, and P . S. Y u, “A comprehensive survey on graph neural networks,” arXiv:1901.00596v3 [cs.LG] , 8 Aug. 2019. [Online]. A vailable: http://arxiv .org/abs/1901.00596 [21] J. Zhou, G. Cui, Z. Zhang, C. Y ang, Z. Liu, L. W ang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” arXiv:1812.08434v4 [cs.LG] , 10 July 2019. [Online]. A vailable: http://arxiv .org/abs/1812.08434 [22] Z. Zhang, P . Cui, and W . Zhu, “Deep learning on graphs: A survey ,” arXiv:1812.04202v1 [cs.LG] , 11 Dec. 2018. [Online]. A vailable: http://arxiv .org/abs/1812.04202 [23] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh, “Attention models in graphs: A survey ,” arXiv:1807.07984v1 [cs.AI] , 20 July 2018. [Online]. A vailable: http://arxiv .org/abs/1807.07984 [24] G. T aubin, “Geometric signal processing on polygonal meshes,” in EUROGRAPHICS ’2000 . Interlaken, Switzerland: The Eur ograph- ics Association, 21-25 Aug. 2000, pp. 1–11. [25] D. I. Shuman, P . V andergheynst, D. Kressner , and P . Frossard, “Distributed signal pr ocessing via Chebyshev polynomial approx- imation,” IEEE T rans. Signal, Inform. Process. Networks , vol. 4, no. 4, pp. 736–751, Dec. 2018. [26] A. Sandryhaila and J. M. F . Moura, “Discrete signal processing on graphs,” IEEE T rans. Signal Process. , vol. 61, no. 7, pp. 1644–1656, Apr . 2013. [27] S. K. Narang, “Compact support biorthogonal wavelet filter - banks for arbitrary undirected graphs,” IEEE T rans. Signal Process. , vol. 61, no. 19, pp. 4673–4685, Oct. 2013. [28] O. T eke and P . P . V aidyanathan, “Extending classical multirate signal processing theory to graphs—Part I: Fundamentals,” IEEE T rans. Signal Process. , vol. 65, no. 2, pp. 409–422, Jan. 2017. [29] S. Segarra, A. G. Marques, and A. Ribeiro, “Optimal graph-filter design and applications to distributed linear network operators,” IEEE T rans. Signal Process. , vol. 65, no. 15, pp. 4117–4131, Aug. 2017. [30] E. Isufi, A. Loukas, A. Simonetto, and G. Leus, “Autor egres- sive moving average graph filtering,” IEEE T rans. Signal Process. , vol. 65, no. 2, pp. 274–288, Jan. 2017. [31] J. Y ou, J. Leskovec, K. He, and S. Xie, “Graph structure of neural networks,” in International Conference on Machine Learning . PMLR, 2020, pp. 10 881–10 891. [32] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd Int. Conf. Learning Representations . San Diego, CA: Assoc. Comput. Linguistics, 7-9 May 2015, pp. 1–15. [33] A. V aswani, N. Shazeer , N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin, “Attention is all you need,” in 31st Conf. Neural Inform. Process. Syst. Long Beach, CA: Neural Inform. Process. Syst. Foundation, 4-9 Dec. 2017, pp. 1–11. [34] M. Coutino, E. Isufi, and G. Leus, “Advances in distributed graph filtering,” IEEE T rans. Signal Process. , vol. 67, no. 9, pp. 2320–2333, May 2019. [35] P . W . Battaglia, J. B. Hamrick, V . Bapst, A. Sanchez-Gonzalez, V . Zambaldi, M. Malinowski, A. T acchetti, D. Raposo, A. Santoro, R. Faulkner et al. , “Relational inductive biases, deep learning, and graph networks,” arXiv preprint , 2018. [36] J. Gilmer , S. S. Schoenholz, P . F . Riley , O. V inyals, and G. E. Dahl, “Neural message passing for quantum chemistry ,” arXiv preprint arXiv:1704.01212 , 2017. [37] A. Ortega, P . Frossar d, J. Kova ˇ cevi ´ c, J. M. Moura, and P . V an- dergheynst, “Graph signal processing: Overview , challenges, and applications,” Proceedings of the IEEE , vol. 106, no. 5, pp. 808–828, 2018. [38] S. Y an, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 32, no. 1, 2018. [39] W . Hamilton, Z. Y ing, and J. Leskovec, “Inductive repr esentation learning on large graphs,” in Advances in neural information process- ing systems , 2017, pp. 1024–1034. [40] V . N. Ioannidis, A. G. Marques, and G. B. Giannakis, “A recurrent graph neural network for multi-relational data,” in 44th IEEE Int. Conf. Acoust., Speech and Signal Process. Brighton, UK: IEEE, 12-17 May 2019, pp. 8157–8161. [41] H. Maron, H. Ben-Hamu, N. Shamir , and Y . Lipman, “Invariant and equivariant graph networks,” arXiv preprint , 2018. [42] R. Levie, E. Isufi, and G. Kutyniok, “On the transferability of spec- tral graph filters,” in 13th Int. Conf. Sampling Theory Applications . Bordeaux, France: IEEE, 8-12 July 2019, pp. 1–5. [43] F . Gama, J. Bruna, and A. Ribeir o, “Stability pr operties of graph neural networks,” IEEE T rans. Signal Process. , vol. 68, pp. 5680– 5695, 25 Sep. 2020. [44] F . Gama, G. Leus, A. G. Marques, and A. Ribeiro, “Convolutional neural networks via node-varying graph filters,” in 2018 IEEE Data Sci. Workshop . Lausanne, Switzerland: IEEE, 4-6 June 2018, pp. 220–224. [45] A. Sandyhaila and J. M. F . Moura, “Discrete signal processing on graphs: Fr equency analysis,” IEEE T rans. Signal Process. , vol. 62, no. 12, pp. 3042–3054, June 2014. [46] S. Liu and G. T renkler , “Hadamar d, khatri-rao, kronecker and other matrix products,” International Journal of Information and Systems Sciences , vol. 4, no. 1, pp. 160–177, 2008. 18 [47] N. A. Heckert and J. J. Filliben, “Nist/sematech e-handbook of statistical methods; chapter 1: Exploratory data analysis,” 2003. [48] L. N. T refethen, Approximation Theory and Approximation Practice, Extended Edition . SIAM, 2019. [49] J. Liu, E. Isufi, and G. Leus, “Filter design for autoregressive moving average graph filters,” IEEE T rans. Signal, Inform. Process. Networks , vol. 5, no. 1, pp. 47–60, March 2019. [50] E. Isufi, A. Loukas, and G. Leus, “Autoregressive moving average graph filtering: A stable distributed implementation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , no. EPFL-CONF-223825, 2017. [51] F . M. Bianchi, D. Grattarola, C. Alippi, and L. Livi, “Graph neural networks with convolutional ARMA filters,” arXiv:1901.01343v5 [cs.LG] , 24 Oct. 2019. [Online]. A vailable: http://arxiv .org/abs/1901.01345 [52] W . A. S. W ijesinghe and Q. W ang, “Dfnets: Spectral cnns for graphs with feedback-looped filters,” in Advances in Neural Infor- mation Processing Systems , 2019, pp. 6007–6018. [53] L. Ruiz, F . Gama, and A. Ribeiro, “Gated graph recurr ent neural networks,” IEEE T rans. Signal Pr ocess. , vol. 68, pp. 6303–6318, 26 Oct. 2020. [54] K. Gregor and Y . LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th international conference on inter- national conference on machine learning , 2010, pp. 399–406. [55] D. P . Kingma and J. L. Ba, “ADAM: A method for stochastic optimization,” in 3rd Int. Conf. Learning Representations . San Diego, CA: Assoc. Comput. Linguistics, 7-9 May 2015, pp. 1–15. [56] A. Anis, A. Gadde, and A. Ortega, “Efficient sampling set selection for bandlimited graph signals using graph spectral proxies,” IEEE T rans. Signal Process. , vol. 64, no. 14, pp. 3775–3789, July 2016. [57] J. McAuley and J. Leskovec, “Learning to discover social circles in Ego networks,” in 26th Conf. Neural Inform. Pr ocess. Syst. Stateline, TX: Neural Inform. Process. Foundation, 3-8 Dec. 2012. [58] S. Segarra, M. Eisen, and A. Ribeiro, “Authorship attribution through function word adjacency networks,” IEEE T rans. Signal Process. , vol. 63, no. 20, pp. 5464–5478, Oct. 2015. [59] F . M. Harper and J. A. Konstan, “The movielens datasets: History and context,” Acm transactions on interactive intelligent systems (tiis) , vol. 5, no. 4, pp. 1–19, 2015. [60] L. Ruiz, F . Gama, A. G. Marques, and A. Ribeiro, “Invariance- preserving localized activation functions for graph neural net- works,” IEEE T rans. Signal Process. , vol. 68, pp. 127–141, 25 Nov . 2019. [61] H. Maron, H. Ben-Hamu, H. Serviansky , and Y . Lipman, “Prov- ably powerful graph networks,” in Advances in neural information processing systems , 2019, pp. 2156–2167. [62] C. Morris, M. Ritzert, M. Fey , W . L. Hamilton, J. E. Lenssen, G. Rat- tan, and M. Grohe, “W eisfeiler and leman go neural: Higher-order graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, 2019, pp. 4602–4609. [63] R. Levie, W . Huang, L. Bucci, M. M. Bronstein, and G. Kutyniok, “T ransferability of spectral graph convolutional neural networks,” arXiv preprint arXiv:1907.12972 , 2019. [64] A. Banerjee, A. G. Chandrasekhar , E. Duflo, and M. O. Jackson, “The diffusion of microfinance,” Science , vol. 341, no. 6144, pp. 1 236 498 (1–7), July 2013.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment