Convolutional Neural Network Architectures for Signals Supported on Graphs

IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (A CCEPTED) 1 Con v olutional Neural Network Architectures for Signals Supported on Graphs Fernando Gama, Antonio G. Marques, Geert Leus, and Alejandro Ribeiro Abstract —T wo architectures that generalize convolutional neu- ral networks (CNNs) f or the processing of signals supported on graphs are introduced. W e start with the selection graph neural network (GNN), which r eplaces linear time in variant ﬁlters with linear shift in variant graph ﬁlters to generate con vo- lutional features and reinterprets pooling as a possibly nonlinear subsampling stage where nearby nodes pool their inf ormation in a set of preselected sample nodes. A key component of the architectur e is to remember the position of sampled nodes to permit computation of con volutional features at deeper layers. The second architectur e, dubbed aggregation GNN, diffuses the signal through the graph and stores the sequence of diffused components observed by a designated node. This procedur e effectively aggr egates all components into a str eam of information having temporal structure to which the con volution and pooling stages of regular CNNs can be applied. A multinode version of aggregation GNNs is further introduced for operation in large scale graphs. An important property of selection and aggregation GNNs is that they r educe to con ventional CNNs when particularized to time signals reinterpr eted as graph signals in a circulant graph. Comparative numerical analyses are performed in a source localization application over synthetic and real- world networks. Perf ormance is also ev aluated f or an authorship attribution problem and text category classiﬁcation. Multinode aggregation GNNs are consistently the best performing GNN architectur e. Index T erms —deep learning, convolutional neural networks, graph signal processing, graph ﬁlters, pooling I . I N T RO D U C T I O N W e consider signals with irregular structure and describe their underlying topology with a graph whose edge weights capture a notion of expected similarity or proximity between signal components expressed at nodes [1]–[4]. Of particular importance in this paper is the interpretation of matrix repre- sentations of the graph as shift operators that can be applied to graph signals. Shift operators represent local (one-hop neighborhood) operations on the graph, and allow for different generalizations of conv olution, sampling and reconstruction. These generalizations stem either from representations of graph ﬁlters as polynomials in the shift operator [1], [5], [6] or from the aggregation of sequences generated through successiv e application of the shift operator [7]. They not only capture the intuitiv e idea of con volution, sampling and Supported by NSF CCF 1717120, ARO W911NF1710438, ARL DCIST CRA W911NF-17-2-0181, ISTC-W AS and Intel DevCloud; and Spain MINECO grants No TEC2013-41604-R and TEC2016-75361-R. F . Gama and A. Ribeiro are with the Dept. of Electrical and Systems Eng., Univ . of Pennsylvania., A. G. Marques is with the Dept. of Signal Theory and Comms., King Juan Carlos Univ ., G. Leus is with the Dept. of Microelec- tronics, Delft Uni v . of T echnology . Email: { fgama,aribeiro } @seas.upenn.edu, antonio.garcia.marques@urjc.es, and g.j.t.leus@tudelft.nl. reconstruction as local operations but also share some other interesting theoretical properties [1], [2], [5]. Our goal here is to build on these deﬁnitions to generalize Con volutional (C) neural networks (NNs) to graph signals. CNNs consist of layers that are sequentially composed, each of which is itself the composition of conv olution and pooling operations (Section II and Figure 1). The input to a layer is a multichannel signal composed of features e xtracted from the previous layer , or the input signal itself at the ﬁrst layer . The main step in the con v olution stage is the processing of each feature with a bank of linear time in v ariant ﬁlters (Section II-A). T o keep complexity under control and av oid the number of intermediate features growing exponentially , the outputs of some ﬁlters are merged via simple pointwise summations. In the pooling stage we begin by computing local summaries in which feature components are replaced with a summary of their values at nearby points (Sec. II-B). These summaries can be linear , e.g., a weighted average of adjacent components, or nonlinear , e.g., the maximum value among ad- jacent components. Pooling also in v olves a subsampling of the summarized outputs. This subsampling reduces dimensionality with a (small) loss of information because the summarizing function is a low-pass operation. The output of the layer is ﬁnally obtained by application of a pointwise nonlinear activ ation function to produce features that become an input to the next layer . This is an architecture that is both simple to implement [8], and simple to train [9]. Most importantly , their performance in regression and classiﬁcation is remarkable to the extent that CNNs hav e become the standard tool in machine learning to handle such inference tasks [10]–[12]. As it follows from the abov e description, a CNN layer in volves ﬁ ve operations: (i) Conv olution with linear time in variant ﬁlters. (ii) Summation of different features. (iii) Computation of local summaries. (iv) Subsampling. (v) Ac- tiv ation with a pointwise nonlinearity . A graph (G)NN is an architecture adapted to graph signals that generalizes these ﬁv e operations. Operations (ii) and (v) are pointwise, therefore independent of the underlying topology , so that they can be applied without modiﬁcation to graph signals. Generalizing (iii) is ready because the notion of adjacent components is well deﬁned by graph neighborhoods. Generalization of operation (i) is not difﬁcult in the context of graph signal processing advances whereby linear time inv ariant ﬁlters are particular cases of linear shift in variant graph ﬁlters. This has motiv ated the deﬁnition of graph (G) NNs with con volutional features computed from shift in variant graph ﬁlters, an idea that was ﬁrst introduced in [13] and further explored in [14]–[19]. Architectures based on receptiv e ﬁelds, which are IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 2 different but conceptually similar to graph ﬁlters, hav e also been proposed [20]–[22]. Ho we ver , generalization of operation (iv) has proven more challenging because once the signal is downsampled, it is not easy to identify a coarsened graph to connect the components of the subsampled signal. The use of multiscale hierarchical clustering has been proposed to produce a collection of smaller graphs [13], [14], [16] but it is not clear which clustering or coarsening criteria is appropriate for GNN architectures. The difﬁculty of designing and implementing proper pooling is highlighted by the fact that sev eral works exclude the pooling stage altogether [17], [20], [21], [23]. In this paper we propose two different GNN architectures, selection GNNs and aggregation GNNs, that include conv o- lutional and pooling stages but bypass the need to create a coarsened graph. In selection GNNs (Sec. III and Fig. 2) we replace conv olutions with linear shift inv ariant ﬁlters and replace regular sampling with graph selection sampling. In the ﬁrst layer of the selection GNN, linear shift in variant ﬁlters are well deﬁned as polynomials on the giv en graph. At the ﬁrst pooling stage, howe v er , we sample a smaller number of signal components and face the challenge of computing a graph to describe the topology of the subsampled signal. Our proposed strategy is to bypass the computation of a coarsened graph by using zero padding (Sec. III-A). This simple technique permits computation of features that are con v olutional on the input graph. The pooling stage is modiﬁed to aggregate information in multihop neighborhoods as determined by the structure of the original graph and the sparsity of the subsampled signal (Sec. III-B). In aggregation GNNs we borrow ideas from aggregation sampling [7] to create a signal with temporal structure that incorporates the topology of the graph (Sec. IV and Fig. 3). This can be accomplished by focusing on a designated node and considering the local sequence that is generated by subsequent applications of the graph shift operator . This is a signal with a temporal structure because it reﬂects the propaga- tion of a dif fusion process. Y et, it also captures the topology of the graph because subsequent components correspond to the aggregation of information in nested neighborhoods of increasing reach. Aggregation GNNs apply a regular CNN to the diffusion signal observ ed at the designated node. W e ﬁnally introduce a multinode version of aggregation GNNs, where sev eral re gular CNNs are run at sev eral desig- nated nodes (Sec. IV -A and Fig. 4). The resulting CNN outputs are diffused in the input graph to generate another sequence with temporal structure at a smaller subset of nodes to which regular CNNs are applied in turn. W e can think of multinode aggregation GNNs as composed of inner and outer layers. Inner layers are regular CNN layers. Output layers stack CNNs joined together by a linear diffusion process. Multinode aggregation GNNs are consistently the best performing GNN architecture (Sec. V). W e remark that aggregation GNNs, as well as selection GNNs are proper generalizations of con ven- tional CNNs because they both reduce to a CNN architecture when particularized to a cyclic graph. The proposed architectures are applied to the problems of localizing the source of a dif fusion process on synthetic networks (Sec. V -A) as well as on real-world social networks (Sec. V -B). Performance is additionally ev aluated on prob- lems of authorship attribution (Sec. V -C) and classiﬁcation of articles of the 20NEWS dataset (Sec. V -D), inv olving real datasets. Results are compared to those obtained from a graph coarsening architecture using a multiscale hierarchical clustering scheme [16]. The results are encouraging and sho w that the multinode approach consistently outperforms the other architectures. Notation: The n -th component of a vector x is denoted as [ x ] n . The ( m, n ) entry of a matrix X is [ X ] mn . The vector x := [ x 1 ; . . . ; x n ] is a column vector stacking the column vectors x n . When n denotes a set of subindices, | n | is the number of elements in n and [ x ] n denotes the column vector formed by the elements of x whose subindices are in n . The vector 1 is the all-ones vector . I I . C O N VO L U T I O N A L N E U R A L N E T W O R K S Giv en a training set T := { ( x , y ) } formed by inputs x and their associated outputs y , a learning algorithm produces a representation (mapping) that can estimate the output ˆ y that should be assigned to an input ˆ x / ∈ T . NNs produce a representation using a stacked layered architecture in which each layer composes a linear transformation with a pointwise nonlinearity [24]. Formally , the ﬁrst layer of the architecture begins with a linear transformation to produce the interme- diate output u 1 := A 1 x 0 = A 1 ˆ x followed by a pointwise nonlinearity to produce the ﬁrst layer output x 1 := σ 1 ( u 1 ) = σ 1 ( A 1 x 0 ) . This procedure is applied recursiv ely so that at the  th layer we compute the transformation x ` := σ ` ( u ` ) := σ ` ( A ` x ` − 1 ) . (1) In an architecture with L layers, the input ˆ x = x 0 is fed to the ﬁrst layer and the output ˆ y = x L is read from the last layer [25]. Elements of the training set T are used to ﬁnd matrices A ` that optimize a training cost of the form P ( x , y ) ∈T f ( y , x L ) , where f ( y , x L ) is a ﬁtting metric that assess the difference between the NN’ s output x L produced by input x and the desired output y stored in the training set. Computation of the optimal NN coefﬁcients A ` is typically carried out by stochastic gradient descent, which can be efﬁciently computed using the backpropagation algorithm [9]. The NN architecture in (1) is a multilayer perceptron composed of fully connected layers [25]. If we denote as M ` the number of entries of the output of layer  , the matrix A ` contains M ` × M ` − 1 components. This, likely extremely , large number of parameters not only makes training challenging but empirical evidence suggests that it leads to overﬁtting [26]. CNNs resolve this problem with the introduction of two operations: Con v olution and pooling. A. Convolutional F eatures T o describe the creation of con v olutional features write the output of the (  − 1) st layer as x ` − 1 := [ x 1 ` − 1 ; . . . ; x F ` − 1 ` − 1 ] . This decomposes the M ` − 1 -dimensional output of the (  − 1) st layer as a stacking of F ` − 1 features of dimension N ` − 1 . This collection of features is the input to the  th layer . Likewise, IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 3 (a) input (b) con volution (c) pooling (d) input (e) con volution (f) pooling (g) (h) (i) Figure 1. Conv olutional Neural Networks. (a) Consider the input to be a discrete time signal, represented by a succession of signal values. (b) Con v olve this signal with a ﬁlter to obtain corresponding features [cf. (2)]. The color disks centered at each node symbolize the conv olution operation. (c) Apply pooling [cf. (4)]. The color disks symbolize the reach of the pooling operation (the number of samples that are pooled together) (d) Downsample to obtain a discrete time signal of smaller size [cf. (5)]. (e)-(i) Repeat the application of conv olution and pooling, trading off the temporal dimension for more features. the intermediate output u ` can be written as a collection of F ` features u ` := [ u 1 ` ; . . . ; u F ` ` ] where u f ` is of length N ` − 1 and is obtained through con v olution and linear aggregation of features x g ` − 1 of the previous layer , g = 1 , . . . , F ` − 1 . Speciﬁcally , let h f g ` := [ [ h f g ` ] 0 ; . . . ; [ h f g ` ] K ` − 1 ] be the coefﬁcients of a K ` -tap linear time in v ariant ﬁlter that is used to process the g th feature of the (  − 1) st layer to produce the intermediate feature u f g ` at layer  . Since the ﬁlter is deﬁned by a conv olution, the components of u f g ` are explicitly giv en by h u f g ` i n := h h f g ` ∗ x g ` − 1 i n = K ` − 1 X k =0 h h f g ` i k h x g ` − 1 i n − k , (2) where we consider that: i) the output has the same size than the input and ii) the con volution (2) is circular to account for border ef fects. After e v aluating the con volutions in (2), the  th layer features u f ` are computed by aggregating the intermediate features u f g ` associated with each of the previous layer features x g ` − 1 using a simple summation, u f ` := F ` − 1 X g =1 u f g ` = F ` − 1 X g =1 h f g ` ∗ x g ` − 1 . (3) The vector u ` := [ u 1 ` ; . . . ; u F ` ` ] obtained from (2) and (3) represents the output of the linear operation of the  th layer of the CNN [cf. (1)]. Although not explicitly required, the number of features F ` and the number of ﬁlter taps K ` are typically much smaller than the dimensionality M ` − 1 of the features x ` − 1 that are processed by the  th layer . This reduces the number of learnable parameters from M ` × M ` − 1 in (1) to K ` × F ` × F ` − 1 simplifying training and reducing overﬁtting. B. P ooling The features u f g ` in (2) and their consolidated counterparts u f ` in (3) have N ` − 1 components. This number of components is reduced to N ` at the pooling stage in which the v alues of a group of neighboring elements are aggregated to a single scalar using a possibly nonlinear summarization function ρ ` . T o codify the locality of ρ ` , we deﬁne, with a slight abuse of notation, n ` as a vector containing the indexes associated with index n – e.g., use n ` = [ n − 1; n ; n + 1] to group adjacent components – and deﬁne the signal v f ` with components h v f ` i n = ρ `  h u f ` i n `  . (4) The summarization function ρ ` in (4) acts as a lo w-pass operation and the most common choices are the maximum ρ ` ([ u f ` ] n ` ) = max([ u f ` ] n ` ) and the av erage ρ ` ([ u f ` ] n ` ) = 1 T [ u f ` ] n ` / | n ` | [27]. T o complete the pooling stage we follow (4) with a down- sampling operation. For that matter, we deﬁne the sampling matrix C ` as a fat binary matrix with N ` − 1 columns and N ` rows, which are selected from the rows of the identity matrix. When the sampling matrix C ` is re gular , the nonzero entries follo w the pattern [ C ` ] mn = 1 if n can be written as n = ( N ` − 1 / N ` ) m and zero otherwise; hence, the product C ` v f ` selects one out of ev ery ( N ` − 1 / N ` ) components of v f ` . Downsampling is composed with a pointwise nonlinearity to produce the  th layer features x f ` = σ `  C ` v f `  . (5) The compression or downsampling factor ( N ` − 1 / N ` ) is often matched to the local summarization function ρ ` so that the set n ` contains ( N ` − 1 / N ` ) adjacent index es. W e further note that although we deﬁned (4) for all n , in practice, we only compute the components of v f ` that are to be selected by the sampling matrix C ` . In fact, it is customary to combine (4) and (5) to simply write [ x f ` ] n = σ l ( ρ ` ([ u f ` ] n ` ) for n in the selection set. Separating the nonlinearity in (4) from the downsampling operation in (5) is conv enient to elucidate pooling strategies for signals on graphs. Equations (2)-(5) complete the speciﬁcation of the CNN architecture. W e begin at each layer with the input x ` − 1 := [ x 1 ` − 1 ; . . . ; x F ` − 1 ` − 1 ] . Features are fed to parallel con volutional channels to produce the features u f g ` in (2) and consolidated into the features u f ` in (3). These features are fed to the local summarization function ρ ` to produce features v f ` [cf. (4)] which are then downsampled and processed by the pointwise activ ation nonlinearity σ ` to produce the features x f ` [cf. (5)]. The output of the  th layer is the vector x ` := [ x 1 ` ; . . . ; x F ` ` ] that groups the features in (5). W e point out for completeness that the L th layer is often a fully connected layer in the mold of (1) that does not abide to the conv olutional and pooling paradigm of (2)-(5). Thus, the L th layer produces an arbitrary (non con volutional) linear combination of F L − 1 features to produce the ﬁnal F L scalar features x L . The output of this readout layer provides the estimate ˆ y = x L that is associated with the input ˆ x = x 0 fed to the ﬁrst layer . IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 4 C. Signals on Graphs There is overwhelming empirical evidence that CNNs are superb representations of signals deﬁned in regular domains such as time series and images [10]. Our goal in this paper is to contribute to the extension of these architectures to signals supported in irregular domains described by arbitrary graphs. Consider then a weighted graph with N nodes, edge set E and weight function W : E → R . W e endow the graph with a shift operator S , which is an N × N square matrix having the same sparsity pattern of the graph; i.e., we can have [ S ] mn 6 = 0 if and only if ( n, m ) ∈ E or m = n . The shift operator is a stand in for one of the matrix representations of the graph. Commonly used shift operators include the adjacency matrix A with nonzero elements [ A ] mn = W ( n, m ) for all ( n, m ) ∈ E , the Laplacian L := diag ( A1 ) − A and their normalized counterparts ¯ A and ¯ L [3]. Consider the signal x = [ x 1 ; . . . ; x F ] formed by F feature vectors x f with N components each. The feature vector x f is said to be a graph signal when each of its N components is assigned to a different verte x of the graph. The graph describes the underlying support of the data x f (hence, of x ) by using the weights W to encode arbitrary pairwise relationships between data elements. The graph shift enables processing of the graph signal x f because it deﬁnes a local linear operation that can be applied to graph signals. Indeed, if we consider the signal y f := Sx f it follows from the sparsity of S that the n th element of y f depends on the elements of x f associated with neighbors of the node n , [ y f ] n = X m :( m,n ) ∈E [ S ] nm [ x f ] m . (6) It is instructiv e to consider the cyclic graph adjacency matrix A dc , with nonzero elements [ A dc ] 1+ n mo d N ,n = 1 . Since the cyclic graph describes the structure of discrete (periodic) time, we can say that a discrete time signal x is a graph signal deﬁned on the cyclic graph. When particularized to S = A dc , (6) yields y f 1+ n mo d N = x f n implying that y f is a circularly time shifted copy of x f . This moti v ates interpretation of S as the generalization of time shifts to signals supported in the corresponding graph [1]. Enabling CNNs to process data modeled as graph signals entails extending the operations of conv olution and pooling to handle the irregular nature of the underlying support. Con vo- lution [cf. (2)] can be readily replaced by the use of linear, shift inv ariant graph ﬁlters [cf. (7)]. The summarizing function [cf. (4)] can also be readily extended by using the notion of neighborhood deﬁned by the underlying graph support. The pointwise nonlinearity can be kept unmodiﬁed [cf. (5)], but there are two general downsampling strategies for graph signals: selection sampling [28] and aggregation sampling [7]. Inspired by these, we propose two architectures: selection GNNs (Section III) and aggregation (Section IV) GNNs. Remark 1. Although our current theoretical understanding of CNNs is limited, empirical evidence suggests that con volution and pooling work in tandem to act as feature extractors at different levels of resolution. At each layer , the con volution operation linearly relates up to K ` nearby v alues of each input feature. Since the same ﬁlter taps are used to process the whole signal, the conv olution ﬁnds patterns that, albeit local, are irrespectiv e of the speciﬁc location of the pattern in the signal. The use of se veral features allo ws collection of different patterns through learning of dif ferent ﬁlters thus yielding a more expressiv e operation. The pooling stage summarizes information into a feature of lower dimensionality . It follows that subsequent con volutions operate on summaries of dif ferent regions. As we move into deeper layers we pool summaries of summaries that are progressively growing the region of the signal that affects a certain feature. The conjectured value of composing local con v olutions with pooling summaries is adopted prima facie as we seek graph neural architectures that exploit the locality of the shift operator to generalize con volution and pooling operations. I I I . S E L E C T I O N G R A P H N E U R A L N E T W O R K S Generalizing the ﬁrst layer of a CNN to signals supported on graphs is straightforward as it follows directly from the deﬁnition of a linear shift in v ariant ﬁlter [5]. Going back to the deﬁnition of con volutional features in (2) we reinterpret the ﬁlters h f g 1 as graph ﬁlters that process the features x g 0 through a graph con volution. This results in intermediate features u f g 1 having components h u f g 1 i n := h h f g 1 ∗ S x g 0 i n := K 1 − 1 X k =0 h h f g 1 i k h S k x f 0 i n , (7) where we have used ∗ S to denote the graph con v olution operation on S . The summations in equations (2) and (7) are analogous except for the different interpretations of what it means to shift the input signal x f 0 . In (2), a k -unit shift at index n means considering [ x f 0 ] n − k , the value of the signal x f 0 at time n − k . In (7), graph shifting at node n entails the operation [ S k x f 0 ] n which composes a multiplication by S k with the selection of the resulting value at n . In fact, particularizing (7) to the cyclic graph by making S = A dc renders (2) and (7) equiv alent. From the perspective of utilizing (7) as an extractor of local (graph) con v olutional features it is important to note that graph conv olutions aggregate information through successiv e local operations [cf. (6)]. A ﬁlter with K 1 taps incorporates information at node n that comes from nodes in its ( K 1 − 1) -hop neighborhood. Although we wrote (7) componentwise to emphasize its similarity with (2) we can drop the n subindices to write a vector relationship. For future reference we further deﬁne the linear shift in variant ﬁlter H f g 1 := P K 1 − 1 k =0 [ h f g 1 ] k S k to write u f g 1 = K 1 − 1 X k =0 h h f g 1 i k S k x f 0 := H f g 1 x f 0 . (8) The graph ﬁlter (8) is a generalization of the Chebyshe v ﬁlter in [16]. More precisely , if G is an undirected graph, and we adopt the normalized Laplacian as the graph shift operator S , then (8) boils do wn to a Chebyshe v ﬁlter . The con volutional stage in [18] is a Chebyshev ﬁlter of K = 2 , and thus is also a special case of (8). W e also note that the use of polynomials on arbitrary graph shift operators for the conv olutional stage IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 5 Figure 2. Selection Graph Neural Networks. Consider the input to be a signal supported by a known N -node graph. First, con volutional features are obtained by means of graph ﬁltering in the original graph [cf. (8)]. The color disks in the second column illustrate the conv olution operation on each node. Then, a subset of N 1 nodes is selected, and summarizing function ρ 1 and pointwise nonlinearity σ 1 are applied to the neighborhood n 1 for each of these nodes, obtaining the output x f 1 for the ﬁrst layer . The color disks in the third column show the reach of the pooling operation, the size of the neighborhood being pooled (in the ﬁrst row , the disks include only the one-hop neighborhood; also, only a few disks are shown so as not to clutter the illustration). In order to obtain conv olutional features for following layers, we zero pad the signal to ﬁt the original graph [cf. (9)] so as to apply a graph ﬁlter and then resample the output at the same set of nodes [cf. (11)-(13)]. Then, a new smaller subset of nodes is selected, and the summarizing function and pointwise nonlinearity are applied to a neighborhood of these nodes [cf. (15)]. This process is repeated while selecting fewer and fe wer nodes. has been also proposed in [17], [23]. Asides from replacing the linear time in variant ﬁlter in (2) with the graph shift in v ariant ﬁlter in (8), the remaining components of the con ventional CNN architecture can remain more or less unchanged. The feature aggregation in (3) to obtain u f 1 needs no modiﬁcation as it is a simple summation independent of the graph structure. The summarization operator in (4) requires a redeﬁnition of locality . This is not dif ﬁcult because it follo ws from (8) that u f 1 is another N -node graph signal that is deﬁned ov er the same graph S . W e can then use n 1 to represent a graph neighborhood of node n and apply the same summary operator . W e point out that n 1 need not be the 1-hop neighborhood of n . The sampling and activ ation operation in (5) requires a matrix C 1 to sample over the irregular graph domain. Apart from the challenge of selecting sampling matrices for graphs – see (16) and [7], [28]–[30] –, this does not require any further modiﬁcation to (5). The ﬁrst row of Fig. 2 sho ws the operations carried out in this ﬁrst layer . The challenge in generalizing CNNs to GNNs arises beyond the ﬁrst layer . After implementing the sampling operation in (5) the signal x f 1 is of lower dimensionality than u f 1 and can no longer be interpreted as a signal supported on S . In regular domains this is not a problem because we use the extraneous geometrical information of the underlying domain to deﬁne con v olutions in the space of lower dimensionality . T o see this in terms of graph signals, let us interpret the signal x g 0 deﬁned on a regular domain as one deﬁned on a cyclic graph with N 0 = N nodes, which is also the same graph that describes u f 1 . Then, if we consider a downsampling factor of ( N 1 / N 0 ) , another cyclic graph with N 1 nodes describes the signal x f 1 . Ho we ver , when graph signals are deﬁned in a generic irregular domain, there is no extraneous information to IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 6 elucidate the form of the graph that describes signals be yond the ﬁrst layer . Resolving mismatched supports is a well-kno wn problem in signal processing whose simplest and most widely- used solution is zero padding. The following sections illustrate how zero padding can be leveraged to resolve one of the critical challenges in the implementation of GNNs. A. Selection Sampling on Graph Con volutional F eatur es Sampling is an operation that selects components of a signal. T o explain the construction of con volutional features on graphs, it is more conv enient to think of sampling as the selection of nodes of a graph which we call activ e nodes. This implies that at each layer  we place the input features x f ` − 1 of dimension N ` − 1 on top of the activ e nodes of the graph S . Selection schemes are further discussed in Sec. III-C. Doing so requires that we keep track of the location of the samples. Thus, at each layer  we consider input features x g ` − 1 each with N ` − 1 components, and zero padded features ˜ x g ` − 1 each with size N but only N ` − 1 nonzero components which replicate the values of x g ` − 1 . The indexes of the nonzero components of ˜ x g ` − 1 correspond to the location of the elements of x g ` − 1 in the original graph. It is clear that we can move from the unpadded to the padded representation by multiplying with an N × N ` − 1 tall binary sampling matrix D T ` − 1 . Indeed, if we let [ D ` − 1 ] mn = 1 represent the m th component of the unpadded feature, [ x g ` − 1 ] m , is located in the n th node of the graph, we can write the padded feature as ˜ x g ` − 1 = D T ` − 1 x g ` − 1 . (9) The advantage of keeping track of the padded signal is that con volutional features can be readily obtained by operating in the original graph. Giv en the notion of graph con volution in (8) and (re-)deﬁning h f g ` to be the graph ﬁlter coefﬁcients at layer  we can deﬁne intermediate features as [cf. (2)] ˜ u f g ` := K ` − 1 X k =0 h h f g ` i k S k ˜ x g ` − 1 . (10) Although a technical solution to the construction of con- volutional features, (10) does not exploit the computational advantages of sampling. These can be recov ered by selecting components of ˜ u f g ` at the same set of nodes that support x g ` − 1 . W e then deﬁne u f g ` := D ` − 1 ˜ u f g ` . If we further use (9) to substitute ˜ x g ` − 1 into the deﬁnition of the conv olutional features in (10), we can write u f g ` := D ` − 1 ˜ u f g ` = D ` − 1 K ` − 1 X k =0 h h f g ` i k S k D T ` − 1 x g ` − 1 . (11) If we further deﬁne reduced dimensionality k -shift matrices S ( k ) ` := D ` − 1 S k D T ` − 1 , (12) and reorder and regroup terms in (11) we can reduce the deﬁnition of con volutional features to u f g ` = K ` − 1 X k =0 h h f g ` i k S ( k ) ` x g ` − 1 = H f g ` x g ` − 1 , (13) where we have also deﬁned the subsampled linear shift in- variant ﬁlter H f g ` := P K ` − 1 k =0 [ h f g ` ] k S ( k ) ` . Implementing (11) entails repeated application of the shift operator to the padded signal, which can be carried out with low cost if the original input graph is sparse. In (13), the ﬁlter H f g ` takes advantage of sampling to operate directly on a space of lo wer dimension N ` − 1 . The matrices S ( k ) ` can be computed beforehand because they depend on the graph shift operator and the sampling matrices only . W e emphasize that, save for subsampling, (13) and (11) are equiv alent and that, therefore, the features u f g ` generated by the subsampled ﬁlter H f g ` are con v olutional relativ e to the original graph (shift) S . The middle image in Fig. 2 shows zero pad of input signal, conv olution in the original graph, and resampling of the ﬁlter output. Features u f ` can be obtained from features u f g ` using the same linear aggregation operation in (3) which does not require adaptation to the structure of the graph, u f ` = F ` − 1 X g =1 H f g ` x g ` − 1 . (14) This completes the construction of con v olutional features and leads to the pooling stage we describe next. B. Selection Sampling and P ooling The pooling stage requires that we redeﬁne the summary and sampling operations in (4) and (5). Generalizing the summary operation requires redeﬁning the aggregation neigh- borhood. In the ﬁrst layer, this can be readily accomplished by selecting the α 1 -hop neighborhood of each node for some giv en α 1 that deﬁnes the reach of the summary operation. This information is actually contained in the po wers of the shift operator . The 1-hop neighborhood of n is the set of nodes m such that [ S ] nm 6 = 0 , the 2-hop neighborhood is the union of this set with those nodes m with [ S 2 ] nm 6 = 0 and so on. In the case of the sampled features the graph neighborhoods need to be intersected with the set of active nodes. This intersection is already captured by the k -shift matrices S ( k ) ` [cf. (12)]. Thus, at layer  we introduce an integer α ` to specify the reach of the summary operator and deﬁne the α ` -hop neighborhood of n as n ` = h m :  S ( k ) `  nm 6 = 0 , for some k ≤ α ` i . (15) Summary features [ v f ` ] n at node n are computed from (4) using the graph neighborhoods in (15). These neighborhoods follow the node proximity encoded by S , see third column of Fig. 2. T o formally explain the downsampling operation in (5) in the context of graph signals, begin by deﬁning sampling matrices adapted to irregular domains. This can be easily deﬁned at the  th layer if we let the sampling matrix C ` be a fat matrix with N ` rows and N ` − 1 columns with the properties [ C ` ] mn ∈ { 0 , 1 } , C ` 1 = 1 , C T ` 1 ≤ 1 . (16) When [ C ` ] mn = 1 it means that the n th component of v f ` is selected in the product C ` v f ` and stored as the m th component IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 7 of the output. The properties in (16) ensure that exactly N ` components of v f ` are selected and that no component is se- lected more than once. They do not, ho wever , enforce a regular sampling pattern. W e further deﬁne the nested sampling matrix D ` as the product of all sampling matrices applied up until layer  D ` = C ` C ` − 1 . . . C 1 = ` Y ` 0 =1 C ` 0 . (17) Matrix D ` keeps track of the location of the selected nodes in the original graph, for each layer  , and is thus used for the zero padding operation in (11). Each layer of the selection GNN architecture is determined by (13)-(14) for the conv olution operation and (4)-(5) for pooling and nonlinearity . T o summarize, the input to layer  is x ` − 1 comprised of F ` − 1 features x f ` − 1 located at a subset of nodes gi ven by D ` − 1 . Then, we use the reduced dimensionality k -shift matrices (12) to process x f ` − 1 using a graph ﬁlter as in (13), and obtain aggregated features u f ` (14). A neighborhood n ` for each element of u f is determined by (15) for some α ` and the output v f ` of the summarizing function ρ ` is computed as in (4). Finally , follo wing (5), a smaller subset of nodes is selected by means of C ` and the pointwise nonlinearity σ ` is applied to obtain the  th output features x f ` , for f = 1 , . . . , F ` . See Algorithm 1 for details. Remark 2. The selection GNN architecture recov ers a con- ventional CNN when particularized to graph signals described by a cyclic graph (conv entional discrete time signals). T o see this, let S = A dc for a graph of size N , and let C ` − 1 be the sampling matrix that takes N ` − 1 equally spaced samples out of the previous N ` − 2 samples, for every  . Then, the nested sampling matrix D ` − 1 becomes a sampling matrix that takes N ` − 1 equally spaced samples out of the N original ones. This implies that S ( k ) ` = D ` − 1 A k dc D T ` − 1 becomes either the k th power of the adjacency matrix of a cyclic graph with N ` − 1 nodes for k a multiple of N / N ` − 1 , or the all-zero matrix otherwise. This results in con v olutional features obtained by (13) being equiv alent to those obtained by (2). Like wise, making α ` = N ` − 1 / N ` for all  leads to regular pooling and downsampling. This sho ws that the selection GNN does indeed boil down to the con ventional CNN for discrete time signals. Remark 3. The dimension N ` is being effecti vely reduced without the need to use a complex multiscale hierarchical clustering algorithm. More speciﬁcally , in each layer , only a new set of nodes is used, but there is no need to recompute edges between these nodes or new weight functions, since the underlying graph on which the operations are actually carried out is the same graph support as the initial input data x . This, not only av oids the computational cost of obtaining multiscale hierarchical clusters, b ut also a voids the need to assess when such clustering scheme is adequate. C. Practical Considerations Algorithm 1 Selection Graph Neural Network. Input: { ˆ x } : testing dataset, T : training dataset S : graph shift operator, L : Number of layers, { F ` } : number of features, { K ` } : degree of ﬁlters { ρ ` } : neighborhood summarizing function selection : selection sampling method { N ` } : number of nodes on each layer { σ ` } : pointwise nonlinearity Output: { ˆ y } : estimates of { ˆ x } 1: procedure S E L E C T I O N G N N ( { ˆ x } , T , S , L , { F ` } , { K ` } , { ρ ` } , selection , { N ` } , { σ ` } ) B Cr eate ar chitectur e: 2: for  = 1 : L − 1 do 3: Compute D ` − 1 = C ` − 1 D ` − 2  See (17) 4: Compute S ( k ) ` for k = 0 , . . . , K ` − 1  See (12) 5: Create [ h f g ` ] k , f = 1 , . . . , F ` , g = 1 , . . . , F ` − 1 6: Compute ﬁlters H f g ` = P K ` − 1 k =0 [ h f g ` ] k S ( k ) ` 7: Aggregate ﬁltered features P F ` − 1 g =1 ( H f g ` · ) 8: Apply summarizing function ρ ` ( · ) 9: Select N ` nodes following method selection C ` = selection ( N ` , C ` − 1 ) 10: Downsample output of summarizing function C ` ρ ` 11: Apply pointwise nonlinearity σ ` ( · ) 12: end for 13: Create fully connected layer A L · B T rain: 14: Learn { [ h f g ` ] k } and A L from T B Evaluate: 15: Obtain ˆ y applying GNN on ˆ x with learned coefﬁcients 16: end pr ocedure Selection of nodes. There is a v ast GSP literature on sampling by selecting nodes, see, e.g., [28]–[32]. In this paper , we con- sider that any one of these methods is adopted throughout the Selection GNN, and at each layer  matrix C ` is determined by follo wing the chosen method. On each layer  the subset of nodes selected by C ` is always a subset of the nodes chosen in the previous layer . This implies that N ` ≤ N ` − 1 and that C ` C ` − 1 nev er yields the zero matrix. In particular, in Sec. V, we adopt the methods proposed in [29] and [32] to study their impact on the overall performance of the Selection GNN. Locality of ﬁltering. The graph con volution remains a local operation with respect to the original input graph. Since each con volutional feature is zero padded to ﬁt the graph, the implementation of the graph ﬁlter at each layer can be carried out by means of local exchanges in the original support. This can be a good computational option if the original input graph is sparse, and therefore repeatedly applying the graph shift operator exploits this sparsity . This turns out to be particularly useful when such a support represents a physical network with physical connections. Centralized computing. When regarding the selection pool- ing architecture as a whole, being ex ecuted from a single IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 8 centralized unit (i.e. when local connectivity is not important for computation purposes, for example, in the training phase), it is observed that the computational cost of carrying out con volutions (13) is reduced to matrix multiplication in the smaller N ` -dimensional space. It is noted that the reduced dimensionality k -shift matrices (12) can be obtained before the training phase, and also, that the statistical properties of learning the ﬁlter taps are not affected by it. This observation, coupled with the previous one, shows that the selection pooling architecture adequately addresses the global vs. local duality by efﬁciently computing con volutions in both settings. Computation of nonlinearities. From an implementation perspectiv e, it is observed that, while the local summarizing function ρ ` in volves the neighborhood of the N ` − 1 nodes (which are more than the N ` nodes that are kept in layer  ), this function only has to be computed for those N ` nodes that are left after downsampling. That is, it is not needed to compute ρ ` at each one of the N ` − 1 nodes, but only at the N ` nodes that are actually kept after do wnsampling. In this sense, this nonlinear operation can be subsumed with the pointwise nonlinearity σ ` that is applied to the N ` nodes. T o further illustrate this point, suppose that max-pooling is used and that the corresponding pointwise nonlinearity is a ReLU, σ ` ( x ) = max { 0 , x } . Then, both operations can be performed simultaneously at node n by doing max { 0 , { x m : ( m, n ) ∈ n ` }} , where n ` denotes the paths in the neighborhood, and where this operation is computed only for nodes n that are part of the N ` ≤ N ` − 1 selected nodes. Regularization of ﬁlter taps. As the Selection GNN grows in depth (more layers), the number of ﬁlter taps in the con- volution stage might increase, in order to access information located at further away neighbors (this happens if the few selected nodes at some deeper layer are far away from each other , as measured by the number of neighborhood exchanges). It is a good idea, then, to structure the ﬁlter coef ﬁcients h f g ` in these deeper layers. More speciﬁcally , ﬁltering with N taps might be necessary , so it makes sense to choose [ h f g ` ] k constant for a range of k , since no new substantial information is going to be included for a wide range of those k . This reduces the number of trainable parameters and consequently ov erﬁtting. Deﬁnition of neighborhoods. Information from the weight function W of the graph can be included in the pooling stage (15). More precisely , instead of deﬁning the neighborhood only looking at the edge set E , that is [ S ( k ) ` ] nm 6 = 0 , we can make [ S ( k ) ` ] nm ≥ δ so that we summarize only across edges stronger than δ . Frequency interpretation of con volutional features. One advantage of having conv olutional features deﬁned always on the same graph G at e very layer  is that these can be easily analyzed from a frequency perspective. Since the graph Fourier transform of a signal depends on the eigen vectors V of the graph shift operator [2], and since the same S = V ΛV − 1 is used to deﬁne all con volutional features [cf. (11)], then they all share the same frequency basis, allowing for a comprehensiv e frequency analysis at all layers. In particular , if we focus on normal matrix GSOs, i.e. V − 1 = V H , the zero-padding aliasing effect is evidenced in the fact that V H D T D V need not be the identity matrix for arbitrary eigen v ectors V and downsampling matrices D , altering the frequency content of the input signal to a ﬁlter . Howe ver , the ﬁlter taps are learned from the training set, taking into account this aliasing effect, and therefore are able to cope with it, extracting useful features. Computational cost. The number of computations at each layer is gi ven by the cost of the con volution operation, which is O ( |E | K ` F ` F ` − 1 ) if (11) is used, or O ( N 2 ` − 1 K ` F ` F ` − 1 ) if (13) is used, since pooling and downsampling incur in negligible cost. W e observe that in (13) the cost tends to be dominated by N 2 ` − 1 making dimensionality reduction (i.e. pooling) a critical step for scalability . Number of parameters. The number of parameters to be learned at each layer are determined by the length of the ﬁlters, and the number of input and output features and is gi ven by O ( K ` F ` F ` − 1 ) independent of N ` − 1 . I V . A G G R E G A T I O N G R A P H N E U R A L N E T W O R K S The selection GNNs of Section III create con v olutional features adapted to the structure of the graph with linear shift in v ariant graph ﬁlters. The aggr egation GNNs that we describe here apply the conv entional CNN architecture of Section II to a signal with temporal (regular) structure that is generated to incorporate the topology of the graph. T o create such a temporal arrangement we consider successive applications of the graph shift operator S to the input graph signal x g (see ﬁrst row of Fig. 3). This creates a sequence of N graph shifted signals y g 0 , . . . , y g N − 1 . The ﬁrst signal of the sequence is y g 0 = x g , the second signal is y g 1 = Sx g , and subsequent members of the sequence are recursiv ely obtained as y g k = Sy g k − 1 = S k x g . W e observe that each vector y g k incorporates the underlying support by means of multiplication by the graph shift operator S . W e arrange the sequence of signals y g k into the matrix representation of the graph signal x g that we deﬁne as X g := [ y g 0 , y g 1 , ..., y g N − 1 ] := [ x g , Sx g , ..., S N − 1 x g ] . (18) The matrix X g is a redundant representation of x g . In fact, for any connected graph any row of X g is sufﬁcient to recov er x g as each row contains N linear combinations of x g [7]. W e thus note that any such ro w has successfully incorporated the graph structure included in the powers of the graph shift operator S , without any loss of information. Our goal here is to work at a designated node p with the signal z g p that contains the components of the diffusion sequence y g k that are observed at node p (see second row of Fig. 3). This is simply the p th row of X g and leads to the deﬁnition z g p := h X g i T p = h  x g  p ;  Sx g  p ; . . . ;  S N − 1 x g  p i . (19) The signal z g p is a local representation at node p that accounts for the topology of the graph in a temporally structured man- ner . Indeed, since the diffusion sequence y g k is generated from a temporal diffusion process the components of the sequence IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 9 Figure 3. Aggregation Graph Neural Networks. Select a node p ∈ V and perform successiv e local exchanges with its neighbors. For each k -hop neighborhood (illustrated by the increasing disks in the ﬁrst row), record S k x g at node p and build signal z g p which exhibits a regular structure [cf. (19)]. Once a regular time-structure signal is obtained, we proceed to apply regular conv olution and pooling to process the data [cf. (2)-(5)]. z g p are elements of a time sequence. Y et, the components of this time sequence depend on the topology of the graph. The ﬁrst element of z g p is the value of the input signal x g at node p , which is independent of the graph topology , but the second element z g p aggregates information from values of the input x g within the neighborhood of p as deﬁned by the nodes that are connected to node p . The third element of z g p is an aggregate of aggregates which results in the aggregation of information from the 2-hop neighborhood of p . As we move forward in the sequence z g p we incorporate information from nodes that are farther from p as determined by the topology of the graph. In this way , we ha ve successfully generated a regular structured signal that ef fectiv ely incorporates the underlying structure. W e note that two consecutive elements of z g p indeed relate neighboring values according to the topology of the graph. If the signal z g p is a signal in time, it can be processed with a regular CNN architecture and this is indeed our deﬁnition of aggregation GNNs. At the ﬁrst layer  = 1 we take the locally aggregated signal z g p as input and produce features u f g p 1 by con volving with the K p 1 -tap ﬁlter h f g p 1 [cf. (2)], h u f g p 1 i n := h h f g p 1 ∗ z g p i n = K p 1 − 1 X k =0 h h f g p 1 i k h z g p i n − k , (20) where we use zero padding to account for border effects and assume the size of the output is the same as the input. The conv olution in (20) is the regular time conv olution. In fact, except for minor notational dif ferences to identify the aggregation node p , (20) is the same as (2) with  = 1 . The topology of the graph is incorporated in (20) not because of the con volution but because of the way in which we construct z g p . T o emphasize the effect of the topology of the graph we use (19) to rewrite (20) as h u f g p 1 i n = K p 1 − 1 X k =0 h h f g p 1 i k h S n − k − 1 x g i p (21) Since the conv olution in (21) considers consecutive values of the signal z g p , the features u f g p 1 hav e a structure that is con volutional on the graph S . Each feature element [ u f g p 1 ] n is a linear combination of consecutiv e K p 1 neighboring values of the input x g starting with shift S n − 1 x g and ending at S n − K p 1 − 1 x g . Alternati vely , note that the regular con volution operation linearly relates consecutiv e elements of a vector; and since consecutive elements in vector z g p reﬂect nearby neighborhoods according to the graph, we have effecti vely related neighboring v alues of the graph signal by means of a regular con v olution. Thus, coefﬁcients h f g p 1 encoding low-pass ﬁlters further aggregate information across neighborhoods, while high-pass ﬁlters output features quantifying differences between consecuti ve neighborhood resolutions. Thus, low-pass time ﬁlters applied to z g p detect features that are smooth on the graph S , while high-pass time ﬁlters applied to z g p detect sharp transitions between signal values between nearby nodes. Once the features u f g p 1 in (20), or their equiv alents in (21), are computed, we sum features u f g p 1 as per (3) obtaining u f p 1 , IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 10 Figure 4. Multinode Aggregation Graph Neural Networks. Start by selecting a subset P 1 ⊂ V of P 1 nodes of the graph (ro w 1, diagram 1). Then, proceed to perform Q 1 local exchanges with neighbors (row 1, diagrams 2, 3, and 4) in order to build P 1 regular time-structure signals, one at each node (row 2), see (22). W e note that in row 1, the color disks illustrate the reach of the Q 1 local exchanges of each of the selected nodes P 1 . Once the regular structured signals hav e been constructed on each of the P 1 nodes, proceed with a regular CNN, applying regular con v olution (ro w 3), and regular pooling (row 4), until F L 1 features are obtained at each node (row 5), see (2)-(5), (23). Now , we view each feature as a graph signal being supported on the selected nodes, see (24), zero-padded to ﬁt the graph (row 6, diagram 1), see (25). W e then select a smaller subset P 2 ⊆ P 1 of P 2 ≤ P 1 nodes (row 2, diagram 2) and carry out Q 2 local exchanges with the neighbors, (row 2, diagrams 2, 3 and 4), illustrated with color disks in the last ro w . These neighbor exchanges create new regular structured signals at each of the P 2 nodes, see cf. (26). Then, we continue by computing F L 2 regional features at each node by means of regular CNNs and so on. compute local summaries as per (4) yielding v f p 1 , and sub- sample according to (5) resulting in features x f p 1 . Since in this case the indexes of the feature vector represent (neighborhood) resolution, some applications may beneﬁt from non-equally spaced sampling schemes that put more emphasis on sampling the high-resolution (low-resolution) part of the feature vector . Subsequent layers repeat the computation of con v olutional features and pooling steps in (2)-(5). Formally , all of the variables in (2)-(5) need to be marked with a subindex p to identify the aggregation node. Remark 4. The aggregation GNN architecture reduces triv- ially to con ventional CNNs when particularized to graph signals deﬁned over a cyclic graph. Since [ A k dc x g ] p = [ x g ] 1+( p + k ) mo d N is a cyclic shift of the input signal x g , then z g p = x g in (19) for all p and a regular CNN follo ws. Remark 5. The aggregation GNN architecture rests on trans- forming the data on the graph in such a way that it be- comes supported on a regular structure, and thus regular CNN techniques can be applied. T ransforming graph data is the main concern of graph embeddings [33]. Unlike the methods surve yed in [33], we consider the underlying graph support G as gi ven (not learned), we do not attempt to compress the graph data as construction of aggregated vector z g p does not entail any loss of information (if all eigenv alues of S are distinct), and the focus is on data deﬁned on top of the graph (the graph signal), rather than the graph itself (gi ven by S ). A. Multinode Aggr e gation Graph Neural Networks Selecting a single node p ∈ V to aggregate all the information generally entails N − 1 local exchanges with neighbors [cf. (18)]. For large networks, carrying out all these exchanges might be infeasible, either due to the associated communication o verhead or due to numerical instabilities. This can be ov ercome by selecting a subset of nodes to aggregate local information, i.e., selecting a submatrix of (18) with a few rows and columns in lieu of a single ro w with all the columns; see Fig. 4. The selected nodes will ﬁrst process their IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 11 own samples using an aggre gation GNN and then exchange the obtained outputs with the other selected nodes. This process is repeated until the information has been propagated through the entire graph. T o explain such a two-lev el hierarchical architecture, let us denote as  the layer index for the aggregation stage and as r the layer index for the exchange stage. The total number of exchange (outer) layers is R and, for each outer layer r , a total number of L r aggregation (inner) layers is run. W e start by describing the procedure for r = 1 , where P 1 ⊂ V denotes the subset of selected nodes and let Q 1 denotes the number of times the shift is applied ( S q , for q = 0 , . . . , Q 1 − 1 ). It is observed that this amounts to selecting P 1 = |P 1 | rows and Q 1 consecutiv e columns of (18). Setting  = 0 , the signal aggregating the ( Q 1 − 1) -hop neighborhood information at each node p ∈ P 1 can be constructed as [cf. (19)] z g p 0 (1 , Q 1 ) := h  x g  p ;  Sx g  p ; . . . ;  S Q 1 − 1 x g  p i . (22) Since z g p 0 exhibits a time structure, the regular CNN steps (2)- (5) can be applied indi vidually at each node (see Fig. 4). More speciﬁcally , since L 1 denotes the number of layers for the aggregation stage when r = 1 , a set of F L 1 descriptiv e features of the ( Q 1 − 1) -hop neighborhood of node p is constructed by concatenating  = 0 , . . . , L 1 − 1 layers of the form (2)-(5) as is done in the aggregation GNN. Setting  = L 1 , the output of the last layer of the aggregation stage is z pL 1 (1 , Q 1 ) = h z 0 pL 1 ; . . . ; z F L 1 pL 1 i . (23) Different feature vectors z pL 1 of dimension F L 1 are obtained at each of the p selected nodes, describing the corresponding ( Q 1 − 1) -hop neighborhood. In order to further aggregate these local features (describing local neighborhoods) into more global information, we need to collect each feature g at ev ery selected node p ∈ P 1 . More precisely , let P 1 = |P 1 | be the number of selected nodes, then x g 1 = h z g p 1 L 1 ; . . . ; z g p P 1 L 1 i (24) where each p k ∈ P 1 . W e now set r = 2 and select a subset of nodes P 2 ⊆ P 1 to aggregate features x g 1 by means of local neighborhood exchanges. Howe ver , signal x g 1 has dimension P 1 < N , so it cannot be directly exchanged through the original graph G . W e therefore use zero padding to make x g 1 ﬁt the graph ˜ x g 1 = P T 1 x g 1 (25) with P 1 being the P 1 × N fat binary matrix that selects the subset P 1 of rows of (18). Then, we apply Q 2 times the original shift S to the signals ˜ x g 1 , collecting information at nodes p ∈ P 2 , z g p 0 (2 , Q 2 ) := h  ˜ x g 1  p ,  S ˜ x g 1  p , . . . ,  S Q 2 − 1 ˜ x g 1  p i T . (26) Once z g p 0 is collected at each node p ∈ P 2 the time-structure of the signal is exploited to deploy another regular CNN (2)- (5) (aggregation GNN stage) in order to obtain F L 2 features that describe the region. In general, consider the output of outer layer r − 1 is x g r − 1 , consisting of feature g at a subset P r − 1 of P r − 1 nodes [cf. (24)], for g = 1 , . . . , F L r − 1 . Then, this signal is zero padded to ﬁt the original graph ˜ x g r − 1 = P T r − 1 x g r − 1 [cf. (25)] and the graph shift S is applied Q r times, collecting the shifted versions at a subset of nodes P r to construct time-structure signal z g p 0 ( r , Q r ) [cf. (26)]. Each node p ∈ P r runs a regular CNN (2)-(5) with L r inner layers to produce F L r features z pL r ( r , Q r ) [cf. (23)] that are then collected at each of the nodes p ∈ P r to produce x f r [cf. (24)], for f = 1 , . . . , F L r . See Fig. 4 for an illustration of the architecture. B. Practical Considerations Local architectur e. The single node aggregation GNN archi- tecture is entirely local . Only one node p ∈ V is selected, and that node gathers all the relev ant information about the data by means of local exchanges only . Furthermore, the output at the last layer is also obtained at a single node, so there is no need to hav e actual physical access to e very node in the network. Regular CNN design. Since signal z g p gathered at node p exhibits a regular time structure, the state-of-the-art expertise in designing con ventional CNNs can be immediately lev eraged to inform the design of conv olutional layers of aggregation GNNs. Numerical normalization. For big networks, some of the entries of S k (as well as the components of z g p associated with those powers) can grow too lar ge, leading to numerical instability . T o a void this, aggreg ation schemes typically work with a normalized version of the graph shift operator that guarantees that the spectral radius of S is one. Choice of aggregating node. The choice of nodes that aggregate all the information has an impact on the ov erall performance of the algorithm. This decision can be informed by several criteria such as the degree, the frequency content of the signals of interest [7] or be determined by different measures of centrality in the network [34]. In particular , in the experiments carried out in Sec. V, we select nodes based on the le verage scores obtained by the two sampling schemes described in [29] and [32]. Filter taps. For a generic (inner) layer 1 <  < L r the generation of the feature vectors u f g ` ∈ R N ` − 1 and u f ` ∈ R N ` − 1 is as in (2) and (3), so that we hav e that u f ` = P F ` − 1 g =1 u f g ` = P F ` − 1 g =1 h f g ` ∗ z g p ( ` − 1) . The main difference in this case is on the type and length of the ﬁlter coefﬁcients h f g ` ∈ R K ` . While in classical CNNs the ﬁlter coefﬁcients are critical to aggregate the information at dif ferent resolutions, here part of that aggregation has been already taken care of in the ﬁrst layer when transforming x g into z g p . As a result, the ﬁlter taps in the aggregation GNN architecture can have a shorter length and place more emphasis in high frequency features. Pooling . Something similar applies to the pooling schemes. The summarization and downsampled vectors for the aggre- gation architecture are obtained as [ v f ` ] n = ρ ` ([ u f ` ] n ` ) and IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 12 x f ` = σ ` ( C ` v f ` ) , which coincide with their counterparts for classical CNNs in (4) and (5). The difference is therefore not in the expressions, but on how n ` and C ` are selected. While in traditional CNNs the signal x g is global in that all the samples hav e the same resolution, in the aggregation architecture the signal z g p is local and different samples correspond to different lev els of resolution. More speciﬁcally , aggregation pooling schemes for n ` and C ` that preserve the top samples of the feature v ectors u f ` to keep ﬁner resolutions combined with a few bottom samples to account for coarser information are reasonable, while in traditional CNNs regular schemes for n ` and C ` that extract information and sample the signal support regularly can be more adequate. Design ﬂexibility . The multinode aggregation GNN acts as a decentralized method for constructing regional features. W e note that, for ease of exposition, the number of shifts Q r at each outer layer is the same for all nodes as well as the number of features F L r that are obtained at each node. Howe ver , this architecture can be adapted to different node-dependent parameters in a straightforward manner . Computational cost. The computational cost of the multinode aggregation GNN at each outer layer r is that of processing the regular CNN for each node, O ( P P r p =1 P L r ` =1 N ` − 1 K ` F ` − 1 F ` ) which can be easily distributed among the P r in volved nodes. Number of parameters. The number of parameters of the multinode aggregation GNN is O ( P P r p =1 P L r ` =1 K ` F ` F ` − 1 ) . W e observe, though, that the regular CNNs employed tend to be very small, since the initial Q r regular CNN at each node) as well as the length of the ﬁlters K ` are very small as well (typically , K `  Q r , cf. Sec. II). V . N U M E R I C A L E X P E R I M E N T S W e test the proposed GNN architectures and compare their performance with the graph coarsening (multiscale hi- erarchical clustering) approach of [16]. In the ﬁrst scenario (Sec. V -A), we address the problem of source localization on synthetic stochastic block model (SBM) networks. Then, we move the source localization problem to a more realistic setting of a Facebook network of 234 users (Sec. V -B). As a third experiment, we address the problem of authorship attribution (Sec. V -C). And ﬁnally , we test the proposed architectures in the problem of text categorization on the 20NEWS dataset (Sec. V -D). W e test the proposed Selection (Sec. III), Aggregation (Sec. IV) and Multinode (Sec. IV -A) GNN architectures. F or the selection of nodes in volved in each of the architectures, we test three different strategies. First, we choose nodes based on their degree; second, we select them following the le verage scores proposed by the experimentally designed sampling (EDS) in [32]; and third, we determine the appropriate nodes by using the spectral-proxies approach (SP) in [29]. In all architectures, the last layer is a fully-connected readout layer, followed by a softmax, to perform classiﬁcation. Unless otherwise speciﬁed, all GNNs were trained using the AD AM optimizer [35] with learning rate 0 . 001 and forgetting factors β 1 = 0 . 9 and β 2 = 0 . 999 . The training phase is Architecture Accuracy Selection (S) Degree 86 . 9( ± 5 . 9)% Selection (S) EDS 90 . 0( ± 4 . 6)% Selection (S) SP 91 . 1( ± 4 . 7)% Aggregation (A) Degree 94 . 2( ± 4 . 7)% Aggregation (A) EDS 96 . 5( ± 3 . 1)% Aggregation (A) SP 95 . 2( ± 4 . 4)% Multinode (MN) Degree 96 . 1( ± 3 . 4)% Multinode (MN) EDS 96 . 0( ± 3 . 5)% Multinode (MN) SP 97 . 3 ( ± 2 . 7 )% Graph Coarsening (C) Clustering 87 . 4( ± 3 . 2)% T able I: Considering that SBM graphs are random, we generate 10 different instances of SBM graphs with N = 100 nodes and C = 5 communities of 20 nodes each. For each of these 10 graphs, we randomly generate 10 dif ferent datasets (training, validation and test set). W e compute the classiﬁcation accuracy of each realization, and av erage across all 10 realizations for each graph, obtaining 10 average classiﬁcation accuracies. In the table we show the classiﬁcation accuracy , av eraged across the 10 graph instances. The standard de viation from these 10 graphs is also shown. carried out for 40 epochs with batches of 100 training samples. The loss function considered in all cases is the cross-entropy loss between one-hot target vectors and the output from the last layer of each architecture, interpreted as probabilities of belonging to each class. Also, in all cases, we consider max- pooling summarizing functions and ReLU activ ation functions for the corresponding GNN layers. A. Source Localization Consider a connected stochastic block model (SBM) net- work with N = 100 nodes and C = 5 communities of 20 nodes each [36]. In SBM graphs, edges are randomly drawn between nodes within the same community , independently , with probability 0 . 8 ; while edges are randomly drawn between nodes of different communities, independently , with probabil- ity 0 . 2 . Denote by A the adjacency matrix of such graph. In the problem of source localization, we observe a signal that has been diffused over the graph and estimate the spatial origin of such diffused process. More precisely , consider δ c a graph signal that has a 1 at node c and 0 at every other node. Deﬁne x = A t δ c as the diffused graph signal, for some unknown t ≥ 0 . The objective is to estimate the origin c of the dif fusion. In this situation in particular , we are interested in estimating the community c (rather than the node) that originated the observed signal x . W e can thus model this scenario as a classiﬁcation problem in which we observe graph signal x and hav e to assign it to one of the C = 5 communities. In the simulations, we generate the training and test set by randomly selecting the origin c from a pool of C = 5 nodes (the largest-degree node of each community; recall that all nodes have, on average, the same degree) and randomly selecting the diffusion time t < 25 , as well. W e generate a training set of 10 , 000 signals and a test set of 200 signals. The training set is further split in 2 , 000 signals for validation, and the rest for training. W e simulate 10 graphs, and for each graph, we simulate 10 realizations of training and test sets. IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 13 (a) Selection GNN SP (b) Aggregation GNN EDS (c) Multinode GNN SP Figure 5. V alidation and training loss during training stage. W e observe that the validation loss and the training loss are essentially equal throughout the training stage for all three architectures. This shows that the proposed models are not overﬁtting the data, since the validation loss keeps decreasing with the training steps The best performing selection method of each architecture is represented. For numerical reasons, the adopted graph shift operator is S = A /λ max where λ max is the maximum eigenv alue of A . The architectures tested are as follows. For the selection GNN we consider two layers selecting 10 nodes in each. The number of output features in each layer are F 1 = F 2 = 32 and the ﬁlters consists of K 1 = K 2 = 5 taps [cf. (13)]. For the summarizing functions, we consider neighborhoods of size α 1 = 6 and α 2 = 8 , respectiv ely [cf. (15)]. In the aggregation GNN, we select the single node with highest: a) degree, b) EDS lev erage score, or d) spectral-proxies (SP) norm, depending on the strategy chosen. Then, we construct the regular -structured signal [cf. (19)] and apply the aggregation GNN with two layers. The number of features on each layer is F 1 = 16 and F 2 = 32 , with ﬁlters of size K 1 = 4 and K 2 = 8 [cf. (21)]. Max-pooling is applied to reduce the size of the regular signal by half on each layer, and the nonlinearity used is the ReLU. Finally , for the multinode GNN, we consider two outer layers selecting P 1 = 10 and P 2 = 5 nodes and shifting the signal Q 1 = 7 and Q 2 = 5 times to build the regular signal on each node [cf. (22)]. Then, for each outer layer, we apply two inner layers. In the ﬁrst one, we obtain 16 features at each inner layer; and in the second outer layer , we get 16 and 32 for each inner layer . In all inner layers, the ﬁlters are of size 3 , with max-pooling by 2 , and a ReLU nonlinearity . W e recall that the selection of nodes depends on the sampling strategy selected (degree, EDS or SP). W e compare against a two-layer architecture using graph coarsening [16], reducing the number of nodes to a half on each layer , computing F 1 = F 2 = 32 features with ﬁlters consisting of K 1 = K 2 = 5 ﬁlter taps. In contrast with the previous cases where S was set to the rescaled adjancency matrix, in the graph coarsening architecture we set S to normalized Laplacian, since that was the speciﬁcation in [16] and, more importantly , yields a better performance. The plots in Fig. 5 sho w the v alue of the loss function on the training and v alidation sets as the training stage progresses. W e observe that both drop with training, showing that the model is effecti vely learning from data. W e see that it takes some time for the models to start learning (reaching half of the training stage in the case of aggregation), but then effecti vely lower the training loss. W e also see that the Multinode GNN achieves a lower loss value, which translates in better performance on the test set, and that it also takes the least number of training steps before starting to lo wer the loss function. Finally , we note that the v alidation loss and the training loss are essentially the same, showing that there is no ov erﬁt in the models. Accuracy results on the test set are presented in T able I. The accuracy results for all 10 realizations of each graph are av eraged, and then all 10 graph mean accuracies are a veraged to obtain the values sho wn in T able I. The error values in the table are the square root of the variance computed across the means obtained for each of the 10 graphs. W e observe that the best performance is achieved by Multinode GNN with nodes chosen following the spectral proxies method. W e observe that all multinode and aggregation GNNs outperform the graph coarsening approach, and so do selection GNNs follo wing EDS and spectral proxies sampling. B. F acebook network For this second experiment, we also consider the source localization problem, but in this case, we test it on top of a real-world network. In particular , we built a 234-user Facebook network as the lar gest connected network within the larger dataset provided in [37]. W e observe that the resulting network exhibits two communities of quite different size. The source localization problem formulation is the same than the one described in the previous section, where the objecti ve is to identify which of the two communities originated the dif fusion process. This is analogous to ﬁnding the start of a rumor . Again, we set S = A /λ max . The datasets are generated in the same fashion as described in the pre vious section. The three architectures used are as follows. For the selection GNN we use two layers, choosing 10 nodes after the ﬁrst one, and use ﬁlters with K 1 = K 2 = 5 taps that generate F 1 = F 2 = 32 features on each layer . F or the pooling stage, we use a max {·} summarization with α 1 = 2 and α 2 = 4 . In the aggregation GNN we select the best node based on one of the three sampling strategies (degree, EDS and SP) and the gather the regular-structure data at that node. W e then process it with a two-layer CNN that generates F 1 = 32 and F 2 = 64 features, using K 1 = K 2 = 4 . Max-pooling of size 2 is used on each layer (i.e. half of the samples gathered at the node are kept after each layer). In the case of the multinode GNN we use two-outer layers, selecting P 1 = 30 and P 2 = 10 nodes on each, and gathering Q 1 = Q 2 = 5 shifted versions IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 14 Architecture Accuracy Selection (S) Degree 96 . 0( ± 1 . 5)% Selection (S) EDS 95 . 6( ± 1 . 0)% Selection (S) SP 97 . 6( ± 1 . 2)% Aggregation (A) Degree 95 . 8( ± 1 . 6)% Aggregation (A) EDS 96 . 9( ± 1 . 2)% Aggregation (A) SP 95 . 8( ± 1 . 4)% Multinode (MN) Degree 97 . 6( ± 1 . 3)% Multinode (MN) EDS 96 . 8( ± 1 . 2)% Multinode (MN) SP 99 . 0 ( ± 0 . 8 )% Graph Coarsening (C) Clustering 95 . 2( ± 1 . 2)% T able II: Classiﬁcation accuracy averaged across 10 different realizations of the training and test sets for the same underlying graph. In parenthesis, we show the standard de viation of the classiﬁcation accuracy . of the signal at each node. Then, for the inner layers, we use two-layer architectures that generate 16 features on each layer in the ﬁrst outer layer , and 16 and 32 features on each layer in the second outer layer . In all cases, we use ﬁlters of size 3 and max-pooling by a factor of 2 . Finally , for the graph coarsening architecture, we adopt the normalized Laplacian as GSO, as described in [16], and use two-layers computing F 1 = F 2 = 32 features using graph ﬁlters with K 1 = K 2 = 5 ﬁlter taps. After each layer, the number of nodes is reduced by half. For training we use 80 epochs. W e also generate 10 dif ferent random realizations of the dataset to account for random variabilities in the setting. Results for all ten architectures are shown in T able II. W e observe that all architectures achiev e a very high classiﬁcation accuracy . W e note that selection GNN tends to outperform aggregation GNN. The best result is obtained for multinode GNN using spectral proxies and is 99 . 0% classiﬁcation accuracy . C. Authorship attribution As a third experiment, we study the problem of authorship attribution, as detailed in [38]. W e consider excerpts of works written by a myriad of contemporary authors from the 19th century . W e then b uild a word adjacency network (W AN) using functional words in these excerpts, and obtain a graph proﬁle for each author , i.e., a graph that represents an author’ s writing style by the way functional words (who act as nodes) are linked (weighted edges) in the excerpts written; see [38] for a full detail on the authors considered and the speciﬁc construction of W ANs. Then, we take a new excerpt, of unknown authorship, and by looking at the frequency of the functional words, we want to determine who the author is. In the framework presented in this paper , the signature word adjacency network constitutes the underlying graph support, and the frequency count of functional words becomes the graph signal. In particular , we focus on texts authored by Emily Bront ¨ e. W e consider a corpus of 682 excerpts of around 1000 words, authored by her; and take into consideration 211 functional words. Then, we take 546 of these excerpts as a training set, in order to both, build the signature W AN, and also as training samples. The constructed graph consists of N = 211 nodes, one for each functional word, the edges and their weights Architecture Accuracy Selection (S) Degree 69 . 6( ± 5 . 6)% Selection (S) EDS 68 . 1( ± 5 . 3)% Selection (S) SP 73 . 0( ± 4 . 8)% Aggregation (A) Degree 69 . 5( ± 2 . 0)% Aggregation (A) EDS 71 . 0( ± 2 . 8)% Aggregation (A) SP 69 . 2( ± 4 . 0)% Multinode (MN) Degree 80 . 4( ± 2 . 0)% Multinode (MN) EDS 80 . 5 ( ± 2 . 6 )% Multinode (MN) SP 79 . 9( ± 2 . 8)% Graph Coarsening (C) Clustering 65 . 2( ± 5 . 0)% T able III: Classiﬁcation accuracy averaged across 10 different realizations of the training and test sets (recall that the training and test sets are chosen at random from the av ailable corpus, and the choice of training set affects the constructed underly- ing graph). In parenthesis, we sho w the standard de viation of the classiﬁcation accuracy . are determined by the precedence relationship between each word, as described in [38]; and each training sample consist of a graph signal, where the value associated to each node is the frequency count of that speciﬁc functional word. The remaining 136 excerpts are used as test samples. Once the signature W AN for Bront ¨ e is built, we construct a training set of 1092 text excerpts, 546 corresponding to the author , and 546 corresponding to other contemporary authors; and a test set of 272 excerpts, 136 belonging to Bront ¨ e, and 136 written by other authors. The excerpts corresponding to the training and test set, written by either Bront ¨ e or other contemporary authors, are chosen uniformly at random. The objecti ve is to decide if the excerpts in the test set were written by Bront ¨ e. Again, we consider the three GNN architectures proposed in this paper , as well as the graph coarsening GNN of [16]. For the selection GNN, we consider a tw o-layer architecture, where we choose 100 nodes (functional words) as determined by each of the three sampling strategies (degree, EDS and SP). For each layer we set F 1 = F 2 = 32 , K 1 = K 2 = 5 and α 1 = 2 and α 2 = 4 . In the aggre gation GNN we consider three layers, after aggregating all the information at the chosen node (the choice depends on each sampling strategy). In the ﬁrst layer we compute F 1 = 32 features with a ﬁlter of size K 1 = 6 , and do max-pooling, reducing the number of samples by 4 . The second and third layers use ﬁlters of size K 2 = K 3 = 4 to obtain F 2 = 64 and F 3 = 128 features respectiv ely . Pooling is applied, reducing the size of the vector by a factor of 2 in each of the last two aggreg ation GNN layers. The multinode GNN employed consists of tw o outer layers, choosing P 1 = 30 and P 2 = 10 nodes, respectiv ely , and aggregating information up to the Q 1 = Q 2 = 5 hop-neighborhood. For each outer layer , we have two inner layers, having 16 features on each of those for the ﬁrst outer layer , and 16 and 32 features for the second outer layer . All ﬁlters are of size 3 and pooling reduces the size of the vectors by half. Finally , the graph coarsening GNN consists of two layers obtaining F 1 = F 2 = 32 features in each, with graph ﬁlters of size K 1 = K 2 = 5 , and pooling reducing the size of the graph by half on each layer . The graph shift operator S is set to the adjacency matrix after normalizing the weights of each row (to add up to 1 ) and symmetrizing it, e xcept for the case of graph coarsening IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 15 Architecture Accuracy Selection (S) Degree 55 . 7( ± 0 . 5)% Selection (S) EDS 58 . 1( ± 0 . 5)% Selection (S) SP 59 . 2( ± 0 . 4)% Aggregation (A) Degree 49 . 0( ± 0 . 4)% Aggregation (A) EDS 51 . 3( ± 0 . 5)% Aggregation (A) SP 52 . 9( ± 0 . 5)% Multinode (MN) Degree 65 . 7( ± 0 . 4)% Multinode (MN) EDS 66 . 5( ± 0 . 5)% Multinode (MN) SP 67 . 0 ( ± 0 . 5 )% Graph Coarsening (C) Clustering 62 . 8( ± 0 . 5)% T able IV: 20NEWS dataset on a word2vec graph embed- ding of N = 1 , 000 nodes. Classiﬁcation accuracy averaged across 10 different runs. In parenthesis, we sho w the standard deviation of the classiﬁcation accuracy . GNNs, where the GSO is the normalized Laplacian obtained from the aforementioned adjacenc y matrix. F or training we use 80 epochs. And we run the experiment 10 times, to account for the randomness in the selection of training and test sets (and thus, for the randomness in the creation of the underlying W AN). Results can be found in T able III, where we show the classiﬁcation accuracy a veraged ov er 10 dif ferent realizations of the training and test sets, as well as the estimated standard deviation. W e ﬁrst observe that, in this case, all proposed GNN architectures outperform the graph coarsening GNN. W e note that the multinode GNN is the best performing architecture. W e also observe that selecting nodes via the EDS sampling method works best for aggregation and multinode GNNs, but spectral proxies yield better results in the case of selection GNN. The best classiﬁcation accurac y obtained is 80 . 5% , on av erage across all realizations, and achieved by the multinode GNN whose nodes are selected by means of EDS sampling. D. 20NEWS dataset Finally , we consider the classiﬁcation of articles in the 20NEWS dataset which consists of 16 , 617 texts ( 9 , 922 of which are used for training and 6 , 695 for testing) [39]. The graph signals are constructed as in [16]: each document x is represented using a normalized bag-of-words model and the underlying graph support is constructed using a 16 -NN graph on the word2vec embedding [40] considering the 1 , 000 most common words. The GSO adopted is the normalized Laplacian as in [16]. The selection GNN architecture consists of 2 conv olutional layers, selecting P 1 = 250 and P 2 = 100 nodes, according to each of the three dif ferent sampling strate gies. Each layer uses graph ﬁlters of K 1 = K 2 = 5 taps to build F 1 = 32 and F 2 = 64 features. The pooling neighborhoods correspond to α 1 = 7 and α 2 = 12 . For the aggregation GNN we also consider 2 layers, and use ﬁlters of length K 1 = K 2 = 11 to build F 1 = F 2 = 32 features on each layer . Pooling size is 4 , and the data is aggregated at a single node chosen by each of the sampling strategies. The multinode GNN consists of 2 outer layers that select P 1 = 70 and P 2 = 30 nodes, respectively . The number of local exchanges to create a temporally-structured signal are Q 1 = 10 and Q 2 = 25 . Each outer layer employs a regular CNN with 2 inner layers. Each inner layer of the ﬁrst outer layer creates 16 features, while each inner layer of the second outer layer uses 16 and 32 features, respectiv ely . All ﬁlters inv olved are of length 5 and the pooling size is 4 . Finally , for the graph coarsening architecture, we consider 2 layers, reducing the number of nodes by half on each layer , and computing F 1 = 32 and F 2 = 64 features, using ﬁlters of length K 1 = K 2 = 5 . T raining is done for 80 epochs. Classiﬁcation accuracy results, av eraged out of 10 runs, are listed in T able IV. W e note that the multinode GNN is the best performing architecture, followed by graph coarsening. The comparatively poor performance of the aggregation GNN is most likely due to the numerical instabilities that arise from performing a large number of neighborhood exchanges. V I . C O N C L U S I O N S In this paper we proposed two architectures for extending con volutional neural networks to process graph signals. The selection graph neural network replaces the conv olution op- eration with graph ﬁltering by means of linear shift in v ariant graph ﬁlters. Pooling is reinterpreted as a neighborhood sum- marizing function that gathers the relevant re gional informa- tion at a subset of nodes, follo wed by a do wnsampling. By keeping track of the location of these subsets of nodes in the original graph, con v olutional layers can be further computed at deeper layers through the use of zero padding. In this way , the selection GNN respects the original topology that describes the data, while reducing the computational complexity at each layer . Furthermore, the resultant features at each layer can be appropriately analyzed in terms of the original graph (frequency analysis, local ﬁltering). The aggregation GNN collects, at a single node, diffused versions of the original signal. The resulting signal simultane- ously possesses a regular temporal structure and includes all relev ant information of the topology of the graph. Since the signal collected at this single node has a temporal structure, a regular CNN can be applied to it. In large scale networks, howe ver , gathering all the information of the graph signal at a single node might be infeasible. In order to overcome this, we proposed a multinode variation of the aggregation GNN in which we use a subset of nodes to subsequently create meaningful features of increasing neighborhoods. W e have tested the proposed architectures in a source lo- calization problem on both synthetic and real datasets, as well as for authorship attribution and the classiﬁcation of articles of the 20NEWS dataset. W e considered three different ways of choosing nodes in each architecture, based on three existing sampling techniques (namely , by degree, and by leverage scores computed from experimentally designed sampling and spectral proxies). W e compared the results with an existing graph coarsening GNN that employs multiscale hierarchical clustering for the pooling stage. W e observe that the multinode aggregation GNN e xhibits the best performance. All in all, the proposed GNN architectures exploit the advances in graph signal processing to present novel construc- tions of deep learning that are able to handle network data represented as signals supported on graphs. IEEE TRANSA CTIONS ON SIGNAL PROCESSING (A CCEPTED) 16 R E F E R E N C E S [1] A. Sandryhaila and J. M. F . Moura, “Discrete signal processing on graphs, ” IEEE T rans. Signal Process. , vol. 61, no. 7, pp. 1644–1656, Apr . 2013. [2] A. Sandyhaila and J. M. F . Moura, “Discrete signal processing on graphs: Frequency analysis, ” IEEE T rans. Signal Pr ocess. , vol. 62, no. 12, pp. 3042–3054, June 2014. [3] D. I. Shuman, S. K. Narang, P . Frossard, A. Ortega, and P . V an- derghe ynst, “The emerging ﬁeld of signal processing on graphs: Ex- tending high-dimensional data analysis to networks and other irregular domains, ” IEEE Signal Process. Mag. , vol. 30, no. 3, pp. 83–98, May 2013. [4] A. Sandryhaila and J. M. F . Moura, “Big data analysis with signal processing on graphs, ” IEEE Signal Process. Mag. , vol. 31, no. 5, pp. 80–90, Sep. 2014. [5] S. Segarra, A. G. Marques, and A. Ribeiro, “Optimal graph-ﬁlter design and applications to distributed linear network operators, ” IEEE T rans. Signal Pr ocess. , v ol. 65, no. 15, pp. 4117–4131, Aug. 2017. [6] D. I. Shuman, P . V anderghe ynst, D. Kressner , and P . Frossard, “Dis- tributed signal processing via chebyshev polynomial approximation, ” IEEE T rans. Signal, Inform. Pr ocess. Netw . , 6 Apr . 2018, early access. [7] A. G. Marques, S. Seg arra, G. Leus, and A. Ribeiro, “Sampling of graph signals with successi ve local aggre gations, ” IEEE T rans. Signal Pr ocess. , vol. 64, no. 7, pp. 1832–1843, Apr . 2016. [8] M. M. Najafabadi, F . V illanustre, T . M. Khoshgoftaar, and N. Seliya, “Deep learning applications and challenges in big data analytics, ” J. Big Data , vol. 2, no. 1, pp. 1–21, Dec. 2015. [9] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- sentations by back-propagating errors, ” Natur e , vol. 323, no. 6088, pp. 533–536, Oct. 1986. [10] Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning, ” Natur e , vol. 521, no. 7553, pp. 85–117, 2015. [11] Y . LeCun, K. Kavukcuoglu, and C. Farabet, “Con v olutional networks and applications in vision, ” in 2010 IEEE Int. Symp. Circuits and Syst. Paris, France: IEEE, 30 May-2 June 2010. [12] H. Greenspan, B. van Ginneken, and R. M. Summers, “Deep learning in medical imaging: Overvie w and future promise of an exciting new technique, ” IEEE T rans. Med. Imag. , vol. 35, no. 5, pp. 1153–1159, May 2016. [13] J. Bruna, W . Zaremba, A. Szlam, and Y . LeCun, “Spectral networks and deep locally connected networks on graphs, ” [cs.LG] , 21 May 2014. [Online]. A vailable: http://arxiv .org/abs/1213. 6203 [14] M. Henaff, J. Bruna, and Y . LeCun, “Deep conv olutional networks on graph-structured data, ” arXiv:1506.051631v1 [cs.LG] , 16 June 2015. [Online]. A vailable: http://arxiv .org/abs/1506.05163 [15] J. Atwood and D. T owsley , “Dif fusion-con volutional neural netw orks, ” in 30th Neural Inform. Pr ocess. Syst. Barcelona, Spain: NIPS F oundation, 5-10 Dec. 2016. [16] M. Defferrard, X. Bresson, and P . V andergheynst, “Con volutional neural networks on graphs with fast localized spectral ﬁltering, ” in Neural Inform. Pr ocess. Syst. 2016 . Barcelona, Spain: NIPS Foundation, 5-10 Dec. 2016. [17] J. Du, S. Zhang, G. W u, J. M. F . Moura, and S. Kar, “T opology adaptiv e graph conv olutional networks, ” arXiv:1710.10370v2 [cs.LG] , 2 Nov . 2017. [Online]. A vailable: http://arxiv .org/abs/1710.10370v2 [18] T . N. Kipf and M. W elling, “Semi-supervised classiﬁcation with graph con volutional networks, ” in 5th Int. Conf. Learning Representations . T oulon, France: Assoc. Comput. Linguistics, 24-26 Apr . 2017. [19] F . Gama, A. G. Marques, A. Ribeiro, and G. Leus, “MIMO graph ﬁlters for con v olutional networks, ” arXiv:1803.02247v1 [cs.LG] , 6 March 2018. [Online]. A vailable: http://arxiv .org/abs/1803.02247 [20] M. Niepert, M. Ahmed, and K. Kutzkov , “Learning con v olutional neural networks for graphs, ” in 33r d Int. Conf. Mach. Learning , New Y ork, NY , 24-26 June 2016. [21] B. Pasdeloup, V . Gripon, J.-C. V ialatte, and D. Pastor , “Conv olutional neural networks on irregular domains through approximate translations on inferred graphs, ” arXiv:1710.10035v1 [cs.DM] , 27 Oct. 2017. [Online]. A vailable: http://arxiv .org/abs/1710.10035 [22] P . V eli ˇ ckovi ´ c, G. Cucurull, A. Casanova, A. Romero, P . Li ` o, and Y . Bengio, “Graph attention networks, ” arXiv:1710.10903v3 [stat.ML] , 4 Feb . 2018. [Online]. A v ailable: http://arxiv .org/abs/1710.10903 [23] F . Gama, G. Leus, A. G. Marques, and A. Ribeiro, “Conv olutional neural networks via node-varying graph ﬁlters, ” in 2018 IEEE Data Sci. W orkshop . Lausanne, Switzerland: IEEE, 4-6 June 2018. [24] I. Goodfello w , Y . Bengio, and A. Courville, Deep Learning , ser . The Adaptiv e Computation and Machine Learning Series. Cambridge, MA: The MIT Press, 2016. [25] C.-C. J. Kuo, “The CNN as a guided multilayer RECOS transform, ” IEEE Signal Pr ocess. Mag. , vol. 34, no. 3, pp. 81–89, May 2017. [26] G. Huang, Z. Liu, L. van der Maaten, and K. Q. W einberger , “Densely connected conv olutional networks, ” in IEEE Comput. Soc. Conf. Com- put. V ision and P attern Recognition 2017 . Honolulu, HI: IEEE Comput. Soc., 21-26 July 2017. [27] T . Wiatoski and H. B ¨ olcskei, “ A mathematical theory of deep con v olu- tional neural networks for feature extraction, ” IEEE T rans. Inf. Theory , vol. 64, no. 3, pp. 1845–1866, March 2018. [28] S. Chen, R. V arma, A. Sandryhaila, and J. Kov a ˇ cevi ´ c, “Discrete signal processing on graphs: Sampling theory , ” IEEE T rans. Signal Process. , vol. 63, no. 24, pp. 6510–6523, Dec. 2015. [29] A. Anis, A. Gadde, and A. Ortega, “Efﬁcient sampling set selection for bandlimited graph signals using graph spectral proxies, ” IEEE T rans. Signal Pr ocess. , v ol. 64, no. 14, pp. 3775–3789, July 2016. [30] M. Tsitsvero, S. Barbarossa, and P . Di Lorenzo, “Signals on graphs: Uncertainty principle and sampling, ” IEEE Tr ans. Signal Pr ocess. , vol. 64, no. 18, pp. 4845–4860, Sep. 2016. [31] G. Puy , N. T remblay , R. Gribon val, and P . V andergheynst, “Random sampling of bandlimited graph signals, ” Appl. Comput. Harmonic Anal. , vol. 44, no. 2, pp. 446–475, March 2018. [32] R. V arma, S. Chen, and J. Ko va ˇ cevi ´ c, “Spectrum-blind signal recovery on graphs, ” in 2015 IEEE Int. W orkshop Comput. Advances Multi-Sensor Adaptive Pr ocess. Canc ´ un, M ´ exico: IEEE, 13-16 Dec. 2015, pp. 81–84. [33] H. Cai, V . W . Zheng, and K. C.-C. Chang, “ A comprehensive survey of graph embedding: Problems, techniques and applications, ” arXiv:1709.07604v3 [cs.AI] , 2 Feb. 2018. [Online]. A vailable: http: //arxiv .org/abs/1709.07604 [34] S. Segarra and A. Ribeiro, “Stability and continuity of centrality measures in weighted graphs, ” IEEE T rans. Signal Process. , vol. 64, no. 3, pp. 543–555, Feb . 2016. [35] D. P . Kingma and J. L. Ba, “ADAM: A method for stochastic opti- mization, ” in 3r d Int. Conf. Learning Representations . San Diego, CA: Assoc. Comput. Linguistics, 7-9 May 2015. [36] A. Decelle, F . Krzakala, C. Moore, and L. Zdeborov ´ a, “ Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications, ” Physical Review E , vol. 84, no. 6, p. 066106, Dec. 2011. [37] J. McAuley and J. Leskov ec, “Learning to discov er social circles in Ego networks, ” in 26th Neural Inform. Pr ocess. Syst. Stateline, TX: NIPS Foundation, 3-8 Dec. 2012. [38] S. Segarra, M. Eisen, and A. Ribeiro, “ Authorship attribution through function word adjacency networks, ” IEEE T rans. Signal Pr ocess. , vol. 63, no. 20, pp. 5464–5478, Oct. 2015. [39] T . Joachims, “ Analysis of the rocchio algorithm with tﬁdf for text cat- egorization, ” Carnegie Mellon University , Computer Science T echnical Report CMU-CS-96-118, 1996. [40] T . Mikolov , K. Chen, G. Corrado, and J. Dean, “Efﬁcient estimation of word representations in vector space, ” in 1st Int. Conf. Learning Repr esentations . Scottsdale, AZ: Assoc. Comput. Linguistics, 2-4 May 2013.

Convolutional Neural Network Architectures for Signals Supported on Graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment