Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs

Cross-atten tiv e Cohesiv e Subgraph Em b edding to Mitigate Ov ersquashing in GNNs T an vir Hossain 1 (  ), Muhammad Ifte Khairul Islam 1 , Lilia Chebbah 1 , Charles F anning 2 , and Esra Akbas 1 1 Departmen t of Computer Science, Georgia State Universit y , Atlan ta, GA 30302, USA 2 Departmen t of Data Science and Analytics, Kennesaw State Univ ersity , 1000 Chastain Road, Kennesa w, GA 30144, USA {thossain5, mislam29, lchebbah1}@student.gsu.edu, cfannin8@students.kennesaw.edu, eakbas1@gsu.edu Abstract. Graph neural net works (GNNs) ha ve ac hieved strong per- formance across v arious real-world domains. Nevertheless, they suﬀer from o versquashing, where long-range information is distorted as it is compressed through limited message-passing pathw ays. This b ottlenec k limits their abilit y to capture essential global con text and decreases their p erformance, particularly in dense and heterophilic regions of graphs. T o address this issue, w e prop ose a nov el graph learning framew ork that en- ric hes node em beddings via cross-atten tive cohesive subgraph represen ta- tions to mitigate the impact of excessive long-range dep endencies. This framew ork enhances the no de representation by emphasizing cohesiv e structure in longe-range information but removing noisy or irrelev an t connections. It preserv es essential global con text without o verloading the narro w bottleneck ed channels, which further mitigates o versquash- ing. Extensiv e exp erimen ts on m ultiple b enc hmark datasets demonstrate that our mo del ac hieves consistent improv ements in classiﬁcation accu- racy o ver standard baseline methods. Keyw ords: Oversquashing · Graph Decomp osition · Graph Represen- tation Learning 1 In tro duction Graph Neural Net works (GNNs) ha v e sho wn strong performance in learning from graph-structured data b y iterativ ely propagating information through lo- cal neighborho o ds. While GNNs capture lo cal signals eﬀectively , n -hop neigh- b orhoo d gro ws exp onentially with distance, messages m ust pass through limited edges, causing imp ortan t long-range information to be compressed into ﬁxed-size no de embeddings [2], referred as the o versquashing problem. Many studies hav e b een prop osed to understand and quantify ov ersquashing using no de sensitiv- it y [25], eﬀective resistance [8], and commute time [11]. T o solv e this issue, one of the most common strategy is graph rewiring, whic h adds edges to reduce struc- tural bottlenecks and impro ve information ﬂow. How ever, these approaches often 2 Hossain et al. in tro duce signiﬁcan t computational ov erhead [18,25,24,10]. In addition, they rely on computationally in tensive techniques, suc h as spectral decomp osition [18] and optimal transp ort [24]. In real-world net works, critical information is often concentrated within densely connected regions that form cohesiv e subgraphs. When message passing aggre- gates information across these regions, long-range dependencies may still be forced through narrow in ter-communit y b ottlenec ks, while dense regions dom- inate the aggregation pro cess, further amplifying ov ersquashing. On the other hand, applying GNNs to these cohesive subgraphs allows capturing considerable task-relev an t information for vertices in downstream graph analytics. Our pilot study (detailed in Section 2.6) illustrates that cohesion-sensitive graph partition shortens long-range dep endencies while preserving homophily within subgraphs. In this paper, w e presen t a no vel graph learning framew ork, that utilizes C ross- a tten tive Co hesive S ubgraph E m b eddings ( CaCoSE ) to alleviate ov er- squashing in GNNs. After decomposing graphs in to dense cohesiv e k -core sub- graphs, w e construct edge-induced cohesive subgraphs guided by k -core v alues and learn no de representations via GNN in each cohesiv e subgraph. Next, we apply graph p ooling on these cohesiv e subgraphs that selectively ﬁlter out noisy or irrelev ant connections while preserving task-relev ant structures and embed them in to subgraph em b edding as critical global information. It is follo wed b y a cross-subgraph attention mec hanism that enriches subgraph em b edding via capturing long range information across subgraphs. Finally , w e combine these enric hed subgraphs’ represen tations with the embeddings of nodes within the subgraphs. This com bined no de representation with enhanced message aggrega- tion captures b oth lo cal subgraph and global long-range information for no des, ev en in large-scale heterophilic netw orks. Our mo del architecture is pro vided in Figure 1 The contributions of our mo del is as follows. – This study op ens up a new viewp oin t to understand the reasons for ov er- squashing. While k -core decomp osition b eneﬁts cohesive aw areness among v ertices, which pro vides essential lo calit y for GNN op erations, p o oling en- hances the netw ork’s homophily to attain meaningful global subgraph rep- resen tations. – The cross-subgraph atten tion mechanism enco des essen tial global relations among all subgraphs. Merging no de representation with enriched subgraph em b eddings main tains both lo cal and global conn ectivit y , helping to alleviate the ov ersquahing problem in GNNs. – In a comprehensive exp erimen t on multiple datasets, our mo del outp erforms the standard baseline mo dels in b oth no de and graph classiﬁcation tasks. 2 Preliminaries In this section, we describ e the primary comp onents of CaCoSE . Cross-atten tive Cohesive Subgraph Embedding 3 Input Graph, (1) Cohesive Subgraph Decomposition -- -- -- -- -- -- -- Attention Matrix -- -- -- + -- -- -- -- -- (2) Graph Learning (3) Cross-Subgraph Attention and Feature Combination (4) Final Prediction Fig. 1: Mo del Architecture. The CaCoSE framework ﬁrst applies closure-a ware k -core partitioning to extract cohesive edge-induced subgraphs (black). A GCN is used to learn node representations within each subgraph, follow ed by SAGP o ol to obtain compact subgraph embeddings (red) . Next, a cross-attention mec ha- nism captures the mutual information among subgraphs. The resulting represen- tations are concatenated with no de representations and the av erage of attended subgraphs embeddings give the ﬁnal graph representations (green) . Finally , The reﬁned no des’ and graph’s representations are ev aluated through MLP (orange) . 2.1 Ov ersquashing. One of the ma jor dra wbacks of GNNs that o ccurs when information is severely b ottlenec k ed due to its failure in useful message passing. According to [25], when a node t that is connected with another no de s by an r -hop distance, the Jaco- bian op eration δ h ( r +1) t /δ x s denotes the change of the feature vector x s ’s impact on the ( r + 1) st lay er’s output of h ( r +1) t . The quan tiﬁcation is observed from the absolute v alue | δ h ( r +1) t /δ x s | of the Jacobian, where a negligible v alue indi- cates limited information propagation or ov ersquashing. In addition, as presen ted in [8], ov ersquashing can b e asso ciated with eﬀective resistance b et w een tw o no de pairs. As low as the eﬀective resistance is, the tw o no des ha ve more inﬂuence on eac h other during GNNs’ op erations. 4 Hossain et al. 2.2 Graph Neural Net w ork. Graph neural netw orks (GNNs) em b edded structural information in the graph via message propagation in to node and graph represen tations. In graph con vo- lution netw ork (GCN), message propagation rule is deﬁned as: H ( l +1) = σ ( ˜ D − 1 2 ˜ A ˜ D − 1 2 H ( l ) Θ ( l ) ) (1) where, ˜ A = A + I N v denotes the adjacency matrix with the self-connections, ˜ D ii = P j ˜ A ij degree matrix and Θ ∈ R d × h learnable w eight matrix for h - dimensional em b eddings. σ ( . ) indicates the R eLU ( x ) = max (0 , x ) activ ation. H ( l ) ∈ R N v × h denotes the l th la yer’s no des’ embeddings matrix for h -dimension where H (0) = X . 2.3 Self-A ttention Graph Pooling. Graph p ooling captures the entire graph’s represen tation through compression. SA GPool [20] facilitates self-attentiv e graph learning through selecting impactful neigh b ors of vertices. It utilizes the (GCN) to measure the self-attention scores. Attn v = σ ( ˜ D − 1 2 ˜ A ˜ D − 1 2 X Θ att ) (2) Where Attn v ∈ R N v × 1 is calculated from ˜ A = A + I N , X and a learnable parameters’ matrix Θ att . Then utilizing the p ooling ratio ( P R ) it selects the top- k no de indices, idx = topk ( Attn v , ⌈ ( P R ) N v ⌉ ) . Next, only considers the top- k v ertices and their connections, Attn mask = Attn v [ idx ] . After the computation of top -k node indexes, the graph po oling of nodes’ features is measured as X out = X idx, : . ⊙ Attn mask and A out = A idx,idx . Finally , through a READOUT function the representation of the entire graph is computed as Z G = R E AD OU T ( X out ) . Here, READOUT is the global p o oling function: sum, mean, or other adv anced learnable aggregator. 2.4 k-core Decomp osition. Cohesiv e subgraph decomp osition is instrumen tal to determine the intended graph regions. The decomp osition is performed by iterativ ely p eeling a wa y no des whose degrees fall b elo w the sp eciﬁed threshold: nodes with degree less than k are remov ed. In particular, if the graph has no isolated no de, the original graph G can b e denoted as 1 -core. As indicated by hierarch y , G k max ⊆ . . . G 3 ⊆ G 2 ⊆ G 1 , where, k ∈ { 1 , 2 , . . . k max } . A k -core subgraph is deﬁned as follows. Deﬁnition 1. ( k -c or e ): F or a given k ≥ 1 , the k -c or e sub gr aph is r epr esente d fr om the gr aph G when e ach no de in the sub gr aph b elongs to the same ( k ) or mor e neighb ors such that | N ( v ) | ≥ k , wher e N ( v ) r epr esents the numb er of neighb ors of no de v ∈ G k . Cross-atten tive Cohesive Subgraph Embedding 5 As the k -core is deﬁned on no des, w e extend this concept to edges by in tro ducing an edge score, deﬁned as follows. Deﬁnition 2. ( Edge sc or e ): In the c ontext of k -c or e , for a gr aph G = ( V , E ) and k ≥ 1 , an e dge ( u, v ) ∈ E c an app e ar in multiple k -c or e sub gr aphs. The sc or e of an e dge is assigne d as C ( u, v ) = max { k | ( u, v ) ∈ G k } (3) C ( u, v ) is c ompute d for the highest k -value d sub gr aph wher e ( u, v ) exists. 2.5 Problem Statemen t. Our study addresses the c hallenge of ov ersquashing by introducing a cohe- siv e subgraph decomp osition framew ork along with graph p ooling and a cross- subgraph attention mechanism. The primary ob jectiv e is to enhance graph com- p onen t features’ expressivity in do wnstream analytics tasks. In a formal sense, a graph is denoted as G = ( V , E , X ) , where V is the v ertex set, E is the edge set and X ∈ R N v × d represen ts the initial feature vector of the vertices. The num b er of no des is | V | = N v and d denotes the dimension of the no des’ features. The graph will b e partitioned in to cohesion-centric subgraphs, such as S = { S 1 , S 2 , ..S k max } where each S k ⊆ G corresp onds to cohesion level k = { 1 , 2 , .., k max } . W e apply the prop osed CaCoSE mo del to these subgraphs. F or no de classiﬁcation, the mo del trained on D v = ( G, X , Y v ) to learn the map- ping f v : X → Y v . Besides, for graph classiﬁcation, it is trained on D G = ( G , Y G ) to learn f G : G → Y G . In b oth cases, the learned functions lev erage the reduced graph complexity achiev ed through the CaCoSE framew ork. 2.6 Pilot Study GNNs struggle to learn long range dep endencies due to the b ottlenec k eﬀects in graph structure [2]. Cohesion-aw are graph decomp osition not only reduces burdens on b ottlenec ked c hannels, but also preserves essential structural prop er- ties. Ho wev er, in heterophilic netw orks, selecting task-relev ant neighbors remains c hallenging due to cross-class mixing. This pilot study demonstrates that apply- ing k -core decomp osition, follow ed by p ooling (SA GPooling), preserves essential class-consistency among vertices even in heterophilic graphs. That enables GNNs to aggregate more reliable signals, pro ducing stable no de representations. Fig. 2 presents a pilot study on Cora (homophilic) and Chameleon (het- erophilic) datasets to examine how w ell subgraph decomp osition preserves as- sortativit y and disassortativity . A ccording to Deﬁnition 1 and 2, w e ﬁrst com- pute edge scores and extract edge-induced dense subgraphs for top-three k − core v alues. F orm Cora k ∈ { 2 , 3 , 4 } and from Chameleon k ∈ { 61 , 62 , 63 } . Next, w e compute the av erage num b er of paths (ANP) for vertices at hop distances n ∈ { 4 , 5 } across original graphs, subgraphs and their homophilic counterparts H = { ( u, v ) ∈ E : y u = y v } . The measures are presen ted in bar graphs. 6 Hossain et al. A ccording to Figs. 2(a) and 2(c), the subgraphs in b oth datasets retain the ho- mophilic and heterophilic ( y u  = y v ) prop erties of their original graphs. While this preserv ation is often suﬃcien t for homophilic netw orks to maintain task-relev an t features, it b ecomes more challenging for heterophilic graphs like Chameleon. T o extend the study , we apply SA GPool to each subgraph and rep eat the ANP ev al- uation on p o oled subgraphs (PS-ANP). Interestingly , in the Chameleon dataset (Fig. 2(d)), the po oled subgraphs exhibit a higher ratio of homophilic paths to the total av erage num b er of paths per node compared to the initial ev aluation. That facilitates eﬀectiv e represen tation learning in graph mo dels [14]. Besides, it increases the ratio for the homophilic dataset. (a) ANP Cora (b) PS-ANP in Cora (c) ANP in Chameleon (d) PS-ANP in Chameleon Fig. 2: Pilot Study . A v erage num b er (#) of paths (ANP) p er no de for n ∈ { 4 , 5 } hop distances (HD) in Cora and Chamele on (Chm) datasets. Blue bars denote the original graph ( G ), its cores ( k ) and their (PS)-p o oled subgraphs ( P − k ); green bars presen t their homophilic coun terparts ( H − G , H − k and P H − k ). F rom righ t to left, the deep (blue & green) bar pairs (Figs. 2(a) and 2(c)) with hatch ( ′ ∗ ′ ) presen t the ANP of original graph and its homophilic subgraph resp ectiv ely; in other cases, more deep er bar color denotes more denser subgraph. K - thousand and, M|B - m|billion. Cross-atten tive Cohesive Subgraph Embedding 7 3 Metho dology 3.1 Cohesiv e Subgraph Decomp osition. Densit y-informed subgraphs, called cohesiv e subgraphs, are crucial for eﬀective graph structure learning where each vertex gains suﬃcient connectivity in a par- ticular region. Hence, learning the representation of vertices with these cohesive regions can capture adequate pro ximal insigh t for them. The k -core is one of the p opular algorithms to obtain cohesion-fo cused subgraphs [9]. Theorem 1 (Closure-aw are Edge Filtration C aE F ). L et G kc denote the k -c or e of G = ( V , E ) . F or an e dge e = ( u, v ) ∈ G k , wher e N ( u ) and N ( v ) ar e the neighb or set of no de u and v , its triadic supp ort is deﬁne d as S ( u, v ) = | N ( u ) ∩ N ( v ) | (4) F or k ≥ δ , if S ( u, v ) = 0 then ( u, v ) is r emove d fr om G k and r e assigne d to pr evious c or e C ( u, v ) = ( k − 1) , wher e δ denotes the e dge ﬁltering thr eshold. In this mo dule, to partition a graph into cohesiv e subgraphs, CaCoSE employs the k -core decomposition algorithm in [6]. Initially , it calculates the edge coreness follo wing Deﬁnition 2. F or example, edge ( v 4 , v 6 ) is a part of b oth the 1 -core and the 2 -core subgraphs. As (2 > 1) , the coreness score C ( v 4 , v 6 ) = 2 . Similarly , ( v 6 , v 7 ) ∈ G 1 ∩ G 2 , hence, C ( v 6 , v 7 ) = 2 . Sim ultaneously , w e examine the existence of narrow edges in k -core sub- graphs. According to theorem (1), such edges are remo ved from k -core subgraph and reassigned to the ( k − 1) -core subgraph. F or example, in Figure 1 when k = 3 , the edge ( v 4 , v 7 ) has no supp ort. It may act as a noisy c hannel during neigh b orho od aggregation in GNNs. Hence, ( v 4 , v 7 ) is pruned from G 3 and as- signed to G 2 , then C ( v 4 , v 7 ) = 2 . After assigning scores to each of the edges, the graph is decomp osed in to subgraphs corresp onding to the same edge-score groups - S = { S k } k max k =1 , S k = ( V S , E S ) , E S = { ( u, v ) ∈ E | C ( u, v ) = k } (5) Note that S k  = G k , e.g S 3  = G 3 . Although many no des ov erlap across diﬀerent subgraphs, they consistently carry the same w eighted edges within each sub- graph. Hence, these score-based partitions av oid structural inconsistencies and preserv e meaningful subgraphs for graph op erations. 3.2 Subgraph Learning. In this part, CaCoSE , particularly aims to embed the top ological information of all decomp osed subgraphs in to no de represen tations. F or this purp ose, it applies the GC N to each subgraph and obtains the representation of each vertex h v ∈ H k separately . H k = GCN k ( S k ) (6) 8 Hossain et al. H k ∈ R N v S × h presen ts the no de feature matrix, where N v S = | S k | denotes the n umber of vertices in the subgraph. Nevertheless, Just learning within subgraph get local cohesive information but loss global information. T o get the global in- formation of each subgraph, CaCoSE attains subgraph embeddings via emplo ying the self-attention graph p ooling ( S AGP ool ) . Z k = SAGP o ol k ( S k , H k ) (7) This po oling enco des essential global structural information of the subgraphs through relev ant neighborho o d selection, Z k ∈ R d S , where d S presen ts the po ol- ing dimension. It eﬀectively ﬁlters out candidate no des’ task-irrelev ant or noisy neigh b ors, even from highly heterophilic net works. Combining these subgraph em b eddings with no de em b eddings provides crucial global information to ver- tices. 3.3 Cross-Subgraph A ttention & F eature Com bination. Cohesion-cen tric partitioning reduces b ottlenec k loads but causes loss of long- range dep endency among vertices. Additionally , the subgraphs’ embeddings via p ooling capture only lo cal structural information while ov erlo oking other sub- graphs. Hence, a wa y to recov er global connectivity across partitions is required. The atten tion mec hanism [27] enables the modeling of long-range dependen- cies through sequen tial learning. T ow ard this goal, cross-subgraph attention fa- cilitates communication and information propagation among distinct sequence subgraphs. In this stage, CaCoSE employs cohesion-sensitiv e subgraph attention to up- date the subgraph em b eddings. Each subgraph em b edding is pro cessed through cross-atten tion to enco de mutual information across regions. A ttention [27] help in capturing essential aw areness among entities. The p ooled subgraphs’ em b ed- dings are represen ted as Z S =  Z 1 , Z 2 . . . Z k max  ⊤ ∈ R N S × d S , where N S = | S | denotes the num b er of subgraphs and d S as feature dimension. F rom the sub- graphs feature matrix, the query , key , and v alue matrices are computed as Q = Z S W Q , K = Z S W K , V = Z S W V . Here, W Q , W K , W V are learnable weigh t matrices. Then, the atten tion scores are measured, and subgraphs representa- tions are up dated as Attn S = softmax  QK T √ d S  , Z attn S = Attn S V (8) where Z attn S =  Z attn 1 , Z attn 2 · · · Z attn k max  T ∈ R N S × d S presen ts the up dated repre- sen tation of subgraph, incorp orating other subgraphs’ attentions where Attn S ∈ R N S × N S denotes subgraphs’ atten tion matrix. It is noteworth y that the proce- dure of the attention mec hanism seems like complete graph learning. Meanwhile, cohesion-sensitiv e partitions pro duce a smaller num b er of subgraphs; therefore, cross-subgraph attention adds only negligible computational ov erhead. The ﬁ- nal graph representation is derived by taking the mean of the cross-attentiv e Cross-atten tive Cohesive Subgraph Embedding 9 subgraphs embeddings, preserving inter-subgraph relational information. Z G = Z attn S = M ean (  Z attn 1 , Z attn 2 · · · Z attn k max  ) (9) CaCoSE obtains informativ e nodes’ embeddings b y concatenating ( ∥ ) eac h sub- graph’s attentiv e features with its vertices represen tations. h v = h v ∥ Z attn k , v ∈ S k (10) If a no de b elongs to multiple subgraphs, a mapping ( M ap ) function matches its occurrences, and all of its represen tations are summed. Such as, for v ∈ { S l , S m , S n } and S l , S m , S n ⊆ S , the ﬁnal node representation is computed as: z v = X k ∈{ l,m,n } h v ( k ) (11) 3.4 Final Prediction. Decomp osition reduces excessive information pro cessing through b ottlenec ked c hannels in graph. Subsequently , selection-based learning pro vides compact and meaningful subgraph represen tations. F ollowing that, cross-subgraph atten tion alleviates long-hop dependency b y mo deling interactions among v ertices across diﬀeren t subgraphs. Finally , the processed no de and graph representations ( Z v ) and ( Z G ) are passed through a multila yer p erceptron ( M LP ) for ﬁnal prediction. Exp erimen tal results v alidate the mo del’s ability in mitigating ov ersquashing and enhance p erformance on downstream inference. The detailed examination of the CaCoSE algorithm, its complexity , theoretical justiﬁcation, and scalability analysis are av ailable in Appendix A. 4 Related W orks Graph Decomp ositions. Numerous graph decomp osition algorithms are ap- plied to impro ve GNNs’ eﬀectiveness. A t the earlier stage of GNNs, sp ectral clus- tering [31,7] metho ds were m uch more popular along with the hierarchical [30] and mo dularity-based [26] clustering models. Ho wev er, due to exp ensiv e eigen- v alue decomp osition and clustering assignment steps, these metho ds encounter relativ ely higher time complexity . Cohesion-sensitiv e Decomp ositions. Th ese algorithms are mostly applied to determine the in terconnectedness b et ween no des in m ultiple netw ork regions for solving diﬀerent domain problems: graph compression [1], high-p erformance computing [22], analyzing so cial netw orks [9], etc. Only a few metho ds utilize these algorithms to enhance the eﬀectiveness of GNNs. TGS [17] utilizes the k -truss algorithm for edge scoring and sparsiﬁes noisy edges to mitigate ov er- smo othing in GNNs. Another mo del, CT Aug [29], utilizes k -truss and k -core al- gorithms to provide cohesive subgraph aw areness and improv e graph con trastive 10 Hossain et al. learning (GCL). Nonetheless, due to rep etitiv e edge ﬁltering for sparsiﬁcation and GCL’s resource-exp ensiv e nature, both mo dels require longer run time. Ov ersquashing. Prior w ork [2] illustrates that due to b ottlenec ks in the graph, long-range neighborho o d signals are distorted, which downgrades GNNs’ p er- formance in downstream graph learning tasks. SDRF [25] prop osed a curv ature- based graph rewriting to detect edges resp onsible for the information b ottlenec k. F oSR [18] applies systematic edge addition op erations in graphs in accordance with sp ectral expansion, while BORF [24] demonstrates the impacts of curv a- ture signs for ov ersquashing and ov ersmoothing. How ev er, due to dep endency on in termediate lay er output and subgraph matching with optimal transp ort, these mo dels show inconsistency in large heterophilic graph op erations. Another ap- proac h, GTR [8], uses eﬀective resistance and a repeated edge addition technique to reduce the impact of ov ersquashing in GNN. How ev er, this metho d fo cuses on the entire graph’s information for edge rewiring, whic h increases complexity in executing large-scale graphs. Recently , GoKU [21] utilizes sp ectrum-preserving graph rewiring, ﬁrst densifying the net work and subsequen tly applying structure- a ware graph sparsiﬁcation. In contrast, LASER [5] adopts lo calit y aw are sequen- tial rewiring while considering multiple graph snapshots. Graph ViT [16] lever- ages the Vision T ransformers combined with MLP mixing mec hanism to mo del long-range dep endencies in graphs. On the other hand, LRGB [12] addresses the limitations of b enchmark GNNs and T ransformers in analyzing long-range ver- tex in teractions. How ever, b oth approaches primarily fo cus on relatively ordered and structurally regular datasets. In con trast, our work concentrates on reducing o versquashing in complex, irregular real-world and so cial netw orks. 5 Exp erimen t Results This section co vers CaCoSE ’s exp erimen t outcomes’ information. First, it dis- cusses the datasets and baselines. Then, it describ es the experimental setup and, ﬁnally , presen ts the mo del’s results compared to other mo dels with v arious exp erimen ts. Note that, prior w orks [12,2] exp erimen t on syn thetic datasets (tree neigh b ors-matc h, ring transfer, etc.) are unsuitable for k -core due to their ﬁxed structure. Hence, our exp erimen t focuses on real-w orld complex net works. 5.1 Exp erimen tal Settings Datasets and Baselines. Our CaCoSE mo del is ev aluated on 8 datasets ( D v ) for the no de classiﬁcation task. Where ﬁve datasets are homophilic: Cora , C ite Seer , Co Author CS , Am azon C o mp uters and Am azon P ho to s. Besides, three datasets are heterophilic: Cham eleon, Squir rel and T exas . In case of graph classiﬁcation tasks w e experiment on six diﬀerent datasets where four from so cial netw ork domain: IMDB-B INAR Y, IMDB-M UL TI, COLLAB and R ED D I T-B INAR Y. Other t wo from the biomedical domian: MUT A G and PR OT EINS. Cross-atten tive Cohesive Subgraph Embedding 11 T able 1: No de Classiﬁcation Accuracy Comparison. Highest accuracy is b olded , second-b est is underlined. OOM: Out of Memory in execution. M / D v Cora C.Seer Co. CS Am. Cmp Am. Pto T exas Cham. Squir. GCN 84.24 69.10 88.86 87.77 92.12 52.36 63.32 48.79 GA T 85.00 67.94 87.13 88.41 92.01 52.37 61.70 46.18 SA GE 83.60 67.76 88.42 87.22 91.58 56.05 62.64 47.15 LR GB 71.02 57.58 59.50 70.34 74.04 57.36 46.93 30.49 BORF 83.68 67.08 90.52 89.79 91.93 54.21 60.24 OOM F OSR 83.66 67.24 90.44 89.82 91.83 52.37 60.31 40.19 SDRF 84.04 67.42 90.56 86.79 91.71 52.63 60.74 43.02 GTR 84.07 69.15 88.87 85.37 91.76 52.37 63.78 48.67 DR 44.35 24.90 72.50 71.43 79.86 69.45 27.87 23.85 LASER 75.77 64.26 76.90 38.63 67.36 32.63 41.90 25.87 GOKU 82.66 66.08 90.04 OOM 92.51 37.10 65.90 46.74 CaCoSE 85.00 69.42 90.73 90.43 92.98 54.47 68.99 58.86 T able 2: Comparison of diﬀeren t metho ds on graph classiﬁcation b enc hmarks. M/ D G MUT AG IMDB-B IMDB-M RDT-B COLLAB PROTEINS SDRF 74.53 62.90 41.53 85.40 70.22 66.88 F OSR 75.89 60.40 37.33 83.25 69.85 66.70 BORF 6 4.00 60.82 38.20 84.92 OOM 68.41 GTR 76.00 70.20 45.33 89.65 68.02 71.52 DR 71.00 53.60 35.60 73.60 55.54 72.23 LASER 68.00 67.30 45.40 81.70 71.32 71.07 GOKU 74.50 64.80 42.40 80.55 69.02 70.80 CaCoSE 76.99 73.20 49.14 85.70 80.95 71.79 Our mo del is compared with the 11 standard metho ds for the no de classi- ﬁcation (NC) tasks: GCN [19], GA T [28], SA GE [15], LRGB [12], BORF [24], F oSR [18], SDRF [25], GTR [8], DR [3], LASER [5] and GOKU [21]. F or graph classiﬁcation (GC) , w e compare CaCoSE with seven metho ds (BORF, FOSR, SDRF, GTR, LASER and GOKU), excluding the ﬁrst four baselines. Exp erimen tal Setup. W e ev aluate our mo del by running baselines’ co des for a fair comparison. Except for Cora (1208 , 500 , 500) and CiteSeer (1812 , 500 , 500) , w e split other datasets in to 48% , 32% , and 20% for training, v alidation, and testing, resp ectiv ely . In the case of graph classiﬁcation (GC) for all datasets, the ratio is (80% : 10% : 10%) . Eac h model runs for up to 250 iterations (NC) and 100 iterations (GC), with early stopping after 50 and 25 consecutive ep ochs without v alidation improv ement, respectively . Nodes’ features are initialized with 1 − hot enco ding and hidden dimension in GNNs is set as 128 . F or a learning rate of 2 . 5 e − 3 , CaCoSE utilizes l 2 regularization with a w eight decay of 1 e − 4 . The p o oling ratio is set as 0 . 5 in both cases while the num b er of heads are set as 2 (NC) and 1 (GC). Beside, we set the C aE F threshold δ as 3 . Finally , w e split eac h dataset using 10 diﬀerent seeds and rep ort the mean accuracy as the ﬁnal result. 12 Hossain et al. (a) NC (PR) (b) NC (NH) (c) GC (PR) (d) GC (NH) Fig. 3: Sensitivity Analysis . V arying P o oling Ratios (PR) and Numbers of Heads (NH). CaCoSE ’s settings for No de Classiﬁcation (PR = 50% and NH = 2) and for Graph Classiﬁcation (PR = 50% and NH = 1) are highligh ted in b old. F or NC datasets, Cora CiteSeer T exas Chameleon. F or GC datasets IMDB-B COLLAB MUT A G PROTEINS. T able 3: Performance comparison across diﬀerent v alues of δ . N C / δ 3 4 5 6 7 GC / δ 3 4 5 6 7 Cora 85.00 84.88 85.22 84.96 85.10 RDT-B 85.70 83.50 83.20 83.15 81.80 C.Seer 69.42 69.12 68.90 68.96 69.00 IMDB-B 73.20 71.60 73.40 72.70 73.79 Cham. 68.99 68.73 68.46 68.58 68.93 PROT. 71.79 72.77 72.05 71.52 72.41 5.2 Result Analysis No de Classiﬁcation and Graph Classiﬁcation. T able 1 demonstrates the comparison of our mo del with the baselines in the accuracy metric. In most cases, CaCoSE achiev es a superior gain  acc ( CaCoSE ) − acc ( baseline ) acc ( baseline ) ∗ 100  in ( % ) ov er other metho ds. P articularly on the dense heterophilic: Chameleon and Squirrel datasets it sho ws p erformance gains of 4 . 69% and 20 . 63% , resp ectively , ov er the nearest p erforming baselines GOKU and GCN. T able 2 presents the p erformance of our mo del in comparison to standard o versquashing addressing baselines. The CaCoSE achiev es better or almost simi- lar p erformance ov er the baselines. Notably , on the IMDB-BINAR Y and COL- LAB datasets, it surpasses the other baselines with substantial gains of 4 . 27% and 13 . 50% , resp ectiv ely , where the closest comp eting baselines are GTR and LASER. Sensitivit y Analysis. W e analyze the impact of the p o oling ratio and n um- b er of heads on mo del p erformance. As sho wn in Fig. 3(a), the accuracy curv e for CaCoSE ’s exhibits minimal ﬂuctuation on four (NC) datasets for low er to higher p ooling ratios. In con trast on GC datasets, in Fig. 3(c) MUT AG’s curve displa ys an uneven trend, while other three demonstrate a mo derate mov ement. Regarding the num b er of heads, Fig. 3(b) illustrates that, our mo del exp eriences small-scale p erformance changes on the T exas and CiteSeer datasets with nearly iden tical on other t wo NC datasets. F or the GC datasets (in Fig. 3(d)), there is a little ﬂuctuation in the p erformance on IMDB-BINAR Y and COLLAB while PR OTEINS exhibits a gradual increase in p erformance. Cross-atten tive Cohesive Subgraph Embedding 13 F urthermore, in T able 3, w e analyze CaCoSE ’s p erformance b y changing the C aE F threshold δ across three NC datasets and three GC datasets. Generally , mo dels p erformance is observed to decrease as the v alue of δ increase. Ho wev er, ﬂuctuates on some datasets, sp ecially on IMDB-B and PROTEINS. W e use δ = 3 as default v alue in our mo del. Analysis on Bridges in Heterophilic Net works. In exp erimen ts, CaCoSE sho ws outstanding performance on heterophilic netw orks. T o determine the un- derlying reason, we analyze the bridge edges that act as the narrow channels in graphs. F or eac h bridge edge, we examine the 2-hop neighborho o d of its end- p oin t nodes and color the v ertices by class lab el. The same analysis is repeated on the corresp onding edges in the decomp osed subgraphs. In many cases, the neigh b orho ods around decomp osed heterophilic bridges reveal latent homophily . (a) Chm (1976 , 473) (b) Chm-k 1 c (1976 , 473) (c) Sqr (4799 , 358) (d) Sqr-k 1 c (4799 , 358) Fig. 4: Snippets of Bridge Analysis . Edge (1976 , 473) in Chameleon (Chm) and (4799 , 358) in Squirrel (Sqr) datasets. k 1 c denotes the 1 − cor e subgraph and presen ts the Bridge Edges. Figure 4, illustrates the 2 − hop surroundings of bridge edges in the b oth orig- inal graph (Figs 4(a) and 4(c)) and the edge-induced partitioned subgraph (Figs 4(b) and 4(d)). In the original graphs, bridge edges show strong heterophilic c haracteristics, causing cross-class contamination in GNN’s aggregation and re- ducing representational expressivity . After k -core decomp osition, the surround- ings of bridge edges remain heterophilic and resem ble star like pattern where h ub no de diﬀers from its neigh b ors. How ever, most clien t no des share the same class lab el. This allow a 2 − l ay er GNN to eﬀectively capture the homophilic structure. Hence, the partition preserves latent homophily in those subgraphs that enhances mo del’s p erformance. Ablation Study . In this analysis, w e experiment with our mo del b y alter- ing its components. Figure 5 illustrates the p erformance gains of the k -core decomp osition in CaCoSE compared to alternative partition metho ds on six dif- feren t dataset. W e use four NC datasets: tw o citations netw orks (Cora, CiteSeer) and tw o heterophilic netw orks (Chameleon, Squirrel). Besides, tw o GC datasets: 14 Hossain et al. (a) (b) (c) Fig. 5: P erformance gain (in %) − CaCoSE vs other Decomp ositions: Illus- trate on six datasets (4 NC and 2 GC). Louv ain ( Lv ) , Metis ( M ) , Hierarchical ( H i ) , and Random-W alk ( Rw ) . Except for Louv ain, the other methods are an- notated with the num ber of partitions. F or instance, ( M − 4) presen ts Metis with four partitions. Green shades presen t the p ositive gains where red negative. . T able 4: Ablation study and performance comparison across datasets. Accuracy v alues exceeding CaCoSE are indicated in bold. Change Comp onen t Cora C.Seer Cham. PR OT. IMDB-B A ttention without 84.42 68.38 69.23 71.87 73.00 Mec hanism cross-A ttention CaEF without CaEF 84.96 68.41 68.73 65.89 73.10 SA GPool T opkPool 84.94 69.20 65.62 65.09 72.90 DMon 85.10 69.37 65.82 74.55 67.50 GMT 84.70 69.74 65.27 56.90 60.71 GCN GA T 83.31 68.86 65.49 61.33 71.70 SA GE 83.92 69.12 60.04 63.84 73.90 IMDB-BINAR Y and MUT A G. With a few exceptions, our mo del consistently ac hieves higher accuracy than other decomp ositions. T able 4 presen ts an ablation study on ﬁve datasets (three NC and tw o GC) to asses the contribution of diﬀerent mo dules. First, w e remo ve the attention mec hanism, follo wed b y the exclusion of the closure comp onen t ( C aE F ). In most cases, the fall in accuracy highlights the signiﬁcance of these comp onen ts in CaCoSE . Next, we replace SA GPool with alternative p ooling metho ds, includ- ing T opKP o ol [13], DMonP o ol [26] and GMT [4]. With only a few exceptions, our mo del consistently outp erforms these v ariants. Finally , w e substitute the GCN bac kb one with GA T and GraphSAGE. Although GraphSAGE enables us to achiev e higher accuracy on the IMDB-BINAR Y dataset, it p erforms worse on other datasets. Regarding GA T, our mo del consisten tly yields sup erior p erfor- mance across all datasets. 6 Conclusion In a nutshell, our CaCoSE mo del successfully demonstrates its eﬃcacy that facili- tates graph representation learning to ov ercome long-range dep endency in GNNs. Cross-atten tive Cohesive Subgraph Embedding 15 Through cohesion-aw are graph decomp osition, it provides essential lo calit y to the vertices. Besides, the SAGP ool ﬁlters out noisy connections in netw orks and the atten tion mec hanism provides crucial global information across subgraphs. Extensiv e exp erimen ts on b enc hmark datasets illustrate its consistency o ver the standard graph learning models. W e exp ect this technique to op en up new re- searc h directions: subgraph-wise graph rewiring, regional graph sup er-resolution, and parallel subgraph learning to develop robust graph represen tation learning mo dels. References 1. Akbas, E., Zhao, P .: T russ-based communit y search: a truss-equiv alence based in- dexing approach. Pro ceedings of the VLDB Endowmen t 10 (11), 1298–1309 (2017) 2. Alon, U., Y ahav, E.: On the b ottlenec k of graph neural netw orks and its practical implications. arXiv preprin t arXiv:2006.05205 (2020) 3. A ttali, H., Buscaldi, D., P ernelle, N.: Delaunay graph: A ddressing o ver-squashing and ov er-smo othing using delauna y triangulation. In: F orty-ﬁrst In ternational Con- ference on Mac hine Learning (2024) 4. Bacciu, D., Conte, A., Grossi, R., Landolﬁ, F., Marino, A.: K-plex cov er po oling for graph neural net works. Data Mining and Knowledge Discov ery 35 (5), 2200–2220 (2021) 5. Barb ero, F., V elingker, A., Sab eri, A., Bronstein, M., Di Giov anni, F.: Locality- a ware graph-rewiring in gnns. arXiv preprint arXiv:2310.01668 (2023) 6. Batagelj, V., Zav ersnik, M.: An o (m) algorithm for cores decomposition of net- w orks. arXiv preprint cs/0310049 (2003) 7. Bianc hi, F.M., Grattarola, D., Alippi, C.: Sp ectral clustering with graph neural net works for graph p o oling. In: In ternational conference on machine learning. pp. 874–883. PMLR (2020) 8. Blac k, M., W an, Z., Nayy eri, A., W ang, Y.: Understanding o versquashing in gnns through the lens of eﬀective resistance. In: International Conference on Machine Learning. pp. 2528–2547. PMLR (2023) 9. Chen, H., Con te, A., Grossi, R., Loukides, G., Pissis, S.P ., Sw eering, M.: On break- ing truss-based and core-based communities. A CM T ransactions on Kno wledge Disco very from Data 18 (6), 1–43 (2024) 10. Chen, R., et al.: Redundancy-free message passing for graph neural netw orks. Ad- v ances in Neural Information Pro cessing Systems 35 , 4316–4327 (2022) 11. Di Gio v anni, F., Giusti, L., et al.: On o ver-squashing in message passing neural net works: The impact of width, depth, and topology . In: International conference on mac hine learning. pp. 7865–7885. PMLR (2023) 12. Dwiv edi, V.P ., Rampášek, et al.: Long range graph b enc hmark. A dv ances in Neural Information Processing Systems 35 , 22326–22340 (2022) 13. Gao, H., Ji, S.: Graph u-nets. In: in ternational conference on machine learning. pp. 2083–2092. PMLR (2019) 14. Gu, M., Y ang, G., Zhou, S., Ma, N., Chen, J., T an, Q., Liu, M., Bu, J.: Homophily- enhanced structure learning for graph clustering. In: Pro ceedings of the 32nd A CM in ternational conference on information and knowledge management. pp. 577–586 (2023) 15. Hamilton, W., Ying, Z., Lesko vec, J.: Inductive representation learning on large graphs. A dv ances in neural information pro cessing systems 30 (2017) 16 Hossain et al. 16. He, X., Ho oi, B., Laurent, T., P erold, A., LeCun, Y., Bresson, X.: A generalization of vit/mlp-mixer to graphs. In: In ternational conference on machine learning. pp. 12724–12745. PMLR (2023) 17. Hossain, T., Saifuddin, K.M., et al.: T ac kling ov ersmo othing in gnn via graph sparsiﬁcation. In: Joint Europ ean Conference on Mac hine Learning and Knowledge Disco very in Databases. pp. 161–179. Springer (2024) 18. Karhadk ar, K., Banerjee, P .K., Mon túfar, G.: F osr: First-order sp ectral rewiring for addressing o versquashing in gnns. arXiv preprint arXiv:2210.11790 (2022) 19. Kipf, T.N., W elling, M.: Semi-sup ervised classiﬁcation with graph conv olutional net works. arXiv preprint arXiv:1609.02907 (2016) 20. Lee, J., Lee, I., Kang, J.: Self-attention graph p ooling. In: International conference on mac hine learning. pp. 3734–3743. PMLR (2019) 21. Liang, L., Bu, F., Song, Z., Xu, Z., P an, S., Shin, K.: Mitigating ov er-squashing in graph neural netw orks b y spectrum-preserving sparsiﬁcation. arXiv preprin t arXiv:2506.16110 (2025) 22. Liu, Q.C., et al.: P arallel k-core decomposition with batched up dates and asyn- c hronous reads. In: Proceedings of the 29th A CM SIGPLAN Ann ual Symp osium on Principles and Practice of Parallel Programming. pp. 286–300 (2024) 23. Malliaros, F.D., Giatsidis, C., Papadopoulos, A.N., V azirgiannis, M.: The core de- comp osition of net works: Theory , algorithms and applications. The VLDB Journal 29 (1), 61–92 (2020) 24. Nguy en, K., Hieu, N.M., Nguy en, V.D., Ho, N., Osher, S., Nguy en, T.M.: Revisiting o ver-smoothing and ov er-squashing using ollivier-ricci curv ature. In: International Conference on Mac hine Learning. pp. 25956–25979. PMLR (2023) 25. T opping, J., Di Giov anni, F., Chamberlain, B.P ., Dong, X., Bronstein, M.M.: Un- derstanding o ver-squashing and b ottlenec ks on graphs via curv ature. arXiv preprin t arXiv:2111.14522 (2021) 26. T sitsulin, A., P alowitc h, J., Perozzi, B., Müller, E.: Graph clustering with graph neural net works. Journal of Machine Learning Research 24 (127), 1–21 (2023) 27. V aswani, A., Shazeer, N., P armar, N., Uszk oreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Atten tion is all you need. Adv ances in neural information pro- cessing systems 30 (2017) 28. V eličko vić, P ., Cucurull, G., Casanov a, A., Romero, A., Lio, P ., Bengio, Y.: Graph atten tion netw orks. arXiv preprint arXiv:1710.10903 (2017) 29. W u, Y., W ang, L., et al.: Graph contrastiv e learning with cohesive subgraph a ware- ness. In: Proceedings of the ACM W eb Conference 2024. pp. 629–640 (2024) 30. Ying, Z., Y ou, J., Morris, C., Ren, X., Hamilton, W., Lesko v ec, J.: Hierarchical graph representation learning with diﬀeren tiable po oling. Adv ances in neural in- formation processing systems 31 (2018) 31. Zhang, X., Liu, H., W u, X.M., Zhang, X., Liu, X.: Sp ectral embedding netw ork for attributed graph clustering. Neural Netw orks 142 , 388–396 (2021) Cross-atten tive Cohesive Subgraph Embedding 17 A App endix A.1 Pro of through Ollivier-Ricci Curv ature Olliv er-Ricci curv ature [24]—denoted by κ ( u, v ) —is a measure of the lo cal sim- ilarit y of the neighborho ods of t wo vertices, ty pically by means of random w alk measures centered at those v ertices. This v alue is p ositiv e or negativ e when the o verlap b et ween the probability distributions of random w alkers b et ween N ( u ) and N ( v ) is large or small, resp ectiv ely . Let G = ( V G , E G ) b e a simple graph with shortest path distance d ( u, v ) for all u, v ∈ V G . Let Π ( µ u , µ v ) b e the family of joint probabilit y distributions of µ u and µ v , and take µ u to b e the 1 -step random w alk measure giv en by µ u ( v ) = ( 1 deg ( u ) if v ∈ N ( u ) 0 otherwise. Then deﬁne the L 1 -W asserstein distance as W 1 ( µ u , µ v ) = inf π ∈ Π ( µ u ,µ v )   X ( p,q ) ∈ V 2 G π ( p, q ) d ( p, q )   , whic h measures the minim um distance random walks betw een u and v must tak e to meet each other. The Ollivier-Ricci curv ature is then κ ( u, v ) = 1 − W 1 ( µ u , µ v ) d ( u, v ) . This carries the interpretation that whenever tw o random walk ers are unlik ely to meet, we hav e κ ( u, v ) < 0 . Theorem 2. F or al l ( u, v ) ∈ E G , if N ( u ) ∩ N ( v ) = ∅ , then κ ( u, v ) ≤ 0 Pr o of. F or any π ∈ Π ( µ u , µ v ) , consider the case of π ( p, q ) > 0 since π ( p, q ) = 0 con tributes 0 to P ( p,q ) ∈ V 2 G π ( p, q ) d ( p, q ) . Then p ∈ N ( u ) and q ∈ N ( v ) since P q ∈ V G π ( p, q ) = µ u ( p ) and P p ∈ V G π ( p, q ) = µ v ( q ) giv e p / ∈ N ( u ) ⇒ P q ∈ V G π ( p, q ) = 0 and q / ∈ N ( v ) ⇒ P q ∈ V G π ( p, q ) = 0 , resp ectiv ely . But N ( u ) ∩ N ( v ) = ∅ , so p  = q for all ( p, q ) ∈ V 2 G with π ( p, q ) > 0 . Thus, d ( p, q ) ≥ 1 . It follo ws that X ( p,q ) ∈ V 2 G π ( p, q ) d ( p, q ) ≥ X ( p,q ) ∈ V 2 G π ( p, q ) · 1 = 1 . Moreo ver, W 1 ( µ u , µ v ) ≥ 1 . Since ( u, v ) ∈ E G , we hav e d ( u, v ) = 1 . Hence, κ ( u, v ) = 1 − W 1 ( µ u , µ v ) ≤ 0 . Th us, Theorem 2 sho ws that the edges remov ed b y the CaEF in algorithm ha ve nonp ositiv e Ollivier-Ricci curv ature; in other words, they corresp ond with b ottlenec ks. This approach can then b e said to mitigate o versquashing by re- mo ving the message passing pathw a ys asso ciated with this class of b ottlenec ks. 18 Hossain et al. Algorithm 1: CaCoSE Algorithm Input: A Graph G = ( V , E ) Output: V ertices ( Z v ) and Graph’s ( Z G ) Em b eddings 1 # Measure Edge Score with k -core algorithm 2 C ( u, v ) = max { k | ( u, v ) ∈ G k } 3 # Apply CaEF Check if S ( u, v ) = 0 then assign previous core v alue as Edge score C ( u, v ) = C ( u, v ) − 1 4 S k = { ( u, v ) ∈ E : C ( u, v ) = k } 5 # Extract subgraphs S 6 S = { Edge-Subgraph ( S k ) : k ∈ 1 , ..K max } 7 foreac h S k = ( V S , E S ) ∈ S do 8 H k = GCN k ( S k ) 9 Z k = SAGP ool ( S k , H k ) 10 Z attn S k = Atten tion ( Z S k ) /* Apply Attention among Subgraphs ∀ ( Z S k ) */ 11 Z G = Mean (  Z attn S 1 , Z attn S 2 . . . Z attn S k  ) /* Graph’s Representation */ 12 Z v = z eros ( N v , ( d v + d S )) 13 foreac h ( H k , Z attn k ) ∈ S k do 14 for h v ∈ H k do 15 h v = ( h v ∥ Z attn k ) /* Concatenation ( ∥ ) Operation */ 16 17 z v = z v + h v /* Map v from H k to Z v */ 18 return Z G , Z v A.2 Represen tation Learning b y CaCoSE Algorithm 1 presents the execution of the CaCoSE mo del for k -core decomp o- sition. In lines (2-3), at ﬁrst, the input graph is decomp osed with the k -core algorithm, where nodes achiev e their coreness score. Next, CaCoSE explores all the edges and assigns the coreness scores as the edge weigh t. Additionally , it utilizes edge ﬁltering ( C aE F ) to up date sp eciﬁc narrow c hannels edge scores. In the next stage (lines 4-9), the CaCoSE extracts edge-induced subgraphs from the input graph based on the coreness scores. Note that, the k -core algorithm [23] iterativ ely remov es the no des with degree l ess than k as k increases. Thus, in eac h no de remov al stage, the ( k − 1) v alue is assigned to the remov ed nodes. Then, on the partitioned subgraphs, it concurrently applies GC N for no de represen tation learning as w ell as S AGP ool to encode the en tire subgraph’s information. In the next step (line 10), CaCoSE utilizes the attention mechanism across the subgraphs’ em b eddings to capture the mutual information among them. Next, it tak es the av erage of all the subgraph embeddings that represent the ﬁnal representation of the graph (line 11). In the con text of no de em b eddings (lines 12-17), each subgraph’s cross-attentiv e embeddings are concatenated with its no de representations, while the shared no des are mapp ed and their combined features are summed up in the vertices’ global feature matrix. Complexit y . In w orst case, the time complexities of k -core, SAGP o ol, and cross atten tion are O ( V + E ) , O ( V 2 ) , and O ( k 2 d ) . Hence the ov erall complexity of Cross-atten tive Cohesive Subgraph Embedding 19 CaCoSE is O ( V 2 + V + E + k 2 d ) . Although lo oks complex, CaCoSE b eneﬁts from parallelizable k -core decomp ositions and eﬃcient GPU execution of SAGP o ol and cross-attention. The detailed algorithm is omitted due to space constrain ts. Fig. 6: Scalabilit y T est. This ﬁgure presents the maximum core v alue K max along the y -axis for netw orks generated with v arious combinations of graph size (# N odes ) and densit y ( p ) . The v alue of K max is corresp onds to the maxim um n umber of subgraphs that can b e extracted through k -core decomposition. Scalabilit y . In this exp erimen t, we apply the Erdős–Rén yi graph generator to generate random netw orks of v arying scale. During the generation pro cess, w e construct graphs with vertex count of 10 2 , 10 3 , 10 4 and 10 5 . F or each graph size ( # N odes ) we employ the edge creation probability p ∈ { 0 . 01 , 0 . 05 , 0 . 10 , 0 . 25 , 0 . 50 } . As the graph size increases, the num b er of possible edges-and consequently the total generated edges-grows signiﬁcantly . F or each generated graph, we compute the maximum core n umber ( K max ) m represen ting the highest num ber of subgraphs that can b e extracted through the k -core decomp osition. Figure 6 plots K max against the graph’s densit y p ( x - axis) for netw orks having diﬀeren t scales (distinguished by colors and markers). Besides, each plot is annotated with the num ber of edges corresp onding to each com bination of graph size and densit y . The result depicts that ev en in extremely large and densely connected net- w orks, the maximum core v alue remains b elo w 10 4 . In practical cases the v alue of K max is substan tially smaller, this indicates that the num b er of extractable cohesiv e subgraphs via k -core decomposition is limited in realistic settings. Since, the real-world graphs are sparser and the resulting subgraphs sizes are smaller, it implies that the computational o verhead introduced b y the attention mec hanism in our framework is negligible. It is worth noting that, the generation pro cess w as constrained b y hardware capacit y . F or graphs with 10 4 v ertices w e limit the density p ≤ 0 . 25 , and for graph size with 10 5 , we consider p ≤ 0 . 10 .

Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment