Reading time: 20 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.22388
  • Date:
  • Authors: Unknown

📝 Abstract

Graph Neural Networks (GNNs) are powerful tools for learning from graphstructured data, but their application to large graphs is hindered by computational costs. The need to process every neighbor for each node creates memory and computational bottlenecks. To address this, we introduce BLISS, a Bandit Layer Importance Sampling Strategy. It uses multi-armed bandits to dynamically select the most informative nodes at each layer, balancing exploration and exploitation to ensure comprehensive graph coverage. Unlike existing static sampling methods, BLISS adapts to evolving node importance, leading to more informed node selection and improved performance. It demonstrates versatility by integrating with both Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), adapting its selection policy to their specific aggregation mechanisms. Experiments show that BLISS maintains or exceeds the accuracy of full-batch training.

📄 Full Content

Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, enabling applications such as personalized recommendations Ying et al. [2018], Wang et al. [2019], drug discovery Lim et al. [2019], Merchant et al. [2023], image understanding Han et al. [2022Han et al. [ , 2023]], and enhancing Large Language Models (LLMs) Yoon et al. [2023], Tang et al. [2023], Chen et al. [2023]. Architectures like GCNs and GATs have addressed early limitations in capturing long-range dependencies.

However, training GNNs on large graphs remains challenging due to prohibitive memory and computational demands, primarily because considering all neighbor nodes for each node leads to excessive memory and computational costs. While mini-batching, common in deep neural networks, can mitigate memory issues, uninformative mini-batches can lead to: 1) Sparse representations: Nodes may be isolated, neglecting crucial connections and resulting in poor representations. 2) Neighborhood explosion: A node’s receptive field grows exponentially with layers, making recursive neighbor aggregation computationally prohibitive even for single-node mini-batches.

Efficient neighbor sampling is essential to address these challenge. Techniques include random selection, feature-or importance-based sampling, and adaptive strategies learned during training. They fall into three categories: (1) Node-wise sampling, which selects neighbors per node to reduce cost but risks redundancy (e.g., GraphSAGE Hamilton et al. [2017], VR- GCN Chen et al. [2017], BS- GNN Liu et al. [2020]); (2) Layer-wise sampling, which samples neighbors jointly at each layer for efficiency and broader coverage but may introduce bias (e.g., FastGCN Chen et al. [2018], LADIES Zou et al. [2019], LABOR Balin and Çatalyürek [2023]); and (3) Sub-graph sampling, which uses induced subgraphs for message passing, improving efficiency but potentially losing global context if reused across layers (e.g., Cluster-GCN Chiang et al. [2019], GraphSAINT Zeng et al. [2019]). Layer-wise sampling. Left: Node-wise sampling selects nodes per target node, often causing redundancy (e.g., v 4 sampled for both u 1 and u 2 ), higher sampling rates (e.g., v 4 , v 5 ), and missing edges (e.g., u 2 -v 4 ). Right: Layer-wise sampling considers all nodes in the previous layer, preserving structure and connectivity while sampling fewer nodes.

Our key contributions are: (1) Modeling neighbor selection as a layer-wise bandit problem: Each edge represents an “arm” and the reward is based on the neighbor’s contribution to reducing the variance of the representation estimator. (2) Applicability to Different GNN Architectures: BLISS is designed to be compatible with various GNN architectures, including GCNs and GATs.

The remainder is organized as follows: section 2 describes BLISS; section 3 reports results; section 4 concludes. A detailed background and related work appear in sections B and C.

2 Proposed Method 2.1 Bandit-Based Layer Importance Sampling Strategy (BLISS) BLISS selects informative neighbors per node and layer via a policy-based approach, guided by a dynamically updated sampling distribution driven by rewards reflecting each neighbor’s contribution to node representation. Using bandit algorithms, BLISS balances exploration and exploitation, adapts to evolving embeddings, and maintains scalability on large graphs. Traditional node sampling often fails to manage this trade-off or adapt to changing node importance, reducing accuracy and scalability. While Liu et al. [2020] framed node-wise sampling as a bandit problem, BLISS extends it to layer-wise sampling, leveraging inter-layer information flow and reducing redundancy (see fig. 1).

Initially, edge weights w ij = 1 for all j ∈ N i , with sampling probabilities q ij set proportionally. BLISS proceeds top-down from the final layer L, computing layer-wise sampling probabilities p j for nodes in layer l. These are passed to algorithm 4, which selects k nodes. The GNN then performs a forward pass, where each node i aggregates from sampled neighbors j s to approximate its representation μi :

Here, j s ∼ q i denotes the s-th sampled neighbor of node i, drawn from the per-node sampling distribution q i . This process updates node representations h j . The informativeness of neighbors is quantified as a reward r ij , and the estimated rewards rij are calculated as:

where S t i is the set of sampled neighbors at step t, α ij is the aggregation coefficient, and h j is the node embedding. The edge weights w ij and sampling probabilities q ij are updated using the EXP3 algorithm (see algorithm 5). The edge weights are updated as follows:

where δ is a scaling factor and η is the bandit learning rate.

BLISS operates through an iterative process of four steps: (1) dynamically selecting nodes at each layer via a bandit algorithm (e.g., EXP3) that assigns sampling probabilities, (2) estimating node representations by aggregating from sampled neighbors using Monte Carlo estimation, (3) performing standard GNN message passing with these samples, and (4) calculating rewards based on neighbor contributions to update the bandit policy and refine future sampling distributions. For the detailed algorithm check algorithm 2.

BLISS: We extend BLISS to attentive GNNs, following Liu et al. [2020]. With only a sampled neighbor set S i , true normalized attention α ij is unavailable. We compute unnormalized scores αij and define adjusted feedback attention:

where q ij is the banditdetermined sampling probability of edge e ij . We use j∈Si q ij as a surrogate for the normalization over the full neighborhood N i , thus approximating α ij while properly weighting sampled neighbors within the attention mechanism.

PLADIES:Applying LADIES to attentive GNNs (e.g., GATs) requires preserving at least one neighbor per node after sampling to respect attention’s dependence on neighbor information. The PLADIES edge-sampling procedure, adapted from Balin and Çatalyürek [2023] and detailed in algorithm 4, first computes initial probabilities (p j ), then iteratively adjusts a scaling factor (c) so the sum of clipped probabilities approaches the target sample size (k). Probabilities for seed nodes V skip are set to ∞, guaranteeing selection and creating “skip connections,” ensuring each node retains a neighbor for attention while enabling LADIES to leverage attention efficiently.

We evaluate the performance of each method in the node prediction task on the following datasets: Cora, Citeseer Sen et al. [2008], Pubmed Namata et al. [2012], Flickr, Yelp Zeng et al. [2019], and Reddit Hamilton et al. [2017]. More details of the benchmark datasets are given in table 3.

We compare BLISS with PLADIES, a strong baseline among existing layer-wise sampling algorithms. The code for both BLISS and PLADIES is publicly available.1 Model and Training. We use 3-layer GNNs (GraphSAGE and GATv2) with a hidden dimension of 256. Models are trained with the ADAM optimizer with a learning rate of 0.002. For bandit experiments, we set η = 0.4 and δ = η/10 6 to prevent large updates.

Sampling Parameters. Batch sizes and fanouts for each dataset are listed in table 4. For smaller datasets (Citeseer, Cora, Pubmed), a small batch size is chosen to ensure the sampler does not process all training nodes in a single step (training nodes: 120, 140, and 60 respectively). For larger datasets (Flickr, Yelp, Reddit), relatively small batch sizes are used to accommodate limited computational resources (tested on a P100 GPU with 16GB VRAM). An incremental fanout configuration ensures sufficient local neighborhood aggregation: the first layer’s fanout is set to four times the batch size, and subsequent layers’ fanouts are twice the preceding layer’s.

Evaluation. For all methods and datasets, training is conducted 5 times with different seeds, and the mean and standard deviation of the F1-score on the test set are reported. The number of training steps for each dataset is specified in table 4. We run the experiments on GraphSAGE Hamilton et al. [2017] and GATv2 Brody et al. [2021]. citeseer BLISS 0.927 ± 0.005 0.947 ± 0.013 0.712 ± 0.004 0.598 ± 0.028 0.706 ± 0.002 0.580 ± 0.032 PLADIES 0.912 ± 0.007 0.963 ± 0.016 0.699 ± 0.008 0.616 ± 0.020 0.683 ± 0.005 0.601 ± 0.017 cora BLISS 0.989 ± 0.002 0.983 ± 0.005 0.802 ± 0.005 0.785 ± 0.005 0.813 ± 0.004 0.795 ± 0.009 PLADIES 0.989 ± 0.003 0.981 ± 0.005 0.800 ± 0.004 0.767 ± 0.011 0.809 ± 0.003 0.772 ± 0.014 flickr BLISS 0.515 ± 0.003 0.516 ± 0.002 0.511 ± 0.003 0.503 ± 0.001 0.511 ± 0.002 0.503 ± 0.002 PLADIES 0.511 ± 0.006 0.515 ± 0.001 0.507 ± 0.005 0.504 ± 0.001 0.507 ± 0.005 0.505 ± 0.001 pubmed BLISS 0.907 ± 0.008 0.807 ± 0.063 0.748 ± 0.006 0.594 ± 0.047 0.731 ± 0.007 0.597 ± 0.057 PLADIES 0.910 ± 0.008 0.760 ± 0.042 0.750 ± 0.014 0.571 ± 0.038 0.718 ± 0.013 0.557 ± 0.042 reddit BLISS 0.953 ± 0.001 0.979 ± 0.001 0.949 ± 0.001 0.962 ± 0.000 0.949 ± 0.001 0.962 ± 0.000 PLADIES 0.954 ± 0.002 0.979 ± 0.001 0.951 ± 0.001 0.962 ± 0.000 0.950 ± 0.001 0.962 ± 0.000 yelp BLISS 0.540 ± 0.002 0.530 ± 0.005 0.538 ± 0.002 0.527 ± 0.005 0.540 ± 0.002 0.529 ± 0.005 PLADIES 0.540 ± 0.002 0.503 ± 0.009 0.537 ± 0.002 0.501 ± 0.009 0.539 ± 0.002 0.502 ± 0.009 Baseline Justification. We compare BLISS against PLADIES from Balin and Çatalyürek [2023] because it represents the state-of-the-art in layer-wise sampling, which is the specific category BLISS belongs to. While other sampling methods like GraphSAINT Zeng et al. [2019] (subgraph sampling) or GCN-BS Liu et al. [2020] (node-wise bandit sampling) exist, direct comparison would require different experimental setups or fall outside the scope of layer-wise sampling. Our goal is to achieve accuracy comparable to full-batch training while maintaining scalability, which PLADIES also aims for within the layer-wise paradigm.

Our experiments confirm that BLISS, a dynamic layer-wise sampling strategy, consistently outperforms the PLADIES sampler across multiple benchmark datasets and GNN architectures (GAT and GraphSAGE), as shown in table 1. It is worth noting that the original LADIES and PLADIES were designed specifically for GraphSAGE. A comparison of BLISS on GAT against the original PLADIES (or LADIES) on GraphSAGE reveals a noticeable advantage for BLISS (e.g., Citeseer: 70.6% vs. 60.1%; Pubmed: 73.1% vs. 55.7%).

The results demonstrate superior F1-scores for BLISS, particularly with GAT models on Citeseer (70.6% vs. 68.3%) and Pubmed (73.1% vs. 71.8%). This advantage stems from its bandit-driven mechanism, which better adapts to evolving node importance, thereby reducing variance and improving generalization. The performance gains are most pronounced on smaller datasets (Cora, Citeseer, Pubmed) and on complex, heterogeneous graphs like Yelp, where BLISS effectively captures nuanced class relationships (52.9% vs. 50.2% with SAGE). On denser, more uniform graphs like Flickr and Reddit, the performance difference is minimal. fig. 2, fig. 3 summarizes the F1-scores (mean ± standard deviation) and loss for both samplers on GAT and GraphSAGE architectures.

These results validate our theoretical analysis: BLISS minimizes estimator variance by dynamically prioritizing informative neighbors, unlike PLADIES’ static sampling which risks under-sampling critical nodes. The only noted exception was overfitting on the Yelp dataset with GAT for both samplers, which was unevaluated to maintain uniform experimental conditions.

In this work, we proposed Bandit-Based Layer Importance Sampling Strategy (BLISS), a layer-wise sampling method for scalable and accurate training of deep GNNs on large graphs. BLISS employs multi-armed bandits to dynamically select informative nodes per layer, balancing exploration of under-sampled regions with exploitation of valuable neighbors. This enables efficient message passing and improved scalability. We demonstrated its applicability to diverse GNNs, including GCNs and GATs, and presented an adaptation of PLADIES for GATs. Experiments show BLISS matches or exceeds state-of-the-art performance while remaining computationally efficient. Future directions include exploring advanced bandit algorithms (e.g., CMAB) and extending BLISS to domains such as GNN-augmented LLMs and vision tasks.

A.1 Notation Summary

We denote a directed graph G = (V, E) consisting of a set of nodes V = {v i } i=1:N and a set of edges

where N is the number of nodes, N i denotes the set of neighbors of node v i , and L is the number of layers.

Graph Neural Networks. GNNs operate on the principle of neural message passing Gilmer et al. [2017], where nodes iteratively aggregate information from their local neighborhoods. In a typical GNN, the embedding of node v i at layer l + 1 is computed from layer l as follows:

where W (l) is a learnable weight matrix, h

j is the node feature vector at layer l, and σ is a non-linear activation function. The term α ij represents the aggregation coefficient, which varies depending on the GNN architecture (e.g., static in GCNs Kipf and Welling [2016] or dynamic in GATs Velickovic et al. [2017]).

Layer-wise Sampling. Following Huang et al. [2018], eq. ( 7) can be written in expectation form:

where p ij = p(v j |v i ) is the probability of sampling v j given v i , and N (i) = j α ij . To make the computation of eq. ( 8) tractable, the expectation µ p (i) = E pij [h (l) j ] can be approximated via Monte-Carlo sampling:

eq. ( 9) defines node-wise sampling, where neighbors are recursively sampled for each node. While this reduces immediate computational load, the receptive field still grows exponentially with network depth d, leading to O(n d ) dependencies in the input layer for deep networks. An alternative approach is to apply importance sampling to eq. ( 8), which forms the basis for layer-wise sampling methods:

where q j = q(v j |v 1 , …, v n ) is the probability of sampling node v j from the entire layer. We estimate the expectation µ q (i) via Monte-Carlo sampling:

p ij q j ĥ(l) j , vj ∼ q j (11)

The embedding then becomes h

. Without loss of generality, following the setting from Liu et al. [2020], we assume p ij = α ij and normalize the probabilities such that N (i) = 1. We denote µ q (i) as µ i for simplicity and ignore non-linearities. The goal of a layer-wise sampler is to approximate:

An effective estimator should minimize variance. The variance of the estimator in eq. ( 12) is:

We seek q ⋆ j ≥ 0 that minimizes V(μ i ). The optimal sampling distribution is:

To address these challenges, efficient neighbor sampling techniques are crucial. These methods typically involve randomly selecting a fixed number of neighbors, sampling based on node features or importance scores, or employing adaptive strategies that learn optimal sampling during training. They can be broadly categorized into three groups:

Graph Convolutional Networks (GCNs) leverage a simplified convolution operation on the graph, aggregating information from a node’s neighbors, and produces the normalized sum of them as in eq. ( 7) where σ is an activation function (ReLU for GCNs), N (i) is the set of its one-hop neighbors, α l) is the weight matrix for the l-th layer, h (l) j denotes the node feature matrix at layer l. For GraphSAGE is c ij = |N (i)|. This equally weights the contributions from all neighbors. Graph Attention Networks (GATs) address the equal contribution by introducing an attention mechanism that assigns learnable weights (α) to each neighbor based on their features, allowing the model to focus on the most relevant information:

this computes a pair-wise un-normalized attention score between two neighbors. It first concatenates the linear transformation of l-th layer embeddings of the two nodes, where || denotes concatenation, then takes a dot product of it and a learnable weight vector ⃗ a (l) , and applies a LeakyReLU in the end.

This difference allows GATs to capture more nuanced relationships within the graph than GCNs. However, Brody et al. [2021] argues that the original GAT uses a static attention mechanism due to the specific order of operations in eq. ( 15). While the weights depend on both nodes, this structure can limit the expressiveness of the attention calculation. GATv2 introduces a dynamic attention mechanism by modifying this order. This change allows the attention weights to depend on the features of both the sending node (neighbor) and the receiving node in a potentially more expressive way. The key difference lies in the order of operations. eq. ( 15) will be changed to:

This section reviews existing sampling techniques and discusses their limitations.

Layer-Dependent Importance Sampling (LADIES) Zou et al. [2019]: LADIES leverages layerwise importance scores based on node features and graph structure to guide node selection. LADIES begins by selecting a subset of nodes in the upper layer. For each selected node, it constructs a bipartite subgraph of its immediate neighbors. It then calculates importance scores for these neighbors and samples a fixed number based on these scores. This process is repeated recursively for each layer. However, LADIES relies on pre-computed importance scores, which can be computationally expensive and may not adapt well to dynamic edge-weight changes. Additionally, LADIES employs sampling with replacement, which can be suboptimal as it may select the same node multiple times. Calculate sampling probability for each node using p l j in eq. ( 18) 4:

Sample k nodes in l-th layer using p l j .

5:

Normalize the edge weights of the sampled nodes in the layer by eq. ( 19). 6: end for 7: return Modified edge weights αij and Sampled Nodes; While LADIES suggests using α ij values similar to GraphSAGE (c ij = |N (i)|), their implementation utilizes eq. ( 19) for normalization. Instead of directly feeding α ij to the model, it is first used to calculate an importance score:

, where π

The most important nodes are then sampled using p (l) j . Before passing these nodes to the model, the original α ij is re-weighted by p (l) j and normalized by dividing it over c ij of the selected (union of sampled nodes and seed nodes) nodes. These new α(l) ij values are passed to the model at each layer for the selected points.

SKETCH Chen et al. [2022] proposed a fix for the sampling equation and the normalization of the edge weights. Instead of eq. ( 18), they suggested:

They also suggested an alternative normalization for the edge weight instead of eq. ( 19):

where ns (l) j is the number of sampled nodes for node i at layer l. Layer-Neighbor Sampling (LABOR) Balin and Çatalyürek [2023]: The LABOR sampler combines layer-based and node-based sampling. It introduces a per-node hyperparameter to estimate the expected number of sampled neighbors, enabling correlated sampling decisions among vertices. This hyperparameter and the sampling probabilities are optimized to sample the fewest vertices in an unbiased manner.

The paper also introduced PLADIES (Poisson LADIES), which employs Poisson sampling to achieve unbiased estimation with reduced variance. PLADIES assigns each node j in the neighborhood of source nodes S (denoted N (S)) a sampling probability p j ∈ [0, 1] such that j∈N (S) p j = k, where k is the desired sample size. A node j is then sampled if a random number ϕ j ∼ U (0, 1) satisfies ϕ j ≤ p j . PLADIES achieves this unbiased estimation in linear time, in contrast to the quadratic complexity of some debiasing methods Chen et al. [2022]. Notably, its variance converges to 0 if all p j = 1, highlighting its effectiveness.

Bandit Samplers Liu et al. [2020]: Bandit Samplers frame the optimization of sampling variance as an adversarial bandit problem, where rewards depend on evolving node embeddings and model weights. While node-wise bandit sampling is effective, selecting neighbors individually can lead to redundancy and may not capture long-range dependencies efficiently. This highlights the importance of extending to layer-wise sampling.

Their method employs a multi-armed bandit framework to learn a sampling distribution q t i for each node v i at each training step t. The algorithm initializes a uniform sampling distribution. During training, it samples k neighbors for each node based on q t i , computes rewards based on GNN performance, and updates the distribution using an algorithm like EXP3. This process prioritizes informative neighbors to improve training efficiency. Our work builds upon this foundation by applying bandit principles to the layer-wise sampling paradigm.

Table 5: Summary of memory complexity, time complexity, and variance for Full-Batch, GraphSAGE, LADIES, and BLISS methods. This table provides a theoretical comparison of the computational and statistical properties of each method, emphasizing BLISS’s ability to minimize variance while maintaining scalability.

• O(L|E|): Stores bandit weights w ij for all edges across L layers.

• O(LKs layer ): Stores embeddings for s layer sampled nodes per layer (K-dimensional).

• O(LK 2 ): Stores L weight matrices W (l) ∈ R K×K .

• O(L|E|): Bandit weight updates (EXP3) over all edges in L layers.

• O(LKs 2 layer ): Importance score computation for s layer nodes per layer. • O(LK 2 s layer ): Message passing and aggregation for s layer nodes.

• Key Difference from LADIES: The (1 -η) -1 term accounts for exploration in bandit sampling. • Derivation: Minimizing eq. ( 13) with bandit-optimized q j (eq. ( 6)) introduces the ηdependent denominator.

PLADIES (Poisson LADIES) shares identical complexity terms with LADIES. The differences between them is that PLADIES uses Poisson sampling (variable-size, unbiased) instead of fixed-size sampling, and PLADIES reduces empirical variance but retains the same asymptotic bound.

In the table 5, PLADIES is grouped under LADIES since their theoretical complexities are identical. BLISS explicitly diverges due to bandit overhead and adaptive exploration.

The reported time in table 6 measures per-iteration training time -the wall-clock time taken to execute one training step. The code for BLISS is not optimized (using naive for loops in the current implementation) and the comparison might not be reasonable, but for the sake of the having a clearer image about the performance. The time is also averaged over 5 runs per experiment. for l = L to 1 do ▷ Top-down layers 4:

Calculate sampling distribution q ij , using eq. ( 6) 5:

Calculate node sampling probability p j , using eq. ( 1) 6:

Pass the p j to algorithm 4 7:

Sample k nodes for the current layer based on p j .

end for 9:

Run forward pass of GNN 10:

Get the updated node embeddings h j from eq. ( 2) and rewards r ij using eq. ( 3)

The advantages of BLISS are particularly pronounced in smaller datasets (Citeseer, Cora, Pubmed) and highly heterogeneous graphs like Yelp (100 classes) with SAGE. For Yelp, BLISS achieves a test F1-score 52.9% (SAGE), while PLADIES lags at 50.2%. The bandit mechanism likely captures nuanced class relationships more effectively in such complex settings. In contrast, Flickr and Reddit exhibit minimal differences between the samplers, possibly due to their dense connectivity and uniform class distributions, which reduce the impact of adaptive sampling.

GAT models generally benefit more from BLISS than SAGE. For example, on Cora, BLISS achieves a test F1-score of 81.3% (GAT) compared to 80.9% for PLADIES, while SAGE shows narrower margins 79.5% vs. 77.2%. This aligns with our hypothesis that attention mechanisms, which dynamically weigh neighbor contributions, synergize well with BLISS’s reward-driven sampling. SAGE’s uniform aggregation is less sensitive to neighbor selection, though BLISS still improves its performance.

Despite larger fanouts and batch sizes for Flickr, Reddit, and Yelp (table 4), BLISS maintains computational efficiency. Reddit’s test F1-scores 94.9% for BLISS vs. 95.0% for PLADIES, highlight that both samplers scale effectively to massive graphs, though BLISS’s adaptive policy incurs negligible overhead. The higher step counts for Reddit (3,000) and Yelp (10,000) reflect their size but do not compromise BLISS’s stability, as evidenced by low standard deviations.

The Yelp dataset with GAT presented a challenge for both samplers, showing overfitting (fig. 3). While early stopping or hyperparameter adjustments could potentially alleviate this, they were not added here to preserve uniform experimental conditions across all datasets.

Require: Graph G, Sample size k, Bandit learning rate η, Steps T , Number of layers L 1: Initialize w ij = 1 if j ∈ N i else 0 2: for t = 1 to T do 3:

The code implementation is available at: https://github.com/linhthi/BLISS-GNN

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut