GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. Data parallel approaches contain redundant work as subgraphs sampled by different GPUs contain significant overlap. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant work by splitting the sampling, loading, and training of each mini-batch across multiple GPUs. Split parallelism, however, introduces communication overheads that can be more than the savings from removing redundant work. We further present a lightweight partitioning algorithm that probabilistically minimizes these overheads. We implement split parallelism in GSplit and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P^3$.

💡 Research Summary

This paper introduces GSplit, a novel system for scaling Graph Neural Network (GNN) training on large graphs via a new hybrid parallel paradigm termed “Split Parallelism.” The work addresses a fundamental inefficiency in standard data-parallel mini-batch GNN training, where independently sampled micro-batches across different GPUs contain significant overlap in their k-hop neighbor subgraphs. This overlap leads to redundant feature loading, sampling, and computation, wasting resources across the entire training pipeline.

The core innovation of GSplit is the split parallelism paradigm. Instead of each GPU processing a separate micro-batch, all GPUs collaboratively work on a single mini-batch per iteration. This mini-batch is dynamically split on-the-fly into non-overlapping partitions called “splits,” with each split assigned to a specific GPU. Consequently, every vertex’s associated work—sampling, feature loading, and forward/backward computation—is performed exactly once by a single GPU, eliminating redundancy at its root. However, this approach introduces a new challenge: the need to shuffle intermediate vertex features between GPUs at each GNN layer, which can incur communication overhead that outweighs the savings from redundancy removal.

To tackle this, the authors devise a lightweight probabilistic splitting algorithm, which is a key technical contribution. Rather than running an expensive online graph partitioner on each sampled mini-batch (an NP-hard problem), the algorithm assigns vertices to GPUs based on a pre-computed probability distribution derived from the global graph structure. This distribution is proven to probabilistically minimize the expected communication cost and balance the expected workload per GPU for a randomly sampled mini-batch. This method imposes negligible overhead during the critical sampling path and is more effective than using an offline partitioner like Metis, which doesn’t consider the stochastic nature of mini-batch sampling.

The paper details the implementation of these ideas in the GSplit system. GSplit maintains programming abstractions compatible with data-parallel systems (like local and mixed frontiers) to leverage existing optimized single-GPU kernels for sampling and training. Its training pipeline involves layer-wise sampling where splits are determined, generation of efficient shuffle indices, feature loading (compatible with GPU caching schemes), and the forward/backward pass with integrated inter-GPU communication for feature shuffling at each layer.

Comprehensive evaluations demonstrate GSplit’s superiority. Experiments on large-scale graphs (Orkut, Papers100M, Friendster) and popular GNN models (GAT, GraphSAGE) show that GSplit outperforms state-of-the-art systems significantly. It achieves up to 4.4x (2.4x on average) speedup over DGL, up to 1.9x (1.4x on average) over Quiver, and up to 4.1x (2.4x on average) over an implementation of P³’s push-pull parallelism in a single-host multi-GPU setting. The performance breakdown confirms that the probabilistic splitting algorithm is crucial to this success, effectively controlling communication overhead and load imbalance while fully realizing the benefits of redundancy elimination. In conclusion, GSplit presents a compelling and practical solution for scaling mini-batch GNN training, establishing split parallelism as an efficient alternative to conventional data-parallel approaches.

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

💡 Research Summary

Comments & Academic Discussion

Leave a Comment