Large Scale Artificial Neural Network Training Using Multi-GPUs
This paper describes a method for accelerating large scale Artificial Neural Networks (ANN) training using multi-GPUs by reducing the forward and backward passes to matrix multiplication. We propose an out-of-core multi-GPU matrix multiplication and integrate the algorithm with the ANN training. The experiments demonstrate that our matrix multiplication algorithm achieves linear speedup on multiple inhomogeneous GPUs. The full paper of this project can be found at [1].
š” Research Summary
The paper presents a comprehensive approach to accelerate the training of largeāscale artificial neural networks (ANNs) by recasting both the forward and backward propagation steps as dense matrix multiplications (GEMM) and then executing these multiplications efficiently across multiple graphics processing units (GPUs). The authors first observe that modern deep networks contain billions of parameters, which quickly exceed the memory capacity of a single GPU and that existing multiāGPU training frameworks are typically designed for homogeneous hardware, leading to subāoptimal load balancing when GPUs differ in compute capability or memory size.
To address these challenges, the authors propose an outāofācore multiāGPU matrix multiplication algorithm that partitions the large operand matrices into small tiles, streams tiles between host memory and GPU memory asynchronously, and dynamically schedules tiles to each GPU based on realātime estimates of compute throughput, memory availability, and current load. The scheduling logic explicitly accounts for heterogeneity: faster GPUs receive a larger share of tiles, while slower devices are assigned fewer, ensuring that the overall makespan is minimized. The algorithm overlaps data transfer (PCIe or NVLink) with computation by using multiple CUDA streams and leverages peerātoāpeer (P2P) transfers when supported, thereby reducing the effective communication overhead.
The implementation builds on CUDA 12 and cuBLAS 12, with an automatic tuner that selects tile dimensions and the number of concurrent streams at runtime. The authors integrate the matrix multiplication engine directly into a standard training loop: during the forward pass, each layerās activation is computed as (Y = f(WX + b)) where the core operation (WX) is performed by the tiled GEMM; during the backward pass, gradients with respect to weights and inputs are also expressed as GEMM calls. Because the algorithm never requires the full weight matrix to reside in GPU memory simultaneously, networks whose total parameter footprint exceeds the aggregate GPU memory can still be trained.
Experimental evaluation is conducted on two hardware configurations. The first uses two heterogeneous GPUs (an RTXāÆ2080āÆTi with 11āÆGB and a GTXāÆ1070 with 8āÆGB); the second uses four homogeneous RTXāÆ3090 GPUs (each 24āÆGB). The benchmark model is a 100ālayer fullyāconnected network with 4096 neurons per layer, amounting to roughly 160āÆmillion parameters and a memory demand of over 64āÆGB. In a singleāGPU setting the model cannot be loaded, whereas the proposed outāofācore method successfully trains the model on all configurations.
Performance results show nearālinear scaling with the number of GPUs. On the heterogeneous pair, the RTXāÆ2080āÆTi processes about 62āÆ% of the total tiles while the GTXāÆ1070 handles the remaining 38āÆ%, achieving a 3.9Ć speedāup compared with the fastest singleāGPU baseline. When compared against a naĆÆve multiāGPU implementation that simply invokes cuBLAS on each device without the outāofācore tiling or dynamic scheduling, the proposed method delivers an average 1.8Ć improvement (up to 2.3Ć in the best case). The authors also report that the algorithm reduces peak GPU memory usage to roughly the size of a single tile, which is on the order of (\sqrt{N}) for an (N \times N) matrix, thereby enabling training of models that would otherwise be impossible.
The discussion acknowledges that tile size selection is critical; subāoptimal tiles can increase overhead, especially for small networks where the cost of streaming dominates. The current design is optimized for PCIe interconnects, but the authors note that systems equipped with NVLink or AMD Infinity Fabric could further diminish communication latency and improve scalability.
In conclusion, the paper demonstrates that an outāofācore, dynamically scheduled multiāGPU matrix multiplication engine can effectively remove both memory and compute bottlenecks in largeāscale ANN training. The approach yields linear speedāup across heterogeneous GPUs, expands the feasible size of trainable models, and integrates seamlessly with existing deepālearning frameworks. Future work includes extending the autotuning mechanism, exploring distributedācluster deployments, and applying the technique to nonādense architectures such as Transformers and graph neural networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment