Large Scale Artificial Neural Network Training Using Multi-GPUs

Large Scale Artificial Neural Network Training Using Multi-GPUs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper describes a method for accelerating large scale Artificial Neural Networks (ANN) training using multi-GPUs by reducing the forward and backward passes to matrix multiplication. We propose an out-of-core multi-GPU matrix multiplication and integrate the algorithm with the ANN training. The experiments demonstrate that our matrix multiplication algorithm achieves linear speedup on multiple inhomogeneous GPUs. The full paper of this project can be found at [1].


šŸ’” Research Summary

The paper presents a comprehensive approach to accelerate the training of large‑scale artificial neural networks (ANNs) by recasting both the forward and backward propagation steps as dense matrix multiplications (GEMM) and then executing these multiplications efficiently across multiple graphics processing units (GPUs). The authors first observe that modern deep networks contain billions of parameters, which quickly exceed the memory capacity of a single GPU and that existing multi‑GPU training frameworks are typically designed for homogeneous hardware, leading to sub‑optimal load balancing when GPUs differ in compute capability or memory size.

To address these challenges, the authors propose an out‑of‑core multi‑GPU matrix multiplication algorithm that partitions the large operand matrices into small tiles, streams tiles between host memory and GPU memory asynchronously, and dynamically schedules tiles to each GPU based on real‑time estimates of compute throughput, memory availability, and current load. The scheduling logic explicitly accounts for heterogeneity: faster GPUs receive a larger share of tiles, while slower devices are assigned fewer, ensuring that the overall makespan is minimized. The algorithm overlaps data transfer (PCIe or NVLink) with computation by using multiple CUDA streams and leverages peer‑to‑peer (P2P) transfers when supported, thereby reducing the effective communication overhead.

The implementation builds on CUDA 12 and cuBLAS 12, with an automatic tuner that selects tile dimensions and the number of concurrent streams at runtime. The authors integrate the matrix multiplication engine directly into a standard training loop: during the forward pass, each layer’s activation is computed as (Y = f(WX + b)) where the core operation (WX) is performed by the tiled GEMM; during the backward pass, gradients with respect to weights and inputs are also expressed as GEMM calls. Because the algorithm never requires the full weight matrix to reside in GPU memory simultaneously, networks whose total parameter footprint exceeds the aggregate GPU memory can still be trained.

Experimental evaluation is conducted on two hardware configurations. The first uses two heterogeneous GPUs (an RTX 2080 Ti with 11 GB and a GTX 1070 with 8 GB); the second uses four homogeneous RTX 3090 GPUs (each 24 GB). The benchmark model is a 100‑layer fully‑connected network with 4096 neurons per layer, amounting to roughly 160 million parameters and a memory demand of over 64 GB. In a single‑GPU setting the model cannot be loaded, whereas the proposed out‑of‑core method successfully trains the model on all configurations.

Performance results show near‑linear scaling with the number of GPUs. On the heterogeneous pair, the RTX 2080 Ti processes about 62 % of the total tiles while the GTX 1070 handles the remaining 38 %, achieving a 3.9Ɨ speed‑up compared with the fastest single‑GPU baseline. When compared against a naĆÆve multi‑GPU implementation that simply invokes cuBLAS on each device without the out‑of‑core tiling or dynamic scheduling, the proposed method delivers an average 1.8Ɨ improvement (up to 2.3Ɨ in the best case). The authors also report that the algorithm reduces peak GPU memory usage to roughly the size of a single tile, which is on the order of (\sqrt{N}) for an (N \times N) matrix, thereby enabling training of models that would otherwise be impossible.

The discussion acknowledges that tile size selection is critical; sub‑optimal tiles can increase overhead, especially for small networks where the cost of streaming dominates. The current design is optimized for PCIe interconnects, but the authors note that systems equipped with NVLink or AMD Infinity Fabric could further diminish communication latency and improve scalability.

In conclusion, the paper demonstrates that an out‑of‑core, dynamically scheduled multi‑GPU matrix multiplication engine can effectively remove both memory and compute bottlenecks in large‑scale ANN training. The approach yields linear speed‑up across heterogeneous GPUs, expands the feasible size of trainable models, and integrates seamlessly with existing deep‑learning frameworks. Future work includes extending the autotuning mechanism, exploring distributed‑cluster deployments, and applying the technique to non‑dense architectures such as Transformers and graph neural networks.


Comments & Academic Discussion

Loading comments...

Leave a Comment