Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs
Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on fast convolution. The high computation throughput and memory bandwidth of graphics processing units (GPUs) make GPUs a natural choice for accelerating convolution operations. However, maximally exploiting the available memory bandwidth of GPUs for convolution is a challenging task. This paper introduces a general model to address the mismatch between the memory bank width of GPUs and computation data width of threads. Based on this model, we develop two convolution kernels, one for the general case and the other for a special case with one input channel. By carefully optimizing memory access patterns and computation patterns, we design a communication-optimized kernel for the special case and a communication-reduced kernel for the general case. Experimental data based on implementations on Kepler GPUs show that our kernels achieve 5.16X and 35.5% average performance improvement over the latest cuDNN library, for the special case and the general case, respectively.
💡 Research Summary
The paper addresses the critical issue of memory bandwidth utilization in direct convolution kernels on NVIDIA Kepler GPUs, where a mismatch often exists between the shared memory (SM) bank width (WSMB) and the per‑thread computation data width (WCD). On Kepler devices the SM bank width is 8 bytes, while most convolution implementations operate on 4‑byte floats, resulting in n = 2 (WSMB = n·WCD). This mismatch leads to bank conflicts: two threads accessing the same bank must be serialized, halving the effective SM bandwidth. The authors propose a general model that aligns WCD with WSMB by having each thread process n basic elements as a single unit, using vector types such as float2 or float4. This alignment eliminates bank conflicts and yields an n‑fold increase in SM bandwidth.
Two kernels are built on this model. The first targets the special case of a single input channel (C = 1), which appears in the first layer of CNNs for grayscale images and many classic image‑processing tasks. The image is partitioned into blocks of size H × W. Each block is assigned to a thread block (TB). Within a TB, W threads cooperatively load a row of the block (including halo pixels) into shared memory, then each thread loads its required pixels into registers. Horizontal data sharing is achieved through shared memory, while vertical sharing is realized by keeping the necessary pixels in registers across successive rows. Because each pixel of a block is read from global memory (GM) exactly once (except for a small fraction of halo pixels), the kernel approaches the theoretical lower bound on GM traffic, making it “communication‑optimal”. To handle the n = 2 mismatch on Kepler, each thread processes two contiguous output pixels per iteration, reading and writing n elements at a time with vector loads/stores. This reduces the number of active threads per TB by a factor of n and requires only a modest increase in register usage (O(K·(n‑1))) to store the extra pixels needed for the overlapping convolutions.
The second kernel handles the general multi‑channel case (C > 1). Here a single convolution cannot keep all required K × K × C input values in registers, so the computation is split across multiple steps. The kernel loads K rows of the block into shared memory, then iteratively prefetches the next row while computing the current one, overlapping computation and memory transfers. Filters, being small, are placed in constant memory, allowing all threads in a warp to broadcast the same filter values without contention. The kernel again aligns WCD with WSMB by processing n pixels per thread using vector types, ensuring conflict‑free shared memory accesses and coalesced global memory traffic. Accumulation across channels is performed in registers, and after processing all channels the final results are written back to GM.
Experimental evaluation on a Kepler K40m GPU compares the proposed kernels against the state‑of‑the‑art cuDNN library (which uses a GEMM‑based approach with runtime‑generated sub‑blocks). For the single‑channel case the new kernel achieves a 5.16× speedup; for the multi‑channel case it yields an average 35.5 % reduction in execution time across a range of image sizes, filter sizes, and channel counts. The authors also demonstrate that matching WCD to WSMB recovers up to 36 % of execution time lost in a baseline GEMM kernel (MAGMA) on Kepler, confirming the importance of the bank‑width model.
In summary, the paper makes three key contributions: (1) a formal model describing the impact of SM bank width versus computation data width on GPU convolution performance; (2) a communication‑optimal kernel for the single‑channel case that minimizes global memory traffic by exploiting both inter‑thread (shared memory) and intra‑thread (register) data reuse; and (3) a communication‑reduced kernel for the general multi‑channel case that eliminates bank conflicts and overlaps memory transfers with computation. The results show that careful alignment of data widths and memory access patterns can substantially improve direct convolution performance on bandwidth‑limited GPUs, reaffirming the relevance of direct convolution as a competitive alternative to GEMM, FFT, or Winograd‑based methods, especially when memory usage and implementation complexity are concerns.
Comments & Academic Discussion
Loading comments...
Leave a Comment