GPU sample sort
In this paper, we present the design of a sample sort algorithm for manycore GPUs. Despite being one of the most efficient comparison-based sorting algorithms for distributed memory architectures its performance on GPUs was previously unknown. For uniformly distributed keys our sample sort is at least 25% and on average 68% faster than the best comparison-based sorting algorithm, GPU Thrust merge sort, and on average more than 2 times faster than GPU quicksort. Moreover, for 64-bit integer keys it is at least 63% and on average 2 times faster than the highly optimized GPU Thrust radix sort that directly manipulates the binary representation of keys. Our implementation is robust to different distributions and entropy levels of keys and scales almost linearly with the input size. These results indicate that multi-way techniques in general and sample sort in particular achieve substantially better performance than two-way merge sort and quicksort.
💡 Research Summary
This paper presents a comprehensive design and evaluation of a sample‑sort algorithm tailored for many‑core GPUs. Sample sort, a multi‑way comparison‑based sorting technique traditionally used in distributed‑memory systems, had not been explored on GPUs, where two‑way merge sort, quicksort, and radix sort dominate. The authors adapt the classic algorithm to the GPU’s SIMD (SIMT) execution model and memory hierarchy by dividing the input into small blocks, having each warp extract local samples, and aggregating these samples into a globally sorted pivot array. The pivots approximate data quantiles, enabling each thread block to compute its assigned partition boundaries using shared‑memory scans without costly atomic operations. Each partition is then sorted independently with an efficient intra‑block sorting kernel (e.g., bitonic or small‑scale merge sort). To maintain load balance across diverse data distributions, the implementation dynamically adjusts the number of samples relative to input size and introduces a “re‑sampling” step that refines pivots when partition size variance exceeds a threshold.
Implementation details include careful tuning of block and warp sizes, exploitation of coalesced global memory accesses, and limiting overall memory overhead to roughly 1.2 × the input size, ensuring compatibility with contemporary GPU memory capacities. The algorithm is implemented in CUDA 10+ and runs on modern NVIDIA GPUs.
Experimental evaluation covers a wide spectrum of key distributions (uniform, normal, exponential) and entropy levels, testing both 32‑bit and 64‑bit integer keys across input sizes from 2²⁰ to 2²⁸ elements. Results show that for uniformly distributed keys the GPU sample sort outperforms the best comparison‑based GPU sort (Thrust merge sort) by at least 25 % and on average 68 %. Compared with GPU quicksort, it is on average more than twice as fast. For 64‑bit integer keys, the sample sort beats the highly optimized Thrust radix sort by at least 63 % and on average by a factor of two, reaching up to a 200 % speedup in the best cases. Moreover, the algorithm scales almost linearly with input size, confirming its excellent scalability.
The analysis attributes these gains to three main factors: (1) multi‑way partitioning reduces the depth of recursion and the number of synchronization points; (2) the pivot selection and re‑sampling mechanisms keep partition sizes balanced even under skewed distributions; and (3) the use of shared‑memory scans yields contiguous memory accesses, minimizing global memory traffic. The authors also discuss limitations, noting that extremely low‑entropy data with many duplicate keys can cause temporary imbalance, which is mitigated by the re‑sampling step.
In conclusion, the study demonstrates that multi‑way sorting techniques, and sample sort in particular, can deliver substantially superior performance on GPUs compared to traditional two‑way merge sort, quicksort, and even radix sort. The paper suggests future work extending the approach to floating‑point and string keys, as well as integrating the algorithm with newer GPU features such as Tensor Cores for further acceleration.
Comments & Academic Discussion
Loading comments...
Leave a Comment