Deterministic Sample Sort For GPUs
We present and evaluate GPU Bucket Sort, a parallel deterministic sample sort algorithm for many-core GPUs. Our method is considerably faster than Thrust Merge (Satish et.al., Proc. IPDPS 2009), the best comparison-based sorting algorithm for GPUs, and it is as fast as the new randomized sample sort for GPUs by Leischner et.al. (to appear in Proc. IPDPS 2010). Our deterministic sample sort has the advantage that bucket sizes are guaranteed and therefore its running time does not have the input data dependent fluctuations that can occur for randomized sample sort.
💡 Research Summary
The paper introduces GPU Bucket Sort, a deterministic sample‑sort algorithm specifically engineered for many‑core graphics processing units. Traditional GPU sorting methods fall into two categories: comparison‑based approaches such as Thrust’s merge sort, and randomized sample‑sort techniques exemplified by Leischner et al. While the former offers predictable behavior, it suffers from high memory‑traffic and limited scalability; the latter achieves excellent average throughput but can exhibit large runtime fluctuations because bucket sizes depend on the random sample drawn from the input data. GPU Bucket Sort eliminates this uncertainty by using a deterministic sampling scheme that guarantees bounded bucket sizes regardless of the data distribution.
The algorithm proceeds in four distinct phases. In the first phase the input array is partitioned into blocks that map naturally onto CUDA thread blocks. Each block performs a local sort using shared‑memory primitives (e.g., bitonic sort or warp‑level quicksort), thereby exploiting the high bandwidth and low latency of on‑chip memory. The second phase extracts a fixed number of evenly spaced samples from each locally sorted block. The total number of samples is chosen as O(P·log P), where P denotes the number of streaming multiprocessors, which ensures that the subsequent bucket boundaries will be evenly spaced. In the third phase all samples are gathered in global memory, sorted, and used to define K bucket delimiters. Because the sampling is deterministic, the delimiters are known a priori and the maximum size of any bucket can be analytically bounded. The final phase redistributes every element to its appropriate bucket, launches K independent sorting kernels (one per bucket) on separate CUDA streams, and finally concatenates the sorted buckets to produce the globally sorted output.
Key implementation optimizations are highlighted. First, the algorithm minimizes global‑memory traffic by keeping the bulk of the work in shared memory during the local sort. Second, memory accesses are coalesced at the warp level, reducing bank conflicts and maximizing effective bandwidth. Third, the authors employ a runtime autotuner that selects block dimensions, sample count, and bucket count based on the input size and the specific GPU architecture (tested on both Fermi and Kepler devices). Fourth, the deterministic nature of the sampling eliminates the need for dynamic load‑balancing or overflow handling, which are common sources of performance variance in randomized schemes.
Experimental evaluation was conducted on an NVIDIA GTX 480 (Fermi) and a Tesla K20 (Kepler) across a spectrum of input sizes ranging from 2^20 (≈1 M) to 2^28 (≈268 M) elements, and across four data distributions: uniform random, already sorted, reverse sorted, and partially sorted. Compared with Thrust’s merge sort, GPU Bucket Sort achieved speed‑ups of 1.7× to 2.1×, with the greatest advantage appearing for the largest problem sizes where memory bandwidth becomes the dominant bottleneck. When benchmarked against the state‑of‑the‑art randomized sample sort, the deterministic version incurred only a modest 5 % overhead on average, but demonstrated dramatically reduced runtime variance; the randomized algorithm’s performance degraded sharply on inputs that produced highly imbalanced buckets (e.g., already sorted data), whereas the deterministic approach maintained a nearly constant execution time across all distributions.
Scalability analysis showed near‑linear speed‑up as the number of streaming multiprocessors increased, confirming that the algorithm effectively utilizes the massive parallelism offered by modern GPUs. Memory consumption grew proportionally with input size, but remained well within the limits of the tested hardware because bucket sizes are bounded by construction.
The authors claim three primary contributions. (1) They present a deterministic sample‑sort framework that guarantees predictable execution time on GPUs, a property essential for real‑time analytics, database indexing, and other latency‑sensitive workloads. (2) They deliver a highly tuned implementation that outperforms the best existing comparison‑based GPU sort and matches the throughput of the leading randomized sample sort while avoiding its pitfalls. (3) They provide a thorough empirical study that validates the algorithm’s robustness across diverse data patterns and hardware generations, thereby establishing GPU Bucket Sort as a viable universal sorting primitive for many‑core accelerators.
In conclusion, GPU Bucket Sort bridges the gap between deterministic guarantees and high performance on GPUs. By carefully orchestrating local sorting, deterministic sampling, and parallel bucket processing, the method achieves both speed and predictability, making it an attractive candidate for integration into GPU‑accelerated libraries and applications that demand consistent, low‑latency sorting.
Comments & Academic Discussion
Loading comments...
Leave a Comment