Engineering Parallel String Sorting
We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we first propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Then we focus on NUMA architectures, and develop parallel multiway LCP-merge and -mergesort to reduce the number of random memory accesses to remote nodes. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations. Comprehensive experiments on five current multi-core platforms are then reported and discussed. The experiments show that our implementations scale very well on real-world inputs and modern machines.
💡 Research Summary
The paper “Engineering Parallel String Sorting” addresses the lack of practical parallel string‑sorting algorithms for modern multi‑core shared‑memory systems, especially those with non‑uniform memory access (NUMA) characteristics. It begins by reviewing the fundamentals of string sorting, defining the total input length N, the distinguishing prefix size D, and the longest common prefix (LCP) array, which together provide a lower bound of Ω(D + n log n) on the work required. The authors then discuss sequential string‑sorting techniques such as multikey quicksort, MSD radix sort, burstsort, and LCP‑aware mergesort, highlighting their memory‑access patterns and cache behavior.
Recognizing that contemporary CPUs feature deep cache hierarchies, high branch‑prediction penalties, and word‑level parallelism, the authors design a new algorithm called Super‑Scalar String Sample Sort (S⁵). S⁵ adapts the classic sample‑sort framework to strings by (1) using LCP information to avoid full‑string comparisons during bucket classification, (2) selecting splitters that fit into cache, and (3) employing an oversampling factor α and a dynamic bucket count k to achieve balanced partitions. Dynamic load‑balancing distributes recursive sub‑tasks across cores, while SIMD‑style operations and loop unrolling reduce branch mispredictions and exploit superscalar pipelines. The result is an algorithm that attains the theoretical bound O(D + n log n) while being highly cache‑friendly.
For NUMA machines, the paper introduces a parallel multi‑way LCP‑aware merge and an LCP‑aware mergesort. The input is first divided into pre‑sorted sub‑sequences, each accompanied by its LCP array. During the multi‑way merge, the algorithm reuses the stored LCP values so that strings sharing a common prefix are read from memory only once, dramatically reducing remote memory traffic. This approach scales almost linearly on multi‑socket systems where remote memory bandwidth is a bottleneck.
In addition to S⁵, the authors parallelize two well‑known string‑sorting methods. Their parallel multikey quicksort retains the classic three‑way partitioning but augments it with LCP‑based pivot selection and insertion‑sort base cases, achieving better cache locality. Their parallel radix sort processes groups of characters as machine words (e.g., 16‑bit or 32‑bit chunks), turning character look‑ups into word‑level operations and thereby improving memory‑access continuity. Both algorithms leverage SIMD instructions, loop unrolling, and cache‑block‑aware bucket sizing.
The experimental evaluation spans five current multi‑core platforms, covering a range of core counts, cache configurations, and NUMA topologies. Benchmarks include real‑world data sets such as large text corpora, DNA sequences, and synthetic random strings. Results show that on single‑socket machines S⁵ outperforms the authors’ own parallel multikey quicksort and radix sort by factors of 1.8–2.5. On NUMA systems, the parallel multi‑way LCP‑aware merge delivers the highest speedups, up to four times faster than competing methods, while keeping memory usage linear in the number of strings. Cache‑miss rates and remote‑memory accesses are significantly reduced across all tests.
The paper concludes that by tightly integrating algorithmic design with hardware‑aware optimizations—cache‑friendly splitter selection, LCP reuse, dynamic load balancing, and word‑level parallelism—string sorting can be made to scale efficiently on today’s many‑core and NUMA architectures. Future work is suggested in areas such as external‑memory extensions, GPU/FPGA acceleration, and adaptive tuning of oversampling and bucket parameters for dynamic workloads. This work sets a new practical benchmark for parallel string sorting and provides a solid foundation for further research and industrial adoption.
Comments & Academic Discussion
Loading comments...
Leave a Comment