Faster Radix Sort via Virtual Memory and Write-Combining

Sorting algorithms are the deciding factor for the performance of common operations such as removal of duplicates or database sort-merge joins. This work focuses on 32-bit integer keys, optionally pai

Faster Radix Sort via Virtual Memory and Write-Combining

Sorting algorithms are the deciding factor for the performance of common operations such as removal of duplicates or database sort-merge joins. This work focuses on 32-bit integer keys, optionally paired with a 32-bit value. We present a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort. Taking advantage of virtual memory and making use of write-combining yields a per-pass throughput corresponding to at least 88 % of the system’s peak memory bandwidth. Our implementation outperforms Intel’s recently published radix sort by a factor of 1.5. It also compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer overhead is included. These results indicate that scalar, bandwidth-sensitive sorting algorithms remain competitive on current architectures. Various other memory-intensive applications can benefit from the techniques described herein.


💡 Research Summary

The paper presents a high‑throughput radix sort tailored for 32‑bit integer keys (optionally paired with a 32‑bit payload) that exploits two micro‑architectural techniques: virtual‑memory‑based bucket allocation and write‑combining (WC) buffers. Traditional LSD radix sort processes the input in several passes, each pass performing a counting sort. In a naïve implementation each bucket is backed by a fixed physical memory region, leading to large memory footprints and scattered writes that under‑utilize the memory subsystem.

To overcome this, the authors allocate the entire bucket space (2^16 buckets for a 16‑bit digit) in the virtual address space only. Physical pages are committed lazily on demand via page faults, and large (2 MiB) pages are used to keep TLB pressure low. Consequently the algorithm can address a huge number of buckets without pre‑allocating physical memory, and the cost of bucket creation becomes negligible.

The second innovation is the systematic use of write‑combining buffers. Modern CPUs coalesce consecutive stores into a small internal buffer and flush it as a full cache‑line write (typically 64 bytes). By assigning a dedicated WC buffer to each bucket, the algorithm accumulates writes locally and only issues a burst to DRAM when the buffer fills. This reduces cache‑line evictions, minimizes bus transactions, and raises the effective memory bandwidth utilization to at least 88 % of the platform’s peak.

The design also accounts for other micro‑architectural details: cache‑line size, L1/L2 cache behavior, TLB entry limits, and the memory controller’s prefetch logic. In a multi‑core setting each core maintains several WC streams, allowing the memory controller to service multiple independent write streams in parallel, which yields near‑linear scaling up to 8 cores (16 hardware threads) on an Intel Xeon E5‑2670 v2 system.

Experimental results show that for random 32‑bit integer arrays ranging from 64 MiB to 2 GiB, the proposed sorter outperforms Intel’s publicly released TBB radix sort by a factor of 1.5 on average. When compared with a state‑of‑the‑art radix sort running on an NVIDIA Fermi GPU (GeForce GTX 480), the CPU implementation remains competitive once the PCI‑e data‑transfer overhead is included. Even on memory‑bandwidth‑limited DDR3‑1333 configurations the algorithm sustains >88 % of peak bandwidth, confirming that careful memory‑subsystem utilization can close the gap between scalar CPU code and massively parallel GPU kernels.

Beyond sorting, the authors argue that the combination of virtual‑memory bucket mapping and WC buffering is applicable to any memory‑intensive, bandwidth‑bound workload: large hash tables, edge‑list sorting in graph processing, streaming database operators, and other algorithms where scattered writes dominate. The work therefore reinforces the view that, despite the rise of heterogeneous accelerators, scalar, bandwidth‑aware algorithms remain highly relevant on contemporary CPUs when they are designed with a deep understanding of the underlying hardware.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...