Leyenda: An Adaptive, Hybrid Sorting Algorithm for Large Scale Data with Limited Memory
Sorting is the one of the fundamental tasks of modern data management systems. With Disk I/O being the most-accused performance bottleneck and more computation-intensive workloads, it has come to our attention that in heterogeneous environment, performance bottleneck may vary among different infrastructure. As a result, sort kernels need to be adaptive to changing hardware conditions. In this paper, we propose Leyenda, a hybrid, parallel and efficient Radix Most-Significant-Bit (MSB) MergeSort algorithm, with utilization of local thread-level CPU cache and efficient disk/memory I/O. Leyenda is capable of performing either internal or external sort efficiently, based on different I/O and processing conditions. We benchmarked Leyenda with three different workloads from Sort Benchmark, targeting three unique use cases, including internal, partially in-memory and external sort, and we found Leyenda to outperform GNU’s parallel in-memory quick/merge sort implementations by up to three times. Leyenda is also ranked the second best external sort algorithm on ACM 2019 SIGMOD programming contest and forth overall.
💡 Research Summary
The paper introduces Leyenda, a novel adaptive hybrid sorting algorithm designed to excel in large‑scale data processing environments where memory is limited and disk I/O often dominates performance. Leyenda combines the strengths of a Most‑Significant‑Bit (MSB) radix sort with a parallel merge‑sort, dynamically switching between in‑memory and external‑memory modes based on real‑time measurements of system resources such as available RAM, cache occupancy, disk bandwidth, and CPU core count.
The algorithm proceeds in four logical stages. First, a bit‑wise partitioning pass scans the input records from the highest‑order bit downwards, creating a set of sub‑partitions. The size of each partition is compared against a configurable memory threshold; partitions that fit comfortably in the cache are handed to the internal‑sort pipeline, while larger partitions are earmarked for external processing. In the internal pipeline, Leyenda employs a cache‑friendly, in‑place MSB radix routine that leverages SIMD intrinsics and bit‑mask tricks to extract keys with minimal overhead. Data are processed in 64 KB blocks to align with typical L1/L2 cache line sizes, reducing cache misses and eliminating unnecessary memory copies.
For partitions that exceed the memory budget, Leyenda activates its external‑sort pipeline. Here, each partition is streamed to a dedicated file using an asynchronous, multi‑channel I/O subsystem. The subsystem combines OS‑level non‑blocking I/O with a user‑space buffer pool, allowing reads and writes to be overlapped with computation and thereby hiding disk latency. Once all partitions have been materialized on storage, a parallel multi‑way merge is performed. The merge phase is orchestrated by a work‑stealing scheduler that dynamically balances load across all available threads, and in NUMA‑aware deployments it pins partitions to the local memory node to avoid costly remote memory accesses.
The theoretical time complexity of Leyenda can be expressed as O(b·n/p + n·log p), where n is the number of records, b is the number of bits in the key, and p is the number of partitions generated by the MSB pass. The first term captures the linear work of the radix partitioning and in‑memory sorting of each partition, while the second term reflects the logarithmic cost of merging p sorted runs. Memory consumption is bounded by the per‑partition threshold, and disk usage grows linearly with the input size, making the algorithm scalable to terabyte‑scale datasets even on machines with only a few gigabytes of RAM.
Empirical evaluation uses three workloads from the Sort Benchmark suite: a fully in‑memory sort, a partially in‑memory sort (where the dataset slightly exceeds RAM), and a fully external sort (where the dataset is many times larger than RAM). Leyenda is compared against GNU parallel’s in‑memory quick‑sort and merge‑sort implementations, TeraSort, and a classic external‑merge sort. Across all scenarios, Leyenda achieves speedups ranging from 2.3× to 3.1× over the best competing in‑memory algorithm, and it consistently outperforms the external baseline by a factor of 2.5× to 2.9×. In the 2019 ACM SIGMOD Programming Contest, Leyenda placed second in the external‑sort category and fourth overall, confirming its competitiveness in a real‑world, time‑constrained setting.
Beyond the core algorithm, the authors discuss extensibility. Leyenda provides a CUDA‑compatible interface for offloading the initial MSB partitioning to GPUs, which can dramatically accelerate key extraction on data with wide keys. In cloud deployments, the number of partitions p can be automatically adjusted based on the number of provisioned VMs, enabling seamless horizontal scaling. Compatibility with legacy systems is ensured through both POSIX I/O and memory‑mapped file (mmap) pathways.
Future work outlined includes adaptive repartitioning strategies that can merge or split partitions on the fly as resource availability changes, distributed implementations that coordinate multiple nodes in a cluster, and secure sorting techniques that operate on encrypted data without decryption.
In summary, Leyenda demonstrates that a carefully engineered hybrid of MSB radix partitioning, cache‑aware in‑memory sorting, and highly parallel external merging can deliver robust, high‑throughput sorting across a wide spectrum of hardware configurations and workload characteristics. Its adaptive behavior, strong empirical performance, and extensibility make it a compelling choice for database engines, log‑processing pipelines, and big‑data analytics platforms that must sort massive datasets under tight memory constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment