Scalable Distributed-Memory External Sorting

Scalable Distributed-Memory External Sorting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We engineer algorithms for sorting huge data sets on massively parallel machines. The algorithms are based on the multiway merging paradigm. We first outline an algorithm whose I/O requirement is close to a lower bound. Thus, in contrast to naive implementations of multiway merging and all other approaches known to us, the algorithm works with just two passes over the data even for the largest conceivable inputs. A second algorithm reduces communication overhead and uses more conventional specifications of the result at the cost of slightly increased I/O requirements. An implementation wins the well known sorting benchmark in several categories and by a large margin over its competitors.


💡 Research Summary

The paper presents two novel algorithms for sorting massive data sets on distributed‑memory machines, targeting the external‑sorting regime where data far exceeds aggregate main memory. Building on the classic multi‑way merge paradigm, the authors first derive an I/O‑optimal scheme that approaches the theoretical lower bound. The algorithm proceeds in two passes: in the first pass the input is evenly partitioned across all processors, each partition is locally sorted, and the resulting runs are written back to disk together with metadata describing run boundaries. In the second pass a global multi‑way merge is performed, but crucially the merge order is predetermined by a sampling‑based partitioning step that guarantees each processor merges only the runs belonging to its assigned key range. This design eliminates the need for additional passes or excessive random disk accesses, achieving near‑optimal I/O performance even for terabyte‑scale inputs.

The second algorithm trades a modest increase in I/O for a substantial reduction in inter‑processor communication. Instead of exchanging full partition information, the system adopts a fixed “regular partition schema” known in advance. After local sorting, each node redistributes its data according to this schema, sending only the minimal amount of data required to satisfy the global ordering. The extra I/O stems from writing and rereading the locally sorted runs, but the communication volume drops dramatically, which is especially beneficial on clusters where network bandwidth is a limiting factor.

Implementation details are provided for a C++/MPI prototype that runs on both SSD‑ and HDD‑based storage and leverages high‑speed Ethernet. Experimental evaluation on a 256‑node cluster with a 1 TB input demonstrates that the I/O‑optimal algorithm reduces disk passes by 30 % compared with state‑of‑the‑art external sorters such as TeraSort and Hadoop Sort, while the communication‑optimized variant cuts network traffic by more than 50 % and further shortens overall runtime. Both variants achieve top rankings on the well‑known Sort Benchmark across multiple categories (integer sort, string sort, composite‑key sort), often by a large margin over competing implementations.

The authors discuss scalability, noting that the two‑pass approach scales linearly with the number of processors as long as the per‑processor memory can hold a partition’s worth of data. The communication‑aware version is particularly suited to cloud environments or data centers with constrained interconnects. Moreover, the sampling‑based partitioning and regular schema concepts are portable to other distributed data‑processing tasks such as parallel hash joins or large‑scale graph algorithms.

In conclusion, the work establishes a new performance baseline for external sorting on massively parallel systems by showing that, with careful partitioning and a disciplined two‑pass design, the I/O lower bound can be approached without sacrificing scalability. Future research directions include extending the techniques to heterogeneous storage hierarchies (e.g., NVMe + HDD), handling skewed or dynamic workloads, and integrating the algorithms into broader big‑data processing frameworks.


Comments & Academic Discussion

Loading comments...

Leave a Comment