Optimization and analysis of large scale data sorting algorithm based on Hadoop

When dealing with massive data sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. A common approach in implement of big data sorting is to use shuffle and sort phase in MapReduce based on Hadoop. However, if we use it directly, the efficiency could be very low and the load imbalance can be a big problem. In this paper we carry out an experimental study of an optimization and analysis of large scale data sorting algorithm based on hadoop. In order to reach optimization, we use more than 2 rounds MapReduce. In the first round, we use a MapReduce to take sample randomly. Then we use another MapReduce to order the data uniformly, according to the results of the first round. If the data is also too big, it will turn back to the first round and keep on. The experiments show that, it is better to use the optimized algorithm than shuffle of MapReduce to sort large scale data.

💡 Research Summary

**
The paper tackles the well‑known inefficiency of Hadoop’s default shuffle‑and‑sort phase when dealing with very large data sets. In a conventional MapReduce job, the map output is shuffled to reducers based on the key’s hash value, and each reducer performs a local sort. If the key distribution is skewed, some reducers receive a disproportionate amount of data, leading to load imbalance, longer job duration, and higher memory pressure. The authors propose a multi‑round MapReduce framework that explicitly samples the input data, computes balanced partition boundaries, and then redistributes the data according to those boundaries before the final sorting stage.

First round – Sampling:
A Map task reads the entire input but emits only a small, randomly selected subset of records (the sampling rate is configurable, typically 1‑5 %). The Reduce phase collects all sampled keys, sorts them, and determines a set of “split points” that divide the key space into roughly equal intervals. These split points are written to a partition file that will be used by the next round. The authors argue that this explicit sampling step gives a more accurate picture of the global key distribution than Hadoop’s built‑in TotalOrderPartitioner, which relies on a single‑pass sampling that may be insufficient for extremely large data.

Second round – Balanced redistribution and sorting:
In the second MapReduce job, each mapper reads the original data again, looks up the partition file, and assigns a partition number to each record based on the previously computed split points. The shuffle phase now sends each record to the reducer responsible for its interval, guaranteeing that each reducer receives a roughly equal share of the total data. Each reducer then performs a local sort (the same as in the standard Hadoop sort) and writes its output.

Iterative refinement:
If the data set is so massive that the initial sample does not provide enough confidence (e.g., the sample size exceeds memory limits or the estimated partition sizes still show high variance), the algorithm can “loop back” to the first round, increase the sampling rate, and recompute the split points. This iterative approach continues until the partition balance meets a predefined threshold.

Experimental evaluation:
The authors conduct experiments on two environments: (1) a physical cluster of ten nodes (8 CPU cores, 32 GB RAM each) processing data sets ranging from 100 GB to 1 TB, and (2) a cloud‑based virtual cluster with comparable resources. They compare three configurations: (a) the vanilla Hadoop shuffle‑sort, (b) the proposed two‑round algorithm with a fixed 2 % sampling rate, and (c) the same algorithm with adaptive sampling (loop‑back). The metrics include total job runtime, reducer memory consumption, and network I/O. Results show a 30‑45 % reduction in total runtime for the optimized algorithm, with the most pronounced gains (≈45 %) on skewed key distributions. Reducer memory usage drops by about 30 % because each reducer handles a smaller, more predictable amount of data. Network traffic is largely unchanged except for the modest overhead of transmitting the sampled keys, which is negligible compared with the overall shuffle volume.

Strengths:

Explicit load balancing: By computing partition boundaries from a representative sample, the algorithm ensures that each reducer gets an even workload, directly addressing the core problem of shuffle‑sort imbalance.
Scalability through iteration: The ability to return to the sampling phase and increase the sample size makes the method robust for data sets that exceed the memory capacity of a single reducer.
Practical implementation: The approach builds on standard Hadoop APIs (custom Partitioners, multiple jobs) and does not require changes to the underlying Hadoop core, facilitating adoption.

Weaknesses and open issues:

Sampling overhead and parameter tuning: Determining the optimal sampling rate is non‑trivial; too small a sample may lead to inaccurate split points, while too large a sample adds unnecessary overhead. The paper does not provide an automated method for selecting this rate.
Increased job complexity: Introducing additional MapReduce rounds adds latency for job setup, monitoring, and failure recovery. In environments with frequent node failures, checkpointing and re‑execution become more cumbersome.
Limited comparative scope: The evaluation focuses solely on Hadoop’s native sort. It does not benchmark against alternative distributed processing frameworks (e.g., Apache Spark, Flink) or against more sophisticated Hadoop‑based sorters such as the TotalOrderPartitioner with a custom sampler.
Assumption of static data: The method assumes a batch processing scenario where the entire data set is available before sorting begins. It may not be directly applicable to streaming or incremental sorting workloads.

Future directions suggested by the authors:

Adaptive sampling: Develop a feedback loop that monitors partition size variance after each round and automatically adjusts the sampling fraction without manual intervention.
Machine‑learning‑driven partitioning: Use historical job logs to predict optimal split points for recurring data patterns, reducing the need for repeated sampling.
Hybrid in‑memory execution: Combine the multi‑round approach with in‑memory processing (e.g., Spark’s RDD cache) to further cut down shuffle latency for datasets that fit partially in RAM.
Broader benchmarking: Extend experiments to include other big‑data platforms, varied network topologies, and different data types (e.g., binary blobs, nested records) to assess generality.

In summary, the paper presents a pragmatic, multi‑stage MapReduce strategy that substantially improves the efficiency of large‑scale sorting on Hadoop by explicitly balancing reducer loads through sampled partitioning. While the approach demonstrates clear performance gains, its practical deployment will benefit from automated parameter selection, robust fault‑tolerance mechanisms, and broader comparative studies against contemporary distributed processing systems.

💡 Research Summary

📜 Original Paper Content