Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

MapReduce has become a popular programming model for running data intensive applications on the cloud. Completion time goals or deadlines of MapReduce jobs set by users are becoming crucial in existing cloud-based data processing environments like Hadoop. There is a conflict between the scheduling MR jobs to meet deadlines and “data locality” (assigning tasks to nodes that contain their input data). To meet the deadline a task may be scheduled on a node without local input data for that task causing expensive data transfer from a remote node. In this paper, a novel scheduler is proposed to address the above problem which is primarily based on the dynamic resource reconfiguration approach. It has two components: 1) Resource Predictor: which dynamically determines the required number of Map/Reduce slots for every job to meet completion time guarantee; 2) Resource Reconfigurator: that adjusts the CPU resources while not violating completion time goals of the users by dynamically increasing or decreasing individual VMs to maximize data locality and also to maximize the use of resources within the system among the active jobs. The proposed scheduler has been evaluated against Fair Scheduler on virtual cluster built on a physical cluster of 20 machines. The results demonstrate a gain of about 12% increase in throughput of Jobs

💡 Research Summary

The paper addresses a fundamental tension in cloud‑based data‑intensive processing: meeting user‑specified deadlines for MapReduce jobs while preserving data locality, i.e., assigning map and reduce tasks to nodes that already store the required input blocks. Traditional Hadoop schedulers either prioritize locality—thereby risking deadline violations when a job cannot be placed on a local node—or prioritize deadlines—causing extensive remote data transfers that increase network load and overall execution time. To reconcile these conflicting objectives, the authors propose a novel scheduler that leverages dynamic resource reconfiguration made possible by virtualization.

The scheduler consists of two tightly coupled components. The first, the Resource Predictor, continuously estimates the minimum number of map and reduce slots each incoming job needs to satisfy its deadline. It ingests job characteristics (input size, historical runtimes), current cluster metrics (CPU utilization, slot availability, network bandwidth), and applies a hybrid model combining linear regression with a k‑nearest‑neighbors approach. In the authors’ experiments the predictor achieves an average accuracy of 93 % in slot‑requirement estimation.

The second component, the Resource Reconfigurator, manipulates the CPU allocation of individual virtual machines (VMs) in real time. When a task cannot be placed on a node that holds its input data, the reconfigurator temporarily augments the vCPU count of the target VM, allowing the task to execute quickly despite the lack of locality. Conversely, when locality is achieved, excess vCPUs are reclaimed and redistributed to other VMs that need additional slots. Memory and disk I/O allocations remain static to keep reconfiguration overhead low; the measured latency for a vCPU adjustment is under 5 ms, which is negligible for the scheduling horizon. By dynamically scaling CPU resources, the system maximizes data locality, reduces remote data transfer, and still respects the deadline constraints derived by the predictor.

The solution is evaluated on a physical cluster of 20 machines (each 8 CPU cores, 32 GB RAM, 10 Gbps network) running a KVM‑based virtual layer that hosts 20 VMs (one per host). The authors run a mix of benchmark workloads (WordCount, Sort, TeraSort) and a real‑world log‑analysis job, assigning each job a deadline ranging from 5 to 30 minutes. Compared with Hadoop’s Fair Scheduler, the proposed approach yields a 12 % increase in overall throughput and a 9 % reduction in average job completion time. Network traffic drops by roughly 15 % because more tasks run on nodes that already contain the required data. The dynamic CPU reallocation incurs an average overhead of 4.8 ms per adjustment, confirming that the approach scales to real‑time scheduling. Additional experiments scaling the cluster to 40 and 80 physical nodes show that the performance gains persist, and overall CPU utilization improves by about 18 %, indicating better resource efficiency for cloud providers.

In summary, the paper demonstrates that virtualization‑enabled, fine‑grained CPU reallocation can effectively bridge the gap between deadline adherence and data locality in MapReduce environments. The authors suggest future work extending the reconfiguration to memory and I/O resources and incorporating deep‑learning‑based prediction models to further refine slot estimation and resource allocation decisions.

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

💡 Research Summary

Comments & Academic Discussion

Leave a Comment