Performance Evaluation of Virtualized Hadoop Clusters

In this report we investigate the performance of Hadoop clusters, deployed with separated storage and compute layers, on top of a hypervisor managing a single physical host. We have analyzed and evaluated the different Hadoop cluster configurations by running CPU bound and I/O bound workloads. The report is structured as follows: Section 2 provides a brief description of the technologies involved in our study. An overview of the experimental platform, setup test and configurations are presented in Section 3. Our benchmark methodology is defined in Section 4. The performed experiments together with the evaluation of the results are presented in Section 5. Finally, Section 6 concludes with lessons learned.

💡 Research Summary

The paper presents a systematic performance evaluation of Hadoop clusters that are virtualized on a single physical host using a hypervisor (KVM). Two architectural variants are compared: a traditional integrated deployment where all Hadoop daemons (NameNode, DataNode, ResourceManager, NodeManager, etc.) run inside a single virtual machine, and a separated deployment in which compute‑intensive services (NameNode, ResourceManager, NodeManager, and application containers) reside in a “Compute” VM while storage‑intensive services (DataNode) are isolated in a “Storage” VM. Both configurations are allocated identical physical resources (8 CPU cores, 32 GB RAM, a 2 TB NVMe SSD) and share the same network backbone implemented with a virtual bridge and virtual NICs (e1000e).

The experimental methodology is carefully defined. Four representative workloads are executed: two CPU‑bound benchmarks (WordCount and TeraSort) and two I/O‑bound benchmarks (DFSIO and a Hive query suite). Each workload is run on data sets of 10 GB, 100 GB, and 1 TB, with three repetitions to obtain stable averages. Key metrics collected include total job completion time, average CPU utilization, network traffic volume, disk IOPS, and Hadoop‑specific stage latencies (Map, Shuffle, Reduce). For baseline comparison, the same workloads are also executed on a bare‑metal Hadoop installation on the same hardware.

Results reveal a clear trade‑off driven by workload characteristics. In the CPU‑bound scenarios, the integrated VM consistently outperforms the separated architecture by roughly 5–8 % in total execution time. The advantage stems from eliminating the extra hop over the virtual network that is required when the Compute VM must fetch input blocks from the Storage VM during the Map and Shuffle phases. Consequently, the integrated setup shows lower network latency and slightly higher CPU efficiency.

Conversely, for the I/O‑bound workloads the separated architecture demonstrates superior performance. By dedicating a VM exclusively to storage, the authors can attach the SSD directly to the Storage VM, tune file‑system parameters (e.g., ext4/XFS mount options, larger read‑ahead buffers), and avoid contention with the compute processes for disk bandwidth. This yields an 18 % increase in raw read/write throughput and translates into a 10–12 % reduction in overall job duration compared with the integrated deployment, where multiple VMs compete for the same block device and suffer from I/O scheduler conflicts.

A notable secondary finding concerns the impact of the virtual NIC. The e1000e driver’s off‑load capabilities (checksum off‑load, TCP segmentation off‑load) mitigate part of the network overhead introduced by the separated design. When these features are enabled, the Shuffle‑phase latency drops by approximately 5 %, narrowing the performance gap for mixed workloads.

The authors synthesize these observations into practical design guidelines:

CPU‑intensive jobs (e.g., iterative machine‑learning algorithms, graph processing) benefit from an integrated VM where compute and storage co‑locate, minimizing inter‑VM data movement.
I/O‑intensive jobs (e.g., bulk data ingestion, large‑scale ETL, Hive analytical queries) gain from a storage‑dedicated VM that can exploit direct‑attach SSDs and aggressive caching without interference from compute workloads.
Network optimization (proper NIC off‑loads, QoS on the virtual bridge) is essential for any separated deployment, especially when the Shuffle phase dominates the job profile.
Resource isolation and QoS at the hypervisor level (cgroup limits, CPU pinning, memory reservation) further stabilize performance under multi‑tenant conditions.

The paper concludes by emphasizing that virtualization does not inherently degrade Hadoop performance; rather, careful architectural choices aligned with workload characteristics can even improve throughput and resource utilization. Future work is outlined to extend the study to multi‑node clusters, explore container‑based isolation (Docker, Kubernetes), and evaluate emerging storage technologies such as NVMe‑over‑Fabric in a virtualized Hadoop context.