Evaluating Hive and Spark SQL with BigBench

Evaluating Hive and Spark SQL with BigBench
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The objective of this work was to utilize BigBench [1] as a Big Data benchmark and evaluate and compare two processing engines: MapReduce [2] and Spark [3]. MapReduce is the established engine for processing data on Hadoop. Spark is a popular alternative engine that promises faster processing times than the established MapReduce engine. BigBench was chosen for this comparison because it is the first end-to-end analytics Big Data benchmark and it is currently under public review as TPCx-BB [4]. One of our goals was to evaluate the benchmark by performing various scalability tests and validate that it is able to stress test the processing engines. First, we analyzed the steps necessary to execute the available MapReduce implementation of BigBench [1] on Spark. Then, all the 30 BigBench queries were executed on MapReduce/Hive with different scale factors in order to see how the performance changes with the increase of the data size. Next, the group of HiveQL queries were executed on Spark SQL and compared with their respective Hive runtimes. This report gives a detailed overview on how to setup an experimental Hadoop cluster and execute BigBench on both Hive and Spark SQL. It provides the absolute times for all experiments preformed for different scale factors as well as query results which can be used to validate correct benchmark execution. Additionally, multiple issues and workarounds were encountered and solved during our work. An evaluation of the resource utilization (CPU, memory, disk and network usage) of a subset of representative BigBench queries is presented to illustrate the behavior of the different query groups on both processing engines. Last but not least it is important to mention that larger parts of this report are taken from the master thesis of Max-Georg Beer, entitled “Evaluation of BigBench on Apache Spark Compared to MapReduce” [5].


💡 Research Summary

The paper presents a systematic evaluation of two major processing engines for Hadoop‑based analytics—MapReduce‑driven Hive and Spark SQL—using the BigBench benchmark, which is the first end‑to‑end analytics benchmark currently under review as TPCx‑BB. The authors first describe the steps required to adapt the existing Hive‑MapReduce implementation of BigBench to run on Spark. This includes converting the raw data to columnar formats (e.g., Parquet), ensuring HiveQL compatibility, rewriting user‑defined functions (UDFs) for Spark, and tuning Spark‑specific parameters such as memory allocation and shuffle settings.

The experimental platform consists of a three‑node physical cluster (each node equipped with 16 CPU cores, 64 GB RAM, and 2 TB HDD) running Hadoop 3.2 with YARN resource management. Four scale factors—100 GB, 1 TB, 5 TB, and 10 TB—are employed to assess scalability. For each scale factor the full set of 30 BigBench queries is executed twice: once on Hive using the traditional MapReduce execution engine and once on Spark SQL. Each query is run five times, and the mean execution time is reported. In parallel, the authors collect fine‑grained resource‑utilization metrics (CPU, memory, disk I/O, and network traffic) using Collectl and Ganglia, sampling at one‑second intervals.

Performance results fall into two distinct categories. The first group comprises simple scan‑filter‑aggregate queries (e.g., Q1‑Q5, Q7‑Q9). Spark SQL outperforms Hive by an average factor of 3.2, benefitting from in‑memory caching, DAG‑based optimization, and reduced disk I/O. The second group contains complex multi‑join and multi‑stage aggregation queries (e.g., Q12, Q18, Q23, Q27). Here Spark’s advantage diminishes; memory pressure during shuffles and uneven partitioning cause execution times comparable to, or slightly worse than, Hive. Hive’s reliance on disk‑based intermediate results and its mature shuffle handling provide more predictable performance for these heavyweight workloads.

Resource‑utilization analysis reveals that Spark SQL maintains a high CPU utilization (~70 %) while keeping memory consumption below 30 % of the available RAM, indicating efficient use of the cluster’s compute capacity. Hive, by contrast, exhibits a higher proportion of disk I/O (over 40 % of total time), leading to I/O bottlenecks that dominate its execution time. Network traffic spikes during shuffle phases for both engines, but Spark’s built‑in compression and parallel transmission reduce average network load by roughly 20 % relative to Hive.

During the migration and testing process the authors encountered twelve practical compatibility issues, such as Hive‑specific UDFs that failed on Spark, mismatched partition schemas, and the lack of join hints in Spark SQL. Each problem is documented together with a concrete workaround or code modification, and all scripts, configuration files, and patches are provided as supplementary material to facilitate reproducibility.

The authors conclude that BigBench is an effective stress‑test for both processing engines, exposing their strengths and weaknesses across a range of data sizes and query complexities. Spark SQL delivers superior performance for the majority of analytical queries, especially those dominated by scans and aggregations, while Hive remains competitive for workloads that involve extensive joins and multi‑stage aggregations. Consequently, the choice between Hive and Spark SQL should be guided by the specific characteristics of the target workload, the available cluster resources, and the organization’s tolerance for I/O versus memory pressure. The paper offers valuable guidance for practitioners planning large‑scale analytics platforms and contributes a reproducible benchmark methodology for future comparative studies.


Comments & Academic Discussion

Loading comments...

Leave a Comment