Comparing Spark vs MPI/OpenMP On Word Count MapReduce

Comparing Spark vs MPI/OpenMP On Word Count MapReduce
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Spark provides an in-memory implementation of MapReduce that is widely used in the big data industry. MPI/OpenMP is a popular framework for high performance parallel computing. This paper presents a high performance MapReduce design in MPI/OpenMP and uses that to compare with Spark on the classic word count MapReduce task. My result shows that the MPI/OpenMP MapReduce outperforms Apache Spark by about 300%.


💡 Research Summary

The paper presents a custom MapReduce implementation built on MPI and OpenMP and compares its performance to Apache Spark on a classic word‑count benchmark. The authors first motivate the need for a high‑performance MapReduce engine in the high‑performance computing (HPC) domain, noting that while Spark offers an in‑memory data‑flow model popular in industry, it runs on the Java Virtual Machine and includes fault‑tolerance mechanisms that add overhead. In contrast, MPI and OpenMP are the de‑facto standards for tightly coupled parallel programs on clusters, yet there is no mature MapReduce library for them.

The core of the proposed system consists of three data structures: DistRange, DistHashMap, and ConcurrentHashMap. DistRange partitions a global integer range across nodes and threads, invoking a user‑provided mapper for each element. The mapper extracts words from a line of text and emits (word, 1) pairs into a DistHashMap. DistHashMap is a simplified distributed hash table: each node holds a main ConcurrentHashMap for its own keys and a set of auxiliary maps for keys that belong to other nodes but are inserted locally. ConcurrentHashMap implements a segment‑based linear‑probing hash table. When a thread attempts to update a segment that is already locked, the update is temporarily stored in a thread‑local linear‑probing buffer, avoiding blocking. Periodic or end‑of‑map synchronization flushes these buffers into the main hash table, and inter‑node synchronization shuffles keys to their owning nodes after the map phase.

A notable optimization is “local reduce” during the map phase: each thread aggregates counts for identical words before the shuffle, dramatically reducing the volume of data transferred across the network. The use of linear probing rather than chained buckets reduces memory allocation churn and improves cache locality, which is especially beneficial on shared‑memory nodes.

Experimental methodology: both systems were run on identical hardware—four AWS r5.xlarge instances (4 vCPU, 32 GB RAM each). Spark was deployed via Amazon EMR 5.20.0 with Spark 2.4.0 using default settings. The MPI/OpenMP code was compiled with GCC 7.2 and linked against MPICH 3.2. The input dataset consisted of the Bible and Shakespeare’s works repeated 200 times, yielding roughly 2 GB of plain‑text data. Performance was measured in words processed per second.

Results: the MPI/OpenMP implementation achieved roughly ten times the throughput of Spark, corresponding to a 300 % speed‑up claim in the abstract. The authors attribute this gap to three factors: (1) native C++ execution avoids JVM and Scala interpreter overhead; (2) Spark’s built‑in fault tolerance (lineage logging, checkpointing) incurs extra I/O and memory usage; (3) the combination of local reduction and the custom hash‑table design reduces network traffic during the shuffle phase.

The paper acknowledges several limitations. The MPI/OpenMP version deliberately omits fault tolerance, assuming that the mean time between failures (MTBF) on modern hardware is on the order of a million core‑hours, making retries acceptable for short batch jobs. Consequently, the approach may not be suitable for long‑running or mission‑critical pipelines where automatic recovery is required. Moreover, the benchmark is limited to a single, I/O‑heavy, key‑frequency‑rich workload; performance on more complex pipelines involving joins, iterative machine‑learning algorithms, or streaming data is not evaluated. Reproducibility is partially supported by a public GitHub repository, but detailed build flags, MPI environment variables, and tuning parameters are not exhaustively documented.

In conclusion, the authors argue that for offline analytics where fault tolerance is not a primary concern, an MPI/OpenMP‑based MapReduce engine can outperform Spark by an order of magnitude, offering cost and time savings. They suggest future work to integrate lightweight fault‑tolerance mechanisms, explore hybrid architectures that combine Spark’s DAG scheduler with MPI’s high‑performance kernels, and broaden the benchmark suite to include diverse data‑processing patterns. The paper thus contributes a concrete case study showing that low‑level parallel programming models can be leveraged to build efficient MapReduce‑style systems, challenging the prevailing assumption that Spark is the default high‑performance solution for all big‑data workloads.


Comments & Academic Discussion

Loading comments...

Leave a Comment