Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark &   Apache Flink for Data Science
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those platforms not only aim to improve performance through improved in-memory processing, but in particular provide built-in high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop than with plain Hadoop MapReduce. But is this indeed the case? This paper compares three prominent distributed data processing platforms: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability perspective. We report on the design, execution and results of a usability study with a cohort of masters students, who were learning and working with all three platforms in order to solve different use cases set in a data science context. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink as platforms for batch-oriented big data analysis. This study starts an exploration of the factors that make big data platforms more - or less - effective for users in data science.


💡 Research Summary

The paper presents a systematic usability study of three major distributed data‑processing platforms—Apache Hadoop MapReduce, Apache Spark, and Apache Flink—targeted at master‑level students who are learning cloud‑based big‑data analytics. The authors motivate the work by noting that while Hadoop MapReduce has been the de‑facto standard for large‑scale batch processing, its low‑level programming model (explicit map/reduce functions, job chaining, and Hadoop Streaming scripts) imposes a steep learning curve on non‑computer‑science users. Spark and Flink were introduced to provide higher‑level abstractions, in‑memory computation, and richer built‑in operators, but their actual usability for data‑science practitioners had not been empirically measured.

Study design. The experiment was conducted in a semester‑long cloud‑computing course at the University of Sydney. A cohort of roughly 30 master’s students from diverse backgrounds (computer science, data science, biology, etc.) participated. Prior to the tasks, participants completed a questionnaire capturing programming experience (novice, intermediate, advanced) and preferred language (Python, Java, Scala). The usability evaluation comprised three data‑analysis assignments derived from immunology and genomics use cases. Assignment 1 required a classic MapReduce implementation (word‑count‑style statistics). Assignments 2 and 3 were identical in functionality but were split in an A/B fashion: half the class used Spark, the other half used Flink. After each assignment participants reported (1) the actual time spent coding, (2) their System Usability Scale (SUS) score, and (3) a subjective preference rating on a 5‑point Likert scale.

Key findings.

  • Development time: Mean coding time for MapReduce was >180 minutes, whereas Spark and Flink averaged ≈95 minutes and ≈92 minutes respectively.
  • Usability (SUS): MapReduce received an average SUS of 58 (below average), while Spark scored 78 and Flink 76, both falling into the “good/excellent” range. Statistical tests showed no significant difference between Spark and Flink (p > 0.05).
  • Preference: Participants overwhelmingly preferred Spark or Flink over MapReduce; the two modern platforms were rated similarly.
  • Influence of experience: More experienced programmers completed tasks faster across all systems, yet SUS scores for Spark and Flink remained high regardless of experience level, indicating that the higher‑level APIs mitigate expertise gaps.
  • Language effect: The availability of a Python API was repeatedly cited as a major factor for ease of learning, especially for students from non‑CS backgrounds.

Interpretation. The authors attribute Spark and Flink’s superior usability to several design aspects: (i) high‑level DataFrame/Dataset (Spark) and DataSet/DataStream (Flink) APIs that hide shuffle, partitioning, and fault‑tolerance details; (ii) extensive documentation, tutorials, and community examples; (iii) interactive monitoring dashboards (Spark UI, Flink Dashboard) that provide immediate feedback; and (iv) in‑memory processing that reduces the need for explicit disk‑I/O handling. Conversely, MapReduce forces developers to manage low‑level details such as key/value serialization, custom partitioners, and explicit job sequencing, which proved cumbersome for the target audience.

Limitations. The study is confined to a single university, a single semester, and a cohort of master’s students, limiting generalizability to industry practitioners or larger, more heterogeneous user groups. Only batch‑oriented workloads were examined; streaming scenarios—where Flink’s native streaming model could show distinct advantages—were not evaluated. The software versions (Hadoop 2.7.2, Spark 2.1.1, Flink 1.2.1) are now superseded, so performance and usability improvements in newer releases are not captured.

Conclusion and future work. The evidence suggests that for data‑science tasks requiring rapid prototyping and moderate scalability, Spark and Flink are far more user‑friendly than traditional MapReduce, regardless of a user’s prior programming expertise. MapReduce may still be relevant for teaching low‑level distributed concepts or for niche cases demanding fine‑grained control. The authors recommend extending the research to cover diverse industry domains, longer‑term adoption studies, and comprehensive streaming‑oriented evaluations to fully map the usability landscape of modern big‑data platforms.


Comments & Academic Discussion

Loading comments...

Leave a Comment