A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of “Big Data Ogres” and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current “architecture” and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.

💡 Research Summary

The paper presents a systematic comparison of the two dominant paradigms for data‑intensive scientific computing: traditional high‑performance computing (HPC) and the Apache‑Hadoop based Big Data Stack (ABDS). After outlining the common challenges of large‑scale data distribution, co‑placement of data and compute, and massive I/O, the authors introduce a unified terminology and a set of functional factors to evaluate both ecosystems. Central to the analysis is the concept of “Big Data Ogres,” a classification scheme inspired by the Berkeley Dwarfs, which groups workloads along three facets: problem architecture (e.g., pleasingly parallel, local vs. global machine learning, data fusion), data source characteristics (SQL, NoSQL, file collections, IoT streams, simulation outputs), and core analytic kernels (K‑means, PageRank, LDA, SVD, graph algorithms, etc.). This taxonomy provides a common language for describing workloads that appear in both HPC and ABDS environments.

The authors decompose each paradigm into five architectural layers—resource fabric, resource management, communication, higher‑level runtime, and data processing/analytics—and map the dominant abstractions and implementations onto these layers. In HPC, compute and storage are physically separated; parallel file systems such as Lustre or GPFS serve many compute nodes, while centralized batch schedulers (SLURM, Torque, SGE) allocate cores without regard to data locality. Data movement across the network often becomes a bottleneck for data‑intensive jobs. In contrast, ABDS tightly couples storage (HDFS) and compute on the same nodes, and YARN provides multi‑level scheduling that allows applications to run their own schedulers on top of the cluster manager. This design enables a rich ecosystem of higher‑level engines—MapReduce, Spark, Tez, Flink, Giraph, Hive, Impala, etc.—that address iterative, streaming, and graph‑oriented workloads.

To ground the discussion, the paper uses K‑means clustering as a representative Ogre and evaluates several implementations: MPI‑based MapReduce, native MPI, Hadoop‑MapReduce, Spark, and Tez, across both HPC clusters and ABDS clusters. The experiments reveal that HPC excels when the network bandwidth and parallel file system can sustain high I/O throughput, delivering superior raw FLOP performance for large, static datasets. However, when data locality can be exploited, Spark and Hadoop‑based runtimes achieve higher efficiency due to lower scheduling overhead and in‑memory data reuse, especially for iterative algorithms with modest data partitions. YARN’s multi‑level scheduling improves overall cluster utilization by about 15 % in mixed‑workload scenarios.

The authors argue that despite distinct software stacks, the two paradigms share a common architectural skeleton and can be integrated. They suggest hybrid solutions that combine HPC‑style data management services (iRODS, SRM) with ABDS storage layers (HDFS, Hive) to reduce data movement costs while preserving high‑performance compute capabilities. Finally, the paper proposes that the set of Ogres become a standard benchmark suite for evaluating future hardware and software platforms across multiple dimensions—performance, scalability, expressivity, and ease of integration—thereby guiding the development of interoperable, next‑generation data‑intensive scientific infrastructures.

A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

💡 Research Summary

Comments & Academic Discussion

Leave a Comment