LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis

Reading time: 5 minute
...

📝 Abstract

The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce parallel programming model to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel programming capability in one line of code. LLMapReduce supports all programming languages and many schedulers. LLMapReduce can work with any application without the need to modify the application. Furthermore, LLMapReduce can overcome scaling limits in the map-reduce parallel programming model via options that allow the user to switch to the more efficient single-program-multiple-data (SPMD) parallel programming model. These features allow users to reduce the computational overhead by more than 10x compared to standard map-reduce for certain applications. LLMapReduce is widely used by hundreds of users at MIT. Currently LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF.

💡 Analysis

The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce parallel programming model to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel programming capability in one line of code. LLMapReduce supports all programming languages and many schedulers. LLMapReduce can work with any application without the need to modify the application. Furthermore, LLMapReduce can overcome scaling limits in the map-reduce parallel programming model via options that allow the user to switch to the more efficient single-program-multiple-data (SPMD) parallel programming model. These features allow users to reduce the computational overhead by more than 10x compared to standard map-reduce for certain applications. LLMapReduce is widely used by hundreds of users at MIT. Currently LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF.

📄 Content

LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Matthew Hubbell, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Albert Reuther MIT Lincoln Laboratory, Lexington, MA, U.S.A

Abstract— The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce parallel programming model to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel programming capability in one line of code. LLMapReduce supports all programming languages and many schedulers.
LLMapReduce can work with any application without the need to modify the application. Furthermore, LLMapReduce can overcome scaling limits in the map-reduce parallel programming model via options that allow the user to switch to the more efficient single-program-multiple-data (SPMD) parallel programming model. These features allow users to reduce the computational overhead by more than 10x compared to standard map-reduce for certain applications. LLMapReduce is widely used by hundreds of users at MIT. Currently LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF. Keywords—LLMapReduce; map-reduce; performance; scheduler; Grid Engine; SLURM; LSF I. INTRODUCTION Large scale computing is currently dominated by four ecosystems: supercomputing, database, enterprise, and big data [1]. Each of these ecosystems has its strengths. The supercomputing ecosystem provides the highest performing computing capabilities in the world via a range of well-established, highly-optimized, specially-designed high performance technologies. Among these technologies are high performance messaging libraries (e.g., MPI [2, 3]) that utilize high performance interconnects (e.g., InfiniBand [4], IBM Blue Gene interconnects [5], Cray interconnects [6]), High performance math libraries (e.g., BLAS [7, 8], LAPACK [9], ScaLAPACK [10]) designed to exploit special processing hardware, high performance parallel files systems (e.g., Lustre [11], GPFS [12]) and high performance schedulers (e.g., LSF [13], GridEngine [14, 15, 16], SLURM [17, 18]). Combined, these technologies consistently achieve near peak speedups on well-established benchmarks (e.g., HPC Challenge [19]) and many applications. On the largest systems in the world these speedups can be in the millions. Relational or SQL (Structured Query Language) database ecosystems [20, 21] have been the de facto interface to databases since the 1980s and are the backbone of electronic transactions around the world. Enterprise ecosystems often exploit virtualization technologies (e.g., VMware) to deliver a vast range of web services such as e-mail, calendar, document sharing, videos, and product information. Big data ecosystems represent non-traditional, relaxed consistency, triple store databases which provide high performance on commodity computing hardware to I/O intensive data mining applications with low data modification requirements. These databases, which are the backbone of many web companies, include Google Big Table [22], Amazon Dynamo [23], Cassandra [24], and HBase [25].
The MIT Lincoln Laboratory Supercomputing Center (LLSC) provides supercomputing capabilities to over 1000 users at MIT [26, 27]. Increasingly, these users require capabilities that are found in all four ecosystems. LLSC has developed the MIT SuperCloud environment that allows all four ecosystems to run on the same hardware without sacrificing performance [1]. The MIT SuperCloud has spurred the development of a number of cross-ecosystem innovations in high performance databases [28, 29], database management [30], database federation [31, 32, and 33], data analytics [34], data protection [35], and system monitoring [36, 37]. One of the most impactful MIT SuperCloud innovations has been the LLMapReduce environment that is used by a large fraction of the LLSC user base to perform map-reduce computations on LLSC supercomputers. The map-reduce parallel programming model is one of the oldest parallel programming models. The map-reduce name derives from the “map” and “reduce” functions found in Common Lisp since the 1990s. Popularized by Google [38] and Apache Hadoop [39], map-reduce has become a staple technology of the ever- growing big data community. Traditionally, the big data community has relied on inexpensive clusters for the majority of its computational needs. Recently, it has become apparent that many big data workloads can benefit from the superior performance and ease-of-use offered by supercomputers [29, 30]. Many LLSC users are recent graduates and have prio

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut