LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis
📝 Abstract
The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce parallel programming model to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel programming capability in one line of code. LLMapReduce supports all programming languages and many schedulers. LLMapReduce can work with any application without the need to modify the application. Furthermore, LLMapReduce can overcome scaling limits in the map-reduce parallel programming model via options that allow the user to switch to the more efficient single-program-multiple-data (SPMD) parallel programming model. These features allow users to reduce the computational overhead by more than 10x compared to standard map-reduce for certain applications. LLMapReduce is widely used by hundreds of users at MIT. Currently LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF.
💡 Analysis
The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce parallel programming model to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel programming capability in one line of code. LLMapReduce supports all programming languages and many schedulers. LLMapReduce can work with any application without the need to modify the application. Furthermore, LLMapReduce can overcome scaling limits in the map-reduce parallel programming model via options that allow the user to switch to the more efficient single-program-multiple-data (SPMD) parallel programming model. These features allow users to reduce the computational overhead by more than 10x compared to standard map-reduce for certain applications. LLMapReduce is widely used by hundreds of users at MIT. Currently LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF.
📄 Content
LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Matthew Hubbell, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Albert Reuther MIT Lincoln Laboratory, Lexington, MA, U.S.A
Abstract— The map-reduce parallel programming model has
become extremely popular in the big data community. Many big
data workloads can benefit from the enhanced performance
offered by supercomputers. LLMapReduce provides the familiar
map-reduce parallel programming model to big data users
running on a supercomputer. LLMapReduce dramatically
simplifies map-reduce programming by providing simple parallel
programming capability in one line of code. LLMapReduce
supports all programming languages and many schedulers.
LLMapReduce can work with any application without the need
to modify the application. Furthermore, LLMapReduce can
overcome scaling limits in the map-reduce parallel programming
model via options that allow the user to switch to the more
efficient
single-program-multiple-data
(SPMD)
parallel
programming model. These features allow users to reduce the
computational overhead by more than 10x compared to standard
map-reduce for certain applications. LLMapReduce is widely
used by hundreds of users at MIT. Currently LLMapReduce
works with several schedulers such as SLURM, Grid Engine and
LSF.
Keywords—LLMapReduce;
map-reduce;
performance;
scheduler; Grid Engine; SLURM; LSF
I. INTRODUCTION
Large scale computing is currently dominated by four
ecosystems: supercomputing, database, enterprise, and big data
[1]. Each of these ecosystems has its strengths.
The supercomputing ecosystem provides the highest
performing computing capabilities in the world via a range of
well-established, highly-optimized, specially-designed high
performance technologies. Among these technologies are high
performance messaging libraries (e.g., MPI [2, 3]) that utilize
high performance interconnects (e.g., InfiniBand [4], IBM Blue
Gene interconnects [5], Cray interconnects [6]), High
performance math libraries (e.g., BLAS [7, 8], LAPACK [9],
ScaLAPACK [10]) designed to exploit special processing
hardware, high performance parallel files systems (e.g., Lustre
[11], GPFS [12]) and high performance schedulers (e.g., LSF
[13], GridEngine [14, 15, 16], SLURM [17, 18]). Combined,
these technologies consistently achieve near peak speedups on
well-established benchmarks (e.g., HPC Challenge [19]) and
many applications. On the largest systems in the world these
speedups can be in the millions.
Relational or SQL (Structured Query Language) database
ecosystems [20, 21] have been the de facto interface to
databases since the 1980s and are the backbone of electronic
transactions around the world. Enterprise ecosystems often
exploit virtualization technologies (e.g., VMware) to deliver a
vast range of web services such as e-mail, calendar, document
sharing, videos, and product information. Big data ecosystems
represent non-traditional, relaxed consistency, triple store
databases which provide high performance on commodity
computing hardware to I/O intensive data mining applications
with low data modification requirements. These databases,
which are the backbone of many web companies, include
Google Big Table [22], Amazon Dynamo [23], Cassandra [24],
and HBase [25].
The MIT Lincoln Laboratory Supercomputing Center
(LLSC) provides supercomputing capabilities to over 1000
users at MIT [26, 27]. Increasingly, these users require
capabilities that are found in all four ecosystems. LLSC has
developed the MIT SuperCloud environment that allows all
four ecosystems to run on the same hardware without
sacrificing performance [1]. The MIT SuperCloud has spurred
the development of a number of cross-ecosystem innovations
in high performance databases [28, 29], database management
[30], database federation [31, 32, and 33], data analytics [34],
data protection [35], and system monitoring [36, 37].
One of the most impactful MIT SuperCloud innovations
has been the LLMapReduce environment that is used by a
large fraction of the LLSC user base to perform map-reduce
computations on LLSC supercomputers. The map-reduce
parallel programming model is one of the oldest parallel
programming models. The map-reduce name derives from the
“map” and “reduce” functions found in Common Lisp since
the 1990s. Popularized by Google [38] and Apache Hadoop
[39], map-reduce has become a staple technology of the ever-
growing big data community. Traditionally, the big data
community has relied on inexpensive clusters for the majority
of its computational needs. Recently, it has become apparent
that many big data workloads can benefit from the superior
performance and ease-of-use offered by supercomputers [29,
30]. Many LLSC users are recent graduates and have prio
This content is AI-processed based on ArXiv data.