Introducing Molly: Distributed Memory Parallelization with LLVM
Programming for distributed memory machines has always been a tedious task, but necessary because compilers have not been sufficiently able to optimize for such machines themselves. Molly is an extension to the LLVM compiler toolchain that is able to distribute and reorganize workload and data if the program is organized in statically determined loop control-flows. These are represented as polyhedral integer-point sets that allow program transformations applied on them. Memory distribution and layout can be declared by the programmer as needed and the necessary asynchronous MPI communication is generated automatically. The primary motivation is to run Lattice QCD simulations on IBM Blue Gene/Q supercomputers, but since the implementation is not yet completed, this paper shows the capabilities on Conway’s Game of Life.
💡 Research Summary
The paper introduces “Molly,” an extension to the LLVM compiler infrastructure that automatically distributes computation and data across a distributed‑memory system and generates the necessary asynchronous MPI communication code. The motivation stems from the observation that, despite the prevalence of MPI, programmers still have to manually restructure code for each target architecture, a process that is both error‑prone and time‑consuming. Molly targets programs whose loop control flow can be expressed statically; such loops are represented as polyhedral integer‑point sets (SCoPs) which enable powerful affine transformations and dependence analysis.
Molly’s workflow begins with the programmer declaring distributed arrays using a custom C++ template, molly::array. These arrays are annotated with metadata describing element type, dimensions, and intended distribution. The compiler front‑end (Clang) translates array accesses into a new LLVM intrinsic, llvm.molly.ptr, which returns a “remote‑aware” pointer. This pointer can only be used in load and store instructions; any pointer arithmetic is prohibited, ensuring that the compiler retains full control over where data resides.
The core of Molly is a module‑level LLVM pass that intercepts Polly’s SCoP detection. Polly identifies static control parts, extracts iteration domains, and builds an initial schedule. Molly then augments the SCoP with two virtual statements – a prologue that writes out all data needed before the SCoP and an epilogue that reads back results after the SCoP. Using the Integer Set Library (ISL), Molly computes read‑after‑write (RAW) dependencies between statements, ignoring write‑after‑write and write‑after‑read because it does not reorder instructions, only decides where they execute.
Execution placement is decided by a “where‑mapping” function π_S. By default a statement is placed on the same node that owns its input data; statements that produce data required by the epilogue are placed on the data’s home node, while statements that generate intermediate scalar values are placed on all nodes that may consume them. This placement information, together with the dependency graph, drives the generation of communication code.
Communication is handled by grouping dependent data elements into “chunks.” A chunk is a set of array elements that can be packed into a single MPI buffer and sent together, provided there is no direct or indirect dependency between elements in the same chunk. The chunking function ϕ maps each statement instance to a representative instance, forming equivalence classes that define the chunks. For each pair of producer‑consumer statements that execute on different nodes, Molly inserts non‑blocking MPI_Isend and MPI_Irecv calls for the appropriate chunks, followed by the necessary MPI_Wait or MPI_Test to ensure correctness. This approach reduces the overhead of sending many tiny messages and exploits the bandwidth of high‑performance interconnects.
The current implementation supports only block‑cyclic distribution, where the block size must be constant because the mapping i = p·(N/P) + l is not affine in p. Consequently, the number of processors and the geometry of the compute nodes must be known at compile time. The authors acknowledge this limitation and propose future work to parameterize SCoPs, allowing runtime determination of processor topology and supporting more sophisticated distributions such as 2‑D block or torus layouts.
To demonstrate functionality, the authors apply Molly to a reduced version of Conway’s Game of Life, a 2‑D cellular automaton with a 5‑point stencil. The program declares two distributed boolean arrays, front and back, and iterates 100 steps, swapping the arrays after each step. Molly automatically transforms the nested loops into a distributed SCoP, inserts the appropriate MPI communication for halo exchanges, and generates executable code that runs correctly on a multi‑node cluster. Although performance results are preliminary, the experiment validates that Molly can correctly generate communication code and respect data dependencies without manual MPI programming.
In summary, Molly represents a novel integration of polyhedral compilation techniques with automatic MPI code generation. By handling data distribution, execution placement, and communication chunking within the compiler, it promises to reduce the programmer’s burden in high‑performance scientific applications such as Lattice QCD. The paper outlines current constraints—compile‑time topology, limited distribution schemes, and lack of support for dynamic pointer swaps—and suggests a roadmap for extending the framework to more general use cases. If fully realized, Molly could become a valuable tool for achieving portable performance across heterogeneous supercomputing platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment