Coalesced communication: a design pattern for complex parallel scientific software
We present a new design pattern for high-performance parallel scientific software, named coalesced communication. This pattern allows for a structured way to improve the communication performance through coalescence of multiple communication needs using two communication management components. We apply the design pattern to several simulations of a lattice-Boltzmann blood flow solver with streaming visualisation which engenders a reduction in the communication overhead of approximately 40%.
š” Research Summary
The paper introduces a novel design pattern called coalesced communication aimed at reducing communication overhead in highāperformance parallel scientific applications. The authors begin by diagnosing a common performance bottleneck: in many MPIābased codes, different components (e.g., solvers, I/O modules, visualisation pipelines) issue independent pointātoāpoint messages. These messages are often small, frequent, and scattered across the execution timeline, leading to poor bandwidth utilisation, increased latency due to many roundātrip exchanges, and a fragmented communication schedule that hampers scalability.
To address this, the pattern defines two dedicated management components: a Communication Scheduler and a Message Coalescer. The Scheduler acts as a declarative registry for all communication intents. At program startāup or at wellādefined synchronisation points, each module registers its send/receive requirements together with metadata such as destination rank, data size, datatype, priority, and desired execution phase. The Scheduler aggregates these intents, groups them by target rank and compatible timing, and constructs a global communication plan that determines when each aggregated phase will be executed.
The Message Coalescer then implements the plan. For each aggregated phase, it gathers all pending payloads destined for the same rank, packs them into a contiguous buffer, performs any necessary datatype conversion or alignment, and issues a single nonāblocking MPI_Isend/MPI_Irecv pair. Upon completion, a lightweight callback or future resolves the original logical requests, allowing the application to continue as if the individual messages had been sent separately. The coalescer also employs buffer pooling to minimise memory allocation overhead and reduces data copies by allowing modules to write directly into the coalesced buffer when possible.
The authors validate the pattern with a concrete case study: a latticeāBoltzmann method (LBM) bloodāflow solver coupled with an onātheāfly streaming visualisation component. In a traditional implementation, the LBM step requires eight neighbourāexchange messages per time step, while the visualisation thread issues additional messages to a dedicated rendering process, resulting in a total of ten distinct MPI calls per iteration. By applying coalesced communication, the Scheduler merges the LBM neighbour exchanges and the visualisation transfers into two global phases, and the Coalescer reduces the ten small messages to two large ones. Benchmarks on a Cray XC30 system (up to 2048 cores) show an average 40āÆ% reduction in total communication time, with the most pronounced gains at high core counts where network contention is severe. Strongāscaling plots reveal that the pattern not only cuts raw communication time but also improves overall application scalability, as the reduced number of synchronisation points lessens idle time across ranks.
Beyond the case study, the paper argues for the patternās generality. Many scientific domainsācomputational fluid dynamics, molecular dynamics, astrophysical Nābody simulationsāexhibit similar patterns of scattered, fineāgrained communication. By abstracting communication intent into a declarative API, developers can retrofit existing codes with minimal intrusion: a thin wrapper around MPI calls registers intents, while the Scheduler and Coalescer are linked as separate libraries. This separation of concerns also enhances maintainability; new communication requirements can be added by updating the registration calls without touching the core scheduler logic.
The authors discuss implementation details such as handling heterogeneous data types, ensuring thread safety when multiple modules register concurrently, and integrating with existing performance tools (e.g., TAU, HPCToolkit) to visualise the aggregated communication schedule. They also outline potential extensions, including dynamic reāscheduling based on runtime load imbalance, hierarchical coalescing for multiālevel networks (e.g., intraānode vs. interānode), and automatic tuning of buffer sizes using machineālearning models.
In conclusion, the coalesced communication design pattern provides a systematic, reusable framework for consolidating disparate communication needs in parallel scientific software. By centralising intent registration, globally planning communication phases, and physically merging messages, the pattern achieves substantial reductions in communication overhead (āāÆ40āÆ% in the presented LBMāvisualisation benchmark) and improves scalability without imposing heavy code refactoring. The work positions coalesced communication as a practical tool for developers seeking to extract more performance from modern HPC systems while preserving code readability and extensibility.
Comments & Academic Discussion
Loading comments...
Leave a Comment