pMR: A high-performance communication library

pMR: A high-performance communication library
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

On many parallel machines, the time LQCD applications spent in communication is a significant contribution to the total wall-clock time, especially in the strong-scaling limit. We present a novel high-performance communication library that can be used as a de facto drop-in replacement for MPI in existing software. Its lightweight nature that avoids some of the unnecessary overhead introduced by MPI allows us to improve the communication performance of applications without any algorithmic or complicated implementation changes. As a first real-world benchmark, we make use of the pMR library in the coarse-grid solve of the Regensburg implementation of the DD-$\alpha$AMG algorithm. On realistic lattices, we see an improvement of a factor 2x in pure communication time and total execution time savings of up to 20%.


💡 Research Summary

The paper introduces pMR (pico Message Passing for RDMA), a lightweight communication library designed to replace MPI in high‑performance computing applications that are dominated by off‑chip communication. pMR targets InfiniBand FDR networks and exploits Remote Direct Memory Access (RDMA) directly, avoiding the overhead associated with typical MPI implementations such as internal threading, polymorphic dispatch, and frequent context switches. Implemented in C++11 with an optional C API, the library deliberately eliminates polymorphism and internal threads, and it adopts a persistent‑communication model where network resources (queues, memory registrations, etc.) are created once and reused throughout the program’s lifetime.

The core API consists of a Connection object that defines the target rank and tags, and two handle classes—SendWindow and RecvWindow—that manage registered send and receive buffers. Communication follows an init‑post‑wait sequence: buffers are initialized, the transfer is posted, and the program waits for completion. Because data must be copied from the application’s native buffers into RDMA‑registered buffers (and back after reception), the authors introduced multithreaded copy kernels to hide this overhead, especially on Intel Xeon Phi (KNC) coprocessors where memory bandwidth scales with the number of active cores.

To demonstrate practicality, the authors ported the coarse‑grid solver of the DD‑α AMG algorithm (used in Lattice QCD) from MPI to pMR. Two common MPI usage patterns are examined: persistent point‑to‑point (MPI_Send_init/MPI_Recv_init followed by MPI_Start/MPI_Wait) and non‑persistent point‑to‑point (MPI_Isend/MPI_Irecv followed by MPI_Wait). In both cases the migration requires only modest code changes: creation of a Connection, instantiation of SendWindow/RecvWindow objects, and replacement of the MPI calls with the init‑post‑wait sequence. The non‑persistent case demands an extra step because pMR is inherently persistent, but the authors show that the additional buffering can be minimized if necessary.

Benchmarking was performed on two clusters, QPACE 2 and QPACE B, each equipped with Intel Xeon Phi (KNC) coprocessors and InfiniBand interconnects. The coarse‑grid solve is communication‑bound, with halo exchanges and global reductions being the critical operations. By substituting MPI halo exchanges with pMR, the wall‑clock time spent in halo communication dropped by roughly a factor of two on both clusters, leading to overall runtime reductions of up to 20 % for the coarse‑grid solve. Scaling tests revealed that as the number of KNCs increases (16 → 256), the per‑node message size shrinks, further reducing halo time, but global reductions become the dominant cost, indicating a next target for optimization. An interesting observation is that on QPACE 2, which uses a Flexible Block Torus topology unknown to Intel MPI, pMR still delivers consistent gains, highlighting its topology‑agnostic design.

The authors conclude that pMR can deliver substantial performance improvements for communication‑intensive stencil‑type applications without requiring major algorithmic changes. Future work includes adding efficient global‑sum primitives, extending support to Intel Omni‑Path (used in the upcoming QPACE 3), and further refining the library to exploit new network topologies. In summary, pMR offers a pragmatic, high‑performance alternative to MPI for RDMA‑capable systems, demonstrating that a carefully engineered, lightweight communication layer can significantly accelerate real‑world scientific codes.


Comments & Academic Discussion

Loading comments...

Leave a Comment