A novel algorithm for GPU-accelerated particle-mesh interactions implemented in the QUOKKA code
We present a novel, GPU-optimized algorithm for particle-mesh interactions in grid-based hydrodynamics simulations, designed for massively parallel architectures. This approach overcomes the inefficiency of particle neighbour searches or sorts across multiple GPU nodes by using a new “particle-mesh-particle” interaction scheme, which extends the particle-mesh method for self-gravity. The algorithm proceeds in two main stages: first, quantities exchanged between particles and the mesh – such as mass, energy, and momentum added by stellar feedback or removed by accretion onto a sink – are deposited into a buffer mesh equipped with ghost zones, where multiple contributions per cell are accumulated using atomic additions and then communicated across distributed memory ranks. In the second stage, the buffer states are applied to real mesh states, incorporating cell-wise limiters to enforce physical constraints such as positive density. We implement this scheme in the GPU-native radiation-magnetohydrodynamics code QUOKKA and validate it through a comprehensive suite of tests, including Bondi and Bondi-Hoyle accretion, and single and multiple supernova remnant evolution at varying spatial resolutions. We show that the algorithm achieves $\approx 50%$ weak-scaling efficiency running on up to 8192 GPUs on the Frontier supercomputer. This scheme enables efficient, scalable particle-mesh coupling for GPU-optimized simulations of star formation and feedback in galaxies.
💡 Research Summary
The paper introduces a new GPU‑friendly algorithm for particle‑mesh coupling, termed the “particle‑mesh‑particle” (PMP) scheme, and demonstrates its implementation in the radiation‑magnetohydrodynamics code QUOKKA. Traditional particle‑mesh approaches on CPUs rely on neighbor searches or ordered processing of particles, both of which become bottlenecks on modern GPU architectures and especially across multiple GPU nodes where synchronization and communication are costly. The PMP algorithm eliminates the need for explicit neighbor lists and guarantees order‑independent results by separating the interaction into five deterministic stages.
-
Particle → Buffer Mesh – Each GPU allocates a buffer mesh that mirrors the hydrodynamic grid but includes enough ghost zones to cover the interaction kernel radius (r_K). For every locally stored particle the deposition function (\Delta U_{s,ijk}) (mass, momentum, energy, etc.) is evaluated for all cells within the kernel and added to the buffer using atomic additions. This step is fully parallel on the GPU and avoids race conditions by design.
-
Inter‑GPU Summation – The buffer’s ghost zones are exchanged once per update using a high‑performance collective (e.g., NCCL all‑reduce or halo exchange). After this communication each cell’s buffer contains the sum of contributions from all particles in the entire domain, (\Delta U_{\text{sum},ijk}).
-
Limiter Application – Because multiple particles may modify the same cell, physical constraints (non‑negative density, maximum supernova energy budget, etc.) are enforced by a cell‑wise limiter function (f_{\text{lim}}). The limiter produces a scaled change (\Delta U_{\text{lim},ijk}=f_{\text{lim}}(U,\Delta U_{\text{sum}})) and stores the ratio (\eta_{ijk}=\Delta U_{\text{lim},ijk}/\Delta U_{\text{sum},ijk}) for later use.
-
Particle Update – Using the same deposition kernel, each particle recomputes its contribution but now multiplies it by the stored (\eta_{ijk}) values. This guarantees that the particle’s mass, momentum, or other conserved quantities are updated consistently with the limited changes applied to the mesh, preserving total conservation.
-
Real Mesh Update – Finally the limited changes are applied to the actual hydrodynamic state: (U_{ijk}\leftarrow U_{ijk}+\Delta U_{\text{lim},ijk}). Because the limiter has already ensured physical admissibility, the mesh remains stable and the overall scheme is deterministic.
Implementation details are carefully described. The authors use CUDA atomic operations for the deposition step, NCCL for the single‑communication reduction, and a deterministic reduction order to achieve bit‑wise reproducibility across runs. Ghost‑zone handling is generalized to adaptive‑mesh refinement, and the buffer size is modest (typically four ghost layers).
The algorithm is validated with a suite of astrophysical tests. Bondi and Bondi‑Hoyle accretion problems show convergence to analytic solutions with errors below 1 % across resolutions. Supernova feedback tests—both single and multiple explosions—demonstrate correct shock propagation, energy conservation, and, crucially, independence from the order in which exploding particles are processed. The limiter successfully prevents negative densities when several sink particles accrete from the same cell.
Performance scaling is measured on the Frontier supercomputer. Using up to 8 192 GPUs, the PMP scheme achieves roughly 50 % weak‑scaling efficiency, a 3–5× speed‑up compared with traditional CPU‑based particle‑mesh implementations. Communication overhead remains low because only one halo exchange per timestep is required, and the algorithm’s compute‑to‑communication ratio improves with problem size. Memory overhead is limited to the additional buffer mesh and ghost zones (≈ 20 % extra).
In summary, the paper contributes (i) a novel, order‑independent particle‑mesh coupling method that avoids costly neighbor searches, (ii) a practical GPU‑native implementation that scales to thousands of GPUs, and (iii) a demonstration that complex, non‑linear feedback processes (sink accretion, supernova energy injection) can be handled robustly in large‑scale galaxy formation simulations. The authors argue that the PMP framework is broadly applicable to other GPU‑intensive domains such as particle‑in‑cell plasma simulations, cosmological N‑body gravity solvers, and any scenario where particles act as sources or sinks for mesh‑based conserved fields.
Comments & Academic Discussion
Loading comments...
Leave a Comment