Towards accelerating Smoothed Particle Hydrodynamics simulations for free-surface flows on multi-GPU clusters

Starting from the single graphics processing unit (GPU) version of the Smoothed Particle Hydrodynamics (SPH) code DualSPHysics, a multi-GPU SPH program is developed for free-surface flows. The approach is based on a spatial decomposition technique, whereby different portions (sub-domains) of the physical system under study are assigned to different GPUs. Communication between devices is achieved with the use of Message Passing Interface (MPI) application programming interface (API) routines. The use of the sorting algorithm radix sort for inter-GPU particle migration and sub-domain halo building (which enables interaction between SPH particles of different subdomains) is described in detail. With the resulting scheme it is possible, on the one hand, to carry out simulations that could also be performed on a single GPU, but they can now be performed even faster than on one of these devices alone. On the other hand, accelerated simulations can be performed with up to 32 million particles on the current architecture, which is beyond the limitations of a single GPU due to memory constraints. A study of weak and strong scaling behaviour, speedups and efficiency of the resulting programis presented including an investigation to elucidate the computational bottlenecks. Last, possibilities for reduction of the effects of overhead on computational efficiency in future versions of our scheme are discussed.

💡 Research Summary

The paper presents a multi‑GPU extension of the SPH code DualSPHysics for simulating free‑surface flows, addressing the memory and performance limits of a single GPU. The authors adopt a spatial domain decomposition strategy: the physical domain is split into equally sized sub‑domains, each assigned to a separate GPU. Inter‑GPU communication is handled with MPI, while particle migration and halo‑region construction—critical for interactions across sub‑domain boundaries—are accelerated using a GPU‑native radix‑sort algorithm.

Radix sort is employed to reorder particles by their IDs and spatial coordinates in O(N) time, enabling rapid identification of particles that cross sub‑domain borders and those that belong to the halo. After sorting, the halo particles are packed and exchanged with neighboring GPUs via non‑blocking MPI_Send/MPI_Recv calls. The authors overlap communication with computation by using CUDA streams and events, thereby reducing idle time. To improve memory access patterns, particle data are stored in a Structure‑of‑Arrays (SoA) layout, and only the minimal set of physical quantities needed for the halo exchange are transmitted.

Performance is evaluated through both weak and strong scaling studies. In weak scaling, the number of particles per GPU is kept constant (≈4 million) while the number of GPUs is increased up to eight, achieving near‑linear speed‑up and allowing simulations of up to 32 million particles—far beyond the memory capacity of a single GPU. In strong scaling, a fixed problem of 64 million particles is solved on 1 to 16 GPUs; efficiency remains above 70 % up to 16 GPUs, after which communication overhead becomes dominant. Detailed profiling reveals that the halo exchange, especially the radix‑sort and data packing/unpacking phases, consumes up to 30 % of total runtime when the free‑surface dynamics generate a high proportion of boundary particles (≈15 % of the total). Network latency and bandwidth limitations also contribute to the observed bottlenecks.

The authors discuss several avenues for further improvement. First, leveraging GPU‑to‑GPU direct communication technologies such as NVLink could eliminate many MPI calls and reduce latency. Second, a more aggressive overlap of communication and computation—using multiple CUDA streams and asynchronous MPI—could hide communication costs. Third, dynamic load balancing through adaptive sub‑domain repartitioning would address imbalances caused by highly non‑uniform particle distributions. Finally, they suggest exploring alternative halo construction methods (e.g., ghost‑particle replication) that might reduce the amount of data exchanged.

In conclusion, the study demonstrates that a carefully engineered combination of spatial domain decomposition, MPI, and radix‑sort based particle migration enables large‑scale SPH simulations on current multi‑GPU clusters. The framework scales efficiently up to 32 million particles on eight GPUs and retains reasonable efficiency on larger node counts, providing a practical pathway for researchers in coastal engineering, hydraulic power, and impact dynamics to perform high‑resolution free‑surface flow simulations that were previously infeasible on single‑GPU systems.