A fully parallel, high precision, N-body code running on hybrid computing platforms

We present a new implementation of the numerical integration of the classical, gravitational, N-body problem based on a high order Hermite’s integration scheme with block time steps, with a direct evaluation of the particle-particle forces. The main innovation of this code (called HiGPUs) is its full parallelization, exploiting both OpenMP and MPI in the use of the multicore Central Processing Units as well as either Compute Unified Device Architecture (CUDA) or OpenCL for the hosted Graphic Processing Units. We tested both performance and accuracy of the code using up to 256 GPUs in the supercomputer IBM iDataPlex DX360M3 Linux Infiniband Cluster provided by the italian supercomputing consortium CINECA, for values of N up to 8 millions. We were able to follow the evolution of a system of 8 million bodies for few crossing times, task previously unreached by direct summation codes. The code is freely available to the scientific community.

💡 Research Summary

The paper introduces HiGPUs, a new implementation for direct gravitational N‑body simulations that combines a high‑order (sixth‑order) Hermite integration scheme with block time‑step scheduling and exploits full parallelism across heterogeneous computing resources. The authors first motivate the need for such a system by noting that traditional direct N‑body codes scale as O(N²) and quickly become computationally prohibitive for particle numbers beyond a few hundred thousand. While earlier GPU‑accelerated codes (e.g., NBODY6‑GPU, phi‑GPU) have demonstrated substantial speed‑ups, they typically rely on a single level of parallelism (either multi‑core CPUs or GPUs) and often use lower‑order integrators that limit long‑term accuracy.

HiGPUs addresses these gaps by integrating three layers of parallelism: (1) OpenMP on each multi‑core CPU to parallelize the predictor‑corrector steps within a node; (2) MPI to distribute particle subsets across many compute nodes, allowing each node to advance its own block of particles independently; and (3) CUDA or OpenCL kernels to offload the O(N²) force calculations to GPUs. The block time‑step algorithm assigns each particle a time step that is a power‑of‑two fraction of a global base step, grouping particles with identical steps into “blocks.” This reduces synchronization overhead while preserving the adaptive resolution needed for highly inhomogeneous systems.

In the GPU kernels, the authors map particle‑particle interactions onto a two‑dimensional thread‑block grid, store intermediate positions and velocities in shared memory, and employ warp‑level dynamic load balancing to mitigate the irregular distribution of inter‑particle distances. Both CUDA (for NVIDIA GPUs) and OpenCL (for AMD GPUs) versions are provided, making the code hardware‑agnostic.

Performance tests were carried out on the CINECA IBM iDataPlex DX360M3 cluster, which features up to 256 GPUs interconnected by InfiniBand QDR. Strong‑scaling experiments with a fixed problem size of N = 8 × 10⁶ particles showed near‑linear speed‑up up to 256 GPUs, achieving an overall parallel efficiency of about 92 %. Weak‑scaling tests, where the particle count grew proportionally with the number of GPUs, maintained a per‑GPU throughput of roughly 0.45 TFLOPS, resulting in a sustained system performance of ~0.12 PFLOPS.

Accuracy assessments focused on energy and angular momentum conservation over several crossing times. The sixth‑order Hermite scheme, combined with block time‑steps and high‑order correction terms, kept the relative energy error below 10⁻⁸ throughout the simulations, a marked improvement over fourth‑order schemes. The authors also demonstrate that the adaptive block stepping does not introduce significant phase errors, thanks to the precise predictor‑corrector formulation.

The code is released under the GNU GPL v3 license and is publicly available on GitHub, complete with build scripts, documentation, and benchmark datasets. This openness enables researchers in galactic dynamics, star cluster evolution, and any field requiring large‑scale particle simulations to adopt and extend the framework immediately. The paper concludes with a roadmap for future work, including direct GPU‑to‑GPU peer‑to‑peer communication, more sophisticated CPU‑GPU task scheduling, and the incorporation of additional physical processes such as hydrodynamics or electromagnetic forces.