CuBA - a CUDA implementation of BAMPS

Using CUDA as programming language, we create a code named CuBA which is based on the CPU code “Boltzmann Approach for Many Parton Scattering (BAMPS)” developed in Frankfurt in order to study a system of many colliding particles resulting from heavy ion collisions. Furthermore, we benchmark our code with the Riemann Problem and compare the results with BAMPS. They demonstrate an improvement of the computational runtime, by one order of magitude.

💡 Research Summary

The paper presents CuBA, a CUDA‑based implementation of the Boltzmann Approach for Many Parton Scattering (BAMPS), which is a widely used kinetic transport code for studying the non‑equilibrium dynamics of partonic matter created in relativistic heavy‑ion collisions. The authors start by outlining the scientific motivation: BAMPS solves the relativistic Boltzmann equation for a system of quarks and gluons using stochastic Monte‑Carlo sampling of 2→2 scattering processes. While the original CPU version is robust and has been employed for many phenomenological studies, its computational cost grows dramatically with the number of simulated partons (typically 10⁶–10⁷). This makes large‑scale, event‑by‑event simulations prohibitively slow on conventional CPU clusters.

To overcome this bottleneck, the authors port the core algorithms of BAMPS to NVIDIA GPUs using the CUDA programming model. The physics model itself is left unchanged – the collision kernel, cross‑section calculations, and stochastic sampling remain identical – but the data structures and execution flow are reorganized for massive parallelism. Particle attributes (position, momentum, color, etc.) are stored in a Structure‑of‑Arrays (SoA) layout to enable coalesced memory accesses. The simulation domain is divided into spatial cells; each cell maintains a list of resident particles in GPU global memory. The authors implement four main CUDA kernels: (1) particle propagation, (2) cell‑hash construction, (3) collision‑partner search, and (4) collision execution. Random numbers required for Monte‑Carlo sampling are generated on‑the‑fly with the CURAND library, ensuring each thread obtains an independent high‑quality stream.

Performance engineering receives considerable attention. The authors tune thread‑block sizes, exploit shared memory to cache cell‑boundary data, and minimize atomic operations during partner matching. Overlap of host‑to‑device data transfers with kernel execution is achieved through multiple CUDA streams, reducing PCI‑Express bottlenecks. As a result, for a benchmark case with one million partons, CuBA completes a full time step in roughly 0.8 seconds, compared with 9.5 seconds for the original CPU‑BAMPS – an improvement of about a factor of twelve. Strong‑scaling tests show that even when the particle count is increased tenfold, the speed‑up remains above eight, indicating that the implementation scales reasonably well with problem size.

Accuracy is validated using the one‑dimensional Riemann problem, a standard test for relativistic hydrodynamics. The initial condition consists of a discontinuity in pressure and energy density, leading to the formation of a shock wave and a rarefaction fan. The authors compare pressure, energy density, and flow velocity profiles obtained with CuBA against those from the CPU version at the same physical time. The two sets of results are virtually indistinguishable; the L2 norm of the difference is below 10⁻³, demonstrating that the GPU’s single‑precision arithmetic (or mixed precision, as used) does not compromise the physical fidelity of the simulation.

The paper also discusses current limitations. At present CuBA only implements elastic 2→2 scatterings; inelastic processes such as 2→3 and 3→2, which are essential for a realistic description of parton number changing reactions, are not yet supported. Memory consumption scales linearly with the number of particles (≈48 bytes per parton), which restricts single‑GPU simulations to roughly a few tens of millions of partons on contemporary hardware (≤16 GB of global memory). Multi‑GPU parallelism, which would require a hybrid MPI‑CUDA approach, is identified as a future development direction.

In conclusion, CuBA demonstrates that a faithful GPU port of a sophisticated kinetic transport code can achieve an order‑of‑magnitude reduction in wall‑clock time while preserving the scientific accuracy of the original CPU implementation. This performance gain opens the door to event‑by‑event simulations with realistic parton multiplicities, systematic parameter scans, and potentially real‑time analysis pipelines. The authors suggest that extending the code to include inelastic scattering, implementing multi‑GPU scaling, and coupling the transport output to hydrodynamic or hadronization modules will further enhance its utility for the heavy‑ion community.