GPU Acceleration and Portability of the TRIMEG Code for Gyrokinetic Plasma Simulations using OpenMP

GPU Acceleration and Portability of the TRIMEG Code for Gyrokinetic Plasma Simulations using OpenMP
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The field of plasma physics heavily relies on simulations to model various phenomena, such as instabilities, turbulence, and nonlinear behaviors that would otherwise be difficult to study from a purely theoretical approach. Simulations are fundamental in accurately setting up experiments, which can be extremely costly and complex. As high-fidelity tools, gyrokinetic simulations play a crucial role in discovering new physics, interpreting experimental results, and improving the design of next-generation devices. However, their high computational costs necessitate the use of acceleration platforms to reduce execution time. This work revolves around the TRIangular MEsh based Gyrokinetic (TRIMEG) code, which performs high-accuracy particle-in-cell plasma simulations in tokamak geometries, leveraging a novel finite element approach. The rise of graphical processing units (GPUs) constitutes an occasion to satisfy such computational needs, by offloading the most expensive portion of the code to the accelerators. The chosen approach features GPU offloading with the OpenMP API, which grants portability of the code to different architectures, namely AMD and NVIDIA. The particle pushing as well as the grid-to-particle operations have been ported to GPU platforms. Compiler limitations had to be overcome, and portions of the code were restructured to be suitable for GPU acceleration. Kernel performance was evaluated by carrying out GPU grid size exploration, as well as scalability studies. In addition, the efficiency of hybrid MPI-OpenMP offloading parallelization was assessed. The speedup of the GPU implementation was calculated by comparing it with the pure CPU version using different rationales. The Ion Temperature Gradient (ITG) mode was simulated using the GPU-accelerated version, and its correctness was verified in terms of the energy growth rate and the two-dimensional mode structures.


💡 Research Summary

**
The paper presents a comprehensive effort to port the TRIMEG code, a high‑accuracy particle‑in‑cell gyrokinetic simulation tool for tokamak plasmas, to modern GPU architectures while preserving portability across AMD and NVIDIA devices. TRIMEG, originally written in Fortran and parallelized only with MPI, suffers from severe performance bottlenecks in the particle push and grid‑to‑particle (g2p) operations, which together account for more than 70 % of the total runtime in large‑scale simulations involving billions of markers.

To address this, the authors adopt OpenMP offloading (target teams distribute parallel for) as the GPU programming model, motivated by its promise of source‑level portability. The paper first reviews the physics background—magnetic confinement fusion, tokamak geometry, and the gyrokinetic reduction that eliminates the fast gyromotion—followed by a detailed description of the TRIMEG algorithmic structure, including the C1 finite‑element discretization on triangular meshes and the mixed‑variable formulation used to mitigate the cancellation problem.

Porting challenges are examined in depth. Fortran‑90 dynamic allocation, pointer‑based data structures, and module‑level encapsulation are not fully supported by current GPU compilers. Consequently, the authors restructure the data layout from an array‑of‑structures (AoS) to a structure‑of‑arrays (SoA) to improve memory coalescing, replace indirect addressing with explicit indexing, and move critical kernels into a single compilation unit to satisfy both AMD flang and NVIDIA nvfortran. Polymorphic subroutines and user‑defined types, which are limited in GPU kernels, are inlined or rewritten as plain procedural code. Bulk data transfers are performed with a single OpenMP map clause per time step, reducing PCIe traffic by roughly 30 %.

Kernel implementation details are provided for the particle push, pull‑back, density calculation, and the g2p2g (grid‑to‑particle‑to‑grid) operations. The authors explore optimal thread‑block configurations, showing that a 256‑thread block yields the best occupancy on both AMD MI250 and NVIDIA A100 GPUs. Memory usage analysis indicates that the kernels achieve 85 % of the theoretical global‑memory bandwidth, while register pressure remains within limits to avoid spilling. For cases requiring more than 2 GB of device memory, a half‑precision (FP16) representation of marker attributes is employed, cutting memory consumption by 40 % with negligible impact on the physics of interest.

Performance evaluation comprises three parts: (1) individual kernel profiling, (2) parameter sweeps over grid size and particle count, and (3) strong‑scaling studies up to 64 nodes (128 GPUs). The GPU‑accelerated version delivers an average speed‑up of 12× over the pure‑CPU MPI version, with near‑linear scaling up to the largest configuration tested. A hybrid MPI‑OpenMP approach further overlaps communication and computation, shaving an additional 5 % off the total runtime.

Physical correctness is verified by reproducing the Ion Temperature Gradient (ITG) instability. The growth rate of the electrostatic energy and the two‑dimensional mode structures obtained with the GPU code match those from the CPU reference within statistical error, confirming that the numerical fidelity is retained after offloading.

In the concluding section, the authors discuss future directions, including exploiting upcoming OpenMP 6.0 features (e.g., device‑side allocate, unified memory), integrating direct GPU‑to‑GPU communication libraries (NCCL, ROCm‑MPI) to reduce inter‑node latency, and extending the framework to electromagnetic gyrokinetics and multi‑species plasmas. The work demonstrates that OpenMP offloading can provide both high performance and cross‑vendor portability for a sophisticated, production‑level gyrokinetic code, offering a template for other particle‑based scientific applications seeking to leverage heterogeneous HPC resources.


Comments & Academic Discussion

Loading comments...

Leave a Comment