Porting Large HPC Applications to GPU Clusters: The Codes GENE and VERTEX

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We have developed GPU versions for two major high-performance-computing (HPC) applications originating from two different scientific domains. GENE is a plasma microturbulence code which is employed for simulations of nuclear fusion plasmas. VERTEX is a neutrino-radiation hydrodynamics code for “first principles”-simulations of core-collapse supernova explosions. The codes are considered state of the art in their respective scientific domains, both concerning their scientific scope and functionality as well as the achievable compute performance, in particular parallel scalability on all relevant HPC platforms. GENE and VERTEX were ported by us to HPC cluster architectures with two NVidia Kepler GPUs mounted in each node in addition to two Intel Xeon CPUs of the Sandy Bridge family. On such platforms we achieve up to twofold gains in the overall application performance in the sense of a reduction of the time to solution for a given setup with respect to a pure CPU cluster. The paper describes our basic porting strategies and benchmarking methodology, and details the main algorithmic and technical challenges we faced on the new, heterogeneous architecture.

💡 Research Summary

The paper presents a detailed case study of porting two state‑of‑the‑art high‑performance computing (HPC) applications— the plasma micro‑turbulence code GENE and the neutrino‑radiation hydrodynamics code VERTEX— to a heterogeneous GPU‑CPU cluster architecture. The target hardware consists of nodes equipped with two Intel Xeon E5‑2670 (Sandy Bridge) 8‑core CPUs and two NVIDIA Kepler K20X GPUs. The authors adopt a CUDA‑C programming model for the GPU portions while retaining the original Fortran‑MPI (and MPI/OpenMP for VERTEX) structure for the CPU side. They deliberately avoid CUDA‑Fortran and OpenACC because the PGI compiler used for CUDA‑Fortran underperforms the Intel compiler on the CPU, which would diminish overall speed‑ups.

For GENE, profiling shows that more than 60 % of the runtime is spent computing the nonlinear term of the Vlasov‑Maxwell equations. This term is implemented as a sequence of five stages (pre‑processing, forward FFT, real‑space multiplication, inverse FFT, post‑processing) each mapped to a separate CUDA kernel. Data are split into contiguous xy‑plane chunks (typically four per array) and transferred to the GPU in two asynchronous CUDA streams, allowing overlap of PCIe transfers with kernel execution. Multiple MPI ranks sharing a single GPU further increase concurrency. The authors use the CUFFT library for FFTs; these alone account for 53 % of the GPU time. A Roofline performance model is constructed using two metrics: xy‑planes computed per second (computational intensity) and xy‑planes transferred per second (bandwidth). Measured PCIe 2.0 bandwidth (5.7 GB/s) yields a transfer ceiling of 374 kPₜ /s, while kernel benchmarks give a compute ceiling of about 190 kP_c /s on the K20X. The model shows that the K20X implementation is bandwidth‑limited; the algorithm would become compute‑bound only if the transfer volume were reduced or PCIe 3.0 were used. Consequently, the achieved overall speed‑up for GENE is modest (≈1.8× on K20X, ≈1.4× on the older Fermi M2090).

VERTEX, in contrast, spends roughly half of its runtime on a “rate kernel” that evaluates neutrino absorption and emission rates for each grid cell. This kernel exhibits high data parallelism and arithmetic intensity, making it an ideal candidate for GPU offloading. The code uses a hybrid MPI/OpenMP domain decomposition: each MPI rank corresponds to a socket, and within a socket OpenMP threads (called “rays”) handle angular sub‑domains. The authors map one GPU per socket and offload only the rate kernel (sub‑step C2) to the GPU, while the surrounding steps (hydrodynamics, transport, other rate calculations) continue on the CPU. Overlap is achieved by scheduling CPU work that does not depend on the kernel’s output concurrently with GPU execution. In weak‑scaling tests up to 128 CPU sockets, the GPU‑accelerated version shows roughly a factor‑two reduction in total time‑step duration, and the scaling trend suggests similar benefits would persist on larger GPU‑enabled systems.

The paper concludes that substantial performance gains can be realized for large, production‑grade scientific codes by targeting the most computationally intensive, data‑parallel kernels. However, on current Kepler hardware the PCIe 2.0 interconnect is the dominant bottleneck for GENE, while VERTEX’s kernel is already compute‑bound and benefits more directly from GPU acceleration. Future improvements could involve moving additional algorithmic components onto the GPU, employing higher‑bandwidth interconnects such as PCIe 3.0 or NVLink, and refining kernel implementations. The authors’ experience demonstrates that even highly tuned legacy codes can be successfully ported to heterogeneous architectures with modest development effort, providing a roadmap for other scientific domains seeking to exploit GPU‑accelerated HPC platforms.

Porting Large HPC Applications to GPU Clusters: The Codes GENE and VERTEX

💡 Research Summary

Comments & Academic Discussion

Leave a Comment