Accelerating QDP++ using GPUs
Graphic Processing Units (GPUs) are getting increasingly important as target architectures in scientific High Performance Computing (HPC). NVIDIA established CUDA as a parallel computing architecture controlling and making use of the compute power of GPUs. CUDA provides sufficient support for C++ language elements to enable the Expression Template (ET) technique in the device memory domain. QDP++ is a C++ vector class library suited for quantum field theory which provides vector data types and expressions and forms the basis of the lattice QCD software suite Chroma. In this work accelerating QDP++ expression evaluation to a GPU was successfully implemented leveraging the ET technique and using Just-In-Time (JIT) compilation. The Portable Expression Template Engine (PETE) and the C API for CUDA kernel arguments were used to build the bridge between host and device memory domains. This provides the possibility to accelerate Chroma routines to a GPU which are typically not subject to special optimisation. As an application example a smearing routine was accelerated to execute on a GPU. A significant speed-up compared to normal CPU execution could be measured.
💡 Research Summary
The paper addresses the growing importance of graphics processing units (GPUs) in scientific high‑performance computing (HPC) and demonstrates how to extend the QDP++ library – a C++ vector‑class framework that underpins the lattice QCD software suite Chroma – to run its expression‑template (ET) evaluations on NVIDIA GPUs. The authors exploit the C++ support introduced in CUDA 4.0, the Portable Expression Template Engine (PETE), and Just‑In‑Time (JIT) compilation to bridge the host and device memory domains without rewriting the high‑level QDP++ code.
The methodology consists of four tightly coupled components. First, when a QDP++ expression is evaluated, PETE’s “pretty‑printing” facility emits a C++ type description of the expression tree. An external code generator consumes this description, emits a CUDA kernel source that reconstructs the same expression on the device, and invokes NVCC to compile the source into a shared library. The library is dynamically loaded (dlopen) and the kernel is launched immediately. Because each distinct expression results in a single compiled kernel, subsequent calls reuse the same binary, minimizing compilation overhead.
Second, the expression tree is flattened on the host side. PETE traverses the tree, extracts all runtime‑configurable data (plain‑old‑data, POD) from each operator (e.g., scalar coefficients, shift directions, site‑table indices) and packs them into a linear buffer. This buffer is passed to the CUDA kernel via the C‑only kernel argument interface. On the device, the kernel reconstructs the original expression tree from the POD buffer, thereby restoring the full semantics of the host expression. Special operators that require auxiliary data (such as the shift operator’s site table) trigger a one‑time copy of the required table to device memory; subsequent invocations reuse the cached copy, avoiding repeated transfers.
Third, a mixed‑memory model is adopted to alleviate the scarcity of device memory. Lattice‑wide objects remain allocated in host memory; when a computation is about to be launched, the user calls pushToDevice(), which pins the host memory region (using CUDA’s page‑lock facility) and copies it directly to the GPU. This eliminates the need for intermediate staging buffers and reduces host‑to‑device transfer latency. After the kernel finishes, popFromDevice() copies results back and frees the device allocation. Explicit deallocation functions (freeDeviceMem(), theDeviceStorage::freeAll()) give the programmer fine‑grained control over device resources.
Fourth, the authors expose two tunable parameters that control the mapping of lattice sites to CUDA threads: N_threads (threads per block) and N_site (lattice sites processed per thread). The total grid geometry is derived from the lattice volume, N_threads, and N_site. Because each distinct expression generates its own kernel, performance depends on the interplay of these parameters, the expression’s computational intensity, and the GPU’s architectural characteristics (warp size, shared memory, registers). The paper suggests that an auto‑tuning phase could be added to discover optimal values before production runs.
To validate the approach, the authors port a Jacobi smearing routine – a commonly used, iterative, nearest‑neighbour operation in Chroma that is typically not hand‑optimised – to the GPU. The original routine consists of five QDP++ expressions, two of which dominate the floating‑point workload. By inserting a few API calls (pushToDevice(), popFromDevice(), and memory‑free calls) around the existing code, the entire smearing loop executes on the GPU without altering the high‑level algorithmic structure.
Benchmarking on an NVIDIA GeForce GTX 480 (1.5 GB GDDR5, compute capability ≥ 2.0) shows that for small lattices (8³) the GPU achieves roughly 1.5 GFLOP/s, comparable to the CPU. As the lattice size grows, the GPU’s advantage becomes pronounced: for a 32³ lattice the GPU reaches 21 GFLOP/s versus 1.5 GFLOP/s on a single CPU core, a speed‑up of more than an order of magnitude. The performance gain scales with lattice volume because the computation becomes increasingly bandwidth‑bound and the GPU can hide memory latency with many concurrent warps.
The paper’s contributions can be summarised as follows:
- A generic, template‑based GPU acceleration path for QDP++ that requires only minimal changes to existing Chroma code (essentially data‑movement calls).
- A JIT compilation pipeline that automatically generates and caches device kernels for arbitrary QDP++ expressions, eliminating the need for hand‑written CUDA kernels.
- A flatten‑and‑reconstruct strategy for expression trees that works around CUDA’s C‑only kernel argument interface while preserving the full semantics of complex lattice operators.
- A mixed host/device memory model that leverages page‑locked host memory to reduce transfer overhead and keep the host memory footprint low.
- Empirical evidence that even non‑solver parts of lattice QCD codes can achieve substantial speed‑ups when off‑loaded to GPUs, thereby alleviating the “solver‑dominant” bottleneck that has traditionally motivated GPU adoption.
In conclusion, the authors demonstrate that by combining modern C++ features in CUDA, PETE’s expression‑template infrastructure, and JIT compilation, it is possible to retrofit an existing, heavily templated scientific library with GPU acceleration in a largely automated fashion. This approach not only benefits the Chroma/QDP++ ecosystem but also provides a blueprint for other domain‑specific libraries that rely on expression templates, opening the door to broader, low‑effort GPU adoption across computational science.
Comments & Academic Discussion
Loading comments...
Leave a Comment