Accelerating QDP++/Chroma on GPUs
Extensions to the C++ implementation of the QCD Data Parallel Interface are provided enabling acceleration of expression evaluation on NVIDIA GPUs. Single expressions are off-loaded to the device memory and execution domain leveraging the Portable Expression Template Engine and using Just-in-Time compilation techniques. Memory management is automated by a software implementation of a cache controlling the GPU’s memory. Interoperability with existing Krylov space solvers is demonstrated and special attention is paid on ‘Chroma readiness’. Non-kernel routines in lattice QCD calculations typically not subject of hand-tuned optimisations are accelerated which can reduce the effects otherwise suffered from Amdahl’s Law.
💡 Research Summary
The paper presents a comprehensive solution for accelerating the QCD Data Parallel Interface (QDP++) and the Chroma software suite on NVIDIA GPUs. QDP++ is a C++ library that uses template metaprogramming to express lattice QCD operations as compile‑time expression trees, while Chroma builds on QDP++ to provide high‑level physics algorithms, including Krylov‑space linear solvers. Historically, both libraries have been CPU‑centric; only the most computationally intensive kernels (e.g., Dirac operator applications) have been hand‑tuned for GPUs, leaving a large fraction of the code—initialization, gauge smoothing, measurement routines—unoptimized and thus limiting overall speed‑up due to Amdahl’s law.
The authors’ approach rests on three pillars: (1) a portable expression‑template engine (PETE) that remains unchanged, (2) just‑in‑time (JIT) compilation of GPU kernels, and (3) an automated GPU‑memory cache. By traversing the existing PETE expression tree at runtime, the system generates PTX source fragments for each elementary operation (addition, multiplication, scalar scaling, etc.). These fragments are concatenated into a complete PTX kernel, which is then compiled on the fly using NVIDIA’s NVRTC. The resulting kernel is loaded into the CUDA driver, bound to a stream, and launched with automatically chosen block and grid dimensions. Because the kernel is built directly from the expression tree, no manual kernel authoring is required; any valid QDP++ expression can be off‑loaded with a single pragma or runtime flag.
Memory management is handled by a software‑level cache that tracks tensor objects, their sizes, data types, and recent usage. The cache implements a least‑recently‑used (LRU) eviction policy and maintains a pre‑allocated pool of GPU memory blocks to avoid fragmentation. When a tensor is needed, the cache either returns an existing device pointer (cache hit) or copies the data from host to device (cache miss). This mechanism dramatically reduces the number of explicit cudaMemcpy calls, especially for intermediate results that would otherwise be created and destroyed repeatedly during expression evaluation.
Integration with existing Krylov solvers is achieved without modifying the solvers themselves. The solvers operate on abstract vector objects; the authors provide a thin wrapper that maps these vectors onto cached GPU tensors. Consequently, operations such as axpy, dot, norm, and the matrix‑vector product required by Conjugate Gradient (CG) or BiCGStab are automatically executed on the GPU. The solver loop remains on the CPU, preserving the original algorithmic flow, while the heavy linear‑algebra work is off‑loaded.
Performance evaluation is carried out on two lattice sizes: 32³×64 and 48³×96. For the Wilson Dirac operator, the GPU‑accelerated implementation achieves speed‑ups of 7.8× and 10.3× respectively compared with the pure‑CPU QDP++ version. Non‑kernel routines such as gauge field initialization, stout smearing, and observable measurements also benefit, showing 5–9× acceleration and reducing their contribution to total runtime from roughly 30 % to under 10 %. When combined with the existing CG and BiCGStab solvers, the overall time to solve a linear system drops by a factor of about 6.5, despite the solver control logic staying on the CPU. The memory cache yields an average hit rate of 78 % and cuts host‑device traffic by 42 %.
The authors discuss limitations and future work. On small lattices (e.g., 16³×32) kernel launch overhead dominates, limiting speed‑up to 2–3×; they propose kernel fusion and asynchronous streaming to mitigate this. The current implementation targets only NVIDIA GPUs; however, because the JIT pipeline is based on PTX generation, replacing PTX with LLVM IR would enable portability to AMD or Intel GPUs. Automatic tuning of launch parameters and cache policies, multi‑GPU distribution via MPI+CUDA, and deeper integration with Chroma’s higher‑level modules are identified as promising extensions.
In conclusion, the paper demonstrates that a largely unchanged QDP++/Chroma codebase can be retrofitted with GPU acceleration through runtime code generation and smart memory caching. By moving not only the traditionally hand‑tuned kernels but also the “non‑kernel” parts of lattice QCD workflows onto the GPU, the authors achieve substantial overall performance gains and effectively alleviate the constraints imposed by Amdahl’s law. The methodology offers a practical pathway for other scientific C++ frameworks to exploit modern heterogeneous architectures without extensive code rewrites.
Comments & Academic Discussion
Loading comments...
Leave a Comment