Accelerating QDP++/Chroma on GPUs

Reading time: 5 minute
...

📝 Original Info

  • Title: Accelerating QDP++/Chroma on GPUs
  • ArXiv ID: 1111.5596
  • Date: 2010-03-15
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Extensions to the C++ implementation of the QCD Data Parallel Interface are provided enabling acceleration of expression evaluation on NVIDIA GPUs. Single expressions are off-loaded to the device memory and execution domain leveraging the Portable Expression Template Engine and using Just-in-Time compilation techniques. Memory management is automated by a software implementation of a cache controlling the GPU's memory. Interoperability with existing Krylov space solvers is demonstrated and special attention is paid on 'Chroma readiness'. Non-kernel routines in lattice QCD calculations typically not subject of hand-tuned optimisations are accelerated which can reduce the effects otherwise suffered from Amdahl's Law.

💡 Deep Analysis

Deep Dive into Accelerating QDP++/Chroma on GPUs.

Extensions to the C++ implementation of the QCD Data Parallel Interface are provided enabling acceleration of expression evaluation on NVIDIA GPUs. Single expressions are off-loaded to the device memory and execution domain leveraging the Portable Expression Template Engine and using Just-in-Time compilation techniques. Memory management is automated by a software implementation of a cache controlling the GPU’s memory. Interoperability with existing Krylov space solvers is demonstrated and special attention is paid on ‘Chroma readiness’. Non-kernel routines in lattice QCD calculations typically not subject of hand-tuned optimisations are accelerated which can reduce the effects otherwise suffered from Amdahl’s Law.

📄 Full Content

Graphic Processing Units (GPUs) are getting increasingly important as target architectures in scientific High Performance Computing (HPC). The massively parallel architecture for floatingpoint arithmetic together with a very high bandwidth to device-local memory make GPUs interesting not only for compute-intensive but also for data-intensive applications.

NVIDIA established the Compute Unified Device Architecture (CUDA) as a parallel computing architecture controlling and making use of the compute power of their GPUs. Now in its 4th major software iteration mature support of most of the C++ language features (like templates) is provided making it an interesting platform also for software projects employing meta-programming techniques.

Within the U.S. SciDAC initiative a unified programming environment was developed -the QCD Application Programming Interface (API) [1]. This API enables lattice QCD scientists to implement portable software achieving a high level of software sustainability. Part of this API is QDP++, the C++ implementation of the QCD Data Parallel Interface, which provides data parallel types and expressions suitable for lattice field theory. The very successful lattice QCD software suite Chroma builds on top of QDP++ where implementations for a large range of hardware architectures exist [2]. High efficiency is provided through a flexible interface that permits specialised compute kernels to be applied [3]. QDP++ makes substantial use of template meta-programming techniques to provide Domain Specific Language (DSL) abstractions for this problem domain. Through usage of the Portable Expression Templates Engine (PETE) QDP++ provides user expressions that look similar to their mathematical counterparts.

PETE is a portable implementation of the Expression Template (ET) technique [4,5] -a technique that can be used to implement vector expressions without relying on vector sized temporaries. PETE’s portability concepts include abstractions for a flexible return type system and user defined expression tree traversals. However, PETE and so QDP++ do not support heterogeneous multicore architectures with separate memory and execution domains.

This work extends QDP++/Chroma to make use of NVIDIA’s CUDA as the target architecture for expression evaluation. A single expression is off-loaded to the device memory and execution domain by dynamically generating a CUDA kernel and using Just-in-Time (JIT) compilation techniques. Special attention is paid on ‘Chroma readiness’ meaning that a successful build of Chroma on top of the extended QDP++ is possible.

Efforts similar to this work are undergone at Jefferson Lab [6]. This underlines the necessity of this approach.

A similar ET reconstruction on GPUs using CUDA was previously reported [7]. Pointers to the vector’s data are passed to the CUDA kernel as function arguments and the NIVIDA compiler called just-in-time. Memory management was not addressed and circumvented by using the Thrust library featured by CUDA.

In previous work QDP++/Chroma was extended in a similar way targeted to a different heterogeneous multicore architecture, the Cell processor and QPACE [8,9].

Lattice QCD calculations divide into two parts, the generation of background field configurations and the computation of observables on these configurations, the so-called analysis part. The computationally most intensive part of the analysis part is the inversion of the fermion matrix. Although heavily dependent on the simulation parameters the vast majority of the total amount of floating-point operations carried out during the analysis is spent for inverting the large sparse fermion matrix. The rest of the floating-point operations are spent on non-kernel routines like smearing, quark contractions, etc.

It is natural to spend most optimisation work on the inverter part. A highly optimised library (QUDA) for the fermion matrix inversion for NVIDIA GPUs is available [10 -12]. These inverters provide speedup factors of over S ≥ 30 compared to an inversion carried out on the CPU, see a later section for benchmark results.

However, Amdahl’s law states that a program fraction P subject to acceleration with the ac- (3.1)

For a very high speedup factor S → ∞ the total speedup factor is limited by the fraction S total = 1/1 -P. Fig. 1 shows S total over S for P = 0.8. To further increase S total one needs to increase P.

To further increase P in case of Chroma one can either implement more hand-tuned versions of non-kernel routines or target on the underlying library QDP++. Targeting on QDP++ by adding design extensions for GPU support is advantageous since this approach results in a more general solution. General in the sense that the user is not restricted to specific non-kernel routines.

The bandwidth between host and device memory domain represents a major bottleneck. Since lattice objects are typically more often referenced than just once in a particular set of expressions minimising

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut