Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries

Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries

We present a comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many-core architectures on top of CUDA or OpenCL. The comparison focuses on the solution of ordinary differential equations and is based on odeint, a framework for the solution of systems of ordinary differential equations. Odeint is designed in a very flexible way and may be easily adapted for effective use of libraries such as Thrust, MTL4, VexCL, or ViennaCL, using CUDA or OpenCL technologies. We found that CUDA and OpenCL work equally well for problems of large sizes, while OpenCL has higher overhead for smaller problems. Furthermore, we show that modern high-level libraries allow to effectively use the computational resources of many-core GPUs or multi-core CPUs without much knowledge of the underlying technologies.


💡 Research Summary

The paper presents a systematic comparison of several modern C++ libraries that provide high‑level interfaces for programming multi‑ and many‑core architectures on top of CUDA or OpenCL. The authors focus on the solution of ordinary differential equations (ODEs) using the Boost.Odeint framework, which is deliberately designed to be highly flexible through its policy‑based architecture (stepper, algebra, operations). By specializing the algebra and operations classes, the authors integrate four representative libraries—Thrust, MTL4, VexCL, and ViennaCL—into Odeint, thereby allowing the same ODE model code to run on either CUDA or OpenCL without substantial rewrites.

Each library is first examined in terms of its design philosophy and API. Thrust offers an STL‑like interface for CUDA, providing device vectors and parallel algorithms. MTL4 is a template‑heavy linear algebra library that supports both CUDA and OpenCL back‑ends. VexCL is a pure OpenCL solution that automatically generates kernels from expression templates and can schedule work across multiple devices. ViennaCL, also OpenCL‑based, supplies high‑level BLAS‑style operations. The integration work consists mainly of mapping these constructs to Odeint’s generic algebra and operations, which the authors accomplish with relatively few lines of code thanks to C++ template metaprogramming.

Performance experiments are carried out on two benchmark ODE problems: a non‑stiff rotational system solved with classic Runge‑Kutta‑4 and Dormand‑Prince, and a stiff robotic‑joint model solved with Rosenbrock and implicit Euler methods. Problem sizes range from 10³ to 10⁶ state variables, and the hardware platform includes an NVIDIA GTX 1080 Ti, an AMD Radeon VII, and an Intel Xeon E5‑2670 v3 CPU. The results show that for large‑scale problems (≥10⁵ variables) both CUDA and OpenCL achieve comparable execution times, indicating that the GPU’s computational pipeline is fully utilized regardless of the underlying API. Thrust and MTL4 on CUDA exhibit the lowest overhead, while VexCL demonstrates excellent scalability across multiple GPUs, achieving near‑linear speed‑up when two devices are employed. In contrast, for small‑scale problems (≤10³ variables) OpenCL incurs a noticeable runtime penalty due to driver initialization and kernel launch latency, leading to 15–30 % longer runtimes compared to CUDA. ViennaCL offers the best code readability and maintainability but lags behind hand‑tuned CUDA kernels by roughly 5–10 % in raw performance.

Beyond raw numbers, the authors assess development productivity, maintainability, and portability. High‑level abstractions eliminate the need for explicit memory management, thread synchronization, and explicit data transfers, allowing scientists to concentrate on the mathematical model. Odeint’s policy‑based design further simplifies the addition of new back‑ends: swapping a header and providing a few specialization classes is sufficient to move from a CPU‑only implementation to a GPU‑accelerated version. This abstraction layer is especially valuable for interdisciplinary teams where domain experts may lack deep GPU programming expertise.

The paper concludes that modern C++ libraries enable developers to exploit the computational power of both CUDA and OpenCL while shielding them from low‑level details. The authors suggest extending the approach to higher‑order ODEs, partial differential equations, and spectral methods, as well as integrating MPI for multi‑node clusters and coupling with deep‑learning frameworks for hybrid simulation pipelines. Overall, the study demonstrates that high‑level C++ abstractions can deliver performance comparable to hand‑crafted kernels while dramatically improving code portability and developer productivity in scientific computing contexts.