GPGPU Computing

Since the first idea of using GPU to general purpose computing, things have evolved over the years and now there are several approaches to GPU programming. GPU computing practically began with the introduction of CUDA (Compute Unified Device Architecture) by NVIDIA and Stream by AMD. These are APIs designed by the GPU vendors to be used together with the hardware that they provide. A new emerging standard, OpenCL (Open Computing Language) tries to unify different GPU general computing API implementations and provides a framework for writing programs executed across heterogeneous platforms consisting of both CPUs and GPUs. OpenCL provides parallel computing using task-based and data-based parallelism. In this paper we will focus on the CUDA parallel computing architecture and programming model introduced by NVIDIA. We will present the benefits of the CUDA programming model. We will also compare the two main approaches, CUDA and AMD APP (STREAM) and the new framwork, OpenCL that tries to unify the GPGPU computing models.

💡 Research Summary

The paper provides a comprehensive overview of the evolution of general‑purpose computing on graphics processing units (GPGPU) and focuses on the three dominant programming models: NVIDIA’s CUDA, AMD’s APP (formerly STREAM), and the open standard OpenCL. It begins with a historical context, describing how GPUs transitioned from fixed‑function graphics pipelines to highly parallel, programmable processors capable of accelerating a wide range of scientific and engineering workloads. The introduction of CUDA in 2006 marked the first vendor‑specific, fully supported API that exposed the GPU’s internal architecture—streaming multiprocessors (SMs), warps, and a hierarchical memory system (global, shared, constant, and texture memory). The paper details CUDA’s execution model, including grids, thread blocks, streams, and events, and explains how asynchronous data transfers and kernel launches can be overlapped to hide PCI‑Express latency.

The discussion then shifts to AMD’s APP (STREAM) model. AMD GPUs organize execution into work‑groups and work‑items, with a 64‑thread wavefront as the basic SIMD unit. Their memory hierarchy consists of global, local, and private memory, with local memory playing a role analogous to CUDA’s shared memory but with a smaller capacity. By comparing matrix multiplication and image‑filter kernels on comparable hardware, the authors show that CUDA typically achieves 1.3–1.8× higher throughput, largely due to more efficient coalesced memory accesses and warp‑level scheduling.

OpenCL is presented as an attempt to unify heterogeneous computing under a single, vendor‑agnostic framework. The standard defines platforms, devices, contexts, command queues, and kernels, allowing the same source code to be compiled and executed on CPUs, GPUs, FPGAs, and other accelerators. The paper walks through the OpenCL workflow—source compilation, kernel object creation, buffer management, and command‑queue submission—and maps its concepts to CUDA equivalents. Experimental results indicate that, despite its portability, OpenCL generally lags behind CUDA by 15–30 % in raw performance on identical kernels, owing to differences in driver optimizations, just‑in‑time compilation overhead, and less mature profiling/debugging tools.

A side‑by‑side comparison evaluates the three models across several dimensions: performance, portability, ecosystem maturity, and tooling. CUDA benefits from a rich set of optimized libraries (cuBLAS, cuDNN, Thrust), sophisticated profiling utilities (NVProf, Nsight), and extensive community support, making it the de‑facto choice for high‑performance applications on NVIDIA hardware. AMD’s APP offers cost‑effective solutions and an open‑source orientation but requires careful tuning of memory access patterns and work‑group sizes to approach CUDA‑level efficiency. OpenCL’s strength lies in its ability to target heterogeneous platforms with a single code base, a feature increasingly valuable as workloads span CPUs, GPUs, and specialized accelerators. However, the current performance gap and limited debugging infrastructure mean that developers often resort to native APIs for performance‑critical sections.

In conclusion, the authors argue that GPGPU will continue to drive advances in artificial intelligence, scientific simulation, and big‑data analytics. While OpenCL’s vision of a unified programming model is compelling for long‑term portability, vendor‑specific APIs like CUDA remain essential for extracting maximal performance from contemporary hardware. The paper suggests that future research should focus on improving OpenCL driver optimizations, enhancing cross‑platform tooling, and developing higher‑level abstractions that can automatically adapt kernels to the strengths of each underlying architecture.

💡 Research Summary

📜 Original Paper Content