OpenCL 2.0 for FPGAs using OCLAcc
Designing hardware is a time-consuming and complex process. Realization of both, embedded and high-performance applications can benefit from a design process on a higher level of abstraction. This helps to reduce development time and allows to iteratively test and optimize the hardware design during development, as common in software development. We present our tool, OCLAcc, which allows the generation of entire FPGA-based hardware accelerators from OpenCL and discuss the major novelties of OpenCL 2.0 and how they can be realized in hardware using OCLAcc.
💡 Research Summary
The paper presents OCLAcc, an open‑source tool that automatically generates FPGA‑based hardware accelerators from OpenCL kernels, with a focus on exploiting the new features introduced in OpenCL 2.0. The authors first motivate the need for higher‑level abstraction in FPGA design, noting that traditional RTL or vendor‑specific HLS flows are time‑consuming and often force early architectural decisions. OCLAcc addresses this by allowing designers to stay in the software domain (OpenCL‑C) while still obtaining custom hardware that can be iteratively refined.
The compilation flow consists of three main stages. In the first stage, a modified Clang front‑end translates the OpenCL kernel into SPIR (Standard Portable Intermediate Representation), which is based on LLVM‑VM‑IR. This SPIR binary is then fed into OCLAcc’s internal representation called OCLAccHW. OCLAccHW models the kernel as a data‑flow graph composed of basic blocks—maximal instruction sequences with a single entry and exit. For each block, the tool analyses inputs, outputs, and memory indices (both static and dynamic) to identify streams to/from external memory. Built‑in OpenCL functions (e.g., barriers, atomics) are mapped to dedicated hardware components and control signals, and classic hardware optimizations such as common‑subexpression elimination are applied.
In the second stage, called HWMap, OCLAcc interacts with vendor tools (Altera, Xilinx, etc.) to generate the final RTL or IP cores. Depending on the target, the tool may instantiate pre‑defined components, generate custom IP, or rely on the vendor’s inference engine. Timing information (latency, maximum clock frequency) is only known after synthesis, so OCLAcc synchronizes arithmetic units within basic blocks using clock‑driven handshaking, while inter‑block communication uses a simple ready/ack protocol. This hybrid synchronization strategy ensures correct operation even when the exact timing of blocks is unknown at compile time.
The core contribution of the paper is the discussion of how four OpenCL 2.0 extensions can be realized on FPGA hardware.
-
Work‑group functions (e.g., broadcast, reduction) are implemented by writing each work‑item’s data to local SRAM, waiting until all items reach the synchronization point, then performing the collective operation and feeding the result back into the next block.
-
Pipes provide FIFO‑style communication without explicit indexing. OCLAcc maps pipes to SRAM‑based FIFOs whose depth and dimensionality are supplied by the host. This enables seamless data streaming between kernels, and also allows direct interfacing with external peripherals (e.g., image sensors) without host intervention.
-
Device‑side enqueue allows kernels to launch other kernels autonomously. OCLAcc implements a hardware work‑queue as a writable FIFO that kernels can push new work‑groups onto, eliminating the need for host‑mediated kernel launches and reducing latency.
-
Shared Virtual Memory (SVM) is partially supported. Coarse‑grained SVM is realized by mapping host‑allocated buffers into the device address space, requiring explicit map/unmap operations. Fine‑grained SVM (pointer‑level sharing) is not yet implemented, as it would demand more complex coherence mechanisms.
The authors compare OCLAcc with vendor‑provided OpenCL solutions (Altera’s OpenCL SDK, Xilinx SDAccel). While the commercial tools generate a fixed data‑path that may underutilize the FPGA’s flexibility, OCLAcc tailors the hardware to the specific kernel characteristics (data width, parallelism), potentially achieving better performance‑per‑watt. However, OCLAcc does not yet implement the full OpenCL 2.0 specification, and its reliance on vendor synthesis tools may introduce portability issues when toolchains evolve.
In conclusion, OCLAcc demonstrates that the newer OpenCL 2.0 constructs—work‑group functions, pipes, device‑side enqueue, and SVM—can be mapped efficiently onto FPGA fabrics, offering software developers a more productive path to custom accelerators. Future work includes extending support for fine‑grained SVM, improving vendor‑independent back‑ends, and broadening the set of OpenCL 2.0 features covered.
Comments & Academic Discussion
Loading comments...
Leave a Comment