Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels
Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which embedded platform is most suitable for their application, we conduct a comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels. We discuss rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of a range of vision kernel categories. Specifically, our study is performed for three commonly used HW accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1-3.2x compared to the others for simple kernels. While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2-22.3x. It is also observed that the FPGA performs increasingly better as a vision application’s pipeline complexity grows.
💡 Research Summary
The paper presents a systematic, reproducible benchmark that evaluates the runtime performance, energy consumption, and Energy‑Delay Product (EDP) of three widely used embedded vision accelerators: an ARM‑based multicore CPU (ARM57), an Nvidia Jetson TX2 GPU (Pascal architecture), and a Xilinx ZCU102 UltraScale+ FPGA. Using only publicly available, vendor‑optimized vision libraries—OpenCV for the CPU, Nvidia VisionWorks for the GPU, and xfOpenCV for the FPGA—the authors implement a broad set of vision kernels grouped into six functional categories: Input Processing, Image Arithmetic, Image Filters, Image Analysis, Geometric Transformations, and Composite Kernels (e.g., feature extraction, stereo block matching, optical flow).
Methodologically, each kernel is executed on 1080p grayscale frames (1000 frames per run) while high‑resolution timers record execution time, and on‑board power measurement ICs capture dynamic and static power on the relevant rails (CPU cores, GPU cores, programmable logic, etc.). The authors explicitly exclude data transfer and configuration overhead to isolate pure compute energy. Energy per frame is derived from measured power during the active period, and EDP is calculated as Energy × Delay, providing a balanced metric that penalizes both high power and long latency.
Results reveal two clear trends. For simple, highly parallel kernels (e.g., channel conversion, basic arithmetic), the GPU achieves the best energy efficiency, delivering a 1.1‑3.2× reduction in energy per frame compared with the CPU and FPGA. This advantage stems from the GPU’s massive SIMD core count, high memory bandwidth, and efficient handling of regular, data‑parallel workloads. However, as kernel complexity grows—particularly for operations that involve irregular memory access, branching, or multi‑stage pipelines—the FPGA outperforms both the CPU and GPU. Energy reductions of 1.2‑22.3× are observed for composite pipelines such as stereo block matching and dense optical flow. The FPGA’s superiority is attributed to custom data‑path creation, on‑chip BRAM for data locality, and the ability to stream pixels directly between processing elements without external memory traffic. Moreover, the FPGA’s advantage scales with pipeline depth: each additional stage amplifies the benefit of on‑chip data reuse, leading to dramatically lower EDP values.
The CPU, while benefiting from NEON SIMD extensions, is limited by fewer cores and lower memory bandwidth, resulting in moderate performance but relatively high static power. The GPU, despite its high throughput, suffers from substantial static power draw and reduced efficiency on branching‑heavy kernels. The FPGA, though requiring more design effort (timing closure, resource budgeting), delivers the most favorable trade‑off between power and latency for complex, real‑world vision pipelines.
The authors contribute (1) a publicly available benchmark suite (https://github.com/isu‑rcl/cvBench) that can be reused for future studies, (2) detailed insights into why each architecture excels or falters for specific kernel categories, and (3) a comparative analysis based on the Energy‑Delay Product, which highlights that the “best” accelerator depends on the application’s computational characteristics rather than a one‑size‑fits‑all metric.
In conclusion, the paper advises practitioners to select GPUs for simple, highly parallel tasks where raw throughput is paramount, and to opt for FPGAs when dealing with complex, multi‑stage vision pipelines under strict energy budgets. Future work is suggested to explore hybrid CPU‑GPU‑FPGA co‑execution models, broader image formats, and higher resolution workloads to further refine the decision framework for embedded vision system designers.
Comments & Academic Discussion
Loading comments...
Leave a Comment