Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs
Deep Convolutional Neural Networks have become a Swiss knife in solving critical artificial intelligence tasks. However, deploying deep CNN models for latency-critical tasks remains to be challenging because of the complex nature of CNNs. Recently, FPGA has become a favorable device to accelerate deep CNNs thanks to its high parallel processing capability and energy efficiency. In this work, we explore different fast convolution algorithms including Winograd and Fast Fourier Transform (FFT), and find an optimal strategy to apply them together on different types of convolutions. We also propose an optimization scheme to exploit parallelism on novel CNN architectures such as Inception modules in GoogLeNet. We implement a configurable IP-based face recognition acceleration system based on FaceNet using High-Level Synthesis. Our implementation on a Xilinx Ultrascale device achieves 3.75x latency speedup compared to a high-end NVIDIA GPU and surpasses previous FPGA results significantly.
💡 Research Summary
This paper addresses the challenge of accelerating latency‑critical deep convolutional neural networks (CNNs) on field‑programmable gate arrays (FPGAs) by combining two fast convolution algorithms—Winograd minimal filtering and Fast Fourier Transform (FFT) based convolution—and by introducing a resource‑allocation scheme that exploits the intra‑module parallelism of modern multi‑branch CNN architectures such as the Inception modules used in GoogLeNet and FaceNet.
The authors begin by analyzing the theoretical and practical characteristics of the two algorithms. Winograd excels for small kernels (3×3, 5×5) because it reduces the number of multiplications dramatically while keeping transformation overhead modest. However, as kernel size grows, the transformation matrices become larger and the overhead grows quadratically, eroding its advantage. FFT‑based convolution, on the other hand, transforms both the input feature map and the kernels to the frequency domain, performs element‑wise complex multiplication, and then applies an inverse FFT. Its computational complexity scales favorably with larger kernels (≥5×5) and larger feature maps (≥12×12), but it consumes more LUTs and BRAM for the transform stages.
To decide which algorithm to use for each layer, the authors conduct a systematic design‑space exploration on a Xilinx VU9P UltraScale FPGA using Vivado HLS. They evaluate latency for combinations of kernel sizes (3, 5, 7), feature‑map sizes (6, 12, 24), and input/output channel counts (16, 32, 64, 128). The empirical results are distilled into a decision table that maps kernel‑size/feature‑map‑size pairs to the preferred algorithm. For example, all 3×3 convolutions use Winograd; 5×5 convolutions use Winograd unless the feature map is 24×24, in which case FFT is better; and 7×7 convolutions use FFT when the channel depth is large.
Beyond algorithm selection, the paper tackles the resource‑allocation problem that arises in Inception‑style modules, where several convolution branches run in parallel and are merged by concatenation or summation. Prior work (e.g., Zhang et al.) used a global Cauchy‑inequality based partition that ignores intra‑module parallelism, leading to sub‑optimal latency because the overall execution time is bounded by the slowest branch. The authors propose a branch‑aware allocation method: they compute the normalized computation complexity of each branch, derive an ideal resource share proportional to this complexity, and then map the ideal share to a realistic hardware share that respects HLS constraints (parallel factors must be powers of two). An iterative greedy algorithm doubles the allocation of the branch with the largest gap until the total allocated resources match the available budget. This scheme is expressed as Algorithm 1 and is implemented by controlling HLS UNROLL and ARRAY_PARTITION directives, ensuring consistent partition factors across adjacent layers to avoid memory‑port contention.
Hardware implementation details include separate Winograd and FFT engines with comparable feature‑map reuse factors, allowing a fair comparison of resource usage. FFT engines consume roughly 60 % of the DSPs but up to 2.2 × the LUTs of a baseline design, while Winograd engines save DSPs at the cost of additional adders. Both engines follow a transform‑compute‑transform pattern, and the authors deliberately keep the reuse factor constant across algorithms to isolate algorithmic effects.
The complete system is built as a configurable IP core targeting the Inception‑V2 module of FaceNet, a state‑of‑the‑art face‑recognition network that outputs 128‑dimensional embeddings trained with triplet loss. Using HLS, the C++ description of the network is synthesized into RTL, placed on a Xilinx UltraScale VU9P, and evaluated against a high‑end NVIDIA RTX 2080 Ti GPU. Results show a 3.75× reduction in inference latency compared with the GPU, while power consumption is roughly 30 % of the GPU’s, yielding more than a two‑fold improvement in energy efficiency over prior FPGA implementations of GoogLeNet. An ablation study confirms that removing either the hybrid algorithm selection or the branch‑aware resource allocation degrades performance by up to 1.8×, underscoring the importance of both contributions.
In summary, the paper demonstrates that (1) a hybrid convolution strategy that selects Winograd or FFT based on layer‑specific characteristics can capture the best of both worlds, and (2) a fine‑grained, branch‑aware resource partitioning algorithm can fully exploit the parallelism inherent in modern multi‑branch CNNs on FPGAs. The proposed framework automates the otherwise labor‑intensive process of algorithm and hardware co‑design, enabling designers to achieve low‑latency, energy‑efficient deep‑learning inference on reconfigurable hardware without extensive manual tuning.
Comments & Academic Discussion
Loading comments...
Leave a Comment