Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs
📝 Abstract
Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times.
💡 Analysis
Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times.
📄 Content
Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs R. Tapiador, A. Rios-Navarro, A. Linares-Barranco. Robotic and Technology of Computers Lab. University of Seville Seville, SPAIN alinares@atc.us.es Minkyu Kim, Deepak Kadetotad, Jae-sun Seo. School of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ, USA jaesun.seo@asu.edu
Abstract—Deep learning has significantly advanced the state
of the art in artificial intelligence, gaining wide popularity from
both industry and academia. Special interest is around
Convolutional Neural Networks (CNN), which take inspiration
from the hierarchical structure of the visual cortex, to form deep
layers of convolutional operations, along with fully connected
classifiers. Hardware implementations of these deep CNN
architectures are challenged with memory bottlenecks that
require many convolution and fully-connected layers demanding
large amount of communication for parallel computation. Multi-
core CPU based solutions have demonstrated their inadequacy
for this problem due to the memory wall and low parallelism.
Many-core GPU architectures show superior performance but
they consume high power and also have memory constraints due
to inconsistencies between cache and main memory. FPGA
design solutions are also actively being explored, which allow
implementing
the
memory
hierarchy
using
embedded
BlockRAM. This boosts the parallel use of shared memory
elements between multiple processing units, avoiding data
replicability and inconsistencies. This makes FPGAs potentially
powerful solutions for real-time classification of CNNs. Both
Altera and Xilinx have adopted OpenCL co-design framework
from GPU for FPGA designs as a pseudo-automatic development
solution. In this paper, a comprehensive evaluation and
comparison of Altera and Xilinx OpenCL frameworks for a 5-
layer deep CNN is presented. Hardware resources, temporal
performance and the OpenCL architecture for CNNs are
discussed. Xilinx demonstrates faster synthesis, better FPGA
resource utilization and more compact boards. Altera provides
multi-platforms tools, mature design community and better
execution times.
Keywords—Deep Learning; Convolutional Neural Network;
Hardware Acceleration; OpenCL; FPGA; Caffe; Xilinx; Altera.
I.
INTRODUCTION
In recent years, throughout a series of breakthrough
algorithms [1-5], convolutional neural networks significantly
improved the state-of-the-art in large-scale image recognition
tasks. Driven by such success, CNNs have become widespread
across a broad range of applications including vision, object
detection, speech recognition, autonomous driving, image
captioning, etc. Typically CNNs consists of a large number of
deep layers, and could involve hundreds of millions of
parameters. Using high-end GPGPUs (General Purpose
Graphic Processing Units), the networks are trained iteratively
using back-propagation algorithm for days or weeks, and then
the networks with trained weights can be deployed onto
hardware for classification tasks.
There has been a number of prior works [6-12] that built
hardware
on
different
platforms
for
efficient
CNN
implementation (as accelerators or complete architecture on
hardware), such as FPGA [6-9] and ASIC (application-specific
integrated circuits) [10-12]. ASIC or custom chip designs show
better energy-efficiency, but may not flexibly map various
CNN algorithms easily with the rigid circuits. On the other
hand, FPGA platforms are much more flexible and could easily
map any given CNN algorithm with hardware optimizations.
For FPGAs, the designers could perform manual RTL designs
[7], but using high-level synthesis tools could prove effective
[8-9] in terms of design time and wide design space
exploration. The authors in [8] employed HLS tools in Xilinx
framework to optimize CNN implementation, while the authors
in [9] explored Open Computing Language (OpenCL) based
implementation
in
Altera
framework
for
throughput
optimization of CNNs.
Since the high-level synthesis tools are developed
differently within different frameworks of Xilinx and Altera, it
is difficult to determine which option or FPGA chip would be
the best candidate for certain objectives (area, speed, etc.) from
the designer’s point of view. In this paper, we provide a
comprehensive evaluation and comparison of the same CNN
using both Xilinx and Altera’s OpenCL-based high-level
synthesis tool flows. The remainder of the paper is organized
as follows. In Section II, the OpenCL programming and
models are described. In Section III, Altera’s OpenCL design
flow and hardware system is discussed, while Xilinx’s
SDAaccel design flow and hardware platform is presented in
Section IV. LeNet-5 ConvNet [20] for MNIST database digits
classification scenario is presented in Section V. In Section VI,
the hardware results and impl
This content is AI-processed based on ArXiv data.