A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things
📝 Abstract
Convolutional neural network (CNN) offers significant accuracy in image detection. To implement image detection using CNN in the internet of things (IoT) devices, a streaming hardware accelerator is proposed. The proposed accelerator optimizes the energy efficiency by avoiding unnecessary data movement. With unique filter decomposition technique, the accelerator can support arbitrary convolution window size. In addition, max pooling function can be computed in parallel with convolution by using separate pooling unit, thus achieving throughput improvement. A prototype accelerator was implemented in TSMC 65nm technology with a core size of 5mm2. The accelerator can support major CNNs and achieve 152GOPS peak throughput and 434GOPS/W energy efficiency at 350mW, making it a promising hardware accelerator for intelligent IoT devices.
💡 Analysis
Convolutional neural network (CNN) offers significant accuracy in image detection. To implement image detection using CNN in the internet of things (IoT) devices, a streaming hardware accelerator is proposed. The proposed accelerator optimizes the energy efficiency by avoiding unnecessary data movement. With unique filter decomposition technique, the accelerator can support arbitrary convolution window size. In addition, max pooling function can be computed in parallel with convolution by using separate pooling unit, thus achieving throughput improvement. A prototype accelerator was implemented in TSMC 65nm technology with a core size of 5mm2. The accelerator can support major CNNs and achieve 152GOPS peak throughput and 434GOPS/W energy efficiency at 350mW, making it a promising hardware accelerator for intelligent IoT devices.
📄 Content
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 1 Abstract—Convolutional neural network (CNN) offers significant accuracy in image detection. To implement image- detection using CNN in the internet of things (IoT) devices, a streaming hardware accelerator is proposed. The proposed accelerator optimizes the energy efficiency by avoiding unnecessary data movement. With unique filter decomposition technique, the accelerator can support arbitrary convolution window size. In addition, max pooling function can be computed in parallel with convolution by using separate pooling unit, thus achieving throughput improvement. A prototype accelerator was implemented in TSMC 65nm technology with a core size of 5mm2. The accelerator can support major CNNs and achieve 152GOPS peak throughput and 434GOPS/W energy efficiency at 350mW, making it a promising hardware accelerator for intelligent IoT devices. Index Terms—Convolution Neural Network, Deep Learning, Hardware Accelerator, IoT
I. INTRODUCTION
ACHINE LEARNING offers many innovative applications in
the IoT devices, such as face recognition, smart security
and object detection [1-3]. State-of-the-art machine-learning
computation mostly relies on the cloud servers [4-5]. Benefiting
from the graph processing unit (GPU)’s powerful computation
ability, the cloud can process high throughput video data
coming from the devices and use CNN to achieve
unprecedented accuracy on most AI applications [6]. However,
this approach has its own drawbacks. Since the network
connectivity is necessary for cloud-based AI applications, those
applications cannot run in the areas where there is no network
coverage. In addition, data transfer through network induces
significant latency, which is not acceptable for real-time AI
applications such as security system. Finally, most of the IoT
applications have a tough power and cost budget which could
tolerate neither local GPU solutions nor transmitting massive
amounts of image and audio data to data center servers [7].
To address these challenges, a localized AI processing
scheme is proposed. The localized AI processing scheme aims
at processing the acquired data at the client side and finishes the
whole AI computation without communication network access.
This paragraph of the first footnote will contain the date on which you
submitted your paper for review. It will also contain support information,
including sponsor and financial support acknowledgment. For example, “This
work was supported in part by the U.S. Department of Commerce under Grant
BS123456”.
Conventionally, this is done through local GPU or DSP.
However, this results in a limited computation ability and
relatively large power consumption, making it not suitable for
running computation-hungry neural network such as CNN on
power limited IoT devices [8]. Consequently, it is crucial to
design a dedicated CNN accelerator inside the IoT devices that
can support a high performance AI computation with minimal
power consumption. Some of the reported works in the neural
network acceleration are focusing on providing an architecture
for computing general neural network. For example, in [9], an
efficient hardware architecture is proposed based on the
sparsity of the neural network through pruning the network
properly. However, it is a more general architecture to compute
the fully-connected deep neural network without considering
parameter reuse. On the contrary, the CNN has its unique
feature that the filters’ weights will be largely reused
throughout each image during scanning. Benefiting from this
feature, many dedicated CNN hardware accelerators are
reported [10-12]. Most of reported CNN accelerators only focus
on accelerating the convolution part while ignoring the
implementation of the pooling function, which is a common
layer in the CNN network. In [10], a CNN hardware accelerator
using a spatial architecture with 168 processing elements is
demonstrated. In [11], another dedicated convolution
accelerator with loop-unfolding optimization is reported. Since
pooling function is not implemented in those accelerators, the
convolution results must be transferred to CPU/GPU to run
pooling function and then fed back to the accelerator to compute
the next layer. This data movement not only consumes much
power but also limits overall performance. On the other hand,
some works report highly configurable neural network
processers but they require complicated data flow control. This
adds hardware overhead to IoT devices. For example, [12]
reports a CNN processor occupying 16 mm2 silicon area in
65nm CMOS technology, which can be intolerable for low-cost
IoT chips. In addition, several recent reports, such as [13],
proposed to use memristors to perform neuromorphic
computing for CNN. H
This content is AI-processed based on ArXiv data.