A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things

Reading time: 5 minute
...

📝 Abstract

Convolutional neural network (CNN) offers significant accuracy in image detection. To implement image detection using CNN in the internet of things (IoT) devices, a streaming hardware accelerator is proposed. The proposed accelerator optimizes the energy efficiency by avoiding unnecessary data movement. With unique filter decomposition technique, the accelerator can support arbitrary convolution window size. In addition, max pooling function can be computed in parallel with convolution by using separate pooling unit, thus achieving throughput improvement. A prototype accelerator was implemented in TSMC 65nm technology with a core size of 5mm2. The accelerator can support major CNNs and achieve 152GOPS peak throughput and 434GOPS/W energy efficiency at 350mW, making it a promising hardware accelerator for intelligent IoT devices.

💡 Analysis

Convolutional neural network (CNN) offers significant accuracy in image detection. To implement image detection using CNN in the internet of things (IoT) devices, a streaming hardware accelerator is proposed. The proposed accelerator optimizes the energy efficiency by avoiding unnecessary data movement. With unique filter decomposition technique, the accelerator can support arbitrary convolution window size. In addition, max pooling function can be computed in parallel with convolution by using separate pooling unit, thus achieving throughput improvement. A prototype accelerator was implemented in TSMC 65nm technology with a core size of 5mm2. The accelerator can support major CNNs and achieve 152GOPS peak throughput and 434GOPS/W energy efficiency at 350mW, making it a promising hardware accelerator for intelligent IoT devices.

📄 Content

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 1  Abstract—Convolutional neural network (CNN) offers significant accuracy in image detection. To implement image- detection using CNN in the internet of things (IoT) devices, a streaming hardware accelerator is proposed. The proposed accelerator optimizes the energy efficiency by avoiding unnecessary data movement. With unique filter decomposition technique, the accelerator can support arbitrary convolution window size. In addition, max pooling function can be computed in parallel with convolution by using separate pooling unit, thus achieving throughput improvement. A prototype accelerator was implemented in TSMC 65nm technology with a core size of 5mm2. The accelerator can support major CNNs and achieve 152GOPS peak throughput and 434GOPS/W energy efficiency at 350mW, making it a promising hardware accelerator for intelligent IoT devices. Index Terms—Convolution Neural Network, Deep Learning, Hardware Accelerator, IoT

I. INTRODUCTION ACHINE LEARNING offers many innovative applications in the IoT devices, such as face recognition, smart security and object detection [1-3]. State-of-the-art machine-learning computation mostly relies on the cloud servers [4-5]. Benefiting from the graph processing unit (GPU)’s powerful computation ability, the cloud can process high throughput video data coming from the devices and use CNN to achieve unprecedented accuracy on most AI applications [6]. However, this approach has its own drawbacks. Since the network connectivity is necessary for cloud-based AI applications, those applications cannot run in the areas where there is no network coverage. In addition, data transfer through network induces significant latency, which is not acceptable for real-time AI applications such as security system. Finally, most of the IoT applications have a tough power and cost budget which could tolerate neither local GPU solutions nor transmitting massive amounts of image and audio data to data center servers [7].
To address these challenges, a localized AI processing scheme is proposed. The localized AI processing scheme aims at processing the acquired data at the client side and finishes the whole AI computation without communication network access.

This paragraph of the first footnote will contain the date on which you submitted your paper for review. It will also contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456”.
Conventionally, this is done through local GPU or DSP. However, this results in a limited computation ability and relatively large power consumption, making it not suitable for running computation-hungry neural network such as CNN on power limited IoT devices [8]. Consequently, it is crucial to design a dedicated CNN accelerator inside the IoT devices that can support a high performance AI computation with minimal power consumption. Some of the reported works in the neural network acceleration are focusing on providing an architecture for computing general neural network. For example, in [9], an efficient hardware architecture is proposed based on the sparsity of the neural network through pruning the network properly. However, it is a more general architecture to compute the fully-connected deep neural network without considering parameter reuse. On the contrary, the CNN has its unique feature that the filters’ weights will be largely reused throughout each image during scanning. Benefiting from this feature, many dedicated CNN hardware accelerators are reported [10-12]. Most of reported CNN accelerators only focus on accelerating the convolution part while ignoring the implementation of the pooling function, which is a common layer in the CNN network. In [10], a CNN hardware accelerator using a spatial architecture with 168 processing elements is demonstrated. In [11], another dedicated convolution accelerator with loop-unfolding optimization is reported. Since pooling function is not implemented in those accelerators, the convolution results must be transferred to CPU/GPU to run pooling function and then fed back to the accelerator to compute the next layer. This data movement not only consumes much power but also limits overall performance. On the other hand, some works report highly configurable neural network processers but they require complicated data flow control. This adds hardware overhead to IoT devices. For example, [12] reports a CNN processor occupying 16 mm2 silicon area in 65nm CMOS technology, which can be intolerable for low-cost IoT chips. In addition, several recent reports, such as [13], proposed to use memristors to perform neuromorphic computing for CNN. H

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut