A Reconfigurable Low Power High Throughput Architecture for Deep Network Training
📝 Abstract
General purpose computing systems are used for a large variety of applications. Extensive supports for flexibility in these systems limit their energy efficiencies. Neural networks, including deep networks, are widely used for signal processing and pattern recognition applications. In this paper we propose a multicore architecture for deep neural network based processing. Memristor crossbars are utilized to provide low power high throughput execution of neural networks. The system has both training and recognition (evaluation of new input) capabilities. The proposed system could be used for classification, dimensionality reduction, feature extraction, and anomaly detection applications. The system level area and power benefits of the specialized architecture is compared with the NVIDIA Telsa K20 GPGPU. Our experimental evaluations show that the proposed architecture can provide up to five orders of magnitude more energy efficiency over GPGPUs for deep neural network processing.
💡 Analysis
General purpose computing systems are used for a large variety of applications. Extensive supports for flexibility in these systems limit their energy efficiencies. Neural networks, including deep networks, are widely used for signal processing and pattern recognition applications. In this paper we propose a multicore architecture for deep neural network based processing. Memristor crossbars are utilized to provide low power high throughput execution of neural networks. The system has both training and recognition (evaluation of new input) capabilities. The proposed system could be used for classification, dimensionality reduction, feature extraction, and anomaly detection applications. The system level area and power benefits of the specialized architecture is compared with the NVIDIA Telsa K20 GPGPU. Our experimental evaluations show that the proposed architecture can provide up to five orders of magnitude more energy efficiency over GPGPUs for deep neural network processing.
📄 Content
1
Abstract—General purpose computing systems are used for a large variety of applications. Extensive supports for flexibility in these systems limit their energy efficiencies. Neural networks, including deep networks, are widely used for signal processing and pattern recognition applications. In this paper we propose a multicore architecture for deep neural network based processing. Memristor crossbars are utilized to provide low power high throughput execution of neural networks. The system has both training and recognition (evaluation of new input) capabilities. The proposed system could be used for classification, dimensionality reduction, feature extraction, and anomaly detection applications. The system level area and power benefits of the specialized architecture is compared with the NVIDIA Telsa K20 GPGPU. Our experimental evaluations show that the proposed architecture can provide up to five orders of magnitude more energy efficiency over GPGPUs for deep neural network processing.
Keywords–Low power architecture; memristor crossbars; autoencoder; on-chip training; deep network. I. INTRODUCTION eliability and power consumption are among the main obstacles to continued performance improvement of future multicore computing systems [1]. As a result, several research groups are investigating the design of energy efficient processors from different aspects. These include architectures for approximate computation utilizing dynamic voltage scaling technique, dynamic precision control, and inexact hardware [2,3]. Emerging non-volatile memory technologies are being investigated as low power on-chip caches [4]. Application specific architectures are also proposed for several application domains such as signal processing and video processing. Interest in specialized architectures for accelerating neural networks has increased significantly because of their ability to reduce power, increase performance, and allow fault tolerant computing. Recently IBM has developed the TrueNorth chip [5] consisting of 4,096 neurosynaptic cores interconnected via an intra-chip network. Their synapse element is SRAM based and off-chip training is utilized. DaDianNao [6] is an accelerator for deep neural network (DNN) and convolutional neural network (CNN). In this system, neuron synaptic weights are stored in eDRAM and
later brought into Neural Functional Unit for execution.
Recently deep neural networks (or deep networks) have
gained significant attention because of their superior
performance for classification and recognition applications.
Training and evaluation of a deep network are both
computationally and data intensive tasks. This paper presents
a generic multicore architecture for training and recognition
of deep network applications. The system has both
unsupervised and supervised learning capabilities. The
proposed
system
could
be
used
for
classification,
unsupervised clustering, dimensionality reduction, feature
extraction and anomaly detection applications.
Memristor [7] is a novel non-volatile device having a large
varying resistance range. Physical memristors can be laid out
in a high density grid known as a crossbar [8]. A memristor
crossbar can evaluate many multiply-add operations in
parallel in analog domain which are the dominant operations
in neural networks. We are using memristor crossbars in the
proposed system which provide high synaptic weight density
and parallel analog processing consuming very low energy. In
this system processing happens at physical location of the
data. Thus data transfer energy and functional unit energy
consumptions are saved significantly.
Both the training and the recognition phases of the
neural networks were examined. As deep networks deal with
large networks, efficient approaches to simulate and
implement large memristor crossbars for these networks are
important. We have presented a novel method to accurately
simulate large crossbars at high speed. Detailed circuit level
simulations of memristor crossbars were carried out to verify
the neural operations. We have evaluated the power, area,
and performance of the proposed multicore system and
compared them with a GPU based system. Our results
indicate that the memristor based architecture can provide
up to five orders of magnitude more energy efficiency over
GPU for the selected benchmarks.
The related memristor core design works in this area are
[9,10] where the impact on area, power, and throughput are
examined for systems that carry out recognition tasks only.
Unsupervised training or deep network training is not
examined in these studies. These systems are based on ex-situ
training and do not examine on-chip training and
A Reconfigurable Low Power High Throughput Architecture for
Deep Network Training
Raqibul Hasan, and Tarek M. Taha
Department of Electrical and Computer En
This content is AI-processed based on ArXiv data.