Introduction

While the architectural design and implementation of accelerators for Artificial Intelligence (AI) is a very popular topic, a more careful review of papers in these areas indicates that both architectures and their circuit implementations are routinely evaluated on AlexNet , a deep neural net (DNN) architecture that has fallen out of use, and whose fat (in model parameters) and shallow (in layers) architecture bears little resemblance to typical DNN models for computer vision. This initial error is compounded by other problems in the procedures used for evaluation of results. As a result, the utility of many of these NN accelerators on real application workloads is largely unproven. At the same time, contemporary deep neural net (DNN) design principally focuses on accuracy on target benchmarks, with little consideration of speed and even less of energy. Moreover, the implications of DNN design choices on hardware execution are not always understood.

Thus, a significant gap exists between state of the art NN-accelerator design and state-of-the-art DNN model design. This problem will be carefully reviewed in a longer version of this paper. In this paper we will simply present the results of a coarse-grain co-design approach for closing the gap and demonstrate that a careful tuning of the accelerator architecture to a DNN model can lead to a $`1.9-6.3\times`$ improvement in speed in running that model. We also show that integrating hardware considerations into the design of a neural net model can yield an improvement of $`2.6\times`$ in speed and $`2.25\times`$ in energy as compared to SqueezeNet ($`8.3\times`$ and $`7.5\times`$ compared to AlexNet), while improving the accuracy of the model.

The remainder of this paper is broadly organized as follows. In Section 10, we begin with a brief introduction to applications in embedded computer vision, and their natural constraints in speed, power, and energy. In Section 7, we discuss the design of NN accelerators for these embedded vision applications. In Section 8, we turn our focus to the co-design of DNN and NN accelerators. We end with our conclusions.

Design of NN Accelerators for Embedded Vision

The power, energy, and speed constraints for embedded vision applications discussed in the previous section naturally motivate a specialized accelerator for the inference problem of NNs. The typical approach to micro-architectural design of accelerators is to find a representative workload, extract characteristics, and tailor the micro-architecture to that workload . However, as DNN models are evolving quickly we feel that co-design of DNN models and NN accelerators is especially well motivated.

Per-layer inference time (bar) and utilization efficiency (dotted and solid lines) of SqueezeNet v1.0 on the reference WS/OS architectures and Squeezelerator.

Key Elements of NN Accelerators

Spatial architecture (e.g. ) are a class of accelerator architectures that exploit the high computational parallelism using direct communication between an array of relatively simple processing elements (PEs). Compared to SIMD architectures, spatial architectures have relatively low on-chip memory bandwidth per PE, but they have good scalability in terms of routing resources and memory bandwidth. Convolutions constitute 90% or more of the computation in DNNs for embedded vision, and are therefore called convolutional neural netowrks (CNN). Thanks to the high degree of parallelism and data reusability of the convolution, the spatial architecture is a popular option for accelerating these CNN/DNNs . Hereafter, we restrict the type of the NN accelerators we consider to spatial architectures.

In order to exploit the massive parallelism, NN accelerators contain a large number of PEs that run in parallel. A typical PE consists of a MAC unit and a small buffer or register file for local data storage. Many accelerators employ a two-dimensional array of PEs, ranging in size from as small as $`8\times8`$ to as large as $`256\times256`$ . However, an increase in the number of PEs requires an increase in the memory bandwidth. A MAC operation has three input operands and one output operand, and supplying these operands to hundreds of PEs using only DRAM is limited in terms of bandwidth and energy consumption. Thus, NN accelerators provide several levels of memory hierarchy to provide data to the MAC unit of the PE, and each level is designed to take advantage of the data reuse of the convolutional layer to minimize access to the upper level. This includes global buffers (on-chip SRAMs) ranging from tens of KBs to tens of MBs, interconnections between PEs, and local register files in the PE. The memory hierarchy and the data reuse scheme are one of the most important features that distinguish NN accelerators. Some accelerators also have dedicated blocks to process NN layers other than convolutional layers . Since these layers have a very small computational complexity, they are usually processed in a 1D SIMD manner.

A Taxonomy of NN Accelerator Architectures

There are several features that distinguish NN accelerators, and the following are some examples.

PE: data format (log, linear, floating-point), bit width, implementation of arithmetic unit (bit-parallel, bit-serial ), data to reuse (input, weight, partial sum)
PE array: size, interconnection topology, data reuse, algorithm mapping
global buffer: configuration (unified , dedicated ), memory type (SRAM, eDRAM )
data compression, sparsity exploitation , multi-core configuration

Eyeriss proposed a useful taxonomy that classifies NN accelerators according to the type of data each PE locally reuses. Since the degree of data reuse increases as the memory hierarchy goes down, this type of classification shows the characteristic reuse scheme of NN accelerators. Among the four dataflows, weight stationary (WS), output stationary (OS), row stationary (RS), and no local reuse (NLR), two are introduced here.

Weight Stationary

The weight stationary (WS) dataflow is designed to minimize the required bandwidth and the energy consumption of reading model weights by maximizing the accesses of the weights from the register file at the PE. The execution process is as follows. The PE preloads a weight of the convolution filters to its register. Then, it performs MAC operations over the whole input feature map. The result of the MAC is sent out of the PE in each cycle. Afterwards, it moves to the next element and so forth.

There are several ways to map the computation to multiple PEs. One example is to map the weight matrix between the input and output channels to the PE array. Such hardware takes the form of a general matrix-vector multiplier. TPU has a $`256\times256`$ PE array, which performs matrix-vector multiplications over a stream of input vectors in a systolic way. The input vectors are passed to each column in the horizontal direction, and the partial sums of PEs are propagated and accumulated in the vertical direction. In this way, TPU can also reuse inputs up to 256 times and reduce partial sums up to 256 times at the PE array level.

Output Stationary

The output stationary (OS) dataflow is designed to maximize the accesses of the partial sums within the PE. In each cycle, the PE computes parts of the convolution that will contribute to one output pixel, and accumulates the results. Once all the computations for that pixel are finished, the final result is sent out of the PE and the PE moves to work on a new pixel.

One example of the OS dataflow architecture is ShiDianNao , which maps a 2D block of the output feature map to the PE array. It has an 8x8 PE array, and each PE handles the processing of different activations on the same output feature map. The PE array performs $`F_x \times F_y`$ filtering on a $`(F_x+7) \times (F_y+7)`$ block of the input feature map over $`F_x \times F_y`$ cycles. In the first cycle, the top left $`8\times8`$ pixels of the input block is loaded into the PE array. In the following cycles, most of the input pixels are reused via mesh-like inter-PE connections, and only small part of the input block is read from the global buffer. The corresponding weight is broadcasted to all PEs every cycle.