DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration
Overlays have shown significant promise for field-programmable gate-arrays (FPGAs) as they allow for fast development cycles and remove many of the challenges of the traditional FPGA hardware design flow. However, this often comes with a significant …
Authors: Mohamed S. Abdelfattah, David Han, Andrew Bitar
DLA: Compiler and FPGA Ov erlay for Neural Network Inference Acceleration Mohamed S. Abdelfattah, Da vid Han, Andrew Bitar , Roberto DiCecco, Shane O’Connell, Nitika Shanker , Joseph Chu, Ian Prins, Joshua Fender , Andrew C. Ling, Gordon R. Chiu Pr ogrammable Solutions Gr oup, Intel T oronto, Canada { firstname.lastname } @intel.com Abstract —Overlays have shown significant promise for field- programmable gate-arrays (FPGAs) as they allow for fast de- velopment cycles and remove many of the challenges of the traditional FPGA hardware design flow . However , this often comes with a significant perf ormance b urden re sulting in very little adoption of overlays for practical applications. In this paper , we tailor an o verlay to a specific application domain, and we show how we maintain its full programmability without paying for the performance o verhead traditionally associated with ov erlays. Specifically , we introduce an overlay targeted for deep neural network inference with only ~1% overhead to support the control and reprogramming logic using a lightweight very-long instruction word (VLIW) network. Additionally , we implement a sophisticated domain specific graph compiler that compiles deep learning languages such as Caffe or T ensorflow to easily target our overlay . W e show how our graph compiler performs architectur e-driven software optimizations to significantly boost performance of both con volutional and r ecurrent neural netw orks (CNNs/RNNs) – we demonstrate a 3 × improv ement on ResNet- 101 and a 12 × improvement f or long short-term memory (LSTM) cells, compared to na ¨ ıve implementations. Finally , we describe how we can tailor our hardwar e overlay , and use our graph compiler to achiev e ~900 fps on GoogLeNet on an Intel Arria 10 1150 – the fastest ever reported on comparable FPGAs. I . I N T R O DU C T I O N Creating custom high-performance hardware designs on field-programmable gate arrays (FPGAs) is difficult and time- consuming when compared to software-programmable devices such as CPUs. A hardware designer must describe their system in a cycle-accurate manner , and worry about lo w-lev el hard- ware considerations such as timing closure to memory inter- faces. Ov er the past decade, significant progress has been made in easing the use of FPGAs through high-level languages such as OpenCL, making it easier to implement high-performance designs [9]. Howe ver , even when using high-lev el design, one must still carefully describe an efficient parallel hardware architecture that lev erages the FPGA ’ s capabilities such as the massiv e on-chip memory bandwidth or configurable multiplier blocks. Additionally , the designer must optimize both area and frequency through long compilations to realize performance gains versus other programmable platforms. Compared to writing a software algorithm targeting a CPU, designing for FPGAs is still drastically more difficult. Our goal in this paper is to present a software-programmable hardware overlay on FPGAs to realize the ease-of-use of software programmability and the efficienc y of custom hardware design. W e introduce a domain specific approach to overlays that lev erages both software and hardware optimizations to achiev e state-of-the-art performance on the FPGA for neural network (NN) acceleration. For hardware, we partition configurable parameters into runtime and compile time parameters such that you can tune the architecture for performance at compile time, and program the ov erlay at runtime to accelerate dif ferent NNs. W e do this through a lightweight very-long instruction word (VLIW) network that delivers full reprogrammability to our ov erlay without incurring any performance or efficienc y ov erhead (typical overlays have large ov erhead [4]). Addi- tionally , we create a flexible architecture where only the core functions required by a NN are connected to a parameterizable interconnect (called Xbar). This av oids the need to include all possible functions in our overlay during runtime; rather, we can pick from our library of optimized kernels based on the group of NNs that are going to run on our system. Our approach is unlike previous work that created hardware that can only run a single/specific NN [1], [7], [8]. On the software side, we introduce an architecture-aware graph compiler that efficiently maps a NN to the overlay . This both maximizes the hardware efficienc y when running the design and simplifies the usability of the end application, where users are only required to enter domain specific deep learning languages, such as Caffe or T ensorflow , to program the ov erlay . Our compiler generates VLIW instructions that are loaded into the FPGA and used for reprogramming the ov erlay in tens of clock cycles thus incurring no performance ov erhead. Compared to fixed-function accelerators that can only ex ecute one NN per application run, our approach opens the door to allow for multiple NNs be run consecuti vely in a single application run [12] by simply reprogramming our ov erlay instead of recompiling or reconfiguring the FPGA. The rest of this paper is organized as follows. Section II introduces our hardware architecture. W e describe ho w we tar- get specific NNs using our compile-time parameters and Xbar interconnect. Importantly , we describe our lightweight VLIW network in Section II-A, u s ed for programming the overlay . Next, we describe our NN graph compiler in Section III, and detail some of our architecture-driven optimizations that allow the ef ficient implementation of NNs on architecture variants of different sizes. Sections IV and V detail how our graph compiler and hardware overlay work together for efficient implementation of CNNs and RNNs. W e walk through hard- Str eam Buff e r ( on -c hip) Xbar DDRx/HBM LRN PE -0 Dr ain 0 Max Pool Dr ain 1 Dr ain N C_VEC x Q_VEC x P_VEC DRAIN_VEC x Q_VEC x P_VEC AUX_VEC x Q_VEC x P_VEC K_VEC DDRx/HBM PE -1 PE -N Filt e r Ca che Filt e r Ca che Filt e r Ca che PE Array Ma x P oo l L R N Wid t h Ad a pt Activatio n Xbar C_VEC x S_VEC x R_VEC Fig. 1: System-level diagram of our neural network inference accelerator (DLA). ware and software optimizations in implementing both the ResNet and GoogLeNet CNNs, allo wing us to achie ve record- setting performance on GoogLeNet. Finally , we discuss the implementation of a long short-term memory (LSTM) cell by simply adding an additional kernel to our overlay , and relying on our graph compiler to mutate the LSTM cell graph to fit within our overlay . In this paper, we refer to our system as “DLA” – our Deep Learning Accelerator . I I . H A R D W A R E A R C H I T E C T U R E Our domain specific ov erlay aims to be general enough to implement any NN, but still remain customizable so that it can be optimized for a specific NN only . Fig. 1 shows an ov erview of our o verlay . At the core of our ov erlay is a 1D systolic processing element (PE) array that performs dot product operations in each PE to implement general matrix math such as conv olutions or multiplications. W e omit the discussion of numerics in this paper but we support different floating-point formats such as FP32/16/11/10/9/8 which hav e been sho wn to work well with inference [2] – these could be easily modified to support any nascent innov ations in data type precisions such as bfloat [15], and other unique fixed or floating point representations, due to the flexible FPGA fabric. As Fig 1 shows, our Xbar interconnect can augment the functionality of our overlay with different auxiliary func- tions (also referred to as kernels in this paper). This section goes through different parts of our hardware architecture and highlights the built-in compile-time flexibility and run-time programmability of our ov erlay . A. VLIW Network T o implement a NN on DLA, our graph compiler breaks it into units called “subgraphs” that fit within the ov erlay’ s buf fers and compute elements. For example, with conv olu- tional neural networks (CNNs), a subgraph is typically a single con volution with an optional pooling layer afterwards. W e deliv er new VLIW instructions for each subgraph to program DLA correctly for the subgraph execution. Stream Buffer VLIW Reader from ddr4 Transpo rt Transpo rt Transpo rt Xbar Pool Fig. 2: VLIW network distributes instructions to each kernel. Our nov el VLIW network distributes instructions to each kernel as sho wn in Fig. 2. The VLIW reader continuously fetches the instructions for the next subgraph from external memory and sends it down an 8-bit unidirectional ring network that is connected to all of the kernels in DLA. The VLIW instruction sequence is divided into different portions for each kernel. A special header packet identifies the kernel, then it is followed by a series of programming instructions that are destined for that kernel. The “T ransport” kernels parse the header packet and redirects the instructions that follow to the correct kernel as shown in Fig. 2. The transport kernels also assemble the 8-bit packets into 32-wide instructions for direct kernel consumption. Our instructions are actually counter end values and con- trol flags that are directly loaded into registers within each kernel to govern its operation – this av oids the need for any instruction decode units. For example, the pool ker - nel recieves approximately a dozen instructions: the image height/width/depth, the pool window size, and the type of pooling (maxpool or average pool). Before executing each subgraph, the pool kernel would read each of its 12 instructions serially , consuming 12 clock cycles – this has no material im- pact on performance that typically takes thousands of cycles. Howe ver , it ensures that the entire VLIW network can remain only 8 bits wide, with a minimal area ov erhead of only ~3000 LUTs – about 1% of an Arria-10 1150 FPGA de vice as shown in T able I. Adding new auxiliary programmable functions (kernels) to DLA is simple and has little ov erhead – we extend the VLIW network with an additional transport kernel, and connect that ne w kernel to the Xbar without affecting existing kernels or instructions. T ABLE I: Area overhead of VLIW network for DLA with 10 kernels at frequency of 450 MHz on Arria 10. LUTs FFs ALMs VLIW Reader 1832 1841 1473 T ransport 126 139 73 T otal 3092 3231 2046 B. Xbar Interconnect Machine learning is a fast-dev eloping field – we are in- creasingly seeing new functions implemented by the machine learning research community . For example, new acti vation functions are constantly being ev aluated such as “Swish” [10]. A quick look at T ensorflow shows that there more than 100 different layer types that users can experiment with in building different NNs [15]. W e aim to use the Xbar for extensibility of DLA such that users can easily add or remove functions to implement different types of NNs. Fig. 1 shows an example Xbar interconnect used to connect pool/LRN kernels for CNNs. As the diagram sho ws, the Xbar is actually a custom interconnect built around exactly what is needed to connect the auxiliary kernels. For example, the SqueezeNet graph has no local response normalization (LRN) layers, so we can remove that kernel completely . From a prototxt architecture description, the Xbar (including width adaptation) is automatically created to connect auxiliary kernels. W e use width adapters to control the throughput of each auxiliary kernel – for example, we can decrease the width of infrequent kernels such as LRN to conserve logic resources. The interconnection pattern within the Xbar is also customizable based on the order of the auxiliary operations. For example, the AlexNet graph has both MaxPool and LRN layers, but LRN always comes first; whereas the GoogLeNet graph has some layers in which MaxPool precedes LRN, which is supported by adding more multiplexing logic. T o demonstrate the power of our extensible architecture (and compiler which is presented in Section III), we add a single kernel to the Xbar in Section V which extends our architecture to also implement LSTM cells alongside CNNs – this allows implementing video-based RNNs commonly used for gesture recognition for instance [16]. C. V ectorization T o ensure our ov erlay can be customized to different neural network models and FPGA devices, we support vectorization , or degree of parallelism, across different axes. Figure 1 shows some of the degrees of parallelism av ailable in the accelerator , configurable via vectorization. Q VEC and P VEC refer to the parallelism in the width and height dimensions, while C VEC and K VEC refer to the input/output depth parallelism respectiv ely . Every clock cycle, we process the product of 0.6 0.7 0.8 0.9 1 1.1 Alexnet GoogleNet SqueezeNet VGG-16 ResNet-101 Normalized Throughput/Area P_VEC=1, K_VEC=64 P_VEC=4, K_VEC=16 Fig. 3: Throughput/Area on two architectures with different P VEC and K VEC vectorization. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 0.2 0.4 0.6 0.8 1 Normalized Throughput Normalized On-Chip Memory Size AlexNet GoogleNet ResNet-101 Fig. 4: Impact of stream buffer memory vs. compute tradeoff on AlexNet, GoogleNet and ResNet-101. { Q VEC, P VEC, C VEC, and K VEC } feature values in parallel. Initially , our design was scaled by increasing K VEC; howe ver , this method of scaling saw diminishing returns, since quantization inef ficiencies can become more pronounced as vectorization dimensions increase. For example, if the output depth (K) of a layer is 96, and K VEC is 64, this will require 2 complete iterations through the PE array , with only 96/128 (75%) useful computations. On the other hand, if K VEC is 32, the output depth divides perfectly into 3 iterations at 100% efficienc y . T o mitigate this quantization effect, it is possible to balance the scaling of the design across multiple differ - ent dimensions besides just K VEC (e.g. P VEC, Q VEC, C VEC, etc). The optimal balance of vectorization depends on the graph’ s layer dimensions. Figure 3 demonstrates this point by comparing the throughput of two architectures with similar area for different graphs. As the figure shows, the optimal balance of scaling the design between P VEC and K VEC varies based on the neural network topology being used. This is an example of ho w we tune our overlay to get top performance on specific NNs. D. Stream Buffer and Filter Caches A single Arria 10 FPGA contains ~4 TB/s on-chip memory bandwidth, interspersed within the FPGA in configurable 20 Kbit memory blocks. This powerful FPGA resource is piv otal in determining the performance of FPGA compute operations – DLA lev erages these block RAMs to buffer both activ ation and filter tensors. As Fig. 1 shows, filters are stored in a double-buffered “filter cache” contained in each PE, allowing the PEs to compute data while filters are pre-loaded from external memory for the next subgraph. The “stream b uffer” is a flexible scratchpad that is used to store intermediate tensors on-chip. Many of our graph compiler passes are dedicated for ef ficient use of this stream b uffer as Section III will show . When presented with an intermediate tensor larger than the stream buf fer or filter caches, our graph compiler slices the tensor into multiple pieces that fit within our on-chip caches, and the rest of the pieces are stored in slo wer off-chip memory , and require higher latency to fetch and compute. T o limit this slicing , we can increase the size of the stream b uffer and/or filter caches, but this decreases the number of RAM blocks av ailable to increase PE array vectorization. Therefore, there is a memory-vs-compute tradeoff for each NN to balance the size of the caches and the number of PEs – Fig. 4 illustrates this tradeoff for different NNs. As the figure shows, a tradeoff that is optimal for one NN can cause 40% or more performance degradation for a second NN. I I I . G R A P H C O M P I L E R The previous section focused on the hardware ov erlay architecture and how to configure it at compile time to maximize performance for a specific NN graph. This section describes our NN graph compiler that takes advantage of the overlay VLIW instructions to decompose, optimize, and run a NN model on the overlay . The graph compiler breaks down a NN into subgraphs, schedules subgraph execution, and importantly , allocates explicit cache buf fers to optimize the use of our stream buf fer and filter caches. This section goes through our core compiler “passes” (slicing, scheduling and allocation), and shows examples of how smart graph compilation allo ws more ef ficient hardware implementations. Besides these general core passes, our compiler implements more specific algorithms that target and optimize specific NN patterns as we show in the following Sections IV and V. A. Slicing T o achieve the highest possible throughput for a given DLA architecture it is desirable to size the stream buf fer and filter caches in such a way to fit the entire input feature tensor and filter tensor . Ho wev er , as the resolution of images increases and graph topologies for NNs become deeper , on- chip allocation for these tensors may not be feasible. T o ov ercome this constraint, slices of the input tensor are fetched from external memory into the stream buffer and processed independently by DLA. The 3D input feature tensor can be sliced along the height, width, or depth to fit in the on-chip stream buf fer . When slicing along the width and height, the slices must overlap if the filter window size is greater than 1x1. The graph compiler tries to pick slices that minimize the overlapped computation for the sliced tensor . Alternativ ely , slicing across the depth does not require overlapped computations, but requires an additive operation to add the results of the depth-wise slices. T o boost performance and minimize the number of DDR4 spillpoints, we enhance our slicing algorithm to slice multiple sequential con volutions together (called “Group Slicing”). Instead of completing all slices within a layer , we compute Conv1 C o nv2 Conv3 N or m al S lic ing Gr oup Slic ing S l ice 1 S l ice 2 S l ice 1 S l ice 1 S l ice 2 S l ice 2 S l ice 1 S l ice 2 S l ice 1 S l ice 1 S l ice 2 S l ice 2 NN DDR 4 spill DDR 4 spi l l Fig. 5: Group slicing minimizes external memory spillpoints by computing multiple sequential con volutions for each slice. Input Bu ffer Output Bu ffer Con t i g u o u s Bu ffer avai la b le for use Fig. 6: Double-buffering in the stream buf fer . sev eral sequential con volutions with a single slice using the stream buf fer before moving onto the next slice. Fig. 5 illustrates how group slicing reduces the number of external- memory spillpoints for a sample NN. For Resnet101 with image resolution of 1080p (HD), our Group Slicing algorithm improv es throughput by 19% compared to simple slicing. B. Allocation The allocation pass manages reading and writing from the stream buf fer . Allocation calculates the read and write addresses for each slice, and computes the total stream buf fer memory used by a graph. One of the main goals is to reduce fragmentation – gaps between allocated memory blocks in the stream buf fer . In its most simple operation, the stream buf fer is used as a double buffer to store both the input and output of a subgraph. T o achiev e this double-buffering while reducing fragmentation, the input buf fer starts at address 0 and counts up, while the output b uf fer starts at the end of the stream b uffer and counts down. As Fig. 6 shows, this lea ves a contiguous space in the middle of the stream buf fer that can be used to allocate more data slices in the stream buf fer; this is especially useful for graphs that hav e multiple branches as demonstrated by the GoogLeNet example in Section III-C. Note that the allocation pass must keep track of the lifetime of each buf fer to be able to free/overwrite its memory in the stream buf fer once it is no longer used. Additionally , our allocation pass also assigns addresses in external memory when the stream buf fer isn’t large enough, but external memory size is not a problem so it is simply done left-to-right, in the first av ailable space. C. Scheduling The DLA compiler partitions NNs into subgraphs where a subgraph is a list of functions that can be chained together and implemented on DLA without writing to a buf fer , except at the very end of the subgraph e xecution – scheduling decides when each subgraph is executed. In the case of early CNN models such as AlexNet [5] or VGG16 [17] there is very little need for a scheduler as there are no decisions to be made on which subgraph to execute next. When considering CNNs 12 4 6 1 12 8 2 2 1 6 (a) Inception module with relati ve output sizes for each subgraph. T i m e S t e p 0 T i m e S t e p 1 T i m e S t e p 2 T i m e S t e p 3 T i m e S t e p 4 T i m e S t e p 5 T i m e S t e p 6 T i m e S t e p 7 T i m e S t e p 8 (b) Stream buf fer usage when using a depth first schedule. T i m e S t e p 0 T i m e S t e p 1 T i m e S t e p 2 T i m e S t e p 3 T i m e S t e p 4 T i m e S t e p 5 T i m e S t e p 6 T i m e S t e p 7 T i m e S t e p 8 (c) Stream buf fer usage with an improved schedule. Fig. 7: Scheduling one of the GoogLeNet [6] inception modules. with branching nodes such as GoogLeNet [6], ResNet [14], or graphs that require slicing, the order of subgraph execution heavily influences the stream buf fer size that is required for a giv en graph to a void external memory spill points. Fig. 7a illustrates an example of an inception module from Googlenet, partitioned into DLA subgraphs with the relative output sizes of each subgraph. W e sho w the stream buf fer allocation corresponding to two possible schedules of the inception module. Both are depth-first schedules, but in Fig. 7b we start with the leftmost branch, while Fig. 7c starts with the rightmost branch. This simple change in schedule results in a 30% reduction in the size of the required stream buf fer for this inception module. When considering large graphs with many branching nodes that either conv erge to a single output such as GoogLeNet or graphs that di ver ge to se veral outputs such as those used for single-shot multibox detection [12], an exhausti ve search of all possible schedules may be infeasible without incurring large compile time penalties. Our scheduling is conducted using a priority queue based approach, where the cost of executing a giv en node is determined by the ratio of its output size to its effective input size (the size of the input multiplied by the number of users of the input tensor). This approach allo ws for the stream buf fer sa vings of Fig. 7c to be achie ved, with minimal impact on the compiler runtime. I V . C N N I M P L E M E N TA T I O N This section focuses on 2 popular CNNs: ResNet [14] and GoogLeNet [6]. W e explain different hardware/software co- optimizations that are possible because of our runtime recon- figurable and software programmable ov erlay . This allo ws us to significantly boost the performance of these CNNs on DLA at runtime with little effort, as we show in our results. A. ResNet Convolution Mer ging ResNet 101 is a large graph that can be targeted for high- definition image resolutions, creating intermediate tensors that require significant slicing to run on the DLA ov erlay on Arria 10. ResNet is composed of three types of resmodules , as sho wn in Fig. 8. Each type has two con volution branches, merged through an element-wise addition operation ( eltwise ). W e present a resmodule optimization (implemented auto- matically in our compiler) that eliminates the eltwise operation by merging it with the preceding con volution(s). This reduces the total number of arithmetic operations in DLA, and more B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v A : 1 x 1 Co n v I n p u t e l t w i s e (a) T ype 1. B 1 : 1 x 1 Co n v ( s t r i d e 2 ) R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v A : 1 x 1 Co n v ( s t r i d e 2 ) I n p u t e l t w i s e (b) T ype 2. B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v I n p u t e l t w i s e (c) T ype 3. Fig. 8: T ypes of resmodules in ResNet. importantly , decreases the number of slices and DDR4 spill- points. Instead of storing intermediate tensors between the con volution and the eltwise addition operations, we combine them in a single con volution operation where tensor size is at least half as big as the eltwise input. Consider the computation that produces ev ery output ele- ment of the eltwise in T ype 1 resmodule (Figure 8a) – it is the sum of the corresponding output elements of con v olution A and B 3 . As illustrated in Figure 9a, this sequence of operations is equiv alent to a single conv olution after input A and B 3 (and the corresponding filter A and B 3 ) are merged depth-wise. This effecti vely absorbs the eltwise addition operation into the dot product operation of the preceding conv olutions. Figure 9b shows the T ype 1 resmodule after con volution A and B 3 are merged with the eltwise layer . Since this optimization conv erts the explicit eltwise operations into a con volution, output A and B 3 , which would usually reside in DDR4 or on-chip memory , become intermediate results of the merged con volution and are stored in on-chip registers. This reduction in memory traffic is especially prominent in resmodules, where output A and B 3 are of 4 × the size of input A and B 3 . In order for T ype-2 and T ype-3 resmodules to benefit from this optimization, we conv ert them to T ype 1. For T ype 2 (Figure 10a), we push the stride-2 con volution A and B 1 upstream to the layer before the input. Not only does this con vert the resmodule to T ype 1, it also cuts the amount of computation in the upstream layer and reduces the input traffic to con volution A and B 1 . For T ype 3 (Figure 10b), we introduce an identity con volution – which creates an identical output tensor from an input tensor – in the left branch. I n p u t A F i l t e r s A I n p u t B F i l t e r s B D e p t h - c o n c a t I n p u t D e p t h - c o n c a t F i l t e r s (a) Eltwise elimination. B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU I n p u t A + B 3 : 1 x 1 c o n v (b) Optimized T ype 1 resmodule. Fig. 9: Conv olution merging optimization. B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v A : 1 x 1 Co n v I n p u t ( s t r i d e 2 ) e l t w i s e (a) T ype 2. B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v A : 1 x 1 Co n v ( i d e n t i t y) I n p u t e l t w i s e (b) T ype 3. Fig. 10: Resmodule type con version to benefit from con v olution merging optimization. B. Non-convolution Primitives While almost all layers in ResNet are con volutions, there are a couple of exceptions – a single Global A verage Pooling (GAP) layer and a single Fully-Connected (FC) layer at the end. This is also true for GoogLeNet where there is a single FC layer , and a single av erage pooling layer . Giv en the extremely low frequency of these non-con volution layers (e.g., 2 out of 147 for ResNet 101), it is best to map them to conv olutions. In this way , we can reuse the powerful con volution engine (PE array) instead of adding dedicated auxiliary kernels that would be under-utilized (over time). An FC layer performs a multiplication between a vector (input) and a matrix (weights). It can be mapped to a con vo- lution as follows: 1) the 1D FC input of length N is mapped to a 3D con volution input of shape 1 × 1 × N , and 2) the 2D FC weight matrix of shape N × M is mapped to M 3D con volution filters of shape 1 × 1 × N . With this mapping, the computation of each FC output is assigned to a PE. A verage pooling of windo w H × W on a 2D image is equiv alent to a 2D con volution with a filter of size H × W . Each filter element is of v alue 1 / ( H × W ) . For a 3D input of depth D , average pooling is applied to each 2D input surface, producing the corresponding output surface. In this case, the equiv alent conv olution filter for the output surface at depth d , is of shape H × W × D , with all zero filter values except the surface at depth d being the average pooling filter . C. Sparse F ilter Shortening Even though they sav e area, the identity and av erage pooling con volutions introduced in the pre vious optimizations could come at a high cost to throughput, due to the large but sparse filters in volved. For an identity conv olution of input Number of filte r s = output depth = D Non-zer o f i lt e rs Zero f il t ers – pruned awa y d ue t o sparsit y V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V K_VEC K_VEC DLA c omp utation after pruning Filte r d e pt h = input de pt h = D Filter f ace s i ze e qual s a v e ra ge pool wi n dow size o r 1x 1 in c a s e of ide nti t y f i l t e r 1/4 1/4 1/4 1/4 1 2x 2 a v g . pool I d e n tity Fig. 11: Sparse filter shortening with identity and av erage-pooling con volution filters. and output shape H × W × D , there are D filters, each of shape 1 × 1 × D . Since each filter is responsible for copying input surface at depth d to the output surface at the same depth, the values of this filter are all zeros except 1 at depth d . Fig. 11 illustrates both the identity and av erage pooling con volution filters, and how we can lev erage their sparsity to conserve operations on DLA. W e improve performance by skipping the computation with filter entries that are filled with zeros. Since the PEs process K V E C filters at a time, we trim the filters size K V E C to fit perfectly in the PE array . This effecti vely reduces the filter depth from D to K V E C , saving both compute time and filter data loading time. W e call this optimization sparse filter shortening , which can also be applied to the av erage pooling con v olution as sho wn in Fig. 11, due to the same filter sparsity . D. 1x1 Filter s Optimization T o efficiently compute con volutions using 3x3 filters, the DLA architecture is often tuned to be vectorized in the filter width dimension by setting S VEC=3. Increasing the filter width vectorization increases PE throughput as well as filter prefetch bandwidth for large (eg. 3x3) filters. Howe ver , many of the latest CNNs hav e a mix of 3x3 and 1x1 filters. Con volutions using 1x1 filters do not benefit from filter width vectorization, and thus would achieve lo w DSP efficiency and filter prefetch bandwidth. T o av oid this, the DLA architecture has been optimized for 1x1 filters in two ways. First, the DSPs that would have been used in a 3x3-filter con volution to process the second and third filter values in the filter width direction are instead used to calculate two additional output pixel in a 1x1-filter con volution. This allows the PEs to maintain the same DSP efficienc y for both 3x3 and 1x1 filters. Second, the filter prefetch bandwidth added to load a 3-wide filter is used to simply load more 1-wide filters in parallel. Overall, these two optimizations allow DLA to achieve high throughputs through vectorization for 3x3-filter con volutions without suf fering any additional quantization loss for 1x1-filter con volutions. E. Optimization Impact on ResNet T able II summarizes the impact of each optimization on the throughput of ResNet 101 with 1080p image resolution. The number in each row is the normalized throughput after applying all optimizations listed up to this row . Here, we apply the mapping of GAP and FC layers to con volution unconditionally (i.e., in the baseline). The huge speedup of sparse filter shortening comes from the filters of the identity con volutions introduced by conv olution merging optimization on T ype 3 resmodules which account for 87% of all resmod- ules in ResNet 101. T ABLE II: Optimization impact on ResNet-101. Optimization Relativ e Throughput Baseline 1.0 1x1 Filter Opt 1.3 Con v . merging (T ype 3) 1.7 Sparse filter shortening 2.8 Group slicing 3.1 F . Optimization Impact on GoogLeNet T wo of the described CNN optimizations are used to improv e throughput on GoogleNet: (1) the 1x1 filter opti- mizations and (2) the av erage pool mapped to con volution optimization – this allowed DLA to fit a lar ger PE array instead of wasting dedicated resources on an average-pooling kernel. As shown in T able III, GoogleNet saw a 17% throughput improv ement from these two optimizations. The following row in the table shows the throughput improvement from increasing the PE array vectorization (from { P VEC,K VEC } = { 1,48 } to { 2,32 } ). Finally , the last ro w in the table points to an accurate model of external memory optimizations that will allows DLA to achieve ~900 fps on GoogLeNet on Intel’ s Arria 10 1150 device, which to our knowledge, is the most efficient acceleration of GoogLeNet on FPGAs. This optimization entails continuously fetching filters for the next NN layers until the filter cache becomes full instead of limiting filter prefetch only to 1 layer ahead. While this slightly complicates filter prefetch logic, it has a negligable area cost but allo ws hiding external memory latency when fetching the NN model. T ABLE III: Optimization impact on GoogLeNet. Optimization Relativ e Raw Throughput Throughput (Intel Arria 10 1150) Baseline 1.0 469 fps 1x1 Filter Opt 1.1 506 fps A vg Pool Mapped to Conv 1.2 550 fps Additional V ectorization 1.7 777 fps External Memory Opt 1.9 ~900 fps V . L S T M C E L L I M P L E M E N TA T I O N LSTM cells are a widely-used variant of RNNs, commonly used in speech recognition [3], translation [18] and motion detection [16]. DLA is designed to be a flexible NN accelerator for all relev ant deep learning workloads, including LSTM- based networks. As such, this section discusses how our graph compiler mutates an LSTM cell to map well to the DLA ov erlay with high performance. δ b o W xo W ho o t tanh δ b f W xf W hf f t b g W xg W hg g t δ x t b i W xi W hi i t h t-1 c t-1 c t tanh h t Combine 4 gates : {b i ,b g ,b f ,b o } x t {W xi ,W xc ,W xf ,W xo } Combine 2 br a nches: {b i ,b g ,b f ,b o } {x t ,h t-1 } {{ W xi ,W hi },{W xc ,W hc }, {W xf ,W hf },{W xo ,W ho }} Matrix Multipli catio n Element -wise add/m ultip l y Sigmo id/tanh acti vation h t-1 {W hi ,W hc ,W hf ,W ho } Fig. 12: Graph/Matrix view of an LSTM cell, and ho w we combine its matrix-multiplications into one big matrix. A. Mapping an LSTM Cell to DLA Most of the computation in an LSTM cell occurs in 8 matrix multiplications to compute the 3 LSTM gates (in- put/forget/output) [11]. Fig. 12 illustrates how we combine those 8 matrices into one big matrix – this reduces DLA ex ecution from 12 subgraphs to a single subgraph which runs at least ~12 × faster . First, the 4 matrices that were multiplied by the input/history are each height concatenated as sho wn in the example in Fig. 12. This is a generic optimization that can be applied to any matrix-vector multiplications that share the same input vector . W e end up with two lar ge matrix multiplications, one matrix for the input ( x t ), and another for the history ( h t − 1 ). Next, we combine those two matrices, and the element-wise addition that follo ws, into one larger matrix through width concatenation of the matrices, and height concatenation of the input and history vectors as shown in Fig 12. This gi ves us one lar ge matrix multiplication for the entire LSTM cell. Depending on the LSTM cell size, our compiler may later decide to slice this large matrix if it does not fit on the FPGA as described in Section III-A. W ith the combined matrix, each of the LSTM gates are computed one-after-the-other since we compute the matrix i i i i i i i i g g g g f f f f o o o o g f o i g f o i g f o i g f o i g f o i g g g g f f f f i i i i o o o o i g f o i g f o Fig. 13: Matrix ro w interleaving allows streaming different LSTM gate values simulataneously instead of b uffering each gate separately . tanh σ σ σ i t g t f t o t tanh h t from Xbar to Xbar Context Cache Fig. 14: Streaming LSTM hardware block to compute the element- wise operations of an LSTM cell. 0 1 2 3 4 5 0 100 200 300 400 500 Latency [s] Peak External Memory Bandwidth [GB/s] 1 DDR4- 2400 2 DDR4- 2400 3 DDR4- 2400 4 DDR4- 2400 1 HBM2 2 HBM2s Fig. 15: Latency of an LSTM NN when varying external memory bandwidth. rows in order . Howe ver , this is not FPGA-friendly , as each of the input/forget/output gate values will now need to be buf fered (using costly on-chip RAM or slow external memory) so that they can be combined in the second half of the LSTM cell. Howe ver , by interleaving the rows of the large matrix (so that the first row contains the filters for the input gate, the second row for the ‘g’ gate, the third row for the forget gate, and the fourth ro w for the output gate), we can compute one output from each gate in each time step as shown in Fig. 13. This removes the need for buf fering large intermediate gate outputs [13], and allows us to directly stream the gate values into the dedicated LSTM hardware block shown in Fig. 14. This demonstrates the flexibility of the DLA o verlay , and the power of our graph compiler in implementing different NNs. By simply attaching the LSTM kernel to the Xbar , we can lev erage our powerful multi-precision PE array to compute the matrix-multiplication portion of the LSTM cell, then stream data directly into the dedicated LSTM block. B. External-Memory-bound RNNs Non-con volutional neural networks, are ef fectively a matrix- vector multiplication when computed with batch=1. Most of the applications that use RNNs are real-time applications such as speech/gesture recognition or translation; therefore, they require low-batch and low-latency processing that is ideal for FPGAs. Howe ver , external memory bandwidth is often a bottleneck, since a large matrix has to be fetched from external memory , only to be multiplied with one vector – compute time is lo wer than memory fetch time so it is impossible to hide memory fetch latency . Intel’ s Stratix 10 devices hav e 2 HBM2 devices integrated on some of their boards, providing up to 500 GB/s of peak memory bandwidth – this is 20 × higher than a DDR4-2400 memory . In Fig. 15, we look tow ards the future, and model the performance of a 4-layer stacked LSTM NN (with size of input=output=hidden=2048) used for speech recognition. As the figure sho ws, with more external memory bandwidth, going from DDR4 to HBM2, the latency for processing a speech segment goes down by more than 5 × . V I . C O N C L U S I O N W e presented a methodology to achieve software ease- of-use with hardware efficienc y by implementing a domain specific customizable overlay architecture. W e described the hardware tradeoffs inv olved with NN acceleration, and delved into our graph compiler that maps NNs to our ov erlay . W e then showed that, using both our hardware and software, we can achiev e 3 × improvement on ResNet 101 HD, 12 × on LSTM cells, and 900 fps on GoogLeNet on Intel’ s Arria 10 FPGAs. W e will further dev elop DLA to encompass more use-cases such as multi-FPGA deployment [2]. In the future, we also aim to implement similar ov erlays for different application domains such as genomics, packet processing, compression and encryption to further make FPGAs accessible for high- throughput computation. R E F E R E N C E S [1] U. e. a. A ydonat. An opencl™deep learning accelerator on arria 10. In FPGA , FPGA ’17, pages 55–64, New Y ork, NY , USA, 2017. ACM. [2] E. e. a. Chung. Accelerating persistent neural networks at datacenter scale. HotChips, 2017. [3] A. G. et al. Speech recognition with deep recurrent neural networks. ICASSP , 2013. [4] A. J. et al. Ef ficient overlay architecture based on dsp blocks. FCCM , 2015. [5] A. K. et al. Imagenet classification with deep conv olutional neural networks. In F . Pereira, C. J. C. Burges, L. Bottou, and K. Q. W einberger , editors, Advances in Neural Information Pr ocessing Systems 25 , pages 1097–1105. Curran Associates, Inc., 2012. [6] C. S. et al. Going deeper with conv olutions. In Computer V ision and P attern Recognition (CVPR) , 2015. [7] H. Z. et al. A frame work for generating high throughput cnn implemen- tations on fpgas. FPGA , 2018. [8] J. S. et al. T owards a uniform template-based architecture for acceler- ating 2d and 3d cnns on fpga. FPGA , 2018. [9] M. A. et al. Gzip on a chip: High performance lossless data compression on fpgas using opencl. IWOCL , 2014. [10] P . R. et al. Swish: a self-gated activation function. AISTA TS , 2017. [11] S. H. et al. Ese: Efficient speech recognition engine with sparse lstm on fpga. FPGA , 2017. [12] W . L. et al. Ssd: Single shot multibox detector . ECCV , 2016. [13] Y . G. et al. Fpga-based accelerator for long short-term memory recurrent neural networks. ASP-D AC , 2017. [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR , abs/1512.03385, 2015. [15] G. Inc. T ensorflow , 2018. [16] K. Murakami and H. T agushi. Gesture recognition using recurrent neural networks. SIGCHI , 1991. [17] K. Simonyan and A. Zisserman. V ery deep conv olutional networks for large-scale image recognition. arXiv preprint , 2014. [18] Y . e. a. W u. Google’ s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 , 2016.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment