DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

DLA: Compiler and FPGA Ov erlay for Neural Network Inference Acceleration Mohamed S. Abdelfattah, Da vid Han, Andrew Bitar , Roberto DiCecco, Shane O’Connell, Nitika Shanker , Joseph Chu, Ian Prins, Joshua Fender , Andrew C. Ling, Gordon R. Chiu Pr ogrammable Solutions Gr oup, Intel T oronto, Canada { ﬁrstname.lastname } @intel.com Abstract —Overlays have shown signiﬁcant promise for ﬁeld- programmable gate-arrays (FPGAs) as they allow for fast de- velopment cycles and remove many of the challenges of the traditional FPGA hardware design ﬂow . However , this often comes with a signiﬁcant perf ormance b urden re sulting in very little adoption of overlays for practical applications. In this paper , we tailor an o verlay to a speciﬁc application domain, and we show how we maintain its full programmability without paying for the performance o verhead traditionally associated with ov erlays. Speciﬁcally , we introduce an overlay targeted for deep neural network inference with only ~1% overhead to support the control and reprogramming logic using a lightweight very-long instruction word (VLIW) network. Additionally , we implement a sophisticated domain speciﬁc graph compiler that compiles deep learning languages such as Caffe or T ensorﬂow to easily target our overlay . W e show how our graph compiler performs architectur e-driven software optimizations to signiﬁcantly boost performance of both con volutional and r ecurrent neural netw orks (CNNs/RNNs) – we demonstrate a 3 × improv ement on ResNet- 101 and a 12 × improvement f or long short-term memory (LSTM) cells, compared to na ¨ ıve implementations. Finally , we describe how we can tailor our hardwar e overlay , and use our graph compiler to achiev e ~900 fps on GoogLeNet on an Intel Arria 10 1150 – the fastest ever reported on comparable FPGAs. I . I N T R O DU C T I O N Creating custom high-performance hardware designs on ﬁeld-programmable gate arrays (FPGAs) is difﬁcult and time- consuming when compared to software-programmable devices such as CPUs. A hardware designer must describe their system in a cycle-accurate manner , and worry about lo w-lev el hard- ware considerations such as timing closure to memory inter- faces. Ov er the past decade, signiﬁcant progress has been made in easing the use of FPGAs through high-level languages such as OpenCL, making it easier to implement high-performance designs [9]. Howe ver , even when using high-lev el design, one must still carefully describe an efﬁcient parallel hardware architecture that lev erages the FPGA ’ s capabilities such as the massiv e on-chip memory bandwidth or conﬁgurable multiplier blocks. Additionally , the designer must optimize both area and frequency through long compilations to realize performance gains versus other programmable platforms. Compared to writing a software algorithm targeting a CPU, designing for FPGAs is still drastically more difﬁcult. Our goal in this paper is to present a software-programmable hardware overlay on FPGAs to realize the ease-of-use of software programmability and the efﬁcienc y of custom hardware design. W e introduce a domain speciﬁc approach to overlays that lev erages both software and hardware optimizations to achiev e state-of-the-art performance on the FPGA for neural network (NN) acceleration. For hardware, we partition conﬁgurable parameters into runtime and compile time parameters such that you can tune the architecture for performance at compile time, and program the ov erlay at runtime to accelerate dif ferent NNs. W e do this through a lightweight very-long instruction word (VLIW) network that delivers full reprogrammability to our ov erlay without incurring any performance or efﬁcienc y ov erhead (typical overlays have large ov erhead [4]). Addi- tionally , we create a ﬂexible architecture where only the core functions required by a NN are connected to a parameterizable interconnect (called Xbar). This av oids the need to include all possible functions in our overlay during runtime; rather, we can pick from our library of optimized kernels based on the group of NNs that are going to run on our system. Our approach is unlike previous work that created hardware that can only run a single/speciﬁc NN [1], [7], [8]. On the software side, we introduce an architecture-aware graph compiler that efﬁciently maps a NN to the overlay . This both maximizes the hardware efﬁcienc y when running the design and simpliﬁes the usability of the end application, where users are only required to enter domain speciﬁc deep learning languages, such as Caffe or T ensorﬂow , to program the ov erlay . Our compiler generates VLIW instructions that are loaded into the FPGA and used for reprogramming the ov erlay in tens of clock cycles thus incurring no performance ov erhead. Compared to ﬁxed-function accelerators that can only ex ecute one NN per application run, our approach opens the door to allow for multiple NNs be run consecuti vely in a single application run [12] by simply reprogramming our ov erlay instead of recompiling or reconﬁguring the FPGA. The rest of this paper is organized as follows. Section II introduces our hardware architecture. W e describe ho w we tar- get speciﬁc NNs using our compile-time parameters and Xbar interconnect. Importantly , we describe our lightweight VLIW network in Section II-A, u s ed for programming the overlay . Next, we describe our NN graph compiler in Section III, and detail some of our architecture-driven optimizations that allow the ef ﬁcient implementation of NNs on architecture variants of different sizes. Sections IV and V detail how our graph compiler and hardware overlay work together for efﬁcient implementation of CNNs and RNNs. W e walk through hard- Str eam Buff e r ( on -c hip) Xbar DDRx/HBM LRN PE -0 Dr ain 0 Max Pool Dr ain 1 Dr ain N C_VEC x Q_VEC x P_VEC DRAIN_VEC x Q_VEC x P_VEC AUX_VEC x Q_VEC x P_VEC K_VEC DDRx/HBM PE -1 PE -N Filt e r Ca che Filt e r Ca che Filt e r Ca che PE Array Ma x P oo l L R N Wid t h Ad a pt Activatio n Xbar C_VEC x S_VEC x R_VEC Fig. 1: System-level diagram of our neural network inference accelerator (DLA). ware and software optimizations in implementing both the ResNet and GoogLeNet CNNs, allo wing us to achie ve record- setting performance on GoogLeNet. Finally , we discuss the implementation of a long short-term memory (LSTM) cell by simply adding an additional kernel to our overlay , and relying on our graph compiler to mutate the LSTM cell graph to ﬁt within our overlay . In this paper, we refer to our system as “DLA” – our Deep Learning Accelerator . I I . H A R D W A R E A R C H I T E C T U R E Our domain speciﬁc ov erlay aims to be general enough to implement any NN, but still remain customizable so that it can be optimized for a speciﬁc NN only . Fig. 1 shows an ov erview of our o verlay . At the core of our ov erlay is a 1D systolic processing element (PE) array that performs dot product operations in each PE to implement general matrix math such as conv olutions or multiplications. W e omit the discussion of numerics in this paper but we support different ﬂoating-point formats such as FP32/16/11/10/9/8 which hav e been sho wn to work well with inference [2] – these could be easily modiﬁed to support any nascent innov ations in data type precisions such as bﬂoat [15], and other unique ﬁxed or ﬂoating point representations, due to the ﬂexible FPGA fabric. As Fig 1 shows, our Xbar interconnect can augment the functionality of our overlay with different auxiliary func- tions (also referred to as kernels in this paper). This section goes through different parts of our hardware architecture and highlights the built-in compile-time ﬂexibility and run-time programmability of our ov erlay . A. VLIW Network T o implement a NN on DLA, our graph compiler breaks it into units called “subgraphs” that ﬁt within the ov erlay’ s buf fers and compute elements. For example, with conv olu- tional neural networks (CNNs), a subgraph is typically a single con volution with an optional pooling layer afterwards. W e deliv er new VLIW instructions for each subgraph to program DLA correctly for the subgraph execution. Stream Buffer VLIW Reader from ddr4 Transpo rt Transpo rt Transpo rt Xbar Pool Fig. 2: VLIW network distributes instructions to each kernel. Our nov el VLIW network distributes instructions to each kernel as sho wn in Fig. 2. The VLIW reader continuously fetches the instructions for the next subgraph from external memory and sends it down an 8-bit unidirectional ring network that is connected to all of the kernels in DLA. The VLIW instruction sequence is divided into different portions for each kernel. A special header packet identiﬁes the kernel, then it is followed by a series of programming instructions that are destined for that kernel. The “T ransport” kernels parse the header packet and redirects the instructions that follow to the correct kernel as shown in Fig. 2. The transport kernels also assemble the 8-bit packets into 32-wide instructions for direct kernel consumption. Our instructions are actually counter end values and con- trol ﬂags that are directly loaded into registers within each kernel to govern its operation – this av oids the need for any instruction decode units. For example, the pool ker - nel recieves approximately a dozen instructions: the image height/width/depth, the pool window size, and the type of pooling (maxpool or average pool). Before executing each subgraph, the pool kernel would read each of its 12 instructions serially , consuming 12 clock cycles – this has no material im- pact on performance that typically takes thousands of cycles. Howe ver , it ensures that the entire VLIW network can remain only 8 bits wide, with a minimal area ov erhead of only ~3000 LUTs – about 1% of an Arria-10 1150 FPGA de vice as shown in T able I. Adding new auxiliary programmable functions (kernels) to DLA is simple and has little ov erhead – we extend the VLIW network with an additional transport kernel, and connect that ne w kernel to the Xbar without affecting existing kernels or instructions. T ABLE I: Area overhead of VLIW network for DLA with 10 kernels at frequency of 450 MHz on Arria 10. LUTs FFs ALMs VLIW Reader 1832 1841 1473 T ransport 126 139 73 T otal 3092 3231 2046 B. Xbar Interconnect Machine learning is a fast-dev eloping ﬁeld – we are in- creasingly seeing new functions implemented by the machine learning research community . For example, new acti vation functions are constantly being ev aluated such as “Swish” [10]. A quick look at T ensorﬂow shows that there more than 100 different layer types that users can experiment with in building different NNs [15]. W e aim to use the Xbar for extensibility of DLA such that users can easily add or remove functions to implement different types of NNs. Fig. 1 shows an example Xbar interconnect used to connect pool/LRN kernels for CNNs. As the diagram sho ws, the Xbar is actually a custom interconnect built around exactly what is needed to connect the auxiliary kernels. For example, the SqueezeNet graph has no local response normalization (LRN) layers, so we can remove that kernel completely . From a prototxt architecture description, the Xbar (including width adaptation) is automatically created to connect auxiliary kernels. W e use width adapters to control the throughput of each auxiliary kernel – for example, we can decrease the width of infrequent kernels such as LRN to conserve logic resources. The interconnection pattern within the Xbar is also customizable based on the order of the auxiliary operations. For example, the AlexNet graph has both MaxPool and LRN layers, but LRN always comes ﬁrst; whereas the GoogLeNet graph has some layers in which MaxPool precedes LRN, which is supported by adding more multiplexing logic. T o demonstrate the power of our extensible architecture (and compiler which is presented in Section III), we add a single kernel to the Xbar in Section V which extends our architecture to also implement LSTM cells alongside CNNs – this allows implementing video-based RNNs commonly used for gesture recognition for instance [16]. C. V ectorization T o ensure our ov erlay can be customized to different neural network models and FPGA devices, we support vectorization , or degree of parallelism, across different axes. Figure 1 shows some of the degrees of parallelism av ailable in the accelerator , conﬁgurable via vectorization. Q VEC and P VEC refer to the parallelism in the width and height dimensions, while C VEC and K VEC refer to the input/output depth parallelism respectiv ely . Every clock cycle, we process the product of 0.6 0.7 0.8 0.9 1 1.1 Alexnet GoogleNet SqueezeNet VGG-16 ResNet-101 Normalized Throughput/Area P_VEC=1, K_VEC=64 P_VEC=4, K_VEC=16 Fig. 3: Throughput/Area on two architectures with different P VEC and K VEC vectorization. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 0.2 0.4 0.6 0.8 1 Normalized Throughput Normalized On-Chip Memory Size AlexNet GoogleNet ResNet-101 Fig. 4: Impact of stream buffer memory vs. compute tradeoff on AlexNet, GoogleNet and ResNet-101. { Q VEC, P VEC, C VEC, and K VEC } feature values in parallel. Initially , our design was scaled by increasing K VEC; howe ver , this method of scaling saw diminishing returns, since quantization inef ﬁciencies can become more pronounced as vectorization dimensions increase. For example, if the output depth (K) of a layer is 96, and K VEC is 64, this will require 2 complete iterations through the PE array , with only 96/128 (75%) useful computations. On the other hand, if K VEC is 32, the output depth divides perfectly into 3 iterations at 100% efﬁcienc y . T o mitigate this quantization effect, it is possible to balance the scaling of the design across multiple differ - ent dimensions besides just K VEC (e.g. P VEC, Q VEC, C VEC, etc). The optimal balance of vectorization depends on the graph’ s layer dimensions. Figure 3 demonstrates this point by comparing the throughput of two architectures with similar area for different graphs. As the ﬁgure shows, the optimal balance of scaling the design between P VEC and K VEC varies based on the neural network topology being used. This is an example of ho w we tune our overlay to get top performance on speciﬁc NNs. D. Stream Buffer and Filter Caches A single Arria 10 FPGA contains ~4 TB/s on-chip memory bandwidth, interspersed within the FPGA in conﬁgurable 20 Kbit memory blocks. This powerful FPGA resource is piv otal in determining the performance of FPGA compute operations – DLA lev erages these block RAMs to buffer both activ ation and ﬁlter tensors. As Fig. 1 shows, ﬁlters are stored in a double-buffered “ﬁlter cache” contained in each PE, allowing the PEs to compute data while ﬁlters are pre-loaded from external memory for the next subgraph. The “stream b uffer” is a ﬂexible scratchpad that is used to store intermediate tensors on-chip. Many of our graph compiler passes are dedicated for ef ﬁcient use of this stream b uffer as Section III will show . When presented with an intermediate tensor larger than the stream buf fer or ﬁlter caches, our graph compiler slices the tensor into multiple pieces that ﬁt within our on-chip caches, and the rest of the pieces are stored in slo wer off-chip memory , and require higher latency to fetch and compute. T o limit this slicing , we can increase the size of the stream b uffer and/or ﬁlter caches, but this decreases the number of RAM blocks av ailable to increase PE array vectorization. Therefore, there is a memory-vs-compute tradeoff for each NN to balance the size of the caches and the number of PEs – Fig. 4 illustrates this tradeoff for different NNs. As the ﬁgure shows, a tradeoff that is optimal for one NN can cause 40% or more performance degradation for a second NN. I I I . G R A P H C O M P I L E R The previous section focused on the hardware ov erlay architecture and how to conﬁgure it at compile time to maximize performance for a speciﬁc NN graph. This section describes our NN graph compiler that takes advantage of the overlay VLIW instructions to decompose, optimize, and run a NN model on the overlay . The graph compiler breaks down a NN into subgraphs, schedules subgraph execution, and importantly , allocates explicit cache buf fers to optimize the use of our stream buf fer and ﬁlter caches. This section goes through our core compiler “passes” (slicing, scheduling and allocation), and shows examples of how smart graph compilation allo ws more ef ﬁcient hardware implementations. Besides these general core passes, our compiler implements more speciﬁc algorithms that target and optimize speciﬁc NN patterns as we show in the following Sections IV and V. A. Slicing T o achieve the highest possible throughput for a given DLA architecture it is desirable to size the stream buf fer and ﬁlter caches in such a way to ﬁt the entire input feature tensor and ﬁlter tensor . Ho wev er , as the resolution of images increases and graph topologies for NNs become deeper , on- chip allocation for these tensors may not be feasible. T o ov ercome this constraint, slices of the input tensor are fetched from external memory into the stream buffer and processed independently by DLA. The 3D input feature tensor can be sliced along the height, width, or depth to ﬁt in the on-chip stream buf fer . When slicing along the width and height, the slices must overlap if the ﬁlter window size is greater than 1x1. The graph compiler tries to pick slices that minimize the overlapped computation for the sliced tensor . Alternativ ely , slicing across the depth does not require overlapped computations, but requires an additive operation to add the results of the depth-wise slices. T o boost performance and minimize the number of DDR4 spillpoints, we enhance our slicing algorithm to slice multiple sequential con volutions together (called “Group Slicing”). Instead of completing all slices within a layer , we compute Conv1 C o nv2 Conv3 N or m al S lic ing Gr oup Slic ing S l ice 1 S l ice 2 S l ice 1 S l ice 1 S l ice 2 S l ice 2 S l ice 1 S l ice 2 S l ice 1 S l ice 1 S l ice 2 S l ice 2 NN DDR 4 spill DDR 4 spi l l Fig. 5: Group slicing minimizes external memory spillpoints by computing multiple sequential con volutions for each slice. Input Bu ffer Output Bu ffer Con t i g u o u s Bu ffer avai la b le for use Fig. 6: Double-buffering in the stream buf fer . sev eral sequential con volutions with a single slice using the stream buf fer before moving onto the next slice. Fig. 5 illustrates how group slicing reduces the number of external- memory spillpoints for a sample NN. For Resnet101 with image resolution of 1080p (HD), our Group Slicing algorithm improv es throughput by 19% compared to simple slicing. B. Allocation The allocation pass manages reading and writing from the stream buf fer . Allocation calculates the read and write addresses for each slice, and computes the total stream buf fer memory used by a graph. One of the main goals is to reduce fragmentation – gaps between allocated memory blocks in the stream buf fer . In its most simple operation, the stream buf fer is used as a double buffer to store both the input and output of a subgraph. T o achiev e this double-buffering while reducing fragmentation, the input buf fer starts at address 0 and counts up, while the output b uf fer starts at the end of the stream b uffer and counts down. As Fig. 6 shows, this lea ves a contiguous space in the middle of the stream buf fer that can be used to allocate more data slices in the stream buf fer; this is especially useful for graphs that hav e multiple branches as demonstrated by the GoogLeNet example in Section III-C. Note that the allocation pass must keep track of the lifetime of each buf fer to be able to free/overwrite its memory in the stream buf fer once it is no longer used. Additionally , our allocation pass also assigns addresses in external memory when the stream buf fer isn’t large enough, but external memory size is not a problem so it is simply done left-to-right, in the ﬁrst av ailable space. C. Scheduling The DLA compiler partitions NNs into subgraphs where a subgraph is a list of functions that can be chained together and implemented on DLA without writing to a buf fer , except at the very end of the subgraph e xecution – scheduling decides when each subgraph is executed. In the case of early CNN models such as AlexNet [5] or VGG16 [17] there is very little need for a scheduler as there are no decisions to be made on which subgraph to execute next. When considering CNNs 12 4 6 1 12 8 2 2 1 6 (a) Inception module with relati ve output sizes for each subgraph. T i m e S t e p 0 T i m e S t e p 1 T i m e S t e p 2 T i m e S t e p 3 T i m e S t e p 4 T i m e S t e p 5 T i m e S t e p 6 T i m e S t e p 7 T i m e S t e p 8 (b) Stream buf fer usage when using a depth ﬁrst schedule. T i m e S t e p 0 T i m e S t e p 1 T i m e S t e p 2 T i m e S t e p 3 T i m e S t e p 4 T i m e S t e p 5 T i m e S t e p 6 T i m e S t e p 7 T i m e S t e p 8 (c) Stream buf fer usage with an improved schedule. Fig. 7: Scheduling one of the GoogLeNet [6] inception modules. with branching nodes such as GoogLeNet [6], ResNet [14], or graphs that require slicing, the order of subgraph execution heavily inﬂuences the stream buf fer size that is required for a giv en graph to a void external memory spill points. Fig. 7a illustrates an example of an inception module from Googlenet, partitioned into DLA subgraphs with the relative output sizes of each subgraph. W e sho w the stream buf fer allocation corresponding to two possible schedules of the inception module. Both are depth-ﬁrst schedules, but in Fig. 7b we start with the leftmost branch, while Fig. 7c starts with the rightmost branch. This simple change in schedule results in a 30% reduction in the size of the required stream buf fer for this inception module. When considering large graphs with many branching nodes that either conv erge to a single output such as GoogLeNet or graphs that di ver ge to se veral outputs such as those used for single-shot multibox detection [12], an exhausti ve search of all possible schedules may be infeasible without incurring large compile time penalties. Our scheduling is conducted using a priority queue based approach, where the cost of executing a giv en node is determined by the ratio of its output size to its effective input size (the size of the input multiplied by the number of users of the input tensor). This approach allo ws for the stream buf fer sa vings of Fig. 7c to be achie ved, with minimal impact on the compiler runtime. I V . C N N I M P L E M E N TA T I O N This section focuses on 2 popular CNNs: ResNet [14] and GoogLeNet [6]. W e explain different hardware/software co- optimizations that are possible because of our runtime recon- ﬁgurable and software programmable ov erlay . This allo ws us to signiﬁcantly boost the performance of these CNNs on DLA at runtime with little effort, as we show in our results. A. ResNet Convolution Mer ging ResNet 101 is a large graph that can be targeted for high- deﬁnition image resolutions, creating intermediate tensors that require signiﬁcant slicing to run on the DLA ov erlay on Arria 10. ResNet is composed of three types of resmodules , as sho wn in Fig. 8. Each type has two con volution branches, merged through an element-wise addition operation ( eltwise ). W e present a resmodule optimization (implemented auto- matically in our compiler) that eliminates the eltwise operation by merging it with the preceding con volution(s). This reduces the total number of arithmetic operations in DLA, and more B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v A : 1 x 1 Co n v I n p u t e l t w i s e (a) T ype 1. B 1 : 1 x 1 Co n v ( s t r i d e 2 ) R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v A : 1 x 1 Co n v ( s t r i d e 2 ) I n p u t e l t w i s e (b) T ype 2. B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v I n p u t e l t w i s e (c) T ype 3. Fig. 8: T ypes of resmodules in ResNet. importantly , decreases the number of slices and DDR4 spill- points. Instead of storing intermediate tensors between the con volution and the eltwise addition operations, we combine them in a single con volution operation where tensor size is at least half as big as the eltwise input. Consider the computation that produces ev ery output ele- ment of the eltwise in T ype 1 resmodule (Figure 8a) – it is the sum of the corresponding output elements of con v olution A and B 3 . As illustrated in Figure 9a, this sequence of operations is equiv alent to a single conv olution after input A and B 3 (and the corresponding ﬁlter A and B 3 ) are merged depth-wise. This effecti vely absorbs the eltwise addition operation into the dot product operation of the preceding conv olutions. Figure 9b shows the T ype 1 resmodule after con volution A and B 3 are merged with the eltwise layer . Since this optimization conv erts the explicit eltwise operations into a con volution, output A and B 3 , which would usually reside in DDR4 or on-chip memory , become intermediate results of the merged con volution and are stored in on-chip registers. This reduction in memory trafﬁc is especially prominent in resmodules, where output A and B 3 are of 4 × the size of input A and B 3 . In order for T ype-2 and T ype-3 resmodules to beneﬁt from this optimization, we conv ert them to T ype 1. For T ype 2 (Figure 10a), we push the stride-2 con volution A and B 1 upstream to the layer before the input. Not only does this con vert the resmodule to T ype 1, it also cuts the amount of computation in the upstream layer and reduces the input trafﬁc to con volution A and B 1 . For T ype 3 (Figure 10b), we introduce an identity con volution – which creates an identical output tensor from an input tensor – in the left branch. I n p u t A F i l t e r s A I n p u t B F i l t e r s B D e p t h - c o n c a t I n p u t D e p t h - c o n c a t F i l t e r s (a) Eltwise elimination. B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU I n p u t A + B 3 : 1 x 1 c o n v (b) Optimized T ype 1 resmodule. Fig. 9: Conv olution merging optimization. B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v A : 1 x 1 Co n v I n p u t ( s t r i d e 2 ) e l t w i s e (a) T ype 2. B 1 : 1 x 1 Co n v R e LU B 2 : 3 x 3 Co n v R e LU B 3 : 1 x 1 Co n v A : 1 x 1 Co n v ( i d e n t i t y) I n p u t e l t w i s e (b) T ype 3. Fig. 10: Resmodule type con version to beneﬁt from con v olution merging optimization. B. Non-convolution Primitives While almost all layers in ResNet are con volutions, there are a couple of exceptions – a single Global A verage Pooling (GAP) layer and a single Fully-Connected (FC) layer at the end. This is also true for GoogLeNet where there is a single FC layer , and a single av erage pooling layer . Giv en the extremely low frequency of these non-con volution layers (e.g., 2 out of 147 for ResNet 101), it is best to map them to conv olutions. In this way , we can reuse the powerful con volution engine (PE array) instead of adding dedicated auxiliary kernels that would be under-utilized (over time). An FC layer performs a multiplication between a vector (input) and a matrix (weights). It can be mapped to a con vo- lution as follows: 1) the 1D FC input of length N is mapped to a 3D con volution input of shape 1 × 1 × N , and 2) the 2D FC weight matrix of shape N × M is mapped to M 3D con volution ﬁlters of shape 1 × 1 × N . With this mapping, the computation of each FC output is assigned to a PE. A verage pooling of windo w H × W on a 2D image is equiv alent to a 2D con volution with a ﬁlter of size H × W . Each ﬁlter element is of v alue 1 / ( H × W ) . For a 3D input of depth D , average pooling is applied to each 2D input surface, producing the corresponding output surface. In this case, the equiv alent conv olution ﬁlter for the output surface at depth d , is of shape H × W × D , with all zero ﬁlter values except the surface at depth d being the average pooling ﬁlter . C. Sparse F ilter Shortening Even though they sav e area, the identity and av erage pooling con volutions introduced in the pre vious optimizations could come at a high cost to throughput, due to the large but sparse ﬁlters in volved. For an identity conv olution of input Number of filte r s = output depth = D Non-zer o f i lt e rs Zero f il t ers – pruned awa y d ue t o sparsit y V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V K_VEC K_VEC DLA c omp utation after pruning Filte r d e pt h = input de pt h = D Filter f ace s i ze e qual s a v e ra ge pool wi n dow size o r 1x 1 in c a s e of ide nti t y f i l t e r 1/4 1/4 1/4 1/4 1 2x 2 a v g . pool I d e n tity Fig. 11: Sparse ﬁlter shortening with identity and av erage-pooling con volution ﬁlters. and output shape H × W × D , there are D ﬁlters, each of shape 1 × 1 × D . Since each ﬁlter is responsible for copying input surface at depth d to the output surface at the same depth, the values of this ﬁlter are all zeros except 1 at depth d . Fig. 11 illustrates both the identity and av erage pooling con volution ﬁlters, and how we can lev erage their sparsity to conserve operations on DLA. W e improve performance by skipping the computation with ﬁlter entries that are ﬁlled with zeros. Since the PEs process K V E C ﬁlters at a time, we trim the ﬁlters size K V E C to ﬁt perfectly in the PE array . This effecti vely reduces the ﬁlter depth from D to K V E C , saving both compute time and ﬁlter data loading time. W e call this optimization sparse ﬁlter shortening , which can also be applied to the av erage pooling con v olution as sho wn in Fig. 11, due to the same ﬁlter sparsity . D. 1x1 Filter s Optimization T o efﬁciently compute con volutions using 3x3 ﬁlters, the DLA architecture is often tuned to be vectorized in the ﬁlter width dimension by setting S VEC=3. Increasing the ﬁlter width vectorization increases PE throughput as well as ﬁlter prefetch bandwidth for large (eg. 3x3) ﬁlters. Howe ver , many of the latest CNNs hav e a mix of 3x3 and 1x1 ﬁlters. Con volutions using 1x1 ﬁlters do not beneﬁt from ﬁlter width vectorization, and thus would achieve lo w DSP efﬁciency and ﬁlter prefetch bandwidth. T o av oid this, the DLA architecture has been optimized for 1x1 ﬁlters in two ways. First, the DSPs that would have been used in a 3x3-ﬁlter con volution to process the second and third ﬁlter values in the ﬁlter width direction are instead used to calculate two additional output pixel in a 1x1-ﬁlter con volution. This allows the PEs to maintain the same DSP efﬁcienc y for both 3x3 and 1x1 ﬁlters. Second, the ﬁlter prefetch bandwidth added to load a 3-wide ﬁlter is used to simply load more 1-wide ﬁlters in parallel. Overall, these two optimizations allow DLA to achieve high throughputs through vectorization for 3x3-ﬁlter con volutions without suf fering any additional quantization loss for 1x1-ﬁlter con volutions. E. Optimization Impact on ResNet T able II summarizes the impact of each optimization on the throughput of ResNet 101 with 1080p image resolution. The number in each row is the normalized throughput after applying all optimizations listed up to this row . Here, we apply the mapping of GAP and FC layers to con volution unconditionally (i.e., in the baseline). The huge speedup of sparse ﬁlter shortening comes from the ﬁlters of the identity con volutions introduced by conv olution merging optimization on T ype 3 resmodules which account for 87% of all resmod- ules in ResNet 101. T ABLE II: Optimization impact on ResNet-101. Optimization Relativ e Throughput Baseline 1.0 1x1 Filter Opt 1.3 Con v . merging (T ype 3) 1.7 Sparse ﬁlter shortening 2.8 Group slicing 3.1 F . Optimization Impact on GoogLeNet T wo of the described CNN optimizations are used to improv e throughput on GoogleNet: (1) the 1x1 ﬁlter opti- mizations and (2) the av erage pool mapped to con volution optimization – this allowed DLA to ﬁt a lar ger PE array instead of wasting dedicated resources on an average-pooling kernel. As shown in T able III, GoogleNet saw a 17% throughput improv ement from these two optimizations. The following row in the table shows the throughput improvement from increasing the PE array vectorization (from { P VEC,K VEC } = { 1,48 } to { 2,32 } ). Finally , the last ro w in the table points to an accurate model of external memory optimizations that will allows DLA to achieve ~900 fps on GoogLeNet on Intel’ s Arria 10 1150 device, which to our knowledge, is the most efﬁcient acceleration of GoogLeNet on FPGAs. This optimization entails continuously fetching ﬁlters for the next NN layers until the ﬁlter cache becomes full instead of limiting ﬁlter prefetch only to 1 layer ahead. While this slightly complicates ﬁlter prefetch logic, it has a negligable area cost but allo ws hiding external memory latency when fetching the NN model. T ABLE III: Optimization impact on GoogLeNet. Optimization Relativ e Raw Throughput Throughput (Intel Arria 10 1150) Baseline 1.0 469 fps 1x1 Filter Opt 1.1 506 fps A vg Pool Mapped to Conv 1.2 550 fps Additional V ectorization 1.7 777 fps External Memory Opt 1.9 ~900 fps V . L S T M C E L L I M P L E M E N TA T I O N LSTM cells are a widely-used variant of RNNs, commonly used in speech recognition [3], translation [18] and motion detection [16]. DLA is designed to be a ﬂexible NN accelerator for all relev ant deep learning workloads, including LSTM- based networks. As such, this section discusses how our graph compiler mutates an LSTM cell to map well to the DLA ov erlay with high performance. δ b o W xo W ho o t tanh δ b f W xf W hf f t b g W xg W hg g t δ x t b i W xi W hi i t h t-1 c t-1 c t tanh h t Combine 4 gates : {b i ,b g ,b f ,b o } x t {W xi ,W xc ,W xf ,W xo } Combine 2 br a nches: {b i ,b g ,b f ,b o } {x t ,h t-1 } {{ W xi ,W hi },{W xc ,W hc }, {W xf ,W hf },{W xo ,W ho }} Matrix Multipli catio n Element -wise add/m ultip l y Sigmo id/tanh acti vation h t-1 {W hi ,W hc ,W hf ,W ho }                                                                                                            Fig. 12: Graph/Matrix view of an LSTM cell, and ho w we combine its matrix-multiplications into one big matrix. A. Mapping an LSTM Cell to DLA Most of the computation in an LSTM cell occurs in 8 matrix multiplications to compute the 3 LSTM gates (in- put/forget/output) [11]. Fig. 12 illustrates how we combine those 8 matrices into one big matrix – this reduces DLA ex ecution from 12 subgraphs to a single subgraph which runs at least ~12 × faster . First, the 4 matrices that were multiplied by the input/history are each height concatenated as sho wn in the example in Fig. 12. This is a generic optimization that can be applied to any matrix-vector multiplications that share the same input vector . W e end up with two lar ge matrix multiplications, one matrix for the input ( x t ), and another for the history ( h t − 1 ). Next, we combine those two matrices, and the element-wise addition that follo ws, into one larger matrix through width concatenation of the matrices, and height concatenation of the input and history vectors as shown in Fig 12. This gi ves us one lar ge matrix multiplication for the entire LSTM cell. Depending on the LSTM cell size, our compiler may later decide to slice this large matrix if it does not ﬁt on the FPGA as described in Section III-A. W ith the combined matrix, each of the LSTM gates are computed one-after-the-other since we compute the matrix i i i i i i i i g g g g f f f f o o o o g f o i g f o i g f o i g f o i g f o i g g g g f f f f i i i i o o o o i g f o i g f o Fig. 13: Matrix ro w interleaving allows streaming different LSTM gate values simulataneously instead of b uffering each gate separately . tanh σ σ σ i t g t f t o t tanh h t from Xbar to Xbar Context Cache Fig. 14: Streaming LSTM hardware block to compute the element- wise operations of an LSTM cell. 0 1 2 3 4 5 0 100 200 300 400 500 Latency [s] Peak External Memory Bandwidth [GB/s] 1 DDR4- 2400 2 DDR4- 2400 3 DDR4- 2400 4 DDR4- 2400 1 HBM2 2 HBM2s Fig. 15: Latency of an LSTM NN when varying external memory bandwidth. rows in order . Howe ver , this is not FPGA-friendly , as each of the input/forget/output gate values will now need to be buf fered (using costly on-chip RAM or slow external memory) so that they can be combined in the second half of the LSTM cell. Howe ver , by interleaving the rows of the large matrix (so that the ﬁrst row contains the ﬁlters for the input gate, the second row for the ‘g’ gate, the third row for the forget gate, and the fourth ro w for the output gate), we can compute one output from each gate in each time step as shown in Fig. 13. This removes the need for buf fering large intermediate gate outputs [13], and allows us to directly stream the gate values into the dedicated LSTM hardware block shown in Fig. 14. This demonstrates the ﬂexibility of the DLA o verlay , and the power of our graph compiler in implementing different NNs. By simply attaching the LSTM kernel to the Xbar , we can lev erage our powerful multi-precision PE array to compute the matrix-multiplication portion of the LSTM cell, then stream data directly into the dedicated LSTM block. B. External-Memory-bound RNNs Non-con volutional neural networks, are ef fectively a matrix- vector multiplication when computed with batch=1. Most of the applications that use RNNs are real-time applications such as speech/gesture recognition or translation; therefore, they require low-batch and low-latency processing that is ideal for FPGAs. Howe ver , external memory bandwidth is often a bottleneck, since a large matrix has to be fetched from external memory , only to be multiplied with one vector – compute time is lo wer than memory fetch time so it is impossible to hide memory fetch latency . Intel’ s Stratix 10 devices hav e 2 HBM2 devices integrated on some of their boards, providing up to 500 GB/s of peak memory bandwidth – this is 20 × higher than a DDR4-2400 memory . In Fig. 15, we look tow ards the future, and model the performance of a 4-layer stacked LSTM NN (with size of input=output=hidden=2048) used for speech recognition. As the ﬁgure sho ws, with more external memory bandwidth, going from DDR4 to HBM2, the latency for processing a speech segment goes down by more than 5 × . V I . C O N C L U S I O N W e presented a methodology to achieve software ease- of-use with hardware efﬁcienc y by implementing a domain speciﬁc customizable overlay architecture. W e described the hardware tradeoffs inv olved with NN acceleration, and delved into our graph compiler that maps NNs to our ov erlay . W e then showed that, using both our hardware and software, we can achiev e 3 × improvement on ResNet 101 HD, 12 × on LSTM cells, and 900 fps on GoogLeNet on Intel’ s Arria 10 FPGAs. W e will further dev elop DLA to encompass more use-cases such as multi-FPGA deployment [2]. In the future, we also aim to implement similar ov erlays for different application domains such as genomics, packet processing, compression and encryption to further make FPGAs accessible for high- throughput computation. R E F E R E N C E S [1] U. e. a. A ydonat. An opencl™deep learning accelerator on arria 10. In FPGA , FPGA ’17, pages 55–64, New Y ork, NY , USA, 2017. ACM. [2] E. e. a. Chung. Accelerating persistent neural networks at datacenter scale. HotChips, 2017. [3] A. G. et al. Speech recognition with deep recurrent neural networks. ICASSP , 2013. [4] A. J. et al. Ef ﬁcient overlay architecture based on dsp blocks. FCCM , 2015. [5] A. K. et al. Imagenet classiﬁcation with deep conv olutional neural networks. In F . Pereira, C. J. C. Burges, L. Bottou, and K. Q. W einberger , editors, Advances in Neural Information Pr ocessing Systems 25 , pages 1097–1105. Curran Associates, Inc., 2012. [6] C. S. et al. Going deeper with conv olutions. In Computer V ision and P attern Recognition (CVPR) , 2015. [7] H. Z. et al. A frame work for generating high throughput cnn implemen- tations on fpgas. FPGA , 2018. [8] J. S. et al. T owards a uniform template-based architecture for acceler- ating 2d and 3d cnns on fpga. FPGA , 2018. [9] M. A. et al. Gzip on a chip: High performance lossless data compression on fpgas using opencl. IWOCL , 2014. [10] P . R. et al. Swish: a self-gated activation function. AISTA TS , 2017. [11] S. H. et al. Ese: Efﬁcient speech recognition engine with sparse lstm on fpga. FPGA , 2017. [12] W . L. et al. Ssd: Single shot multibox detector . ECCV , 2016. [13] Y . G. et al. Fpga-based accelerator for long short-term memory recurrent neural networks. ASP-D AC , 2017. [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR , abs/1512.03385, 2015. [15] G. Inc. T ensorﬂow , 2018. [16] K. Murakami and H. T agushi. Gesture recognition using recurrent neural networks. SIGCHI , 1991. [17] K. Simonyan and A. Zisserman. V ery deep conv olutional networks for large-scale image recognition. arXiv preprint , 2014. [18] Y . e. a. W u. Google’ s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 , 2016.

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment