CompRRAE: RRAM-based Convolutional Neural Network Accelerator with Reduced Computations through a Runtime Activation Estimation

CompRRAE: RRAM-based Con volutional Neural Network Accelerator with Reduced Computations throu gh a Runtime Activ ation Estimation Xizi Chen, Jingyang Zhu, Jingbo J iang and Chi-Y ing Tsui Department o f Electronic and Computer Engineering , Hon g K ong Uni versity of Science and T echno logy , Hon g K ong Abstract– Recently Resistive -RAM (RRAM) crossbar has been used in the d esign of the accelerator of conv olut i onal neural net- works (CNNs) to solve the memory wall issue. Howe ver , the in - tensive mul t iply-accumulate computations (MA Cs) executed at the crossbars durin g the inf erence phase are stil l the bottleneck fo r the further improv ement of energy efﬁciency and th roughput. In this work, we explore sev eral methods to reduce th e compu- tations for the RRAM-b ased CNN accelerato rs. First, the out- put sparsity resulting fr om th e wi d ely employ ed Rectiﬁed Linear Unit is exploited, and a signiﬁ cant portion of computations are bypassed through an early detection of t he negativ e outp ut acti- vations. Second, an adap t ive approximation is proposed to ter - minate the MA C early when the sum of the partial results of the remaining com putations is considered to be within a certain range of th e in termediate accumulated result and thus has an in signif- icant contribution to t h e inference. In order to determine these redundant computations, a no vel runtime estimation on the max- imum and minimum values of each outpu t activation is developed and used during the MA C operation. Experimental results sho w that around 70% of the computations can be reduced duri n g the inference with a negligible accuracy loss smaller than 0.2%. As a result, the en ergy efﬁ ciency and the throughput are impr ov ed by ov er 2.9 and 2.8 times, respectiv ely , compared with th e state-of- the-art RR AM -based accelerators. I . I N T RO D U C T I O N Con volutional neural networks (CNNs) hav e demonstrated impres- siv e performance in variou s machine learning tasks such as the visual recognition [11] and visual tracking [16]. At the same time, due to the nature of the con volutional operation, the inference of CNN usually in volves intensiv e com putations which are energy consuming an d be- come a big deterrent for deploy ing CNN i n embedded systems. Be- sides the high computation cost, con ventional accelerators also face the memory wall issue where the massiv e memory accesses for fetch- ing the weights and activ ations va stly limit the performance. There- fore, it is necessary t o deliv er a more efﬁcient implementation with fewe r co mputations. The Rectiﬁed L i near Unit (ReLU) [5] has become the most widely used activ ation function in neural networks in recent years. Due to the application of ReLU, a high activ ation sparsity can be achie ved dur- ing t he i nference [2]. Since the negati ve MA C results will be clamped to zero by ReLU, their actual magnitude values are irrele v ant for the cascading layers. Thus, a large portion of computations correspond- ing to the negati ve output activ ations can be bypassed once the sign can be determined early . In addition, the inherent resilience of C NN makes the activ ation values error-tolerant t o some degree, hence mak- ing it possible t o reduce the computations by approximation without affe cting t he classiﬁcation accurac y . T o trigger the abov e computa- tion bypass, we propose a runtime estimation on the maximum and minimum values of each output activ ation. During each MA C, the contribution of the i ntermediate accumulated result is e v aluated con- tinually against t he estimated values of t he remaining partial results. Once the contribu tion of the current accumulated result is considered to be large enough to dictate t he ﬁnal result v alue, the MAC will be terminated to i mprove the energy efﬁciency and the throughput. The proposed methods are implemented in a spec ialized architecture based on the resisti ve random access memory ( R RAM) crossbar [18] which utilizes the in-situ computation as an approach to address the high power density and the memory wall issue of the con ventional CMOS-based design. In summary , the contribu tions of this work are as follows: • A runtime estimation on the maximum and minimum values o f the output activ ation is proposed and implemented during each MA C. • By detecting the negati ve output acti vations through the estima- tion, the corresponding MA Cs are terminated in adv ance in the con volutional layers follo wed by ReLU. According to the ex- perimental results, ov er 99.98% of nega tiv e outputs are detected and over 7 1.5% of their computations are bypasse d wit hout in- ducing accuracy loss. • An adapti ve approximation is propose d to bypass the remain- ing computations during the MAC when the estimated value s of the remaining partial results are determined to have a negligible contribution to the inference. • A dedicated RRAM-based architecture is proposed for imple- menting the CNNs with reduced computations. A total compu- tation reduction of around 70% is achiev ed for the general 16- bit ﬁxed-point implementation, and 40% reduction is achiev ed for the 8-bit implementation which demonstrates t he ef fectiv e- ness of the proposed methods under an aggressi ve quantization scheme. The induced accuracy loss is smaller than 0.2%. Ex- perimental results show signiﬁcant improve ment i n the energy efﬁcien cy and throu ghput. I I . R E L AT E D W O R K S V arious techniques hav e been proposed to reduce the intensiv e computations in the CNN accelerators. A natural way for reducing the memory footprint and the number of multipli cations in t he CMOS- based accelerators is to utilize the activ at i on sp arsity [1, 2, 19]. Since a larg e portion of inpu t acti vations are zero, the corresponding multi- plications can be bypassed to save energy and time [2]. A recent work in [1 ] focuses on the layers wi th only non-ne gativ e inputs and exploits the output sparsity by reordering the weights to calculate t he sum of the positiv e products ﬁrst. Later calculation of the negati ve produ cts will be t erminated as soon as the accumulated result becomes smaller than zero. As a result, 16% energy saving and 28% speedup can be achie ved . In a more aggressi ve mode, an empirical value is used to compare with the accumulated result after a speciﬁc number of mul- tiplications. If the accumu lated result is smaller, the ou tput activ ation is conside red likely to be negati ve. By bypassing the remaining mul- tiplications, a higher reduction in comple xity can be achie ved, but a relative ly large accuracy loss (3.0%) is induced. Another work i n [19] exploits the output sparsity by using a low-ran k approximation of the weight matrix to predict the output sparsity and disabling the actual computation if the predicted output is negati ve. In this case, each MA C nee ds to be broadcast to all the processing elements to im- prov e the throughput. Such methods, howe ver , are not suitable for the RRAM-based architecture. Since the computations are executed at the RRAM crossbars where the weight matrix is programmed into the memristors before the classiﬁcation, the multiplication-accumulation has to be done in a reg ular pattern. Therefore, it is dif ﬁcult to ir- regularly skip t he zero inputs, independen tly reorder the weights of each kerne l, or broadcast differen t weight matrices to the crossbars at runtime. A way to r educe the computations in the RRAM-based architecture is to str ucturally compress the weights through training and then exploit the weight sparsity [7]. Ho we ver , t o the best of our kno wl edge, the output sparsity hasn’t been efﬁciently exploited in the RRAM-based architecture. Due to the resilience of CNN, re- ducing the quantization bit-width of the weights and activ ations i s another method for reducing computations [14]. A dynamic quantiza- tion schem e is propo sed in [ 3] to change the bit-width of the weights when multiplying with the different bit of the activ ations to reduce the computations for the RRAM-based MA C. I I I . P R E L I M I N A R I E S A. Conv olu tional Neural Networks A con volutiona l neural network (CNN) [11] is a machine learn- ing model inspired by the structure of the human brain. It i s usually comprised of a series of cascading layers, includ ing the con volutional (CONV) layers, pooling layers, and fully-con nected (F C) layers. The CONV an d FC layers con sist of neurons to extract the features of the image. Inside each layer , the input activ ations from the previou s layer are ﬁrst ly multiplied wi th the correspondin g weights in the current layer , and then accumulated to generate the output acti vations for the next laye r . The computation of the CONV layer is sho wn in Fig.1(a) and can be expressed as follo ws: a out ( x, y , z ) = f ( c − 1 X l =0 h − 1 X m =0 w − 1 X n =0 a in ( x + m, y + n, l ) × K z ( m, n, l )) (1) where a out and a in represent the outpu t and the inpu t activ ati ons, respecti vely; K z represents t he z th kernel; h , w and c represent the height, width, and depth of the kerne l; ( x, y , z ) , ( x + m, y + n, l ) and ( m, n, l ) represen t t he positions of the activ ations and weights in height, w i dth, and depth. f is a non-linear activ ation function to avoid ov erﬁtting. The most commonly used acti v ation function is ReLU gi ven by: f ( x ) = ma x (0 , x ) (2) where x is the input to the f unction. FC layers are si mi l ar to the CONV layers, but hav e much fe wer computations which can be sim- pliﬁed to a single vector -matrix multiplication. Pooling layers usually follo w t he CONV layers for down-sampling. In this wo rk we will fo- cus on the CONV layers since they account for most of t he computa- tions in CNN. B. RRAM Crossbar and In-Situ Computation The RRAM crossbar has aroused great research interest due to its high-density , non-vo latility , and the potential for parallel in-situ ana- log computation [17, 18, 6]. While the CMOS-based accelerators face the d ifﬁculty of scaling do wn and the issue of memory wall, the RRAM-based computation pro vides a promising approach to ach iev e substantial improveme nt in the energy efﬁcienc y and throu ghput. For instance, the RRAM-based accelerator i n [18] demonstrates 22 times energy saving compared with t he CMOS-based counterpart. A hi- erarchical RRAM-based architecture proposed in [17] improv es the A D C DAC .. . .. . .. . .. . .. . ... ... ... ... 0 0 1 0 ADC Output activations (a) Shift-Add Bit slicing the input activations (b) Input activations Kernels Multiply Accumulate ! " Fig. 1 . (a) Computation of the CONV Layer; (b) In-Situ Computation based on the RRAM Crossb ar . energy efﬁciency and throughpu t by 5.5 and 14.8 times, respe ctiv ely , compared with the state-of-the-art CMOS-based DaDianNao archi- tecture [4]. The RRAM crossbar for the v ector-matrix multiplication is sho wn in Fig.1(b). The elements of the weight matrix are stored as the conductance v alues of the memristors at the crosspoints con- necting the horizontal wordlines and the vertical bitlines. When the computation starts, the i nput activ ation v ector is app lied on the word- lines as voltages. The current ﬂowing through each memristor is equal to t he product of the memristor conductance and the wordline volt- age. Currents on the same bitline will be accumulated and output as the computation result. Since t he computation is done in analog, digital-to-analog con vertors (DA Cs) are neede d at the wordlines to con vert the i nput activ ations to voltage s, and analog-to-digital con ver- tors (ADCs) are needed at the bitlines to con vert the results back to digital v alues. These i nterfacing circuits are the most energy consum- ing part during the computation [17, 3]. Since hundred s of products are accumulated vertically , the resolution requirement of ADC can easily go bey ond the acceptable range and induce a huge energy ov er- head. As a comm on solution, the multi-bit multiplication, e.g. 16-bit multiplication, is broken into a series of low bit-width multiplications to limit the ADC resolution within a reasonab le range [6, 17, 3]. For examp le, for a crossbar with 128 wordlines, the resolution of each crosspoint should be no more than 2-bit to ke ep the ADC resolution less than 10-bit. Thus, each weight takes multiple memristors to store. At the same time, since the multi-bit DA C is expensi ve to implement and hund reds of DA C operations are ne eded for one MA C, it is more efﬁcien t and preferable to use the single-bit DA C to minimize the ov erhead [17, 3]. Thus, each bit of the input activ ation is sent into the crossbar sequentially to ﬁnish the MA C. The whole MA C oper- ation will take multiple iterations. At each iteration, a partial result correspondin g to the current 1-bit inpu t vector is generated, and then accumulated with the existing results of pre vious iterations. This bit- lev el slicing of acti v ati ons creates a special scheduling for MA C and we wil l utilize this characteristic to reduce the compu tations. I V . R R A M - B A S E D C O M P U TA T I O N R E D U C E D A C C E L E R ATO R D E S I G N A. Algorith ms for Com putation Reduction In the CMOS-based accelerators, the MAC is usually done by ac- cumulating the correspond ing activ ation-weight products. H owe ver , since the activ ati ons are sliced i nto bit-level in the RRAM-based ar- chitecture, only a partial result is obtained at each iteration by ac- cumulating the input bit-weight products. As the MA C sequentially proceeds from the most-signiﬁcant bit (MSB) to the least-signiﬁcant bit (LSB) of the input activ ations, the generated partial results also become less and less signiﬁcant. E ach new generated partial result will be accumulated with t he sum of the partial results of previous iterations. An example is gi ven in Fig.2 to illustrate this computa- tion process, where the inner product of the activ ations [4 , 12 , 10] and the weights [4 , − 8 , − 5] is computed i n 4 iterations. The i nterme- diate accumulated result is updated at each iteration, represented as Accu [ − 104 , − 120 , − 130 , − 130] in Fig.2. Such bit-lev el process- ing make s it possible to t erminate the MAC in adv ance once the re- 0 1 0 0 1 1 0 0 1 0 1 0 0 1 2 3 MSB LSB -1 04 -1 20 - 1 3 0 - 1 3 0 0 1 2 3 ü ü ü ü 4 12 10 Te r mi n ate if Accu+Ma x İ 0 Te r mi n ate if |Max |, |M in| İ |Acc u |×T ( e . g. T =0 . 5) ü ü û û 4 -8 -5 -1 09 -47 -15 1 1 9 51 1 7 Ma x Mi n Iteration State-of-the-art ReLU bypass Approx. bypass W eights Input bits Accumulated result at each iteration Iteration Accu Sum of the remaining partial results (estim ated) Fig. 2 . C omputation Reduction in t he RRAM-based Accelerator 72% 73% 74% 75% 76% 77% 0% 20% 40% 60% 80% 100% Threshold 0.5: Percen tage of Computa tion R edu ction f or Inf erence Accuracy 16bit 8bit Threshold increasing 80.9% saving 60.5% saving Threshold 0.5: (~1% accuracy loss) (~1% accuracy loss) Fig. 3 . Ideal Performance of the Adaptiv e Approximation for t he 16- bit and 8-bit Implementations maining i terations are considere d to be redundant based on the possi- ble values of their partial results estimated beforehand. Speciﬁcally , two scheme s are e xploited to identify the redun dant iterations, as de- scribed b elo w . 1) ReL U- based Computation Bypass: For an early identiﬁcation of the neg ativ e outputs followed by ReLU, the maximum v alue of the sum of the pa rtial results for the remaining i t erations is estimated beforehand and represented as Max in Fig.2. (The method for esti- mating Max will be elaborated later .) If at any iteration, the sum of Accu and Max is not larger than zero ( A ccu + M ax ≤ 0 ), the ﬁnal output is considered t o be non-positi ve and hence the MA C can be terminated in advan ce. O t herwise, the MA C will continue. For in- stance, the MAC in Fi g.2 w i ll be terminated after the second iteration since − 120 + 51 ≤ 0 . It will be shown in the experimental results that the negati ve outputs of the CONV layers account for 57.5% of the total computations in the CifarQu ick Mod el [8 ] on Cifar-1 0 [10], and 71.5% of their computations can be bypassed based on the estimation. 2) Adaptive Appro ximation: For the activ ations not supported by the ReLU-based bypass, adapti ve approximation is proposed for the early termination of MAC. In general, the resilience of network al- lo ws the activ ations to de viate fr om their actual values within a cer- tain range without affec ting the classiﬁcation result. Based on this, an adapti ve approximation scheme is proposed as sho wn in Fig.2. If the magnitude of M ax and Min (Min: minimum v alue of the sum of the remaining partial results) is no t larger than a certain threshold ( T ) of the magnitude of Accu ( | M a x | , | M in | ≤ | Accu | × T ), which re- ﬂects the allo wable de viation from t he actual result, the MA C can be terminated. For instance, t he last two iterations can be bypassed if T is set as 0.5 in Fi g.2. T he allowa ble deviation for triggering the bypass v aries adaptiv ely with | Accu | . The larger the | Accu | , t he larger the allow able de viation i s. T is an empirical tunable parameter to balance the accurac y and t he complexity saving. A larger amount of computa- tion reduc tion can be achie ved by increasing T , b ut the accu racy loss will also increase at the same time. T o obtain the ideal performance, i.e. the upper bound of comple xity saving of the adapti ve approxi- mation, we assume we kno w the exac t value of the output activ ation beforehand and so the actual partial r esult at each iteration is used instead of t he estimated values. Based on this, we can exactly know which i t erations will not be needed and t he i deal maximum amount of bypass can be obtained. This ideal performance and the u pper bo und of saving at differen t T values for the general 16-bit ﬁxed-point imple- 0% 20% 40% 60% 80% 100% 6 8 10 4 2 0 12 14 8 10 12 14 6 4 2 0 (b) (a) !"# $% "& ' () = *++ !"# ,% "& ' () = - 4-bit V alue Range 4-bit V alue Range Distribution Density !"# $% "& .() = *++ !"# ,% "& .() = - Probability Bit Position ( LSB at 0) CONV1 !"# $% 0% 20% 40% 60% 80% 100% 0 2 4 6 8 10 12 14 Probability 0 2 4 6 8 10 12 14 CONV2 !"# $% Bit Position (LSB at 0) 0% 20% 40% 60% 80% 100% CONV / !"# $% 0 2 4 6 8 10 12 14 Bit Position (LSB at 0) 0% 20% 40% 60% 80% 100% 0 2 4 6 8 10 12 14 Bit Position (LSB at 0) Probability Probability F 0/ !"# $% Average Span Fig. 4 . (a) An Example for Extracting the Probabilities from the Dis- tribution of the 4-bit Input Activ ations; (b) A verage P robabilities and the Spans Extracted for the 16-bit implemen tation of CifarQuick. mentation of CifarQuick is sho wn in Fig.3. A sweet spot is observed where 80.9% computations can be reduced wit h an accuracy loss of 1% for an optimum threshold v alue. Similarly , for the 8-bit ﬁxed- point implementation, ideally a computation reduction of 60.5% can be achiev ed by the adaptive approximation at the same threshold as sho w n in Fig.3. The savin gs sho wn in Fig.3 assu me a perfec t kno wl- edge of the partial result at each iteration. Ho we ver in real situation, the actual partial results will not be kno wn beforehand . Therefore we propose a method to get an accurate estimati on on the ma ximum an d minimum v alues o f the partial result at each iteration. B. Making a Runtime Estimation on the Output Activ ation Considering the MAC operation for an output activ ation, the worst- case maximum v alue of each partial result can be estimated by assum- ing all the i nput bit-weight products to b e accumulated are as large as possible. In this case, for the layers with both positi ve and negati ve inputs, the maximum value of the partial result is equal to P | w | × 2 i where w represen ts the weight in the kernel and i is the position of the input bit from bit-width − 1 to 0. Si milarly , the worst-case minimum v alue of each partial result is equal to − P | w | × 2 i . For the hidde n layers follo wi ng R eL U, since there is no negati ve input, the maxi- mum and minimum values of the partial result become P w + × 2 i and P w − × 2 i , where w + and w − represent the positiv e and the negati ve weights, resp ectiv ely . T he com putation bypass base d on this loose bound of estimation ensures no accuracy loss, but the amount of the comple xity reduction is small since the worst-case estimated v alues usually ha ve much larger magn itude than the actual result due to se veral reaso ns. F i rst, since t he multi-bit input activ ation has been broke n into multiple iterations, it is highly likely to have some in- put bits equal to zero e ven when the activ ation is positiv e. Thus, a considerable portion of input bit-weight products are actually zero. Moreo ver , in the worst case, all t he input bit-weight products to be accumulated are assumed to hav e the same sign. Howe ver , in prac- tice each non-zero product can either have positiv e impact or neg ativ e impact on the partial resu lt. T o have a more practical estimation, a tighter bound based on the actual input acti v ation statistics is proposed. Before the classiﬁcation, the empirical activ ation statist i cs of each layer are obtained from the training images, and the probabilities of the input bit at each itera- tion to be +1 and − 1 are calculated accordingly . As an example, Fig.4(a) illustrates how to extract the correspo nding probabilities for the MSB and LSB from the distribution of the 4-bit input activ ations in a CONV layer of CifarQuick. Speciﬁcally , the probability of MSB to be +1 is equal to the occurrence probability of the acti v ation that has a v alue larger than 8. It can be easily e xtended to the implemen- tations with different bit- w i dths. For a larger bit-w i dth such as 16-bit, the activ ati ons will be partitioned into smaller bins. The averag e value and the span of the probability of each input bit to be +1 extracted for the 16-bit implementation of CifarQuick are shown in F ig.4(b). For the hidd en layers, the probability of the MSB to be +1 is small since most of the activ ations have small v alues. For the LSB, the probabil- ity is within t he rang e from 25% to 40% since a larg e portion of input activ ations are zero. The probability of ha ving a − 1 input is zero for the layers follo wing ReLU. C O NV 1 is differen t from others due to the mean subtraction for imag e pre-processing. It is worth noting that the acti v ation distribution normally doesn’t change much for dif ferent images in the dataset, an d thus the empirical p robabilities can be gen- erally utilized for t he estimation. The maximum and minimum values of t he partial result estimated at a speciﬁc i t eration i ( i =0 for L SB) are given by: max = ( max + + max − ) × 2 i , min = ( min + + min − ) × 2 i max + = X w + × P r ob +1 ,max + X | w − | × P rob − 1 ,max max − = X − w + × P rob − 1 ,min + X w − × P rob +1 ,min min + = X w + × P rob +1 ,min + X | w − | × P rob − 1 ,min min − = X − w + × P rob − 1 ,max + X w − × P r ob +1 ,max (3) where max and min represe nt the estimated maximum and mini- mum values of the partial result, respectiv ely . T o estimate max , the input bit-weight products with different signs need to be separately considered. W e ﬁrst consider t he case where the input bits and the correspondin g weights have the same signs to estimate the sum of the positi ve products ( max + ). P w + and P w − represent the sums of the positive weights and the nega tiv e weights in the kerne l, respec- tiv ely . For a better accuracy , a conserv ative bound should be used for the estimation, and thus we use P rob +1 ,max and P r ob − 1 ,max to estimate max + where P r ob +1 ,max and P rob − 1 ,max represent the maximum probabilities of the input bit to be +1 and − 1 , respectiv ely . Also we need to estimate the sum of the neg ativ e products ( max − ) when the input bits and weights are of opposite signs. Again to hav e a conserv ative bound we use P rob +1 ,min and P r ob − 1 ,min which rep- resent the minimum probabilities of the input bit to be +1 and − 1 , respecti vely . The minimum va lue of the partial result ( min ) can be estimated in a similar way . This statist ics-based estimation i s more precise than the worst-case estimation. C. Hardware Architecture The analog computation is done inside t he in-situ processing units (IPUs) similar as Fig.1(b). Each IPU contains a group of 1-bit DA C s at the input of the word lines, a pair of diffe rential RRAM crossbars to st ore the positiv e and nega tiv e weights, r especti vely , t he sample- hold units to hold the bitl ine currents, a single ADC which is ti me- shared by the bitlines, and a shift-add unit to aggreg ate the partial results after ADC for t he MAC operation. In order to make a com- parison with the state-of-the-art RRAM-based accelerator , we adopt a hierarchical architecture similar t o ISAAC presented in [17] as the baseline, and compare the proposed CompRRAE with IS AA C in terms of the energy efﬁcienc y , throughput, and area cost. Simi- lar t o ISAAC, each accelerator contain s multi ple tiles connected wi th a concentrated-mesh at the top lev el. The architecture inside a tile is shown in Fig.5(a). Each tile contains multiple in-situ multiply- accumulate modules (IMAs) sharing the same centralized memories C e n t r a l i z e d I n p u t M e m o r y O u t p u t M e m o r y E s t i m a t i o n L U T D i g i t a l P r o ce s s i n g U n i t s ( R e L U , P o o l i n g , S h i f t A d d , E v a l u a t i o n L o g i c s ) L o ca l I n p u t B u f f e r I P U I P U I P U . . . A d d e r s t o A g g r e g a t e t h e P a r t i a l R e s u l t s L o ca l O u t p u t B u f f e r ( a ) ( b) C h A D C S t a g e : 6 . 2 5 n s C r o s s b a r S t a g e : . . . 1 6 I t e r a t i o n 2 1 6 x 6 . 2 5 n s I t e r a t i o n 3 ≤ 1 6 x 6 . 2 5 n s C h 1 5 1 6 S h i f t A d d S t a g e: E v a l u a t i o n S t a g e i n T i l e : . . . 1 5 1 6 . . . C h 1 4 . . . 1 6 . . . 1 6 1 4 . . . 1 5 . . . . . . . . . . . . 1 8 . 7 5 n s n + 1 1 2 . 5 n s 1 4 1 4 1 4 1 5 1 6 C o m p u t a t i o n y p a s s ? ( Y / N ) . . . . . . 1 4 n +2 1 2 . 5 n s n + 1 2 . 5 n s I M A I M A I M A I M A I M A I M A I M A I M A I M A I M A I M A I M A . . . I d l e C h : O u t p u t C h a n n el i n t h e I P U C h : O u t p u t C h a n n el i n t h e I P U I t e r a t i o n ≤ 1 6 x 6 . 2 5 n s , ≥ 1 2 . 5 n s I t e r a t i o n ≤ 1 6 x 6 . 2 5 n s , ≥ 1 2 . 5 n s C e n t r a l i z e d I n p u t M e m o r y O u t p u t M e m o r y E s t i m a t i o n L U T D i g i t a l P r o ce s s i n g U n i t s ( R e L U , P o o l i n g , S h i f t A d d , E v a l u a t i o n L o g i c s ) L o ca l I n p u t B u f f e r I P U I P U I P U . . . A d d e r s t o A g g r e g a t e t h e P a r t i a l R e s u l t s L o ca l O u t p u t B u f f e r ( a ) ( b) C h A D C S t a g e : 6 . 2 5 n s C r o s s b a r S t a g e : . . . 1 6 I t e r a t i o n 2 1 6 x 6 . 2 5 n s I t e r a t i o n 3 ≤1 6 x 6 . 2 5 n s C h 1 5 1 6 S h i f t A d d S t a g e: E v a l u a t i o n S t a g e i n T i l e : . . . 1 5 1 6 . . . C h 1 4 . . . 1 6 . . . 1 6 1 4 . . . 1 5 . . . . . . . . . . . . 1 8 . 7 5 n s n +1 1 2 . 5 n s 1 4 1 4 1 4 1 5 1 6 T r i g g er i n g C o m p u t a t i o n y p a s s ? ( Y / N ) . . . . . . 1 4 n +2 1 2 . 5 n s n +3 1 2 . 5 n s I M A I M A I M A I M A I M A I M A I M A I M A I M A I M A I M A I M A . . . I d l e I d l e : H a v i n g I d l e T i m e w h e n o n l y O n e A ct i v a t i o n L e f t I d l e : H a v i n g I d l e T i m e w h e n o n l y O n e A ct i v a t i o n L e f t C h : O u t p u t C h a n n e l i n t h e I P U C h : O u t p u t C h a n n e l i n t h e I P U I t e r a t i o n ≤1 6 x 6 . 2 5 n s , ≥ 1 2 . 5 n s I t e r a t i o n ≤1 6 x 6 . 2 5 n s , ≥ 1 2 . 5 n s E x t r a O p e r a t i o n s w h e n l e s s t h a n T w o A ct i v a t i o n s L ef t E x t r a O p e r a t i o n s w h e n l e s s t h a n T w o A ct i v a t i o n s L ef t 6 . 2 5 n s 6 . 2 5 n s 6 . 2 5 n s Centralized Input Me m o ry Output M e mory Estima t ion LUT Digital Processing Un its (R eLU, P oo l i n g , Sh i ft-A dd, Ev a l u ati on Logi c s) Lo ca l Inpu t B u f f er IPU IPU IPU . . . Adders to A g g regate the Partial R esu lts Loca l Ou tput Buff er (a ) (b) Ch 1 ADC Stage: 2 6.25ns Crossbar Stag e: ... 16 Iteration 2 16 x 6. 25 ns Iteration 3 ≤16 x 6.25 ns Ch 1 15 16 Shift-Add Stage: Ev a lu ation Stage in Tile: ... 15 16 ... Ch 14 1 2 ... 16 1 ... 16 14 ... 15 ... ... ... ... Itera tio n 6 18.7 5ns 7 12.5n s 8 9 14 8 9 14 8 9 14 15 16 Computation Bypass ? (Y/ N) N N N N N Y ... ... Y N Y 9 14 9 IMA IMA IMA IMA IMA IMA IMA IMA IMA IMA IMA IMA ... Ch : Outp ut Cha nn e ls in t he IP U Ch : Outp ut Cha nn e ls in t he IP U 6.25ns 6.25ns ... ... ... ... Fig. 5 . (a) Hardware Architecture inside a T ile; (b) The Pipeline of CompRRAE. and digital processing units which are used to execute digital opera- tions such as shift-add, ReLU and pooling. Inside the tile, the IMAs are connected through a shared b us. Inside each IMA, there are mul- tiple IPUs sharing a local inpu t buffer and a local output buf fer which hold t he input and output activ ations, resp ectiv ely , during the MA C. In order to i mplement the proposed runtime estimation, ﬁrst, the probabilities are extracted of ﬂine. Based on Eq.(3), the estimated maximum and minimum v alues of the partial result at each iteration are computed independently for each output channel. T hen, t he corre- sponding esti mated partial results are summed up to get the M ax and Min for each iteration. Same as the example shown in Fig.2, there are totall y N-1 Max and Min values for each output channel, where N i s t he number of iterations, i.e. the bit - width of the activ ation − 1. This process is done of ﬂine and the estimated Max and Min v alues are stored in a look-up table (LUT) i n the tile. During runtime, at each iteration, the estimated value of the o utput activ ation is the sum of the actual accumulated result and the estimated v alue for the remaining iterations r ead from the LUT . The ev aluation logics to compute the formulas in F i g.2 for deciding whether to skip the remaining itera- tions include an adder for the ReLU-based bypass and a multiplier and comparator for the approximation-based bypass. The size of t he tile is normally large enough for mapping a complete kernel of the CONV laye rs. Howe ver , each k ernel may occup y multiple IMAs and thus the local results in the IMAs are requ ired to be sent out through the shared bus and aggregated in the tile. T o minimize the ov erhead of data transfer , eac h k ernel is preferred to fully occup y the IPUs in one IMA ﬁrst before occup ying others when mapping t he network, and the calculated results are ﬁrstl y aggregated locally inside the IMAs before sent out through the shared bus. W e u se the same con ﬁguration as t hat in ISAAC where ea ch mem- ristor has 2-bit precision in the 1 28 × 128 RRAM crossbar . Since each 16-bit weight takes 8 memristors to store, there are 16 output channels mapped to one IPU. The pipeline schedule of CompRRAE is shown in Fig.5(b). After ﬁnishing each cro ssbar com putation, the bitline re- sults are ﬁrstly latched in the sample-ho ld circuits. In the next stage, a 1.28GHz ADC sequentially con verts the 8 bitline currents for an output activ ation in 6.25ns in the IPU. The 8 bitline results are pro- cessed by the shift -add to generate a l ocal partial result in the IPU in t he next 6.25ns. T hen, in the next stage, t he local partial results of diffe rent IMAs are aggreg ated in the tile and update the interme- diate accumulated result of the MA C. This result together with the estimated v alues stored in the LUT will be sent to the ev aluation log- ics to decide whether the computation can be terminated earli er based on the algorithms presented in Section IV , and the control signals will be generated and sent back to the IMAs. All these operations will be ﬁnished in the 6.25ns time f r ame. At t he be ginning of the MA C oper - ation, i.e. the ﬁrst iteration, it takes 16 × 6.25n s t o ﬁnish the iteration for the 16 output acti vations in the IPU. As the computation goes on, some of the ou tput activ ations may be bypassed and the time for each iteration in the IPU may become shorter . After all t he IPUs ﬁnish the computation, the results will be sent to the nex t layer , and the next MA C operation of the current layer will start. S i nce the iterations may t ake less time in CompRRAE due to t he computation bypass , the bandwidth of the input memories to pro vide the necessary input activ ations to t he IPUs have to be increased. If we need to maintain the input memory bandwidth the same as that in ISAA C, the number of IPUs in each tile needs t o be r educed to make sure the memory bandwidth can sup port the IPUs which are now running faster . At t he same time, since l ess IPUs are used in each tile, more tiles are needed for mapping the same netwo rk and this will ca use area ov erhead. Compared with ISAA C, the energ y overh ead mainly comes from the e v aluation logics, the additional data transfers throu gh the shared bus, and t he extra memory accesses of the LUT and the centralized output b uf fer . Extra area overhead will be requ ired for the ev aluation logics and the LUT . The detailed analysis will be discussed in the next section. V . E X P E R I M E N TA L R E S U LT A. Models of Energy , Area, and Th roug h put The operation parameters and the corresponding energy and area data for the major components of CompRRAE are summarized in T a- ble I. Al l the memories and the shared b uses are modeled at 32nm i n CA C TI6.5 [15]. The centralized i nput memory i s i mplemented us- ing eDRAM. The local buf fers, the centralized output memory and the LUT are implemented using SRAM. The conductance range and the area of RRAM are taken fr om the Stanford-PKU RRAM model [9], and the corresponding po wer is simulated using a de vice-level simulator implemented in C++. The parameters of the 1-bit D AC are obtained thro ugh a real design implemented in Cadence at TSMC 65nm and scaled shown to 32nm process. Same as ISAAC, an 8-bit SAR ADC is adopted, and the power and area are taken from [12]. The ev aluation logics are designed and implemented in V erilog and synthesized using TSMC 65nm. The po wer and area are obtained and scaled down to 32nm process. The parameters of other digital pro- cessing units such as the shift-add and the sample-hold are adapted from ISAAC [17]. The time for the IPUs to ﬁnish the MAC operation for each layer is used to mod el the exe cution time, and a simulation- based throughput model is built in SystemC. W e also calculate the correspondin g energy , throughput, and area of ISAAC as a baseline to compare with. B. Benchmark s W e use two benchmarks to compare CompRRAE wi th ISAA C . The ﬁrst ben chmark is the LeNet-5 [13] which ha s two CONV layers and two FC l ayers trained on the handwritten digit dataset MNIST . The second benchmark is the medium-sized CifarQuick model [8] with three CONV layers and two FC layers trained on the color image dataset Cifar-10 [10]. The proposed schemes are ﬁr st ly tested for a 16-bit quantization for comparison wit h ISAA C, and then tested for an 8-bit quantization to demonstrate the ef fective ness of C ompRRAE under an aggressi ve quantization scheme. The accurac y of the ﬁxed- point impleme ntations are summarized in T able II. C. Results of the ReLU-based Computatio n Redu ction The negativ e output acti vations in the CONV layers of CifarQuick account for 57.5% of the total computations during the inferen ce. On av erage, over 99.9% of the negati ve outputs are detected based on the T able I. Power and Area Estimation Centralized Memories and Buses Component Spec Energy ( nJ ) Area ( um 2 ) Input Memory size: 64KB 0.0188 46000 (eDRAM) bus width: 256bit (0.38mW leakage) Output Memory s iz e : 1KB 0.0008 3900 (SRAM) bus width: 128bit (0.13mW leakage) Estimation LUT size: 5KB 0. 0035 9600 (SRAM) bus width: 160bit (0.002mW leakage) Bus of num: 256 0.0042 80000 Input Path delay: 0.44ns Bus of num: 128 0.0020 39100 Output Path delay: 0.43ns Local Memories (Shared among 8 IPUs) Component Spec R/W Energy ( nJ ) Area ( um 2 ) Input Buffer size: 2KB 0.0019 7400 bus width: 256bit (0.42mW leakage) Output Buffer size: 256B 0 .0005 2600 bus width: 128bit (0.05mW leakage) IPU Parameters at 1.28GHz (80 MA Cs per Tile) Component Spec Power ( mW ) Area ( um 2 ) DA C resolution: 1 bit 0.25 668 num: 256 ADC resolution: 8 bit 3.1 1500 num: 1 Memristor resolution: 2 bit Cifar -10: 1.5 264 Crossbar num: 2 MNIST : 0.7 Sample-Hold num: 128 0.001 5 Shift-Add num: 1 0.05 60 Other Tile Parameters at 1.28GHz Component Spec Power ( mW ) Area ( um 2 ) Evaluation L ogic num: 8 0. 79 320 Shift-Add num: 8 0.4 480 T able II. Accuracy of the Fixed-Point Implementations Benchmarks 16-bit Representation 8-bit Representation CifarQuick 75.57% 75.15% LeNet-5 99.13% 99.09% runtime estimation and over 71.5% of the computations correspond - ing to these negati ve outputs are reduced for the 16-bit implementa- tion. For the 8-bit implementation, ov er 98.3% of the neg ativ e outputs are detected and 44.1% of their computation s are reduced. Thus, the ov erall ReLU - based computation reduction for a complete inference are 40.2% and 23.8% for the 16-bit and 8-bit implementations, respec- tiv ely . No accuracy has been compromised for both implementations. Since the CONV layers of LeNet-5 are not followed by ReL U , t he computation wi ll only be r educed by the adaptiv e approximation. The performance of the ReLU-based computation bypass in CifarQuick is summarized in T able III. D. Results of the Adaptive Appr oximation The r esults of t he computation bypass based on the adaptiv e ap- proximation are sho wn in Fig.6. T o maximize the amount of com- putation reduction while maintaining a high accuracy , the optimal threshold for the approximation is empirically found as 0.8 for the 16-bit implementation of CifarQuick, where 67.4% of computations T able III. Results of the ReLU-based Comp utation Reduction in Ci far- Quick Computation Reduction 16-bit Implementation 8-bit Implementation For the Negative Outputs 71.5% 44.1% followed by ReL U For a Complete Inference 40.2% 23.8% T able IV. The Ov erall Comp utation Reduction, Energ y Efﬁciency , Throug hput, and Area Efﬁciency Performan ce 16-bit CifarQuick 8-bit CifarQuick 16-bit LeNet-5 8-bit LeNet-5 Baseline CompRRAE Baseline CompRRAE Baseline CompRRAE Baseline Com pRRAE Accuracy 75.57% 75.44% 75.15% 75. 00% 99.13% 98.97% 99.09% 98.90% Computation Reduction for a Complete Inference - 69.4% - 39.1% - 78.5% - 45.4% Energy Consumption ( mJ /f r ame ) 5.80e-2 2.00e-2 1.41e-2 1.00e-2 1.71e-2 5.65e-3 4.31e-3 2.61e-3 Energy Efﬁciency ( f r ames/J ) 1.72e+4 5.00e+4 7.10e+4 1.00e+5 5. 83e+4 1.77e+5 2.32e+5 3.83e+5 Throughput ( f r am es/s ) 603.0 1715.6 1205.2 1978.3 10 82.0 4897.1 2158.0 4163.1 Area ( mm 2 ) 0.5125 0.5816 0.3022 0.3468 1.5375 1.7252 0. 8204 0.9243 Area Efﬁciency ( f r ames/s/mm 2 ) 1176.8 2949. 8 3988.1 5704.4 703.7 2838.5 2630.4 4504.0 71% 73% 75% 77% 0% 20% 40% 60% 80% 100% 71% 73% 75% 77% 0% 20% 40% 60% 80% 100% Ideal Perf ormance based on the Real Partial R esults Performance based on the Runtime Es timation CifarQuick (16-bit) CifarQuick (8-bit) 97.5% 98.0% 98.5% 99.0% 99.5% 0% 20% 40% 60% 80% 100% 97.5% 98.0% 98.5% 99.0% 99.5% 0% 20% 40% 60% 80% 100% LeNet-5 (16-bit) LeNet-5 (8-bit) Accuracy Accuracy Accuracy Accuracy Computation Reduction for an In ference Computation Reduction for an Inference Computation Reduction for an Infer ence Computation Reduction for an Infer ence Fig. 6 . R esults of the Adaptive Approximation can be reduced for t he inference with an accurac y loss as small as 0.13%. For the 8-bit implementation of Cifa rQuick, 35.8% computa- tion reduction is achie ved at the same threshold for the inference with only 0.16% accurac y loss. Similar trend has been observed in LeNet- 5, where 78.5% and 45.4% computation reductions are obtained for the 16-b it and 8-b it implemen tations, respecti vely . The accurac y loss is small er than 0.19%. E. Overall Performance and Overhead An alysis The ov erall computation reduction, energy efﬁcienc y improv e- ment, and throughpu t i mprov ement aft er combining the two proposed schemes are summarized in T able IV. The ov erall computation reduc- tion achiev ed for t he 16-bit implementation of CifarQuick is 69.4% with 0.13% induced accuracy loss. Therefore, the energy ef ﬁciency and throughpu t are improved by 2.9 times and 2.8 times, respectiv ely . The energ y ov erhead caused by the runtime estimation (i.e. the e v al - uation logics, the e xtra data t r ansfers through the shared bus, and the extra memory accesses) acco unts for 3.4% of t he overall energy co n- sumption. Compared with ISAAC, there is a 13.5% area o verhea d due to the estimation logics, the L UT , and the extra tiles occupied. Ho we ver , due to the improvemen t of throughput, the area efﬁcienc y is improved by 2.5 times. For the 8-bit implementation of CifarQuick, 39.1% of computations are reduced. Thus the energy efﬁciency and throughpu t are improv ed by 1.4 times and 1.6 t imes, r especti vely , at a cost of 0.15% accuracy loss and 14.8% area ove rhead. Si mi l ar re- sults hav e been observed for LeNet-5. The improv ements in energ y efﬁcien cy and throughput are 3.0 times and 4.5 times for the 16-bit implementation. For the 8-bit implementation , the correspondin g im- prov ements are 1.6 and 1.9 times, respectiv ely . Around 12% area ov erhead is induced for the implementations of LeNet-5. V I . C O N C L U S I O N S In this paper, a RRAM-based CNN accelerator is proposed to re- duce the computations during the i nference. The computations are reduced by exploiting the output sparsity and the adaptiv e approxi- mation based on the runtime estimati on on the maximum and min- imum va lues of the output activ ati on. It is i mplemented under dif- ferent quantization schemes, and the corresponding en ergy efﬁciency and t hroughpu t are signiﬁcantly impro ved. R E F E R E N C E S [1] V . Akhlaghi, A. Y azdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh. Snapea: Predictiv e early activ ation for reduc- ing computation in deep con volutional neural networks. In 2018 ACM/IEEE 45th Annu al International Symposium on Computer Ar chitectur e (ISCA) , pa ges 66 2–673, June 2018. [2] J. Albericio, P . Judd, T . Hetherington, T . Aamodt, N. E. Jerger , and A. Mosho vos. C n vlutin: Inef fectual-neuron-free deep neural network computing. S IGARCH Comput. Arc hit. News , 44(3):1–13, Ju ne 2016. [3] X. Chen, J. Jiang, J. Zhu, and C.-Y . Tsui. A high-throug hput and energy-ef ﬁcient rram-based con volution al neural network using data encoding and dynamic quantization. In Proce edings of the 23r d Asia and South P aciﬁc Design Automation Con- fer ence , ASPDA C ’18, pages 123–128, Piscataw ay , NJ, USA, 2018. IE EE Pr ess. [4] Y . Chen, T . L uo, S. Liu, S. Zhang, L. He, J. W ang, L. L i, T . Chen, Z . Xu, N. S un, and O. T emam. Dadiannao: A machine-learning supercomp uter . In P roceed ings of the 47th Annual I E EE/ACM International Symp osium on Micr oar chitec- tur e , MICR O-47, pages 609–622, W ashington, DC, USA, 2014. IEEE C omputer Society . [5] G. E. Dahl, T . N. Sainath, and G. E. Hinton. Improvin g deep neural networks for lvcsr using rectiﬁed l inear units and dropout. In 2013 IEEE International Confer ence on Acoustics, Speec h and Signal Pr ocessing , pages 8609 –8613, May 2013. [6] B. Feinberg, S. W ang, and E. Ipek. Making memristiv e neural network accelerators reliable. In 2018 IEEE International Sym- posium on High P erformance C omputer Ar chitectur e (HPC A ) , pages 52–65, Feb 2018. [7] H. Ji, L. S ong, L. Jiang, H. Li, and Y . Chen. Recom: An ef ﬁcient resistiv e accelerator for compressed deep neural network s. I n 2018 Design, Auto mation T est in Eur ope Con fer ence Exhibition (D ATE) , pages 237– 240, March 2018. [8] Y . Jia, E. Shelhamer , J. Donahue, S . Karayev , J. Long, R . Gir- shick, S . Guadarrama, and T . Darrell. Caffe: C onv olutional architecture for fast feature embedding. In Pro ceedings of the 22Nd ACM International Confer ence on Multimedia , MM ’14, pages 675–678, Ne w Y ork, NY , USA, 2014 . A CM. [9] Z. Jiang and H.-S. P . W ong. Stanford univ ersity resistiv e- switching random access memory (rram) verilog-a model, Oct 2014. [10] A. Kri zhe vsk y and G. E . Hinton. L earning multi ple layers of features from ti ny images. 1, 01 2009. [11] A. Krizhevsk y , I. S utske ver , and G. E. Hi nton. Imagenet clas- siﬁcation wi th deep con volution al neural networks. In Pr oceed- ings of the 25th International Confer ence on Neural Information Pr ocessing Systems - V olume 1 , NI P S’12, pages 1097–110 5, USA, 2012. Curran Associates Inc. [12] L . Kull, T . T oiﬂ, M. L. Schmatz, P . A. Francese, C. Menolﬁ, M. Br aendli, M. A. Kossel, T . Morf, T . M. Andersen, and Y . Leblebici. A 3.1mw 8 b 1.2gs /s single-chan nel asynchron ous sar adc with alterna te comparators for enh anced speed in 32 nm digital soi cmos. In 2013 IEEE International Solid-State Cir- cuits Confer ence Digest of T echnica l P apers , pages 468–469 , Feb 2013. [13] Y . Lecun, L. Bottou, Y . Bengio, and P . Haf f ner . Gradient-based learning applied t o document recognition. Procee dings of t he IEEE , 86 (11):2278–2 324, Nov 1998. [14] B. Moons, B. D. Brabandere, L. J. V . Gool, and M. V er- helst. Energy-ef ﬁcient con vnets through approximate comput- ing. CoRR , ab s/1603.0677 7, 2016. [15] N. Muralimanoha r , R. Balasubramonian , and N. Jouppi. Op- timizing nuca organization s and wiring alternativ es for l arge caches wit h cacti 6.0. In Procee dings of the 40th Annual IEEE/ACM I nternational Symposium on Micr oar chitectur e , MI- CR O 40, pages 3–14, W ashington, DC, USA, 2007. IEEE Com- puter Society . [16] H. Nam and B. Han. Learning multi-domain con volutional neu- ral networks for visual trackin g. CoRR , abs/1 510.0794 5, 2015. [17] A. S haﬁee, A. Nag, N. Muralimanohar , R. Balasubramonian, J. P . Strachan, M. Hu, R. S . Williams, and V . Srikumar . Isaac: A conv olutional neural network accelerator with in-situ analog arithmetic in crossbars. In Pr oceedings of the 43r d International Symposium on Computer Ar chitectur e , ISCA ’16, pages 14– 26, Piscataway , NJ, USA, 2016 . IEEE Press. [18] Y . W ang, B . Li, R. Luo, Y . Chen, N. Xu, and H. Y ang. Energy ef- ﬁcient neural ne tworks for big data analytics. In Pr oceedings of the Confer ence on Design, Automation & T est in Europ e , DA TE ’14, pages 345:1–345:2 , 3001 Leuven, Belgium, Belgium, 2014. European Design and Automation Association. [19] J. Zhu, J. Jiang, X. Chen, and C.-Y . Tsui. Sparsenn: An energy - efﬁcien t neural network accelerato r exploiting input and ou tput sparsity . In 2018 Design, Automation T est in Eur ope Confer ence Exhibition (DA TE) , pages 241–244, March 2018.

CompRRAE: RRAM-based Convolutional Neural Network Accelerator with Reduced Computations through a Runtime Activation Estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment