Training DNN IoT Applications for Deployment On Analog NVM Crossbars

T raining DNN IoT Applications for Deployment On Analog NVM Crossbars Garc ´ ıa-Redondo, Fernando Arm Ltd. Cambridge fernando.garciaredondo@arm.com Shidhartha Das Arm Ltd. Cambridge shidhartha.das@arm.com Glen Rosendale Arm Ltd. San Jos ´ e glen.rosendale@arm.com Abstract —A trend to wards energy-efﬁciency , security and privacy has led to a recent focus on deploying deep-neural networks (DNN) on microcontr ollers. Howev er , limits on compute and memory resour ces restrict the size and the complexity of the machine-learning (ML) models deployable in these systems. Computation-In-Memory architectures based on r esistive non- volatile memory (NVM) technologies hold great promise of sat- isfying the compute and memory demands of high-performance and low-power , inher ent in modern DNNs. Ne vertheless, these technologies are still immature and suffer from both the intrinsic analog-domain noise problems and the inability of repr esenting negative weights in the NVM structures, incurring in larger crossbar sizes with concomitant impact on Analog-to-Digital Con verters (ADCs) and Digital-to-Analog Con verters (D A Cs). In this paper , we provide a training framework f or addressing these challenges and quantitatively evaluate the circuit-level efﬁciency gains thus accrued. W e make two contributions: Firstly , we propose a training algorithm that eliminates the need for tuning individual layers of a DNN ensuring unif ormity acr oss layer - weights and activations. This ensures analog-blocks that can be reused and peripheral hardware substantially reduced. Secondly , using Network Architectur e Search (NAS) methods, we propose the use of unipolar-weighted (either all-positive or all-negative weights) matrices/sub-matrices. W eight unipolarity obviates the need f or doubling crossbar area leading to simpliﬁed analog periphery . W e validate our methodology with CIF AR10 and HAR applications by mapping to crossbars using 4 -bit and 2 -bit devices. W e achieve up to 92 . 91% accuracy ( 95% ﬂoating-point) using 2 -bit only-positive weights for HAR. A combination of the proposed techniques leads to 80% area improvement and up to 45% energy reduction. Index T erms —Deep Neural Networks, DNN, memristor , RRAM, MRAM, PCRAM I . I N T R O D U C TI O N The deployment of DNN applications on alw ays-ON IoT devices suffers from stringent limitations on memory , compu- tation capabilities and memory-bandwidth [1], leading to com- plex trade-offs between model-size, performance and energy- consumption. This trade-off is addressed in the prior-art through a combination of algorithmic approaches (network- pruning and weight elision, quantization [2], [3]), system- design solutions (NN accelerators, vectorized instructions [2], [4]–[6]) and process-technology inno vations (3D-integration, emerging non-volatile memory technologies [4], [7]). Computation In Memory (CIM) architectures drastically reduce memory-bandwidth requirements [1], [8], while taking advantage of the quantization and pruning solutions. Emerging resistiv e switching technologies such as Phase-Change Mem- ory (PCM) and memristors or Resistive Random Access Mem- ories (RRAM) behav e as analog synapses placed in crossbar arrays [8]–[13]. By executing the multiply-accumulate (MA C) operation in these highly-integrated structures in the analog- domain, NVM crossbars naturally provide the parallelization of the matrix-vector multiplication, the kernel behind most DNN operations. Moreover , as the weights are encoded in the non-volatile resistiv e elements and not transferred from of fchip external memory , energy consumed due to data mov ement is signiﬁcantly reduced, potentially enabling orders of magnitude improv ements in energy and computing efﬁcienc y . Despite the promise of performance and energy consump- tion improv ements, the stark reality remains that emerging NVM technologies are still relativ ely immature and suffer from intrinsic analog-domain shortcomings such as device variability , sensiti vity to temperature v ariations and no intrinsic capability for representing negati ve values. These constraints are considerable technological challenges that need to be over - come before CIM architectures using NVM can be deployed in mass market. Addressing these substantial challenges entirely through innov ative circuit-design can se verely dent potential ef ﬁciency gains. As an illustrative example, mapping negati ve weights in NVM devices requires independent analog-domain accumula- tors for positiv e and negativ e weights, thereby potentially dou- bling the analog-domain circuit overheads, including energy- expensi ve ADCs and D ACs [9], [13]. Similarly , differences in the dynamic range of the activ ations across the DNN layers require per-layer tuning of analog periphery . This limits the opportunities for reusable analog macros [12], and creates a signiﬁcant gap between laboratory prototypes that rely upon external probes and analyzers [9], [11], [14], and deployable market products. In this work we introduce a framework to train and efﬁ- ciently map the DNN to the NVM hardware, providing two main contributions. Firstly , we enable the re-use of smaller , less power -hungry , uniform analog modules across dif ferent layers in the DNN, removing the need of per-layer full- custom periphery design. By ensuring uniform scaling through layers, independently of their morphology and size, area and power beneﬁts come together with shorter circuit design time, closing the gap between reconﬁgurable blocks [9], [12] and real reconﬁgurable solutions. Secondly , we inv estigate how relaxing bipolar -weight matrices requirements can lead to r o w _ n c o l u m n _ 1 c o l u m n _m D A C ADC + SRAM ADC + SRAM D A C ADC + SRAM D A C ADC D A C CNN o r F C La y er Output F C La y er NVM A ccel erato r S/H S/H S/H r o w _ n c o l u m n _m I 1 I M I 2 V 1 V 2 V N NxM vecto r-matrix analog m ultiplication I 1 =V'G 1 G 1 =[g 11 ,g 12 ,... ,g 1M ] Inputs Output s r o w _ n c o l u m n _m Figure 1. NVM accelerator architecture and principle behind the NVM based vector-matrix multiplication. Inputs are encoded as voltages, and drive the crossbar through the horizontal ro w lines. Bitline currents naturally accumulate the individual products of the input voltages and the NVM conductances and are later digitized. In very deep neural-networks (NN) crossbars are interconnected digitally . additional periphery area savings, while reducing the crossbar area by 2 × . W e prove that certain applications can be retrained and directly mapped to unipolar weight matrices. Con versely , for those deeper con volutional NN that cannot be completely unipolar , we analyze the trade-off between the number of unipolar channels (and therefore the energy/area savings) and the accuracy loss. The paper is structured as follows. Section II deepens into the moti vation and related work. Section III describes the proposed methodology and framew ork, followed by the achiev ed results on Section IV. Finally , we summarize and conclude this paper in the conclusion section. I I . R E L A T E D W O R K A. NVM Cr ossbars as MAC Engines Machine Learning applications, and more speciﬁcally DNN, use vector -matrix multiplication operations (also called multiplication-accumulation or MA Cs operations) as a com- mon underlying primiti ve for most algorithmic operations. Resistiv e NVM elements (PCM, RRAM, MRAM) arranged in crossbar topologies compute MAC operations in constant time with signiﬁcantly improved energy-ef ﬁciency [8], [10], [11], [13] by mapping the fundamental MAC operation to the analog-domain. Thus, an individual element in the weight matrix is mapped to an individual NVM element whose conductance value is programmed to a discrete conductance g within a kno wn range g ∈ [ g ON , g OF F ] . By encoding second operand as a voltage v , the current through the de vice becomes the multiplication of both operands i = v g . The crossbar architecture automatically computes the addi- tion of the individual dot-products. As depicted in Figure 1, the set of accurately programmed conductances in the NVM devices conforms to the matrix G = { g ab } , a = [1 , N ] , b = [1 , M ] . Encoding the input vector as voltages V = { v a } , the current ﬂowing through each one of the bitlines I = { i b } corresponds to the accumulation of the partial products i b = P N a =1 v a g ab . B. NVM T echnology Level Challenges Garc ´ ıa-Redondo et al [15] provide a more detailed per- spectiv e on resistiv e switching NVMs. Three key challenges are: endurance, variability , and device non-linearity . Limited endurance after many writes is one of the main problems that remains unsolved. As an example, in-crossbar training methods have been proposed [11], but their applicability to real products is still unclear due to this reduced lifetime. Always-ON inference applications heavily rely on analog read operations, and rarely re-write the weights encoded in the crossbar . Consequently limited write-ability would not affect the normal behavior . Second, variability and crossbar-related errors heavily affect NVM-CMOS hybrid circuits. Neverthe- less architectures as DNNs are naturally robust against the noise that these problems may cause. More over , this defects can be taken into account at training time, getting around device faults [16]. Third, NVM elements suffer from non- linearities that lead to errors during the weight-to-conductance mapping. Ho we ver , by engineering the physical device this problem can be ov ercome. As seen, technology problems related to inference ML ac- celerators b uilt with NVM technologies can be o vercome. Next section describes the challenges speciﬁc to the deployment of ML algorithms in NVM crossbars. C. NVM F or Analog ML Accelerators: Cir cuit Challenges In this section the three main challenges still to be addressed at circuit le vel: operands precision, analog signals dynamic range control and negati ve weights representation. Challenges Related to Pr ecision: For both digital and emerging analog accelerators, the precision of the operands and operations in volved in the DNN determines the accuracy , latency and energy consumption of the inference operation. Consequently , the quantization of both weights and activ ations is critical on the design of the accelerated system. Though 6 − 8 -bit NVM devices have been demonstrated [9], vari- ability or analog noise may compromise the encoding of more than 2 to 4 bits per cell/weight. On the other hand, the current accumulation taking place on the column bit-line is not quantized and does not suffer from precision related problems. Finally , the precision of the in volved D ACs and ADCs greatly inﬂuences the total area and power consumption [9], [17]. Thereby , the selection of the periphery precision (or the design/use of multiple D ACs/ADCs e xhibiting different bits) is critical for the system accuracy and efﬁciency . Challenges Related to Dynamic Range: The dynamic range of the analog signals decides the periphery design and recon- ﬁgurability . Figure 2 describes how NN layers are deployed in an NVM crossbar . Conv olution kernels are decomposed by sets of channels, and mapped to dif ferent columns. Then, the kernels are unrolled and grouped to compute in parallel the con volution operation. Howe ver , the input activ ations, weight, accumulation signals and output activ ations do not share common ranges across different channels and layers. This non-uniform scaling across DNN layers imposes full-custom blocks per stage –Figure 2 a) and b) . The number of elements W eight ranges a re not sha red Conductance ran ge is sha red Inputs a re sha red: Fixed voltages S/H o r AD C sha red: Current scaling NN A ctivation ranges a re not sh a red: Scaling Conv F Filters Conv F Filters Conv F Filters Conv F F-2 Filters Conv F -1 Filter La y er i-th F di ﬀ erent ﬁ lters w eights W from [w F0 ,w F1 ] G from [g0,g1] w eights W 0 from [w 00 ,w 01 ] G from [g0,g1] V 1 V 2 V N I 1 I M I 2 Analog Scaling ADC Digital Scaling F 0 F 0 F F-1 Unrolled F F-1 Unrolled a) Convolutional L a y er Deplo yment Conv Conv MAX P . Conv Conv MAX P . Conv Conv MAX P . F C F C b) F C La y er Deplo yment D A C_A ADC_A A CT w eights W ∈ [w A0 ,w A1 ] G ∈ [g 0 ,g 1 ] iA vA inputs n(X A , vA): N NA X A ∈ [x A0 ,x A1 ] vA ∈ [vA 0 , vA 1 ] vrefA1 V A outputs n(Y A , iA): N MA Y A ∈ [y A0 ,y A1 ] iA ∈ [iA 0 , iA 1 ] D A C_B ADC_B w eights W ∈ [w B0 ,w B1 ] G ∈ [g 0 ,g 1 ] iB vB inputs n(X B , vB): N NB X B ∈ [x B0 ,x B1 ] vB ∈ [vB 0 , vB 1 ] vrefB1 V B outputs n(Y B , iB): N MB Y B ∈ [y B0 ,y B1 ] iB ∈ [iB 0 , iB 1 ] A CT X B Y B vrefB0 X A Y A vrefA0 c) Negative W eights Deplo yment r o w _ n I 1+ I 1- V 1 V 2 V N Inputs W 1+ W 1- I 1 ADC - I M+ I M- W M+ W M- I M ADC Double NV M Elements Count: - T wice the a rea - T wice the p o w er A dditional Current Sub tracto rs - Increased a rea - Increased p o w er - P er-la y er tunn ed - S/H Figure 2. Common problems on the deployment of NN layers in NVM crossbars: Scaling isues on a) con volutional layers and b) fully connected layers are often solved using full-custom periphery , while the handling of c) negativ e weights area and power overcosts. in a layer and the range of the inv olved signals determine the currents ﬂowing through the bitlines. Different voltage or current signals require per-layer full-custom periphery . T o the best of our knowledge, ev ery ofﬂine learning work in the literature that trains the DNN externally to the NVM crossbar dynamically scale each DNN layer to the av ailable set of conductances in which we can program the device [18]. This process is independent of the input, weight and activ ation value ranges of the layers. Ho wev er, a reconﬁgurable system deals with variable voltage/current signals with very different ranges: though the crossbar weights can be reprogrammed, the full-custom periphery constraints the deployment of different NN graphs to only one. Consider the deployment of two different fully connected layers, A and B , of the NN described in Figure 2 b) . The number of inputs n ( X A ) and neurons n ( Y A ) differs from those in layer B . Similarly , their ranges [ x A 0 , x A 1 ] , [ y A 0 , y A 1 ] will differ from the respectiv e ones in layer B . And more importantly , the weight matrices W A and W B differ on their ranges [ w A 0 , w A 1 ] , [ w B 0 , x B 1 ] . Howev er, both matrices W A and W B need to be mapped [18] to the same av ailable set of conductances the de vices can be programmed in, G . Thus for translating the i − th weight matrix to the conductances range [ g 0 , g 1 ] , the periphery generating the re- quired voltage amplitudes v ith and sensing the output currents i ith needs to be scaled accordingly , and therefore be different. Full-custom blocks require higher design time [19], and limit the deployment of different NNs in the same HW . Moreover , should the NN weights be updated varying voltage/current ranges, D A Cs/ADCs would require e xtra calibration processes. Similarly , and to take fully advantage of the crossbar , the deployment of conv olutional layers requires mapping different ﬁlters of the same layer to different columns in the tile. As described in Figure 2 a) , per-channel quantization methods [20] lead to different weight/activ ation ranges. As v oltage inputs and S/H or ADC elements are shared across the ﬁlters, analog/digital scaling stages would be required. Different S/H and ADC designs leads to additional area, power consumption, and higher design times [19]. Challenges Related T o W eights P olarity: The conductance in a passiv e NVM element can only be a positive number g in the range [ g OF F , g ON ] . Howe ver , the NN weights, no matter whether W ∈ R or W ∈ Z , contain both posi- tiv e and negati ve values. Consequently , the use of bipolar weights in volv es a problem when mapped to a only positiv e conductance set. T raditionally positive and negati ve weights are deployed separately in different areas of the crossbar [9], [13]. This approach comes with the duplication of crossbar area and energy consumption, and the addition of current subtractors or highly-tuned dif ferential ADC –hindering the reconﬁgurability of the accelerator . As depicted in Figure 2 c), using this scheme, we double the crossbar area as per- weight, one column computes the positiv e contributions, while the other column the negati ve ones [7], [9], [13]. Moreover , additional current subtraction blocks are required before/at the ADC stages [7], [13]. Alternativ e solutions as [18] shifting the weight matrices usually in volv e the use of biases dependent on the inputs and additional periphery . Nev ertheless, both alternativ es inv olve considerable area&energy overheads. Status of NVM-based Reconﬁgurable Accelerators: A naiv e deployment of a particular algorithm into a given crossbar re- quires the periphery surrounding it to be full-custom designed. Therefore, despite many efforts have been de voted to design NVM based accelerators, most works presented in literature describing different NN experiments rely on HW external to the chip to assist the crossbar as supporting periphery [11], [13], [14], [18]. T o solve the issue, the ﬁrst reconﬁgurable CMOS-NVM processor includes per-column current dividers as scaling stages, interfacing the high-precision ADCs before the con version takes place [7]. While achieving reconﬁgura- bility , the system is penalized in terms of area and power . I I I . H A R D - C O N S T R A I N E D H W Q UA N T I Z E D T R A I N I N G T o address the reconﬁgurability versus full-custom periph- ery design, and its dependence on the weights/activ ation preci- sion, we hav e de veloped a frame work to aid mapping the DNN to the NVM hardware at training time. The main idea behind it is the use of hard-constraints when computing forward and back-propagation passes. These constraints, related to the HW capabilities, impose the precision used on the quantization of each layer , and guarantee that the weight, bias and activ ation values that each layer can have are shared across the NN. This methodology allows, after the training is ﬁnished, to map each hidden layer L i to uniform HW blocks sharing: • a single DA C/ADC design performing V () / act () • a single weight-to-conductance mapping function f () • a global set of activ ation values Y g = [ y 0 , y 1 ] • a global set of input values X g = [ x 0 , x 1 ] • a global set of weight values W g = [ w 0 , w 1 ] • a global set of bias values B g = [ b 0 , b 1 ] . Being the crossbar behavior deﬁned by i ij = X v ik g ikj + b ij (1) v ik = V ( x ik ) (2) g ikj = f ( w ikj ) (3) y ij = act ( i ij ) , (4) and every system variable within the sets Y g , X g , W g and B g , ev ery DA C/ADC performing V () and act () will share design and can potentially be reused. T o achiev e the desired behavior we need to ensure at training time that the following equations are met for each hidden layer L i present in the NN: Y i = { y ij } , y ij ∈ [ y 0 , y 1 ] (5) X i = { x ik } , x ik ∈ [ x 0 , x 1 ] (6) W i = { w ikj } , w ikj ∈ [ w 0 , w 1 ] (7) B i = { b ij } , b ij ∈ [ b 0 , b 1 ] . (8) Commonly , the output layer acti vation ( sigmoid, softmax ) does not match the hidden layers activ ation. Therefore for the DNN to learn the output layer should be quantized using an independent set of values Y o , X o , W o , B o that may or not match Y g , X g , W g , B g . Consequently , the output layer is the only layer that once mapped to the crossbar requires full- custom periphery . w eights & bi as activatio n quantizatio n op eration quantizatio n inputs custom regula rizer i-1th-l a y er i-th-la y er i+1 t h-la y er global control X0 & X1 global control Y0 & Y1 global control W0 & W1 global control B0 & B1 Figure 3. Simpliﬁed version of the proposed quantized graph for crossbar- aware training, automatically handling the global variables involv ed in the quantization process, achieving uniform scaling across layers. H 1 Quantized T raining aided by Dif ferentiable Architecture Search [21] Input: Set of global variables V g = { X g , Y g , W g , B g } Initialize V g while not con verged do Update weights W Compute non-differentiable vars in V g Update layer quantization parameters end while A. HW A ware Graph Deﬁnition The NN graphs are generated by T ensorﬂow Ker as libraries. In order to perform the HW -aware training, elements control- ling the quantization, accumulation clippings, and additional losses, are added to the graph. Figure 3 describes these addi- tional elements, denoted as global variables . For this purpose, the global variable contr ol blocks manage the deﬁnition, up- dating and later propagation of the global variables . A global variable is a v ariable used to compute a global set of v alues V g composed of the previously introduced Y g , X g , W g , B g or others. Custom re gularizer blocks may also be added to help the training to con ver ge when additional objecti ves are present. B. HW A ware NN T raining 1) Differ entiable Arc hitectur e and V ariables Updating Dur- ing T raining: Each global variable can be non-updated during training, –ﬁxing the value of the corresponding global set in V g – or dynamically controlled using the related global vari- able contr ol . If ﬁxed, a design space exploration is required in order to ﬁnd the best set of global variable hyperparameters for the given problem. On the contrary , we propose the use of a Differ entiable Arc hitecture (DA) [21] to automatically ﬁnd the best set of global variable values using the back- propagation. In order to do that, we make use of DA to e xplore the NN design space. T o achiev e it, we deﬁne the global variables as a function of each layer characteristics –mean, max, min, deviations, etc. If complying with DA requirements, the global control elements automatically update the related variables descending through the gradient computed in the back-propagation stage. On the contrary , should a speciﬁc variable not be directly computable by the gradient descent, it would be updated in a later step as depicted in algorithm 1. W e also propose the use of D A on the deﬁnition of inference networks that target extremely lo w precision layers (i.e. 2 bit weights and 2 − 4 bits in activ ations), to explore the design space, and to ﬁnd the most suitable activ ation functions to be shared across the network hidden layers. In Section IV experiments we explore the use (globally , in every hidden layer) of a traditional r elu versus a customized tanh deﬁned as tanh ( x − th g ) . Our NN training is able to choose the most appropriate activ ation, as well as to ﬁnd the optimal parameter th g . The parameter th g is automatically computed through gradient descent. Howev er, to determine which kind of activ ation to use, we ﬁrst deﬁne the continuous activ ations design space as act ( x ) = a 0 r elu ( x ) + a 1 tanh ( x − th g ) , (9) where { a i } = { a 0 , a 1 } = A g . The selected activ ation a s is obtained after applying softmax function on A g : a s = sof tmax ( A g ) , (10) which forces either a 0 or a 1 to a 0 value once the training con verges [21]. 2) Loss Deﬁnition: As introduced before, additional objec- tiv es/constraints related to the ﬁnal HW characteristics may lead to non con vergence issues (see Section III-C). In order to help the con vergence towards a valid solution, we introduce extra L C terms in the loss computation that may depend on the training step. The ﬁnal loss L F is then deﬁned as L F = L + L L 2 + L L 1 + L C , (11) where L refers the standard training loss, {L L 1 , L L 2 } refer the standard L 1 and L 2 regularization losses, and L C is the custom penalization. An example of this particular regulariza- tion terms may refer the penalization of weight v alues beyond a threshold W T after training step N . This loss term can be formulated as L C = α C X w max ( W − W T , 0) H V ( step − N ) (12) where α C is a preset constant and H V the Heaviside function. If the training would still provide weights whose values surpass W T , H V function can be substituted by a non clipped function r elu ( step − N ) . In particular , this L C function was used in the unipolarity experiments located at Section IV. 3) Implemented Quantization Scheme: The implemented quantization stage takes as input a random tensor T = { t t } , t t ∈ R and projects it to the quantized space Q = { q q + , q q − } , where q q + = α Q 2 q , q q − = − α Q 2 q , and α ∈ R . Therefore the projection is denoted as q ( T ) = T q , where T q = { t q } , t q ∈ Q . For its implementation we use fake quant operations [20] computing straight thr ough estimator as the quantization scheme, which provides us with the uniformly distributed Q set, alw ays including 0 . Howe ver , the quantiza- tion nodes shown in Figure 3 allow the use of non-uniform quantization schemes. The deﬁnition of the quantized space Q gets determined by the minimum and maximum values giv en by the global variables V g . CIF AR10 Conv Conv MAX P . Conv Conv MAX P . Conv Conv MAX P . F C 32x32 img 32x32 x32 32x32 x32 16x16 x64 16x16 x64 8x8 x128 8x8 x128 CIF AR10 3x2 conv2d b lo cks, out la y er 310K pa rameters HAR 2 F C la y ers, out la y er 133K pa rameters F C F C F C CH 1 CH 9 HAR 100 s Figure 4. Structure of the CIF AR10 and HAR classiﬁcation NNs. Algorithm 1 can consider either max/min functions or stochastic quantization schemes [20]. Similarly , the quantiza- tion stage is dynamically acti vated/deacti vated using the global variable do Q ∈ 0 , 1 , with could be easily substituted to support incremental approaches [22]. In particular, and as shown in Section III-C, the use of alpha-blending scheme [23] proves useful when the weight precision is very limited. C. Unipolar W eight Matrices Quantized T raining Mapping positiv e/negativ e weights to the same crossbar in- volv e double the crossbar resources and introducing additional periphery . Using the proposed training scheme we can restrict further the characteristics of the DNN graph obtaining unipolar weight matrices, by redeﬁning some global variables as W g ∈ [0 , w 1 ] (13) and introducing the L C function deﬁned by Equation 12. Moreov er , for certain activ ations ( r elu , tanh , etc.) the max- imum and/or minimum values are already known, and so the sets of parameters in V g can be constrained e ven further . These maximum and minimum values can easily be mapped to speciﬁc parameters in the acti vation function circuit interfacing the crossbar [19]. Finally , in cases where weights precision is very limited (i.e. 2 bits), additional loss terms as L C gradually mov e weight distributions from a bipolar space to an only positiv e space, helping the training to conv erge. In summary , by applying the mechanisms described in Section III, we open the possibility of obtaining NN graphs only containing unipolar weights. I V . E X P E R I M E N T S A N D R E S U LT S W e have e valuated the presented methodology using CI- F AR10 and Human Acti vity Recognition (HAR) applications. CIF AR10 [24] comprises the classiﬁcation of 32 x 32 sized images into 10 different categories. HAR classiﬁes among in- coming data from different sensors (accelerometer, gyroscope, magnetometer , 3 channels each) into 12 different acti vities (run, jump, etc.) T o mimic a smartwatch scenario we used real data from sensors placed in only one limb from [25] dataset. Figure 4 describes the architectures used in each case: CIF AR10 problem represents a good example of always-ON medium sized DNNs, including multiple conv olutional layers and 310 K parameters. HAR NN interfaces 9 input channels, with a time-series data input of 100 samples each, followed by 2 fully connected layers with a total of 133 K parameters. Standard STE 4b Standard STE 8b Prop osed 4b Figure 5. CIF AR10 CNN. Comparison between 4 -bit quantized training with T ensorFlow STE quantization and the proposed solution. 8 -bit is shown as a baseline. T able I C I F A R 1 0 C N N Q UA N TI Z A T I ON S C HE M E S C O MPA R IS O N . O U R P RO P O SA L B R IN G S A 5 . 7 × R E D UC T I O N O N T H E N U M BE R OF D I FFE R E N T W E IG H T S . Scheme Accuracy # Different Uniform W eights D A Cs/ADCs TF , 8 -bit 88 . 10% 1372 No TF , 4 -bit 84 . 43% 91 No Proposed, 4 -bit 83 . 7% 16 Y es A. Accuracy vs Uniform Scaling T rade-of f Results: CIF AR10 After performing a quantization hyperparameter design ex- ploration we conducted the quantized training of the use case NN using both the standard STE approach in T ensorFlow library [20] and the proposed scheme. It is to be noted that T ensorFlow’s scheme does not quantize the bias. and thus when mapped to the crossbars, additional quantization studies would be needed. Figure 5 shows the ev olution of the Deep con volutional NN learning through the training process. When quantized with 4 -bit (weights and activ ations) our solution giv es accuracies only 0 . 7% away of the state of the art. Moreov er , and as described in T able I our solution provides a signiﬁcant reduction in the number of full-custom circuit modules in volved in the algorithm-to-HW mapping. B. Unipolar W eights vs Accuracy T rade-of f 1) FC DNN: HAR: In this experiment we apply the pro- posed mechanisms to obtain a NN classifying different HAR activities whose weights take only positi ve v alues. The D A Algorithm 1 conducted the exploration of the NN design space, determining the NN architecture and parameters set that provided the best accuracy while using only positive weights within the NN. T o help the NN training to con verge, and following graph structure shown in Figure 3, custom regularizers were required to penalize negativ e weights, and a variation of alpha-blending quantization scheme [23] was introduced. Figure 6 summarizes the experiment results. NNs with bipolar weight matrices use relu as the hidden-layers HAR 3FC Standard STE Prop osed Bip olar Prop osed Unip olar Figure 6. Comparison between T ensorFlow STE quantization and proposed solution. Accuracy lev els are indistinguishable even with unipolar weight matrices. 0 20 40 60 80 100 P ercen tage of c hannels unip olar 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 Accuracy CIF AR10 Unip olarit y Study not-quan tized 4b-Uniform Scaling 8b-Uniform Scaling 4b-STE 8b-STE Figure 7. CIF AR10 NN study on variable percent of unipolar channels. Reasonable accuracies are achiev ed with 40 − 50% of bipolar weights. activ ation. On the contrary , our design exploration algorithm found act = tanh ( x − th g ) as a function best suiting NNs based on unipolar weight matrices. The introduction of th g shift on the tanh function allows the network to map a small valued positiv e (negati ve) input to the activ ation as a small valued negati ve (positive) at the output, . The proposed solu- tion is as competitiv e as the standard one, while obtaining the signiﬁcant beneﬁt of a reduced set of weights and uniform HW . But more importantly , we demonstrate that small NN using only unipolar weight matrices /ADC can correctly perform classiﬁcations, aiding the deployment to NVM crossbars. 2) Deeper CNN: CIF AR10: For larger con volutional net- works the imposed unipolarity constraint can be to restrictiv e for the NN to correctly learn. W e propose imposing the constraint only to a certain amount of channels in each layer . The ratio of unipolar/bipolar channels will determine the ﬁnal accuracy and the power and area savings. Figure 7 describes the results of applying hard unipolar weights constraint to the same CIF AR10 application, varying the percentage of unipolar channels in each conv olutional stage from 0% (bipolar weights) to 100% (completely unipolar weights), for the standard STE and uniform-scaling quanti- zation approaches. It can be seen how a minimum number V 1 V 2 V N F 0+ F 0- D A C I 1+ I 1- I 1 ADC - I M+ I M- I M - ADC D A C D A C V 1 V 2 V N F 0+ F 0- I 1+ I 1- I 1 ADC - I M+ I M- I M - ADC D A C D A C D A C T raditional H W implementati on D A Cs= C0 + C1 + ... ADCs= R0 + R1 + ... I 1+ I 1- I 1 ADC - I M+ I M- I M - ADC F 0+ F 0- F 0+ F 0- F 0+ F 0- F 0+ F 0- D A C D A C D A C D A Cs= max(C0, C1...) ADCs= ma x(R0, R1...) sha red resources MUX Prop osed HW implementati on Figure 8. HW implementation differences between traditional and proposed approach, highlighting the saved periphery by using a multiplexing scheme. T able II C H AR AC T E R IS T I C S O F D E SI G N E D A D C A N D D AC Device Po wer@ 10 MHz @ 100 MHz Area D A C 4 b 3 . 2 µ W 11 . 7 µ W 101 µ m 2 D A C 8 b 4 . 4 µ W 13 . 6 µ W 440 µ m 2 ADC 4 b 1 . 28 µ W 12 . 56 µ W 1030 µ m 2 ADC 8 b 1 . 64 µ W 16 . 39 µ W 7920 µ m 2 of channels in each con volutional layer is required by the NN to learn. Unipolar percentages abov e 60% impose a hard limitation, specially when the uniform-scaling training scheme is used. Howe ver , it can be seen how for the 8 -b STE quantization scheme, by imposing 50% unipolar channels, we can reduce a 25% the crossbar area/energy with a small 2% accuracy penalty . For the 4 -b scheme, a 20% area/energy savings would come with a 4% accuracy reduction. Therefore we can state that ev en for more complex problems, we can greatly simplify the NN deployment forcing a percentage of channels to be unipolar . C. Energy and Area Beneﬁts Figure 8 describes the comparison of HW implementation, where we consider [8], in which each PCM NVM element – each parameter in our NN– consumes 0 . 2 µ W , and has a 25 F 2 area, equiv alent to 0 . 075 µm 2 . For the D A C/ADC character- istics, we designed in house 4 -bit and 8 -bit elements, using a 55 nm CMOS technology . Simulated po wer consumption and area are gathered in T able II. A power/area ov erhead of 5 / 10% over the ADC ﬁgure for an integrated adapted current subtractor is added in the case where bipolar weights are present. Additional 5% power penalty is applied for ADCs using current scaling. Regarding each one of the NN layers, D ACs and ADCs will only be multiplexed should the layer maintain uniform scaling with the system. With our proposed approach, all layers share the same input ranges, and only the last layer ADCs would be dif ferent from the rest of the system. T able III E S TI M A T ED E N ER G Y P E R I N F ER E N C E : N U MB E R O F N VM C EL L R EA D S ( P OS I T I VE (+) A N D N E G A T I V E ( − ) W E IG H T S ) , DAC / A DC O P ER A T I ON S . CIF AR10 TF 8 bits TF 4 bits Ours 4 bits 10/100 MHz 10/100 MHz 10/100 MHz T otal 1 . 6 / 0 . 19 µ J 1 . 59 / 0 . 18 µ J 1 . 58 / 0 . 18 µ J NVM ( ± ) ≈ 77 e 6 1 . 55 / 0 . 16 µ J 1 . 55 / 0 . 16 µ J 1 . 55 / 0 . 16 µ J D A C ops ≈ 75 e 3 32 . 9 / 10 . 1 nJ 23 . 9 / 8 . 7 nJ 23 . 9 / 8 . 7 nJ ADC* ops ≈ 115 e 3 22 . 6 / 22 . 6 nJ 18 . 4 / 18 . 1 nJ 16 . 5 / 16 . 2 nJ HAR TF 8 bits TF 4 bits Ours 4 bits 10/100 MHz 10/100 MHz 10/100 MHz T otal 1 . 6 / 0 . 24 nJ 1 . 54 / 0 . 22 nJ 0 . 84 / 0 . 15 nJ NVM ( + ) ≈ 34 e 3 0 . 7 / 0 . 07 nJ 0 . 7 / 0 . 07 nJ 0 . 7 / 0 . 07 nJ NVM ( − ) ≈ 34 e 3 0 . 7 / 0 . 07 nJ 0 . 7 / 0 . 07 nJ 0 nJ D A C ops ≈ 384 0 . 17 / 0 . 05 nJ 0 . 12 / 0 . 04 nJ 0 . 12 / 0 . 04 nJ ADC* ops ≈ 268 0 . 052 / 0 . 05 nJ 0 . 04 / 0 . 03 nJ 0 . 03 / 0 . 03 nJ 1) Energy Estimation: T o maximize the throughput per NN layer we consider one DA C (ADC) per column (row). From the power perspecti ve, for each layer the total number of NVM cell reads performing the multiplications (and automatically the additions) would be P F i K i X i for the conv olutional layers and X i Y i for the fully connected ones, where X i , Y i , K i refer the the size of inputs, outputs, and kernel respectively , and F i refers the number of ﬁlters of the i − th layer . Regarding the DA Cs and ADCs utilization, a total of X i and digital to analog and Y i F i analog to digital conv ersions are required. No analog scaling system is required. The results describing the power estimation per inference in both bipolar-CIF AR10 and bipolar/unipolar-HAR applications is displayed in T able III. It can be seen that as bipolar weights were needed in the image solution, and due to the amount of multiplications ( > 38 million per inference), the saved po wer is almost negligible. Howe ver , in very low po wer IoT applications, the proposed solution requires only 55% of the energy compared with traditional schemes, mainly due to the unipolar weight matrices encoded in the NVM crossbar . 2) Area Estimation: In traditional deployments, being F i the number of ﬁlters present in a giv en layer L i , F i full custom different ADCs would be designed and placed for that layer, freezing the applicability to a particular application. Howe ver , with our proposed scheme we can deploy dif ferent NN applications in the same hardware, using man y smaller and ﬁxed-sized crossbars. W e can feed the incoming inputs in batches, reusing the kernels unrolled in the crossbar . Adopting this second scheme for the CIF AR10 example, the largest CNN unrolled layer requires an input of size 32 x 32 x 32 . For example, if the crossbar size av ailable in our reconﬁgurable system is 128 x 128 , the layer can be batched in 256 operations. If the hardware blocks were composed of 512 x 128 elements, the layer could be batched in 64 operations. On the other hand, for smaller NN this same hardware could ﬁt entire layers: in HAR benchmarch each layer can ﬁt in a 128 x 128 crossbar . For both crossbar size examples, ev ery layer but the last would reuse the 128 D AC/ADC pairs during inference. T able IV summarizes the area estimation when considering T able IV E S TI M A T ED A R EA U S IN G 128 x 128 BA S I C C RO S S BA R B L O CK S . CIF AR10 TF 8 bits TF 4 bits Ours 4 bits Reconﬁgurable No No Y es Crossbars 44 44 44 D A Cs 448 448 128 ADCs 896 896 256 Current subtractors 896 896 256 T otal Area 8 . 05 mm 2 1 . 1 mm 2 0 . 22 mm 2 HAR TF 8 bits TF 4 bits Ours 4 bits Reconﬁgurable No No Y es Crossbars 6 6 3 D A Cs 384 384 128 ADCs 268 268 256 Current subtractors 268 268 0 T otal Area 2 . 51 mm 2 0 . 35 mm 2 0 . 28 mm 2 crossbars composed of 128 x 128 elements (a very conservati ve approach to avoid technology problems) and assisted by 128 D ACs, 128 ADCs and additional periphery . For the traditional approaches, we follow the deployment schemes in the litera- ture, and consider that the number of ADCs present in each layer does not need to match the crossbar column size, saving considerable amount of area but av oiding reconﬁgurability . On the contrary , by using the proposed solution the D ACs and ADCs are multiplex ed. The beneﬁts are noticeable: First, we guarantee that the HW is uniform across the NN, ensuring reconﬁgurability . Second, in CIF AR10 benchmark, the solu- tion leads to up to 80% area sa ving – 0 . 22 mm 2 vs 1 . 1 mm 2 for 4 bit accelerators. For HAR benchmark, up to 20% area saving is achieved. When comparing against the traditional 8 bit deployment schemes, this area savings raise up to 97% for CIF AR10 CNN benchmark and 89% for the HAR NN. V . C O N C L U S I O N S ML at the edge requires accelerators that efﬁciently compute inference in constrained devices, and NVM based analog accelerators are promising candidates due to their low power capabilities. Howe ver the full-custom per-layer design of the periphery interacting with the crossbars hinder the reconﬁg- urability of the whole system. This work has presented the ﬁrst solution that aids the algorithm deployment in uniform crossbar/periphery blocks, at training time. With no accuracy penalty , the method is able to simplify the design of the crossbar periphery , signiﬁcantly reducing the ov erall area and po wer consumption, and enabling real re-usability and reconﬁgurability . Moreover , we hav e demonstrated that DNN with unipolar weight matrices can cor - rectly perform bio-signals classiﬁcation tasks while solving the negati ve/positi ve weights problem inherent to NVM crossbars, and therefore reducing by half the crossbar area/energy and signiﬁcantly simplifying the periphery design. W e validated our solution against two different always-ON sensing appli- cations, CIF AR10 and HAR, obtaining competitive accuracies while simplifying the whole system design. V I . A C KN O W L E D G E M E N T S This research on CIM architecture is supported by EC Horizon 2020 Research and Innov ation Program through MNEMOSENE project under Grant 780215. R E F E R E N C E S [1] M. A. Zidan et al. , “The future of electronics based on memristiv e systems, ” Nature Electr onics , vol. 1, no. 1, pp. 22–29, jan 2018. [2] S. Kodali et al. , “Applications of Deep Neural Networks for Ultra Low Po wer IoT, ” in 2017 IEEE International Conference on Computer Design (ICCD) . IEEE, nov 2017, pp. 589–592. [3] I. Fedorov et al. , “SpArSe : Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers, ” pp. 1–26. [4] H. Huang et al. , “A Highly-parallel and Energy-ef ﬁcient 3D Multi-layer CMOS-RRAM Accelerator for T ensorized Neural Network, ” IEEE T ransactions on Nanotechnology , no. c, pp. 1–1, 2017. [5] Y . Zhang et al. , “Hello Edge: Ke yword Spotting on Microcontrollers, ” pp. 1–14, nov 2017. [6] P . N. Whatmough et al. , “FixyNN: Efﬁcient Hardware for Mobile Computer V ision via Transfer Learning, ” vol. 1, feb 2019. [7] F . Cai et al. , “A fully integrated reprogrammable memristor–CMOS system for efﬁcient multiply–accumulate operations, ” Natur e Electr onics , vol. 2, no. 7, pp. 290–299, 2019. [8] S. Hamdioui et al. , “ Applications of Computation-In-Memory Architectures based on Memristiv e Devices, ” in 2019 Design, Automation & T est in Eur ope Confer ence & Exhibition (D A TE) . IEEE, mar 2019, pp. 486–491. [9] C. Li et al. , “Analogue signal and image processing with large memristor crossbars, ” Natur e Electronics , vol. 1, no. 1, pp. 52–59, jan 2018. [10] A. Rahimi et al. , “High-Dimensional Computing as a Nanoscalable Paradigm, ” IEEE T ransactions on Cir cuits and Systems I: Re gular P apers , vol. 64, no. 9, pp. 2508–2521, sep 2017. [11] S. Ambrogio et al. , “Equivalent-accurac y accelerated neural-network training using analogue memory, ” Nature , vol. 558, no. 7708, pp. 60–67, 2018. [12] A. Serb et al. , “Seamlessly fused digital-analogue reconﬁgurable computing using memristors, ” Natur e Communications , vol. 9, no. 1, p. 2170, dec 2018. [13] V . Joshi et al. , “Accurate deep neural network inference using computational phase-change memory, ” pp. 1–10, jun 2019. [14] C. Li et al. , “Efﬁcient and self-adaptiv e in-situ learning in multilayer memristor neural networks, ” Nature Communications , vol. 9, no. 1, p. 2385, dec 2018. [15] F . Garcia-Redondo and M. Lopez-V allejo, “On the Design and Analysis of Reliable RRAM-CMOS Hybrid Circuits, ” IEEE T ransactions on Nanotechnology , vol. 16, no. 3, pp. 514–522, may 2017. [16] R. Hasan et al. , “A fast training method for memristor crossbar based multi-layer neural networks, ” Analog Inte grated Cir cuits and Signal Pr ocessing , vol. 66, pp. 31–40, oct 2017. [17] L. Ni et al. , “Distributed In-Memory Computing on Binary RRAM Crossbar, ” ACM Journal on Emerging T echnologies in Computing Systems , vol. 13, no. 3, pp. 1–18, mar 2017. [18] M. Hu et al. , “Dot-product engine for neuromorphic computing, ” in Pr oceedings of the 53r d Annual Design Automation Conference on - D AC ’16 . New Y ork, New Y ork, USA: A CM Press, 2016, pp. 1–6. [19] M. Giordano et al. , “Analog-to-Digital Conv ersion With Reconﬁgurable Function Mapping for Neural Networks Activation Function Acceleration, ” IEEE Journal on Emerging and Selected T opics in Circuits and Systems , vol. 9, no. 2, pp. 367–376, jun 2019. [20] R. Krishnamoorthi, “Quantizing deep con volutional networks for efﬁcient inference: A whitepaper, ” jun 2018. [21] H. Liu et al. , “DAR TS: Differentiable Architecture Search, ” jun 2018. [22] A. Zhou et al. , “Incremental Network Quantization: T owards Lossless CNNs with Low-Precision W eights, ” pp. 1–14, feb 2017. [23] Z.-G. Liu and M. Mattina, “Learning low-precision neural networks without Straight-Through Estimator(STE), ” mar 2019. [24] A. Krizhevsky , “Images, Learning Multiple Layers of Features from T iny Images, ” Ph.D. dissertation, 2009. [25] O. Banos et al. , “mHealthDroid: A Nov el Framework for Agile Dev elopment of Mobile Health Applications, ” 2014, pp. 91–98.

Training DNN IoT Applications for Deployment On Analog NVM Crossbars

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment