Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices

Page 1 of 22 Training Dee p Convo lutional Neural Ne tw orks with Resistive Cr oss-Point D evices Authors: Ta yfun Gokmen,* O. Murat Onen, Wilfried Ha ensch Affiliations IBM T.J. Watson R esearch Center, Yorktown Hei ghts, NY 10598 USA *Correspondence to: tgokm en@us.ibm.com Abstract In a previous work we have d etailed the requi rements to obtain a max imal performance ben efit b y implementing fully connected dee p neural networks (DNN) in form of arra ys of resistive devic es for deep learning. This concept of Resistive Processing Unit (RPU) devices we ext end here towards convolutional neural networks (CNNs) . We sho w how to map the convolutional layers to RPU array s such that the parallelism of the hard ware can be full y utilized in all three c ycles of the backpropagation al gorithm. We find that the noise and bound limitatio ns imposed due to analog nature of the computations performed on the arrays effect the tr aining accu racy of the C NNs. Noise and bou nd mana gement techniques are presented that m itigate th ese problems without introducing an y addit ional complexit y in the analog circuits and can be addressed by the digital circuits. In addition, we discuss digitall y programmable update management and device variability reduction tec h niques that can be used sel ectively for some o f the la yers in a CNN. We show that combination of all t hose techniques enables a successful application of the RPU concept for training CN Ns. The techniques discuss ed here are more general and can be applied beyond CNN architectures and therefore enables appli cabilit y of RP U approach for large class of neural n etwork architectures. Page 2 of 22 INTRODUCTION Deep neural network ( DNN) [1] based models demonstrated unprecedented accurac y, some of w hich exceeding human lev el p erformance, in cognitive tasks such as object re cognition [2] [3] [ 4] [5], speech recognition [6], and natural la ngu age processing [7]. These a ccompli shments are made possible thanks to the advan ces i n comput ing architectures and the availabilit y of l arge amounts of labelled training data. Furthermore, network a rchitectures ar e adjusted to take advantage of data properties such as s patial or temporal correlation. For instance, convol utional neural n etworks ( C NN) p rovide superior results for image recognition and recurrent neur al networks (RNN) in s peech and natural lan guage processi ng. Therefore, the application space of t he t raditional fully connected d eep lea rning network is diminis hing. In a recent paper we have introduced the concept of a r estive pro cessing unit (RPU) as an archite cture solution for full y connect ed DNN. We will now sh ow t hat the RPU concept is equally applicabl e fo r CNN. Training large DNNs is an extremel y computationally in tensive task that m ay take weeks even on distributed parallel computing frameworks utilizing many computing nodes [8] [9] [10]. There have been many attempts to accelerate DNN tr aining by desi gning and u sing specializ ed hardware such as GP Us [11] [ 12], FPGAs [13], or ASICs [14] that rely on conventional CMOS-technolog y. All of these approaches sha re th e co mmon objective of packi ng l arger number of comput ing units into a fixed area and power budget b y using optimized m ultiply and add units so that acceleration over a CPU can be achieved. Although various microarchitectures and data formats are considered for different accelerator designs [ 15] [16] [1 7], all of these digital approac hes use a similar underl ying transistor technolog y and therefore the acceleration factors will eventu ally be limi ted due to scaling limitati ons. In order t o achi eve even larger acceleration factors beyond conventional CMOS, novel nano-electronic device concepts based on non-volatil e memory (NVM) technolo gies [18 ], such as phase change m emor y (PCM) [ 19], resistive random access memory (RRAM) [20], and memristors [21 ] [22] [23] have been explored for implementin g DNN trainin g. Acceleratio n factors ranging from    [ 24] [25] [ 26] compared to the convention al CPU/GPU based approaches and si gnificant reduction in pow er and area have been proposed. However, for these bottom -up approa ch es the acceleration factors are still li mited by device specifi cations intrinsic to their application as non-volatile memor y (NVM) cells. I nst ead, usi ng a top-down approach it is possibl e to develop a new class of devices, so called Resist ive Processing Unit (RPU) devices [ 27] that are free from these limitati ons, and therefore can p romise ultimate accelerati ons factors of  while providing pow er efficienc y of    . The concept of usin g resistiv e cross-point device arra ys [27] [ 28] [29] as DNN accele rators have been tested up to som e extend b y performing simul ations using full y connected n eural networks. Th e effect of various device f eatures a nd s ystem par ameters on training pe rformance is evaluated to deriv e the device and s ystem level sp ecifications for a successful im plementation o f an accelerator chip fo r DN N com pute efficient tr aining [27 ] [30]. I t is shown that resisti ve devices, that are analog in n ature, n eed to res pond symmetrically i n up and down condu ctance changes pr ov ided the same but opposit e pu lse stimulus. Indeed, these sp ecifications differ significantl y from parameters t ypically used for memor y elements and therefore require a s ystematic search for new ph ysical mechanisms, materials and devic e desi gns to realize an ideal resisti ve element for DNN trainin g. It is important to note that, however, th ese resistive cross- Page 3 of 22 point arrays perform the multi ply and add in the analog domain in contra st t o the CMOS based digital approaches. Therefo re, it is not clear y et wh ether t he proposed device spe cifications that are sufficient to train a full y conn ected ne ural network generalize to a more general set of ne twork architectures, and hence requires further validation of th eir applicability to a broader range of n etworks. Fully Connected Neutr al Networks Deep full y connect ed ne ural networks are composed of stacking of multiple fully connected layers such that the signal pr op agates from inpu t lay er to output la yer b y going through s eries of linear and non-linear transformations. The wh ole network expresses a si ngle differentiable error functi on that maps t he input data on to class s cores at the output layer. Most comm only th e network is trained with simple stochastic gradient decent (SGD), in which th e er ror gradien t wit h respect to e ach par ameter is calculated usin g the backpropagation algorith m [31]. The backpropagation algorithm is composed of three cycles, forward, back ward and weight update that are repe ated man y times until a convergence crite rion is met. For a single full y connected la yer where  inputs n eurons are conn ected to  output (or hidden) n eurons, t he forw ard c ycle involve com puting a vector-matrix m ultiplication (    ) where the ve ctor  of length  repres ents the activiti es of the input neurons and th e matrix  of size    stores the weight values between each pai r of input and output neurons. The resulting vector  of length  is further processed by pe rforming a non-linear activation on each of the elements and then passed to the ne x t lay er. Once the information reaches to t he final output layer, the error signal is calculated and b ackpropagated through the n etwork. The b ackward c ycle on a single la yer also involves a vector -matrix mul tiplication on the transpose of the wei ght m atrix (      ), where the v ector  of l ength  represents the error calculated by the output neurons and the v ector  o f length  is further process ed usin g the deriv ative o f neuro n non-linearit y and then p assed down to t he previous layers. Finall y, i n the update cycle the weight matrix  is updated b y performin g an outer product of t he two vectors t hat are used in t he forward and t he backward c ycles and usually express ed as   ! " #   $ where " is a global learning rate. Mapping Fully Con nected Layers to Resistive Device Arrays All of the above operatio ns performed on the weight matrix  can be impl emented with a 2 D crossbar array of two-terminal resi stive devices with  rows and  columns where the stored conductance values in the crossbar arra y form the matrix  . In the forward c ycle, input vector  is transmitt ed as voltage pulses through each of the columns and resulting vector  can be read as current signals from the rows [32]. Simi larly, when voltage pulses are supplied from th e rows as an input in the backward c ycle, then a vector-matrix product is computed on the transpose of the weight matrix   . Finally, in the update cycle voltage pulses representing ve ctors  and  are simu ltaneousl y supplied from th e colum ns and the rows. At this setting each cross-poi nt device performs a local multipli cation and summation operation by processing t he vo ltage p ulses coming f rom the colum n and the row and hence achievin g an incremental weight update. Page 4 of 22 All t hree operatin g mod es describ ed abov e allo w the arra ys of cross-po int devices that constitut e the network to be active in all three cycles and hence enabl e a ver y efficient impl ementation of the backpropagation algorit hm. Because of their local weight st orage and processing capabilit y t hese resi stive cross-point devices are called Resisti ve P rocessing Unit ( R PU) devices [27] . An arra y of R PU devices ca n perform the operations involving th e wei ght m atrix  locally and in parall el, and hence achieves #%$ time complexi ty in all three cycles independent of the arra y size. Here, w e ex tend the RPU device conc ept towards convoluti onal neural networks (C NNs). First we show how to map t he convolutional layers t o RP U device arra ys such that the parallelism of t he hardware can be full y u tilized in all three cycles of t he ba ckpropagation algorithm. Next, we show that identical R PU device spe cifications hold for CNNs. Our study shows, however, that CN Ns are more sensitive t o noise and bounds due to analo g nature of the computation s on RPU arrays. We dis cuss noise and bound management techniques that mit igate these probl ems without introducin g an y additional complexi ty in the analog circuits and can be addressed by t he di gital circuits. I n addi tion, we d iscuss digitally programmable update m anagement and device variabilit y reduction techniques that can be u sed selectively for some of t he l ayers in a CNN. We show that c ombination of all tho se techniq ues enab les a successful applicatio n of t he RPU concept for training CNNs; and a network trained with all the RPU device imperfections can yield a classification error i ndistinguishable from a network trained with high precision floating point numb ers. MATERIALS AND M ETHODS Convolution al Layers The input to a convolutional layer ca n be an image or an output of the previous convolutional layer and is generally considered as a volume with dimensions of #& & ' $ with a wi dth and hei ght of & pixels and a depth of ' channels corresponding to different input components (e.g. red, green and blue c om ponents of an ima ge) as ill ustrated in Figure 1A. The kernels of a convo lutional la yer are also a volume tha t is spatially small along the wid th and height, but extends t hrough the full depth of t he input volume with dimensions of # (  (  ' ). Durin g the forward c ycle, each kernel sl ides ov er the i nput volume across th e width and heig ht and a do t product is computed between the p arameters of the kernels and the input pixels at a n y positi on. Assuming no zero paddin g and single pix el sliding (stride is equal to one), this 2D convolution operation re sults in a single o utput pl ane with dimensions # # &  ( ! % $  # &  ( ! % $  %$ per kernel. Since there e xists  different kernels, outpu t becomes a volume wit h dimensions # # &  ( ! % $  # &  ( ! % $  $ and is passed to following layers for fu rther processing. During the backward c ycle of a convolut ional la yer similar operations are performed but this time the s patially flipped kernels slide over the error s ignals that a re backpr opagated from the upper l ayers. The error signals form a vol ume with the same dim ensions of the output # # &  ( ! % $  # &  ( ! % $  $ . The results of t his backward convoluti on are organiz ed to a volume with dimensions #&  & ' $ and are further backpropagat ed for error c alculations i n the previous layers. Fina lly, in the up date c ycle, gradient with respect to each parameter is computed by convolving the input vol ume with the error volume used in the forward and backward cycles, resp ectivel y. This gradient informati on, which has the same dimensi ons as the kernels, is added to the kernel parameters a fter scaled with a learnin g rate. Page 5 of 22 Figure 1. (A) Schematic s of a convolutional la yer showing the input volume, kernels, and the out put volume. (B) Schematics of a mapped convolutional la yer to an RPU arra y showing the input and output matrixes and their propagation throu gh the kernel m atrix during the forward, backward and t he update cycles. Mapping Con volutional Layers to Resistive Dev ice Arrays For an efficient im plementation of a convolutional lay er using a n RPU array , all the input/output volumes as well as the kern el parameters need to b e rearran ged in a specific w ay. The convolut ion operation essentially performs a do t product between the kernel param eters and a loc al region of th e input volume and hence can be formulated as a matrix -matrix multip ly. By collapsing the parameters of a single kernel to a c ol umn vector of len gth ( ) ' a nd stacking all  different kernels a s separate rows , a par ameter matrix * of size   ( ) ' is formed t hat stores all o f the trainable p arameters associated a si ngle convolutional layer as shown in Figure 1B. A fter this re arrangement, in t he forw ard c ycle the output s correspondi ng to Page 6 of 22 a specific location along the width and height can be ca lculated by performing a vector-matrix multiplication   * , where th e vector  of length ( ) ' is a lo cal region in the inp ut volum e and vector  of length  has all the results along the depth of the output volume. By repeating this v ector-matrix multiplication for different loca l regions, the full volume of the output map can be computed. Indeed, this repeated vector-matrix multipli cation is equivalent to a matrix-matrix multiplication +  *, , where the matrix , with dimensions ( ) '  #&  ( ! %$ ) has the input neuron activities with some repetition and resulting matrix + with dimensions   # &  ( ! %$ ) has all the results corresponding to t he output volume. Similarly, usin g the transpose of the parameter matrix  the backward c ycle of a convolutional layer can also be e x presses as a matrix-matrix multiplication -  *  . , where the matrix . with dimensions   #&  ( ! %$ ) has the error si gnals correspondin g to an error volume. Furthermore, in this setting the update c ycle also si mplifies to a m atrix mult iplication wh ere the g radi ent information for the whole parameter matrix * c an be computed usi ng matric es , and . , and the update rule can b e written as * * ! " # .,  $ . The rearrangement of the trainable parameters to a single matrix * by flattening of the kern els enables an efficient implementatio n of a convolutional layer using an R PU array. Aft er this rearrangement, al l the matrix operations performed on * can be computed as a series of vector o perations on an RPU arr ay. Analogous to the full y connected layers, matrix * is mapped t o an RPU ar ra y with  rows and ( ) ' columns as shown in Figure 1B. I n the forward cycle, the input vector corresponding to a si ngle column in , is transmi tted as voltage pulses from the columns and the results are read from the rows. Repetition of t his operation for all #&  ( ! %$ ) c ol umns in , completes all the computations required for the forward cycle. Similarly, in the backward c ycle the input vector corresponding to a single column in . is serially fed t o the rows of the arra y. The upd ate rule shown above can be viewed as a se ries of updates that involv es computing an outer product betwee n two columns from , and . . Thi s can be achieved by serially feeding th e colu mns of , and . simultaneously to the R PU arra y. During the update c ycle each RPU device performs a s eries of local multiplication and sum mation operations and hence calculates the product of the two matrixes. We note that for a sin gle input the total num ber o f multipli cation and s ummation operations that ne ed to be computed in all three cycles for a convolutional la yer is  ( ) ' #&  ( ! %$ ) and this number i s independent of the meth od of computation. The proposed RPU mapping described above achieves this number as follows: Due to the i nherent parallelism in the RPU array  ( ) ' operations are performed simultaneousl y for each vector oper ation performed on t he arra y. Since there ar e #&  ( ! %$ ) man y vector operations perfor med seriall y on t he ar ray t otal num ber of computations are justi fied. Alternativel y, one can view that there ex ists ( ) ' trainable parameters and each paramete r is used #&  ( ! % $ ) times thanks to the parameter s haring in a convolution l ayer. Since, each RP U device in an arra y can perform a single computation at an y given time, paramete r sharing is a chieved b y accessing the arra y #&  ( ! %$ ) times. For fully connected layers each weight is us ed onl y onc e and therefore all the computations can be completed with single ve ctor operation on the arra y. The end result of mapping a convolutional la yer o nto the RPU array is very sim ilar to the mapping of a fully connected layer an d therefore does not change the fundament al operations performed on the arra y. We also emphasize that the described convolut ional layer with no z ero padding and single pixel sliding is Page 7 of 22 only used for ill ustration purposes. Ho wever, the proposed mappin g i s more ge n eral and can be applied to convolutional layers with zero padding, strid es larger than a sing le pix el, dilated convolutions or convolutions with no n-square inputs or kernels. RESULTS In order to test the validity of this method we pe rformed deep neural network trainin g simulations for the MNIST dataset using a CNN architecture sim ilar t o LeNet-5 [33]. I t compri ses of two convol utional l ayers with    kernels and h yperbolic tangent ( /&0 ) activation functions. The first la yer has %1 kernels while the second l ayer has  kernels. Each convolutional la yer is followed b y a subsampli ng la yer th at implements the max pooling function over non-overlappin g pooling windows of size    . The output of the second poolin g layer, consistin g of %  neuron activations , feeds int o a full y connected la yer consisting of % /&0 neurons, which is t hen connected in to a 10-wa y 23/45 output la yer. Training is performed repe atedl y using a mini-batch size of unity for all 1 ima ges in the training dataset whi ch constitutes a single training epoch. Learning rate of "  6% i s used throu ghout the trainin g for all  epochs. Following th e propos ed m apping above, the trainable parameters (includin g the biases) of this architecture are stored in 4 s eparate arra ys wi th dim ensions of %1  1 and   % for the first two convolut ional layers, and, %  % and %  %7 for the following two fully connected lay e rs. We name these arra ys as 8 9  8 )   : and  ; , where the subscript denotes the la yer’s location and 8 and  i s used for convolutional and full y connected layers, respec tively. When all four arr ays are conside red as si mple matrices and the op erations are performed with floating point ( FP) numb ers, the ne two rk achiev es a classification error of 6< on the test data. This is the FP-baseline model that we comp are against the RPU based simulatio ns for the rest of the paper. RPU Baseline Model The influence of various RP U device features, variations, and non-idealit ies on the training accuracy of a deep full y c onnect ed network have been tested by Ref [27] . W e f oll ow the same methodology he re a nd as a baseline for all our RP U models we use the device specifications that res ulted in an acceptable tes t error on the fully connected net work. The RPU-baseline model uses the stochastic update scheme wh ere the nu mbers t hat are en coded from neurons #5 = and > ? ) are translated to stochastic bit streams so that each RPU device can perform a stochastic multiplication [34] [35] [36] by performing simple coincidence detection a s illus trated by Figure 2. In this update scheme the expected weight chan ge can be written as @ A B C =? D  EF  G C H=I # J K 5 = $ # J L > ? $ (1) where EF is the length of the stochastic bit st ream, GC H=I is the chan ge in the weight value due to a single coincidence event, J K and J L are the gain factors used during the stochastic translati on for the colum ns and the rows, respectively. The RPU-baseline has E F  % J K  J L  M " # EFG C H=I $  %6 and Page 8 of 22 GC H=I  6% . The change in weight values is achieved b y conductance change in R PU devices; therefore, in order to c apture device im perfections, G C H=I is assumed to have c ycle-to-cycle and device- to-device variations of 30%. The f abricated RP U devices ma y also show different amounts of chan ge to positive and negative weight updates. This i s taken into account b y using separate GC H=I N for the positive updates and GC H=I O for the ne gative updates for e ach R PU device. The average v alue o f the ratio GC H=I N / GC H=I O among all devices is assumed to be unity a s this can be achieved by a g lob al adjustment of the v oltage pulse duratio ns/heights. However, de vice-to-device mismatch is unavoi dable and therefore 2% variation is i ntroduced for this parameter. To take condu ctance sat uration into account, which is expected to be present in realistic RPU devices, the bounds on the weight s values, PC =? P , is assumed b e 0.6 on average with a 30% device-to-devic e variati on. We did not introduce an y non-linearity in the w eight update as thi s effect i s shown to be n ot important as long as the upda tes are r easonably balanced (symmetric) between up and down changes. During the forward and back ward cycles the vector-matrix multiplications performed on an RPU array are prone to analog noise and s ignal satu ration due t o the peripheral circuitry. The a rra y operations including the input and output signals are illustrated in Figure 2 for these two c ycles. Because of the analo g curren t in tegration durin g the measurement tim e ( / HQRS ), the output voltage ( T UVW ) g en erated over the int egrating capacitor ( J =IW ) will have noise contributions coming from various s ources. These noise sou rces are tak en into account b y introducing an additional G aussian noise, with zero mean an d st andard deviation of X  6 1 , to the results of vector-matrix m ultiplications computed on an RPU array. In addi tion th ese r esults are bounded to a valu e of Y Z Y  % to account for a signal satu ration on the output voltage c or responding to a supply voltage on the op-amp. Tab le 1 summarizes all of the RPU-baseline model paramet ers used in our simulations that are also consi stent with the specifications derived b y Ref [27]. Figure 2. Schematics of an RPU array operation durin g th e backward an d update c ycles. The forward cycle operations are iden tical t o the backward c ycle ope rations ex cept t he inputs are suppl ied from the columns and the outputs are read f rom the rows. Page 9 of 22 Table 1. Su mm a ry of the R PU-baseline model para meters EF J K  J L  G C H=I G C H=I N  G C H=I O P C =? P X Analog Noise Y Z Y Signal Bound Averag e Device t o device variat ion Cycle-to - cycle variat ion Averag e Device-t o- device variat ion Averag e Device-to- device variat ion 10 1.0 0.001 30% 30% 1.0 2% 0.6 30% 0.0 6 12 The CNN training results for various RPU va riations are shown in Fi gure 3A. Interestingly, the RPU- baseline model of Table 1 performs poorl y and only achieves a test error bet ween % < and < as s hown by the black curve. Not only is this value significantl y high er than the FP-baseline value of 6< but also higher than the 6< error rate a chieved wi th the same RP U model for a f ull y connected network o n the same dataset. Our investi gation shows that this test error is due to simultaneous contributions of analog noise introduced during t he ba ckw ard c ycle and si gnal bounds introduced in the forward cyc l e onl y on the final RPU array,  ; . As shown b y th e green curve, the model without analog noise in the backward cy cl e and infinit e bounds on  ; reaches a respectable test error of about %6< . W hen we onl y elimi nate th e noise while keeping the boun ds, the model can follow a modest training up to about 8 th e poch but the n the error rate suddenl y i ncreases and reaches a value about %< . S imilarly, if we onl y eliminate the bounds while keeping the noise, the model, shown by the red curve, performs poorly and the error rate stays around %< level. Figure 3. Test error of C NN with the MNIST dataset. Open whit e cir cles correspond to the model with the training pe rformed us ing the floating point (FP ) n umbers. (A) Lines wi th different colors corresp ond to RP U-baseline models with different nois e term s i n the b ackward c ycle and signals bounds on the last classification layer as g i ven by the legend. (B) All li nes marked wit h different colors correspond to RPU- baseline models including the noise and t he bound terms; h owever, the nois e management and the boun d management techniques are applied selectivel y as given by the legend. Page 10 of 22 Noise and Bou nd Management Techniques It is clear th at the noise i n the backward c ycle and the si gnal bounds o n the output layer n eed to be addressed fo r successful application of RPU approach to CNN t raining. The complete elimination of analog n oise and signal bou nds is not realistic for real hardwar e implementati on of RPU arra ys. Designin g very low noise read circui ty with ver y large signal bounds is not an opt ion because it will int roduce unrealistic ar e a a nd power constraints on th e analo g circuits. B elow we descr ib e noise and bound management techniq ues t hat can easily be im plemented in the digital domain without changing th e d esign considerations of RPU arra ys and the supporting analog peripheral circuits. During a vector-matrix multiplication on an R PU array, the input vector (  or  ) is transmitted as voltage pulses with a fixed amplitude and tunable durations as illus trated by Figure 2. In a naive implementation, the m aximal pulse duration represents unity ( / HQRS [ %$ , and all pulse durations are scaled accordingl y depending on the values of 5 = or > ? . This scheme works optimally for the for ward c ycle with /&0 (or 42' $ activations, as all 5 = in  in cluding a bias term a re between \%%] . However, this assum ption becomes problem atic for the backward cycle, as there are not an y guarantees for the rang e of the error signals in  . F or instance, all > ? in  ma y become significantly smaller than unity # ^ _$ as th e training progresses and the classification error gets smaller. In this scenario the results of a vector-matrix multiplication i n the backward cycle, as shown by Eq (2) below,      ! ` (2) are domi nated b y the noise term ` , as th e si gnal term    does not g ene rate enough vol tage at the output. This is indeed why the noise introduced in the backward c ycle bring s the learning to a halt at around 10% error rate as shown b y models in Figure 3A. In order to get better si gnal at the output even when  ^ _ , we divide all > ? i n  to the maxim um value > HRK before t he v ector-matrix multiplication is perform ed on an R PU arra y. We note that thi s division operation is performed on digital circuits and ensures that at least one si gnal at the input of an RPU arra y exists for the whole integration time corresponding to unity. After the results of the vector-matrix multiplication is read fro m an RPU array and converted back to digital si gnals, we rescal e the resul ts b y the same amount > HRK . I n this noise management sch eme, the results of a vector-matrix m ultiplication can be written as   a  a  L bcd e ! `e > HRK . (3) The result,      ! `> HRK , effectively reduces the impact of noise significantl y for small error rates > HRK ^ % . This noise management scheme allows to propagate error sig n als that ar e arbitrarily small and maintains a fixed signal to nois e ratio independent of the range of numbers in  . In addition to the noise, results o f a ve ctor-matrix mul tiplication will be bounded b y Y Z Y term that corresponds to a maximum allowed voltag e during the in tegration time. Thi s value Y Z Y  % is not critical Page 11 of 22 while calculatin g activati ons for hidden layers wit h /&0 (or  42' $ non-linearit y because the error introduced due to the bou nd is negli gible for a v alue that is ot herwise mu ch larger. However, fo r the out put layer with 23/45 (or fgFh ) activations the error introduced due to the bound ma y becom e sig nificant. For instance, if there ex ist two outputs that are above the bounded v alue, the y wo uld be treated equally strong and the classificat ion task would result an equall y probable two classes even if one o f the out puts is significantl y larger th an the other. This result s in a significant er ror (m ajor informati on loss) in estimating the class lab el and hen ce limits t he performance of t he network. Similar to the noise, the bounded signals start to become an issue for later stages of the t raining as the network starts to pe rform good test results and the decision boundar y between classes become more disti nct. As shown by the blue curve in Figure 3A, at th e beginning of th e training the network successfully learns and test errors as low as 2% can be achieved; however, right around 8 th epoch the misleading error signals due to s ignal bounding forces the network to learn unwanted features and hence the error rate suddenl y increases. In order to eliminate the error introduced due to bounded signals, we propos e repeating the vector-matrix multiplication after reducing t he input strength b y a half when a signal satu ration is detected. This would guarantee that after a f ew iterations ( & ) the unbounded signals can be read reliabl y and th en properly rescaled in the digital domain. In this bound management scheme, the effective vector-matrix multiplication on an RPU array can be written as   a a  ) i e ! `e  I (4) with a new effective bou nd of  I Y Z Y . Note t he n oise term is also amplified b y th e same factor; howev er, the signal to noise ratio r emains fixed, onl y a few percent, for the largest n umbers that contribute m ost in calculation of 23/45 activations. In order to test the validity of the proposed noise management (NM) and bound management (BM) techniques, we performed simulations using the RPU-baseline mo del of Table 1 with and wit hout enabling NM and BM. The summar y of these simulations is presented in Figure 3B. When both NM and BM are off, the model using the RPU baseline of Table I, shown as black cu rve, performs poorl y sim ilar to the black curve in Fi gure 3A. Similarly, turnin g on either NM or BM alone (as shown by re d and blu e cu rves) is not sufficient and t he m odels reach test errors of about 10 %. Howeve r, when both NM and BM are enabled th e model a chiev es a test error o f about 1. 7% as shown by the green curve. This is very simi lar to the model with no analog n oise and infinite bound s presented in Fi gure 3A and shows the succ ess of t he noise and bound management techniques. Si mply rescaling the numbers in the digital domain these techniques mitigate both the noise and the bound problems inherent to analog computations performed on RPU arrays. The additional comput ations int roduced in the digital domain due to NM a nd BM are not si gnifican t and can be addressed with a proper digital design. For the NM technique, > HRK needs to be searched in  and each element in  (and  ) value needs to be divided and multiplied with t his > HRK value. Al l of these computations require addi tional  #  $ comparison, div ision and mul tiplication operations that ar e performed in the digital domain. However, given that t he same circuits need to compute  # $ error signals usin g the derivative of the activation function s, these additional operations do es not chan ge the Page 12 of 22 complexit y of the operat ions that needs to b e performed b y th e digital circuits. Therefore, wit h pro per design these additional operations c an be perfo rmed with onl y a sl ight overhe ad without causi ng significant sl owdown on t he d igital circuits . Similarly , BM can be handled in the dig ital domain by performing  # $ computations onl y when a signal saturation is detected. Sensitivity to Device Variation s The RPU-baseline model with NM and BM perform s reasonable well and a chieves a test error o f %6j< , however, this is still above the 6< value achieved with a FP-baseline model. In order t o identi fy the dominating factors and l ayers contribut ing to this addi tional classi fication error, we performed si mulation s while selectivel y eliminating various devic e imperfections from d ifferent la yers. The summar y of these results is shown in Figure 4, where the average test error a chieved between 25 th and 3 0 th epochs i s reported on the y -axis along wit h an error bar that represents the standard deviation for the same interval. The b lack data points in Figure 4 c orresponds to experiments where devi ce-to-device and cycle-to-cycle variation s corresponding to parame ters GC H=I , GC H=I N GC H=I O and PC =? P are completely eliminated fo r different layers while the average values are kept unaltered. The model that is free from device variation s for all four l ayers achieves a test error of %6< . We note that most of this improvement comes fro m the convolutional layers as a very similar test error o f %6%< is achieved fo r the model that does not have device variations for 8 9 k8 ) , whereas the model with out an y device v ariations for full y connected la yers  : k ; remains at %6< level. Among the convolutional layers, it is clear that 8 ) is m ore influ ential compared to 8 9 as t est errors of %6< o r %6< are a chieved respectivel y for models with device variati ons eliminated for 8 ) or 8 9 . Interestingl y, when we repeated s imilar experiments by eliminating onl y the device-to-device variatio n for the imbalanc e parameter GC H=I N GC H=I O from different la yers, a ver y similar trend is observed as shown b y the red data point s. These results highli ghts the important of the device imbalance and shows that even a few percent device im balance can still be harmful whil e training a network. Page 13 of 22 Figure 4. Aver age test error achieved between 25 th and 30 th epochs for a various RPU models with vary i ng device variation s. Black data points corresponds to ex periments where device-to-device and c ycle-to- cycle variations corresponding to parameters GC H=I , GC H=I N GC H=I O and PC =? P are all comp letel y eliminated from different layers. Red d ata points corresponds to ex periments where onl y the device-to- device variation for the imbalance pa rameter GC H=I N GC H=I O is elimi nated from d ifferent la ye rs. Green points corresponds to models whe re m ultiple RPU devices a re m apped for the second convolutional layer 8 ) . RPU-baseline with noise and bound management as well as the FP-baseline models are also included for comparison. It is cle ar th at the reduction in device v ariations in som e layers c an further boos t the network performance; however, for re alistic t echnological im plementations of the crossbar arrays variations a re controlled by fabrication tolerances in a given technolog y. Therefo re, compl ete or ev en partial elimin ation of any device variation is not a realistic option. Instead, in orde r t o get better training results, the effects of th e device variations can be m itigated b y mapping more th an one R PU device per weight, whi ch averages ou t the device variations and red uces the v ariabilit y [37]. Here, w e propose a flex ible m ulti-device mapping t hat can be realiz ed in the di gital dom ain b y repeating the input signals going to t he columns (or rows) of an RPU array, and/or summ ing (averaging) the results of the output signals generated from the ro ws (or columns). Since the same s ignal propagates through many devices and t he results are su mmed on the digital domain, this techni que allows to average out device variation s in the arra y withou t ph ysically hardwiring the lines correspon ding to different co lumns or rows. To test the validi ty of this digitally controlled m ulti-device mapping approach, we performed simulations using m odels where the mapping of the most influ ential layer 8 ) i s repeated on  or % devic es. Indeed, this m ulti-device mappi ng appr oa ch reduces the t est error to 1.45% and 1.35% for  and % device mapping cases, respectiv ely, as shown by the green data points in Figure 4 . The num ber of devices ( l m ) used per weight are effectively redu ces the device variation s b y the a factor propo rtional to M l m . Note that %n device mapping of 8 ) effectively redu ces the device variations by a factor of 61 at a cost of increase in the arra y dimensions to %1  % . However, assuming RPU arrays are f a bricated with equal number of columns and rows, m ulti-device mapping of r ectangular matrix es such as 8 ) does not introduce Page 14 of 22 any operational (or circuit) overhead as long as t he mapping fits in the physical dimensions of a square array. In addition , this method all ows to flexib ly c ontrol the numb er of devices used whil e mapp ing different la yers and t herefore a viable approach to mi tigate the effects of d evice va riabilit y for a realis tic technological implementations of th e crossbar arra ys. Update Management All RPU models presented so far use the stochastic upd ate scheme with a bit length of EF  % and amplification factors that are equall y distrib uted to the columns and the rows with values J K  J L  M " # EFG C H=I $  %6 . Th e choice of th ese v alues is d ictated b y the learning rate whi ch is a h yper- parameter of the trainin g algorithm; therefore th e hardware should be able t o handle any value without imposing an y restricti ons on i t. The learnin g rate for the stochastic m odel is the product of four terms; GC H=I , EF , J K and J L . Among them GC H=I corresponds to an i ncremental conductance change on an RPU device due a sin gle coincidence event; therefore the control o f thi s parameter ma y be restricted by the hardwar e. For instanc e, G C H=I ma y be cont rolled only b y shapin g the voltage pulses used du ring the update c ycle and hence requires pro grammable analog circuits. In cont rast, the control of J K  J L and EF is much easier and can be impl emented in the digital domain. Figure 5. Aver age test error achieved between 25 th and 30 th epochs for a various RPU models with vary i ng update schemes. Black data points cor r espond to updates with amplification factors that are equally distributed to t he colum ns and the rows. Red d ata points corresponds to models that uses the update management sc heme. RPU-baseline with noise and bound management as well as the FP-baseline models are also included for comparison. To test the effect of J K  J L and EF on the training accurac y we pe rformed s imulati ons using the RPU- baseline model with the noise and bound mana gement t echniques des cribed above whi le var ying th ose parameters. For all models we used the s ame fixed l earning r ate "  6% and GC H=I  6% , and the summary of these resul ts are s hown in Figure 5. For the first set of models we varied EF , and M " # EFG C H=I $ value is used equall y for the amplifi cation factors J K and J L . Interestingl y, increasing EF to 40 did not improve the network pe rformance, whereas reducing it to 1 boosted the perf orm ance and Page 15 of 22 a test error of about 1.3% is achieved. These results may be counter intuitive as one ma y e x pect the lar ger EF case to be l ess nois y and h ence would perform bett er. However, for EF   case, the amplification factors are smaller ( J K  J L  6 ) in order to sati sfy the same learning rate on average. T his reduces the probability of gener ating a pulse but since t here exist lo nger st reams durin g the update, the average up date and the variance do not c hange. In contrast, for E F  % , the amplifications factors are la rger with a value 3.16 and therefore th e puls e generation becomes m ore likel y. Indeed for cas es when the amplified num bers are lar ger than unit y ( J K 5 = o % or J L > ? o % ) a single update pu lse i s generated for sure. This makes the updates more deterministic but with an earlier clipping. Note for EF  % the weight value can only m ove by a si ngle GC H=I per upd ate cy c le. How ever, also note that the convolutional layers 8 9 and 8 ) receiv e 576 and 64 single bit stochastic updates per image due to weig ht reuse while the fully connected layers  : and  ; receives onl y on e single bit stochasti c update per image. Althou gh the interaction of all of these terms and the t radeoffs are non-trivi al, it is clear that the abov e C NN architecture favors E F  % whereas the DNN used in R ef [27] favored EF  % . These results emphasize the imp ortance o f desi gning flexible hardware that can control the nu mber of pulses us ed for t he update cycle. We note that this flexibilit y can be achieved seamlessly for the s tochastic update scheme without cha ngin g the design considerations for periph eral circuits generating the random puls es. In addit ion to EF , for the second set of experiments the amplification factors J K and J L us ed duri ng the update c ycle are also varied to so me ex tend while maint aining the average learning rate fix ed. Above models all assume that t he same J K and J L are used during updates; howeve r, it is possi ble to use arbitraril y larger o r smaller J K and J L as long as the product sat isfies " #E F GC H=I $6 I n our update m anagement scheme, we use J K and J L values such that the probability of generating pulses fro m columns (  ) and rows (  ) are roughl y the same o rder. This is achieved by rescalin g the ampli fication factors with a ratio 4  M > HRK 5 HRK , a nd in this scheme the amp lification fac to rs can be written as J K  4 M " # EFG C H=I $ and J L  # 9 H $ M " #EF G C H=I $ . Alt hough for EF  % this method did not y i eld any improvement, for EF  % the error rate as low as 1. 1% is achieved; and hence shows that the proposed update management scheme can yield better training resul ts. This proposed update s cheme does not alter the expected change in the weight value and therefo re its benefits may not be obvi ous. Note that towards the end of training it is very lik ely the r ange of n umbers in columns (  ) and rows ( $ are ver y different; i.e.  ha ve many elements close 1 (or -1) whereas  have elements ver y close to z ero (  ^ _ ). For this case i f th e same J K and J L are used, th e updates become row-vise correlated. Alth ough unl ikel y, generati on of a pulse fo r > ? will result in man y coinciden ces alon g the row j , as there are m an y pu lses generated b y different columns since man y 5 = values are close to unity. Our update management sche m e eli minates thi s corr el ated upd ates b y sh ifting the probabiliti es from columns to rows by simp le rescaling the numbers used during the update. One may vi ew that this scheme uses rescaled vectors ( 4 and 4 ) for the updates which are composed of numbers roughl y t he same order. This update man agement scheme relies on a sim ple rescaling of the nu mbers that are perform ed in the digital domain; and therefore, does not change the d esign of t he analog circuits n eeded for t he update cycle. Th e additional co mputations introdu ced i n the digital domain ar e not significant eith er and onl y require additional  #  $ (or  #  $ $ operations similar to the noise mana gement technique. Page 16 of 22 Results Summary The summar y CNN training results of va rious RPU mod els that use the ab ove management techniq ues is shown in Figure 6. W hen all manage ment techniques are disabled the RPU-baseline model can onl y achieve test error above %< . This lar ge error rate is reduced significantly to a value about %6j< with the noise and bound man agement techniques, which address the noise issue in the b ackward c ycle and the signal satu ration durin g f orward c yc l e. Additionall y wh en the update m anagement scheme is e nabl ed with a r educed bit len gth du ring updates, th e model reaches a test error o f %6%< . Finall y, t he combination of all of t hose management techniques with the %n device mapping on the second convolutional brings the model’s test error to 6< pq rqp . The performance of th is final RPU model is almost indi stinguishable from t he FP-baseline model and hence shows the successful application of R PU approach for training CNNs. W e note that all thos e methods can b e tu rned on selectivel y b y simpl y programin g the op erations performed on digital circuits; and therefore can be applied to an y netwo rk architecture beyond CNNs without changing design considerations for realistic technological implementations of the crossbar arrays and analog peripheral circuit s. Figure 6. Test error of C NN with the MNIST dataset. Open whit e cir cles correspond to the model with the training performed using the floati ng point nu mbers. L ines wit h differe nt colors corre spo nd to RPU- baseline model with different mana gement techniques enabled p rogressively. Page 17 of 22 DISCUSSION AND CO NCLUSIONS The application of RP U device c oncept for training C NNs requires a rearrangement of the kernel parameters and only after this rearrangement the i nherent parallelism of the RPU array can be fully uti lized for convolut ional layers. A single ve ctor op eration performed on the RPU array is a constant time  #%$ and independent o f the arra y size, however, because of the weight sharin g in convolutional layers, the RPU arrays are accessed several times, r esulting in a series of v ector operations performed on the array for all three c ycles. These repeated vector oper ations introduce in teresting challen ges and opportunit ies while training CNNs on a RP U based hardware. The arra y sizes, we i ght sharing factors ( C ) a nd the number of multiply and add (MAC ) operati ons performed at different l ayers for a relative si mple b ut respectable CNN archi tecture Alex Net [ 2] are shown in Table 2. This architec ture won th e large-scale ImageNet competition by a la rge mar gin in 2012. We understand that there has been si gnificant pro gress since 2 012 and w e onl y choose AlexNet architecture due to its sim plicity and to illust rate i nteresting possibili ties that RPU based hardware enables whi le designing new network a rchitectures. Table 2 . Array sizes, weigh t sharin g f actors an d n umber of MACs perf ormed for each lay er f or AlexNet* [2] architecture Layer RPU Array Size (Matrix Size) Weight Sharin g Factor ( C ) MACs 8 9 71  1  %1   8 ) 1   j7    8 :    %17 %   8 ;   1 %17    8 s 1  1 %17 %    t 71  7%1 %     u 71  71 % %j    v %  71 %    w2/x  yJ  % 6 %   *Table assum es t he weight s that are o riginally distributed to two GPUs are contained into a single R PU array for each layer When AlexNet architecture runs on a conv entional hardware (such as CPU, GPU or AS IC), the time to process a singl e image is dictated b y the total number of MACs; therefore, the contribution s of d ifferent layers to the total worklo ad are additive, with 8 ) consuming about 40% of the workload. The total num ber of MACs is usually considered as the main metric th at scales t he training t ime, and hence, practitioners deliberately const ruct network architectures while keeping t he total numb er of MACs below a bud ged t o avoid infeasible training ti mes. This constraints the choice o f the number o f kernels and their dimen sions used for ea ch convolutional layer as well as the siz e of t he pooling l ayers. Assuming a compute bou nded system, the time to proc ess a single image on a c o nventional hardware can be estimated using the ratio of the total number of M ACs to the performance m etric of the correspond ing hardware ( w 2/x yJ  w0z2{0{/ ). Page 18 of 22 In contrast to conventional hardware, when the s ame architecture runs on a RPU based hardware, t he time to process a single image is not dictated by the total number of MACs, instead it is dominated b y the largest wei ght reus e factor in the network. For the above example, th e op erations performed on the first convolutional 8 9 take s the longest time among all layers because o f the lar ge we ight reuse factor of C   , although this layer has the sm allest array size and onl y 10% of the total num ber of MACs. Assuming a RPU based accelerator wit h many RPU arrays and p ipeline stages b etween them, th e average ti me to process a sin gle im age c an be esti mated as C  / HQRS using values from layer 8 9 , where / HQRS is the measurement time cor responding to a single vector-matrix m ultiplication on the RPU array. First of all, this metric emphasiz es the const ant t ime operation on RPU arra y s as the training tim e i s independent of the array sizes, the num ber of trainable parameter s in th e network and the total number of MACs. Thi s would encourage practitioners to use increasing number of kernels wit h larg er dimensions without worrying about training times that is otherwise i mpossible wit h conventional hardware. However, the same metric also highli ghts the importance of / HQRS and C for layer 8 9 that is causing a bot tleneck; and therefore, it is desirable to com e up with strategies that can redu ce both parameters. In order to redu ce / HQRS we first discuss designing sm all RPU arrays that can op erate fast er. It is clear that designing large arrays are beneficial t o achieve high degree of parallelism for the ve ctor operations performed on the arrays; howe ver, due to pa rasitic resistance and capa citance of the tr ansmission lines the largest array siz e is limited to 71  71 [2 7]. For an array of this s ize the measurement time of / HQRS  & is derived considering the acceptable noise threshold value, which is dominated by the thermal noise on RPU dev ices. How ever, for a sm all arra y with %  % devices the acceptabl e noise threshold is not dominated b y the thermal noise and hence / HQRS can be reduced to about %& for faster computations. I t is not desirable to build an accelerator chip all composed of sm all arrays, as for a small array power and area are dominated by the peripheral circuits (mainly b y ADCs); and t herefore, a small array has worse power and area efficienc y metrics compared to a large arra y. However, a bimodal design consisting of large and sm all siz e arra y s achieves be tt er hardware utiliz ation and provid es speed advanta ge while mapping architectu res with significantl y v arying matrix dimensions. W hile the large arra ys are used to map full y c onne cted lay er s or lar ge convolutional layers, for a convolutional layer such a s 8 9 using the small array would be better a solution t hat provides a reduction in / HQRS from & to %& . In order to r educe the weight reuse fa ctor on 8 9 , next we discuss to allocate two (or more) arra ys for t he first convolutional layer. When mor e than one array is allocated for the first convol utional la yer the network can b e forced to learn separate features on d ifferent arra ys b y pro perly dir ecting the upper (left) and lower ( right) portions of the i mage to separate arrays and b y computi ng the er ror signals and the updates independentl y. Not only this allows the network to learn independent features for separate portions of th e im age, for each ar ray the weight reuse factor is reduce d by a factor of 2. This reduces the tim e to process a sin gle image w hile making the architecture m ore expressive. Alternatively, on e could als o try to synchronize the t wo arrays by randomly shuffling the porti ons of the images t hat are pro cessed by different arra ys. This approach would force the ne twork to learn same features on two arrays with s ame reduction of 2 in the weight reuse f actor. Th ese discussed subtle changes in the network architecture do not provide an y speed a dvantage when run on a conventional ha rdware; and therefore, it h ighlights t he interesting possibilities that a R PU based architecture provides. Page 19 of 22 In s ummar y, we show that the RPU concept can be applied b e yond full y connect n etworks and th e RPU based accelerators are natural fit for trainin g convolu tional n eural networks as well. These accelera tors promise unprecedented spe ed and power b enefits and hardware level parallelism as the number of trainable pa rameters increase in the n etwork. Because of the constant time operation on RPU arra ys, RPU based accelerators provid e interesting network ar chitecture choices witho ut increasin g the training times. However, all of the benefits of an RPU array are tied to the analog nature of the computation s performed on it and therefore intr oduces some challenges. We show th at digitally p rogrammable management techniques are sufficient to eliminate the noise an d bound limi tations imposed on the arra y. Furthermore, their combination with the update managem ent and device variabilit y reduction techniques enable a successful application of the RP U concept for training CNNs. All the management techniques discussed in this paper are addressed in the di gital dom ain w ithout changing the desi gn consi derations on the a rray and the supporting analog peripheral circuits. These techniques enable the applicability of RPU approach to wide a variety of networks beyond convolutional or full y connected networks. Referen ces [1] Y. LeCun, Y. Bengio and G. Hin ton, "Deep learning, " Nature, vol. 521, pp. 436- 444, 2015. [2] A. Krizhevsk y , I. S utsk ever and G. Hinton, "Im agenet clas sification with deep co nvolutio nal neural networks ," NIPS, pp. 1097-1105 , 2012. [3] K. Simon yan and A. Z isser man, "Ver y Deep Co nvolutiona l Network s for Lar ge-Scale Im age," ICLR, 2015 . [4] C. Szeged y, W . Liu, Y. Jia, P. Ser m anet, S. Reed, D. Anguelo v, D. Erh an, V. Va nhouck e and A. Ra binovich, "Going dee per with co nvolutions, " CVPR, 20 15. [5] K. He, X. Zha ng, S. R en and J . Sun, "De lving De ep into Rec tifiers: Surpass ing Hum an-Level Perf orm ance on ImageNet Classif ication," in 20 15 IEEE Internat ional Conf erence on Computer Vision (ICC V) , 2 015. [6] G. Hinton, L. D eng, G . Dahl, A. Moham ed, N. Jait ly, A. Sen ior, V. Van houck e, P. N guyen, T. S ainath and B. Kingsbur y, "Deep neur al networ ks for acoustic m odeling in s peech r ecognition: T he shar ed vie ws of four research grou ps," IEE E Sig nal Process ing Maga zine, pp. 8 2-97, 20 12. [7] R. Collobert, J . W eston, L. Bott ou, M. Karlen, K. Kavuk cuoglu and P. Kuks a, "Natural L anguage Process ing (Almos t) from Sc ratch," Jour nal of Mac hine Learn ing R esearch, vol. 12, pp. 24 93-2537 , 2012. [8] Q. Le, M. Ra nzato, R. Monga, M . Devin, K. Chen, G . Corr ado, J. Dea n and A. Ng, "Bu ilding high- level features us ing large sc ale uns upervised learning, " Internati onal Confer ence o n Ma chine Lear ning, 2012 . Page 20 of 22 [9] S. Gupta, W . Zhang an d F. W a ng, "Mo del Accurac y and Run tim e Tradeoff in Distributed D eep Lear ning: A Systematic St ud y," IEDM, 2 016. [10] J. Dean, G. C orrado, R. Monga, K. Chen, M. Devin, Q . Le, M . Mao, M. Ra nzato, A. Senior, P. Tuck er, K. Yang and A. Ng, "Large scale distribut ed deep net work s," in NIPS'12 , 2012. [11] A. Coates, B. Huval, T . W ang, D. W u and A. Ng, "D eep lear ning with COT S HPC s ystems ," ICML, 2013. [12] R. W u, S. Yan, Y. Sha n, Q. Dang and G. S un, "Deep Im age: Scaling up Im age Recogn ition," 2 015. [13] S. Gupta, A. Agrawal, K. Gopa lakrishnan a nd P. N arayanan, "Deep Learning with Lim ited Num erical Precision," 20 15. [14] Y. Chen, T. Lu o, S. L iu, S. Zhang, L. He, J. W ang, L. Li, T. Chen, Z. X u, N. Su n and O. T em am, "DaDianNao : A Mach ine-L earning Sup ercom puter," 47th Annua l IEEE/AC M Internati onal Sympos ium on Microarchitec ture, pp. 609-622, 2014. [15] J. Em er, V. Sze and Y. Che, "T utorial on H ardware Architect ures f or Deep Neura l Net works," in IE EE/AC M Internationa l Symposi um o n Microarc hitecture ( MICRO- 49) , 2016. [16] Y. Arim a, K. Mashik o, K. Ok ada, T . Yamada, A. M aeda, H. N otani, H. K ondoh an d S. Ka yano, "A 33 6- neuron, 28 K-s ynapse , self-learn ing neur al networ k c hip with branch- neuron-un it a rchitecture, " IEEE Journal of Solid-State C ircuits, v ol. 26, p. 1991, 1637- 1644. [17] C. Lehm ann, M. Vire daz and F. B layo, "A g eneric s ystolic arra y buildi ng block for neural network s with on- chip learnin g," IEEE T ransa ctions on Ne ural Netw orks , vol. 4, pp. 40 0-407, 19 93. [18] G. Burr, R. Sh elb y, A. Sebas tian, S. Kim , S. Kim , S. Sidler, K. Virwani , M. Ishi i, P. Nara yanan, A. Fum arola, L. Sanches, I . Boybat, M. L e Gallo, K. Moon, J. W oo, H. Hwang and Y. Leblebici, "Neurom orphic computing us ing non- volatil e mem ory," Advances in Phy sics : X, pp. 8 9-124, 20 17. [19] D. Kuzum , S. Yu and H. W ong, "S ynaptic electron ics: m aterials, de vices and applicati ons," Nanotech nology, vol. 24, 2013. [20] P. Chi, S. Li, C. Xu, T . Zhang, J . Zhao, Y. Liu, Y. W ang and Y. Xie, "PRIME: A nove l processin g-in-m emor y architecture for neura l netw ork c omputation in R eRAM bas ed m ain mem ory," in IS CA , 2016. [21] M. Prezioso, F. Merrik h-Bayat, B. Hosk ins, G. Ad am , A. Likharev and D . Struk ov, "Training a nd operat ion of an integrate d neurom orphic net work based on m etal-oxide m em ristors," Nature, pp. 61-6 4, 201 5. Page 21 of 22 [22] E. Merced-G rafals, N. D avila, N. G e, S. W ill iam s and J . Strachan, "Repeatab le, acc urate, and hig h speed multi-level programm ing of m emr istor 1T1R ar rays for po wer eff icient analo g computing applicat ions," Nanotechno logy, 20 16. [23] D. Soudr y, D. Di Cas tro, A. G al, A. Kolodn y and S. Kvatinsk y , "Mem r istor-Based M ultilayer Neur al Net works W ith Online Gradien t Desc ent Training, " IEEE Tr ansactions on Neur al Networ ks and Learning Sy stems , 2015. [24] Z. Xu, A. Moh ant y, P. Chen, D. K adetotad, B. Lin, J . Ye, S. Vru dhula, S. Yu, J. S eo and Y. Cao, "Par allel Programm ing of Res istive Cr oss-point Ar ray for Synaptic P lasticit y," Procedia C omputer Science, vol. 41, pp. 126-133, 2014. [25] G. Burr, P. Nar ayanan , R. Shelb y, S. Sidler, I. Bo ybat, C. di No lfo and Y. Leblebic i, "Large- scale neur al networks implem ented with non- volatile mem ory as the s ynaptic weight elem ent: Com parative perform ance anal ysis (accur acy, speed, a nd po wer)," IED M (Internati onal Electr on Devices Meetin g), 2015. [26] J. Seo, B. Li n, M. Kim , P. C hen, D. Kade totad, Z. X u, A. Moh ant y, S. Vrudhula, S. Yu, J. Ye and Y. Cao, "On-Chip Sparse Learn ing Acceleratio n W ith CMOS and Res istive Synaptic De vic es," IEEE Trans actions on Nanotec hnology , 2015. [27] T. Gokm en and Y. V lasov, "Acc eleration of Deep Neural N etwork Training w ith Res istive Cross -Point Devices," Fro ntiers in N euros cience, 201 6. [28] S. Agrawal, T . Quach, O . Parek h, A. Hsia , E. DeBen edictis, C. J am es, M. Marin ella an d J. Aim one, "Ener gy Scaling Adva ntages of Resis tive Mem ory Crossbar Com putation and Its Applicati on to Spar se Codin g," Frontiers in N eurosc ience, 2016 . [29] E. Fuller, F. E l Gabal y, F. Leonar d, S. A garwal, S. Plimpton, R. Jacobs- Gedr im, C. J am es, M. Marinella a nd A. Talin, "Li-I on S ynaptic Tr ansistor f or Low Po wer Ana log Com puting," Adva nced Sc ience News , vol. 29, 2017. [30] S. Agrawal, S. Plim pton, D. Hugh art, A. Hs ia, I. Ric hter, J. Cox , C. Jam es and M. Marinel la, "Res istive Memor y Device Requ irem ents for a N eural Net work Ac celerator," IJC NN, 2 016. [31] D. Rum elhart, G. Hin ton and R. W ill iam s, "Learn ing re presentatio ns b y back-propagat ing error s," Nature, vol. 323, pp. 533-536, 1986 . Page 22 of 22 [32] K. Steinbuc h, "Die Lernm atrix ," Kybernetik , 1961. [33] Y. LeCun, L. Bottou, Y. Be ngio and P. Haffner , "Gradi ent-Based L earning A pplied to D ocum ent Recognition, " Proceed ings of th e IEEE, vol. 86, no. 11 , pp. 2278-232 4, 1988. [34] B. Gaines, "Stochast ic computi ng," in Proceedin gs of the AF IPS Sprin g Joint C omputer C onferenc e , 1967. [35] W . Poppelbaum , C. Afuso and J . Esch, " Stochastic com puting elem ents and s ystems ," in Proceedi ngs of the AFIPS Fall Jo int Co mputer Conference , 1967. [36] C. Merkel an d D. Ku dithipudi, "A stochas tic lear ning algor ithm f or neurom emr istive system s," in 27th IE EE Internationa l System- on-Chip C onference ( SOCC) , 2014. [37] P. Chen, D. Kadetot ad, Z. Xu, A. Mohant y, B. Lin, J . Ye, S. Vru dhula, J. Seo, Y. C ao and S. Yu, "Technolog y-design c o-optim ization of resistive cr oss-poi nt arra y for accelerat ing learn ing algor ithm s on chip," in D ATE , 2015.

Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment