Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for DNNs on the Edge
Low-precision DNNs have been extensively explored in order to reduce the size of DNN models for edge devices. Recently, the posit numerical format has shown promise for DNN data representation and compute with ultra-low precision in [5..8]-bits. Howe…
Authors: Hamed F. Langroudi, Zachariah Carmichael, David Pastuch
1 Cheetah : Mix ed Lo w-Precision Hardw are & Software Co-Design Frame work for DNNs on the Edge Hamed F . Langroudi, Zachariah Carmichael, Da vid Pastuch, Dhireesha Kudithipudi (Pr eprint) Abstract —Low-precision DNNs hav e been extensively explored in order to reduce the size of DNN models f or edge devices. Recently , the posit numerical format has shown promise for DNN data repr esentation and compute with ultra-low precision ∈ [5 .. 8] bits. Howev er , previous studies were limited to study- ing posit f or DNN inference only . In this paper , we propose the Cheetah framework, which supports both DNN training and inference using posits, as well as other commonly used formats. Additionally , the framework is amenable f or different quantization approaches and supports mixed-precision floating point and fixed-point numerical formats. Cheetah is evaluated on three datasets: MNIST , Fashion MNIST , and CIF AR-10. Results indicate that 16-bit posits outperform 16-bit floating point in DNN training. Furthermore, perf orming inference with [5..8]-bit posits impro ves the trade-off between performance and energy- delay-product over both [5..8]-bit float and fixed-point. Index T erms —Deep neural networks, low-precision arithmetic, posit numerical format I . I N T R O D U C T I O N Edge computing is an emerging design paradigm that offers intelligence-at-the-edge of mobile networks, while addressing some of the shortcomings of cloud datacenters [1]. The nodes of the edges host the computing, storage, and communi- cation capabilities, which provide on-demand learning for sev eral applications, such as intelligent transportation, smart cities, and industrial robotics. Inherent characteristics of edge devices include low latency , reduced data movement cost, low communication bandwidth, and decentralized real-time processing [2], [3]. Howe ver , deploying intelligence-at-the- edge is a formidable challenge for se veral of the deep neural network (DNN) models. For instance, DNN inference with AlexNet requires ∼ 61 M parameters and ∼ 1.4 gigaFLOPS [4]. Moreover , the cost of the multiply-and-accumulate (MA C) units, a fundamental DNN operation, is non-trivial. In a 45 nm CMOS process, energy consumption doubles from 16-bit floats to 32-bit floats for addition and it increases by ∼ 4x for multiplication [5]. Memory access cost increases by ∼ 10x from 8 k to 1 M memory size with 64-bit cache [5]. In general, there is a gap between memory storage, bandwidth, compute requirements, and energy consumption of today’ s DNN models and hardware resources av ailable on edge devices [6], [7]. An apparent solution to address this gap is by compress- ing the size of the networks and reduce the computation Hamed. F . Langroudi, Zachariah Carmichael, David Pastuch, and Dhireesha Kudithipudi are with the Department of Computer Engineering, Rochester Institute of T echnology , Rochester, NY , USA requirements to match putative edge resources. Sev eral groups hav e proposed compressed DNN models with new compute- and memory-efficient neural networks [8]–[10] and parameter- efficient neural networks, such as DNN pruning [11], distilla- tion [12], and low-precision arithmetic [13], [14]. Among these approaches to compress DNN models, lo w- precision arithmetic is noted for its ability to reduce memory capacity , bandwidth, latenc y , and energy consumption associ- ated with MAC units in DNNs, and an increase in the le vel of data parallelism [13], [15], [16]. For instance, DNN inference with compressed models, such as MobileNet with 8-bit fixed- point parameters, utilizes only ∼ 4.2 M parameters and ∼ 1.1 megaFLOPS [8]. While this alleviates some of the design constraints for the edge, DNN models must still run quickly with high accuracy for complex visual or video recognition tasks on-de vice. Therefore, a conflicting design constraint here is that the network’ s precision cannot compromise a DNN’ s o verall performance. F or instance, there is a ∼ 10% gap between the performance of low-precision DNN models (e.g, MobileNet with 8-bit fix ed-point DNN parameters) and high- precision DNN models (e.g, MobileNet with 32-bit floating point DNN parameters) for real-time (30 FPS) classification on ImageNet data with a Snapdragon 835 LITTLE core [13]. The ultimate goal of designing the lo w-precision DNN is reducing the hardware complexity of the high-precision DNN model such that it can be ported on to edge de vices with performance similar to the high-precision DNN. The hardware complexity and performance in lo w-precision DNNs rely heav- ily on the quantization approach and the numerical format. Prev ailing techniques, such as complex vector quantization or hardware-friendly numerical formats, lead to undesirable hardware complexity or performance penalties [17], [18]. T o understand the correlation between hardware complex- ity and performance of low-precision neural networks for the edge, a hardware and software co-design framework is required. Previous studies ha ve addressed this by proposing low-precision frameworks [13]–[16], [19]–[22]. Howe ver , the scope of these studies is limited, as highlighted below: 1) None of the previous works explore the propriety of the posit numerical format for both DNN training and inference by comprehensiv e comparison with fixed and float formats [19]–[22]. 2) There is a lack of comparison between the efficacy of quantization approaches, numerical formats, and the associated hardware complexity . 2 3) In most of the previous works, the comparison across numerical formats are conducted for v arying bit-widths (e.g. 32-bit floating point compared to 8-bit fixed-point [15]). Such comparisons do not offer insights on viabil- ity of utilizing the same bit-precision across numerical formats for a particular task. T o address the gaps in previous studies, we are moti vated to propose Cheetah as a comprehensi ve hardware and software co-design frame work to e xplore the adv antage of lo w-precision for both DNN training and inference. The current version of Cheetah supports three numerical formats (fixed-point, floating point, and posit), two quantization approaches (rounding and linear), and two DNN models (feedforward neural networks and con v olutional neural networks). I I . B AC K G RO U N D A. Deep Neur al Network Deep neural networks (DNNs) [23] are artificial neural networks that are used for various tasks, such as classification, regression and prediction, by learning the correlation between examples from a corpus of data called training sets [24]. These networks are capable of learning a non-linear input- to-output mapping in either a supervised, unsupervised, or semi-supervised manner . The DNN models contain a sequence of layers, each comprising a set of nodes. The connecti vity between layers depends on the DNN architecture (e.g. globally connected in feedforward neural network or locally connected in con v olutional neural network). A major computation in a DNN node is the MA C operation. Specifically , a node in feedforward neural and con volutional neural network computes (1) where B indicates the bias vector , W is the weights tensor with numerical values that are associated with each connection, A represents the activ ation vector as input v alues to each node, Y is the feature vector at the output of each node, and N equals either the number of nodes for a feedforward neural network or the product of the ( C, R, S ) filter parameters: the number of filter channels, the filter heights, and the filter weights, respectiv ely , for a con volutional neural network. Y j = B j + N X i =0 A i × W ij (1) In a supervised learning scenario for all of these networks, the correctness of classifications is given by the distance between Y and the desired output as calculated by E i , a cost function with respect to the weights. Then, during training, the weights are learned through stochastic gradient descent (SGD) to minimize E i as gi ven by (2). ∆ W ij = − α ∂ E i ∂ W ij (2) B. P osit Numerical F ormat The posit, a T ype III unum, is a new numerical format with tapered precision characteristic and was proposed as an alternati ve to IEEE-754 floating format to represent real numbers [25]. Posit rev amped the IEEE-754 floating format and addressed complaints about T ype I and T ype II unums [26]. Posits provides better accuracy , dynamic range, and program reproducibility than IEEE floating point. The essential advantage of posits is their capability to represent non-linearly distributed numbers in a specific dynamic range around 1 with maximum accuracy . The v alue of a posit number is represented by (3), where s represents the sign, es and f s represent the maximum number of bits allocated for the exponent and frac- tion, respectiv ely , e and f indicate the exponent and fraction values, respecti vely , and k , as computed by (4), represents the regime value. x = 0 , if ( 00 ... 0 ) N aR, if ( 10 ... 0 ) ( − 1) s × 2 2 es × k × 2 e × 1 + f 2 f s , otherwise (3) The regime bit-field is encoded based on the runlength m of identical bits ( r...r ) terminated by either a r e gime terminating bit r or the end of the n -bit value. Note that there is no requirement to distinguish between ne gativ e and positi ve zero since only a single bit pattern ( 00 ... 0 ) represents zero. Furthermore, instead of defining a NaN for exceptional v alues and infinity by v arious bit patterns, a single bit pattern ( 10 ... 0 ) , “Not-a-Real” ( N aR ), represents exception v alues and infinity . More details about the posit number format can be found in [25]. k = ( − m, if r = 0 m − 1 , if r = 1 (4) I I I . R E L AT ED W O R K As lately as the 1980s, low-precision arithmetic has been studied for shallo w neural networks to reduce compute and memory complexity for training and inference without sac- rificing performance [27]–[30]. In some scenarios, it also improv es the performance of training and inference since the quantization noise generated from the use of lo w-precision parameters in shallo w neural netw ork acts as a regularization method [30], [31]. The outcome of these studies indicate that 16- and 8-bit precision DNN parameters are sufficient for training and inference on shallow networks [28]–[30]. The capability of lo w-precision arithmetic is ree valuated in the deep learning era to reduce memory footprint and ener gy consumption during training and inference [14]–[16], [19]– [22], [32]–[38]. A. Low-Pr ecision DNN Training Sev eral of the previous studies hav e shown that to perform DNN training, either variants of low-precision block floating point (BFP), where a block of floating point DNN parameters used a shared exponent [39], such as Flexpoint [35] (16-bit fraction with 5-bit shared exponent for DNN parameters), or mixed-precision floating point (16-bit weights, activ ations, and gradients and 32-bit accumulators in the SGD weight update process) are suf ficient to maintain similar performance as 32- bit high-precision floating point. For instance, Courbariaux 3 et al. trained a low-precision DNN on the MNIST , CIF AR- 10, and SVHN datasets with the floating point, fixed-point, and BFP numerical formats [32]. They demonstrate that BFP is the most suitable choice for low-precision training due to variability between the dynamic range and precision of DNN parameters [32]. Following this work, K oster et al. proposed the Flexpoint numerical format and a ne w algorithm called Autoflex to automatically predict the optimal shared exponents for DNN parameters in each iteration of SGD by statistically analyzing the values of DNN parameters in previous iterations [35]. Aside from managing the shared exponent in the BFP numerical format, Narang et al. used mixed-precision floating point [34]. They used a 16-bit floating point to represent weights, activ ations, and gradients to perform forward and backward passes. T o prevent accuracy loss caused by under- flow in the product of learning rate and gradients with (2) in 16-bit floating point, the weights are updated in 32-bit floating point. Additionally , to prev ent gradients with very small magnitude from becoming zero when represented by 16-bit float, a new loss scaling approach is proposed [34]. Recently , W ang et al. and Mellempudi et al. reduce the bit-precision required to represent weights, activ ations, and gradients to 8-bit by exhaustiv ely analyzing DNN training parameters [14], [36]. Even in [36], a new chunk-based addition is presented to solve the truncation issue caused by addition of lar ge- and small-magnitude numbers and thus the number of bits demanded for accumulator and weight updates is reduced to 16-bits. T o prev ent the requirement of the loss scaling in mixed-precision floating point, Kalamkar et al. [37] proposed the brain floating point (BFLOA T -16) half-precision format with similar dynamic range (7-bit exponent) and less precision (8-bit fraction) compared to 32-bit floating point. The same dynamic range between BFLO A T -16 and 32-bit floating point reduces the con version complexity between these two formats in DNN training. In training a ResNet model on the ImageNet dataset, BFLO A T -16s achiev e the same performance as 32-bit floating point. B. Low-Pr ecision DNN Inference The performance of DNN inference without retraining is more rob ust to the noise that is generated from lo w-precision DNN parameters as the DNN parameters during inference are static; several groups hav e demonstrated that either 8- bit BFP or 8-bit fixed-point, coupled with linear quantization, are adequate to represent weights and acti v ations without sig- nificantly degrading performance yielded with 32-bit floating point. Note that the accumulation bit-width is selected to be 32 bits to preserve accuracy in performing, in general, thousands of additions in the MA C operations. For instance, Gysel et al. demonstrate that an 8-bit block floating point for representing weights and acti vations, 8-bit multipliers, and 32- bit accumulation results in < 1% accuracy loss on AlexNet with the ImageNet corpus [16]. Follo wing this work, Hashemi et al. introduce low-precision DNN inference networks to better understand the impact of numerical formats on the energy consumption and performance of DNNs [15], [16]. For instance, performing inference on AlexNet with the 8- bit fixed-point format yields a 6 × improv ement in energy consumption over 32-bit fixed-point for the CIF AR-10 dataset [15]. Chung et al. proposed the Brainwa ve accelerator using 8-bit block floating point with a 5-bit e xponent to classify ImageNet dataset on ResNet-50 with < 2% accuracy loss [38]. Howe v er , the scaling factor parameter in the block floating point numerical format needs to be updated according to the DNN parameter statistics, thus increasing the computational complexity of inference. T o alleviate this problem, researchers have used posits in DNNs [19]–[22]. Posits represent numbers more accurately around ± 1 and less accurately for very small and large numbers, unlike the uniform precision of the floating point numerical format [40]. This characteristic of posits arises from its tapered precision and suits the distribution of DNN param- eters well [19], [25]. For instance, Langroudi et al. explored the efficacy of posits for representing DNN weights and have shown that it is possible to achieve a loss in accuracy within < 1% on the AlexNet and ImageNet corpora with weight representation at 7-bit [19]. They also demonstrate that posits hav e a 30% less voracious memory footprint than fixed-point for multiple DNNs while maintaining a < 1% drop in accuracy . Howe v er , in the work, the 7-bit posit quantized weights are con verted to 32-bit floats, limiting the posit numerical format for memory storage only . T o take full advantage of the posit numerical format, Carmichael et al. proposed the Deep Positron DNN accelerator which employs the posit numerical format to represent weights and acti vations combined with an FPGA soft core for ≤ 8-bit precision exact-MA C operations [20], [21]. They demonstrate that 8-bit posits outperform 8-bit fixed-point and floating point on low-dimensional datasets, such as Iris [41]. Follo wing these works, most recently , Jef f Johnson proposed a log float format as a combination of the posit numerical format and exact log-linear multiply-add (ELMA), which is the logarithmic version of the exact MA C operation. This work sho ws that it is possible to classify ImageNet with the ResNet DNN architecture with < 1% accuracy degradation [22]. This research builds on these earlier studies [19]–[22] and extends lo w-precision arithmetic to both DNN training and DNN inference with different quantization approaches for both feedforward and con v olution neural networks on v arious datasets. I V . P R O P O S E D F R A M E W O R K The Cheetah framew ork, shown in Fig. 1, comprises a two-le vel software component and a single-lev el hardware component. The software framework is used to ev aluate the performance of various numerical formats and quantization approaches by emulating low-precision DNN training and inference. The hardware frame work is a soft-core implemented on FPGA and used for ev aluating hardware characteristics of the MA C (multiply-and-accumulate) operations as a fun- damental computation in DNN models coupled with various quantization techniques. For each lev el, two optimization stages are considered to con vert the baseline DNN model with 4 Accuracy Analysis DNNModels (Keras& T ensorflow) ArithmeticLibrary (C&C++) EMACSoftcore (VHDL) Numerical formats Quantization approach EDPAnalysis Software Framework (High-level) Software Framework (Low-level) Hardware Framework Accuracy Analysis EDPAnalysis Customerrequest: 3XEDPreduction comparedto32-bit floatingpointDNNmodel with Similarperformance Cheetahanswer: usethisconfiguration (8-bitposit, linearquantization) Figure 1: The Cheetah High-le vel Hardware & Software Co-design frame work for DNNs on the edge. EDP: Energy-Delay Product. 32-bit high-precision floating point with soft-core MA Cs to a low-precision DNN model with either posit, floating point, or fixed-point arithmetic soft-core exact-MA Cs (EMA Cs). This optimization is performed iteratively , reducing the bit- precision by one at each step; the performance degradation and hardware comple xity reduction achiev ed by a numerical format in both DNN training and inference is computed and compared with the specified design constraints (e.g. 3 × EDP reduction with similar performance). This iterativ e process is repeated for the next numerical format after one of the de- sign constraints is violated. Essentially , Cheetah approximates the optimal bit-width for each numerical format based on the performance and hardware complexity constraints. Note that there is a priority between optimization approaches; the numerical format parameter has a higher precedence in the optimization process. This design decision is made to limit the search space and the hardware complexity overhead of the quantization approaches. In performing DNN inference, the current version of Cheetah supports three lo w-precision numerical formats (fixed-point, floating point and posit), two quantization approaches (rounding and linear), and two DNN models (feedforward and conv olutional neural networks). T o perform DNN training on feedforward neural networks, Chee- tah supports two numerical formats (floating point and posit) with 32-bit and 16-bit precision. For brevity , the architecture explained here is based on single hidden layer feedforward neural network training and inference with the posit numerical format for both rounding and linear quantization approaches, as sho wn in Fig. 2. A. Softwar e Design and Exploration In emulating feedforward and con v olutional DNNs, the output of each layer Y is calculated as in (5) Y j = B j + 1 α 1 × α 2 × N X i [ Q ( α 1 × A i )] × [ Q ( α 2 × W ij )] ! (5) where α 1 and α 2 are scale factors, B i is the bias term, A i is the acti vation vector , W ij is the weight matrix, N indicates the number of MA C operations, and Q ( · ) is the quantization function. First, the feedforward or con v olutional neural net- work is trained by either 32- or 16-bit floating point or posit numbers as shown by Fig. 5. T o perform DNN inference, the 32-bit floating point high-precision learned weights and 32- bit floating point high-precision activ ations are quantized to either n -bit low-precision fixed-point, floating point, or posit numbers ( n ≤ 8 ). In the quantization procedure, the values of α 1 and α 2 are dependent on the quantization approach. T o perform rounding quantization, α 1 and α 2 are both set to 1 and the 32-bit high- precision floating point v alues that lie outside dynamic range of one of the low-precision posit numerical formats (e.g. 8-bit posit) are clipped appropriately to either the format’ s maxi- mum or minimum. During quantization by rounding, a value that is interleav ed between two arbitrary numbers is rounded to the nearest number. T o perform linear quantization, the activ ations and weights are quantized to the range [ − β , β ] by calculating α 1 = β Max ( A i ) and setting α 2 = 2 β Max ( W i ) − Min ( W i ) . In the ne xt step, the MA C operation is employed to calculate Y i . T o minimize arithmetic error , the MAC operation in this 5 Fully-Connected Layer Exact MAC Q r Data → : ϕ ( x ) ϕ ( A ) α 1 : float32 A (posit, fixed, float) Y → R R ( × ) 1 × α 1 α 2 A q W q ϕ ( W ) α 2 Classification : float32 W → A q W q A q W q Quantize Layer1 Dequantize Figure 2: The Cheetah software framew ork for feedforward neural networks with one hidden layer . The frame work scales to any DNN architecture. paper is calculated using the EMAC algorithm [20]. In the EMA C, to preserv e precision in computing the products, the posit weights and activ ations are multiplied in a posit format without truncation or rounding at the end of multiplications. T o av oid rounding during accumulation, the products are stored in a wide register , or quire in the posit literature, with a width giv en by (6). The products are then con verted to the fixed- point format F X ( m k ,n k ) , where m k = 2 es +1 × ( n − 2) + 2 + d log 2 ( N op ) e is the exponent bit-width and n k = 2 es +1 × ( n − 2) is the fraction bit-width. Finally , the N op fixed-point products are accumulated and the result is descaled in linear quantization, again using α 1 and α 2 , and con v erted back to posit. w q = d log 2 ( N op ) e + 2 es +2 × ( n − 2) + 2 (6) Algorithm 1 Posit DO T operation for n -bit inputs each with es exponent bits [20] 1: procedur e P O S I T D OT( weight , activation ) 2: sign w , reg w , exp w , frac w ← D E C O D E ( weight ) 3: sign a , reg a , exp a , frac a ← D E C O D E ( activation ) 4: sf w ← { reg w , exp w } Gather scale factors 5: sf a ← { reg a , exp a } Multiplication 6: sign mult ← sign w ⊕ sign a 7: frac mult ← frac w × frac a 8: ovf mult ← frac mult [ MSB ] Adjust for overflo w 9: normfrac mult ← frac mult ovf mult 10: sf mult ← sf w + sf a + ovf mult Accumulation 11: fracs mult ← sign mult ? − frac mult : frac mult 12: sf biased ← sf mult + bias Bias the scale factor 13: fracs fixed ← fracs mult sf biased Shift to fixed 14: sum quire ← fracs fixed + sum quire Accumulate Fraction & SF Extraction 15: sign quire ← sum quire [ MSB ] 16: mag quire ← sign quire ? − sum quire : sum quire 17: zc ← L E A D I N G Z E R O S D E T E C TO R ( mag quire ) 18: frac quire ← mag quire [2 × ( n − 2 − es ) − 1+ zc : zc ] 19: sf quire ← zc − bias Con vergent Rounding & Encoding 20: nzero ← | frac quire 21: sign sf ← sf quire [ MSB ] 22: exp ← sf quire [ es − 1 : 0] Unpack scale factor 23: reg tmp ← sf quire [ MSB − 1 : es ] 24: reg ← sign sf ? − reg tmp : reg tmp 25: ovf reg ← reg [ MSB ] Check for overflo w 26: reg f ← ovf reg ? {{d log 2 ( n ) e − 2 { 1 }} ) , 0 } : reg 27: exp f ← ( ovf reg |∼ nzero | (& reg f )) ? { es { 0 }} : exp 28: tmp1 ← { nzero , 0 , exp f , frac quire [ MSB − 1 : 0] , { n − 1 { 0 }}} 29: tmp2 ← { 0 , nzero , exp f , frac quire [ MSB − 1 : 0] , { n − 1 { 0 }}} 30: ovf regf ← & reg f 31: if ovf regf then 32: shift neg ← reg f − 2 33: shift pos ← reg f − 1 34: else 35: shift neg ← reg f − 1 36: shift pos ← reg f 37: end if 38: tmp ← sign sf ? tmp2 shift neg : tmp1 shift pos 39: lsb , guard ← tmp [ MSB − ( n − 2) : MSB − ( n − 1)] 40: round ← ∼ ( ovf reg | ovf regf ) ? ( guard & ( lsb | ( | tmp [ MSB − n : 0])) ) : 0 41: result tmp ← tmp [ MSB : MSB − n +1]+ round 42: result ← sign quire ? − result tmp : result tmp 43: retur n result 44: end procedure B. Har dwar e F ramework The MA C operation, as introduced as the fundamental DNN operation, calculates the weighted sum of a set of inputs. In many implementations, this operation is inexact, i.e. arithmetic error gro ws due to iterati ve rounding and truncation. The EMA C mitigates this concern by adapting the concept of the Kulisch accumulator [42]. The error due to rounding is de- ferred until after the accumulation of all products, which low- precision arithmetic further benefits from. In the EMA C, as 6 D Q D Q D Q Bias Weight Activation e s + c lo g 2 ( n ) + 1 n − 3 − e s n − 3 − e s 〉 〉 2 × ( n − 2 − e s ) B I A S 2's Comp 〈 〈 2 × ( n − 2 − e s ) + 1 e s + c lo g 2 ( n ) + 1 Round Normalize Clip − 1 w a Encode Decode Decode c lo g 2 ( n ) + 1 e s c lo g 2 ( n ) + 1 e s 1 w a 1 1 Decode & Shift e s + c lo g 2 ( n ) + 2 2's Comp n n n n scale ? shift : identity scale ? shift : identity Scale 1 Output scale ? shift : identity Figure 3: A parameterized ( n total bits, es exponent bits) FPGA soft core design of the posit exact multiply-and-accumulate (EMA C) operation [20]. mentioned beforehand, the fixed-point values of N op products are accumulated in a wide re gister sized as gi ven by (6). The posit EMA C, illustrated by Fig. 3, is parameterized by n , the bit-width, and es , the number of exponential bits. “NaR” is not considered as posits do not o verflo w or underflo w and all DNN parameters and data are real numbers. Algorithm 1 describes the bitwise operation of the EMA C dot product. Each EMA C is pipelined into three stages: multiplication, accumulation, and rounding. For further details on EMA Cs and the exact dot product, we suggest revie wing [20], [21], [42]. V . S I M U L AT I O N R E S U L T S & A N A L Y S I S The Cheetah software is implemented in the Keras [43] and T ensorFlo w [44] framew orks. Rounding quantization, linear quantization, and the EMA C operations with [5,32]- bit precision fixed-point, floating point, and posit numbers for DNN inference and { 16, 32 } -bit floating point and posit numbers for DNN training are extended to these frameworks via software emulation. T o reduce the search space of the α 1 and α 2 parameters, β is selected from { 1 , 2 , 4 , 8 } which still provides, on a verage, a wide coverage ( ∼ 82%) of the dynamic range of each numerical format, as sho wn in T able I. T able I: The dynamic range cov erage of ≤ 8-bit posit, floating point, and fixed-point numerical formats. The percentages are calculated without considering (NaR), infinity , and “Not-a- Number” (NaN) v alues. Format Dynamic Range ≤ 8-bit Posit ( es =0) 94.12% Posit ( es =1) 81.57% Posit ( es =2) 69.02% Float ( w e =4) 66.66% Float ( w e =3) 85.71% Fixed-point ( n k =4) 100.0% A. Exploiting Numerical F ormats for DNN Inference T o ev aluate Cheetah performance on DNN inference, a feedforward neural network and different con volutional neu- ral networks are trained on three benchmarks with 32-bit floating point. The specification of these tasks and inference performance are summarized in T able II. The accuracies of performing DNN inference on these tasks are presented in T able III in the [5..8]-bit precision version of Cheetah . The results sho w that posit with [5..8]-bit precision (mostly es = 1 ) outperforms the fixed-point and floating point formats (mostly w e = 4 exponential bits). For instance, the accuracy of performing DNN inference on Fashion-MNIST is improved by 5 . 14% and 4 . 17% with 5-bit posits in comparison to 5-bit floating point and fixed-point, respectiv ely . On the CIF AR-10 dataset, these performance gains are further noticeable with 5-bit posits having 28 . 5% and 31 . 62% improvements ov er floating point and fixed-point, respecti vely . The benefits of the posit numerical format are intuitiv ely explained by the nonlinear distribution of its v alues, similar to that of DNN inference parameters. This hypothesis is explored empirically by calculating the distortion rate of DNN inference parameters with respect to each numerical format. The distortion rate is described by (7) where P indicates the high-precision parameters and Quant ( P ) represents the quantized param- eters. The results, as sho wn in Fig. 4, validate the hypothesis, especially at 5-bit precision where the distortion rate of posit is significantly less than that of the other numerical formats. d ( R ) = d ( P , Quant ( P )) = 1 n n X i || P i , Quant ( P i ) || 2 (7) B. Exploiting Numerical F ormats with Quantization Ap- pr oaches for DNN Infer ence As mentioned before, quantization with rounding has less ov erhead when compared to the other quantization approaches, but it is not possible to perform DNN inference with 5-bit posits with similar performance of DNN inference as 32-bit floating point. T o improve performance of DNN inference, the [5..8]-bit posit numerical format is combined with linear quan- tization approaches and e valuated for a 4-layer feedforward neural network on the MNIST and Fashion-MNIST datasets. The α 1 × A i and α 2 × W ij in (5) can be either implemented by constant multiplication or by a shift operation where the 7 T able II: Specifications of the benchmark tasks and performance on a baseline 32-bit floating point network Dataset Layers 1 # Parameters # EMA C Ops 2 Memory Accuracy MNIST 4 FC 0.34 M 0.78 k 1.34 MB 98.46% 2 Con v , 2 FC, 1 PL 1.40 M 58.7 k 5.84 MB 99.32% Fashion-MNIST 4 FC 0.34 M 0.78 k 1.34 MB 89.51% 2 Con v , 3 FC, 2 PL, 1 BN 1.88 M 69.8 k 7.77 MB 92.54% CIF AR-10 7 Con v , 1 FC, 3 PL 0.95 M 312.6 k 6.23 MB 81.37% 1 Con v: 2D con volutional layer; FC: fully-connected layer; PL: max/avg. pooling layer; BN: batch normalization layer . 2 The number of EMAC operations for a single sample. T able III: Cheetah accuracy on three datasets with [5..8]-bit precision compared to fixed and float (respecti ve best results are when posit has es ∈ { 0 , 1 , 2 } and floating point with exponent bit-width w e ∈ { 3 , 4 } ). Dataset DNN Posit Float Fixed 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit MNIST FC 98.45 % 98.39 % 98.37 % 98.30 % 98.42% 98.39% 98.33% 93.91% 98.31% 97.95% 97.87% 97.88% Con v 99.35 % 99.33 % 99.20 % 98.94 % 99.34% 99.25% 99.12% 92.27% 99.18% 97.14% 97.08% 96.96% Fashion FC 89.59 % 89.44 % 89.24 % 88.14 % 89.56% 89.36% 88.92% 83.00% 89.16% 87.27% 85.20% 83.97% MNIST Con v 92.70 % 92.60 % 91.64 % 88.92 % 92.63% 92.22% 89.58% 68.21% 89.59% 88.63% 85.31% 83.46% CIF AR-10 Con v 80.40 % 76.90 % 68.51 % 41.33 % 79.75% 76.09% 53.68% 12.83% 24.27% 17.43% 12.54% 9.71% (a) dense_1 dense_2 dense_3 dense_4 overall avg. 5 6 7 8 Bit-precision -5e-03 -2e-03 0e+00 3e-03 5e-03 (b) dense_1 dense_2 dense_3 dense_4 overall avg. 5 6 7 8 Bit-precision -6e-03 -3e-03 0e+00 3e-03 6e-03 (c) dense_1 dense_2 dense_3 dense_4 overall avg. 5 6 7 8 Bit-precision -6e-03 -3e-03 0e+00 3e-03 6e-03 (d) dense_1 dense_2 dense_3 dense_4 overall avg. 5 6 7 8 Bit-precision -6e-03 -3e-03 0e+00 3e-03 6e-03 (e) conv2d_1 conv2d_2 conv2d_3 conv2d_4 conv2d_5 conv2d_6 conv2d_7 dense_1 overall avg. 5 6 7 8 Bit-precision -8e-04 -4e-04 0e+00 4e-04 8e-04 (f) conv2d_1 conv2d_2 conv2d_3 conv2d_4 conv2d_5 conv2d_6 conv2d_7 dense_1 overall avg. 5 6 7 8 Bit-precision -8e-03 -4e-03 0e+00 4e-03 8e-03 Figure 4: Layer-wise delta distortion rate ∆( d ( R )) heatmaps compare the precision (rates) of [5..8]-bit numerical formats for representing 32-bit floating point DNN parameters. The av erage ∆( d ( R )) among all weights in a DNN are sho wn in the final column of each heatmap. (a) d(R) posit − d(R) f ixed for the MNIST task; (b) d(R) posit − d(R) f ixed for the Fashion MNIST task; (c) d(R) posit − d(R) f ixed for the CIF AR-10 task; (d) d(R) posit − d(R) f loat for the MNIST task; (e) d(R) posit − d(R) f loat for the Fashion MNIST task; (f) d(R) posit − d(R) f loat for the CIF AR-10 task. 8 T able IV: Comparison of different quantization approaches. Accuracy on MNIST (top) and F ashion-MNIST (bottom) with { 5-8 } -bit precision for posit with es ∈ { 0 , 1 , 2 } , fixed-point, and floating point with exponent bit-width w e ∈ { 3 , 4 } . Numerical Format Rounding Quantization Linear-Quantization with Multiplication Linear-Quantization with Shift 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit Posit ( es = 0 ) 98.42% 98.37% 98.30% 91.05% 98.46% 98.48% 98.46 % 98.19% 98.48 % 98.46 % 98.39% 98.28% Posit ( es = 1 ) 98.45 % 98.39 % 98.34% 98.30 % 98.49 % 98.47% 98.42% 98.34 % 98.48 % 98.42% 98.38% 98.42 % Posit ( es = 2 ) 98.44% 98.39 % 98.37 % 98.16% 98.45% 98.49 % 98.38% 97.96% 98.46% 98.41% 98.41 % 98.13% Fixed-point 98.31% 97.95% 97.87% 97.88% 98.47% 98.32% 98.11% 96.41% 98.42% 98.29% 98.16% 97.17% Floating point 98.42% 98.39 % 98.33% 93.91% 98.46% 98.42% 98.36% 98.02% 98.46% 98.45% 98.38% 98.06% 32-bit Floating point 98.46% 98.46% 98.46% Numerical Format Rounding Quantization Linear-Quantization with Multiplication Linear-Quantization with Shift 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit Posit ( es = 0 ) 89.57% 89.21% 88.46% 76.87% 89.64 % 89.58 % 89.36 % 88.17% 89.59% 89.61 % 88.31% 88.10% Posit ( es = 1 ) 89.59 % 89.44 % 89.22% 88.14 % 89.58% 89.52% 89.35% 88.98 % 89.58% 89.45% 89.48 % 89.07 % Posit ( es = 2 ) 89.56% 89.33% 89.24 % 87.07% 89.53% 89.55% 88.98% 87.06% 89.49% 89.52% 89.18% 87.06% Fixed-point 89.16% 87.27% 85.20% 83.97% 89.52% 88.83% 87.46% 76.58% 89.40% 88.93% 87.10% 82.10% Floating point 89.56% 89.36% 88.92% 83.00% 89.59% 89.45% 89.00% 87.25% 89.73 % 89.32% 88.86% 87.37% 32-bit Floating point 89.51% 89.51% 89.51% α 1 and α 2 values are approximated by a power of two. The results, as sho wn in T able IV, exhibit that 5-bit lo w-precision DNN inference achie ves similar performance to 32-bit floating point DNN inference on the MNIST data set. Essentially , by deploying this approach, the quantization error produced by the values that lie outside of posit’ s dynamic range is zeroed out. The linear quantization approach also plays a ke y role in reducing the hardware complexity of posit EMA Cs used for DNN inference. Notably, the accuracy of DNN inference with posits is significantly enhanced by using the linear quantization approach in comparison to quantization with rounding. There- fore, the o verhead of adding linear quantization is of fset by reducing the hardware complexity , i.e. carrying out the posit EMA C operation with es = 0 instead of es = 1 , which is explained in depth in the ne xt section. C. Exploiting P osit and Floating P oint for DNN T raining T o explore the efficac y of the posit numerical format ov er the floating point numerical format, a 4-layer feedforward neural network is trained with each number system on the MNIST and Fashion-MNIST datasets. The results indicate that the posit numerical format has a slightly better accuracy in comparison to the floating point number system, as sho wn in T able V. 16-bit posits outperform 16-bit floats in terms of accuracy . Although Cheetah is ev aluated on small datasets, there are two adv antages compared to [14], [36]. Mellempudi et al. [36] use 32-bit numbers for accumulation to reduce the hardware cost of stochastic rounding. W ang et al. [14] reduce the accumulation bit-precision to 16 by using stochastic rounding. Howev er , in this paper , we sho w the potential of using 16-bit posits for all DNN parameters with a simple and hardware-friendly round-to-nearest algorithm and show less than 1% accuracy degradation without exhaustiv ely analyzing DNN training parameters. T able V: A verage accuracy over 10 independent runs on the test set of the respectiv e dataset. Networks are trained using only the specified numerical format. T ask Format Accuracy MNIST Posit-32 98.131% Float-32 98.087% Posit-16 96.535% Float-16 90.646% Fashion MNIST Posit-32 89.263% Float-32 89.105% Posit-16 87.400% Float-16 81.725% D. EMA C Soft-Cor e FPGA Implementation T o show the effecti veness of the posit numerical format ov er floating point and fix ed-point, we ev aluate the trade-off between the energy-delay-product and latency of the EMAC operation vs. av erage accuracy degradation from 32-bit float- ing point per bit-width across the three datasets (two for the linear-quantization experiment) with the Cheetah framew ork, as sho wn in Figs. 5, 6, 7, 8, and 9. The energy-delay-product, a combined measure of the latency and resource cost of the EMA C operation, coupled with quantization with rounding [20] and the EMA C operation coupled with linear quantization are selected for all numerical formats and measured on a V irtex-7 FPGA (xc7vx485t-2ffg1761c) with synthesis through V iv ado 2017.2. Note that the av erage accuracy degradation per bit-width is computed using the accuracy results in T able IV. The results, as sho wn by Fig. 5, indicate that posit coupled with rounding quantization achiev es up to 23% av erage ac- curacy improvement over fixed-point. Ho wev er , this accuracy enhancement is gained at the cost of a 0 . 41 × 10 − 10 increase in ener gy-delay-product to implement the EMA C unit. Posit also consistently shows better performance, especially at 5-bit 9 Figure 5: The a verage accuracy degradation from 32-bit floating point across the two classification tasks vs. the ener gy-delay- product of the respectiv e EMAC with rounding quantization. Each < x, y > pair indicates the number of bits and corresponding parameter bit-width, as indicated in the legend. A star ( ? ) denotes the lo west accuracy degradation for a numerical format and bit-width. Figure 6: The average accuracy degradation from 32-bit floating point across the two classification tasks vs. the latency of the respectiv e EMAC with rounding quantization. Each < x, y > pair indicates the number of bits and corresponding parameter bit-width, as indicated in the legend. A star ( ? ) denotes the lowest accuracy degradation for a numerical format and bit-width. 10 Figure 7: The average accuracy degradation from 32-bit floating point across the two classification tasks vs. the cost of the respectiv e EMAC with rounding quantization. Each < x, y > pair indicates the number of bits and corresponding parameter bit-width (fractional bits or exponent), as labeled along the x-axis. compared to the floating point number system at a comparable energy-delay-product. The posit EMA C operation achie ves lower latencies, as shown in Fig. 6, due to a lack of subnormal detection and other exception cases, but exhibits resource- hungry encoding and decoding due to the variable-length regime of the posit numerical format, as shown in Fig. 7. Overall, the 6-bit posit shows the best trade-of f between energy-delay-product and av erage accuracy degradation from 32-bit floating point on the two benchmarks (when analyzed across the [5..8]-bit range). Looking at the posit numerical format in terms of classification performance and EMA C energy-delay-product, posits with es = 1 provide a better trade-off compared to posits with es ∈ { 0 , 2 } . At [5..7]-bit precision, the average performance of DNN inference with es = 1 among the three datasets is 2% and 4% better than with es = 2 and es = 0 , respectiv ely . These accuracy benefits are coupled with 2.1 × less energy-delay-product and 1.4 × more energy-delay-product in comparison to es = 2 and es = 0 , respectiv ely . These results are measured when the rounding quantization is used. Linear quantization with the shift operation requires similar hardware ov erhead across all of the numerical formats, as shown in Figs. 8 and 9. Howe v er , the accuracy of performing DNN inference with linear quantization with posits ( es = 0 ) is similar to the accuracy when es = 1 . Therefore, it is possible to use EMA Cs with es = 0 instead of es = 1 and thereby achiev e 18% energy-delay-product savings. A summary of previous studies that propose low-precision framew orks are sho wn in T able VI. Several research groups hav e explored the efficac y of floats and fixed-point on the performance and hardware complexity of DNNs with multiple image classification tasks [14]–[16], [32], [34], [35]. Ho wev er , none of these works analyze the appropriateness of the posit numerical format for both DNN training and inference. Ad- ditionally , current work does not offer insight on the impact of the quantization approach vs. numerical format on both accuracy and hardware complexity , as in vestigated in this paper . V I . C O N C L U S I O N S A lo w-precision DNN frame work, Cheetah , for edge de vices is proposed in this work. W e explored the capacity of v arious numerical formats, including floating point, fixed-point and posit, for both DNN training and inference. W e show that the recent posit numerical format has high efficacy for DNN training at { 16, 32 } -bit precision and inference at ≤ 8-bit pre- cision. Moreover , we show that it is possible to achiev e better performance and reduce energy consumption by using linear quantization with the posit numerical format. The success of low-precision posits in reducing DNN hardware complexity with ne gligible accuracy degradation motiv ates us to e valuate ultra-low precision training in future work. 11 0 10 20 30 40 50 Avg. Degradation (%) 1 0 1 1 1 0 1 0 Energy-Delay-Product Numerical Format F i x e d < N , Q > F l o a t < N , w e > P o s i t < N , e s > Figure 8: The a verage accuracy degradation from 32-bit floating point across the two classification tasks vs. the ener gy-delay- product of the respective EMA C with linear quantization. Each < x, y > pair indicates the number of bits and corresponding parameter bit-width, as indicated in the legend. A star ( ? ) denotes the lo west accuracy degradation for a numerical format and bit-width. 0 10 20 30 40 50 Avg. Degradation (%) 2 × 1 0 9 3 × 1 0 9 4 × 1 0 9 6 × 1 0 9 Latency (s) Numerical Format F i x e d < N , Q > F l o a t < N , w e > P o s i t < N , e s > Figure 9: The a verage accuracy de gradation from 32-bit floating point across the two classification tasks vs. the latency of the respectiv e EMA C with linear quantization. Each < x, y > pair indicates the number of bits and corresponding parameter bit-width, as indicated in the legend. A star ( ? ) denotes the lowest accuracy degradation for a numerical format and bit-width. 12 T able VI: High-lev el summary of Cheetah and other lo w-precision framew orks. All datasets are image classification tasks. WI BC: W isconsin Breast Cancer; FMNIST : Fashion MNIST ; FP: floating point; FX: fixed-point; PS: posit; SW: software; HW: hardware. Courbariaux et al. [45] Gysel et al. [16] Hashemi et al. [15] Carmichael et al. [20] W ang et al. [14] Johnson et al. [22] This W ork Dataset MNIST , CIF AR-10, ImageNet MNIST , CIF AR-10, WI BC, Iris, Mushroom ImageNet ImageNet MNIST , FMNIST SVHN SVHN MNIST , FMNIST CIF AR-10 Numerical Format FP , FX, FP , FX, FP , FX FP , FX FP FX, FP FX, FP BFP BFP Binary PS PS PS Bit-precision 12 8 All [5..8] All 8 [5..8] Utility Training Inference Inference Inference Training Inference Inference & Training Inference Quantization - Rounding Rounding Rounding - Log Rounding & Linear Implementation SW SW & HW SW & HW SW & HW SW & HW SW & HW SW & HW DNN library Theano Caffe Caffe Keras/T ensorFlow Home Suite PyT orch Keras/T ensorFlow Device - ASIC ASIC V irte x-7 FPGA ASIC ASIC V irtex-7 FPGA T echnology Node - 65 nm 65 nm 28 nm 14 nm 28 nm 28 nm R E F E R E N C E S [1] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy , ” in Pr oceedings of the 2018 W orkshop on Mobile Edge Communications . ACM, 2018, pp. 31–36. [2] W . Shi, J. Cao, Q. Zhang, Y . Li, and L. Xu, “Edge computing: V ision and challenges, ” IEEE Internet of Things Journal , vol. 3, no. 5, pp. 637–646, 2016. [3] M. Satyanarayanan, “The emergence of edge computing, ” Computer , vol. 50, no. 1, pp. 30–39, 2017. [4] A. Krizhe vsky , I. Sutskev er , and G. E. Hinton, “Imagenet classification with deep con volutional neural networks, ” in Advances in Neural Information Pr ocessing Systems 25: 26th Annual Confer ence on Neural Information Pr ocessing Systems, NeurIPS , P . L. Bartlett, F . C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. W einber ger , Eds., Lake T ahoe, Nev ada, USA, Dec. 2012, pp. 1106–1114. [Online]. A v ailable: http://papers.nips.cc/paper/ 4824- imagenet- classification- with- deep- con volutional- neural- networks [5] M. Horowitz, “1.1 computing’ s energy problem (and what we can do about it), ” in 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC) . IEEE, 2014, pp. 10–14. [6] X. Xu, Y . Ding, S. X. Hu, M. Niemier, J. Cong et al. , “Scaling for edge inference of deep neural networks, ” Nature Electr onics , vol. 1, no. 4, p. 216, 2018. [7] C.-J. W u, D. Brooks, K. Chen, D. Chen, S. Choudhury et al. , “Ma- chine learning at facebook: Understanding inference at the edge, ” in 2019 IEEE International Symposium on High P erformance Computer Ar chitectur e (HPCA) . IEEE, 2019, pp. 331–344. [8] A. G. How ard, M. Zhu, B. Chen, D. Kalenichenko, W . W ang et al. , “Mobilenets: Efficient con volutional neural networks for mobile vision applications, ” arXiv pr eprint arXiv:1704.04861 , 2017. [9] Y . Chen, H. Fang, B. Xu, Z. Y an, Y . Kalantidis et al. , “Drop an octav e: Reducing spatial redundancy in conv olutional neural networks with octav e con volution, ” arXiv pr eprint arXiv:1904.05049 , 2019. [10] M. Cho and D. Brand, “MEC: Memory-efficient con volution for deep neural network, ” in Proceedings of the 34th International Conference on Machine Learning, ICML , ser . Proceedings of Machine Learning Research, D. Precup and Y . W . T eh, Eds., vol. 70. Sydney , NSW , Australia: PMLR, Aug. 2017, pp. 815–824. [Online]. A v ailable: http://proceedings.mlr .press/v70/cho17a.html [11] M. Ren, A. Pokrovsky , B. Y ang, and R. Urtasun, “Sbnet: Sparse blocks network for fast inference, ” in Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , 2018, pp. 8711–8720. [12] B. Zhou, Y . Sun, D. Bau, and A. T orralba, “Revisiting the importance of individual units in cnns via ablation, ” arXiv preprint , 2018. [13] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. T ang et al. , “Quantization and training of neural networks for efficient integer -arithmetic-only inference, ” in The IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2018. [14] N. W ang, J. Choi, D. Brand, C.-Y . Chen, and K. Gopalakrishnan, “T raining deep neural networks with 8-bit floating point numbers, ” in Advances in neural information pr ocessing systems , 2018, pp. 7686– 7695. [15] S. Hashemi, N. Anthony , H. T ann, R. I. Bahar , and S. Reda, “Understanding the impact of precision quantization on the accuracy and energy of neural networks, ” in Design, A utomation & T est in Eur ope Conference & Exhibition, D A TE , D. Atienza and G. D. Natale, Eds. Lausanne, Switzerland: IEEE, Mar . 2017, pp. 1474–1479. [Online]. A vailable: https://doi.org/10.23919/DA TE.2017.7927224 [16] P . Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi, “Ristretto: A frame- work for empirical study of resource-efficient inference in conv olutional neural networks, ” IEEE T ransactions on Neural Networks and Learning Systems , 2018. [17] Y . Guo, “ A survey on methods and theories of quantized neural net- works, ” arXiv pr eprint arXiv:1808.04752 , 2018. [18] R. Krishnamoorthi, “Quantizing deep con volutional networks for effi- cient inference: A whitepaper, ” arXiv preprint , 2018. [19] S. H. F . Langroudi, T . Pandit, and D. Kudithipudi, “Deep learning infer- ence on embedded devices: Fixed-point vs posit, ” in 2018 1st W orkshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2) , March 2018, pp. 19–23. [20] Z. Carmichael, H. F . Langroudi, C. Khazanov , J. Lillie, J. L. Gustafson, and D. Kudithipudi, “Deep positron: A deep neural network using the posit number system, ” in Design, Automation & T est in Europe Conference & Exhibition, DA TE . Florence, Italy: IEEE, Mar . 2019, pp. 1421–1426. [Online]. A vailable: https: //doi.org/10.23919/D A TE.2019.8715262 [21] Z. Carmichael, H. F . Langroudi, C. Khazanov , J. Lillie, J. L. Gustafson, and D. Kudithipudi, “Performance-efficiency trade-of f of low-precision numerical formats in deep neural networks, ” in Proceedings of the Confer ence for Next Generation Arithmetic , ser . CoNGA ’19. Singapore, Singapore: ACM, 2019, pp. 3:1–3:9. [22] J. Johnson, “Rethinking floating point for deep learning, ” arXiv pr eprint arXiv:1811.01721 , 2018. [23] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner , “Gradient-based learning applied to document recognition, ” Pr oceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998. [24] I. Goodfellow , Y . Bengio, and A. Courville, Deep Learning . MIT Press, 2016, http://www .deeplearningbook.org. [25] J. L. Gustafson and I. T . Y onemoto, “Beating floating point at its own game: Posit arithmetic, ” Super computing F r ontiers and Innovations , vol. 4, no. 2, pp. 71–86, 2017. [26] W . Tichy , “Unums 2.0: An interview with John L. Gustafson, ” Ubiquity , vol. 2016, no. September , p. 1, 2016. [27] H. P . Graf, L. D. Jackel, and W . E. Hubbard, “VLSI implementation of a neural network model, ” IEEE Computer , vol. 21, no. 3, pp. 41–49, 1988. [Online]. A vailable: https://doi.org/10.1109/2.30 [28] A. Iwata, Y . Y oshida, S. Matsuda, Y . Sato, and N. Suzumura, “ An arti- ficial neural network accelerator using general purpose 24 bits floating point digital signal processors, ” in International Joint Confer ence on Neural Networks, IJCNN , vol. 2, 1989, pp. 171–175. [29] D. W . Hammerstrom, “ A VLSI architecture for high-performance, low- cost, on-chip learning, ” in IJCNN 1990, International Joint Conference on Neural Networks . San Diego, CA, USA: IEEE, Jun. 1990, pp. 537– 544. [Online]. A vailable: https://doi.org/10.1109/IJCNN.1990.137621 [30] K. Asanovic and N. Morgan, “Experimental determination of precision requirements for back-propagation training of artificial neural networks, ” in In Proceedings of the 2nd International Conference on Micr oelectr on- ics for Neural Networks , 1991, pp. 9–15. 13 [31] C. M. Bishop, “T raining with noise is equiv alent to tikhonov regular- ization, ” Neural computation , vol. 7, no. 1, pp. 108–116, 1995. [32] M. Courbariaux, Y . Bengio, and J. David, “Low precision arithmetic for deep learning, ” in W orkshop T rack Proceedings of the 3rd International Confer ence on Learning Representations, ICLR , Y . Bengio and Y . LeCun, Eds., San Diego, CA, USA, May 2015. [Online]. A vailable: http://arxiv .org/abs/1412.7024 [33] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P . Narayanan, “Deep learning with limited numerical precision, ” in Proceedings of the 32nd International Confer ence on Machine Learning, ICML , ser. JMLR W orkshop and Conference Proceedings, F . R. Bach and D. M. Blei, Eds., v ol. 37. Lille, France: JMLR.org, Jul. 2015, pp. 1737–1746. [Online]. A vailable: http://proceedings.mlr.press/v37/gupta15.html [34] P . Micikevicius, S. Narang, J. Alben, G. F . Diamos, E. Elsen et al. , “Mixed precision training, ” in Confer ence T r ack Pr oceedings of the 6th International Confer ence on Learning Representations, ICLR . V ancouver , BC, Canada: OpenRe view .net, 2018. [Online]. A v ailable: https://openrevie w .net/forum?id=r1gs9JgRZ [35] U. K ¨ oster , T . W ebb, X. W ang, M. Nassar , A. K. Bansal et al. , “Flexpoint: An adaptiv e numerical format for efficient training of deep neural networks, ” in Advances in Neural Information Processing Systems , 2017, pp. 1742–1752. [36] N. Mellempudi, S. Srinivasan, D. Das, and B. Kaul, “Mixed precision training with 8-bit floating point, ” arXiv preprint , 2019. [37] D. Kalamkar , D. Mudigere, N. Mellempudi, D. Das, K. Banerjee et al. , “ A study of BFLO A T16 for deep learning training, ” arXiv pr eprint arXiv:1905.12322 , 2019. [38] E. Chung, J. Fowers, K. Ovtcharov , M. Papamichael, A. Caulfield et al. , “Serving DNNs in real time at datacenter scale with Project Brainwave, ” IEEE Micr o , vol. 38, no. 2, pp. 8–20, 2018. [39] J. H. Wilkinson, “Rounding errors in algebraic processes, ” in IFIP Congr ess , 1959, pp. 44–53. [40] F . de Dinechin, L. Forget, J.-M. Muller , and Y . Uguen, “Posits: the good, the bad and the ugly, ” Dec. 2018, working paper or preprint. [Online]. A vailable: https://hal.inria.fr/hal- 01959581 [41] R. A. Fisher , “The use of multiple measurements in taxonomic prob- lems, ” Annals of eugenics , vol. 7, no. 2, pp. 179–188, 1936. [42] U. Kulisch, Computer arithmetic and validity: theory , implementation, and applications , 1st ed., ser . de Gruyter Studies in Mathematics. Berlin, Ne w Y ork, USA: W alter de Gruyter, 2008, vol. 33. [43] F . Chollet et al. , “Keras, ” https://github.com/k eras- team/keras, 2015. [44] M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen et al. , “T ensorFlo w: Large-scale machine learning on heterogeneous systems, ” 2015. [Online]. A vailable: https://www .tensorflo w .or g/ [45] M. Courbariaux, Y . Bengio, and J.-P . David, “Training deep neu- ral networks with low precision multiplications, ” arXiv pr eprint arXiv:1412.7024 , 2014.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment