Design Automation for Binarized Neural Networks: A Quantum Leap Opportunity?
Design automation in general, and in particular logic synthesis, can play a key role in enabling the design of application-specific Binarized Neural Networks (BNN). This paper presents the hardware design and synthesis of a purely combinational BNN f…
Authors: Manuele Rusci, Lukas Cavigelli, Luca Benini
Design Automation for Binarized Neural Networks: A Quantum Leap Opportunity? Manuele Rusci ∗ , Lukas Cavigelli † , Luca Benini † ∗ ∗ Energy-Ef ficient Embedded Systems Laboratory , Uni versity of Bologna, Italy – manuele.rusci@unibo.it † Integrated Systems Laboratory , ETH Zurich, Switzerland – { cavigelli, benini } @iis.ee.ethz.ch Abstract —Design automation in general, and in particular logic synthesis, can play a key role in enabling the design of application-specific Binarized Neural Networks (BNN). This paper presents the hardwar e design and synthesis of a purely combinational BNN f or ultra-low power near-sensor processing . W e leverage the major opportunities raised by BNN models, which consist mostly of logical bit-wise operations and integer counting and comparisons, f or pushing ultra-low power deep learning cir cuits close to the sensor and coupling it with binarized mixed-signal image sensor data. W e analyze area, power and energy metrics of BNNs synthesized as combinational networks. Our synthesis results in GlobalFoundries 22 nm SOI technology shows a silicon area of 2.61 mm 2 for implementing a combina- tional BNN with 32 × 32 binary input sensor receptive field and weight parameters fixed at design time. This is 2.2 × smaller than a synthesized network with re-configurable parameters. With respect to other comparable techniques for deep learning near - sensor processing, our approach features a 10 × higher energy efficiency . I . I N T RO D U C T I O N Bringing intelligence close to the sensors is an effecti ve strategy to meet the energy requirement of battery-powered devices for always-ON applications [1]. Power -optimized so- lutions for near-sens or processing aim at reducing the amount of data to be dispatched out from the sensors. Local data analysis can compress the data down to ev en a single bit in case of a binary classifier , hence massively reducing the output bandwidth and energy consumption ov er raw sensor data communication [2]. In the context of visual sensing, nov el computer vision chips feature embedded processing capabilities to reduce the ov erall energy consumption [3]. By placing computational modules within the sensor , mid- and low- le vel visual features can be directly extracted and transferred to a processing unit for further computation or used to feed a first stage classifier . Moreov er , by integrating analog processing circuits on the focal-plane, the amount of data crossing the costly analog- to-digital border is reduced [4]. If compared with a camera- based system featuring a traditional imaging technology , this approach has a lo wer energy consumption because of (a) a reduced sensor-to-processor bandwidth and (b) a lower demand for digital computation [5]. Relev ant examples of mixed-signal smart capabilities include the e xtraction of spatial and temporal features, such as edges or frame-dif ference maps, This project was supported in part by the EU’ s H2020 programme under grant no. 732631 (OPRECOMP) and by the Swiss National Science Founda- tion under grant 162524 (MicroLearn). or a combination of them [6]. Because of the employed highly optimized architectures, the power consumption of smart vi- sual chips results to be more than one order of magnitude lower than off-the-shelf traditional image sensors [7]. Howe ver , to fav or the meeting between smart ultra-low power sensing and deep learning, which is now adays the leading technique for data analytics, a further step is required. At present, the high computational and memory requirement of deep learning inference models have prev ented a full integration of these approaches close to the sensor at an ultra low power cost [4], [8]. A big opportunity for pushing deep learning into low-po wer sensing come from recently proposed Binarized Neural Networks (BNNs) [9], [10]. When looking at the inference task, a BNN consists of logical XNOR oper- ations, binary popcounts and integer thresholding. Therefore, major opportunities arise for hardw are implementation of these models as part of the smart sensing pipeline [11]. In this paper, we explore the feasibility of deploying BNNs as a front-end for an ultra-lo w po wer smart vision chip. The combination of mixed-signal processing and hardware BNN implementation represents an extremely energy-efficient and powerful solution for always-ON sensing, serving as an early detector of interesting ev ents. Therefore, we design and synthesize a purely combinational hard-wired BNN, which is fed with the binary data produced by a mixed-signal ultra-low power imager [7]. The main contributions of this paper are: • The hardware design and logic synthesis of a combi- national BNN architecture for always-ON near-sensor processing. • The area and energy ev aluation of the proposed approach, for varying network models and configurations. W e e valuate two BNN models with 16 × 16 and 32 × 32 binary input size, either with fixed or variable parameters. In case of a combinational BNN with 32 × 32 input data and hardwired parameters, our synthesis results in GlobalFoundries 22 nm SOI technology sho ws an area occupancy of 2.61 mm 2 , which is 2.2 × smaller than the model with v ariable parameters, and features a 10 × higher energy ef ficiency with respect to comparable techniques for deep learning-based near-sensor processing. Moreov er , our study pav es the way for exploring a new generation of logic synthesis tools—aimed at aggressi vely optimizing deep binarized networks and enabling focal-plane processing of images with higher resolution. I I . R E L A T E D W O R K Sev eral proposed smart imaging chips for always-ON appli- cations embed mixed-signal processing circuits for extracting basic spatial and temporal features directly on the sensor die [6], [12], [13]. Recent approaches tried to push deep learn- ing circuits to the analog sensor side to exploit the benefits of focal-plane processing [3]. The work presented in [14] makes use of angle-sensitiv e pixels, integrating diffraction gratings on the focal plane. Based on the different orientations of the pixel-le vel filters, multiple feature maps are locally computed as the first layer of a conv olutional network. A sensing front- end supporting analog multiplication is proposed in [15]. They introduce a MA C unit composed of only passive switches and capacitors to realize a switched-capacitor matrix multiplier , which achiev es an energy ef ficiency of 8.7 TOp/s/W when running conv olution operations. RedEye [4] embeds column- wise processing pipelines in the analog domain to perform 3D con volutions before of the digital con version. The chip is implemented in 0.18 µ m technology and needs 1.4 mJ to process the initial 5 layers of GoogLeNet, leading to an energy ef ficiency of less than 2 TOp/s/W . With respect to these focal-plane analog approaches, we lev erage the potentiality of BNNs to deploy a digital and optimized purely combinational network to notably increase the energy efficienc y of near- sensor processing circuits. Many neural network accelerators have been reported in the literature, most of them with an energy efficienc y in the range of few T Op/s/W [16]–[18]. Several recent approaches hav e focused on quantizing the weights down to binarization in order to gain a significant advantage in memory usage and energy ef ficiency [8], [18], pushing it up to around 60 T Op/s/W while advances in training methods have achiev ed accuracy losses of less than 1% for this setup. A new approach has been to quantize also the activ ations do wn to binary with initial accuracy losses of up to 30% on the ILSVRC dataset, these hav e improved to around 11% ov er the last two years and ev en less for smaller networks on datasets such as CIF AR- 10 and SVHN [9], [10], [18]. During this time, some VLSI implementations have been published, most of them tar geting FPGAs such as the FINN framework [11], [19]. Only few ASIC implementations exist [19]–[21], of which XNOR-POP uses in-memory processing and reports the highest energy efficienc y of 21.1 T Op/s/W and thus less than the best binary- weight-only implementation. I I I . C O M B I NAT I O NA L H A R D W A R E B N N D E S I G N BNNs feature a single-bit precision for both the weights and the activ ation layers when performing inference. This makes the approach promising for resource-constrained de vices, also considering the intrinsic 32 × memory footprint reduction with respect to baseline full-precision models. When applying the binarization scheme to a Con volutional Neural Network (CNN), a BNN features a stacked architecture of binary con volutional layers. Ev ery layer transforms the I F binary input feature maps into the O F binary output feature maps through the well-known con volution operation. Because of the Fig. 1. Binary con volution flow for every conv olutional layer . For any of the OF output feature maps, the binary value at position ( x, y ) is produced by overlapping the m -th weight filter to the array of the receptive field of the input feature map centered at the spatial position ( x, y ) . binary domain { 0,1 } of both the input data and the weight filters, the con volution kernel can be rewritten as ϕ ( m, x, y ) = p opcount( weights(m) xnor r ecF ield(x,y) ) , (1) where ϕ ( m, x, y ) is the result of the conv olution, weights(m) is the array of binary filter weights and r ecF ield(x,y) is the receptiv e field of the output neuron located at position ( x, y ) of the m -th output feature map. The p opcount( · ) function returns the numbers of asserted bits of the argument. Note that the con volution output ϕ ( m, x, y ) is an integer value. As presented by [9], the popcount result is binarized after a batch normalization layer . Howe ver , the normalization operation can be reduced to a comparison with an integer threshold, outM ap ( m, x, y ) = ϕ ( m, x, y ) ≥ thr esh ( m ) if γ > 0 ϕ ( m, x, y ) ≤ thr esh ( m ) if γ < 0 1 if γ = 0 and β ≥ 0 0 if γ = 0 and β < 0 , (2) where thresh ( m ) is the integer threshold that depends on the con volution bias b and on the parameters learned by the batch normalization layer µ , γ , σ and β . After training the network, the thresh ( m ) parameters are computed offline as b µ − b − β · σ /γ c if γ > 0 or d µ − b − β · σ /γ e if γ < 0 . Fig. 1 graphically schematizes the binary con volution ker - nel. The BinCon v module applies (1) and (2) over the recepti ve field values of the output neuron outM ap ( m, x, y ) . T o build a con volutional layer , the BinCon v is replicated for every output neuron. The hardware architecture of a BinConv element is shown in Fig. 2. The input signals r ecF ield(x,y) , weights(m) and thr esh(m) and the output signal outMap(m,x,y) of the block refer to (1) and (2). Additionally , the sign(m) signal driv es the selection of the correct output neuron’ s value depending on the batch normalization parameters (eq. (2)). The network parameters, weights, thr esh and sign , highlighted in red, can be stored in a memory block, to allow online reconfiguration, or can be fixed at design time. In total, the memory footprint Fig. 2. Hardware architecture of the combinational building block for computing binary conv olutions. Every binCon v(m,x,y) module instantiated within a conv olutional layer produces the binary value of the output neuron at location ( x, y ) of the m -th output feature map. required to store the parameters of a con volutional layer is O F · ( I F · k w · k h + b log 2 ( I F · k w · k h ) c + 3) bits. Despite the reduced reconfigurability , relev ant benefits in terms of silicon occupation arise when hard-wiring the binary weights. In this case, the synthesis tool plays a major role to enable the implementability of the model. The synthesis tool has to exploit the optimizations based on a high-le vel abstract HDL description of the network. T o explore the feasibility of deep combinational BNNs, we focus on VGG-like network topologies as in [9]. These networks include conv olutional layers with a small filter size (typically k w = k h = 3 ) and an increasing feature dimension going deeper into the network. The spatial dimension tends to decrease by means of strided pooling operations placed after the binary con volution of (1). Following the intuition of [11], a MaxPooling layer can be moved behind the binarization by replacing the MAX with an OR operation among the binary values passing through the pooling filter . The VGG-like topology features multiple fully-connected layers. Their hardware implementation is similar to the bin- Con v module of Fig. 2, where the con volutional receptiv e field contains all the input neurons of the layer . The last fully- connected layer generates a confidence score for every class. Differently from the original BNN scheme, our network archi- tecture is fed with a binary single-layer signal coming from a mixed-signal imager [7]. Howe ver , the presented approach also holds for multi-channel imagers. A. Estimating Area Before looking at synthesis results, we estimate the area of a binary conv olutional layer . For each output value (output pixel and feature map, N out = H · W · O F ), we hav e a receptiv e field of size N RF = I F · k w · k h and thus need a total of N out N RF XNOR gates. These are followed by T ABLE I V GG - L I K E B N N M O D E L S 1 layer Model with a 16 × 16 input map Model with a 32 × 32 input map 1 bCon vL yr3x3( 1,16)+MaxP2x2 bCon vL yr3x3( 1,16)+MaxP2x2 2 bCon vL yr3x3(16,32)+MaxP2x2 bCon vL yr3x3(16,32)+MaxP2x2 3 bCon vL yr3x3(32,48)+MaxP2x2 bCon vL yr3x3(32,48)+MaxP2x2 4 bFcL yr(192,64) bCon vL yr3x3(48,64)+MaxP2x2 5 bFcL yr( 64, 4) bFcL yr(256,64) 6 bFcL yr( 64, 4) popcount units—adder trees summing ov er all N RF values in the receptiv e field. The resulting full-precision adder trees require P log 2 ( N RF ) i =1 N RF 2 − i = N RF − 1 half-adders and P log 2 ( N RF ) i =1 ( i − 1) N RF 2 − i = N RF − log 2 ( N RF ) − 1 full-adders each, and are replicated for e very output v alue. The subsequent threshold/compare unit is insignificant for the total area. T o provide an example, we look at the first layer of the network for 16 × 16 pixel images with 1 input and 16 output feature maps and a 3 × 3 filter ( N RF = 9 , N out = 4096 ). Eval- uating this for the GF22 technology with A XNOR = 0 . 73 µ m 2 , A HA = 1 . 06 µ m 2 and A F A = 1 . 60 µ m 2 , we obtain an area of A XNOR , tot = 0 . 027 mm 2 , A HA , tot = 0 . 033 mm 2 and A F A , tot = 0 . 029 mm 2 —a total of 0.089 mm 2 . Note that this implies that the area scales faster than linearly with respect to the size of the receptiv e field N RF since the word width in the adder tree increases rapidly . This is not accounted for in the widely used GOp/img complexity measure for NNs, as it is becoming only an issue in this very low word-width regime. I V . E X P E R I M E N TA L R E S U LT S A. BNN T raining The experimental analysis focuses on two VGG-like net- work topologies described in Tbl. I to in vestigate also the impact of different input and network size. As a case-study , we trained the networks with labelled patches from the MIO- TCD dataset [22] belonging to one of the following classes: cars, pedestrians, cyclist and background. The images from the dataset are resized to fit the input dimension before applying a non-linear binarization, which simulates the mixed-signal preprocessing of the sensor [7]. By training the BNNs with AD AM over a training set of about 10ksamples/class (original images are augmented by random rotation), the classification accuracy against the test-set achieves 64.7% in case of the model with 32 × 32 input data, while a 50% is measured for the 16 × 16 model because of the smaller input size and network. Since this work focuses on hardware synthesis issues of BNN inference engines, we do not explore adv anced training approaches for NNs with non-traditional input data, which hav e been discussed in the literature [23]. 1 bCon vL yr3x3( x , y ) indicates a binary conv olutional layer with a 3 × 3 filter, x input and y output feature maps, MaxP2x2 is a max pooling layer of size 2 × 2, bFcL yr( x , y ) is a binary fully connected layer with x binary input y binary output binary neurons. T ABLE II S Y N T H E S I S A N D P O W E R R E S U LTS F O R D I FF E R E N T C O N FI G U R ATI O N S —— area —— — time/img — E/img leak. E-eff. netw . type [ mm 2 ] [MGE] † [ns] [FO4] ‡ [nJ] [µ W ] [TOp/J] 16 × 16 var . 1.17 5.87 12.82 560 2.40 945 470.8 16 × 16 fixed 0.46 2.32 12.40 541 1.68 331 672.6 32 × 32 var . 5.80 29.14 17.27 754 11.14 4810 479.4 32 × 32 fixed 2.61 13.13 21.02 918 11.67 1830 457.6 † T wo-input NAND-gate size equiv alent: 1 GE = 0 . 199 µ m 2 ‡ Fanout-4 delay: 1 FO4 = 22 . 89 ps T ABLE III A R E A B R E A K D O W N F O R T H E 1 6 × 1 6 N E T W O R K compute area estim. var . weights fixed weights layer [kOp/img] [ mm 2 ] area [ mm 2 ] area [ mm 2 ] 1 74 ( 6.5%) 0.093 0.077 ( 6.6%) 0.008 ( 1.7%) 2 590 (52.2%) 0.971 0.647 (55.4%) 0.204 (44.3%) 3 442 (39.1%) 0.738 0.417 (35.8%) 0.241 (52.3%) 4 25 ( 2.2%) 0.041 0.026 ( 2.2%) 0.008 ( 1.7%) B. Synthesis Results W e analyze both aforementioned networks for two configu- rations, with weights fixed at synthesis time and with variable weights (excl. storage, modeled as inputs). The fixed weights are taken from the aforementioned trained models. W e provide an overvie w of synthesis results for different configurations in Tbl. II. W e synthesized both networks listed in Tbl. I in GlobalFoundries 22 nm SOI technology with L VT cells in the typical case corner at 0.65 V and 25 ◦ C. The configuration with variable weights scales with the computa- tional effort associated with the network (1.13 MOp/img and 5.34 MOp/img for the 16 × 16 and 32 × 32 networks) with 0.97 and 0.92 MOp/cycle/ mm 2 , respecti vely . The v ariable parame- ters/weights configuration does not include the storage of the parameters themselves, which would add 1 . 60 µ m 2 (8.0 GE) per FF which could be loaded through a scan-chain without additional logic cells (from some flash memory elsewhere on the device). Alternatively , non-volatile memory cells could be used to store them. The number of parameters is 33 and 65 kbit and thus 0.05 mm 2 (264 kGE) and 0.10 mm 2 (520 kGE) for the 16 × 16 and 32 × 32 network, respectively . Looking at the more detailed area breakdown in Tbl. III, we can see that there is a massiv e reduction when fixing the weights before synthesis. Clearly , this eliminates all the XNOR operations which become either an inv erter or a wire, and ev en the inv erter can now be shared among all units having this particular input value in their receptive field. Howe ver , based on the estimates described in Sec. III-A, this cannot explain all the savings. Additional cells can be sav ed through the reuse of identical partial results, which not only can occur randomly but must occur frequently . For example, consider 16 parallel popcount units summing over 8 values each. W e can split the value into 4 groups with 2 values each. T wo binary values can generate 2 2 = 4 output combinations. Since we ha ve 16 units of which each will need one of the combinations, they T ABLE IV E N E R G Y A N D L E A K AG E B R E A K D OW N F O R T H E 1 6 × 1 6 N E T W O R K ——– var . weights ——– ——– fixed weights ——– layer energy/img [pJ] leakage energy/img [pJ] leakage 1 38 ( 1.6%) 68 µ W 9 ( 0.5%) 8 µ W 2 806 (33.7%) 547 µ W 478 (28.5%) 152 µ W 3 1440 (60.2%) 310 µ W 1037 (61.9%) 163 µ W 4 107 ( 4.5%) 20 µ W 151 ( 9.0%) 7 µ W will on av erage be reused 4 times. This is only possible with fixed weights, otherwise the v alues to reuse would have to be multiplex ed, thereby loosing all the savings. Generally , we can observe that these already small networks for low-resolution images require a sizable amount of area, such that more advanced ad-hoc synthesis tools exploiting the sharing of weights and intermediate results are needed. C. Energy Efficiency Evaluations W e have performed post-synthesis power simulations using 100 randomly selected real images from the dataset as stimuli. The results are also reported in Tbl. II while a detailed per-layer breakdown is sho wn in Tbl. IV. W e see that the model with 32 × 32 input has lower energy-efficienc y and higher latency when fixing the weights, while the opposite is observed for the smaller model. W e attribute this to the fact that synthesis is set to optimize for area and both, the critical path length and target power are unconstrained. These energy efficiency numbers are in the order of 10 × higher than those of the next competitor Y odaNN [8]. Ho wever , they are fundamentally dif ferent in the sense that Y odaNN (a) runs the more complex binary weight networks, (b) requires additional off-chip memory for the weights and intermediate results, (c) can run large networks with a fixed-size accelerator , and (d) is in an older technology but doing aggressi ve voltage scaling. Giv en these major differences, a more in-depth comparison would require a redesign of Y odaNN in 22 nm and re-tuning to the single-channel input architecture we are using for comparison. Nevertheless, is is clear that these combinational BNNs are by far more efficient. When heavily duty-cycling a de vice, leakage can become a problem. In this case, we see 945 µ W and 331 µ W of leakage power , which might be significant enough in case of low utilization to require mitigation through power -gating or using HVT cells. Generally , voltage scaling can also be applied, not only reducing leakage, b ut also activ e power dissipation. The throughput we observe in the range of 50 Mframe/s is far in excess of what is meaningful for most applications. Thus aggressiv e voltage scaling, power gating and the re verse body biasing av ailable in this FD-SOI technology should be optimally combined to reach the minimum energy point where leakage and dynamic power are equal while the supply is ON. W e expect these values to be highly dependent on the input data, since ener gy is consumed only when v alues toggle. While a single pixel toggling at the input might af fect many values later in the network, it has been shown that rather the opposite Fig. 3. Silicon area estimation (in red) and measurements with variable (green) and fixed (blue) weights of three BNNs featuring a model complexity which scales depending on the imager resolution. The area occupation of the 64 × 64 model is not reported because the synthesis tool is not able to handle such a complex and large design. effect can be seen: changes at the input tend to vanish deeper into the network [24]. A purely combinational implementation fully le verages this and BNNs naturally have a threshold that keeps small changes from propagating and might thus perform ev en better for many real-world applications. D. Scaling to Larger Networks Our results sho w an area requirement in the range of 2.05 to 2.46 GE/Op and an average 1.9 fJ/Op. Scaling this up to 0 . 5 cm 2 (250 MGE) of silicon and an energy consump- tion of only 210 nJ/img, we could map networks of around 110 MOp/img—this is already more than optimized high- quality ImageNet classification networks such as ShuffleNets require [25]. Fig. 3 shows the estimation and measurements of the silicon area corresponding to the synthesized BNNs for fixed and variable weights. W e also consider a model with a larger 64 × 64 input imager receptive field and a higher complexity (5 con volutional and 2 fully-connected layers, 23.05 GOp/img). Such a model presents is more accurate on the considered classification task (73.6%) but current synthesis tool cannot handle the high complexity of the design, using in excess of 256 GB of memory . When estimating the area occupancy , the 64 × 64 BNNs result to be 4.3 × lar ger than the area estimated for the 32 × 32 model. A direct optimization of such large designs is out of scope of today’ s ED A tools, clearly showing the need for specialized design automation tools for BNNs. V . C O N C L U S I O N W e ha ve presented a purely combinational design and synthesis of BNNs for near-sensor processing. Our results demonstrate the suitability and the energy ef ficiency benefits of the proposed solution, fitting on a silicon area of 2.61 mm 2 when considering a BNN model with 32 × 32 binary input data and weight parameters fixed at design time. Our study also highlighted the need for novel synthesis tools able to deal with very large and complex network designs, that are not easily handled by current tools. R E F E R E N C E S [1] M. Alioto, Enabling the Internet of Things: F r om Integr ated Circuits to Inte grated Systems . Springer, 2017. [2] M. Rusci, D. Rossi et al. , “ An event-dri ven ultra-low-power smart visual sensor , ” IEEE Sensors Journal , vol. 16, no. 13, pp. 5344–5353, 2016. [3] ´ A. Rodr ´ ıguez-V ´ azquez, R. Carmona-Gal ´ an et al. , “In the quest of vision- sensors-on-chip: Pre-processing sensors for data reduction, ” Electronic Imaging , vol. 2017, no. 11, pp. 96–101, 2017. [4] R. LiKamW a, Y . Hou et al. , “Redeye: analog convnet image sensor architecture for continuous mobile vision, ” in Pr oc. IEEE ISCA , 2016, pp. 255–266. [5] S. Zhang, M. Kang et al. , “Reducing the energy cost of inference via in-sensor information processing, ” , 2016. [6] J. Fern ´ andez-Berni, R. Carmona-Gal ´ an et al. , “Focal-plane sensing- processing: A power-ef ficient approach for the implementation of priv acy-aw are networked visual sensors, ” Sensors , vol. 14, no. 8, pp. 15 203–15 226, 2014. [7] M. Gottardi, N. Massari, and S. A. Jawed, “ A 100 µ w 128 × 64 pixels contrast-based asynchronous binary vision sensor for sensor networks applications, ” IEEE Journal of Solid-State Circuits , vol. 44, no. 5, pp. 1582–1592, 2009. [8] R. Andri, L. Ca vigelli et al. , “Y odann: An architecture for ultra- low power binary-weight cnn acceleration, ” IEEE T ransactions on Computer-Aided Design of Integr ated Circuits and Systems , 2017. [9] M. Courbariaux, I. Hubara et al. , “Binarized neural networks: Training deep neural networks with weights and activ ations constrained to+ 1 or-1, ” , 2016. [10] M. Rastegari, V . Ordonez et al. , “Xnor-net: Imagenet classification using binary conv olutional neural networks, ” in Proc. ECCV . Springer , 2016, pp. 525–542. [11] Y . Umuroglu, N. J. Fraser et al. , “Finn: A framework for fast, scalable binarized neural network inference, ” in Pr oc. ACM/SIGD A FPGA , 2017, pp. 65–74. [12] J. Choi, S. Park et al. , “ A 3.4- µ w object-adaptive cmos image sensor with embedded feature extraction algorithm for motion-triggered object- of-interest imaging, ” IEEE Journal of Solid-State Circuits , v ol. 49, no. 1, pp. 289–300, 2014. [13] G. Kim, M. Barangi et al. , “ A 467nw cmos visual motion sensor with temporal averaging and pixel aggregation, ” in Pr oc. IEEE ISSCC , 2013, pp. 480–481. [14] H. G. Chen, S. Jayasuriya et al. , “ Asp vision: Optically computing the first layer of conv olutional neural networks using angle sensitive pixels, ” in Pr oc. IEEE CVPR , 2016, pp. 903–912. [15] E. H. Lee and S. S. W ong, “ Analysis and design of a passiv e switched- capacitor matrix multiplier for approximate computing, ” IEEE Journal of Solid-State Cir cuits , vol. 52, no. 1, pp. 261–271, 2017. [16] Z. Du, R. Fasthuber et al. , “Shidiannao: Shifting vision processing closer to the sensor , ” in ACM SIGARCH Computer Architectur e News , vol. 43, no. 3, 2015, pp. 92–104. [17] L. Cavigelli and L. Benini, “Origami: A 803-gop/s/w conv olutional network accelerator , ” IEEE T ransactions on Circuits and Systems for V ideo T echnolo gy , vol. 27, no. 11, pp. 2461–2475, 2017. [18] V . Sze, Y .-H. Chen et al. , “Efficient processing of deep neural networks: A tutorial and surve y , ” , 2017. [19] E. Nurvitadhi, D. Shef field et al. , “ Accelerating binarized neural net- works: Comparison of fpga, cpu, gpu, and asic, ” in Pr oc. FPT , 2016, pp. 77–84. [20] K. Ando, K. Ueyoshi et al. , “Brein memory: A 13-layer 4.2 k neuron/0.8 m synapse binary/ternary reconfigurable in-memory deep neural netw ork accelerator in 65 nm cmos, ” in Pr oc. VLSI Symposium , 2017. [21] L. Jiang, M. Kim et al. , “Xnor-pop: A processing-in-memory architec- ture for binary convolutional neural networks in wide-io2 drams, ” in Pr oc. IEEE/ACM ISLPED , 2017. [22] “The traffic surveillance workshop and challenge 2017 (tswc- 2017), ” 2017, MIO-TCD: MIOvision Traffic Camera Dataset. [Online]. A vailable: http://podoce.dinf.usherbrooke.ca [23] S. Jayasuriya, O. Gallo et al. , “Deep learning with energy-efficient binary gradient cameras, ” , 2016. [24] L. Cavigelli, P . Degen, and L. Benini, “Cbinfer: Change-based inference for con volutional neural networks on video data, ” , 2017. [25] X. Zhang, X. Zhou et al. , “Shuf flenet: An extremely efficient con volu- tional neural network for mobile devices, ” , 2017.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment