A Single-MOSFET MAC for Confidence and Resolution (CORE) Driven Machine Learning Classification

1 A Single-MOSFET MA C for Conﬁdence and Resolution (CORE) Dri v en Machine Learning Classiﬁcation Farid K enarangi, Student Member , IEEE and Inna Partin-V aisband, Member , IEEE Abstract —Mixed-signal machine-learning classiﬁcation has r e- cently been demonstrated as an efﬁcient alternativ e for classiﬁca- tion with power expensiv e digital circuits. In this paper , a high- COnﬁdence high-REsolution (CORE) mixed-signal classiﬁer is proposed for classifying high-dimensional input data into multi- class output space with less power and area than state-of-the-art classiﬁers. A high-r esolution multiplication is facilitated within a single-MOSFET by feeding the features and feature weights into, respecti vely , the body and gate inputs. High-resolution classiﬁer that considers the conﬁdence of the individual pr edictors is designed at 45 nm technology node and operates at 100 MHz in subthreshold r egion. T o ev aluate the performance of the classiﬁer , a reduced MNIST dataset is generated by downsampling the MNIST digit images fr om 28 × 28 features to 9 × 9 features. The system is simulated across a wide range of PVT variations, exhibiting nominal accuracy of 90%, energy consumption of 6.2 pJ per classiﬁcation (over 45 times lower than state-of-the-art classiﬁers), area of 2,179 µ m 2 (over 7.3 times lower than state-of- the-art classiﬁers), and a stable response under PVT variations. Index T erms —machine lear ning hard ware, mixed-signal clas- siﬁers, conﬁdence-level, high resolution, high-dimensional data, multi-class classiﬁcation, linear classiﬁers, logistic regr ession, subthreshold. I . I N T R O D U C T I O N L OW power machine learning (ML) classiﬁers are playing an important role in enabling edge computation of ML algorithms. W ide variety of applications beneﬁt from these compact classiﬁers, such as, Internet of Things (IoT) de vices, wireless sensor networks (WSNs), smart home technologies, autonomous transportation, and security systems. Existing on-chip classiﬁers can be categorized into two major domains: digital and mixed-signal [1]. A digital clas- siﬁer is typically fed with binary inputs ( i.e. , features) and uses binary feature weights, all obtained by sampling and quantizing corresponding analog signals. The classiﬁcation accuracy with digital classiﬁers increases with the increasing number of bits assigned for features and weights. These highly accurate digital classiﬁers howe ver exhibit signiﬁcant power consumption and physical size and are often not suitable for power limited applications, such as battery powered sensors and those other IoT devices that are wirelessly powered and powered from harv ested energy . Alternati vely , mixed signal classiﬁers aim to reduce the area and po wer consumption of the conv entional digital classiﬁers by directly using the analog input data for classiﬁcation [2]. The inherent need for data con version with power hungry analog-to-digital con verters (ADCs) is therefore mitigated with mix ed-signal classiﬁers [2]. Another concern in modern classiﬁers is the high dimen- sionality of data. Classifying data in a high-dimensional space The authors are with the Department of Electrical and Computer Engi- neering, University of Illinois at Chicago, Chicago, IL 60607 USA (e-mail: fkenar2@uic.edu; vaisband@uic.edu). Fig. 1. Schematic of the proposed linear binary classiﬁer . Features and feature weights are fed into, respecti vely , the body and gate input of the individual MOS transistors. Separate lines are used for positiv e and negativ e weights. often results in a prohibiti vely high data movement among memory and computing circuit components. This, in turn, signiﬁcantly increases power consumption in both, mixed- signal and digital classiﬁers [3]–[15]. T o reduce the data mov ement, several approaches for in-memory ML computa- tion have recently been proposed. Recent state-of-the-art in- memory classiﬁers typically exhibit accuracy between 90% and 96% and energy dissipation in the range of 210 pJ/decision to 879 pJ/decision [16]–[19] for typical image recognition datasets. Emerging de vice technologies are also being con- sidered for providing power and area ef ﬁcient alternatives for the con ventional CMOS based classiﬁers. Accuracy of 90% and energy of 25 pJ per decision has been recently reported in [20], [21]. T o enable high-resolution feature-weight multiplication, a theoretical framework that comprises circuits, models, and algorithms is proposed. T o the best of the authors knowledge, this paper is the ﬁrst to report a mixed-signal high-resolution classiﬁer , utilizing MOSFET body terminals. The schematic of the proposed conﬁguration is shown in Fig. 1. W ith this approach, body bias of the MOSFETs is controlled by the individual ML features, the gate inputs are fed by the absolute value of a corresponding feature weight, and the sign of the weights is taken into account by considering separate lines for the positiv e and negati ve feature and feature weight product. A K -class classiﬁcation with N features is therefore realized with N rows and K columns multiplication and accumulation (MA C) array , where each column serves as an independent 2 binary classiﬁer . These individual binary classiﬁers are com- bined using one-versus-all technique [22], requiring 1 2 ( K − 1) times less binary classiﬁers than the state-of-the-art classiﬁer in [16] and orders of magnitude less transistors, as compared with other existing mixed-signal classiﬁers [17], [19]. Another primary contribution is in decision making domain. W ith analog classiﬁers, binary predictions are typically made based on the relati ve magnitude of signals between the positi ve and negativ e sensing lines. W ith this traditional approach, the sensing line with the highest voltage drop is assumed to exhibit the correct classiﬁcation and the conﬁdence le vel of the decision is not considered. Alternati vely , a small dif ference in these voltage drop values indicates low conﬁdence le vel of the individual binary predictions and thus, a higher probability of an erroneous ﬁnal decision. T o address this ML integrity issue, a conﬁdence driv en classiﬁcation is proposed in this paper . W ith this approach, the difference among the magnitudes of the sensing line voltage drops is considered for capturing the conﬁdence of the individual predictions. Finally , the proposed system is designed in subthreshold re- gion, e xhibiting a po wer efﬁcient alternative for the traditional classiﬁers. The classiﬁer is demonstrated on Modiﬁed National Institute of Standards and T echnology (MNIST) dataset [23] of 10-class digit images. Based on SPICE circuit le vel simulation results, MNIST data is classiﬁed with 90% accuracy using 81 × 10 transistors and exhibits power consumption of only 6.2 pJ per decision. The rest of the paper is organized as follo ws. In Section II, the proposed high-resolution binary classiﬁer and linearization technique are described. Fabrication considerations are also discussed in this section. Based on the proposed binary clas- siﬁer , a multi-class high-resolution classiﬁer is designed and demonstrated with MNIST dataset, as described in Section III. Conﬁdence driven classiﬁcation is also explained in Section III. Circuit design and simulation results of the multi-class high-COnﬁdence high-REsolution (CORE) classiﬁer using one-versus-all technique are presented in Section IV. The paper is summarized in Section V. I I . T H E P RO P O S E D L I N E A R B I NA RY C L A S S I F I E R In this section, the proposed linear binary classiﬁer is described. The software lev el design frame work is provided in Section II-A. The circuit, and fabrication lev el considerations are presented in, respectively , Sections II-B and II-C. A. Design F ramework Reliability , power consumption, and physical size of on- chip classiﬁers are all primary concerns in modern ML ICs. The proposed framework is designed to meet accuracy speci- ﬁcations of modern classiﬁcation problems in a cost effecti ve manner . Linear algorithms are exploited in this paper for training a supervised binary classiﬁer , optimizing the system for linearly separable input data. Fig. 2. Handwritten digits from MNIST dataset, (a) original image (default resolution with 28 × 28 features), (b) selected features shown with light squares, and (c) downsampled image with 9 × 9 features. W ith a multiv ariate linear classiﬁer , the system response Z is a linear combination of N input features x = ( x 1 , x 2 , ..., x N ) and model weights w = ( w 1 , w 2 , ..., w N ) , Z = N X i =1 w i · x i , Z ∈ R . (1) The model weights are determined during supervised training by minimizing the prediction error between the system re- sponse, Z , and a corresponding true v alue in the labeled train- ing dataset. Logistic regression (LR) – a common supervised linear ML model – is used for training the proposed classiﬁer based on gradient descent algorithm [24]. LR is preferred due to its simple implementation and superior performance on MNIST dataset as compared with other classiﬁers. In inference, a probability threshold of 0 . 5 is used for predicting system response to input data, exhibiting a simple on-chip implementation, ˆ y = sign ( Z ) = sign ( P N i =1 w i · x i ) =  1 , Z ≥ 0 − 1 , Z < 0 . (2) The described logistic regressor with the probability threshold of 0.5 is referred to as logistic classiﬁer . The accuracy of the proposed logistic classiﬁer is e valuated as a percentage of all the correct predictions out of the total number of test predictions. The proposed ML ﬂow and the preprocessing steps are explained below . 1) Dataset: MNIST database is a large image dataset, com- monly used for ev aluating the effecti veness of ML hardware. MNIST contains images of 70,000 handwritten digits, ranging between 0 to 9. Each digit comprises 784 ( 28 × 28 ) image pixels. The training and test datasets comprise, respectiv ely , 60,000 and 10,000 digits. Out of the 60,000 training obser- vations, 45,000 and 15,000 digits are used for , respectively , training and v alidating the proposed system. 2) F eatur e selection and downsampling: Each image pixel of the individual digits in the training set is considered as an ML feature and used for training the classiﬁer . T o reduce the power and area overheads, those redundant features that are not essential for digit classiﬁcation are eliminated. T o deter- mine the preferred number of observed features, the dataset is downsampled to N ≤ 784 features ( N = 6 2 , 7 2 , . . . . 28 2 ) and classiﬁcation accuracy is obtained for the downsampled data in Python. T o ef ﬁciently classify MNIST digits with the proposed classiﬁer , N = 81 (9 × 9) is preferred, corresponding to 90% accuracy . The original and downsampled ( N = 81 ) digit images are shown in Fig. 2. 3 B. Cir cuit Level Considerations The primary goal in a linear binary classiﬁcation is to accurately and efﬁciently perform the dot product operation of the features and feature weights, as described in (2). T o simplify the circuit lev el design, the result sho wn in (2) is formulated as the signed addition of positiv e, Z + , and negati ve, Z − , feature-weight products, ˆ y = sign ( Z + z }| { X w i > 0 x i · w i + Z − z }| { X w j < 0 x j · w j ) =  1 , Z + ≥ Z − − 1 , Z + < Z − . (3) The individual positiv e and negati ve feature-weight prod- ucts are accumulated within the positive, V + sen and negati ve V − sen sensing lines, yielding the basic ML multiplication and accumulation (MAC) operation, as sho wn in Fig. 1. For each feature-weight multiplication, a single-MOSFET is connected to either the positive or ne gative sensing line, as determined by the sign of the corresponding feature weight. For example, for w 1 < 0 , the corresponding multiplier MOSFET is connected to the negati ve sensing line. T o capture the v oltage drops across the sensing lines, sensing capacitors ( i.e. , C sen ) are connected to the indi vidual sensing lines. The size of a sensing capacitor is determined proportionally to sensing line current, increasing with the number of transistors ( i.e. , features) connected to the line. T o classify tasks with higher number of features, larger sensing capacitors are therefore required, limiting the scalability of the system. T o provide a power efﬁcient and scalable solution, the transistors are biased in near/subthreshold operation region, signiﬁcantly limiting the current through the sensing lines. In the case of MNIST classiﬁcation with 9 × 9 features, capacitance of only 50 fF is utilized per sensing line. A primary concern with near/subthreshold operation is the exponential dependence of the drain current on the body and gate biases [25], I sub = W L I t exp ( V g s − V th nV T )[1 − exp ( − V DS V T )] , (4) where I t is the sub-threshold current at V g s = V th , n is the sub-threshold slope, and V T is the thermal v oltage. Note that body v oltage dependence is embedded in the threshold v oltage, V th ∝ √ V bs . Considering that V DS >> V T , the expression in (4) can be simpliﬁed as, I sub = W L I t exp ( V g s − V th nV T ) , V DS >> V T . (5) T o mitigate the non-linear dependence of the drain current on the weight-feature dot product (see (1)), a novel training ﬂow is proposed (see Fig. 3). T o account for the non-linear dependence of the drain current on the bias voltage, the model is trained with square root values of the default features ( x i → √ x i ) ). Thus, the extracted feature weights, w , are optimized for classifying the MNIST dataset transformed into half-order polynomial space. Alternatively , to account for the non-linear dependence of the drain current on the gate voltage, the feature weights are logarithmically adjusted ( w i → ln( w i ) ), exhibiting V g s ∝ ln( w ) . Using the ﬁrst-order approximation of exp( √ V bs ) ≈ 1 + √ V bs , the current in this Fig. 3. Linearization ﬂow . T o account for the non-linear dependence of the drain current on the body bias, I sub ∝ √ V bs , the model is trained to make predictions based on the square root values of the original features. The optimized weights are logarithmically adjusted, mitigating the exponential dependence of the subthreshold current on the gate bias. case is expressed as, I sub ∝ exp(ln( V g s )) · exp( p V bs ) ∝ V g s · p V bs ∝ w · √ x. (6) In inference, the current model is exploited for making predic- tion based on the square root v alues of the original features, as trained ofﬂine, yielding 90% accuracy across the MNIST test set, as detailed in the follo wing sections. 4 C. F abrication Costs In the proposed linear binary classiﬁer, the body and gate terminals are fed by , respectively , the input features and corresponding feature weights. Each multiplication is, there- fore, executed by a single-MOSFET , signiﬁcantly reducing the power and area costs (despite the o verhead of the triple- well technology) and complexity (as determined by number of transistors) of the classiﬁer in comparison to the existing state-of-the-art mixed-signal classiﬁers [2], [16]. Con ventional twin-well fabrication process is illustrated in Fig. 4(a). This process is designed to provide a single voltage connection to all the n-type and p-type body terminals. Alter- nativ ely , to connect body terminals of the indi vidual multiplier transistors to different voltage le vels, a specialized fabrication process is required. One way to indi vidually bias numerous body terminals, is by fabricating with triple-well process (see Fig. 4(b)), which is commonly used in high-performance, low-po wer ICs [26], [27] and for reducing substrate noise in mixed-signal circuits [28]. In a p-substrate triple-well process, an additional deep n-well is used to isolate the p-well of each MOSFET from the p-substrate, allo wing an independent body terminal connection for each MOSFET . The triple-well structure has been demonstrated to provide better noise char- acteristics as compared with the traditional twin-well structure, without increasing the gate leakage [29]. Alternativ ely , the triple-well structure, exhibits additional fabrication costs and area ov erheads that needs to be considered. Layout of a four- transistor block in twin-well and triple-well process is shown in, respectiv ely , Fig. 4(c) and Fig. 4(d). W ith the triple-well conﬁguration, the area is increased 3 times as compared with the twin-well process in 45 nm CMOS technology . I I I . C O R E BA S E D M U L T I - C L A S S C L A S S I F I E R A multi-class classiﬁer is designed based on multiple lin- ear binary classiﬁers, as presented in Section II. Conﬁdence driv en approach that addresses the integrity of multi-class classiﬁcation is described in Section III.A. The transistor lev el implementation of the proposed CORE classiﬁer is presented in Section III.B. A. Conﬁdence Driven Classiﬁcation: O VO versus O V A T wo typical approaches for designing a multi-class classiﬁer based on multiple binary classiﬁers are one-versus-one (O VO) and one-versus-all (O V A) [30]. With O VO approach, all pair- wise combinations of the output classes are e valuated with the individual binary classiﬁers. Thus, a K -class classiﬁcation with O V O approach requires 1 2 K ( K − 1) binary classiﬁers, increasing the system complexity and power and area costs quadratically with the number of classes. For example, for classifying 10 digits with O V O approach, 45 i -verus- j , ( i, j ∈ { 0 , 1 , 2 , .., 9 } ) binary classiﬁers are required. Alternatively , in case of a 1,000 class classiﬁcation, about half a million of binary classiﬁers are required. The ﬁnal decision with O V O technique is extracted using majority voting approach [31]: each binary classiﬁer v otes independently for a certain class and the ﬁnal decision is made based on the class with highest number of votes. Fig. 4. T wo commonly used fabrication processes, (a) twin-well process, and (b) triple-well process with p-substrate. Layout of four transistors with W = 200 nm and L = 50 nm in the (c) twin-well, and (d) triple-well conﬁgurations. Alternativ ely , with O V A approach, each binary classiﬁer discriminates between a single class and the rest of the classes. The required number of binary classiﬁers with O V A increases linearly with the number of classes, facilitating a more power and area efﬁcient classiﬁcation of high-dimensional data. In addition, a probability score of correct discrimination is in- herently provided with O V A and can be extracted for the individual binary classiﬁers. Thus, a K -class O V A classiﬁer can be designed with as few as K binary classiﬁers, seam- lessly accounting for the conﬁdence lev el of the individual predictions. The conﬁdence driv en decision, d , with the O V A classiﬁer is determined as, d = i | p i = max ( p i ) , i = 0 , 1 , ..., K − 1 , (7) where p i is the conﬁdence level of the i th binary classiﬁer . From circuit level perspecti ve, the conﬁdence lev el with O V A approach can be determined as the difference of the voltage drops across the positiv e and negati ve sensing lines, e.g . , p i ∝ ∆ V + sen ( i ) − ∆ V − sen ( i ) , ∆ V sen ( i ) . d = i | (∆ V sen ( i )) = K − 1 max j =0 { (∆ V sen ( j ) } . (8) T o extract the class with highest conﬁdence lev el, a light- weight comparator is designed, as described in the next subsection, yielding a po wer and area efﬁcient alternativ e for the traditional voter circuits. B. Cir cuit Level Design and Simulation Results The proposed multi-class classiﬁer is designed in SPICE and demonstrated based on the reduced MNIST dataset. The O V A 5 Fig. 5. V oltage wav eforms of the sensing lines during the precharge ( i.e. , ( 0 − 2 . 5 ns ), ( 10 − 12 . 5 ns ), ( 20 − 22 . 5 ns ), ( 30 − 32 . 5 ns ), ( 40 − 42 . 5 ns ), and ( 50 − 52 . 5 ns ) ) and classiﬁcation stages ( i.e. , ( 2 . 5 ns − 10 ns ), ( 12 . 5 ns − 20 ns ), ( 22 . 5 ns − 30 ns ), ( 32 . 5 ns − 40 ns ), ( 42 . 5 ns − 50 ns ), and ( 52 . 5 ns − 60 ns ) for six consecutive input features ( i.e. , 7, 2, 1, 0, 4, and 1). circuits and architecture of the MOSFET array are described in this section. 1) MA C Array: T o classify the do wnsampled 9 × 9 MNIST digits, ten binary classiﬁers (see Section II) are co-designed in SPICE, yielding a 81 × 10 MOSFET array . Each of the transistors within the MA C array is exploited for generating a single feature-weight product. During inference, the V + sen and V − sen lines are precharged to V DD prior to each prediction. Fig. 6. Schematic of the proposed conﬁdence level extractor . The conﬁdence le vel of the i th and j th binary classiﬁers is extracted and the prediction of the most conﬁdent classiﬁer is used as the ﬁnal decision. All the input features and feature weights are connected simultaneously to, respectiv ely , the body and gate terminals of the indi vidual multiplier transistors, facilitating a parallel classiﬁcation process within all the ten binary classiﬁers. As a result, 20 different voltage drop v alues ( i.e . , V + sen ( i ) , V − sen ( i ) , i = 0 , 1 , . . . , 9 ) are generated on the individual sensing lines, as sho wn in Fig. 5. The voltage waveforms of the positi ve and negati ve sensing lines are illustrated by , respectiv ely , the blue dotted and solid red lines. These voltage drops are exploited by the conﬁdence level extractor , generating the ﬁnal classiﬁer decision. The conﬁdence le vel of each binary classiﬁer (as determined based on (8)) is sho wn in Fig. 5, as noted on top of each plot. For e xample, for digit 7, the 7 th (7-vs-all) binary classiﬁer has as expected the highest conﬁdence lev el. The ﬁnal decision is generated by the conﬁdence level extractor . 2) Conﬁdence Level Extractor: The schematic of a single conﬁdence driven selector for a multi-class classiﬁcation is presented in Fig. 6(a). For a K -class classiﬁcation, 1 2 K ( K − 1) conﬁdence driv en selectors are required, yielding a total of 6 Fig. 7. Schematic of the resistive voltage divider . No sensitivity to process variations is observed with 1K-run Monte-Carlo simulation. Fig. 8. Schematic of the proposed classiﬁer, comprising voltage divider , MOSFET array , and conﬁdence lev el extractors. 45 selectors for the MNIST dataset ( K = 10 ). The circuit is designed to compare the conﬁdence le vels of the binary classiﬁers ( i th and j th classiﬁer) and determine the classiﬁer with higher conﬁdence level, ∆ V + sen ( i ) − ∆ V − sen ( i ) <> ∆ V + sen ( j ) − ∆ V − sen ( j ) . (9) T o simplify the circuit le vel implementation the subtraction in (9) is replaced with summation, ∆ V + sen ( i ) + ∆ V − sen ( j ) <> ∆ V + sen ( j ) + ∆ V − sen ( i ) . (10) Each summation in (10) is captured with two parallel NMOS transistors, as sho wn in Fig. 6(a). T o capture the result of ∆ V + sen ( i ) + ∆ V − sen ( j ) , the gate terminals of the transistors M 1 and M 2 are connected to, respectiv ely , the positive sensing line of the i th classiﬁer , V + sen ( i ) , and negati ve sensing line of the j th classiﬁer , V − sen ( j ) . As a result, the drain current at M 1 and M 2 is proportional to ∆ V + sen ( i ) + ∆ V − sen ( j ) . Similarly , the gate terminals of the transistors M 3 and M 4 are connected to, respectiv ely , V + sen ( j ) and V − sen ( i ) , generating a drain current proportional to ∆ V + sen ( j ) + ∆ V − sen ( i ) . T o determine which side exhibits higher conﬁdence lev el ( i.e. , sinks lower current), two back to back in verters are utilized. V oltage wav eforms of the sensing lines, EN signal which enables the back-to-back in verters, Reset signal which resets the voltages stored on both sides of the back-to-back in verters and output signals are illustrated in Fig. 6(b) for six consecutiv e classiﬁcation periods. During the second, third, and ﬁfth classiﬁcations, the left side of the in verters sink higher current than the right side. The left and right sides of the conﬁdence selector, are, therefore, forced to, respectively , the low ( i.e. , D ( i ) = 0) and high ( i.e. , D ( j ) = 1) output voltage. Alternativ ely , during the ﬁrst, fourth, and sixth classiﬁcations, the right side sinks higher current than the left side. Thus, left and right sides are forced to, respecti vely , the high ( i.e. , D ( i ) = 1) and low ( D ( j ) = 0) output v oltage. The correct functionality of the conﬁdence lev el extractor depends on its symmetric structure and is highly sensitiv e to process v ariations. T o mitigate process variations, larger pull-down ( W = 5 W min ), pull-up ( W = 15 W min ), and conﬁdence e xtractor ( W = 5 W min ) transistors are utilized for the conﬁdence le vel extractor . W ith upsized transistors, the av erage accuracy degradation of the classiﬁer is limited to 2% under process variations, as described in the next section. Note that con ventional methods such as extracting ﬁnal classiﬁcation results with K analog-to-digital (ADCs) can also be lev eraged, trading-off po wer efﬁciency for scalability ( i.e. , 1 2 ( K − 1) times less conﬁdence extractors). 3) Resistive V oltage Divider: The trained, quantized feature weights are generated using a resistive voltage divider (see Fig. 7). In this conﬁguration, the preferred voltage range ( V low , V hig h ) is divided into 2 n equal steps, where n is the preferred quantization resolution. The gate bias range ( i.e. , ( 300 mV , 610 mV ) is quantized with 5-bit resolution into 31 equal steps of 10 mV , as illustrated in Fig. 7. Poly resistors with sheet resistance of 7.8 Ω /  are utilized. The resilience of the voltage divider to process variations is ev aluated with 1K- run Monte-Carlo simulation. Based on the simulation results, the circuit is highly resilient to process v ariations, exhibiting the average deviation of 0.2 mV from the nominal wight values. The results of the 1K-run Monte-Carlo simulation are also shown in Fig. 7. W ith this topology , no memory and data con version units are required for storing and quantizing the weights. Alternati vely , the weights provided with the voltage divider are not reconﬁg- urable. Using a 32-to-1 multiplex er and a memory unit ( e.g. , SRAM), the circuit can be updated to provide reconﬁgurable feature weights [32]. The overall area of the classiﬁer with reconﬁgurable weights is expected to increase by a factor of 4.2. I V . R E S U LT S A. System Characteristics A schematic of the integrated system is illustrated in Fig. 8, comprising v oltage divider , MOSFET array , and conﬁdence 7 (a) (b) Fig. 9. Confusion matrix of the MNIST classiﬁcation obtained in (a) Python (90% accuracy), and (b) SPICE (90% accuracy), exhibiting no accuracy degradation in SPICE as compared with Python results. lev el extractors. The proposed CORE classiﬁer is designed at 45 nm technology node with the nominal power supply voltage of 0.9 V and threshold voltage of 0.61 V. The body terminals of the MA C array are biased at voltage lev els between 0.2 V and 0.8 V. The area occupied by the classiﬁer is 2,179 µ m 2 , as estimated based on transistor count in SPICE. The MAC array comprises a total of 81 × 10 = 810 MOSFETs. The reduced, 81-feature MNIST dataset is classiﬁed with accuracy of 90% within a single 10 ns clock cycle of the system operation, exhibiting no accuracy degradation as compared with the v alidation accuracy in Python. The confusion matrices obtain based on Python and SPICE simulations are shown in, respectiv ely , Fig. 9(a) and Fig. 9(b), exhibiting equal accuracy of 90%. The ML classiﬁer generates predictions at 100 MHz frequency , exhibiting an average energy consumption of 6.2 pJ per classiﬁcation of a single digit. T o maintain high prediction accuracy , 5 bits and 6 bits are assigned for quantizing, respec- tiv ely , the feature weights and input features. By increasing the dimensionality of the proposed classiﬁer, lower power and area ov erheads can be traded off for higher prediction accuracy , approaching the theoretical limit of 92% for MNIST classiﬁcation with linear ML algorithms and O V A decisioning scheme. The existing tradeoffs between dimensionality of the data and the accuracy , po wer, and area overheads are sho wn in Fig. 10. The knee point of N = 81 is selected in this paper to provide satisfactory accuracy results in an power and area efﬁcient manner . B. Simulation Results Performance characteristics are listed in T able I for the proposed system along with the existing state-of-the-art mix ed- signal classiﬁers [2], [16], [17]. Note the different dataset, MIT -CBCL, used in [17]. Classiﬁcation accuracy is a strong function of data. The accuracy comparison among [17] and other classiﬁers in T able I is therefore less valuable, albeit the excellent accuracy demonstrated in [17]. Alternatively , beneﬁting from the high-resolution multiplications and con- ﬁdence driven predictions, CORE classiﬁer exhibits signiﬁ- cantly less transistor count, and thus lower power consumption and smaller IC area, as compared with the other state-of-the- art classiﬁers. For fair comparison, current time per decision Fig. 10. Existing tradeoffs between the selected number of features, accuracy and (a) area, and (b) power overheads. T ABLE I: System characteristics of the proposed and other existing mixed-signal ML classiﬁers. [2] [16] [17] Current work Dataset MNIST MNIST MIT -CBCL MNIST T echnology 130 nm 130 nm 65 nm 45 nm Algorithm Ada-boost Ada-boost SVM LR Accuracy 90% 90% 96% 90% Feature weight resolution 4b 1b 8b 5b Feature resolution 5b 5b 8b 6b Number of features 48 81 256 81 Supply voltage (V) 1.2 1.2 0.675 0.9 Speed (MHz) 1.3 50 32 100 Costs Energy (pJ) 543 633.4 210 6.2 Current time per decision (pA · sec) 452.5 527.8 311.1 6.9 Area ( µ m 2 ) 246,792 133,736 401,300 2,179 Normalized area ( µ m 2 /nm 2 ) 14.6 7.91 94.9 1.07 and the system area normalized by squared form factor are also shown in T able I. The current time per decision with the proposed classiﬁer is approximately 45 times lower as com- pared with other approaches. Similarly , area savings (albeit the triple-well technology) range between × 7.3 and × 89 as compared to other classiﬁers, as shown in T able I. Note that the operational frequency is scalable and can be adjusted based on the application needs and constrains. T o ev aluate the effect of v oltage and temperature variations on the CORE classiﬁer, the supply v oltage is varied between 0.6 volts and 1.2 volt and the temperature is varied between −30 ◦ C and 125 ◦ C. The effect of process variations on the classiﬁer performance is ev aluated based on a 1K-run Monte- Carlo simulation on a randomly selected 100-observ ation (10 images per digit) balanced test set with nominal accuracy of 90%. Note that a 1K-run Monte-Carlo simulation on the whole test set takes 1 , 000 × 2 . 5 hours on Intel Core i7-7700 CPU. The results of the simulations are sho wn in Fig. 11. The 8 Fig. 11. Classiﬁer performance under PVT variations, (a) under temperature variations at typical process corner, TT , and V DD = 0.9 V, (b) under v oltage v ariations at room temperature, T = 27 ◦ C, and at TT , and (c) process and mismatch variations with T = 27 ◦ C and V DD = 0.9 V . classiﬁer exhibits no sensitivity within wide range of voltage variations from 0.8 V to 1.2 V. Less than 2% accuracy degrada- tion is observed at lo w temperatures − 30 ◦ C ≤ T ≤ 0 ◦ C . No sensitivity to temperature v ariations for 0 ◦ C ≤ T ≤ 125 ◦ C is observed. An av erage of 2% accuracy degradation is exhibited due to process variations, as extracted from the 1K-run Monte- Carlo simulation. Conﬁdence histograms of the correct and incorrect clas- siﬁcations are shown in, respectiv ely , Fig. 12(a) and Fig. 12(b). W ith the proposed conﬁdence dri ven approach, incorrect classiﬁcations often exhibit lower conﬁdence as compared with typically conﬁdent, correct predictions. The odds of an incorrect classiﬁcation to be corrected under process variations are therefore high, fav orably affecting the resilience of the system to variations. Based on the simulation results, the accuracy is improved for nearly one-third of the Monte-Carlo runs. V . C O N C L U S I O N Sev eral state-of-the-art mixed-signal classiﬁers ha ve re- cently been demonstrated for po wer efﬁcient classiﬁcation. Accurate classiﬁcation of multi-dimensional data under the tight power and area constraints is the primary objective in Fig. 12. Conﬁdence lev el distribution for (a) correct, and (b) incorrect classiﬁcations. modern on-chip classiﬁers. A novel circuit topology is pro- posed in this paper for high-COnﬁdence and high-REsolution (CORE) classiﬁcation. W ith this topology , body terminals of the MOSFETs are exploited to encode input features, enabling the high-resolution classiﬁcation. T o enhance the ML integrity in multi-class classiﬁers, O V A technique is exploited for ef ﬁciently extracting a ﬁnal decision based on the conﬁdence le vel of the individual predictors. For a K -class classiﬁcation, ( K − 1) / 2 times fewer binary classiﬁers are required with the O V A approach as compared with the traditional O VO method [16]. T o further reduce area and power consumption of the O V A-based CORE classiﬁer , a light- weight conﬁdence extractor is designed, generating the ﬁnal decision based on the conﬁdence level of the individual binary classiﬁers. T o the best of the authors knowledge, the proposed CORE classiﬁer is the ﬁrst integrated system to successfully classify MNIST dataset in subthreshold region using a single- MOSFET MA C. Biasing transistors in subthreshold region signiﬁcantly decreases the leakage and dynamic currents as well as ov erall load on the sensing lines. The proposed CORE classiﬁer is designed in SPICE and simulated in 45 nm standard CMOS process. The performance and functionality of the proposed approach is validated with simulation results, exhibiting 90% classiﬁcation accuracy with 6.2 pJ energy consumption per prediction across the MNIST dataset. Each prediction is ﬁnalized within a single clock cycle of 10 ns. The unique topology of CORE classiﬁer supports the ML integrity under a wide range of PVT v ariations, as well as system scalability across technology nodes. R E F E R E N C E S [1] V . Sze, “Designing hardware for machine learning: The important role played by circuit designers, ” IEEE Solid State Cir cuits Mag. , vol. 9, no. 4, pp. 46–54, Nov 2017. [2] Z. W ang and N. V erma, “A low-energy machine-learning classiﬁer based on clocked comparators for direct inference on analog sensors, ” IEEE T rans. Circuits Syst. I, Re g. P apers , v ol. 64, no. 11, pp. 2954–2965, Jun 2017. 9 [3] Y .-H. Chen et al ., “Eyeriss: An energy-ef ﬁcient reconﬁgurable accelera- tor for deep con volutional neural networks, ” IEEE J . Solid-State Cir cuits , vol. 52, no. 1, pp. 127–138, Jan 2017. [4] R. Hameed et al ., “Understanding sources of inefﬁciency in general- purpose chips, ” in Pr oc. Int. Symp. Comput. Ar chit. , vol. 38, no. 3, Jun 2010, pp. 37–47. [5] M. Horowitz, “1.1 computing’ s energy problem (and what we can do about it), ” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2014, pp. 10–14. [6] P . Whatmough et al ., “A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with > 0.1 timing error rate tolerance for IoT applications, ” in IEEE Int. Solid-State Cir cuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2017. [7] D. Bankman et al ., “An always-on 3.8 µ J/86% CIF AR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS, ” in IEEE Int. Solid-State Cir cuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2018, pp. 222–224. [8] D. Jeon et al ., “A 23-mW face recognition processor with mostly-read 5T memory in 40-nm CMOS, ” IEEE J. Solid-State Circuits , vol. 52, no. 6, pp. 1628–1642, Feb 2017. [9] S. Bang et al ., “14.7 a 288 µ W programmable deep-learning processor with 270KB on-chip weight storage using non-uniform memory hier- archy for mobile intelligence, ” in IEEE Int. Solid-State Cir cuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2017, pp. 250–251. [10] G. Desoli et al ., “14.1 a 2.9 TOPS/W deep conv olutional neural network SOC in FD-SOI 28nm for intelligent embedded systems, ” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2017, pp. 238–239. [11] K. Bong et al ., “14.6 A 0.62 mW ultra-low-power conv olutional-neural- network face-recognition processor and a CIS integrated with always-on haar-like face detector, ” in IEEE Int. Solid-State Cir cuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2017, pp. 248–249. [12] D. Shin et al ., “14.2 DNPU: An 8.1 TOPS/W reconﬁgurable CNN-RNN processor for general-purpose deep neural networks, ” in IEEE Int. Solid- State Circuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2017, pp. 240–241. [13] B. Moons et al ., “14.5 envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accurac y-frequency-scalable conv olutional neural net- work processor in 28nm FD-SOI, ” in IEEE Int. Solid-State Circuits Conf . (ISSCC) Dig. T ech. P apers , Feb 2017, pp. 246–247. [14] S. Park et al ., “4.6 A1. 93TOPS/W scalable deep learning/inference pro- cessor with tetra-parallel MIMD architecture for big-data applications, ” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2015, pp. 1–3. [15] F . Kenarangi and I. P .-V aisband, “Exploiting machine learning against on-chip power analysis attacks: tradeof fs and design considerations, ” IEEE T rans. Circuits Syst. I, Reg. P apers , vol. 66, no. 2, pp. 1396–1402, Oct 2018. [16] J. Zhang, Z. W ang, and N. V erma, “In-memory computation of a machine-learning classiﬁer in a standard 6T SRAM array , ” IEEE J. Solid-State Circuits , vol. 52, no. 4, pp. 915–924, Apr 2017. [17] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42pJ/decision 3.12 TOPS/W rob ust in-memory machine learning classiﬁer with on-chip training, ” in IEEE Int. Solid-State Cir cuits Conf. (ISSCC) Dig. T ech. P apers , Feb 2018, pp. 490–492. [18] M. Kang et al., “An in-memory VLSI architecture for con volutional neural networks, ” IEEE J . T rans. Emer g. Sel. T opics Cir cuits Syst. , vol. 8, no. 3, pp. 494–505, Apr 2018. [19] Y . T ang, J. Zhang, and N. V erma, “Scaling up in-memory-computing classiﬁers via boosted feature subsets in banked architectures, ” IEEE T rans. Circuits Syst. II, Exp. Briefs , July 2018. [20] D. Querlioz, O. Bichler, and C. Gamrat, “Simulation of a memristor- based spiking neural network immune to device variations, ” in Neural Networks (IJCNN), The 2011 International Joint Conference on . IEEE, 2011, pp. 1775–1781. [21] F . Kenarangi and I. Partin-V aisband, “Leveraging independent double- gate FinFET devices for machine learning classiﬁcation, ” IEEE T rans. Cir cuits Syst. I, Reg . P apers , v ol. PP , no. 99, pp. 1–12, Jul 2019. [22] R. Rifkin and A. Klautau, “In defense of one-vs-all classiﬁcation, ” J. of machine learning r esearc h (JMLR) , v ol. 5, pp. 101–141, Jan 2004. [23] Y . LeCun. (1998) The MNIST database of handwritten digits. [Online]. A vailable: http://yann.lecun.com/exdb/mnist/ [24] J. A. Nelder and R. J. Baker, “Generalized linear models, ” Encyclopedia of statistical sciences , v ol. 4, Jul 2004. [25] P . R. Gray , P . Hurst, R. G. Meyer , and S. Le wis, Analysis and design of analog integr ated cir cuits , 4th ed. W iley , Ne w Y ork, USA, 2001. [26] J. W . Tschanz et al., “ Adaptiv e body bias for reducing impacts of die- to-die and within-die parameter variations on microprocessor frequency and leakage, ” IEEE J. Solid-State Cir cuits , vol. 37, no. 11, pp. 1396– 1402, Dec 2002. [27] R. T aco et al., “Low voltage logic circuits exploiting gate level dynamic body biasing in 28 nm UTBB FD-SOI, ” Solid-State Electr onics , vol. 117, pp. 185–192, march 2016. [28] K. T o et al., “Comprehensive study of substrate noise isolation for mixed-signal circuits, ” in Int. Electr on Dev . Meeting. T ech. Digest , Dec 2001, pp. 22–7. [29] Y . Ogasahara et al., “Supply noise suppression by triple-well structure, ” EEE T rans. V ery Lar ge Scale Inte gr . (VLSI) Syst. , vol. 21, no. 4, pp. 781–785, Apr 2013. [30] M. Aly , “Survey on multiclass classiﬁcation methods, ” Neur al Netw , vol. 19, pp. 1–9, 2005. [31] J. Levin and B. Nalebuff, “ An introduction to vote-counting schemes, ” J. of Economic P erspectives , vol. 9, no. 1, pp. 3–26, Mar 1995. [32] C.-W . Lu et al., “A 10-bit resistor-ﬂoating-resistor-string D AC (RFR- D AC) for high color-depth LCD driver ICs, ” IEEE J. Solid-State Cir cuits , vol. 47, no. 10, pp. 2454–2466, Sep 2012. Farid Kenarangi (S’18) received the B.Sc. de- gree in electrical engineering from the University of T abriz, T abriz, Iran, in 2015. He is currently pursuing the Ph.D. degree in electrical engineer- ing with The University of Illinois at Chicago, Chicago, IL, USA, under the supervision of Prof. I. Partin-V aisband. His current research interests include hardware security , machine learning inte- grated circuits, analog design, and on-chip power deliv ery and management. He was a recipient of the 2017 Univ ersity of Illinois at Chicago Chancellor’ s Graduate Research A ward. Inna Partin-V aisband (S’12–M’15) recei ved the Bachelor of Science degree in computer engineer- ing and the Master of Science degree in electrical engineering from the T echnion-Israel Institute of T echnology , Haifa, Israel, in, respectively , 2006 and 2009, and the Ph.D. degree in electrical engineering from the Univ ersity of Rochester, Rochester , NY in 2015. She is currently an Assistant Professor with the Department of Electrical and Computer Engineering at the Uni versity of Illinois at Chicago. Between 2003 and 2009, she held a variety of software and hardware R&D positions at T ower Semiconductor Ltd., G- Connect Ltd., and IBM Ltd., all in Israel. Her primary interests lay in the area of high performance integrated circuit design. Her research is currently focused on innov ation in the areas of AI hardware and hardware security . Y et another primary focus is on distributed power delivery and locally intelligent power management that facilitates performance scalability in heterogeneous ultra-large scale integrated systems. Special emphasis is placed on developing robust frameworks across levels of design abstraction for complex heterogeneous integrated systems. Dr . P .-V aisband is an Associate Editor of the Micr oelectronics Journal and has served on the T echnical Program and Organization Committees of various conferences.

A Single-MOSFET MAC for Confidence and Resolution (CORE) Driven Machine Learning Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment