Threshold Logic in a Flash

Threshold Logic in a Flash Ankit W agle ∗ , Gian Singh ∗ , Jinghua Y ang ∗ , Sunil Khatri † , Sarma Vrudhula ∗ ∗ (awagle1,gsingh58,jinghua.yang,vrudhula)@asu.edu, † sunil.khatri@tamu.edu ∗ School of Computing, Informatics and Decision Systems Engineering, Arizona State Univ ersity , T empe AZ 85281 † Dept. of Electrical and Computer Engineering, T exas A&M Univ ersity , Colle ge Station TX Abstract —This paper describes a novel design of a threshold logic gate (a binary perceptr on) and its implementation as a standard cell . This new cell structure, r eferred to as ﬂash threshold logic (FTL), uses ﬂoating gate (ﬂash) transistors to realize the weights associated with a threshold function. The threshold voltages of the ﬂash transistors serve as a proxy for the weights. An FTL cell can be equivalently viewed as a multi-input, edge-triggered ﬂipﬂop which computes a threshold function on a clock edge. Consequently , it can be used in the automatic synthesis of ASICs. The use of ﬂash transistors in the FTL cell allows programming of the weights after fabrication, thereby pre venting discovery of its function by a f oundry or by re verse engineering. This paper focuses on the design and characteristics of the FTL cell. W e present a novel method for programming the weights of an FTL cell for a speciﬁed thr eshold function using a modiﬁed perceptr on learning algorithm. The algorithm is further extended to select weights to maximize the r obustness of the design in the pr esence of process variations. The FTL circuit was designed in 40nm technology and simulations with layout-extracted parasitics included, demonstrate signiﬁcant im- pro vements in the ar ea (79.7%), power (61.1%), and perf ormance (42.5%) when compar ed to the equivalent implementations of the same function in con ventional static CMOS design. W eight selection targeting r obustness is demonstrated using Monte Carlo simulations. The paper also shows how FTL cells can be used for ﬁxing timing err ors after fabrication. Index T erms —Threshold Logic, Floating Gate, Flash, Low Po wer , High Perf ormance, P erceptr on I . I N T R O D U CT I O N A N D M O T I V AT I O N Methods to optimize the performance, po wer and area (PP A) of static CMOS circuits ha ve continuously improved ov er three decades, leaving fe w opportunities, if any , for further improv ements. This suggests that if there are to be any further advances in improving PP A at the logic and circuit lev els, the conv entional way of computing logic functions has to be re visited. Although sev eral nanotechnologies are being in vestigated as alternativ es or enhancements to static CMOS (e.g. [1]–[4]), they remain at the research stage and large scale adoption is still far in the future. This paper introduces a ne w pr ogrammable ASIC primitive , referred to as a ﬂash thr eshold logic (FTL) cell, that can be used to substantially improv e all three PP A metrics of an ASIC. An FTL cell and its use in an ASIC is different from any other type of ASIC component previously reported. Howe ver , it is designed as a standard cell , so that it is fully compatible with conv entional ASIC design ﬂow , and can be processed by commercial design tools without any changes. In other words, it can easily be combined with con ventional CMOS logic during synthesis, technology mapping, and place- ∗ The research was supported in part by NSF PFI award 1701241. and-route. Howe ver , it is functionally and structurally very different from a complex standard cell. An FTL cell of n inputs can realize any thr eshold function of n or fe wer v ariables. A threshold function f ( x 1 , · · · , x n ) [5] is a unate Boolean function whose on-set and of f-set are linearly separ able , i.e. there e xists a v ector of weights W = ( w 1 , w 2 , · · · , w n ) 1 and a threshold T such that f ( x 1 , x 2 , · · · , x n ) = 1 ⇔ n X i =1 w i x i ≥ T , (1) where P here denotes the arithmetic sum. A threshold function can be equiv alently represented by ( W , T ) = ( w 1 , w 2 , · · · , w n ; T ) . Q C x 1 f(x 1 ,. . . x n ) x 2 x n-1 x n FTL w 1 w 2 w n-1 w n Fig. 1: FTL Schematic Figure 1 shows the schematic of the FTL cell, in which the weights W are internal parameters of the cell. The schematic is meant to con ve y that the input-output behavior of an FTL cell may be viewed as an edge-trigg er ed , multi-input ﬂip-ﬂop, whose output is a threshold function, registered at the rising edge of the clock signal C. A distinctive characteristic of the FTL cell design is that the actual threshold function realized by an FTL instance within an ASIC is pr ogrammed after the cir cuit is manufactured . An FTL based ASIC inte grates ﬂash or ﬂoating gate [7] transistors along with con ventional MOSFETs within the FTL cell. Thus, unlike many of the emerging technolo gies [2], [3], [8], [9], an FTL cell employs mature IC technologies (CMOS and Flash) that can be commercially manufactured and integrated today . A. FTL in ASIC Design – A V aluable Use Case The focus of this paper is on the design of the FTL cell. Before proceeding to that, it will be instructiv e to understand its use in ASIC design [10]. The f act that an FTL is a programmable, multi-input ﬂip-ﬂop pro vides a unique and signiﬁcant new opportunity to improve the PP A of ASICs. Consider the logic netlist shown in Figure 2a which has two registered outputs F and G . Suppose that transiti ve fan in (TFI) cones of F and G are tra versed and two subcircuits A and B (see Figure 2b) are found that are threshold functions of their inputs. The remaining subcircuit is labeled as C . Suppose that subcircuits A and B are each replaced by an FTL cell, 1 W .L.O.G, weights can be assumed to be positive integers [6], and for a given truth table of a threshold function, there is a weight vector whose sum is minimum [6]. (a) A logic netlist. (b) Identifying threshold functions in TFI cones of ﬂip-ﬂops (c) A FTL-CMOS logic hybrid Fig. 2: Use of FTL in ASIC design programmed to realize A and B . This replacement is shown in Figure 2c, where the FTL cells are shown as black boxes. Now , subcircuit C would be re-synthesized to account for the changes in the delay of FTL cells and the new loads that they present to the outputs of C . The circuit in Figure 2c would substantially improv e the PP A of an ASIC for two reasons: 1) Subcircuits A and B and the two ﬂip-ﬂops are each replaced by an FTL cell which has much few transistors, resulting in a signiﬁcant reduction in area and power . 2) The clock-to-Q delay of FTL cells are typically about 30% to 40% smaller than the delay of standard cell realization of subcircuits A and B plus the clock-to-Q delay of regular ﬂip-ﬂops. In the FTL-CMOS hybrid design, this results in a substantial amount of slack (required time minus arriv al time) on the outputs of subcircuit C , which in turn will allow synthesis and technology mapping tools to drastically reduce the logic area of subcircuit C . FTL based ASICs also of fers se veral other equally signiﬁ- cant advantages not possible with conv entional CMOS logic. 1) IP Pr otection: A CMOS ASIC with embedded FTL cells cannot be rev erse-engineered by a foundry or any third party because the functions of the FTL cells are unknown (black box es) at manufacturing time, as sho wn in Figure 2c. 2) Corr ecting T iming Err ors: The ﬁne-grained, post- manufacture ﬂash threshold v oltage programmability allows precise speed binning, and correction of timing errors. This is not possible in traditional CMOS design. 3) Mitigating Aging Effects: By re-programming the ﬂash design in-ﬁeld, our scheme allows for mitigating the effects of aging. This is also not possible in CMOS design. 4) High Endurance: Unlike ﬂash memory , the FTL cell does not suffer from endurance issues. Flash transistors can endure a ﬁnite number of write cycles (1K to 100K) [11], [12]. In our approach, the ﬂash devices will be programmed a few times (at most), after fabrication, and then again to possibly adjust for aging effects (in the ﬁeld). B. Main Contributions The remainder of the paper will focus on the design of the FTL cell and demonstrate its key characteristics through extensi ve and detailed electrical simulations using the state-of- the-art device and circuit models and commercial tools. The main contributions of this w ork are summarized below . • This paper introduces a no vel circuit design of the FTL cell to realize all threshold functions of n or fewer variables 2 . The new design incorporates both ﬂash transistors and con ventional MOSFETs in a unique way to realize highly robust threshold logic circuits. • The set of threshold v oltages ( V t ) of the ﬂash transistors in the FTL cell serve as a proxy for [ W , T ] that deﬁne a threshold function realized by an FTL cell. Since the thresh- old voltages of the ﬂash transistors can be programmed with high precision [7], an FTL cell can implement weights with great ﬁdelity . W e introduce an algorithm that maps the weights of a giv en threshold function f = [ W , T ] to the threshold voltages of the ﬂash transistors. This is a complex, non-linear , multi-valued mapping. That is, sev eral different V t (s) may correspond to a gi ven W, T , each determined by the complex electrical and layout characteristics of the MOSFETs and ﬂash transistors. Giv en a layout extracted netlist of an FTL cell, we present a novel modiﬁcation of the classical per ceptr on learning algorithm (PLA) [13] that works in concert with HSPICE to determine one V t of an FTL cell that computes f = [ W , T ] . This algorithm ac- counts for layout parasitics and process v ariations. Like the original PLA, the modiﬁed PLA is guaranteed to con ver ge, ensuring that a solution ( V t ) for the giv en layout of an FTL cell will be found in a ﬁnite number of steps if a solution exists. • The ﬁne-grained programmability of threshold voltages of the ﬂash transistors in an FTL cell is exploited to improve its robustness. Given that the mapping [ W , T ] ⇒ V t is multi-valued, we show how to direct our modiﬁed PLA to ﬁnd a V t that will ensure that the FTL cell reliably computes the gi ven threshold function in the presence of local and global process and en vironmental v ariations. Using this approach, substantial improv ement in the rob ustness of the FTL cell is demonstrated using Monte Carlo simulations. This also shows how post-fabrication tuning of the threshold voltages can correct failures due to process variations, or modify the delay to correct timing errors, improve a circuit’ s performance, or improve the performance characteristics of 2 In the experimental results, we ﬁnd that n = 5 is a sufﬁciently good choice to demonstrate substantial improvements in PP A, since there are a large number (117) of threshold functions of n or fe wer v ariables a design to alter the speed binning distribution in a manner that maximizes proﬁt. C. Organization of the Paper Section II gi ves a ve ry brief overvie w of threshold logic and ﬂash transistor technology . Sections III, IV and V contain the main body of this work. The architecture and operation of the FTL cell are described in Section III. This is followed by a description in Section IV of the modiﬁed PLA used to program an FTL to implement a giv en threshold function. Section V contains an e xtensiv e set of experimental results, demonstrating the signiﬁcant improvements in PP A of FTL cells over their CMOS equiv alents, and v alidating se veral of the uses of post-fabrication programming/tuning of the ﬂash devices. Before concluding the paper in Section VII, we present a brief and partial re view of the prior art related to this paper in Section VI. I I . B AC K G RO U N D A. Threshold Logic Equation (1) deﬁnes an n -input threshold function. An ex- ample of a 5-input threshold function is a 3-out-of-5 majority function: f ( a, b, c, d, e ) = abc ∨ abd ∨ abe ∨ acd ∨ ace ∨ ade ∨ bcd ∨ bce ∨ bde ∨ cde ≡ a + b + c + d + e ≥ 3 ≡ [ w a , w b , w c , w d , w e ; T ] = [1 , 1 , 1 , 1 , 1; 3] . An XOR is a simple example of a non-threshold function. The importance of threshold logic stems from the fact that many Boolean functions that require e xponential size AND/OR networks can be realized by polynomial sized, ﬁxed depth threshold networks [6]. From a practical perspective, nearly 70% of the functions in standard cell libraries are threshold functions. W e will demonstrate that implementing threshold functions using con v entional CMOS logic primitives is very inef ﬁcient, as compared to the FTL cell. In our approach, Equation (1) must be translated to the comparison of electrical quantity such as charge, current or v oltage. This is the basis of many other threshold gate implementations as well [14]. B. Flash T ransistors e e e e e N+ N+ Control Gate Drain Source Floating Gate Dielectric Fig. 3: Flash T ransistor Cross Section Flash or ﬂoating gate tran- sistors are dual-gate ﬁeld effect transistors (DGFETs). The ﬁrst gate is called a con- tr ol gate and the second is a ﬂoating gate (see Figure 3). The control gate is similar to the gate of a traditional MOSFET . The ﬂoating gate is inserted between the substrate and the control gate, and is electrically and physically isolated. Hence, current cannot ﬂow into (out) of the ﬂoating gate, unless electrons are forced to enter (leav e) the ﬂoating gate from (to) the substrate by a phenomenon known as Fo wler-Nordheim (FN) tunneling [15]. A ﬂash de vice is programmed by holding its body , source and drain nodes at the ground and applying a high voltage (10-20 V olts) to the control gate. The resulting electric ﬁeld forces electrons to tunnel from the substrate into the ﬂoating gate, increasing the threshold voltage of the ﬂash transistor . The resulting threshold voltage depends on the number of electrons that tunnel into the ﬂoating gate, which depends on the duration of the programming pulse. Signiﬁcantly , the threshold v oltage of a ﬂash transistor can be adjusted with a ﬁne granularity [7]. Once electrons are trapped in the ﬂoating gate, they remain trapped for many years [11], [12], or until remov ed by an erase operation. A ﬂash transistor can be erased by holding the control gate to ground, ﬂoating the drain and source nodes, and applying a high v oltage at the body node. Erasing is simultaneously performed on all the transistors which share a common body node. I I I . F L A S H T H R E S H O L D L O G I C ( F T L ) C E L L Figure 4 sho ws the architecture of the FTL cell. It has ﬁv e main components: the left input network LIN, the right input network RIN, a sense ampliﬁer (SA), an output latch (LA) and a ﬂash transistor programming logic (P). The LIN and RIN consist of two sets of inputs ( ` 1 , · · · , ` n ) and ( r 1 , · · · , r n ) , respectiv ely , with each input in series with a ﬂash transistor . In our implementation, ` i = r i for all i . The conductivity of these two networks is determined by the state of the inputs and the threshold voltages of the ﬂash transistors. The assignment of signals to the LIN and RIN is done to ensure sufﬁcient difference in conductivity across all minterm pairs ( m i , m j ) such that f ( m i ) 6 = f ( m j ) . The FTL cell has two differential signals N 1 and N 2 , which serve as inputs to an SR latch. When [ N 1 , N 2] = [0 , 1] ( [1 , 0] ), the latch is set (reset) and the output Y = 1(0) . The magnitudes of the two sides of the inequality (1) in the deﬁ- nition of a threshold function are mapped to the conductance G L of the LIN and G R of the RIN, such that [ N 1 , N 2] = [0 , 1] ⇔ G L > G R and [ N 1 , N 2] = [1 , 0] ⇔ G L < G R . As stated earlier, the ﬂash transistor threshold voltages serve as a proxy to the weights of the threshold function – the higher the weight, the lower will be the threshold voltage. For a giv en threshold function, this non-linear monotonic relationship is learnt using a modiﬁed perceptron learning algorithm described in Section IV. The FTL cell has three modes: r e gular , erase and pro gram- ming mode. The V t values of the ﬂash transistors are set in the programming mode and erased in the erase mode. The ev aluation takes place in re gular mode. FTL Regular Mode: In this mode PR OG = ERASE = 0 . Assume that the V t s of the ﬂash transistors have been set to appropriate values corresponding to the weights of the threshold function, and their g ates are being driven to 1 by setting HiV to VDD, F C j to 0V and all F T i to 0V . When C LK = 0 , the circuit is reset. In this phase, the nodes N 5 and N 6 of LIN and RIN are connected to the supply , N 5 = N 6 = 0 , and N 1 = N 2 = 1 . Therefore, the output Y remains unchanged. Assume no w that an on-set minterm is applied to the inputs in the LIN and RIN. With properly assigned V t values to Fig. 4: FTL Cell Architecture: Input net- works LIN and RIN drive the sense am- pliﬁer with current based on weighted in- puts. Input weights are implemented by modulating the conductivity of LIN and RIN using ﬂash transistors. Sense ampli- ﬁer e valuates the threshold function and driv es the latch to produce the output Y . An FTL cell is programmed by sending high v oltage pulses to the ﬂash transistors’ gates via the Programming Logic. the ﬂash transistors, suppose that G L > G R for the given minterm. When C LK : 0 → 1 , both the LIN and RIN will conduct, and N 5 and N 6 will both transition from 0 → 1 . Assuming G L > G R , N 5 rises faster than N 6 , and hence N 5 will make M 7 activ e before N 6 makes M 8 active. This will start to discharge N 1 before N 2 . When N 1 falls belo w the V t of M 6 , it will stop further discharge of N 2 , and turn on M 3 , resulting in N 2 : 0 → 1 . Finally , [N1,N2] = [0,1] sets the SR latch, resulting in Y = 1 . For an of f-set minterm, G L < G R , and [ N 1 , N 2] = [1 , 0] resulting in Y = 0 . The con ventional circuit structures used in ﬂash memories are not suitable for programming an FTL cell because it has to also perform logic operations. Consequently , we present a ne w programming interface for an of f-chip programming circuit to set the V t values of any FTL cell. During ﬂash-programming, this interface uses the F C j signal to select the j th FTL cell and the F T i signal to select the i th ﬂash transistor of the selected FTL cell. FTL Pr ogramming Mode :(ERASE=0, PR OG=1, CLK=0, FT i =0, FC j =0, HiV=20V). The ERASE and PROG signals turn on M12 and M13 and turn of f M14. In this state, the source of the ﬂash transistor is ﬂoating while the drain and bulk are connected to the ground. Acti vating the appropriate transistors using the F T i and F C j signals, high voltage pulses are passed on the HiV line through M C j and M T i to the gate of the ﬂash transistor to set the desired threshold voltage ( V t ). FTL Erase Mode : (ERASE=1, PROG=1, CLK=0, FT i =0, FC j =0, HiV=-20V). M12 is turned of f by the ERASE signal. Both the source and drain of the ﬂash transistors are ﬂoating in this state, while the bulk is connected to the ground. A negati ve HiV pulse at the gate terminal of all the ﬂash transistors in this state will tunnel the charge from the ﬂoating gate, thereby erasing the ﬂash transistor . I V . M O D I FI E D P E R C E P T R O N L E A R N I N G A L G O R I T H M In this section, we describe an algorithm to determine the vector of ﬂash transistor threshold voltages for a given threshold function f = [ W , T ] . The problem is to ﬁnd a mapping between the Boolean space B n , and the conductivity space ( G L , G R ) such that G L > G R iff P w i x i > T (i.e. for an on-set minterm), and G L < G R iff P w i x i < T (i.e. for an off-set minterm). This mapping is depicted in Figure 5. G L and G R are non-linear functions of the ﬂash transistor Fig. 5: Transformation from Boolean space to conductivity space; Hyperplane gets con v erted into a line. threshold voltages, the time-v arying drain and sources v oltages of the input transistors, and the layout parasitics that vary from instance to instance. T o account for these dependencies, G L and G R , in principle, must be obtained by solving a set of differential equations – an approach that is not practical. W e next show how to simultaneously solve the differential equations numerically and perform the binary classiﬁcation by a modiﬁed version of the classical perceptron learning algorithm (PLA) [13]. The PLA starts with an initial hyperplane in the Boolean space and iterativ ely adjusts it until all the on-set and off- set minterms f all on opposite sides of the hyperplane. Each minterm corresponds to some point in the ( G L , G R ) space. Our modiﬁed PLA iterativ ely adjusts the V t ( s ) of ﬂash transis- tors such that points in the conducti vity space that correspond to the on-set and of f-set minterms fall on the appropriate side of the line G L = G R (Fig. 5). W e use HSPICE to determine whether any point falls above or belo w this line. A description of the modiﬁed PLA follo ws. The threshold voltages of the ﬂash transistors associated with the input transistors in the LIN and RIN are labeled V 1 , V 2 , · · · , V n . The i th transistor in both LIN and RIN has a threshold voltage V i . In addition, there are two special ﬂash transistors, whose threshold voltages are V L and V R associated with the LIN and RIN, respectively . For a threshold function f = ( w 1 , w 2 , · · · , w n ; T ) , the V i , 1 ≤ i ≤ n , correspond to the weights w i of a threshold function, whereas only one of V L or V R is associated with the threshold T of f . If V L is associated with T , then V R = V DD , effecti vely turning it off. If V R is associated with T , then V L = V DD . The use of additional ﬂash devices on both sides of the FTL cell allo ws for extra programming ﬂe xibility . The induced symmetry also balances the parasitics of the LIN and the RIN. For the truth table ( T T ) of f , the modiﬁed PLA applies all the minterms of f to the FTL cell, and records the HSPICE response in an array called OT (output table). For a gi ven minterm m i , if T T ( m i ) = O T ( m i ) then the response is called a correct response, otherwise it is called an incorrect response. An FTL cell is completely programmed if the recorded response for every minterm is correct . Until the FTL cell is completely programmed, at least one minterm would generate an incorrect response. In the event of an incorr ect response associated with minterm m i , the modiﬁed PLA adjusts the threshold voltages of all ﬂash transistors associated with the ON input transistors within the interval [ δ, V DD − δ ] , by a minimum increment δ , using the follo wing equations (k denotes the iteration number of the algorithm): V k +1 i =  V k i − δ m i m i · W ≥ T V k i + δ m i m i · W < T . (2) Equation (2) is quite easy to understand. The term δ m i is simply a vector which has a v alue δ at all locations where m i is 1, and zero elsewhere. For instance, δ (1 , 0 , 1 , 1 , 0) = ( δ, 0 , δ, δ, 0) . Suppose m i is an on-set minterm for which the response was incorrect. This means that G L < G R . Therefore G L needs to be increased for minterm m i . Hence the threshold voltages of all ﬂash transistors that are connected to the input transistors that are ON for minterm m i , should be decreased by δ . Similarly , if m i is an off-set minterm, then the threshold voltages of the same ﬂash transistors must be increased by δ . This is what is expressed in Equation (2). Since the V i values are bounded abo ve and below , it might not be possible to satisfy the truth table using the V i alone. In such cases, the algorithm will resort to adjusting V L and V R using the same principle as in Equation (2). If m i is a on-set minterm that was incorrect, then G R should be reduced. Therefore, V R is incremented by δ , until its upper bound is reached. If this is not sufﬁcient, then G L has to be increased. Hence, V L is decremented. Gi ven a threshold function and a sufﬁciently small δ , the modiﬁed PLA will con ver ge to a feasible threshold voltage set assignment V ∗ t for the FTL cell [13]. For an n -input threshold function, a pessimistic upper bound on the number of iterations is giv en by k max = 2 n || V ∗ t || 2 /δ 2 . For n = 5 and δ = . 02 V , k max = 2500 || V ∗ t || 2 . A. T raining for Robustness The modiﬁed PLA does not consider the relative location of the points with respect to the metastability region around the line G L = G R (see Figure 5b). Even though minterms are classiﬁed correctly , they can be arbitrarily close to the line. The further away a minterm is from the line, the easier (and faster and more robust) it will be for the sense ampliﬁer to detect the difference between N 5 and N 6 , and discharge the appropriate side ( N 1 or N 2 ) ﬁrst. Our approach to making the FTL cell highly robust is to introduce an additional capacitance C 1 on node N 1 when classifying an on-set minterm, and determining the maximum value of C 1 for which the modiﬁed PLA conv erges. This handicaps node N 1 and directs the algorithm to ﬁnd a solution, which will result in increasing G L more than increasing G R . Similarly , we add a capacitance C 0 on node N 2 , when classifying an of f-set minterm. The corresponding threshold voltages found by the modiﬁed PLA algorithm will increase the gap between G L and G R , which makes it much more robust, and also improves its speed, as a direct consequence. Note that C 0 and C 1 are introduced in the simulations for improving the training solution only , and are not part of the FTL cell. V . E X P E R I M E N TA L R E S U LT S A. Experiment Setup A 5-input FTL cell was designed and a complete layout (including the programming devices) was created using the TSMC 40nm LP library . The ﬂash transistor models were obtained from [16] and were suitably modiﬁed to reﬂect the characteristics and variations of the TSMC 40nm library . The design rules for the ﬂash transistors were obtained from ITRS. The layout of the FTL cell was created as a standard cell with an area of 15.6 µm 2 . For reference, if X represents the dri ve strength, an X4 DFF and an X4 N AND gate have an area of 5 . 6 µm 2 and 2.8 µm 2 respectiv ely , while their delay optimized X8 counterparts have an area of 14.347 µm 2 and 7.3 µm 2 respectiv ely . The { setup, C2Q } of a X4 DFF is { 67ps, 168ps } . There are a total of 117 distinct threshold functions of 5 or fe wer variables. A numbered list of these is gi ven in [17] and can also be accessed at [18]. In this section, we use the same numbering scheme as in [17] to identify the functions. In the sequel, the FTL cell trained to implement the threshold function numbered n in [17] will be referred to as F T L n , and the corresponding CMOS implementation will be denoted as C M O S n . The threshold function itself will be denoted as F n . B. T raining Iterations The modiﬁed PLA algorithm was used to train the FTL cell for robustness (see Section IV -A) for all 117 functions. Figure 6 shows the number of iterations needed for training for each of the 117 functions. The actual number of iterations were about 10X lower than the theoretical upper bound, presented in Section IV. Fig. 6: Iteration count for the modiﬁed perceptron learning algorithm for all 117 functions of 5 or fewer v ariables. C. Area, Delay and P ow er Comparison Each of the 117 functions were implemented as FTL cells, and also synthesized by Cadence Genus © and placed and routed using Cadence Innovus © , using the TSMC 40nm LP standard cells. The total delay (logic delay + setup time + clock-to-Q delay) and po wer values were determined by simulating the circuits at 25 ◦ C at 20% input switching activity . Figure 7 shows that each of the FTL implementations of the 117 functions hav e substantially smaller area, po wer and delay when compared to the CMOS equi valent. The a veraged improv ements of FTL over CMOS are: area (79.5%) , delay (42.5%) and power (61.1%) . Fig. 7: PP A improvements of FTL o ver CMOS implementations. Figure 8 compares the leakage power of the FTL and CMOS implementations of the 117 functions. The functions are arranged in ascending order of their CMOS leakage values. Unlike the CMOS implementations, the leakage power of the FTL implementations is nearly constant. Also plotted is the area trend line of CMOS implementations, to illustrate the strong correlation of leakage power with area. The few FTL implementations that had higher leakage (shown circled) were all small logic primitiv es. Ne vertheless, the total power (see Figure 7) of FTL implementations of e ven these functions is far less than the CMOS implementations. These functions can be av oided if leakage minimization is the primary design goal. Fig. 8: Leakage power of FTL v ersus CMOS implementations. D. Experiments on T raining for Robustness This experiment demonstrates the robust PP A training method, described in Section IV -A, to improve yield. The test function chosen was F 115 = [ W ; T ] = [4 , 1 , 1 , 1 , 1; 5] = ab + ac + ad + ae . The experiment consisted of training multiple versions of F T L 115 for v arious values of the parasitic capacitances C 1 and C 0 , and for each solution, performing 100K Monte Carlo simulations with local and global process variations 3 , and checking if the truth table was correctly realized. T able I shows the delay and yield for v arious v alues of C 1 and C 0 . The functional yield was improved from 13% to 100% (i.e. truth tables of all 100K instances were v eriﬁed to be correct) by increasing the values of C 1 and C 0 . There are two important observations to be made here. First, e ven though the weights of b , c , d , e are equal, the corresponding ﬂash tran- sistors receiv ed different threshold voltages ( V 2 , V 3 , V 4 , V 5 ). This sho ws that the perceptron learning algorithm, w orking in concert with HSPICE, accounts for the layout parasitics. Second, the delay impr oves with increasing robustness, due to the increase in the difference between the voltages at nodes N5 and N6,(see Section IV). T ABLE I: Multi-Corner Monte Carlo results with 100K simulations of F T L 115 , trained for robustness using v arious capacitor v alues (fF) C 1 , A verage Vt V alues (V) Y ield Delay C 0 ( V 1 , V 2 , V 3 , V 4 , V 5 ; V l 0 , V r 0 ) % (ps) 0.00 0.64, 0.74, 0.72, 0.74, 0.72; 1.00, 0.74 13 244 0.01 0.62, 0.72, 0.7, 0.74, 0.74; 1.00, 0.70 20 220 0.02 0.58, 0.74, 0.72, 0.74, 0.72; 1.00, 0.64 43 204 0.05 0.48, 0.68, 0.66, 0.70, 0.66; 1.00, 0.56 59 162 0.10 0.34, 0.56, 0.54, 0.60, 0.62; 1.00, 0.46 100 138 Fig. 9: Conductivity G L and G R of F T L 115 [TT , 0.9V , 25 ◦ C]. In the conducti vity space, gap between off-set minterms and on-set minterms increases, when the training is done for robustness. In Section IV -A we argued that training an FTL cell with a handicap in the form a parasitic capacitance on N1 and N2 will improve the robustness by increasing the smallest gap in conductance between the LIN and the RIN. Figure 9 demonstrates this very important characteristic of the rob ust PP A algorithm for FTL. It is a plot of the conductivity space , i.e., G R versus G L , of an FTL when trained for the test function F 115 , with and without the parasitic (handicap) capacitances. The blue points (orange points) correspond to the G L and G R values of the on-set and off-set minterms of 3 Sev eral dozen parameters are varied in the HSPICE models provided by the vendor F 115 in the absence (presence) of the parasitic capacitances C 1 and C 0 ( C 1 = C 0 = 0.1fF). Recall that for an on-set minterm G L > G R and for an off- set minterm, G R > G L . The plot clearly demonstrates that training with the parasitic capacitances dramatically improv es the robustness in two ways. First, there is a signiﬁcant increase (by 21%) in the shortest distance between the two closest on- set and off-set minterms, as indicated in Figure 9. Second, the increase in G L is greater than the increase in G R , i.e. ∆ G L / ∆ G R > 1 for the on-set minterms, and vice-versa for the of f-set minterms. Both of these effects contribute to reducing the contention in the sense ampliﬁer in deciding the function output, which in turn directly improv es the speed as well, resulting in higher robustness and higher performance. E. Delay Distributions This experiment compares the distributions of delays of FTL and CMOS implementations. W e show the results for the function F 115 = [ W ; T ] = [4 , 1 , 1 , 1 , 1; 5] . The PVT corner setting was [ P , V , T ] = [ T T , 0 . 9 V , 25 ◦ C ] . 100K Monte Carlo instances were generated for both F T L 115 and C M O S 115 . The function of each of the 100K FTL instances was v eriﬁed against the truth table for correctness, for both F T L 115 and C M O S 115 . The histograms of delays are shown in Figure 10. These clearly demonstrate the delay advantage of the FTL cell ov er its CMOS equi valent, e ven in the presence of process variations. The dif ference in standard de viation between the two is insigniﬁcant. Note that the FTL instances with large delays can be re-pr ogrammed to further reduce the delay . This capability is not possible for the CMOS versions. Fig. 10: Delay histogram of F T L 115 and C M O S 115 with 100K Monte Carlo simulations. P V T = [ T T , 0 . 9 V , 25 ◦ C ] . T ABLE II: Delay , total power and power-delay-product (PDP) of F T L 115 , trained at V DD = 0 . 9 V , and C 0 = C 1 = 0 . 1 f F . Supply V oltage (V) Flash Gate V oltage (V) Power (u) Delay (ps) PDP 0.8 0.8 14.3 198.1 2837.1 0.85 0.825 20.5 157.6 3228.7 0.9 0.85 26.1 130.2 3396.9 0.95 0.875 40.3 111.2 4482.7 1 0.9 53.1 97.0 5148.6 1.05 0.925 76.0 86.4 6562.9 1.1 0.95 85.0 78.2 6644.0 F . Dynamic V oltage Scaling V oltage scaling is a common mechanism to trade off per- formance against power . T able II shows the results of training F T L 115 at 0 . 9 V . The FTL was programmed with the resulting set of ﬂash threshold voltages, and then operated over the voltage range [0 . 8 V , 1 . 1 V ] . T o ensure proper operation across all voltages, the gate voltages of the ﬂash transistors were also scaled in this experiment. This result demonstrates how a single V T assignment can be used for dynamic v oltage scaling. Note that the delay v aries by 2.5X, power varies by 5.9X and the PDP (ener gy) varies by 2.3X, as the supply voltage varies ov er [0.8V , 1.1V]. This shows that the FTL cells offer a healthy power , delay , and energy tradeof f by voltage scaling. G. P ost-fabrication Timing Correction The experiments described in Sections V -D, V -E and V -F all point to the ﬂe xibility of FTL due to its unique characteristic of allowing for programming of the ﬂash transistor threshold voltages after fabrication. It should come as no surprise then that this can also be used to correct timing errors. Fig. 11: Datapath to demonstrate post-fab timing corrections Fig. 12: Correcting setup time violation with an FTL cell after fabrication. C2Q of FTL cell reduced from 180 ps to 142 ps. Fig. 13: Correcting hold time violation with an FTL cell after fabrication. C2Q of FTL cell increased from 142 ps to 180 ps. Figure 11 shows a small datapath that was constructed to demonstrate ho w setup time and hold time violations can be corrected after fabrication in an FTL design. The datapath consists of clock-to-Q (C2Q) delay , combinational delay (D2D) and DFF speciﬁcations for setup ( D F F setup ) and hold ( D F F hold ) times. The clock is ske wed by an appropriate amount ∆ , to generate either a setup time or a hold time violation. The violations are corrected by reprogramming the FTL cell to produce different C2Q values. Figure 12 sho ws how the data launched from FTL X misses the tar get clock edge at DFF Y , thereby violating setup time. T o ﬁx the setup time, the C2Q of FTL X is decreased. Similarly , Figure 13 sho ws ho w the data launched at FTL X gets captured by DFF Y one cycle early , thereby ov erwriting the old value at DFF Y . By increasing the C2Q value of the FTL X, the old value at the input of Y is retained for a longer time, which satisﬁes the hold time condition. Since the FTL cells are programmed post-fabrication, the delay can also be modiﬁed after fabrication. Using the same idea of post-fabrication V T adjustment, an FTL cell can be reprogrammed to mitigate delay increases due to aging. H. Chip-level programming architecture Although the full architecture for programming the FTL cells in an ASIC is not presented here, this section describes the programming architecture in brief. An on-chip decoder architecture is used to address the ﬂash transistors of the FTL cells, during programming. The address for the decoder is sent into the chip using a serial communication protocol along with a programming clock. The high voltage line needed for sending programming pulses to ﬂash transistors is generated and sent into the chip using an off-chip voltage source. The pin count ov erhead for programming is low (only 3 pins are needed). When the address is recei ved, the decoder activ ates a speciﬁc ﬂash transistor of a speciﬁc FTL cell for programming. V I . R E L A T E D W O R K A. Threshold Logic The study of threshold functions and the de velopment of threshold gates date back to the 1960s culminating in the authoritativ e book by Muroga [17]. Since then, an extensiv e body of theoretical work, ne w circuit architectures and im- plementations have been published. References [24] and [25] provide a detailed surv ey of work prior to 2003. One of the earliest reported works that demonstrated the operation of threshold logic gates using ﬂash transistor s was reported in [26], [27]. It was an analog design of a single cell to demonstrate proof of concept. The focus has shifted to exploring the use of emerging devices such as RRAMs, STT -MTJs, and others, to implement threshold gates [9], [28], [29]. Se veral recent works ha ve devised ef ﬁcient al gorithms for determining weights aimed at robust threshold gates [29], [30]. Howe ver until recently , due to the lack of designs tools and incompatibility with existing design methodologies, threshold logic remained outside mainstream VLSI design. Recently , [10] reported an architecture of a threshold gate and showed how it can be integrated with the standard- cell ASIC design methodology using commercial tools. In addition, they reported signiﬁcant improvements in PP A of an actual silicon implementation of ASIC with threshold gates [31]. Their architecture, ho wev er , se verely limits the number of threshold functions that can be implemented. This is because the weight w i associated with input x i is implemented by using w i transistors each driven by signal x i . Hence, their circuit has sev ere fan in limitations. For instance, the design in [10] can only realize 11 of the 5-input threshold functions, whereas, as demonstrated here, the FTL-5 cell can realize all 117 functions In addition, representing weights using multiple transistors signiﬁcantly reduces the robustness and prevents it from scaling to lower geometries. Finally , the FTL cell is programmed after fabrication, prev enting copying by a foundry , and numerous opportunities to correct f ailures and tune for high performance and aging effects. B. Flash T echnology Many research efforts hav e studied ﬂash devices and their use in memory . A short list includes [32], [33]. These papers report details of ﬂash devices and their characterization. Ho w- ev er , they do not describe the use of ﬂash transistors for logic circuits. A good deal of work in ﬂash has been reported in the area of architectural techniques to increase ﬂash memory endurance. Some representativ e works include wear le veling techniques, which are used in ﬂash-based memory blocks [34], to compensate for the fact that ﬂash transistors typically hav e a ﬁnite (10k - 100k) number of times they can be written [11], [12]. In traditional ﬂash memory , wear leveling is performed at the architectural lev el to spread the wear of the cells. The authors of [35] present a design ﬂow to implement ﬂash-based digital circuits at the block level. These efforts present results for a programmable logic array style cell design and illustrate its use in a modiﬁed standard-cell style VLSI design ﬂow . In contrast, the work of this paper focuses on threshold logic and is envisioned for use in a traditional standard-cell based ﬂo w . An FTL cell can replace a D ﬂip- ﬂop and some or part of its logic cone in any CMOS netlist. T o the best of our knowledge, there has been no work prior to this paper which describes the synthesis, detailed electrical characterization of sequential ﬂash-based threshold logic cells. V I I . C O N C L U S I O N In this paper, we proposed a novel threshold logic cell (FTL) using ﬂash transistors. A modiﬁed perceptron learning algorithm was also proposed to program the FTL cell. Sub- stantial area (79.7%), po wer (61.1%) and performance 42.5%) improv ement of the FTL cells was demonstrated against their con ventional 40nm standard-cell based designs of the same functions. By adding a capacitor to introduce a handicap in the FTL cell during simulation, this paper sho ws that the learning algorithm counters the ef fect of the handicap by generating more robust solutions. Robustness against PVT variations was demonstrated using 100K Monte Carlo simulations, demon- strating a 100% yield. W e also demonstrated that FTL cells are amenable to dynamic voltage scaling, and post-silicon tuning of setup and hold time violations. R E F E R E N C E S [1] M.J. A vedillo and J.M Quintana. A Threshold Logic Synthesis T ool for R TD Circuits. In Eur omicr o Symposium on Digital System Design , DSD ’04, pages 624–627, W ashington, DC, USA, 2004. IEEE Computer Society . [2] Krzysztof Berezowski and Sarma Vrudhula. Automatic design of binary multiple-valued logic gates on the rtd series. In Eight Eur omicro Conf. on Digital System Design , Porto, Portugal, Aug. 2005. [3] P . Gupta and N.K. Jha. An algorithm for nanopipelining of rtd-based circuits and architectures. Nanotec hnology , IEEE T ransactions on , 4(2):159–167, March 2005. [4] R. Zhang, P . Gupta, and N. K. Jha. Synthesis of Majority and Minority Networks and Its Applications to QCA, TPL and SET Based Nanotechnologies. International Conference on VLSI Design , 0:229– 234, 2005. [5] S. Muroga. Thr eshold Logic and its Applications . 1971. [6] K. Siu, V . Roychowdhury , and T . Kailath. Discr ete Neural Computation: A Theoretical F oundation . Prentice-Hall, Inc., 1995. [7] Y . Cai, E. F . Haratsch, O. Mutlu, and K. Mai. Threshold voltage distribution in MLC NAND ﬂash memory: Characterization, analysis, and modeling. In IEEE D A TE , March 2013. [8] R. Perricone, I. Ahmed, Z. Liang, M. G. Mankalale, X. S. Hu, C. H. Kim, M. Niemier, S. S. Sapatnekar, and J. W ang. Adv anced spintronic memory and logic for non-volatile processors. In D ATE, 2017 , March 2017. [9] J. Y ang, N. Kulkarni, S. Y u, and S. Vrudhula. Integration of threshold logic gates with RRAM de vices for energy efﬁcient and robust operation. In IEEE/ACM NANO ARCH , July 2014. [10] N. Kulkarni, J. Y ang, J. S. Seo, and S. Vrudhula. Reducing Power , Leakage, and Area of Standard-Cell ASICs Using Threshold Logic Flip- Flops. IEEE TVLSI , 24(9), Sept 2016. [11] D. Jung et al. A Group-based W ear -leveling Algorithm for Large- capacity Flash Memory Storage Systems. In ACM CASES , 2007. [12] S. Boboila and P . Desnoyers. Write Endurance in Flash Driv es: Measurements and Analysis. In A CM F AST , 02 2010. [13] F . Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Revie w , 1958. [14] Y . Beiu, J.M. Quinfana, M.J. A vedilo, and R. Andonie. Dif ferential Implementations of Threshold Logic Gates. In Pr oceedings of the IEEE International Symposium on Signals, Circuits and Systems , 2003. [15] R. Fo wler and L. Nordheim. Electron Emission in Intense Electric Fields. Proc. Royal Soc. of London. Series A , 119(781), May 1928. [16] M. Abusultan and S.P . Khatri. Implementing Lo w Power Digital Circuits using Flash De vices. In IEEE/ACM ICCD , October 2016. [17] Saburo Muroga. Thr eshold Logic and its Applications . W iley- Interscience New Y ork, 1971. [18] https://sites.google.com/view/5-input-threshold-functions/ . [19] A. Neutzling, J. M. Matos, A. I. Reis, R. P . Ribas, and A. Mishchenko. Threshold logic synthesis based on cut pruning. In IEEE/A CM ICCAD , Nov 2015. [20] Dimitri Kag aris and Spyros Tragoudas. Maximum W eighted Indepen- dent Sets on Transiti ve Graphs and Applications. Inte gr . VLSI J. , 27:77– 86, January 1999. [21] Sandeep Dechu, Manoj Kumar Goparaju, and Spyros T ragoudas. A metric of tolerance for the manufacturing defects of threshold logic gates. 21st IEEE International Symposium on Defect and F ault- T olerance in VLSI Systems (DFT’06) , pages 318–326, October 2006. [22] Manoj Kumar Goparaju and Spyros Tragoudas. An atpg methodology using parametric f ault model for defects in threshold logic g ate networks. WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS , 5(8):1206– 1211, August 2006. [23] A. Neutzling, J. M. Matos, A. Mishchenko, A. Reis, and R. P . Ribas. Effecti ve logic synthesis for threshold logic circuit design. IEEE TCAD , 2018. [24] V . Beiu. A survey of perceptron circuit complexity results. In IEEE IJCNN , volume 2, pages 989–994 vol.2, July 2003. [25] P . Celinski, S. D. Cotofana, J. F . Lopez, S. F . Al-Sarawi, and D. Abbott. State of the art in CMOS threshold logic VLSI gate implementations and systems. In IEEE VCAL , April 2003. [26] V . Bohossian, P . Hasler, and J. Bruck. Programmable neural logic. In IEEE ISIS , Oct 1997. [27] E. Rodriguez-V illegas, J. M. Quintana, M. J. A vedillo, and A. Rueda. High-speed lo w-power logic gates using ﬂoating gates. In IEEE ISCAS , volume 5, May 2002. [28] S. Sa vas, H. Hesham, T . Darwin, and C. Gregory . Reconﬁgurable threshold logic gates with nanoscale DG-MOSFETs. Elsevier Solid- State Electronics , 51(10), 2007. [29] S. N. Mozaffari and S. T ragoudas. Maximizing the number of threshold logic functions using resistive memory . IEEE TNANO , 17(5), Sep. 2018. [30] S. N. Mozaf fari, S. Tragoudas, and T . Haniotakis. A Generalized Ap- proach to Implement Efﬁcient CMOS-Based Threshold Logic Functions. IEEE TCSI , 65(3), March 2018. [31] Jinghua Y ang, Joseph Davis, Niranjan Kulkarni, Jae sun Seo, and Sarma Vrudhula. Dynamic and Leakage Power Reduction of ASICs Using Conﬁgurable Threshold Logic Gates. In Pr oc. IEEE Custom Inte grated Cir cuits Conf. (CICC) , San Jose, CA, Sept. 2015. [32] H. An, K. Kim, S. Jung, H. Y ang, K. Kim, and Y . Song. The threshold voltage ﬂuctuation of one memory cell for the scaling-down NOR ﬂash. In IEEE ICNIDC , Sep. 2010. [33] E. Choi and S. Park. Device considerations for high density and highly reliable 3D N AND ﬂash cell in near future. In IEEE IEDM , Dec 2012. [34] M. K. Qureshi, J. Karidis, M. Franceschini, V . Srini vasan, L. Lastras, and B. Abali. Enhancing lifetime and security of PCM-based Main Memory with Start-Gap W ear Leveling. In IEEE/A CM MICRO , Dec 2009. [35] M. Abusultan and S.P . Khatri. A Flash-based Digital Circuit Design Flow. In IEEE/ACM ICCAD , Nov 2016. [36] J. Rajendran, H. Manem, R. Karri, and G. S. Rose. Memristor based programmable threshold logic array . In 2010 IEEE/A CM International Symposium on Nanoscale Arc hitectures , June 2010. [37] G. S. Rose, J. Rajendran, H. Manem, R. Karri, and R. E. Pino. Lev eraging memristive systems in the construction of digital logic circuits. Proceedings of the IEEE , 100(6):2033–2049, June 2012. [38] M. Soltiz, D. Kudithipudi, C. Merkel, G. S. Rose, and R. E. Pino. Memristor-based neural logic blocks for nonlinearly separable functions. IEEE T ransactions on Computers , 62(8):1597–1606, Aug 2013. [39] M. Soltiz, C. Merkel, D. Kudithipudi, and G. S. Rose. Rram-based adaptiv e neural logic block for implementing non-linearly separable functions in a single layer . In 2012 IEEE/ACM International Symposium on Nanoscale Ar chitectur es (NANO ARCH) , July 2012. [40] Georgios Detorakis, Sadique Sheik, Charles Augustine, Somnath P aul, Bruno U. Pedroni, Nikil D. Dutt, Jeffrey L. Krichmar, Gert Cauwen- berghs, and Emre Neftci. Neural and synaptic array transceiver: A brain-inspired computing framework for embedded learning. In F ront. Neur osci. , 2017. [41] M. Uddin and G. Rose. A practical sense ampliﬁer design for memristi ve crossbar circuits (puf). 2018. [42] S. Sayyaparaju, G. Chakma, S. Amer, and G. S. Rose. Circuit techniques for online learning of memristi ve synapses in cmos-memristor neuromor - phic systems. In Proceedings of the on Great Lakes Symposium on VLSI 2017 , GLSVLSI ’17, 2017. [43] X. Y ao, J. Harms, A. L yle, F . Ebrahimi, Y . Zhang, and J. W ang. Magnetic tunnel junction-based spintronic logic units operated by spin transfer torque. IEEE T ransactions on Nanotechnology , Jan. [44] S. Patil, A. Lyle, J. Harms, D. J. Lilja, and J. W ang. Spintronic logic gates for spintronic data using magnetic tunnel junctions. In 2010 IEEE International Conference on Computer Design , Oct 2010.

Threshold Logic in a Flash

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment