Threshold Logic in a Flash

This paper describes a novel design of a threshold logic gate (a binary perceptron) and its implementation as a standard cell. This new cell structure, referred to as flash threshold logic (FTL), uses floating gate (flash) transistors to realize the …

Authors: Ankit Wagle, Gian Singh, Jinghua Yang

Threshold Logic in a Flash
Threshold Logic in a Flash Ankit W agle ∗ , Gian Singh ∗ , Jinghua Y ang ∗ , Sunil Khatri † , Sarma Vrudhula ∗ ∗ (awagle1,gsingh58,jinghua.yang,vrudhula)@asu.edu, † sunil.khatri@tamu.edu ∗ School of Computing, Informatics and Decision Systems Engineering, Arizona State Univ ersity , T empe AZ 85281 † Dept. of Electrical and Computer Engineering, T exas A&M Univ ersity , Colle ge Station TX Abstract —This paper describes a novel design of a threshold logic gate (a binary perceptr on) and its implementation as a standard cell . This new cell structure, r eferred to as flash threshold logic (FTL), uses floating gate (flash) transistors to realize the weights associated with a threshold function. The threshold voltages of the flash transistors serve as a proxy for the weights. An FTL cell can be equivalently viewed as a multi-input, edge-triggered flipflop which computes a threshold function on a clock edge. Consequently , it can be used in the automatic synthesis of ASICs. The use of flash transistors in the FTL cell allows programming of the weights after fabrication, thereby pre venting discovery of its function by a f oundry or by re verse engineering. This paper focuses on the design and characteristics of the FTL cell. W e present a novel method for programming the weights of an FTL cell for a specified thr eshold function using a modified perceptr on learning algorithm. The algorithm is further extended to select weights to maximize the r obustness of the design in the pr esence of process variations. The FTL circuit was designed in 40nm technology and simulations with layout-extracted parasitics included, demonstrate significant im- pro vements in the ar ea (79.7%), power (61.1%), and perf ormance (42.5%) when compar ed to the equivalent implementations of the same function in con ventional static CMOS design. W eight selection targeting r obustness is demonstrated using Monte Carlo simulations. The paper also shows how FTL cells can be used for fixing timing err ors after fabrication. Index T erms —Threshold Logic, Floating Gate, Flash, Low Po wer , High Perf ormance, P erceptr on I . I N T R O D U CT I O N A N D M O T I V AT I O N Methods to optimize the performance, po wer and area (PP A) of static CMOS circuits ha ve continuously improved ov er three decades, leaving fe w opportunities, if any , for further improv ements. This suggests that if there are to be any further advances in improving PP A at the logic and circuit lev els, the conv entional way of computing logic functions has to be re visited. Although sev eral nanotechnologies are being in vestigated as alternativ es or enhancements to static CMOS (e.g. [1]–[4]), they remain at the research stage and large scale adoption is still far in the future. This paper introduces a ne w pr ogrammable ASIC primitive , referred to as a flash thr eshold logic (FTL) cell, that can be used to substantially improv e all three PP A metrics of an ASIC. An FTL cell and its use in an ASIC is different from any other type of ASIC component previously reported. Howe ver , it is designed as a standard cell , so that it is fully compatible with conv entional ASIC design flow , and can be processed by commercial design tools without any changes. In other words, it can easily be combined with con ventional CMOS logic during synthesis, technology mapping, and place- ∗ The research was supported in part by NSF PFI award 1701241. and-route. Howe ver , it is functionally and structurally very different from a complex standard cell. An FTL cell of n inputs can realize any thr eshold function of n or fe wer v ariables. A threshold function f ( x 1 , · · · , x n ) [5] is a unate Boolean function whose on-set and of f-set are linearly separ able , i.e. there e xists a v ector of weights W = ( w 1 , w 2 , · · · , w n ) 1 and a threshold T such that f ( x 1 , x 2 , · · · , x n ) = 1 ⇔ n X i =1 w i x i ≥ T , (1) where P here denotes the arithmetic sum. A threshold function can be equiv alently represented by ( W , T ) = ( w 1 , w 2 , · · · , w n ; T ) . Q C x 1 f(x 1 ,. . . x n ) x 2 x n-1 x n FTL w 1 w 2 w n-1 w n Fig. 1: FTL Schematic Figure 1 shows the schematic of the FTL cell, in which the weights W are internal parameters of the cell. The schematic is meant to con ve y that the input-output behavior of an FTL cell may be viewed as an edge-trigg er ed , multi-input flip-flop, whose output is a threshold function, registered at the rising edge of the clock signal C. A distinctive characteristic of the FTL cell design is that the actual threshold function realized by an FTL instance within an ASIC is pr ogrammed after the cir cuit is manufactured . An FTL based ASIC inte grates flash or floating gate [7] transistors along with con ventional MOSFETs within the FTL cell. Thus, unlike many of the emerging technolo gies [2], [3], [8], [9], an FTL cell employs mature IC technologies (CMOS and Flash) that can be commercially manufactured and integrated today . A. FTL in ASIC Design – A V aluable Use Case The focus of this paper is on the design of the FTL cell. Before proceeding to that, it will be instructiv e to understand its use in ASIC design [10]. The f act that an FTL is a programmable, multi-input flip-flop pro vides a unique and significant new opportunity to improve the PP A of ASICs. Consider the logic netlist shown in Figure 2a which has two registered outputs F and G . Suppose that transiti ve fan in (TFI) cones of F and G are tra versed and two subcircuits A and B (see Figure 2b) are found that are threshold functions of their inputs. The remaining subcircuit is labeled as C . Suppose that subcircuits A and B are each replaced by an FTL cell, 1 W .L.O.G, weights can be assumed to be positive integers [6], and for a given truth table of a threshold function, there is a weight vector whose sum is minimum [6]. (a) A logic netlist. (b) Identifying threshold functions in TFI cones of flip-flops (c) A FTL-CMOS logic hybrid Fig. 2: Use of FTL in ASIC design programmed to realize A and B . This replacement is shown in Figure 2c, where the FTL cells are shown as black boxes. Now , subcircuit C would be re-synthesized to account for the changes in the delay of FTL cells and the new loads that they present to the outputs of C . The circuit in Figure 2c would substantially improv e the PP A of an ASIC for two reasons: 1) Subcircuits A and B and the two flip-flops are each replaced by an FTL cell which has much few transistors, resulting in a significant reduction in area and power . 2) The clock-to-Q delay of FTL cells are typically about 30% to 40% smaller than the delay of standard cell realization of subcircuits A and B plus the clock-to-Q delay of regular flip-flops. In the FTL-CMOS hybrid design, this results in a substantial amount of slack (required time minus arriv al time) on the outputs of subcircuit C , which in turn will allow synthesis and technology mapping tools to drastically reduce the logic area of subcircuit C . FTL based ASICs also of fers se veral other equally signifi- cant advantages not possible with conv entional CMOS logic. 1) IP Pr otection: A CMOS ASIC with embedded FTL cells cannot be rev erse-engineered by a foundry or any third party because the functions of the FTL cells are unknown (black box es) at manufacturing time, as sho wn in Figure 2c. 2) Corr ecting T iming Err ors: The fine-grained, post- manufacture flash threshold v oltage programmability allows precise speed binning, and correction of timing errors. This is not possible in traditional CMOS design. 3) Mitigating Aging Effects: By re-programming the flash design in-field, our scheme allows for mitigating the effects of aging. This is also not possible in CMOS design. 4) High Endurance: Unlike flash memory , the FTL cell does not suffer from endurance issues. Flash transistors can endure a finite number of write cycles (1K to 100K) [11], [12]. In our approach, the flash devices will be programmed a few times (at most), after fabrication, and then again to possibly adjust for aging effects (in the field). B. Main Contributions The remainder of the paper will focus on the design of the FTL cell and demonstrate its key characteristics through extensi ve and detailed electrical simulations using the state-of- the-art device and circuit models and commercial tools. The main contributions of this w ork are summarized below . • This paper introduces a no vel circuit design of the FTL cell to realize all threshold functions of n or fewer variables 2 . The new design incorporates both flash transistors and con ventional MOSFETs in a unique way to realize highly robust threshold logic circuits. • The set of threshold v oltages ( V t ) of the flash transistors in the FTL cell serve as a proxy for [ W , T ] that define a threshold function realized by an FTL cell. Since the thresh- old voltages of the flash transistors can be programmed with high precision [7], an FTL cell can implement weights with great fidelity . W e introduce an algorithm that maps the weights of a giv en threshold function f = [ W , T ] to the threshold voltages of the flash transistors. This is a complex, non-linear , multi-valued mapping. That is, sev eral different V t (s) may correspond to a gi ven W, T , each determined by the complex electrical and layout characteristics of the MOSFETs and flash transistors. Giv en a layout extracted netlist of an FTL cell, we present a novel modification of the classical per ceptr on learning algorithm (PLA) [13] that works in concert with HSPICE to determine one V t of an FTL cell that computes f = [ W , T ] . This algorithm ac- counts for layout parasitics and process v ariations. Like the original PLA, the modified PLA is guaranteed to con ver ge, ensuring that a solution ( V t ) for the giv en layout of an FTL cell will be found in a finite number of steps if a solution exists. • The fine-grained programmability of threshold voltages of the flash transistors in an FTL cell is exploited to improve its robustness. Given that the mapping [ W , T ] ⇒ V t is multi-valued, we show how to direct our modified PLA to find a V t that will ensure that the FTL cell reliably computes the gi ven threshold function in the presence of local and global process and en vironmental v ariations. Using this approach, substantial improv ement in the rob ustness of the FTL cell is demonstrated using Monte Carlo simulations. This also shows how post-fabrication tuning of the threshold voltages can correct failures due to process variations, or modify the delay to correct timing errors, improve a circuit’ s performance, or improve the performance characteristics of 2 In the experimental results, we find that n = 5 is a sufficiently good choice to demonstrate substantial improvements in PP A, since there are a large number (117) of threshold functions of n or fe wer v ariables a design to alter the speed binning distribution in a manner that maximizes profit. C. Organization of the Paper Section II gi ves a ve ry brief overvie w of threshold logic and flash transistor technology . Sections III, IV and V contain the main body of this work. The architecture and operation of the FTL cell are described in Section III. This is followed by a description in Section IV of the modified PLA used to program an FTL to implement a giv en threshold function. Section V contains an e xtensiv e set of experimental results, demonstrating the significant improvements in PP A of FTL cells over their CMOS equiv alents, and v alidating se veral of the uses of post-fabrication programming/tuning of the flash devices. Before concluding the paper in Section VII, we present a brief and partial re view of the prior art related to this paper in Section VI. I I . B AC K G RO U N D A. Threshold Logic Equation (1) defines an n -input threshold function. An ex- ample of a 5-input threshold function is a 3-out-of-5 majority function: f ( a, b, c, d, e ) = abc ∨ abd ∨ abe ∨ acd ∨ ace ∨ ade ∨ bcd ∨ bce ∨ bde ∨ cde ≡ a + b + c + d + e ≥ 3 ≡ [ w a , w b , w c , w d , w e ; T ] = [1 , 1 , 1 , 1 , 1; 3] . An XOR is a simple example of a non-threshold function. The importance of threshold logic stems from the fact that many Boolean functions that require e xponential size AND/OR networks can be realized by polynomial sized, fixed depth threshold networks [6]. From a practical perspective, nearly 70% of the functions in standard cell libraries are threshold functions. W e will demonstrate that implementing threshold functions using con v entional CMOS logic primitives is very inef ficient, as compared to the FTL cell. In our approach, Equation (1) must be translated to the comparison of electrical quantity such as charge, current or v oltage. This is the basis of many other threshold gate implementations as well [14]. B. Flash T ransistors e e e e e N+ N+ Control Gate Drain Source Floating Gate Dielectric Fig. 3: Flash T ransistor Cross Section Flash or floating gate tran- sistors are dual-gate field effect transistors (DGFETs). The first gate is called a con- tr ol gate and the second is a floating gate (see Figure 3). The control gate is similar to the gate of a traditional MOSFET . The floating gate is inserted between the substrate and the control gate, and is electrically and physically isolated. Hence, current cannot flow into (out) of the floating gate, unless electrons are forced to enter (leav e) the floating gate from (to) the substrate by a phenomenon known as Fo wler-Nordheim (FN) tunneling [15]. A flash de vice is programmed by holding its body , source and drain nodes at the ground and applying a high voltage (10-20 V olts) to the control gate. The resulting electric field forces electrons to tunnel from the substrate into the floating gate, increasing the threshold voltage of the flash transistor . The resulting threshold voltage depends on the number of electrons that tunnel into the floating gate, which depends on the duration of the programming pulse. Significantly , the threshold v oltage of a flash transistor can be adjusted with a fine granularity [7]. Once electrons are trapped in the floating gate, they remain trapped for many years [11], [12], or until remov ed by an erase operation. A flash transistor can be erased by holding the control gate to ground, floating the drain and source nodes, and applying a high v oltage at the body node. Erasing is simultaneously performed on all the transistors which share a common body node. I I I . F L A S H T H R E S H O L D L O G I C ( F T L ) C E L L Figure 4 sho ws the architecture of the FTL cell. It has fiv e main components: the left input network LIN, the right input network RIN, a sense amplifier (SA), an output latch (LA) and a flash transistor programming logic (P). The LIN and RIN consist of two sets of inputs ( ` 1 , · · · , ` n ) and ( r 1 , · · · , r n ) , respectiv ely , with each input in series with a flash transistor . In our implementation, ` i = r i for all i . The conductivity of these two networks is determined by the state of the inputs and the threshold voltages of the flash transistors. The assignment of signals to the LIN and RIN is done to ensure sufficient difference in conductivity across all minterm pairs ( m i , m j ) such that f ( m i ) 6 = f ( m j ) . The FTL cell has two differential signals N 1 and N 2 , which serve as inputs to an SR latch. When [ N 1 , N 2] = [0 , 1] ( [1 , 0] ), the latch is set (reset) and the output Y = 1(0) . The magnitudes of the two sides of the inequality (1) in the defi- nition of a threshold function are mapped to the conductance G L of the LIN and G R of the RIN, such that [ N 1 , N 2] = [0 , 1] ⇔ G L > G R and [ N 1 , N 2] = [1 , 0] ⇔ G L < G R . As stated earlier, the flash transistor threshold voltages serve as a proxy to the weights of the threshold function – the higher the weight, the lower will be the threshold voltage. For a giv en threshold function, this non-linear monotonic relationship is learnt using a modified perceptron learning algorithm described in Section IV. The FTL cell has three modes: r e gular , erase and pro gram- ming mode. The V t values of the flash transistors are set in the programming mode and erased in the erase mode. The ev aluation takes place in re gular mode. FTL Regular Mode: In this mode PR OG = ERASE = 0 . Assume that the V t s of the flash transistors have been set to appropriate values corresponding to the weights of the threshold function, and their g ates are being driven to 1 by setting HiV to VDD, F C j to 0V and all F T i to 0V . When C LK = 0 , the circuit is reset. In this phase, the nodes N 5 and N 6 of LIN and RIN are connected to the supply , N 5 = N 6 = 0 , and N 1 = N 2 = 1 . Therefore, the output Y remains unchanged. Assume no w that an on-set minterm is applied to the inputs in the LIN and RIN. With properly assigned V t values to Fig. 4: FTL Cell Architecture: Input net- works LIN and RIN drive the sense am- plifier with current based on weighted in- puts. Input weights are implemented by modulating the conductivity of LIN and RIN using flash transistors. Sense ampli- fier e valuates the threshold function and driv es the latch to produce the output Y . An FTL cell is programmed by sending high v oltage pulses to the flash transistors’ gates via the Programming Logic. the flash transistors, suppose that G L > G R for the given minterm. When C LK : 0 → 1 , both the LIN and RIN will conduct, and N 5 and N 6 will both transition from 0 → 1 . Assuming G L > G R , N 5 rises faster than N 6 , and hence N 5 will make M 7 activ e before N 6 makes M 8 active. This will start to discharge N 1 before N 2 . When N 1 falls belo w the V t of M 6 , it will stop further discharge of N 2 , and turn on M 3 , resulting in N 2 : 0 → 1 . Finally , [N1,N2] = [0,1] sets the SR latch, resulting in Y = 1 . For an of f-set minterm, G L < G R , and [ N 1 , N 2] = [1 , 0] resulting in Y = 0 . The con ventional circuit structures used in flash memories are not suitable for programming an FTL cell because it has to also perform logic operations. Consequently , we present a ne w programming interface for an of f-chip programming circuit to set the V t values of any FTL cell. During flash-programming, this interface uses the F C j signal to select the j th FTL cell and the F T i signal to select the i th flash transistor of the selected FTL cell. FTL Pr ogramming Mode :(ERASE=0, PR OG=1, CLK=0, FT i =0, FC j =0, HiV=20V). The ERASE and PROG signals turn on M12 and M13 and turn of f M14. In this state, the source of the flash transistor is floating while the drain and bulk are connected to the ground. Acti vating the appropriate transistors using the F T i and F C j signals, high voltage pulses are passed on the HiV line through M C j and M T i to the gate of the flash transistor to set the desired threshold voltage ( V t ). FTL Erase Mode : (ERASE=1, PROG=1, CLK=0, FT i =0, FC j =0, HiV=-20V). M12 is turned of f by the ERASE signal. Both the source and drain of the flash transistors are floating in this state, while the bulk is connected to the ground. A negati ve HiV pulse at the gate terminal of all the flash transistors in this state will tunnel the charge from the floating gate, thereby erasing the flash transistor . I V . M O D I FI E D P E R C E P T R O N L E A R N I N G A L G O R I T H M In this section, we describe an algorithm to determine the vector of flash transistor threshold voltages for a given threshold function f = [ W , T ] . The problem is to find a mapping between the Boolean space B n , and the conductivity space ( G L , G R ) such that G L > G R iff P w i x i > T (i.e. for an on-set minterm), and G L < G R iff P w i x i < T (i.e. for an off-set minterm). This mapping is depicted in Figure 5. G L and G R are non-linear functions of the flash transistor Fig. 5: Transformation from Boolean space to conductivity space; Hyperplane gets con v erted into a line. threshold voltages, the time-v arying drain and sources v oltages of the input transistors, and the layout parasitics that vary from instance to instance. T o account for these dependencies, G L and G R , in principle, must be obtained by solving a set of differential equations – an approach that is not practical. W e next show how to simultaneously solve the differential equations numerically and perform the binary classification by a modified version of the classical perceptron learning algorithm (PLA) [13]. The PLA starts with an initial hyperplane in the Boolean space and iterativ ely adjusts it until all the on-set and off- set minterms f all on opposite sides of the hyperplane. Each minterm corresponds to some point in the ( G L , G R ) space. Our modified PLA iterativ ely adjusts the V t ( s ) of flash transis- tors such that points in the conducti vity space that correspond to the on-set and of f-set minterms fall on the appropriate side of the line G L = G R (Fig. 5). W e use HSPICE to determine whether any point falls above or belo w this line. A description of the modified PLA follo ws. The threshold voltages of the flash transistors associated with the input transistors in the LIN and RIN are labeled V 1 , V 2 , · · · , V n . The i th transistor in both LIN and RIN has a threshold voltage V i . In addition, there are two special flash transistors, whose threshold voltages are V L and V R associated with the LIN and RIN, respectively . For a threshold function f = ( w 1 , w 2 , · · · , w n ; T ) , the V i , 1 ≤ i ≤ n , correspond to the weights w i of a threshold function, whereas only one of V L or V R is associated with the threshold T of f . If V L is associated with T , then V R = V DD , effecti vely turning it off. If V R is associated with T , then V L = V DD . The use of additional flash devices on both sides of the FTL cell allo ws for extra programming fle xibility . The induced symmetry also balances the parasitics of the LIN and the RIN. For the truth table ( T T ) of f , the modified PLA applies all the minterms of f to the FTL cell, and records the HSPICE response in an array called OT (output table). For a gi ven minterm m i , if T T ( m i ) = O T ( m i ) then the response is called a correct response, otherwise it is called an incorrect response. An FTL cell is completely programmed if the recorded response for every minterm is correct . Until the FTL cell is completely programmed, at least one minterm would generate an incorrect response. In the event of an incorr ect response associated with minterm m i , the modified PLA adjusts the threshold voltages of all flash transistors associated with the ON input transistors within the interval [ δ, V DD − δ ] , by a minimum increment δ , using the follo wing equations (k denotes the iteration number of the algorithm): V k +1 i =  V k i − δ m i m i · W ≥ T V k i + δ m i m i · W < T . (2) Equation (2) is quite easy to understand. The term δ m i is simply a vector which has a v alue δ at all locations where m i is 1, and zero elsewhere. For instance, δ (1 , 0 , 1 , 1 , 0) = ( δ, 0 , δ, δ, 0) . Suppose m i is an on-set minterm for which the response was incorrect. This means that G L < G R . Therefore G L needs to be increased for minterm m i . Hence the threshold voltages of all flash transistors that are connected to the input transistors that are ON for minterm m i , should be decreased by δ . Similarly , if m i is an off-set minterm, then the threshold voltages of the same flash transistors must be increased by δ . This is what is expressed in Equation (2). Since the V i values are bounded abo ve and below , it might not be possible to satisfy the truth table using the V i alone. In such cases, the algorithm will resort to adjusting V L and V R using the same principle as in Equation (2). If m i is a on-set minterm that was incorrect, then G R should be reduced. Therefore, V R is incremented by δ , until its upper bound is reached. If this is not sufficient, then G L has to be increased. Hence, V L is decremented. Gi ven a threshold function and a sufficiently small δ , the modified PLA will con ver ge to a feasible threshold voltage set assignment V ∗ t for the FTL cell [13]. For an n -input threshold function, a pessimistic upper bound on the number of iterations is giv en by k max = 2 n || V ∗ t || 2 /δ 2 . For n = 5 and δ = . 02 V , k max = 2500 || V ∗ t || 2 . A. T raining for Robustness The modified PLA does not consider the relative location of the points with respect to the metastability region around the line G L = G R (see Figure 5b). Even though minterms are classified correctly , they can be arbitrarily close to the line. The further away a minterm is from the line, the easier (and faster and more robust) it will be for the sense amplifier to detect the difference between N 5 and N 6 , and discharge the appropriate side ( N 1 or N 2 ) first. Our approach to making the FTL cell highly robust is to introduce an additional capacitance C 1 on node N 1 when classifying an on-set minterm, and determining the maximum value of C 1 for which the modified PLA conv erges. This handicaps node N 1 and directs the algorithm to find a solution, which will result in increasing G L more than increasing G R . Similarly , we add a capacitance C 0 on node N 2 , when classifying an of f-set minterm. The corresponding threshold voltages found by the modified PLA algorithm will increase the gap between G L and G R , which makes it much more robust, and also improves its speed, as a direct consequence. Note that C 0 and C 1 are introduced in the simulations for improving the training solution only , and are not part of the FTL cell. V . E X P E R I M E N TA L R E S U LT S A. Experiment Setup A 5-input FTL cell was designed and a complete layout (including the programming devices) was created using the TSMC 40nm LP library . The flash transistor models were obtained from [16] and were suitably modified to reflect the characteristics and variations of the TSMC 40nm library . The design rules for the flash transistors were obtained from ITRS. The layout of the FTL cell was created as a standard cell with an area of 15.6 µm 2 . For reference, if X represents the dri ve strength, an X4 DFF and an X4 N AND gate have an area of 5 . 6 µm 2 and 2.8 µm 2 respectiv ely , while their delay optimized X8 counterparts have an area of 14.347 µm 2 and 7.3 µm 2 respectiv ely . The { setup, C2Q } of a X4 DFF is { 67ps, 168ps } . There are a total of 117 distinct threshold functions of 5 or fe wer variables. A numbered list of these is gi ven in [17] and can also be accessed at [18]. In this section, we use the same numbering scheme as in [17] to identify the functions. In the sequel, the FTL cell trained to implement the threshold function numbered n in [17] will be referred to as F T L n , and the corresponding CMOS implementation will be denoted as C M O S n . The threshold function itself will be denoted as F n . B. T raining Iterations The modified PLA algorithm was used to train the FTL cell for robustness (see Section IV -A) for all 117 functions. Figure 6 shows the number of iterations needed for training for each of the 117 functions. The actual number of iterations were about 10X lower than the theoretical upper bound, presented in Section IV. Fig. 6: Iteration count for the modified perceptron learning algorithm for all 117 functions of 5 or fewer v ariables. C. Area, Delay and P ow er Comparison Each of the 117 functions were implemented as FTL cells, and also synthesized by Cadence Genus © and placed and routed using Cadence Innovus © , using the TSMC 40nm LP standard cells. The total delay (logic delay + setup time + clock-to-Q delay) and po wer values were determined by simulating the circuits at 25 ◦ C at 20% input switching activity . Figure 7 shows that each of the FTL implementations of the 117 functions hav e substantially smaller area, po wer and delay when compared to the CMOS equi valent. The a veraged improv ements of FTL over CMOS are: area (79.5%) , delay (42.5%) and power (61.1%) . Fig. 7: PP A improvements of FTL o ver CMOS implementations. Figure 8 compares the leakage power of the FTL and CMOS implementations of the 117 functions. The functions are arranged in ascending order of their CMOS leakage values. Unlike the CMOS implementations, the leakage power of the FTL implementations is nearly constant. Also plotted is the area trend line of CMOS implementations, to illustrate the strong correlation of leakage power with area. The few FTL implementations that had higher leakage (shown circled) were all small logic primitiv es. Ne vertheless, the total power (see Figure 7) of FTL implementations of e ven these functions is far less than the CMOS implementations. These functions can be av oided if leakage minimization is the primary design goal. Fig. 8: Leakage power of FTL v ersus CMOS implementations. D. Experiments on T raining for Robustness This experiment demonstrates the robust PP A training method, described in Section IV -A, to improve yield. The test function chosen was F 115 = [ W ; T ] = [4 , 1 , 1 , 1 , 1; 5] = ab + ac + ad + ae . The experiment consisted of training multiple versions of F T L 115 for v arious values of the parasitic capacitances C 1 and C 0 , and for each solution, performing 100K Monte Carlo simulations with local and global process variations 3 , and checking if the truth table was correctly realized. T able I shows the delay and yield for v arious v alues of C 1 and C 0 . The functional yield was improved from 13% to 100% (i.e. truth tables of all 100K instances were v erified to be correct) by increasing the values of C 1 and C 0 . There are two important observations to be made here. First, e ven though the weights of b , c , d , e are equal, the corresponding flash tran- sistors receiv ed different threshold voltages ( V 2 , V 3 , V 4 , V 5 ). This sho ws that the perceptron learning algorithm, w orking in concert with HSPICE, accounts for the layout parasitics. Second, the delay impr oves with increasing robustness, due to the increase in the difference between the voltages at nodes N5 and N6,(see Section IV). T ABLE I: Multi-Corner Monte Carlo results with 100K simulations of F T L 115 , trained for robustness using v arious capacitor v alues (fF) C 1 , A verage Vt V alues (V) Y ield Delay C 0 ( V 1 , V 2 , V 3 , V 4 , V 5 ; V l 0 , V r 0 ) % (ps) 0.00 0.64, 0.74, 0.72, 0.74, 0.72; 1.00, 0.74 13 244 0.01 0.62, 0.72, 0.7, 0.74, 0.74; 1.00, 0.70 20 220 0.02 0.58, 0.74, 0.72, 0.74, 0.72; 1.00, 0.64 43 204 0.05 0.48, 0.68, 0.66, 0.70, 0.66; 1.00, 0.56 59 162 0.10 0.34, 0.56, 0.54, 0.60, 0.62; 1.00, 0.46 100 138 Fig. 9: Conductivity G L and G R of F T L 115 [TT , 0.9V , 25 ◦ C]. In the conducti vity space, gap between off-set minterms and on-set minterms increases, when the training is done for robustness. In Section IV -A we argued that training an FTL cell with a handicap in the form a parasitic capacitance on N1 and N2 will improve the robustness by increasing the smallest gap in conductance between the LIN and the RIN. Figure 9 demonstrates this very important characteristic of the rob ust PP A algorithm for FTL. It is a plot of the conductivity space , i.e., G R versus G L , of an FTL when trained for the test function F 115 , with and without the parasitic (handicap) capacitances. The blue points (orange points) correspond to the G L and G R values of the on-set and off-set minterms of 3 Sev eral dozen parameters are varied in the HSPICE models provided by the vendor F 115 in the absence (presence) of the parasitic capacitances C 1 and C 0 ( C 1 = C 0 = 0.1fF). Recall that for an on-set minterm G L > G R and for an off- set minterm, G R > G L . The plot clearly demonstrates that training with the parasitic capacitances dramatically improv es the robustness in two ways. First, there is a significant increase (by 21%) in the shortest distance between the two closest on- set and off-set minterms, as indicated in Figure 9. Second, the increase in G L is greater than the increase in G R , i.e. ∆ G L / ∆ G R > 1 for the on-set minterms, and vice-versa for the of f-set minterms. Both of these effects contribute to reducing the contention in the sense amplifier in deciding the function output, which in turn directly improv es the speed as well, resulting in higher robustness and higher performance. E. Delay Distributions This experiment compares the distributions of delays of FTL and CMOS implementations. W e show the results for the function F 115 = [ W ; T ] = [4 , 1 , 1 , 1 , 1; 5] . The PVT corner setting was [ P , V , T ] = [ T T , 0 . 9 V , 25 ◦ C ] . 100K Monte Carlo instances were generated for both F T L 115 and C M O S 115 . The function of each of the 100K FTL instances was v erified against the truth table for correctness, for both F T L 115 and C M O S 115 . The histograms of delays are shown in Figure 10. These clearly demonstrate the delay advantage of the FTL cell ov er its CMOS equi valent, e ven in the presence of process variations. The dif ference in standard de viation between the two is insignificant. Note that the FTL instances with large delays can be re-pr ogrammed to further reduce the delay . This capability is not possible for the CMOS versions. Fig. 10: Delay histogram of F T L 115 and C M O S 115 with 100K Monte Carlo simulations. P V T = [ T T , 0 . 9 V , 25 ◦ C ] . T ABLE II: Delay , total power and power-delay-product (PDP) of F T L 115 , trained at V DD = 0 . 9 V , and C 0 = C 1 = 0 . 1 f F . Supply V oltage (V) Flash Gate V oltage (V) Power (u) Delay (ps) PDP 0.8 0.8 14.3 198.1 2837.1 0.85 0.825 20.5 157.6 3228.7 0.9 0.85 26.1 130.2 3396.9 0.95 0.875 40.3 111.2 4482.7 1 0.9 53.1 97.0 5148.6 1.05 0.925 76.0 86.4 6562.9 1.1 0.95 85.0 78.2 6644.0 F . Dynamic V oltage Scaling V oltage scaling is a common mechanism to trade off per- formance against power . T able II shows the results of training F T L 115 at 0 . 9 V . The FTL was programmed with the resulting set of flash threshold voltages, and then operated over the voltage range [0 . 8 V , 1 . 1 V ] . T o ensure proper operation across all voltages, the gate voltages of the flash transistors were also scaled in this experiment. This result demonstrates how a single V T assignment can be used for dynamic v oltage scaling. Note that the delay v aries by 2.5X, power varies by 5.9X and the PDP (ener gy) varies by 2.3X, as the supply voltage varies ov er [0.8V , 1.1V]. This shows that the FTL cells offer a healthy power , delay , and energy tradeof f by voltage scaling. G. P ost-fabrication Timing Correction The experiments described in Sections V -D, V -E and V -F all point to the fle xibility of FTL due to its unique characteristic of allowing for programming of the flash transistor threshold voltages after fabrication. It should come as no surprise then that this can also be used to correct timing errors. Fig. 11: Datapath to demonstrate post-fab timing corrections Fig. 12: Correcting setup time violation with an FTL cell after fabrication. C2Q of FTL cell reduced from 180 ps to 142 ps. Fig. 13: Correcting hold time violation with an FTL cell after fabrication. C2Q of FTL cell increased from 142 ps to 180 ps. Figure 11 shows a small datapath that was constructed to demonstrate ho w setup time and hold time violations can be corrected after fabrication in an FTL design. The datapath consists of clock-to-Q (C2Q) delay , combinational delay (D2D) and DFF specifications for setup ( D F F setup ) and hold ( D F F hold ) times. The clock is ske wed by an appropriate amount ∆ , to generate either a setup time or a hold time violation. The violations are corrected by reprogramming the FTL cell to produce different C2Q values. Figure 12 sho ws how the data launched from FTL X misses the tar get clock edge at DFF Y , thereby violating setup time. T o fix the setup time, the C2Q of FTL X is decreased. Similarly , Figure 13 sho ws ho w the data launched at FTL X gets captured by DFF Y one cycle early , thereby ov erwriting the old value at DFF Y . By increasing the C2Q value of the FTL X, the old value at the input of Y is retained for a longer time, which satisfies the hold time condition. Since the FTL cells are programmed post-fabrication, the delay can also be modified after fabrication. Using the same idea of post-fabrication V T adjustment, an FTL cell can be reprogrammed to mitigate delay increases due to aging. H. Chip-level programming architecture Although the full architecture for programming the FTL cells in an ASIC is not presented here, this section describes the programming architecture in brief. An on-chip decoder architecture is used to address the flash transistors of the FTL cells, during programming. The address for the decoder is sent into the chip using a serial communication protocol along with a programming clock. The high voltage line needed for sending programming pulses to flash transistors is generated and sent into the chip using an off-chip voltage source. The pin count ov erhead for programming is low (only 3 pins are needed). When the address is recei ved, the decoder activ ates a specific flash transistor of a specific FTL cell for programming. V I . R E L A T E D W O R K A. Threshold Logic The study of threshold functions and the de velopment of threshold gates date back to the 1960s culminating in the authoritativ e book by Muroga [17]. Since then, an extensiv e body of theoretical work, ne w circuit architectures and im- plementations have been published. References [24] and [25] provide a detailed surv ey of work prior to 2003. One of the earliest reported works that demonstrated the operation of threshold logic gates using flash transistor s was reported in [26], [27]. It was an analog design of a single cell to demonstrate proof of concept. The focus has shifted to exploring the use of emerging devices such as RRAMs, STT -MTJs, and others, to implement threshold gates [9], [28], [29]. Se veral recent works ha ve devised ef ficient al gorithms for determining weights aimed at robust threshold gates [29], [30]. Howe ver until recently , due to the lack of designs tools and incompatibility with existing design methodologies, threshold logic remained outside mainstream VLSI design. Recently , [10] reported an architecture of a threshold gate and showed how it can be integrated with the standard- cell ASIC design methodology using commercial tools. In addition, they reported significant improvements in PP A of an actual silicon implementation of ASIC with threshold gates [31]. Their architecture, ho wev er , se verely limits the number of threshold functions that can be implemented. This is because the weight w i associated with input x i is implemented by using w i transistors each driven by signal x i . Hence, their circuit has sev ere fan in limitations. For instance, the design in [10] can only realize 11 of the 5-input threshold functions, whereas, as demonstrated here, the FTL-5 cell can realize all 117 functions In addition, representing weights using multiple transistors significantly reduces the robustness and prevents it from scaling to lower geometries. Finally , the FTL cell is programmed after fabrication, prev enting copying by a foundry , and numerous opportunities to correct f ailures and tune for high performance and aging effects. B. Flash T echnology Many research efforts hav e studied flash devices and their use in memory . A short list includes [32], [33]. These papers report details of flash devices and their characterization. Ho w- ev er , they do not describe the use of flash transistors for logic circuits. A good deal of work in flash has been reported in the area of architectural techniques to increase flash memory endurance. Some representativ e works include wear le veling techniques, which are used in flash-based memory blocks [34], to compensate for the fact that flash transistors typically hav e a finite (10k - 100k) number of times they can be written [11], [12]. In traditional flash memory , wear leveling is performed at the architectural lev el to spread the wear of the cells. The authors of [35] present a design flow to implement flash-based digital circuits at the block level. These efforts present results for a programmable logic array style cell design and illustrate its use in a modified standard-cell style VLSI design flow . In contrast, the work of this paper focuses on threshold logic and is envisioned for use in a traditional standard-cell based flo w . An FTL cell can replace a D flip- flop and some or part of its logic cone in any CMOS netlist. T o the best of our knowledge, there has been no work prior to this paper which describes the synthesis, detailed electrical characterization of sequential flash-based threshold logic cells. V I I . C O N C L U S I O N In this paper, we proposed a novel threshold logic cell (FTL) using flash transistors. A modified perceptron learning algorithm was also proposed to program the FTL cell. Sub- stantial area (79.7%), po wer (61.1%) and performance 42.5%) improv ement of the FTL cells was demonstrated against their con ventional 40nm standard-cell based designs of the same functions. By adding a capacitor to introduce a handicap in the FTL cell during simulation, this paper sho ws that the learning algorithm counters the ef fect of the handicap by generating more robust solutions. Robustness against PVT variations was demonstrated using 100K Monte Carlo simulations, demon- strating a 100% yield. W e also demonstrated that FTL cells are amenable to dynamic voltage scaling, and post-silicon tuning of setup and hold time violations. R E F E R E N C E S [1] M.J. A vedillo and J.M Quintana. A Threshold Logic Synthesis T ool for R TD Circuits. In Eur omicr o Symposium on Digital System Design , DSD ’04, pages 624–627, W ashington, DC, USA, 2004. IEEE Computer Society . [2] Krzysztof Berezowski and Sarma Vrudhula. Automatic design of binary multiple-valued logic gates on the rtd series. In Eight Eur omicro Conf. on Digital System Design , Porto, Portugal, Aug. 2005. [3] P . Gupta and N.K. Jha. An algorithm for nanopipelining of rtd-based circuits and architectures. Nanotec hnology , IEEE T ransactions on , 4(2):159–167, March 2005. [4] R. Zhang, P . Gupta, and N. K. Jha. Synthesis of Majority and Minority Networks and Its Applications to QCA, TPL and SET Based Nanotechnologies. International Conference on VLSI Design , 0:229– 234, 2005. [5] S. Muroga. Thr eshold Logic and its Applications . 1971. [6] K. Siu, V . Roychowdhury , and T . Kailath. Discr ete Neural Computation: A Theoretical F oundation . Prentice-Hall, Inc., 1995. [7] Y . Cai, E. F . Haratsch, O. Mutlu, and K. Mai. Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling. In IEEE D A TE , March 2013. [8] R. Perricone, I. Ahmed, Z. Liang, M. G. Mankalale, X. S. Hu, C. H. Kim, M. Niemier, S. S. Sapatnekar, and J. W ang. Adv anced spintronic memory and logic for non-volatile processors. In D ATE, 2017 , March 2017. [9] J. Y ang, N. Kulkarni, S. Y u, and S. Vrudhula. Integration of threshold logic gates with RRAM de vices for energy efficient and robust operation. In IEEE/ACM NANO ARCH , July 2014. [10] N. Kulkarni, J. Y ang, J. S. Seo, and S. Vrudhula. Reducing Power , Leakage, and Area of Standard-Cell ASICs Using Threshold Logic Flip- Flops. IEEE TVLSI , 24(9), Sept 2016. [11] D. Jung et al. A Group-based W ear -leveling Algorithm for Large- capacity Flash Memory Storage Systems. In ACM CASES , 2007. [12] S. Boboila and P . Desnoyers. Write Endurance in Flash Driv es: Measurements and Analysis. In A CM F AST , 02 2010. [13] F . Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Revie w , 1958. [14] Y . Beiu, J.M. Quinfana, M.J. A vedilo, and R. Andonie. Dif ferential Implementations of Threshold Logic Gates. In Pr oceedings of the IEEE International Symposium on Signals, Circuits and Systems , 2003. [15] R. Fo wler and L. Nordheim. Electron Emission in Intense Electric Fields. Proc. Royal Soc. of London. Series A , 119(781), May 1928. [16] M. Abusultan and S.P . Khatri. Implementing Lo w Power Digital Circuits using Flash De vices. In IEEE/ACM ICCD , October 2016. [17] Saburo Muroga. Thr eshold Logic and its Applications . W iley- Interscience New Y ork, 1971. [18] https://sites.google.com/view/5-input-threshold-functions/ . [19] A. Neutzling, J. M. Matos, A. I. Reis, R. P . Ribas, and A. Mishchenko. Threshold logic synthesis based on cut pruning. In IEEE/A CM ICCAD , Nov 2015. [20] Dimitri Kag aris and Spyros Tragoudas. Maximum W eighted Indepen- dent Sets on Transiti ve Graphs and Applications. Inte gr . VLSI J. , 27:77– 86, January 1999. [21] Sandeep Dechu, Manoj Kumar Goparaju, and Spyros T ragoudas. A metric of tolerance for the manufacturing defects of threshold logic gates. 21st IEEE International Symposium on Defect and F ault- T olerance in VLSI Systems (DFT’06) , pages 318–326, October 2006. [22] Manoj Kumar Goparaju and Spyros Tragoudas. An atpg methodology using parametric f ault model for defects in threshold logic g ate networks. WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS , 5(8):1206– 1211, August 2006. [23] A. Neutzling, J. M. Matos, A. Mishchenko, A. Reis, and R. P . Ribas. Effecti ve logic synthesis for threshold logic circuit design. IEEE TCAD , 2018. [24] V . Beiu. A survey of perceptron circuit complexity results. In IEEE IJCNN , volume 2, pages 989–994 vol.2, July 2003. [25] P . Celinski, S. D. Cotofana, J. F . Lopez, S. F . Al-Sarawi, and D. Abbott. State of the art in CMOS threshold logic VLSI gate implementations and systems. In IEEE VCAL , April 2003. [26] V . Bohossian, P . Hasler, and J. Bruck. Programmable neural logic. In IEEE ISIS , Oct 1997. [27] E. Rodriguez-V illegas, J. M. Quintana, M. J. A vedillo, and A. Rueda. High-speed lo w-power logic gates using floating gates. In IEEE ISCAS , volume 5, May 2002. [28] S. Sa vas, H. Hesham, T . Darwin, and C. Gregory . Reconfigurable threshold logic gates with nanoscale DG-MOSFETs. Elsevier Solid- State Electronics , 51(10), 2007. [29] S. N. Mozaffari and S. T ragoudas. Maximizing the number of threshold logic functions using resistive memory . IEEE TNANO , 17(5), Sep. 2018. [30] S. N. Mozaf fari, S. Tragoudas, and T . Haniotakis. A Generalized Ap- proach to Implement Efficient CMOS-Based Threshold Logic Functions. IEEE TCSI , 65(3), March 2018. [31] Jinghua Y ang, Joseph Davis, Niranjan Kulkarni, Jae sun Seo, and Sarma Vrudhula. Dynamic and Leakage Power Reduction of ASICs Using Configurable Threshold Logic Gates. In Pr oc. IEEE Custom Inte grated Cir cuits Conf. (CICC) , San Jose, CA, Sept. 2015. [32] H. An, K. Kim, S. Jung, H. Y ang, K. Kim, and Y . Song. The threshold voltage fluctuation of one memory cell for the scaling-down NOR flash. In IEEE ICNIDC , Sep. 2010. [33] E. Choi and S. Park. Device considerations for high density and highly reliable 3D N AND flash cell in near future. In IEEE IEDM , Dec 2012. [34] M. K. Qureshi, J. Karidis, M. Franceschini, V . Srini vasan, L. Lastras, and B. Abali. Enhancing lifetime and security of PCM-based Main Memory with Start-Gap W ear Leveling. In IEEE/A CM MICRO , Dec 2009. [35] M. Abusultan and S.P . Khatri. A Flash-based Digital Circuit Design Flow. In IEEE/ACM ICCAD , Nov 2016. [36] J. Rajendran, H. Manem, R. Karri, and G. S. Rose. Memristor based programmable threshold logic array . In 2010 IEEE/A CM International Symposium on Nanoscale Arc hitectures , June 2010. [37] G. S. Rose, J. Rajendran, H. Manem, R. Karri, and R. E. Pino. Lev eraging memristive systems in the construction of digital logic circuits. Proceedings of the IEEE , 100(6):2033–2049, June 2012. [38] M. Soltiz, D. Kudithipudi, C. Merkel, G. S. Rose, and R. E. Pino. Memristor-based neural logic blocks for nonlinearly separable functions. IEEE T ransactions on Computers , 62(8):1597–1606, Aug 2013. [39] M. Soltiz, C. Merkel, D. Kudithipudi, and G. S. Rose. Rram-based adaptiv e neural logic block for implementing non-linearly separable functions in a single layer . In 2012 IEEE/ACM International Symposium on Nanoscale Ar chitectur es (NANO ARCH) , July 2012. [40] Georgios Detorakis, Sadique Sheik, Charles Augustine, Somnath P aul, Bruno U. Pedroni, Nikil D. Dutt, Jeffrey L. Krichmar, Gert Cauwen- berghs, and Emre Neftci. Neural and synaptic array transceiver: A brain-inspired computing framework for embedded learning. In F ront. Neur osci. , 2017. [41] M. Uddin and G. Rose. A practical sense amplifier design for memristi ve crossbar circuits (puf). 2018. [42] S. Sayyaparaju, G. Chakma, S. Amer, and G. S. Rose. Circuit techniques for online learning of memristi ve synapses in cmos-memristor neuromor - phic systems. In Proceedings of the on Great Lakes Symposium on VLSI 2017 , GLSVLSI ’17, 2017. [43] X. Y ao, J. Harms, A. L yle, F . Ebrahimi, Y . Zhang, and J. W ang. Magnetic tunnel junction-based spintronic logic units operated by spin transfer torque. IEEE T ransactions on Nanotechnology , Jan. [44] S. Patil, A. Lyle, J. Harms, D. J. Lilja, and J. W ang. Spintronic logic gates for spintronic data using magnetic tunnel junctions. In 2010 IEEE International Conference on Computer Design , Oct 2010.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment