On Complexity, Energy- and Implementation-Efficiency of Channel Decoders

IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 1 On Comple xity , Ener gy- and Implementation- Ef ﬁciency of Channel Decoders Frank Kienle, Member , IEEE, Norbert W ehn, Senior Member , IEEE , and Heinrich Meyr , F ellow , IEEE Abstract Future wireless communication systems require ef ﬁcient and ﬂexible baseband recei v ers. Meaningful ef ﬁciency metrics are key for design space e xploration to quantify the algorithmic and the implementation complexity of a receiv er . Most of the current established efﬁciency metrics are based on counting operations, thus neglecting important issues like data and storage complexity . In this paper we introduce suitable energy and area efﬁcienc y metrics which resolve the afore-mentioned disad- vantages. These are decoded information bit per energy and throughput per area unit. Efﬁcienc y metrics are assessed by various implementations of turbo decoders, LDPC decoders and con volutional decoders. New exploration method- ologies are presented, which permit an appropriate benchmarking of implementation ef ﬁciency , communications performance, and ﬂexibility trade-of fs. These exploration methodologies are based on efﬁciency trajectories rather than a single snapshot metric as done in state-of-the-art approaches. Index T erms Channel coding, algorithmic complexity , energy efﬁciency , design space exploration, design methodology . I . M OT I V A T I O N T oday , high-end smart phones have to support multiple radio standards, advanced graphic- and media applications and many other applications resulting in a workload of about 100 giga operations per second in a power budget of 1 W att [1]. The baseband processing in the radio part (mainly front-end processing, demodulation and decoding) requires more than 50% of the overall workload in a state-of-the-art 3.5G smart phone. T o achiev e higher spectral efﬁcienc y ne w transmission techniques like MIMO will be established. Ho we ver , this will increase the workload ev en further . Thus there is a strong need for ef ﬁcient wireless baseband recei v ers. The overall efﬁcienc y of a baseband receiv er depends on • communications efﬁciency : e xpressed by the spectral ef ﬁciency and signal-to-noise ratio (SNR). The require- ments on the communications ef ﬁcienc y ha ve the lar gest impact on the selected baseband processing algorithms. This work has been partly supported by the UMIC Research Center, R WTH Aachen Uni versity . F . Kienle and N. W ehn are with the Microelectronic Systems Design Research Group, Univ ersity of Kaiserslautern, Kaiserslautern, Germany , e-mail: { kienle,wehn } @eit.uni-kl.de H. Meyr is with the Institute for Integrated Signal Processing System, RWTH Aachen University , Aachen, Germany October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 2 • implementation efﬁciency : related to silicon area, power and energy . Here, the energy efﬁciency is the biggest challenge due to the limited a v ailable battery power in many devices. • ﬂexibility : in softw are deﬁned radio, receivers hav e to support multiple standards and should be conﬁgurable at run-time (see software deﬁned radio). There are various silicon implementation styles ranging from general purpose architectures, ov er DSPs and ASIPs do wn to fully physically optimized IP blocks which strongly differ in their implementation efﬁciency b ut also in their ﬂexibility . For each building block of the receiv er a detailed analysis of ﬂexibility requirements has to be carried out to ﬁnd the best ﬂexibility/cost trade-off. Thus, advanced baseband receivers are heterogeneous multi-core architectures implemented in different design styles. System requirements are very often speciﬁed by communication standards like UMTS, L TE and W iMAX, which deﬁne dif ferent services in terms of required communications performance and system data throughput, i.e. information bits per second. T o obtain an efﬁcient baseband implementation, a careful and elaborate design space explor ation has to be performed. This is a very challenging task due to the size and the multi-dimensionality of the space. Therefore it is mandatory to prune the design space in an early stage of the design process. In this process the algorithms hav e to be selected and quantitati vely compared to each other with respect to their system performance and implementation efﬁciencies. Appropriate metrics are key for efﬁcient design space exploration to measure the algorithmic and the implemen- tation complexity respecti vely . A. Algorithmic Complexity There exists no univ ersal measure for complexity . In computer science the O-notation describes the asymptotic complexity behavior of an algorithm. In information theory the K olmogorof f complexity is deﬁned by the minimum description length of a string. These measures are inadequate for implementation purposes. A useful description of complexity for our purpose is to use the number of ”algorithmic” operations which ha v e to be performed per recei ved samples by the algorithms of a baseband receiv er . This comple xity metric has the advantage of being independent of a speciﬁc implementation of the algorithms. Based on this complexity deﬁnition a two-dimensional graph can be set up in which the horizontal axis correspond to the sample rate of the receiv er (which is proportional to the data rate) and the vertical axis corresponds to the operations per sample which have to be carried out. T ypically , both ax es are scaled logarithmically . An example of such a graph is sho w exemplarily in Figure 1 for the digital baseband processing of a 384 kbit/s UMTS receiv er . The off-diagonal lines describe points of equal number of operations per second , often expressed in million of operations per second ( MOPs )or gig a operations per second ( GOPs ). Sev eral important conclusions can be dra w form this ﬁgure. First, the receiver task is heterogeneous. There exist a large variety of algorithms ranging from the complex MUSIC algorithm to the simple root raised cosine matched ﬁltering. Note, that the largest number of operations is performed in simple algorithms such as ﬁltering and correlation while the most complex MUSIC algorithm in this example requires less MOPs . From this follows October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 3 1 7 sampling rate [1/s] 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 0 10 1 10 2 10 3 10 4 10 5 OPs per sample 384 kbps UM TS receiver, digit al BB complexity MU SIC delay acq. 1 MOPS 10 MOPS 100 MOPS 1000 MOPS AGC AFC RRC pulse MF Interpolat ion/decima tion Correlators Max. ratio co m bining T im ing tracking Channel estimatio n SI R estimati on Path searcher T u rbo decoder 3 8 4 k b p s U M T S R e c e i v e r B B C o m p l e x i t y … … a n d H e t e r o g e n e i t y Fig. 1: Operations per sample o ver sampling rate [1/s] that counting only operations/sec is entirely misleading. The second conclusion is that the heterogeneity of the algorithms points to architectural features for implementation. Simple algorithms such as ﬁltering and correlation can be implemented very efﬁciently in architectures requiring little ﬂexibility in the form of parameterizability . Complex algorithms must be programmable and thus require high ﬂe xibility . Recently , V an Berkel determined the complexity of various algorithms in baseband processing based on a similar metric. In his remarkable and comprehensiv e [1] he has shown the number of ”algorithmic” operations which hav e to be performed per receiv ed bit by the algorithms of a baseband receiver for different communication standards. Eberli from ETH Z ¨ urich is using a similar metric [2] for measuring complexity in baseband processing by calling these operations ”atomic” operations. From a communication system point of view we can separate digital processing in the baseband into two parts: the so called ”inner modem” and ”outer modem” [3] respectively . T ask of the inner modem is the extraction of October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 4 symbols from the receiv ed signal wav eform, i.e., equalization, channel estimation, interference cancellation and synchronization. The outer modem performs demodulation, de-interleaving and channel decoding on the receiv ed symbols. Thus algorithmic comple xity in baseband processing is normally separately plotted for the inner and the outer modem respectiv ely . A large di versity exists in the v arious baseband processing algorithms with respect to operation types, operation complexity , and data types especially between the inner and out modem. Figure 2 in [1] shows the algorithmic complexity for the inner and outer modem measured in giga operations per second (GOPs). It can be seen that sophisticated decoding schemes like turbo and LDPC codes utilized in advanced services like L TE require much more operations than the algorithms of the inner modem. B. Implementation Complexity On the implementation side a strong emphasis has to be put on the energy efﬁcienc y . Implementation complexity and algorithmic complexity are strongly interrelated in wireless baseband processing. Thus they hav e to be related to each other . Often it is argued that the implementation complexity is directly related to the algorithmic complexity . E.g. Eberli [2] considers the implementation complexity by introducing a cost factor for each atomic operation which reﬂects its implementation cost. F or design space exploration, graph representations are commonly used: • A two dimensional energy ef ﬁciency graph: one axis corresponds to the algorithmic complexity , e.g. measured in GO P s , and the other axis to the po wer , e.g. measured in mW , consumed when providing the corresponding operations/second. Each point in this graph describes the ener gy efﬁciency metric , i.e. operations/second/power unit , usually measured in GO P s/mW . Since energy corresponds to power multiplied with ex ecution time, each point giv es the operations/ener gy measured in oper ations/J oule . • In a similar way we can set up an area efﬁcienc y graph in which one axis represents the needed area. Each point in this graph yields the area efﬁciency metric , i.e. operations/second/ar ea unit , usually measured in GO P s/mm 2 . Note that the ener gy and area efﬁciency for the same algorithmic complexity can vary by several orders of magnitude, dependent on the selected implementation style. By far the highest energy efﬁciency is achiev ed by physically optimized circuits, ho we ver , at the expense of no ﬂexibility . The highest ﬂe xibility via software programmability at the expense of low energy efﬁcienc y is achie ved by digital signal processors. The designer has to ﬁnd a compromise between the two conﬂicting goals by trading off ﬂexibility vs. energy ef ﬁciency . Flexibility is hard to quantify . The optimum design point is thus to be understood qualitatively . It depends on the application and a large number of economic and technical considerations. W e can combine energy and area efﬁcienc y in a two dimensional design space in which the two axis correspond to area and energy efﬁciency respecti vely . This is a well known representation of the design space. C. Assessment of Metrics Area, throughput and especially energy in many system-on-chip implementations are dominated by data-transfers and storage schemes [4] and not by the computations itself. Howe ver common metrics as described above are October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 5 focusing solely on operations, and are not considering data-transfer and storage issues at all. Thus, these metrics are only valid if the operations dominate the implementation complexity . This is the case in data-ﬂow dominated algorithms like an FFT calculation, correlation or ﬁltering. Most algorithms in the inner modem of baseband receivers belong to this class of algorithm. Howe v er the channel decoding algorithms in the outer modem largely dif fer from the algorithms used in the inner modem. Here, the operations to be performed are non-standard operations (e.g. tanh) using non-standard data types (e.g. 7 bits). But more important, the ov erall implementation comple xity , especially energy , is dominated by data-transfers and storage schemes. A change in the algorithm with respect to computation, e.g. optimal versus suboptimal algorithm by approximating computations, has only a minor impact on the energy efﬁcienc y as shown later . The transitions from 3G to L TE advanced require 2 orders of magnitude improvement in energy efﬁciency . This improv ement will come to a small extent from technology scaling [5]. Efﬁcient system-on-chip implementations are feasible when channel coding schemes and the corresponding decoding algorithms are co-designed together with the architecture (architecture aw are code design) [6] [7] [8]. This is in accordance with a general trend tow ards co-design algorithm and architecture in receiver design realizing that the traditional separation of algorithm and architecture design leads to suboptimal results. In channel decoding the co-design focuses on data-transfer and storage schemes. Examples are special interleavers for turbo codes (e.g. L TE standard [9]) and special structures of the parity check matrix for LDPC codes (e.g. D VB-S2 standard [10]). These special structures allo w an efﬁcient parallel implementation of the decoding algorithm with small overhead in data-transfer and storage. GOPs based metrics do not at all reﬂect such speciﬁc structures. Another important issue is ﬂexibility . Flexibility on the algorithmic side, e.g., code rates and block sizes in the case of channel decoding, hav e a large impact on the implementation complexity . By looking only on the operations in the algorithm, ﬂexibility is normally not considered. In summary , efﬁcienc y metrics based on GOPs are questionable. Particularly , for non-data ﬂow dominated algorithms since they entirely neglect important issues like data and storage complexity , algorithm/architecture co-design and ﬂexibility . In this paper we focus on channel decoding as application. The contributions of this paper are: • we will show that the GOPs metric yields wrong conclusions. • we will introduce suitable metrics for energy and area efﬁcienc y . • we present a methodology for design space exploration based on these metrics. I I . R E F E R E N C E D E S I G N S Reference designs are k ey to assess various metrics. Thus, we selected 5 dif ferent channel decoder implementations which our research group has designed in the last couple of years. Using own design has the advantage that all data are av ailable. The decoders differ in services (throughput, block sizes, code rates), decoding algorithms, ﬂexibility and implementation styles. Selected codes are con volutional codes, turbo codes and LDPC codes. The 5 different decoders are: October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 6 Decoder Flexibility Max. Block- size Throughput [Mbit/s] Frequency [MHz] Area [mm2] Dynamic Power [mW att] ASIP [11] Con v . Codes Binary TC Duo-binary TC N=16k 40 14 (6iter) 28 (6iter) 385 (P&R) 0.7 (P&R) ˜100 L TE turbo [12] R=1/3 to R=9/10 by puncturing N=18k 150 (6.5 iter) 300 (P&R) 2.1 (P&R) ˜300 LDPC ﬂexible R=1/4 to R=9/10 N=16k 30 (R=1/3 40iter) 100 (R=1/2 20iter) 300 (R=0.83 10iter) 385 (P&R) 1.172 (P&R) ˜389 LDPC W iMedia 1.5 [13] R=1/2-4/5 N=1.3k 640 (R=1/2 5iter) 960 (R=0.75 5iter) 265 0.51 ˜193 CC Decoder 64-state NSC 500 500 0.1 ˜37 T ABLE I: Reference decoders: service parameters and implementation results in 65nm technology • An application speciﬁc instruction set processor (ASIP) [11] capable of processing binary turbo codes, duo- binary turbo codes and various con v olutional codes with different throughputs dependent on code rate and decoding scheme. • A turbo decoder which is L TE [9] compliant. The maximum throughput is 150Mbit/s at 6.5 decoding iterations. • An LDPC decoder optimized for ﬂexibility , supporting two different decoding algorithms, code rates from R=1/4-9/10 and a maximum block length of 16384. • An LDPC decoder which is WiMedia 1.5 compliant and optimized for throughput, supporting code rates from R=1/2-4/5 with two block lengths N=1200 and N=1320 bits [13]. • A con v olutional decoder with 64-state which is WiFi [14] compliant. All decoders are synthesized on a 65nm CMOS technology under worst case conditions with V dd = 1 . 0 V , 120 C . Power estimations are based on nominal case V dd = 1 . 1 V . T able I gives an ov ervie w of the ke y parameters of the dif ferent decoders. P&R indicates that the corresponding data are post-layout data. The payload (information bits) throughput depends on the number of decoding iterations for turbo and LDPC codes which also impacts the communications performance. Thus, the throughput is speciﬁed dependent on the number of iterations. In T able II we show the number of algorithmic operations required to process the different types of conv olutional codes, turbo codes and LDPC codes. Bit-true C-reference models are used for operation counting. All operations were normalized to an 8 bit addition. The number of operations is related to one information bit which has to be decoded, i.e., operations/information bit. The total number of operations which ha ve to be performed per second, i.e. GOPs, depends on the code rate R and throughput which depends on the number of iterations for LDPC and turbo codes. T wo dif ferent algorithms for LDPC codes were in v estigated. Both algorithms are suboptimal algorithm approximating the belief propagation algorithm: the Min-Sum algorithm with a scaling factor and the October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 7 Code operations/bit GOPs (w .r .t. throughput) #iter #(op/bit) 100 Mbit/s 300 Mbit/s 1 Gbit/s CC (states=64) 200 20 60 200 5 iter 75/R 7.5/R 22.5/R 75/R LDPC 10 iter 150/R 15/R 45/R 150/R Min-Sum 20 iter 300/R 30/R 90/R 300/R 40 iter 600/R 60/R 180/R 600/R T urbo 2 iter 280 28 84 280 (Max-Log) 4 iter 560 56 168 560 6 iter 840 84 252 840 T ABLE II: Number of normalized algorithmic operations per decoded information bit for different channel decoders dependent on throughput and code rate R. λ -3-Min algorithm [15] which is a more accurate approximation. Howe v er the latter one needs about 3.3 times more operations. This more accurate approximation is mandatory if lower code rates R < 0 . 5 have to be supported like in D VB-S2 decoders [16]. In T able II we hav e only listed the operations for the Min-Sum algorithm. T o obtain the operations for the λ -3-Min algorithm, operations and GOPs have to be multiplied by 3.3 respecti vely . The ﬂexible LDPC decoder in our reference design was designed for both decoding algorithms, the W iMedia LDPC decoder is based on the Min-Sum algorithm only . I I I . S U I T A B L E M E T R I C S The energy efﬁcienc y graph for our reference designs is sho wn in Figure 2. It can be seen that the energy efﬁcienc y , measured in GOPs/mW , largely v aries for the different decoders. The two dimensional design space, co vering area and ener gy ef ﬁciency , is illustrated in Figure 3. In this graph, efﬁcient architectures w .r .t. area and energy hav e to be located in the upper right corner . Less efﬁcient architectures are placed in the lower left corner . W e see that the conv olutional decoder appears to be the most efﬁcient decoder while the ASIP being the decoder with the lowest efﬁciency . One interesting observ ation is the efﬁcienc y of the ﬂexible LDPC decoder . The ef ﬁciency largely increases in both directions (area, energy) when replacing the Min-Sum by the λ -3-Min algorithms which is the more complex algorithm in terms of operations. As described in the previous section the GOPs for this algorithm increases by a factor 3. W e w ould expect a large increase in area and po wer accordingly . Ho we ver , the area and power only increases by about 10% for the λ -3-Min algorithm. This is due to the fact that area and energy in both decoders are dominated by the data-transfer and storage scheme and the change in the operations of algorithm has only a small impact on it. In other words, the number of operations increases much larger than the implementation complexity . This is a hint that a GOPs based metric is not suited. Moreov er we see that the λ -3-Min based ﬂexible October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 8 1 10 100 1000 1 10 100 1000 Power (mW) GOPs ASIP TC (14Mbit/s) LTE TC (150Mbit/s) LDPC flexible (~100 Mbit/s) LDPC WiMedia (~1Gbit/s) CC (500Mbit/s) 1GOPs/mWatt Min-Sum  -3-Min 0.1 GOPs/mWatt Fig. 2: Operations/second versus power decoder has nearly the same efﬁciency as the less ﬂexible WiMedia decoder which is optimized for throughput. W e would e xpect that such a less ﬂe xible for throughput optimized decoder has a higher ef ﬁciency compared to the ﬂexible one. In the following we introduce metrics to resolve the afore mentioned anomalies. Instead of using the operations which have to be carried out for processing per task we normalize to the number of information bits per task. Metrics normalized to the number of information bits ha v e the following properties. The y allo w comparing • competing architectures for a giv en algorithm since the ef ﬁciency metrics are independent of the speciﬁc operations and data types used to execute the algorithm. All implementation issues like data-transfer and storage are taken into account since the metrics is oblivious to how the task has been executed. • different coding schemes as a function of the communication parameters (modulation, signal to noise ratio, bandwidth). In particular , iterati ve decoding algorithm can be compared in a meaningful way to non-iterativ e algorithm. Ener gy efﬁcienc y is a multidimensional problem. The performance of the physical layer of a communication system depends on the transmit ener gy via the SNR at the recei ver , denoted communication ener gy , and the processing energy in the receiv er to retriev e the information, denoted computation energy . There e xists an interesting trade off between communication and computation energy which has to be exploited in advanced, adaptive systems. For example, October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 9 0.1 1 10 10 100 1000 Area Efficiency: GOPs/mm 2 Energy Efficiency: operation/energy (op/pJ) ASIP TC (14Mbit/s) LTE TC (150Mbit/s) LDPC flexible (~100 Mbit/s) LDPC WiMedia (~1Gbit/s) CC (500Mbit/s) Min-Sum  -3-Min Min-Sum Fig. 3: Energy ef ﬁciency versus area efﬁciency based on operations iterativ e algorithm operates at much lower SNR than con v olutional codes. Decreasing the SNR, howe ver , results in an exponentially increasing complexity and energy consumption due to the large number of iterations required for decoding. W e deﬁne the two suitable metrics for implementation efﬁcienc y as follows • energy efﬁciency metric: decoded information bit per energy measured in bit/nJ • area efﬁcienc y metric: information bit thr oughput per ar ea unit measured in M bit/s/mm 2 W e hav e mapped our decoders to the design space which is based on these metrics, see Figure 4. Again ef ﬁcient architectures are placed in the upper right corners, inef ﬁcient architectures in the lower left corner . A large change in the relativ e and absolute positions can be observed for some decoders, when comparing them with ﬁgure Figure 3. • The dif ference in the ef ﬁcienc y between the two instances of the ﬂe xible LDPC decoder (Min-Sum and the λ -3-Min decoder respectiv ely) is now much smaller . Moreov er the Min-Sum decoder is more efﬁcient than the other ones which w as not the case in the conv entional design space. This matches our expectations since the data-transfer and storage scheme in both decoders is nearly identical and the increase in computation results in only a small energy and area increase as described abov e. Both decoders are targeting the same throughput. • The efﬁcienc y of the W iMedia decoder which is optimized for throughput and less ﬂe xibility , is now much October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 10 0.1 1 10 100 10 100 1000 10000 Area Efficiency: (Mbit/s)/mm 2 Energy Efficiency: decoded bit/energy (bit/nJ) ASIP TC (14Mbit/s) LTE TC (150Mbit/s) LDPC flexible (~100 Mbit/s) LDPC WiMedia (~1Gbit/s) CC (500Mbit/s) Min-Sum  -3-Min Fig. 4: Design space based on suitable metrics. Decoded information bit per energy over information bit throughput per area unit. larger than the ef ﬁcienc y of the ﬂexible LDPC decoder which again matches what we expected. So far we ha v e focused on the implementation complexity b ut ha ve not discussed the important aspect of ﬂexibility and communication performance. In the follo wing we will in vestigate the relationship between communication performance, ﬂexibility and implementation efﬁcienc y . I V . M E T H O D O L O G Y In the previous section we in vestigated the absolute and the relativ e positions of the different decoders to each other . Howe v er equally important in this space is the trajectory when speciﬁc parameters are changed since such trajectory represents the impact of a speciﬁc parameter on the decoder ef ﬁciency . The follo wing parameters will be considered: the frame error rates (FER), i.e. communications performance, coding techniques, code rates, number of iterations and throughput. The resulting trajectories well illustrates the strong dependency between communications performance and implementation efﬁcienc y . W e present two design space exploration methodologies. The ﬁrst one is driv en by implementation efﬁcienc y and compares non-iterativ e decoding techniques (con volutional codes) with iterativ e decoding techniques (LDPC codes). The second exploration compares two dif ferent iterati ve decoding techniques (LDPC and T urbo codes) with code rate ﬂe xibility . October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 11 A. Implementation driven design space e xplor ation W e use the current W iMedia 1.5 standard for demonstration. WiMedia features low complexity de vices for UWB communication. Thus the WiMedia 1.2 standard used con v olutional codes as channel coding technique. In the deﬁnition phase of the next generation standard, WiMedia 1.5, LDPC codes were considered as a promising candidate due to their much better communication performance. A throughput of 960 Mbit/s at code rate R = 0 . 75 was deﬁned in the standard. A code/architecture co-design approach [13] resulted in an LDPC decoder which has a much higher ef ﬁciency than the ﬂexible LDPC decoder . Note that its ef ﬁciency is lower in both dimensions compared to a con v olutional decoder . Howe v er this comparison completely ne glects the communications performance. Five decoding iterations can be maximally performed by the LDPC decoder to comply with the throughput ﬁxed in the standard. As already pointed out, the number of iterations strongly impacts the performance of the LDPC decoder . The frame error rate as a function of the number of iteration is contrasted with implementation efﬁciency in Figure 5. Point 3 in the design space ﬁgure corresponds to the WiMedia 1.5 decoder when performing 5 iterations (this was the decoder assumption in the previous ﬁgures when we referred to the W iMedia LDPC decoder). The communication ﬁgure sho ws that this decoder has a 4dB better communication performance than the con v olutional decoder . The communication performance is comparable to that of the conv olutional decoder if the LDPC performs only two iterations instead of ﬁ v e (case 2 in Figure 5). Finally executing only one iteration in the LDPC decoder results in a communication performance which is about 4dB worse than the con volutional decoder (case 1 in Figure 5). Important is the resulting trajectory in the design space for the different cases. T wo cases hav e to be distinguished: • The system throughput is not changed w .r .t. W iMedia 1.5. constraint (scenario a in Figure 5). In this scenario only the energy efﬁciency is improv ed (points 3 → 2 a → 1 a ). Ob viously the decoding time decreases when the decoder executes a smaller number of iterations resulting in a negati ve time lag. This time lag can be exploited for energy ef ﬁciency impro vement. For e xample clock and the power supply could be completely switched off when decoding is ﬁnished. This reduces energy and leakage current. Another possibility is to slow do wn the frequency (frequency scaling). This reduces the energy by the same amount as in the previous case but the peak power consumption during decoding instead of leakage is minimized. The most efﬁcient technique is voltage scaling in which the voltage is reduced which results in the highest ener gy efﬁciency . • The system throughput is changed (scenario b in Figure 5). In this scenario the area ef ﬁciency increases by the same amount as the throughput increases due to smaller number of iterations (points 3 → 2 b → 1 b ). W e see that the efﬁciency of the LDPC decoder is increasing with decreasing communication requirements, i.e. number of iterations. Thus, the decoder efﬁciency is r epr esented by a trajectory instead of a single point in the design space. This trajectory results form v arying communication performance requirements. W e also see that the efﬁcienc y of the LDPC decoder outperforms the conv olutional decoder at the same communication performance. B. Communications performance driven exploration In the previous exploration we hav e compared implementations efﬁciency between non-iterativ e and iterative decoding techniques dependent on throughput and frame error rate behavior for ﬁxed code rates . In this section we October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 12 compare two iterative decoding techniques and put emphasis on code rate ﬂexibility and dependencies. Reference is an L TE turbo decoder implementation. This L TE turbo decoder is compared with a ﬂexible LDPC decoder which supports code rate ﬂexibility . The right graph in Figure 6 shows the communication performance for the two decoding schemes dependent on code rates ( R = 0 . 5 and R = 0 . 83 ) and iteration numbers. The number of information bits is K = 6140 in all cases. Frame error rates are based on ﬁxed point simulations matching the hardware implementation. W e use the communications performance of the turbo decoder with 6.5 iterations as reference point for both code rates. The 6.5 iterations result from the throughput constraint of 150Mbit/s which is speciﬁed in the L TE standard. The 6.5 iterations fulﬁll the L TE communications performance requirements for all code rates. It is well kno wn that the communication performance in LDPC decoding depends on the number of iterations and the code rate. The LDPC decoder under inv estigations provide large code rate ﬂexibility , i.e., the hardware can support v arious code rates. The LDPC decoder requires 10 iterations for R = 0 . 83 and 20 iterations for R = 0 . 5 to match the performance of the turbo decoder . For a code rate of R = 1 / 3 even 40 iterations are mandatory (this is not shown in Figure 6b). Important are the corresponding trajectories in the implementation space. The turbo decoder efﬁciency is identical for all code rates (see left graph in Figure 6). Thus we have no trajectory . This is due to the fact that the code rate ﬂexibility is implemented by puncturing which has negligible impact on throughput, area and energy . Howe v er the situations is completely dif ferent for the ﬂexible LDPC decoder . For a giv en communications performance the code rate has strong impact on the number of required iterations. This iteration number inﬂuences the implementations efﬁcienc y as we hav e seen in the previous exploration case. But beside this impact via the iteration number, there is also a direct impact of the code rate on the implementation efﬁciency since lo wer code rates requires also a more accurate decoding algorithm ( λ -3-Min algorithm instead of the less complex Min-Sum algorithm). The resulting trajectory is shown in the left graph of Figure 6. W e see that the ef ﬁciency increases in both directions with increasing code rate (points 1 → 2 → 3 ). The important observation in this e xploration is the v arying implementation efﬁcienc y of the ﬂe xible LDPC decoder represented by the trajectory . This trajectory results from the required code rate ﬂexibility in the LDPC decoder which is necessary to match the communications performance with respect to a competiti ve turbo code decoder . W e see that analyzing only one code rate, and thus one snap shot, could result in a wrong efﬁcienc y conclusions. The two explorations have shown that implementation ef ﬁciency for advanced iterativ e decoders often results in a trajectory instead of a single point in the design space. These trajectories result from the strong interrelation between communication performance, ﬂexibility and implementation efﬁciency . V . C O N C L U S I O N Understanding the trade-offs between implementation ef ﬁciency , communications performance and ﬂe xibility will be key for designing efﬁcient baseband receivers. Meaningful efﬁcienc y metrics are mandatory to explore and October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 13 ev aluate the resulting huge design space. W e introduced and discussed suitable energy and area efﬁciency metrics which are based on decoded information bit per energy and throughput per area unit. V arious channel decoder implementations were utilized to examine these efﬁciency metrics with respect to the achiev ed communications performance and with respect to the decoder ﬂe xibility . The presented methodology allo ws to systematically compare different realizations by jointly considering: implementation ef ﬁciency , communications performance and ﬂexibility . R E F E R E N C E S [1] C. H. v an Berkel, “Multi-core for mobile phones, ” in Pr oc. DA TE ’09. Design, Automation. T est in Eur ope Confer ence. Exhibition , Apr . 20–24, 2009, pp. 1260–1265. [2] S. C. Eberli, Ph.D. dissertation, ETH Zurich, Integrated Systems Laboratory , 2009. [3] H. Meyr , M. Moeneclaey , and S. A. Fechtel, Digital Communication Receivers . John W iley & Sons Inc, 1998. [4] M. Miranda, C. Ghez, E. Brockmeyer, P . Op De Beeck, and F . Catthoor, “Data transfer and storage exploration for real-time implementation of a digital audio broadcast receiv er on a Trimedia processor , ” in Pr oc. 15th Symposium on Integrated Cir cuits and Systems Design , 9–14 Sept. 2002, pp. 373–378. [5] C. Rowen, “Energy-Ef ﬁcient L TE Baseband with Extensible Dataplane Processor Units, ” in Pr oc. 9th International Symposium on Multipr ocessor Systems-on-Chips (MPSoC’09) , Savanna, USA, August 2009. [6] E. Boutillon, J. Castura, and F . Kschischang, “Decoder-ﬁrst code design, ” in Proc. 2nd International Symposium on T urbo Codes & Related T opics , Brest, France, Sep. 2000, pp. 459–462. [7] M. Mansour and N. Shanbhag, “Architecture-A w are Lo w-Density Parity-Check Codes, ” in Proc. 2003 IEEE International Symposium on Cir cuits and Systems (ISCAS ’03) , Bangkok, Thailand, May 2003. [8] J. Kwak and K. Lee, “Design of di vidable interleaver for parallel decoding in turbo codes, ” Electr onics Letters , vol. 38, no. 22, pp. 1362–1364, Oct. 2002. [9] “3GPP L TE (Long T erm Evolution) Homepage. ” [Online]. A v ailable: http://www .3gpp.org/Highlights/L TE/L TE.htm [10] European T elecommunications Standards Institude (ETSI), “Digital V ideo Broadcasting (D VB) Second generation framing structure,channel coding and modulation systems for Broadcasting, Interactiv e Services, News Gathering and other broadband satellite applications; TM 2860r1 D VBS2-74r8, ” www .dvb .or g. [11] T . V ogt and N. W ehn, “A Reconﬁgurable ASIP for Con volutional and Turbo Decoding in a SDR Environment, ” IEEE T ransactions on V ery Lar ge Scale Integration (VLSI) Systems , pp. 1309–1320, Oct. 2008. [Online]. A vailable: http://dx.doi.org/10.1109/TVLSI.2008.2002428 [12] M. May , T . Ilnseher, N. W ehn, and W . Raab, “A 150Mbit/s 3GPP L TE Turbo Code Decoder, ” in Pr oc. Design, Automation and T est in Eur ope, 2010 (DA TE ’10) , 2010, accepted for publication. [13] M. Alles, F . Berens, and N. W ehn, “A Synthesizable IP Core for W iMedia 1.5 UWB LDPC Code Decoding, ” in Pr oc. IEEE International Confer ence on Ultra-W ideband ICUWB 2009 , V ancouver , Canada, September 2009, pp. 597–601. [14] IEEE 802.11, “W ireless Fidelity (W ireless LAN), ” http://grouper .ieee.or g/groups/802/11/. [15] F . Guilloud, E. Boutillon, and J. Danger , “ λ -Min Decoding Algorithm of Regular and Irregular LDPC Codes, ” in Proc. 3nd International Symposium on T urbo Codes & Related T opics , Brest, France, Sep. 2003, pp. 451–454. [16] S. M ¨ uller , M. Schreger , M. Kabutz, M. Alles, F . Kienle, and N. W ehn, “A nov el LDPC decoder for D VB-S2 IP, ” in Proc. D A TE ’09. Design, Automation. T est in Europe Conference . Exhibition , Apr . 20–24, 2009, pp. 1308–1313. October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 14 1 10 100 100 1000 10000 100000 Area Efficiency: (Mbit/s)/mm2 Energy Efficiency: decoded bit/energy (bit/nJ) LDPC WiMedia1.5 CC WiMedia1.5 1a 2a 3 2b 1b R (a) Implementation Efﬁcienc y 18 20 22 24 26 28 3 0 10 −4 10 −3 10 −2 10 −1 10 0 E B /N 0 [dB] FER LDPC, R=3/4, 5 iterations LDPC, R=3/4, 2 iterations CC, R=3/4 LDPC, R=3/4, 1 iterations 2 R 1 3 (b) Communications performance (16-QAM, CM1 channel according to IEEE 802.15.3a [13]) R : Reference conv olutional deco der with co de rate ( R = 0 . 75) and ﬁxed throughput of 1 Gbit / s 1 : LDPC co de p erforming 1 iteration a) iden tical throughput (1 Gbit / s) ∼ 4dB w orse communications p erformance b) 5 times higher throughput (5 Gbit / s) ∼ 4dB w orse communications p erformance 2 : LDPC co de p erforming 2 iteration a) iden tical throughput (1 Gbit / s) ∼ 1dB b etter communications p erformance b) 2 . 5 times higher throughput (2 . 5 Gbit / s) ∼ 1dB b etter communications p erformance 3 : LDPC co de p erforming 5 iteration iden tical throughput (1 Gbit / s) ∼ 4dB b etter communications p erformance Fig. 5: Implementation efﬁciency and communications performance for WiMedia 1.5 standard October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 15 0.01 0.1 1 10 10 100 1000 Area Efficiency: (Mbit/s)/mm2 Energy Efficiency: decoded bit/energy (bit/nJ) TC LTE flexible LDPC 3 R 2 1 (a) Implementation Efﬁcienc y 0.5 1 1.5 2 2.5 3 3.5 4 4. 5 10 −4 10 −3 10 −2 10 −1 10 0 FER R=0.5 TC 6.5 iter TC 3 iter E b /N 0 [dB] LDPC 10 iter LDPC 20 iter LDPC 10 iter LDPC 5 iter TC 6.5 iter R=0.83 TC 3 iter (b) Communications performance (BPSK, A WGN channel) R : Reference of L TE turb o deco der 150 Mbit / s throughput for all code rates reference TC p erformance at 6 . 5 iterations 1 : LDPC co de at co de rate R = 1 / 3 ∼ max throughput 30Mbit / s ∼ 40 iterations to match TC performance 2 : LDPC co de at co de rate R = 1 / 2 ∼ max throughput 90Mbit / s ∼ 20 iterations to match TC performance 3 : LDPC co de at co de rate R = 0 . 83 ∼ max throughput 300Mbit / s ∼ 10 iterations to match TC performance Fig. 6: Implementation efﬁcienc y and communications performance of L TE turbo code/decoder and a ﬂexible LDPC decoder . October 30, 2018 DRAFT

On Complexity, Energy- and Implementation-Efficiency of Channel Decoders

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment