On Complexity, Energy- and Implementation-Efficiency of Channel Decoders

Future wireless communication systems require efficient and flexible baseband receivers. Meaningful efficiency metrics are key for design space exploration to quantify the algorithmic and the implementation complexity of a receiver. Most of the curre…

Authors: Frank Kienle, Norbert Wehn, Heinrich Meyr

IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 1 On Comple xity , Ener gy- and Implementation- Ef ficiency of Channel Decoders Frank Kienle, Member , IEEE, Norbert W ehn, Senior Member , IEEE , and Heinrich Meyr , F ellow , IEEE Abstract Future wireless communication systems require ef ficient and flexible baseband recei v ers. Meaningful ef ficiency metrics are key for design space e xploration to quantify the algorithmic and the implementation complexity of a receiv er . Most of the current established efficiency metrics are based on counting operations, thus neglecting important issues like data and storage complexity . In this paper we introduce suitable energy and area efficienc y metrics which resolve the afore-mentioned disad- vantages. These are decoded information bit per energy and throughput per area unit. Efficienc y metrics are assessed by various implementations of turbo decoders, LDPC decoders and con volutional decoders. New exploration method- ologies are presented, which permit an appropriate benchmarking of implementation ef ficiency , communications performance, and flexibility trade-of fs. These exploration methodologies are based on efficiency trajectories rather than a single snapshot metric as done in state-of-the-art approaches. Index T erms Channel coding, algorithmic complexity , energy efficiency , design space exploration, design methodology . I . M OT I V A T I O N T oday , high-end smart phones have to support multiple radio standards, advanced graphic- and media applications and many other applications resulting in a workload of about 100 giga operations per second in a power budget of 1 W att [1]. The baseband processing in the radio part (mainly front-end processing, demodulation and decoding) requires more than 50% of the overall workload in a state-of-the-art 3.5G smart phone. T o achiev e higher spectral efficienc y ne w transmission techniques like MIMO will be established. Ho we ver , this will increase the workload ev en further . Thus there is a strong need for ef ficient wireless baseband recei v ers. The overall efficienc y of a baseband receiv er depends on • communications efficiency : e xpressed by the spectral ef ficiency and signal-to-noise ratio (SNR). The require- ments on the communications ef ficienc y ha ve the lar gest impact on the selected baseband processing algorithms. This work has been partly supported by the UMIC Research Center, R WTH Aachen Uni versity . F . Kienle and N. W ehn are with the Microelectronic Systems Design Research Group, Univ ersity of Kaiserslautern, Kaiserslautern, Germany , e-mail: { kienle,wehn } @eit.uni-kl.de H. Meyr is with the Institute for Integrated Signal Processing System, RWTH Aachen University , Aachen, Germany October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 2 • implementation efficiency : related to silicon area, power and energy . Here, the energy efficiency is the biggest challenge due to the limited a v ailable battery power in many devices. • flexibility : in softw are defined radio, receivers hav e to support multiple standards and should be configurable at run-time (see software defined radio). There are various silicon implementation styles ranging from general purpose architectures, ov er DSPs and ASIPs do wn to fully physically optimized IP blocks which strongly differ in their implementation efficiency b ut also in their flexibility . For each building block of the receiv er a detailed analysis of flexibility requirements has to be carried out to find the best flexibility/cost trade-off. Thus, advanced baseband receivers are heterogeneous multi-core architectures implemented in different design styles. System requirements are very often specified by communication standards like UMTS, L TE and W iMAX, which define dif ferent services in terms of required communications performance and system data throughput, i.e. information bits per second. T o obtain an efficient baseband implementation, a careful and elaborate design space explor ation has to be performed. This is a very challenging task due to the size and the multi-dimensionality of the space. Therefore it is mandatory to prune the design space in an early stage of the design process. In this process the algorithms hav e to be selected and quantitati vely compared to each other with respect to their system performance and implementation efficiencies. Appropriate metrics are key for efficient design space exploration to measure the algorithmic and the implemen- tation complexity respecti vely . A. Algorithmic Complexity There exists no univ ersal measure for complexity . In computer science the O-notation describes the asymptotic complexity behavior of an algorithm. In information theory the K olmogorof f complexity is defined by the minimum description length of a string. These measures are inadequate for implementation purposes. A useful description of complexity for our purpose is to use the number of ”algorithmic” operations which ha v e to be performed per recei ved samples by the algorithms of a baseband receiv er . This comple xity metric has the advantage of being independent of a specific implementation of the algorithms. Based on this complexity definition a two-dimensional graph can be set up in which the horizontal axis correspond to the sample rate of the receiv er (which is proportional to the data rate) and the vertical axis corresponds to the operations per sample which have to be carried out. T ypically , both ax es are scaled logarithmically . An example of such a graph is sho w exemplarily in Figure 1 for the digital baseband processing of a 384 kbit/s UMTS receiv er . The off-diagonal lines describe points of equal number of operations per second , often expressed in million of operations per second ( MOPs )or gig a operations per second ( GOPs ). Sev eral important conclusions can be dra w form this figure. First, the receiver task is heterogeneous. There exist a large variety of algorithms ranging from the complex MUSIC algorithm to the simple root raised cosine matched filtering. Note, that the largest number of operations is performed in simple algorithms such as filtering and correlation while the most complex MUSIC algorithm in this example requires less MOPs . From this follows October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 3 1 7 sampling rate [1/s] 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 0 10 1 10 2 10 3 10 4 10 5 OPs per sample 384 kbps UM TS receiver, digit al BB complexity MU SIC delay acq. 1 MOPS 10 MOPS 100 MOPS 1000 MOPS AGC AFC RRC pulse MF Interpolat ion/decima tion Correlators Max. ratio co m bining T im ing tracking Channel estimatio n SI R estimati on Path searcher T u rbo decoder 3 8 4 k b p s U M T S R e c e i v e r B B C o m p l e x i t y … … a n d H e t e r o g e n e i t y Fig. 1: Operations per sample o ver sampling rate [1/s] that counting only operations/sec is entirely misleading. The second conclusion is that the heterogeneity of the algorithms points to architectural features for implementation. Simple algorithms such as filtering and correlation can be implemented very efficiently in architectures requiring little flexibility in the form of parameterizability . Complex algorithms must be programmable and thus require high fle xibility . Recently , V an Berkel determined the complexity of various algorithms in baseband processing based on a similar metric. In his remarkable and comprehensiv e [1] he has shown the number of ”algorithmic” operations which hav e to be performed per receiv ed bit by the algorithms of a baseband receiver for different communication standards. Eberli from ETH Z ¨ urich is using a similar metric [2] for measuring complexity in baseband processing by calling these operations ”atomic” operations. From a communication system point of view we can separate digital processing in the baseband into two parts: the so called ”inner modem” and ”outer modem” [3] respectively . T ask of the inner modem is the extraction of October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 4 symbols from the receiv ed signal wav eform, i.e., equalization, channel estimation, interference cancellation and synchronization. The outer modem performs demodulation, de-interleaving and channel decoding on the receiv ed symbols. Thus algorithmic comple xity in baseband processing is normally separately plotted for the inner and the outer modem respectiv ely . A large di versity exists in the v arious baseband processing algorithms with respect to operation types, operation complexity , and data types especially between the inner and out modem. Figure 2 in [1] shows the algorithmic complexity for the inner and outer modem measured in giga operations per second (GOPs). It can be seen that sophisticated decoding schemes like turbo and LDPC codes utilized in advanced services like L TE require much more operations than the algorithms of the inner modem. B. Implementation Complexity On the implementation side a strong emphasis has to be put on the energy efficienc y . Implementation complexity and algorithmic complexity are strongly interrelated in wireless baseband processing. Thus they hav e to be related to each other . Often it is argued that the implementation complexity is directly related to the algorithmic complexity . E.g. Eberli [2] considers the implementation complexity by introducing a cost factor for each atomic operation which reflects its implementation cost. F or design space exploration, graph representations are commonly used: • A two dimensional energy ef ficiency graph: one axis corresponds to the algorithmic complexity , e.g. measured in GO P s , and the other axis to the po wer , e.g. measured in mW , consumed when providing the corresponding operations/second. Each point in this graph describes the ener gy efficiency metric , i.e. operations/second/power unit , usually measured in GO P s/mW . Since energy corresponds to power multiplied with ex ecution time, each point giv es the operations/ener gy measured in oper ations/J oule . • In a similar way we can set up an area efficienc y graph in which one axis represents the needed area. Each point in this graph yields the area efficiency metric , i.e. operations/second/ar ea unit , usually measured in GO P s/mm 2 . Note that the ener gy and area efficiency for the same algorithmic complexity can vary by several orders of magnitude, dependent on the selected implementation style. By far the highest energy efficiency is achiev ed by physically optimized circuits, ho we ver , at the expense of no flexibility . The highest fle xibility via software programmability at the expense of low energy efficienc y is achie ved by digital signal processors. The designer has to find a compromise between the two conflicting goals by trading off flexibility vs. energy ef ficiency . Flexibility is hard to quantify . The optimum design point is thus to be understood qualitatively . It depends on the application and a large number of economic and technical considerations. W e can combine energy and area efficienc y in a two dimensional design space in which the two axis correspond to area and energy efficiency respecti vely . This is a well known representation of the design space. C. Assessment of Metrics Area, throughput and especially energy in many system-on-chip implementations are dominated by data-transfers and storage schemes [4] and not by the computations itself. Howe ver common metrics as described above are October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 5 focusing solely on operations, and are not considering data-transfer and storage issues at all. Thus, these metrics are only valid if the operations dominate the implementation complexity . This is the case in data-flow dominated algorithms like an FFT calculation, correlation or filtering. Most algorithms in the inner modem of baseband receivers belong to this class of algorithm. Howe v er the channel decoding algorithms in the outer modem largely dif fer from the algorithms used in the inner modem. Here, the operations to be performed are non-standard operations (e.g. tanh) using non-standard data types (e.g. 7 bits). But more important, the ov erall implementation comple xity , especially energy , is dominated by data-transfers and storage schemes. A change in the algorithm with respect to computation, e.g. optimal versus suboptimal algorithm by approximating computations, has only a minor impact on the energy efficienc y as shown later . The transitions from 3G to L TE advanced require 2 orders of magnitude improvement in energy efficiency . This improv ement will come to a small extent from technology scaling [5]. Efficient system-on-chip implementations are feasible when channel coding schemes and the corresponding decoding algorithms are co-designed together with the architecture (architecture aw are code design) [6] [7] [8]. This is in accordance with a general trend tow ards co-design algorithm and architecture in receiver design realizing that the traditional separation of algorithm and architecture design leads to suboptimal results. In channel decoding the co-design focuses on data-transfer and storage schemes. Examples are special interleavers for turbo codes (e.g. L TE standard [9]) and special structures of the parity check matrix for LDPC codes (e.g. D VB-S2 standard [10]). These special structures allo w an efficient parallel implementation of the decoding algorithm with small overhead in data-transfer and storage. GOPs based metrics do not at all reflect such specific structures. Another important issue is flexibility . Flexibility on the algorithmic side, e.g., code rates and block sizes in the case of channel decoding, hav e a large impact on the implementation complexity . By looking only on the operations in the algorithm, flexibility is normally not considered. In summary , efficienc y metrics based on GOPs are questionable. Particularly , for non-data flow dominated algorithms since they entirely neglect important issues like data and storage complexity , algorithm/architecture co-design and flexibility . In this paper we focus on channel decoding as application. The contributions of this paper are: • we will show that the GOPs metric yields wrong conclusions. • we will introduce suitable metrics for energy and area efficienc y . • we present a methodology for design space exploration based on these metrics. I I . R E F E R E N C E D E S I G N S Reference designs are k ey to assess various metrics. Thus, we selected 5 dif ferent channel decoder implementations which our research group has designed in the last couple of years. Using own design has the advantage that all data are av ailable. The decoders differ in services (throughput, block sizes, code rates), decoding algorithms, flexibility and implementation styles. Selected codes are con volutional codes, turbo codes and LDPC codes. The 5 different decoders are: October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 6 Decoder Flexibility Max. Block- size Throughput [Mbit/s] Frequency [MHz] Area [mm2] Dynamic Power [mW att] ASIP [11] Con v . Codes Binary TC Duo-binary TC N=16k 40 14 (6iter) 28 (6iter) 385 (P&R) 0.7 (P&R) ˜100 L TE turbo [12] R=1/3 to R=9/10 by puncturing N=18k 150 (6.5 iter) 300 (P&R) 2.1 (P&R) ˜300 LDPC flexible R=1/4 to R=9/10 N=16k 30 (R=1/3 40iter) 100 (R=1/2 20iter) 300 (R=0.83 10iter) 385 (P&R) 1.172 (P&R) ˜389 LDPC W iMedia 1.5 [13] R=1/2-4/5 N=1.3k 640 (R=1/2 5iter) 960 (R=0.75 5iter) 265 0.51 ˜193 CC Decoder 64-state NSC 500 500 0.1 ˜37 T ABLE I: Reference decoders: service parameters and implementation results in 65nm technology • An application specific instruction set processor (ASIP) [11] capable of processing binary turbo codes, duo- binary turbo codes and various con v olutional codes with different throughputs dependent on code rate and decoding scheme. • A turbo decoder which is L TE [9] compliant. The maximum throughput is 150Mbit/s at 6.5 decoding iterations. • An LDPC decoder optimized for flexibility , supporting two different decoding algorithms, code rates from R=1/4-9/10 and a maximum block length of 16384. • An LDPC decoder which is WiMedia 1.5 compliant and optimized for throughput, supporting code rates from R=1/2-4/5 with two block lengths N=1200 and N=1320 bits [13]. • A con v olutional decoder with 64-state which is WiFi [14] compliant. All decoders are synthesized on a 65nm CMOS technology under worst case conditions with V dd = 1 . 0 V , 120 C . Power estimations are based on nominal case V dd = 1 . 1 V . T able I gives an ov ervie w of the ke y parameters of the dif ferent decoders. P&R indicates that the corresponding data are post-layout data. The payload (information bits) throughput depends on the number of decoding iterations for turbo and LDPC codes which also impacts the communications performance. Thus, the throughput is specified dependent on the number of iterations. In T able II we show the number of algorithmic operations required to process the different types of conv olutional codes, turbo codes and LDPC codes. Bit-true C-reference models are used for operation counting. All operations were normalized to an 8 bit addition. The number of operations is related to one information bit which has to be decoded, i.e., operations/information bit. The total number of operations which ha ve to be performed per second, i.e. GOPs, depends on the code rate R and throughput which depends on the number of iterations for LDPC and turbo codes. T wo dif ferent algorithms for LDPC codes were in v estigated. Both algorithms are suboptimal algorithm approximating the belief propagation algorithm: the Min-Sum algorithm with a scaling factor and the October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 7 Code operations/bit GOPs (w .r .t. throughput) #iter #(op/bit) 100 Mbit/s 300 Mbit/s 1 Gbit/s CC (states=64) 200 20 60 200 5 iter 75/R 7.5/R 22.5/R 75/R LDPC 10 iter 150/R 15/R 45/R 150/R Min-Sum 20 iter 300/R 30/R 90/R 300/R 40 iter 600/R 60/R 180/R 600/R T urbo 2 iter 280 28 84 280 (Max-Log) 4 iter 560 56 168 560 6 iter 840 84 252 840 T ABLE II: Number of normalized algorithmic operations per decoded information bit for different channel decoders dependent on throughput and code rate R. λ -3-Min algorithm [15] which is a more accurate approximation. Howe v er the latter one needs about 3.3 times more operations. This more accurate approximation is mandatory if lower code rates R < 0 . 5 have to be supported like in D VB-S2 decoders [16]. In T able II we hav e only listed the operations for the Min-Sum algorithm. T o obtain the operations for the λ -3-Min algorithm, operations and GOPs have to be multiplied by 3.3 respecti vely . The flexible LDPC decoder in our reference design was designed for both decoding algorithms, the W iMedia LDPC decoder is based on the Min-Sum algorithm only . I I I . S U I T A B L E M E T R I C S The energy efficienc y graph for our reference designs is sho wn in Figure 2. It can be seen that the energy efficienc y , measured in GOPs/mW , largely v aries for the different decoders. The two dimensional design space, co vering area and ener gy ef ficiency , is illustrated in Figure 3. In this graph, efficient architectures w .r .t. area and energy hav e to be located in the upper right corner . Less efficient architectures are placed in the lower left corner . W e see that the conv olutional decoder appears to be the most efficient decoder while the ASIP being the decoder with the lowest efficiency . One interesting observ ation is the efficienc y of the flexible LDPC decoder . The ef ficiency largely increases in both directions (area, energy) when replacing the Min-Sum by the λ -3-Min algorithms which is the more complex algorithm in terms of operations. As described in the previous section the GOPs for this algorithm increases by a factor 3. W e w ould expect a large increase in area and po wer accordingly . Ho we ver , the area and power only increases by about 10% for the λ -3-Min algorithm. This is due to the fact that area and energy in both decoders are dominated by the data-transfer and storage scheme and the change in the operations of algorithm has only a small impact on it. In other words, the number of operations increases much larger than the implementation complexity . This is a hint that a GOPs based metric is not suited. Moreov er we see that the λ -3-Min based flexible October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 8 1 10 100 1000 1 10 100 1000 Power (mW) GOPs ASIP TC (14Mbit/s) LTE TC (150Mbit/s) LDPC flexible (~100 Mbit/s) LDPC WiMedia (~1Gbit/s) CC (500Mbit/s) 1GOPs/mWatt Min-Sum  -3-Min 0.1 GOPs/mWatt Fig. 2: Operations/second versus power decoder has nearly the same efficiency as the less flexible WiMedia decoder which is optimized for throughput. W e would e xpect that such a less fle xible for throughput optimized decoder has a higher ef ficiency compared to the flexible one. In the following we introduce metrics to resolve the afore mentioned anomalies. Instead of using the operations which have to be carried out for processing per task we normalize to the number of information bits per task. Metrics normalized to the number of information bits ha v e the following properties. The y allo w comparing • competing architectures for a giv en algorithm since the ef ficiency metrics are independent of the specific operations and data types used to execute the algorithm. All implementation issues like data-transfer and storage are taken into account since the metrics is oblivious to how the task has been executed. • different coding schemes as a function of the communication parameters (modulation, signal to noise ratio, bandwidth). In particular , iterati ve decoding algorithm can be compared in a meaningful way to non-iterativ e algorithm. Ener gy efficienc y is a multidimensional problem. The performance of the physical layer of a communication system depends on the transmit ener gy via the SNR at the recei ver , denoted communication ener gy , and the processing energy in the receiv er to retriev e the information, denoted computation energy . There e xists an interesting trade off between communication and computation energy which has to be exploited in advanced, adaptive systems. For example, October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 9 0.1 1 10 10 100 1000 Area Efficiency: GOPs/mm 2 Energy Efficiency: operation/energy (op/pJ) ASIP TC (14Mbit/s) LTE TC (150Mbit/s) LDPC flexible (~100 Mbit/s) LDPC WiMedia (~1Gbit/s) CC (500Mbit/s) Min-Sum  -3-Min Min-Sum Fig. 3: Energy ef ficiency versus area efficiency based on operations iterativ e algorithm operates at much lower SNR than con v olutional codes. Decreasing the SNR, howe ver , results in an exponentially increasing complexity and energy consumption due to the large number of iterations required for decoding. W e define the two suitable metrics for implementation efficienc y as follows • energy efficiency metric: decoded information bit per energy measured in bit/nJ • area efficienc y metric: information bit thr oughput per ar ea unit measured in M bit/s/mm 2 W e hav e mapped our decoders to the design space which is based on these metrics, see Figure 4. Again ef ficient architectures are placed in the upper right corners, inef ficient architectures in the lower left corner . A large change in the relativ e and absolute positions can be observed for some decoders, when comparing them with figure Figure 3. • The dif ference in the ef ficienc y between the two instances of the fle xible LDPC decoder (Min-Sum and the λ -3-Min decoder respectiv ely) is now much smaller . Moreov er the Min-Sum decoder is more efficient than the other ones which w as not the case in the conv entional design space. This matches our expectations since the data-transfer and storage scheme in both decoders is nearly identical and the increase in computation results in only a small energy and area increase as described abov e. Both decoders are targeting the same throughput. • The efficienc y of the W iMedia decoder which is optimized for throughput and less fle xibility , is now much October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 10 0.1 1 10 100 10 100 1000 10000 Area Efficiency: (Mbit/s)/mm 2 Energy Efficiency: decoded bit/energy (bit/nJ) ASIP TC (14Mbit/s) LTE TC (150Mbit/s) LDPC flexible (~100 Mbit/s) LDPC WiMedia (~1Gbit/s) CC (500Mbit/s) Min-Sum  -3-Min Fig. 4: Design space based on suitable metrics. Decoded information bit per energy over information bit throughput per area unit. larger than the ef ficienc y of the flexible LDPC decoder which again matches what we expected. So far we ha v e focused on the implementation complexity b ut ha ve not discussed the important aspect of flexibility and communication performance. In the follo wing we will in vestigate the relationship between communication performance, flexibility and implementation efficienc y . I V . M E T H O D O L O G Y In the previous section we in vestigated the absolute and the relativ e positions of the different decoders to each other . Howe v er equally important in this space is the trajectory when specific parameters are changed since such trajectory represents the impact of a specific parameter on the decoder ef ficiency . The follo wing parameters will be considered: the frame error rates (FER), i.e. communications performance, coding techniques, code rates, number of iterations and throughput. The resulting trajectories well illustrates the strong dependency between communications performance and implementation efficienc y . W e present two design space exploration methodologies. The first one is driv en by implementation efficienc y and compares non-iterativ e decoding techniques (con volutional codes) with iterativ e decoding techniques (LDPC codes). The second exploration compares two dif ferent iterati ve decoding techniques (LDPC and T urbo codes) with code rate fle xibility . October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 11 A. Implementation driven design space e xplor ation W e use the current W iMedia 1.5 standard for demonstration. WiMedia features low complexity de vices for UWB communication. Thus the WiMedia 1.2 standard used con v olutional codes as channel coding technique. In the definition phase of the next generation standard, WiMedia 1.5, LDPC codes were considered as a promising candidate due to their much better communication performance. A throughput of 960 Mbit/s at code rate R = 0 . 75 was defined in the standard. A code/architecture co-design approach [13] resulted in an LDPC decoder which has a much higher ef ficiency than the flexible LDPC decoder . Note that its ef ficiency is lower in both dimensions compared to a con v olutional decoder . Howe v er this comparison completely ne glects the communications performance. Five decoding iterations can be maximally performed by the LDPC decoder to comply with the throughput fixed in the standard. As already pointed out, the number of iterations strongly impacts the performance of the LDPC decoder . The frame error rate as a function of the number of iteration is contrasted with implementation efficiency in Figure 5. Point 3 in the design space figure corresponds to the WiMedia 1.5 decoder when performing 5 iterations (this was the decoder assumption in the previous figures when we referred to the W iMedia LDPC decoder). The communication figure sho ws that this decoder has a 4dB better communication performance than the con v olutional decoder . The communication performance is comparable to that of the conv olutional decoder if the LDPC performs only two iterations instead of fi v e (case 2 in Figure 5). Finally executing only one iteration in the LDPC decoder results in a communication performance which is about 4dB worse than the con volutional decoder (case 1 in Figure 5). Important is the resulting trajectory in the design space for the different cases. T wo cases hav e to be distinguished: • The system throughput is not changed w .r .t. W iMedia 1.5. constraint (scenario a in Figure 5). In this scenario only the energy efficiency is improv ed (points 3 → 2 a → 1 a ). Ob viously the decoding time decreases when the decoder executes a smaller number of iterations resulting in a negati ve time lag. This time lag can be exploited for energy ef ficiency impro vement. For e xample clock and the power supply could be completely switched off when decoding is finished. This reduces energy and leakage current. Another possibility is to slow do wn the frequency (frequency scaling). This reduces the energy by the same amount as in the previous case but the peak power consumption during decoding instead of leakage is minimized. The most efficient technique is voltage scaling in which the voltage is reduced which results in the highest ener gy efficiency . • The system throughput is changed (scenario b in Figure 5). In this scenario the area ef ficiency increases by the same amount as the throughput increases due to smaller number of iterations (points 3 → 2 b → 1 b ). W e see that the efficiency of the LDPC decoder is increasing with decreasing communication requirements, i.e. number of iterations. Thus, the decoder efficiency is r epr esented by a trajectory instead of a single point in the design space. This trajectory results form v arying communication performance requirements. W e also see that the efficienc y of the LDPC decoder outperforms the conv olutional decoder at the same communication performance. B. Communications performance driven exploration In the previous exploration we hav e compared implementations efficiency between non-iterativ e and iterative decoding techniques dependent on throughput and frame error rate behavior for fixed code rates . In this section we October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 12 compare two iterative decoding techniques and put emphasis on code rate flexibility and dependencies. Reference is an L TE turbo decoder implementation. This L TE turbo decoder is compared with a flexible LDPC decoder which supports code rate flexibility . The right graph in Figure 6 shows the communication performance for the two decoding schemes dependent on code rates ( R = 0 . 5 and R = 0 . 83 ) and iteration numbers. The number of information bits is K = 6140 in all cases. Frame error rates are based on fixed point simulations matching the hardware implementation. W e use the communications performance of the turbo decoder with 6.5 iterations as reference point for both code rates. The 6.5 iterations result from the throughput constraint of 150Mbit/s which is specified in the L TE standard. The 6.5 iterations fulfill the L TE communications performance requirements for all code rates. It is well kno wn that the communication performance in LDPC decoding depends on the number of iterations and the code rate. The LDPC decoder under inv estigations provide large code rate flexibility , i.e., the hardware can support v arious code rates. The LDPC decoder requires 10 iterations for R = 0 . 83 and 20 iterations for R = 0 . 5 to match the performance of the turbo decoder . For a code rate of R = 1 / 3 even 40 iterations are mandatory (this is not shown in Figure 6b). Important are the corresponding trajectories in the implementation space. The turbo decoder efficiency is identical for all code rates (see left graph in Figure 6). Thus we have no trajectory . This is due to the fact that the code rate flexibility is implemented by puncturing which has negligible impact on throughput, area and energy . Howe v er the situations is completely dif ferent for the flexible LDPC decoder . For a giv en communications performance the code rate has strong impact on the number of required iterations. This iteration number influences the implementations efficienc y as we hav e seen in the previous exploration case. But beside this impact via the iteration number, there is also a direct impact of the code rate on the implementation efficiency since lo wer code rates requires also a more accurate decoding algorithm ( λ -3-Min algorithm instead of the less complex Min-Sum algorithm). The resulting trajectory is shown in the left graph of Figure 6. W e see that the ef ficiency increases in both directions with increasing code rate (points 1 → 2 → 3 ). The important observation in this e xploration is the v arying implementation efficienc y of the fle xible LDPC decoder represented by the trajectory . This trajectory results from the required code rate flexibility in the LDPC decoder which is necessary to match the communications performance with respect to a competiti ve turbo code decoder . W e see that analyzing only one code rate, and thus one snap shot, could result in a wrong efficienc y conclusions. The two explorations have shown that implementation ef ficiency for advanced iterativ e decoders often results in a trajectory instead of a single point in the design space. These trajectories result from the strong interrelation between communication performance, flexibility and implementation efficiency . V . C O N C L U S I O N Understanding the trade-offs between implementation ef ficiency , communications performance and fle xibility will be key for designing efficient baseband receivers. Meaningful efficienc y metrics are mandatory to explore and October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 13 ev aluate the resulting huge design space. W e introduced and discussed suitable energy and area efficiency metrics which are based on decoded information bit per energy and throughput per area unit. V arious channel decoder implementations were utilized to examine these efficiency metrics with respect to the achiev ed communications performance and with respect to the decoder fle xibility . The presented methodology allo ws to systematically compare different realizations by jointly considering: implementation ef ficiency , communications performance and flexibility . R E F E R E N C E S [1] C. H. v an Berkel, “Multi-core for mobile phones, ” in Pr oc. DA TE ’09. Design, Automation. T est in Eur ope Confer ence. Exhibition , Apr . 20–24, 2009, pp. 1260–1265. [2] S. C. Eberli, Ph.D. dissertation, ETH Zurich, Integrated Systems Laboratory , 2009. [3] H. Meyr , M. Moeneclaey , and S. A. Fechtel, Digital Communication Receivers . John W iley & Sons Inc, 1998. [4] M. Miranda, C. Ghez, E. Brockmeyer, P . Op De Beeck, and F . Catthoor, “Data transfer and storage exploration for real-time implementation of a digital audio broadcast receiv er on a Trimedia processor , ” in Pr oc. 15th Symposium on Integrated Cir cuits and Systems Design , 9–14 Sept. 2002, pp. 373–378. [5] C. Rowen, “Energy-Ef ficient L TE Baseband with Extensible Dataplane Processor Units, ” in Pr oc. 9th International Symposium on Multipr ocessor Systems-on-Chips (MPSoC’09) , Savanna, USA, August 2009. [6] E. Boutillon, J. Castura, and F . Kschischang, “Decoder-first code design, ” in Proc. 2nd International Symposium on T urbo Codes & Related T opics , Brest, France, Sep. 2000, pp. 459–462. [7] M. Mansour and N. Shanbhag, “Architecture-A w are Lo w-Density Parity-Check Codes, ” in Proc. 2003 IEEE International Symposium on Cir cuits and Systems (ISCAS ’03) , Bangkok, Thailand, May 2003. [8] J. Kwak and K. Lee, “Design of di vidable interleaver for parallel decoding in turbo codes, ” Electr onics Letters , vol. 38, no. 22, pp. 1362–1364, Oct. 2002. [9] “3GPP L TE (Long T erm Evolution) Homepage. ” [Online]. A v ailable: http://www .3gpp.org/Highlights/L TE/L TE.htm [10] European T elecommunications Standards Institude (ETSI), “Digital V ideo Broadcasting (D VB) Second generation framing structure,channel coding and modulation systems for Broadcasting, Interactiv e Services, News Gathering and other broadband satellite applications; TM 2860r1 D VBS2-74r8, ” www .dvb .or g. [11] T . V ogt and N. W ehn, “A Reconfigurable ASIP for Con volutional and Turbo Decoding in a SDR Environment, ” IEEE T ransactions on V ery Lar ge Scale Integration (VLSI) Systems , pp. 1309–1320, Oct. 2008. [Online]. A vailable: http://dx.doi.org/10.1109/TVLSI.2008.2002428 [12] M. May , T . Ilnseher, N. W ehn, and W . Raab, “A 150Mbit/s 3GPP L TE Turbo Code Decoder, ” in Pr oc. Design, Automation and T est in Eur ope, 2010 (DA TE ’10) , 2010, accepted for publication. [13] M. Alles, F . Berens, and N. W ehn, “A Synthesizable IP Core for W iMedia 1.5 UWB LDPC Code Decoding, ” in Pr oc. IEEE International Confer ence on Ultra-W ideband ICUWB 2009 , V ancouver , Canada, September 2009, pp. 597–601. [14] IEEE 802.11, “W ireless Fidelity (W ireless LAN), ” http://grouper .ieee.or g/groups/802/11/. [15] F . Guilloud, E. Boutillon, and J. Danger , “ λ -Min Decoding Algorithm of Regular and Irregular LDPC Codes, ” in Proc. 3nd International Symposium on T urbo Codes & Related T opics , Brest, France, Sep. 2003, pp. 451–454. [16] S. M ¨ uller , M. Schreger , M. Kabutz, M. Alles, F . Kienle, and N. W ehn, “A nov el LDPC decoder for D VB-S2 IP, ” in Proc. D A TE ’09. Design, Automation. T est in Europe Conference . Exhibition , Apr . 20–24, 2009, pp. 1308–1313. October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 14 1 10 100 100 1000 10000 100000 Area Efficiency: (Mbit/s)/mm2 Energy Efficiency: decoded bit/energy (bit/nJ) LDPC WiMedia1.5 CC WiMedia1.5 1a 2a 3 2b 1b R (a) Implementation Efficienc y 18 20 22 24 26 28 3 0 10 −4 10 −3 10 −2 10 −1 10 0 E B /N 0 [dB] FER LDPC, R=3/4, 5 iterations LDPC, R=3/4, 2 iterations CC, R=3/4 LDPC, R=3/4, 1 iterations 2 R 1 3 (b) Communications performance (16-QAM, CM1 channel according to IEEE 802.15.3a [13]) R : Reference conv olutional deco der with co de rate ( R = 0 . 75) and fixed throughput of 1 Gbit / s 1 : LDPC co de p erforming 1 iteration a) iden tical throughput (1 Gbit / s) ∼ 4dB w orse communications p erformance b) 5 times higher throughput (5 Gbit / s) ∼ 4dB w orse communications p erformance 2 : LDPC co de p erforming 2 iteration a) iden tical throughput (1 Gbit / s) ∼ 1dB b etter communications p erformance b) 2 . 5 times higher throughput (2 . 5 Gbit / s) ∼ 1dB b etter communications p erformance 3 : LDPC co de p erforming 5 iteration iden tical throughput (1 Gbit / s) ∼ 4dB b etter communications p erformance Fig. 5: Implementation efficiency and communications performance for WiMedia 1.5 standard October 30, 2018 DRAFT IEEE TRANSA CTIONS ON COMMUNICA TIONS, V OL. X, NO. X, XXXX 15 0.01 0.1 1 10 10 100 1000 Area Efficiency: (Mbit/s)/mm2 Energy Efficiency: decoded bit/energy (bit/nJ) TC LTE flexible LDPC 3 R 2 1 (a) Implementation Efficienc y 0.5 1 1.5 2 2.5 3 3.5 4 4. 5 10 −4 10 −3 10 −2 10 −1 10 0 FER R=0.5 TC 6.5 iter TC 3 iter E b /N 0 [dB] LDPC 10 iter LDPC 20 iter LDPC 10 iter LDPC 5 iter TC 6.5 iter R=0.83 TC 3 iter (b) Communications performance (BPSK, A WGN channel) R : Reference of L TE turb o deco der 150 Mbit / s throughput for all code rates reference TC p erformance at 6 . 5 iterations 1 : LDPC co de at co de rate R = 1 / 3 ∼ max throughput 30Mbit / s ∼ 40 iterations to match TC performance 2 : LDPC co de at co de rate R = 1 / 2 ∼ max throughput 90Mbit / s ∼ 20 iterations to match TC performance 3 : LDPC co de at co de rate R = 0 . 83 ∼ max throughput 300Mbit / s ∼ 10 iterations to match TC performance Fig. 6: Implementation efficienc y and communications performance of L TE turbo code/decoder and a flexible LDPC decoder . October 30, 2018 DRAFT

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment