Application of a design space exploration tool to enhance interleaver generation

This paper presents a methodology to efficiently explore the design space of communication adapters. In most digital signal processing (DSP) applications, the overall performance of the system is significantly affected by communication architectures,…

Authors: Cyrille Chavet (LESTER, STM), Philippe Coussy (LESTER)

Application of a design space exploration tool to enhance interleaver   generation
Application o f a design spa ce explorati on tool to enhance interleav er generation CHAVET Cyrille 1 , COUSSY Philippe 2 , URARD Pascal 1 , MARTIN Eric 2 1 STMicro electronics, Crolles, FRANCE. {firstname.lastnam e@st.com} 2 LESTER Lab, UBS University , CNRS FRE 273 4. {firstname.lastnam e@univ-u bs.fr} ABSTRACT This paper presents a methodology to e fficiently explore the design space of communication adapters. In most digital signal processing (DSP) applications, the overall performance of the system is signifi cantly affected by communication archit ectures, as a consequence the designers need specifically optimized adapt ers. By explicitl y mo deling th ese commu nication s within an effective grap h-theoretic mo del an d an alysis framework, we automat ically generate an opti mized architecture, named Spa ce- Time AdapteR (STAR ). Our design f low inpu ts a C description of Input/Ou tput data scheduling, and user requirements (throughpu t, latency, parallelism…), and formalizes communication constraints thro ugh a Resource Constraints Graph (RCG). Design space explorati on is then performed through asso ciated to ols, to synth esize a STAR component u nder time-to-market constraints. The propo sed approach has been tested to design an industrial data mixing block example: an Ultra-Wideband interleaver. 1. INTRODUCTION In the multimedia and telecommunications d omain, conti nuously emerg ing customer services require seve re performance (computing power, timing perform ances and memory bandwidth/capacity) to implement the new comm unication standards. Indeed, communication sy stem application s require high throughput -on th e order of several hu ndred Mb /s- accompanied by both low latency and severe bit error rate BER constraints (e.g. wireless, fiber-optic communication…). Owing to their impressive near-Shannon -limit error correcting performance, turbo-li ke codes in th eir parallel o r serially concatenated versions [2 ], originally dedicated to channel coding, are being currently reused in a large set o f th e whole digital communication systems (e.g. equalization, d emodulation, synchronization, MIMO). These codes are formed by two or more p rocessing elements PE (encoders/decoders) and one comm unication network that interleaves the data blo cks exchanged by the PEs. The turbo decoding princip le is based on an iterative alg orith m using decoders exchanging information in ord er to improve the error correction perf ormance through the iterations. The iterative nature of these algorithms is a severe constraint to satisfy the aforem ention ed requ irements with an affordable implementation complexity . A widespread so lution is to realize the tu rbo decoder in a parallel fashion. On e the one h and, this so lution increases the throughp ut sin ce the latency of the system becom es the latency of constituent sub-blocks [ 6]. On the other hand, the complexity and the cost of t he syste m are increased due to parallel nature of the architecture. Moreover, for th e su b-blocks t o b e able t o work in parallel, it is necessary that each on e exchanges data with a Random Acces s Memory block (RAM). By th e way, d epending on th e specific permutation law, different modules may t ry to simultaneously access th e same RAM. As a consequence, non e o f them is able to retrieve data. This problem is known as th e “collision” prob lem [ 9]. In this case, the access to the memory has to be po stponed and caref ully arbitrated, which slo ws do wn th e decoding pro cess. The solu tion consists in designing an adapted interleaver and/or modifying th e d ecoder architecture. In this paper, we p ropose to use the formal approach presented in [ 3] to tackle the interleaver design prob lem. The paper i s organized as follows: the second section presents a state-of-the-art for interleaving architectures. The third section is dedicated to the p roblem formulation of the interleaver design. In the fourth section we briefly present o ur design flow. Fin ally, t he last section presents experimental results and the design explo ration offered by ou r design flow on an in dustrial example. 2. INTERLEAVING ARCH ITECTURE Interleaving is a permutation ru le t hat scram bles data to break up neighbourh ood-relatio ns. It is a key fa ctor for turbo-codes performances which varies from one standard to anoth er. Moreover within a given standard , different in terleaving rul es can be used for different modes th rough varying fram e lengths and/or d ata rates. In this con text, taking in to accou nt th e aforementioned con straints and the collisio n p roblems, hardware implementations of parallel tu rbo decoders require the in tegration o f complex topo logy supporti ng the intensive interleaved me mory accesses . Indeed, in state-of-the-art parallel turbo-d ecoding, int erleaving is con sidered as a li miting factor con cerning the overall system performance and the architectural cost. To successfully tackle these probl ems, different sol utions h ave been recently propo sed. First, possible solutions to get rid of collisio ns with nonprunable interleavers, consist in designing a specific interleaver ru le. In [9 ], the auth ors prop ose a deterministic methodology to design collision- free interleavers. In [10 ] and [8] the authors define collision-free permutations thanks to a combination of a spatial an d a temporal permutation. The aut hors of [12] simply in tegrate the collision -free constraint in the design of their interleaver. Ho weve r, the multi-modes architectures (depending on fram e length, data-rate…) ca n not be handled by such app roaches. Another solu tion consists in defining a collision- free interleaver that preserves this pro perty even when prun ed. In [7] , th e authors describe a design rule to ob tain such interleavers, with an incremental algorithm t hat generates collision -free interleavers by adding n ew elem ents in su ccessive steps to a small permutation. Of course, all these solution s are viable only if the designer is free to choose the p ermutation law to b e used in th e system . As a c on sequence, the resulting architecture m ay not be standard compliant. In [16] the authors prop ose, in case of a collision, to s tore the conflicting information in the com munication network until the targeted sub-block can process it. Of course, additi onal network buffering resources, and consequently time needed to interleave information, increase with th e number of p arallel processors. This is a subopt imal strategy, in terms of latency and thus throughpu t, which avoids collisions at the expense o f area and mem ory. Moreover, the communication is based on a Benes network [5] , which might be subop timal compared to a ded icated an d optimized architecture. Unlike these implem entation s, in [13] the authors propose a solution based on software and /or reconfigurable parts to achieve the required flexibility, bu t achieving lower thro ughput. In [ 14] , an advanced heterogeneous comm unicatio n network implem entation w as proposed. Two multistage intercon nection network architectures are p resented in order to handle o n-chip comm unicatio ns in multiprocessor parallel turb o decoders. They are based on a dedicated network and associated routers. The main feature of th ese network architectures (Butterfly and Benes based topo logies) is th eir suppo sed scalability enabling seamless trade-off between hardware complexity and available bandwidth for turbo decoding. The Butterfly network, which lacks of diversity, i s a multistage int erconnection network with 2-inpu t 2- outpu t routers. There is a un ique p ath b etween each source and destination. As a consequence, the risk o f con flict is increased and the autho rs h ave to add queues t o store conflicting information. The secon d network architecture proposed is based on a Benes n etwork. In th is case, the latency is con stant for all the coupl es (source, destinatio n), b ut this network avoids the conflicts if and only if all the paths h ave a d ifferent destination . Unfortunately, we saw that it was n ot true for turbo-decod ing applications because interleaving (respectively d e-interleaving) ends in potential co nflicts. Moreover, as already mentioned the Benes networks are costly and under-op timized solutions. Finally, the authors of [15 ] d escribe a system th at avoids collisions for eve ry interleave r and any degree of parallelism . This solution consists in automa tically finding a collision-free data memory mapping respecting th e interleaving rule, t hanks to a simulated-annealing algorithm. As a con sequence, the user cannot p redict when th e algorith m will end. Moreover, the proposed appro ach do es not target th e op timization of the storage eleme nts. In this pap er, we prop ose to use the formal approach p resented in [3] to tackle the in terleaver design p roblem. This ap proach, which originally target i nterface synthesis, is shown to be also suited to the interleav er design space exploration. Our design flow can tak e as input tim ing diagrams (constraints file) or C descriptions of I/O data sched uling (e.g. an int erleaving form ula), with user requirem ents (th roughpu t, latency…). We form alize comm unicatio n constraints throu gh a formal Resou rce Constraints G raph (RCG) which properties enable an efficient architecture exploratio n. By using our design flow, any u ser can generate an optimized architecture in term of latency, n etwork architecture and memory , from any interleaving standard. 3. PROB LEM FO RMU LATION First, from throu ghput and parallelism con straints, and an interleaver permutation pattern, we can form alize data comm unicatio n as timing diagrams. Let us consid er a simple architecture exam ple composed of two components exchanging a set of data S = {a, b, c, d, e, f}. S is produced b y a blo ck #1 and is con sumed by a b lock #2 thro ugh a single point -to-poin t link. The w rite access sequence into the comm unicatio n li nk is S w = (a,c,b,e,f,d ) i.e. t w a

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment