Lossless data compression on GPGPU architectures

Lossless data compression on GPGPU arc hitectures Axel Eirola Aalto Universit y Sc ho ol of Science Abstract. Modern graphics processors pro vide exceptional computa- tional p o we r, but only for certai n compu tational mod els. Wh ile they hav e revol ut ionize d computation in many ﬁelds, compression has b een largely unnaﬀected. This pap er aims to explain th e current issues and p ossibili- ties in GPGPU compressi on. This is done by a high level o verview of the GPGPU computational mo del in th e con text of compression algorithms; along with a more in-dep th analysis of how one w ould implement bzip2 on a GPGPU arc h itectu re. 1 In tro duction Ever since PCs beca me commo n household items, a nd co mputer gaming hit mainstream attent io n during the 90s, the graphics pro cessing unit (GPU) in- dustry has bee n growing s teadily in alo ng with the C P U industr y . The need for faster and ever more visually stunning g r aphics has lead a head-on battle betw ee n nVidia a nd AMD for mo re p o werful GPUs, and the consumers money . The colo s sal computational p ow er on the GPU chips has until lately o nly b een utilized to their full capacity in 3D gaming, but lately with the adven t of new ar - chit ectur es is it p ossible to exploit the computational resources for mor e general purp oses. General-purp ose computing on gra phics pro cessing units (GPGPU) g iv es softw are desig ners the massive parallel computational p ow er and memor y band- width of mo dern GP Us, at the cost of more limited progr amming mo dels. So me tasks, such a s scientiﬁc computing and audio/ video pr ocessing , a re able to utilize these resour ces without lar ge co mplications due to their par alellizable natur e . Other tas ks such as data compressio n, esp ecially lossles s data co mpression, are not as lucky and hav e mo r e diﬃculties exploiting the p o wer of GPUs. This pa- per fo cuses on lossless data co mpression and attempts to build a picture of the incompatibilities b et ween GPGP U computation and data compression. On the other hand it also takes a lo ok at what data compressio n related co mputations hav e success fully b een p orted to GPGPU pla tforms. The rest of the pap e r is split into three sections . The next section 2 describ es the GPGPU pla tfor ms a bit closer , r elating their characteristics to the needs of data compres sion. Section 3 sho ws some successful applicatio ns o f GPGPU on compressio n algo rithms, and des cribes wha t parts of a bzip2 implementations could b e accelerated by GPGPUs. Finally in section 4 w e dis cuss the current trends in the ﬁeld of GPGP U compression. 2 GPGPU Current leading edge GPU chips provide way higher computations p er sec o nd compared to simila rly priced leading edge CPU. The computatio na l p o wer o f the GPUs is distributed ov er hundreds of co mputational co res, a ll working in para l- lel. T o simplify synchronization of their concurrent computations, the GPGP U platforms run co de in clusters of 32 threa ds called warps . E ac h of the threads in one warp executes the same instruction, on diﬀerent data, a t every given time. F or fast p erformance, the da ta the pro cessors op erate on should b e coales ced in contin uous and aligned blo c ks in memory , to enable a single memor y tra nsfer for all data. Multiple warps ar e run in blo cks of hun dr ed of threa ds, which combine int o a grid co nsisting of a ll executing threads. Un til recent ly all threa ds would al- wa ys run the same progr a m, but newer a rc hitectures hav e enabled s im ultaneous execution of m ultiple prog rams. This pro gramming model ﬁts nicely in imag e pro cessing where ea c h c omput- ing core can pro cess a single pixel, comparing it to nearby pixels in temp oral or spatial dimensions, pro ducing reductions or sensory compens ations o f them. On the o ther hand s erial computations are slow on GPUs, a s are ma n y alg orithms eﬃciently paralleliz able o n symmetric multiprocess ing CPU s ystems. The lim- itations created b y the tightly synchronized threads, and the memory access patterns restrict what is feas ible to do o n a GPGPU platform. The split system with the CPUs on the motherbo ard, a nd the GP Us as separate devic e s also lea d to some o bstacles. Firstly all data data that is to b e pro cessed on GPGPU nee ds to be copied ov er the PCIe bus. Althoug h the PCIe 2.0 bus is fast (8 GB P S) in compar is on to ma n y other transp ortation mediums, it do es fall short of the parallel memory bandwidth on GPU cards ( > 20 0GBPS), which might lead to limits when wanting to do r eally fast stuﬀ. Althoug h the memory tra ns fer fro m device (GPU) to ho st (CPU) can b e p erformed in paralle l with e xecuted co de, it always a dds some extra latency to the overall run time. 2.1 GPGPU Data Com pression The fact that loss less data compression is lar gely ab out ﬁnding redunda nc y within a set o f data, lea ds to compr e ssion alg orithms that pro cess the data in one or man y passes, comparing inco ming data with previously encount er ed data. This mak es many o f them highly serial in nature, and are usually c hallenging to par allelize. The mos t common way o f parallelizing c o mpression algorithms is th us to split the data int o smalle r blo c ks and execute the normal serial a lgorithm on each o ne. As a downside, this separ ates the redundancies, giving slig h tly po orer compres s ion ra tio. Ther e ar e multiple bzip2 implementations doing this with eﬃcient scaling on multiple CP U cores [6]. This simple trick means that little rese a rc h has b een done on mo re eﬃciently par a llelizable algo rithms b efore GPGPU c a me along. The fact that GPGPUs hav e hundreds of pro cessor s and a no n-general mem- ory acces sing scheme, means that the metho d of splitting the data into s ma ller ch unks isn’t always feasible [16]. On the other hand if one aims at very trivial and simple compression by pr ocess ing single ﬁelds or integers in to mor e compa ct form, using for example null suppre s sion or E lias delta code s , one ca n achiev e amazing parallel per formance [12]. The real dea l o f GPGPU compres sion is though to b e found for m ex ploiting more general computational metho ds av ailable for GPGPU platforms. Firstly , being such a widely used genera l to ol, sorting has several eﬃcie nt parallel imple- men ta tio ns, suc h as merge sor t executing O ( n log n ) steps, in O (log n ) time for a input o f size n . O n mo de r n hardware it gives about a tenfold sp eedup compared to CPU implementations [1 3]. Perhaps more interesting is the paralle l implementation of counting preﬁx sums. This a lg orithm co mputes intermediate v alues in a tree- lik e fashion pro- ducing a n output with O ( n ) steps, in O (log n ) time and be ing ar ound ∼ 20x faster than CPU implemen tatio ns on mo dern hardware [11]. This has interesting utilizations in many GPGPU algorithms such as sorting [10] and lexicogra phic naming [14]. In the next c ha pter where we will lo ok into some paralle l imple- men ta tio ns of clas sic compressio n metho ds, wher e we can see more uses for the preﬁx s um. 3 Curren t compression algorithms This c ha pter lo oks a t some widely used co mpression algorithms, also used in bzip2, to see what parts o f them hav e be e n successfully parallelized on GP GPU platforms. Notice b eforehand that altho ugh the alg orithms in this chapter ar e successful at co mpressing the incoming data , a eﬃcient GPGPU decompress or might not a lw ays be av ailable. 3.1 Burro ws-Whe eler T ransform (BWT) The bzip2 compression stack star ts with BWT [4] to r eorder the characters in the data to a more compressible form. Since the tra nsform is simply executed by sorting the rotations of the input data, we can use any eﬃcient sorting algorithm av ailable for GPGPU. As sta ted ab o ve, there ar e many av ailable that would make the e ncoding p ossible in pa rallel giving reasona ble sp eedups. Although no resear ch pap ers seem to hav e b een published o n the sub ject yet, prelimina ry ﬁgures in the communit y give claims of reasona ble s peedup [15,9]. Unfortunately the inv erse Bur r o ws-Wheele r tr a nsform isn’t a s simple to par- allelize. One p ossibility for par allelization would b e to utilize the prop erty of the transform that the inv er se can b e started from a n y p oin t o n the output. Given this we can on each para llel pr ocess or start a t a p oin t in the tra nsformed string and contin ue until w e hit a character fro m which deco ding has been started by another pro cessor. Problems w o uld p ossibly arise fr o m p o or GPU p erformance of the very r andom memor y accessing ca used by the s cattering o f characters throughout the string. Uncompressed input A A A B B C C C Step 1: Run b oundary se arc h Run bou n daries 0 0 1 0 1 0 0 1 Step 2: Calculate output p osition with preﬁx sum Output p ositio ns 0 0 0 1 1 2 2 2 Step 3: Cop y to output p osition Data v alue A B C Data index 3 5 8 Step 4: Calculate run length Run length 3 2 3 Fig. 1. Run-length enco ding. Empty cells are not part o f the a rray , but a re a dded to retain the s traight ﬂow of v alues . Thus the last three arrays are of size 3. 3.2 Run-Length Enco ding (R LE) Moving in to the next step o f the bzip2 co mpression, we encounter RLE, which compresses long runs of s a me characters in to a single character alo ng with a int eg er spec ifying the amo un t of times the character was rep eated. Since RLE only focus es on very local redundancie s b et ween the character s , it is ea sier to design parallel algorithms for it. One such alg orithm has b een researched by W. F ang et al. [5], and similar work has b een done by A. Balevic [1]. The alg orithm describ ed by F ang utilizes as a cen tra l part the parallel im- plement a tion for co un ting preﬁx sums, mentioned ab ov e. The encoding w o r ks in four parallel steps, a s can be seen in ﬁgure 1. In the ﬁrst step the boundar ie s betw ee n the runs are identiﬁed by having each pro cessor co mpare its designated character with the neig hbo ring o ne, outputting 1s in an auxiliary ar ra y to mark the b oundaries. In the second step the preﬁx sum of the auxilia ry a rray is com- puted to get the output p ositions for the c ha racters and r un lengths. In step 3 the algo rithm simply copies the index of the b oundary in the origina l data along with the character int o the output ar ra y . The ﬁnal step, executed on the now smaller output array , computes the lengths of the runs by comparing the diﬀerences of the indices . RLE is one o f the few GPGPU compr ession algo rithms for which a eﬃcient deco ding has b een found and implemen ted. W orking similarly as the enco ding in reverse, ﬁgure 2 shows the four steps of the algo rithm. First the indices of the bo undaries ar e calcula ted from the run lengths using the preﬁx sum algo rithm. Secondly 1s a re written int o an auxiliary ar ra y at the b oundaries in the previous result. In step 3 the preﬁx sum is run on the auxiliary array to co mpute an a rray containing indices of the co mpressed input. In the last step c har acters a re copied to the output a rray acco rding to the indices computed in the previous step. Character input A B C Run length input 3 2 3 Step 1: Compute b oundary p ositions with preﬁx sum Boundary p ositi ons 3 5 8 Step 2: Set run b oundaries Run b oundaries 0 0 1 0 1 0 0 1 Step 3: Calculate char acter inde x with preﬁx sum Scatter p ositions 0 0 0 1 1 2 2 2 Step 4: Cop y characters from calculated i ndexes Scatter p ositions 0 0 0 1 1 2 2 2 Decompressed output A A A B B C C C Fig. 2. Run-leng th deco ding 3.3 Mo ve-to-fron t Although no published research could b e found o n MTF enco ding, ther e seems to be a p ossibility for trivial paralleliz a tion. B y para llelizing ov er the stack of recently used symbo ls to ﬁnd the index of the r ead character, w e could reduce the time of lo oking up the index of the character. In practice ea ch pro cessor would check if the c ha racter at the pro cessor s index matches the input character and output this index to s o me shared register. Since each character o ccurs in the stack only o nce, no conc ur rency r ace conditions will o ccur for the write. Updating the sta te of the stack can a lso be simila rly para llelizable. This would le a d to constant time lo okup o f the index, pro vided that the GPU has equal amount or mor e pr ocess ors than amount characters in the alphab et. Although every core in the GP U is easily utilized, one must take note that this implemen tatio n w o uld be executing more redundan t computations compared to a ser ial implement a tion which would stop a fter ﬁnding the ﬁrst, and only , o ccurrence. Spee ding up deco ding on GPGPU platforms might be mor e challenging since the character lo okup is already constant time on seria l implementations, and starting deco ding from multiple places is diﬃcult since the state of the stack is not known at the other places. 3.4 V ar i able-Length Enco ding (VLE) The last step o f bzip2 cov ered in this pap er is the Huﬀman co ding [8]. Although no GPGP U a lgorithms are at the time av ailable for constructing the actual Huﬀman tree, ther e exists at least one algorithm by A. Balevic [2], fo cusing on the substitution from uncompressed data to v ariable-le ng th enco ding. Although the substitution seems like a trivial task for a seria l algorithm, this isn’t the Uncompressed input xxx x xxxx xxxx xx xx xx x x xxx x xxxx xxx x Step 1: Codeword lo okup Codewords y yyy yyy y yy yy yy yyy yy yy Codeword lenghts 1 8 16 7 14 9 4 4 Step 2: P aralle l preﬁx s um Codeword output p osi tions 0 1 9 25 23 46 55 59 Fig. 3. V ar iable length e nco ding. xxxx repr esen ts one generic uncompressed character, and yyy r epresen ts it’s g eneric co deword. case when we wan t to parallelize. The problem aris es from the fact that due to v a riable length codewords, we do n’t know where int o the output arr ay the conv erted co des should b e copied. The e nc o ding algorithm works as in ﬁgure 3. First the input characters ar e conv erted to co dewords in parallel, also storing the corr e s ponding co deword length into a n a uxiliary ar ra y . Secondly the preﬁx sum is ca lc ulated from the auxiliary array to get the output p ositions of the co dewords. Finally the code- words are c opied into the c alculated indexe s . Here aga in, decompression is harder. This is due to the fact that the deco der do esn’t know whe r e one co de word ends and another b e g ins b efore it ha s deco ded the whole prio r input. Some work on par allelizing v ar iable-length codes us ing error -resilience methods has been done [3, Ch. 3.10], and could po ssibly b e used on GP GPU platforms . 3.5 Other compressi on us ages In addition to gener al los s less data compress ion fo cusing o n large scale redun- dancies as in the cas e of bzip2, successful work has a lso been done in other t yp es of GP GP U data compressio n applications . Fir s tly a team at the T exas state universit y has implemented a high-sp eed ﬂoating p oint compressio n alg orithm called GFC [12] tha t touts 75 GB/s compression sp eeds for usage in scientiﬁ c computation where ma s siv e amounts of da ta ar e g enerated and need to b e pro- cessed in real time. The co mpression metho ds used are quite simple, mostly fo cusing on only storing delta s betw een ﬂoating p oin t v alues and r emo ving un- necessary zer oes form the r esult. Unfortunately , the pr actical pro cessing sp eed of the compression is limited by the PCIe bus to 8GB/s. Other work has b een done in da tabase data c ompression [5], where multiple diﬀerent light-w eight compre ssion metho ds are c o m bined to sp eed up GPU co - pro cessing o f da ta base queries . Compres sing the data makes it faster to mov e the da ta over the PCIe bus, and faster to r e ad from disk. The compress ion techn iq ues fo cus on compr e s sion of single databas e ﬁelds such as integers, using nu ll- s uppression, dictionaries , RLE a nd more. 4 Discussion The compressio n algor ithms descr ibed ab ov e compose the cen tra l par ts of bzip2 . Given prop e r implemen tations o f each, it shouldn’t b e har d to combine them togther for a bzip2 co mpatible compressing to ol that co uld deliver high p erfor- mance o n modern GPUs. The eﬃcie nc y of the MTF alg orithm is debatable. But the b eneﬁt of having all co de r unning on the GPU, r e mo ving unnecessar y data transfer over the PCIe bus, might still make the ov era ll co mpression spe ed faster . As for the codeword tables for the VLE, it s e e ms unlikely that eﬃcien t Huﬀman-tree GPGPU a lg orithms will b e p ossible. But this do esn’t mean that pre-calcula ted ones couldn’t be use d, co mbined with s ome data ana lysis for choosing a ﬁtting one for the curr e nt input data. Although mos t parallel compress io n metho ds dep end on splitting the com- pressible data into small ch unks, cutting the dep endancies b et ween the ch unks, we hav e here taken a lo ok at some more sophisticated attempts at parallel com- pression on GPGPU platforms. GPGP U co mpression fo cusing o n lo cal changes can give very high compr ession bandwidths, while at the same time mor e wide scale algorithms ar e using general co mputational primitives for eﬃcient para l- lelizable implemen tations . F uture work will most likely b e aﬀected b y the diﬀerent gener al algor ithms discov ered for GPGPU platforms, and made easily av aila ble by libraries such as Thrust [7 ]. References 1. Balevic, A.: Fine-grain paralleliza tion of en tropy coding on GPGPUs. (2009) 2. Balevic, A.: P arallel vari able-length encod ing on GPGPUs. ( 201 0) 3. Biskup, M.: Error Resilience in Compressed DataSelected T opics. (2008) 4. Burro ws, M., Wheeler, D.: A blo c k-sorting lossl ess data compressio n algorithm. (1994) 5. F ang, W., He, B., Luo, Q.: Database compression on graphics pro cessors . (2010) 6. Gilc hrist, J.: Pbzip 2: Parallel bzip2 data compression softw are (2009) 7. Hob erock, J., Bell, N .: Thrust: A parallel template library (2010) 8. Huﬀman, D .: A method fo r the construction of minimum-redundancy cod es. (1952) 9. inikep: Cuda/gpu-b ased bwt/st4 sorting. http:/ /encode.ru/thread s/1208- CU DA-G PU- based- BWT- ST4- sorting (2011) 10. Leisc hner, N., Osip o v , V., S an d ers, P .: Gp u sample sort. (2009) 11. Nguyen, H.: GPU Gems 3. (2007) 12. O’Neil, M., Burtscher, M.: Floating-p oin t data compression at 75 Gb/s on a GPU. (2011) 13. Satish, N., Harris, M., Garland, M.: Designing eﬃcient sorting algorithms for manycore gpus. (2009) 14. Sun, W., Ma, Z.: P arallel lex icographic names construction with cuda. (2009) 15. W a veAccess: Breakthrough in cuda data compression. http://wav eaccessllc.blogsp ot.com/2011/04/breakthrough- i n- cuda- data- compression.html (2011) 16. W u, L., Storus, M., Cross, D.: CS315A: Final Pro ject CUD A WUDA SHUDA: CUDA Compression Pro ject. (2009)

Lossless data compression on GPGPU architectures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment