Time and Memory Efficient Lempel-Ziv Compression Using Suffix Arrays

The well-known dictionary-based algorithms of the Lempel-Ziv (LZ) 77 family are the basis of several universal lossless compression techniques. These algorithms are asymmetric regarding encoding/decoding time and memory requirements, with the former …

Authors: Artur Ferreira, Arlindo Oliveira, Mario Figueiredo

Time and Memory Efficien t Lemp el-Ziv Compression Using Suffix Arra ys Artur J. F erreir a 1 , 3 , 4 , Arlindo L. Oliv eira 2 , 4 , M´ ario A. T. Figueiredo 3 , 4 1 Instituto Sup erior d e Engenharia de Lisb o a, Lisb o a, POR TUGAL 2 Instituto de Engenhari a de Sistemas e Comp utador es, Lisb o a, POR TUGAL 3 Instituto de T ele c omunic a¸ c˜ oes, Lisb o a, POR TUGAL 4 Instituto Sup erior T´ ecnic o, Lisb o a, POR TUGAL Con tact email: ar turj@cc.ise l.ipl.pt Abstract The w ell- kno wn dictionary-based algorithms o f the Lemp el-Ziv (LZ) 77 family are the basis of sev eral univ ersal lossless compression tec hniques. These algorithms are asymmetric r egarding en- co ding/deco ding t ime and memory requiremen ts, with the former b eing muc h more demanding, since it inv olv es rep eated pattern searching. In the pa st years, considerable attention has b een dev oted to the problem of finding efficien t data structures to support these searc hes, aiming at optimizing the enco ders in terms of sp eed and memory . Hash tables, binary searc h trees and suffix trees ha v e b een widely used for this purp ose, as they a llo w f a st searc h at the expense of memory . Some recen t researc h has fo cused on suffix arr ays (SA), due to their lo w memory requiremen ts and linear construction algorithms. Previous w ork has sho wn how the LZ7 7 decomp osition can b e computed using a single SA or an SA with a n auxiliary array with the longest common prefix information. The SA- based algorithms use less memory than the tree-based enco ders, allo cating the strictly necess ary amoun t of memory , regardless of the conten ts of the text t o searc h/enco de. In this pap er, we improv e on previous w ork b y prop osing faster SA-based alg orithms for LZ77 enco ding and sub-string searc h, k eeping their low memory requiremen ts. F or some compression settings, on a large set of b enc hmark files, our lo w-memory SA-based encoders are also faster than tree-based enco ders. This pro vides time and memory efficie n t LZ77 encoding, b eing a p ossible replacemen t for trees o n w ell kno wn enco ders lik e LZMA. Our algo r it hm is also suited for text classification, b ecause it provides a compact w a y to describ e text in a bag- o f-w ords represen tatio n, as w ell as a fa st indexing mec hanism tha t allo ws to quic kly find all the sets of w ords tha t start with a given sym b ol, ov er a static dictionary . Keyw ords: Lemp el-Ziv compression, suffix array s, time-efficiency , memory-efficienc y , patt ern searc h. 1 In t r o duc t ion The Lemp el-Ziv 77 (LZ77) and its v arian t Lemp el-Ziv-Storer-Szymanski (LZ SS) [14, 16, 19] lossless compression algorithms a re the basis of a wide v ariet y of unive rsal source co ders, such as GZip, WinZip, PkZip, WinRar, and 7-Zip, among others. Those algorithms are asymme tric in terms of time and memory req uiremen ts, with enco ding b eing m uc h more demanding than deco ding. 1 The LZ-based enco ders use efficien t data structures, like binary tr e es (BT) [6, 11], suffix tr e es (ST) [5, 7, 9, 13, 17] and hash tables, th us allowin g fast searc h at t he exp ense of higher memory requiremen t. The use of a Ba yer-tree along with sp ecial bina r y searc hes, on a sorted sliding windo w, to speedup the enco ding pro cedure, has b een addressed [6]. S uffix arr ays (SA) [7, 10, 15], due t o their simplicit y , space efficienc y , and linear time construction algorithms [8, 1 2 , 18] ha v e b een a fo cus of researc h; for instance, the linear time SA construction algorithm suffix arr a y induc e d sorting (SA-IS) has b een recen tly pr o p osed [12]. SA ha v e b een used in encoding data with anti-dictionaries [4], to find repeating sub-sequences [1] for d ata deduplication, among other applications. Recen tly , space-efficien t algo rithms for computing the LZ7 7 f a ctorization of a string, ba sed on SA and auxiliary array s, hav e b een prop osed to replace trees [2, 3]. These SA-based enco ders require less memory than ST-based enco der, with some p enalt y on the enco ding time, fo r roughly the same compression ra tio. The amount of memory for the SA-based enco der is constan t, indep enden t of the con t en ts o f the sequ ence to enco de, as opp osed t o tree-based enco ders in whic h has to b e allo cated a maxim um amount of memory . In this pap er, w e improv e on previous approach es [2, 3], prop osing f aster SA-based a lgorithms for LZ77/LZSS enco ding, without requiring any mo difications o n the deco der side. These low -memory enco ders are fa ster than the tree-based ones, lik e 7- Zip, b eing close to GZip in enco ding time on sev eral s tandard b enc hmark file s. The rest of this pap er is organized as fo llo ws. Section 2 presen t s the basic concepts of LZ77 / LZSS enco ding using suffix arrays . Section 3 describ es our prop osed alg o rithm. The exp erimen tal results are discusse d in Section 4 and some concluding remarks are made in Section 5. 2 Lemp el - Ziv Compres sion using Su ffix Arra ys The LZ77 and LZSS [1 4, 16, 19] lossle ss compression tec hniques use a sliding window ov er the sequence of sym b ols to b e enco ded, whic h has t wo s ub-window s: the dictionary (holding sym b ols already enco ded) a nd the lo ok- ahe ad-buffer (LAB, containing the next symbols to b e enco ded). As the string in the LAB is enco ded, the windo w slides to include it in the dictionary (this string is said to slide in ); consequen tly , the sym b ols at the f a r end of the dictionary are dropp ed ( s lide out ). A t each step of the LZ77/LZSS enco ding alg o rithm, the longest prefix of the LAB whic h can b e found any where in the dictionary is determined and its p osition stored. F o r t hese t w o algorithms, enco ding of a string consists in describing it by a tok en. The LZ77 toke n is a triplet of fields, ( p os, len, sym ), with t he following meanings: • p os - lo catio n of the longest prefix of the LAB found in t he curren t dictionary; this field uses log 2 ( | dictionary | ) bits, where | dictionary | denotes the length (n um b er of b ytes) of the dictionary; • len - length of the match ed string; this requires log 2 ( | LAB | ) bits; • sym - t he first sym b ol in the LAB that do es no t b elong to the matched string ( i.e. , that breaks the matc h); fo r ASCI I sym b ols, this uses 8 bits. In the absence of a matc h, the LZ77 tok en is ( 0,0,sym ). Each L Z77 tok en uses log 2 ( | dictionary | ) + log 2 ( | LAB | ) + 8 bits; usually , | dictionary | ≫ | LAB | . In LZSS, the tok en has the format ( bit,c o de ), 2 with the structure of c o de depending on v alue bit as follo ws: ( bit = 0 ⇒ c o de = ( sym) , bit = 1 ⇒ c o de = ( p os, len ) . (1) In the absence of a matc h, LZSS pro duces (0, sym ). The idea is that, when a matc h exists , there is no nee d to explicitly encode the next sym b ol. Beside s this mo dification, Storer and Szymanski [16] also prop osed k eeping the LAB in a circular queue and t he dictionary in a binary searc h tree, to optimize the searc h. LZSS is widely used in practice since it ty pically achie v es higher compression ratios than L Z 77. In LZSS, the token uses either 9 bits, when it has the form (0, sym ), or 1 + log 2 ( | dictionary | ) + log 2 ( | LAB | ) bits, when it has the form (1,( p os , len )). The fundamen tal and most exp ensiv e comp onen t o f these enco ding algorit hms is the searc h for the longest match b et w een LAB prefixes and the dictionary . Assuming that the decoder and enco der are initialized with equal dictionaries, the decoding of eac h LZ77 tok en ( p os,len,sym ) pro ceeds as follo ws: 1) len sym b ols are copied from the dictionary to the output, starting at p osition p os of the dictionary; 2) the sym b ol sym is app ended to the output; 3) the string just pro duced at the output is slid into the dictionary . F or LZSS decoding, w e ha v e: 1) if the bit field is 1, len sym b o ls, starting at p osition p os of the dictionary , a r e copied to the output; otherwise sym is copie d to the output; 2) the string just pro duced at the output is slid into the dictionary . Both LZ77 and LZSS deco ding are lo w complexit y pro cedures, and th us deco ding is mu c h faster than enco ding, b ecause it inv olv es no searc h. In this w ork, w e address only the enc o der side dat a structures and algorithms, with no effect in the deco der. 2.1 Suffix Arra ys A suffix arr ay (SA) is the lexicographically sorted arra y of the suffixes of a string [7, 10]. F or a string D of length m (with m suffixes), the suffix array P is the set o f integers from 1 to m , sorted b y the lexicographic o rder of the suffixes of D . F or instance, if w e consider dictionary D = mississippi (with m = 11) , its SA is P = { 11 , 8 , 5 , 2 , 1 , 10 , 9 , 7 , 4 , 6 , 3 } and w e get the suffixes sho wn in Fig. 1, along with the use of SA for LZ77/LZSS enco ding: • with LAB = issia , the LZ77 enco der outputs (5 , 4 , a ) or (2 , 4 , a ), dep ending on ho w w e search P and how we c ho ose the matc h; for LZSS, w e hav e (1(5 , 4)) or (1 (2 , 4)) fo llow ed by (0 , ( a )); • with LAB = bsia , the LZ77 tok ens are (0 , 0 , b ) f ollo w ed b y (7 , 2 , a ) or (4 , 2 , a ); LZSS pro duces (0 , ( b )) follo w ed by (1(7 , 2 ) ) or (1(4 , 2 ) ) a nd finally (0 , ( a )). Eac h of the integers in P is the suffix n umber corresp onding to its p osition in D . Finding a sub- string of D as in LZ77 /LZSS, can b e done b y searc hing arr a y P ; fo r instance, the set of sub-strings of D that start with sym b ol ‘s’, can b e found at indexes 7, 4, 6, and 3 o f D , ranging from index 7 to 10 on P . There are s ev eral line ar time algorithms for SA construction [8, 12, 18]; w e hav e use d the suffix arr a y induc e d sorting (SA-IS) algorithm [12] . 3 Prop o s ed Algor i t hm W e hav e adopted the follo wing enco ded file format. The header ha s 48 bits: the first 8 bits ( np ) represen t the n um b er of bits used to represen t the pos field of the tok en; these are follo w ed by 3 Figure 1: LZ 7 7 and LZSS enco ding with SA, with dictionary D = missi ssippi . In pa rt a), with LAB = issi a , we hav e four po ssible matches delimited by left a nd right . In part b) with LAB = bsia there is no suffix that starts with ‘b’ (which is enco ded a s a sing le symbol), but after ‘b’ we find four suffixes whose first symbol is ‘s’; t wo of these suffixes s tart with ‘si’. another 8 bits ( nl ) with the n um b er of bits used by the l en field; the f o llo wing 32 bits are the original file size. The header is follo w ed by | LAB | ASCI I symbols and t he remainder of the file consists in a sequence of LZSS tok ens. Our decoding algorithm do es not need an y sp ecial data structure and follo ws standard LZSS deco ding, a s describ ed in Section 2. The enco ding algorithm us es t wo SA to represen t the dictionary and an auxiliary arra y of 256 in tegers named LI ( L eftIndex ). This array holds, for eac h ASCI I sym b ol, the first index of the suffix arra y where we can find t he first suffix that starts with t ha t sym b ol (the left index for eac h symbol, as sho wn in F ig. 1). The sym b ols such that are not the start of any suffix, the corr esp onding en try is mark ed with - 1 , meaning that w e ha ve an empt y match for those sym b ols. Fig. 2 sho ws the LI for dictionary D = mississippi ; for instance, the first suffix that start s with sym b ol ‘i’ is at index 0 in P , suffixes starting with ‘p’ are at index 5 of P . Using LI, w e don’t ha v e to searc h for the left index for each sub-string in the LAB that w e need to enco de. Figure 2: The LI ( L eftIn dex ) auxiliary array: for each symbol that starts a suffix it holds the index of the SA P in which that suffix sta rts. F or the symbols that ar e no t the start of any suffix, the co rresp onding entry is marked with - 1, meaning that we hav e an empt y match for s ub-strings that start with that s y m b ol. As sho wn in Algorithm 1, the enco der starts b y reading input sym b ols into the LAB. The first | LAB | sym b ols are written directly on the out put file, b ecause the dictionary is empty at that stage. W e then slide the LAB in to the dictionar y and pro ceed b y computing the SA for the dictionary until it is not full. When the dictionar y is full, on eve ry subsequen t iteration, after eac h full LAB enco ding, the corresp onding SA and LI index es are up dat ed. This up date, 4 Algorithm 1 LZSS Enco ding usin g Suffix Array Input: I n , input stream to enco de; m , length of dictionary; n , length of LAB. Output: O ut , outpu t stream with LZS S description of I n . 1: W rite 48-bit header: np , n l an d F ileS iz e (as d escrib ed ab o v e). 2: Read th e first lo ok-ahead-buffer L AB , with | LAB | sym b ols, from I n . 3: W rite L AB into O ut . 4: Initialize ev ery p osition of LI to -1. 5: Do co ded ← 0. 6: while coded < F ileS iz e do 7: Slide in LAB into dictionary D and read n ext LAB . 8: if coded < m then 9: Build S A, using SA-IS algorithm [12], for D and name it P . { /* Dictionary is filling. */ } 10: else 11: Up date P (as in algorithm UD and Fig. 3 ) . { /* Runs after eac h LAB en codin g. */ } 12: end if 13: Scan P and up date LI (as describ ed in Fig. 2). 14: Do i ← 0. 15: while i < n do 16: lef t = LI [ LAB [ i ]]. { /* Lo op to enco de n symb ols in the LAB . */ } 17: if ( lef t == − 1) t hen 18: output (0 , LAB [ i ]); i ← i + 1; con tin ue. { /* Emp t y Matc h. No suffix starts with LAB [ i ].*/ } 19: end if 20: Find r ig ht , su c h that D [ P [ r ig ht ]] = LAB [ i ]. { /* Get left and right as in Fig. 1 . */ } 21: F rom the set of suffixes b et w een P [ lef t ] and P [ r ig ht ], choose the suffix at ind ex pos , such that lef t ≤ pos ≤ r ig ht . { /* Cho ose b et ween “fast” an d “b est” compr ession. */ } 22: Do l en ← the m atch-le ngth of sub-strings starting at D [ P [ pos ]] and LAB [ i ]. 23: Output (1( p os , len )) into O ut ; i ← i + len . 24: end while 25: cod ed = co ded + n. 26: end w hile described in Algorithm 2 (named UD), runs when the dictionary is full, after eac h LAB en co ding, p erforming the following actions: remo v e from P the suffixes in the range { 1 , . . . , | LAB |} b ecause they slide out ; up date in P the suffixes in the range {| LAB | + 1 , . . . , | dictionary |} to the range { 1 , . . . , | dictionary | − | LAB |} , subtracting | LAB | to eac h suffix n umber; insert in to P the slide in suffixes in the range {| dictionary | − | LAB | , . . . , | dictionar y |} , a fter their prop er sorting; this sorting is done by computing an SA for the LAB. T o p erform these actions on a single arra y is time consuming. T o speed-up the up date w e use t wo SA of length | dictionary | , named P A and P B , and a p ointer P (to P A or P B ). After eac h LAB encoding, w e tog gle p ointer P b et w een P A and P B , to av oid unnecessary remo v als, copies, and (slo w) displacemen t of t he elemen ts of the w orking SA. After the up date pro cedure, P p oin ts to the new up dated arra y . If t he previous LAB enco ding w a s done with P A , the f ollo wing will b e car r ied out using P B and vice-v ersa. F ig. 3 illustrates this pro cedure with ( | dictionary | , | LAB | )=(16,4) and the dictionary con tents this is the file , with p oin ter P set to P A . W e compute the SA for the LAB= _the a nd insert the new suffixes at indexes { 2 , 4 , 8 , 14 } of P B ; all other p ositions of P B are up dated fr o m P A , subtracting | LAB | from P A . Aft er enco ding LAB= enco , the up date pro cess is rep eated using P A as destination. 5 Figure 3: Update s tep with p ointer P set to P A initially; the first up date is done using P B as destinatio n. Array I holds the indexes where to inser t the new suffixes; ‘U’ and ‘R’ a re the Up date and Remove indexes, resp ectively . On the rig h t hand side, we hav e the initial and up dated dictionary conten ts. Algorithm 2 UD - Up date Dictionary (line 11 of Algorithm 1) Input: P A , P B , m -length SA; P , p oin ter to P A or PB; P dst , p ointer to PB or P A; LAB , lo ok-ahead bu ffer; LI, 256 p osition length LeftInd ex array . Output: P A or P B up dated; P p ointing to the recentl y u p dated SA. 1: if P p oint s to P A then 2: Set P dst to P B . 3: else 4: Set P dst to P A . 5: end if 6: Compute the SA P LAB for th e enco ded LAB. { /* Sorts the suffix es in the LAB. */ } 7: Using L I and P LAB , fill the | LAB | -length arra y I with the insertion indexes ( slide in suffi x es). 8: for j = 0 to | LAB | − 1 do 9: P dst [ I [ j ] ] = P LAB [ j ] + | dictionar y | − | LAB | . { /* The | LAB | In s ert Suffix es. */ } 10: end for 11: Do u p dateCoun ter = | dictionar y | − | LAB | . 12: for j = 0 to | dictionar y | − 1 do 13: if ( P [ j ] − | LAB | ) > 0 t hen 14: P dst [ j ] = P [ j ] − | LAB | . { /* The | dictionary | − | LAB | Up date Su ffixes. */ } 15: up dateCount er = u p dateCoun ter - 1; 16: if (up dateCoun ter==0) then 17: break; { /* Break immediately if | dictionary | − | LAB | up dates hav e b een done. */ } 18: end if 19: end if 20: end for 21: Set P to P dst . { /* P p oin ts to recen tly up d ated SA. */ } In line 6 of Algorit hm 2, w e get the sorted suffixes correspo nding to the recen tly enco ded LAB. After line 7, array I con tains | LAB | in tegers with the indexes where these new suffixe s are to b e inserted (with lexicographical order) in to P dst ; t his searc h finds | LAB | p ositions, b eing quite fast b ecause we use arra y LI to get the index of P , in whic h to start searc hing. The lo op in lines 8 to 10 p erfo rms the sorted insertion at the corresp onding indexes giv en by I on the targ et SA p oin ted 6 b y P dst . The lo op in lines 12 to 20 up dates the suffixes in the range {| LAB | + 1 , . . . , | dictionary |} to the range { 1 , . . . , | dictionary | − | LAB |} . With the use of tw o SA, w e do n’t ha v e to explicitly (slo wly) remo v e the suffixes from the old SA. W e ha v e a lso dev elop ed anot her v ersion of Algorithm 1, whic h up dates the SA at eac h and ev ery tok en, th us b eing a smo oth sliding window suffix arr a y . The up date pro cedure is divided in to t wo situations, dep ending on the length of t he tok en (1 or l en sym b ols). F or l en sym b o ls, we hav e the same pro cedure as described ab ov e, using len instead of | LAB | . When w e ha v e a matc h of a single sym b ol, w e subtract one f rom eac h p osition of P , and remo v e the suffix corresp onding to the single slide out sym b ol; finally , we insert the single suffix n um b er | dictionary | corresp onding to symb ol at its corresp onding p osition. This ve rsion turned out to b e 2 ≈ 3 times slow er than Algo rithm 1, ac hieving ab out the same compression ratio . 4 Exp eri men tal Res ults Our exp erimen tal tests w ere carried o ut on a laptop with a 2 GHz In tel Core2Duo T7300 CPU and 2 GB of RAM, using a single core. The co de 1 w a s written in C , using Microsoft Visual Studio 2008. The linear time SA construction algorithm SA-IS [1 2] (a v ailable at http://yuta.256 .googlepages.com/sais ) w a s used . F or comparis on purp oses, w e also presen t the results of a BT-encoder [11], GZ ip 2 , and the LZ Marko v chain algorithm (LZMA 3 ). The test files are f r om the standard cor p o ra Calg a ry (18 files, 3 MB) and Silesia (12 files, 211 MB), av aila ble at http://www .data- compression.info . W e use the “b est” compression option (c hoice of the longest ma t c h, at line 21 of Algorithm 1). 4.1 P erformance Indicators and Measures In our tests, w e used the Calgary and Silesia Corpus files, to assess the followin g measures: enco ding time (in seconds, measured by the C function clock ); compression ratio (in bits p er b yte, bpp); amoun t of memory f o r enco der data structures (in b ytes). This amount for our enco der data structures is M S A = | dictionary | + | LAB | + 2 | P | + | LI | + | P LAB | . The integer arra y LI has 256 p ositions, regardless o f the length of the dictionary . P LAB is the SA for the LAB, with | LAB | in tegers. The BT-enco der [11] uses 3 in tegers p er t ree no de with | dictionary | + 1 no des, o ccup ying M B T = 13 × | dictionary | + 12 b ytes, using 4-b yte integers. A suffix t r ee algor it hm 4 uses 3 in tegers and a sym b ol for eac h no de, o ccup ying 16 by tes, placed in a hash table [9], using the maxim um amoun t of memory M S T = 25 × | dictionary | + 4 × hashsz + 16 b ytes, where hashsz is the hash table size. The G Zip enco der o ccupies M GZ I P =313408 b ytes, as measured with the ‘C’ sizeof op erator. The LZMA enco der data structures o ccup y 5 M LZ M A = 4194304 +          9.5 | dictionary | , if MF = BT2 11.5 | dictionary | , if MF = BT3 11.5 | dictionary | , if MF = BT4 7.5 | dictionary | , if MF = HC4 , (2) 1 Av ailable at http: //www.de etc.isel.ipl.pt/sistemastele/docentes/AF/AF.htm 2 http:/ /www.gz ip.org/ 3 http:/ /www.7- zi p.org 4 http:/ /www.la rsson.dogma.net/research.html 5 As rep orted in htt p://man cubus.net/svn/hosted/gzdoom/trunk/lzma/lzma.txt 7 T able 1: Amoun t of memor y , tota l enco ding time (in seco nds), and av er age co mpression ratio (in bpb), fo r several lengths of ( | dictionar y | , | LAB | ) on the Calgary Co rpus, using “best” compr e s sion. GZip “fast” obtains Time=0.5 and bpb=3 .20 while GZip “b est” do es Time=1.2 and bpb= 2 .79. The b est enco ding time is underlined. Calgar y Co rpus SA “b est” BT “b est” LZMA “best” # | Dictionary | | L AB | Memor y T ime bpb Memory Time bpb Memory Time bpb 1 2048 1024 24576 2.2 5.77 26636 3.92 5.65 4217856 4.7 2.99 2 4096 1024 43008 2.5 5.40 53260 4.3 4.98 4241408 4.8 2.82 3 4096 2048 48128 2.4 5.75 53260 11.1 5.48 4241408 4.8 2.82 4 8192 2048 84992 3.8 5.49 106508 11.7 4.88 42885 12 5.1 2.69 5 1638 4 256 149760 9.1 4.36 213004 4.5 4.12 438272 0 5.2 2.61 6 3276 8 256 297216 18.4 4. 31 42 5996 5.5 4. 08 45 71136 4.9 2.54 7 3276 8 1024 301056 11.1 4.86 425996 7.5 4.40 4571136 4.9 2.54 8 3276 8 2048 306176 9.5 5.16 425996 15.8 4.57 4571136 4.9 2.54 b ytes, depending on the match fin der (MF) used as w ell as on | dictionary | with BT# denoting binary tree with # by tes hashing and HC4 denoting hash chain with 4 byte s hashing. F or instance, with ( | dictionary | , | LAB | ) = (65536 , 4096) w e ha v e M S A =611328, M B T =851980, M S T =1900560 , and M LZ M A =4816896 b ytes. If we consider a n application in whic h we o nly hav e a low fixed amount of memory , suc h as the in ternal memory of an em b edded device, it ma y not b e p ossible t o instan tiate a tree or a hash table based enco der. The GZip and LZMA 6 enco ders p erform entrop y enco ding of the dictiona r y to k ens achie ving b etter compression rat io than our LZSS enco ding algorithms. Lik e GZip, LZMA is built up on the deflate algor it hm, b eing the default compression metho d of 7z for ma t in the 7-Zip program. These enco ders are useful as a b enc hmark comparison, regarding enco ding time and the a moun t of memory o ccupied. F or b oth compression tec hniques, w e hav e compiled their C/C++ sources using the same compiler settings, as for our enco ders. The compression r a tio of our enco ders as w ell a s that of the BT-enco der can b e easily improv ed b y en tro py-encoding the tok ens, like in GZip and LZMA. Our purpose is to fo cus only on the construction of the dictionary and searc hing o ver it, using less memory than the con v en t ional solutions with trees and hash tables. 4.2 Comparison with other enco d ers W e enco de eac h file of the t w o corp o r a and compute the total enco ding t ime as we ll as the av erage compression ratio, for differen t configurat ions of ( | dictionary | , | LAB | ), using “b est” compress ion option. T able 1 sho ws the results o f these tests on t he Calgary Corpus. Our SA-encoder is faster than BT, except on tests 5 to 7; on test 6 (the GZip- lik e scenario), BT-enc o der is a b o ut 3 .5 time s faster than SA. T able 2 show s the results f or the Silesia Corpus. In these tests, the SA-enco der is the fastest excep t on tests 5 and 6. On t est 3, the SA- encoder is abo ut 5 times faster than the BT-enco der, a c hieving ab out the same compression ra tio. W e see that when | LAB | is not to o small (as compared to the dictionary), the SA-encoder is faster than the BT-encoder. Fig. 4 sho ws the trade-off b etw een time a nd memory on the enco ding o f the Calgary and Silesia cor p o ra, o n the tests sho wn o n T ables 1 and 2, for SA and BT- enco ders, including GZip test results f or comparison; the SA-enco der offers a go o d trade-off, esp ecially on t ests 1 to 5, using (muc h) less memory than G Zip. 6 LZMA SDK, version 4 .65, relea s ed 3 F ebrua ry 2009 , av ailable at http: //www.7 - zip.org/sdk.html 8 T able 2: Amount of memory , tota l enco ding time (in s econds) and av er age compressio n ratio (in bpb), for several lengths o f ( | dictionar y | , | LAB | ) on the Silesia Corpus, using “b est” compres sion. GZip “fast” obtains Time= 19.5 and bpb=3.32 while GZip “b est” do es Time=7 4.4 and bpb=2.98. The be st enco ding time is underlined. Silesia Co rpus SA “b est” BT “b est” LZMA “b est” # | Dictionary | | L AB | Memor y T ime bpb Memory Time bpb Memory T ime bpb 1 2048 1024 24576 118.7 5.66 26636 249.5 5.65 4217856 333.53 3.05 2 4096 1024 43008 116.9 5.41 53260 303.4 5.25 4241408 349.05 2.90 3 4096 2048 48128 112.9 5.68 53260 694.9 5.63 4241408 349.05 2.90 4 8192 2048 84992 143.4 5.44 106508 668.9 5. 27 428 8512 356.77 2.76 5 1638 4 256 149760 319.1 4.55 2130 04 254.6 4.44 4382 720 366.47 2.62 6 3276 8 256 297216 542.7 4.41 4259 96 318.1 4.31 4571 136 356.34 2.52 7 3276 8 1024 301056 322.2 4.80 425996 382.6 4.64 4571136 356.34 2.52 8 3276 8 2048 306176 302.3 5.02 425996 979.8 4.81 4571136 356.34 2.52 0 5 10 15 20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10 5 G f G b A 1 T 1 A 2 T 2 A 3 T 3 A 4 T 4 A 5 T 5 A 6 T 6 A 7 T 7 A 8 T 8 Time−Memory Trade−off on Calgary Corpus Time [seconds] Memory [kB] 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10 5 G f G b A 1 T 1 A 2 T 2 A 3 T 3 A 4 T 4 A 5 T 5 A 6 T 6 A 7 T 7 A 8 T 8 Time−Memory Trade−off on Silesia Corpus Time [seconds] Memory [kB] Figure 4: T ime- memory trade- off b etw een SA (A#) a nd BT (T#) for the Ca lgary a nd Silesia Corpus, on the 8 enco ding tests of T a bles 1 and 2. W e include GZip for comparison: G f is GZip “fast”, G b is GZip “b est”. F or all these enco ders searc hing and up dating the dictionary a re the most time-consuming tasks. A high compression ratio lik e those of LZMA and GZip can b e attained only when we use en trop y enco ding with appropriate mo dels for the tokens . The SA e nco der is f a ster than the BT enco der, when the LAB is not to o small. Our alg orithms (without en tropy enco ding) are th us p ositioned in a trade-off b et w een time and memory , that can mak e them suitable to replace binary tr ees on LZMA or in sub-string searc h. 5 Conclus ions In this pap er, we ha v e prop o sed a new Lemp el-Ziv enco ding alg orithm based on suffix arrays, impro ving on earlier work in terms of enc o ding time, b eing faster than previous approac hes, with similar low memory requiremen ts. The prop osed algor it hm uses an auxiliary arra y as an accelerator to the enco ding pro cedure, as w ell as a fast up date of the dictiona ry based on t wo suffix arra ys. This algorithm has considerably lo we r memory requiremen ts than binary/suffix trees a nd hash tables. The prop osed algorithm allo ws a priori computing the exact amount of memory ne cessary for the enco der data structures ; usually this may not be the case when using ( bina r y/suffix) trees, b ecause the num b er of no des and branc hes to allo cate dep ends on the conten ts of the text, or when w e allo cate a memory blo c k that is larger than needed as it happ ens with hash tables. W e hav e compared our algorithm (on b enc hmark files from standard corp o r a) a gainst tree- based enco ders, including GZip and LZMA. The ex p erimen tal tests sho w ed that in s ome (t ypical) 9 compression settings, our enco ders o ccup y less memory and ar e faster than t ree-based enco ders, th us being time and memory efficien t LZ77 and L Z SS enco ders, based on suffix arra ys. The tree- based enco ders can only b e faster at the exp ense of memory usage. Our algorithm is p ositioned in a trade-off b et w een time a nd memory , that can mak e it suitable to r eplace the use of trees, lik e in LZMA (7- Zip), reducing the amount of memory and enco ding t ime, k eeping the same compression ratio, whic h is b etter than that of GZip. These enco ders a lso pro vide a more compact w a y to represen t the dictionary whic h is suited for text categorization, base d on bag-o f -w o rds represen tations. Using a single suffix arra y and the 256-p osition auxiliary array , w e hav e a fast indexing mec ha nism to quic kly find all the sets of w ords that star t with a giv en sym b ol, on a static dictionary . This will b e topic of f uture researc h. References [1] C. Co nstan tinescu, J. Piep er, and Tianc heng Li. Blo ck size optimization in deduplication systems. In DCC ’09: Pr o c. of the IEEE Confer enc e on Data Compr ession , page 442, 20 09. [2] M. Cro chemore, L . Ilie, and W. Smyth. A simple alg orithm for computing the Lemp el-Ziv factor ization. In DCC ’08: Pr o c. of the IEEE Confer en c e on Data Compr ession , pag es 48 2–488, 2008. [3] A. F erreira , A. Oliveira, and M. Figueir edo. On the use of suffix a rrays for memory - efficien t Lemp el-Ziv da ta compressio n. In DCC ’09: Pr o c. of the IEEE Confer enc e on Data Compr ess ion , page 444, 20 09. [4] M. Fiala and J. Holub. DCA using suffix arrays. In D CC ’ 08: Pr o c. of t he IEEE Confer enc e on Data Com- pr ession , pag e 516, W ashington, DC, USA, 2 0 08. IEEE Computer Socie ty . [5] G. Gonnet, R. Baeza -Y ates , and T . Snider. New indices for text: P A T trees and P A T arrays. Information r etrieval: data struct u r es and algorithms , pages 6 6–82, 19 92. [6] U. Gr¨ a f. So rted sliding window compr ession. In DCC ’99: Pr o c. of the IEEE Confer en c e on Data Compr ession , page 527, W ashington, DC, USA, 19 99. [7] D. Gusfield. Algorithms on S trings, T r e es and Se quenc es . Cam bridge University Press, 1 997. [8] J. Kar k ainen, P . Sanders, and S. Burkhardt. Linear work suffix arr ay constructio n. Journal of t he ACM , 53(6):918 –936, 2 006. [9] N. Lars s on. Structu re s of String Matching and Data Compr ession . PhD thesis, Department o f Computer Science, Lund Univ ersity , Sw eden, 1999. [10] U. Manber and G. Myers. Suffix Arrays: a new metho d for on-line string searches. SIA M Journal on Computing , 22(5):935 –948, 1 993. [11] M. Nelson and J. Gailly . Th e Data Compr ession Bo ok . M & T Bo oks, New Y ork, 2 nd edition, 19 95. [12] G. Nong, S. Zhang, and W. Cha n. Linear s uffix arr ay co ns truction by almost pure induced-sor ting. In DCC ’09: Pr o c. of t he IEEE Confer enc e on Data Compr ession , pages 193 – 202, 2009. [13] M. Ro deh, V. P ratt, and S. Even. Linear algo r ithm for data compr ession via string ma tc hing. Journal of the ACM , 28(1):16 –24, 1981 . [14] D. Sa lomon. Data Compr ession - The c omplete r efer enc e . Spring er-V er lag London L td, Londo n, 2007 . [15] M. Sa ls on, T. Lecro q, M. L´ eonard, and L. Mouchard. Dynamic extended s uffix arrays. Journal of Discr ete Algo rithms , In Press, Co rrected P ro of, 2009 . [16] J. Stor er and T. Szymanski. Data compressio n via textual substitution. J. of t he ACM , 29(4):92 8–951, 1982. [17] E. Ukkonen. On-line co ns truction of suffix trees. Algori thmic a , 14(3):249 –260, 1 995. [18] S. Z hang and G. Nong. F ast a nd spac e efficient linea r suffix arr a y cons truction. In DCC ’08: Pr o c. of the IEEE Confer enc e on Data Compr ession , page 553, 2008. [19] J. Z iv and A. Lemp el. A universal a lgorithm for sequential data compres s ion. IEEE T r ansactions on In formatio n The ory , IT-23(3 ):337–343 , 1 977. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment