Low-Memory Adaptive Prefix Coding

In this paper we study the adaptive prefix coding problem in cases where the size of the input alphabet is large. We present an online prefix coding algorithm that uses $O(\sigma^{1 / \lambda + \epsilon}) $ bits of space for any constants $\eps>0$, $…

Authors: Travis Gagie, Marek Karpinski, Yakov Nekrich

Lo w-Memory Adaptiv e Prefix Co ding T ra vis Gagie ∗ Marek Karpinski † Y ak o v Nekric h † Abstract In this pap er w e study the adaptive prefix coding problem in cases where the size of the input alphabet is large. W e presen t an online prefix codin g a lgorithm that uses O ( σ 1 /λ + ǫ ) bits of space for any constan ts ε > 0, λ > 1, and enco des the string of symbols in O (log log σ ) time per symbol in the worst c ase , where σ is the size of the alphabet. The upp er b oun d on the encoding length is λnH ( s ) + ( λ ln 2 + 2 + ǫ ) n + O ( σ 1 /λ log 2 σ ) bits. 1 In tro duction In this paper w e pres e n t an alg orithm for a da ptiv e prefix coding that uses sub- linear space in the size of the alphabe t. Space usa ge can b e an imp ortant issue in situations where the av a ilable memory is sma ll; e.g., in mobile computing, when the alpha bet is very lar ge, and when we wan t the data used b y the algorithm to fit in to first-level cache memory . F or instance, V ersio n 5.0 of the Unicode Standard [14] pro vides co de points for 99 08 9 c harac ter s, cov ering “all the ma jor la nguages written to day”. The Standard itself may b e the only do cument to con tain quite that many distinct characters, but there are ov er 50 0 0 0 Chinese character s, o f which everyday Chinese uses sev eral thousand [15]. One r eason there a r e so ma n y Chinese characters is that each conv eys more information than an English character do es; if w e consider syllables, mo rphemes or words a s basic units of text, then the English ‘alphab et’ is comparably large. Compress ing string s ov er such a lphabets can be a wkward; the problem can b e severely aggrav a ted if we hav e only a small amount of (cache) memory at our disp osal. Static a nd ada ptiv e prefix enco ding algo rithms that use linear space in the size of the alphab et were extensively studied. The classical algor ithm of Huffman [8] enables us to c onstruct an o ptimal prefix-free co de a nd enco de a text in tw o passes in O ( n ) time. Henceforth in this pap e r, n denotes the nu mber of characters in the text, and σ denotes the size of the a lphabet; ∗ Departmen t of Compu ter Science, Unive rsi t y of Eastern Pi edmon t. Email: travis@m fn.unipmn.it . Supp orted by Italy-Israel FIRB grant “Patt ern Discov ery Algo- rithms in Di screte Structures, with Applications to Bioinform atics”. † Departmen t of Computer Scienc e, Unive rsi t y of Bonn . Email: { marek,y asha } @cs.uni -bonn.de . 1 H ( s ) = P σ i =1 f a i n log 2 n f a i is the zeroth-o rder en tropy 1 of s , wher e f a denotes the n umber of occur rences of c haracter a in s . The length of the enco ding is ( H + d ) n bits, and the redundancy d can b e estimated as d ≤ p max + 0 . 086 wher e p max is the pr obability of the most freq ue nt character [6 ]. The drawbac k to the static Huffman coding is the need to make tw o passes over data: we co llect the frequencies of differen t c harac ters during the first pass, a nd then constr uct the co de and enco de the str ing during the second pas s. Adaptive co ding av oids this b y maint aining a co de for t he prefix of the input string that has already bee n read and encoded. When a new ch ar a cter s i is read, it is enco de d with the code for s 1 . . . s i − 1 ; then the co de is updated. The FGK algorithm [11] for adaptive Huffman coding enco des the s tring in ( H + 2 + d ) n + O ( σ lo g σ ) bits, while the adaptive Huff man algorithm of Vitter [16] guarantees that the s tring is enco ded in ( H + 1 + d ) n + O ( σ log σ ) bits. The a daptive Shannon co ding algorithms o f Gagie [4] and Karpinski a nd Nekrich [1 0] enco de the string in ( H + 1) n + O ( σ log σ ) bits and ( H + 1) n + O ( σ log 2 σ ) bits resp ectively . All of the above algorithms us e space at leas t linear in the size of the alpha bet, to count ho w often each distinct character occur s. All a lgorithms for adaptive prefix c o ding , with exception o f [10], enco de and deco de in Θ( nH ) time, i.e. the time to pro c ess the str ing dep ends on H and hence on the size of the input alphab et. The algorithm of [10] enco des a string in O ( n ) time, a nd deco ding takes O ( n log H ) time. Compression with sub-line a r space usage was studied by Gag ie and Manzini [5] who pr o ved the following low er b ound: F o r any g indep endent of n a nd a n y constants ǫ > 0 and λ > 1, in the worst case w e cannot enco de s in λH ( s ) n + o ( n log σ ) + g bits if, during the single pass in whic h w e write the enco ding, w e use O ( σ 1 /λ − ǫ ) bits of memory . In [5 ] the author s also presented an algor ithm that divides the input string into ch unks of length O ( σ 1 /λ log σ ) and enco des each individual c hunk with a mo dification of the arithmetic co ding, so that the string is enco ded with ( λH ( s ) + µ ) n + O ( σ 1 /λ log σ ) bits. How ev er, their al- gorithm is quite complicated and uses arithmetic coding; hence, co dewords ar e not self-de limiting and the encoding is not ‘instantaneously deco dable’. Bes ides that, their algor ithm is based on static enco ding of par ts of the input string. In this pap er we present an ada ptiv e pr efix co ding a lgorithm that uses O ( σ 1 /λ + ǫ ) bits of memory and enco des a string s with λnH ( s ) + ( λ ln 2 + 2 + ǫ ) n + O ( σ 1 /λ log 2 σ ) bits. The enco ding and decoding work in O (log log σ ) time p er symbol in the worst c ase , and the whole string s is enco ded/deco ded in O ( n log H ( s )) time. A r andomized implemen tation of o ur a lgorithm use s O ( σ 1 /λ log 2 σ ) bits o f memory and works in O ( n log H ) exp ected time. Our metho d is based on a simple but effective for m of alphab et-partitioning (see, e.g., [1] and r eferences ther ein) to trade off the s iz e of a code and the compres- sion it achiev es: we split the alphab et in to frequent and infrequent characters; we preface each o ccurr e nce of a fr equen t c harac ter with a 1, and each o ccurrence of an infrequen t one with a 0; w e replace each o ccurre nc e of a frequent character 1 F or ease of description, we sometimes simply denote the en tropy b y H if the string s is clear from the con text. 2 by a co dew ord, and r eplace each occur rence o f an infreq uen t c haracter b y that character’s index in the alphab e t. W e make a natural assumption that unenco ded files co nsist of characters represented b y their indices in the alphab et (cf. ASCI I co des), so w e can simply copy the r e presentation of a n infrequent character from the origina l file. One difficult y is that we cannot identif y the freq uen t ch ar a cters using a low-memory one-pass alg orithm: a c cording to the lower b ound of [9] any o nline algo rithm that identifies a s et o f characters F , such that each s ∈ F o ccurs at leas t Θ n times for some parameter Θ, needs Ω( σ log n σ ) bits of memory in the w orst ca se. W e ov ercome this difficult y b y maintaining the fr equencies of s ym b ols that o ccur in a sliding windo w. In sec tion 2, w e rev iew the data structures that a r e used by our a lgorithm. In section 3 we present a nov el enco ding metho d, henceforth called sliding-window Shannon c o ding . Ana lysis of the sliding-window Shannon co ding is g iven in section 4 . 2 Preliminaries The dictionary data s tructure co n tains a set S ⊂ U , so tha t fo r any element x ∈ U we can determine whether x belo ngs to S . W e assume that | S | = m . The following dictionary data structure is des crib e d in [7] Lemma 1 Ther e exists a O ( m ) sp ac e dictionary data stru ctur e t hat c an b e c onstructe d in O ( m lo g m ) time and supp orts memb ership queries in O (1) time. In the case of a p o lynomial-size universe, we can easily cons truct a data structure that uses mor e space but also suppo rts up dates. The following Lemma is folklo re. Lemma 2 If | U | = m O (1) , then ther e exists a O ( m 1+ ε ) sp ac e dictionary data structur e t hat c an b e c onstructe d in O ( m 1+ ε ) time and su pp orts memb ership queries and up dates in O (1 ) time. Pr o of : W e r egard S as a set o f binary strings of length log U . All str ings can be stored in a trie T with no de degr e e 2 ε ′ log U = m ε , where ε ′ = (log U / lo g m ) · ε . The height of T is O (1), and the total n umber of in ternal no des is O ( m ). Each int ernal node uses O ( m ε ) space; hence, the data s tructure uses O ( m 1+ ε ) space and can b e constr ucted in O ( m 1+ ε ) time. Clearly , quer ies and upda tes are suppo rted in O (1) time.  If we a llow rando mization, then the dynamic O ( m ) space dictiona ry can be maintained. W e can use the result o f [3]: Lemma 3 Ther e exists a r andomize d O ( m ) sp ac e dictionary data structur e that supp orts memb ership queries in O (1) time and up dates in O (1) exp e ct e d time. 3 All of the ab ov e dictiona r y data structures can b e augmented so that one or more additional records are asso ciated with e a c h element of S ; the recor d(s) asso ciated with element a ∈ S can be accessed in O (1) time. In Section 3, we als o use the following dyna mic partial-sums data structure, due to Moffat [12]: Lemma 4 Ther e is a dynamic se ar chable p artial-sums data stru ctur e that stor es a se quenc e of O (log σ ) -bit r e al nu mb ers p 1 , . . . , p k in O ( k log σ ) bits and su pp orts the fol lowing op er ations in O (log i ) t ime: • given an index i , re tur n t he i -th p artial sum p 1 + · · · + p i ; • given a r e al numb er b , r eturn t he index i of t he lar gest p artial sum p 1 + · · · + p i ≤ b ; • given an index i and a r e al n u mb er d , add d to p i . 3 Adaptiv e c o ding The ada ptive Shannon co ding algorithm we present in this sectio n combines ideas from Ka rpinski a nd Nekr ic h’s algorithm [10] with the sliding-window ap- proach, to enco de s in λnH ( s ) + ( λ ln 2 + 2 + ǫ ) n + O ( σ 1 /λ log 2 σ ) bits using O ( n log H ) time o verall and O (log log σ ) time for any character, O ( σ 1 /λ + ǫ ) bits of memor y and o ne pass, for any given constants λ ≥ 1 and ǫ > 0 . Where a s Karpinski and Nekrich’s alg orithm co nsiders the whole prefix already enco ded, our new a lgorithm enco des each character s [ i ] of s base d only on the window w i = s [max( i − ℓ, 1) .. ( i − 1)], where ℓ =  cσ 1 /λ log σ  and c is a constant we will define later in terms of λ and ǫ . (With c = 10, for example, we pro duce a n enco ding of fewer than λnH ( s ) + (2 λ + 2) n + O ( σ 1 /λ log 2 σ ) bits; with c = 100, the b o und is λnH ( s ) + (0 . 9 λ + 2) n + O ( σ 1 /λ log 2 σ ) bits.) Let f ( a, s [ i ..j ]) denote the n um b er of occurr ences o f a in s [ i..j ]. F or 1 ≤ i ≤ n , if f ( s [ i ] , w i ) ≥ ℓ/σ 1 /λ , then we wr ite a 1 fo llo wed by s [ i ]’s co deword in our adaptive Sha nnon co de; otherwise, we write a 0 follow ed by s [ i ]’s ⌈ log σ ⌉ -bit index in the a lphabet. As in the ca se of the quantized Shannon coding [10], our alg orithm maintains a canonica l Shannon co de. In a ca nonical co de [13, 2], each co deword can b e characterized by its length and its p osition a mong co dewords of the sa me length, henceforth calle d offset . The co deword of length j with offset k can b e co mputed as P j − 1 h =1 n h / 2 h + ( k − 1) / 2 j . W e maintain four dy na mic data struc tur es: a q ueue Q , an a ugment ed dic- tionary D , an arr ay A  0 .. ⌈ log σ 1 /λ ⌉ , 0 .. ⌊ σ 1 /λ ⌋  and a searchable partial- sums data structure P . (W e actually use A only while deco ding but, to emphas ize the symmetry betw een the tw o pro cedures, w e refer to it in our explanation of enco ding a s well.) When w e come to encode or dec o de s [ i ], • Q stores w i ; • D stor es each c hara cter a that o ccurs in w i , its fr equency f ( a, w i ) there and, if f ( a, w i ) ≥ ℓ/σ 1 /λ , its po sition in A ; 4 • A [] is an arr ay of doubly-linked lists. The list A [ j ], 0 ≤ j ≤ ⌈ lo g σ 1 /λ ⌉ , contains all characters with co dew ord length j sorted by the co dew ord offsets; w e deno te by A [ j ] .l the pointer to the la st element in A [ j ]. • C [ j ] stores the num b er of c o dewords of length j • P stor es C [ j ] / 2 j for e ach j and supports prefix sum queries. W e imp lement Q in O ( ℓ lo g σ ) = O ( σ 1 /λ log 2 σ ) bits of memo r y , A in O ( σ 1 /λ log 2 σ ) bits, and P in O (log 2 σ ) bits by Lemma 4. The dictionary D uses O ( σ 1 /λ + ǫ ) bits and suppo rts queries and updates in O (1) worst-case time by Lemma 2; if we allow ra ndomization, we can apply Lemma 3 and reduce the space usage to O ( σ 1 /λ log 2 σ ) bits, but updates ar e supp orted in O (1) expected time. Therefor e, altogether we use O ( σ 1 /λ + ǫ ) bits of memory; if randomization is a llowed, the space usage is reduced to O ( σ 1 /λ log 2 σ ) bits. T o enco de s [ i ], we first sea rch in D a nd, if f ( s [ i ] , w i ) < ℓ/ σ 1 /λ , we simply write a 0 follow ed b y s [ i ]’s index in the alphab et, up date the data structures as describ ed b elow, and pro ceed to s [ i + 1]; if f ( s [ i ] , w i ) ≥ ℓ/σ 1 /λ , we use P a nd s [ i ]’s position A [ j, k ] in A to co mpute j − 1 X h =0 C [ h ] / 2 h + ( k − 1) / 2 j ≤ 1 . The first j = ⌈ log( ℓ/f ( s [ i ] , w i )) ⌉ bits of this sum’s bina ry representation are enough to uniquely ident ify s [ i ] b ecause, if a character a 6 = s [ i ] is s tored at A [ j ′ , k ′ ], then       j − 1 X h =0 C [ h ] / 2 h + ( k − 1) / 2 j ! −   j ′ − 1 X h =0 C [ h ] / 2 h + ( k ′ − 1 ) / 2 j ′         ≥ 1 / 2 j ; therefore, we write a 1 follow ed by these bits as the codeword for s [ i ]. T o deco de s [ i ], we read the next bit in the enco ding; if it is a 0, we simply int erpre t the following ⌈ log σ ⌉ bits as s [ i ]’s index in the alphab et, upda te the data structur es, and pro ceed to s [ i + 1]; if it is a 1 , we int erpr e t the following ⌈ log σ 1 /λ ⌉ bits (of which s [ i ]’s co deword is a prefix ) as a binar y fraction b and search in P for index j of the largest partial sum P j − 1 h =0 C [ h ] / 2 h ≤ b . Knowing j tells us the length of s [ i ]’s codeword or, eq uiv alently , its row in A ; w e can also compute its offset, k = $ b − P j − 1 h =0 C [ h ] / 2 h 2 j % + 1 ; th us, w e can find and write s [ i ]. Enco ding o r deco ding s [ i ] takes O (1) time for querying D and A and, if f ( s [ i ] , w i ) ≥ ℓ/σ 1 /λ , then O  log log ℓ f ( s [ i ] , w i )  = O (log log σ ) 5 time to query P . After enco ding or deco ding s [ i ], we update the da ta structures as follo ws: • we dequeue s [ i − ℓ ] (if it exists) from Q a nd enq ueue s [ i ]; we decrement s [ i − ℓ ]’s frequency in D and delete it if it do es not o ccur in w i +1 ; insert s [ i ] in to D if it do es not o c cur in w i or, if it do es, increment its frequency; • we remove s [ i − ℓ ] from A (b y r eplacing it with the last character in its list A [ j ], decremen ting C [ j ], and updating D ) if f ( s [ i − ℓ ] , w i +1 ) < ℓ/σ 1 /λ ≤ f ( s [ i − ℓ ] , w i ) ; • we mov e s [ i − ℓ ] fro m lis t A [ j ] to list A [ j + 1] if ⌈ log σ 1 /λ ⌉ ≥  log ℓ f ( s [ i − ℓ ] , w i +1 )  >  log ℓ f ( s [ i − ℓ ] , w i )  ; this is done by replacing s [ i − ℓ ] with A [ j ] .l , and app ending s [ i − ℓ ] at the end of A [ j + 1]; p o in ters A [ j ] .l and A [ j + 1] .l and counters C [ j ] a nd C [ j + 1] are also updated; • if nece s sary , we insert s [ i ] int o A or mov e it from A [ j ] to A [ j + 1 ]; these pro cedures are symmetric to deleting s [ i − ℓ ] and to moving s [ i − ℓ ] from A [ j ] to A [ j − 1 ] • finally , if we hav e changed C , the data structure P is up dated. All of these updates, except the last one, take O (1) time, and up dating P takes O (log log σ ) time in the worst c a se. When we inser t a new element s [ i ] into Q , this may lead to up dating P as descr ib ed ab ov e. W e may decrement the length of s [ i ] or inser t a new co deword for the symbol s [ i ]. In b oth cases, we can P up dated in O ( le ng th ( s [ i ])) time, wher e l eng th ( s [ i ]) is the current co deword length of s [ i ]. When w e delete a n element s [ i − ℓ ], we ma y incre men t the co dew or d length of s [ i − ℓ ] or re mo ve it fro m the co de. If the co deword length is incremented, then we up date P in O (length( s [ i − ℓ ])) time. If we remov e the co deword for s [ i − ℓ ], then w e also up date P in O (leng th( s [ i − ℓ ])) time; in the last case we can charge the cos t of up dating P to the previo us o ccurrence of s [ i − ℓ ] in the str ing s , when s [ i − ℓ ] was enco ded with length( s [ i − ℓ ]) bits. The co deword lengths of symbo ls s [ i ] and s [ i − ℓ ] a re O  log log ℓ f ( s [ i ] ,w i )  and O  log log ℓ f ( s [ i − ℓ ] ,w i )  resp ectively . Hence, by Jensen’s ine q ualit y , in total we enco de s in O ( n log H ′ ) time, where H ′ is the average num b er o f bits p er character in our enco ding. In the next se ction, we will prov e that the sliding- window Shannon co ding enco des s in λnH ( s ) + ( λ ln 2 + 2 + ǫ ) n + O ( σ 1 /λ log 2 σ ) bits. Since we can assume that σ is not v a stly larger than n , our metho d works in O ( n log H ) time. If the dictionary D is implement ed as in Lemma 3, the ana lysis is exactly the same, but a string s is pro cessed in expec ted time O ( n log H ). 6 Lemma 5 Sliding-window Shannon c o ding c an b e implemente d in O ( n log H ) time ove r al l and O (log log σ ) time for any cha r acter, O ( σ 1 /λ + ǫ ) bi ts of memory and one p ass. If r andomization is al lowe d, sliding-window Shannon c o ding c an b e implemente d in O ( σ 1 /λ log 2 σ ) bits of memory and O ( n log H ) exp e cte d time. 4 Analysis In this section we prove the upp er b ound on the enco ding length of sliding- window Shannon co ding and obtain the follo wing Theorem. Theorem 1 We enc o de s in, and later de c o de it fr om, λnH ( s ) + ( λ ln 2 + 2 + ǫ ) n + O ( σ 1 /λ log 2 σ ) bits u sing O ( n lo g H ) time over al l and O (log log σ ) time for any char acter, O ( σ 1 /λ + ǫ ) bi ts of memory and one p ass. If r andomization is al lowe d, the memory usage c an b e r e duc e d to O ( σ 1 /λ log 2 σ ) bits and s c an b e enc o de d and de c o de d in O ( n log H ) ex p e cte d t ime. Pr o of : Consider an y substring s ′ = s [ k .. ( k + ℓ − 1)] of s with length ℓ , a nd let F be the set of c haracter s a s uc h that f  a, s [max( k − ℓ , 1) .. ( k + ℓ − 1)]  ≥ ℓ σ 1 /λ ; notice | F | ≤ 2 σ 1 /λ . F or k ≤ i ≤ k + ℓ − 1, if s [ i ] ∈ F but f ( s [ i ] , w i ) < ℓ /σ 1 /λ , then w e enco de s [ i ] using ⌈ log σ ⌉ + 1 < λ log σ 1 /λ + 2 < λ log ℓ max  f ( s [ i ] , w i ) , 1  + 2 ≤ λ log ℓ max  f  s [ i ] , s [ k .. ( i − 1)]  , 1  + 2 bits; if f ( s [ i ] , w i ) ≥ ℓ/σ 1 /λ , then we enco de s [ i ] using  log ℓ f ( s [ i ] , w i )  + 1 < λ log ℓ max  f  s [ i ] , s [ k .. ( i − 1)]  , 1  + 2 bits; finally , if s [ i ] 6∈ F , then we again enco de s [ i ] using ⌈ log σ ⌉ + 1 < λ log σ 1 /λ + 2 < λ log ℓ f  s [ i ] , s [ma x( k − ℓ , 1) .. ( k + ℓ − 1)]  + 2 ≤ λ log ℓ f ( s [ i ] , s ′ ) + 2 7 bits. Therefor e, the total num b er o f bits w e use to enco de s ′ is less than λ X a ∈ F X s [ i ]= a, k ≤ i ≤ k + ℓ − 1 log ℓ max  f  a, s [ k .. ( i − 1)]  , 1  + λ X a 6∈ F f ( a, s ′ ) log ℓ f ( a, s ′ ) + 2 ℓ = λℓ log ℓ − λ X a ∈ F X s [ i ]= a, k ≤ i ≤ k + ℓ − 1 log  max  f  a, s [ k .. ( i − 1)]  , 1  − λ X a 6∈ F f ( a, s i ) log f ( a, s ′ ) + 2 ℓ ; since X s [ i ]= a, k ≤ i ≤ k + ℓ − 1 log max  f  a, s [ k .. ( i − 1)]  , 1  = f ( a,s ′ ) − 1 X j =1 log j , we can re w r ite our bound as λ   ℓ lo g ℓ − X a ∈ F f ( a,s ′ ) − 1 X j =1 log j − X a 6∈ F f ( a, s ′ ) log f ( a, s ′ )   + 2 ℓ = λ   ℓ lo g ℓ − X a ∈ F log(( f ( a, s ′ ) − 1)!) − X a 6∈ F f ( a, s ′ ) log f ( a, s ′ )   + 2 ℓ ; by Stirling’s F ormula, ℓ lo g ℓ − X a ∈ F log(( f ( a, s ′ ) − 1)!) = ℓ log ℓ − X a ∈ F log(( f ( a, s ′ )!) + X a ∈ F log f ( a, s ′ ) ≤ ℓ log ℓ − X a ∈ F  f ( a, s ′ ) log f ( a, s ′ ) − f ( a, s ′ ) ln 2  + | F | lo g ℓ ≤ ℓ log ℓ − X a ∈ F f ( a, s ′ ) log f ( a, s ′ ) + ℓ ln 2 + 2 σ 1 /λ log ℓ , so w e can again rewr ite our bound as λ ℓ lo g ℓ − X a f ( a, s ′ ) log f ( a, s ′ ) + ℓ ln 2 + 2 σ 1 /λ log ℓ ! + 2 ℓ = λ X a f ( a, s ′ ) log ℓ f ( a, s ′ ) +  λ ln 2 + 2 + 2 λσ 1 /λ log ℓ ℓ  ℓ = λℓH ( s ′ ) +  λ ln 2 + 2 + 2 λσ 1 /λ log ℓ ℓ  ℓ . 8 Recall ℓ =  cσ 1 /λ log σ  , so 2 λσ 1 /λ log ℓ ℓ = 2 λσ 1 /λ log  cσ 1 /λ log σ   cσ 1 /λ log σ  ≤ 2 λ  log c + (1 /λ ) log σ + log log σ + 1  c log σ ≤ 2 λ (log c + 3) c (w e will g ive tighter inequalities in the full pap er, but use these here for sim- plicit y); for any co ns tan ts λ ≥ 1 and ǫ > 0, we can cho ose a constant c la r ge enough that 2 λ (log c + 3) c < ǫ , so the num b er of bits w e use to enco de s ′ is les s than λℓH ( s ′ ) + ( λ ln 2 + 2 + ǫ ) ℓ . With c = 10, for example, 2 λ (log c + 3) c < (2 − ln 2) λ , so o ur b ound is less than λℓH ( s ′ ) + (2 λ + 2) ℓ ; w ith c = 100, it is less than λℓH ( s ′ ) + (0 . 9 λ + 2 ) ℓ . Since the pro duct o f length and empir ical entrop y is sup eradditive — i.e., | s 1 | H ( s 1 ) + | s 2 | H ( s 2 ) ≤ | s 1 s 2 | H ( s 1 s 2 ) — we hav e ℓ ⌊ n/ℓ ⌋− 1 X j =0 H  s [( j ℓ + 1) .. ( j + 1) ℓ ]  ≤ nH ( s ) so, by the bound ab ov e, w e enco de the first ℓ ⌊ n /ℓ ⌋ characters o f s using fewer than λnH ( s ) + ( λ ln 2 + 2 + ǫ ) n bits. W e enco de the last ℓ characters of s using few er than λℓH ( s [( n − ℓ ) ..n ]) + ( λ ln 2 + 2 + ǫ ) ℓ = O ( ℓ log σ ) = O ( σ 1 /λ log 2 σ ) bits so, even co un ting the bits we use for s [( n − ℓ + 1) ..ℓ ⌊ n/ ℓ ⌋ ] twice, in to ta l we enco de s using fewer than λnH ( s ) + ( λ ln 2 + 2 + ǫ ) n + O ( σ 1 /λ log 2 σ ) bits.  If the most common σ 1 /λ characters in the alphab et make up muc h more than half o f s (in par ticula r, when λ = 1) then, instead of using an extra bit for eac h character, we ca n keep a s pecia l escap e co dew ord a nd us e it to indica te o ccurrences of characters not in the co de. The analysis be c omes somewhat complicated, how ever, so we leave discus sion of this mo dification for the full pap er. 9 5 Summary In this pape r we presen ted an algor ithm tha t uses spac e sub-linear in the alpha- bet size and achiev es an enco ding length that is close to the low er b ound of [5]. Our algor ithm pro cesses each symbol in O (log log σ ) worst-case time, wher eas linear-spa c e prefix c oding algo rithms can encode a string of n sym b ols in O ( n ) time, i.e. in time indepe ndent of the alpha bet size σ . It is an interesting open problem whether o ur algo r ithm (or one with the same space b o und) can b e made to run in O ( n ) time. References [1] D. Chen, Y.-J. Chi ang, N. D. M emon, and X. W u. Alphabet partitioning for semi- adaptiv e Huffman coding of l arge alphabets. IEEE T r ansactions on Communic ations , 55:436–443, 2007. [2] J. B. Connell. A Huffman-Shannon -F ano co de. Pr o c e e dings of the IEEE , 61:1046–1047, 1973. [3] M. Dietzfelbinger, A. Karlin, K. M ehlhorn, F. Meyer auf der Heide, H. Rohnert, and R. E. T arj an. Dynamic p erfect hashing: Upp er and low er bounds. SIAM Jo urnal on Computing , 23:738–761, 1994. [4] T. Gagie. Dynamic Shannon co ding. Information Pr o c essing L etters , 102:113– 117, 2007. [5] T. Gagie and G. Manzini. Space-conscious compression. In Pr o ce e dings of the 32nd Symp osium on Mathematic al F oundations of Computer Scienc e , pages 206–217, 2007. [6] R. G. Gallager. V ariations on a theme by Huffman. IEEE T r ansactions on Information The ory , 24:668–674, 1978. [7] T. Hagerup, P . B. Miltersen, and R. Pagh. Determini s tic dictionaries. Journal of A lgo- rithms , 41:69–85, 2001. [8] D. A. Huffman. A method for construction of minimum-redundancy co des. Pr o c ee dings of the IRE , 40:1098–1101, 1952. [9] R. M. Karp, S. Shenker, and C. H. Pa padimitriou. A si mple algori thm for finding f requen t elemen ts in streams and bags. ACM T ra nsactions on Datab ase Systems , 28:51–55, 2003 . [10] M. Kar pinski and Y. Nekrich. A fast algorithm for adaptiv e prefix co ding. A lgorithmic a , to appear. [11] D. E. Kn uth. Dynamic Huffman coding. Journal of Algorithms , 6:163–180, 1985. [12] A. Moffat. Linear time adaptiv e arithmetic co ding. IEEE T r ansactions on Information The ory , 36:401–406, 1990. [13] E. S. Sc hw artz and B. Kal lic k. Generating a canonical prefix enco ding. Com munic ations of the A CM , 7: 166–169, 1964. [14] Unico de Consortium. The Uni c o de St andar d, V ersion 5.0 . Addison-W esley Professional, 2006. [15] P . Vines and J. Zob el. Compression tec hniques for Chinese text. Softwar e: Pr actice and Exp eri enc e , 28(12):1299–13 14, 19 98. [16] J. S. Vitter. Design and analysis of dynamic Huffman codes. Journal of the ACM , 1987(4) :825–845, 1987. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment