Grammar-Based Compression in a Streaming Model

Grammar-Based Compression in a Streaming M o del T ravis Gagie 1 , ⋆ and Paw e l Gawryc howski 2 1 Department of C omputer Science Universit y of Chile travis.gag ie@gmail.com 2 Institute of Computer Science Universit y of W roc la w, Poland gawry1@gma il.com Abstract. W e sho w that, giv en a string s of length n , with constant memory and logarithmic passes ov er a constant number of streams we can b uild a context-free grammar th at generates s and only s and whose size is within an O  min  g log g, p n/ log n  -factor of the minimum g . This stands in contrast to our previous result th at, with p olylogarithmic memory and polylogarithmic passes o ver a single stream, we cannot build such a grammar whose size is within an y p olynomial of g . 1 In tro duction In the past decade, the ever-increasing amount o f data to be stored and ma- nipulated has inspired intense interest in both grammar-base d compres sion a nd streaming algorithms, resulting in many pr actical algorithms and upper and low er bo unds for b oth pr oblems. Nevertheless, there has b een relatively little study of gra mmar-based compress ion in a s treaming mo del. In a previo us pa- per [1] we prov ed limits on the quality of the compr ession w e can achieve with po lylogar ithmic memor y and polylog arithmic pas ses ov er a single stream. In this pap er we show how to achieve b etter compres sion with constant memory and logarithmic passes ov er a constant num ber of s treams. F or grammar-ba sed co mpression of a string s of length n , we try to build a small context-free gr ammar (CFG ) that g enerates s and only s . This is useful not only for compr ession but also for, e.g., index ing [2,3] and sp eeding up dy- namic progr ams [4]. (It is sometimes desira ble for the CFG to b e in C homsky normal form (CNF), in which cas e it is also known as a straight-line pr ogra m.) W e can measure our success in terms of universality [5], empir ical e n tropy [6] or the ratio be t ween the size of our CFG and the size g = Ω (log n ) of the smallest such grammar . In this pap er we conside r the third and last measure. Store r and ⋆ F und ed by the Millennium Institute for Cell Dynamics and Biotec hnology (I CDB), Gran t ICM P05-001-F, Mideplan, Chile. 2 T. Gagie and P . Gawryc how ski Szymanski [7] sho wed that deter mining the s ize of the smallest g rammar is NP- complete; Charik ar et al. [8] sho wed it ca nnot be a pproximated to within a small constant factor in poly nomial time unless P = NP , and that even approximat- ing it to within a factor of o (log n/ log log n ) in po lynomial time would require progre ss o n a w ell-studied algebra ic pr oblem. Charik ar et al. and Rytter [9] in- depe nden tly gave O (lo g( n/g ))-a pproximation algor ithms, both based on turning the LZ77 [10] pa rse of s into a CFG and b oth initially pre sent ed at conference in 2002 ; Sa k amoto [11] then propo sed a nother O (log( n/ g ))-approximation a lgo- rithm, bas ed on Re-Pair [12]. Sak amoto, Kida a nd Shimo zono [1 3] gave a linear- time O ((lo g g )(log n ))-approximation alg orithm that uses O ( g log g ) w orks pace, again based on LZ7 7; together with Maruyama, they [14] recently modiﬁed their algorithm to r un in O ( n log ∗ n ) time but achiev e an O ((log n )(log ∗ n )) approxi- mation r atio. A few years b efore Charik a r et al.’s and Rytter’s paper s spar ked a sur ge of in- terest in g rammar-ba sed compre ssion, a pap er by Alon, Matias and Szeg edy [15] did the same for streaming algo rithms. W e refer the reader to Bab co ck et a l. [16] and Muthu krishna n [17] for a thoro ugh in tro duction to str eaming algorithms. In this pa per , how ever, w e a re most c oncerned with more p ow erful strea ming mo d- els than these authors consider , ones that allow the use of multiple streams. A nu mber of re cent pape rs ha ve considered suc h mo dels, beg inning with Gr ohe and Sch w eik ardt’s [18] deﬁnition of ( r , s, t )-b ounded T uring Machines, w hic h use at most r reversals ov er t “externa l-memory” tap es and a total of at most s space on “internal-memory” tapes to which they hav e unrestricted access . While Munro and Paterso n [1 9] proved tig h t b o unds for sorting with one tap e three deca des ago, Grohe and Sc hw eik a rdt pr ov ed the ﬁrst tight b ounds for sorting with m ul- tiple tap es. Gr ohe, Hernich and Sch weik ardt [20] proved low er b ounds in this mo del for randomized alg orithms with one-sided err or, a nd Beame, Jayram a nd Rudra [21] pr ov ed lo wer b ounds for alg orithms with tw o-sided error (renaming the mo del “read/write streams”). Beame and Huynh [22] r evisited the pr ob- lem considered b y Alon, Matias and Sze gedy , i.e., approximating frequency mo- men ts, and pr ov ed lower b ounds for read/w rite str eam a lgorithms. Her nich and Sch w eik ardt [23] related results for read/write stream algorithms to results in classical complex it y theory , including r esults by Chen and Y a p [24] o n reversal complexity . Hernich a nd Sc hw eik ardt’s pap er dre w our attention to a theorem by Chen and Y ap implying that, if a problem can b e so lved deter ministically with r ead-only access to the input and logar ithmic w ork space then, in theory , it can b e solved w ith constant memory a nd logarithmic passes (in either direc- tion) over a constant num b er of read/wr ite streams. This theorem is the key to our main result in this pap er. Unfor tunately , the consta n ts in v olved in Chen and Y ap’s co nstruction a re enor mous; we leav e a s future work ﬁnding a more practical proo f of our res ults. The study of co mpression in something like a s treaming mo del go es back at least a decade, to work b y Shein wald, Lempel and Ziv [25] and De Agostino and Storer [26]. As far as w e know, how ever, our joint paper with Ma nzini [27] was the ﬁrst to g ive nearly tight b ounds in a s tandard s treaming mo del. In that Grammar-Based Compression in a Streaming Mo del 3 pap er we proved nea rly ma tc hing b ounds on the c ompression achiev able with a c onstant amoun t o f internal memory and one pass ov er the input, a s well as upper and lower b ounds for LZ77 with a sliding window whose size grows as a function of the num ber of character s e nco ded. (Our upp er b ound for LZ7 7 used a theor em due to Ko sara ju and Manzini [28] about quasi-distinct pars- ings, a sub ject recent ly r evisited by Amir, Aumann, Lev y and Roshko [29 ].) Shortly therea fter, Alb ert, Mayordomo, Moser a nd Perifel [3 0] show ed the co m- pression a chiev ed by LZ7 8 [3 1] is incomparable to that ac hiev able by pushdown transducers; May ordomo and Mo ser [32] then extended their result to sho w bo th kinds of co mpression are incompar able with that achiev able b y online a lgorithms with p olylog arithmic memory . (A somewhat s imilar sub ject, recognitio n of the context-sensitiv e Dyck languages in a streaming model, w as rec en tly broa ched by Magniez, Mathieu and Nay ak [33], who gav e a one-pass algo rithm with one-s ided error that uses p olylogar ithmic time per character and O  n 1 / 2 log n  space.) In a recen t paper with F err agina and Manzini [34] w e demonstrated the pra cticality of streaming algor ithms for compression in external memory . In a r ecent pa per [1] w e proved several low er b ounds for compress ion al- gorithms that use a sing le stream, all base d on an automa ta-theoretic lemma: suppo se a ma chin e implemen ts a lossless compression algor ithm using sequen- tial accesses to a single tap e that initially holds the input; then we can recon- struct a n y substr ing g iven, for every pa ss, the machine’s conﬁgur ations when it reaches and leav es the part of the tap e that initially holds that s ubstring, together with all the output it generates while over that part. (W e note similar arguments app ear in computationa l c omplexity , where they are refer red to a s “cross ing sequences” , and in communication co mplexit y .) It follows that, if a streaming co mpression algo rithm is restricted to us ing p olylo garithmic memory and p olylo garithmic passes over o ne stream, then there are p erio dic str ings with po lylogar ithmic p erio ds such that, even thoug h the str ings are very compressible as, e.g., CF Gs, the algor ithm m ust enco de them using a linear num be r of bits; therefore, no such algo rithm ca n a pproximate the smallest-g rammar pr oblem to within a n y p olynomia l of the minimum size. Such arg umen ts cannot prov e low er bo unds for algorithms with multiple streams , how ev er, and we left open the ques- tion of whether extra strea ms allo w us to achiev e a polynomial appr oximation. In this pap er we use Chen and Y ap’s result to co nﬁrm they do : we show how, with lo garithmic w orkspace, we can compute the LZ77 par se and turn that in to a CFG in CNF while increasing the size b y a factor of O  min  g log g , p n/ log n  — i.e., at most polynomia lly in g . It follows that we can ac hieve that approxima- tion ratio while using constant memory and lo garithmic passes ov er a constan t nu mber of str eams. 2 LZ77 in a Streaming Mo del Our star ting p oint is the same as that o f Char ik a r et al., Ry tter and Sak a moto, Kida and Shimozono, but we pay even more attention to workspace than the last 4 T. Gagie and P . Gawryc how ski set of author s. Sp eciﬁcally , w e beg in b y consider ing the v ariant o f LZ77 consid- ered by Charik ar et al. a nd Rytter, which do es not a llow self-referencing phrases but still pro duces a parse whose size is at most as large as that of the smallest grammar . Each phr ase in this pars e is either a single character o r a substring of the preﬁx of s already parsed. F o r example, the parse of “ how-much-woo d-would- a-woo dchuc k-chuck-i f-a-woo dchuck -could-chuck-w o o d? ” is “ h | o | w | - | m | u | c | h | - | w | o | o | d | -wo | u | l | d- | a | -woo d | ch | uc | k | - | chuck - | i | f | -a-woo dchuck-c | ould- | chuck- | w o o d | ? ”. Lemma 1 (Charik ar et al., 2 002; Rytter, 2002). The numb er of phr ases in the LZ77 p arse is a lower b ound on the size of the smal lest gr ammar. As an aside, we note that Lemma 1 and results from our pr evious pap er [1] together imply we cannot co mpute the LZ77 pa rse with one stre am when the pro duct of the memory and passes is sublinear in n . It is not diﬃcult to sho w that this LZ 77 par se — lik e th e original — can be computed with read-only access to the input and logar ithmic workspace. Pseu- do co de for do ing this a ppea rs a s Algor ithm 1. O n the example ab ov e, this pseu- do co de pro duces “ how-much-wo o d(9,3)ul(13,2 )a(9,5)(7,2)(6,2)k-(27,6)if(20,14) (16,5)(27,6 )(10,4)? ”. Lemma 2. We c an c omp ute the LZ77 p a rse with lo ga rithmic worksp a c e. Pr o of. The ﬁrst phrase in the pa rse is the ﬁrst letter in s ; after outputting this, we alw ays k eep a p ointer t to the division b et ween the preﬁx already pa rsed and the suﬃx y et to b e parsed. T o compute each la ter phrase in turn, w e c heck the length of the longest c ommon preﬁx of s [ i ..t − 1 ] and s [ t..n ], for 1 ≤ i < t ; if the lo ngest match has length 0 or 1, we output s [ t ]; otherwise, we output the v alue of the minimal i that maximizes the length of the lo ngest common preﬁx, together with that length. This tak es a constan t num ber of po in ters into s and a co nstant num b er of O (log n )-bit count ers. ⊓ ⊔ Combined with Chen and Y ap’s theor em b elow, Lemma 2 implies that we can compute the LZ77 parse with co nstant workspace and logarithmic passes ov er a constant num b er of strea ms. Theorem 1 (Chen and Y ap, 1991). If a function c an b e c ompute d with lo g- arithmic worksp ac e, then it c an b e c ompute d with c onstant worksp ac e and lo ga- rithmic p asses over a c onstant numb er of str e ams. As an a side, we note tha t Chen and Y ap’s theorem is actually muc h stronger than what w e state here: they prov ed that, if f ( n ) = Ω (log n ) is re versal- computable (see [24] or [23] for an explanation) a nd a problem ca n b e solved deterministically in f ( n ) time, then it can be solved with co nstant workspace and O ( f ( n )) passes ov er a constant num ber of tap es. Chen and Y ap showed how a reversal- bo unded T uring ma ch ine can simulate a spa ce-b ounded T uring machine by building a table of the p ossible conﬁg urations of the space-b ounded machine. Sc hw eik ardt [35] pointed out that “ this is of no practical use, s ince the resulting algo rithm pro duces huge intermediate results, but it is of ma jor Grammar-Based Compression in a Streaming Mo del 5 t ← 1; while t ≤ n do max match ← 0; max length ← 0; for i ← 1 . . . t − 1 do j ← 0; while s [ i + j ] = s [ t + j ] do j ← j + 1; if j > max length the n max match ← i ; max length ← j ; if max length ≤ 1 the n print s [ t ]; t ← t + 1; else print ( max match , max length ); t ← t + max leng th ; Algorithm 1: pseudoco de for co mputing the LZ7 7 par se in log arithmic workspace theoretical interest” b ecause it implies that a num be r o f lower bo unds are tight. W e leav e as future work ﬁnding a more practica l pro o f our our results. In the next s ection w e pro ve the following lemma, which is the most tec hnical part of this pap er. B y the s ize of a CFG, we mean the num b er of sy m b ols on the righthand sides of the pro ductions; notice this is at most a loga rithmic factor less than the n umber o f bits needed to expres s the CF G. Lemma 3. With lo garithmic worksp ac e we c an t urn the LZ77 p arse into a CFG whose size is within a O  min  g log g , p n/ log n  -factor of minimum. T ogether with Lemma 2 and Theor em 1 , Lemma 3 immediately implies our main result. Theorem 2 . With c onstant worksp ac e and lo garithmic p asses over a c onstant numb er of str e ams, we c an build a CF G gener ating s and only s whose size is within a O  min  g log g , p n/ log n  -factor of minimum. 3 Logspace CF G Construct ion Unlik e the LZ7 8 parse, the LZ77 parse cannot normally b e viewed as a CF G, bec ause the substring to which a phrase matc hes ma y begin or end in the mid- dle of a pre ceding phr ase. W e note this obsta cle ha s bee n considered by o ther authors in other c ircumstances, e.g., by Nav ar ro and Raﬃnot [36] for pattern 6 T. Gagie and P . Gawryc how ski matching. F ortunately , we can r emov e this obstacle in log arithmic workspace, without incre asing the num b er of phrases mor e than quadratically . T o do this, for each phr ase for which we output ( i, ℓ ), we ensur e s [ i ] is the ﬁrst character in a phra se and s [ i + ℓ ] is the last character in a phra se (by break ing phrases in t wo, if necessary). F or example, the parse “ h | o | w | - | m | u | c | h | - | w | o | o | d | - wo | u | l | d- | a | -woo d | ch | uc | k | - | chuck - | i | f | -a-woo d chuck-c | oul d- | chuck- | wo o d | ? ” bec omes “ h | o | w | - | m | u | c | h | - | w | o | o | d | - w 1 o | u | l | d 2 - | a | -woo d | c 3 h | uc | k | - | c 3 4 huck - | i | f | 2 -a-woo dch uck-c | 1 , 4 ould- | chuck - | woo d | ? ”, where the thick lines indicate new breaks and super scripts indicate which breaks cause the new ones (which a re subsc ripted). Notice the break “ a-woo dchuck- c | 1 , 4 ould ” causes both “ w 1 ould ” (matching “ oul d ”) and “ a-woo dchuck -c 3 4 huck ” (matchin g “ a-woo dchuc k-c ”); in turn, the la tter new break caus es “ wo o dc 3 huck ” (matchin g “ huck ”), whic h is wh y it has a sup erscr ipt 3. Lemma 4. Br e aking the p hr ases takes at most lo garithmic worksp ac e and at most squar es the n umb er of phr ases. Afterwar ds, every phr ase is either a single char acter or the c onc atenation of c ompl ete, c onse cutive, pr e c e ding phr ases. Pr o of. Since the phrases’ sta rt p oints are the partial sums o f their lengths, we can compute them with logarithmic w ork space; therefore, we ca n assume without loss of generality t hat the start p oints are stored with the phrases. W e start with the rightmost phra se and work left. F o r each phrase’s endp oints, w e compute the corr esp onding position in the matching, preceding substring (notice that the po sition corr esp onding to one phrase’s ﬁnish may not be the one corres po nding to the start o f the next phr ase to the r ight) a nd insert a new break ther e, if there is not one a lready . If we hav e inserted a new break, then we iterate, co mputing the po sition co rresp onding to the new brea k; even tually , we will reach a point where there is already a br eak, so the iteratio n will s top. This process requires only a constant num be r of pointers, so w e can perform it with logarithmic w ork space. Also, since ea ch phrase is broken at most twice for each o f the phras es that initially follow it in the parse, the ﬁnal num b er of phra ses is a t most the square of the initial n um b er. By inspection, after the pro cess is complete ev ery phrase is the conca tenation of complete, cons ecutive, pr eceding phr ases. ⊓ ⊔ Notice that, after w e brea k the phrases a s desc rib ed ab ove, we can view the parse a s a CFG. F or example, the par se for our running example cor resp onds to X 0 → X 1 . . . X 35 X 1 → h . . . X 13 → d X 14 → X 9 X 10 X 15 → o . . . X 31 → X 19 . . . X 27 . . . X 34 → X 10 . . . X 13 X 35 → ? Grammar-Based Compression in a Streaming Mo del 7 where X 0 is the sta rting nonterminal. Unfortunately , while the n umber of pr o- ductions is p oly nomial in the num ber of phrases in the LZ77 parse, it is not clear the size is and, moreover, the grammar is not in CNF. Since all the right- hand sides of the pro ductions are either terminals or sequences of consecutive nonterminals, w e could put the gr ammar in to CNF by squaring the num b er of nonterminals — g iving us a n a pproximation ratio c ubic in g . This would s till b e enough for us to prove our main re sult but , fortunately , s uc h a lar ge increa se is not necessary . Lemma 5. Putting t he CFG into CNF takes lo garithmic worksp ac e and in- cr e ases the numb er of pr o ductions by at most a lo garithmic factor. Afterwar ds, the size of the gr ammar is pr op ortional to the nu mb er of pr o ductions. Pr o of. W e build a for est of complete binary trees whose le av es are th e no n termi- nals: if w e consider th e trees in order b y size, the nonterminals appear in order from the leftmost lea f of the ﬁrst tr ee to the r ight most leaf of the last tree; eac h tree is as large as p oss ible, g iven the num b er of nonterminals rema ining after we build the trees to its left. Notice ther e ar e O (log g ) such trees, of tota l size at most O  g 2  . W e then assig n a new nonterminal to each internal no de and output a pro duction which ta kes that nonterminal to its children. This takes logarithmic workspace and increa ses the num ber o f pro ductions by a consta n t factor. Notice any sequence of consecutive non terminals that spans at least tw o trees, can be written as the concatenation o f t w o consecutiv e sequences, o ne of which ends with the rightmost le af in one tree and the other of whic h starts with the leftmost leaf in the next tree. Consider a sequence ending with the rightmost leaf in a tree; dealing with one that star ts with a leftmost lea f is s ymmetric. If the sequence completely contains that tree, we can write a binary pro duction that splits the sequence into the preﬁx in the preceding tre es, which is the expa nsion of a new no nt erminal, a nd the leav es in that tree, w hic h are the expansio n of its ro ot. W e need do this O (log g ) times b efore the rema ining subsequence is contained within a single tree. After that, we re pea tedly pro duce new binar y pro ductions that split the s ubsequence into preﬁxes, ag ain the ex pansions of new nonterminals, and suﬃxes , the expansions o f ro ots o f the largest p ossible complete subtree. Since the size of the la rgest p oss ible complete subtree shrinks by a factor of t wo a t each step (or, equiv alen tly , th e height o f its roo t decrea ses by 1), we need rep eat O (log g ) times. Again, this takes lo garithmic workspace (w e will give mor e details in the full v ersion o f this pap er). In summary , we may replace ea ch p ro duction with O ( log g ) new , binary pro- ductions. Since the pr o ductions are binar y , the n um b er of sy m b ols on the r ight- hand sides is linear in the num b er of pro ductions themselv es. ⊓ ⊔ Lemma 5 is our most deta iled res ult, and the dia gram b elow s howing the construction with our running example is also somewhat detailed. On the left are mo diﬁcations o f the origina l pro ductio ns, now made binary; in the middle are pro ductions for the internal nodes of the binar y trees; and on the right are pro ductions br eaking down the consecutive subsequences that appea r on the 8 T. Gagie and P . Gawryc how ski righthand sides of the pro ductions in the left column, until the subsequences a re single, original non terminals or non terminals for nodes in the binary tr ees (i.e., those o n the lefthand sides of the productions in the middle column). X 0 → X 1 , 32 X 33 , 35 X 1 → h . . . X 13 → d X 14 → X 9 X 10 X 15 → o . . . X 31 → X 19 , 24 X 25 , 27 . . . X 34 → X 10 , 12 X 13 X 35 → ? X 1 , 32 → X 1 , 16 X 17 , 32 X 1 , 16 → X 1 , 8 X 9 , 16 X 1 , 8 → X 1 , 4 X 5 , 8 . . . X 17 , 32 → X 17 , 24 X 18 , 32 X 17 , 24 → X 17 , 20 X 21 , 24 . . . X 29 , 32 → X 29 , 30 X 31 , 32 X 33 , 35 → X 33 , 34 X 35 X 1 , 2 → X 1 X 2 . . . X 33 , 34 → X 33 X 34 X 19 , 24 → X 19 , 20 X 21 , 24 X 25 , 27 → X 25 , 26 X 27 . . . X 10 , 12 → X 10 X 11 , 12 Combined with Lemma 1, Lemmas 4 and 5 imply that with lo garithmic workspace we can build a CFG in CNF whose size is O  g 2 log g  . W e can use a similar approa ch with binary tree s to build a CFG in CNF of size O ( n ) that generates s and only s , still using loga rithmic workspace. If we co mbin e all non- terminals that ha v e t he same ex pansion, whic h also takes loga rithmic workspace, then this becomes Kieﬀer, Y ang, Nelson and Co sman’s [37] Bisection algo- rithm, which gives an O  p n/ log n  -approximation [8]. By taking the smaller of these tw o CFGs we achiev e an O  min  g log g , p n/ log n  -approximation. Therefore, as w e claimed in Lemma 3, with logarithmic w orkspa ce we can turn the LZ7 7 parse into a C F G w hose size is within a O  min  g log g , p n/ log n  - factor of minim um. 4 Recen t W ork W e recently impro ved the b ound on the approximation ratio in Lemma 3 from O  min  g log g , p n/ log n  to O  min  g , p n/ log n  . The key observ ation is that, b y the deﬁnition of the LZ77 parse, the ﬁrs t o ccurr ence of any substring m ust touc h or cross a break betw een phrases. Co nsider an y phrase in the parse obtained by applying Lemma 4 to the LZ77 parse . By the obser v a tion above, that phrase can b e wr itten as the concatena tion of so me consecutive new phras es (all contained within one old phrase and ending at that old phrase’s r ight end), some consecutive old phrases, and some more consecutive new phrases (all con tained within one old phrase a nd starting a t the old phrase’s left end). Since there are O ( g ) old phrases, there are O  g 2  sequences of consecutive old phra ses; since there are O  g 2  new phrases, ther e are O  g 2  sequences of consecutive Grammar-Based Compression in a Streaming Mo del 9 new phr ases that are co n tained in one o ld phras e and e ither star t at that old phrase’s r ight end or end at that old phrase’s left end. While working on the improv emen t ab ov e, w e realiz ed how to improve the bo und further, to O  min  g , 4 √ log n  . T o do this, we choo se a v alue b b etw een 2 and n and, for 0 ≤ i ≤ lo g b n , we a sso ciate a no n terminal to ea ch of the b blo cks of ⌈ n/ b i ⌉ characters to the left and right of each brea k; we thus start building the gra mmar with O ( b g log b n ) no n terminals. W e then add O ( bg log b n ) binary pro ductio ns such that any sequence o f nonterminals asso ciated with a consecutive sequence of blo cks, c an b e derived from O (1 ) nonterminals. Notice any substring is the concatena tion of 0 or 1 partial blocks, some num be r of full blo cks to the left of a break, s ome num ber of blocks to the rig ht of a brea k, a nd 0 o r 1 more par tial blo cks. W e now a dd mo re binary pro ductions as follows: we start with s (the only blo ck of length  n/b 0  = n ); ﬁnd the ﬁrst bre ak it to uc hes or crosses (in this case it is the start of s ); consider s as the concatenation of blo cks of size  n/b 1  (in this case o nly the rightmost blo ck can b e partial); asso ciate non terminals to the partial blocks (if they exist); add O (1 ) productions to tak e the symbol as so ciated to s (in this case, the start sym bo l) to the sequence of non terminals asso cia ted with the smaller blocks in order from left to rig ht ; and recurse on eac h of the smaller blocks. T o guar antee each smaller blo ck touches or crosses a bre ak, w e work on the ﬁrst o ccurrence in s of the substring contained in that blo ck. W e sto p recursing when the blo ck size is 1 , and add O ( bg ) pro ductions taking those blo cks’ nonterminals to the appr opriate characters . Analysis shows that the num b er of pr o ductions we add during the recursio n is prop ortiona l to the num b er of blo cks involv ed, either full or par tial. Since the num b er of distinct full blo cks in an y le vel of recursion is O ( bg ) and the nu mber of partial blo cks is at most t wice the num ber of blo cks (full o r partial) in the previous level of recursion, the num ber of pr o ductions we add during the recursion is O  2 log b n bg  . There fore, the grammar has size O  2 log b n bg  ; when b = 2 √ log n , this is O  4 √ log n g  . The ﬁrst of the t wo k ey obse rv ations th at let us build the gr ammar in logar ithmic w orkspac e, is that we can store the index of a blo ck (full or par tial) with resp ect to the asso ciated brea k, in O  √ log n  bits; therefore, we can store O ( 1) indice s for ea ch of the O  √ log n  levels of recursion, in a total of O (log n ) bits. The second key o bserv ation is tha t, given the indices of the blo ck we are w orking on in each level of recursion, with resp ect to the appropria te brea k, w e can compute the start p oint and end point of the blo ck we a re curr ently working on in the deep est level o f recursio n. W e will give details o f these tw o improvemen ts in the full v ersio n of this pap er. While working on this second improv emen t, we r ealized that we ca n use the same ideas to build a compres sed represe n tation that allows eﬃcient r andom access. W e re fer the reader to the recen t pape rs b y Kreft and Nav arro [38] and Bille, Landau and W eimann [39 ] for bac kgro und on this pro blem. Supp ose that, for eac h of the O  2 √ log n g √ log n  full blocks desc rib ed a bove, w e store a pointer to the ﬁrst o c currence in s of the s ubstring in that blo ck, a s well as a p ointer to the ﬁrst brea k that ﬁrst occurre nce touches or crosses . Notice this takes a total 10 T. Gagie and P . Gawryc how ski of O  2 √ log n g (log n ) 3 / 2  bits. Then, given a block’s index and an oﬀset in that blo ck, in O ( 1) time we can co mpute a sma ller blo ck’s index and oﬀset in that smaller blo ck, such that the characters in those tw o p ositions are equal; if the larger blo ck has siz e 1, in O ( 1) time we ca n return the character. Since s itself is a blo ck, an oﬀset in it is just a character’s p osition, and there are O  √ log n  levels of recursion, it follows that w e ca n access any character in O  √ log n  time. F ur ther a nalysis s hows that it takes O  √ log n + ℓ  time to a ccess a substr ing o f length ℓ . Of cour se, for an y p ositive constant ǫ , if we are willing to use O ( n ǫ g ) bits of space, then we ca n a ccess any character in co nstant time. If we make the data structure slightly la rger and mor e complicated — e.g., storing sear chable partial sums at the blo ck boundaries — then, at least for strings o ver fairly small alphab ets, we can also support fast ra nk a nd select queries. This implementation makes it easy to see the data structur e’s r elation to LZ77 and gra mmar-based c ompression. W e can us e a simpler implementation, how ever, and use LZ 77 o nly in the analysis . Supp ose that, fo r 0 ≤ i ≤ log b n , we break s into co nsecutive blo cks of le ngth ⌈ n/b i ⌉ (the la st blo ck may b e sho rter), alwa ys sta rting from t he ﬁrst c hara cter of s . F o r eac h block, w e store a p ointer to the ﬁrst o ccurrence in s of that blo ck’s substring. Giv en a blo ck’s index and an oﬀset in that blo ck, in O ( 1) time we ca n again compute a smaller blo ck’s index and o ﬀset in that smaller blo ck, such that the characters in those tw o p ositions are eq ual: the new blo ck’s index is the s um of p ointer and the old oﬀset, divided by the new blo ck length and rounded down; the new oﬀset is the sum o f the po int er and the o ld o ﬀset, mo dulo the new blo ck leng th. W e can discard any blo ck that cannot b e visited dur ing a query , so this data structure takes at most a constant factor more space than the one describ ed above. Indeed, this data structure seems likely to be smaller in pra ctice, b ecause blocks of the same size can ov erlap in the previous data structur e but cannot in this o ne. W e plan to implemen t this data structure and rep or t the results in a fut ure pap er. References 1. Gagie, T.: On the v alue of m ultiple read/write streams for data compression. In: Proceedings of the Symp osium on Com binatorial Pattern Matching. (2009) 68–77 2. Claude, F., Na v arro, G.: S elf-indexed text compression using straight-line pro- grams. In: Pro ceedings of the S ymp osium on Mathematical F oundations of Com- puter Science. (2009) 235–246 3. Lifshits, Y.: Processing compressed texts: A tractability border. In: Proceedings of th e Sy mp osium on Combinatorial Pa ttern Matching. ( 2007) 228–240 4. Lifshits, Y., Mozes, S., W eimann, O., Ziv-Ukelson, M.: Sp eedin g up HMM deco ding and training by exploiting seq uence rep etitions. Algorithmica 54 (3) (2009) 379–399 5. K ieﬀer, J.C., Y ang, E.: Grammar-based codes: A new class of universal lossless source codes. IEEE T ransac tions o n I nformation Theory 46 (3) ( 2000) 737–754 6. N a v arro, G., Russo, L.M.S.: Re-pair achiev es high-order en tropy . In: Proceedings of th e Data Compression Confere nce. (20 08) 537 7. S torer, J.A., S zymanski, T.G.: Data compression via textual sub stitution. Journal of th e ACM 29 (4) (1982) 92 8–951 Grammar-Based Compression in a Streaming Mo del 11 8. Charik ar, M., Lehman, E., Liu, D., Panig rahy , R ., Prabhaka ran, M., S ahai, A ., Shelat, A.: The smallest grammar problem. IEEE T ransactions on Information Theory 51 (7) (2005) 2554–2576 9. R ytter, W.: Application of Lemp el-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302 (1–3) (2003) 211– 222 10. Ziv, J., Lemp el, A.: A u nivers al algorithm fo r sequential d ata compression. IEEE T ransactions on Information Theory 23 (3) (1977) 337–343 11. Saka moto, H.: A fully linear-time approximation algorithm for grammar-based compression. Journal of Discrete Algorithms 3 (2–4) ( 2005) 416–430 12. Larsson, N.J., Moﬀat, A.: Oﬄine dictionary-based compression. Proceed ings of the I EEE 88 (11) (2000 ) 17 22–1732 13. Saka moto, H., Kida, T., Shimozono, S .: A space-saving linear-time algorithm for grammar-based compression. In: Proceedings of th e Symp osium on String Pro- cessing and Information Retriev al. (2004) 21 8–229 14. Saka moto, H ., Maruyama, S., K ida, T., Shimozono, S .: A space-saving approxi- mation algorithm for grammar-based compression. IEICE T ransactions 92-D (2) (2009) 158–165 15. Alon, N., Matias, Y ., Szegedy , M.: The sp ace complexity of approximating the frequency moments. Journal of Computer and System S ciences 58 (1) (1999) 137– 147 16. Bab co ck, B., Babu, S., Datar, M., Mot wa ni, R., Widom, J.: Models and issues in d ata stream systems. In: Proceedings of the Symp osium on Database Systems. (2002) 1–16 17. Muthukrishnan, S.: Data Streams: Algorithms and Applications. V olume 1(2) of F oundations and T rends in Theoretical Co mputer Science. now Publishers (2005) 18. Grohe, M., Schw eik ardt , N.: Low er boun ds for sorting with few random accesses to external memory . In: Pro ceedings of the Symp osium on Database Sy stems. (2005) 238–249 19. Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theoretical Computer Science 12 (1980) 315–323 20. Grohe, M., H ernic h, A., S c hw eik ardt, N.: Low er bound s for processing data with few random a ccesses to ex ternal memory . Journal of the ACM 56 (3) (2009) 1– 58 21. Beame, P ., Jayram, T.S., Rudra, A.: Lo w er b ounds for randomized read/write stream algorithms. I n: Proceedings of th e Symp osium on Theory of Computing. (2007) 689–698 22. Beame, P ., Huyn h, T. : On the v alue of m ultiple read/write streams for appro xi- mating frequency moments. In: Proceedings of the Symp osium on F oundations of Computer Science. (2008) 499–508 23. Hern ic h, A., Sc hw eik ardt, N.: Reversal complexit y revisi ted. Theoretical Computer Science 401 (1–3) (2008) 191–205 24. Chen, J., Y ap, C.: Reversal co mplexity . SIAM Journal on Computing 20 (4) (1991) 622–638 25. Sh einw ald, D., Lemp el, A., Ziv, J.: On enco ding and deco ding with tw o-w a y head mac hines. Information and Computation 116 (1) (1995) 128–133 26. De Agostino, S., Storer, J.A.: On-line versus oﬀ- line computation in dynamic tex t compression. I nformation Processing Letters 59 (3) ( 1996) 169–174 27. Gagie, T., Manzini, G.: Sp ace-conscious compression. I n: Pro ceedings of the Sym- p osium on Mathematical F oun dations of Compu ter Science. (2007) 206–217 28. Kosara ju, S .R., Manzini, G.: Compression of lo w entropy strings with Lemp el-Ziv algorithms. SIAM Journal on Computing 29 (3) (19 99) 893–911 12 T. Gagie and P . Gawryc how ski 29. Amir, A ., Aumann , Y ., Levy , A ., R oshko, Y.: Quasi-distinct p arsing and optimal compression methods. I n: Pro ceedings of th e Sy mp osium on Com binatorial P attern Matc hing. (2009) 12–25 30. Alb ert, P ., May ordomo, E., Moser, P ., P erifel, S .: Pushdown compression. In: Proceedings of the S ymp osium on Theoretica l Asp ects of Computer Science. (2008 ) 39–48 31. Ziv, J., Lemp el, A.: Compression of individu al sequ ences via v ariable-rate co ding. IEEE T ransactions on In formation Theory 24 (5) (1978) 530–536 32. May ordomo, E., Moser, P .: P olylog space compression is incomparable with Lemp el-Ziv and p ushdown compression. In: Proceedings of the Conference on Current T rends in Theory and Practice of Informatics . (20 09) 633–644 33. Magniez, F., Mathieu, C., Nay ak, A .: Recognizing w ell-parenthesiz ed expressions in the streaming mo del. T echnical Rep ort TR09-119, Electronic Colloquium on Computational Complexity (2009) 34. F errag ina, P ., Gagie, T., Manzini, G.: Light wei ght data index ing and compression in external memory . In : Proceedings of the L atin American Theoretical Informatics Symp osium. (2010) T o app ear. 35. Schw eik ardt , N.: Mac hine models and low er b ounds for q uery pro cessing. In: Proceedings of the Sym p osium on Principles of Database Sy stems. ( 2007) 41–52 36. Nav arro, G., Raﬃnot, M .: Practical and ﬂexible pattern matc hing ov er Ziv-Lempel compressed text. Journal of Discrete Algorithms 2 (3) ( 2004) 347–371 37. Kieﬀer, J.C., Y ang, E., Nelson, G.J., Cosman, P .C.: Univ ersal lossless compression via multilev el p attern matching. IEEE T ransactio ns on Information Theory 46 (4) (2000) 1227–124 5 38. Kreft, S., N a v arro, G.: LZ77-lik e compression with fast random access. In: Pro- ceedings of the Data Compression Conference. (2010) T o app ear. 39. Bille, P ., Landau, G., W eimann, O.: Random access to grammar comp ressed strings. http://arxiv.org/ab s/1001.1565 (2010)

Grammar-Based Compression in a Streaming Model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment