A Faster Grammar-Based Self-Index

A F aster Grammar-Based Self-Index T ra vis Gagie a , P a w e l Ga wryc ho wski b , Juha K¨ arkk¨ ainen c , Y ak o v Nekric h d , Simon J. Puglisi c a A alto University b Max Planck Institute c University of Helsinki d University of Chile Abstract T o store and searc h genomic databases eﬃcien tly , researc hers ha v e recen tly started building compressed self-indexes based on grammars. In this pap er w e sho w ho w, given a straight-line program with r rules for a string S [1 ..n ] whose LZ77 parse consists of z phrases, w e can store a self-index for S in O ( r + z log log n ) space suc h that, given a pattern P [1 ..m ], w e can list the o cc o ccurrences of P in S in O  m 2 + occ log log n  time. If the straigh t-line program is balanced and w e accept a small probabilit y of building a fault y index, then w e can reduce the O  m 2  term to O ( m log m ). All previous self-indexes are larger or slo w er in the w orst case. Keywor ds: compressed self-indexes, grammar-based compression, Lemp el-Ziv compression 1. In tro duction With the adv ance of DNA-sequencing technologies comes the problem of ho w to store many individuals’ genomes compactly but such that we can search them quic kly . An y t w o h uman genomes are 99.9% the same, but compressed self-indexes based on compressed suﬃx arra ys, the Burrows-Wheeler T ransform or LZ78 (see [1] for a survey) do not take full adv an tage of this similarit y . Researc hers ha v e recen tly started building self-indexes based on context-free grammars (CFGs) and LZ77 [2], whic h better compress highly rep etitiv e strings. A compressed self-index stores a string S [1 ..n ] in compressed form s uc h that, Pr eprint submitte d to Elsevier Octob er 31, 2018 X 7 → X 6 X 5 X 6 → X 5 X 4 X 5 → X 4 X 3 X 4 → X 3 X 2 X 3 → X 2 X 1 X 2 → a X 1 → b X 7 X 4 X 2 a X 3 X 2 a X 1 b X 3 X 2 a X 1 b X 5 X 3 X 2 a X 1 b X 4 X 2 a X 3 X 2 a X 1 b X 3 X 2 a X 1 b X 5 X 3 X 2 a X 1 b X 4 X 2 a X 3 X 2 a X 1 b X 3 X 2 a X 1 b X 6 Figure 1: A balanced SLP for abaababaabaab (left) and the corresponding parse tree (righ t). ﬁrst, given a p osition i and a length ` , w e can quickly extract S [ i..i + ` − 1] and, second, giv en a pattern P [1 ..m ], w e can quic kly list the o cc o ccurrences of P in S . Claude and Na v arro [3] gav e the ﬁrst compressed self-index based on gram- mars or, more precisely , straigh t-line programs (SLPs). An SLP is a con text-free grammar (CFG) in Chomsky normal form that generates only one string. Fig- ure 1 shows an example. They sho wed ho w, given an SLP with r rules for a string S , w e can build a self-index that tak es O ( r ) space and supp orts extraction in O (( ` + h ) log r ) time and pattern matc hing in O (( m ( m + h ) + h occ) log r ) time, resp ectiv ely , where h is the height of the parse tree. Our mo del is the w ord RAM with Θ(log n )-bit words; except where stated otherwise, by log we mean log 2 and w e measure space in w ords. The same authors [4] recen tly ga v e a self-index that has b etter time b ounds and can be based on any CFG generating S and only S . Sp eciﬁcally , they sho wed how, given suc h a CF G with r 0 distinct terminal and non-terminal symbols and R symbols on the righ thand sides of the rules, w e can build a self-index that takes O ( R ) space and supp orts extraction in O ( ` + h log( R/h )) time and pattern matching in O  m 2 log(log n/ log r 0 ) + o cc log r 0  time. If we are not concerned ab out the constan t co eﬃcient in the space b ound, w e can improv e Claude and Nav arro’s time bound for extraction. Calculation sho ws that h log ( R/h ) ≥ log n . Given a CFG generating S and only S with R sym b ols on the righthand sides of the rules, w e can turn it in to an SLP with 2 O ( R ) rules (although the num b er of distinct sym b ols and the heigh t of the parse tree can each increase by a factor of O (log n )). Bille et al. [5] sho wed ho w w e can store such an SLP in O ( R ) space and supp ort extraction in O ( ` + log n ) time. Com bining their result with Claude and Nav arro’s improv ed one, w e obtain an index that still takes O ( R ) space and O  m 2 log(log ( n ) / log r 0 ) + o cc log r 0  time for pattern matc hing but only O ( ` + log n ) time for extraction. In this pap er we show ho w, given an SLP for S with r rules, we can build a self-index that takes O ( r + z log log n ) space, where z is the num b er of phrases in the LZ77 parse of S , and supp orts extraction in O ( ` + log n ) time and pattern matc hing in O  m 2 + o cc log log n  time. Therefore, by the observ ations ab o ve, giv en a CFG generating S and only S with R sym b ols on the righthand sides of the rules, we can build an index with the same time b ounds that tak es O ( R + z log log n ) space. If we are given a balanced SLP for S — i.e., one for which the parse tree is heigh t- or weigh t-balanced [6] — and we accept a small probabilit y of build- ing a faulty index, then w e do not need Bille et al.’s result to extract in O ( ` + log n ) time and w e can reduce the time bound for pattern matc hing to O ( m log m + o cc log log n ). Rytter [7] sho wed how we can build suc h an SLP with O ( z log( n/z )) rules, and pro ved that no SLP for S has fewer than z rules. His algorithm still has the best known approximation ratio ev en when the SLP need not b e balanced, but p erforms badly in practice. Recen tly , how ever, Maruy ama, Sak amoto and T ak eda [8] ga ve a practical online algorithm that pro duces a balanced SLP with O  z log 2 n  rules. In other words, requiring the SLP to b e balanced is a reasonable restriction b oth in theory and in practice. T able 1 summarizes Claude and Na v arro’s bounds and our own. Since all the self-indexes mentioned can be made to supp ort extraction in O ( ` + log n ) time without increasing their space usage b y more than a constan t factor, w e do not include this b ound in the table. As noted ab o ve, given a CFG generating S and only S with R sym b ols on the righthand sides of the rules, we can turn it in to an SLP with O ( R ) rules, so our ﬁrst result is as general as Claude and Na v arro’s; the r in the second row of the table can be replaced b y R . By Rytter’s result, 3 T able 1: Claude and Nav arro’s b ounds and our own. In the ﬁrst ro w, R is the num b er of symbols on the righthand sides of the rules in a given CF G generating S and only S , and r 0 is the number of distinct terminal and non-terminal symbols in that CFG. In the second and third rows, r is the n umber of rules in a given SLP for S — which must b e balanced in the third ro w — and z is the num b er of phrases in the LZ77 parse of S . source space searc h time [4] O ( R ) O  m 2 log  log n log r 0  + o cc log r 0  Theorem 3 O ( r + z log log n ) O  m 2 + o cc log log n  Theorem 7 O ( r + z log log n ) O ( m log m + o cc log log n ) w e can assume z ≤ r = O ( z log ( n/z )). There are other self-indexes optimized for highly repetitive strings but com- paring ours against them directly is diﬃcult. F or example, Do et al.’s [9] space b ound is in terms of the n um b er of phrases in a new v ariant of the LZ77 parse [10], which can be m uc h larger than z ; Huang et al.’s [11] is b ounded in terms of the num b er and length of common and distinct regions in the text; Maruy ama et al.’s [12] time b ound for pattern matc hing depends on “the num- b er of o ccurrences of a maximal common subtree in [the edit-sensitiv e parse] trees of P and S ”; Kreft and Nav arro’s [13] time b ound dep ends on the depth of nesting in the LZ77 parse. W e still use many ideas from Kreft and Nav arro’s work, which w e describ e in Section 2. In Section 3 we show ho w, giv en an SLP for S with r rules, we can build a self-index that takes O ( r + z log log n ) space and supp orts extraction in O ( ` + log n ) time and pattern matching in O  m 2 + o cc log log n  time. W e also sho w how, with the same self-index, in O  m 2 log log n  time we can compute all cyclic shifts and maximal substrings of P that o ccur in S . In Section 3 we sho w ho w, if the SLP is balanced and w e accept a small probability of building a faulty index, then w e can reduce the time b ound for pattern matc hing to O ( m log m + o cc log log n ). Finally , in Section 5 we discuss directions for future 4 1 2 3 4 1 2 3 4 5 6 7 8 9 10 11 12 13 0 a b a a b a b a a b a a b 1 2 3 4 5 6 7 8 9 10 11 12 13 5 6 7 8 9 10 11 12 13 Figure 2: The LZ77 parse of “ abaababaabaab ” (left) and the lo cations of the phrase sources plotted as p oints on a grid (right). In the parse, horizontal lines indicate phrases’ sources, with arro ws leading to the boxes con taining the phrases themselves. On the grid, a p oint’s horizontal coordinate is where the corresponding source starts, and its vertical coordinate is where the source ends. Notice that a phrase source S [ i..j ] co vers a substring S [ i 0 ..i 0 + m − 1] if and only if the p oin t ( i, j ) is above and to the left of the p oint ( i 0 , i 0 + m − 1). w ork. 2. Kreft and Na v arro’s Self-Index The LZ77 compression algorithm w orks b y parsing S from left to righ t into z phrases: after parsing S [1 ..i − 1], it ﬁnds the longest preﬁx S [ i..j − 1] of S [ i..m ] that has o ccurred b efore and selects S [ i..j ] as the next phrase. If j = 1 then the phrases consists only of the ﬁrst occurrence of a c haracter; otherwise, the leftmost o ccurrence of S [ i..j − 1] is called the phrase’s source. Figure 2 sho ws an example. Kreft and Na v arro [13] gav e the ﬁrst (and, so far, only) compressed self-index based on LZ77. Their index takes O ( z log n ) + o ( n ) bits and supports extraction in O ( `d ) time and pattern matc hing in O  m 2 d + ( m + o cc) log z  time, where d ≤ z is the depth of nesting in the parse. They considered only the non-self- referen tial v ersion of LZ77, so z ≥ log n ; for ease of comparison, we do the same. They also gav e a v ariant of LZ77 called LZ-End, with whic h they can reduce the extraction time to O ( ` + d ). Although they sho w ed that LZ-End p erforms 5 w ell in practice, how ever, they were unable to b ound the worst-case size of the LZ-End parse in terms of z . Kreft and Nav arro start b y building tw o P atricia trees, one for the rev erses of the phrases in the LZ77 parse and the other for the suﬃxes of S that start at phrase b oundaries. A P atricia tree [14] is a compacted trie for substrings of a stored string, in whic h w e store only the ﬁrst character and length of eac h edge lab el; the leav es store p oin ters into the string itself suc h that, after ﬁnishing a searc h at a node in the tree, we can v erify that the node’s path lab el matc hes the string we seek. The total size of the t w o Patricia trees is O ( z ). Since Kreft and Na v arro store S in compressed form, they extract no des’ path lab els in order to verify them. F or example, if S = abaababaabaab , then the reverses of the phrases are shown b elow on the left with the phrase n um b ers, in order b y phrase n um b er on the left and in lexicographic order on the righ t; the suﬃxes starting at phrase b oundaries are sho wn on the righ t. When building the Patricia trees, w e treat S as ending with a special character $ lexicographically less than any in the alphab et, and each rev ersed phrase as ending with another sp ecial character #. Figure 3 shows the Patricia trees for this example. 1) a # 6) $ b # 1) baababaabaab $ 6)  2) b # 1) a # 2) aababaabaab $ 4) aabaab $ 3) aa # 3) aa # 3) babaabaab $ 2) aababaabaab $ 4) bab # 5) aabaa # 4) aabaab $ 5) b 5) aabaa # 2) b # 5) b $ 1) baababaabaab $ 6) $ b # 4) bab # 6)  3) babaabaab $ Their next comp onen t is a data structure for four-sided range rep orting on a z × z grid storing z p oints, with eac h p oint ( i, j ) indicating that the lexi- cographically i th rev ersed phrase is follo wed in S by the lexicographically j th suﬃx starting at a phrase b oundary . Figure 4 shows the grid for our running example S = abaababaabaab . Kreft and Nav arro use a wa velet tree, which tak es O ( z ) space and answ ers queries in O (( p + 1) log z ) time, where p is the num b er of p oints reported [15]. Many other data structures are known for this problem, 6 b, 8 b, 1 a, 4 a, 3 a, 1 a, 11 b, 8 8 3 13 2 5 4 12 2 7 14 $ , 3 a, 1 b, 1 a, 3 # , 1 a, 1 # , 1 # , 1 1 b, 4 $ , 1 Figure 3: The Patricia trees for the rev ersed phrases (left) and suﬃxes starting at phrase boundaries (right) in the LZ77 parse of “ abaababaabaab ”.  a a b a a b $ a a b a b a a b a a b $ b $ b a a b a b a a b a a b $ b a b a a b a a b $ $ b # a # a a # a a b a a # b # b a b # Figure 4: A grid sho wing ho w, in the LZ77 parse of “ abaababaabaab ”, rev ersed phrases precede suﬃxes starting at phrase b oundaries. ho w ev er, with diﬀeren t time-space tradeoﬀs. Their ﬁnal comp onent is (essen tially) a data structure for t w o-sided range rep orting on an n × n grid storing at most z − 1 p oints, with each point ( i, j ) indicating that S [ i..j ] is a phrase’s source. The grid for S = abaababaabaab is sho wn b eside the LZ77 parse in Figure 2. They implemen t this data structure with a compressed bitvector (as a predecessor data structure) and a range- minim um data structure, whic h tak e O ( z log n ) + o ( n ) bits of space and answer queries in O ( p + 1) time, where p is again the n umber of p oints rep orted. Again, ho w ev er, other time-space tradeoﬀs are av ailable. 7 Giv en a pattern P [1 ..m ], Kreft and Nav arro use the t wo Patricia trees to ﬁnd, for 1 ≤ i ≤ m , the lexicographic range of the rev erses of phrases ending with P [1 ..i ], and the lexicographic range of the suﬃxes starting with P [ i + 1 ..m ] at phrase b oundaries. This tak es a total of O  m 2  time to descend the Patricia trees and O  m 2 d  time to extract no des’ path labels. They then use the w av elet tree to ﬁnd all the phrase boundaries preceded by P [1 ..i ] and follow ed b y P [ i + 1 ..m ], which tak es a total of O (( m + o cc) log z ) time. After these steps, they kno w the lo cations of all o ccurrences of P that cross phrase boundaries in S , whic h are called primary o ccurrences. An occurrence of P that is completely contained within a phrase is called a secondary o ccurrence. By the deﬁnition of LZ77, the ﬁrst o ccurrence must b e primary and any secondary o ccurrence must b e copied from an earlier o c- currence. W e can ﬁnd all secondary o ccurrences by ﬁnding all primary o ccur- rences and then recursiv ely ﬁnding all phrase sources that co ver o ccurrences w e ha v e already found. Notice that, if a phrase source S [ i..j ] co vers an o ccurrence S [ i 0 ..i 0 + m − 1], then i ≤ i 0 and j ≥ i 0 + m − 1, so the p oin t ( i, j ) is ab ov e and to the left of the point ( i 0 , i 0 + m − 1). It follo ws that, after ﬁnding all primary o c- currences of P , Kreft and Na v arro can ﬁnd all secondary o ccurrences in O (o cc) time using one tw o-sided range reporting p er o ccurrence. Therefore, their self- index supp orts pattern matc hing in a total of O  m 2 d + ( m + o cc) log z  time. W e can use a new data structure by Chan, Larsen and Pˇ atra¸ scu [16] for four-sided range rep orting, instead of a w av elet tree, and a y-fast trie [17] for predecessor queries, instead of a compressed bitv ector. Calculation shows that Kreft and Na v arro’s space b ound then changes to O ( z log log z ) w ords and their time b ound impro v es to O  m 2 d + m log log z + occ log log n  . Bille and Gørtz [18] show ed how, by storing one-dimensional range-reporting data struc- tures at eac h no de in the top log log z levels of the P atricia trees, we can elimi- nate the O ( m log log z ) term: if m ≤ log log z then instead of the data structure for four-sided range rep orting, we can use the one-dimensional range-rep orting data structures, which are faster; otherwis e, the O  m 2  term dominates the O ( m log log z ) term an yw a y . Th us, b y implementing the comp onents diﬀeren tly 8 in Kreft and Nav arro’s self-index, we obtain one that tak es O ( z log log z ) space and supp orts pattern matc hing in O  m 2 d + occ log log n  time. If w e are giv en an SLP for S with r rules then we can also combine Bille et al.’s [5] with our mo diﬁcation of Kreft and Na v arro’s. W e can use Bille et al.’s data structure for extracting no des’ path lab els while pattern matching, so w e obtain a self-index that takes O ( r + z log log z ) space and supp orts extraction in O ( ` + log n ) time and pattern matc hing in O  m 2 + m log n + o cc log log n  time. In Section 3 w e explain ho w to remov e the O ( m log n ) term by taking adv an tage of the fact that, while pattern matching, w e extract no des’ path lab els only from phrase boundaries. 3. Self-Indexing with an Unbalanced SLP Supp ose w e are given an SLP for S with r rules and a list of t sp eciﬁed p ositions from which we w an t to supp ort linear-time extraction, e.g., from the phrase b oundaries in the LZ77 parse. W e can build an instance of Bille et al.’s [5] data structure and supp ort extraction from any p osition in O ( ` + log n ) time, where ` is the length of the substring extracted. When ` = Ω(log n ) we ha v e O ( ` + log n ) = O ( ` ), i.e., the extraction is linear-time. Therefore, we need w orry only ab out extracting substrings of length o (log n ) from around the t sp eciﬁed positions. Consider each substring that starts log n c haracters to the left of a sp eciﬁed p osition and ends log n characters to the right of that p osition. By the deﬁnition of LZ77, the ﬁrst occurrence of that substring crosses a phrase b oundary . If we store a p ointer to the ﬁrst occurrence of each suc h substring, whic h takes O ( t ) space, then we need worry only ab out extracting substrings of length o (log n ) from around the phrase boundaries. Now consider the string S 0 [1 ..n 0 ] obtained from S by removing any character at distance more than log n from the nearest phrase b oundary . Notice that S 0 can b e parsed in to O ( z ) substrings, each of whic h • occurs in S , 9 • has length at most log n , • is either a single character or do es not touch a phrase b oundary in the LZ77 parse of S . W e claim that any suc h substring S 0 [ i..j ] is split b etw een at most 2 phrases in the LZ77 parse of S 0 . T o see why , consider that the ﬁrst copy of S 0 [ i..j ] in S must touch a phrase boundary and is completely within distance log n of that phrase b oundary , so it remains in tact in S 0 . Therefore, either S 0 [ i..j ] is a single character — which is obviously con tained within only 1 phrase in the LZ77 parse of S 0 — or S 0 [ i..j ] is not the ﬁrst o ccurrence of that substring in S 0 . It follo ws that the LZ77 parse of S 0 consists of O ( z ) phrases. Clearly n 0 = O ( z log n ), so we can apply Rytter’s algorithm to build a balanced SLP for S 0 that has r 0 = O ( z log ( n 0 /z )) = O ( z log log n ) rules. Since this SLP is balanced, its parse tree has height O (log z + log log n ) and so we can store it in O ( r 0 ) = O ( z log log n ) space and supp ort extraction from any p osition in S 0 in O ( ` + log n 0 ) = O ( ` + log z ) time. W e now hav e a data structure that tak es O ( r + t + z log log n ) space and supp orts extraction from an y p osition in S in O ( ` + log n ) time and extraction from the t sp eciﬁed p ositions in O ( ` + log z ) time. If w e c ho ose the sp eciﬁed p ositions to b e the phrase b oundaries in the LZ77 parse of S , then w e can com bine it with our mo diﬁcation of Kreft and Na v arro’s index from Section 2 and obtain a self-index that takes O ( r + z log log n ) space and supp orts extraction in O ( ` + log n ) time and pattern matc hing in O  m 2 + m log z + o cc log log n  time. W e next eliminate the O ( m log z ) term b y taking adv antage of the fact that the SLP for S 0 is balanced. As noted in Section 1, an SLP is balanced if the corresp onding parse tree is heigh t- or w eight-balanced. Supp ose we are given a p osition i in S 0 and a bound L and asked to add O (1) space to our balanced SLP for S 0 suc h that we can supp ort extraction of any substring S 0 [ i..i + ` − 1] with ` ≤ L in O ( ` + log L ) time. Supp orting extraction of any substring S 0 [ i − ` + 1 ..i ] in O ( ` + log L ) time is symmetric. W e ﬁnd the low est common ancestor u of the i th and ( i + L )th 10 lea v es of the parse tree T for S 0 . W e then ﬁnd the deep est no de v in u ’s left subtree such that v ’s subtree contains the i th leaf of T , and the deep est no de w in u ’s righ t subtree suc h that w ’s subtree contains the ( i + L )th leaf. Since our SLP for S 0 is balanced, v and w hav e heigh t O (log L ). T o see why , consider that the ancestors of the rightmost leaf in u ’s subtree and of the leftmost leaf in its righ t subtree, ha v e exp onen tially many lea v es in their heigh t. Without loss of generality , assume v ’s subtree con tains the i th leaf. W e store the non-terminals at v and w in O (log r 0 ) bits, and O (log L ) bits indicating the path from v to the i th leaf; together these take O (1) words. Figure 5 sho ws an example. W e can view the symbols at no des of T as p oin ters to those no des and use the rules of the grammar to navigate in the tree. T o extract S 0 [ i..i + ` − 1], we start at v , descend to the i th leaf in T , and then tra verse the lea ves to the righ t un til we hav e either reached the ( i + ` − 1)st leaf in T or the rightmost leaf in v ’s subtree; in the latter case, we p erform a depth-ﬁrst tra v ersal of w ’s subtree un til w e reach the ( i + ` − 1)st leaf in T . During both trav ersals we output the terminal symbol at eac h leaf when w e visit it. If w e store the size of each non- terminal’s expansion (i.e., the num b er of lea ves in the corresp onding subtree of the parse tree) then, after descending from v to the i th leaf, in O (log L ) time we can compute a list of O (log ` ) terminal and non-terminal sym b ols such that the concatenation of their expansions is S 0 [ i..i + ` − 1]. This op eration will pro v e useful in Section 4. Since w e can extract any substring S [ i..i + ` − 1] in O ( ` + log L ) time and ex- tracting any substring S 0 [ i − ` +1 ..i ] in O ( ` + log L ) time is symmetric, w e can ex- tract any substring of length ` that crosses p osition i in S 0 in O ( ` + log L ) time. W e can already extract an y substring in O ( ` + log n 0 ) time, so w e ﬁrst choose L = log n 0 and store O (1) words to b e able to extract any substring that crosses p osition i in O ( ` + log log n 0 ) time. W e then choose L = log log n 0 and store an- other O (1) w ords to b e able to extract an y suc h substring in O ( ` + log log log n 0 ) time. After log ∗ n 0 iterations, we ha ve stored O (log ∗ n 0 ) words and can extract an y suc h substring in O ( ` ) time. 11 u v w i O (log L ) i + L Figure 5: T o support fast extraction from position i , we store the non-terminals at v and w and the path from v to the i th leaf in the parse tree. Lemma 1. Given a b alanc e d SLP for a string S 0 [1 ..n 0 ] and a sp e ciﬁe d p osition in S 0 , we c an add O (log ∗ n 0 ) wor ds to the SLP such that, if a substring of length ` cr osses that p osition, then we c an extr act that substring in O ( ` ) time. Applying Lemma 1 to each of the p ositions in S 0 of the phrase b oundaries in the LZ77 parse of S , then com bining the resulting data structure with our instance of Bille et al.’s data structure for S , w e obtain the following corollary . Corollary 2. Given an SLP for S with r rules and a list of t sp e ciﬁe d p ositions, we c an stor e S in O ( r + t + z log log n ) sp ac e such that, if a substring of length ` cr osses a sp e ciﬁe d p osition, then we c an extr act that substring in O ( ` ) time. Applying Corollary 2 to S and c ho osing the t sp eciﬁed p ositions to b e the phrase b oundaries in the LZ77 parse, we obtain a data structure that takes O ( r + z log log n ) space and supp orts extraction in O ( ` + log n ) time and ex- traction from around phrase b oundaries in O ( ` ) time . Combining that with our mo diﬁcation of Kreft and Na v arro’s self-index from Section 2, w e obtain our ﬁrst main result. Theorem 3. Given a str aight-line pr o gr am with r rules for a string S [1 ..n ] whose LZ77 p arse c onsists of z phr ases, we c an stor e a self-index for S in O ( r + z log log n ) sp ac e such that we c an extr act any substring of length ` in 12 O ( ` + log n ) time and, given a p attern P [1 ..m ] , we c an list the o cc o c curr enc es of P in S in O  m 2 + o cc log log n  time. W e note that this self-index supp orts fast circular pattern matching (see, e.g., [19]), for which w e wan t to ﬁnd all the cyclic shifts P [ j + 1 ..m ] P [1 ..j ] of P that occur in S . Listing the o ccurrences can be handled in the same wa y as listing o ccurrences of P , so we ignore that subproblem here. W e mo dify our searc hing algorithm such that, when w e would searc h in the ﬁrst P atricia tree for the rev erse ( P [1 ..i ]) R of a preﬁx of P and in the second P atricia tree for the corresp onding suﬃx P [ i + 1 ..m ], we instead searc h for ( P [ i + 1 ..m ] P [1 ..i ]) R and P [ i + 1 ..m ] P [1 ..i ], resp ectively . W e record whic h nodes w e visit in the Patricia trees and, when w e stop descending (p ossibly b ecause there is no edge whose lab el starts with the correct c haracter), we extract the path lab el for the last no de we visit in either tree and compute ho w man y no des’ path lab els match preﬁxes of ( P [ i + 1 ..m ] P [1 ..i ]) R and P [ i + 1 ..m ] P [1 ..i ]. F or each node v w e visit in the ﬁrst Patricia tree whose path lab el matc hes a preﬁx of ( P [ i + 1 ..m ] P [1 ..i ]) R , w e ﬁnd the ﬁrst node w (if one exists) that w e visit in the second Patricia tree whose path lab el matc hes a preﬁx of P [ i + 1 ..m ] P [1 ..i ] and such that the sum of the lengths of the path lab els of v and w is at least m . F or each such pair, we p erform a range-emptiness query (i.e., a range-reporting query that w e stop early , determining only whether there are any p oints in the range) to chec k whether there are an y phrase b oundaries that are immediately preceded b y the rev erse of v ’s path lab el and immediately follow ed b y w ’s path lab el. These phrase b oundaries are precisely those that are crossed by cyclic shifts of P with the b oundary b etw een P [ i ] and P [ i + 1]. This takes a total of O  m 2 log log z  time. A similar idea works for ﬁnding the maximal substrings of P that o ccur in S . F or each 1 ≤ i ≤ m , w e can use doubling searc h — with a range-emptiness query at each step — to ﬁnd the longest suﬃx P [ h..i ] of P [1 ..i ] suc h that some phrase boundary is immediately preceded b y P [ h..i ] and immediately follo w ed b y P [ i + 1]. W e then use doubling search to ﬁnd the longest preﬁx P [ i + 1 ..j ] of 13 P [ i + 1 ..m ] such that some phrase boundary is immediately preceded by P [ h..i ] and immediately follow ed by P [ i + 1 ..j ]. Notice that P [ h..j ] is the leftmost maximal substring of P crossing a phrase b oundary at p osition i , and we record it as a candidate maximal substring of P o ccurring in S . W e now use doubling search to ﬁnd the longest suﬃx P [ h 0 ..i ] of P [ h + 1 ..i ] suc h that some phrase b oundary is immediately preceded by P [ h 0 ..i ] and imme- diately follo wed by P [ i + 1 ..j + 1], then w e use doubling search to ﬁnd the longest preﬁx P [ i + 1 ..j 0 ] such that some phrase boundary is immediately preceded b y P [ h 0 ..i ] and immediately follow ed b y P [ i + 1 ..j 0 ]. Notice that h 0 > h , j 0 > j and P [ h 0 ..j 0 ] is the second maximal substring of P crossing a phrase b oundary at p osition i , and we record it as another candidate maximal substring of P o ccurring in S . W e rep eat this procedure until w e hav e recorded all the candidate maximal substrings crossing a phrase b oundary at p osition i . While we w ork, the left endp oin ts of the preﬁxes and righ t endpoints of the suﬃxes w e consider do not mov e left, so we use a total of O ( m log log z ) time to ﬁnd the candidates asso ciated with each p osition i . Since tw o candidate associated with the same p osition cannot contain each other, there are at most m of them. Once we hav e all the candidates for every p osition i , ﬁnding the true maximal substrings of P that o ccur in S takes O  m 2  time. In total we use O  m 2 log log z  time. Corollary 4. Given a p attern P [1 ..m ] , we c an use the self-index describ e d in The or em 3 to c ompute in O  m 2 log log z  time al l the cyclic shifts and maximal substrings of P that o c cur in S . 4. Self-Indexing with a Balanced SLP In this section we describ e how, if the SLP we are given for S happ ens to b e balanced, then we can improv e the time b ound in Theorem 3 using Karp-Rabin hashes [20]. A Karp-Rabin hash function f ( T [1 ..` ]) =   ` X j =1 σ ` − j T [ j ]   mo d q 14 maps strings to num b ers, where σ is the size of the alphab et, q is a prime and we interpret eac h character T [ j ] as a n umber betw een 0 and σ − 1. If we c ho ose q uniformly at randomly from among the primes at most n c then, for an y tw o distinct strings T [1 ..` ] and T 0 [1 ..` ] with ` ≤ n , the probability that f ( T ) = f ( T 0 ) is O  c log σ /n c − 1  . Therefore, we can use Karp-Rabin hashes that ﬁt in O (1) words with almost no chance of collisions. Notice that, once w e hav e computed the Karp-Rabin hashes of all the preﬁxes of a string, w e can compute the Karp-Rabin hash of any substring in O (1) time. Moreo v er, giv en the Karp-Rabin hashes of tw o strings, w e can compute the Karp-Rabin hash of their concatenation in O (1) time. W e can replace Karp-Rabin hashing b y deterministic alternatives [21, 22] at the cost of increasing our b ounds b y p olylogarithmic factors. Consider the problem of ﬁnding the lexicographic range of the suﬃxes start- ing with P [ i + 1 ..m ] at phrase b oundaries. The problem of ﬁnding the lexico- graphic range of reversed phrases ending with P [1 ..i ] is symmetric. Supp ose w e augmen t the Patricia tree for the suﬃxes b y storing at eac h node u the Karp- Rabin hash of u ’s path lab el. This takes O ( z ) extra space and, assuming our Karp-Rabin hash causes no collisions and we hav e already computed the Karp- Rabin hashes of all the preﬁxes of P , lets us ﬁnd the deep est no de v whose path lab el is a preﬁx of P [ i + 1 ..m ] in time prop ortional to v ’s depth. In the worst case, how ever, v ’s depth could be as large as m − i . F ortunately , while studying a related problem, Belazzougui, Boldi, Pagh and Vigna [23, 24] show ed ho w, b y storing one Karp-Rabin hash for each edge, w e can use a kind of binary searc h to ﬁnd v in O (log m ) time. F erragina [25] gav e a somewhat simpler solution in whic h he balanced the Patricia tree by a centroid decomp osition. His solution also tak es O ( z ) space but with O (log z ) searc h time. If the length of v ’s path lab el is exactly m − i then, again assuming our Karp-Rabin hash causes no collisions, v ’s path lab el is P [ i + 1 ..m ]. Otherwise, v ’s path lab el is a prop er preﬁx of P [ i + 1 ..m ] and in O (1) time we can ﬁnd the edge descending from v (if one exists) whose label b egins with the next c haracter of P [ i + 1 ..m ]. Let w be the child of v b elo w this edge. If the length 15 of w ’s path lab el is at most m − i then we kno w by our choice of v that no suﬃx starts at a phrase b oundary with P [ i + 1 ..m ]. Assume w ’s path lab el has length at least m − i + 1. If any suﬃx starts at a phrase b oundary with P [ i + 1 ..m ], then those that do correspond to the lea ves in w ’s subtree. W e cannot determine from looking at Karp-Rabin hashes stored in the P atricia tree, ho wev er, whether there are any suc h suﬃxes. In order to determine this, we use the balanced SLP to compute the Karp-Rabin hash of the ﬁrst m − i c haracters of w ’s path lab el. Recall from Section 3 that, given a balanced SLP with r rules for a string S , a speciﬁed p osition i in S and a bound L , we can add O (1) space suc h that later, given a length ` ≤ L , in O (log L ) time we can compute a list of O (log ` ) terminal and non-terminal sym b ols suc h that the concatenation of their expansions is S [ i..i + ` − 1]. (This is the same information w e store to extract S [ i..i + ` − 1].) It follows that, if we store the Karp-Rabin hash of the expansion of every non-terminal symbol, then we can compute the Karp-Rabin hash of S [ i..i + ` − 1] in O (log L ) time. Symmetrically , we can add O (1) space such that we can compute in O (log L ) time the Karp-Rabin hash of any substring of length at most L that ends at position i . Therefore, we can add O (1) space suc h that we can compute in O (log L ) time the Karp-Rabin hash of any substring of length at most L that crosses position i in S . As long as L is polynomial in the length ` substring whose Karp-Rabin hash we w ant, log L = O (log ` ). If we ﬁx  > 0 and apply this construction with L set to eac h of the O (log log n ) v alues n  , n  2 , n  3 , . . . , 2, then w e obtain the follo wing result. Lemma 5. Given a b alanc e d SLP for a string S [1 ..n ] and a sp e ciﬁe d p osition in S , we c an add O (log log n ) wor ds to the SLP such that, if a substring of length ` cr osses that p osition, then we c an c ompute its Karp-R abin hash in O (log ` ) time. Applying Lemma 5 to each of the phrase b oundaries in the LZ77 parse of S , w e obtain the follo wing corollary . Corollary 6. Given a b alanc e d SLP for S with r rules, we c an stor e S in 16 O ( r + z log log n ) sp ac e such that, if a substring of length ` cr osses a phr ase b oundary, then we c an c ompute its Karp-R abin hash in O (log ` ) time. Com bining Corollary 6 with Belazzougui et al.’s construction, we obtain a data structure that tak es O ( r + z log log n ) space and allo ws us to ﬁnd in O (log m ) time the lexicographic range of the suﬃxes starting with P [ i + 1 ..m ] at phrase b oundaries, assuming our Karp-Rabin hash causes no collisions and w e hav e already computed the Karp-Rabin hashes of all the preﬁxes of P . Since computing the Karp-Rabin hashes of all the preﬁxes of P takes O ( m ) time and w e need do it only once, it follo ws that we can ﬁnd in a total of O ( m log m ) time the lexicographic range of the suﬃxes starting with P [ i + 1 ..m ] for ev ery v alue of i and, symmetrically , the lexicographic range of the rev ersed preﬁxes ending with P [1 ..i ]. Combining this data structure with Lemma 1, w e can also supp ort extraction in O (log n + ` ) time and extraction from around phrase b oundaries in O ( ` ) time. Combining this data structure with our mo diﬁcation of Kreft and Na v arro’s self-index from Section 2, we obtain our second main result, below, except that our searc h time is O ( m log m + m log log z + occ log log n ) instead of O ( m log m + o cc log log n ). Theorem 7. Given a b alanc e d str aight-line pr o gr am with r rules for a string S [1 ..n ] whose LZ77 p arse c onsists of z phr ases, we c an stor e a self-index for S in O ( r + z log log n ) sp ac e such that we c an extr act any substring of length ` in O ( ` + log n ) time and, given a p attern P [1 ..m ] , we c an list the o cc o c curr enc es of P in S in O ( m log m + o cc log log n ) time. Our c onstruction is r andomize d but, given any c onstant c , we c an b ound by 1 /n c the pr ob ability that we build a faulty index. Unfortunately , this time we cannot use Bille and Gørtz’ [18] approach alone to eliminate the O ( m log log z ) term. When m ≤ log log z , storing one-dimensional range-rep orting data structures at no des in the top log log z levels of the Patri- cia trees means we use O ( r + z log log n ) space and O ( m log m + o cc log log n ) searc h time; when m ≥ log z , the O ( m log m ) term dominates the O ( m log log z ) 17 term an yw a y . T o deal with the case log log z < m < log z , w e build a Patricia tree for the set of O ( z log z ) substrings of S that cross a phrase b oundary , start at most log z characters b efore the ﬁrst phrase boundary they cross, and end exactly log z c haracters after it (or at S [ n ], whichev er comes ﬁrst). At the leaf corresp onding to each such substring, we store O (log log z ) bits indicating the p osition in the substring where it ﬁrst crosses a phrase b oundary . In total this P atrica tree tak es O ( z log log z ) words. If log log z < m < log z , we search for P in this new Patricia tree, which tak es O ( m ) time. Supp ose our searc h ends at node v . If P o ccurs in S , then the leav es in P ’s subtree store the distinct p ositions in P ’s primary o ccurrences where they cross phrase b oundaries. T o determine whether P o ccurs in S , it suﬃces for us to choose any one of those p ositions, sa y i , and chec k whether there is a phrase b oundary immediately preceded by P [1 ..i ] and immediately follo w ed by P [ i + 1 ..m ]. T o do this, w e search in our ﬁrst t wo augmented P atricia trees and p erform a range-emptiness query . If m ≤ log log z time then we can p erform the range-emptiness query with the one-dimensional range-rep orting data structures in O (1) time; otherwise, we p erform the range-emptiness query with our data structure for four-sided range rep orting in O (log log z ) ⊆ O ( m ) time. If we learn that P does not occur in S , then we stop here, ha ving used a total of O ( m ) time. If we learn that P do es o ccur in S , then in O (o cc) time we tra v erse v ’s subtree to obtain the full list of distinct p ositions in P ’s primary o ccurrences where they ﬁrst cross phrase b oundaries. F or eac h suc h p osition, we searc h in our ﬁrst t w o augmen ted P atricia trees and p erform a range-reporting query . This takes O ( m log m + o cc log log z ) time and giv es us the p ositions of all P ’s primary o ccurrences in S . 5. F uture W ork W e are currently working on a practical implementation of our self-index. W e b eliev e the most promising av enue is to use Maruy ama, Sak amoto and T ak eda’s [8] algorithm to build a balanced SLP , a wa velet tree as the range- 18 rep orting data structure [26, 15] and F erragina’s [25] restructuring to balance the Patricia trees. When m ≤ z — whic h is the case of most in terest for man y applications in bioinformatics — this implementation should tak e O ( r ) space and support lo cation of all occ 1 primary occurrences in O (( m + o cc 1 ) log z ) time, with reasonable co eﬃcien ts. As w e hav e explained, ﬁnding all secondary o ccurrences is relativ ely easy once we ha v e found all the primary o ccurrences. Appro ximate pattern matching is often more useful than exact pattern match- ing, esp ecially in bioinformatics. F ortunately , Russo, Na v arro, Oliv eira and Morales [27] sho w ed ho w to supp ort practical approximate pattern matching using indexes for exact pattern matching, and we b elieve most of their tech- niques are applicable to our self-index. One potential problem is how to p erform bac ktrac king using P atricia trees augmented with Karp-Rabin hashes, without storing or extracting edge lab els. This is b ecause comparing hashes tells us (with high probabilit y) when strings diﬀer, but it do es not tell us by how muc h they diﬀer. W e are curren tly inv estigating a new v ariant of Karp-Rabin hashes b y P olicriti, T omescu and V ezzi [28] that roughly preserves Hamming distance. Finally , we ha ve sho wn elsewhere [29] that supp orting extraction from sp eci- ﬁed p ositions has applications to, e.g., sequential approximate pattern matching. In that pap er we developed a diﬀerent data structure to supp ort such extraction, whic h w e hav e now implemen ted and found to b e faster and more space-eﬃcien t than Kreft and Nav arro’s solutions. Nev ertheless, we exp ect the solutions w e ha v e giv en here to b e ev en better. Ac knowledgmen ts Man y thanks to Djamal Belazzougui, F rancisco Claude, V eli M¨ akinen, Gon- zalo Na v arro and Jorma T arhio, for helpful discussions. References [1] G. Nav arro, V. M¨ akinen, Compressed full-text indexes, ACM Computing Surv eys 39 (2007). 19 [2] J. Ziv, A. Lemp el, A universal algorithm for sequential data compression, IEEE T ransactions on Information Theory 23 (1977) 337–343. [3] F. Claude, G. Nav arro, Self-indexed grammar-based compression, F unda- men ta Informaticae 111 (2011) 313–337. [4] F. Claude, G. Nav arro, Improv ed grammar-based self-indexes, T echnical Rep ort 1110.4493, arxiv.org, 2011. [5] P . Bille, G. M. Landau, R. Raman, K. Sadak ane, S. R. Satti, O. W eimann, Random access to grammar-compressed strings, in: Pro ceedings of the 22nd Symp osium on Discrete Algorithms (SOD A), pp. 373–389. [6] T. H. Cormen, C. E. Leiserson, R. L. Riv est, C. Stein, In troduction to Algorithms, MIT Press, 2001. [7] W. Rytter, Application of Lemp el-Ziv factorization to the approximation of grammar-based compression, Theoretical Computer Science 302 (2003) 211–222. [8] S. Maruy ama, H. Sak amoto, M. T akeda, An online algorithm for light weigh t grammar-based compression, Algorithms 5 (2012) 214–235. [9] H. H. Do, J. Jansson, K. Sadak ane, W. Sung, F ast relative Lemp el-Ziv self-index for similar sequences, in: Pro ceedings of the Joint Conference on F ron tiers in Algorithmics and Algorithmic Asp ects in Information and Managemen t (F A W-AAIM), pp. 291–302. [10] S. Kuruppu, S. J. Puglisi, J. Zobel, Relativ e Lemp el-Ziv compression of genomes for large-scale storage and retriev al, in: Pro ceedings of the 17th Symp osium on String Pro cessing and Information Retriev al (SPIRE), pp. 201–206. [11] S. Huang, T. W. Lam, W. Sung, S. T am, S. Yiu, Indexing similar DNA sequences, in: Pro ceedings of the 6th Conference on Algorithmic Asp ects in Information and Managemen t (AAIM), pp. 180–190. 20 [12] S. Maruyama, M. Nak ahara, N. Kishiue, H. Sak amoto, ESP-index: A compressed index based on edit-sensitiv e parsing, in: Pro ceedings of the 18th Symp osium on String Pro cessing and Information Retriev al (SPIRE), pp. 398–409. [13] S. Kreft, G. Nav arro, On compressing and indexing rep etitiv e sequences, Theoretical Computer Science (2012). T o app ear. [14] D. R. Morrison, P A TRICIA - Practical algorithm to retrieve information co ded in alphan umeric, Journal of the A CM 15 (1968) 514–534. [15] G. Na v arro, W av elet trees for all, in: Pro ceedings of the 23rd Symp osium on Com binatorial P attern Matc hing (CPM). T o app ear. [16] T. M. Chan, K. G. Larsen, M. Pˇ atra ¸ scu, Orthogonal range searching on the RAM, revisited, in: Pro ceedings of the 27th Symp osium on Computational Geometry (SoCG), pp. 1–10. [17] D. E. Willard, Log-logarithmic worst-case range queries are p ossible in space O ( N ), Information Pro cessing Letters 17 (1983) 81–84. [18] P . Bille, I. L. Gørtz, Substring range reporting, in: Pro ceedings of the 22nd Symp osium on Com binatorial Pattern Matc hing (CPM), pp. 299–308. [19] C. S. Iliopoulos, M. S. Rahman, Indexing circular patterns, in: Pro ceedings of the 2nd W orkshop on Algorithms and Computation (W ALCOM), pp. 46–57. [20] R. M. Karp, M. O. Rabin, Eﬃcien t randomized pattern-matching algo- rithms, IBM Journal of Research and Developmen t 31 (1987) 249–260. [21] K. Mehlhorn, R. Sundar, C. Uhrig, Maintaining dynamic sequences under equalit y tests in p olylogarithmic time, Algorithmica 17 (1997) 183–198. [22] S. Alstrup, G. S. Bro dal, T. Rauhe, P attern matching in dynamics texts, in: Pro ceedings of the 11th Symposium on Discrete Algorithms, pp. 819–828. 21 [23] D. Belazzougui, P . Boldi, R. Pagh, S. Vigna, Monotone minimal perfect hashing: searching a sorted table with O (1) accesses, in: Pro ceedings of the 20th Symp osium on Discrete Algorithms (SOD A), pp. 785–794. [24] D. Belazzougui, P . Boldi, R. Pagh, S. Vigna, F ast preﬁx search in little space, with applications, in: Pro ceedings of the 18th European Symp osium on Algorithms (ESA), pp. 427–438. [25] P . F erragina, On the weak preﬁx-search problem, in: Pro ceedings of the 22nd Symp osium on Combinatorial Pattern Matching (CPM), pp. 261–272. [26] R. Grossi, A. Gupta, J. S. Vitter, High-order entrop y-compressed text indexes, in: Pro ceedings of the 14th Symp osium on Discrete Algorithms (SOD A), pp. 841–850. [27] L. M. S. Russo, G. Na v arro, A. L. Oliveira, P . Morales, Approximate string matc hing with compressed indexes, Algorithms 2 (2009) 1105–1136. [28] A. Policriti, A. I. T omescu, F. V ezzi, A randomized numerical aligner (rNA), Journal of Computer and System Sciences (2011). T o app ear. [29] T. Gagie, P . Ga wryc ho wski, S. J. Puglisi, F aster appro ximate pattern matc hing in compressed repetitive texts, in: Proceedings of the 22nd In ternational Symp osium on Algorithms and Computation (ISAAC), pp. 653–662. 22

A Faster Grammar-Based Self-Index

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment