Substring Range Reporting

Substring Range Rep orting ∗ Philip Bille phbi@imm.dtu.dk Inge Li Gørtz ilg@imm.dtu.dk Octob er 30, 2018 Abstract W e revisit v arious string indexing problems with range rep orting features, namely , p osition- restricted substring searc hing, indexing substrings with gaps, and indexing substrings with in- terv als. W e obtain the following main results. • W e giv e eﬃcient reductions for eac h of the ab ov e problems to a new problem, which we call substring r ange r ep orting . Hence, w e unify the previous work by sho wing that we may restrict our attention to a single problem rather than studying eac h of the ab o ve problems individually . • W e show how to solve substring range rep orting with optimal query time and little space. Com bined with our reductions this leads to signiﬁcan tly impro ved time-space trade-oﬀs for the ab o v e problems. In particular, for each problem we obtain the ﬁrst solutions with optimal time query and O ( n log O (1) n ) space, where n is the length of the indexed string. • W e sho w that our techniques for substring range rep orting generalize to substring r ange c ounting and substring r ange emptiness v ariants. W e also obtain non-trivial time-space trade-oﬀs for these problems. Our b ounds for substring range reporting are based on a no vel combination of suﬃx trees and range reporting data structures. The reductions are simple and general and may apply to other com binations of string indexing with range rep orting. 1 In tro duction Giv en a string S of length n the string indexing pr oblem is to prepro cess S in to a compact rep- resen tation that eﬃciently supports substring queries , that is, given another string P of length m rep ort all o ccurrences of substrings in S that matc h P . Com bining the classic suﬃx tree data struc- ture [14] with p erfect hashing [13] leads to an optimal time-space trade-oﬀ for string indexing, i.e., an O ( n ) space represen tation that supp orts queries in O ( m + o cc) time, where o cc is the n um b er of o ccurrences of P in S . In recen t y ears, sev eral extensions of string indexing problems that add r ange r ep orting features ha ve b een prop osed. F or instance, M¨ akinen and Nav arro prop osed the p osition-r estricte d substring se ar ching pr oblem [21, 22]. Here, queries tak e an additional range [ a, b ] of p ositions in S and the goal is to report the o ccurrences of P within S [ a, b ]. F or such extensions of string indexing no optimal time-space trade-oﬀ is kno wn. F or instance, for p osition-restricted substring searching one ∗ An extended abstract of this pap er appeared at the 22nd Conference on Com binatorial P attern Matching. 1 can either get O ( n log ε n ) space (for any constant ε > 0) and O ( m + log log n + o cc) query time or O ( n 1+ ε ) space with O ( m + o cc) query time [8, 21, 22]. Hence, removing the log log n term in the query comes at the cost of signiﬁcantly increasing the space. In this pap er, we revisit a num b er string indexing problems with range rep orting features, namely p osition-r estricte d substring se ar ching , indexing substrings with gaps , and indexing sub- strings with intervals . W e achiev e the follo wing results. • W e give eﬃcien t reductions for each of the ab o ve problems to a new problem, which w e call substring r ange r ep orting . Hence, we unify the previous work b y sho wing that we may restrict our attention to a single problem rather than studying each of the ab ov e problems individually . • W e sho w ho w to solv e substring range rep orting with optimal query time and little space. Com bined with our reductions this leads to signiﬁcan tly improv ed time-space trade-oﬀs for all of the ab o v e problems. F or instance, we show how to solve p osition-restricted substring searc hing in O ( n log ε n ) space and O ( m + o cc) query time. • W e show that our techniques for substring range rep orting generalize to substring r ange c ounting and substring r ange emptiness v ariants. W e also obtain non-trivial time-space trade- oﬀs for these problems. Our b ounds for substring range rep orting are based on a nov el combination of suﬃx trees and range rep orting data structures. The reductions are simple and general and may apply to other com binations of string indexing with range rep orting. 1.1 Substring Range Rep orting Let S b e a string where eac h p osition is asso ciated with a in teger v alue in the range [0 , u ]. The in teger asso ciated with p osition i in S is the lab el of p osition i , denoted lab el( i ), and we call S a lab ele d string . Given a lab eled string S , the substring r ange r ep orting pr oblem is to compactly represen t S while supp orting substring r ange r ep orting queries , that is, given a string P and a pair of integers a and b , 0 ≤ a ≤ b ≤ u , rep ort all starting p ositions in S that match P and whose lab els are in the range [ a, b ]. W e assume a standard unit-cost RAM mo del with word size w and a standard instruction set including arithmetic op erations, bitwise b o olean op erations, and shifts. W e assume that a lab el can b e stored in a constant n um b er of w ords and therefore w = Θ(log u ). The space complexity is the num b er of words used by the algorithm. All b ounds mentioned in this pap er are v alid in this mo del of computation. T o solve substring range rep orting a basic approac h is to com bine a suﬃx tree with a 2D range rep orting data structure. A query for a pattern P and range [ a, b ] consists of a search in the suﬃx tree and then a 2D range rep orting query with [ a, b ] and the lexicographic range of suﬃxes deﬁned P . This is essentially the ov erall approac h used in the known solutions for p osition-restricted substring searc hing [4, 8, 9, 21, 22, 31], whic h is a sp ecial case of substring range rep orting (see the next section). Dep ending on the choice of the 2D range rep orting data structure this approach leads to diﬀerent trade-oﬀs. In particular, if we plug in the 2D range rep orting data structure of Alstrup et al. [2], w e get a solution with O ( n log ε n ) space and O ( m + log log u + o cc) query time (see M¨ akinen 2 and Nav arro [21, 22]). The log log u term in the query time is from the range rep orting query . Alternativ ely , if we use a fast data structure for the range successor problem [8, 31] to do the range rep orting, w e get optimal O ( m + o cc) query time but increase the space to at least Ω( n 1+ ε ). Indeed, since an y 2D range rep orting data structure with O ( n log O (1) n ) space must use Ω(log log u ) query time [26], we cannot hop e to a v oid this blowup in space with this approach. Our ﬁrst main contribution is a new and simple technique that ov ercomes the inherent problem of the previous approach. W e sho w the following result. Theorem 1 L et S b e a lab ele d string of length n with lab els in the r ange [0 , u ] . F or any c onstants ε, δ > 0 , we c an solve substring r ange r ep orting using O ( n (log ε n + log log u )) sp ac e, O ( n (log n + log δ u )) exp e cte d pr epr o c essing time, and O ( m + o cc) query time, for a p attern string of length m . Compared to the previous results w e ac hiev e optimal query time with an additional O ( n log log u ) term in the space. F or the applications considered here, we hav e that u = O ( n ) and therefore the space b ound simpliﬁes to O ( n (log ε n + log log u )) = O ( n log ε n ). Hence, in this case there is no asymptotic space ov erhead. The key idea to obtain Theorem 1 is a new and simple combination of suﬃx trees with multiple range rep orting data structures for b oth 1 and 2 dimensions. Our solution handles queries diﬀerently dep ending on the length of the input pattern such that the ov erall query is optimized accordingly . In terestingly , the idea of using diﬀeren t query algorithms dep ending on the length of the pattern is closely related to the concept of ﬁltering se ar ch introduced for the standard range rep orting problem by Chazelle as early as 1986 [6]. Our new results sho w that this idea is also useful in com binatorial pattern matching. Finally , we also consider substring r ange c ounting and substring r ange emptiness v ariants. Here, the goal is to coun t the num b er of o ccurrences in the range and to determine whether or not the range is empt y , resp ectively . Similar to substring range rep orting, these problems can also b e solv ed in a straightforw ard wa y by combining a suﬃx with a 2D range counting or emptiness data structure. W e sho w ho w to extend our tec hniques to obtain impro ved time-space trade-oﬀs for b oth of these problems. 1.2 Applications Our second main contribution is to show that substring range rep orting actually captures sev eral other string indexing problems. In particular, w e show ho w to reduce the following problems to substring range rep orting. • Position-r estricte d substring se ar ching: Given a string S of length n , construct a data struc- ture supp orting the following query: Given a string P and query in terv al [ a, b ], with 1 ≤ a ≤ b ≤ n , return the p ositions of substrings in S matching P whose p ositions are in the interv al [ a, b ]. • Indexing substrings with intervals: Given a string S of length n , and a set of interv als π = { [ s 1 , f 1 ] , [ s 2 , f 2 ] , . . . , [ s | π | , f | π | ] } suc h that s i , f i ∈ [1 , n ] and s i ≤ f i , for all 1 ≤ i ≤ | π | , construct a data structure supp orting the following query: Giv en a string P and query interv al [ a, b ], with 1 ≤ a ≤ b ≤ n , return the p ositions of substrings in S matching P whose p ositions are in [ a, b ] and in one of the in terv als in π . 3 • Indexing substrings with gaps: Given a string S of length n and an in teger d , the problem is to construct a data structure supp orting the following query: Giv en t w o strings P 1 and P 2 return all p ositions of substrings in S matc hing P 1 ◦ ? d ◦ P 2 . Here ◦ denotes concatenation and ? is a wildcard matching all characters in the alphab et. Previous results Let m b e the length of P . M¨ akinen and Nav arro [21, 22] introduced the p osition-restricted substring searching problem. Their fastest solution uses O ( n log ε n ) space, O ( n log n ) expected preprocessing time, and O ( m + log log n + o cc) query time. Cro chemore et al. [8] prop osed another solution using O ( n 1+ ε ) space, O ( n 1+ ε ) prepro cessing time, and O ( m + o cc) query time (see also Section 1.1). Using techniques from range non-ov erlapping indexing [7] it is p ossible to impro ve these bounds for small alphab et sizes [27]. Several succinct v ersions of the problem ha ve also b een prop osed [4, 21, 22, 31]. All of these ha ve signiﬁcantly worse query time since they require sup erconstan t time p er rep orted o ccurrence. Finally , Cro c hemore et al. [10] studied a restricted v ersion of the problem with a = 1 or b = n . F or the indexing substrings with interv als problem, Cro chemore et al. [8, 9] gav e a solution with O ( n log 2 n ) space, O ( | π | + n log 3 n ) exp ected prepro cessing time, and O ( m + log log n + o cc) query time. They also show ed ho w to achiev e O ( n 1+ ε ) space, O ( n 1+ ε + | π | ) prepro cessing time, and O ( m + o cc) query time. Several pap ers [3, 17, 20] hav e studied the prop ert y matching problem, whic h is similar to the indexing substrings with in terv als problem, but where b oth start and end p oin t of the match m ust b e in the same in terv al. Iliop oulos and Rahman [18] studied the problem of indexing substrings with gaps. They gav e a solution using O ( n log ε n ) space, O ( n log n ) exp ected prepro cessing time, and O ( m + loglog n + o cc) query time, where m is the length of the t wo input strings. Crochemore and Tischler recently prop osed a v arian t of the problem [11]. Our results W e reduce p osition-restricted substring searching, indexing substrings with interv als, and indexing substrings with gaps to substring range rep orting. Applying Theorem 1 with our new reductions, we get the following result. Theorem 2 L et S b e a string of length n and let m b e the length of the query. F or any c onstant ε > 0 , we c an solve (i) Position-r estricte d substring se ar ching using O ( n log ε n ) sp ac e, O ( n log n ) exp e cte d pr epr o c ess- ing time, and O ( m + o cc) query time. (ii) Indexing substrings with intervals using O ( n log ε n ) sp ac e, O ( | π | + n log n ) exp e cte d pr epr o- c essing time, and O ( m + o cc) query time. (iii) Indexing substrings with gaps using O ( n log ε n ) sp ac e, O ( n log n ) exp e cte d pr epr o c essing time, and O ( m + o cc) query time ( m is the size of the two input strings). This improv es the b est known time-space trade-oﬀs for all three problems, that all suﬀer from the trade-oﬀ inherent in 2D range rep orting. The reductions are simple and general and may apply to other com binations of string indexing with range rep orting. 4 2 Basic Concepts 2.1 Strings and Suﬃx T rees Throughout the section w e will let S b e a lab eled string of length | S | = n with lab els in [0 , u ]. W e denote the c haracter at p osition i by S [ i ] and the substrings from p osition i to j by S [ i, j ]. The substrings S [1 , j ] and S [ i, n ] are the pr eﬁxes and suﬃxes of S , resp ectively . The r everse of S is S R . W e denote the lab el of p osition i by lab el S ( i ). The or der of suﬃx S [ i, n ], denoted order S ( i ), is the lexicographic order of S [ i, n ] among the suﬃxes of S . The suﬃx tr e e for S , denoted T S , is the compacted trie storing all suﬃxes of S [14]. The depth of a no de v in T S is the n um b er of edges on the path from v to the ro ot. Each of the edges in T S is asso ciated with some substring of S . The children of each no de are sorted from left to righ t in increasing alphab etic order of the ﬁrst c haracter of the substring asso ciated with the edge leading to them. The concatenation of substrings from the ro ot to v is denoted str S ( v ). The string depth of v , denoted strdepth S ( v ), is the length of str S ( v ). The lo cus of a string P , denoted lo cus S ( P ), is the minimum depth no de v such that P is a preﬁx of str S ( v ). If P is not a preﬁx of a substring in S we deﬁne lo cus S ( P ) to b e ⊥ . Eac h leaf ` in T S uniquely corresp onds to a suﬃx in S , namely , the suﬃx str S ( ` ). Hence, we will use lab el S ( ` ) and order S ( ` ) to refer to the lab el and order of the corresp onding suﬃx. F or an in ternal no de v we extend the notation such that lab el S ( v ) = { label S ( ` ) | ` is a descendant leaf of v } order S ( v ) = { order S ( ` ) | ` is a descendant leaf of v } . Since c hildren of a no de are sorted, the left to righ t order of the lea ves in T S corresp onds to the lexicographic order of the suﬃxes of S . Hence, for an y no de v , order S ( v ) is an in terv al. W e denote the left and right endp oin ts of this interv al by l v and r v . When the underlying string S is clear from the context we will often drop the subscript S for brevity . The suﬃx tree for S uses O ( n ) space and can b e constructed in O (sort( n )) time, where sort( n ) is the time for sorting n v alues in the mo del of computation [12]. W e only need a standard comparison-based O ( n log n ) suﬃx tree construction in our results. Let P b e a string of length m . If lo cus S ( P ) = ⊥ then P do es not o ccur as a substring in S . Otherwise, the substrings in S that matc h P are the suﬃxes in order S (lo cus S ( P )). Hence, we can compute all o ccurrences of P in S b y trav ersing the suﬃx tree from the ro ot to lo cus S ( P ) and then rep ort all suﬃxes stored in the subtree. Using p erfect hashing [13] to represent the outgoing edges of each no de in T S w e achiev e an O ( n ) solution to string indexing that supp orts queries in O ( m + o cc) time (here o cc is the total n umber of o ccurrences of P in S ). 2.2 Range Rep orting Let X ⊆ { 0 , . . . , u } d b e a set of p oin ts in a d-dimensional grid. The r ange r ep orting pr oblem in d -dimensions is to compactly represent X while supp orting r ange r ep orting queries , that is, giv en a rectangle R = [ a 1 , b 1 ] × · · · × [ a d , b d ] rep ort all p oin ts in the set R ∩ X . W e use the following results for range rep orting in 1 and 2 dimensions. Lemma 1 (Alstrup et al. [1], Mortensen et al. [24]) F or a set of n p oints in [0 , u ] and any c onstant γ > 0 , we c an solve 1D r ange r ep orting using O ( n ) sp ac e, O ( n log γ u ) exp e cte d pr epr o c ess- ing time and O (1 + o cc) query time. 5 Lemma 2 (Alstrup et al. [2]) F or a set of n p oints in [0 , u ] × [0 , u ] and any c onstant ε > 0 , we c an solve 2D r ange r ep orting using O ( n log ε n ) sp ac e, O ( n log n ) exp e cte d pr epr o c essing time, and O (log log u + o cc) query time. 3 Substring Range Rep orting W e now show Theorem 1. Recall that S is a lab eled string of length n with lab els from [0 , u ]. 3.1 The Data Structure Our substring range rep orting data structure consists of the follo wing comp onents. • The suﬃx tree T S for S . F or each no de v in T S w e also store l v and r v . W e partition T S in to a top tr e e and a num b er of b ottom tr e es . The top tree consists of all no des in T S whose string depth is at most log log u and all their c hildren. The trees induced by the remaining no des are the forest of b ottom trees. • A 2D range reporting data structure on the set of points { (order S ( i ) , lab el S ( i )) | i ∈ { 1 , . . . , n }} . • F or eac h no de v in the top tree, a 1D range rep orting data structure on the set { lab el S ( i ) | i ∈ order S ( v ) } . W e analyze the space and prepro cessing time for the data structure. W e use the range rep orting data structures from Lemmas 1 and 2. The space for the suﬃx tree is O ( n ) and the space for the 2D range rep orting data structure is O ( n log ε n ), for any constan t ε > 0. W e b ound the space for the (p oten tially Ω( n )) 1D range rep orting data structures stored for the top tree. Let V d b e the set of no des in the top tree with depth d . Since the sets order S ( v ), v ∈ V d , partition the set of descendan t leav es of no des in V d , the total size of these sets is as most n . Hence, the total size of the 1D range rep orting data structures for the no des in V d is therefore O ( n ). Since there are at most log log u + 1 levels in the top tree, the space for all 1D range rep orting data structures is O ( n log log u ). Hence, the total space for the data structure is O ( n (log ε n + log log u )). W e can construct the suﬃx tree in O (sort( n )) time and the 2D range rep orting data structure in O ( n log n ) exp ected time. F or any constant γ > 0, the exp ected prepro cessing time for all 1D range rep orting data structures is O   X v in top tree | order S ( v ) | log γ u   = O ( n log log u log γ u ) = O ( n log 2 γ u ) . Setting δ = 2 γ we use O ( n (log n + log δ u )) exp ected prepro cessing time in total. 3.2 Substring Range Queries Let P b e a string of length m , and let a and b b e a pair of in tegers, 0 ≤ a ≤ b ≤ u . T o answer a substring range query we wan t to compute the set of starting p ositions for P whose lab els are in [ a, b ]. First, we compute the no de v = lo cus S ( P ). If v = ⊥ then P is not a substring of S , and w e return the empt y set. Otherwise, we compute the set of descendan t lea v es of v with lab els in [ a, b ]. There are tw o cases to consider. 6 (i) If v is in the top tree w e query the 1D range rep orting data structure for v with the interv al [ a, b ]. (ii) If v is in a b ottom tree, w e query the 2D range rep orting data with the rectangle [ l v , r v ] × [ a, b ]. Giv en the p oin ts returned by the range rep orting data structures, we output the corresp onding starting p ositions of the corresp onding suﬃxes. F rom the deﬁnition of the data structure it follows that these are precisely the o ccurrences of P within the range [ a, b ]. Next consider the time complexit y . W e ﬁnd lo cus S ( P ) in O ( m ) time (see Section 2). In case (i) we use O (1 + o cc) time to compute the result by Lemma 1. Hence, the total time for a substring range query for case (i) is O ( m + o cc). In case (ii) we use O (log log u + o cc) time to compute the result by Lemma 2. W e hav e that v = lo cus S ( P ) is in a b ottom tree and therefore m ≥ strdepth(paren t(lo cus S ( v ))) > log log u . Hence, the total time to answer a substring range query in case (ii) is O ( m + log log u + occ) = O ( m + o cc). Thus, in b oth cases we use O ( m + occ) time. Summing up, our solution uses O ( n (log ε n + log log u ) space, O ( n (log n + log δ u )) exp ected prepro cessing time, and O ( m + o cc) query time. This completes the pro of of Theorem 1. 4 Applications In this section we show ho w to impro v e the results for the three problems p osition-restricted sub- string searc hing, indexing substrings with in terv als, and indexing gapp ed substrings, using our data structure for substring range rep orting. Let repor t S ( P , a, b ) denote a substring range rep orting query on string S with parameters P , a , and b . 4.1 P osition-Restricted Substring Searc hing W e can reduce position-restricted substring searc hing to substring range reporting by setting lab el( i ) = i for all i = 1 , . . . , n . T o answer a query we return the result of the substring range query repor t S ( P , a, b ). Since eac h lab el is equal to the p osition, it follo ws that the solution to the substring range rep orting instance immediately gives a solution to p osition-restricted substring searc hing. Applying Theorem 1 with u = n , this pro v es Theorem 2(i). 4.2 Indexing Substrings with Interv als W e can reduce indexing substrings with interv als to substring range rep orting by setting lab el( i ) = ( i if i ∈ ϕ for some ϕ ∈ π , 0 otherwise . T o answ er a query we return the result of the substring range rep orting query repor t S ( P , a, b ). Let I b e the solution to the indexing substrings with interv als instance and let I 0 b e the solution to the substring range rep orting instance derived b y the ab ov e reduction. Then i ∈ I ⇔ i ∈ I 0 . T o prov e this assume i ∈ I . Then i ∈ ϕ for some ϕ ∈ π and i ∈ [ a, b ] . F rom i ∈ ϕ and the deﬁnition of lab el( i ) it follo ws that lab el( i ) = i . Thus, lab el( i ) = i ∈ [ a, b ] and thus i ∈ I 0 . Assume i ∈ I 0 . Then lab el( i ) ∈ [ a, b ]. Since a > 0 also lab el( i ) > 0, and it follo ws that lab el( i ) = i . By the reduction this means that i ∈ ϕ for some ϕ ∈ π . Since i = label( i ), we ha v e i ∈ [ a, b ] and therefore i ∈ I . 7 cababa$ cababa$ cab cababa$ aba$ a cab b a a b $ cababa$ ba$ $ ba$ $ 6 8 3 1 10 7 9 4 2 5 v l v r v 7 a a a b b c a b c a 1 2 3 4 5 6 8 9 10 7 9 8 6 1 0 0 0 S= 2 5 Figure 1: A string S , the lab eling for d = 2 (b elo w the string), and the suﬃx tree of T S R . Given a query P 1 = ab and P 2 = bac we ﬁnd v = lo cus S R (ba) (mark ed in the suﬃx tree). W e hav e l v = 6 and r v = 7 from the left-to-right-order in the T S R . The substring range rep orting query repor t s ( P 2 , 6 , 7) returns 7. Hence, we rep ort the o ccurrence at p osition 7 − 2 − 2 = 3. W e can construct the lab eling in O ( n + | π | ) if the interv als are sorted by startpoint or endp oin t. Otherwise additional time for sorting is needed. A similar approach is used in the solution b y Cro c hemore et al. [8]. Applying Theorem 1 with u = n , this prov es Theorem 2(ii). 4.3 Indexing Substrings with Gaps W e can reduce the indexing substrings with gaps problem to substring range rep orting as follows. Construct the suﬃx tree of the reverse of S , i.e., the suﬃx tree T S R for S R . F or each no de v in T S R also store l v and r v . Set lab el S ( i ) = ( order S R ( n − i + d + 2) for i ≥ d + 2 , 0 otherwise . T o answer a query ﬁnd the lo cus no de v of P R 1 in T S R . Then use the substring range rep orting data structure to return all p ositions of substrings in S matching P 2 whose lab els are in the range [ l v , r v ]. F or each p osition i returned by repor t S ( P 2 , l v , r v ), return i − | P 1 | − d . See Fig. 1 for an example. Correctness of the reduction W e will now show that the reduction is correct. Let I b e the solution to the indexing substrings with gaps instance and let I 0 b e the solution to the substring range rep orting instance deriv ed b y the ab o v e reduction. W e will show i ∈ I ⇔ i ∈ I 0 . Let m i = | P i | for i = 1 , 2. If i ∈ I then there is an o ccurrence of P 1 at p osition i in S and an o ccurrence of P 2 at p osition i 0 = i + m 1 + d in S . It follows directly , that there is an o ccurrence of P R 1 at p osition i 00 = n − ( i + m 1 ) + 2 in S R . By deﬁnition, lab el S ( i 0 ) = lab el S ( i + m 1 + d ) = order S R ( n − ( i + m 1 + d ) + d + 2) = order S R ( i 00 ) , where the second equality follows from the fact that i + m 1 + d ≥ d + 2. Since there is an o ccurrence of P R 1 at p osition i 00 in S R , we hav e lab el S ( i 0 ) = order S R ( i 00 ) ∈ order S R (lo cus S R ( P R 1 )) . 8 Th us, lab el S ( i 0 ) ∈ [ l v , r v ], and since there is an o ccurrence of P 2 at p osition i 0 in S , we hav e i 0 − m 1 − d = i ∈ I 0 . If i ∈ I 0 then there is an o ccurrence of P 2 at p osition i 0 = i + m 1 + d with lab el( i 0 ) in the range [ l v , r v ], where v = lo cus S R ( P R 1 ). W e need to show that this implies that there is an o ccurrence of P 1 at p osition i in S . By deﬁnition, lab el S ( i 0 ) = order S R ( n − i 0 + d + 2) = order S R ( n − i − m 1 + 2) . Let i 00 = n − i − m 1 + 2. Since order S R ( i 00 ) = lab el S ( i 0 ) ∈ [ l v , r v ], there is an o ccurrence of P R 1 at p osition i 00 in S R . It follows directly , that there is an o ccurrence of P 1 at p osition n − i 00 − m 1 + 2 = n − ( n − i − m 1 + 2) − m 1 + 2 = i in S . Therefore, i ∈ I . Complexit y Construction of the suﬃx tree T S R tak es time O ( n log n ) and the lab eling can b e constructed in time O ( n ). Both use space O ( n ). It takes O ( m 1 ) time to ﬁnd the lo cus no des of P R 1 in T S R . The substring range rep orting query takes time O ( m 2 + occ). Thus the total query time is O ( m + o cc). Applying Theorem 1 with u = n , this completes the pro of of Theorem 2(iii). 5 Substring Range Coun ting and Emptiness W e now sho w how to apply our techniques to substring r ange c ounting and substring r ange empti- ness . Analogous to substring range rep orting, the goal is here to coun t the num b er of o ccurrences in the range and to determine whether or not the range is empty , resp ectiv ely . A straightforw ard w ay to solve these problems is to combine a suﬃx tree with a 2D range counting data structure and a 2D range emptiness data structure, resp ectiv ely . Using the techniques from Section 3 we sho w how to signiﬁcantly impro v e the b ounds of this approach in b oth cases. W e note that by the reductions in Section 4 the b ounds for substring range counting and substring range emptiness also immediately imply results for counting and emptiness versions of p osition-restricted substring searc hing, indexing substrings with interv als, and indexing substrings with gaps. 5.1 Preliminaries Let X ⊆ { 0 , . . . , u } b e a set of p oin ts in a d -dimensional grid. Given a query rectangle R = [ a 1 , b 1 ] × · · · × [ a d , b d ], a r ange c ounting query computes | R ∩ X | , and a r ange emptiness query computes if R ∩ X = ∅ . Giv en X the r ange c ounting pr oblem and the r ange emptiness pr oblem is to compactly represen t X , while supp orting range counting queries and range emptiness queries, resp ectiv ely . Note that an y solution for range rep orting or range coun ting implies a solution for range emptiness with the same complexity (ignoring the o cc term for range rep orting queries). W e will need the following additional geometric data structures. Lemma 3 (J´ aJ´ a et a l. [19]) F or a set of n p oints in [0 , u ] × [0 , u ] we c an solve 2D r ange c ounting in O ( n ) sp ac e, O ( n log n ) pr epr o c essing time, and O (log n/ log log n + log log u ) query time. Lemma 4 (v an Emde Boas et al. [29, 30], Mehlhorn and N¨ aher [23]) F or a set of n p oints in [0 , u ] we c an solve 1D r ange c ounting in O ( n ) sp ac e, O ( n log log n ) pr epr o c essing time, and O (log log u ) query time. 9 T o ac hiev e the result of Lemma 4 we use a v an Emde Boas data structure [29, 30] implemen ted in linear space [23] using p erfect hashing. This data structure supp orts predecessor queries in O (log log u ) time. By also storing for eac h p oin t it’s rank in the sorted order of the p oints, w e can compute a range counting query by tw o predecessor queries. T o build the data structure eﬃciently w e need to sort the p oin ts and build suitable p erfect hash tables. W e can sort deterministically in O ( n log log n ) time [16], and w e can build the needed hash tables in O ( n ) time using deterministic hashing [15] combined with a standard tw o-level approac h (see e.g., Thorup [28]). Lemma 5 (Chan et al. [5]) F or a set of n p oints in [0 , u ] × [0 , u ] we c an solve 2D r ange emptiness in O ( n log log n ) sp ac e, O ( n log n ) pr epr o c essing time, and O (log log u ) query time. 5.2 The Data Structures W e no w sho w ho w to eﬃciently solve substring range coun ting and substring range emptiness. Recall that S is a lab eled string of length n with lab els from [0 , u ]. W e can directly solv e substring range counting by combining a suﬃx tree with the 2D range coun ting result from Lemma 3. This leads to a solution using O ( n ) space and O ( m + log n/ log log n + log log u ) query time. W e sho w how to improv e the query time to O ( m + log log u ) at the cost of increasing the space to O ( n log n/ log log n ). Hence, we remov e the log n/ log log n term from the query time at the cost of increasing the space b y a log n/ log log n factor. W e cannot hope to achiev e suc h a b ound using a suﬃx tree combined with a 2D range coun ting data structure since any 2D range coun ting data structure using O ( n log O (1) n ) space requires Ω(log n/ log log n ) query time [25]. W e can also directly solve substring range emptiness b y combining a suﬃx tree with the 2D range emptiness result from Lemma 5. This solution uses O ( n log log n ) space and O ( m + log log u ) query time. W e show ho w to achiev e optimal O ( m ) query time with space O ( n log log u ). Our data structure for substring range counting and existence follo ws the construction in Sec- tion 3. W e partition the suﬃx tree into a top and a num b er of b ottom trees and store a 1D data structure for each no de in the top tree and a single 2D data structure. T o answer a query for a pattern string P of length m , we search the suﬃx tree with P and use the 1D data structure if the searc h ends in the top tree and otherwise use the 2D data structure. W e describ e the sp eciﬁc details for each problem. First we consider substring range coun ting. In this case the top tree consists of all nodes of string depth at most log n/ log log n . The 1D and 2D data structures used are the ones from Lem ma 4 and 3. By the same arguments as in Section 3 the total space used for the 1D data structures for all no des in the top tree at depth d is at most O ( n ) and hence the total space for all 1D data structures is O ( n (log n/ log log n )). Since the 2D data structure uses O ( n ) space, the total space is O ( n log n/ log log n ). The time to build all 1D data structures is O ( n (log n/ log log n ) · log log n )) = O ( n log n ). Since the suﬃx tree and the 2D data structure can b e built within the same b ound, the total prepro cessing time is O ( n log n ). Given a pattern of length m , a query uses O ( m + log log u ) time if the searc h ends in the top tree, and O ( m + log n/ log log n + log log u ) time if the search ends in a b ottom tree. Since b ottom trees consists of no des of string depth more than log n/ log log n the time to answer a query in b oth cases is O ( m + log log u ). In summary , we hav e the follo wing result. Theorem 3 L et S b e a lab ele d string of length n with lab els in the r ange [0 , u ] . We c an solve substring r ange c ounting using O ( n log n/ log log n ) sp ac e, O ( n log n ) pr epr o c essing time, and O ( m + log log u ) query time, for a p attern string of length m . 10 Next we consider substring range emptiness. In this case the top tree consists of all no des of string depth at most log log u . W e use the 1D and 2D data structures from Lemma 1 and Lemma 5. The total space for all 1D data structures is O ( n log log u ). Since the 2D data structure uses O ( n log log n ) space the total space is O ( n log log u ). F or an y constan t γ > 0, the expected time to build all 1D data structures is O ( n log log u log γ u ) = O ( n log δ u ) for suitable constan t δ > 0. The suﬃx tree and the 2D data structure can b e built in O ( n log n ) time and hence the total exp ected prepro cessing time is O ( n (log n + log δ u )). If the search for a pattern string ends in the top tree the query time is O ( m ) and if the searc h ends in a b ottom tree the query time is O ( m + log log u ). As ab ov e, the partition in top and b ottom trees ensures that the query time in b oth cases is O ( m ). In summary , we hav e the follo wing result. Theorem 4 L et S b e a lab ele d string of length n with lab els in the r ange [0 , u ] . F or any c onstant δ > 0 we c an solve substring r ange existenc e using O ( n log log u ) sp ac e, O ( n (log n + log δ u )) exp e cte d pr epr o c essing time, and O ( m ) query time, for a p attern string of length m . 6 Ac kno wledgmen ts W e thank Christian W orm Mortensen and Kasp er Green Larsen for clariﬁcations on the prepro- cessing times for the results in Lemma 3 and Lemma 5. References [1] S. Alstrup, G. Bro dal, and T. Rauhe. Optimal static range rep orting in one dimension. In Pr o c. 33r d STOC , pages 476–482, 2001. [2] S. Alstrup, G. Stølting Bro dal, and T. Rauhe. New data structures for orthogonal range searc hing. In Pr o c. 41st FOCS , pages 198–207, 2000. [3] A. Amir, E. Chencinski, C. S. Iliop oulos, T. Kop elo witz, and H. Zhang. Prop erty matc hing and weigh ted matching. The or et. Comput. Sci. , 395(2-3):298–310, 2008. [4] P . Bose, M. He, A. Mahesh w ari, and P . Morin. Succinct orthogonal range search structures on a grid with applications to text indexing. In Pr o c. 11th W ADS , pages 98–109, 2009. [5] T. M. Chan, K. Larsen, and M. Pˇ atra ¸ scu. Orthogonal range searc hing on the ram, revisited. In Pr o c. 27th SoCG , pages 354–363, 2011. [6] B. Chazelle. Filtering searc h: A new approach to query-answering. SIAM J. Comput. , 15(3):703–724, 1986. [7] H. Cohen and E. P orat. Range non-ov erlapping indexing. In Pr o c. 20th ISAAC , pages 1044– 1053, 2009. [8] M. Cro c hemore, C. S. Iliop oulos, M. Kubica, M. S. Rahman, and T. W alen. Impro v ed al- gorithms for the range next v alue problem and applications. In Pr o c. 25th ST ACS , pages 205–216, 2008. [9] M. Cro c hemore, C. S. Iliop oulos, M. Kubica, M. S. Rahman, and T. W alen. Finding patterns in given interv als. F undam. Inform. , 101(3):173–186, 2010. 11 [10] M. Cro c hemore, C. S. Iliop oulos, and M. S. Rahman. Optimal preﬁx and suﬃx queries on texts. Inf. Pr o c ess. L ett. , 108(5):320–325, 2008. [11] M. Cro c hemore and G. Tischler. The gapp ed suﬃx array: A new index structure. In Pr o c. 17th SPIRE , pages 359–364, 2010. [12] M. F arac h-Colton, P . F erragina, and S. Muth ukrishnan. On the sorting-complexit y of suﬃx tree construction. J. ACM , 47(6):987–1011, 2000. [13] M. L. F redman, J. Koml´ os, and E. Szemer´ edi. Storing a sparse table with O (1) w orst case access time. J. ACM , 31:538–544, 1984. [14] D. Gusﬁeld. Algorithms on strings, tr e es, and se quenc es: c omputer scienc e and c omputational biolo gy . Cambridge, 1997. [15] T. Hagerup, P . B. Miltersen, and R. P agh. Deterministic dictionaries. J. Algorithms , 41(1):69– 85, 2001. [16] Y. Han. Deterministic sorting in O ( n log log n ) time and linear space. J. Algorithms , 50(1):96– 105, 2004. [17] C. S. Iliop oulos and M. S. Rahman. F aster index for prop erty matching. Inf. Pr o c ess. L ett. , 105(6):218–223, 2008. [18] C. S. Iliop oulos and M. S. Rahman. Indexing factors with gaps. Algorithmic a , 55(1):60–70, 2009. [19] J. J´ aJ´ a, C. W. Mortensen, and Q. Shi. Space-eﬃcient and fast algorithms for m ultidimensional dominance rep orting and counting. In Pr o c. 15th ISAAC , pages 558–568, 2004. [20] M. Juan, J. Liu, and Y. W ang. Errata for ”Faster index for prop ert y matching”. Inf. Pr o c ess. L ett. , 109(18):1027–1029, 2009. [21] V. M¨ akinen and G. Nav arro. P osition-restricted substring searching. In Pr o c. 7th LA TIN 2006 , pages 703–714, 2006. [22] V. M¨ akinen and G. Nav arro. Rank and select revisited and extended. The or et. Comput. Sci. , 387(3):332–347, 2007. [23] K. Mehlhorn and S. N¨ ahler. Bounded ordered dictionaries in O (log log N ) time and O ( n ) space. Inform. Pr o c ess. L ett. , 35(4):183–189, 1990. [24] C. W. Mortensen, R. Pagh, and M. Pˇ atra¸ ccu. On dynamic range rep orting in one dimension. In Pr o c. 37th STOC , pages 104–111, 2005. [25] M. Pˇ atra ¸ scu. Lo w er b ounds for 2-dimensional range counting. In Pr o c. 39th STOC , pages 40–46, 2007. [26] M. Pˇ atra¸ scu and M. Thorup. Time-space trade-oﬀs for predecessor searc h. In Pr o c. 38th STOC , pages 232–240, 2006. [27] E. Porat, 2011. Personal comm unication. 12 [28] M. Thorup. Space eﬃcient dynamic stabbing with fast queries. In Pr o c e e dings of the 33r d A nnual ACM Symp osium on The ory of Computing , pages 649–658, 2003. [29] P . v an Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Inform. Pr o c ess. L ett. , 6(3):80–82, 1977. [30] P . v an Emde Boas, R. Kaas, and E. Zijlstra. Design and implemen tation of an eﬃcient priorit y queue. Mathematic al Systems The ory , 10:99–127, 1977. Announced at F OCS 1975. [31] C.-C. Y u, W.-K. Hon, and B.-F. W ang. Impro ved data structures for the orthogonal range successor problem. Comput. Ge ometry , 44(3):148 – 159, 2011. 13

Substring Range Reporting

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment