String Matching with Variable Length Gaps

String Matc hing with V ariable Length Gaps ∗ Philip Bille Inge Li Gørtz Hjalte W edel Vildhøj David Kofo ed Wind Octob er 30, 2018 Abstract W e consider string matc hing with v ar iable length gaps. Given a string T and a patter n P consisting of strings separ ated by v ariable length gaps (arbitra r y strings of length in a sp eciﬁed range), the problem is t o ﬁnd all ending positions of substrings in T that match P . This problem is a basic primitive in computational biolo gy applicatio ns. Let m and n b e the leng ths of P and T , resp ectiv ely , and let k be the num b er of str ings in P . W e pr esen t a new algorithm achieving time O ( n log k + m + α ) and space O ( m + A ), where A is the sum of the low er bounds of the lengths of the gaps in P and α is the total num be r of o ccurrences o f the strings in P within T . Compared to the previo us results this bound essent ially ac hieves the best kno wn time and space co mplexities simultaneously . Consequently , o ur algor ithm obtains the b est known bo unds for almost all combinations of m , n , k , A , and α . Our a lg orithm is surprising ly simple and straightforward to implement. W e also present algor ithms for ﬁnding and enco ding the po sitions of a ll strings in P for every match of the pattern. 1 In tro duction Giv en integ ers a and b , 0 ≤ a ≤ b , a variable length gap g { a, b } is an arbitrary strin g o v er Σ of length b et ween a and b , b oth inclusiv e. A variable length g a p p attern ( abb r evia ted VLG pattern) P is the concatenatio n of a sequ ence of strings and v ariable length gaps, that is, P is of the form P = P 1 · g { a 1 , b 1 } · P 2 · g { a 2 , b 2 } · · · g { a k − 1 , b k − 1 } · P k . A VLG pattern P matches a su bstring S of T iﬀ S = P 1 · G 1 · · · G k − 1 · P k , where G i is an y str in g of length b et w een a i and b i , i = 1 , . . . , k − 1. Giv en a string T and a VLG pattern P , the variable length gap pr oblem (VLG problem) is to ﬁnd all ending p ositions of su b strings in T that matc h P . Example 1 As an example, c onsid er the pr oblem instanc e over the alphab et Σ = { A , G, C, T } : T = A TCGGCTCCAGA CCAGT ACCCGTTCCG TGGT P = A · g { 6 , 7 } · CC · g { 2 , 6 } · GT The solution to the pr oblem instanc e is the set of p ositions { 17 , 28 , 31 } . F or example the solution c onta ins 17, sinc e the substring A TCGGCTCCAGA CCAGT, ending at p osition 17 in T , matches P . ∗ An extend ed abstract of this pap er app eared in p roceedings of the 17th S ymposium on String Processing and Information Retriev al. 1 V ariable length gaps are fr equen tly used in compu ta tional biology applications [7, 8, 14, 16, 17]. F or in sta nce, the PR OSITE d a ta base [5, 10] su pp orts searc hin g for proteins sp eciﬁed b y VLG patterns. 1.1 Previous W ork W e br ie ﬂy review the main w orst-case b ounds for the VLG problem. As ab o ve , let P = P 1 · g { a 1 , b 1 } · P 2 · g { a 2 , b 2 } · · · g { a k − 1 , b k − 1 } · P k b e a VLG pattern consisting of k strings, and let T b e a string. T o state the b oun ds, let m = P k i =1 | P i | b e the sum of the lengths of the strings in P and let n b e the length of T . The simplest app roa ch to solv e the VLG prob lem is to translate P in to a regular expression and then use an algorithm for regular expr e ssion matc h ing. Unfortunately , th e translation pro duces a regular exp ression signiﬁcan tly longer than P , resu lti ng in an ineﬃcient algorithm. Sp eciﬁcally , supp ose that the alphab et Σ con tains σ c haracters, that is, Σ = { c 1 , . . . , c σ } . Using standard regular expression op erators (union and concatenation), we can tran s la te g { a, b } into the expression g { a, b } = a z }| { C · · · C b − a z }| { ( C | ǫ ) · · · ( C | ǫ ) , where C is sh orthand for the expression ( c 1 | c 2 | . . . c σ ). Hence, a v ariable length gap g { a, b } , represent ed b y a constant length expression in P , is translated in to a regular expression of length Ω( σ b ). Consequen tly , a r eg ular expression R corresp ondin g to P has length Ω( B σ + m ), where B = P k − 1 i =1 b i is the s um of the upp er b ounds of the gaps in P . Using Th ompson’s textb ook regular exp ression m atching algorithm [20] th is leads to an algorithm for the VLG p roblem u sing O ( n ( B σ + m )) time. Even with th e fastest kno wn algorithms for regular expr essio n matc hing this b ound can on ly b e impr o v ed by at most a p olylogarithmic factor [2, 3, 15, 18]. Sev eral algorithms that impro ve u pon the d irect trans la tion to a r e gular expression matc hin g problem hav e b een p roposed [4, 6–8, 12–14, 16, 17, 19]. Some of these are able to solv e more general v ersions of the problem, suc h as searc hin g for p a tterns that also con tain c haracter classes and v ariable length gaps with negativ e length. Most of th e algorithms are based on fast sim ulations of non-deterministic ﬁnite automata. In particular, Na v arro and Raﬃn o t [17] ga v e an algorithm using O ( n ( m + B w + 1)) time, w here w is the num b er of bits in a memory word. F r ed rikson and Grab o wski [7, 8] improv ed th is b oun d for the case when all v ariable length gaps hav e lo w er b ound 0 and iden tical upp er b ound b . Their fastest algorithm ac hieve s O ( n ( m log log b w + 1)) time. V ery recen tly , Bille and T horup [4] ga ve an algorithm using O ( n ( k log w w + log k ) + m log m + A ) time and O ( m + A ) space, w h ere A = P k − 1 i =1 a i is the su m of the lo wer b ound s on th e lengths of the gaps. Note that if we assume that the nk term dominates and ignore the w / log w f a ctor, the time b ound reduces to O ( nk ). An alternativ e approac h, s u gg ested indep endently b y Morgan te et al. [13] and Rah m a n et al. [19], is to design algorithms that are eﬃcient in terms of the total num b er of o ccurrences of th e k strin g s P 1 , . . . , P k within T . Let α b e this num b er, e.g., in Example 1 A, CC , and GT occur 5, 5, and 4 times in T . Hence, α = 5 + 5 + 4 = 14. Rahman et al. [19] ga v e an algorithm us ing O ( n log k + m + α log(max 1 ≤ i 1, that is rep orted at p osition τ . Th ere are t wo cases to consider. 1. y is r ele v an t. By deﬁnition th ere is a relev ant o ccurrence x of P i − 1 in T , such that startp os ( y ) = τ − | P i | ∈ R ( x ) . By th e indu cti on hyp othesis x w as corr ectly determined to b e relev an t b y the algo rithm. Since endp os ( x ) < τ , R ( x ) was app ended to L i earlier in the execution of the algorithm. It remains to sh o w that th e range con taining startp os ( y ) is the ﬁrst range in L i in step 2b. When removi n g the dead r ange s in L i in step 2a, all ranges [ a, b ] wher e b < τ − | P i | are remov ed. Th erefo re the range conta inin g τ − | P i | = startp os ( y ) is the ﬁrst range in L i after step 2a. It f o llo ws that the algorithm correctly determines that y is relev an t. 2. y is not relev ant . Then there exists no r e lev an t o ccurrence x of P i − 1 suc h that startp os ( y ) ∈ R ( x ). By the in duction hyp othesis there is no ran ge in L i con taining startp os ( y ), since the algorithm only app end ranges when a r el ev an t o ccurrence is foun d. Consequent ly , the algorithm correctly determines that y is not relev an t. 6 P 1 P 1 P 1 P 1 P 1 P 2 P 2 P 2 P 2 P 2 P 3 P 3 P 3 P 3 x R ( x ) A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Figure 2: The o ccurrences of the subp at terns P 1 = A, P 2 = CC and P 3 = GT and the ranges they deﬁne in the text T f r om Examp le 1. Occur rences w h ic h are n ot relev an t are crossed out. Th e b old o cc u r rences of P 3 are the relev an t o ccurrences of P k and their end p ositions 17,28 and 31 constitute the solution to the VLG problem. Consider the p oin t in the execution of the algorithm when the o cc u r rence x of P 2 at p osition τ = 26 is rep orted by th e Aho-Corasic k automaton. At this time L 2 =  [17; 20] , [22; 23] , [25; 26]  and L 3 =  [23 , 28]  . The ranges [17; 20] and [22; 23] are now dead and are remo v ed from L 2 in s tep 2a. In step 2b the algorithm d et ermin e s that x is relev an t and R ( x ) = [29; 33] is app ended to L 3 : L 3 =  [23; 33]  . 3.2 Time and Space Complexity The A C automaton for the subpatterns P 1 , P 2 , . . . , P k can b e bu ilt in time O ( m log k ) usin g O ( m ) space, where m = P k i =1 | P i | . In the trivial case when m > n we do not need to build the automaton. Hence, we will assu me that m ≤ n in the follo win g analysis. F or eac h of the α o ccurrences of th e strings P 1 , P 2 , . . . , P k Algorithm 1 ﬁr st remo ve s the dead r an ges from L i and L i +1 and p erform s a n umb er of constant -time op eratio ns . Since b oth lists are sorted, the dead r ange s can b e remo ve d by tra v ersing the lists from the b eginning. A t m o st α ranges are ev er added to the lists, and th erefo re the algorithm s p end s O ( α ) time in total on removi n g dead ranges. T he tota l time is therefore O (( n + m ) log k + α ) = O ( n log k + m + α ). T o pr o v e the space b ound, we ﬁrst s ho w the follo wing lemma. Lemma 2 At any time during the e xe cution of the algorithm we have | L i | ≤  2 c i − 1 + | P i | + a i − 1 c i − 1 + 1  = O  | P i | + a i − 1 b i − 1 − a i − 1 + 2  , for i = 2 , 3 , . . . , k , wher e c i = b i − a i + 1 . Pr o of. Consider list L i for some i = 2 , . . . , k . Referring to Algorithm 1, the size of the list L i is only increased in step 2(b)i, wh e n a range R ( x j ) d e ﬁ n ed by a relev an t o ccurrence x j of P i − 1 is rep orted and R ( x j ) do es not adj o in or o ve rlap the last range in L i . Let R ( x 1 ) = [ s, t ] b e the ﬁrst range in L i at an arbitrary time in the execution of the algorithm. W e b oun d the n umb e r of additional ranges that can b e added to L i from th e time R ( x 1 ) b ecame th e 7 P i − 1 P i P i − 1 R ( x 1 ) R ( x ℓ ) R ( x 2 ) d | P i | − 1 a i − 1 c i − 1 b i − 1 + 1 x 1 is rep orted and R ( x 1 ) is add ed to L i Last position where R ( x 1 ) is still alive |{z} 1 Po sition in T x 1 x ℓ Figure 3: The worst-case situation where ℓ , the maximum n umb er of ranges are present in L i . The ﬁgure on ly shows the ﬁrst and the last o ccurrence of P i − 1 ( x 1 and x ℓ ) d eﬁning th e ℓ ranges. ﬁrst range in L i unt il R ( x 1 ) is remov ed. The last p osition where R ( x 1 ) is still ali ve is τ a = t + | P i | − 1. If a relev an t o ccurrence x ℓ of P i − 1 ends at this p osition, then the r a n ge R ( x ℓ ) = [ τ a + a i − 1 + 1; τ a + b i − 1 + 1] is app ended to L i . Hence, the maxim um num b er of p ositions d from t to the end of R ( x ℓ ) is d = τ a + b i − 1 + 1 − t = ( t + | P i | − 1) + b i − 1 + 1 − t = | P i | + b i − 1 = | P i | + a i − 1 + c i − 1 − 1 . In the w orst case, all th e ranges in L i are separated by exactly one p osition as illustrated in Fig. 3. Therefore at most ⌊ d/ ( c i − 1 + 1) ⌋ additional ranges can b e added to L i b efore R ( x 1 ) is remov ed. Count ing in R ( x 1 ) y ields the follo wing b ound on the size of L i | L i | ≤  d c i − 1 + 1  + 1 =  2 c i − 1 + | P i | + a i − 1 c i − 1 + 1  = O  | P i | + a i − 1 b i − 1 − a i − 1 + 2  .  By Lemma 2 the total num b er of ranges stored at any time during the pro cessing of T is at most O k X i =2 | P i | + a i − 1 b i − 1 − a i − 1 + 2 ! = O k − 1 X i =1 | P i +1 | b i − a i + 2 + k − 1 X i =1 a i b i − a i + 2 ! = O ( m + A ) . Eac h range can b e stored using O (1) space, s o this is an upp er b ound on the space needed to store th e lists L 2 , . . . , L k . Th e AC-automat on uses O ( m ) sp a ce, so the total s pace required by our algorithm is O ( m + A ). In summ a ry , the algorithm uses O ( n log k + m + α ) time and O ( m + A ) space. This completes the pro of of Th eo rem 1. 8 4 Complete Characterization of Oc currences In th is section we sho w how our algorithm can b e extended to rep ort n ot only the end p osition of P k , but also the p ositions of P 1 , P 2 , . . . , P k − 1 for eac h o ccurrence of P in T . The main idea is to construct a graph that enco des all o ccurrences of the VLG-pattern usin g O ( α ) space. F or eac h o ccurrence of the VLG-pattern, the p ositions of the in dividual subpatterns can b e rep orted by tra ve rsin g this graph. This approac h was also used by Rahman et al. [19] and Morgan te et al. [13]. W e giv e a fast new alg orithm for co ns tr ucting this graph and sho w a blac k-b o x solution th a t can rep ort the o ccurrences of the VLG-pattern without storing the complete graph . W e introd u ce the follo win g simple deﬁnitions. If P o ccurs in th e text T , then a match c ombi- nation is a sequence e 1 , . . . , e k of end p ositions of P 1 , . . . , P k in T corresp ondin g to the matc h. The total n umb er of matc h com binations of P in T is denoted β . Note that there can b e many m a tc h com binations corresp onding to a single matc h. S ee Fig. 4. A T C G G C T C C A G A C C A G T A C C C G T T C C G T G G T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 G C - - - A - - - - - - T G C - - - - - A - - - - T G - - C - A - - - - - - T G - - C - - - A - - - - T G - - - C - - A - - - - T The ﬁve m atch combinations. Figure 4: The text s equ ence is the same as in the previous examples. The sub s tring S from p osition 5 to 17 (highlighte d in b old) matc hes the VLG-pattern Q = G · g { 0 , 3 } · C · g { 1 , 6 } · A · g { 2 , 7 } · T . As the ﬁgure sho ws, this matc h conta ins the follo wing ﬁve matc h com binations: [5,9,12 ,17],[5,8,12,17],[5,8,10,17],[5,6,12,17],[5,6,10,17]. Due to the inequalit y of arithm etic and geometric means, the total n u mb er of matc h com binations β is maximized, when the α o ccurrences are distrib uted ev enly o v er P 1 , P 2 , . . . , P k and eac h o ccurrence of P i is compatible to all o ccurrences of P i − 1 for i = 2 , . . . , k . S o in the worst case β = Θ  ( α k ) k  , whic h is exp onen tial in the num b er of gaps. All these matc h com b in at ions can b e enco ded in a directed graph us ing O ( α 2 k ) space as f ollo ws. T h e no des in the graph are the relev an t o ccurrences of P 1 , P 2 , . . . , P k in T . T w o n odes x of P i − 1 and y of P i are connected by an edge from y to x if and only if star tpos ( y ) ∈ R ( x ) . In that case we also sa y that x and y are c omp atible . W e d enote this graph as the gap gr aph for P and T . S ee Fig. 5. Since the num b er of no des in the gap graph is at most α , and there are O  ( α k ) 2  edges b et w een the k la y ers in the worst case, we can store the graph u sing O ( α 2 k ) space. 9 C T G G C C C C G C T C C A C G T T G A G C G G C G C T G A G 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 3 P 3 P 3 Figure 5: T h e gap graph for the VLG-pattern R = C · g { 0 , 3 } · G · g { 3 , 10 } · A and the text T = CTGGCCC C GCTCCA CGTTGA GCGGCGCTGAG . If the j o ccurrences x 1 , x 2 , . . . , x j of P i (app earing in that ord er in T ) are all compatible w it h th e same o ccurrence y of P i +1 , then the j edges ( y , x 1 ) , ( y , x 2 ) , . . . , ( y , x j ) are all pr e sent in the gap graph. Due to the follo wing lemma, the edges ( y , x 2 ) , . . . , ( y , x j − 1 ) are redun dan t. Lemma 3 L et x 1 and x 2 b e two o c curr enc es of P i , i = 1 , . . . , k − 1 , b oth c omp atible with the same o c curr enc e y of P i +1 . A ssume without loss of gener ality that star tpos ( x 1 ) < star tpos ( x 2 ) and let x ′ b e another o c curr enc e of P i such that star tpos ( x 1 ) ≤ star tpos ( x ′ ) ≤ star tpos ( x 2 ) , then x ′ is also c omp atible with y . Pr o of. Since star tpos ( y ) ∈ R ( x 1 ) and star tpos ( y ) ∈ R ( x 2 ), we hav e that star tpos ( y ) ∈ R ( x 1 ) ∩ R ( x 2 ). F urtherm ore since star tpos ( x 1 ) ≤ star tpos ( x ′ ) ≤ start pos ( x 2 ), it holds that R ( x 1 ) ∩ R ( x 2 ) ⊆ R ( x ′ ), so st ar tpos ( y ) ∈ R ( x ′ ).  Lea ving out the red undan t ed g es in the gap graph , w e get a new graph, whic h we denote the implicit gap gr aph . F or an example, see Fig. 6. In this graph the out-degree of eac h n ode is at m o st tw o, so the n umb er of edges is no w linear in th e num b er of no des, and consequ ently w e can store the implicit gap graph using O ( α ) space. C T G G C C C C G C T C C A C G T T G A G C G G C G C T G A G 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 1 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 3 P 3 P 3 Figure 6: The implicit gap graph for th e VLG-pattern R = C · g { 0 , 3 } · G · g { 3 , 10 } · A an d the text T = CTGGCCC C GCTCCA CGTTGA GCGGCGCTGAG . The out-degree of eac h no de is at most t w o. Comp a re to Fig. 5. In th e conte xt of th ese new deﬁn it ions, w e are interested in solving the t wo follo wing pr ob lems: The r ep orting variable length gaps pr oblem (R VLG problem) is to outpu t all matc h com bi- nations of P in T . The imp lic it r ep orting var iable length gaps pr oblem (IR VLG problem) is to output the im- plicit gap graph of all m a tc h combinations of P in T . 10 Po sition in T Time G A C A C A C C T G G C A T A G C C G A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x 1 P 1 x 2 P 1 x 3 P 1 y P 2 P 2 L f 2 : x 1 x 2 x 3 L ℓ 2 : x 1 x 2 x 3 Figure 7: Examp le s ho wing ho w the t wo lists L f 2 and L ℓ 2 store the ﬁrst and most recent range to co v er a p osition in th e text, r espectiv ely . T he VLG-pattern is A C · g { 1 , 5 } · T . When the occurrence y of P 2 = T at p osition 9 is r eported, w e can c hec k the tw o lists to see that x 1 is the ﬁ rst an d x 3 is the last o ccurrence of P 1 compatible with y . 4.1 Constructing the Implicit Gap Graph Algorithm 2 describ es h o w to b u ild the implicit gap graph. Recall that in Algorithm 1 th e ranges in L i allo w ed us to d ete rmin e the relev ancy of a n ewly rep orted o ccurrence x of P i b y insp ecting the ﬁr st range in L i (after th e d ea d ranges had b een r emo v ed). T o build the implicit gap graph, w e n eed to not only determine the relev ancy of x , b ut also the ﬁ rst and last o ccurrence of P i − 1 compatible with x . This information allo ws us to add the correct edges to the implicit gap grap h . T o do this, w e replace the list L i with t wo lists L f i and L ℓ i , for i = 2 , . . . , k . The idea is that when a p osition in the text is co ve red b y multiple ranges, L f i con tains th e ﬁrs t range and L ℓ i con tains the most r ecent range to co v er th a t p osition. S e e Fig. 7. E ach range [ s, t ] in L f i or L ℓ i no w also h a s a reference to the o ccurrence x of P i − 1 that deﬁned it, and w e will denote the ran ge [ s , t ] x to indicate this. When an o cc ur rence x of P i is rep orted, we ﬁ rst r emov e dead r an ges from the lists L f i , L ℓ i , L f i +1 and L ℓ i +1 as was done in Algorithm 1. If x is relev an t a no de represent ing x is add e d to th e implicit gap grap h in step 2(b)i. In step 2(b)ii, pr o vided that x is n ot an o cc ur rence of P 1 , the t wo out-going edges of x are added by in sp e cting L f i and L ℓ i to d etermin e the ﬁrst and last o ccurrence of P i − 1 compatible with x . Unless x is an o cc ur rence of P k , the r ange R ( x ) = [ τ + a i + 1; τ + b i + 1] is added to the lists L f i +1 and L ℓ i +1 in step 2(b)iii as describ ed in the f ol lo win g section. 4.1.1 Main taining the Range Lists When adding a range [ s, t ] x deﬁned b y an o ccurrence x of P i to L f i +1 and L ℓ i +1 , w e sim p ly app end it to the end of the list if it do es not o v erlap the last range in the list. Otherwise, to a v oid o v erlappin g ranges, w e appropriately shorten either the newly added range [ s , t ] x (for L f i +1 ) or th e last range in the list (for L ℓ i +1 ). The wa y L f i is main tained ensures that the ﬁrst range th at cov ers some p osition τ in T will r ema in the only range co vering this p osition in L f i . Conv ersely , L ℓ i will store the most recen t ran ge co v ering τ . In Algorithm 2 the steps 2(b)iii, A, B and C app end and p ossibly shorten the ranges according to this s trat egy . 11 Algorithm 2 Algorithm solving the IR VLG p roblem for a VLG p att ern P and a string T . 1. Build the A C-automaton for the subp a tterns P 1 , P 2 , . . . , P k . 2. Pr ocess T using the automaton and eac h time an o ccurrence x of P i is r eported at p osition τ = endp os ( x ) in T d o: (a) Remo ve any dead ranges from th e lists L f i , L ℓ i , L f i +1 and L ℓ i +1 . (b) If i = 1 or if τ − | P i | = startp os ( x ) is con tained in the ﬁr st range in L f i (i.e., x is a relev ant o ccurrence) do: i. Ad d the no de x to the implicit gap graph. ii. If i > 1: Add th e edges ( x, y ) and ( x, z ) to the implicit gap graph , wh er e y and z are th e o ccurrences of P i − 1 deﬁning the ﬁr st range in L f i and L ℓ i , resp ectiv ely . iii. If i < k : Let [ q , r ] w and [ q ′ , r ′ ] w ′ denote the ﬁrst and last r a n ge in L f i +1 and L ℓ i +1 , resp ectiv ely . A. Ap pen d th e ran ge [max( r + 1 , τ + a i + 1) , τ + b i + 1] x to the end of L f i +1 . B. C h ange the last range in L ℓ i +1 to [ q ′ , min( r ′ , τ + a i )] w ′ . C. App end the range [ τ + a i + 1; τ + b i + 1] x to the end of L ℓ i +1 . 4.1.2 Time and Space Analysis As for Algorithm 1, the time sp en t f o r eac h of th e at most α relev an t o ccurrences r eported b y the A C automaton is amortized constant. Hence the imp lic it gap graph can b e b uilt in O ( n log k + m + α ) time. Storing the implicit gap graph for the ent ire text tak es space O ( α ), since eac h of the at most α no des h as at most t wo out-going edges. W e now consider the space needed to s tore the lists L f i and L ℓ i . The ranges in L f i and L ℓ i are no longer guaranteed to hav e size c i − 1 = b i − 1 − a i − 1 + 1 n or b eing separated by at least one p osition, so the b ound of Lemma 2 n ee d s to b e revised, resu lt ing in a sligh tly in crea sed space b ound for storing the lists. Referring to Fig. 3, the num b er of ranges in L f i or L ℓ i at an y p oin t in time is at most d + 1 = c i − 1 + | P i | − 1 + a i − 1 + 1 = | P i | + b i − 1 + 1 . Summing up, the total space required to store the lists incr eases from O ( m + A ) to O ( m + B ), where B = P k − 1 i =1 b i is the sum of the up p er b oun d s of the lengths of the gaps. Recapitulating, w e hav e the follo wing theorem Theorem 2 The IR VLG pr oblem c an b e solve d in time O ( n log k + m + α ) and sp ac e O ( m + B + α ) . 4.2 A Black-Bo x Solution for Rep orting Match Combina tions The num b er of matc h com binations, β , can b e exp onen tial in the n umb er of gaps. The implicit gap graph space eﬃcien tly enco d es all of these m a tc h combinations in a graph of size O ( α ). T h us, a straigh tforw ard solution to the R VLG problem is to construct the implicit gap graph and sub - sequen tly tra v erse it to rep ort the matc h combinatio ns . Eac h of the β matc h com binations is a sequence of k inte gers, so this solution to th e R VLG take s time O ( n log k + m + α + k β ) and space O ( m + B + α ). 12 W e no w sho w that the R VLG pr oblem can b e s pace eﬃcient ly solv ed using any b lack- b o x algorithm for the IR VLG problem. The main idea is a simple sp lit ting of T in to o v erlapping smaller substrings of suitable size. W e solv e the problem for eac h substr ing individu ally and com b in e the solutions to solv e th e fu ll p r oblem. By carefully organizi n g the co mp u ta tion we can eﬃcien tly reuse the space needed for the su bproblems. Let A I b e any algorithm that solv es the IR VLG p roblem in time t ( n, m, k , α ) and space s ( n, m, k ), where n , m , k , and α , are the parameters of the inp ut as ab o v e. W e bu ild a n ew algorithm A R from A I that solv es the R VLG problem as follo ws. Assume without loss of generali ty that n is a multiple of 2( m + B ). Divide T int o z = n m + B − 1 sub s trings C 1 , . . . , C z , called chunks . Eac h c h un k has length 2( m + B ) and o v erlaps in m + B c haracters with eac h neighbor. W e run A I on eac h ch unk C 1 , . . . , C z in sequence to compute the implicit gap graph for eac h ch unk. By tra v ersing the imp lic it gap graph for eac h ch unk w e ou tp ut the u nion of th e corresp onding matc h com binations. Since eac h matc h com bination of P in T o ccurs in at most t wo neigh b oring c hunks it su ﬃces to only store the imp lic it gap graph for tw o c h un ks at any time. Next we consider the complexit y A R . Let α i denote the num b er of o ccurrences of the strings of P in C i . F or eac h c hunk we run A I to pro duce the implicit gap graph. Giv en these w e compute the union of matc h com bin a tions in O ( kβ ) time. Hence, algorithm A R uses time O z X i =1 t (2( m + B ) , m , k , α i ) + k β ! . Next consider the space. W e only need to store the implicit gap graph s for tw o ch unks at any time. Since the space required for eac h c hunk is O (( m + B ) k ), the total sp ac e b ecomes O (( m + B ) k + s (2( m + B ) , m, k )) . The blac k-b o x algorithm eﬃcient ly conv erts algorithms for the IR VLG p roblem to the R VLG prob- lem, r esulting in the follo wing theorem. Theorem 3 Give n an algorithm sol ving the IR VLG pr oblem in time t ( n, m, k , α ) and sp ac e s ( n, m, k ) , ther e is an algorithm solving the R V LG pr oblem in time O ( P z i =1 t (2( m + B ) , m , k , α i ) + k β ) and sp ac e O (( m + B ) k + s (2( m + B ) , m, k )) . If we use the resu lt from T heorem 2, we obtain an algorithm that uses time O z X i =1 t (2( m + B ) , m , k , α i ) ! + k β ! = O  n m + B (2( m + B ) log k + m ) + α + k β  = O ( n log k + m + α + k β ) , where th e term m in the last expr ession is needed for the case w here m > n . The space u sag e is O (( m + B ) k + s (2( m + B ) , m, k )) = O  ( m + B ) k + m + B + max i =1 ,...,z α i  = O (( m + B ) k ) , where the last equalit y holds, since a i ≤ ( m + B ) k for all i . In summary , we ha ve th e follo wing result for the R VLG problem. Theorem 4 The R VLG pr oblem c an b e solve d in time O ( n log k + m + α + k β ) and sp ac e O (( m + B ) k ) . 13 4.3 Rep orting Match Combin ations On the Fly W e n o w sho w ho w a simple extension of our algorithm pro vides an alternativ e solution to the R VLG problem ac hieving the same sp ac e and time complexit y as the blac k-b o x solution. Th e idea is to use Algorithm 2 and r ep ort th e matc h com binations on the ﬂ y , w h ile con tin ually remo ving no des from the implicit gap graph that no longer can b e part of a m atch com bination. W e remov e the no des using a metho d similar to that for r emo v in g dead r a nges in the lists L f i and L ℓ i . W e sa y that a no de x of P i in the implicit gap graph is de ad if x can not b e part of a future matc h com bination. This happ ens when τ > endp os ( x ) + k X j = i +1 b j − 1 + | P j | . Lik e dead ranges, we can remo v e dead n odes from the imp lic it gap graph in amortized constan t time. Cons equen tly , all match com bin at ions can b e rep orted in time O ( n log k + m + α + k β ). Remo ving the dead no des ensu r es that th e num b er of P i no des in the implicit gap graph at an y time is at most 1 + P k j = i +1 b j − 1 + | P j | . Thus, the total num b er of n odes nev er exceeds k X i =1 1 + k X j = i +1 b j − 1 + | P j | = O (( m + B ) k ) . In su mmary , the algorithm solve s the R VLG problem in time O ( n log k + m + α + k β ) and sp ac e O (( m + B ) k ), s o it p ro vides and alternativ e pr oof of Theorem 4. References [1] A. V. Aho and M. J. Corasic k. E ﬃcie nt string matc hing: an aid to b ibliog raph ic searc h. Commun. ACM , 18(6):333– 340, 1975 . [2] P . Bille. New algo rithm s for regular expression matc hing. In Pr o c . 33r d ICALP , pages 643–654, 2006. [3] P . Bille and M. Thorup. F aster regular expression matc h in g. In Pr o c. 36th ICALP , pages 171–1 82, 2009. [4] P . Bille and M. Thorup. Regular expr ession matc hing with m ulti-strings and in terv als. In Pr o c. 21st SODA , 2010. [5] P . Bucher and A. Bairo c h. A generalized pr oﬁ le synta x for biomolecular sequence motifs and its f unction in automatic sequence int erp r et ation. In Pr o c. 2nd ISMB , pages 53–61, 1994. [6] M. Cro c hemore, C . Iliop oulos, C. Makris, W. Rytter, A. Tsak alidis, and K. Ts ichlas. Appro x- imate string matc hin g with gaps. Nor dic J. of Computing , 9(1):54–65, 2002. [7] K. F redriksson and S. Grab o wski. Eﬃcien t algorithms for pattern matc hing with general gaps, c haracter classes, an d transp osition inv ariance. Inf. R etr. , 11(4):335– 357, 2008. [8] K. F redriksson and S. Gr a b o wski. Nested counters in bit-parallel strin g matc hing. In Pr o c. 3r d LA T A , pages 338–3 49, 2009. 14 [9] T. Haa p asalo, P . Silv asti, S . Sipp u, and E. S oi salon-Soininen. On lin e dictionary matc hing with v ariable- length gaps. In Pr o c. 10th SEA , p a ges 76–87, 2011. [10] K . Hofmann, P . Buc her, L. F alquet, and A. Bairo c h. The prosite d at abase, its status in. Nucleic A cids R es , (27):215–2 19, 1999. [11] D. E. Knuth, J. James H. Morr is, and V. R. Pratt. F ast pattern matc hing in strin gs. SIAM J. Comput. , 6(2):323– 350, 1977. [12] I . Lee, A. Ap ostolico, C. S. Iliop oulos, and K. Park. Findin g app r o ximate o cc ur rences of a pattern that con tains gaps. In Pr o c. 14th A WOCA , pages 89–100, 2003. [13] M. Morgante , A. P olicriti, N. Vitacolonna, and A. Zuccolo. Structured motifs search. J. Comput. B io . , 12(8):1065 –1082, 2005. [14] E . W. My ers. App r o ximate matching of net wo rk expressions with s pace rs. J. Comput. Bio. , 3(1):3 3–51, 1992. [15] E . W. Mye rs. A f ou r -russian algo rithm for regular expression pattern matc hing. J. ACM , 39(2): 430–448, 1992. [16] G. My ers and G. Mehldau. A system for pattern matc hin g app lications on biosequences. CABIOS , 9(3):299–31 4, 1993. [17] G. Na v arro an d M. Raﬃn o t. F ast and simple c haracter classes and b ounded gaps p a ttern matc hing, with applications to pr o tein searc hin g . J. Comput. B io. , 10(6):9 03–923, 2003. [18] G. Na v arr o and M. Raﬃnot. New tec hniques for regular expression searching. Algorithmic a , 41(2): 89–116, 2004. [19] M. S. Rahman, C . S . Iliop oulos, I . Lee, M. Mohamed, and W. F. Sm yth. Finding patterns with v ariable length gaps or don’t cares. In Pr o c. 12th COCOON , pages 146–1 55, 2006. [20] K . T hompson. Regular expression searc h algorithm. Commun. ACM , 11:419 –422, 1968. 15

String Matching with Variable Length Gaps

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment