Parameterized Matching in the Streaming Model

P arameterized Matc hing in the Streaming Mo del Markus Jalsenius ∗ Benn y Porat † Benjamin Sac h ‡ Abstract W e study the problem of parameterized matching in a stream where w e w ant to output matches b et ween a pattern of length m and the last m sym- b ols of the stream b efore the next symbol arriv es. P arameterized matching is a natural generalisation of exact matching where an arbitrary one-to- one relab elling of pattern sym b ols is allo w ed. W e sho w how this problem can be solved in constant time p er arriving stream symbol and sublinear, near optimal space with high probabilit y . Our results are surprising and imp ortan t: it has been shown that almost no streaming pattern matching problems can b e solv ed (not ev en randomised) in less than Θ( m ) space, with exact matc hing as the only known problem to ha ve a sublinear, near optimal space solution. Here we demonstrate that a similar sublinear, near optimal space solution is achiev able for an even more challenging problem. The pro of is considerably more complex than that for exact matching. 1 In tro duction W e consider the problem of pattern matc hing in a stream where we w ant to output matches b et w een a pattern of length m and the last m symbols of the stream. Eac h answer m ust b e rep orted b efore the next symbol arrives. The problem w e consider in this pap er is known as p ar ameterize d matching and is a natural generalisation of exact matc hing where an arbitrary one-to-one relab elling of the pattern sym b ols is allo wed (one p er alignment). F or example, if the pattern is abbca then there there is a parameterized match with bddcb as we can apply the relab elling a → b , b → d , c → c . There is ho w ever no parameterized matc h with bddbb . W e sho w ho w this streaming pattern matc hing problem can be solved in near constan t time p er arriving stream symbol and sublinear, near optimal, space with high probabilit y . The space used is reduced ev en further when only a small subset of the sym b ols are allow ed to b e relabelled. As discussed in the next section, our results demonstrate a serious push forw ard in understanding what pattern matching algorithms can b e solv ed in sublinear space. ∗ Departmen t of Computer Science, Universit y of Bristol, U.K. † Departmen t of Computer Science, Bar-Ilan Univ ersity , Israel. ‡ Departmen t of Computer Science, Universit y of W arwick, U.K. 1 1.1 Bac kground Streaming algorithms is a well studied area and sp eciﬁcally ﬁnding patterns in a stream is a fundamental problem that has receiv ed increasing attention o v er the past few years. It was shown in [8] that many oﬄine algorithms can b e made online (streaming) and deamortised with a log m factor o v erhead in the time complexity p er arriving symbol in the stream, where m is the length of the pattern. There ha ve also b een improv emen ts for sp eciﬁc pattern matching prob- lems but they all hav e one prop ert y in common: space usage is Θ( m ) words. It is not diﬃcult to sho w that we in fact ne e d as m uch as Θ( m ) space to do pattern matc hing, unless errors are allow ed. The ﬁeld of pattern matc hing in a stream to ok a signiﬁcan t step forw ards in 2009 when it was sho wn to b e p ossible to solv e exact matching using only O (log m ) w ords of space and O (log m ) time p er new stream symbol [15]. This metho d, which is based on ﬁngerprin ts, correctly ﬁnds all matches with high probabilit y . The initial approac h w as subsequen tly somewhat simpliﬁed [10] and then ﬁnally impro ved to run in constan t time [7] within the same space requiremen ts. Being able to do exact matc hing in sublinear space raised the question of what other streaming pattern matc hing problems can b e solved in small space. In 2011 this question w as answered for a large set of such problems [9]. The result was rather glo omy: almost no streaming pattern matc hing problems can b e solv ed in sublinear space, not even using randomised algorithms. An Ω( m ) space lo wer bound w as given for L 1 , L 2 , L ∞ , Hamming, edit distance and pattern matching with wildcards as well as for any algorithm that computes the cross-correlation/con volution. So what other pattern matching problems could p ossibly b e solv ed in small space? It seems that the only hop e to ﬁnd an y is b y imp osing v arious restrictions on the problem de ﬁnition. This was indeed done in [15] where a solution to k -mismatc h (exact matching where up to k mismatches are allo wed) was given which uses O ( k 2 p oly (log m )) time p er arriving stream symbol and O ( k 3 p oly (log m )) w ords of space. The solution in volv es multiple instances of the exact matc hing algorithm run in parallel. Note that the space b ound approaches Θ( m ) as k increases, so the algorithm is only in teresting for suﬃcien tly small k . F urther, the space b ound is very far from the known Ω( k ) lo wer b ound. W e also note that it is straigh tforward to show that exact matching with k wildcards in the pattern can b e solved with the k -mismatc h algorithm. T o our knowledge, no other streaming pattern matching ha ve b een solved in sublinear space so far. In this paper w e present the ﬁrst push forw ard since exact matching b y giving a sublinear, near optimal space and near constan t time (or constant with a mild restriction on the alphabet) algorithm for parameterized matching in a stream. This natural problem turns out to b e signiﬁcantly more complicated to solv e than exact matc hing and our results pro vide the ﬁrst demonstration that small space and time b ounds are ac hiev able for a more challenging problem. Note that our space b ound, as opp osed to k -mismatc h, is essentially optimal lik e for exact matching. One could easily argue that our results are surprising, and y et again the question of what other problems are solv able in sublinear 2 space calls for an answ er. In particular, given that restrictions to the problem ha ve to be made, what restrictions should one mak e to break the Ω( m ) space barrier. 1.2 Problem deﬁnition and related work A pattern P of length m is said to p ar ameterize match , or p-match for short, an m length string S if there is an injectiv e (one-to-one) function f suc h that S [ j ] = f ( P [ j ]) for all j ∈ { 0 , . . . , m − 1 } . In our streaming setting, the pattern is known in adv ance and the sym b ols of the stream T arriv e one at a time. W e use the letter i to denote the index of the latest symbol in the stream. Our task is to output whether there is a p-match b etw een P and T [( i − m + 1) , i ] b efore T [ i + 1] arrives. The mapping f may b e distinct for eac h i . One ma y view this matching problem as that of ﬁnding matches in a stream encrypted using a substitution cipher. In oﬄine settings, parameterized matc h- ing has its origin in ﬁnding duplication and plagiarism in soft ware co de although has since found numerous other applications. Since the ﬁrst introduction of the problem, a great deal of work has gone into its study in b oth theoretical and practical settings (see e.g. [1, 3 – 6, 12]). Notably , in an oﬄine setting, the ex- act parameterized matc hing problem can b e solv ed in near linear time using a v ariant [1] of the classic linear time exact matc hing algorithm KMP [14]. When the sublinear space algorithm for exact matc hing was giv en in [15], prop erties of the p erio ds of strings formed a crucial part of their analysis. How- ev er, when considering parameterized matching the p eriod of a string is a muc h less straightforw ard concept than it is for exact matching. F or example, it is no longer true that consecutive matches must either be separated by the p eriod of the pattern or b e at least m/ 2 sym b ols apart. This prop ert y , whic h holds for exact but not parameterized matching, allows for an eﬃcient encoding of the p ositions of the matches. This was crucial to reducing the space requirements of the previous streaming algorithms. Unfortunately , parameterized matches can o ccur at arbitrary p ositions in the stream, requiring new insights. This is not the only c hallenge that w e face. A natural wa y to matc h t wo strings under parameterization is to consider their pr e de c essor strings . F or a string S , the predecessor string, denoted pred ( S ) , is a string of length | S | such that pred ( S ) [ j ] is the distance, counted in n um- b ers of sym b ols, to the previous o ccurrence of the symbol S [ j ] in S . In other w ords, pred ( S ) [ j ] = d , where d is the smallest p ositive v alue for whic h S [ j ] = S [ j − d ] . Whenev er no suc h d exists, w e set pred ( S ) [ j ] = 0 . As an example, if S = aababcca then pred ( S ) = 01022014 . W e can p erform parameterized matc hing oﬄine by only considering predecessor strings using the fundamen tal fact [3] that t w o equal length strings S and S 0 p-matc h iﬀ pred ( S ) = pred ( S 0 ) . A plausible approac h for our streaming problem w ould now b e to translate the problem of parameterized matching in a stream to that of exact matc hing. This could b e achiev ed b y conv erting b oth pattern and stream in to their corresp ond- ing predecessor strings and maintaining ﬁngerprints of a sliding window of the translated input. How ev er, consider the eﬀect on the predecessor string, and 3 hence its ﬁngerprin t, of sliding a window in the stream along by one. The left- most symbol x , say , will mo ve out of the window and so the predecessor v alue of the new leftmost o ccurrence of x in the new window will need to b e set to 0 and the corresp onding ﬁngerprint up dated. W e cannot aﬀord to store the positions of all c haracters in a Θ( m ) length window. W e will show a matching algorithm that solv es these problems and others w e encoun ter en route using minimal space and in near constant time per arriving sym b ol. A num b er of tec hnical inno v ations are required, including new uses of ﬁngerprin ting, a new compressed enco ding of the p ositions of p otential matches, a separate de terministic algorithm designed for preﬁxes of the pattern with small parameterized p eriod as well as the deamortisation of the entire matching pro cess. Section 2 giv es a more detailed ov erview of these main hurdles. 1.3 Our new results Our main result is a fast and space eﬃcient algorithm to solve the streaming pa- rameterized matching problem. It applies to dense alphab ets where we assume that both the pattern and streaming text alphab ets are Σ = { 0 , . . . , | Σ | − 1 } . The following theorem is prov ed o v er the subsequen t sections of this pap er. Theorem 1. Supp ose the p attern and text alphab ets ar e b oth Σ = { 0 , . . . , | Σ | − 1 } and the p attern has length m . Ther e is a r andomise d algorithm for str e aming p ar ameterize d matching that takes O (1) worst-c ase time p er char acter and uses O ( | Σ | log m ) wor ds of sp ac e. The pr ob ability that the algorithm outputs c orr e ctly at al l alignments of an n length text is at le ast 1 − 1 /n c , wher e c is any c onstant. T o fully appreciate this theorem we also give a nearly matching space low er b ound which sho ws that our solution is optimal within logarithmic factors. The pro of is based on communication complexity arguments and is deferred to Ap- p endix A. Theorem 2. Ther e is a r andomise d sp ac e lower b ound of Ω( | Σ | ) bits for the str e aming p ar ameterize d pr oblem, wher e Σ is the p attern alphab et. P arameterized matching is often sp eciﬁed under the assumption that only some sym b ols are v ariable (allo wed to be relab elled). The mapping f w e used in Section 1.2 has to reﬂect this constraint. More precisely , let the pattern alphab et be partitioned in to ﬁxed symbols Σ ﬁxed and v ariable sym b ols Π . F or σ ∈ Σ ﬁxed , we require that f ( σ ) = σ . The result from Theorem 1 can be extended to handle gener al alphab ets with arbitrary ﬁxed symbols. The idea is to apply a suitable reduction that w as given in [1] (Lemma 2.2) together with the streaming exact matching algorithm of Breslauer and Galil [7], as well as applying a “ﬁlter” on the text stream, using for instance the the dictionary of Andersson and Thorup [2] based on exp onen tial search trees. The dictionary is used to map text sym b ols to the v ariable pattern sym b ols in Π . The pro of of the following theorem is given in Appendix C. 4 Theorem 3. Supp ose Π is the set of p attern symb ols that c an b e r elab el le d under p ar ameterize d matching. A l l other p attern symb ols ar e ﬁxe d. Without any c onstr aints on the text alphab et, ther e is a r andomise d algorithm for str e aming p ar ameterize d matching that takes O ( p log | Π | / log log | Π | ) worst-c ase time p er char acter and uses O ( | Π | log m ) wor ds of sp ac e, wher e m is the length of the p attern. The pr ob ability that the algorithm outputs c orr e ctly at al l alignments of an n length text is at le ast 1 − 1 /n c , wher e c is any c onstant. As part of the pro of of Theorem 1 we had to develop an algorithm that eﬃ- cien tly solves streaming parameterized matching for patterns with small p ar am- eterize d p erio d , deﬁned as follows. The parameterized p erio d ( p-p erio d ) of the pattern P , denoted ρ , is the smallest p ositiv e integer suc h that P [0 , ( m − 1 − ρ )] p-matc hes P [ ρ, m − 1] . That is, ρ is the shortest distance that P m ust b e slid b y to parameterized match itself. Our algorithm is deterministic and is interesting in its own right (see Section 4 for details). W e also provide a matc hing space lo wer b ound which is detailed in Appendix A. Theorem 4. Supp ose the p attern and text alphab ets ar e b oth Σ = { 0 , . . . , | Σ | − 1 } and the p attern has p-p erio d ρ . Ther e is a deterministic algorithm for str e aming p ar ameterize d matching that takes O (1) worst-c ase time p er char acter and uses O ( | Σ | + ρ ) wor ds of sp ac e. F urther, ther e is a deterministic sp ac e lower b ound of Ω( | Σ | + ρ ) bits. 1.4 Fingerprin ts W e will mak e extensive use Rabin-Karp style ﬁngerprints of strings whic h are deﬁned as follows. Let S be a string ov er the alphab et Σ . Let p > | Σ | b e a prime and c ho ose r ∈ Z p uniformly at random. The ﬁngerprint φ ( S ) is given by φ ( S ) def = P | S |− 1 k =0 S [ k ] r k mo d p . A critical prop ert y of the ﬁngerprint function φ is that the probability of achieving a false p ositiv e, Pr( φ ( S ) = φ ( S 0 ) ∧ S 6 = S 0 ) , is at most | S | / ( p − 1) (see [13, 15] for pro ofs). Let n denote the total length of the stream. Our randomised algorithm will make o ( n 2 ) (in fact near linear) ﬁngerprin t comparisons in total. Therefore, by the applying the union bound, for any constant c , w e can choose p ∈ Θ( n c +3 ) so that with probability at least 1 − 1 /n c there will be no false positive matc hes. As w e assume the RAM model with word size Θ(log n ) , a ﬁngerprint ﬁts in a constant num ber of w ords. W e assume that all ﬁngerprint arithmetic is p erformed within Z p . In particular we will take adv an tage of tw o ﬁngerprin t op erations.  Splitting: Given φ  S [0 , a ]  , φ  S [0 , b ]  (where b > a ) and the v alue of r − a mo d p , we can compute φ  S [ a + 1 , b ]  = φ  S [0 , b ]   φ  S [0 , a ]  in O (1) time. } Zer oing: Let S, S 0 b e t wo equal length strings such that S 0 is identical to S except for in positions z ∈ Z ⊆ [0 , s − 1] at whic h S 0 [ z ] = 0 . W e write φ  S  } Z to denote φ  S 0  . Given φ  S  and ( S [ z ] , r z mo d p ) for all z ∈ Z , computing φ  S  } Z tak es O ( | Z | ) time. 5 a a b c a b b c a a b b b b f a e e c d f a d b 0 0 0 0 0 0 3 3 1 7 9 7 2 4 1 3 2 11 3 1 3 5 2 5 3 17 d 18 e d 17 f 3 a T i 0 + m ` − 1 − 1 i 0 i 0 + m ` − 1 1 2 3 1 3 5 2 5 3 0 0 0 3 0 0 0 pred ( T ) pred ( T [ i 0 , i 0 + m ` − 1] ) Φ ` − 1 ( i 0 ) Φ ` ( i 0 ) Φ ` ( i 0 )  Φ ` − 1 ( i 0 ) Φ 0 ` ( i 0 ) Figure 1: The k ey ﬁngerprints used b y the randomised algorithm. Characters con tribute diﬀeren tly to Φ 0 ` ( i 0 ) and Φ ` ( i 0 )  Φ ` − 1 ( i 0 ) are highligh ted. 2 Ov erview, k ey prop erties and notation The o verall idea of our algorithm in Theorem 1 follows that of previous work on streaming exact matc hing in small space, ho wev er for parameterized matc hing the situation is m uch more complex and calls for not only more inv olved details and methods but also a deep fundamen tal understanding of the nature of pa- rameterized matc hing. W e will now describ e the ov erall idea, in tro duce some imp ortan t notation and at the end of this section we will highlight key facts ab out parameterized matching that are crucial for our solution. The main algorithm will try to match the streaming text with v arious preﬁxes of the pattern P . Let Σ P denote the pattern alphab et. W e deﬁne δ = | Σ P | log m and let P 0 denote the shortest preﬁx of P that has p-p eriod greater than 3 δ (recall the deﬁnition of p-perio d given abov e Theorem 4). W e deﬁne s preﬁxes P ` of increasing length so that | P ` | = 2 ` | P 0 | for ` ∈ { 1 , . . . , s − 1 } , where s 6 d log m e is the largest v alue such that | P s − 1 | 6 m/ 2 . The ﬁnal preﬁx P s has length m − 4 δ . F or all ` , we deﬁne m ` = | P ` | , hence m ` = 2 m ` − 1 . In order to determine if there is a p-matc h b et ween the text and a pattern preﬁx, w e will compare the ﬁngerprin ts of their predecessor strings (recall that t wo strings p-match iﬀ their predecessor strings are the same). W e will need t wo related (but typically distinct) ﬁngerprin t deﬁnitions to ac hieve this. Figure 1 will b e helpful when reading the following deﬁnitions whic h are discussed in an example b elo w. F or any index i 0 and ` ∈ { 0 , . . . , s } , Φ ` ( i 0 ) def = φ  pred ( T [0 , ( i 0 + m ` − 1)] )  , Φ 0 ` ( i 0 ) def = φ  pred ( T [ i 0 , ( i 0 + m ` − 1)] ) [ m ` − 1 , m ` − 1]  . F or eac h ` ∈ { 1 , . . . , s } the main algorithm runs a process whose resp onsi- bilit y for ﬁnding p-matches betw een the text and P ` ( P 0 is handled separately as will b e discussed later). The process responsible for P ` will ask the pro cess resp onsible for P ` − 1 if it has found any p-matches, and if so it will try to extend the matc hes to P ` . As an example, supp ose that the process for P ` − 1 ﬁnds a matc h at p osition i 0 of the text (refer to Figure 1). The pro cess will then store this matc h along with the ﬁngerprint Φ ` − 1 ( i 0 ) whic h has b een built up as new 6 sym b ols arriv e. The pro cess for P ` will b e handed this information when the sym b ol at p osition i 0 + m ` − 1 arriv es. The task is no w to w ork out if i 0 is also a matching position with P ` . With the ﬁngerprin t Φ ` ( i 0 ) av ailable (built up as new sym b ols arriv e), the pro cess for P ` can use ﬁngerprin t arithmetics to determine if i 0 is a matching p osition. This is one instance where the situation b ecomes more tricky than one migh t ﬁrst think. As position i 0 is a p-match with P ` − 1 it suﬃces to compare the second half of the predecessor string of P ` with the second half of the predecessor string of T [ i 0 , i + m ` − 1] . Fingerprints are used for this comparison. It is crucial to understand that Φ ` ( i 0 )  Φ ` − 1 ( i 0 ) cannot b e used directly here; some predecessor v alues of the text might p oin t very far back, namely to some position b efor e index i 0 . In Figure 1 we hav e shaded the three sym b ols for which this is true and w e hav e dra wn arro ws indicating their predecessors. Thus, in order to correctly do the ﬁngerprint comparison w e need to set those p ositions to zero (w e wan t the ﬁngerprint of the predecessor string of the text substring starting at p osition i 0 , not the b eginning of T ). The ﬁngerprint we deﬁned as Φ 0 ` ( i 0 ) ab o v e is the ﬁngerprint w e w ant to compare to the ﬁngerprint of the second half of the predecessor string of P ` . Using ﬁngerprin t operations, we hav e from the deﬁnitions that Φ 0 ` ( i 0 ) =  Φ ` ( i 0 )  Φ ` − 1 ( i 0 )  } ∆ ` ( i 0 ) , where ∆ ` ( i 0 ) is the set of p ositions that ha v e to b e set to zero. F or a substring of T of length Θ( m ` − 1 ) consider the subset of p ositions whic h occur in ∆ ` ( i 0 ) for at least one v alue of i 0 . Any such p osition has a predecessor v alue greater than m ` − 1 . Therefore, b y summing ov er all distinct symbols we hav e that the size of this subset is crucially only O ( | Σ P | ) . Th us, w e can maintain in small space ev ery p osition in a suitable length window that will ever ha ve to be set to zero. Let us go bac k to the example where the pro cess for P ` − 1 had found a p-matc h at p osition i 0 . The pro cess stores i 0 along with the ﬁngerprint Φ ` − 1 ( i 0 ) . This information is not needed b y the process for P ` un til m ` − 1 text sym b ols later. During the arriv al of these symbols, the pro cess for P ` − 1 migh t detect more p-matc hes, in fact man y more matches. Their p ositions and corresp onding ﬁngerprin ts hav e to b e stored until needed by the pro cess for P ` . W e no w ha ve a space issue: ho w do we store this information in small space? T o appreciate this question, ﬁrst consider exact matc hing. Here matches are kno wn to b e either an exact p eriod length apart or very far apart. The matching p ositions can therefore b e represented b y an arithmetic progression. F urther, the ﬁngerprints asso ciated with the matches in an arithmetic progression can easily b e stored succinctly as one can work out each one of the ﬁngerprints from the ﬁrst one. F or parameterized matc hing the situation is muc h more complex: matc hes can o ccur more c haotically and, as w e hav e seen ab o ve, ﬁngerprints must b e up- dated dynamically to reﬂect that sym b ols could b e mapped diﬀerently in tw o distinct alignments. Handling these diﬃculties in small space (and small time complexit y) is a main h urdle and is one p oint at whic h our work diﬀer signiﬁ- can tly from all previous w ork on streaming matching in small space. W e cop e with this space issue in the next section. 7 ρ ρ ρ ρ ρ T 3 m/ 2 Y z }| { A z }| { Figure 2: Partitioning of p ositions ( × ) at whic h P p-matc hes in a 3 m/ 2 length substring of T . 2.1 The structure of parameterized matches First recall that an arithmetic pr o gr ession is a sequence of num bers suc h that the (common) diﬀerence b et w een any tw o successive num b ers is constan t. W e can sp ecify an arithmetic progression by its start n umber, the common diﬀerence and the length of the sequence. In the next lemma w e will see that the p ositions at which a string P of length m parameterize matches a longer string of length 3 m/ 2 can b e stored in small memory: either a matching p osition b elongs to an arithmetic progression or it is one of relatively few p ositions that can b e listed explicitly in O ( | Σ P | ) space. The pro of of the lemma (consult Figure 2) is deferred to Section 5. Lemma 5. L et X b e the set of p ositions at which P p-matches within an 3 m/ 2 length substring of T . The set X c an b e p artitione d into two sets Y and A such that | Y | 6 6 | Σ P | , max( Y ) < min( A ) and A is an arithmetic pr o gr ession with c ommon diﬀer enc e ρ , wher e ρ is the p-p erio d of P . The lemma is incredibly imp ortan t for the algorithm as it allo ws us to store all partial matches (that need to b e kept in memory before b eing discarded) in a total of O ( | Σ P | log m ) space across all pro cesses. The question of ho w to store their asso ciated ﬁngerprin ts remains, but is nicely resolved with the corollary b elo w that follows immediately from the proof of Lemma 5. W e can aﬀord to store ﬁngerprints explicitly for the positions that are identiﬁed to b elong to the set Y from Lemma 5, and for the matching p ositions in the arithmetic progression A we can, as for exact matching, work out ev ery ﬁngerprint giv en the ﬁrst one. Corollary 6. F or p attern P , text T and arithmetic pr o gr ession A as sp e ciﬁe d in L emma 5, pred ( T ) [( i + m − ρ ) , ( i + m − 1)] is the same for al l i ∈ A . 2.2 Deamortisation So far we hav e describ ed the ov erall approach but it is of course a ma jor concern ho w to carry out computations in constant time p er arriving sym b ol. In order to de amortise the algorithm, we run a separate pro cess resp onsible for the pattern preﬁx P 0 that uses the deterministic algorithm of Section 4 (i.e. Theorem 4). As P 0 has p-p eriod greater than 3 δ , the p-matches it outputs are at least this far apart. This enables the other pro cesses to op erate with a small dela y: pro cess P ` exp ects pro cess P ` − 1 to hand ov er matches and ﬁngerprints with a small dela y , and it will itself hand ov er matches and ﬁngerprin ts to P ` +1 with a small delay . 8 One of the reasons for the delays is that pro cesses op erate in a round-robin sc heme – one pro cess p er arriving sym b ol. The pro cess that is resp onsible for P s (whic h has length m − 4 δ ) returns matc hes with a dela y of up to 3 δ arriving sym b ols. Hence there is a gap of length δ in which we can work out if the whole of P matc hes. T o do this we ha ve another pro cess that runs in parallel with all other pro cesses and explicitly chec ks if an y matc h with P s can be extended with the remaining 4 δ symbols b y directly comparing their predecessor v alues with the last 4 δ predecessor v alues of the pattern. This job is spread out ov er δ arriving symbols, hence matches with P are outputted in constan t time. 3 The main algorithm W e are no w in a p osition to describ e the full algorithm of Theorem 1. Recall that the algorithm will ﬁnd p-matches with each of the pattern preﬁxes P 0 , . . . , P s deﬁned in the previous section. If a shorter preﬁx fails to match at a giv en po- sition then there is no need to c heck matches for longer preﬁxes. Our algorithm runs three main pro cesses concurrently which w e lab el A, B and C. The term pro cess had a sligh tly diﬀerent meaning in the previous section, but hop efully this will cause no confusion. Each pro cess takes O (1) time p er arriving sym b ol. Recall that b oth the pattern and text alphab ets are Σ P = { 0 , . . . , | Σ P | − 1 } . Pro cess A ﬁnds p-matches with preﬁx P 0 whic h are inserted as they o ccur in to a match queue M 0 . Pro cess B ﬁnds p-matches for preﬁxes P 1 , . . . , P s whic h are inserted into the match queues M 1 , . . . , M s , resp ectively . The p-matches are inserted with a dela y of up to 3 δ sym b ol arriv als after they o ccur. Pro cess C ﬁnds p-matc hes with the whole pattern P which are outputted in constant time as they occur as describ ed in Section 2.2. It is crucial for the space usage that the matc h queues M 0 , M 1 , . . . , M s will b e stored in a compressed fashion. The dela y in detecting p-matches with P ` in Pro cess B is a consequence of deamortising the w ork required to ﬁnd a preﬁx matc h, which we spread out ov er Θ( δ ) arriving sym b ols. W e can aﬀord to spread out the work in this w ay because the p-p eriod of P ` − 1 is at least δ so an y p-matc hes are at least this far apart. Throughout this section we assume that m > 14 δ so that m ` − m ` − 1 > 3 δ for ` ∈ { 1 , . . . , s } . If m 6 14 δ , or the p-p erio d of P is 3 δ or less, w e use the deterministic algorithm presented in Section 4 to solv e the problem within the required b ounds. 3.1 Pro cess A (ﬁnding matches with P 0 ) F rom the deﬁnition of P 0 w e ha ve that if w e remo ve the ﬁnal c haracter (giving the string P [0 , m 0 − 2] ) then its p-p eriod is at most 3 δ . The p-p erio d of P 0 itself could b e m uch larger. As part of pro cess A w e run the deterministic pattern matc hing algorithm from Section 4 (see Theorem 4) on P [0 , m 0 − 2] . It returns p-matc hes in constan t time and uses O ( | Σ P | + 3 δ ) = O ( | Σ P | log m ) space. In order to establish matc hes with the whole of P 0 w e handle the ﬁnal c haracter separately . If the deterministic subroutine reports a match that 9 ends in T [ i − 1] , when T [ i ] arrives we hav e a p-match with P 0 if and only if pred ( T ) [ i ] = pred ( P 0 ) [ m 0 − 1] (or pred ( T ) [ i ] > m 0 if pred ( P 0 ) [ m 0 − 1] = 0 ). As the alphab et is of the form Σ P = { 0 , . . . | Σ P | − 1 } , we can compute the v alue of pred ( T ) [ i ] in O (1) time b y main taining an arra y A of length | Σ P | suc h that for all σ ∈ Σ P , A [ σ ] giv es the index of the most recent o ccurrence of symbol σ . Whenev er P rocess A ﬁnds a match with P 0 at p osition i 0 of the text, the pair ( i 0 , Φ 0 ( i 0 )) is added to a (FIF O) queue M 0 , whic h is queried by Pro cess B when handling preﬁx P 1 . 3.2 Pro cess B (ﬁnding matches with P 1 , . . . , P s ) W e split the discussion of the execution of Process B into s levels , 1 , . . . , s . F or eac h level ` the ﬁngerprint Φ 0 ` ( i 0 ) is computed for each p osition i 0 at whic h P ` − 1 p-matc hes. Then, as discussed in Section 2, if Φ 0 ` ( i 0 ) = φ ( pred ( P ` ) [ m ` − 1 , ( m ` − 1)]) , there is also a match with P ` at i 0 . The algorithm will in this case add the pair ( i 0 , Φ ` ( i 0 )) to the queue M ` whic h is sub ject to queries b y level ` + 1 . T o this end w e compute Φ ` ( i 0 )  Φ ` − 1 ( i 0 ) and ∆ ` ( i 0 ) , where ∆ ` ( i 0 ) contains all the p ositions which should b e zero ed in order to obtain Φ 0 ` ( i 0 ) . In the example of Figure 1, ∆ ` ( i 0 ) = { 1 , 5 , 7 } (the d , e and f , resp ectiv ely). In order for pro cess B to spend only constan t time per arriving sym bol, all its work must b e scheduled carefully . The preparation of the ∆ ` ( i 0 ) v alues takes place as a subpro cess we name B1. Computing Φ ` ( i 0 )  Φ ` − 1 ( i 0 ) and establishing matc hes tak es place in another subpro cess named B2. The tw o subpro cesses are run in sequence for eac h arriving sym b ol. W e now give their details. Subpro cess B1 (prepare zeroing) W e use a queue D ` asso ciated with each lev el l which contains the most recen t O ( | Σ P | ) p ositions with predecessor the v alues greater than m ` − 1 . W e will see below that ∆ ` ( i 0 ) is a subset of the p ositions in D ` (adjusted to the oﬀset i 0 ). Unfortunately , in the worst case, for an arriving symbol T [ i ] , i could b elong to all of the D ` queues. Since we can only aﬀord constant time p er arriving sym b ol, w e cannot insert i into more than a constan t n umber of queues. The solution is to buﬀer arriving sym b ols. When some T [ i ] arrives w e ﬁrst c heck whether pred ( T ) [ i ] > m 0 . If so, the pair ( i, pred ( T ) [ i ]) is added to a buﬀer B to be dealt with later. T ogether with the pair w e also store the v alue r i mo d p whic h will b e needed to p erform the required zeroing op erations. In addition to adding a new element to the buﬀer B , the Subpro cess B1 will also pro cess elements from B . If is is currently not in the state of pro cessing an elemen t, it will now start doing so by removing an element from B (unless B is empt y). Call this element ( j , pred ( T ) [ j ]) . Ov er the next s arriving symbols the Subpro cess B1 will do the follo wing. F or eac h of the s lev els ` , if pred ( T ) [ j ] > m ` − 1 , add ( j, pred ( T ) [ j ]) to the queue D ` . If D ` con tains more than 12 | Σ P | elemen ts, discard the oldest. Subpro cess B2 (establish matches) This subpro cess sc hedules the w ork across the levels in a round-robin fashion by only considering lev el ` = 1 + 10 ( i mo d s ) when the symbol T [ i ] arrives. Poten tial matches ma y not b e rep orted b y this subpro cess un til up to 3 δ arriving symbols after they o ccur. As P ` − 1 has p-p erio d at least 3 δ , the pro cessing of potential matches do es not ov erlap. The Subpro cess B2 for level ` is alw ays in one of tw o states: either it is che cking whether a matching p osition i 0 for P ` − 1 is also a matc h with P ` , or it is id le . If idle, level ` lo oks in to queue M ` − 1 whic h holds matc hes with P ` − 1 . If M ` − 1 is non-empty , level ` remov es an element from M ` − 1 , call this elemen t ( i 0 , Φ ` − 1 ( i 0 )) , and en ters the c hec king state. Whenever i > i 0 + m ` + δ , lev el ` will start c hecking if i 0 is also a matching position with P ` . It do es so b y ﬁrst computing the ﬁngerprint Φ ` ( i 0 )  Φ ` − 1 ( i 0 ) , which by deﬁnition equals  Φ ` ( i 0 ) − Φ ` − 1 ( i 0 )  r − i 0 − m ` − 1 mo d p . W e can ensure the ﬁngerprin t Φ ` ( i 0 ) is al- w ays av ailable when needed by maintaining a circular buﬀer of the most recen t Θ( δ ) ﬁngerprints of the text. Similarly we can obtain r − i 0 − m ` − 1 mo d p in O (1) time b y keeping a buﬀer of the most recen t Θ( δ ) v alues of r − i mo d p along with r − m ` mo d p for all ` . Ov er the next at most | Σ P | arriving symbols for whic h Subpro cess B2 is considering lev el ` (i.e. those with ` = 1 + ( i mo d s ) ), Φ 0 ` ( i 0 ) will b e computed from Φ ` ( i 0 )  Φ ` − 1 ( i 0 ) by stepping through the elemen ts of the queue D ` . F or an y element ( j, pred ( T ) [ j ]) ∈ D ` , we hav e that ( j − i 0 − m ` − 1 ) ∈ ∆ ` ( i 0 ) if and only if pred ( T ) [ j ] > j − i 0 . F urther, as Subpro cess B1 stored r j mo d p with the elemen t in D ` and r i 0 mo d p is obtained through the circular buﬀer as abov e, w e can p erform the zeroing in O (1) time. Ha ving computed Φ 0 ` ( i 0 ) , w e then compare it to φ ( pred ( P ` ) [ m ` − 1 , ( m ` − 1)]) . If they are equal, we hav e a p-matc h with P ` at p osition i 0 of the text, and the pair ( i 0 , Φ ` ( i 0 )) is added to the queue M ` . This occurs b efore T [ i 0 + m ` + 3 δ ] arriv es. 3.3 Correctness, time and space analysis The time and space complexity almost follow immediately from the descrip- tion of our algorithm, but a little more atten tion is required to v erify that the algorithm actually w orks. In particular one has to sho w that buﬀers do not o verﬂo w, elements in queues are dealt with b efore b eing discarded and ev ery p ossible match will b e found (disregarding the probabilistic error in the ﬁnger- prin t comparisons). The proof of the next lemma is giv en in Appendix B. Lemma 7. The algorithm describ e d ab ove pr oves The or em 1. 4 The deterministic matching algorithm W e now describ e the deterministic algorithm that solves Theorem 4. Its running time is O (1) time p er c haracter and it uses O ( | Σ P | + ρ ) w ords of space, where ρ is the parameterized p eriod of P . W e require that both the pattern and text alphab ets are Σ P = { 0 , . . . , | Σ P | − 1 } . W e ﬁrst brieﬂy summarise the ov erall approach of [1] which our algorithm follo ws. It resem bles the classic KMP algorithm. When T [ i ] arrives, the o v erall 11 goal is to calculate the largest r suc h that P [0 , r − 1] p-matches T [( i − r + 1) , i ] . A p-matc h o ccurs iﬀ r = m . When a new text character T [ i + 1] arriv es the algorithm compares pred ( P ) [ r ] to pred ( T ) [ i + 1] in O (1) time to determine whether P [0 , r ] p-matches T [( i − r + 1) , i + 1] . More precisely , the algorithm chec ks whether either pred ( P ) [ r ] = pred ( T ) [ i + 1] , or pred ( P ) [ r ] = 0 ∧ pred ( T ) [ i + 1] > r . The second case cov ers the p ossibilit y that the previous o ccurrence in the text was outside the windo w. If there is a match, w e set r ← r + 1 and i ← i + 1 and contin ue with the next text c haracter. If not, w e shift the pattern preﬁx P [0 , r − 1] along by its p-p eriod, denoted ρ r − 1 , so that it is aligned with T [( i − r + ρ r − 1 + 1) , i ] . This is the next candidate for a p-matc h. In the original algorithm, the p-p erio ds of all preﬁxes are stored in an array of length m called a preﬁx table. The main h urdle we m ust tac kle is to store b oth a preﬁx table suitable for parameterized matching as w ell as an enco ding of the pattern in only O ( | Σ P | + ρ ) space, while still allo wing eﬃcien t access to b oth. It is well-kno wn that an y string P can b e stored in space prop ortional to its exact p eriod. In Lemma 9, whic h follows from Lemma 8, w e show an analogous result for pred ( P ) . See App endix D for proofs. Lemma 8. F or any j ∈ [ ρ ] ther e is a c onstant k j such that pred ( P ) [ j + k ρ ] is 0 for k < k j , and c j for k > k j , wher e c j > 0 is a c onstant that dep ends on j . Lemma 9. The pr e de c essor string pred ( P ) c an b e stor e d in O ( ρ ) sp ac e, wher e ρ is the p-p erio d of P . F urther, for any j ∈ [ m ] we c an obtain pred ( P ) [ j ] fr om this r epr esentation in O (1) time. W e now explain how to store the parameterized preﬁx table in only O ( ρ ) space, in con trast to Θ( m ) space which a standard preﬁx table w ould require. The p-p eriod ρ r of P [0 , r ] is, as a function of r , non-decreasing in r . This prop ert y enables us to run-length enco de the preﬁx table and store it as a doubly link ed list with at most ρ elemen ts, hence using only O ( ρ ) space. Eac h elemen t corresp onds to an interv al of preﬁx lengths with the same p-p erio d, and the elemen ts are link ed together in increasing order (of the common p-p eriod). This representation do es not allo w O (1) time random access to the p-p eriod of an y preﬁx, how ev er, for our purp oses it will suﬃce to p erform sequential access. T o accelerate computation we also store a second link ed list of the indices of the ﬁrst o ccurrences of each symbol in P in ascending order, i.e. ev ery j such that pred ( P ) [ j ] = 0 . This uses O ( | Σ P | ) space. There is a crucial second adv an tage to compressing the preﬁx table which is that it allows us to upper b ound the num b er of preﬁxes of P w e need to insp ect when a mismatch occurs. When a mismatc h o ccurs in our algorithm, w e rep eatedly shift the pattern un til a p-match betw een a text suﬃx and pattern preﬁx o ccurs. Naively it seems that we might ha ve to c heck many preﬁxes within the same run. How ev er, as a consequence of Lemma 8 we are assured that if some preﬁx does not p-match, every preﬁx in the same run with pred ( P ) [ j ] 6 = 0 will also mismatc h (except p ossibly the longest). Therefore we can skip insp ecting these preﬁxes. This can b e seen by observing (using Lemma 8) that for j such 12 that ρ j = ρ j +1 , we ha ve pred ( P ) [ j − ρ j ] ∈ { 0 , pred ( P ) [ j ] } . By keeping p oin ters in to b oth link ed lists, it is straightforw ard to ﬁnd the next preﬁx to chec k in O (1) time. Whenever we p erform a pattern shift w e mo ve at least one of the p oin ters to the left. Therefore the total num b er of pattern shifts insp ected while pro cessing T [ i ] is at most O ( | Σ P | + ρ ) . As eac h p ointer only mov es to the right b y at most one when eac h T [ i ] arrives, an amortised time complexit y of O (1) p er c haracter follows. The space usage is O ( | Σ P | + ρ ) as claimed, dominated by the linked lists. W e no w brieﬂy discuss ho w to deamortise our solution by applying Galil’s KMP deamortisation argumen t [11]. The main idea is to restrict the algorithm to shift the pattern at most twice when each text character arrives, giving a constan t time algorithm. If we hav e not ﬁnished pro cessing T [ i ] b y this p oin t we accept T [ i + 1] but place it on the end of a buﬀer, output ‘no matc h’ and con tinue pro cessing T [ i ] . The key property is that the num ber of text arriv als un til the next p-match o ccurs is at least the length of the buﬀer. As we shift the pattern up to t wice during each arriv al we alw ays clear the buﬀer b efore (or as) the next p-match o ccurs. F urther, the size of the buﬀer is alwa ys O ( | Σ P | + ρ ) . This follo ws from the observ ation ab o ve that the num b er of pattern shifts required to pro cess a single text character is O ( | Σ P | + ρ ) . This concludes the algorithm of Theorem 4. Com bining this result with the lo w er b ound result of App endix A pro ves Theorem 4. 5 The pro of of Lemma 5 In this section w e prov e the imp ortan t Lemma 5. Let i left denote an arbitrary p osition in T where P p-matches. Let X b e the set of p ositions at which P p-matc hes within T [ i left , ( i left + 3 m/ 2 − 1)] . W e now pro v e that there exist disjoin t sets Y and A with the prop erties set out in the statemen t of the lemma. Let α b e the smallest in teger suc h that all distinct symbols in P occur in the preﬁx P [0 , α ] . W e b egin by showing that ρ , the p-p erio d of P is at least α/ | Σ | . F rom the minimality of α , we ha ve that P [ α ] is the leftmost o ccurrence of some sym b ol. By the deﬁnition of the p-p erio d, we hav e that P [0 , ( m − 1 − ρ )] p-matc hes P [ ρ, m − 1] . Under this shift, P [ α ] (in P [ ρ, m − 1] ) is aligned with P [ α − ρ ] (in P [0 , ( m − 1 − ρ )] ) . Assume that P [ α − ρ ] is not a leftmost o ccurrence and let j b e the p osition of the previous o ccurrence of P [ j ] = P [ α − ρ ] . As a parameterized match o ccurs, we ha ve that P [ j ] = P [ j + α ] 6 = P [ α ] , contradiction. By rep eating this argumen t we ha ve found distinct sym b ols at p ositions α − k ρ for all k > 0 . This immediately implies that ρ > α/ | Σ | . W e ﬁrst deal with t wo simple cases: ρ > m/ 8 or α > m/ 4 (whic h implies that ρ > m/ (4 | Σ | ) ). In these tw o cases the n umber of p-matches is easily upp er b ounded b y 6 | Σ | , so all p ositions can b e stored in the set Y . W e therefore con tinue under the assumption that α < m/ 4 and ρ < m/ 8 . As ρ > α/ | Σ | , there are at most ( α + 1) / ( α / | Σ | ) 6 2 | Σ | p ositions from the range [ i left , i left + α ] at which P can parameterize match T . W e can store these p ositions in the set Y . Next we will show that the p ositions from the range 13 [( i left + α + 1) , ( i left + 3 m/ 2 − 1)] at which P parameterize matc hes T can b e represen ted b y the arithmetic progression A . First we show that ρ is an exact p erio d (not p-p eriod) of pred ( P ) [ α +1 , m − 1] (but not necessarily the shortest p eriod). Consider arbitrary p ositions P [ j ] and P [ j − ρ ] where α < j < m − ρ . By the deﬁnition of the p-perio d, w e hav e that P [ ρ, m − 1] p-matches P [0 , ( m − 1 − ρ )] and hence that pred ( P [ ρ, m − 1] ) = pred ( P [0 , ( m − 1 − ρ )] ) . In particular, pred ( P [ ρ, m − 1] ) [ j ] = pred ( P [0 , ( m − 1 − ρ )] ) [ j ] = pred ( P ) [ j ] , where the second equality follows b ecause we take the predecessor string of a preﬁx of P . Also observe that pred ( P [ ρ, m − 1] ) [ j ] either equals 0 or pred ( P ) [ j − ρ ] by deﬁnition. F urther, pred ( P [0 , ( m − 1 − ρ )] ) [ j ] = pred ( P ) [ j ] 6 = 0 as j > α and all leftmost o ccurrences are before α . This implies that pred ( P [ ρ, m − 1] ) [ j ] 6 = 0 , hence, as required, pred ( P ) [ j − ρ ] = pred ( P [ ρ, m − 1] ) [ j ] = pred ( P [0 , ( m − 1 − ρ )] ) [ j ] = pred ( P ) [ j ] . Recall that P p-matches T [ i left , i left + m − 1] so pred ( P ) = pred ( T [ i left , i left + m − 1] ) ] and hence ρ is an exact p eriod of pred ( T [ i left , i left + m − 1] ) [ α + 1 , m − 1] . Let j ∈ { α + 1 , . . . , m − 2 } and observ e that b y deﬁnition, pred ( T [ i left , i left + m − 1] ) [ j ] ∈ { 0 , pred ( T ) [ i left + j ] } . Ho wev er, pred ( T [ i left , ( i left + m − 1)] ) [ j ] = pred ( P ) [ j ] > 0 b ecause j > α and all leftmost o ccurrences are in P [0 , α ] . This implies that pred ( T [ i left , ( i left + m − 1)] ) [ j ] = pred ( T ) [ i left + j ] . As j w as arbi- trary , w e hav e that pred ( T ) [( i left + α + 1) , ( i left + m − 1)] = pred ( T [ i left , ( i left + m − 1)] ) [ α + 1 , m − 1] and hence ρ is an exact perio d of pred ( T ) [( i left + α + 1) , ( i left + m − 1)] . Let i right b e the righ tmost p osition in T [ i left , i left + 3 m/ 2 − 1] where P p-matc hes. By the same argument as for i left , we hav e that ρ is an exact p erio d of pred ( T ) [( i right + α + 1) , ( i right + m − 1)] . Th us, b oth pred ( T ) [( i left + α + 1) , ( i left + m − 1)] and pred ( T ) [( i right + α + 1) , ( i right + m − 1)] has an exact p eriod of ρ . As these t w o strings o verlap b y at least ρ characters, we ha ve that ρ is also an exact p erio d of pred ( T ) [ i left + α + 1 , i right + m − 1] . Let i ∈ { ( i left + α + 1) , . . . , i right − 1 } b e arbitrary such that P p-matches T [ i, ( i + m − 1)] . W e no w prov e that if i + ρ < i right then P p-matc hes T [ i + ρ, ( i + ρ + m − 1)] . As p-matc hes must b e at least ρ c haracters apart this is suﬃcien t to conclude that all remaining matc hes form an arithmetic progression with common diﬀerence ρ . As ρ is an exact perio d of pred ( T ) [( i left + α + 1) , ( i right + m − 1)] , w e hav e that pred ( T ) [ i, ( i + m − 1)] = pred ( T ) [ i + ρ, ( i + ρ + m − 1)] . By deﬁnition, this implies that pred ( T [ i, ( i + m − 1)] ) = pred ( T [ i + ρ, ( i + ρ + m − 1)] ) and hence a p-matc h also o ccurs at i + ρ . This concludes the pro of of Lemma 5. 6 A c kno wledgemen ts The work describ ed in this pap er was supp orted by EPSR C. The authors w ould lik e to thanks Raphaël Cliﬀord for man y helpful and encouraging discussions. 14 References [1] A. Amir, M. F arach, and S. Muth ukrishnan. “ Alphab et dep endence in parameterized matching”. In: IPL 49.3 (1994), pp. 111 –115. [2] A. A. Andersson and M. Thorup. “ Tight(er) worst-case b ounds on dy- namic searching and priorit y queues”. In: STOC ’00 . 2000, pp. 335–342. [3] B. S. Baker. “ A theory of parameterized pattern matching: algorithms and applications”. In: STOC ’93 . 1993, pp. 71–80. [4] B. S. Bak er. “ P arameterized Duplication in Strings: Algorithms and an Application to Softw are Maintenance”. In: SIAM J. on Comp. 26.5 (1997), pp. 1343–1362. [5] B. S. Baker. “ Parameterized Pattern Matching: Algorithms and Applica- tions”. In: JCSS 52.1 (1996), pp. 28 –42. [6] B. S. Baker. “ Parameterized Pattern Matching by Boy er-Mo ore-T yp e Al- gorithms”. In: SOD A ’95 . 1995, pp. 541–550. [7] D. Breslauer and Z. Galil. “ Real-Time Streaming String-Matching”. In: CPM ’11 . 2011, pp. 162–172. [8] R. Cliﬀord, K. Efremenko, B. Porat, and E. P orat. “ A Blac k Bo x for Online Appro ximate P attern Matching”. In: CPM ’08 . 2008, pp. 143–151. [9] R. Cliﬀord, M. Jalsenius, E. P orat, and B. Sach. “ Space Low er Bounds for Online Pattern Matching”. In: CPM ’11 . 2011, pp. 184–196. [10] F. Ergun, H. Jowhari, and M. Sağlam. “ Periodicity in streams”. In: RAN- DOM’10 . 2010, pp. 545–559. [11] Z. Galil. “ String Matching in Real Time.” In: J. ACM 28.1 (1981), pp. 134– 149. [12] C. Hazay, M. Lewenstein, and D. Sokol. “ Approximate parameterized matc hing”. In: ACM T r ans. Algorithms 3.3 (2007). [13] R. M. Karp and M. O. Rabin. “ Eﬃcient randomized pattern-matc hing algorithms”. In: IBM J. R es Dev 31.2 (1987), pp. 249 –260. [14] D. E. Knuth, J. H. Morris, and V. B. Pratt. “ F ast pattern matching in strings”. In: SIAM J. on Comp. 6 (1977), pp. 323–350. [15] B. Porat and E. Porat. “ Exact And Approximate Pattern Matc hing In The Streaming Model”. In: FOCS ’09 . 2009, pp. 315–323. 15 A Space lo w er b ounds T o complete the picture w e give nearly matching space low er b ounds which sho w that our solutions are optimal to within log factors. The proof is by a comm unication complexity argument. In essence one can sho w that in the randomised case Alice is able to transmit an y string of length Θ( | Σ P | ) bits to Bob using a solution to the matc hing problem by selecting a suitable pattern and streaming text. Similarly in the deterministic case (see b elo w) one can show that she can send Θ( | Σ P | + ρ ) bits. Pr o of of The or em 2 . Consider ﬁrst a pattern where all symbols are distinct, e.g. P = 123456 . Now let us assume Alice would lik e to send a bit-string to Bob. She can enco de the bit-string as an instance of the parameterized matc hing problem in the following wa y . As an example, assume the bit-string is 01011 . She ﬁrst creates the ﬁrst half of a text stream aBcDE where w e c ho ose capitals to corresp ond to 1 and lo w er case sym b ols to corresp ond to 0 from the original bit-string. She starts the matching algorithm and runs it until the pattern and the ﬁrst half of the text hav e been pro cessed and then sends a snapshot of the memory to Bob. Bob then con tin ues with the second half of the text which is ﬁxed to b e the sorted low er case sym b ols, in this case abcde . Where Bob ﬁnds a parameterized matc h he outputs a 1 and where he do es not, he outputs a 0 . Thus Alice’s bit-string is reproduced b y Bob. In general, if w e restrict the alphabet size of the pattern to b e | Σ P | then Alice can similarly enco de a bit-string of length | Σ P | − 1 , and successfully transmit it to Bob, giving us an Ω( | Σ P | ) bit lo wer b ound on the space requiremen ts of an y streaming algorithm. If randomisation is not allow ed, the lo wer b ound increases to Ω( | Σ P | + ρ ) bits of space. Here ρ is the parameterized p eriod of the pattern. This bound follo ws b y a similar argument by devising a one-to-one enco ding of bit-strings of length Θ( ρ ) into P [0 . . . ρ − 1] . The k ey diﬀerence is that with a deterministic algorithm, Bob can enumerate all possible m -length texts to recov er Alice’s bit-string from P . B Correctness pro of of the main algorithm Pr o of of L emma 7 . Coupled with the discussion in Section 2, the time and space complexity almost follow immediately from the description. It only re- mains to show that, at any time, |B | 6 | Σ P | . First observe that any sym b ol σ ∈ Σ T is only inserted into B when pred ( T ) [ i ] > m 0 > δ whic h can only happ en at most once in ev ery δ = | Σ P | log m arriving sym b ols. F urther w e remo v e one elemen t every s 6 d log m e arriv als and in particular remov e the σ o ccurrence after at most |B |d log m e arriv als. As B is initially empty , b y induction it follows that no sym b ol o ccurs more than once in B . F or correctness, it remains to sho w that w e correctly obtain the p ositions of Φ 0 ` ( i 0 ) from D ` . It follo ws from the description that all positions of Φ 0 ` ( i 0 ) cor- resp ond to elemen ts inserted in to D ` at some point. Ho wev er we need to pro v e 16 that these elements are presen t in D ` while Φ 0 ` ( i 0 ) is calculated. Any element inserted in to B during T [ i 0 , ( i 0 + m ` − 1)] has cleared the buﬀer b y the end of in terv al B (which has length δ ) b y the argument ab o ve. Therefore any relev an t elemen t has b een inserted into D ` b y the start of in terv al C, during whic h w e calculate Φ 0 ` ( i 0 ) . Any element inserted in to D ` is at least m ` − 1 c haracters from its predecessor. Therefore, summing o ver all sym b ols in the alphab et, there are at most 4 | Σ P | positions in T [ i 0 , ( i 0 + 2 m ` − 1)] which are inserted in to D ` . As D ` is a FIF O queue of size 12 | Σ | , the relev ant elements are still present after in terv al C. As commen ted earlier, p otential matc hes in M ` are separated b y more than 3 δ arriv als b ecause P ` − 1 has p-p erio d more than 3 δ . They are pro cessed within 3 δ arriv als so M ` do es not ov erﬂow. This completes the correctness. C Pro of of Theorem 3 (general alphab ets) Let Σ T denote the text alphab et. In order to handle general alphab ets w e p er- form t wo reductions in sequence on each arriving text symbol (and on P during prepro cessing). The ﬁrst reduces Σ P and Σ T to each contain only sym b ols from Π and one additional v ariable sym b ol (which is diﬀeren t for P and T ). A suit- able such reduction is giv en in [1] (Lemma 2.2). The reduction is presented for the oﬄine version but immediately generalises by using the constant time exact matc hing algorithm of Breslauer and Galil [7]. W e no w deﬁne Σ 0 P to b e the pattern alphab et after the ﬁrst reduction (and Σ 0 T resp ectiv ely). Note that | Σ 0 P | = | Σ 0 T | = | Π | + 1 and all pattern symbols are v ariables. Ho wev er w e hav e no guaran tee on the bit representations of the alphab et symbols. Let T 0 and P 0 denote the text and pattern after the ﬁrst re- duction. The second reduction now maps eac h T 0 [ i ] into the range { 0 , . . . , | Σ 0 P |} as it arriv es. The equiv alent reduction for the pattern is a simpliﬁcation whic h can b e p erformed in prepro cessing. Let the strings S and S ﬁlt denote the last m c haracters of the unﬁltered (p ost ﬁrst reduction) and ﬁltered (post second reduction) stream, resp ectiv ely . Let Σ last ⊆ Σ 0 T denote the up to | Σ 0 P | + 1 last distinct sym b ols in S , hence | Σ last | is never more than | Σ 0 P | + 1 . Let T b e a dynamic dictionary on Σ last suc h that a sym b ol in Σ 0 T can b e lo oked up, deleted and added in O ( p log | Σ 0 P | / log log | Σ 0 P | ) time [2]. Every symbol that arrives in the stream is associated with its “arriv al time”, whic h is an in teger that increases b y one for ev ery new symbol arriving in the stream. Let L b e an ordered list of the sym b ols in Σ last (together with their most recen t arriv al time) suc h that L is ordered according to the most recent arriv al time. F or example, L = ( d , 25) , ( b , 33) , ( g , 58) , ( e , 102) (1) means that the symbols b , d , e and g are the last four distinct sym b ols that app ear in S (for this example, | Σ 0 P | + 1 > 4 ), where the last e arrived at time 102, the last g arrived at time 58, and so on. 17 By using appropriate p oin ters betw een elements of the hash table T and elemen ts of L (which could b e implemented as a link ed list), we can maintain T and L in O (1) time p er arriving symbol. T o see this, take the example in Equation (1) and consider the arriv al of a new sym bol x at time 103 (follo wing the last symbol e ). First we lo ok up x in T and if x already exists in Σ last , mov e it to the right end of L b y deleting and inserting where needed and up date the elemen t to ( x, 103) . Also chec k that the leftmost elemen t of L is not a sym b ol that has b een pushed outside of S when x arriv ed. W e use its arriv al time to determine this and remov e the last element accordingly . If the arriving symbol x do es not already exist in Σ last , then we add ( x, 103) to the right end of L . T o ensure that L do es not con tain more than | Σ 0 P | + 1 elements, we remov e the leftmost element of L if necessary . W e also remo ve the leftmost symbol if it has b een pushed outside of S . The hash table T is of course up dated accordingly as well. Let Σ ﬁlt = { 0 , . . . , | Σ 0 P |} denote the sym b ols outputted b y the ﬁlter. W e augmen t the elements of L to maintain a mapping M from the sym b ols in Σ last to distinct sym b ols in Σ ﬁlt as follows. Whenever a new sym b ol is added to Σ last , map it to an un used symbol in Σ ﬁlt . If no such sym b ol exists, then use the sym b ol that is associated with the symbol of Σ last that is to be remo ved from Σ last (note that | Σ last | 6 | Σ ﬁlt | ). The mapping M speciﬁes the ﬁltered stream: when a sym b ol x arriv es, the ﬁlter outputs M ( x ) . Finding M ( x ) and up dating T is done in O (1) time p er arriving c haracter, and b oth the tree T and the list L can be stored in O ( | Σ 0 P | ) space. It remains to sho w that the ﬁltered stream do es not induce an y false matches or miss a p oten tial matc h. Supp ose ﬁrst that the num b er of distinct sym b ols in S is | Σ 0 P | or few er. That is, Σ last con tains all distinct symbols in S . Every symbol x in S has b een replaced by a unique sym b ol in Σ ﬁlt and the construction of the ﬁlter ensures that the mapping is one-to-one. Th us, pred ( S ﬁlt ) = pred ( S ) . Supp ose second that the num b er of distinct symbols in S is | Σ 0 P | + 1 or more. That is, | Σ last | = | Σ 0 P | + 1 and therefore S ﬁlt con tains | Σ 0 P | + 1 distinct symbols. Th us, pred ( S ﬁlt ) cannot equal pred ( P 0 ) . The claimed result then follo ws from Theorem 1. D Pro ofs omitted from Section 4 Pr o of of L emma 8 . Let ρ be the p-perio d of P . W e prov e the lemma b y con tradiction. Supp ose, for some j and k , that i = j + k ρ is a p osition such that pred ( P ) [ i ] = c > 1 and pred ( P ) [ i + ρ ] = c 0 6 = c . Consider Figure 3 for a concrete example, where ρ = 5 , i = 12 , pred ( P ) [12] = c = 4 and pred ( P ) [12+5] = c 0 = 3 . Since ρ is a p-perio d of P , w e hav e that pred ( P [ ρ, m − 1] ) = pred ( P [0 , ( m − 1 − ρ )] ) . Consider the alignment of p ositions i + ρ and i (p ositions 17 and 12 in Figure 3). W e ha ve that pred ( P [ ρ, m − 1] ) [ i ] is either c 0 or 0. In either case, it is certainly not pred ( P [0 , m − 1 − ρ ] ) [ i ] which is c . Th us, ρ cannot b e a p-p erio d of P . 18 0 0 4 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 0 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 pred( P 0 ) = pred( P 0 ) = . . . . . . Figure 3: An example demonstrating the structure of pred ( P 0 ) used in the pro of of Lemma 8. Pr o of of L emma 9 . By Lemma 8 w e can enco de pred ( P ) by storing the t w o v alues k j and c j for eac h j ∈ [ ρ ] . This takes O ( ρ ) space. The v alue pred ( P ) [ i ] is 0 if i < k ( i mo d ρ ) , otherwise it is c ( i mo d ρ ) . 19

Parameterized Matching in the Streaming Model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment