Parameterized Matching in the Streaming Model
We study the problem of parameterized matching in a stream where we want to output matches between a pattern of length m and the last m symbols of the stream before the next symbol arrives. Parameterized matching is a natural generalisation of exact …
Authors: Markus Jalsenius, Benny Porat, Benjamin Sach
P arameterized Matc hing in the Streaming Mo del Markus Jalsenius ∗ Benn y Porat † Benjamin Sac h ‡ Abstract W e study the problem of parameterized matching in a stream where w e w ant to output matches b et ween a pattern of length m and the last m sym- b ols of the stream b efore the next symbol arriv es. P arameterized matching is a natural generalisation of exact matching where an arbitrary one-to- one relab elling of pattern sym b ols is allo w ed. W e sho w how this problem can be solved in constant time p er arriving stream symbol and sublinear, near optimal space with high probabilit y . Our results are surprising and imp ortan t: it has been shown that almost no streaming pattern matching problems can b e solv ed (not ev en randomised) in less than Θ( m ) space, with exact matc hing as the only known problem to ha ve a sublinear, near optimal space solution. Here we demonstrate that a similar sublinear, near optimal space solution is achiev able for an even more challenging problem. The pro of is considerably more complex than that for exact matching. 1 In tro duction W e consider the problem of pattern matc hing in a stream where we w ant to output matches b et w een a pattern of length m and the last m symbols of the stream. Eac h answer m ust b e rep orted b efore the next symbol arrives. The problem w e consider in this pap er is known as p ar ameterize d matching and is a natural generalisation of exact matc hing where an arbitrary one-to-one relab elling of the pattern sym b ols is allo wed (one p er alignment). F or example, if the pattern is abbca then there there is a parameterized match with bddcb as we can apply the relab elling a → b , b → d , c → c . There is ho w ever no parameterized matc h with bddbb . W e sho w ho w this streaming pattern matc hing problem can be solved in near constan t time p er arriving stream symbol and sublinear, near optimal, space with high probabilit y . The space used is reduced ev en further when only a small subset of the sym b ols are allow ed to b e relabelled. As discussed in the next section, our results demonstrate a serious push forw ard in understanding what pattern matching algorithms can b e solv ed in sublinear space. ∗ Departmen t of Computer Science, Universit y of Bristol, U.K. † Departmen t of Computer Science, Bar-Ilan Univ ersity , Israel. ‡ Departmen t of Computer Science, Universit y of W arwick, U.K. 1 1.1 Bac kground Streaming algorithms is a well studied area and sp ecifically finding patterns in a stream is a fundamental problem that has receiv ed increasing attention o v er the past few years. It was shown in [8] that many offline algorithms can b e made online (streaming) and deamortised with a log m factor o v erhead in the time complexity p er arriving symbol in the stream, where m is the length of the pattern. There ha ve also b een improv emen ts for sp ecific pattern matching prob- lems but they all hav e one prop ert y in common: space usage is Θ( m ) words. It is not difficult to sho w that we in fact ne e d as m uch as Θ( m ) space to do pattern matc hing, unless errors are allow ed. The field of pattern matc hing in a stream to ok a significan t step forw ards in 2009 when it was sho wn to b e p ossible to solv e exact matching using only O (log m ) w ords of space and O (log m ) time p er new stream symbol [15]. This metho d, which is based on fingerprin ts, correctly finds all matches with high probabilit y . The initial approac h w as subsequen tly somewhat simplified [10] and then finally impro ved to run in constan t time [7] within the same space requiremen ts. Being able to do exact matc hing in sublinear space raised the question of what other streaming pattern matc hing problems can b e solved in small space. In 2011 this question w as answered for a large set of such problems [9]. The result was rather glo omy: almost no streaming pattern matc hing problems can b e solv ed in sublinear space, not even using randomised algorithms. An Ω( m ) space lo wer bound w as given for L 1 , L 2 , L ∞ , Hamming, edit distance and pattern matching with wildcards as well as for any algorithm that computes the cross-correlation/con volution. So what other pattern matching problems could p ossibly b e solv ed in small space? It seems that the only hop e to find an y is b y imp osing v arious restrictions on the problem de finition. This was indeed done in [15] where a solution to k -mismatc h (exact matching where up to k mismatches are allo wed) was given which uses O ( k 2 p oly (log m )) time p er arriving stream symbol and O ( k 3 p oly (log m )) w ords of space. The solution in volv es multiple instances of the exact matc hing algorithm run in parallel. Note that the space b ound approaches Θ( m ) as k increases, so the algorithm is only in teresting for sufficien tly small k . F urther, the space b ound is very far from the known Ω( k ) lo wer b ound. W e also note that it is straigh tforward to show that exact matching with k wildcards in the pattern can b e solved with the k -mismatc h algorithm. T o our knowledge, no other streaming pattern matching ha ve b een solved in sublinear space so far. In this paper w e present the first push forw ard since exact matching b y giving a sublinear, near optimal space and near constan t time (or constant with a mild restriction on the alphabet) algorithm for parameterized matching in a stream. This natural problem turns out to b e significantly more complicated to solv e than exact matc hing and our results pro vide the first demonstration that small space and time b ounds are ac hiev able for a more challenging problem. Note that our space b ound, as opp osed to k -mismatc h, is essentially optimal lik e for exact matching. One could easily argue that our results are surprising, and y et again the question of what other problems are solv able in sublinear 2 space calls for an answ er. In particular, given that restrictions to the problem ha ve to be made, what restrictions should one mak e to break the Ω( m ) space barrier. 1.2 Problem definition and related work A pattern P of length m is said to p ar ameterize match , or p-match for short, an m length string S if there is an injectiv e (one-to-one) function f suc h that S [ j ] = f ( P [ j ]) for all j ∈ { 0 , . . . , m − 1 } . In our streaming setting, the pattern is known in adv ance and the sym b ols of the stream T arriv e one at a time. W e use the letter i to denote the index of the latest symbol in the stream. Our task is to output whether there is a p-match b etw een P and T [( i − m + 1) , i ] b efore T [ i + 1] arrives. The mapping f may b e distinct for eac h i . One ma y view this matching problem as that of finding matches in a stream encrypted using a substitution cipher. In offline settings, parameterized matc h- ing has its origin in finding duplication and plagiarism in soft ware co de although has since found numerous other applications. Since the first introduction of the problem, a great deal of work has gone into its study in b oth theoretical and practical settings (see e.g. [1, 3 – 6, 12]). Notably , in an offline setting, the ex- act parameterized matc hing problem can b e solv ed in near linear time using a v ariant [1] of the classic linear time exact matc hing algorithm KMP [14]. When the sublinear space algorithm for exact matc hing was giv en in [15], prop erties of the p erio ds of strings formed a crucial part of their analysis. How- ev er, when considering parameterized matching the p eriod of a string is a muc h less straightforw ard concept than it is for exact matching. F or example, it is no longer true that consecutive matches must either be separated by the p eriod of the pattern or b e at least m/ 2 sym b ols apart. This prop ert y , whic h holds for exact but not parameterized matching, allows for an efficient encoding of the p ositions of the matches. This was crucial to reducing the space requirements of the previous streaming algorithms. Unfortunately , parameterized matches can o ccur at arbitrary p ositions in the stream, requiring new insights. This is not the only c hallenge that w e face. A natural wa y to matc h t wo strings under parameterization is to consider their pr e de c essor strings . F or a string S , the predecessor string, denoted pred ( S ) , is a string of length | S | such that pred ( S ) [ j ] is the distance, counted in n um- b ers of sym b ols, to the previous o ccurrence of the symbol S [ j ] in S . In other w ords, pred ( S ) [ j ] = d , where d is the smallest p ositive v alue for whic h S [ j ] = S [ j − d ] . Whenev er no suc h d exists, w e set pred ( S ) [ j ] = 0 . As an example, if S = aababcca then pred ( S ) = 01022014 . W e can p erform parameterized matc hing offline by only considering predecessor strings using the fundamen tal fact [3] that t w o equal length strings S and S 0 p-matc h iff pred ( S ) = pred ( S 0 ) . A plausible approac h for our streaming problem w ould now b e to translate the problem of parameterized matching in a stream to that of exact matc hing. This could b e achiev ed b y conv erting b oth pattern and stream in to their corresp ond- ing predecessor strings and maintaining fingerprints of a sliding window of the translated input. How ev er, consider the effect on the predecessor string, and 3 hence its fingerprin t, of sliding a window in the stream along by one. The left- most symbol x , say , will mo ve out of the window and so the predecessor v alue of the new leftmost o ccurrence of x in the new window will need to b e set to 0 and the corresp onding fingerprint up dated. W e cannot afford to store the positions of all c haracters in a Θ( m ) length window. W e will show a matching algorithm that solv es these problems and others w e encoun ter en route using minimal space and in near constant time per arriving sym b ol. A num b er of tec hnical inno v ations are required, including new uses of fingerprin ting, a new compressed enco ding of the p ositions of p otential matches, a separate de terministic algorithm designed for prefixes of the pattern with small parameterized p eriod as well as the deamortisation of the entire matching pro cess. Section 2 giv es a more detailed ov erview of these main hurdles. 1.3 Our new results Our main result is a fast and space efficient algorithm to solve the streaming pa- rameterized matching problem. It applies to dense alphab ets where we assume that both the pattern and streaming text alphab ets are Σ = { 0 , . . . , | Σ | − 1 } . The following theorem is prov ed o v er the subsequen t sections of this pap er. Theorem 1. Supp ose the p attern and text alphab ets ar e b oth Σ = { 0 , . . . , | Σ | − 1 } and the p attern has length m . Ther e is a r andomise d algorithm for str e aming p ar ameterize d matching that takes O (1) worst-c ase time p er char acter and uses O ( | Σ | log m ) wor ds of sp ac e. The pr ob ability that the algorithm outputs c orr e ctly at al l alignments of an n length text is at le ast 1 − 1 /n c , wher e c is any c onstant. T o fully appreciate this theorem we also give a nearly matching space low er b ound which sho ws that our solution is optimal within logarithmic factors. The pro of is based on communication complexity arguments and is deferred to Ap- p endix A. Theorem 2. Ther e is a r andomise d sp ac e lower b ound of Ω( | Σ | ) bits for the str e aming p ar ameterize d pr oblem, wher e Σ is the p attern alphab et. P arameterized matching is often sp ecified under the assumption that only some sym b ols are v ariable (allo wed to be relab elled). The mapping f w e used in Section 1.2 has to reflect this constraint. More precisely , let the pattern alphab et be partitioned in to fixed symbols Σ fixed and v ariable sym b ols Π . F or σ ∈ Σ fixed , we require that f ( σ ) = σ . The result from Theorem 1 can be extended to handle gener al alphab ets with arbitrary fixed symbols. The idea is to apply a suitable reduction that w as given in [1] (Lemma 2.2) together with the streaming exact matching algorithm of Breslauer and Galil [7], as well as applying a “filter” on the text stream, using for instance the the dictionary of Andersson and Thorup [2] based on exp onen tial search trees. The dictionary is used to map text sym b ols to the v ariable pattern sym b ols in Π . The pro of of the following theorem is given in Appendix C. 4 Theorem 3. Supp ose Π is the set of p attern symb ols that c an b e r elab el le d under p ar ameterize d matching. A l l other p attern symb ols ar e fixe d. Without any c onstr aints on the text alphab et, ther e is a r andomise d algorithm for str e aming p ar ameterize d matching that takes O ( p log | Π | / log log | Π | ) worst-c ase time p er char acter and uses O ( | Π | log m ) wor ds of sp ac e, wher e m is the length of the p attern. The pr ob ability that the algorithm outputs c orr e ctly at al l alignments of an n length text is at le ast 1 − 1 /n c , wher e c is any c onstant. As part of the pro of of Theorem 1 we had to develop an algorithm that effi- cien tly solves streaming parameterized matching for patterns with small p ar am- eterize d p erio d , defined as follows. The parameterized p erio d ( p-p erio d ) of the pattern P , denoted ρ , is the smallest p ositiv e integer suc h that P [0 , ( m − 1 − ρ )] p-matc hes P [ ρ, m − 1] . That is, ρ is the shortest distance that P m ust b e slid b y to parameterized match itself. Our algorithm is deterministic and is interesting in its own right (see Section 4 for details). W e also provide a matc hing space lo wer b ound which is detailed in Appendix A. Theorem 4. Supp ose the p attern and text alphab ets ar e b oth Σ = { 0 , . . . , | Σ | − 1 } and the p attern has p-p erio d ρ . Ther e is a deterministic algorithm for str e aming p ar ameterize d matching that takes O (1) worst-c ase time p er char acter and uses O ( | Σ | + ρ ) wor ds of sp ac e. F urther, ther e is a deterministic sp ac e lower b ound of Ω( | Σ | + ρ ) bits. 1.4 Fingerprin ts W e will mak e extensive use Rabin-Karp style fingerprints of strings whic h are defined as follows. Let S be a string ov er the alphab et Σ . Let p > | Σ | b e a prime and c ho ose r ∈ Z p uniformly at random. The fingerprint φ ( S ) is given by φ ( S ) def = P | S |− 1 k =0 S [ k ] r k mo d p . A critical prop ert y of the fingerprint function φ is that the probability of achieving a false p ositiv e, Pr( φ ( S ) = φ ( S 0 ) ∧ S 6 = S 0 ) , is at most | S | / ( p − 1) (see [13, 15] for pro ofs). Let n denote the total length of the stream. Our randomised algorithm will make o ( n 2 ) (in fact near linear) fingerprin t comparisons in total. Therefore, by the applying the union bound, for any constant c , w e can choose p ∈ Θ( n c +3 ) so that with probability at least 1 − 1 /n c there will be no false positive matc hes. As w e assume the RAM model with word size Θ(log n ) , a fingerprint fits in a constant num ber of w ords. W e assume that all fingerprint arithmetic is p erformed within Z p . In particular we will take adv an tage of tw o fingerprin t op erations. Splitting: Given φ S [0 , a ] , φ S [0 , b ] (where b > a ) and the v alue of r − a mo d p , we can compute φ S [ a + 1 , b ] = φ S [0 , b ] φ S [0 , a ] in O (1) time. } Zer oing: Let S, S 0 b e t wo equal length strings such that S 0 is identical to S except for in positions z ∈ Z ⊆ [0 , s − 1] at whic h S 0 [ z ] = 0 . W e write φ S } Z to denote φ S 0 . Given φ S and ( S [ z ] , r z mo d p ) for all z ∈ Z , computing φ S } Z tak es O ( | Z | ) time. 5 a a b c a b b c a a b b b b f a e e c d f a d b 0 0 0 0 0 0 3 3 1 7 9 7 2 4 1 3 2 11 3 1 3 5 2 5 3 17 d 18 e d 17 f 3 a T i 0 + m ` − 1 − 1 i 0 i 0 + m ` − 1 1 2 3 1 3 5 2 5 3 0 0 0 3 0 0 0 pred ( T ) pred ( T [ i 0 , i 0 + m ` − 1] ) Φ ` − 1 ( i 0 ) Φ ` ( i 0 ) Φ ` ( i 0 ) Φ ` − 1 ( i 0 ) Φ 0 ` ( i 0 ) Figure 1: The k ey fingerprints used b y the randomised algorithm. Characters con tribute differen tly to Φ 0 ` ( i 0 ) and Φ ` ( i 0 ) Φ ` − 1 ( i 0 ) are highligh ted. 2 Ov erview, k ey prop erties and notation The o verall idea of our algorithm in Theorem 1 follows that of previous work on streaming exact matc hing in small space, ho wev er for parameterized matc hing the situation is m uch more complex and calls for not only more inv olved details and methods but also a deep fundamen tal understanding of the nature of pa- rameterized matc hing. W e will now describ e the ov erall idea, in tro duce some imp ortan t notation and at the end of this section we will highlight key facts ab out parameterized matching that are crucial for our solution. The main algorithm will try to match the streaming text with v arious prefixes of the pattern P . Let Σ P denote the pattern alphab et. W e define δ = | Σ P | log m and let P 0 denote the shortest prefix of P that has p-p eriod greater than 3 δ (recall the definition of p-perio d given abov e Theorem 4). W e define s prefixes P ` of increasing length so that | P ` | = 2 ` | P 0 | for ` ∈ { 1 , . . . , s − 1 } , where s 6 d log m e is the largest v alue such that | P s − 1 | 6 m/ 2 . The final prefix P s has length m − 4 δ . F or all ` , we define m ` = | P ` | , hence m ` = 2 m ` − 1 . In order to determine if there is a p-matc h b et ween the text and a pattern prefix, w e will compare the fingerprin ts of their predecessor strings (recall that t wo strings p-match iff their predecessor strings are the same). W e will need t wo related (but typically distinct) fingerprin t definitions to ac hieve this. Figure 1 will b e helpful when reading the following definitions whic h are discussed in an example b elo w. F or any index i 0 and ` ∈ { 0 , . . . , s } , Φ ` ( i 0 ) def = φ pred ( T [0 , ( i 0 + m ` − 1)] ) , Φ 0 ` ( i 0 ) def = φ pred ( T [ i 0 , ( i 0 + m ` − 1)] ) [ m ` − 1 , m ` − 1] . F or eac h ` ∈ { 1 , . . . , s } the main algorithm runs a process whose resp onsi- bilit y for finding p-matches betw een the text and P ` ( P 0 is handled separately as will b e discussed later). The process responsible for P ` will ask the pro cess resp onsible for P ` − 1 if it has found any p-matches, and if so it will try to extend the matc hes to P ` . As an example, supp ose that the process for P ` − 1 finds a matc h at p osition i 0 of the text (refer to Figure 1). The pro cess will then store this matc h along with the fingerprint Φ ` − 1 ( i 0 ) whic h has b een built up as new 6 sym b ols arriv e. The pro cess for P ` will b e handed this information when the sym b ol at p osition i 0 + m ` − 1 arriv es. The task is no w to w ork out if i 0 is also a matching position with P ` . With the fingerprin t Φ ` ( i 0 ) av ailable (built up as new sym b ols arriv e), the pro cess for P ` can use fingerprin t arithmetics to determine if i 0 is a matching p osition. This is one instance where the situation b ecomes more tricky than one migh t first think. As position i 0 is a p-match with P ` − 1 it suffices to compare the second half of the predecessor string of P ` with the second half of the predecessor string of T [ i 0 , i + m ` − 1] . Fingerprints are used for this comparison. It is crucial to understand that Φ ` ( i 0 ) Φ ` − 1 ( i 0 ) cannot b e used directly here; some predecessor v alues of the text might p oin t very far back, namely to some position b efor e index i 0 . In Figure 1 we hav e shaded the three sym b ols for which this is true and w e hav e dra wn arro ws indicating their predecessors. Thus, in order to correctly do the fingerprint comparison w e need to set those p ositions to zero (w e wan t the fingerprint of the predecessor string of the text substring starting at p osition i 0 , not the b eginning of T ). The fingerprint we defined as Φ 0 ` ( i 0 ) ab o v e is the fingerprint w e w ant to compare to the fingerprint of the second half of the predecessor string of P ` . Using fingerprin t operations, we hav e from the definitions that Φ 0 ` ( i 0 ) = Φ ` ( i 0 ) Φ ` − 1 ( i 0 ) } ∆ ` ( i 0 ) , where ∆ ` ( i 0 ) is the set of p ositions that ha v e to b e set to zero. F or a substring of T of length Θ( m ` − 1 ) consider the subset of p ositions whic h occur in ∆ ` ( i 0 ) for at least one v alue of i 0 . Any such p osition has a predecessor v alue greater than m ` − 1 . Therefore, b y summing ov er all distinct symbols we hav e that the size of this subset is crucially only O ( | Σ P | ) . Th us, w e can maintain in small space ev ery p osition in a suitable length window that will ever ha ve to be set to zero. Let us go bac k to the example where the pro cess for P ` − 1 had found a p-matc h at p osition i 0 . The pro cess stores i 0 along with the fingerprint Φ ` − 1 ( i 0 ) . This information is not needed b y the process for P ` un til m ` − 1 text sym b ols later. During the arriv al of these symbols, the pro cess for P ` − 1 migh t detect more p-matc hes, in fact man y more matches. Their p ositions and corresp onding fingerprin ts hav e to b e stored until needed by the pro cess for P ` . W e no w ha ve a space issue: ho w do we store this information in small space? T o appreciate this question, first consider exact matc hing. Here matches are kno wn to b e either an exact p eriod length apart or very far apart. The matching p ositions can therefore b e represented b y an arithmetic progression. F urther, the fingerprints asso ciated with the matches in an arithmetic progression can easily b e stored succinctly as one can work out each one of the fingerprints from the first one. F or parameterized matc hing the situation is muc h more complex: matc hes can o ccur more c haotically and, as w e hav e seen ab o ve, fingerprints must b e up- dated dynamically to reflect that sym b ols could b e mapped differently in tw o distinct alignments. Handling these difficulties in small space (and small time complexit y) is a main h urdle and is one p oint at whic h our work differ signifi- can tly from all previous w ork on streaming matching in small space. W e cop e with this space issue in the next section. 7 ρ ρ ρ ρ ρ T 3 m/ 2 Y z }| { A z }| { Figure 2: Partitioning of p ositions ( × ) at whic h P p-matc hes in a 3 m/ 2 length substring of T . 2.1 The structure of parameterized matches First recall that an arithmetic pr o gr ession is a sequence of num bers suc h that the (common) difference b et w een any tw o successive num b ers is constan t. W e can sp ecify an arithmetic progression by its start n umber, the common difference and the length of the sequence. In the next lemma w e will see that the p ositions at which a string P of length m parameterize matches a longer string of length 3 m/ 2 can b e stored in small memory: either a matching p osition b elongs to an arithmetic progression or it is one of relatively few p ositions that can b e listed explicitly in O ( | Σ P | ) space. The pro of of the lemma (consult Figure 2) is deferred to Section 5. Lemma 5. L et X b e the set of p ositions at which P p-matches within an 3 m/ 2 length substring of T . The set X c an b e p artitione d into two sets Y and A such that | Y | 6 6 | Σ P | , max( Y ) < min( A ) and A is an arithmetic pr o gr ession with c ommon differ enc e ρ , wher e ρ is the p-p erio d of P . The lemma is incredibly imp ortan t for the algorithm as it allo ws us to store all partial matches (that need to b e kept in memory before b eing discarded) in a total of O ( | Σ P | log m ) space across all pro cesses. The question of ho w to store their asso ciated fingerprin ts remains, but is nicely resolved with the corollary b elo w that follows immediately from the proof of Lemma 5. W e can afford to store fingerprints explicitly for the positions that are identified to b elong to the set Y from Lemma 5, and for the matching p ositions in the arithmetic progression A we can, as for exact matching, work out ev ery fingerprint giv en the first one. Corollary 6. F or p attern P , text T and arithmetic pr o gr ession A as sp e cifie d in L emma 5, pred ( T ) [( i + m − ρ ) , ( i + m − 1)] is the same for al l i ∈ A . 2.2 Deamortisation So far we hav e describ ed the ov erall approach but it is of course a ma jor concern ho w to carry out computations in constant time p er arriving sym b ol. In order to de amortise the algorithm, we run a separate pro cess resp onsible for the pattern prefix P 0 that uses the deterministic algorithm of Section 4 (i.e. Theorem 4). As P 0 has p-p eriod greater than 3 δ , the p-matches it outputs are at least this far apart. This enables the other pro cesses to op erate with a small dela y: pro cess P ` exp ects pro cess P ` − 1 to hand ov er matches and fingerprints with a small dela y , and it will itself hand ov er matches and fingerprin ts to P ` +1 with a small delay . 8 One of the reasons for the delays is that pro cesses op erate in a round-robin sc heme – one pro cess p er arriving sym b ol. The pro cess that is resp onsible for P s (whic h has length m − 4 δ ) returns matc hes with a dela y of up to 3 δ arriving sym b ols. Hence there is a gap of length δ in which we can work out if the whole of P matc hes. T o do this we ha ve another pro cess that runs in parallel with all other pro cesses and explicitly chec ks if an y matc h with P s can be extended with the remaining 4 δ symbols b y directly comparing their predecessor v alues with the last 4 δ predecessor v alues of the pattern. This job is spread out ov er δ arriving symbols, hence matches with P are outputted in constan t time. 3 The main algorithm W e are no w in a p osition to describ e the full algorithm of Theorem 1. Recall that the algorithm will find p-matches with each of the pattern prefixes P 0 , . . . , P s defined in the previous section. If a shorter prefix fails to match at a giv en po- sition then there is no need to c heck matches for longer prefixes. Our algorithm runs three main pro cesses concurrently which w e lab el A, B and C. The term pro cess had a sligh tly different meaning in the previous section, but hop efully this will cause no confusion. Each pro cess takes O (1) time p er arriving sym b ol. Recall that b oth the pattern and text alphab ets are Σ P = { 0 , . . . , | Σ P | − 1 } . Pro cess A finds p-matches with prefix P 0 whic h are inserted as they o ccur in to a match queue M 0 . Pro cess B finds p-matches for prefixes P 1 , . . . , P s whic h are inserted into the match queues M 1 , . . . , M s , resp ectively . The p-matches are inserted with a dela y of up to 3 δ sym b ol arriv als after they o ccur. Pro cess C finds p-matc hes with the whole pattern P which are outputted in constant time as they occur as describ ed in Section 2.2. It is crucial for the space usage that the matc h queues M 0 , M 1 , . . . , M s will b e stored in a compressed fashion. The dela y in detecting p-matches with P ` in Pro cess B is a consequence of deamortising the w ork required to find a prefix matc h, which we spread out ov er Θ( δ ) arriving sym b ols. W e can afford to spread out the work in this w ay because the p-p eriod of P ` − 1 is at least δ so an y p-matc hes are at least this far apart. Throughout this section we assume that m > 14 δ so that m ` − m ` − 1 > 3 δ for ` ∈ { 1 , . . . , s } . If m 6 14 δ , or the p-p erio d of P is 3 δ or less, w e use the deterministic algorithm presented in Section 4 to solv e the problem within the required b ounds. 3.1 Pro cess A (finding matches with P 0 ) F rom the definition of P 0 w e ha ve that if w e remo ve the final c haracter (giving the string P [0 , m 0 − 2] ) then its p-p eriod is at most 3 δ . The p-p erio d of P 0 itself could b e m uch larger. As part of pro cess A w e run the deterministic pattern matc hing algorithm from Section 4 (see Theorem 4) on P [0 , m 0 − 2] . It returns p-matc hes in constan t time and uses O ( | Σ P | + 3 δ ) = O ( | Σ P | log m ) space. In order to establish matc hes with the whole of P 0 w e handle the final c haracter separately . If the deterministic subroutine reports a match that 9 ends in T [ i − 1] , when T [ i ] arrives we hav e a p-match with P 0 if and only if pred ( T ) [ i ] = pred ( P 0 ) [ m 0 − 1] (or pred ( T ) [ i ] > m 0 if pred ( P 0 ) [ m 0 − 1] = 0 ). As the alphab et is of the form Σ P = { 0 , . . . | Σ P | − 1 } , we can compute the v alue of pred ( T ) [ i ] in O (1) time b y main taining an arra y A of length | Σ P | suc h that for all σ ∈ Σ P , A [ σ ] giv es the index of the most recent o ccurrence of symbol σ . Whenev er P rocess A finds a match with P 0 at p osition i 0 of the text, the pair ( i 0 , Φ 0 ( i 0 )) is added to a (FIF O) queue M 0 , whic h is queried by Pro cess B when handling prefix P 1 . 3.2 Pro cess B (finding matches with P 1 , . . . , P s ) W e split the discussion of the execution of Process B into s levels , 1 , . . . , s . F or eac h level ` the fingerprint Φ 0 ` ( i 0 ) is computed for each p osition i 0 at whic h P ` − 1 p-matc hes. Then, as discussed in Section 2, if Φ 0 ` ( i 0 ) = φ ( pred ( P ` ) [ m ` − 1 , ( m ` − 1)]) , there is also a match with P ` at i 0 . The algorithm will in this case add the pair ( i 0 , Φ ` ( i 0 )) to the queue M ` whic h is sub ject to queries b y level ` + 1 . T o this end w e compute Φ ` ( i 0 ) Φ ` − 1 ( i 0 ) and ∆ ` ( i 0 ) , where ∆ ` ( i 0 ) contains all the p ositions which should b e zero ed in order to obtain Φ 0 ` ( i 0 ) . In the example of Figure 1, ∆ ` ( i 0 ) = { 1 , 5 , 7 } (the d , e and f , resp ectiv ely). In order for pro cess B to spend only constan t time per arriving sym bol, all its work must b e scheduled carefully . The preparation of the ∆ ` ( i 0 ) v alues takes place as a subpro cess we name B1. Computing Φ ` ( i 0 ) Φ ` − 1 ( i 0 ) and establishing matc hes tak es place in another subpro cess named B2. The tw o subpro cesses are run in sequence for eac h arriving sym b ol. W e now give their details. Subpro cess B1 (prepare zeroing) W e use a queue D ` asso ciated with each lev el l which contains the most recen t O ( | Σ P | ) p ositions with predecessor the v alues greater than m ` − 1 . W e will see below that ∆ ` ( i 0 ) is a subset of the p ositions in D ` (adjusted to the offset i 0 ). Unfortunately , in the worst case, for an arriving symbol T [ i ] , i could b elong to all of the D ` queues. Since we can only afford constant time p er arriving sym b ol, w e cannot insert i into more than a constan t n umber of queues. The solution is to buffer arriving sym b ols. When some T [ i ] arrives w e first c heck whether pred ( T ) [ i ] > m 0 . If so, the pair ( i, pred ( T ) [ i ]) is added to a buffer B to be dealt with later. T ogether with the pair w e also store the v alue r i mo d p whic h will b e needed to p erform the required zeroing op erations. In addition to adding a new element to the buffer B , the Subpro cess B1 will also pro cess elements from B . If is is currently not in the state of pro cessing an elemen t, it will now start doing so by removing an element from B (unless B is empt y). Call this element ( j , pred ( T ) [ j ]) . Ov er the next s arriving symbols the Subpro cess B1 will do the follo wing. F or eac h of the s lev els ` , if pred ( T ) [ j ] > m ` − 1 , add ( j, pred ( T ) [ j ]) to the queue D ` . If D ` con tains more than 12 | Σ P | elemen ts, discard the oldest. Subpro cess B2 (establish matches) This subpro cess sc hedules the w ork across the levels in a round-robin fashion by only considering lev el ` = 1 + 10 ( i mo d s ) when the symbol T [ i ] arrives. Poten tial matches ma y not b e rep orted b y this subpro cess un til up to 3 δ arriving symbols after they o ccur. As P ` − 1 has p-p erio d at least 3 δ , the pro cessing of potential matches do es not ov erlap. The Subpro cess B2 for level ` is alw ays in one of tw o states: either it is che cking whether a matching p osition i 0 for P ` − 1 is also a matc h with P ` , or it is id le . If idle, level ` lo oks in to queue M ` − 1 whic h holds matc hes with P ` − 1 . If M ` − 1 is non-empty , level ` remov es an element from M ` − 1 , call this elemen t ( i 0 , Φ ` − 1 ( i 0 )) , and en ters the c hec king state. Whenever i > i 0 + m ` + δ , lev el ` will start c hecking if i 0 is also a matching position with P ` . It do es so b y first computing the fingerprint Φ ` ( i 0 ) Φ ` − 1 ( i 0 ) , which by definition equals Φ ` ( i 0 ) − Φ ` − 1 ( i 0 ) r − i 0 − m ` − 1 mo d p . W e can ensure the fingerprin t Φ ` ( i 0 ) is al- w ays av ailable when needed by maintaining a circular buffer of the most recen t Θ( δ ) fingerprints of the text. Similarly we can obtain r − i 0 − m ` − 1 mo d p in O (1) time b y keeping a buffer of the most recen t Θ( δ ) v alues of r − i mo d p along with r − m ` mo d p for all ` . Ov er the next at most | Σ P | arriving symbols for whic h Subpro cess B2 is considering lev el ` (i.e. those with ` = 1 + ( i mo d s ) ), Φ 0 ` ( i 0 ) will b e computed from Φ ` ( i 0 ) Φ ` − 1 ( i 0 ) by stepping through the elemen ts of the queue D ` . F or an y element ( j, pred ( T ) [ j ]) ∈ D ` , we hav e that ( j − i 0 − m ` − 1 ) ∈ ∆ ` ( i 0 ) if and only if pred ( T ) [ j ] > j − i 0 . F urther, as Subpro cess B1 stored r j mo d p with the elemen t in D ` and r i 0 mo d p is obtained through the circular buffer as abov e, w e can p erform the zeroing in O (1) time. Ha ving computed Φ 0 ` ( i 0 ) , w e then compare it to φ ( pred ( P ` ) [ m ` − 1 , ( m ` − 1)]) . If they are equal, we hav e a p-matc h with P ` at p osition i 0 of the text, and the pair ( i 0 , Φ ` ( i 0 )) is added to the queue M ` . This occurs b efore T [ i 0 + m ` + 3 δ ] arriv es. 3.3 Correctness, time and space analysis The time and space complexity almost follow immediately from the descrip- tion of our algorithm, but a little more atten tion is required to v erify that the algorithm actually w orks. In particular one has to sho w that buffers do not o verflo w, elements in queues are dealt with b efore b eing discarded and ev ery p ossible match will b e found (disregarding the probabilistic error in the finger- prin t comparisons). The proof of the next lemma is giv en in Appendix B. Lemma 7. The algorithm describ e d ab ove pr oves The or em 1. 4 The deterministic matching algorithm W e now describ e the deterministic algorithm that solves Theorem 4. Its running time is O (1) time p er c haracter and it uses O ( | Σ P | + ρ ) w ords of space, where ρ is the parameterized p eriod of P . W e require that both the pattern and text alphab ets are Σ P = { 0 , . . . , | Σ P | − 1 } . W e first briefly summarise the ov erall approach of [1] which our algorithm follo ws. It resem bles the classic KMP algorithm. When T [ i ] arrives, the o v erall 11 goal is to calculate the largest r suc h that P [0 , r − 1] p-matches T [( i − r + 1) , i ] . A p-matc h o ccurs iff r = m . When a new text character T [ i + 1] arriv es the algorithm compares pred ( P ) [ r ] to pred ( T ) [ i + 1] in O (1) time to determine whether P [0 , r ] p-matches T [( i − r + 1) , i + 1] . More precisely , the algorithm chec ks whether either pred ( P ) [ r ] = pred ( T ) [ i + 1] , or pred ( P ) [ r ] = 0 ∧ pred ( T ) [ i + 1] > r . The second case cov ers the p ossibilit y that the previous o ccurrence in the text was outside the windo w. If there is a match, w e set r ← r + 1 and i ← i + 1 and contin ue with the next text c haracter. If not, w e shift the pattern prefix P [0 , r − 1] along by its p-p eriod, denoted ρ r − 1 , so that it is aligned with T [( i − r + ρ r − 1 + 1) , i ] . This is the next candidate for a p-matc h. In the original algorithm, the p-p erio ds of all prefixes are stored in an array of length m called a prefix table. The main h urdle we m ust tac kle is to store b oth a prefix table suitable for parameterized matching as w ell as an enco ding of the pattern in only O ( | Σ P | + ρ ) space, while still allo wing efficien t access to b oth. It is well-kno wn that an y string P can b e stored in space prop ortional to its exact p eriod. In Lemma 9, whic h follows from Lemma 8, w e show an analogous result for pred ( P ) . See App endix D for proofs. Lemma 8. F or any j ∈ [ ρ ] ther e is a c onstant k j such that pred ( P ) [ j + k ρ ] is 0 for k < k j , and c j for k > k j , wher e c j > 0 is a c onstant that dep ends on j . Lemma 9. The pr e de c essor string pred ( P ) c an b e stor e d in O ( ρ ) sp ac e, wher e ρ is the p-p erio d of P . F urther, for any j ∈ [ m ] we c an obtain pred ( P ) [ j ] fr om this r epr esentation in O (1) time. W e now explain how to store the parameterized prefix table in only O ( ρ ) space, in con trast to Θ( m ) space which a standard prefix table w ould require. The p-p eriod ρ r of P [0 , r ] is, as a function of r , non-decreasing in r . This prop ert y enables us to run-length enco de the prefix table and store it as a doubly link ed list with at most ρ elemen ts, hence using only O ( ρ ) space. Eac h elemen t corresp onds to an interv al of prefix lengths with the same p-p erio d, and the elemen ts are link ed together in increasing order (of the common p-p eriod). This representation do es not allo w O (1) time random access to the p-p eriod of an y prefix, how ev er, for our purp oses it will suffice to p erform sequential access. T o accelerate computation we also store a second link ed list of the indices of the first o ccurrences of each symbol in P in ascending order, i.e. ev ery j such that pred ( P ) [ j ] = 0 . This uses O ( | Σ P | ) space. There is a crucial second adv an tage to compressing the prefix table which is that it allows us to upper b ound the num b er of prefixes of P w e need to insp ect when a mismatch occurs. When a mismatc h o ccurs in our algorithm, w e rep eatedly shift the pattern un til a p-match betw een a text suffix and pattern prefix o ccurs. Naively it seems that we might ha ve to c heck many prefixes within the same run. How ev er, as a consequence of Lemma 8 we are assured that if some prefix does not p-match, every prefix in the same run with pred ( P ) [ j ] 6 = 0 will also mismatc h (except p ossibly the longest). Therefore we can skip insp ecting these prefixes. This can b e seen by observing (using Lemma 8) that for j such 12 that ρ j = ρ j +1 , we ha ve pred ( P ) [ j − ρ j ] ∈ { 0 , pred ( P ) [ j ] } . By keeping p oin ters in to b oth link ed lists, it is straightforw ard to find the next prefix to chec k in O (1) time. Whenever we p erform a pattern shift w e mo ve at least one of the p oin ters to the left. Therefore the total num b er of pattern shifts insp ected while pro cessing T [ i ] is at most O ( | Σ P | + ρ ) . As eac h p ointer only mov es to the right b y at most one when eac h T [ i ] arrives, an amortised time complexit y of O (1) p er c haracter follows. The space usage is O ( | Σ P | + ρ ) as claimed, dominated by the linked lists. W e no w briefly discuss ho w to deamortise our solution by applying Galil’s KMP deamortisation argumen t [11]. The main idea is to restrict the algorithm to shift the pattern at most twice when each text character arrives, giving a constan t time algorithm. If we hav e not finished pro cessing T [ i ] b y this p oin t we accept T [ i + 1] but place it on the end of a buffer, output ‘no matc h’ and con tinue pro cessing T [ i ] . The key property is that the num ber of text arriv als un til the next p-match o ccurs is at least the length of the buffer. As we shift the pattern up to t wice during each arriv al we alw ays clear the buffer b efore (or as) the next p-match o ccurs. F urther, the size of the buffer is alwa ys O ( | Σ P | + ρ ) . This follo ws from the observ ation ab o ve that the num b er of pattern shifts required to pro cess a single text character is O ( | Σ P | + ρ ) . This concludes the algorithm of Theorem 4. Com bining this result with the lo w er b ound result of App endix A pro ves Theorem 4. 5 The pro of of Lemma 5 In this section w e prov e the imp ortan t Lemma 5. Let i left denote an arbitrary p osition in T where P p-matches. Let X b e the set of p ositions at which P p-matc hes within T [ i left , ( i left + 3 m/ 2 − 1)] . W e now pro v e that there exist disjoin t sets Y and A with the prop erties set out in the statemen t of the lemma. Let α b e the smallest in teger suc h that all distinct symbols in P occur in the prefix P [0 , α ] . W e b egin by showing that ρ , the p-p erio d of P is at least α/ | Σ | . F rom the minimality of α , we ha ve that P [ α ] is the leftmost o ccurrence of some sym b ol. By the definition of the p-p erio d, we hav e that P [0 , ( m − 1 − ρ )] p-matc hes P [ ρ, m − 1] . Under this shift, P [ α ] (in P [ ρ, m − 1] ) is aligned with P [ α − ρ ] (in P [0 , ( m − 1 − ρ )] ) . Assume that P [ α − ρ ] is not a leftmost o ccurrence and let j b e the p osition of the previous o ccurrence of P [ j ] = P [ α − ρ ] . As a parameterized match o ccurs, we ha ve that P [ j ] = P [ j + α ] 6 = P [ α ] , contradiction. By rep eating this argumen t we ha ve found distinct sym b ols at p ositions α − k ρ for all k > 0 . This immediately implies that ρ > α/ | Σ | . W e first deal with t wo simple cases: ρ > m/ 8 or α > m/ 4 (whic h implies that ρ > m/ (4 | Σ | ) ). In these tw o cases the n umber of p-matches is easily upp er b ounded b y 6 | Σ | , so all p ositions can b e stored in the set Y . W e therefore con tinue under the assumption that α < m/ 4 and ρ < m/ 8 . As ρ > α/ | Σ | , there are at most ( α + 1) / ( α / | Σ | ) 6 2 | Σ | p ositions from the range [ i left , i left + α ] at which P can parameterize match T . W e can store these p ositions in the set Y . Next we will show that the p ositions from the range 13 [( i left + α + 1) , ( i left + 3 m/ 2 − 1)] at which P parameterize matc hes T can b e represen ted b y the arithmetic progression A . First we show that ρ is an exact p erio d (not p-p eriod) of pred ( P ) [ α +1 , m − 1] (but not necessarily the shortest p eriod). Consider arbitrary p ositions P [ j ] and P [ j − ρ ] where α < j < m − ρ . By the definition of the p-perio d, w e hav e that P [ ρ, m − 1] p-matches P [0 , ( m − 1 − ρ )] and hence that pred ( P [ ρ, m − 1] ) = pred ( P [0 , ( m − 1 − ρ )] ) . In particular, pred ( P [ ρ, m − 1] ) [ j ] = pred ( P [0 , ( m − 1 − ρ )] ) [ j ] = pred ( P ) [ j ] , where the second equality follows b ecause we take the predecessor string of a prefix of P . Also observe that pred ( P [ ρ, m − 1] ) [ j ] either equals 0 or pred ( P ) [ j − ρ ] by definition. F urther, pred ( P [0 , ( m − 1 − ρ )] ) [ j ] = pred ( P ) [ j ] 6 = 0 as j > α and all leftmost o ccurrences are before α . This implies that pred ( P [ ρ, m − 1] ) [ j ] 6 = 0 , hence, as required, pred ( P ) [ j − ρ ] = pred ( P [ ρ, m − 1] ) [ j ] = pred ( P [0 , ( m − 1 − ρ )] ) [ j ] = pred ( P ) [ j ] . Recall that P p-matches T [ i left , i left + m − 1] so pred ( P ) = pred ( T [ i left , i left + m − 1] ) ] and hence ρ is an exact p eriod of pred ( T [ i left , i left + m − 1] ) [ α + 1 , m − 1] . Let j ∈ { α + 1 , . . . , m − 2 } and observ e that b y definition, pred ( T [ i left , i left + m − 1] ) [ j ] ∈ { 0 , pred ( T ) [ i left + j ] } . Ho wev er, pred ( T [ i left , ( i left + m − 1)] ) [ j ] = pred ( P ) [ j ] > 0 b ecause j > α and all leftmost o ccurrences are in P [0 , α ] . This implies that pred ( T [ i left , ( i left + m − 1)] ) [ j ] = pred ( T ) [ i left + j ] . As j w as arbi- trary , w e hav e that pred ( T ) [( i left + α + 1) , ( i left + m − 1)] = pred ( T [ i left , ( i left + m − 1)] ) [ α + 1 , m − 1] and hence ρ is an exact perio d of pred ( T ) [( i left + α + 1) , ( i left + m − 1)] . Let i right b e the righ tmost p osition in T [ i left , i left + 3 m/ 2 − 1] where P p-matc hes. By the same argument as for i left , we hav e that ρ is an exact p erio d of pred ( T ) [( i right + α + 1) , ( i right + m − 1)] . Th us, b oth pred ( T ) [( i left + α + 1) , ( i left + m − 1)] and pred ( T ) [( i right + α + 1) , ( i right + m − 1)] has an exact p eriod of ρ . As these t w o strings o verlap b y at least ρ characters, we ha ve that ρ is also an exact p erio d of pred ( T ) [ i left + α + 1 , i right + m − 1] . Let i ∈ { ( i left + α + 1) , . . . , i right − 1 } b e arbitrary such that P p-matches T [ i, ( i + m − 1)] . W e no w prov e that if i + ρ < i right then P p-matc hes T [ i + ρ, ( i + ρ + m − 1)] . As p-matc hes must b e at least ρ c haracters apart this is sufficien t to conclude that all remaining matc hes form an arithmetic progression with common difference ρ . As ρ is an exact perio d of pred ( T ) [( i left + α + 1) , ( i right + m − 1)] , w e hav e that pred ( T ) [ i, ( i + m − 1)] = pred ( T ) [ i + ρ, ( i + ρ + m − 1)] . By definition, this implies that pred ( T [ i, ( i + m − 1)] ) = pred ( T [ i + ρ, ( i + ρ + m − 1)] ) and hence a p-matc h also o ccurs at i + ρ . This concludes the pro of of Lemma 5. 6 A c kno wledgemen ts The work describ ed in this pap er was supp orted by EPSR C. The authors w ould lik e to thanks Raphaël Clifford for man y helpful and encouraging discussions. 14 References [1] A. Amir, M. F arach, and S. Muth ukrishnan. “ Alphab et dep endence in parameterized matching”. In: IPL 49.3 (1994), pp. 111 –115. [2] A. A. Andersson and M. Thorup. “ Tight(er) worst-case b ounds on dy- namic searching and priorit y queues”. In: STOC ’00 . 2000, pp. 335–342. [3] B. S. Baker. “ A theory of parameterized pattern matching: algorithms and applications”. In: STOC ’93 . 1993, pp. 71–80. [4] B. S. Bak er. “ P arameterized Duplication in Strings: Algorithms and an Application to Softw are Maintenance”. In: SIAM J. on Comp. 26.5 (1997), pp. 1343–1362. [5] B. S. Baker. “ Parameterized Pattern Matching: Algorithms and Applica- tions”. In: JCSS 52.1 (1996), pp. 28 –42. [6] B. S. Baker. “ Parameterized Pattern Matching by Boy er-Mo ore-T yp e Al- gorithms”. In: SOD A ’95 . 1995, pp. 541–550. [7] D. Breslauer and Z. Galil. “ Real-Time Streaming String-Matching”. In: CPM ’11 . 2011, pp. 162–172. [8] R. Clifford, K. Efremenko, B. Porat, and E. P orat. “ A Blac k Bo x for Online Appro ximate P attern Matching”. In: CPM ’08 . 2008, pp. 143–151. [9] R. Clifford, M. Jalsenius, E. P orat, and B. Sach. “ Space Low er Bounds for Online Pattern Matching”. In: CPM ’11 . 2011, pp. 184–196. [10] F. Ergun, H. Jowhari, and M. Sağlam. “ Periodicity in streams”. In: RAN- DOM’10 . 2010, pp. 545–559. [11] Z. Galil. “ String Matching in Real Time.” In: J. ACM 28.1 (1981), pp. 134– 149. [12] C. Hazay, M. Lewenstein, and D. Sokol. “ Approximate parameterized matc hing”. In: ACM T r ans. Algorithms 3.3 (2007). [13] R. M. Karp and M. O. Rabin. “ Efficient randomized pattern-matc hing algorithms”. In: IBM J. R es Dev 31.2 (1987), pp. 249 –260. [14] D. E. Knuth, J. H. Morris, and V. B. Pratt. “ F ast pattern matching in strings”. In: SIAM J. on Comp. 6 (1977), pp. 323–350. [15] B. Porat and E. Porat. “ Exact And Approximate Pattern Matc hing In The Streaming Model”. In: FOCS ’09 . 2009, pp. 315–323. 15 A Space lo w er b ounds T o complete the picture w e give nearly matching space low er b ounds which sho w that our solutions are optimal to within log factors. The proof is by a comm unication complexity argument. In essence one can sho w that in the randomised case Alice is able to transmit an y string of length Θ( | Σ P | ) bits to Bob using a solution to the matc hing problem by selecting a suitable pattern and streaming text. Similarly in the deterministic case (see b elo w) one can show that she can send Θ( | Σ P | + ρ ) bits. Pr o of of The or em 2 . Consider first a pattern where all symbols are distinct, e.g. P = 123456 . Now let us assume Alice would lik e to send a bit-string to Bob. She can enco de the bit-string as an instance of the parameterized matc hing problem in the following wa y . As an example, assume the bit-string is 01011 . She first creates the first half of a text stream aBcDE where w e c ho ose capitals to corresp ond to 1 and lo w er case sym b ols to corresp ond to 0 from the original bit-string. She starts the matching algorithm and runs it until the pattern and the first half of the text hav e been pro cessed and then sends a snapshot of the memory to Bob. Bob then con tin ues with the second half of the text which is fixed to b e the sorted low er case sym b ols, in this case abcde . Where Bob finds a parameterized matc h he outputs a 1 and where he do es not, he outputs a 0 . Thus Alice’s bit-string is reproduced b y Bob. In general, if w e restrict the alphabet size of the pattern to b e | Σ P | then Alice can similarly enco de a bit-string of length | Σ P | − 1 , and successfully transmit it to Bob, giving us an Ω( | Σ P | ) bit lo wer b ound on the space requiremen ts of an y streaming algorithm. If randomisation is not allow ed, the lo wer b ound increases to Ω( | Σ P | + ρ ) bits of space. Here ρ is the parameterized p eriod of the pattern. This bound follo ws b y a similar argument by devising a one-to-one enco ding of bit-strings of length Θ( ρ ) into P [0 . . . ρ − 1] . The k ey difference is that with a deterministic algorithm, Bob can enumerate all possible m -length texts to recov er Alice’s bit-string from P . B Correctness pro of of the main algorithm Pr o of of L emma 7 . Coupled with the discussion in Section 2, the time and space complexity almost follow immediately from the description. It only re- mains to show that, at any time, |B | 6 | Σ P | . First observe that any sym b ol σ ∈ Σ T is only inserted into B when pred ( T ) [ i ] > m 0 > δ whic h can only happ en at most once in ev ery δ = | Σ P | log m arriving sym b ols. F urther w e remo v e one elemen t every s 6 d log m e arriv als and in particular remov e the σ o ccurrence after at most |B |d log m e arriv als. As B is initially empty , b y induction it follows that no sym b ol o ccurs more than once in B . F or correctness, it remains to sho w that w e correctly obtain the p ositions of Φ 0 ` ( i 0 ) from D ` . It follo ws from the description that all positions of Φ 0 ` ( i 0 ) cor- resp ond to elemen ts inserted in to D ` at some point. Ho wev er we need to pro v e 16 that these elements are presen t in D ` while Φ 0 ` ( i 0 ) is calculated. Any element inserted in to B during T [ i 0 , ( i 0 + m ` − 1)] has cleared the buffer b y the end of in terv al B (which has length δ ) b y the argument ab o ve. Therefore any relev an t elemen t has b een inserted into D ` b y the start of in terv al C, during whic h w e calculate Φ 0 ` ( i 0 ) . Any element inserted in to D ` is at least m ` − 1 c haracters from its predecessor. Therefore, summing o ver all sym b ols in the alphab et, there are at most 4 | Σ P | positions in T [ i 0 , ( i 0 + 2 m ` − 1)] which are inserted in to D ` . As D ` is a FIF O queue of size 12 | Σ | , the relev ant elements are still present after in terv al C. As commen ted earlier, p otential matc hes in M ` are separated b y more than 3 δ arriv als b ecause P ` − 1 has p-p erio d more than 3 δ . They are pro cessed within 3 δ arriv als so M ` do es not ov erflow. This completes the correctness. C Pro of of Theorem 3 (general alphab ets) Let Σ T denote the text alphab et. In order to handle general alphab ets w e p er- form t wo reductions in sequence on each arriving text symbol (and on P during prepro cessing). The first reduces Σ P and Σ T to each contain only sym b ols from Π and one additional v ariable sym b ol (which is differen t for P and T ). A suit- able such reduction is giv en in [1] (Lemma 2.2). The reduction is presented for the offline version but immediately generalises by using the constant time exact matc hing algorithm of Breslauer and Galil [7]. W e no w define Σ 0 P to b e the pattern alphab et after the first reduction (and Σ 0 T resp ectiv ely). Note that | Σ 0 P | = | Σ 0 T | = | Π | + 1 and all pattern symbols are v ariables. Ho wev er w e hav e no guaran tee on the bit representations of the alphab et symbols. Let T 0 and P 0 denote the text and pattern after the first re- duction. The second reduction now maps eac h T 0 [ i ] into the range { 0 , . . . , | Σ 0 P |} as it arriv es. The equiv alent reduction for the pattern is a simplification whic h can b e p erformed in prepro cessing. Let the strings S and S filt denote the last m c haracters of the unfiltered (p ost first reduction) and filtered (post second reduction) stream, resp ectiv ely . Let Σ last ⊆ Σ 0 T denote the up to | Σ 0 P | + 1 last distinct sym b ols in S , hence | Σ last | is never more than | Σ 0 P | + 1 . Let T b e a dynamic dictionary on Σ last suc h that a sym b ol in Σ 0 T can b e lo oked up, deleted and added in O ( p log | Σ 0 P | / log log | Σ 0 P | ) time [2]. Every symbol that arrives in the stream is associated with its “arriv al time”, whic h is an in teger that increases b y one for ev ery new symbol arriving in the stream. Let L b e an ordered list of the sym b ols in Σ last (together with their most recen t arriv al time) suc h that L is ordered according to the most recent arriv al time. F or example, L = ( d , 25) , ( b , 33) , ( g , 58) , ( e , 102) (1) means that the symbols b , d , e and g are the last four distinct sym b ols that app ear in S (for this example, | Σ 0 P | + 1 > 4 ), where the last e arrived at time 102, the last g arrived at time 58, and so on. 17 By using appropriate p oin ters betw een elements of the hash table T and elemen ts of L (which could b e implemented as a link ed list), we can maintain T and L in O (1) time p er arriving symbol. T o see this, take the example in Equation (1) and consider the arriv al of a new sym bol x at time 103 (follo wing the last symbol e ). First we lo ok up x in T and if x already exists in Σ last , mov e it to the right end of L b y deleting and inserting where needed and up date the elemen t to ( x, 103) . Also chec k that the leftmost elemen t of L is not a sym b ol that has b een pushed outside of S when x arriv ed. W e use its arriv al time to determine this and remov e the last element accordingly . If the arriving symbol x do es not already exist in Σ last , then we add ( x, 103) to the right end of L . T o ensure that L do es not con tain more than | Σ 0 P | + 1 elements, we remov e the leftmost element of L if necessary . W e also remo ve the leftmost symbol if it has b een pushed outside of S . The hash table T is of course up dated accordingly as well. Let Σ filt = { 0 , . . . , | Σ 0 P |} denote the sym b ols outputted b y the filter. W e augmen t the elements of L to maintain a mapping M from the sym b ols in Σ last to distinct sym b ols in Σ filt as follows. Whenever a new sym b ol is added to Σ last , map it to an un used symbol in Σ filt . If no such sym b ol exists, then use the sym b ol that is associated with the symbol of Σ last that is to be remo ved from Σ last (note that | Σ last | 6 | Σ filt | ). The mapping M specifies the filtered stream: when a sym b ol x arriv es, the filter outputs M ( x ) . Finding M ( x ) and up dating T is done in O (1) time p er arriving c haracter, and b oth the tree T and the list L can be stored in O ( | Σ 0 P | ) space. It remains to sho w that the filtered stream do es not induce an y false matches or miss a p oten tial matc h. Supp ose first that the num b er of distinct sym b ols in S is | Σ 0 P | or few er. That is, Σ last con tains all distinct symbols in S . Every symbol x in S has b een replaced by a unique sym b ol in Σ filt and the construction of the filter ensures that the mapping is one-to-one. Th us, pred ( S filt ) = pred ( S ) . Supp ose second that the num b er of distinct symbols in S is | Σ 0 P | + 1 or more. That is, | Σ last | = | Σ 0 P | + 1 and therefore S filt con tains | Σ 0 P | + 1 distinct symbols. Th us, pred ( S filt ) cannot equal pred ( P 0 ) . The claimed result then follo ws from Theorem 1. D Pro ofs omitted from Section 4 Pr o of of L emma 8 . Let ρ be the p-perio d of P . W e prov e the lemma b y con tradiction. Supp ose, for some j and k , that i = j + k ρ is a p osition such that pred ( P ) [ i ] = c > 1 and pred ( P ) [ i + ρ ] = c 0 6 = c . Consider Figure 3 for a concrete example, where ρ = 5 , i = 12 , pred ( P ) [12] = c = 4 and pred ( P ) [12+5] = c 0 = 3 . Since ρ is a p-perio d of P , w e hav e that pred ( P [ ρ, m − 1] ) = pred ( P [0 , ( m − 1 − ρ )] ) . Consider the alignment of p ositions i + ρ and i (p ositions 17 and 12 in Figure 3). W e ha ve that pred ( P [ ρ, m − 1] ) [ i ] is either c 0 or 0. In either case, it is certainly not pred ( P [0 , m − 1 − ρ ] ) [ i ] which is c . Th us, ρ cannot b e a p-p erio d of P . 18 0 0 4 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 0 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 pred( P 0 ) = pred( P 0 ) = . . . . . . Figure 3: An example demonstrating the structure of pred ( P 0 ) used in the pro of of Lemma 8. Pr o of of L emma 9 . By Lemma 8 w e can enco de pred ( P ) by storing the t w o v alues k j and c j for eac h j ∈ [ ρ ] . This takes O ( ρ ) space. The v alue pred ( P ) [ i ] is 0 if i < k ( i mo d ρ ) , otherwise it is c ( i mo d ρ ) . 19
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment