Approximating General Metric Distances Between a Pattern and a Text
Let $T=t_0 ... t_{n-1}$ be a text and $P = p_0 ... p_{m-1}$ a pattern taken from some finite alphabet set $\Sigma$, and let $\dist$ be a metric on $\Sigma$. We consider the problem of calculating the sum of distances between the symbols of $P$ and th…
Authors: Klim Efremenko, Ely Porat
Appro ximating General Metric Distances Bet wee n a P attern and a T ext Ely P orat ∗ Klim Efremenk o † ‡ Octob er 30, 2018 Abstract Let T = t 0 . . . t n − 1 be a text and P = p 0 . . . p m − 1 a pattern taken from some finite alphab et set Σ, and let d be a metric on Σ. W e consider the problem of calculating the sum of distances betw een the sym b ols o f P and the symbols of s ubstrings of T o f length m for all po ssible offsets. W e present an ε - approximation algorithm for this problem which runs in time O ( 1 ε 2 n · p olylog( n, | Σ | )). 1 In tro duction String matching , the problem of find ing all o ccurrences of a giv en pattern in a giv en text, is a classical problem in computer science. The problem has p lea s in g theoretical features and a n u m b er of direct applications to “real wo r ld” problems. Adv ances in m ultimedia, digit al libraries, and c omp utatio n al biology , h a v e sho wn that a m uch more generalized theoretical basis of string matc hin g could b e of tremend ous b en- efit [ ? , ? ]. T o this end, string matc hing h as had to adapt itself to increasingly br oader definitions of “matc hin g” . Two t yp es o f problems need to b e addressed – gener alize d match- ing and appr oximate matching . In generalized matc h ing, one seeks all exact o ccurrences of the pattern in the text, bu t the “matc hing” relation is defined different ly . The output is all lo cations in the text where the pattern “matc hes” according to the new definition of a matc h. T h e different applications define the matc hin g r ela tion. Examples can b e seen in Bak er’s p ar ameterize d matching ([ ? ]) or Amir and F arac h’s less-than matching ([ ? ]). The second mo del, and the one w e are concerned with in th is pap er, is that of appr oximate matching . In app ro ximate matc hing, one defines a distance metric b et we en the ob jects (e.g. strings, matrices) and seeks to calculate this distance for all text locations. Usually w e seek lo ca tions where this distance is small enough. One of the earliest and most natural metrics is the Hamming distanc e , where the d istance b et w een t wo strings is the num b er of mismatc hing c haracters exists algorithm calculating ∗ Department of Computer Science, Bar-Ilan Universit y , 52900 Ramat-Gan, Israel; porately@c s.biu.ac.il † Department of Comput er Science, Bar-Ilan Universi ty , 5290 0 Ramat-Gan, I sra el ‡ W eizmann Institute, Rehovo t 76100, Israel 1 this distance exactly [ ? ] and appro ximating it [ ? ]. Levensh tein [ ? ] iden tified three t yp es of errors: mismatc hes, inser tions, and d ele tions. These op erations are tr ad itionally used to define the e dit distanc e b et wee n t wo strin gs. The edit distance is th e minimum num b er of edit op eratio n s one needs to p erform on the pattern in order to achiev e an exact matc h at the give n text lo catio n . Lo wrance and W agner [ ? , ? ] added the swap op eration to the s et of op eratio n s d efining the distance metric. Muc h of the recen t researc h in string matc hing concerns itself with u nderstanding the inheren t “hard n ess” of the v arious d istance metrics, b y seeking u pp er and lo wer b ounds for string matc hin g un der these conditions. A natural subset of these problems is wh en th e distance is d efined only on the alph abet, and therefore, the d istance b et we en tw o strings is the su m of the distances b et w een the corresp onding charact ers in b oth strings. It is p ossible to solv e this pr oblem in time O ( n · m ) b y emplo ying the naiv e approac h of summ ing the d istances b et w een eac h c haracter of the pattern, and its corresp onding c haracter in the text, for eac h p ossible alignmen t of the pattern. This problem w as first d efined by Muth ukrishn an in [ ? ] and h as b een op en since. In this pap er we presen t an approxima tion algorithm for th is problem. This al gorithm consists of tw o parts: the first part is a pr eprocessing phase in whic h random hash fun ctions on the alphab et is constru cte d . W e us e same hashing whic h Bartal u s ed for tree em b eddin g at [ ? ]. W e use this hashing in order to separate the places where distance b et w een letters is large and p lace s where this distance is small. The second p art of the algorithm is an application of sampling ([ ? ]), w h ic h allo ws us to giv e an approximati on of the distance b et w een the text and the pattern, in time O ( 1 ε 2 n · p olylog( n, | Σ | )). The con tribu tions of th is pap er are tw ofold: on the tec hn ica l side, we hav e solv ed a problem that has b een op en for o ver a decade, b y p resen ting the fastest kno wn approxi mation algo- rithm for many metrics; additionally , and this is p erhaps the more imp ortant con tribution of this p ap er, we ha v e iden tified and exploited a new tec hniqu e – sampl ing , that has b een used in some recen t pap ers ([ ? ]) only imp lici tly . W e emplo y samplin g in a muc h more so- phisticated mann er and show h o w to use this imp ortant to ol for appr oximati ng distances. W e also presen t a no vel wa y of using em b edd ings and geometry tools in pattern matc hing. This tec hnique p ossesses a wide range of app lica tions. F or example, one can easily extend it to calculat e the ℓ 2 -norm distance (in other w ords, when Dist ( T [ i . . i + m − 1] , P ) = v u u t m − 1 X j =0 d ( t i + j , p j ) 2 (1) is the d istance measur e), or it can b e extended for man y infinite metrics. Our algorithm also allo w s for sym b ols in a text to b e wildcard s . W e b eliev e th at th is new metho d for solving appro xim ate string matc hing pr oblems – em- b edding metric in some suitable sp ac e and sampling – ma y actually y ield efficien t al gorithms for many more pr oblems in the future. 2 2 Problem definition and Preliminaries Definition 2.1. A metric sp ac e is a p air ( X, d ) wher e X is a set of p oints and d : X × X → R + is a metric satisfying the fol lowing axioms: d ( x 1 , x 2 ) = d ( x 2 , x 1 ) . d ( x 1 , x 3 ) ≤ d ( x 1 , x 2 ) + d ( x 2 , x 3 ) . d ( x 1 , x 2 ) = 0 ⇔ x 1 = x 2 . (2) Let (Σ , d ) b e a metric space, an d A = a 0 . . . a m − 1 , B = b 0 . . . b m − 1 b e any t w o strin gs of the same length with symb ols from Σ. W e define the distanc e of A and B b y Dist ( A, B ) = P m − 1 i =0 d ( a i , b i ). Giv en a text T = t 0 . . . t n − 1 and a pattern P = p 0 . . . p m − 1 , our goal is to calculate th e array S [ i ] = Dist ( T [ i . . i + m − 1] , P ), for eac h p ossible offset i = 0 , . . . , n − m − 1. Calculating the exact v alues of S can b e done in O ( nm ) time, u sing the naiv e approac h, In most cases it is enough to know only an appro ximation of the distance; w e therefore p resen t an efficien t algorithm wh ic h approximate s the v alues of S . Con volutio n s can b e used in the s ta ndard fashion to improv e the time for finite fixed alpha- b ets. Definition 2.2. L et A [0] , . . . , A [ n − 1] and B [0] , . . . , B [ m − 1] b e arr ays of natur al nu m- b ers.The discrete con v olution (p olynomial multiplicat ion) of A and B is V wher e: V [ i ] = m − 1 X j =0 A [ i − j ] B [ j ] , (3) wher e i = 0 , . . . , n − m . W e denote V as A ∗ B . We wil l cho ose A = T and B = P , i.e. we wil l tr e at T and P as the c o effic i ents of p olynomials of de gr e es n − 1 and m − 1 , r esp e ctively. By standar d tricks, namely, (1) r ev e rsing the text to obtain T R ; (2) c alculating T R ∗ P ; (3) r e v erse the r esult; (4) disc ar d the first m − 1 values, and last n − m + 1 values of the r esult, we obtain an arr ay V = ( T R ∗ P ) R [ m − 1 . . n − 1] wher e for e ach i , V [ i ] = m − 1 X j =0 t i + j p j . (4) In other wor ds, for every p ossible offset i , V [ i ] is the sum of the p attern symb ols multiplie d, e ach with its c orr e sp onding text symb ol. F or c onvenienc e, we define T ⊗ P = ( T R ∗ P ) R [ m − 1 . . n − 1] A con vo lu tion can b e computed in time O ( n log m ), in a computational mo del with word size O (log m ), by using the F ast F ourier T ransf orm (FFT) [ ? ]. Remark 2.3. U sing FFT we c an c ompute gener al p attern distanc es, in time O ( | Σ | n log m ) , by using the fol lowing metho d: for every a ∈ Σ , we define an arr ay χ a ( P ) by setting χ a ( P )[ i ] = χ a ( P [ i ]) , wher e χ a ( x ) = 1 if x = a and 0 otherwise. Set T a [ i ] = d ( a, t i ) . Computing T a ⊗ χ a ( P ) gives us the sum of the distanc es of the letter a fr om the text. The sum of al l c onvolution r esults is the desir e d distanc e. 3 W e pr o vide some required defin itio n s regarding metric spaces: Definition 2.4. L et ρ b e a mapping ρ : X → Y , wher e ( X , d 1 ) and ( Y , d 2 ) ar e metric sp ac es. ρ is an isometry if ∀ x, y ∈ X d 1 ( x, y ) = d 2 ( ρ ( x ) , ρ ( y )) . Unfortunately , we cannot alwa ys construct an isometry . Therefore w e consider wea ker conditions: Definition 2.5. Given two metric sp ac es ( X , d 1 ) and ( Y , d 2 ) , and a value c ≥ 1 , a map ping ρ : X → Y is c al le d a c-distortion em b edd ing , if for al l x, y ∈ X , d 1 ( x, y ) ≤ d 2 ( ρ ( x ) , ρ ( y )) ≤ c · d 1 ( x, y ) . (5) 3 The One-Mismatc h Algorithm In this section w e describ e the one-mismatch algorithm, in itself a very useful general to ol in pattern matc hing. The one-mismatc h algorithm had b een describ ed b efore in [ ? , ? ]. Giv en a n umeric text T and a numeric pattern P , and w e wa nt to fin d exact matc hes of P in T . One w a y to d o so is by calculating for eac h lo cation 0 ≤ i ≤ n − m − 1 the v alue: m − 1 X j =0 ( p j − t i + j ) 2 = m X j =0 ( p 2 j − 2 p j t i + j + t 2 i + j ) . Notice this s u m will b e zero iff there is an exact matc h at lo ca tion i . F urthermore this sum can b e computed efficien tly for all i ’s in O ( n log m ) time using conv olutions. Notice that if P and T are not n um er ic then an arbitrary one-to-o n e mapping can b e chosen f r om the alphab et to the set of p ositiv e in tegers N . This metho d can b e extended to the case of matc hing with “don’t cares” [ ? ], by simp ly calculating instead A 0 [ i ] = m − 1 X j =0 p ′ j t ′ i + j ( p j − t i + j ) 2 , where p ′ j = 0 (resp. t ′ j = 0) if p j (resp. t j ) is a “don’t care” sym b ol and 1 otherwise. Wherev er there is an exact matc h with “don’t cares” th is su m will b e exactly 0. T h is can also b e compu ted with con v olutions in time O ( n log m ). Again, this scheme can b e further extended to the one-mismatc h p roblem, which is to determine if P matc hes T [ i . . i + m − 1] with at most one mismatc h . F urth er m ore, we can iden tify the lo cati on of the mismatc h for eac h suc h i . This is done b y also computing, for eac h i , A 1 [ i ] = m − 1 X j =0 j p ′ j t ′ i + j ( p j − t i + j ) 2 , b y u sing the conv olution. Then if p matc hes the text at offset i with one m ismatc h then ev ent ually A 0 [ i ] = ( p r − t i + r ) 2 and A 1 [ i ] = r ( p r − t i + r ) 2 where r is a lo cation of a mismatc h. Therefore, by calculating A 1 [ i ] / A 0 [ i ], we find the sup p osed mismatc h lo cation and v erify 4 it. Finally , lo cati on s where exact matc hes o ccur will b e lab eled “matc h”, lo cati on where a single mismatc h occur s will b e lab eled with its lo ca tion, and location w here more than one mismatc h o ccurs will b e lab ele d ⊥ . Th e one-mismatc h algorithm is th erefore as f oll ows: 1. compute the arra y A 0 [ i ] = m X j =0 p ′ j t ′ i + j ( p j − t i + j ) 2 using FFT. 2. compute the arra y A 1 [ i ] = m X j =0 j p ′ j t ′ i + j ( p j − t i + j ) 2 using FFT. 3. If A 0 [ i ] = 0 set B [ i ] ← “matc h” else set B [ i ] ← A 1 [ i ] / A 0 [ i ]. 4. F or eac h i s.t. B [ i ] 6 = “matc h” , c hec k to see if ( p [ B [ i ]] − t [ B [ i ] + i ]) 2 = A 0 [ i ]. If this is not the case then set B [ i ] ← ⊥ . The ru nning time of th is algorithm is O ( n log m ). 4 The Sampling Metho d In this section w e pr esen t a general metho d referr ed to as the sampling metho d. It allo ws us, for ev ery p ossible offset, to sample (i.e. choose) a r andom mismatch from th e set of all mismatc hes w.r .t. this offset. W e sho w how to utilize the pr evio u sly describ ed algorithm for this purp ose. First fi x some pr obabilit y 0 < q ≤ 1, and define sub pattern P ∗ of P b y: p ∗ i = ( p i with pr obabilit y q ; φ (don’t care) otherwise. (6) In the algorithm referr ed to as Sample( q , T , P ), we simply create P ∗ as defined in (6) and run the one-mismatc h algorithm on P ∗ and T . No w, for ev ery offset i , let m i b e the n umb er of mism atc h es b et wee n T and P w.r.t. this offset. Th e follo wing lemma trivially follo ws: Lemma 4.1. L et B b e the arr ay r eturne d by Sample( q, T , P ). F or some lo c ation i , Pr( B [ i ] = “match” ) = (1 − q ) m i , and Pr( B [ i ] is a mismatch lo c ation ) = m i q (1 − q ) m i − 1 . 5 Another imp ortan t p rop ert y of this algorithm is that the mismatc h returned w.r.t. offset i is u niformly d istributed o ver the set of all mismatc h es w.r .t. offset i . W e w ill sho w how to use th is algorithm to s ample a r andom lo catio n for which the distance is not 0. Notice that for q ≈ 1 m i , the pr obabilit y of fin d ing a mismatc h w.r.t. offset i is O (1). Th erefore, w e can en um erate on q = { 2 − j } log m j =0 . Then, for ev ery lo catio n i there exists some q wh ic h is ≈ 1 m i . Therefore, the n ext algorithm finds for eac h lo ca tion a mism at ch w ith constan t probability , whic h is u niformly distributed ov er all the mism atc h es. 1. for q = 1; q ≥ 1 /m ; q = q / 2 2. Sample(q,T,P) 3. F or ev ery offset i if a mismatc h is found retur n it. 5 Motiv ation for the Algorithm Remark 5.1. In this p ap er we assume that the r atio of maximal and minimal distanc es is b ounde d by B d . Ther efor e, w.l.o.g. we c an assume that the minimal nonzer o distanc e is 1 , i.e. ∀ x, y ∈ Σ , d ( x, y ) ≤ B d and d ( x, y ) > 0 ⇐ ⇒ d ( x, y ) ≥ 1 . That is b e c ause if B min is the minimal distanc e, then we c an use the metric d B min inste ad. A fi rst naiv e approac h to approxima te th e distance is as follo ws: sa y we wish to provide an ap p ro ximation only for some offset i , and let X b e a random v ariable which is equal to d ( t i + j , p j ), where j is c hosen uniformly from 0 , 1 , . . . , m − 1. W e can sample X b y c ho osing a random j and calculating d ( t i + j , p j ). The exp ectat ion of m · X is the desired sum. Therefore, the wa y to compute E ( X ) is sample X several times and r eturn the a verag e. T he problem with this approac h is that the v ariance of X may b e v ery large: for example, if P matc hes T except for a few mismatc hes, then w.h .p. w e will not sample ev en a single mismatc h. The second attempt to reduce the v ariance of v ariable X , is to us e the sampling algorithm describ ed in Sect. 4. As a result, X will b e distributed only o v er lo cations where d ( t i + j , p j ) > 0. That is b ecause the sampling algorithm returns only relativ e lo cations j for wh ic h t i + j 6 = p j . Th is sampling approac h reduces th e v ariance of X , b ut still it ma y happ en that f or some offset i , all distances are v ery small except for a single one which is ev en greater than the sum of all others. With high probabilit y , only the smaller distances will b e sampled, th u s affecting the final outcome. Th is appr oa ch can pro v id e u s with an alg orithm whic h runs in O ( n · B d · p olylo g ( n, | Σ | )) time, how ev er, B d ma y b e v ery large. All the ab o ve leads us to search for a wa y to sample only lo catio n s j for which, wh en some D is fixed, D ≤ d ( t i + j , p j ) < 2 D . Then, with an additional multi plicativ e f ac tor of log( B d ), w e can en umerate on D = { 2 i } log B d i =0 , eac h step approximat ing the exp ectation of the v ariable X D whic h uniformly ranges ov er { d ( t i + j , p j ) | D ≤ d ( t i + j , p j ) < 2 D } . Notice that for an y v alue a , Pr( X D = a ) = ( Pr( X = a ) Pr( D ≤ X < 2 D ) D ≤ a < 2 D , 0 else . 6 Hence it follo ws that E ( X ) = X D Pr( D ≤ X < 2 D ) E ( X D ) . A hyp othetic w ay to sample X D w ould b e to d esign a mappin g π D on Σ for wh ic h D < d ( x, y ) < 2 D ⇐ ⇒ π D ( x ) 6 = π D ( y ) . (7) If so, we could h a v e run the sampling algorithm on π d ( T ) and π d ( P ), applying π d in the ob vious wa y , and obtain samples of X D . Then, the a verage of sampled distances will appro ximate E ( X D ). The appro ximation of Pr( D ≤ X < 2 D ) w ould ha v e b een al s o simple: it is a num b er of appro ximated mismatc hes divided by m (i.e. the length of the pattern). Unfortunately , we cannot design suc h a mapping. Ho w ever, we can design a set of mappings suc h that for a random mapping, this condition holds with high probabilit y . 6 Probabilistically Separating Hashing In this section, our goal is to construct a random hash π f or a giv en D such that (7) holds with goo d probabilit y . Th e set of h ash functions H D called C - Pr ob abilistic al ly Sep ar ating Hashing if it admits next t wo conditions: 1. If the distance b et we en x, y is greater than D , th en they their h ash ing is different i.e. d ( x, y ) ≥ D ⇒ ∀ π ∈ H D π ( x ) 6 = π ( y ). 2. ∀ x, y ∈ Σ , Pr π ∈H D ( π ( x ) 6 = π ( y )) ≤ C d ( x,y ) D . Bartal at [ ? ] gav e a construction of log | Σ | -Probabilistically Separating Hashin g for fi n ite metrics after it w as exte n ded for graphs em b edded in real normed spaces at [ ? ]. In section 8 w e w ill giv e a simple construction for the case when al p hab et is normed space R d with small d . Notice that we only n eed to build suc h a hashing only once for ev ery alphab et. Th er efore it can b e done as a prepr ocessing measur e. W e are able to use π D in order to sample a subset of indices for w h ic h the distance is not to o small. W e will no w sho w that this will also allo w us to s amp le X D as we desire. Lemma 6.1. Fix an offset i . L et A = { j | π D ( t i + j ) 6 = π D ( p j )) } b e the set of mismatches under π D and B = { j | D < d ( t i + j , p j ) ≤ 2 D } b e the set of indic es we ar e r e al ly inter este d in sampling fr om them. Then: 1. E π D ∈H D ( | A | ) = O ( S · C D ) (wher e the exp e ctation is over the c hoic e of π D ) 2. B ⊆ A and S D D ≤ | B | ≤ 2 S D D . Wher e S = P m − 1 j =0 d ( p j , t i + j ) and S D = P j ∈ B d ( p j , t i + j ) . 7 Pr o of. By linearit y of exp ectatio n : E ( | A | ) = X j Pr( π D ( t i + j ) 6 = π D ( p j )) . By d efinition of Pr ob ab ilistically Separating Hashin g w e ha v e: X j Pr( π D ( t i + j ) 6 = π D ( p j )) ≤ X j C d ( p j , t i + j ) D = C · S D . This p ro v es (1). B ⊆ A follo ws f r om theorem 8.1 an d S D D = P j ∈ B d ( p j ,t i + j ) D but 1 ≤ d ( p j ,t i + j ) D ≤ 2 for j ∈ B so S D D ≤ | B | ≤ 2 S D D 7 The Algorithm A t this p oin t, we hav e all the to ols necessary in order to describ e the algorithm. The algorithm is based u p on the ap p lica tion of sampling algorithm, describ ed pr evio u sly , to the C -probab ilistically separating hash pro vided in Sect. 6 . As a prep rocessing phase, w e construct for the metric space (Σ , d ), samp les of hashing π D ∈ H D for D = 2 i . The prepro cessing algorithm therefore gets a metric space (Σ , d ), where Σ is the alphab et and d is the m etric on it, and pro duces the O ( 1 var epsilon 2 log | Σ | log m ) hash fun ctio n s π D c hosen at rand om from H D . The main (i.e. quer y ) algorithm gets a text T = t 0 t 1 . . . t n − 1 and a pattern P = p 0 . . . p m − 1 o v er the alph abet Σ . F or a fixed offset i the result w ill b e in ((1 − ε ) S i , (1 + ε ) S i ) with probabilit y 1 − e − t . The output of the algorithm is an arra y R [0 . . n − m − 1] where R [ i ] is an ε -appr o ximation to S [ i ] = P m − 1 j =0 d ( p j , t i + j ) i.e: ∀ i Pr( | P [ i ] − S [ i ] | ≥ εS [ i ]) ≤ e − t (8) W e will no w outline the idea of the algorithm. W e w an t to appro ximate m E ( X ) = X D m Pr( D ≤ X < 2 D ) E ( X D ) . W e will en umerate D , increasing it eac h time by a factor of 2, and ap p ro ximate m Pr( D ≤ X < 2 D ) E ( X D ). Fix some D and some offset i , let as b efore A = { j | π D ( t i + j ) 6 = π D ( p j )) } , where A dep end s on the random mapping π D , and B = { j | D < d ( t i + j , p j ) ≤ 2 D } . Recall that B ⊆ A , and that | B | | A | is n ot too small. In order to appr o ximate E ( X D ) we will use the sampling algorithm on π D ( P ) and π D ( T ). W e get a rand om elemen t in A , and w e c heck if this element is also in B . In order to appro ximate E ( X D ) we a verag e the distances of elemen ts found in B . In order to appr o ximate m Pr( D ≤ X < 2 D ) = | B | w e u se lemma 4.1. The probabilit y that the sampling algo r ith m returns “matc h ” is q 0 = E (1 − q ) | A | , and the p robabilit y that it retur ns a mismatc h from the set B is q 1 = | B | q E (1 − q ) | A |− 1 . So, | B | = q 1 (1 − q ) q q 0 . 8 Let’s assu me that we run the samp ling algorithm K times; then the total num b er of matches is m 0 ≈ K q 0 and the total n umb er of element s in B is m 1 ≈ K q 1 . Let M 1 b e an arra y of the elemen ts in B which w ere found , including rep etitio n s of elemen ts from B . | M 1 | = m 1 . W e will appro xim ate | B | by m 1 (1 − q ) q m 0 b ecause: | B | = E ( m 1 )(1 − q ) q E ( m 0 ) , (9) and app ro ximate E ( X D ) by P j ∈ M 1 d ( p j ,t i + j ) m 1 . Therefore: m Pr( D ≤ X < 2 D ) E ( X D ) ≈ (1 − q ) q m 0 X j ∈ M 1 d ( p j , t i + j ) (10) W e will need to sho w that this app ro ximation is narro w , i.e. th at the v ariance of the appro ximation is sm all . In order to do so, we w ill need to c ho ose q s .t. q ≈ 1 E | A | . In order to fin d such a q , w e try a s eries of q ’s, increasing by a factor of 2 eac h time, and c h oose q s.t. m 0 is large enough and q m 0 is m axima l. W e pro v e that this pr o duces a goo d q w.h.p. W e now write th e complete algorithm. Set K = O ( 1 ε 2 · C · t ). 1: for D = B d ; D ≥ 1 ; D = D / 2 do 2: for q = 1 / 2 ; q > 1 m ; q = q / 2 do 3: for iter = 1; iter ≤ K ; iter = iter + 1 do 4: Cho ose a r an d om π ∈ H D 5: run Sample( q , π ( T ) , π ( P )). Sav e the r esu lt as the iter -th result for this q . 6: end for 7: end for 8: for all offset in text i do 9: Calculate m 0 ( i, q ) for all q ’s - the num b er of matc hes 10: Among all q such that m 0 ≥ e − 4 · K choose q ( i ) s.t. q ( i ) m 0 ( i, q ( i )) is maximal 11: Set M 1 to b e the set of distances b et ween D and 2 D for this q . 12: Calculate S D ( i ) = (1 − q ) q m 0 X j ∈ M 1 d ( t i + j , p j ) 13: end for 14: end for 15: for ev ery offset i r et urn R ( i ) = P D S D ( i ) Algorithm 1: General d istance algorithm The run n ing time of this algorithm is: O ( 1 ε 2 n log 2 m log | Σ | log B d ). This is b ecause the runn in g time is m ostl y dominated by the Sampling f unction, whic h tak es O ( n log m ) time. The Samp ling function is executed O ( 1 ε 2 log | Σ | log m log B d ) times. 9 Theorem 7.1. F or every offset i Pr R ( i ) − m − 1 X j =0 d ( p j , t i + j ) ≥ εR ( i ) ≤ e − t , (11) or in other wor ds our algorith m r eturns ε -appr oximat ion w.h.p. The pr oof of this th eo r em app ears in th e app endix. 8 Explicit hashing constructions for normed spaces No w in this section let us constru ct exp lici t d -Probabilistically Sep arating Hashing for normed space R d with L p , 1 ≤ p ≤ ∞ norm. Th e main problem with p revious constru c- tions [ ? ] is that this hashing can’t b e calculated efficien tly and usu all y it tak es O ( n 2 ) time to calculate one hash f unction. In case that p oin ts n ot given in adv ance th is ma y b e b ot- tlenec k of the algorithm. An other reason why this constru ctio n imp ortan t is: if we hav e d -Probabilistically Separating Hash in g for space X and w e ha ve embedd ing f : Y 7→ X with distortion c then we can construct cd –Probabilistic ally S ep arat in g Hashing for space Y . T he problem of em b edd in g metric spaces to real normed spaces wh ere deeply inv estigat ed . Our construction is the same for ev ery norm L p . Let ~ ε b e a v ector of d indep enden t random v ariables with uniform distr ibution on [0 , 1]. Define: π D ( ~ x ) = ~ x D − ~ ε = j x 1 D − ε 1 k , . . . , j x d D − ε d k . (12) Theorem 8.1. The ab ove mapping π D satisfies the next pr op erties: 1. If the distanc e b etwe en x, y is gr e ater than D · d 1 /p , then their mapping is differ ent i.e. ∀ x, y ∈ R d , k x − y k p ≥ D · d 1 /p ⇒ π D ( x ) 6 = π D ( y ) . 2. ∀ x, y ∈ R d , Pr( π D ( x ) 6 = π D ( y )) ≤ d ·k x − y k p d 1 /p D . Pr o of. As follo ws: 1. Th is is trivial: D d 1 /p ≤ k x − y k p ⇒ ∃ i | x i − y i | ≥ D ⇒ ∀ → ε π D ( x ) 6 = π D ( y ) . 10 2. If d ( x, y ) ≥ D cd then this inequalit y is trivial. Therefore assum e d ( x, y ) ≤ D cd : Pr ( π D ( x ) 6 = π D ( y )) ≤ d X i =1 Prob ( π D ( x ) i 6 = π D ( y ) i ) ≤ d X i =1 · Pr ⌊ x i D − ε i ⌋ 6 = ⌊ y i D − ε i ⌋ ≤ d X i =1 | x i − y i | D = k ~ x − ~ y k 1 D ≤ d k ~ x − ~ y k p d 1 /p D W e choose to u se π as our emb ed ding, and notice π is easy to calculate, assuming w e already ha ve the c -embedd ing σ . This calculation can b e d one in O ( d | Σ | ) time. Remark 8.2. Consider the set { π D ( x ) | x ∈ Σ } . While e ach of its memb ers is a ve ctor of length d , when c omp aring these ve ctors we ar e only inter este d in che ck i ng e quality. Ther e- for e, in or der to save sp ac e, we c an r eplac e e ach ve ctor with a unique numb er in { 1 , . . . , | Σ |} . 9 Conclusions W e ha ve present ed th e first non-trivial algorithm for the app ro ximation of a large class of distances b et w een text and pattern. W e b eliev e that the tec h niques we h av e p resen ted h ere ha ve a wide range of applicatio n s. A fur ther in teresting op en question is to generalize these tec hniques to the case wh er e the distance is not necessary a metric. A The pro of of the algorithm Remark A.1. Her e w.h.p. me an with pr ob ability mor e then 1 − e − t W e will now prov e th at the algorithm indeed appr o ximated th e distances for eac h i w.h.p. W e will only sket ch th e pro of. Pr o of. (of Algo r ithm) Fix some offset i . Then f or ev ery D we set B = { j | D < d ( t i + j , p j ) ≤ 2 D } and A = { j | π D ( t i + j ) 6 = π D ( p j )) } t wo sets. Notice that | A | is a rand om v ariable. Claim A.1. W.h.p. for every D ther e exist q ( D ) s.t. m 0 ( q ) ≥ e − 4 K and q · m 0 ≥ e − 4 K E ( | A | ) Pr o of. There exist q s.t. 1 E ( | A | ) ≤ q ≤ 2 E ( | A | ) . The p robabilit y of a matc h for this q is q 0 = E (1 − q ) ( | A | ) b y Jensen’s inequalit y E (1 − q ) | A | ≥ (1 − q ) E ( | A | ) ≥ e − 3 . m 0 ( q ) ha ve binomial distr ib ution B ( q 0 , K ) an d so w.h.p. m 0 ≥ e − 4 K . F or this q it also holds that q · m 0 ≥ e − 4 K E ( | A | ) . So f or the q that the algorithm chose also h olds that q · m 0 ≥ e − 4 K E ( | A | ) . 11 Claim A.2. Ther e exist a c onstant ˜ m 0 = E ( m 0 ) = K E (1 − q ) | A | . Such that. (1 − ε ) m 0 ( D ) ≤ ˜ m 0 ( D ) ≤ (1 + ε ) m 0 ( D ) w.h.p. Pr o of. This follo w s from the C hernoff b ound. This is b ecause m 0 is binomially d istributed v ariable B ( K, p ) with p ≥ e − 4 . Lets c D = (1 − q ( D )) D q ( D ) ˜ m 0 ( D ) b e a constant and ˜ S ( i ) = X c D X j ∈ M 1 ( D ) d ( t i + j , p j ) D Claim A.3. ˜ S ( i ) is close to R ( i ) w.h.p. i.e. (1 − ε ) R ( i ) ≤ ˜ S ( i ) ≤ (1 + ε ) R ( i ) Pr o of. W e can represen t R ( i ) as: R ( i ) = X D X j ∈ M 1 ( D ) (1 − q ( D )) D q ( D ) m 0 ( D ) · d ( t i + j , p j ) D . By the previous claim m 0 ( D ) is close to ˜ m 0 ( D ). Claim A.4. E ( ˜ S ( i )) = m − 1 X j =0 d ( p i , t i + j ) . Pr o of. E ( ˜ S ( i )) = X c D D E ( X j ∈ M 1 ( D ) d ( t i + j , p j )) = X c D D E | M 1 ( D ) | E ( X D ) , (13) But we know that: c D E ( | M 1 ( D ) | ) D = (1 − q ( D )) E ( m 1 ( D )) q ( D ) ˜ m 0 ( D ) = (1 − q ( D )) E ( m 1 ( D )) q ( D ) E ( m 0 ( D )) . (14) By (9) we ha ve that: c D E ( | M 1 ( D ) | ) D = | B | = m Pr ( D ≤ X D < 2 D ) . So w e ha v e that: E ( ˜ S ( i )) = X D m Pr( D ≤ X D < 2 D ) E ( X D ) = m E ( X ) = m − 1 X j =0 d ( p i , t i + j ) . (15) 12 Claim A.5. Ther e exists a universal c onstant C s.t. w.h.p. : c D ≤ C ε 2 t X c D | M 1 ( D ) | holds w.h.p. Pr o of. The previous claim show ed us that S ( i ) ≤ P 2 c D | M 1 ( D ) | b ecause d ( p i ,t i + j ) D ≤ 2 for j ∈ M 1 ( D ). Therefore, it’s enough to pr o v e that c D ≤ C ε 2 t S ( i ) . By A.1 w e kn o w th at q m 0 ≥ e − 4 K E ( | A | ) . So we hav e: c D = (1 − q ( D )) D q ( D ) ˜ m 0 ( D ) ≤ C E ( | A | ) D K . (16) By 6.1 E ( | A | ) D = O ( S · c · d ). K = O ( 1 ε 2 · c · d · t ) So C E ( | A | ) D K = O ( ε 2 t S ) (17) W e will s ta te the follo wing lemma w ithout pr o ving it b ecause it follo ws fr om the Chernoff b ound: Lemma A.2. Ther e exists a universal c onstant C s.t. for every se quenc e of indep endent r andom variables X 1 , X 2 . . . X n with 1 ≤ X i ≤ 2 , and a se quenc e of p ositive c onstants c 1 , c 2 , . . . c n s.t. c i < C ε 2 t n X i =1 c i . Then: Pr n X i =1 c i · X i − E ( n X i =1 c i · X i ) ≥ ε E ( n X i =1 c i · X i ) ! ≤ e − t Claim A.6. Pr( ˜ S − m − 1 X j =0 d ( p i , t i + j ) ≥ ε ˜ S ) ≤ e − t . (18) Pr o of. ˜ S = P c i X D D b y definition 1 ≤ X D D ≤ 2, b y A.4 E ( ˜ S ) = P m − 1 j =0 d ( p i , t i + j ) b y A.5 follo ws that c i < C ε 2 t n X i =1 c i and therefore A.2 prov es the claim 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment