Efficient Pattern Matching on Binary Strings

Eﬃcien t P attern Matc hing on Binary Strings Simone F aro 1 and Thierry Lecro q 2 1 Dipartimento di Matematica e Informatica, Univ ersit` a di C atania, Italy 2 Universit y o f Rouen, LITIS EA 4108 , 76821 Mo nt-Sain t- Aignan Cedex, F rance faro@dmi.u nict.it, thierry.lecroq@un iv-rouen.fr Abstract. The binary string matching problem consists in ﬁndin g all the o ccurrences of a pattern in a text where b oth strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer netw ork ap- plications. Moreo ver the problem ﬁnds applications also in the ﬁeld of image processing and in pattern matc hing on compressed texts. Recen t ly it has b een sho wn th at adaptations of classi cal exact string matc h ing al- gorithms are not v ery e ﬃcien t on binary data. In this paper w e present tw o eﬃcien t algorithms for the problem adapted to completely a void any reference to bits allo wing to process pattern and text byte b y b yte. Exp erimenta l results sho w that th e new algorithms outperform exis ting solutions in most cases. Keywords: string matching, b inary strings, ex p erimental algorithms, compressed text p rocessing, text pro cessing. 1 In tr o duction Given a text t and a pattern p ov er some alphab et Σ o f size σ , the string matching pr oblem consists in ﬁnding al l occ ur rences of the pattern p in the text t . It is a very extensively studied problem in computer science, mainly due to its direct applica tions to s uc h div erse areas as text, image and signa l pro ce s sing, sp eech analysis a nd r e cognition, information re tr iev al, computatio nal biology and chemistry , etc. In this article we consider the pro blem o f searching for a patter n p of length m in a text t o f length n , with b oth str ings a r e built ov er a binary a lpha bet, where eac h character o f p and t is repr esent ed by a single bit. Thus memo r y space needed to repres e n t t and p is, resp ectively , ⌈ n/ 8 ⌉ and ⌈ m / 8 ⌉ bytes. This is an interesting problem in c o mputer science, since bina ry data are omnipresent in telecom and computer netw ork applications. Ma n y for ma ts for data exchange b etw een no des in distr ibuted co mputer systems as well as most net work proto co ls use binar y r epresentations. Binar y ima ges often a rise in dig- ital ima g e pro cessing as masks or as the results of cer tain op erations suc h as segmentation, thresholding and dithering. Mo reov er s o me input/output devices, such as laser printers a nd fax machines, c a n only ha ndle binar y images . The main reason for using binar ies is size. A binar y is a muc h mor e compact format tha n the symbo lic o r textual repres en tation of the same information. Consequently , less resour ces ar e requir e d to transmit binaries ov er the netw or k. F or this r eason the binar y str ing matching problem ﬁnds applicatio ns als o in pattern matching o n compressed texts, when using the Huﬀman compression strategy [KS0 5,SD06,F G06]. Observe that the text t , and the pattern p to sear c h for , cannot b e directly pro cessed a s strings with a sup er- alphab et [F re02,FG06], i.e., where each g roup of 8 bits is consider ed as a character of the text. This is b ecause an o ccurrence o f the pattern can be found starting at the middle of a gr o up. Suppos e, for instance, that t = 01100 1001000 100110100101000101001001 and p = 010011 0100 . If we write text and pattern as g roups of 8 bits then we o btain t = 01100100 10001001 10100101 00010100 1001 p = 01001 10100 The o ccurr ence o f the pa ttern at po s ition 11 of the text cannot b e lo c a ted by a clas s ical pattern matching algo rithm based on sup er-a lphabe t. It is p ossible to s imply adapt cla ssical e ﬃcie nt algor ithms for exac t pattern- matching to binary-ma tc hing with minor mo diﬁcatio ns . W e can substitute in the algor ithms refer e nc e to the c ha r acter at po sition i with re ference to the bit at po s ition i . Roughly sp eaking we can substitute o ccurrenc e s o f t [ i ] with getBit ( t, i ) which returns the i -th bit of the text t . This tr ansformation do es not aﬀect the time complexity of the algo rithm but is time consuming a nd in general co uld b e not very eﬃcient. In [KBN07] Klein and Ben-Nissan prop osed an eﬃcien t v ar ia n t of the Bo yer- Moore algor ithm for the binary case without referr ing to bits. The algo rithm is pr o jected to pro cess only entire blo cks such as bytes or words and achiev es a signiﬁcantly reduction in the num b e r o f text character insp ections. In [KB N07] the authors s how ed also by empirical results that the ne w v ariant perfo rms b e tter than the regular binar y Boyer-Moore algo rithm and even than binar y versions of the mos t eﬀective algor ithms for class ical pattern matching. In this note we present tw o new eﬃcient algo rithms for matching o n binary strings which, despite their O ( nm ) w orst ca se time co mplexit y , obtain very go o d results in practica l cases. T he ﬁrst algorithm is an adaptation of the q -Hash algorithm [Lec07] which is among the most eﬃcient a lgorithms fo r the standar d pattern matching problem. W e show how the tec hnique a do pted by the algorithm can b e naturally transla ted to a llow for blo cks of bits. The second solution ca n b e seen as an ada pta tion to binary string ma tching of the Sk ip-Sear ch algor ithm [CL P 98]. This a lgorithm ca n b e eﬃcie n tly a dapted to completely av o id any reference to bits allowing to pro cess pa tter n and text pro ceeding byte by byte. The pap er is orga nized a s follows. In Section 2 we introduce bas ic deﬁnitions and the ter minology use d alo ng the article. In Section 3 w e intro duce a high level mo del use d to pro cess binary s tr ings av oiding any refer ence to bits. Next, in Section 4 , w e introduce the new solutions. Exp erimental data obtained b y running under v ar ious conditions a ll the alg o rithms reviewed are presented and compared in Section 5. Finally , we draw our c o nclusions in Section 6. 2 Preliminaries and basic deﬁnitions A string p o f leng th m ≥ 0 is represe nted as a ﬁnite array p [0 .. m − 1] of c har a cters from a ﬁnite alphab et Σ . In pa rticular, for m = 0 we obtain the empty string, also denoted by ε . By p [ i ] we deno te the ( i + 1)-th character o f p , for 0 ≤ i < m . Likewise, by p [ i .. j ] we denote the s ubstring of p contained b etw een the ( i + 1)-th and the ( j + 1)-st characters of p , for 0 ≤ i ≤ j < m . Mor e ov er, for any i, j ∈ Z , we put p [ i .. j ] = ε if i > j a nd p [ i .. j ] = p [max( i , 0) , min( j, m − 1)] if i ≤ j . A substring of p is a lso calle d a factor of p . A substring of the form p [0 .. i ] is ca lle d a pr eﬁx of p and a substring of the form p [ i .. m − 1] is c alled a s u ﬃx of p for 0 ≤ i ≤ m − 1 . F o r any tw o s tr ings u and w , we wr ite w ⊒ u to indicate that w is a suﬃx of u . Similar ly , we write w ⊑ u to indicate that w is a pr e ﬁx o f u . Let t be a text of length n and le t p b e a patter n of length m . When the character p [0 ] is aligned with the character t [ s ] of the text, so that the character p [ i ] is aligned with the character t [ s + i ], for i = 0 , . . . , m − 1, we say tha t the pattern p has shift s in t . In this cas e the substring t [ s .. s + m − 1] is called the curr ent window of the text. If t [ s .. s + m − 1 ] = p , we say that the shift s is valid . Most string matching algor ithms have the fo llowing general s tructure. First, during a pr epr o c essing phase , they calculate useful mappings, g enerally in the form of tables, which la ter ar e acces sed to determine nontrivial shift adv ance- men ts. Next, starting with shift s = 0, they lo ok for all v alid shifts, b y executing a matching phase , which de ter mines whether the s hift s is v alid and co mputes a p ositive shift increment , ∆s . Suc h increment ∆s is used to pr o duce the new shift s + ∆s to b e fed to the subsequent matching phase. F or instance, in the cas e of the Naive string matching algo rithm, there is no pre pro cessing phase and the matching phase a lwa ys returns a unitary s hift increment, i.e. all p ossible shifts are actually pro cesse d. 3 A High Level Mo del for Matching on Binary Strings A string p over the binary alphabet Σ = { 0 , 1 } is sa id to b e a binary st ring and is repres en ted a s a binary vector p [0 .. m − 1], whose elements ar e bits. Binary vectors ar e usua lly str uctured in blo cks of k bits, typically b y tes ( k = 8), halfwords ( k = 16 ) or words ( k = 32 ), which can b e pro cessed at the co st of a single o per ation. If p is a binary string of le ngth m we use the symbol P [ i ] to indicate the ( i + 1)-th blo ck of p and use p [ i ] to indicate the ( i + 1)-th bit o f p . If B is a blo ck of k bits we indicate with symbol B j the j -th bit of B , with 0 ≤ j < k . Thus, for i = 0 , . . . , m − 1 we hav e p [ i ] = P [ ⌊ i/ k ⌋ ] i mo d k . In this section we prese nt a high level mo del to pr o cess binar y s trings which exploits the blo ck structure of text and pattern to sp eed up the searching phase av oiding to work with bitwise op eratio ns. W e supp ose that the blo ck size k is ﬁxed, so that all r eferences to b oth tex t a nd pattern will only b e to entire blo cks of k bits. W e refer to a k -bit blo ck as a byte , though large r v alues than k = 8 could b e supp orted a s well. The idea to eliminate any reference to bits and to pro ceed blo ck by blo ck ha s b een ﬁrst sug g ested in [CKP85] for fast deco ding (A) Patt 0 1 2 3 0 11001011 00101100 10110000 1 01100101 10010110 01011000 2 00110010 11001011 00101100 3 00011001 01100101 10010110 4 00001100 10110010 11001011 00000000 5 00000110 01011001 01100101 10000000 6 00000011 00101100 10110010 11000000 7 00000001 10010110 01011001 01100000 (C) Last 2 2 2 2 3 3 3 3 (B) Mask 0 1 2 3 0 11111111 11111111 11111000 1 01111111 11111111 11111100 2 00111111 11111111 11111110 3 00011111 11111111 11111111 4 00001111 11111111 11111111 10000000 5 00000111 11111111 11111111 11000000 6 00000011 11111111 11111111 11100000 7 00000001 11111111 11111111 11110000 Fig. 1. Let P = 110010110 010110010 110 . (A) The matrix Patt . (B) The matrix Mask . (C) The array L ast . In Patt and Mask bits belonging to P are u nderlined. Blocks conta ining a factor of P are presented with ligh t gray background color. of binar y Huﬀman enco ded texts. A similar approa c h has be e n adopted also in [KBN07,F re0 2]. F or the sake of uniformity we use in the following, when it is po ssible, the same terminology adopted in [K BN07]. Let T [ i ] and P [ i ] denote, resp ectively , the ( i + 1)th byte of the text and of the pattern, starting for i = 0 with b o th text and pattern a ligned at the leftmost bit of the ﬁr st byte. Since the lengths in bits of b oth text and pa ttern are not necessarily multiples of k , the last byte may be only pa rtially deﬁned. In particular if the pattern has length m then its last b y te is that o f p ositio n ⌈ m/k ⌉ and only the le ftmos t ( m mo d k ) bits of the last b yte are deﬁned. W e suppo se that the undeﬁned bits o f the last byte are set to 0 . In our high level mo del we deﬁne a sequence of several co pies of the pattern memorized in the form of a matrix of bytes, Patt , of size k × ( ⌈ m/k ⌉ + 1). Ea c h row i of the matrix Patt co n tains a copy o f the pattern shifted by i po sition to the r ight . The i leftmost bits of the ﬁrs t byte remain undeﬁned a nd are set to 0 . Similarly the rightmost k − (( m + i ) mod k ) bits of the last byte are set to 0 . F ormally the j -th bit of byte Patt [ i, h ] is deﬁned by Patt [ i, h ] j =  p [ k h − i + j ] if 0 ≤ k h − i + j < m 0 otherwise . for 0 ≤ i < k and 0 ≤ h < ⌈ ( m + i ) /k ⌉ . Observe that each factor of length k o f the pattern app ears o nce in the table Patt . In pa rticular, the factor of length k s tarting at pos itio n j o f p is memorized in Patt [ k − ( j mo d k ) , ⌈ j /k ⌉ ]. The high level mo del us es bytes in the matrix Patt to compare the pattern blo ck by blo ck against the text for any pos sible shift of the pattern. Howev er when compa r ing the ﬁrst or las t b y te of P ag ainst its counterpart in the text, Preproce s s ( P , m ) 1 M ← 1 m 0 k − plast 2 for i = 0 to k-1 do 3 L ast [ i ] = ⌈ ( m + i ) /k ⌉ − 1 4 for h = 0 to Last [ i ] do 5 Patt [ i, h ] ← ( P [ h ] >> i ) 6 Mask [ i, h ] ← ( M [ h ] >> i ) 7 if h > 0 then 8 X ← Patt [ i, h ] | ( P [ h − 1] < < ( k − i )) 9 Patt [ i, h ] ← X 10 Y ← Mask [ i, h ] | ( M [ h − 1] << ( k − i )) 11 Mask [ i, h ] ← Y 12 return ( Patt , Last , M ask ) (A) Binar y -Naive ( P , m, T , n ) 1 ( Patt , L , M ) ← Preprocess ( P , m ) 2 s ← i ← w ← 0 3 while s < n do 4 j ← 0 5 while j < L [ i ] and 6 Patt [ i, j ] = ( T [ w + j ]& M [ i, j ]) 7 do j ← j + 1 8 if j = L [ i ] then Output( s ) 9 i ← i + 1 10 if i = k then 11 w ← w + 1 12 i ← 0 13 s ← s + 1 (B) Fig. 2. (A) The Preprocess pro cedure for the computation of the tables Patt , M ask and L ast . (B) The B inar y-Naive algorithm for the binary string matching problem. the bit p o sitions not belo nging to the pattern ha ve to be neutralized. F or this purp ose we deﬁne a matrix of b y tes, Mask , of siz e k × ( ⌈ m/k ⌉ + 1), containing binary ma sks of length k . In par ticula r a bit in the mask Mask [ i, h ] is set to 1 if and only if the corre spo nding bit of Patt [ i, h ] be lo ngs to P . Mor e formally Mask [ i, h ] j =  1 if 0 ≤ k h − i + j < m 0 otherwise . for 0 ≤ i < k and 0 ≤ h < ⌈ ( m + i ) /k ⌉ . Finally we need to compute an ar ray , L ast , of s ize k where L ast [ i ] is deﬁned to b e the index of the last byte in the r ow Patt [ i ]. F o rmally , for 0 ≤ i < k we deﬁne L ast [ i ] = ⌈ ( m + i ) /k ⌉ . The pro cedure Preprocess used to precompute the tables deﬁned ab ov e is presented in Figure 2(A). It r equires O ( k × ⌈ m /k ⌉ ) = O ( m ) time and O ( m ) extra-spa ce. Figure 1 shows the preco mputed tables deﬁned ab ov e for a pattern P = 1 1001011 001011 0010110 of length m = 21 and k = 8. The mo del uses the precomputed tables to c heck whether s is a v alid shift without mak ing use of bit wis e op erations but pro cessing pa tter n and text byte by byte. In particular, for a given shift p osition s (the pattern is aligned with the s - th bit of the text), we rep ort a match if Patt [ i, h ] = T [ j + h ] & Mask [ i, h ] , for h = 0 , 1 , ..., L ast [ i ] . (1) where j = ⌊ s/k ⌋ is the starting byte p ositio n in the text and i = ( s mo d k ). A simple Binar y-Naive algo rithm, obta ine d with this high level mo del, is shown in Figur e 2(B). The alg orithm sta rts by alig ning the left ends o f the pattern and text. Then, for each v alue of the shift s = 0 , 1 , . . . , n − m , it checks whether p o ccur s in t b y simply comparing each byte o f the pattern with its corres p onding byte in the text, pro ceeding from left to rig h t. At the end of the ma tc hing phase, the s hift is adv anced by one p osition to the right. In the worstcase, the Binar y-Naive algor ithm req uires O ( ⌈ m/k ⌉ n ) compariso ns. 4 New Eﬃcien t Binary String Matc hing A lgorithms In this sectio n w e present tw o new eﬃcien t a lgorithms for matching on binary strings based on the high level mo del pre s en ted ab ov e. The ﬁrst algo rithm is an adaptation of the q -Hash algorithm [Lec07] which is among the most eﬃcient algorithms for the standa rd pattern matching pr o blem. W e show how the tech- nique adopted b y the a lgorithm can b e naturally tra nslated to allow for blo cks of bits. The second solution ca n b e seen as an ada pta tion to binary string ma tching of the Sk ip-Sear ch algor ithm [CL P 98]. This a lgorithm ca n b e eﬃcie n tly a dapted to completely av o id any reference to bits allowing to pro cess pa tter n and text pro ceeding byte by byte. 4.1 The Binary-Hash-Matc hing Al gorithm Algorithms in the q -Hash family for exac t pattern matching hav e been intro- duced in [Lec0 7] where the a uthor presented an a daptation of the W u and Ma n- ber multiple str ing ma tc hing algo rithm [WM94] to single s tr ing matching prob- lem. The idea of the q -Hash algo rithm is to consider factors of the pattern of length q . Each substring w of such a length q is has hed using a function hash into int eger v alues within 0 and 25 5. Then the a lgorithm computes in a prepro ces sing phase a function Hs : { 0 , 1 , . . . , 255 } → { 0 , 1 , . . . , m − q } , such that fo r each 0 ≤ c ≤ 25 5 the v alue Hs ( c ) is deﬁned by Hs ( c ) = min  { 0 ≤ k < m − q | has h ( p [ m − k − q .. m − k − 1]) = c } ∪ { m − q }  . The s earching phase of the algo rithm consists of reading, for each shift s o f the pattern in the text, the s ubstring w = t [ s + m − q .. s + m − 1] of leng th q . If Hs ( ha s h ( w )) > 0 then a shift of length Hs ( hash ( w )) is a pplied. Otherwise, when Hs ( ha s h ( w )) = 0 the pattern p is naively check ed in the text. In this case a shift of le ng th ( m − 1 − i ) is applied, where i is the sta rting p osition of the rightmost o ccurrence in p of a factor p [ j .. j + q − 1] such that h ash ( p [ j .. j + q − 1]) = hash ( p [ m − q + 1 .. m − 1 ]). If the pattern p is a binary string we can directly asso ciate ea c h substring of length q with its numeric v alue in the range [0 , 2 q − 1] without making use of the hash function. In order to exploit the blo ck structure of the text we take int o acco un t substrings of leng th q = k . This means that, if k = 8, ea ch blo ck B of k bits can be cons ide r ed as a v alue 0 ≤ B ≤ 255. Th us we deﬁne a function Hs : { 0 , 1 , . . . , 2 k − 1 } → { 0 , 1 , . . . , m } , such that for ea ch byte 0 ≤ B < 2 k Hs ( B ) = min  { 0 ≤ u < m | p [ m − u − k .. m − u − 1] ⊒ B } ∪ { m }  . Observe that if B = p [ m − k .. m − 1] then Hs [ B ] is deﬁned to b e 0. Compute-Hash ( Patt , Last , Mask , m ) 1. for B ← 0 to 2 k − 1 do 1. Hs [ B ] ← m 2. for i ← k − 1 downto 1 d o 3. for B ← 0 to 2 k − 1 do 4. if Patt [ i, 0] = B & Mask [ i, 0] 5. then Hs [ B ] ← m − k + i 6. i ← h ← 0 7. for j ← 0 to m − k − 1 d o 8. Hs [ Patt [ i, h ]] ← m − k − j 9. i ← i − 1 10. if i < 0 then 11. i ← k − 1 12. h ← h + 1 13. return Hs Binar y -Hash-Ma tching ( P, m, T , n ) 1. ( Patt , L ast , Mask ) ← Preprocess ( P , m ) 2. Hs ← Compute-Hash ( Patt , Last , M ask , m ) 3. gap ← k − ( m mo d k ) 4. B ← Patt [ i ][ L ast [ i ]] 5. shift ← Hs [ B ], Hs [ B ] ← 0 6. j ← 0, sℓ ← m − 1 7. while j < ⌈ n/k ⌉ do 8. while sℓ ≥ k do 9. sℓ ← sℓ − k 10. j ← j + 1 11. B ← T [ j ] ≫ k − sℓ 12. B ← B | ( T [ j − 1] ≪ ( sℓ + 1)) 13. if Hs [ B ] = 0 then 14. i ← ( sℓ + g ap ) mo d k 15. h ← L ast [ i ], q ← 0 16. while h > 0 an d P att [ i, h ] = ( T [ j − q ] & Mask [ i, h ]) 17. do h ← h − 1, q ← q + 1 18. if h < 0 then Output( j × k + sℓ ) 19. sℓ ← sℓ + shift 20. else sℓ ← sℓ + Hs [ B ] Fig. 3. The Binar y-Hash-Ma tching algorithm for th e binary string matching prob- lem. F or ex a mple, in the case of the pattern P = 11001 011001 0110010110 , pr e- sented in Figure 1, we have Hs [ 0 110010 1 ] = 2, Hs [ 1100 1011 ] = 1, and moreov er Hs [ 1 001011 0 ] = 0. The co de of the Binar y-Hash-Ma tching algorithm and its prepro cessing phase ar e presented in Figure 3 . The prepro cessing phase o f the algo rithm co ns ists in computing the function Hs deﬁned ab ov e and r equires O ( m + k 2 k +1 ) time co mplexit y and O ( m + 2 k ) extra space. During the sear ch phas e the algor ithm reads, for each shift p os itio n s of the pattern in the text, the blo ck B = t [ s + m − q .. s + m − 1] of k bits (lines 11-1 2 ). If Hs ( B ) > 0 then a shift of length Hs ( B ) is applied (line 2 0 ). Otherwise, when Hs ( B ) = 0 the pattern p is naively c heck ed in the text blo ck by blo ck (lines 1 5-18 ). After the test a n adv a nce men t of length shift is a pplied (line 19 ) where shift = min  { 0 < u < m | p [ m − u − k .. m − u − 1] ⊐ p [ m − k .. m − 1] } ∪ { m }  Observe that if the blo ck B ha s its sℓ rightmost bits in in the j - th blo ck of T a nd the ( k − sℓ ) leftmost bits in the blo ck T [ j − 1], then it is co mputed by per forming the following bitwise op eration B =  T [ j ] ≫ ( k − sℓ )  |  T [ j − 1] ≪ ( sℓ + 1)  The Binar y-Hash-Ma tching algor ithm has a O ( ⌈ m/k ⌉ n ) time complexity and requir es O ( m + 2 k ) extr a space. F or blo cks of length k the size o f the Hs table is 2 k , which se e ms r easonable for k = 8 or even 16. F or greater v alues of k it is poss ible to adapt the algo- rithm to choose the desire d time/space tr adeoﬀ by in tro ducing a new parameter K ≤ k , repr esentin g the nu m ber o f bits taken into account for computing the shift adv ancement. Roug hly sp eak ing, only the K rig h tmo s t bits of the current window of the text are taken into acco unt, re ducing the total sizes of the tables to 2 K at the co st of sometimes shifting the the pa ttern less than could b e done if the full length of a blo ck had b een consider e d. 4.2 The Binary-Skip-Searc h Algorithm The Skip-Search a lgorithm has be en presented in [CLP98] by Charras, Lecro q and Pehoushek. The idea o f the alg orithm is straig h tforward. Let p be a pattern of length m and t a tex t of length n , b oth ov e r a ﬁnite alphab et Σ . F o r each character c of the alphab et, a buck et collects all the p ositions o f that character in the pattern. When a c har acter o ccurs ℓ times in the pattern, there a re ℓ corres p onding p ositions in the buck et o f that character. F ormally , for c ∈ Σ the Skip-Search a lgorithm co mputes the table S [ c ] where S [ c ] = { i | 0 ≤ i < m ∧ P [ i ] = c } . It is p oss ible to notice that when the patter n is muc h shor ter than the al- phab et, many bucket s are empty . The main lo o p of the s e arch phase consis ts in examining every m -th text c ha racter, t [ j ] (so there will b e n/m main iterations). F or each character t [ j ], it uses ea c h po sition in the buck et S [ t [ j ]] to obtain a ll po ssible starting p ositio ns o f p in t . F o r each p osition the algorithm p erforms a comparison of p with t , character by character, un til there is a mismatch, or un til an o ccurrence is found. F or each p oss ible blo ck B o f k bits, a buck et co llects all pair s ( i, h ) in the table Patt such that Patt [ i, h ] = B . When a blo ck of bits o ccurs more times in the pattern, there a re diﬀerent corres po nding pa irs in the buck et of that blo ck. Observe that for a pattern of length m there are m − k + 1 diﬀerent blo cks of length k corresp onding to the blo cks Patt [ i, h ] such that k h − i ≥ 0 and k ( h + 1 ) − i − 1 < m . How ever, to take adv antage of the blo ck structure of the text, we can compute buck ets only for blo cks con tained in the suﬃx of the pattern of le ngth m ′ = k ⌊ m/k ⌋ . In such a way m ′ is a multiple of k a nd we co uld reduce to examine a blo ck for each m ′ /k blo cks of the text. F ormally , for 0 ≤ B < 2 k S k [ B ] = { ( i, h ) : ( m mo d k ) ≤ k h − i ≤ m − k ∧ Patt [ i , h ] = B } . F or example in the case of the pattern P = 11001011 001011 0010110 w e hav e S k [ 010110 01 ] = { (7 , 2) } , S k [ 011001 01 ] = { (3 , 1) , (5 , 2) } , S k [ 110010 11 ] = { (2 , 1) } , S k [ 100101 10 ] = { (1 , 1) , (3 , 2) } and S k [ 101100 10 ] = { (4 , 2) , (6 , 2) } . Precompute-Skip-T ab le ( Patt , m ) 1. for b = 0 to 2 k − 1 do S k [ b ] ← ∅ 2. i ← h ← 0 3. for j = 0 to m − k do 4. if j ≥ ( m mo d k ) then 5. b ← Patt [ i, h ] 6. S k [ b ] = S k [ b ] ∪ { ( i, h ) } 7. i ← i − 1 8. if i < 0 then 9. i ← k − 1 10. h ← h + 1 11. return S k Binar y -Skip-Search ( P, m, T , n ) 1. ( Patt , L ast , Mask ) ← Preprocess ( P , m ) 2. S k ← Precompute-Skip-T able ( Patt , m ) 3. shift ← ⌊ m/k ⌋ − 1 4. j ← shift − 1 5. while j < ⌈ n/k ⌉ do 6. for eac h ( i, pos ) ∈ S k [ T [ j ]] d o 7. h ← 0 8. while h < L ast [ i ] an d P [ i, h ] = ( T [ j − pos + h ] & Mask [ i, h ]) 9. do h ← h + 1 10. if h = L ast [ i ] then Outpu t( j × k + i ) 11. j ← j + shift Fig. 4. The Binar y-Skip-S earch algorithm for th e binary string matching problem. The Binar y-Skip-Search a lgorithm is shown in Figure 4. Its prepro cessing phase cons is ts in computing the buc kets for a ll p ossible blocks of k bits. The space and time complexity o f this pr epro cessing phase is O ( m + 2 k ). The main lo op of the search pha se consists in examining every ( m ′ /k )th text blo ck. F or each blo ck T [ j ] examined in the main lo op, the a lgorithm insp ects each pair ( i, pos ) in the buc ket S k [ T [ j ]] to obtain a po ssible alignmen t of the pattern against the text (line 6 ). F or each pa ir ( i , pos ) the algorithm chec ks whether p o ccurs in t b y compar ing Patt [ i , h ] and T [ j − p os + h ], for h = 0 , . . . , L ast [ i ] (lines 7-10 ). The Binar y-Skip-Search algorithm has a O ( ⌈ m/k ⌉ n ) quadratic worst case time complexity and r equires O ( m + 2 k ) extr a space. In practice, if the blo ck size is k , the Binar y-Skip-Sear ch alg orithm re- quires a table of size 2 k to c ompute the function S k . This is just 256 for k = 8, but for k = 16 or even 32, such a table might b e too larg e. In par ticula r for growing v alues of k , there will be many cache misses, with stro ng impact on the p erformance of the algor ithm. Thus for v alues of k gr eater than 8 it may b e suitable to compute the function on the ﬂy , still using a table for single bytes. Suppo se for example that k = 32 a nd supp ose B is a blo ck of k bits. Let B j be the j -th byte of B , with j = 1 , . . . , 4. T he set of all p ossible pair s a sso ciated to the blo ck B ca n b e co mputed as S k [ B ] = S k [ B 1 ] ∩ S 1 k [ B 2 ] ∩ S 2 k [ B 3 ] ∩ S 3 k [ B 4 ] where we hav e set S q k [ B j ] = { ( i, h ) | ( i , h + q ) ∈ S k [ B j ] } . If we supp ose that the distribution of zero s and ones in the input str ing is like in a r andomly g enerated one, then the proba bilit y of o ccurr ence in the text of any binary string of length k is 2 − k . This is a re a sonable a ssumption for compresse d text [K BD89]. Then the ex pected car dinality of set S k [ B ], for a pattern p of length m , is ( m − 7) × 2 − 8 , that is less than 2 if m < 500. Thus in practical cases the set S k [ B ] can b e computed in consta n t exp ected time. 5 Exp erimen tal Results Here we pres en t exp erimental data which allow to co mpare, in terms of running time and num b er of text character insp ections, the following string matching algorithms under v arious conditions: the Binar y-Naive alg orithm ( BNAIVE ) of Figure 2, the Bin ar y-Boyer-Moore algo rithm by Kle in ( BBM ) pres e n ted in [KBN07], the Binar y-Hash-Ma tching a lgorithm ( BHM ) of Figure 3, a nd the Binar y-Skip-Search algo rithm ( BSKS ) o f Figure 4. F or the sake o f completeness, for e x per imen ta l re s ults on running times we hav e als o tested the following algor ithms for standar d pattern matching: the q - Hash algo rithm [Lec07] w ith q = 8 ( HASH8 ) and the Extended-BOM a l- gorithm [FL08] ( EBOM ). These are a mong the mos t eﬃcient in practical cases. The q - Hash and Extended-BOM algor ithms hav e b een tested o n the same texts and patterns but in their standard form, i.e. each character is an ASCI I v alue of 8-bit, th us obtaining a comparis o n b etw een metho ds on standard and binary s trings. T o simulate the diﬀerent conditions which can arise whe n pro c e ssing binar y data we have p erformed our tests on texts with a diﬀere nt distributio n of zero s and o nes. F or the case of c ompressed str ings it is quite reasona ble to assume a uniform distribution of characters. F or co mpression scheme using Huﬀman co ding, such randomness has b e en shown to ho ld in [KBD89]. In c ont rast when pro cessing binary images we a spe ct a non-uniform distribution of characters. F or instance in a fax-imag e us ually more than 90% of the total num b er of bits is set to z ero. All alg orithms hav e b een implemented in the C prog ramming langua ge and were used to search for the same binary string s in larg e ﬁxed text buﬀers on a PC with Intel Cor e2 pr o c e ssor of 1.66GHz. In particular , the alg orithms have bee n tested on three Rand (1 / 0) γ problems, for γ = 50 , 70 and 9 0. Searching hav e bee n p e r formed for binar y pa tterns, of leng th m fr om 20 to 5 00, which hav e b een taken as substring of the text at rando m starting p ositions. In particular each Rand (1 / 0 ) γ problem consists of sear c hing a set of 10 00 random pa tterns of a given length in a random binar y text of 4 × 10 6 bits. The distribution of c ha racters depends o n the v alue of the para meter γ . In par ticular bit 0 app ears with a p ercentage equal to γ %. Moreov er, for each test, the av er age n um ber o f character insp ections has b een computed by taking the total num b er of times a text byte is a ccessed (either to per form a compar is on with the pattern, o r to pe rform a s hift) and dividing it by the length of the tex t buﬀer. In the following tables, running times (on the left) a re expresse d in hun- dredths of seco nds. T able s with the n umber of text character insp ections (on the right) ar e presented in lig h t-gray background co lor. Bes t results a re b old faced. m BNAIVE BBM BSKS BHM HASH8 EBOM 20 41.53 13.53 3.66 3.40 5.12 8.89 60 41.72 7.77 1.16 1.60 1. 72 3.85 100 41.68 6.80 0 .70 1.44 1.64 3.06 140 42.11 6.21 0 .89 1.24 1.54 2.67 180 41.95 5.76 0 .66 1.10 1.80 2.25 220 41.93 5.36 0 .74 1.24 1.79 1.87 260 41.95 5.08 0 .54 1.05 1.47 2.09 300 41.74 5.07 0 .54 1.11 1.82 1.48 340 41.93 4.86 0 .39 1.07 1.56 1.56 380 41.97 4.59 0 .46 0.97 1.87 1.43 420 42.07 4.52 0 .31 1.23 1.59 1.23 460 41.99 4.68 0 .23 1.04 1.52 1.19 500 42.06 4.61 0 .37 0.81 1.53 1.32 BNAIVE BBM BSKS BHM 9.00 1.82 1.04 0.90 9.00 0.85 0 .20 0.31 9.00 0.63 0 .13 0.20 9.00 0.54 0 .10 0.15 9.00 0.47 0 .08 0.13 9.00 0.44 0 .07 0.11 9.00 0.41 0 .07 0.10 9.00 0.39 0 .06 0.09 9.00 0.38 0 .06 0.09 9.00 0.37 0 .06 0.08 9.00 0.36 0 .05 0.08 9.00 0.35 0 .05 0.08 9.00 0.35 0 .05 0.07 Exp erime ntal re sults for a Rand (0 / 1) 50 problem m B NAIVE BBM BSKS BHM HASH8 EBOM 20 43.26 17.25 4.01 4.21 4.86 10.92 60 43.15 10.26 1.66 2.09 2.03 4.27 100 43.80 8.44 1.60 2.26 1.95 2.54 140 43.70 8.13 1.28 1.61 1.52 2.68 180 43.22 7.37 1.02 1.67 2.08 2.33 220 43.29 6.82 1.08 1.34 1.94 2.50 260 42.93 6.67 1.07 1.53 1.79 1.94 300 43.66 6.46 0.89 1.22 1.59 1.94 340 43.53 6.35 0.97 1.23 1.28 1.86 380 43.76 6.15 0.70 1.42 1.31 1.65 420 43.29 6.03 0.85 1.34 1.67 1.48 460 43.45 6.00 0.92 1.27 1.37 1.43 500 43.31 6.00 0.70 1.28 1.41 1.48 BNAIVE BBM BSKS BHM 9.41 2.27 1.12 1.01 9.40 1.14 0.29 0.38 9.38 0.89 0.21 0.26 9.38 0.77 0.18 0.21 9.37 0.71 0.17 0.18 9.39 0.65 0.16 0.16 9.38 0.61 0.15 0.15 9.39 0.59 0.15 0.14 9.39 0.57 0.15 0.13 9.38 0.55 0.14 0.12 9.38 0.54 0.14 0.12 9.38 0.53 0.14 0.11 9.37 0.51 0.14 0.11 Exp erime ntal re sults for a Rand (0 / 1) 70 problem m B NAIVE BBM BSKS BHM HASH8 EBOM 20 50. 61 41.19 18.51 21.00 24. 30 24.95 60 51. 68 30.62 13.61 14.32 13. 65 7.23 100 53.00 28.44 12.22 12.64 11.64 5.15 140 51.78 27.16 11.86 11 .44 10.31 4.09 180 51.51 24.78 11.80 10 .21 9.83 3.21 220 52.54 24.60 11.50 9.63 9.13 3.12 260 52.38 23.59 11.85 8.74 8.61 2.31 300 52.00 22.68 11.15 8.73 8.11 2.63 340 51.98 21.72 11.30 8.02 7.24 2.29 380 52.33 21.79 11.39 7.66 7.57 2.17 420 52.35 21.16 10.94 7.58 7.43 1.82 460 52.29 20.54 11.09 6.75 6.45 2.12 500 51.68 20.68 11.12 7.42 6.99 1.79 BNAIVE BBM BSKS BHM 12.46 6.88 3.79 4.87 12.53 5.14 2.82 3.28 12.72 4.70 2.78 2.76 12.46 4.47 2.63 2.53 12.45 4.11 2.59 2.22 12.69 4.02 2.65 2.09 12.55 3.87 2.58 1.97 12.59 3.67 2.64 1.80 12.53 3.53 2.60 1.70 12.56 3.53 2.64 1.69 12.60 3.46 2.56 1.60 12.51 3.28 2.59 1.48 12.34 3.39 2.57 1.55 Exp erime ntal re sults for a Rand (0 / 1) 90 problem Exp erimental r e sults show that the Binar y-Skip-Search and the Bina r y- Hash-Ma tching algor ithms o btain the b est r un-time p erfor mance in all cas es. In par ticular it tur ns out that the Binar y-Skip-Search a lgorithm is the b est choice when the distribution of c haracter is uniform. In this ca se the algo - rithm is 1 0 times faster than Binar y-Boyer- Mo ore , and 10 0 times faster than Binar y-Naive . Mo reov er p erfor ms les s than 5 0% o f inspectio ns p erfor med by the Binar y-Boyer-Moore algo rithm, es pecia lly for long pa tterns. F or non-uniform distribution of c haracters the Binar y-Hash-Ma tching algorithm obtains the bes t results in terms of bo th running time and num- ber of character insp ections . It turns out to b e a t least tw o times faster than Binar y-Boyer-Moore algor ithm and to p erform a num b er o f text character insp e ctions which is less tha n 50 % o f that perfo rmed by the Binar y-Boye r- Moore a lgorithm. 6 Conclusion Eﬃcient v ariants of the q -Hash and Skip-Search pattern matching algo rithms hav e b een pres e nted for the ca se in which b oth tex t and pattern a r e ov er a bina r y alphab et. The algorithm exploit the blo ck structure o f the bina ry strings a nd pro cess text a nd patter n with no use of any bit manipula tio ns. Bo th algor ithms hav e a O ( n/m ) time complexity . Howev er, from our exp erimental re s ults it tur ns out tha t the pr esented algo rithms are the most eﬀective in practical cases. References [CKP85] Y. Chouek a, S. T. Klein, and Y . Per l. Eﬃcient v arian t s of Huﬀman co des in high level languages. In Pr o c. 8th A CMSIGIR Conf. , M ontr e al , pages 122– 130, 1985. [CLP98] C. Charras , T. Lecroq , and J. D. P ehoushek. A very fast string matching algo- rithm for small alphab ets and long patterns. In M. F arach-Colton, ed itor, Pr o- c e e dings of the 9th Annual Symp osium on Com binatorial Pattern Matching , vol ume 1448 of LNCS , pages 55–64, Piscata wa y , NJ, 1998. S pringer-V erlag. [F G06] K. F redriksson and S. Grabowski. Eﬃcien t algorithms for pattern match- ing with general gaps and c h aracter classes. In F. Crestani, P . F erragina, and M. Sanderson, editors, Pr o c. Confer enc e on String Pr o c essing and I nfor- mation R etrieval SPI RE06 , volume 4209 of LNCS , p ages 267–278. Springer- V erlag, 2006. [FL08] S. F aro and T. Lecro q. Eﬃcient v ariants of the Bac kward-Oracle-Matc h ing algorithm. In J. Holub and J. ˇ Z ˇ d´ arek, editors, Pr o c. of the Pr ague Stringolo gy Confer enc e , pages 146–160, 2008. [F re02] K . F redriksson. String matching with sup er-alphab ets. In A. H. F. Laender and A . L. Oliveira, editors, Pr o c. Symp osium on String Pr o c essing a nd In- formation R etrieval SPIRE02 , volume 2476 of LNCS , pages 44–57. Springer- V erlag, 2002. [KBD89] S. T. Klein, A . Bo okstein, and S. Deerwester. Storing text retriev al sys- tems on cdrom: Compression and encryption considerations. A CM T r ans. on Information Systems , 7:230–245, 1989. [KBN07] S. T. Klein and M. K . Ben-Nissan. Accelerating Boy er Moore searc hes on bi- nary texts. In J. Holub and J. ˇ Z ˇ d´ arek, editors, Pr o c. of the 12th International Confer enc e on Implementation and Applic ation of Automata, CIAA07 , vol- ume 4783 of LNCS , pages 130–143 , Prague, Czec h Repub lic, 2007. S pringer- V erlag. [KS05] S. T. Klein and D. S h apira. Pattern matching in Huﬀ man enco ded texts. Inf. Pr o c ess. Manage. , 41(4):829–841 , 2005. [Lec07] T. Lecro q. F ast exact string matc hing algorithms. Inf. Pr o c ess. L ett. , 102(6):229 –235, 2007. [SD06] D. Shapira and A . H. Daptardar. Adapting the Knuth-Morris-Pratt algo- rithm for p attern matching in Hu ﬀ man enco ded texts. Inf. Pr o c ess. Manage. , 42(2):429– 439, 2006. [WM94] S. W u and U . Manber. A fast algorithm for multi-pattern searc hing. Rep ort TR-94-17, Department of Computer Science, Universit y of Arizona, T u cson, AZ, 1994.

Efficient Pattern Matching on Binary Strings

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment