A Fast Generic Sequence Matching Algorithm

A F ast Generic Sequence Matc hing Algorithm Da vid R. Musser Gor V. Nishano v Computer Science Department Rensselaer P olytec hnic Institute, T ro y , NY 12180 { m usser,gorik } @cs.rpi.edu Decem b er 1, 19 97 Abstract A string matching—and more generally , sequence matching—algorithm is presented that has a linear worst-case computing time bo und, a low worst-case bound on the num ber of compariso ns (2 n ), and sublinea r av erage-ca se b ehavior that is b etter than that of the fas tes t versions of the Boyer-Moor e algorithm. The a lgorithm r etains its eﬃciency adv a n- tages in a wide v ariet y of s equence matching problems o f practical in- terest, including traditional string matching; larg e-alphab et pro blems (as in Unico de strings); and small- alphab et, lo ng-pattern problems (as in DNA searches). Since it is expresse d as a ge ner ic algo r ithm f or searching in sequences ov er an arbitrary type T , it is well suited for use in g eneric softw are libraries such as the C + + Standard T emplate Library . The algor ithm was obtained by a dding to the Knuth-Morris- Pratt a lg orithm one of the pattern-s hifting tec hniques from the Boy er- Mo ore algo rithm, with pro vision for use of hashing in this tech nique. In situations in whic h a hash function or r andom access to the sequences is no t av ailable, the alg orithm falls back to an optimized v ersio n of the Knuth-Morris-Pra tt algor ithm. key words String searc h String matc hing P attern m atc hing Sequence matc hin g Generic algorithms Kn u th-Morris-Pratt algo rithm Bo y er- Mo ore algo rithm DNA pattern matc h ing C + + Standard T emplate Libr ary STL Ada Literate programming i Con ten ts 1 In tro duction 1 2 Linear and Accelerated Linear Algorithms 2 3 Benc hmarking with English T exts 9 4 Hashed Accelerated Linear Algorithm 9 5 Searc hing f or DNA P atterns 13 6 Large Alpha b et Case 15 7 Generic Searc h Algorithms 15 8 Ho w to Obtain the App endices and C o de 23 9 Conclusion 24 A T ests of Exp ository V ersio ns of t he Algorithms 27 A.1 Algorithm De clarations . . . . . . . . . . . . . . . . . . . . . 27 A.2 Simple T ests . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A.3 Large T e sts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.4 Timed T ests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 B C++ Library V ersions and T est Programs 38 B.1 Generic Library Interfaces . . . . . . . . . . . . . . . . . . . . 38 B.1.1 Library Fil es . . . . . . . . . . . . . . . . . . . . . . . 38 B.1.2 Searc h T raits . . . . . . . . . . . . . . . . . . . . . . . 39 B.1.3 Searc h F u nctions . . . . . . . . . . . . . . . . . . . . . 40 B.1.4 Skip T a ble Computation . . . . . . . . . . . . . . . . . 43 B.1.5 Next T able Pro cedure and Call . . . . . . . . . . . . . 43 B.2 Exp erimental V ersion for Large Alphab et Case . . . . . . . . 44 B.3 DNA Sea rc h F unctions and T raits . . . . . . . . . . . . . . . 47 B.4 Simple T e sts . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 B.5 Large T ests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 B.6 Timed T e sts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 B.7 Timed T e sts (Large Alph ab et) . . . . . . . . . . . . . . . . . 57 B.8 Count ed T est s . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 B.9 Application t o Matc h ing Sequences of W ords . . . . . . . . . 63 B.9.1 Large T ests . . . . . . . . . . . . . . . . . . . . . . . . 63 B.9.2 Timed T ests . . . . . . . . . . . . . . . . . . . . . . . . 65 C Index of P art Names 66 ii 1 In tro duction The traditional strin g matc hing problem is to ﬁnd an o ccurr en ce of a pattern (a string) in a text (another string), or to decide that none exists. Two of the b est known algorithms for the problem of string matc hing are the Kn uth- Morris-Pratt [KMP77] a nd Boy er-Mo ore [BM7 7] a lgorithms (for short, w e will refer to these as KMP and BM). Although K MP has a lo w worst-case b ound on num b er of comparisons (2 n , where n is th e length of the text), it is often considered impractical, sin ce the n um b er of comparisons it p erforms in the av erage case is not signiﬁcantl y smaller than th at of the s tr aigh tforward (SF) algorithm [Sm82], and the ov erhead for initialization is higher. On the other h and, d espite th e f act that BM has a higher wo rst-case boun d on the n umber of comparisons ( ≈ 3 n [Cole9 6]), it has excelle nt sublinear beh avior in the a v erage case. T his f act often mak es B M the alg orithm of c hoice i n practical applicat ions. In [BM77], Bo yer and Moore describ ed b oth a basic v ersion of their al- gorithm a nd an optimized ve rsion based on use of a “skip l o op.” W e will refer to the latter algorithm as Accele rated Bo y er-Mo ore, or ABM for sh ort. Unfortunately , this version remained unn oticed by most researc hers despite its muc h b etter p erform ance. F or examp le, ABM outp erforms the Quic k Searc h [Su90] and Bo yer-Moore-Horsp o ol [Horsp ool88] impr o vemen ts of the basic BM algorithm. Th is state of aﬀairs wa s highligh ted b y Hume and Sun - da y in 1991 in [HS91], in w hic h they intro d uced tw o algorithms, LC (least cost) and T BM (T un ed BM) [HS91], that p er f orm faster than ABM in the a verage case. These t wo algorithms use th e skip lo op of ABM com bined w ith v arian ts of the straigh tforw ard algorithm that use in formation on c h aracter frequency distribution in the target text. F or traditional string matc h ing LC and TBM ha v e excelle nt a v erage case b eha vior, but in the w orst case they b eha ve like SF, taking time prop ortional to the pro du ct of the text an d pattern lengths. Ev en in the a v erage case, the skip lo op as it is u s ed in ABM and other algorithms p erforms p o orly with small alphab ets and long patterns, as hap- p ens, for example, in the problem o f DNA pattern matc hing. And if the alphab et is large, as for example with Unico de strings, initializat ion o v er- head and memory r equiremen ts for the skip lo op weigh against its use. This article describ es a new linear string-matc h ing algorithm and its generalizat ion to searc h ing in sequences ov er an arb itrary t yp e T . The algorithm is based on KMP and has the same lo w 2 n worst-case b ound on n umber of comparisons, but it is b etter than ABM (comparable with TBM) in a verage case p erformance on En glish text strings. It emplo ys a hash- co ded form of the skip lo op making it s uitable even for cases with large alphab ets or with small alphab ets and long patterns. Since it is exp ressed as a generic algorithm f or searching in sequ ences ov er an arbitrary type T , the new algorithm is w ell suited for u se in generic soft ware libraries su c h as the C + + Standard T emplate Lib r ary (STL). W e present the algorithm in th e follo wing sections by starting w ith the basic KMP algorithm and transforming it w ith op timizations and addition of the skip lo op in seve ral alternativ e f orms. The optimized form of KMP without the skip lo op also 1 serv es w ell in cases in whic h access to the sequences is r estricted to forward, one-elemen t-at-a-time iteration rather than r andom-access. W e also discuss exp erimenta l results and some of the issues in includ in g the new algorithm in a generic algorithm libr ary . 2 Linear and Accelerated Linear Algorit hms Let m ≥ a ≥ 0 and n ≥ b ≥ 0 and su p p ose p a . . . p m − 1 is a pattern of size m − a to b e searc hed for in the text t b . . . t n − 1 of size n − b . Characters of the pattern and th e text are dra wn from an alphab et Σ. The Knuth-Morris- Pratt algorithm can b e view ed as an extension of the straigh tforward searc h algorithm. It starts comparing symbols of the pattern and the text from left to the righ t. Ho wev er, when a mismatc h o ccurs, instead of shifting the pattern by one symb ol and rep eating matc h ing from the b eginnin g of the pattern, KMP sh ifts the p attern to the right in such a wa y that th e scan can b e r estarted at the p oint of m ismatc h in the text. The amount of shift is determined by pr ecomputed function next , deﬁned by next ( j ) = max i= a and then text( k) / = patter n(j) loop j := next( j); end loop; k := k + 1; j := j + 1; end loop; if j = m then return k - pattern _size ; else return n; end if; Used in part 27c. 1 Although most authors u se pseudo cod e for exp ository purp oses, we prefer to b e able to chec k all co de with a compiler. The exp ository versions of algorithms in this pap er are expressed in Ada 95, whic h has a syntax similar to that of most pseud o co de languages (at least if one omits the details of sub program headers and p ac k age declarations, which w e include only in an app endix that deals with actual compilation of the co de). The generic library components developed later in the pap er are written in C + + . Throughout the p ap er we present exp ository and pro du ction co de in a va riant of Knuth’s literate programming style [Knuth84], in whic h cod e is p resented in “parts” numbered according to the page num b er on which they app ear (with parts on the same page d istinguished by app en ding a letter to th e num b er). This form of presentation is supp orted by Briggs’ Nuw eb to ol [Briggs] (sligh tly mo diﬁed, as discussed in a later section), with which w e also generate all co de ﬁles directly from the pap er’s source ﬁle. 2 A return v alue i b et ween b and n − patte rn size ind icates a matc h found b eginning at p osition i , while a return v alue of n means there was no matc h. Although elegan tly sh ort, this algorithm d o es redun dant op erations along the expected execution path. Th at is, text (k) is usually not equal to pattern( j) an d next(j) is u sually a − 1, so the in n er lo op usually sets j to a − 1, r edundantly tests it against a , and terminates. k and j are then b oth incremen ted and tested against their b ounds, then j is again red u ndantly compared with a . Knuth, Morris, and Pratt d iscussed a set of optimiza- tions to the basic algorithm that required extending the text and p attern with additional charac ters, w h ic h is p ossible only u nder extra assump tions ab out the wa y the inputs are stored. W e must av oid suc h assumptions when the goal is a generic algorithm. Instead, we eliminate the redund an t op er- ations b y r ewriting th e algorithm in the follo w in g form, which w e w ill call Algorithm L (for Linear) in this pap er: h Algorithm L, optimized linear pattern sear ch 3a i ≡ patter n_siz e : = m - a; k := b; h Handle pattern size = 1 as a specia l c a se 3b i while k <= n - pattern _size l oop h Scan the tex t for a po ssible match 3c i h V erify whether a match is po ssible at the p osition found 4a i h Recov er from a mismatch using the next table 4b i end loop; return n; Used in part 27c. The follo win g co d e allo ws us to eliminate a test in the main lo op: h Handle pattern size = 1 as a sp ecial case 3b i ≡ if pattern _size = 1 then while k /= n and then text( k) / = patter n(a) loop k := k + 1; end loop; return k; end if; Used in parts 3a, 5c, 7b, 12b, 27c. The three parts of the b o dy of the main lo op are d eﬁ ned as follo ws: h Scan the text for a po ssible match 3c i ≡ while text(k) /= patte rn(a) l oop k := k + 1; if k > n - pattern _size then return n; end if; end loop; Used in parts 3a, 27c. 3 h V erify whether a match is p ossible at the p ositio n found 4a i ≡ j := a + 1; k := k + 1; while text(k) = patter n(j) lo op k := k + 1; j := j + 1; if j = m then return k - pattern _size ; end if; end loop; Used in parts 3a, 27c. h Recov er from a mismatch using the next table 4b i ≡ loop j := next( j); if j < a then k := k + 1; exit; end if; exit when j = a; while text(k) = patter n(j) loop k := k + 1; j := j + 1; if j = m then return k - patt ern_s ize; end if; if k = n then return n; end if; end loop; end loop; Used in parts 3a, 5c. This la st part guaran tees linear worst-case beh a vior. Notice that if we simply replace the last part with the co d e k := k - (j - a) + 1 we obtain (an optimized form of ) the straight forward algorithm. Algorithm L can b e further impro ved by incorp orating a s k ip lo op similar to the one th at accoun ts for the excellen t su blinear a verag e time b ehavio r of ABM. T he idea of this tec hn ique is d emonstrated in the follo win g pair of examples: Text: ......uu uuuuuuuu a.... ......uuuuu uuuuue.. .. Before Shift: bcdabcda bcd bcdabcda bcd After Shift: bcdabcda bcd bcdabcda bcd W e insp ect the text c haracter t j that corresp onds to the last characte r of the pattern, and if t j 6 = p m − 1 w e sh ift the pattern by the amount determined by the sk ip fu n ction, wh ich maps an y charac ter of the alphab et to the range [0 , m − a ] and is deﬁned as follo ws: skip ( x ) = ( m − a if ∀ j : a ≤ j < m ⇒ p j 6 = x m − 1 − i otherwise, where i = max { j : a ≤ j < m ∧ p j = x } This is th e same fu nction as Bo yer and Mo ore’s δ 1 [BM77]. The follo wing co de replaces the scan p art of Algorithm L: 4 h Scan the text using the skip lo op 5a i ≡ loop d := skip( text(k + patter n_siz e - 1)); exit when d = 0; k := k + d; if k > n - pattern _size then return n; end if; end loop; Used in part 5c. If the exit is tak en from th is lo op then text(k + pattern size - 1) = pattern( m - 1) . W e also c hange the v erifying part of Algorithm L to the follo wing: h V erify the match for p ositio ns a throug h m - 2 5b i ≡ j := a; while text(k) = patter n(j) lo op k := k + 1; j := j + 1; if j = m - 1 then return k - pattern _size + 1; end if; end loop; Used in part 5c. The algorithm in corp orating these c h an ges will b e called the Accelerated Linear algorithm, or AL for sh ort. In preliminary f orm , th e algorithm is as follo ws: h Accelerated Linear algor ithm, prelimina ry version 5c i ≡ patter n_siz e : = m - a; k := b; h Handle pattern size = 1 as a specia l c a se 3b i h Compute next ta ble 31a i h Compute skip table and mismatc h shift 6a i while k <= n - pattern _size l oop h Scan the tex t us ing the skip lo op 5a i h V erify the match for p ositions a throug h m - 2 5b i if misma tch_s hift > j - a then k := k + (misma tch_s hift - (j - a)); else h Recov er from a mismatch using the next table 4b i end if; end loop; return n; Used in part 29c. F ollo wing the v eriﬁcation part, we kno w that the last c haracter of the pattern and corresp ond ing c h aracter of the text are equal, s o w e can choose whether to pro ceed to the r eco ve ry part th at uses the nex t table or to shift the 5 pattern by the amount misma tch_shif t , p redeﬁned as mismatch shift = ( m − a if ∀ j : a ≤ j < m − 1 ⇒ p j 6 = p m − 1 m − 1 − i otherwise, wh ere i = max { j : a ≤ j < m − 1 ∧ p j = p m − 1 } This v alue can b e most easily computed if it is done dur ing the computation of the skip table: h Compute skip table and mismatch shift 6a i ≡ for i in Charac ter’R ange l oop skip(i ) : = patter n_siz e; end loop; for j in a .. m - 2 loop skip(p atter n(j)) := m - 1 - j; end loop; mismat ch_sh ift := s kip(p atter n(m - 1)); skip(p atter n(m - 1)) := 0; Used in parts 5c, 7b. The skip lo op as describ ed ab o ve p erforms tw o tests f or exit du ring eac h iteration. As suggested in [BM77 ], w e can eliminate one of the tests by ini- tializing skip(pat tern(m - 1) ) to some v alue larg e , c h osen large enough to force an exit based on the size of the index. Up on exit, w e can then p er- form another test to distinguish whether a matc h of a text c h aracter with the last p attern charact er w as f ound or the pattern was shif ted oﬀ the end of th e text string. W e also add patter n_size - 1 to k outsid e th e lo op and precompute adjus tment = l arge + pattern size − 1. h Scan the text using a single-test skip lo op 6b i ≡ loop k := k + skip(t ext(k )); exit when k >= n; end loop; if k < n + pattern _size then return n; end if; k := k - adjust ment; Not used. W e can further optimize the skip lo op by translating k by n (by wr iting k := k - n b efore the main lo op), w hic h allo ws the exit test to b e written as k >= 0 . 6 h Scan the text using a single-test skip lo op, with k transla ted 7a i ≡ loop k := k + skip(t ext(n + k)); exit when k >= 0; end loop; if k < patter n_siz e t hen return n; end if; k := k - adjust ment; Used in part 7b. This sa v es an in struction ov er te sting k >= n , and a go o d compiler will compile t ext(n + k) with only one ins tr uction in the lo op since the com- putation of tex t + n can b e mov ed outside. (In the C + + v ersion we make sure of this optimization by putting it in the source co d e.) With this form of the skip lo op, some compilers are able to translate it into only thr ee instructions. Ho w large is lar ge ? A t th e top of the lo op we hav e k ≥ b − n + pat tern size − 1 . In the case in wh ic h k is incremented by lar ge , we must hav e large + b − n + p attern size − 1 ≥ pattern size . Hence it suﬃces to choose l arge = n − b + 1. h Accelerated Linear algor ithm 7b i ≡ patter n_siz e : = m - a; text_s ize := n - b; k := b; h Handle pattern size = 1 as a specia l c a se 3b i h Compute next ta ble 31a i h Compute skip table and mismatc h shift 6a i large := text_s ize + 1; skip(p atter n(m - 1)) := large; adjust ment : = large + pattern _size - 1; k := k - n; loop k := k + patter n_siz e - 1; exit when k >= 0; h Scan the tex t us ing a single-test skip lo op, with k translated 7a i h V erify match or recover fro m mismatch 8a i end loop; return n; Used in part 27c. W e can also optimize the veriﬁcati on of a m atch by handlin g as a sp ecial case the f r equen tly o ccurrin g case in whic h the ﬁr st c haracters do n ot matc h. 7 h V erify match or recover fro m mis ma tch 8a i ≡ if text(n + k) /= pattern (a) then k := k + mismat ch_sh ift; else h V erify the match for p ositions a + 1 through m - 1, with k translated 8b i if misma tch_s hift > j - a then k := k + (misma tch_s hift - (j - a)); else h Recov er from a mismatch using the next table, with k translated 8c i end if; end if; Used in parts 7b, 12b. The veriﬁcatio n lo op u sed here do esn’t really n eed to c hec k p osition m − 1, but w e write it that wa y in preparation for the hashed version to b e describ ed later. h V erify the match for p ositio ns a + 1 through m - 1, with k translated 8b i ≡ j := a + 1; loop k := k + 1; exit when text( n + k) /= pattern (j); j := j + 1; if j = m then return n + k - patte rn_si ze + 1; end if; end loop; Used in part 8a. h Recov er from a mismatch using the next table, with k translated 8c i ≡ loop j := next( j); if j < a then k := k + 1; exit; end if; exit when j = a; while text(n + k) = pattern (j) loop k := k + 1; j := j + 1; if j = m then return n + k - patte rn_si ze; end if; if k = 0 then return n; end if; end loop; end loop; Used in part 8a. 8 The AL algorithm thus obtained r etains the same 2 n upp er case b ound on the num b er of comparisons as the original KMP algorithm and acquires sublinear a v erage time b ehavio r equ al or sup erior to ABM. 3 Benc h marking with Engl ish T exts Before generalizing AL by introd ucing a hash fu nction, let us consider its use as-is for traditional string matc hing. W e b enc h mark ed ﬁve algorithms with English text searc hes: a C + + v ersion of SF used in the Hewlett-P ac k ard STL imp lemen tation; L and AL in their C + + v ersions as giv en later in th e pap er and app endices; and the C v ersions of ABM [BM77 ] and TBM as giv en by Hu me and Sund a y [HS91]. (Th e v ersion of AL actually used is the hashed v ersion, HAL, discussed in the next section, but using the iden tit y function as the h ash fu nction.) W e searc hed for p atterns of size ranging from 2 to 18 in Lewis Carr oll’s Thr ough the L o oking Glass . The text is comp osed of 171,5 56 charac ters, and the test set included u p to 800 diﬀerent patterns for eac h pattern size—400 text s tr ings c hosen at ev enly spaced p ositions in the target text and up to 400 wo rds c h osen from the Unix sp ell-c hec k dictionary (for longer pattern sizes there w ere f ew er than 400 words). T able 1 sho ws search sp eeds of the ﬁv e algorithms with co d e compiled and executed on three d iﬀeren t systems: 1. g++ compiler, version 2.7.2.2, 60-Mh P en tium p r o cessor; 2. SGI CC compiler, version 7.10, SGI O 2 with MIPS R 5000 2.1 p ro ces- sor; 3. Ap ogee apCC compiler, v ersion 3.0, 200 MHz UltraSP ARC pr o cessor. These results sho w that HAL, ABM, and T BM are quite close in p erfor- mance and are sub s tan tially b etter than the SF or L algorithms. On Sys- tem 1, TBM is a slightl y f aster th an HAL on the longer s trings, bu t not enough to out w eigh t wo signiﬁcan t drawbac ks: ﬁrst, like SF, it tak es Ω( mn ) time in the w orst case; and , second, it achiev es its sligh tly b etter a verage case p erformance though the use of c haracter frequency distribution in for- mation that migh t need to b e c hanged in applications of the algorithm other than English text searc hes. F or b oth of these reasons, TBM is not a go o d candidate for inclus ion in a library of generic algorithms. F or more mac hine indep end en t perf ormance measures, we sho w in a later section the num b er of op erations p er c haracter searched, for v arious kinds of op erations. 4 Hashed Acc elerated Linear Algorit hm The skip lo op pr o duces a d ramatic eﬀect on the algorithm, when we searc h for a word or a phr ase in an ord inary English text or in the text in some other n atural or pr ogramming language with a mid-sized alphab et (26-25 6 c h aracters, s ay). Ho we v er, algorithms that u se this tec hn ique are dep enden t on the alphab et size. In case of a large alphab et, the result is increased 9 P attern Algorithm System 1 System 2 System 3 Size 2 ABM 8.8966 5 24.694 6 32.9261 HAL 8.261 17 24.69 46 32.9261 L 6.0871 8 24.694 6 32.9261 SF 4.2835 7 9.8778 4 24.6 946 TBM 10.5142 32.9261 32.926 1 4 ABM 20.442 5 46.783 8 68. 9446 HAL 23.3995 51.0369 83.6137 L 6.5272 4 27.871 2 38. 9093 SF 4.2962 2 9.8492 3 23.3 919 TBM 21.26 02 49.123 71.451 7 6 ABM 28.163 7 60.283 2 89. 4829 HAL 31.2569 63.6323 108.055 L 6.4527 9 27.401 5 37. 9265 SF 4.2814 2 9.8400 5 22.1 973 TBM 29.22 94 62.249 93.883 7 8 ABM 33.746 3 69.282 8 106 .674 HAL 37.0999 73.0482 126.801 L 6.3408 6 26.668 4 36. 5241 SF 4.2332 3 9.7822 9 22.0 342 TBM 35.34 37 72.26 27 11 2.007 10 ABM 39.63 29 76.23 08 117 .47 HAL 42.5986 80.5134 135.202 L 6.3252 5 26.638 3 36. 1904 SF 4.2253 7 9.7492 4 21.9 134 TBM 41.19 73 78.74 39 12 5.714 14 ABM 47.79 86 89.12 14 12 9.631 HAL 49.8997 92.9962 147.511 L 6.2203 7 25.926 2 33. 6837 SF 4.189 9.72233 21.1774 TBM 49.35 73 92.9962 142.594 18 ABM 50.15 14 97.859 141.35 2 HAL 50.15 14 101.773 159.021 L 5.8618 5 24.702 3 31. 4115 SF 4.0517 3 9.6376 3 21.0 275 TBM 51.2912 97.859 149.66 7 T able 1: Algorithm Sp eed (Ch aracters P er Microsecond) in English T ext Searc h es on T hree S y s tems 10 storage requiremen ts and o verhead for initialization of the o ccur rence table. Secondary eﬀects are also p ossible due to arc h itectural reasons su c h as cac he p erforman ce. Performance of th e s k ip lo op is also diminish ed in cases in whic h the pattern size is muc h greater than the size of the alphab et. A go o d example of th is case is searching for DNA patterns, w hic h could b e relativ ely long, sa y 250 c haracters, w h ereas th e alph ab et con tains only four c h aracters. In this section we show how to generalize the skip lo op to hand le suc h adv erse cases. The key idea of the generalizati on is to apply a hash function to the current p osition in the text to obtain an argument for the skip fun ction. h Scan the text using a single-test skip lo op with hashing 11 i ≡ loop k := k + skip(h ash(t ext, n + k)); exit when k >= 0; end loop; if k < patter n_siz e t hen return n; end if; k := k - adjust ment; Used in part 12b. W e h a ve seen that th e sk ip lo op w orks we ll when the cardinalit y of domain of the skip function is of mo derate size, say σ = 256, as it is in most con ven tional string searc h es. When u s ed with sequences o v er a t yp e T with large (ev en inﬁn ite) cardinalit y , h ash can b e chosen so that it maps T v alues to the range [0 , σ ). Conv ersely , if the cardinalit y of T is smaller than σ , w e can u se more than one elemen t of the text sequence to compu te the hash v alue in ord er to obtain σ distinct v alues. In the con text in which the skip lo op app ears, w e a lw a ys h a ve a v ailable at lea st p attern_s ize elements; whenev er pat tern_siz e is to o small to yield σ diﬀerent hash v alues, w e can either m ak e do w ith fewer v alues or resort to an algorithm that do es not use a skip lo op, such as Algorithm L. (The skip loop is not very eﬀectiv e for small pattern lengths anyw a y .) Of course, the sk ip table itself and the mismatc h shift v alue m ust b e com- puted using th e hash fun ction. Let su ffix_siz e b e the n um b er of sequence elemen ts u sed in compu tin g th e hash fun ction, w h ere 1 ≤ suffix size ≤ pattern size . 11 h Compute skip table and mismatch shift using the hash function 12a i ≡ for i in hash_r ange l oop skip(i ) : = patter n_siz e - suffi x_siz e + 1; end loop; for j in a + suffix_ size - 1 .. m - 2 loop skip(h ash(p attern, j)) := m - 1 - j; end loop; mismat ch_sh ift := s kip(h ash(p attern, m - 1)); skip(h ash(p attern, m - 1)) := 0; Used in part 12b. The remaind er of the compu tation can r emain the same, so we hav e the follo wing algorithm in w hic h it is assumed that th e hash function uses u p to suffix_s ize elemen ts, where 1 ≤ suffix size ≤ pattern size . h Hashed Accelerated Linear algo rithm 12b i ≡ patter n_siz e : = m - a; text_s ize := n - b; k := b; h Handle pattern size = 1 as a specia l c a se 3b i h Compute next ta ble 31a i h Compute skip table and mismatc h shift using the hash function 12a i large := text_s ize + 1; skip(h ash(p attern, m - 1)) := large; adjust ment : = large + pattern _size - 1; k := k - n; loop k := k + patter n_siz e - 1; exit when k >= 0; h Scan the tex t us ing a single-test skip lo op with hashing 11 i h V erify match or recover fro m mismatch 8a i end loop; return n; Used in part 29b. This algorithm will b e called HAL. Note that AL is itself a sp ecial case of HAL, obtained usin g hash ( text , k ) = text ( k ) and (hen ce) s uffix size = 1 , By inlining hash , w e can use HAL instead of AL with minimal p erformance p enalt y (none with a goo d compiler). It is also n otew orthy that in th is app lication of h ashing, a “bad” hash function causes n o great harm, u nlik e the situation w ith asso ciativ e table searc h ing in w h ic h hashing metho d s usually ha v e excellen t a verage case p er- formance (constant time) but with a bad hash function can degrade terribly to linear time. (Thus, in a table w ith thousands of elemen ts, searc hing migh t take thousands of times longer than exp ected.) Here the worst th at can happ en—with, sa y , a hash function that maps ev ery elemen t to the same v alue—is that a sublin ear algo rithm degrades to linearit y . As a consequence, 12 in c ho osing hash functions w e can lean to ward ease of compu tation rather than uniform d istr ibution of the hash v alues. There is, how ev er, an essentia l requ ir emen t on the hash function that m ust b e observed when p erforming sequence m atc hin g in terms of an equ iv- alence relation ≡ on s equence elemen ts that is not an equalit y relation. In this case, we must require that equiv alent v alues hash to th e same v alue: x ≡ y ⊃ hash ( x ) = h ash ( y ) for all x, y ∈ T . W e discuss this r equiremen t fu rther in a later section on generic library versions of th e algorithms. 5 Searc hing for DNA Pa tterns As an imp ortan t example of the use of the HAL algorithm, consider DNA searc h es, in whic h the alphab et h as on ly four c haracters and patterns m a y b e v ery long, sa y 250 charac ters. F or th is application we exp erimente d with hash functions h ck that map a string of c c haracters into the in teger range [0 , k ). W e chose four s uc h fun ctions, h 2 , 64 , h 3 , 512 , h 4 , 256 , and h 5 , 256 , all of whic h add u p the results of v arious shifts of the c haracters they insp ect. F or example, 2 h 4 , 256 ( t, k ) = ( t ( k − 3) + 2 2 t ( k − 2) + 2 4 t ( k − 1) + 2 6 t ( k ))mo d 256 . The algorithms th at use these hash functions b ear th e names HAL2, HAL3, HAL4, and HAL5, resp ectiv ely . The other cont estan ts were SF, L, ABM and the Giancarlo-Bo y er-Mo ore algorithm (GBM), whic h was describ ed in [HS91] and w as considered to b e the fastest for DNA pattern matc hin g. W e searc hed f or patterns of s ize ranging from 20 to 200 in a text of DNA strings obtained from [DNAsource]. The text is comp osed of 997,642 c h aracters, and the test set included up to 80 diﬀeren t patterns for eac h pattern size—40 strings c hosen at ev enly spaced p ositions in the target text and up to 40 p atterns c hosen from another ﬁle from [DNAsource] (for longer pattern sizes there we re few er than 40 p atterns). T able 2 shows searc h sp eeds of the ﬁ v e algorithms with co de compiled and executed on the same thr ee systems as in the English text exp erimen ts (the systems describ ed preceding T able 1). W e see that eac h of the v ersions of HAL is signiﬁcantly faster than an y of the other algorithms, and the sp eed adv an tage increases with longer patterns—for patterns of size 200, HAL5 is ov er 3.5 to 5.5 times faster than its closest comp etito r, GBM, d ep end in g on the system. It app ears that HAL4 is sligh tly faster than HAL5 or HAL3, but further exp eriments with diﬀeren t hash fu nctions migh t yield even b etter p erformance. 2 In the C + + codin g of th is computation w e use shifts in place of multiplicatio n and masking in place of division. The actual C + + versi ons are shown in an app endix. 13 P attern Algorithm System 1 System 2 System 3 Size 20 ABM 16.08 93 37.34 22 55 .6074 GBM 25.88 27 62.38 88 138 .267 HAL 13.14 93 32.58 53 53 .8514 HAL2 28.7208 67.3143 146.168 HAL3 22.816 5 63.948 6 131 .177 HAL4 21.900 8 58.135 113.686 L 4.3309 1 16.397 1 18. 9477 SF 3.2873 2 8.3185 1 16.94 TBM 18.03 95 42.63 24 63 .1591 50 ABM 19.57 08 44.693 70.663 3 GBM 33.63 43 78.046 193.67 HAL 13.68 76 33.736 60.8033 HAL2 49.5795 106.716 275.215 HAL3 43.462 5 106.716 290.505 HAL4 42.632 96.8349 249.004 L 4.3519 16.6003 19.439 SF 3.2890 6 8.3133 3 16.7 599 TBM 24.07 64 54.46 96 87 .1514 100 ABM 21.265 5 49.678 1 73.4 371 GBM 36.64 39 85.88 41 220 .311 HAL 12.946 32.0706 56.3018 HAL2 70.499 7 163.45 7 389 .782 HAL3 71.2744 187.673 460.651 HAL4 70.499 7 168.90 5 460.651 L 4.2447 4 16.086 2 19. 1938 SF 3.2462 3 8.2392 9 16.5 054 TBM 27.83 68 66.67 32 10 5.566 150 ABM 24.226 9 56.136 6 86.66 7 GBM 37.63 83 86.667 205.834 HAL 14.02 05 34.54 56 60 .9879 HAL2 84.309 7 197.60 1 548.891 HAL3 91.641 247.001 548.891 HAL4 90.331 8 235.23 9 494 .002 L 4.3339 5 16.303 7 19. 5258 SF 3.2899 2 8.3305 6 17.0 935 TBM 29.13 93 72.64 74 10 7.392 200 ABM 23.978 6 55.385 3 86.3 636 GBM 37.05 78 90.99 02 212. 31 HAL 13.31 06 33.30 36 57 .9028 HAL2 89.344 9 221.54 1 509 .545 HAL3 103.52 7 283.081 636.931 HAL4 105.196 283.081 566.161 L 4.2656 5 16.227 5 19. 0841 SF 3.2594 6 8.2852 8 16.8 167 TBM 28.73 21 73.84 71 11 3.232 T able 2: Algo rithm Sp eed (Ch aracters Per Microsecond) in DNA Searc hes on Thr ee Systems 14 6 Large A lphab et Case Supp ose the alphab et Σ h as 2 16 = 65 , 536 symb ols, as in Unico de for ex- ample. T o use the skip lo op dir ectly we must initialize a table with 65,536 en tries. If w e are only going to search, sa y , for a short p attern in a 10,000 c h aracter text, the initialization o v erhead dominates the rest of the compu - tation. One wa y to eliminate d ep end en cy on the alphab et size is to use a large zero-ﬁlled global table skip1 ( x ) = skip ( x ) − m , so that the algorithm ﬁlls at most m p ositions with p attern-dep endent v alues at the b eginning, p erform s a searc h, and then restores zero es. This approac h mak es the algorithm non- reen tran t and therefore not suitable for m ulti-threaded app lications, bu t it seems wo rth inv estigating for single-threaded app lications. Another appr oac h is to use HAL with, say , H ( t, k ) = t ( k ) mo d 256 as the hash function. In ord er to compare these t w o approac hes we imp le- men ted a non-hashed version of AL u sing skip1 , called NHAL, and b enc h- mark ed it against HAL w ith H as the h ash fun ction. Th e C + + co de for NHAL is sho wn in an app endix. W e searc hed for patterns of size ranging from 2 to 18 in r andomly gener- ated texts of s ize 1,000, 000 c haracters, with eac h characte r b eing an int eger c h osen with u niform distr ib ution fr om 0 to 65,535 . Patte rns were chose n from the text at random p ositions. The test set included 500 diﬀerent pat- terns for eac h p attern size. T able 3 summarizes the timings obtained usin g the same three sys tems as decrib ed preceding T able 1. W e can see that HAL d emonstrates signiﬁ can tly b etter p erformance than NHAL. O n sys- tems 1 and 2 the ratio of HAL’s sp eed to NHAL’s is muc h higher than on system 3, and we attribute this disparity to p o or optimization abilities of the compilers we u sed on systems 1 and 2. W e conclude that the hash in g tec hn ique presents a viable and eﬃcien t w a y to eliminate alphab et-size dep endency of searc h algorithms that use the skip lo op. 7 Generic Searc h A lgorithms As we hav e seen, the HAL algorithm retains its eﬃciency adv anta ges in a wide v ariet y of searc h pr oblems of practical interest, including traditional string searc hing with small or large alphab ets, and short or long p atterns. These qualities make it a go o d candidate for abstraction to searc h ing in sequences o ver an arbitrary t yp e T , for inclusion in generic soft w are libraries suc h as the C + + Standard T emplate Libr ary (S TL) [Stepano vLee, MS96]. By some d eﬁnitions of genericit y , HAL is already a generic algorithm, since the hash fun ction can b e made a parameter and th us th e algorithm can b e adapted to work with an y t yp e T . In S TL, how ev er, another imp ortan t issue b earing on the generalit y of op erations on linear sequences is the kind of access to the sequences assu med—random access or something weak er, 15 P attern Algorithm System 1 System 2 System 3 Size 2 HAL 10.4369 27.364 5 39.5 632 L 7.1197 2 28.5543 41.5664 NHAL 6.63479 12.391 5 26.4 818 SF 4.4833 4 9.83 157 23.12 5 4 HAL 18.7455 44.375 64.3873 L 7.125 28.3082 41.566 5 NHAL 10.9539 21.049 7 47.5 906 SF 4.4861 1 9.86 112 22.8038 6 HAL 25.3206 54.9243 86.7225 L 7.1179 28.409 1 41.7 146 NHAL 14.2358 28.166 3 63.3 741 SF 4.505 9.86663 23.0451 8 HAL 31.1354 67.1919 94.0686 L 7.1294 6 28. 6296 41.6 76 NHAL 16.9606 33.595 9 80.3 025 SF 4.4944 5 9.88 709 22.7062 10 HAL 35.7717 72.4895 112.484 L 7.0991 3 28. 3655 41.2915 NHAL 19.1634 38.376 8 98.8 494 SF 4.4901 7 9.85 507 22.8114 14 HAL 42.9195 78.1701 149.234 L 7.1132 28.549 1 41.0 393 NHAL 23.5262 47.581 8 136. 798 SF 4.4891 1 9.80 043 22.7996 18 HAL 47.51862 96.9324 173.458 L 7.14 4521 28.1684 41.1963 NHAL 26.4274 56.822 5 164. 785 SF 4.4831 2 9.80 864 22.8868 T able 3: Algorithm Sp eed (Characters P er Microsecond) in Large Alph ab et Case on Three Systems 16 suc h as forward, single-step adv ances only . ST L generic algorithms are sp ec- iﬁed to access sequen ces via iterators, whic h are generalizations of ordin ary C/C + + p ointe rs. STL deﬁ nes ﬁ ve categories of iterators, the most p o werful b eing r andom-access iterators, for w hic h computing i + n or i − n , for iter- ator i and in teger n , is a constant time op eration. F orw ard iterators allo w scanning a sequen ce with only single-step adv ances and only in a forwa rd direction. AL and HAL require rand om access for most eﬃcien t op eration of the skip lo op, w hereas Algorithm L , with only m in or mo d iﬁ cations to th e exp ository version, can b e made to wo rk eﬃciently with forwa rd iterators. The eﬃciency issu e is considered crucial in the STL app roac h to gener- icit y . STL is not a set of sp eciﬁc soft w are comp onen ts b ut a set of r equire- men ts which comp onents must satisfy . By making time complexit y part of the requirements for comp onents, ST L ensur es that complian t comp onents not only ha v e the sp eciﬁed in terfaces and seman tics but also meet certain computing time b oun ds. The requ iremen ts on most comp onents are stated in terms of in puts and outpu ts that are lin ear sequences ov er some t yp e T . The requir emen ts are stated as generally as p ossible, but balanced against the goal of using eﬃcien t algorithms. In fact, the requirements w ere gen- erally chosen based on knowledge of existing eﬃcien t concrete algorithms, b y ﬁn ding th e w eak est assumptions—ab out T and ab out ho w the s equ ence elemen ts are accessed—und er wh ic h those algorithms could still b e used without losing their eﬃciency . In most cases, the computing time require- men ts are s tated as wo rst-case b ounds, but exceptions are made when the concrete algorithms with the b est wo rst-case b ounds are not as go o d in the a verage case as other algorithms, p ro vided th e w orst cases o ccur very infrequently in pr actice. In the case of sequence s earc h algorithms, the concrete algorithms con- sidered for generalization to include in ST L w er e v arious string-searc h algo- rithms, in cluding BM, KMP , and SF. Although KMP has the lo west worst- case b ound , it wa s stated in the original S TL rep ort [Stepano vLee] that SF w as su p erior in the a v erage case. 3 And although BM has excellen t a v erage time b ehavio r, it was evidentl y ruled out as a generic algorithm b ecause of its alphab et size dep enden cy . Th us the generic searc h algorithm requir e- men ts were wr itten with a O ( mn ) time b ound, to allo w its implementa tion b y SF. 4 Th us in the Draft C + + Standard d ated Decem b er 1996 [DraftCPP], t wo sequence searc h fu nctions are required, with the sp eciﬁcations: template 3 The original STL req u irements included the follo wing statement (which has b een dropp ed in more recent versio ns of the Draft C + + Standard): “. . . T he Knuth-Morris- Pratt algorithm is not used here. While th e KMP algorithm guarantee s linear time, it tend s to b e slow er in most practical cases th an the naive algorithm with worst-case quadratic b ehavior . . . .” As we have already seen from T able 1, how ever, a suitably optimized version of KMP—Algorithm L—is signiﬁcantly faster than SF. 4 This did not preclud e library implementors from also sup plying a sp ecialization of the searc h op eration for the string search case, implemented with BM. The original require- ments statemen t for the search operation noted this p ossibility but more recent drafts fail to mention it. W e are n ot a w are of any currently a v ailable STL implementations that do provide su ch a sp ecialization. 17 ForwardI terator1 se arch(For wardIter ator1 first 1, ForwardI terator1 la st1, ForwardI terator2 fi rst2, ForwardI terator2 la st2); template ForwardI terator1 se arch(For wardIter ator1 first 1, ForwardI terator1 la st1, ForwardI terator2 fi rst2, ForwardI terator2 la st2, BinaryPr edicate pred); Eﬀects: Finds a sub sequence of equ al v alues in a sequ en ce. Returns: The ﬁrst iterator i in the range [ﬁr st1 , last1 − (last2 − ﬁrst2)) suc h that for an y n on -n egativ e inte ger n less than last2 − ﬁrst2 the follo wing corresp ondin g cond itions hold: ∗ ( i + n ) = ∗ (ﬁrst2 + n ), pr ed( ∗ ( i + n ) , ∗ (ﬁrst2 + n )) 6 = false. Returns last1 if n o su c h iterator is found . Complexit y: A t most (last1 − ﬁrst1) ∗ (last2 − ﬁrst2) app lications of the corresp ondin g predicate. Before going f u rther, w e note that the results of the pr esen t article wo uld allo w the complexit y requirement to b e replaced with the muc h stronger requirement that the computing time b e O ((last1 − ﬁrst1) + (last2 − ﬁr st2)). W e will base ou r discus s ion on th e ﬁrst interface , which assumes op era- tor== is u sed f or testing sameness of tw o sequence elemen ts; the only added issue for th e bin ary predicate case is the r equiremen t mentioned earlier, that for HAL we must c ho ose a hash fun ction compatible with the binary pred- icate, in the s en se that an y tw o v alues that are equiv alent according to the predicate m ust b e mapp ed to the same v alue by the hash function. F or- tunately , for a give n p redicate it is usu ally rather easy to c ho ose a hash function that guaran tees this prop ert y . (A heur istic guide is to c ho ose a hash function that u s es less information than the pr edicate.) The fact that this standard interface only assumes forw ard iterators w ould seem to preclude HAL, since the skip lo op requires r andom access. There are ho we v er many cases of sequence searc hing in wh ich w e d o ha v e random access, and w e do not w an t to miss the sp eedup aﬀorded b y th e skip lo op in th ose cases. F ortunately , it is p ossible to pro vide for easy selection of the most appropriate algorithm under diﬀeren t actual circumstances, includ- ing w hether random access or only forw ard access is a v ailable, and w hether t yp e T has a small or large num b er of distinct v alues. F o r th is pur p ose we use tr aits , a p rogramming device for compile-time selection of alternativ e t yp e and co de d eﬁnitions. T raits are sup p orted in C + + b y the abilit y to giv e deﬁ nitions of function templates or class templates f or sp ecializations of their template parameters. 5 5 Limited forms of the trait device were used in deﬁn ing some iterator op erations in the ﬁrst implementations of STL. More recently the trait dev ice has been adopted more 18 Algorithm L is not diﬃcult to adapt to work with iterators instead of arra y indexing. The most str aigh tforward translation would require random access iterators, bu t w ith a few adjustment s w e can express the algorithm en tirely w ith forward iterator operations, making it ﬁ t the STL search function in terface. h User level sear ch function 19a i ≡ templa te inline Forwar dIter ator1 searc h(For wardIterator1 text , Forwar dIter ator1 textE nd, Forwar dIter ator2 patte rn, Forwar dIter ator2 patte rnEnd ) { typede f i terat or_tra its T ; return __sear ch(te xt, textE nd, pattern, patter nEnd, T::ite rator _category( )); } Used in part 38b. When w e only h a ve forward iterators, we use Algorithm L. h F orward iterator ca se 19b i ≡ templa te inline Forwar dIter ator1 __sea rch(F orwardIterator1 text, Forwar dIter ator1 textE nd, Forwar dIter ator2 patte rn, Forwar dIter ator2 patte rnEnd , forwar d_ite rator_tag) { return __sear ch_L( text, textE nd, patter n, patter nEnd); } templa te Forwar dIter ator1 __sea rch_L (ForwardIterator1 text, Forwar dIter ator1 textE nd, Forwar dIter ator2 patte rn, Forwar dIter ator2 patte rnEnd ) { typede f t ypena me iterator_t raits ::difference_type Dist ance2 ; Forwar dIter ator1 advan ce, hold; Forwar dIter ator2 p, p1; Distan ce2 j, m ; vector n ext; vector p atter n_iter ator; h Compute next table (C++ forward) 20a i m = next.s ize(); h Algorithm L, optimized linear pattern search (C++) 21a i broadly in other parts of the library , particularly to provide diﬀerent deﬁnitions of ﬂoating p oint and other p arameters u sed in numeric algorithms. The most elab orate uses of the device employ the recently added C + + feature of p artial sp e ciali zation , in which new deﬁnitions can b e given with some template parameters sp ecialized while others are left unsp ecialized. F ew C + + compilers currently supp ort partial sp ecialization, bu t we do not need it here anyw ay . 19 } Used in part 38b. W e s tore the next table in an STL v ector, which provides rand om access to the inte gral n ext v alues; to b e able to get f r om them bac k to the correct p ositions in the p attern sequence w e also store iterators in another v ector, pattern_ iterator . h Compute next table (C++ forward) 20a i ≡ comput e_nex t(pattern, pa ttern End, next, patter n_ite rator); Used in part 19b. h Deﬁne pro cedur e to compute next table (C++ forward) 20b i ≡ templa te void compu te_nex t(ForwardIterator patte rn, Forwar dIter ator pat ternEn d, vector & next , vector & patt ern_i terato r) { Distan ce t = -1; next.r eserv e(32); patter n_ite rator.reserve(32); next.p ush_b ack(-1); patter n_ite rator.push_back(pattern); for (;;) { Forwar dIter ator a dvanc e = patte rn; ++adva nce; if (adva nce == patte rnEnd ) break; while (t >= 0 && *patt ern != *patte rn_it erator[t]) t = next[t ]; ++patt ern; + +t; if (*pat tern == *patter n_ite rator[t]) next.p ush_b ack(nex t[t]); else next.p ush_b ack(t); patter n_ite rator.push_back(pattern); } } Used in part 38b. Returning to the search algorithm itself, the details are as f ollo ws: 20 h Algorithm L, optimized linear pattern sear ch (C+ +) 21a i ≡ h Handle pattern size = 1 as a specia l c a se (C++) 21b i p1 = patte rn; ++p1; while (text != textEnd ) { h Scan the tex t for a po ssible match (C++) 21c i h V erify whether a match is po ssible at the p osition found (C++) 21d i h Recov er from a mismatch using the next table (C++ forward) 22a i } return textEn d; Used in part 19b. F or the case of pattern size 1, w e use the ST L generic linear search algorithm, find . h Handle pattern size = 1 as a sp ecial case (C++) 21b i ≡ if (next.s ize() == 1) return find(t ext, t extEn d, *pattern); Used in parts 21a, 42a, 46a. The th r ee parts of the b o dy of the main lo op are d irect translations from the Ada v ersions giv en earlier, usin g p oin ter m anipulation in place of arra y indexing. h Scan the text for a po ssible match (C++) 21c i ≡ while (*text != *patte rn) if (++te xt = = textE nd) return textEn d; Used in part 21a. h V erify whether a match is p ossible at the p ositio n found (C+ + ) 21d i ≡ p = p1; j = 1; hold = text; if (++text == textEn d) return textEn d; while (*text == *p) { if (++p == pattern End) return hold; if (++te xt = = textE nd) return textEn d; ++j; } Used in part 21a. 21 h Recov er from a mismatch using the next table (C++ forward) 22a i ≡ for (;;) { j = next[j ]; if (j < 0) { ++text ; break; } if (j == 0) break; p = patter n_iter ator[j]; while (*text == *p) { ++text ; ++p; ++j; if (p == patter nEnd) { h Compute and return p osition of match 22b i } if (text == textEn d) return textEn d; } } Used in part 21a. Returning the matc h p osition requires use of the hold iterator sa ved for th at purp ose. h Compute and return p osition of match 22b i ≡ advanc e = ho ld; for (int i = m; --i >= 0;) ++adva nce; while (advanc e ! = text) ++adva nce, + +hold ; return hold; Used in part 22a. Through the us e of traits, w e pro vide for automatic selection of either the ab o v e version of algorithm L in the case of forwa rd or bid ir ectional itera- tors, or the f aster HAL algorithm when r andom access to the sequ ences is a v ailable. STL rand om access iterators p er m it th e use of either array in - dex notation v ery s imilar that in the exp ository version of the algorithm, or p ointe r notation as sho wn ab o v e for algorithm L, but with additional op er- ations such as p + k . Although it is commonplace to u se p ointer notation for eﬃciency reasons, w e a void it in this case b ecause th e calculation of the large v alue cannot b e guarantee d to b e v alid in p ointer arithm etic. Th e adv an tage of the single-test skip lo op out weighs an y disadv antag e d ue to arra y notation calculations. The trait in terface also allo ws the user to su pply the h ash fu nction, but v arious useful default hash fu nctions can b e p r o vided. T h e full details, in- cluding complete source co de, are shown in an ap p endix. The co d e is av ail- able from http:// www.cs.rpi.e du/˜musser/gp . T h e co de su p plied in cludes a set of op eration counting comp onen ts [Mu96] that p ermit easy gathering 22 of statistics on many diﬀeren t kin d s of op erations, in clud ing data elemen t accesses and comparisons, iterator op erations, an d “distance op erations,” whic h are arithmetic op erations on intege r results of iterator su btractions. These counts are obtained w ithout mo difying the so urce co de of the al- gorithms at all, by sp ecial izing their t yp e p arameters with classes wh ose op erations h a ve coun ters built into them. T able 4 shows counts of data comparisons and other data accesses, iterator “big ju mps” and other itera- tor op erations, and distance op erations. In eac h case the coun ts are divided b y the num b er of charac ters searched. These statistics come fr om searches of the same English text, Thr ough the L o oking Glass , with the same selec- tion of patterns, as discussed earlier. F or ABM and TBM, not all op erations w ere counted b ecause the algorithms are from Hume and Sun da y’s original C cod e and therefore could n ot b e sp ecialized with the count ing comp onen ts. F or th ese algorithms a m an ually ins tr ument ed version (supp lied as p art of the co de distribu tion [HS 91]) k ept count of d ata comparisons and accesses. The table sh o w s that HAL, lik e ABM and TBM, do es r emark ably few equalit y comparison op erations on sequence elements—only ab out 1 p er 100 elemen ts for the longer patterns, no more than t wice that f or the sh orter ones. They do access the elemen ts sub stan tially more often than th at, in their resp ectiv e skip lo ops, b ut still alwa ys sublinearly . With strin g matc h - ing, the comparisons and accesses are inexp ensive, but in other app lications of sequence matc hin g th ey might cost su b stan tially more than iterator or distance op erations. I n suc h applications the sa vings in execution time o v er SF or L could b e eve n greater. F or example, an app end ix sh o w s one exp erimen t in whic h the text of Thr ough the L o oking Glass w as stored as a sequence of words, eac h word b eing a characte r str ing, and the patterns we re w ord sequences of diﬀerent lengths c hosen from evenly spaced p ositions in the target w ord sequ ence. In this case, elemen t comparison s were w ord comparisons, whic h could b e signiﬁcan tly more costly than iterator or distance op erations. HAL was again subs tan tially faster than the other conte stan ts, SF and L . The ABM and TBM algorithms from [HS91] were not considered because they are only applicable to strin g matc hing, b u t it was easy to sp ecialize the three generic algorithms to this case of sequence m atc hin g, just by plugging in the appropriate t y p es and , in the case of HAL, deﬁning a suitable hash fu nction. (W e used a fu nction th at returns the ﬁrst c haracter of a w ord.) 8 Ho w to Obtain the App endices and Co de An expanded v ersion of this pap er, in cluding app end ices that con tain and do cument the complete source code f or all b enc h mark exp eriments d escrib ed in th e pap er, will b e maintai ned indeﬁ nitely for p ublic acce ss on the Internet at http://www.cs.rp i.e du/˜musser/gp/ . By downloa ding th e Nuw eb source ﬁle, gense ar ch.w , and us ing Briggs’ Nuw eb to ol [Briggs], 6 readers can also 6 W e started with a version of the Nuwe b t ool p reviously mo diﬁed by R amsdell and Mengel and made additional small changes in terminology in the L A T E X ﬁle th e to ol pro- duces: “part” is used in place of “scrap” and “deﬁ nition” in place of “macro.” This 23 easily generate all of the source co de describ ed in the pap er and app endices. 9 Conclusion When we b egan this researc h , our main goal was to dev elop a generic se- quence searc h algorithm with a lin ear worst-case time b ound and with b et- ter a verage case p erformance th an KMP and SF, so that it could b e us ed in generic soft wa re libraries su c h as the C + + Standard T emplate L ib rary . W e exp ected that for m ost of the useful sp ecial cases, suc h as English text or DNA substring matc hin g, it w ould probably b e b etter to pro vid e separate al- gorithms tailored to those cases. It wa s therefore surpr isin g to disco v er that for the su b string matc hing problem itself a new, sup er ior algorithm could b e obtained b y com bin ing Bo yer and Mo ore’s skip lo op with the Knuth-Morris- Pratt algorithm. By also d ev eloping a hash ed version of the skip lo op and pro viding for s electio n of d iﬀeren t v arian ts of the tec hnique us ing traits, we obtained a generic algorithm, HAL, with all of the attribu tes w e originally sough t. Moreo ver, when sp ecialized to the usu al s tring matc h ing cases of the most pr actical interest, such as English text matc hing and DNA string matc hin g, the new algorithm b eats m ost of the existing string matc h ing algorithms. Since HAL has a linear up p er b ou n d on the num b er of comparisons , it can b e used ev en in mission-critical applications where the p oten tial O ( mn ) b ehavio r of the str aigh tforward algorithm or Hume and Su nday’s TBM al- gorithm w ould b e a serious concern. In such app lications, as w ell as in less-critical applications, HAL’s p erformance in the av erage case is not only linear, bu t sub linear, b eating eve n the b est v ersions of the Bo y er Mo ore algorithm. Sin ce we hav e pr ovided it in a generic form—in p articular, in the framewo rk of the C + + Standard T emplate L ib rary—the new algorithm is easily reusable in man y diﬀerent con texts. Ac knowledgemen t: Th is w ork wa s partially supp orted by a gran t from IBM Corp oration. versi on, called Nuw eb 0.91, is av ailable from http://www.cs.rpi.e du/˜musser/gp/ . The new version d oes not diﬀer from previous versions in th e wa y it pro duces code ﬁles from Nuw eb source ﬁ les. 24 P attern Algorithm Comparisons Other Big Other Distance T otal Size Accesses Jumps Iter Op s Ops Ops 2 SF 1.036 0.001 0.00 0 4.192 2.002 7.231 L 1.028 0.001 0.00 0 4.095 0.177 5.301 HAL 0.018 0.513 0.55 1 1.104 2.431 4.617 ABM 0.0 17 0.528 — — — — TBM 0.021 0.511 — — — — 4 SF 1.034 0.000 0.00 0 4.170 2.000 7.203 L 1.031 0.000 0.00 0 4.098 0.159 5.288 HAL 0.013 0.266 0.29 1 0.583 0.658 1.811 ABM 0.0 13 0.277 — — — — TBM 0.014 0.266 — — — — 6 SF 1.042 0.000 0.00 0 4.211 2.000 7.254 L 1.037 0.000 0.00 0 4.119 0.194 5.350 HAL 0.011 0.189 0.21 1 0.422 0.482 1.315 ABM 0.0 12 0.198 — — — — TBM 0.012 0.189 — — — — 8 SF 1.048 0.000 0.00 0 4.243 2.000 7.291 L 1.042 0.000 0.00 0 4.135 0.220 5.396 HAL 0.010 0.150 0.17 0 0.339 0.392 1.060 ABM 0.0 11 0.157 — — — — TBM 0.011 0.150 — — — — 10 SF 1.052 0.000 0.000 4.263 2.000 7.315 L 1.044 0.000 0.00 0 4.142 0.233 5.418 HAL 0.009 0.126 0.14 4 0.289 0.337 0.905 ABM 0.0 10 0.132 — — — — TBM 0.010 0.126 — — — — 14 SF 1.077 0.000 0.000 4.384 2.000 7.460 L 1.060 0.000 0.00 0 4.197 0.328 5.585 HAL 0.010 0.105 0.12 5 0.250 0.305 0.796 ABM 0.0 10 0.109 — — — — TBM 0.011 0.105 — — — — 18 SF 1.105 0.000 0.000 4.525 2.000 7.629 L 1.077 0.000 0.00 0 4.257 0.436 5.770 HAL 0.011 0.096 0.11 7 0.234 0.295 0.753 ABM 0.0 10 0.099 — — — — TBM 0.011 0.096 — — — — T able 4: Av erage Numb er of Op erations P er Character in English T ext Searc h es 25 References [BM77] R. Bo yer and S. Mo ore. A fast string matc hing algorithm. CACM, 20(197 7),762–7 72. [Briggs] P . Br iggs, Nuweb, a simple liter ate pr o gr amming to ol , V ersion 0.87, 1989. [Cole96] R. C ole. Tight b ounds on the complexit y of the Bo yer-Moore string matc hin g algorithm, SIAM Journal on Computing 5 (1994 ): 1075–1091 . [CGG90] L. Colussi, Z . Galil, R. Giancarlo. On the Exact C omplexit y of String Matc hing. P r o c e e dings of the Thirty First Annual IE EE Symp osium on the F oundations of Computer Scienc e , 1990, 135– 143. [DNAsource] H.S. Bilofsky , C . Bur k s , Th e GenBank(r) genetic sequence data bank. Nucl. A c ids R es. 16 (1988), 1861–1864 . [Ga79] Z. Galil. On Impr o vin g the worst case running time of the Bo yer- Mo ore string matc hing algorithm. CACM 22 (1979), 505–508. [GS83] Z. Galil, J. Seiferas. Time space optimal string matching. J CSS 26 (1983 ), 280–294 . [DraftCPP] Acc redited Standards Comm ittee X3 (American Natio nal Stan- dards Institute), I nformation Pro cessing Systems, Working p a- p e r f or dr aft pr op ose d internat ional standar d for information systems—pr o gr amming language C + + . Do c No. X3J16/95-018 5, W G21/N0785.[ [Chec k f or most recen t version.]] [GO77] L .J. Guibas, A.M. Od lyzk o, A new pro of of the linearit y of the Bo yer-Moore string searc h ing algorithm. Pr o c . 18th Ann. IEEE Symp. F oundations of Comp. Sc i., 1977, 189–195 [Horsp o ol88] R.N. Horsp o ol. Practical fast searc hing in strings Soft.-Pr ac. and Exp., 10 (Mar ch 1980), 501–50 6 [Hume88] A. Hu m e. A tale of t wo greps. Soft.-Pr ac. and Exp. 18 (No vem b er 1988) , 1063–107 2. [HS91] A . Hum e, S. Sun da y . F ast string searc hing. Soft.-P r ac. and Exp. 21 (No vem b er 1991), 1221– 1248. [Kn uth84] D.E. Knuth, Literate programming. Computer Journal 27 (1984), 97–11 1. [KMP77] D.E. Kn uth , J. Morris, V. Pratt. F ast patt ern matc hin g in strings. SIAM Journal on Computing 6 (1977) , 323–350. [Mu96] D.R. Musser. Me asuring Computing Times and Op er ation Counts , http:// www.cs.rpi.edu /m usser/gp/timing.h tml. 26 [MS96] D.R. Musser, A. S aini. STL T utorial and R e fer enc e Guide: C + + Pr o gr amming with Standar d T emplate Libr ary . Addison-W esley , Reading, MA, 1996. [SGI96] Silicon Graphics Stand ard T emplate Lib rary Pr ogramming Guide, online guide, http://www.sgi.c om/T ec hn ology/STL/ . [Sm82] G .V. Smit. A comparison of three strin g m atc hing algorithms. Soft.-Pr ac. and E xp. 12, 1 (Jan 1982), 57–66. [Stepano vLee] A.A. Stepanov, M. Lee, The Standar d T emplate Libr ary , T ec h. Rep ort HPL-94-34, Ap ril 1994, revised Octob er 31, 1995. [Su90] D.M . Su nday . A v ery fast substrin g searc h algorithm. CACM 33 (August 1990), 132–14 2. A T ests of Exp ository V ersions of the Algorithms T o h elp ensure against errors in the exp ository versions of the algorithms in this p ap er, we compiled them as part of sev eral Ada test programs, using b oth the GNA T Ada 95 compiler, v ersion 3.09, and the Aonix Ob j ectAda compiler, Sp ecial Edition, version 7.1. A.1 Algorithm Declarations W e hav e not attempted to dev elop Ada generic s u bprograms based on the exp ository ve rsions of the algorithms; instead we enca psulate them here with non-generic inte rfaces w e can use in simple test p rograms based on strin g (Ada c haracter array) searc hes. h Sequence declara tions 27a i ≡ type Chara cter_S equence is array(I ntege r rang e <>) of Charac ter; type Integ er_Seq uence is a rray( Intege r r ange <>) of Integ er; type Skip_ Sequen ce is array(C harac ter range <>) of Intege r; Used in parts 31b, 35a, 36e. h Algorithm subprog ram declara tions 27b i ≡ h Deﬁne pro cedure to compute next table 30 i h Non-hashed algorithms 27c i h Simple hash function dec larations 29a i h HAL de c laration 29b i Used in parts 31b, 35a, 36e. h Non-hashed algorithms 27c i ≡ functi on KMP(te xt, patter n: C harac ter_S equence; b, n, a, m: Intege r) return Integer is patter n_siz e, j, k: Integer ; next: Integer _Sequ ence(a . . m - 1); 27 begin h Compute next table 31a i h Basic KMP 2 i end KMP; functi on L(text , patter n: Charac ter_Se quence; b, n, a, m: Intege r) return Integer is patter n_siz e, j, k: Integer ; next: Integer _Sequ ence(a . . m - 1); begin patter n_siz e : = m - a; h Compute next table 31a i h Algorithm L, optimized linear pattern search 3a i end L; functi on SF(tex t, p atter n: C harac ter_S equence; b, n, a, m: Intege r) return Integer is patter n_siz e, j, k: Integer ; begin patter n_siz e : = m - a; k := b; h Handle pattern size = 1 as a specia l c a se 3b i while k <= n - patte rn_si ze l oop h Scan the text fo r a po ssible match 3c i h V erify whether a match is po ssible at the p osition found 4a i k := k - (j - a) + 1; end loop; return n; end SF; functi on AL(tex t, p atter n: C harac ter_S equence; b, n, a, m: Intege r) return Integer is patter n_siz e, text_ size, j, k, large , adjust ment, mismat ch_sh ift: I ntege r; next: Integer _Sequ ence(a . . m - 1); skip: Skip_Se quenc e(Character’Range); begin h Accelerated Linear algor ithm 7b i end AL; Used in part 27b. The follo w ing is a samp le hash fu nction deﬁnition that mak es HAL essen- tially equiv alen t to AL. 28 h Simple hash function declara tio ns 29a i ≡ subtyp e h ash_r ange is I ntege r range 0..2 55; functi on hash(t ext: C haract er_Se quence; k: Integ er) return hash_r ange; pragma inline (hash ); functi on hash(t ext: C haract er_Se quence; k: Integ er) return hash_r ange is begin return hash_r ange( character’pos(text(k))); end hash; suffix _size : c onsta nt I ntege r := 1; Used in part 27b. h HAL declara tion 29b i ≡ functi on HAL(te xt, patter n: C harac ter_S equence; b, n, a, m: Intege r) return Integer is patter n_siz e, text_ size, j, k, large , adjust ment, mismat ch_sh ift: I ntege r; next: Integer _Sequ ence(a . . m - 1); skip: Integer _Sequ ence(hash_range); begin h Hashed Acce le rated Linear algor ithm 12b i end HAL; Used in part 27b. F or comparison of HAL with other algorithms we also comp ose the follo wing declarations: h Additional algor ithms 29c i ≡ functi on AL0(te xt, patter n: C harac ter_S equence; b, n, a, m: Intege r) return Integer is patter n_siz e, j, k, d, mismat ch_sh ift: Inte ger; next: Integer _Sequ ence(a . . m - 1); skip: Skip_Se quenc e(Character’Range); begin h Accelerated Linear algor ithm, preliminary version 5c i end AL0; functi on SF1(te xt, patter n: C harac ter_S equence; b, n, a, m: Intege r) return Integer is patter n_siz e, j, k, k0: Integ er; begin patter n_siz e : = m - a; if n < m then return n; end if; j := a; k := b; k0 := k; while j /= m loop if text( k) / = patte rn(j) then if k = n - patt ern_s ize then return n; 29 else k0 := k0 + 1; k := k0; j := a; end if; else k := k + 1; j := j + 1; end if; end loop; return k0; end SF1; functi on SF2(te xt, patter n: C harac ter_S equence; b, n, a, m: Intege r) return Integer is patter n_siz e, j, k, k0, n0: Intege r; begin patter n_siz e : = m - a; if n - b < pattern _size then return n; end if; j := a; k := b; k0 := k; n0 := n - b; while j /= m loop if text( k) = pattern( j) the n k := k + 1; j := j + 1; else if n0 = patte rn_si ze then return n; else k0 := k0 + 1; k := k0; j := a; n0 := n0 - 1; end if; end if; end loop; return k0; end SF2; Used in parts 31b, 35a, 36e. F or computing the KMP n ext table w e pr o vid e the follo wing pr o cedure and calling co de: h Deﬁne pro cedur e to compute next table 30 i ≡ proced ure Co mpute _Next (pattern: C harac ter_S equen ce; a, m: Int eger; next: out Integ er_Se quence) is j: Integ er : = a; t: Integ er : = a - 1; begin next(a ) : = a - 1; while j < m - 1 loop while t >= a and then patte rn(j) /= patter n(t) l oop t := next( t); end loop; j := j + 1; t := t + 1; if patte rn(j) = pattern (t) then next(j ) := next(t); else 30 next(j ) := t; end if; end loop; end Comput e_Next ; Used in part 27b. h Compute next table 31a i ≡ Comput e_Nex t(pattern, a, m , next); Used in parts 5c, 7b, 12b, 27c. A.2 Simple T ests The ﬁr s t test p rogram simply reads sh ort test sequences fr om a ﬁle and rep orts the r esults of runnin g the diﬀerent searc h algorithms on them. "Test_ Searc h.adb " 31b ≡ with Text_ Io; use Text_Io ; with Ada.I nteger _Text_Io; use Ada.In teger _Text_Io; with Io_Ex ceptio ns; proced ure Te st_Se arch i s h Sequence declara tions 27a i h V ariable declara tions 31c i h Algorithm subprogra m declara tions 27b i h Additional algor ithms 29c i h Deﬁne pro cedure to read string int o sequence 33b i h Deﬁne pro cedure to output sequence 33c i h Deﬁne algorithm enumeration type, names, and selector function 31e i h Deﬁne Repo r t pro cedure 33e i begin h Set ﬁle small.txt a s input ﬁle 31d i loop h Read test s equences from ﬁle 32 i h Run tests and r ep ort results 33d i end loop; end Test_S earch; h V ariable declara tions 31c i ≡ Commen t, S1, S2: Charac ter_S equence(1 .. 100); Base_L ine, S 1_Len gth, S 2_Len gth, Last: Intege r; File: Text_Io .File _Type; Used in part 31b. h Set ﬁle small.txt as input ﬁle 31d i ≡ Text_I o.Ope n(File, Text_ IO.In _File, "sma ll.txt "); Text_I o.Set _Input(File); Used in part 31b. h Deﬁne algo r ithm enumeration type, na mes, a nd s e lector function 31e i ≡ 31 type Algor ithm_E numeration is ( Dummy , SF, SF1, SF2, L, AL, HAL); Algori thm_N ames: array (Algo rithm _Enumeration) of String (1 .. 17) := ("sele ction code ", "SF ", "HP SF ", "SGI SF ", "L ", "AL ", "HAL "); functi on Algori thm(k: Al gorit hm_En umeration; text, pattern : Character_ Sequence; b, n, a, m: Intege r) return Integer is begin case k is when Dummy => return b; when SF => return SF(text , pattern, b, n, a, m); when SF1 => return SF1(te xt, patter n, b, n, a, m); when SF2 => return SF2(te xt, patter n, b, n, a, m); when L => retur n L(text , pattern, b, n, a, m); when AL => return AL(text , pattern, b, n, a, m); when HAL => return HAL(te xt, patter n, b, n, a, m); end case; end Algori thm; Used in parts 31b, 35a, 36e. T est sequences are exp ected to b e found in a ﬁle named small.t xt . Eac h test set is conta ined on three lines, the ﬁrst line b eing a comment or blank, the second line cont aining the text s tring to b e searc h ed, and the third the pattern to searc h f or. h Read test sequences from ﬁle 32 i ≡ exit when Text_Io. End_O f_File; Get(Co mment , L ast); Put(Co mment , L ast); New_Lin e; h Check for unexpe c ted e nd of ﬁle 33a i Get(S1 , L ast); h Check for unexpe c ted e nd of ﬁle 33a i Put("T ext se quenc e: "); Put(S 1, Last); S1_Len gth := L ast; Get(S2 , L ast); Put("P atter n s equen ce: "); Put(S2, Last); S2_Len gth := L ast; Used in part 31b. 32 h Check for unexp ected end of ﬁle 33a i ≡ if Text_Io .End_O f_File t hen Put_Li ne("* *** Unex pected end of file." ); New_Li ne; raise Program _Erro r; end if; Used in part 32. h Deﬁne pro cedur e to read string in to sequence 33b i ≡ proced ure Ge t(S: out Charac ter_S equence; Last : out Intege r) is Ch: Charac ter; I : Intege r := 0; begin while not Text_ Io.En d_Of_File l oop Text_I o.Get _Immediate(Ch); I := I + 1; S(I) := Ch; exit when Text_ Io.En d_Of_Line; end loop; Last := I; Text_I o.Get _Immediate(Ch); end Get; Used in part 31b. h Deﬁne pro cedur e to output sequence 33c i ≡ proced ure Pu t(S: Chara cter_ Sequence; L ast: Integ er) is begin for I in 1 .. Last loop Put(S( I)); end loop; New_Li ne; end Put; Used in part 31b. h Run tests and rep ort results 33d i ≡ Base_L ine := 0 ; for K in Algori thm_E numeration’Succ(Algorithm_Enumeration’First) .. Algori thm_E numeration’Last loop Put(" Using "); Put(Al gorith m_Names(k)); Ne w_Lin e; Report (K, S1, S2, 1, S1_Len gth + 1, 1, S2_Le ngth + 1); end loop; New_Li ne; Used in part 31b. h Deﬁne Rep ort pro cedure 33e i ≡ proced ure Re port( K: A lgori thm_E numer ation; S1, S2: Charact er_Se quence; b , n, a, m: Intege r) i s P: Integ er; 33 begin P := Algor ithm(K , S 1, S2, b, n, a, m); Put(" Strin g "); Put(’"’ ); h Output S2 34a i if P = n then Put(" not found "); New_Li ne; else Put(’" ’); Put(" found at posit ion "); Put(P) ; New_Li ne; end if; if Base_ Line = 0 then Base_L ine := P - b; else if P - b /= Base_L ine then Put("* ****I ncorrec t resu lt!") ; New_L ine; end if; end if; end Report ; Used in parts 31b, 35a. h Output S2 34a i ≡ for I in a .. m - 1 loop Put(S2 (I)); end loop; Used in part 33e. Here are a few small tests. "small .txt" 34b ≡ # Now’s the time for all good men and women to come to the aid of their countr y. time # Now’s the time for all good men and women to come to the aid of their countr y. timid # Now’s the time for all good men and women to come to the aid of their countr y. try. # The followi ng example is from the KMP paper. babcba bcabc aabcabcabcacabc abcabc acab # aaaaaa abcab cadefg abcad # aaaaaa abcab cadefg ab 34 A.3 Large T ests This Ada test program can read a long c h aracter sequence fr om a ﬁle and run extensiv e search tests on it. Pa tterns to searc h f or, of a user-sp eciﬁed length, are selected fr om ev enly-sp aced p ositions in the long sequence. "Test_ Long_ Searc h.adb" 35a ≡ with Text_ Io; use Text_Io ; with Ada.I nteger _Text_Io; use Ada.In teger _Text_Io; proced ure Te st_Lo ng_Se arch i s F: Integ er; Number _Of_T ests: Integ er; Patter n_Siz e: Integ er; Increm ent: I ntege r; h Sequence declara tions 27a i h Data dec la rations 35b i h Algorithm subprogra m declara tions 27b i h Additional algor ithms 29c i h Deﬁne algorithm enumeration type, names, and selector function 31e i h Deﬁne Repo r t pro cedure 33e i S2: Charac ter_Se quence(0 . . 100); begin h Read tes t pa rameters 35d i h Set ﬁle long.txt as input ﬁle 35c i h Read character sequence from ﬁle 36a i Increm ent := ( S1_Le ngth - S2_Le ngth) / Numbe r_Of_ Tests; h Run tests searching for selected subsequences 36b i end Test_L ong_Se arch; h Data declaratio ns 35b i ≡ Max_Si ze: co nstan t Integer : = 200_0 00; C: Charact er; S1: Charac ter_Se quence(0 . . Max_Si ze); Base_L ine, I , S1_Le ngth, S2_Len gth: Integer; File: Text_Io .File _Type; Used in parts 35a, 36e. h Set ﬁle long.txt as input ﬁle 35c i ≡ Text_I o.Ope n(File, Text_ IO.In _File, "lon g.txt" ); Text_I o.Set _Input(File); Used in parts 35a, 36e. h Read test parameter s 35d i ≡ Put("I nput N umber of tests and patte rn s ize: "); Text_I o.Flu sh; Get(Nu mber_ Of_Tests); Get(Pa ttern _Size); New_Li ne; Pu t("Nu mber o f tests: "); Put( Number _Of_Tes ts); New _Line; Put("P atter n s ize: "); Put(Pa ttern _Size); New_Li ne; S2_Len gth := P atter n_Size ; Used in parts 35a, 36e. 35 h Read character sequence from ﬁle 36a i ≡ I := 0; while not Text_Io. End_O f_File l oop Text_I o.Get _Immediate(C); S1(I) := C; I := I + 1; end loop; S1_Len gth := I ; Put(S1 _Leng th); P ut(" charact ers read." ); New_Li ne; Used in parts 35a, 36e. h Run tests searching for selected subsequence s 36b i ≡ F := 0; for K in 1 .. Number _Of_T ests l oop h Select sequence S2 to search for in S1 36c i h Run tests 36d i end loop; Used in part 35a. h Select sequence S2 to search for in S1 36c i ≡ for I in 0 .. Patter n_Siz e - 1 loop S2(I) := S1(F + I); end loop; F := F + Increm ent; Used in parts 36b, 37b. h Run tests 36d i ≡ Base_L ine := 0 ; for K in Algori thm_E numeration’Succ(Algorithm_Enumeration’First) .. Algori thm_E numeration’Last loop Put(" Using "); Put(Al gorith m_Names(k)); Ne w_Lin e; Report (K, S1, S2, 0, S1_Len gth, 0, S 2_Len gth); end loop; New_Li ne; Used in part 36b. A.4 Timed T ests This Ada test program reads a c h aracter sequence from a ﬁle and times searc h es for selected strings. "Time_ Long_ Searc h.adb" 36e ≡ with Text_ Io; use Text_Io ; with Ada.I nteger _Text_Io; use Ada.In teger _Text_Io; with Ada.R eal_Ti me; proced ure Ti me_Lo ng_Se arch i s use Ada.Re al_Tim e; 36 packag e M y_Flo at is n ew Text _IO.F loat_I O(Long_Float); Base_T ime: L ong_F loat; Number _Of_T ests, Patte rn_Si ze, Incr ement: Intege r; pragma Suppre ss(Al l_Checks); h Sequence declara tions 27a i h Algorithm subprogra m declara tions 27b i h Additional algor ithms 29c i h Deﬁne algorithm enumeration type, names, and selector function 31e i h Data dec la rations 35b i h Deﬁne run pro cedure 37b i begin h Read tes t pa rameters 35d i h Set ﬁle long.txt as input ﬁle 35c i h Read character sequence from ﬁle 36a i Increm ent := ( S1_Le ngth - S2_Le ngth) / Numbe r_Of_ Tests; Base_T ime := 0 .0; h Run and time tests s earching for selected subsequences 37a i end Time_L ong_Se arch; h Run and time tests searching for selected subsequences 37a i ≡ for K in Algori thm_E numeration’Range l oop Put("T iming "); Put(Al gorith m_Names(K)); New_L ine; Run(K, S1, S1_Leng th, S2_Length); end loop; New_Li ne; Used in part 36e. F or a giv en algorithm, the Run pro cedure condu cts a requested num b er of searc h es in s equ ence S1 for patterns of a requested size , selecti ng the patterns from eve nly spaced p ositions in S1 . It rep orts the total searc h length, time tak en, and sp eed (total search length d ivided by time tak en) of the searc hes. h Deﬁne run pro cedure 37b i ≡ proced ure Ru n(K: Algor ithm_ Enumeration; S1: Charac ter_Se quence; Text_ Size, Patter n_Siz e: Int eger) is P, F: Integer ; Start_ Time, Finish _Time : T ime; Total_ Searc h: Integ er; Time_T aken : L ong_F loat; S2: Charac ter_Se quence(0 . . Patter n_Siz e - 1); begin F := 0; Total_ Searc h : = 0; Start_ Time : = Clock ; for I in 1 .. Number _Of_T ests l oop h Select sequence S2 to search for in S1 36c i P := Algor ithm(K , S 1, S2, 0, Text_ Size, 0, Patte rn_Siz e); Total_ Searc h := Total_ Searc h + P + Patte rn_Siz e; end loop; Finish _Time := Clock ; h Output statistics 38a i end Run; 37 Used in part 36e. h Output statistics 38a i ≡ Time_T aken : = Long_ Float( (Finish_Time - Start _Time ) / Milli secon ds(1) ) / 1000 .0 - Base_Time ; Put("T otal s earch length : "); Put(To tal_S earch); Put(" bytes. "); New_Li ne; Put("T ime: "); My_Flo at.Pu t(Time_Taken, 5, 4, 0); Put(" seconds ."); N ew_Li ne; if K /= Dummy then Put("S peed: "); My_Flo at.Pu t(Long_Float(Total_Search) / 1_000_ 000.0 / Time_T aken, 5 , 2, 0); Put(" MBytes/ secon d."); New_L ine; else Base_T ime := T ime_T aken; end if; New_Li ne; Used in part 37b. B C++ L ibrary V ersi ons and T est Programs The co d e p resen ted in this section is pac k aged in ﬁles th at can b e added to the stand ard C + + library and in cluded in u s er p rograms with #in clude directiv es. (A few adjustment s ma y b e necessary dep ending on ho w w ell the target compiler conform s to the C + + standard.) With only minor c hanges, library maint ainers sh ould b e able to in corp orate the co de into the stan- dard libr ary header ﬁles, replacing wh atev er searc h implemen tations they currentl y con tain. T he only signiﬁcan t work inv olv ed would b e to construct the predicate ve rsions of the searc h fu n ctions, w h ic h are not giv en h ere. B.1 Generic Library Interfaces B.1.1 Library Files "new_s earch .h" 38b ≡ #ifnde f N EW_SE ARCH # d efine NEW_SEA RCH # i nclude # i nclude "searc h_tra its.h" # i nclude using namespa ce std; h Deﬁne pro cedure to compute next table (C++) 43d i h Deﬁne pro cedure to compute next table (C++ forward) 20b i h User level search function 19a i h F orward iterator case 19b i h Bidirectional iterator case 40b i h HAL w ith r andom access iterators, no tr ait passed 41a i h User level search function with trait arg ument 41b i #endif 38 B.1.2 Searc h T raits "searc h_tra its.h " 39a ≡ #ifnde f S EARCH _HASH_ TRAITS # d efine SEARCH_ HASH_ TRAITS h Generic search trait 39b i h Search traits for character sequences 40a i #endif The generic searc h trait class is used wh en there is no searc h trait sp eciﬁcally deﬁned, either in th e library or by the user, for th e t yp e of v alues in the sequences b eing searc hed, and when no searc h trait is explicitly p assed to the searc h fu nction. h Generic sear ch trait 39b i ≡ templa te struct search _trai t { enum {hash _range _max = 0}; enum {suff ix_siz e = 0 }; templa te inline static unsign ed i nt hash (Rando mAccess Iterator i) { return 0; } }; Used in part 39a. The “hash” f u nction used in this trait maps everything to 0; it would b e a source of p oor perf ormance if it w ere actually us ed in the HAL algo- rithm. In fact it is not, b ecause the co de in the searc h function chec ks for suffix size = 0 and u ses algorithm L in that case. This deﬁnition of hash p ermits compilation to succeed ev en if the compiler fails to recognize that the co de segmen t con taining the call of hash is dead co de. F or trad itional string s earches, the follo win g sp ecialized searc h traits are pro vided: 39 h Search traits for character sequenc e s 40a i ≡ templa te <> struct search _trai t { enum {hash _range _max = 256}; enum {suff ix_siz e = 1 }; templa te inline static unsign ed i nt hash (Rando mAccess Iterator i) { return *i; } }; typede f u nsign ed char u nsign ed_ch ar; templa te <> struct search _trai t { enum {hash _range _max = 256}; enum {suff ix_siz e = 1 }; templa te inline static unsign ed i nt hash (Rando mAccess Iterator i) { return *i; } }; Used in part 39a. B.1.3 Searc h F unctions The m ain user-lev el searc h fun ction in terface and an auxiliary f unction __search _L for the forward iterator case were giv en in the b o dy of th e pap er. With bidirectional iterators we again u se the forwa rd iterator v er- sion. h Bidirectional iterator case 40b i ≡ templa te inline Bidire ction alIterator1 _ _sear ch(Bid irectionalIterator1 text, Bidire ction alIte rator1 text End, Bidire ction alIte rator2 patt ern, Bidire ction alIte rator2 patt ernEn d, bidire ction al_it erator_tag) { return __sear ch_L( text, textE nd, patter n, patter nEnd); } Used in part 38b. When we ha ve random access iterators and no searc h trait is passed as an ar- gumen t, w e use a sea rc h trait asso ciated with V = Rando mAccessI terator1 ::value type to obtain the hash fun ction and related parameters. Then we us e the user- lev el searc h fu nction that tak es a searc h trait argument and uses HAL. If no searc h trait has b een sp eciﬁcally deﬁn ed for t yp e V , then the generic search_h ash_trai t is used, causing th e searc h_hashed algorithm to re- sort to algorithm L. 40 h HAL with random access iterators , no trait passe d 41a i ≡ templa te inline Random Acces sIterator1 __sea rch(R andomAccessIterator1 text , Random Acces sIterator1 textE nd, Random Acces sIterator2 patte rn, Random Acces sIterator2 patte rnEnd , random _acce ss_iterator_tag) { typede f i terat or_tra its::value_type V; typede f s earch _trait Trai t; return search _hash ed(text, text End, pattern, patter nEnd, static _cast (0)); } Used in part 38b. Finally , we ha v e a user-lev el searc h function for the case of rand om access iterators and an explicitly passed searc h trait. h User level sear ch function with tra it a r gument 41b i ≡ templa te Random Acces sIterator1 se arch_ hashe d(RandomAccessIterator1 text , Random Acces sIterator1 textE nd, Random Acces sIterator2 patte rn, Random Acces sIterator2 patte rnEnd , Trait* ) { typede f t ypena me iterator_t raits ::difference_type Distan ce1; typede f t ypena me iterator_t raits ::difference_type Distan ce2; if (patt ern == patte rnEnd ) retu rn text; Distan ce2 pattern_si ze, j, m; patter n_siz e = patternEn d - patt ern; if (Trai t::su ffix_size = = 0 || patte rn_siz e < Trai t::su ffix_s ize) return __sear ch_L( text, textE nd, patter n, patternEn d); Distan ce1 i, k , large, adjust ment, mismat ch_sh ift, text _size ; vector n ext, skip; h Hashed Acce le rated Linear algor ithm (C++) 42a i } Used in part 38b. The C + + v ersion of HAL is b uilt from parts corresp ond ing to those expressed in Ada in the b o d y of the pap er. Note that in place of text(n + k) w e can write textEnd + k for the lo cation and text End[k] for the v alue at that lo cation. 41 h Hashed Accelerated Linear algo rithm (C++) 42a i ≡ k = 0; text_s ize = textEnd - text; h Compute next ta ble (C++) 44a i h Handle pattern size = 1 as a specia l c a se (C++) 21b i h Compute skip table and mismatc h shift using the hash function (C++) 43c i large = text_si ze + 1 ; adjust ment = l arge + patter n_siz e - 1; skip[T rait: :hash(pattern + pa ttern _size - 1)] = large ; k -= text_ size; for(;; ) { k += patte rn_siz e - 1 ; if (k >= 0) break; h Scan the tex t us ing a single-test skip lo op with hashing (C++) 42b i h V erify match or recover fro m mismatch (C+ +) 42c i } return textEn d; Used in part 41b. h Scan the text using a single-test skip lo op with hashing (C++) 42b i ≡ do { k += skip[ Trait: :hash(textEnd + k)]; } while (k < 0); if (k < pattern _size ) return textEn d; k -= adjus tment; Used in part 42a. h V erify match or recover fro m mis ma tch (C++) 42c i ≡ if (textEn d[k] ! = patter n[0]) k += misma tch_sh ift; else { h V erify the match for p ositions 1 throug h pattern size - 1 (C++) 43a i if (mism atch_ shift > j) k += misma tch_sh ift - j; else h Recov er from a mismatch using the next table (C++) 43b i } Used in parts 42a, 46a. 42 h V erify the match for p ositio ns 1 throug h pattern size - 1 (C++) 43a i ≡ j = 1; for (;;) { ++k; if (text End[k ] != patter n[j]) break; ++j; if (j == patter n_siz e) return textEn d + k - patter n_size + 1; } Used in part 42c. h Recov er from a mismatch using the next table (C++ ) 43b i ≡ for (;;) { j = next[j ]; if (j < 0) { ++k; break; } if (j == 0) break; while (textEn d[k] = = patter n[j]) { ++k; ++j; if (j == patter n_siz e) { return textEn d + k - patter n_size ; } if (k == 0) return textEn d; } } Used in part 42c. B.1.4 Skip T able C omputation h Compute skip table and mismatch shift using the hash function (C++) 43c i ≡ m = next.s ize(); for (i = 0; i < Trait: :hash_ range_max; ++ i) skip.p ush_b ack(m - Trait ::suf fix_size + 1); for (j = Trait: :suff ix_size - 1 ; j < m - 1; ++j) skip[T rait: :hash(pattern + j)] = m - 1 - j; mismat ch_sh ift = skip[Trai t::ha sh(pattern + m - 1)] ; skip[T rait: :hash(pattern + m - 1)] = 0; Used in part 42a. B.1.5 Next T able Pro cedure and Call When we ha v e random access to the pattern, w e tak e adv antag e of it in computing the n ext table (we do not need to create the pattern_i terator table used in the f orward iterator version). 43 h Deﬁne pro cedur e to compute next table (C++) 43d i ≡ templa te void compu te_nex t(RandomAccessIterator pa ttern , Random Acces sIter ator p atter nEnd, vector & next ) { Distan ce patter n_size = patter nEnd - p attern , j = 0, t = -1; next.r eserv e(32); next.p ush_b ack(-1); while (j < pattern _size - 1) { while (t >= 0 && patte rn[j] ! = patte rn[t] ) t = next[t ]; ++j; ++t; if (patt ern[j ] == patter n[t]) next.p ush_b ack(nex t[t]); else next.p ush_b ack(t); } } Used in part 38b. h Compute next table (C++) 44a i ≡ comput e_nex t(pattern, pa ttern End, next); Used in parts 42a, 46a. B.2 Exp erimen tal V ersion for Large Alphab et Case F or comparison with HAL in the large alphab et case w e also implemented the exp erimental ve rsion that uses a large skip table and n o hashin g, as describ ed in th e b o dy of the pap er . "exper iment al_se arch.h" 44 b ≡ h Exp erimental sear ch function with sk ip lo op without ha shing 44c i In our exp erimen ts, w e assum e that the elemen t t y p e is a 2-b yte unsigned short. h Exp erimental sear ch function with skip lo op without hashing 44c i ≡ #inclu de using namespa ce std; struct large_ alpha bet_trait { typede f u nsign ed short T; enum {suff ix_siz e = 1 }; enum {hash _range _max = (1u << (sizeo f(T) * 8)) - 1}; }; templa te <> struct search _trai t { enum {hash _range _max = 256}; enum {suff ix_siz e = 1 }; 44 templa te inline static unsign ed i nt hash (Rando mAccess Iterator i) { return (unsig ned char)( *i); } }; templa te class skewed_ value { static T skew; T value; public : skewed _valu e() : value (0) {} skewed _valu e(T val) : value (val - skew) {} operat or T () { return value + skew; } static void setSke w(T askew) { skew = askew; } void clear () { value = 0; } }; templa te T skewed _valu e::skew; templa te class skewed_ array { typede f s kewed _value valu e_typ e; static value_ type a rray[ size] ; Random Acces sIterator patter n, patter nEnd; public : skewed _arra y(T skew , Random Acces sIterat or p at, Random Acces sIter ator pat End): patter n(pat ),patternEnd(patEnd){ value_ type: :setS kew(skew); } ~skewe d_arr ay() { while (patter n != p atter nEnd) array[ *patt ern++]. clear(); } value_ type opera tor[] (int index) const { retur n array[ index ]; } value_ type& operat or[] ( int index) { return array[ index ]; } }; templa te skewed _valu e s kewed _arra y::array[size]; templa te Random Acces sIterator1 se arch_ no_ha shing(RandomAccessIterator1 text, Random Acces sIterat or1 textE nd, Random Acces sIterat or2 patte rn, Random Acces sIterat or2 patte rnEnd ) { typede f t ypena me iterator_t raits ::difference_type Dist ance1 ; typede f t ypena me iterator_t raits ::difference_type Dist ance2 ; typede f l arge_ alphab et_trai t Trai t; if (patt ern == patte rnEnd ) return text; Distan ce1 k, t ext_s ize, large, adjustment , misma tch_s hift; Distan ce2 j, m , patter n_siz e; patter n_siz e = patternEn d - patt ern; if (patt ern_s ize < Trait ::suf fix_size) 45 return __sear ch_L( text, textE nd, patter n, patternEn d); vector n ext; skewed _arra y skip(p atter n_size - Trait::s uffix_ size + 1, pat tern, patternEn d); h Accelerated Linear algor ithm, no hashing (C++) 46a i } Used in part 44b. h Accelerated Linear algor ithm, no hashing (C++) 46a i ≡ k = 0; text_s ize = textEnd - text; h Compute next ta ble (C++) 44a i h Handle pattern size = 1 as a specia l c a se (C++) 21b i h Compute skip table and mismatc h shift, no hashing (C++) 46b i large = text_si ze + 1 ; adjust ment = l arge + patter n_siz e - 1; skip[* (patt ern + m - 1)] = large; k -= text_ size; for (;;) { k += patte rn_siz e - 1 ; if (k >= 0) break; h Scan the tex t us ing a single-test skip lo op, no hashing (C++) 46c i h V erify match or recover fro m mismatch (C+ +) 42c i } return textEn d; Used in part 44c. h Compute skip table and mismatch shift, no hashing (C++) 46b i ≡ m = next.s ize(); for (j = Trait: :suff ix_size - 1 ; j < m - 1; ++j) skip[* (patt ern + j)] = m - 1 - j; mismat ch_sh ift = skip[*(pa ttern + m - 1)]; skip[* (patt ern + m - 1)] = 0; Used in part 46a. h Scan the text using a single-test skip lo op, no hashing (C++) 46c i ≡ do { k += skip[ *(text End + k)]; } while (k < 0); if (k < pattern _size ) return textEn d; k -= adjus tment; Used in part 46a. 46 B.3 DNA Searc h F unctions and T raits The follo win g deﬁnitions are for use in DNA search exp erimen ts. F our dif- feren t search functions are deﬁned using 2, 3, 4, or 5 charact ers as argum en ts to hash fun ctions. "DNA_s earch .h" 47a ≡ h Deﬁne DNA search traits 47b i templa te inline Random Acces sIterator1 hal2( Rando mAccessIterator1 t ext, Random Acces sIterator1 te xtEnd , Random Acces sIterator2 pa ttern , Random Acces sIterator2 pa ttern End) { return search _hash ed(text, text End, pattern, patter nEnd, static _cast (0)); } templa te inline Random Acces sIterator1 hal3( Rando mAccessIterator1 t ext, Random Acces sIterator1 te xtEnd , Random Acces sIterator2 pa ttern , Random Acces sIterator2 pa ttern End) { return search _hash ed(text, text End, pattern, patter nEnd, static _cast (0)); } templa te inline Random Acces sIterator1 hal4( Rando mAccessIterator1 t ext, Random Acces sIterator1 te xtEnd , Random Acces sIterator2 pa ttern , Random Acces sIterator2 pa ttern End) { return search _hash ed(text, text End, pattern, patter nEnd, static _cast (0)); } templa te inline Random Acces sIterator1 hal5( Rando mAccessIterator1 t ext, Random Acces sIterator1 te xtEnd , Random Acces sIterator2 pa ttern , Random Acces sIterator2 pa ttern End) { return search _hash ed(text, text End, pattern, patter nEnd, static _cast (0)); } h Deﬁne DNA search traits 47b i ≡ struct search _trai t_dna2 { enum {hash _range _max = 64}; enum {suff ix_siz e = 2 }; 47 templa te inline static unsign ed i nt hash (RAI i) { return (*(i-1 ) + ((*i) << 3)) & 63; } }; struct search _trai t_dna3 { enum {hash _range _max = 512}; enum {suff ix_siz e = 3 }; templa te inline static unsign ed i nt hash (RAI i) { return (*(i-2 ) + (*(i-1 ) << 3 ) + ((*i) << 6)) & 511; } }; struct search _trai t_dna4 { enum {hash _range _max = 256}; enum {suff ix_siz e = 4 }; templa te inline static unsign ed i nt hash (RAI i) { return (*(i-3 ) + (*(i-2 ) << 2 ) + (*(i-1 ) << 4) + ((*i) << 6)) & 255; } }; struct search _trai t_dna5 { enum {hash _range _max = 256}; enum {suff ix_siz e = 5 }; templa te inline static unsign ed i nt hash (RAI i) { return (*(i-4 ) + (*(i-3 ) << 2 ) + (*(i-2 ) << 4) + (*(i-1 ) << 6) + ((*i) << 8)) & 255; } }; Used in part 47a. B.4 Simple T ests In the test programs w e w an t to compare the new searc h fun ctions with the existing searc h f unction fr om an ST L algorithm library imp lemen tation, so w e rename the existing one. h Include algorithms header with existing sear ch function r enamed 48 i ≡ #defin e s earch stl_sea rch #defin e _ _sear ch __stl_sear ch #inclu de #undef search #undef __sear ch Used in parts 49a, 52, 54f, 57b, 59b, 63b, 65b. 48 As in the Ada version of the co d e, the ﬁ rst test program simply reads sh ort test sequences from a ﬁ le and r ep orts the resu lts of r unning the d iﬀeren t searc h algorithms on them. "test_ searc h.cpp " 49a ≡ h Include algorithms header with existing search function renamed 48 i #inclu de #inclu de #inclu de "new_s earch. h" #inclu de "hume. hh" #inclu de "DNA_s earch. h" using namespa ce std; int Base_L ine; h Deﬁne pro cedure to read string int o sequence (C++) 51b i typede f u nsign ed char d ata; h Deﬁne algorithm enumeration type, names, and selector function (C++) 49b i h Deﬁne Repo r t pr o cedure (C++) 51d i int main() { ostrea m_ite rator o ut(co ut, ""); ifstre am ifs("s mall.t xt"); vector C ommen t, S1, S2; const char* separa tor = ""; for (;;) { h Read test s equences from ﬁle (C++) 50 i h Run tests and r ep ort results (C++) 51c i } return 0;} h Deﬁne algo r ithm enumeration type, na mes, a nd s e lector function (C++) 49b i ≡ enum algor ithm_e numeration { Dummy, SF, L, HAL, ABM, TBM, GBM, HAL2, HAL3, HAL4, HAL5 }; const char* algori thm_n ames[] = { "selec tion c ode", "SF", "L", "HAL" , "ABM", "TBM", "GBM", "HAL2" , " HAL3" , "HAL4" , "HAL5" }; #ifnde f D NA_TE ST algori thm_e numeration alg[] = {Dummy , SF, L, HAL, ABM, TBM}; const char textFil eName [] = "long. txt"; const char wordFil eName [] = "words .txt" ; #else algori thm_e numeration alg[] = {Dummy , SF, L, HAL, ABM, GBM, HAL2, HAL3, HAL4, HAL5}; const char textFil eName [] = "dnate xt.tx t"; const char wordFil eName [] = "dnawo rd.tx t"; #endif const int number_o f_alg orithms = sizeof (alg) /sizeof(alg[0]); templa te inline void 49 Algori thm(i nt k, const Cont ainer & x, con st Contain er& y, Contai ner__ const _iterator& resu lt) { switch (alg[k ]) { case Dummy : // does nothing , us ed for timin g overhe ad of test loop result = x.begi n(); r eturn; case SF: result = stl_se arch( x.begin(), x.end (), y.begi n(), y .end( )); return ; case L: result = __ searc h_L(x .begin(), x .end( ), y.begin(), y.end ()); return; case HAL: result = search (x.be gin(), x. end() , y.begin(), y.end( )); return ; case ABM: result = fbm(x. begin (), x.end (), y.begi n(), y .end( )); return ; case TBM: result = hume(x .begi n(), x .end( ), y .begi n(), y.end()); return; case GBM: result = gdbm(x .begi n(), x .end( ), y .begi n(), y.end()); return; case HAL2: result = hal2(x .begi n(), x .end( ), y .begi n(), y.end()); return; case HAL3: result = hal3(x .begi n(), x .end( ), y .begi n(), y.end()); return; case HAL4: result = hal4(x .begi n(), x .end( ), y .begi n(), y.end()); return; case HAL5: result = hal5(x .begi n(), x .end( ), y .begi n(), y.end()); return; } result = x.begi n(); r eturn; } Used in parts 49a, 52, 54f. h Read test sequences from ﬁle (C++) 50 i ≡ get(if s, Commen t); if (ifs.eo f()) break; copy(C ommen t.begin(), Co mment .end( ), o ut); cout << endl; get(if s, S1); h Check for unexpe c ted e nd of ﬁle (C+ +) 51a i cout << "Text string :.... .."; copy(S 1.beg in(), S1.en d(), o ut); cout << endl; get(if s, S2); h Check for unexpe c ted e nd of ﬁle (C+ +) 51a i cout << "Patter n strin g:..." ; copy(S 2.beg in(), S2.en d(), o ut); cout << endl ; Used in part 49a. 50 h Check for unexp ected end of ﬁle (C++) 51a i ≡ if (ifs.eo f()) { cout << "**** Unexpe cted e nd of file ." << endl; exit(1 ); } Used in part 50. h Deﬁne pro cedur e to read string in to sequence (C++) 51b i ≡ templa te void get(i stream & i s, Contain er& S) { S.eras e(S.b egin(), S.end ()); char ch; while (is.get (ch)) { if (ch == ’\n’) break; S.push _back (ch); } } Used in part 49a. h Run tests and rep ort results (C++) 51c i ≡ Base_L ine = 0; for (int k = 1; k < number_ of_alg orithms; + +k) { cout << "Using " << algorit hm_nam es[k] << ":" << endl; Report (algo rithm_enumeration(k), S1, S2, separ ator) ; } cout << endl; Used in parts 49a, 54ce. h Deﬁne Rep ort pro cedure (C++) 51d i ≡ templa te void Repor t(algo rithm_enumeration k, const Conta iner& S1, const Contain er& S2, const char* separa tor) { typena me Contai ner::c onst_iterator P ; typede f t ypena me Container: :valu e_type valu e_t; Algori thm(k , S 1, S2, P); cout << " S tring " << ’"’; copy(S 2.beg in(), S2.en d(), ostream_it erator< value_t>(cout, separa tor)) ; if (P == S1.end ()) cout << ’"’ << " not found" << endl; else cout << ’"’ << " found at positi on " << P - S1.begin( ) << end l; if (Base _Line == 0) Base_L ine = P - S1.begi n(); else if (P - S1.be gin() != Base_ Line) cout << "**** *Inco rrect resul t!" << endl; } 51 Used in parts 49a, 52, 63b. B.5 Large T ests The follo wing program for condu cting tests on a long text sequence p erforms the same tests as the Ada version, plu s searc h es for wo rds of the requested pattern size selected f r om a giv en dictionary (wh ic h is read from a ﬁ le). "test_ long_ searc h.cpp" 52 ≡ h Include algorithms header with existing search function renamed 48 i #inclu de "new_se arch. h" #inclu de "hume. hh" #inclu de "DNA_s earch. h" #inclu de #inclu de #inclu de #inclu de #inclu de #inclu de using namespa ce std; typede f u nsign ed char d ata; typede f v ector sequen ce; sequen ce S1, S2; int Base_L ine; unsign ed int Number_ Of_Te sts, N umber _Of_Pa ttern_Sizes, Incre ment; h Deﬁne algorithm enumeration type, names, and selector function (C++) 49b i h Deﬁne Repo r t pr o cedure (C++) 51d i int main() { unsign ed int F, K, j; h Read tes t pa rameters (C++) 53a i h Read dictio nary from ﬁle, placing words of size j in dictionary[j] 53b i h Read character sequence from ﬁle (C++) 54a i for (j = 0; j < Number _Of_Pa ttern_Sizes; ++j) { h T rim dictionary [Pattern Size[j]] to have a t most Number O f T ests words 53c i Increm ent = (S1.si ze() - Pattern_ Size[ j]) / Num ber_O f_Tes ts; cerr << Patte rn_Si ze[j] << " " << flush ; const char* separa tor = ""; h Output header (C++) 54b i h Run tests searching for selected subsequences (C++) 54c i h Run tests searching for dictionary words (C++) 54e i } } 52 h Read test parameter s (C++) 53a i ≡ cout << "Input number of tests (for each patter n s ize): " << flush; cin >> Number _Of_T ests; cout << "Input number of pattern sizes: " << flush; cin >> Number _Of_P attern_Sizes; cout << "Input pattern sizes: " << flush; vector Patter n_Siz e(Number_Of_Pattern_Sizes); for (j = 0; j < Number _Of_Pa ttern_Sizes; ++j) cin >> Patter n_Siz e[j]; cout << "\nNumb er o f tests: " << Number _Of_T ests < < endl; cout << "Patter n sizes : "; for (j = 0; j < Number _Of_Pa ttern_Sizes; ++j) cout << Pattern _Size [j] << " "; cout << endl; Used in parts 52, 54f, 57b, 59b, 63b, 65b. h Read dictionary from ﬁle, placing words of size j in dictionary [j] 53b i ≡ ifstre am dictfi le(wor dFileName); typede f i strea m_iter ator strin g_inp ut; typede f m ap, less< int> > map_t ype; map_ty pe dictio nary; sequen ce S; string S0; string _inpu t s i(dic tfile) ; while (si != string_ input ()) { S0 = *si++ ; sequen ce S(S0.b egin() , S 0.end ()); dictio nary[ S.size()].push_back(S); } Used in parts 52, 54f, 59b. h T rim dictionary [Pattern Size[j]] to have at most Num b er Of T ests words 53c i ≡ vector & d ictio n = dictio nary[ Pattern_Size[j]]; if (dictio n.size () > Number _Of_T ests) { vector temp; int Skip_A mount = diction .size () / Number _Of_T ests; for (unsig ned int T = 0; T < Number _Of_T ests; ++T) { temp.p ush_b ack(diction[T * Sk ip_Am ount] ); } dictio n = temp; } Used in parts 52, 54f, 59b. 53 h Read character sequence from ﬁle (C++) 54a i ≡ ifstre am ifs(te xtFile Name); char C; while (ifs.ge t(C)) S1.pus h_bac k(C); cout << S1.size () < < " charac ters read." << e ndl; Used in parts 52, 54f, 59b. h Output header (C++) 54b i ≡ cout << "\n\n-- ----- ----------------------------------------------------\n" << "Search ing for pattern s of s ize " << Patter n_Siz e[j] << "..." << endl; cout << "(" << Number_ Of_Tes ts << " patte rns from the text, " << diction ary[Pa ttern_Size[j]].size() << " f rom the dict ionary )" << endl ; Used in parts 52, 54f, 57b, 59b, 63b, 65b. h Run tests searching for selected subsequence s (C++) 54c i ≡ F = 0; for (K = 1; K <= Numbe r_Of_T ests; ++K) { h Select sequence S2 to search for in S1 (C++) 54d i h Run tests and r ep ort results (C++) 51c i } Used in parts 52, 63b. h Select sequence S2 to search for in S1 (C++) 54d i ≡ S2.era se(S2 .begin(), S2.end ()); copy(S 1.beg in() + F , S1.beg in() + F + Pattern_S ize[j ], bac k_ins erter (S2)); F += Incre ment; Used in part 54c. h Run tests searching for dictionary words (C++) 54e i ≡ for (K = 0; K < dictio nary[P attern_Size[j]].size(); + +K) { S2 = dicti onary[ Pattern_Size[j]][K]; h Run tests and r ep ort results (C++) 51c i } Used in part 52. B.6 Timed T ests Again, the follo win g pr ogram for timing searc hes conducts the same searches as in th e Ada v ersion, plus searc hes for w ords of the requ ested pattern size selected from a give n d ictionary . "time_ long_ searc h.cpp" 54f ≡ h Include algorithms header with existing search function renamed 48 i #inclu de "new_s earch. h" 54 #inclu de "hume. hh" #inclu de "DNA_s earch. h" #inclu de #inclu de #inclu de #inclu de #inclu de #inclu de #inclu de #inclu de using namespa ce std; typede f u nsign ed char d ata; typede f v ector sequen ce; sequen ce S1; int Base_L ine; unsign ed int Number_ Of_Te sts, N umber _Of_Pa ttern_Sizes, Incre ment; double Base_T ime = 0.0; h Deﬁne algorithm enumeration type, names, and selector function (C++) 49b i h Deﬁne run pr o cedure (C++ forward) 55 i int main() { int j; h Read tes t pa rameters (C++) 53a i h Read character sequence from ﬁle (C++) 54a i h Read dictio nary from ﬁle, placing words of size j in dictionary[j] 53b i for (j = 0; j < Number _Of_Pa ttern_Sizes; ++j) { h T rim dictionary [Pattern Size[j]] to hav e at most Number Of T ests words 53c i Increm ent = (S1.si ze() - Pattern_ Size[ j]) / Num ber_O f_Tes ts; h Output header (C++) 54b i cerr << Patte rn_Si ze[j] << " " << flush ; h Run and time tests s earching for selected subsequences (C++ ) 56b i } cerr << endl; } The follo w ing test p ro cedure is pr ogrammed using forward iterator op era- tions only , so th at it can b e app lied to a non-random access con tainer (e.g., list), assuming the designated algorithm wo rks with forward iterators. h Deﬁne run pro cedure (C++ forward) 55 i ≡ templa te void Run(i nt k , const Contain er& S1, const vector< Conta iner>& dict ionar y, int Patt ern_S ize) { typena me Contai ner::c onst_iterator P ; int F = 0, d, K; double Start_ Time, Finish _Time , Time _Taken ; long Total _Searc h = 0 ; Start_ Time = c lock( ); Contai ner S2; for (K = 1; K <= Numbe r_Of_T ests; ++K) { 55 typena me Contai ner::c onst_iterator u = S1.begin( ); advanc e(u, F ); S2.era se(S2 .begin(), S2.end ()); for (int I = 0; I < Patte rn_Si ze; ++I) S2.pus h_bac k(*u++) ; F += Incre ment; h Run algorithm and record search distance 56a i } for (K = 0; K < dictio nary.s ize(); + +K) { S2 = dicti onary[ K]; h Run algorithm and record search distance 56a i } Finish _Time = clock( ); h Output statistics (C++) 56c i } Used in parts 54f, 57b, 65b. h Run algo r ithm and reco rd sea rch distance 56a i ≡ Algori thm(k , S 1, S2, P); d = 0; distan ce(S1 .begin(), P, d); Total_ Searc h + = d + Pattern _Size ; Used in parts 55, 62b. h Run and time tests searching for selected subsequences (C++) 56b i ≡ Base_T ime = 0.0; for (int k = 0; k < number_ of_alg orithms; + +k) { if (k != 0) cout << "Timi ng " << algori thm_na mes[k] < < ":" << endl; Run(k, S1, diction ary[P attern_Size[j]], P atter n_Siz e[j]) ; } cout << endl; Used in parts 54f, 57b, 65b. h Output statistics (C++) 56c i ≡ Time_T aken = ( Finis h_Time - Start_ Time) /CLOCKS_PER_SEC - Base_ Time; if (k == 0) Base_T ime = Time_T aken; else { cout << "Total searc h length: " << Total_Sear ch << " ele ments " << endl ; cout << "Time: " << Time_Ta ken << " second s." << endl; double Speed = Time_ Taken == 0.0 ? 0.0 : static _cast (Total_Search) / 10000 00 / Time_Taken ; cout << "Speed: " << Speed << " eleme nts/m icrosecond." << endl << endl; } Used in part 55. 56 B.7 Timed T ests (Large Alphab et) Again, the follo win g pr ogram for timing searc hes conducts the same searches as in th e Ada v ersion, plus searc hes for w ords of the requ ested pattern size selected from a give n d ictionary . h Deﬁne algo r ithm enumeration type, na mes, a nd s e lector function (C++ large alphab et) 57a i ≡ enum algor ithm_e numeration { Dummy, SF, L, HAL, NHAL }; const char* algori thm_n ames[] = { "selec tion c ode", "SF", "L", "HAL" , "NHAL" }; const int number_o f_alg orithms = 5; templa te inline void Algori thm(i nt k, const Cont ainer & x, con st Contain er& y, Contai ner__ const _iterator& resu lt) { switch (algor ithm_ enumeration(k)) { case Dummy : // does nothing , us ed for timin g overhe ad of test loop result = x.begi n(); r eturn; case SF: result = stl_se arch( x.begin(), x.end (), y.begi n(), y .end( )); return ; case L: result = __ searc h_L(x .begin(), x .end( ), y.begin(), y.end () ); return; case HAL: result = search (x.be gin(), x. end() , y.begin(), y.end( ) ); retur n; case NHAL: result = search _no_h ashing(x.begin(), x.end (), y.begi n(), y.end() ) ; retur n; } result = x.begi n(); r eturn; } Used in part 57b. "exper iment al_se arch.cpp" 57b ≡ h Include algorithms header with existing search function renamed 48 i #inclu de "new_s earch. h" #inclu de "exper imenta l_search.h" #inclu de #inclu de #inclu de #inclu de #inclu de #inclu de #inclu de using namespa ce std; typede f u nsign ed short data; 57 typede f v ector sequen ce; sequen ce S1; int Base_L ine, N umber_ Of_Te sts, N umber _Of_P atter n_Sizes, Incr ement; double Base_T ime = 0.0; h Deﬁne algorithm enumeration type, names, and selector function (C++ large alphab et) 57a i h Deﬁne run pr o cedure (C++ forward) 55 i h Deﬁne RandomNum b erGenera tor cla ss 58a i int main() { int j; h Read tes t pa rameters (C++) 53a i h Generate data sequence 58b i h Generate dictionary 59a i for (j = 0; j < Number _Of_Pa ttern_Sizes; ++j) { Increm ent = (S1.si ze() - Pattern_ Size[ j]) / Num ber_O f_Tes ts; h Output header (C++) 54b i cerr << Patte rn_Si ze[j] << " " << flush ; h Run and time tests s earching for selected subsequences (C++ ) 56b i } cerr << endl; } h Deﬁne RandomNumberGenera tor cla ss 58a i ≡ int random (int m ax_val ue) { return rand() % max_val ue; } templa te s truct RandomN umber Generator { int operat or() () { r eturn rand om(MAX _VALUE) ; } }; Used in part 57b. h Generate data sequence 58b i ≡ genera te_n( back_inserter(S1), 1 00000 , Random Numbe rGenerator<65535>()); Used in part 57b. 58 h Generate dictionar y 59a i ≡ typede f m ap, less > map_type; map_ty pe dictio nary; for(in t i = 0; i < Numb er_Of _Patte rn_Sizes; ++i) { int patter n_size = Patter n_Siz e[i]; for(in t j = 0; j < Number_Of _Tests ; ++j ) { int positi on = random(S1. size() - patte rn_si ze); dictio nary[ pattern_size].push_back( seque nce() ); copy(S 1.beg in() + p ositi on, S1.beg in() + position + patter n_siz e, back_i nsert er( dict ionary [patter n_size].back() ) ) ; } } Used in part 57b. B.8 Coun t ed T ests The f ollo wing p rogram r uns the same searches as in the timing p rogram, but in addition to times it records and rep orts counts of many d iﬀeren t t yp es of op erations, in cluding equalit y comparisons on d ata, iterator op erations, and “distance op erations,” whic h are arithmetic op erations on intege r results of iterator su btractions. These coun ts are obtained without mo difyin g the source co de of the algorithms at all, by s p ecializing their type parameters with classes wh ose op erations ha v e coun ters bu ilt into them. "count _long _sear ch.cpp" 59 b ≡ h Include algorithms header with existing search function renamed 48 i #defin e L IST_T EST #inclu de "new_s earch. h" #inclu de "hume. hh" #inclu de #inclu de #inclu de #inclu de #inclu de #inclu de #inclu de using namespa ce std; h Deﬁne t yp es needed for counting op erations 60a i typede f v ector sequen ce; sequen ce S1; int Base_L ine; unsign ed int Number_ Of_Te sts, N umber _Of_Pa ttern_Sizes, Incre ment; double Base_T ime = 0.0; h Deﬁne algorithm enumeration type, names, and selector function (C++ counter) 60b i h Deﬁne run pr o cedure (C++ counter) 62b i int main() { record er<> s tats[ numbe r_of_ algorithms]; 59 int j; h Read tes t pa rameters (C++) 53a i h Read character sequence from ﬁle (C++) 54a i h Read dictio nary from ﬁle, placing words of size j in dictionary[j] 53b i for (j = 0; j < Number _Of_Pa ttern_Sizes; ++j) { h T rim dictionary [Pattern Size[j]] to hav e at most Number Of T ests words 53c i Increm ent = (S1.si ze() - Pattern_ Size[ j]) / Num ber_O f_Tes ts; h Output header (C++) 54b i cerr << Patte rn_Si ze[j] << " " << flush ; h Run and time tests s earching for selected subsequences (C++ counter) 62a i } cerr << endl; } h Deﬁne types needed for counting op eratio ns 60a i ≡ #inclu de using namespa ce std; namesp ace st d { namesp ace rel_o ps {} }; using namespa ce std::re l_ops ; #defin e D EFAUL T_COUN TER_TYPE doub le #inclu de "count ers.h" typede f u nsign ed char b aseda ta; typede f v alue_ counte r d ata; typede f i terat or_cou nter::const_iterator> citer ; struct search _trai t_for_counting { enum {hash _range _max = 256}; enum {suff ix_siz e = 1 }; inline static unsign ed i nt hash (const citer& i) {retu rn (*i).base( );} }; Used in part 59b. h Deﬁne algo r ithm enumeration type, na mes, a nd s e lector function (C++ counter) 60b i ≡ enum algor ithm_e numeration { Dummy, STL_se arch, L, HAL, ABM, TBM }; const char* algori thm_n ames[] = { "selec tion c ode", "SF", "L", "HAL" , "ABM", "TBM" }; #ifnde f L IST_T EST const int number_o f_alg orithms = 6; #else const int number_o f_alg orithms = 3; #endif const char textFil eName [] = "long. txt"; const char wordFil eName [] = "words .txt" ; 60 templa te void Algor ithm(i nt k, const Contai ner& x , const Cont ainer& y, Contai ner__ const _iterator& resu lt) { switch (algor ithm_ enumeration(k)) { case Dummy : // does nothing , us ed for timin g overhe ad of test loop result = x.begi n(); return ; case STL_s earch: result = stl_se arch( citer(x.begin()), citer (x.end ()), citer( y.beg in()) , cite r(y.e nd()) ).base(); return ; case L: result = __sear ch_L( citer(x.begin()), citer (x.end ()), citer( y.beg in()) , cite r(y.e nd()) ).base(); return ; #ifnde f L IST_T EST case HAL: result = search _hash ed(citer(x.begin()), citer (x.en d()), citer( y.beg in()), cite r(y.e nd()) , static _cast (0)).base(); return ; case ABM: fbmpre p((co nst ba sedat a*)y. begin(), y.si ze()); result = (typen ame Contai ner:: const_iterator) fbmexe c_cnt ((const based ata*) x.begin( ), x.siz e()); // data:: acces ses += :: pat.a ccs; // data:: equal _comparisons += ::pat .cmps ; return ; case TBM: humpre p((co nst ba sedat a*)y. begin(), y.si ze()); result = (typen ame Contai ner:: const_iterator) humexe c_cnt ((const based ata*) x.begin( ), x.siz e()); // data:: acces ses += :: pat.a ccs; // data:: equal _comparisons += ::pat .cmps ; result = result ; return ; #endif } result = x.begi n(); return ; } Used in part 59b. 61 h Run and time tests searching for selected subsequences (C++ counter) 62a i ≡ Base_T ime = 0.0; for (int k = 0; k < number_ of_alg orithms; + +k) { if (k != 0) cout << "Timi ng " << algori thm_na mes[k] < < ":" << endl; Run(k, S1, diction ary[P attern_Size[j]], P atter n_Siz e[j], stats ); } cout << endl; Used in part 59b. h Deﬁne run pro cedure (C++ counter) 62b i ≡ templa te void Run(i nt k , const Contain er& S1, const vector< Conta iner>& dict ionar y, int Patt ern_S ize, reco rder< >* sta ts) { typena me Contai ner::c onst_iterator P ; int F = 0, K, d; double Start_ Time, Finish _Time , Time _Taken ; long Total _Searc h = 0 ; stats[ k].re set(); Start_ Time = c lock( ); Contai ner S2; for (K = 1; K <= Numbe r_Of_T ests; ++K) { typena me Contai ner::c onst_iterator u = S1.begin( ); advanc e(u, F ); S2.era se(S2 .begin(), S2.end ()); for (int I = 0; I < Patte rn_Si ze; ++I) S2.pus h_bac k(*u++) ; F += Incre ment; h Run algorithm and record search distance 56a i } for (K = 0; K < dictio nary.s ize(); + +K) { S2 = dicti onary[ K]; h Run algorithm and record search distance 56a i } stats[ k].re cord(); Finish _Time = clock( ); h Output statistics (C++ counter) 63a i } Used in part 59b. 62 h Output statistics (C++ counter) 63a i ≡ Time_T aken = ( Finis h_Time - Start_ Time) /CLOCKS_PER_SEC - Base_ Time; if (k == 0) Base_T ime = Time_T aken; else { operat or_ty pe group 0[] = { LESS _THAN, EQUALI TY }; operat or_ty pe group 1[] = { DEFA ULT_CT OR, COPY _CTOR , OTHE R_CTOR , ASSI GNMEN T }; operat or_ty pe group 2[] = { CONVER SION, POST_I NCREM ENT, P RE_IN CREME NT, POST_D ECREM ENT, P RE_DE CREME NT, UNARY_ PLUS, UNARY_ MINUS , PLUS , MINUS, TIMES, DIVIDE , MODULO, LEFT_S HIFT, RIGHT_ SHIFT , PLUS_A SSIGN , MINU S_ASS IGN, TIMES_ ASSIG N, DIVID E_ASSI GN, MODU LO_AS SIGN, LEFT_S HIFT_ ASSIGN, RIGHT _SHIF T_ASSIGN , DEREFE RENCE , MEMB ER, SUBSCR IPT, FUNCTION_C ALL, NEGATE , AND, AND_ASSIGN , OR, OR_AS SIGN, XOR, XOR_AS SIGN, }; scale_ recor der_visitor > visito r(.00 1); visito r.add ("Comps", group0 , group0 + sizeof (grou p0) / sizeo f(gro up0[0 ])); visito r.add ("Asmts", group1 , group1 + sizeof (grou p1) / sizeo f(gro up1[0 ])); visito r.add ("Other", group2 , group2 + sizeof (grou p2) / sizeo f(gro up2[0 ])); operat or_ty pe group 3[] = { BASE , DTOR, GENER ATION , }; visito r.hid e(group3, group3 + sizeof (grou p3) / sizeo f(grou p3[0] )); stats[ k].re port(cout, visit or); cout << "Total searc h length: " << Total_Sear ch << " ele ments " << endl ; cout << "Time: " << Time_Ta ken << " second s." << endl; double Speed = Time_ Taken == 0.0 ? 0.0 : (doubl e)Tot al_Search / 100000 0 / Time_T aken; cout << "Speed: " << Speed << " eleme nts/m icrosecond." << endl << endl; } Used in part 62b. B.9 Application to Matchin g Sequences of W ords B.9.1 Large T ests This C + + program sp ecializes the generic searc h functions to work with sequences of w ords (c haracter strin gs). I t reads a text ﬁ le in as a sequence of w ords, and for eac h of a sp eciﬁed set of pattern sizes, it searc hes for wo rd sequences of that size selected from ev enly spaced p ositions in the target sequence. T h ese searc h es are the coun terpart of the ﬁrst kind of searc hes done in the previous programs on characte r sequences; the dictionary wo rd searc h es of the previous programs are omitted h ere. "test_ word_ searc h.cpp" 63b ≡ h Include algorithms header with existing search function renamed 48 i #inclu de "new_s earch. h" #inclu de #inclu de #inclu de 63 #inclu de #inclu de #inclu de using namespa ce std; typede f s tring data; typede f v ector sequen ce; sequen ce S1, S2; int Base_L ine, N umber_ Of_Te sts, N umber _Of_P atter n_Sizes, Incr ement; h Deﬁne search trait for word searches 64a i h Deﬁne algorithm enumeration type, names, and selector function (C++ word) 64b i h Deﬁne Repo r t pr o cedure (C++) 51d i int main() { int F, K, j; h Read tes t pa rameters (C++) 53a i typede f m ap, less > map_type; map_ty pe dictio nary; h Read word sequence from ﬁle (C++) 65a i cout << S1.size () < < " words read. " << endl; const char* separa tor = " "; for (j = 0; j < Number _Of_Pa ttern_Sizes; ++j) { Increm ent = (S1.si ze() - Pattern_ Size[ j]) / Num ber_O f_Tes ts; h Output header (C++) 54b i h Run tests searching for selected subsequences (C++) 54c i } } F or a hash fun ction the program u ses a mapp ing of a word to its ﬁrst c harac- ter. Although this would not b e a go o d hash fu nction in hash ed asso ciativ e table lo okup, it wo rks satisfactorily here b ecause there is less n eed f or u ni- formit y of h ash v alue d istribution. h Deﬁne sear ch trait for word sear ches 64a i ≡ struct search _word _trait { typede f v ector ::con st_iterator RAI; enum {hash _range _max = 256}; enum {suff ix_siz e = 1 }; inline static unsign ed i nt hash (RAI i) { return (*i)[0 ]; } }; Used in parts 63b, 65b. h Deﬁne algo r ithm enumeration type, na mes, a nd s e lector function (C++ word) 64b i ≡ enum algor ithm_e numeration { Dummy, STL_se arch, L, HAL }; const char* algori thm_n ames[] = { 64 "selec tion c ode", "SF", "L", "HAL" }; #ifnde f L IST_T EST const int number_o f_alg orithms = 4; #else const int number_o f_alg orithms = 3; #endif templa te inline void Algori thm(i nt k, const Cont ainer & x, con st Contain er& y, Contai ner__ const _iterator& resu lt) { switch (algor ithm_ enumeration(k)) { case Dummy : // does nothing , us ed for timin g overhe ad of test loop result = x.begi n(); r eturn; case STL_s earch: result = stl_se arch( x.begin(), x.end (), y.begi n(), y .end( )); return ; case L: result = __sear ch_L( x.begin(), x.end (), y.begi n(), y .end( )); return ; #ifnde f L IST_T EST case HAL: result = search _hash ed(x.begin(), x. end() , y.beg in(), y.end() , (searc h_wor d_tra it*)0); retur n; #endif } result = x.begi n(); r eturn; } Used in parts 63b, 65b. h Read word sequence from ﬁle (C++) 65a i ≡ ifstre am ifs("l ong.tx t"); typede f i strea m_iter ator strin g_inp ut; copy(s tring _input(ifs), strin g_inp ut(), back_ inser ter(S1)) ; Used in parts 63b, 65b. B.9.2 Timed T ests W e also omit the dictionary searches in the follo win g program which times searc h es for selected subsequ en ces, in this case by deﬁning a map f rom in ts to empt y dictionaries (in order to reuse some of the previous co de). "time_ word_ searc h.cpp" 65b ≡ h Include algorithms header with existing search function renamed 48 i #inclu de "new_s earch. h" #inclu de #inclu de #inclu de #inclu de #inclu de 65 #inclu de #inclu de //#inc lude < list> //#def ine LI ST_TE ST using namespa ce std; typede f s tring data; typede f v ector sequen ce; sequen ce S1, S2; int Base_L ine, N umber_ Of_Te sts, N umber _Of_P atter n_Sizes, Incr ement; double Base_T ime = 0.0; h Deﬁne search trait for word searches 64a i h Deﬁne algorithm enumeration type, names, and selector function (C++ word) 64b i h Deﬁne run pr o cedure (C++ forward) 55 i int main() { int j; h Read tes t pa rameters (C++) 53a i typede f m ap, less > map_type; map_ty pe dictio nary; h Read word sequence from ﬁle (C++) 65a i cout << S1.size () < < " words read. " << endl; for (j = 0; j < Number _Of_Pa ttern_Sizes; ++j) { Increm ent = (S1.si ze() - Pattern_ Size[ j]) / Num ber_O f_Tes ts; h Output header (C++) 54b i h Run and time tests s earching for selected subsequences (C++ ) 56b i } } C Index of P art Names h Accelerated Linear algor ithm, no hashing (C++) 46a i Referenced in part 44c. h Accelerated Linear algor ithm, prelimina ry version 5c i Referenced in part 29c. h Accelerated Linear algor ithm 7b i Referenced in part 27c. h Additional algor ithms 29c i Referenced in p arts 31b, 35a, 36e. h Algorithm L, optimized linear pattern sear ch (C+ +) 21a i R eferenced in part 19b. h Algorithm L, optimized linear pattern sear ch 3a i Referenced in part 27c. h Algorithm subprog ram declara tions 27b i R eferenced in parts 31b, 35a, 36e. h Basic KMP 2 i Referenced in part 27c. h Bidirectional iterator case 40b i Referenced in p art 38b. h Check for unexp ected end of ﬁle (C++) 51a i Referenced in p art 50. h Check for unexp ected end of ﬁle 33a i Referenced in part 32. h Compute and return p osition of match 22b i Referenced in p art 22a. h Compute next table (C++ forward) 20a i Referenced in part 19b. h Compute next table (C++) 44a i Referenced in p arts 42a, 46a. h Compute next table 31a i Referenced in parts 5c, 7b, 12b, 27c. h Compute s kip table and mismatch s hift using the ha sh function (C++) 43c i R ef- erenced in part 42a. h Compute sk ip table and misma tch shift using the hash function 12a i R eferenced in part 12b. h Compute skip table a nd mismatch shift, no hashing (C++) 46b i R eferenced in part 66 46a. h Compute skip table and mismatch shift 6a i Referenced in parts 5c, 7b. h Data declaratio ns 35b i Referenced in parts 35a, 36e. h Deﬁne DNA search traits 47b i Referenced in part 47a. h Deﬁne RandomNumberGenera tor cla ss 58a i Referenced in part 57b. h Deﬁne Rep ort pro cedure (C++) 51d i Referenced in parts 49a, 52, 63b. h Deﬁne Rep ort pro cedure 33e i Referenced in p arts 31b, 35a. h Deﬁne algor ithm enu meratio n type , names, and se lector function (C+ + co unt er) 60b i Referenced in part 59b. h Deﬁne algor ithm enumeration type, names, and selector function (C++ larg e al- phab et) 57a i Referenced in part 57b. h Deﬁne algorithm enumeration type, names, and se le ctor function (C++ word) 64b i Referenced in parts 63b, 65b. h Deﬁne a lgorithm enumeration t yp e, names, and selector function (C++ ) 49b i Ref- erenced in parts 49a, 52, 54f. h Deﬁne alg orithm enumeration type, na mes, and selector function 31e i Referenced in parts 31b, 35a, 36e. h Deﬁne pr o cedure to compute next table (C++ forward) 20b i Referenced in part 38b. h Deﬁne pro cedur e to compute next table (C++) 43d i Referenced in p art 38b. h Deﬁne pro cedur e to compute next table 30 i Referenced in part 27b. h Deﬁne pro cedur e to output sequence 33c i Referenced in part 31b. h Deﬁne pro cedur e to read string in to sequence (C++) 51b i R eferenced in part 49a. h Deﬁne pro cedur e to read string in to sequence 33b i Referenced in part 31b. h Deﬁne run pro cedure (C++ counter) 62b i Referenced in part 59b. h Deﬁne run pro cedure (C++ forward) 55 i Referenced in p arts 54f, 57b, 65b. h Deﬁne run pro cedure 37b i Referenced in part 36e. h Deﬁne sear ch trait for word sear ches 64a i Referenced in parts 63b, 65b. h Deﬁne types needed for counting op eratio ns 60a i R eferenced in part 59b. h Exp erimental s earch function with skip lo op witho ut hashing 44c i Referenced in part 44b. h F orward iterator ca se 19b i R eferenced in part 38b. h Generate data sequence 58b i Referenced in part 57b. h Generate dictionar y 59a i Referenced in part 57b. h Generic sear ch trait 39b i Referenced in p art 39a. h HAL declara tion 29b i Referenced in part 27b. h HAL with random access iterators , no trait passe d 41a i Referenced in part 38b. h Handle pattern size = 1 a s a sp ecial c a se (C++) 21b i Referenced in p arts 21a, 42a, 46a. h Handle pattern size = 1 as a s pe c ia l case 3b i Referenced in parts 3a, 5c, 7b, 12b, 27c. h Hashed Accelerated Linear algo rithm (C++) 42a i R eferenced in part 41b. h Hashed Accelerated Linear algo rithm 12b i Referenced in part 29b. h Include alg orithms header with existing search function r enamed 48 i Referenced in parts 49a, 52, 54f, 57b, 59b, 63b, 65b. h Non-hashed algorithms 27c i Referenced in p art 27b. h Output S2 34a i Referenced in p art 33e. h Output header (C++) 54b i Referenced in p arts 52, 54f, 57b, 59b, 63b, 65b. h Output statistics (C++ counter) 63a i Referenced in part 62b. h Output statistics (C++) 56c i Referenced in part 55. h Output statistics 38a i Referenced in part 37b. h Read character sequence from ﬁle (C++) 54a i Referenced in parts 52, 54f, 59b. h Read character sequence from ﬁle 36a i Referenced in p arts 35a, 36e. h Read dictionary from ﬁle, placing words o f size j in dictionary[j] 53b i Referenced in parts 52, 54f, 59b. 67 h Read test parameter s (C++) 53a i Referenced in parts 52, 54f, 57b, 59b, 63b, 65b. h Read test parameter s 35d i Referenced in parts 35a, 36e. h Read test sequences from ﬁle (C++) 50 i Referenced in part 49a. h Read test sequences from ﬁle 32 i Referenced in part 31b. h Read word sequence from ﬁle (C++) 65a i Referenced in parts 63b, 65b. h Recov er fro m a mismatch using the next table (C++ for ward) 22a i R eferenced in part 21a. h Recov er from a mismatch using the next table (C++ ) 43b i Referenced in part 42c. h Recov er from a mismatch using the next table, with k transla ted 8c i R eferenced in part 8a. h Recov er from a mismatch using the next table 4b i Referenced in p arts 3a, 5c. h Run algo r ithm and reco rd sea rch distance 56a i R eferenced in parts 55, 62b. h Run and time tests searching for se le cted subsequences (C++ counter) 62a i Ref- erenced in part 59b. h Run a nd time tests s earching for selected subs e quences (C++) 56b i Referenced in parts 54f, 57b, 65b. h Run and time tests se arching for selected subs e quences 37a i Referenced in part 36e. h Run tests and rep ort results (C++) 51c i Referenced in parts 49a, 54ce. h Run tests and rep ort results 33d i Referenced in part 31b. h Run tests searching for dictionary words (C++) 54e i Referenced in part 52. h Run tests sea rching for sele cted s ubs equences (C++) 54c i R eferenced in p arts 52, 63b. h Run tests searching for selected subsequence s 36b i R eferenced in part 35a. h Run tests 36d i Referenced in part 36b. h Scan the text for a po ssible match (C++) 21c i Referenced in part 21a. h Scan the text for a po ssible match 3c i Referenced in p arts 3a, 27c. h Scan the tex t us ing a single- test s kip lo op with hashing (C++) 42b i Referenced in part 42a. h Scan the text using a single- tes t skip lo op with hashing 11 i Referenced in part 12b. h Scan the text using a sing le-test sk ip lo op, no ha shing (C+ + ) 46c i Referenced in part 46a. h Scan the text using a single-test skip lo op, with k translated 7a i Referenced in p art 7b. h Scan the text using a single-test skip lo op 6b i Not referenced. h Scan the text using the skip lo op 5a i Referenced in part 5c. h Search traits for character sequenc e s 40a i R eferenced in part 39a. h Select sequence S2 to search for in S1 (C++) 54d i Referenced in part 54c. h Select sequence S2 to search for in S1 36c i Referenced in parts 36b, 37b. h Sequence declara tions 27a i Referenced in p arts 31b, 35a, 36e. h Set ﬁle long.txt as input ﬁle 35c i Referenced in parts 35a, 36e. h Set ﬁle small.txt as input ﬁle 31d i Referenced in part 31b. h Simple hash function declara tio ns 29a i Referenced in p art 27b. h T rim dictiona r y[Pattern Size[j]] to hav e at mos t Number Of T ests w ords 53c i Ref- erenced in parts 52, 54f, 59b. h User level sear ch function with tra it a r gument 41b i Referenced in part 38b. h User level sear ch function 19a i Referenced in part 38b. h V ariable declara tions 31c i Referenced in part 31b. h V erify match or recover fro m mis ma tch (C++) 42c i Referenced in p arts 42a, 46a. h V erify match or recover fro m mis ma tch 8a i Referenced in parts 7b, 12b. h V erify the match for p ositions 1 thr o ugh pattern size - 1 (C++) 43a i Referenced in part 42c. h V erify the matc h for p ositions a + 1 through m - 1 , with k translated 8b i Referenced in part 8a. h V erify the match for p ositio ns a throug h m - 2 5b i Referenced in p art 5c. 68 h V erify whether a match is p oss ible at the positio n found (C++) 21d i Referenced in part 21a. h V erify whether a match is p ossible at the p os ition found 4a i Referenced in parts 3a, 27c. 69

A Fast Generic Sequence Matching Algorithm

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment