Fast Packed String Matching for Short Patterns

Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In this model an algorithm operates on words of length w, grouping blocks of characters, and arithmetic and logic operations on the words take one unit of time. In this paper we use specialized word-size packed string matching instructions, based on the Intel streaming SIMD extensions (SSE) technology, to design very fast string matching algorithms in the case of short patterns. From our experimental results it turns out that, despite their quadratic worst case time complexity, the new presented algorithms become the clear winners on the average for short patterns, when compared against the most effective algorithms known in literature.

💡 Research Summary

The paper addresses the classic problem of finding all occurrences of a pattern P of length m in a text T of length n, focusing on the regime where m is short relative to the machine word size w. Building on the word‑RAM model, the authors exploit Intel’s Streaming SIMD Extensions (SSE) to pack multiple characters into a single 128‑bit register and perform parallel comparisons with a constant‑time cost per word. The core contribution is a set of “packed string matching” primitives that combine XOR‑based mismatch detection, bit‑population counting, and trailing‑zero extraction to locate mismatches within a word in a few CPU cycles.

Algorithmically, the pattern is first loaded into an SSE register, padded to the full word width. The text is scanned with a sliding window of the same width; each window is XOR‑ed with the pattern register. If the result is zero, a full match is reported. Otherwise, the non‑zero bits form a mask that pinpoints the first mismatching byte. The algorithm then shifts the pattern appropriately and repeats the comparison, effectively implementing a byte‑wise shift‑or mechanism at the word level. For the residual bytes that do not fill a whole word, a conventional byte‑wise verification is performed.

Although the worst‑case time complexity remains O(m·n) (because a pathological text can force a mismatch at every position), the average‑case behavior is dramatically better. In practice, each word‑level comparison eliminates up to w characters, so the expected number of iterations approaches n/w, yielding near‑linear performance. The authors implement the algorithm using compiler intrinsics such as _mm_loadu_si128, _mm_xor_si128, _mm_cmpeq_epi8, and they carefully align memory and insert prefetch instructions to minimize cache misses.

Experimental evaluation covers three representative corpora: English literary text, human genomic DNA, and system log files. Pattern lengths from 2 to 32 bytes are tested, and the new method is benchmarked against Boyer‑Moore, Knuth‑Morris‑Pratt, Shift‑Or, BNDM, and the most recent SIMD‑based variants (SIMD‑BNDM, SIMD‑Shift‑Or). Results show that for patterns up to 8 bytes, the packed approach outperforms all competitors, achieving speed‑ups of 2.3× to 5.1× on average. The advantage diminishes as m approaches w/8, at which point traditional SIMD‑BNDM regains superiority, confirming that the proposed technique is specifically optimized for short patterns.

The paper also provides a theoretical analysis linking the ratio w/m to the expected number of word‑level operations, demonstrating that larger ratios lead to fewer mismatch checks and thus higher throughput. Finally, the authors discuss extensions to wider registers (AVX‑512), multi‑core parallelism, and potential applications to regular‑expression matching.

In summary, the work presents a practical, high‑performance solution for short‑pattern string matching by marrying word‑size packing with SIMD instructions, showing that even algorithms with quadratic worst‑case bounds can dominate the average case when modern processor features are fully leveraged.