Improved Extended Regular Expression Matching
An extended regular expression $R$ specifies a set of strings formed by characters from an alphabet combined with concatenation, union, intersection, complement, and star operators. Given an extended regular expression $R$ and a string $Q$, the extended regular expression matching problem is to decide if $Q$ matches any of the strings specified by $R$. Extended regular expression matching was introduced by Hopcroft and Ullman in the 1970s, who gave a simple dynamic programming solution using $O(n^3m)$ time and $O(n^2m)$ space, where $n$ is the length of $Q$ and $m$ is the length of $R$. The current state-of-the art solution, by Yamamoto and Miyazaki uses $O(\frac{n^3k + n^2m}{w} + n + m)$ time and $O(\frac{n^2k + nm}{w} + n + m)$ space, where $k$ is the number of negation and complement operators in $R$ and $w$ is the number of bits in a machine word. This roughly replaces the $m$ factor with $k$ in the dominant terms of both the space and time bounds of the classical Hopcroft and Ullman algorithm. In this paper, we present a new solution that solves extended regular expression matching in [ O\left(n^ωk + \frac{n^2m}{\max(w/\log w, \log n)} + m\right) ] time and $O(\frac{n^2 \log k}{w} + n + m) = O(n^2 +m)$ space, where $ω\approx 2.3716$ is the exponent of matrix multiplication. Essentially, this replaces the dominant $n^3k$ term with $n^ωk$ in the time bound, while simultaneously improving the $n^2k$ term in the space to $O(n^2)$. To achieve our result, we develop several new insights and techniques of independent interest, including a new compact representation to store and efficiently combine substring matches, a new clustering technique for parse trees of extended regular expressions, and a new efficient combination of finite automaton simulation with substring match representation to speed up the classic dynamic programming solution.
💡 Research Summary
The paper addresses the classic problem of deciding whether a given string Q matches an extended regular expression R, where R may contain the usual concatenation, union, star operators as well as the extended intersection (∩) and complement (¬) operators. While the original Hopcroft‑Ullman dynamic‑programming algorithm from the 1970s runs in O(n³ m) time and O(n² m) space (n = |Q|, m = |R|), later work by Yamamoto and Miyazaki replaced the factor m with the number k of extended operators, achieving O(n³ k + n² m/w) time and O(n² k + nm/w) space (w = machine word size). Recent conditional lower bounds based on the k‑clique hypothesis suggest that the exponent ω≈2.3716 of fast matrix multiplication is essentially optimal for the dominant term.
The authors present a new algorithm that improves both the time and space bounds dramatically. Their main result (Theorem 1) states that extended regular expression matching can be solved in
O( n^ω k + n² m · max(w/ log w, log n) + m ) time
and
O( n² log k / w + n + m ) = O(n² + m) space.
The key technical contributions are threefold:
- Match Graphs – a novel representation of substring matches as Boolean matrices. For each sub‑expression v of R, a matrix M_v
Comments & Academic Discussion
Loading comments...
Leave a Comment