Regular Expression Matching and Operational Semantics
Many programming languages and tools, ranging from grep to the Java String library, contain regular expression matchers. Rather than first translating a regular expression into a deterministic finite automaton, such implementations typically match the regular expression on the fly. Thus they can be seen as virtual machines interpreting the regular expression much as if it were a program with some non-deterministic constructs such as the Kleene star. We formalize this implementation technique for regular expression matching using operational semantics. Specifically, we derive a series of abstract machines, moving from the abstract definition of matching to increasingly realistic machines. First a continuation is added to the operational semantics to describe what remains to be matched after the current expression. Next, we represent the expression as a data structure using pointers, which enables redundant searches to be eliminated via testing for pointer equality. From there, we arrive both at Thompson’s lockstep construction and a machine that performs some operations in parallel, suitable for implementation on a large number of cores, such as a GPU. We formalize the parallel machine using process algebra and report some preliminary experiments with an implementation on a graphics processor using CUDA.
💡 Research Summary
The paper presents a formal operational‑semantics treatment of regular‑expression (regex) matching, tracing a path from the abstract big‑step definition of matching to concrete abstract machines that can be directly implemented. The authors begin by recalling the standard syntax for regexes (ε, literal a, concatenation, alternation, and Kleene star) and the usual big‑step relation e ↓ w that states a string w matches expression e. This relation captures nondeterminism (alternation and star) but does not describe how a matcher should explore the search space.
To make the process explicit, the authors introduce the EKW machine, named after its three components: Expression (E), Continuation stack (K), and the remaining Word (W). A configuration ⟨e; k; w⟩ consists of the current sub‑expression e, a stack k that records what must be matched after e, and the yet‑unconsumed input w. Transition rules push and pop the continuation stack for concatenation, alternation, and star, and consume one character when a literal matches. The machine is nondeterministic: a transition for e₁ | e₂ can take either branch, and a successful branch suffices. However, the authors point out that the EKW machine can loop forever on expressions such as a** when presented with an a‑character, because the continuation stack can be rebuilt indefinitely without progress.
To eliminate such infinite loops and redundant work, the paper moves to a pointer‑based representation of the regex as a heap π. The heap is a finite partial function from addresses to nodes; each node stores an operator (·, |, *, ε, or a literal) together with pointers to its children. A separate “continuation” function cont maps each node to the address of the next node to be processed after the current one. Crucially, equality of sub‑expressions is decided by pointer (address) equality, not structural equality, which enables the matcher to recognise that it has already visited a particular sub‑tree and to avoid re‑exploring it.
The PWπ machine operates on configurations ⟨p; w⟩ where p is a pointer into the heap and w is the remaining input. Two kinds of transitions are defined: (1) ε‑transitions p → q that move through the syntax tree without consuming input (e.g., following a concatenation or star edge), and (2) character transitions p a → q that match a literal a against the first character of w and advance both the pointer and the input. Acceptance occurs when the pointer reaches the distinguished null address and the input is empty. The authors prove a simulation theorem: a run of the EKW machine exists exactly when a corresponding run of the PWπ machine exists, with the continuation stack of EKW being reconstructible from the chain of cont pointers in PWπ.
Having a pointer‑based representation, the authors define a “lockstep” construction that mimics Thompson’s classic NFA simulation. They introduce the ε‑closure operator 𝔈(S) that, given a set S of pointers, expands it to all pointers reachable by zero or more ε‑transitions that are ready to consume a character. A macro‑step S ⇒ S′ first computes 𝔈(S) and then, for a given input character a, moves each pointer that can consume a to its continuation, discarding those that cannot. This macro‑step corresponds exactly to the ε‑closure plus character transition step of a nondeterministic finite automaton (NFA).
From the generic lockstep semantics the authors derive a sequential lockstep machine. It maintains two lists of pointers: c (the current set being ε‑expanded) and n (the set that will be processed after the next input character). An auxiliary list t records pointers already seen in the current macro‑step to avoid duplication. The helper function ψ(p, l₁, l₂) inserts p into l₁ only if it does not already appear in l₁ or l₂. The sequential algorithm repeatedly (i) ε‑expands c, (ii) consumes the next input character a by moving each matching pointer to n, (iii) swaps c←n and clears n, and (iv) repeats until the input is exhausted. If null appears in c at that point, the match succeeds. This algorithm is essentially Thompson’s lockstep simulation but expressed as a small‑step operational semantics, making the role of each data structure explicit.
The paper then explores a parallel version of the lockstep machine. Using a simple process algebra (a fragment of the π‑calculus), each pointer in the current set is handled by an independent thread that can perform either an ε‑transition or a character transition. The results of all threads are synchronised by a parallel composition operator that merges the resulting pointer sets. This formulation naturally maps onto massively parallel hardware such as GPUs. The authors implement the parallel lockstep matcher in CUDA, assigning each pointer to a GPU thread, performing ε‑expansion in parallel, and then processing the input characters in lockstep across all active threads.
Experimental results compare the CUDA implementation with a classic backtracking matcher and with a sequential lockstep matcher. For patterns that contain many nondeterministic constructs (e.g., nested stars), the backtracking matcher exhibits exponential blow‑up, while both the sequential and parallel lockstep matchers run in time proportional to the product of the input length and the size of the regex. The GPU version shows substantial speed‑ups on large inputs, especially when the number of active pointers grows large enough to keep many cores busy. The authors also note that memory consumption remains comparable across implementations because the pointer heap is shared and only a modest amount of per‑thread state is needed.
In conclusion, the paper provides a rigorous operational‑semantics foundation for regex matching, shows how pointer equality can be used to eliminate redundant searches, reconstructs Thompson’s classic lockstep NFA simulation as a small‑step abstract machine, and extends the model to a parallel setting that can exploit modern many‑core architectures. The work bridges the gap between theoretical semantics and practical, high‑performance regex engines, and it offers a clean framework that can be extended to richer regex features such as back‑references or look‑ahead assertions in future research.
Comments & Academic Discussion
Loading comments...
Leave a Comment