Deterministic Regular Expressions With Back-References
Most modern libraries for regular expression matching allow back-references (i.e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction. We demonstrate that, compared to their non-deterministic superclass, these deterministic regular expressions with back-references have desirable algorithmic properties (i.e., efficiently solvable membership problem and some decidable problems in static analysis), while, at the same time, their expressive power exceeds that of deterministic regular expressions without back-references.
💡 Research Summary
The paper investigates a middle ground between two seemingly opposite extensions of regular expressions: (i) regexes with back‑references, which dramatically increase expressive power at the cost of intractable decision problems, and (ii) deterministic regular expressions (DREs) used in XML DTDs and XML Schema, which sacrifice expressive power for efficient analysis. The authors introduce a new class called deterministic regexes with back‑references (DRX). To capture the semantics of back‑references while preserving determinism, they define a variant of memory automata called trap‑state memory automata (TMFA) and its deterministic counterpart (DTMFA). TMFA extend the classical memory automata by adding a special trap state that represents “no valid variable binding”. Deterministic TMFA (DTMFA) guarantee that every transition is uniquely determined, which makes them suitable for complement and other closure operations.
A Glushkov‑style construction is generalized to translate any DRX into a deterministic TMFA. This construction treats each occurrence of a terminal symbol or a variable binding as a distinct position, and the resulting automaton can be built in time O(|Σ|·|α|·k), where k is the number of variables. Consequently, checking whether a given regex α is deterministic (i.e., its Glushkov automaton is deterministic) can be done in the same time bound, and the automaton itself is produced simultaneously.
The paper presents three main algorithmic results. First, the determinism test and automaton construction run in polynomial time as described above. Second, the membership problem for DRX—deciding whether a word w belongs to the language of a given DRX α—can be solved in O(|Σ|·|α|·n + k·|w|) time, where n counts the total number of terminal and variable‑reference occurrences in α. This is a dramatic improvement over the NP‑complete membership problem for unrestricted regexes. Third, while the general intersection‑emptiness problem for DRX is undecidable, the authors identify a natural restriction called variable‑star‑free DRX (no variable occurs inside a Kleene star). For this subclass, intersection‑emptiness, inclusion, and equivalence are PSPACE‑complete, and a PSPACE‑complete minimization algorithm is given by enumerating smaller candidates and testing equivalence.
From an expressive‑power perspective, DRX strictly contains all deterministic regular languages and all unary regular languages, yet it does not capture every regular language. Moreover, DRX can define non‑regular languages such as the copy language {ww | w∈{a,b}+} and the square language {aⁿ² | n≥0} by using variable bindings and back‑references in a deterministic fashion. The authors show that DRX is not closed under union, concatenation, reversal, complement, homomorphism, inverse homomorphism, or intersection (even with deterministic regular languages). This lack of closure reflects the trade‑off between determinism and expressive power.
The paper also explores a relaxed notion of determinism that allows a constant amount of look‑ahead. Under this weaker definition, the determinism check remains decidable (still in polynomial time for the variable‑star‑free case), indicating that the core ideas extend beyond the strict “no look‑ahead” model.
In summary, the authors provide a solid theoretical foundation for deterministic regexes with back‑references: a suitable automaton model (TMFA/DTMFA), a constructive Glushkov‑style translation, efficient algorithms for determinism testing and membership, and a detailed analysis of expressive power and static‑analysis problems. The work bridges the gap between the practical need for back‑references in programming languages and the desire for tractable analysis in XML schema technologies, opening new avenues for research on tractable extensions of regular expressions.
Comments & Academic Discussion
Loading comments...
Leave a Comment