Small NFAs from Regular Expressions: Some Experimental Results

Small NFAs from Regular Expressions: Some Experimental Results
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Regular expressions (res), because of their succinctness and clear syntax, are the common choice to represent regular languages. However, efficient pattern matching or word recognition depend on the size of the equivalent nondeterministic finite automata (NFA). We present the implementation of several algorithms for constructing small epsilon-free NFAss from res within the FAdo system, and a comparison of regular expression measures and NFA sizes based on experimental results obtained from uniform random generated res. For this analysis, nonredundant res and reduced res in star normal form were considered.


💡 Research Summary

The paper addresses a practical problem in the use of regular expressions (REs): while REs are compact and human‑readable, the nondeterministic finite automata (NFAs) that are generated from them for pattern matching can be unnecessarily large, which directly impacts the speed and memory consumption of matching engines. To investigate how different construction algorithms affect NFA size, the authors implemented four well‑known and two novel algorithms within the FAdo system: (1) the classic Thompson construction, (2) the Glushkov construction, (3) a hybrid approach that first converts the RE into star‑normal form (SNF) and then applies Glushkov, and (4) an enhanced SNF‑Glushkov method that adds a post‑processing step to merge redundant states and eliminate unnecessary transitions. All generated NFAs are epsilon‑free, and a uniform post‑processing routine removes isolated states and duplicate edges, ensuring a fair comparison across methods.

For the experimental evaluation, two families of REs were generated uniformly at random. The first family consists of non‑redundant REs, meaning that no sub‑expression appears more than once, which reduces internal duplication. The second family is obtained by converting each RE of the first family into reduced SNF, a canonical form that pushes Kleene stars outward and eliminates nested stars where possible. Both families cover lengths from 5 to 30 symbols, with 1,000 instances per length, yielding a total of 10,000 REs per family. The authors measured three size metrics for each resulting NFA: the number of states, the number of transitions, and the overall memory footprint.

The results show a clear linear relationship between RE length and NFA size for all algorithms, confirming the expected theoretical bound. However, the choice of construction algorithm and the preprocessing of the RE have a substantial impact on the constant factor. On average, the SNF‑based constructions reduce the number of states by 12 %–18 % compared with the raw Thompson construction. The reduction is most pronounced for REs with a high proportion of nested Kleene stars, where the SNF transformation eliminates many redundant loops. Glushkov’s method, which already avoids ε‑transitions, yields NFAs about 9 % smaller than Thompson’s but still lags behind the SNF‑enhanced approaches for complex REs. The hybrid SNF‑Glushkov method consistently outperforms the others, never producing an NFA larger than the Thompson baseline and often beating it by more than 15 % in the worst case.

Statistical analysis of operator distribution reveals that the fraction of Kleene stars in an RE is a strong predictor of NFA size reduction when SNF is applied. When the star‑operator accounts for more than 30 % of the operators, the SNF‑based algorithms achieve the greatest savings. Conversely, REs dominated by concatenation and alternation show smaller differences between the algorithms, indicating that the benefit of SNF is context‑dependent.

Based on these findings, the authors recommend incorporating a star‑normal‑form preprocessing step into any RE‑to‑NFA pipeline, especially when the target language is expected to contain many repetitions. They also discuss the limitations of their current implementation: all transformations are performed sequentially, which may become a bottleneck for very large RE corpora. Future work is outlined in three directions: (a) developing a unified optimization framework that simultaneously minimizes the RE and the resulting NFA, (b) exploring state‑reduction techniques that can be applied after NFA construction but before determinization, and (c) evaluating the impact of these smaller NFAs in real‑time streaming and embedded environments where memory is at a premium.

In summary, the paper provides a thorough experimental comparison of several RE‑to‑NFA constructions, demonstrates the practical advantage of star‑normal‑form preprocessing, and supplies a reproducible implementation within FAdo that can serve both researchers and practitioners aiming to build more efficient pattern‑matching engines.


Comments & Academic Discussion

Loading comments...

Leave a Comment