State Elimination Ordering Strategies: Some Experimental Results

Recently, the problem of obtaining a short regular expression equivalent to a given finite automaton has been intensively investigated. Algorithms for converting finite automata to regular expressions have an exponential blow-up in the worst-case. To overcome this, simple heuristic methods have been proposed. In this paper we analyse some of the heuristics presented in the literature and propose new ones. We also present some experimental comparative results based on uniform random generated deterministic finite automata.

💡 Research Summary

The paper addresses the long‑standing problem of converting finite automata (FA) into equivalent regular expressions (RE) without incurring the exponential blow‑up that classic state‑elimination algorithms can cause. The authors focus on the ordering in which states are eliminated, a factor known to dramatically affect the size of the resulting RE. They begin by reviewing the most common heuristic ordering strategies that have appeared in the literature: minimum in‑degree, minimum out‑degree, minimum total degree, minimum weight (where weight combines degree and label length), and minimum path length. While these heuristics have been shown to improve average‑case performance, their behavior on different classes of automata has not been systematically evaluated.

To fill this gap, the authors first re‑implement the five classic heuristics and then introduce two novel strategies. The first, called the Combined‑Score heuristic, assigns each state a score that is a weighted sum of its in‑degree, out‑degree, and the total length of the transition labels incident on that state. The weights (α, β, γ) are set experimentally to 1 : 1 : 0.5, reflecting a slightly lower emphasis on label length. The state with the lowest score is eliminated next. The second, the Look‑Ahead heuristic, performs a one‑step prediction: for each candidate state it estimates how much the total label length would increase if that state were removed, then chooses the state that yields the smallest estimated increase. This anticipatory approach is more computationally intensive but aims to avoid the “cascade” effect where a single elimination creates very long intermediate labels.

The experimental methodology is thorough. Uniform random deterministic finite automata (DFA) are generated for three alphabet sizes (|Σ| = 2, 3, 4) and five state counts (n = 10, 20, 30, 40, 50). For each (|Σ|, n) pair, 500 DFA instances are created, resulting in 7,500 test cases. Seven ordering strategies are evaluated: the five classic heuristics, the two new heuristics, and a baseline random ordering. After applying each strategy, the resulting RE is measured by two metrics: total character length (L) and total number of operators (+, ·, ∗) (O). The product L·O is used as a composite quality indicator, and both average values and the worst‑5 % percentile are reported to capture typical performance and stability.

Results reveal several key insights. First, the Look‑Ahead heuristic consistently yields the smallest average L·O, especially for larger alphabets and denser automata (transition density > 0.6). In the most demanding setting (|Σ| = 4, n = 50) it reduces the composite size by roughly 18 % compared with the best classic heuristic and by more than 30 % relative to random ordering. Moreover, it also improves the worst‑case 5 % dramatically, indicating higher robustness. The Combined‑Score heuristic ranks second overall; its advantage is most pronounced for sparse automata and smaller alphabets, where it matches the Look‑Ahead performance while incurring far lower computational overhead. Classic minimum‑degree heuristics perform adequately on low‑density DFA but degrade sharply as density increases, often producing the largest blow‑up among the evaluated methods. The weight‑based and minimum‑path heuristics sit in the middle, offering modest gains but no clear dominance. Random ordering is the clear baseline with the poorest results, confirming that an uninformed elimination order is detrimental.

Runtime analysis shows a trade‑off. The Look‑Ahead heuristic requires O(n²) additional work per elimination because it must recompute the projected label increase for each candidate state, leading to execution times roughly two to three times longer than the simpler heuristics. The Combined‑Score heuristic, by contrast, operates in O(n) time per step and is therefore suitable for large‑scale or time‑critical applications. The authors therefore recommend the Combined‑Score heuristic as a practical default, reserving the Look‑Ahead approach for scenarios where the smallest possible RE is essential and extra computation is acceptable.

In conclusion, the paper demonstrates that the choice of state‑elimination ordering is a decisive factor for the size of the generated regular expression, and that its impact is strongly modulated by automaton characteristics such as alphabet size and transition density. The newly proposed heuristics, particularly the Look‑Ahead method, provide measurable improvements over existing strategies, while the Combined‑Score method offers a favorable balance between quality and efficiency. The authors suggest future work on adaptive ordering schemes that dynamically switch heuristics based on intermediate RE growth, and on extending the study to nondeterministic automata and ε‑transitions, where the state‑elimination landscape is even richer.

💡 Research Summary

📜 Original Paper Content