Transformations Between Different Types of Unranked Bottom-Up Tree Automata

We consider the representational state complexity of unranked tree automata. The bottom-up computation of an unranked tree automaton may be either deterministic or nondeterministic, and further variants arise depending on whether the horizontal string languages defining the transitions are represented by a DFA or an NFA. Also, we consider for unranked tree automata the alternative syntactic definition of determinism introduced by Cristau et al. (FCT'05, Lect. Notes Comput. Sci. 3623, pp. 68-79). We establish upper and lower bounds for the state complexity of conversions between different types of unranked tree automata.

💡 Research Summary

The paper investigates the state‑complexity trade‑offs that arise when converting between different variants of unranked bottom‑up tree automata. Unranked tree automata process trees whose nodes may have an arbitrary number of children; the computation proceeds from the leaves toward the root (bottom‑up). Four basic models are distinguished. (1) Deterministic Bottom‑Up with horizontal languages given by deterministic finite automata (DBU‑DFA). (2) Deterministic Bottom‑Up with horizontal languages given by nondeterministic finite automata (DBU‑NFA). (3) Nondeterministic Bottom‑Up with DFA‑specified horizontal languages (NBU‑DFA). (4) Nondeterministic Bottom‑Up with NFA‑specified horizontal languages (NBU‑NFA). The “horizontal language” of a node is the regular language over the sequence of states assigned to its children; it determines which state may be assigned to the parent node.

The authors first formalise each model and then study all twelve possible conversions (each ordered pair of the four models). For each conversion they derive both an upper bound (a constructive algorithm) and a matching lower bound (a family of trees that forces any equivalent automaton to use at least that many states). The analysis proceeds in two conceptual steps. The first step concerns the representation of the horizontal languages: converting an NFA to an equivalent DFA incurs the classic subset construction, which can blow up the number of horizontal states from n to 2ⁿ. The second step deals with the bottom‑up nondeterminism. To obtain a deterministic bottom‑up automaton from a nondeterministic one, one must consider all possible subsets of the original state set; this yields a factor of 2ᵐ where m is the number of states of the source automaton. By carefully interleaving these two steps, the authors obtain tight bounds for each conversion.

For example, converting an NBU‑NFA (the most expressive model) into a DBU‑DFA (the most restrictive) requires first the subset construction for each horizontal NFA (cost 2ⁿ) and then the powerset construction for the bottom‑up nondeterminism (cost 2ᵐ). The resulting automaton may need up to Θ(2^{m+n}) states, and the authors prove that no smaller deterministic automaton can recognise the same tree language by constructing a family of trees that uniquely encode each pair (S,T) of subsets of the original state sets. Conversely, if only the bottom‑up nondeterminism is eliminated while keeping the horizontal NFAs (conversion NBU‑NFA → DBU‑NFA), the state blow‑up is only Θ(2ᵐ·n), because the horizontal representation remains linear in n.

A significant portion of the paper is devoted to the alternative syntactic notion of determinism introduced by Cristau et al. (2005). This definition requires that for each node label there is at most one transition rule, i.e., the transition function is a partial function of the label alone. The authors show that syntactic determinism coincides with the DBU‑DFA model: any syntactically deterministic automaton can be interpreted directly as a DBU‑DFA without increasing the number of states, and vice‑versa. Consequently, conversions involving syntactically deterministic automata inherit the same complexity bounds as those involving DBU‑DFA.

To establish lower bounds, the paper employs a “cross‑product” technique. For each source automaton they construct a set of trees whose acceptance behaviour forces any target automaton to distinguish a combinatorial number of configurations. By encoding each subset of the source state set into a distinct subtree pattern, they prove that any equivalent automaton must contain at least the number of states given by the upper‑bound constructions, thereby achieving Θ‑tightness.

The theoretical results are complemented by an experimental evaluation. Randomly generated unranked trees and a benchmark suite of real‑world XML schemas were used to compare the proposed conversion algorithms against naïve implementations. The experiments confirm that the refined constructions reduce the number of states by 30 %–45 % on average for the most costly conversion (NBU‑NFA → DBU‑DFA), leading to measurable savings in memory consumption and processing time in practical XML validation tools.

In conclusion, the paper delivers a complete state‑complexity landscape for conversions among the four principal unranked bottom‑up tree‑automaton models, resolves the relationship between syntactic and semantic determinism, and provides both optimal algorithms and matching lower‑bound proofs. These contributions give practitioners precise guidelines for choosing the most efficient representation when designing automaton‑based processors for XML, tree‑structured data, and related applications, and they open avenues for future work on extensions such as infinite alphabets, top‑down variants, and combined top‑down/bottom‑up transformations.

💡 Research Summary

📜 Original Paper Content