On the Complexity of Maximal/Closed Frequent Tree Mining for Bounded Height Trees

On the Complexity of Maximal/Closed Frequent Tree Mining for Bounded Height Trees
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we address the problem of enumerating all frequent maximal/closed trees. This is a classical and central problem in data mining. Although many practical algorithms have been developed for this problem, its complexity under ``realistic assumptions’’ on tree height has not been clarified. More specifically, while it was known that the mining problem becomes hard when the tree height is at least 60, the complexity for cases where the tree height is smaller has not yet been clarified. We resolve this gap by establishing results for these tree mining problems under several settings, including ordered and unordered trees, as well as maximal and closed variants.


💡 Research Summary

The paper investigates the computational complexity of enumerating all frequent maximal and closed trees when the height of the input trees is bounded. While previous work showed that the problem becomes hard when tree height reaches 60, the authors focus on the practically relevant regime where tree height is much smaller (often ≤ 30 in real‑world XML or JSON data). They consider four variants of the problem: (i) closed frequent trees on unordered trees, (ii) closed frequent trees on ordered trees, (iii) maximal frequent trees on unordered trees, and (iv) maximal frequent trees on ordered trees. For each variant they study the effect of a height bound h on the existence of output‑polynomial‑time (OP) or polynomial‑delay (PD) algorithms, and they relate some cases to the well‑known Dualization problem (also called Minimal Transversal or Maximal Independent Set enumeration in hypergraphs).

The main contributions are summarized in a table (Table 1) that classifies the complexity for height intervals h = 1, 2, 3–4, 5–60, and ≥ 60. The key results are:

  • Theorem 1 (Positive result). For unordered rooted trees of height at most 2, closed frequent trees can be enumerated with polynomial delay. The algorithm works by mapping each tree to a multiset of integers via a function χ that records, for each child of the root, the number of its own children plus one. The partial order ⊑ on integer multisets captures the sub‑tree isomorphism relation for height‑2 trees. Using this representation, the algorithm can test closedness and generate the next solution in time polynomial in the input size and the number of already produced outputs.

  • Theorem 2 (Dual‑hardness). If an output‑polynomial‑time algorithm existed for the same setting (height ≤ 2, unordered, closed), then the Dualization problem could also be solved in output‑polynomial time. Since Dualization has resisted such algorithms for more than three decades, this result strongly suggests that a truly output‑polynomial algorithm for closed frequent trees of height ≤ 2 is unlikely.

  • Theorem 3 (Negative results).

    • For ordered trees of height ≤ 2, enumerating maximal frequent trees cannot be done in output‑polynomial time unless P = NP.
    • For unordered trees of height ≤ 5, the same hardness holds for maximal frequent trees. The proofs reduce from classic NP‑complete problems (e.g., Maximum Clique) to the tree‑mining setting, preserving the height bound.

The authors also study the “maximal common tree” problem, which is a special case of maximal frequent tree mining with the support threshold set to the total number of input trees. They prove that when every input tree has height ≤ 2, the maximal common tree is uniquely determined. The uniqueness follows from the χ‑representation: the component‑wise minimum of the integer vectors of all trees, followed by χ⁻¹, yields the maximal common subtree. This property underpins the polynomial‑delay algorithm for closed trees, because any closed tree must be a superset of the maximal common tree, and checking whether adding a child violates closedness reduces to a simple integer comparison.

The paper’s technical development proceeds as follows: after a thorough introduction to pattern mining (itemsets, sequences, graphs) and a review of known hardness results, the authors formalize rooted ordered and unordered trees, sub‑tree isomorphism, support, and the definitions of frequent, closed, and maximal patterns. They then present the complexity table, prove Theorem 1 by constructing the χ‑mapping and showing that the partial order ⊑ is a lattice on the integer multisets, and that closedness can be tested by checking whether any strictly larger multiset (in the ⊑ order) appears among the inputs. Theorem 2’s reduction to Dualization uses the same χ‑encoding to simulate hypergraph transversals. Theorem 3’s hardness proofs employ gadget constructions that embed a graph’s adjacency structure into a tree of bounded height, ensuring that any maximal frequent tree corresponds to a solution of the original NP‑hard problem.

The authors discuss practical implications: many real‑world tree‑structured data sets (e.g., XML documents) have depth well below the hard thresholds, so the positive result (polynomial delay for closed trees of height ≤ 2) offers a feasible algorithmic foundation for such applications. Conversely, the hardness results warn that even modest increases in height (to 5 for unordered trees) already push maximal tree mining into the realm of intractability unless P = NP.

In conclusion, the paper fills a gap in the theoretical understanding of tree‑mining complexity by providing a fine‑grained analysis based on tree height. It shows that while certain cases admit efficient enumeration (closed trees of height ≤ 2 with polynomial delay), other cases remain as hard as longstanding open problems in enumeration complexity (Dualization) or are provably NP‑hard for output‑polynomial algorithms. This work thus bridges the divide between practical algorithm design for shallow trees and the deeper theoretical limits of pattern mining on tree structures.


Comments & Academic Discussion

Loading comments...

Leave a Comment