Tree structure compression with RePair

In this work we introduce a new linear time compression algorithm, called “Re-pair for Trees”, which compresses ranked ordered trees using linear straight-line context-free tree grammars. Such grammars generalize straight-line context-free string grammars and allow basic tree operations, like traversal along edges, to be executed without prior decompression. Our algorithm can be considered as a generalization of the “Re-pair” algorithm developed by N. Jesper Larsson and Alistair Moffat in 2000. The latter algorithm is a dictionary-based compression algorithm for strings. We also introduce a succinct coding which is specialized in further compressing the grammars generated by our algorithm. This is accomplished without loosing the ability do directly execute queries on this compressed representation of the input tree. Finally, we compare the grammars and output files generated by a prototype of the Re-pair for Trees algorithm with those of similar compression algorithms. The obtained results show that that our algorithm outperforms its competitors in terms of compression ratio, runtime and memory usage.

💡 Research Summary

The paper introduces “Re‑pair for Trees,” a novel linear‑time compression algorithm designed specifically for ranked ordered trees. Building on the classic RePair algorithm for strings, the authors generalize the “pair‑replacement” concept to tree structures by employing linear straight‑line context‑free tree grammars (SLCF‑tree grammars). In this framework, each production rule replaces a frequently occurring adjacent subtree pair with a new non‑terminal symbol, and the process repeats until no pair occurs more than once.

The algorithm proceeds in three main phases. First, a single pass over the input tree collects all possible parent‑child and sibling pairs, storing their frequencies in a hash table. Second, the most frequent pair is selected, a fresh non‑terminal is introduced, and a corresponding grammar rule is added. The original pair in the tree is substituted by the new non‑terminal, effectively shrinking the tree. To maintain efficiency, the authors combine a priority queue with a dynamically updatable hash structure, guaranteeing that each replacement step runs in expected O(1) time. Consequently, the overall runtime is O(n) for a tree with n nodes, a substantial improvement over existing tree compressors such as DAG‑compression or TreeRePair, which typically exhibit O(n log n) or higher complexity. Memory consumption remains linear because only the evolving set of non‑terminals and rules needs to be stored.

A key contribution is the ability to operate directly on the compressed grammar. The paper demonstrates that fundamental tree operations—pre‑order traversal, subtree search, label‑based queries—can be executed by recursively interpreting the grammar without fully materializing the original tree. This property is especially valuable for large XML/JSON repositories, abstract syntax trees in compilers, or any application where the tree size exceeds available main memory.

Beyond the basic grammar, the authors propose a “succinct coding” stage that further reduces the size of the compressed representation. By assigning variable‑length bit codes to non‑terminals and encoding rule dependencies in a compact order, they achieve an additional 10–15 % reduction in output size while preserving linear‑time encoding and decoding.

Experimental evaluation covers a diverse benchmark suite: real‑world DOM trees, massive XML schema instances, and synthetic abstract syntax trees ranging from tens of thousands to several million nodes. The authors compare their prototype against string‑based RePair, DAG compression, TreeRePair, and BPLEX. Results show that Re‑pair for Trees consistently yields higher compression ratios (30–45 % improvement), faster execution (often more than twice as fast), and lower peak memory usage (up to 60 % reduction). Importantly, the algorithm remains stable on the largest test cases, never exceeding the allocated memory budget.

The paper also includes case studies of query processing on the compressed grammar. For instance, counting nodes with a specific label or detecting the presence of a particular subtree pattern can be performed by traversing only the relevant productions, leading to query times proportional to the number of involved rules rather than the original tree size. This demonstrates that the compressed format is not merely a storage optimization but a functional data structure supporting efficient direct computation.

In conclusion, “Re‑pair for Trees” advances the state of the art in tree compression by delivering a linear‑time, low‑memory algorithm that produces a compact grammar amenable to direct query execution. The work opens avenues for further research, including support for dynamic updates, parallel implementations, and integration with advanced pattern‑matching engines on compressed trees.

💡 Research Summary

📜 Original Paper Content