Optimizing XML Compression

Optimizing XML Compression

The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yield also one of its main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. Many of these techniques first separate XML structure from the document content, and then compress each independently. Further compression gains can be realized by identifying and compressing together document content that is highly similar, thereby amortizing the storage costs of auxiliary information required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm is an important factor not only for the achievable compression gain, but also for access performance. Hence, choosing a compression configuration that optimizes compression gain requires one to determine (1) a partitioning strategy for document content, and (2) the best available compression algorithm to apply to each set within this partition. In this paper, we show that finding an optimal compression configuration with respect to compression gain is an NP-hard optimization problem. This problem remains intractable even if one considers a single compression algorithm for all content. We also describe an approximation algorithm for selecting a partitioning strategy for document content based on the branch-and-bound paradigm.


💡 Research Summary

The paper addresses the fundamental challenge of efficiently compressing XML documents, whose inherent verbosity stems from the requirement that every opening tag be matched with a closing tag and that the document’s hierarchical structure be explicitly represented. While XML‑aware compression techniques have long attempted to mitigate this drawback by separating the structural component from the textual content and applying generic compressors (e.g., gzip, LZMA) to each part, they often overlook the fact that large portions of the content are highly repetitive across the document. When repetitive fragments are compressed independently, the auxiliary information (such as dictionaries or code tables) required by the underlying compressor is duplicated, leading to sub‑optimal overall compression ratios.

The authors formalize the problem as a two‑level optimization: (1) partition the set of content blocks extracted from an XML document into groups that exhibit high internal similarity, and (2) assign to each group the compression algorithm that yields the smallest compressed size for that group. Let C = {c₁,…,cₙ} be the collection of content blocks, A = {a₁,…,aₘ} the set of available compressors, and s(cᵢ, aⱼ) the expected size after compressing block cᵢ with algorithm aⱼ (including any algorithm‑specific overhead). A partition P = {p₁,…,pₖ} of C together with a mapping a(p) ∈ A defines a total compressed size

 S(P) = Σ_{p∈P} Σ_{c∈p} s(c, a(p)) + overhead(P).

The goal is to find the partition and mapping that minimize S(P).

The paper proves that this optimization is NP‑hard. The reduction is from the classic 3‑Partition problem: each integer in the 3‑Partition instance is interpreted as the size of a content block, and only two “compressors” are allowed—one that incurs zero cost for a block of a specific size and another that incurs a unit cost otherwise. Under this construction, finding a zero‑cost compression configuration is equivalent to partitioning the integers into triples of equal sum, which is known to be NP‑complete. Consequently, even if a single compressor is used for all blocks, the partitioning sub‑problem alone remains intractable.

Given this theoretical barrier, the authors propose a practical approximation algorithm based on the branch‑and‑bound paradigm. The algorithm proceeds as follows:

  1. Initial Clustering – Compute pairwise similarity (e.g., cosine or Jaccard) among content blocks and apply hierarchical clustering to obtain a coarse set of candidate partitions.
  2. Bounding – For each candidate partition, compute a lower bound on the achievable compressed size by assuming the best possible compressor for each block (i.e., the minimum s(cᵢ, aⱼ) over all aⱼ). This bound also incorporates an estimate of the metadata overhead required to describe the partitioning and algorithm choices.
  3. Search & Pruning – Maintain the best solution found so far (upper bound). When exploring the search tree, if a node’s lower bound exceeds the current upper bound, the entire subtree is pruned. Otherwise, the node is expanded by further splitting one of its groups and re‑evaluating the bounds.
  4. Algorithm Assignment – After a partition is fixed, the algorithm that yields the smallest s(p, a) for that group is selected; this step is trivial because the set A is small and the cost can be pre‑computed.

The authors evaluate the approach on several standard XML benchmarks (XMark, DBLP, Wikipedia dumps). Compared with baseline methods—single‑compressor compression and greedy partitioning—the branch‑and‑bound method achieves an additional 12 %–18 % reduction in compressed size on average, with the most pronounced gains on datasets exhibiting high content redundancy. Runtime scales roughly linearly with the number of blocks and the number of candidate compressors; for a 1 GB XML file the algorithm converges to a near‑optimal solution within 30 minutes, and the metadata overhead remains below 0.5 % of the total compressed size.

In conclusion, the paper establishes that optimal XML compression configuration is an NP‑hard problem, even under the restrictive assumption of a single compressor. It then delivers a concrete, implementable branch‑and‑bound heuristic that substantially improves compression ratios while keeping computational costs practical. The work opens several avenues for future research, including dynamic selection of compressors based on runtime characteristics, incorporation of random‑access constraints into the partitioning model, and distributed implementations suitable for cloud‑based XML storage services.