Polylog space compression, pushdown compression, and Lempel-Ziv are incomparable
The pressing need for efficient compression schemes for XML documents has recently been focused on stack computation, and in particular calls for a formulation of information-lossless stack or pushdown compressors that allows a formal analysis of their performance and a more ambitious use of the stack in XML compression, where so far it is mainly connected to parsing mechanisms. In this paper we introduce the model of pushdown compressor, based on pushdown transducers that compute a single injective function while keeping the widest generality regarding stack computation. We also consider online compression algorithms that use at most polylogarithmic space (plogon). These algorithms correspond to compressors in the data stream model. We compare the performance of these two families of compressors with each other and with the general purpose Lempel-Ziv algorithm. This comparison is made without any a priori assumption on the data’s source and considering the asymptotic compression ratio for infinite sequences. We prove that in all cases they are incomparable.
💡 Research Summary
The paper addresses the problem of designing compression schemes that operate under very limited computational resources, a situation that arises in XML document compression and data‑stream processing. It introduces two families of compressors that are formally defined and analyzed: (i) push‑down compressors (PDCs), which are based on push‑down transducers equipped with a stack, and (ii) poly‑logarithmic‑space online compressors (plogons), which belong to the data‑stream model and are restricted to using at most O(log n) memory while reading the input once. In addition, the classic Lempel‑Ziv 78 (LZ78) algorithm is taken as a benchmark for general‑purpose compression.
The authors first formalize a very general notion of an information‑lossless push‑down compressor (ILPDC). An ILPDC is a deterministic push‑down transducer that may use a bounded number of λ‑rules (transitions that manipulate the stack without consuming an input symbol) and may be equipped with an end‑marker symbol. The model is deliberately permissive: it allows arbitrary stack operations, λ‑rules, and end‑markers, but requires that the mapping from input strings to the pair (compressed output, final state) be injective. To guarantee that decompression is feasible, a stricter subclass called invertible push‑down compressors (invPD) is defined; an invPD must have a companion push‑down transducer that, given the compressed string and the final state, reconstructs the original input.
Next, the paper defines plogon compressors. A plogon transducer reads its input left‑to‑right, writes output left‑to‑right, and uses at most c·log |w| cells of work tape for some constant c. The class plogon consists of all such transducers, and an IL‑plogon compressor is simply a plogon transducer that is injective. This model captures the constraints of streaming algorithms where the input is massive, memory is sub‑linear, and only a single pass is allowed.
The LZ78 algorithm is recalled for completeness. LZ78 parses the input into phrases, builds a dictionary incrementally, and outputs a sequence of (index, symbol) pairs. It is known that LZ78 asymptotically dominates any finite‑state compressor, but the paper investigates its relationship with the two more powerful models above.
The central contribution is a set of incomparability results. For each pair among {PDC, plogon, LZ78} the authors construct an infinite binary sequence S such that one of the two compressors achieves optimal compression (compression ratio tending to 0) on S while the other fails to compress S at all (compression ratio tending to 1). Moreover, the “optimal” compression is achieved in the strongest possible sense: almost every prefix of S is compressed to its Kolmogorov complexity, while the “non‑compressible” side compresses only finitely many prefixes. The three separations are:
-
plogon beats both PDC and LZ78: A sequence is built where a plogon compressor can exploit a regular pattern that can be described with O(log n) bits, but any push‑down compressor (even with end‑markers) cannot keep enough information in its stack, and LZ78 cannot form a useful dictionary because the pattern repeats at distances larger than any feasible dictionary entry.
-
PDC beats both plogon and LZ78: A sequence with deeply nested, well‑balanced parentheses (or similar Dyck‑language structure) is used. A push‑down compressor can match the nesting using its stack and compress each matched pair, achieving optimal compression. A plogon compressor cannot store the unbounded nesting depth with only logarithmic memory, and LZ78 cannot capture the non‑local matching, so its compression ratio stays near 1.
-
LZ78 beats both PDC and plogon: A sequence consisting of long repetitions of previously seen blocks, spaced far enough that a push‑down automaton cannot reuse stack information and a plogon machine cannot retain the whole dictionary, but LZ78’s growing dictionary can reference earlier blocks efficiently, yielding optimal compression.
The proofs rely on Kolmogorov complexity arguments: for each constructed sequence the authors show that any “bad” compressor would have to solve an undecidable or information‑theoretic task (e.g., predict far‑future symbols from limited memory), which contradicts the injectivity requirement. Conversely, they explicitly design compressors in the winning class that achieve the desired compression ratio.
The paper also discusses variants such as visibly push‑down compressors (where the input alphabet is partitioned into calls, returns, and internals, forcing stack operations to be visible) and notes that these are strictly weaker than the general ILPDC model, providing upper bounds on what can be achieved with XML‑specific grammars.
In conclusion, the authors demonstrate that push‑down compressors, plogon compressors, and LZ78 are pairwise incomparable with respect to asymptotic compression ratio on arbitrary infinite sequences. This result has practical implications: XML‑oriented compression should consider push‑down methods to exploit hierarchical structure; streaming environments with severe memory limits should employ plogon‑type algorithms; and for data with long repeated substrings, classic dictionary methods like LZ78 remain optimal. The work also opens avenues for further research on the trade‑offs between stack depth, memory size, and dictionary growth in lossless compression.
Comments & Academic Discussion
Loading comments...
Leave a Comment