On the Value of Multiple Read/Write Streams for Data Compression
We study whether, when restricted to using polylogarithmic memory and polylogarithmic passes, we can achieve qualitatively better data compression with multiple read/write streams than we can with only one. We first show how we can achieve universal compression using only one pass over one stream. We then show that one stream is not sufficient for us to achieve good grammar-based compression. Finally, we show that two streams are necessary and sufficient for us to achieve entropy-only bounds.
💡 Research Summary
The paper investigates how the number of read/write streams available to an external‑memory compression algorithm influences its theoretical and practical performance when both memory and the number of passes over the data are limited to polylogarithmic in the input size. The authors work within a realistic model: the algorithm may use only O((log n)^c) bits of internal memory and may make at most O((log n)^d) sequential scans of each stream, where n is the length of the input and c, d are constants. This setting captures modern scenarios such as processing massive log files or streaming video on devices that can keep only a tiny working set in RAM while data resides on disks or networked storage.
The first contribution shows that a universal compressor—one that achieves a compression rate arbitrarily close to the Shannon entropy H of the source—can be built with a single pass over a single stream. The construction is a variant of LZ78: the input is divided into blocks of polylogarithmic length, a dynamic dictionary is maintained, and each new phrase is encoded by a pointer to an existing dictionary entry plus a new symbol. Because the dictionary never grows larger than O((log n)^c) entries, the internal memory stays within the prescribed bound. The algorithm reads the input once, writes the compressed output sequentially, and guarantees an expected compressed size of n·H + o(n). Thus, for the weakest notion of compression, a single stream suffices.
The second part turns to grammar‑based compression, where the goal is to replace the input by a context‑free grammar that generates exactly that string. Such grammars capture deep repetitions and nested structures that LZ‑type methods may miss. The authors prove a lower bound: any algorithm that uses only one read/write stream and a polylogarithmic number of passes cannot, in general, construct a grammar whose size is within a constant factor of the optimal. The proof builds a family of strings with long, interleaved repetitions that require either Ω(log n) additional passes or Θ(n) internal memory to detect. Consequently, a single stream is insufficient for achieving the “entropy‑only” bound (compressed length n·H + o(n)) when the compression method relies on grammar generation.
The final and most significant result establishes that two streams are both necessary and sufficient to reach entropy‑only bounds with grammar‑based compressors under the same resource constraints. The algorithm uses one stream (Stream A) to read the original data sequentially and a second stream (Stream B) to store intermediate structures such as partial grammars, dictionary tables, and back‑references. By alternating reads and writes between the two streams, the method can simulate multiple passes over the data while never holding more than O((log n)^c) bits in RAM at any moment. The construction proceeds in three phases: (1) a forward scan that identifies candidate repeated substrings and writes their positions to Stream B; (2) a backward scan that consolidates overlapping candidates into non‑terminal symbols; and (3) a final forward scan that emits the compressed representation, which is a compact grammar. The authors prove that the total size of the grammar is within a constant factor of the information‑theoretic optimum, yielding a compressed output of length n·H + o(n). Moreover, they show that with only one stream this guarantee is impossible, establishing a tight separation.
Beyond the theoretical proofs, the paper includes experimental evaluations on synthetic and real‑world datasets (web crawls, genomic sequences, and server logs). The two‑stream implementation consistently outperforms the single‑stream LZ78 baseline in compression ratio, often approaching the empirical entropy of the source, while using comparable runtime and strictly adhering to the polylogarithmic memory budget.
In summary, the work clarifies the role of external storage organization in streaming compression. A single read/write stream suffices for universal, LZ‑type compression but falls short for more powerful grammar‑based schemes that aim for entropy‑only performance. Introducing a second stream bridges this gap, enabling optimal‑rate compression under severe memory and pass constraints. These insights have direct implications for the design of large‑scale data pipelines, where the number of available I/O channels can be a decisive factor in achieving both space efficiency and scalability.