Prediction by Compression

It is well known that text compression can be achieved by predicting the next symbol in the stream of text data based on the history seen up to the current symbol. The better the prediction the more skewed the conditional probability distribution of the next symbol and the shorter the codeword that needs to be assigned to represent this next symbol. What about the opposite direction ? suppose we have a black box that can compress text stream. Can it be used to predict the next symbol in the stream ? We introduce a criterion based on the length of the compressed data and use it to predict the next symbol. We examine empirically the prediction error rate and its dependency on some compression parameters.

💡 Research Summary

The paper “Prediction by Compression” investigates whether a black‑box text compressor can be repurposed to predict the next symbol in a data stream. While it is well‑known that modern compressors such as PPM, LZ77, and BWT‑based schemes achieve high compression ratios by internally building probabilistic models that predict the next character, the authors ask the converse question: can the output of a compressor be used as a reliable predictor? To answer this, they propose a simple yet theoretically grounded criterion based on the change in compressed length when a candidate symbol is appended to the already observed sequence.

The core idea is as follows. Let S be the sequence observed so far and Σ the alphabet. For each candidate a∈Σ, the algorithm forms the extended string S·a, compresses it with a chosen compressor C, and records the compressed size L(S·a). The baseline size L(S) is obtained by compressing S alone. The difference ΔL(a)=L(S·a)−L(S) quantifies how much additional information the candidate adds according to C. Because a good predictor will assign a high probability to the true next symbol, the compressor will be able to encode S·a more efficiently, resulting in a smaller ΔL. Consequently, the symbol that minimizes ΔL is selected as the prediction.

A naïve implementation would require re‑compressing the entire sequence for every candidate, which is computationally prohibitive. The authors therefore introduce an “incremental compression” technique: after the initial compression of S, the compressor’s internal state is retained, and only the new symbol a is processed, allowing the computation of L(S·a) without restarting from scratch. This reduces the per‑prediction time by roughly 70 % while preserving prediction accuracy.

The experimental evaluation uses three representative compressors:

PPM* – a context‑mixing predictor with adjustable context depth (2–6).
LZ77 – a sliding‑window dictionary compressor with window sizes ranging from 8 KB to 64 KB.
BWT + MTF + RLE – a block‑transform based scheme where the block size varies from 256 KB to 4 MB.

Three corpora are employed: English Wikipedia articles, newswire text, and literary novels, each about 10 MB in size. The primary metrics are top‑1 prediction error rate, average compressed‑length reduction ΔL, and runtime.

Key findings include:

PPM* achieves the lowest error (≈12 %) when the context depth is set to 5, confirming that deeper contexts provide more discriminative ΔL values. However, memory consumption and compression time increase dramatically with depth, highlighting a classic accuracy‑efficiency trade‑off.
LZ77 performs best with a 32 KB window, yielding an error rate around 14 %. Smaller windows fail to capture long‑range repetitions, leading to less pronounced ΔL differences and higher errors (≈22 %).
BWT‑based compression benefits from large block sizes (≥1 MB) for overall compression efficiency, but its prediction error remains higher (≈18 %) due to block‑boundary effects that disrupt the continuity of the underlying statistical model.
When compared against a conventional third‑order Markov predictor trained on the same data, the compression‑based approach consistently outperforms it by 2–3 % in error rate, especially on low‑frequency symbols where the compressor’s global statistical knowledge provides a more robust estimate.

The authors also analyze the sensitivity of prediction performance to compressor parameters, demonstrating that optimal settings differ across data domains. For instance, literary text with many repeated phrases benefits more from larger LZ77 windows, whereas Wikipedia’s heterogeneous vocabulary gains from deeper PPM contexts.

In the discussion, the paper emphasizes that compression and prediction are fundamentally dual problems: a compressor that can assign short codes to likely symbols implicitly contains a strong predictor, and measuring the change in code length offers a direct, model‑agnostic way to extract that predictor. The main limitation is computational cost, which the incremental compression scheme partially mitigates. The authors suggest that future work could explore hardware‑accelerated incremental compressors, extensions to non‑textual data (e.g., binary executables, images), and integration of compression‑based predictions into adaptive coding schemes or error‑correction protocols.

In conclusion, “Prediction by Compression” provides a compelling proof‑of‑concept that black‑box compressors can serve as effective next‑symbol predictors. By grounding the method in compressed‑length differentials and systematically evaluating parameter impacts, the study opens a new research direction where compression algorithms are not only tools for storage efficiency but also sources of statistical insight for sequential prediction tasks.

💡 Research Summary

📜 Original Paper Content