The Sample Complexity of Lossless Data Compression
A new framework is introduced for examining and evaluating the fundamental limits of lossless data compression, that emphasizes genuinely non-asymptotic results. The {\em sample complexity} of compressing a given source is defined as the smallest blocklength at which it is possible to compress that source at a specified rate and to within a specified excess-rate probability. This formulation parallels corresponding developments in statistics and computer science, and it facilitates the use of existing results on the sample complexity of various hypothesis testing problems. For arbitrary sources, the sample complexity of general variable-length compressors is shown to be tightly coupled with the sample complexity of prefix-free codes and fixed-length codes. For memoryless sources, it is shown that the sample complexity is characterized not by the source entropy, but by its Rényi entropy of order~$1/2$. Nonasymptotic bounds on the sample complexity are obtained, with explicit constants. Generalizations to Markov sources are established, showing that the sample complexity is determined by the source’s Rényi entropy rate of order~$1/2$. Finally, bounds on the sample complexity of universal data compression are developed for arbitrary families of memoryless sources. There, the sample complexity is characterized by the minimum Rényi divergence of order~$1/2$ between elements of the family and the uniform distribution. The connection of this problem with identity testing and with the associated separation rates is explored and discussed.
💡 Research Summary
The paper introduces a non‑asymptotic framework for lossless data compression by defining the sample complexity of a source. For a given excess‑rate probability ε and a target compression rate R, the sample complexity n⁎(X, ε) is the smallest blocklength n for which there exists a variable‑length compressor fₙ and a rate R such that both the probability of exceeding the rate, P(ℓ(fₙ(Xⁿ)) > nR), and the normalized rate term 2nR/|A|ⁿ are bounded by ε. This definition simultaneously controls the compression rate and its reliability, providing a concrete design guideline for practitioners.
The authors first establish that the sample complexity of general variable‑length compressors is tightly linked to that of prefix‑free and fixed‑length codes (Theorems 4.1 and 4.2). Consequently, the problem can be reduced to a simple hypothesis‑testing task: find a set Cₙ ⊂ Aⁿ that minimizes the sum Pₙ(Cₙ) + |Cₙ|/|A|ⁿ. This reduction enables the direct use of existing non‑asymptotic results from hypothesis testing literature.
For memoryless (i.i.d.) sources, the central result (Theorem 4.3) shows that \
Comments & Academic Discussion
Loading comments...
Leave a Comment