A simple sketching algorithm for entropy estimation
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant alpha. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an alpha-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to alpha are lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as alpha approaches 1: the maximally skewed, strictly stable distribution with alpha = 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant zeta > 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with zeta = 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity.
💡 Research Summary
The paper tackles the challenge of estimating the empirical Shannon entropy of a high‑frequency data stream when memory is too limited to store exact frequency counts. The setting is the relaxed strict‑turnstile model, where both insertions and deletions are allowed but the total count of each item never becomes negative. Traditional approaches approximate Shannon entropy by first estimating the Rényi entropy Hα = (1/(1−α)) log ∑ p_i^α for a constant α close to 1, using α‑stable sketches and the technique known as compressed counting. However, these methods suffer from two practical drawbacks: (i) the choice of α is ad‑hoc, and (ii) as α approaches 1 the variance of the estimator blows up and numerical instability appears, making it hard to calibrate the algorithm for real‑world streams.
The authors observe that the family of α‑stable distributions has a well‑defined limit when α→1. Specifically, the maximally skewed, strictly stable distribution with α = 1 (often denoted S(1,1,γ,δ)) is supported on the whole real line and possesses a closed‑form characteristic function. By mapping each stream update through a pair of hash functions (one for bucket selection, one for random weight) they obtain k independent sketch values X₁,…,X_k that follow this limiting distribution. This transformation eliminates the need to pick an α value; the sketch itself already embodies the α→1 limit.
From these sketch values the paper introduces a family of “log‑mean” estimators indexed by a positive constant ζ:
L_ζ = (1/ζ) log
Comments & Academic Discussion
Loading comments...
Leave a Comment