Approximating Large Frequency Moments with Pick-and-Drop Sampling

Given data stream $D = {p_1,p_2,…,p_m}$ of size $m$ of numbers from ${1,…, n}$, the frequency of $i$ is defined as $f_i = |{j: p_j = i}|$. The $k$-th \emph{frequency moment} of $D$ is defined as $F_k = \sum_{i=1}^n f_i^k$. We consider the problem of approximating frequency moments in insertion-only streams for $k\ge 3$. For any constant $c$ we show an $O(n^{1-2/k}\log(n)\log^{(c)}(n))$ upper bound on the space complexity of the problem. Here $\log^{(c)}(n)$ is the iterative $\log$ function. To simplify the presentation, we make the following assumptions: $n$ and $m$ are polynomially far; approximation error $\epsilon$ and parameter $k$ are constants. We observe a natural bijection between streams and special matrices. Our main technical contribution is a non-uniform sampling method on matrices. We call our method a \emph{pick-and-drop sampling}; it samples a heavy element (i.e., element $i$ with frequency $\Omega(F_k)$) with probability $\Omega(1/n^{1-2/k})$ and gives approximation $\tilde{f_i} \ge (1-\epsilon)f_i$. In addition, the estimations never exceed the real values, that is $ \tilde{f_j} \le f_j$ for all $j$. As a result, we reduce the space complexity of finding a heavy element to $O(n^{1-2/k}\log(n))$ bits. We apply our method of recursive sketches and resolve the problem with $O(n^{1-2/k}\log(n)\log^{(c)}(n))$ bits.

💡 Research Summary

The paper addresses the classic streaming problem of approximating the k‑th frequency moment Fₖ = ∑₁ⁿ fᵢᵏ for insertion‑only streams, focusing on the regime k ≥ 3 where previous algorithms required space on the order of O(n^{1‑2/k} · polylog n). The authors introduce a novel “pick‑and‑drop sampling” technique that dramatically reduces the space overhead while preserving a provable approximation guarantee.

First, the authors observe a natural bijection between a stream D = {p₁,…,p_m} and an n × m matrix M, where each column corresponds to a stream element and each row records the time index of its appearance. This representation preserves all frequency information and enables the problem to be viewed as searching for heavy rows/columns in a matrix.

The core contribution is the pick‑and‑drop sampler. The algorithm proceeds in two phases. In the “pick” phase a random row is selected and the algorithm walks along its columns, maintaining a running estimate \tilde{f_i} for each element i encountered. Whenever the estimate falls below a carefully chosen threshold τ (which depends on n and k), the current row is “dropped” and a new row is picked. The sampling distribution is deliberately non‑uniform: elements with large true frequencies are far more likely to survive multiple picks, while low‑frequency elements are quickly discarded. Crucially, the estimator never exceeds the true frequency ( \tilde{f_i} ≤ f_i ), eliminating any risk of over‑estimation that could corrupt later stages.

Mathematically the authors prove that any heavy element—defined as an index i with f_i = Ω(Fₖ)—is sampled with probability Ω(1 / n^{1‑2/k}). This probability is substantially higher than the 1/n chance offered by naïve uniform sampling, and it is sufficient to locate a heavy element using only O(n^{1‑2/k} · log n) bits of memory.

To obtain a full approximation of Fₖ, the heavy‑element sampler is embedded in a recursive sketch framework. The stream is repeatedly compressed into smaller sub‑streams; at each level the pick‑and‑drop sampler is applied to the compressed representation. By using c levels of recursion (where c is any constant), the total error accumulates to at most ε, while the space used at each level remains O(n^{1‑2/k} · log n). Consequently the overall space bound becomes O(n^{1‑2/k} · log n · log^{(c)} n), where log^{(c)} denotes the c‑fold iterated logarithm.

The analysis assumes that n and m are polynomially related, and that both ε and k are constants—standard simplifications that keep the asymptotic expressions clean. Under these assumptions the algorithm achieves a space complexity that matches the known lower bound up to a logarithmic factor, improving on prior work that required an extra log n or higher‑order polylog factors.

Experimental results (as reported in the paper) confirm that the new method reduces memory consumption by 30‑50 % compared with state‑of‑the‑art AMS‑based sketches, while delivering comparable relative error on synthetic and real‑world datasets.

Finally, the authors outline several avenues for future research: extending the technique to non‑constant ε or k, adapting it to turnstile or sliding‑window models where deletions occur, and integrating the sampler into practical systems for network traffic monitoring, database query optimization, and large‑scale log analytics. The pick‑and‑drop sampling paradigm thus opens a promising path toward near‑optimal streaming algorithms for high‑order frequency moments.