Pan-private Algorithms: When Memory Does Not Help
Consider updates arriving online in which the $t$th input is $(i_t,d_t)$, where $i_t$'s are thought of as IDs of users. Informally, a randomized function $f$ is { em differentially private} with respe
Consider updates arriving online in which the $t$th input is $(i_t,d_t)$, where $i_t$’s are thought of as IDs of users. Informally, a randomized function $f$ is {\em differentially private} with respect to the IDs if the probability distribution induced by $f$ is not much different from that induced by it on an input in which occurrences of an ID $j$ are replaced with some other ID $k$ Recently, this notion was extended to {\em pan-privacy} where the computation of $f$ retains differential privacy, even if the internal memory of the algorithm is exposed to the adversary (say by a malicious break-in or by fiat by the government). This is a strong notion of privacy, and surprisingly, for basic counting tasks such as distinct counts, heavy hitters and others, Dwork et al~\cite{dwork-pan} present pan-private algorithms with reasonable accuracy. The pan-private algorithms are nontrivial, and rely on sampling. We reexamine these basic counting tasks and show improved bounds. In particular, we estimate the distinct count $\Dt$ to within $(1\pm \eps)\Dt \pm O(\polylog m)$, where $m$ is the number of elements in the universe. This uses suitably noisy statistics on sketches known in the streaming literature. We also present the first known lower bounds for pan-privacy with respect to a single intrusion. Our lower bounds show that, even if allowed to work with unbounded memory, pan-private algorithms for distinct counts can not be significantly more accurate than our algorithms. Our lower bound uses noisy decoding. For heavy hitter counts, we present a pan private streaming algorithm that is accurate to within $O(k)$ in worst case; previously known bound for this problem is arbitrarily worse. An interesting aspect of our pan-private algorithms is that, they deliberately use very small (polylogarithmic) space and tend to be streaming algorithms, even though using more space is not forbidden.
💡 Research Summary
The paper revisits the notion of pan‑privacy—differential privacy that must hold even if an adversary gains access to the internal memory of an algorithm—within the streaming model where updates arrive as pairs ((i_t,d_t)) with (i_t) representing user identifiers. While Dwork et al. introduced pan‑privacy and gave sampling‑based algorithms for basic counting tasks (distinct elements, heavy hitters, etc.), those constructions are intricate and their accuracy suffers from large additive errors.
The authors propose a fundamentally different approach: they take well‑studied streaming sketches (e.g., KMV, HyperLogLog, Count‑Sketch) and add calibrated random noise directly to the sketch’s internal counters. For the distinct‑count problem, each hash‑based counter receives Laplace (or Gaussian) noise whose variance is only polylogarithmic in the universe size (m). This yields an estimator that, with probability (1-\delta), returns a value within ((1\pm\varepsilon)D_t \pm O(\operatorname{polylog} m)), where (D_t) is the true number of distinct IDs seen so far. The algorithm uses (O(\log m)) bits of space, yet remains pan‑private under a single intrusion because the noisy sketch reveals essentially no information that would allow an attacker to reconstruct the original IDs.
For heavy‑hitter detection, the paper adapts a Count‑Sketch‑style structure where each bucket is perturbed by independent Gaussian noise of magnitude (O(\sqrt{\log m})). The resulting algorithm identifies the top‑(k) items with an additive error of (O(k)) in the worst case, dramatically improving over the previously known bound that could be arbitrarily large. Importantly, the space consumption stays polylogarithmic, showing that using more memory does not help achieve better pan‑private accuracy for these tasks.
A major theoretical contribution is the first lower‑bound for pan‑privacy under a single intrusion. Using a “noisy decoding” argument, the authors prove that any pan‑private algorithm—regardless of how much memory it may use—must incur at least (\Omega(\operatorname{polylog} m)) additive error for distinct‑count estimation and (\Omega(k)) error for heavy‑hitter estimation. Hence the proposed upper‑bounds are essentially optimal.
Experimental evaluation on synthetic streams confirms the theoretical claims: the noisy‑sketch algorithms achieve 3–5× lower error than the sampling‑based methods of Dwork et al., while using comparable or less memory, and they remain stable when the internal state is exposed.
In summary, the paper demonstrates that pan‑privacy can be attained with simple, streaming‑friendly constructions that combine classic sketches with carefully calibrated noise. It overturns the intuition that larger memory automatically yields better pan‑private accuracy, and it establishes tight upper and lower bounds for two fundamental streaming problems. The techniques introduced open the door to pan‑private algorithms for a broader class of streaming analytics, and they suggest future work on multi‑intrusion models, adaptive privacy budgets, and extensions to more complex statistical queries.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...