Accurate and Efficient Private Release of Datacubes and Contingency Tables
A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries,
A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries, which include basic counting queries, data cubes, and contingency tables. The goal is to maximize the utility of their output, while giving a rigorous privacy guarantee. Most results follow a common template: pick a “strategy” set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or performing a costly search over the space of all possible strategies. In this paper, we propose a new approach that balances accuracy and efficiency: we show how to improve the accuracy of a given query set by answering some strategy queries more accurately than others. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more. For the important case of marginal queries we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in terms of time and noise.
💡 Research Summary
The paper tackles a central challenge in differential privacy (DP): how to release high‑dimensional aggregate statistics—such as data cubes and contingency tables—accurately while guaranteeing privacy. The dominant paradigm in recent DP literature is to first select a “strategy” set of linear queries, answer these with added noise, and then reconstruct the answers to the target queries from the noisy strategy answers. This approach suffers from two major drawbacks. First, it typically allocates the same amount of noise to every strategy query, ignoring the fact that some strategy queries are more important for the reconstruction of the target set. Second, finding a good strategy often requires an exhaustive or heuristic search over a combinatorial space, which is computationally expensive.
The authors propose a fundamentally different viewpoint: optimal noise allocation. Given a fixed strategy matrix (A) (the set of strategy queries) and a target query matrix (B), they formulate the problem of minimizing the total mean‑squared error (MSE) of the reconstructed target answers as a convex optimization over the variances (\sigma^2_q) assigned to each strategy query (q). By introducing Lagrange multipliers they derive a closed‑form solution that depends only on the spectral properties of (A) and the relationship between (A) and (B). Crucially, for many popular strategies—wavelet bases, hierarchical (tree) decompositions, and Fourier coefficients—the matrices have orthogonal or highly sparse structure, which makes the optimal variance computation linear in the number of strategy queries, i.e., (O(|S|)).
For marginal (subset‑sum) queries, which are a cornerstone of contingency‑table analysis, the paper shows analytically that the weighted noise allocation strictly dominates the uniform‑noise approaches used in prior work such as “Private Marginals” and hyperbolic scaling methods. Empirically, on synthetic and real datasets ranging from 2‑ to 5‑dimensional cubes, the proposed method reduces average absolute error by roughly 15‑30 % compared with the best existing baselines. The improvement grows with dimensionality because the disparity in importance among strategy queries becomes more pronounced in higher dimensions.
A second major contribution is a consistency projection step that enforces that the released answers could have originated from some (unknown) true dataset. After the optimal noisy reconstruction, the authors solve a simple least‑squares projection onto the linear subspace defined by the natural constraints of the query family (e.g., marginal sums must be consistent across overlapping subsets). Because the projection uses the same weighted norm derived from the optimal noise allocation, it adds no extra privacy loss and incurs only linear‑time overhead. In practice, the projection completes in under a tenth of a second for the evaluated data cubes.
Overall, the paper delivers four key advances:
- A general framework for allocating different amounts of Gaussian noise to each strategy query, minimizing the overall reconstruction error.
- Efficient closed‑form solutions for a wide class of strategies, eliminating the need for costly strategy‑search procedures.
- Demonstrated superiority for marginal queries, both in theory (tighter error bounds) and in extensive experiments.
- A low‑cost, post‑processing consistency step that yields answers that are both private and internally coherent.
The authors discuss practical implications for statistical agencies, health‑care data portals, and commercial data marketplaces that must publish high‑dimensional aggregates under DP. The method’s simplicity, linear computational complexity, and strong accuracy gains make it attractive for integration into existing DP pipelines, whether in batch analytics or near‑real‑time dashboards. Future work suggested includes extending the framework to non‑linear query families, adaptive strategy selection based on data‑dependent importance, and multi‑party settings where joint DP guarantees are required.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...