Privately Releasing Conjunctions and the Statistical Query Barrier
Suppose we would like to know all answers to a set of statistical queries C on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in C. Can we do any better? + We show that the number of statistical queries necessary and sufficient for this task is—up to polynomial factors—equal to the agnostic learning complexity of C in Kearns’ statistical query (SQ) model. This gives a complete answer to the question when running time is not a concern. + We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. While interesting from a learning theoretic point of view, our main applications are in privacy-preserving data analysis: Here, our second result leads to the first algorithm that efficiently releases differentially private answers to of all Boolean conjunctions with 1% average error. This presents significant progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms.
💡 Research Summary
The paper tackles a fundamental problem at the intersection of learning theory and differential privacy: given only access to a data set through statistical queries (SQs), how many such queries are needed to accurately answer an entire class C of statistical queries, and can this be done efficiently? The authors provide two complementary results that together give a near‑complete picture of the query‑complexity and algorithmic landscape.
1. Query‑complexity characterization via agnostic SQ learning.
The first contribution shows that the minimum number of SQs required to recover all answers in C up to error ε (allowing a small fraction of failures) is, up to polynomial factors, exactly the agnostic learning complexity of C in Kearns’ SQ model. Formally, if an agnostic SQ learner for C exists that uses Q(ε,δ) queries, then there is an SQ‑only algorithm that, with the same query budget, produces ε‑accurate estimates for every query in C. Conversely, any SQ‑only algorithm that recovers the whole class can be transformed into an agnostic learner with comparable query complexity. The proof hinges on constructing an ε‑cover of the hypothesis space and exploiting the linearity of statistical queries. This result is unconditional and does not depend on running time; it tells us that, in the SQ world, learning and private release are essentially the same task.
2. Efficient release when answers are submodular.
The second, algorithmic, contribution exploits structural properties of the answer function. If the mapping from subsets of the domain to query answers can be expressed as a submodular function, the authors design a polynomial‑time algorithm that releases all answers with high accuracy on a (1‑ε) fraction of the queries, while tolerating arbitrary error on the remaining ε fraction. Submodularity (diminishing returns) enables the use of convex relaxations and Lagrangian methods; the algorithm iteratively solves a smooth surrogate problem, injecting Laplace noise at each step to satisfy differential privacy. Crucially, many natural query families—graph cuts, Boolean disjunctions, and especially all Boolean conjunctions—have submodular answer functions. As a concrete outcome, the paper presents the first efficient differentially private mechanism that releases all Boolean conjunctions with an average error of at most 1 %. Prior work could only handle conjunctions of bounded size or required exponential time.
3. Implications for differential privacy and lower bounds.
The authors observe that almost all known differentially private algorithms can be implemented using only statistical queries (e.g., the Laplace mechanism, private ERM, private multiplicative weights). By combining this observation with the SQ‑agnostic learning lower bound, they obtain unconditional lower bounds for any differentially private algorithm that admits an SQ implementation. In other words, the SQ model itself becomes a barrier: any private algorithm that can be expressed as an SQ routine cannot surpass the agnostic learning complexity of the target query class. This isolates a new, learning‑theoretic hardness source for private data analysis.
4. Technical highlights and methodology.
- The reduction from private release to agnostic SQ learning uses a careful “guess‑and‑check” scheme: a candidate hypothesis is evaluated on a random subset of queries via SQs; if it passes, it is output as the estimate.
- The submodular release algorithm builds on the “private submodular optimization” framework, employing the multilinear extension of a submodular function and a projected gradient descent with privacy‑preserving noise.
- The lower‑bound construction adapts the classic SQ hardness of learning parity with noise, embedding it into a query‑release setting to show that any SQ‑only private algorithm must incur error proportional to the SQ learning difficulty.
5. Broader impact and future directions.
The paper’s dual perspective—complexity characterization and efficient algorithm design—clarifies the limits of SQ‑based private data analysis. It suggests that to break the identified barrier, one must either (a) move beyond pure SQ access (e.g., use interactive sampling or cryptographic primitives) or (b) identify new structural properties beyond submodularity that permit efficient private release. Moreover, the techniques for submodular release may be adapted to other combinatorial families such as matroid rank functions or influence maximization objectives.
In summary, the work establishes that (i) the SQ query complexity of privately releasing a class C is essentially the agnostic SQ learning complexity of C, (ii) when the answer function is submodular, one can achieve polynomial‑time, differentially private release with negligible average error (demonstrated for all Boolean conjunctions), and (iii) this creates a concrete, learning‑theoretic lower bound that applies to the vast majority of existing private algorithms. The results bridge a gap between theoretical learning models and practical privacy‑preserving data analysis, opening new avenues for both tighter lower bounds and more powerful release mechanisms.