A mechanism for releasing information about a statistical database with sensitive data must resolve a trade-off between utility and privacy. Privacy can be rigorously quantified using the framework of {\em differential privacy}, which requires that a mechanism's output distribution is nearly the same whether or not a given database row is included or excluded. The goal of this paper is strong and general utility guarantees, subject to differential privacy. We pursue mechanisms that guarantee near-optimal utility to every potential user, independent of its side information (modeled as a prior distribution over query results) and preferences (modeled via a loss function). Our main result is: for each fixed count query and differential privacy level, there is a {\em geometric mechanism} $M^*$ -- a discrete variant of the simple and well-studied Laplace mechanism -- that is {\em simultaneously expected loss-minimizing} for every possible user, subject to the differential privacy constraint. This is an extremely strong utility guarantee: {\em every} potential user $u$, no matter what its side information and preferences, derives as much utility from $M^*$ as from interacting with a differentially private mechanism $M_u$ that is optimally tailored to $u$.
Organizations including the census bureau, medical establishments, and Internet companies collect and publish statistical information [6,19]. The census bureau may, for instance, publish the result of a query such as: "How many individuals have incomes that exceed $100,000?". An implicit hope in this approach is that aggregate information is sufficiently anonymous so as not to breach the privacy of any individual. Unfortunately, publication schemes initially thought to be "private" have succumbed to privacy attacks [19,17,1], highlighting the urgent need for mechanisms that are provably private. The differential privacy literature [10,8,16,18,5,7,12,4] has proposed a rigorous and quantifiable definition of privacy, as well as provably privacy-preserving mechanisms for diverse applications including statistical queries, machine learning, and pricing. Informally, for α ∈ [0, 1], a randomized mechanism is α-differentially private if changing a row of the underlying database-the data of a single individual-changes the probability of every mechanism output by at most an α factor. Larger values of α correspond to greater levels of privacy. Differential privacy is typically achieved by adding noise that scales with α. While it is trivially possible to achieve any level of differential privacy, for instance by always returning random noise, this completely defeats the original purpose of providing useful information. On the other hand, returning fully accurate results can lead to privacy disclosures [8]. The goal of this paper is to identify, for each α ∈ [0, 1], the optimal (i.e., utility-maximizing) α-differentially private mechanism.
We consider databases with n rows drawn from a finite domain D. Every row corresponds to an individual. Two databases are neighbors if they coincide in n -1 rows. A count query f takes a database d ∈ D n as input and returns the result f (d) ∈ N = {0, . . . , n} that is the number of rows that satisfy a fixed, non-trivial predicate on the domain D. Such queries are also called predicate or subset-sum queries; they have been extensively studied in their own right [7,4,12,5], and form a basic primitive from which more complex queries can be constructed [4].
A randomized mechanism with a (countable) range R is a function x from D n to R, where x dr is the probability of outputting the response r when the underlying database is d. For α ∈ [0, 1], a mechanism x is α-differentially private if the ratio x d 1 r /x d 2 r lies in the interval [α, 1/α] for every possible output r ∈ R and pair d 1 , d 2 of neighboring databases1 . (We interpret 0/0 as 1.) Intuitively, the probability of every response of the privacy mechanism -and hence the probability of a successful privacy attack following an interaction with the mechanism -is, up to a controllable α factor, independent of whether a given user “opts in” or “opts out” of the database.
A mechanism is oblivious if, for all r ∈ R, x d 1 r = x d 2 r whenever f (d 1 ) = f (d 2 ) -if the output distribution depends only on the query result. Most of this paper considers only oblivious mechanisms; for optimal privacy mechanism design, this is without loss of generality in a precise sense (see Section 6.2). The notation and definitions above simplify for oblivious mechanisms and count queries. We can specify an oblivious mechanism via the probabilities x ir of outputting a response r ∈ R for each query result i ∈ N ; α-differential privacy is then equivalent to the constraint that the ratios x ir /x (i+1)r lie in the interval [α, 1/α] for every possible output r ∈ R and query result i ∈ N \ {n}.
Example 2.1 (Geometric Mechanism) The α-geometric mechanism is defined as follows. When the true query result is f (d), the mechanism outputs f (d) + Z. Z is a random variable distributed as a two-sided geometric distribution: P r[Z = z] = 1-α 1+α α |z| for every integer z. This (oblivious) mechanism is α-differentially private because the probabilities of adjacent points in its range differ by an α factor and because the true answer to a count query differs by at most one on neighboring databases.
This paper pursues strong and general utility guarantees. Just as differential privacy guarantees protection against every potential attacker, independent of its side information, we seek mechanisms that guarantee near-optimal utility to every potential user, independent of its side information and preferences.
We now formally define preferences and side information. We model the preferences of a user via a loss function l; l(i, r) denotes the user’s loss when the query result is i and the mechanism’s (perturbed) output is r. We allow l to be arbitrary, subject only to being nonnegative, and nondecreasing in |i -r| for each fixed i. For example, the loss function |i -r| measures mean error, the implicit measure of (dis)utility in most previous literature on differential privacy. Two among the many other natural possibilities are (i -r)2 , which essentially me
This content is AI-processed based on open access ArXiv data.