Boosting the Accuracy of Differentially-Private Histograms Through Consistency

We show that it is possible to significantly improve the accuracy of a general class of histogram queries while satisfying differential privacy. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints th…

Authors: Michael Hay, Vibhor Rastogi, Gerome Miklau

Boosting the Accuracy of Diff erentiall y Priv ate Histograms Thr ough Consistency Michael Ha y † , Vibhor Rastogi ‡ , Gerome Miklau † , Dan Suciu ‡ † University of Massachusetts Amherst { mha y ,miklau } @cs.umass .edu ‡ University of W ashington { vibhor ,suciu } @cs.washington.edu ABSTRA CT W e sho w that it is possible to significan tly impro ve the accu- racy of a general class of histogram queries while satisfying differen tial priv acy . Our approac h carefully c ho oses a set of queries to ev aluate, and then exploits consistency con- strain ts that should hold o ver the noisy output. In a p ost- pro cessing phase, w e compute the consisten t input most lik ely to ha ve pro duced the noisy output. The final out- put is differen tially-priv ate and consistent, but in addition, it is often muc h more accurate. W e show, b oth theoreti- cally and experimentally , that these techniques can b e used for estimating the degree sequence of a graph very precisely , and for computing a histogram that can supp ort arbitrary range queries accurately . 1. INTR ODUCTION Recen t work in differen tial priv acy [9] has shown that it is p ossible to analyze sensitive data while ensuring strong pri- v acy guarantees. Differen tial priv acy is typically ac hieved through random p erturbation: the analyst issues a query and receiv es a noisy answer. T o ensure priv acy , the noise is carefully calibrated to the sensitivity of the query . Infor- mally , query sensitivity measures how m uch a small change to the database—such as adding or removing a p erson’s pri- v ate record—can affect the query answ er. Suc h query mech- anisms are simple, efficient, and often quite accurate. In fact, one mechanism has recently b een shown to b e optimal for a single coun ting query [10]—i.e., there is no b etter noisy answ er to return under the desired priv acy ob jective. Ho wev er, analysts t ypically need to compute m ultiple sta- tistics on a database. Differen tially priv ate algorithms ex- tend nicely to a set of queries, but there can be difficult trade-offs among alternative strategies for answ ering a w ork- load of queries. Consider the analyst of a priv ate studen t database who requires answers to the following queries: the total num b er of students, x t , the num b er of students x A , x B , x C , x D , x F receiving grades A, B, C, D, and F resp ec- tiv ely , and the num b er of passing students, x p (grade D or DA T A OWNER ANAL YST Constrained Inference I Private Data • Q • • Q ( I ) • • ˜ q • • q • • γ Q • Q • • Q ( I ) • • ˜ q • • q • • γ Q • Q • • Q ( I ) • • ˜ q • • q • • γ Q • Q • • Q ( I ) • • ˜ q • • q • • γ Q Diff. Private Interface Smo oth Sensitivit y 1 temp Q ( I )= q Pro ving results from [1] and applying to degree sequence. Lemma 1. L et A b e an algorithm that on input x outputs A ( x )= f ( x )+ S ( x ) α Z . F or any inputs x, y , we have: Pr [ A ( x ) ∈ S ]= Pr [ Z ∈ Z x ( S )] wher e z x ( s )= s − f ( x ) S ( x ) / α and Z x ( S )= { z x ( s ) | s ∈ S} . A nd Pr [ A ( y ) ∈ S ]= Pr [ Z ∈ Z y ( S )] wher e z y ( s )= S ( x ) S ( y ) ! z x ( s )+ f ( x ) − f ( y ) S ( x ) / α " = s − f ( y ) S ( y ) / α and Z y ( S )= { z y ( s ) | s ∈ S} . In shorthand, Z x and Z y ar e r elate d as: Z y ( S )= σ ( Z x ( S )+ ∆ ) wher e σ = S ( x ) S ( y ) and ∆ = f ( x ) − f ( y ) S ( x ) / α . Prop osition 1. L et Z b e a L aplac e r andom variable. L et c, δ > 0 b e fixe d. F or any ∆ such that | ∆ | ≤ c , the fol lowing slidin g pr op erty holds: Pr [ Z ∈ Z ] ≤ e c Pr [ Z ∈ Z + ∆ ] F or any σ such that σ ≤ 1+ c/ ln 1 δ , the fol lowing dilation pr op erty holds: Pr [ Z ∈ Z ] ≤ e c Pr [ Z ∈ σ Z ]+ δ F urther, they c an c ombine d: Pr [ Z ∈ Z ] ≤ e 2 c Pr [ Z ∈ σ ( Z + ∆ )] + δ Pr o of. F or an y c , w e ha v e: Pr[ Z ∈ Z ]= # z ∈ Z 1 2 e − | z | dz ≤ # z ∈ Z 1 2 e | ∆ | − | z + ∆ | e − | z | e − | z | dz b ecause | ∆ | − | z + ∆ | + | z | ≥ 0, observ e | ∆ | + | z | ≥ | z + ∆ | = e | ∆ | # z ∈ Z 1 2 e − | z + ∆ | dz = e | ∆ | Pr[ Z ∈ Z + ∆ ] ≤ e c Pr[ Z ∈ Z + ∆ ] XXXXX F or dilation, need to pro v e it but I kno w that there is s ome set Z suc h that for the dilation prop ert y to hold, it m ust b e that σ ≤ 1+ c/ ln 1 δ . But it ma y b e the case that it is necessary for σ < 1+ c/ ln 1 δ to b e true for all Z . 1 Step 1 Step 2 Step 3 Figure 1: Our approach to querying priv ate data. higher). Using a differentially priv ate interface, a first alternative is to request noisy answers for just ( x A , x B , x C , x D , x F ) and use those answ ers to compute answ ers for x t and x p b y sum- mation. The sensitivit y of this set of queries is 1 because adding or remo ving one tuple c hanges exactly one of the fiv e outputs by a v alue of one. Therefore, the noise added to in- dividual answers is low and the noisy answers are accurate estimates of the truth. Unfortunately , the noise accumulates under summation, so the estimates for x t and x p are worse. A second alternative is to request noisy answers for all queries ( x t , x p , x A , x B , x C , x D , x F ). This query set has sen- sitivit y 3 (one change could affect three return v alues, each b y a v alue of one), and the priv acy mechanism must add more noise to eac h component. This means the estimates for x A , x B , x C , x D , x F are worse than ab o ve, but the estimates for x t and x p ma y b e more accurate. There is another con- cern, ho wev er: inconsistency . The noisy answ ers are likely to violate the following constraints, which one would naturally exp ect to hold: x t = x p + x F and x p = x A + x B + x C + x D . This means the analyst m ust find a w ay to reconcile the fact that there are tw o different estimates for the total num b er of students and t wo different estimates for the num ber of passing students. W e prop ose a technique for resolving in- consistency in a set of noisy answ ers, and show that doing so can actually increase accuracy . As a result, w e show that strategies inspired by the second alternative can b e superior in many cases. Ov erview of Approac h . Our approach, sho wn pictorially in Figure 1, inv olves three steps. First, given a task—such as computing a histogram ov er studen t grades—the analyst chooses a set of queries Q to send to the data o wner. The c hoice of queries will dep end on the particular task, but in this work they are chosen so that 1 000 001 010 01 1 Sources Destinations 1.00 1.01 3.00 3.01 3.10 . . . (a) T race data Query Definitions: L : h C 000 , C 001 , C 010 , C 011 i H : h C 0 ∗∗ , C 00 ∗ , C 01 ∗ , C 000 , C 001 , C 010 , C 011 i S : sort ( L ) T rue answer Private output Inferr e d answer L ( I ) = h 2 , 0 , 10 , 2 i ˜ L ( I ) = h 3 , 1 , 11 , 1 i H ( I ) = h 14 , 2 , 12 , 2 , 0 , 10 , 2 i ˜ H ( I ) = h 13 , 3 , 11 , 4 , 1 , 12 , 1 i H ( I ) = h 14 , 3 , 11 , 3 , 0 , 11 , 0 i S ( I ) = h 0 , 2 , 2 , 10 i ˜ S ( I ) = h 1 , 2 , 0 , 11 i S ( I ) = h 1 , 1 , 1 , 11 i (b) Query v ariations Figure 2: (a) Illustration of sample data representing a bipartite graph of netw ork connections; (b) Definitions and sample v alues for alternativ e query sequences: L counts the num b er of connections for each sour ce , H pro vides a hierarch y of range counts, and S returns an ordered degree sequence for the implied graph. constrain ts hold among the answers. F or example, rather than issue ( x A , x B , x C , x D , x F ), the analyst w ould form ulate the query as ( x t , x p , x A , x B , x C , x D , x F ), which has consis- tency constrain ts. The query set Q is sen t to the data o wner. In the second step, the data owner answers the set of queries, using a standard differentially-priv ate mec hanism [9], as follo ws. The queries are ev aluated on the priv ate database and the true answ er Q ( I ) is computed. Then random in- dep enden t noise is added to each answer in the set, where the data owner scales the noise based on the sensitivity of the query set. The set of noisy answers ˜ q is sent to the an- alyst. Imp ortan tly , b ecause this step is unchanged from [9], it offers the same differential priv acy guarantee. The ab ov e step ensures priv acy , but the set of noisy an- sw ers returned may b e inconsistent. In the third and final step, the analyst p ost-pro cesses the set of noisy answers to resolv e inconsistencies among them. W e prop ose a no v el ap- proac h for resolving inconsistencies, called constr aine d infer- enc e , that finds a new set of answers q that is the “closest” set to ˜ q that also satisfies the consistency constraints. F or tw o histogram tasks, our main tec hnical con tributions are efficient techniques for the third step and a theoretical and empirical analysis of the accuracy of q . The surprising finding is that q can b e significan tly more accurate than ˜ q . W e emphasize that the constrained inference step has no imp act on the differen tial priv acy guaran tee. The analyst p erforms this step without access to the priv ate data, using only the constraints and the noisy answers, ˜ q . The noisy answ ers ˜ q are the output of a differen tially priv ate mech- anism; an y post-pro cessing of the answers cannot dimin- ish this rigorous priv acy guaran tee. The constraints are prop erties of the query , not the database, and therefore kno wn by the analyst a priori . F or example, the constraint x p = x A + x B + x C + x D is simply the definition of x p . In tuitively , how ever, it would seem that if noise is added for priv acy and then constrained inference reduces the noise, some priv acy has b een lost. In fact, our results sho w that existing tec hniques add more noise than is strictly necessary to ensure differential priv acy . The extra noise pro vides no quan tifiable gain in priv acy but do es hav e a significant cost in accuracy . W e show that constrained inference can b e an effectiv e strategy for b oosting accuracy . The increase in accuracy we ac hiev e dep ends on the input database and the priv acy parameters. F or instance, for some databases and levels of noise the p erturbation ma y tend to pro duce answers that do not violate the constraints. In this case the inference step would not improv e accuracy . But we sho w that our inference pro cess never reduces accuracy and giv e conditions under whic h it will b oost accuracy . In prac- tice, we find that many real datasets hav e data distributions for which our techniques significantly improv e accuracy . Histogram tasks . W e demonstrate this technique on tw o sp ecific tasks related to histograms. F or relational schema R ( A, B , . . . ), w e choose one attribute A on whic h histograms are built (called the range attribute). W e assume the domain of A , dom , is ordered. W e explain these tasks using sample data that will serve as a running example throughout the pap er, and is also the basis of later exp erimen ts. The relation R ( sr c, dst ), sho wn in Fig. 2, represents a trace of netw ork communications b e- t ween a source IP address ( sr c ) and a destination IP address ( dst ). It is bipartite because it represents flo ws through a gatew ay router from internal to external addresses. In a con ven tional histogram, we form disjoin t interv als for the range attribute and compute counting queries for each sp ecified range. In our example, we use sr c as the range at- tribute. There are four source addresses present in the table. If we ask for coun ts of all unit-length ranges, then the his- togram is simply the sequence h 2 , 0 , 10 , 2 i corresp onding to the (out) degrees of the source addresses h 000 , 001 , 010 , 011 i . Our first histogram task is an unattributed histogram , in which the interv als themselves are irrelev ant to the anal- ysis and so we report only a multiset of frequencies. F or the example histogram, the multiset is { 0 , 2 , 2 , 10 } . An im- p ortan t instance of an unattributed histogram is the de- gree sequence of a graph, a crucial measure that is widely studied [18]. If the tuples of R represent queries submit- ted to a search engine, and A is the search term, then an unattributed histogram sho ws the frequency of o ccurrence of all terms (but not the terms themselves), and can be used to study the distribution. F or our second histogram task, we consider more con- v entional sequences of counting queries in which the inter- v als studied ma y be irregular and o verlapping. In this case, simply returning unattributed counts is insufficient. And b ecause w e cannot predict ahead of time all the ranges of in terest, our goal is to compute priv ately a set of statistics sufficien t to supp ort arbitrary interv al counts and thus any histogram. W e call this a univ ersal histogram . 2 Con tinuing the example, a universal histogram allows the analyst to count the num ber of pack ets sent from any single address (e.g., the counts from source address 010 is 10), or from any range of addresses (e.g., the total num b er of pac k- ets is 14, and the num b er of pack ets from a source address matc hing prefix 01 ∗ is 12). While a univ ersal histogram can b e used compute an unat- tributed histogram, we distinguish b et ween the tw o b ecause w e show the latter can b e computed muc h more accurately . Con tributions . F or both unattributed and universal his- tograms, we propose a strategy for b o osting the accuracy of existing differentially priv ate algorithms. F or each task, (1) we show that there is an efficiently-computable, closed- form expression for the c onsistent query answer closest to a priv ate randomized output; (2) we prov e b ounds on the error of the inferred output, sho wing under what conditions inference b o osts accuracy; (3) we demonstrate significan t impro vemen ts in accuracy through exp erimen ts on real data sets. Unattributed histograms are extremely accurate, with error at least an order of magnitude lo wer than existing tech- niques. Our approach to univ ersal histograms can reduce er- ror for larger ranges by 45-98%, and improv es on all ranges in some cases. 2. B A CKGR OUND In this section, w e in tro duce the concept of query se- quences and how they can b e used to supp ort histograms. Then we review differential priv acy and sho w how queries can b e answered under differential priv acy . Finally , we for- malize our constrained inference pro cess. All of the tasks considered in this pap er are form ulated as query se quenc es where eac h elemen t of the sequence is a sim- ple count query on a range. W e write interv als as [ x, y ] for x, y ∈ dom , and abbreviate interv al [ x, x ] as [ x ]. A coun ting query on range attribute A is: c ([ x, y ]) = Select count(*) From R Where x ≤ R.A ≤ y W e use Q to denote a generic query sequence (please see App endix A for an ov erview of notational conv entions). When Q is ev aluated on a database instance I , the output, Q ( I ), includes one answer to each counting query , so Q ( I ) is a vector of non-negative integers. The i th query in Q is Q [ i ]. W e consider the common case of a histogram ov er unit- length ranges. The conv en tional strategy is to simply com- pute counts for all unit-length ranges. This query sequence is denoted L : L = h c ([ x 1 ]) , . . . c ([ x n ]) i , x i ∈ dom Example 1. Using the example in Fig 2, we assume the domain of sr c c ontains just the 4 addr esses shown. Query L is h c ([000]) , c ([001]) , c ([010]) , c ([011]) i and L ( I ) = h 2 , 0 , 10 , 2 i . 2.1 Differential Privacy Informally , an algorithm is differen tially priv ate if it is insensitiv e to small changes in the input. F ormally , for any input database I , let nbrs ( I ) denote the set of neighboring databases, eac h differing from I b y at most one record; i.e., if I 0 ∈ nbr s ( I ), then | ( I − I 0 ) ∪ ( I 0 − I ) | = 1. Definition 2.1 (  -differential priv acy). Algorithm A is  -differ ential ly private if for al l instanc es I , any I 0 ∈ nbr s ( I ) , and any subset of outputs S ⊆ Rang e ( A ) , the fol- lowing holds: P r [ A ( I ) ∈ S ] ≤ exp(  ) × P r [ A ( I 0 ) ∈ S ] wher e the pr ob ability is taken over the r andomness of the A . Differen tial priv acy has b een defined inconsistently in the lit- erature. The original concept, called  -indistinguishability [9], defines neighboring databases using hamming distance rather than symmetric difference (i.e., I 0 is obtained from I b y r e- placing a tuple rather than adding/removing a tuple). The c hoice of definition affects the calculation of query sensi- tivit y . W e use the ab o ve definition (from Dwork [7]) but observ e that our results also hold under indistinguishability , due to the fact that  -differential priv acy (as defined ab ov e) implies 2  -indistinguishability . T o answer queries under differential priv acy , we use the Laplace mec hanism [9], whic h ac hieves differential priv acy b y adding noise to query answers, where the noise is sam- pled from the Laplace distribution. The magnitude of the noise dep ends on the query’s sensitivity , defined as follows (adapting the definition to the query sequences considered in this pap er). Definition 2.2 (Sensitivity). L et Q b e a se quence of c ounting queries. The sensitivity of Q , denoted S Q , is ∆ Q = max I ,I 0 ∈ nbrs ( I )   Q ( I ) − Q ( I 0 )   1 Throughout the pap er, we use || X − Y || p to denote the L p distance b et ween v ectors X and Y . Example 2. The sensitivity of query L is 1. Datab ase I 0 c an b e obtaine d fr om I by adding or r emoving a single r e cor d, which affects exactly one of the c ounts in L by exactly 1. Giv en query Q , the Laplace mec hanism first computes the query answ er Q ( I ) and then adds random noise indep en- den tly to each answ er. The noise is drawn from a zero-mean Laplace distribution with scale σ . As the following prop o- sition shows, differential priv acy is achiev ed if the Laplace noise is scaled appropriately to the sensitivity of Q and the priv acy parameter  . Proposition 1 (Lapla ce mechanism [9]). L et Q b e a query se quenc e of length d , and let h L ap ( σ ) i d denote a d -length ve ctor of i.i.d. samples fr om a Laplac e with sc ale σ . The r andomize d algorithm ˜ Q that takes as input datab ase I and outputs the fol lowing ve ctor is  -differ ential ly private: ˜ Q ( I ) = Q ( I ) + h L ap (∆ Q / ) i d W e apply this technique to the query L . Since, L has sen- sitivit y 1, the following algorithm is  -differentially priv ate: ˜ L ( I ) = L ( I ) + h Lap(1 / ) i n W e rely on Prop osition 1 to ensure priv acy for the query sequences we prop ose in this pap er. W e emphasize that the prop osition holds for any query sequence Q , regardless of correlations or constraints among the queries in Q . Such dep endencies are accoun ted for in the calculation of sensi- tivit y . (F or example, consider the correlated sequence Q that consists of the same query rep eated k times, then the sensitivit y of Q is k times the sensitivity of the query .) 3 W e present the case where the analyst issues a single query sequence Q . T o supp ort multiple query sequences, the pro- to col that computes a  i -differen tially priv ate resp onse to the i th sequence is ( P  i )-differen tially priv ate. T o analyze the accuracy of the randomized query sequences prop osed in this paper w e quan tify their error. ˜ Q can b e considered an estimator for the true v alue Q ( I ). W e use the common Mean Squared Error as a measure of accuracy . Definition 2.3 (Err or). F or a randomize d query se- quenc e ˜ Q whose input is Q ( I ) , the er ror ( ˜ Q ) is P i E ( ˜ Q [ i ] − Q [ i ]) 2 Her e E is the exp e ctation taken over the p ossible r an- domness in gener ating ˜ Q . F or example, er r or ( ˜ L ) = P i E ( ˜ L [ i ] − L [ i ]) 2 whic h simplifies to: n E [ Lap (1 / ) 2 ] = n V ar ( Lap (1 / )) = 2 n/ 2 . 2.2 Constrained Inference While ˜ L can b e used to supp ort unattributed and univer- sal histograms under differential priv acy , the main contribu- tion of this paper is the dev elopment of more accurate query strategies based on the idea of constrained inference. The sp ecific strategies are describ ed in the next sections. Here, w e formulate the constrained inference problem. Giv en a query sequence Q , let γ Q denote a set of con- strain ts which must hold among the (true) answers. The constrained inference pro cess takes the randomized output of the query , denoted ˜ q = ˜ Q ( I ), and finds the sequence of query answers q that is “closest” to ˜ q and also satisfies the constrain ts of γ Q . Here closest is determined b y L 2 distance, and the result is the minimum L 2 solution : Definition 2.4 (Minimum L 2 solution). L et Q b e a query se quence with c onstr aints γ Q . Given a noisy query se quenc e ˜ q = ˜ Q ( I ) , a minimum L 2 solution, denote d q , is a ve ctor q that satisfies the c onstr aints γ Q and at the same time minimizes || ˜ q − q || 2 . W e use Q to denote the tw o step randomized process in whic h the data owner first computes ˜ q = ˜ Q ( I ) and then computes the minimum L 2 solution from ˜ q and γ Q . (Al- ternativ ely , the data owner can release ˜ q and the latter step can be done by the analyst.) Importantly , the inference step has no impact on priv acy , as stated b elo w. (Pro ofs app ear in the App endix.) Proposition 2. If ˜ Q satisfies  -differ ential privacy, then Q satisfies  -differ ential privacy. 3. UNA TTRIBUTED HISTOGRAMS T o supp ort unattributed histograms, one could use the query sequence L . How ever, it con tains “extra” information— the attribution of eac h coun t to a particular range—whic h is irrelev ant for an unattribute d histogram. Since the asso ci- ation b et ween L [ i ] and i is not required, any permutation of the unit-length counts is a correct resp onse for the unattr- ibuted histogram. W e formulate an alternative query that asks for the counts of L in rank order. As we will show, ordering do es not increase sensitivity , but it do es introduce inequalit y constraints that can b e exploited by inference. F ormally , let a i refer to the answer to L [ i ] and let U = { a 1 , . . . , a n } b e the multiset of answers to queries in Q . W e write r ank i ( U ) to refer to the i th smallest answer in U . Then the query S is defined as S = h r ank 1 ( U ) , . . . , r ank n ( U ) i Example 3. In the example in Fig 2, we have L ( I ) = h 2 , 0 , 10 , 2 i while S ( I ) = h 0 , 2 , 2 , 10 i . Thus, the answer S ( I ) c ontains the same c ounts as L ( I ) but in sorted or de r. T o answer S under differential priv acy , we must determine its sensitivity . Proposition 3. The sensitivity of S is 1. By Propositions 1 and 3, the follo wing algorithm is  - differen tially priv ate: ˜ S ( I ) = S ( I ) + h Lap(1 / ) i n Since the same magnitude of noise is added to S as to L , the accuracy of ˜ S and ˜ L is the same. How ever, S implies a p ow erful set of constraints. Notice that the ordering o c- curs b efore noise is added. Thus, the analyst kno ws that the returned counts are ordered according to the true rank order. If the returned answer con tains out-of-order counts, this must b e caused by the addition of random noise, and they are inconsistent. Let γ S denote the set of inequalities S [ i ] ≤ S [ i + 1] for 1 ≤ i < n . W e show next how to exploit these constraints to b oost accuracy . 3.1 Constrained Inference: Computing S As outlined in the in tro duction, the analyst sends query S to the data owner and receiv es a noisy answer ˜ s = ˜ S ( I ), the output of the differentially priv ate algorithm ˜ S ev aluated on the priv ate database I . W e now describ e a technique for p ost-processing ˜ s to find an answer that is consistent with the ordering constraints. F ormally , given ˜ s , the ob jective is to find an s that mini- mizes || ˜ s − s || 2 sub ject to the constraints s [ i ] ≤ s [ i + 1] for 1 ≤ i < n . The solution has a surprisingly elegant closed- form. Let ˜ s [ i, j ] be the subsequence of j − i + 1 elements: h ˜ s [ i ], ˜ s [ i + 1], . . . , ˜ s [ j ] i . Let ˜ M [ i, j ] be the mean of these elemen ts, i.e. ˜ M [ i, j ] = P j k = i ˜ s [ k ] / ( j − i + 1). Theorem 1. Denote L k = min j ∈ [ k,n ] max i ∈ [1 ,j ] ˜ M [ i, j ] and U k = max i ∈ [1 ,k ] min j ∈ [ i,n ] ˜ M [ i, j ] . The minimum L 2 solu- tion s , is unique and given by: s [ k ] = L k = U k . Since we first stated this result in a technical rep ort [13], w e hav e learned that this problem is an instance of isotonic regression (i.e., least squares regression under ordering con- strain ts on the estimands). The statistics literature gives sev eral c haracterizations of the solution, including the ab o ve min-max formulas (cf. Barlow et al. [3]), as w ell as linear time algorithms for computing it (cf. Barlow et al. [2]). Example 4. We give thr e e examples of ˜ s and its closest or der e d se quenc e s . First, supp ose ˜ s = h 9 , 10 , 14 i . Sinc e ˜ s is alr e ady or der e d, s = ˜ s . Se c ond, ˜ s = h 9 , 14 , 10 i , wher e the last two elements ar e out of or der. The closest or der ed se quenc e is s = h 9 , 12 , 12 i . Final ly, let ˜ s = h 14 , 9 , 10 , 15 i . The se quenc e is in order exc ept for ˜ s [1] . While changing the first element fr om 14 to 9 would make it or der e d, its distanc e fr om ˜ s would b e (14 − 9) 2 = 25 . In c ontr ast, s = h 11 , 11 , 11 , 15 i and || ˜ s − s || 2 = 14 . 3.2 Utility Analysis: the Accuracy of S Prior work in isotonic regression has shown inference can- not hurt, i.e., the accuracy of S is no lo wer than ˜ S [14]. 4 0 5 10 15 20 25 10 15 20 Inde x Count ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ε = 1.0 ● S ( I ) s ~ s Figure 3: Example of how s reduces the error of ˜ s . Ho wev er, we are not aw are of any results that give condi- tions for which S is more accurate than ˜ S . Before presenting a theoretical statement of such conditions, w e first give an illustrativ e example. Example 5. Figur e 3 shows a se quenc e S ( I ) along with a sample d ˜ s and inferr e d s . While the values in ˜ s deviate c onsider ably fr om S ( I ) , s lies very close to the true answer. In p articular, for subse quenc e [1 , 20] , the true se quenc e S ( I ) is uniform and the c onstr ained infer enc e pr o c ess effe ctively aver ages out the noise of ˜ s . At the twenty-first p osition, which is a unique c ount in S ( I ) , and c onstr aine d infer enc e do es not r efine the noisy answer, i.e., s [21] = ˜ s [21] . Fig 3 suggests that er ror ( S ) will b e low for sequences in whic h many coun ts are the same (Fig 7 in App endix C giv es another in tuitiv e view of the error reduction). The follo wing theorem quantifies the accuracy of S precisely . Let n and d denote the num b er of v alues and the n um b er of distinct v alues in S ( I ) respectively . Let n 1 , n 2 , . . . , n d b e the num b er of times each of the d distinct v alues o ccur in S ( I ) (thus P i n i = n ). Theorem 2. Ther e exist c onstants c 1 and c 2 indep endent of n and d such that er ror ( S ) ≤ d X i =1 c 1 log 3 n i + c 2  2 Thus er ror ( S ) = O ( d log 3 n/ 2 ) wher e as er ror ( ˜ S ) = Θ( n/ 2 ) . The ab o ve theorem shows that constrained inference can b oost accuracy , and the improv ement depends on prop er- ties of the input S ( I ). In particular, if the num b er of dis- tinct elements d is 1, then er r or ( S ) = O (log 3 n/ 2 ), while er ror ( ˜ S ) = Θ( n/ 2 ). On the other hand, if d = n , then er ror ( S ) = O ( n/ 2 ) and thus b oth er r or ( S ) and er ror ( ˜ S ) scale linearly in n . F or man y practical applications, d  n , whic h makes er ror ( S ) significantly low er than er r or ( ˜ S ). In Sec. 5, exp erimen ts on real data demonstrate that the error of S can b e orders of magnitude low er than that of ˜ S . 4. UNIVERSAL HISTOGRAMS While the query sequence L is the conv en tional strategy for computing a universal histogram, this strategy has lim- ited utilit y under differential priv acy . While accurate for small ranges, the noise in the unit-length counts accum u- lates under summation, so for larger ranges, the estimates can easily b ecome useless. W e prop ose an alternative query sequence that, in ad- dition to asking for unit-length interv als, asks for in terv al coun ts of larger granularit y . T o ensure priv acy , slightly more noise must b e added to the counts. Ho wev er, this strategy has the property that an y range query can b e answered via a linear com bination of only a small n umber of noisy counts, and this makes it muc h more accurate for larger ranges. Our alternative query sequence, denoted H , consists of a sequence of hierarc hical in terv als. Conceptually , these inter- v als are arranged in a tree T . Eac h no de v ∈ T corresponds to an interv al, and each no de has k children, corresp ond- ing to k equally sized subinterv als. The root of the tree is the interv al [ x 1 , x n ], which is recursively divided into subin- terv als until, at leav es of the tree, the interv als are unit- length, [ x 1 ] , [ x 2 ] , . . . , [ x n ]. F or notational conv enience, we define the heigh t of the tree ` as the n umber of no des , rather than edges, along the path from a leaf to the ro ot. Thus, ` = log k n + 1. T o transform the tree in to a sequence, w e arrange the interv al counts in the order given by a breadth- first trav ersal of the tree. C 0 ** C 00 * C 01 * C 000 C 001 C 010 C 01 1 Figure 4: The tree T asso ciated with query H for the example in Fig. 2 for k = 2 . Example 6. Continuing fr om the example in Fig 2, we describ e H for the sr c domain. The intervals ar e arrange d into a binary ( k = 2) tr e e, as shown in Fig 4. The ro ot is asso ciate d with the interval [0 ∗ ∗ ] , which is evenly sub divide d among its childr en. The unit-length intervals at the le aves ar e [000] , [001] , [010] , [011] . The height of the tre e is ` = 3 . The intervals of the tr e e ar e arrange d into the query se- quenc e H = h C 0 ∗∗ , C 00 ∗ , C 01 ∗ , C 000 , C 001 , C 010 , C 011 i . Eval- uate d on instanc e I fr om Fig. 2, the answer is H ( I ) = h 14 , 2 , 12 , 2 , 0 , 10 , 2 i . T o answer H under differential priv acy , we must determine its sensitivity . As the following prop osition sho ws, H has a larger sensitivity than L . Proposition 4. The sensitivity of H is ` . By Propositions 1 and 4, the follo wing algorithm is  - differen tially priv ate: ˜ H ( I ) = H ( I ) + h Lap( `/ ) i m where m is the length of sequence H , equal to the num b er of counts in the tree. T o answer a range query using ˜ H , a natural strategy is to sum the few est n um b er of sub-interv als suc h that their union equals the desired range. How ever, one challenge with this approac h is inconsistency: in the corresponding tree of noisy answ ers, there may b e a parent coun t that do es not equal to the sum of its children. This can b e problematic: for example, an analyst migh t ask one interv al query and then ask for a sub-interv al and receive a lar ger count. W e next look at how to use the arithmetic constraints b et ween parent and c hild counts (denoted γ H ) to derive a consisten t, and more accurate, estimate H . 5 4.1 Constrained Inference: Computing H The analyst receives ˜ h = ˜ H ( I ), the noisy output from the differentially priv ate algorithm ˜ H . W e now consider the problem of finding the minimum L 2 solution: to find the h that minimizes || ˜ h − h || 2 and also satisfies the consistency constrain ts γ H . This problem can be view ed as an instance of linear regres- sion. The unkno wns are the true counts of the unit-length in terv als. Each answer in ˜ h is a fixed linear combination of the unknowns, plus random noise. Finding h is equiv alen t to finding an estimate for the unit-length interv als. In fact, h is the familiar least squares solution. Although the least squares solution can b e computed via linear algebra, the hierarchical structure of this problem in- stance allows us to derive an intuitiv e closed form solution that is also more efficien t to compute, requiring only linear time. Let T be the tree corresp onding to ˜ h ; abusing nota- tion, for no de v ∈ T , we write ˜ h [ v ] to refer to the interv al asso ciated with v . First, w e define a p ossibly inconsistent estimate z [ v ] for eac h no de v ∈ T . The consistent estimate h [ v ] is then de- scrib ed in terms of the z [ v ] estimates. z [ v ] is defined recur- siv ely from the leav es to the ro ot. Let l denote the height of no de v and succ ( v ) denote the set of v ’s children. z [ v ] = ( ˜ h [ v ] , if v is a leaf no de k l − k l − 1 k l − 1 ˜ h [ v ] + k l − 1 − 1 k l − 1 P u ∈ succ ( v ) z [ u ] , o.w. The intuition b ehind z [ v ] is that it is a weigh ted av erage of t wo estimates for the count at v ; in fact, the w eights are in versely prop ortional to the v ariance of the estimates. The consistent estimate h is defined recursively from the ro ot to the leav es. At the ro ot r , h [ r ] is simply z [ r ]. As w e descend the tree, if at some no de u , w e hav e h [ u ] 6 = P w ∈ succ ( u ) z [ w ], then we adjust the v alues of each descen- dan t b y dividing the difference h [ u ] − P w ∈ succ ( u ) z [ w ] equally among the k descendants. The following theorem states that this is the minimum L 2 solution. Theorem 3. Given the noisy se quenc e ˜ h = ˜ H ( I ) , the unique minimum L 2 solution, h , is given by the fol lowing r e curr ence r elation. L et u be v ’s p ar ent: h [ v ] =  z [ v ] , if v is the ro ot z [ v ] + 1 k ( h [ u ] − P w ∈ succ ( u ) z [ w ]) , o.w. Theorem 3 shows that the ov erhead of computing H is minimal, requiring only tw o linear scans of the tree: a b ot- tom up scan to compute z and then a top down scan to compute the solution h given z . 4.2 Utility Analysis: the Accuracy of H W e measure utility as accuracy of range queries, and we compare three strategies: ˜ L , ˜ H , and H . W e start by com- paring ˜ L and ˜ H . Giv en range query q = c ([ x, y ]), we derive an estimate for the answer as follo ws. F or ˜ L , the estimate is simply the sum of the noisy unit-length interv als in the range: ˜ L q = P y i = x ˜ L [ i ]. The error of each count is 2 / 2 , and so the error for the range is er ror ( ˜ L q ) = O (( y − x ) / 2 ). F or ˜ H , we choose the natural strategy of summing the few est sub-interv als of ˜ H . Let r 1 , . . . , r t b e the ro ots of dis- join t subtrees of T such that the union of their ranges equals [ x, y ]. Then ˜ H q is defined as ˜ H q = P t i =1 ˜ H [ r i ]. Each noisy coun t has error equal to 2 ` 2 / 2 (equal to the v ariance of the added noise) and the num b er of subtrees is at most 2 ` (at most tw o p er level of the tree), thus er ror ( ˜ H q ) = O ( ` 3 / 2 ). There is clearly a tradeoff b etw een these tw o strategies. While ˜ L is accurate for small ranges, error gro ws linearly with the size of the range. In con trast, the error of ˜ H is p oly-logarithmic in the size of the domain (recall that ` = Θ(log n )). Thus, while ˜ H is less accurate for small ranges, it is m uch more accurate for large ranges. If the goal of a univ ersal histogram is to b ound worst-case or total error for all range queries, then ˜ H is the preferred strategy . W e no w compare ˜ H to H . Since H is consistent, range queries can b e easily computed by summing the unit-length coun ts. In addition to b eing consistent, it is also more ac- curate. In fact, it is in some sense optimal: among the class of strategies that (a) pro duce unbiased estimates for range queries and (b) deriv e the estimate from linear com- binations of the coun ts in ˜ h , there is no strategy with lo wer mean squared error than H . Theorem 4. ( i ) H is a line ar unbiased estimator, ( ii ) er ror ( H q ) ≤ er r or ( E q ) for al l q and for al l line ar unbiase d estimators E , ( iii ) er ror ( H q ) = O ( ` 3 / 2 ) for al l q , and ( iv ) ther e exists a query q s.t. er ror ( H q ) ≤ 3 2( ` − 1)( k − 1) − k er ror ( ˜ H q ) . P art (iv) of the theorem shows that H can more accurate than ˜ H on some range queries. F or example, in a height 16 binary tree—the kind of tree used in the experiments—there is a query q where H q is more accurate than ˜ H q b y a factor of 2( ` − 1)( k − 1) − k 3 = 9 . 33. F urthermore, the fact that H is consistent can lead to additional adv antages when the domain is sparse. W e pro- p ose a simple extension to H : after computing h , if there is a subtree ro oted at v such that h [ v ] ≤ 0, we simply set the coun t at v and all children of v to b e zero. This is a heuristic strategy; incorp orating non-negativity constraints in to inference is left for future work. Nevertheless, we show in experiments, that this small c hange can greatly reduce er- ror in sparse regions and can lead to H being more accurate than ˜ L even at small ranges. 5. EXPERIMENTS W e ev aluate our techniques on three real datasets (ex- plained in detail in App endix C): NetTrace is deriv ed from an IP-level netw ork trace collected at a ma jor universit y; Social Network is a graph deriv ed from friendship relations in an online so cial netw ork site; Search Logs is a dataset of searc h query logs o v er time from Jan. 1, 2004 to the present. Source co de for the algorithms is av ailable at the first au- thor’s website. 5.1 Unattributed Histograms The first set of exp erimen ts ev aluates the accuracy of con- strained inference on unattributed histograms. W e compare S to the baseline approach ˜ S . Since ˜ s = ˜ S ( I ) is likely to b e inconsistent—out-of-order, non-integral, and p ossibly negativ e—we consider a second baseline technique, denoted ˜ S r , which enforces consistency by sorting ˜ s and rounding eac h count to the nearest non-negative integer. W e ev aluate the p erformance of these estimators on three queries from the three datasets. On NetTrace : the query 6 ε = 1.0 0.1 0.01 1.0 0.1 0.01 1.0 0.1 0.01 Error 10 − 2 1 10 2 10 4 Social Network NetT race Query Logs Figure 5: Error across v arying datasets and  . Eac h triplet of bars represen ts the three estimators: ˜ S (ligh t gra y), ˜ S r (gra y), and S (black). returns the n um b er of internal hosts to which eac h exter- nal host is connected ( ≈ 65K external hosts); On Social Network , the query returns the degree sequence of the graph ( ≈ 11K no des). On Search Logs , the query returns the searc h frequency o ver a 3-month p erio d of the top 20K key- w ords; p osition i in the answ er vector is the n umber of times the i th rank ed keyw ord w as searched. T o ev aluate the utility of an estimator, w e measure its squared error. Results report the av erage squared error ov er 50 random samples from the differen tially-priv ate mec ha- nism (each sample pro duces a new ˜ s ). W e also show results for three settings of  = { 1 . 0 , 0 . 1 , 0 . 01 } ; smaller  means more priv acy , hence more random noise. Fig 5 sho ws the results of the exp erimen t. Eac h bar represen ts av erage p erformance for a single combination of dataset,  , and estimator. The bars represent, from left-to- righ t, ˜ S (light gray), ˜ S r (gra y), and S (black). The vertical axis is av erage squared error on a log-scale. The results in- dicate that the prop osed approach reduces the error by at least an order of magnitude across all datasets and settings of  . Also, the difference b et w een ˜ S r and S suggests that the improv emen t is due not simply to enforcing integralit y and non-negativit y but from the w ay consistency is enforced through constrained inference (though S and ˜ S r are compa- rable on Social Network at large  ). Finally , the relativ e accuracy of S improv es with decreasing  (more noise). Ap- p endix C provides intuition for how S reduces error. 5.2 Universal Histograms W e now ev aluate the effectiv eness of constrained inference for the more general task of computing a univ ersal histogram and arbitrary range queries. W e ev aluate three techniques for supp orting universal histograms ˜ L , ˜ H , and H . F or all three approaches, we enforce integralit y and non-negativity b y rounding to the nearest non-negative integer. With H , rounding is done as part of the inference pro cess, using the approac h describ ed in Sec 4.2. W e ev aluate the accuracy o v er a set of range queries of v arying size and lo cation. The range sizes are 2 i for i = 1 , . . . , ` − 2 where ` is the height of the tree. F or each fixed size, we select the lo cation uniformly at random. W e rep ort the av erage error ov er 50 random samples of ˜ l and ˜ h , and for each sample, 1000 randomly chosen ranges. W e ev aluate the follo wing histogram queries: On Net- Trace , the num b er of connections for each external host. Error 1 10 10 2 10 3 10 4 1 10 3 10 6 10 9 ● L ~ H ~ H ε = 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 10 2 10 3 10 4 ε = 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 10 2 10 3 10 4 ε = 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Error ● Range size Error 1 10 10 2 10 3 10 4 1 10 3 10 6 10 9 ε = 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Range size 1 10 10 2 10 3 10 4 ε = 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Range size 1 10 10 2 10 3 10 4 ε = 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Error Figure 6: A comparison estimators ˜ L (circles), ˜ H (di- amonds), and H (squares) on tw o real-world datasets (top NetTrace , b ottom Search Logs ). This is similar to the query in Sec 5.1 except that here the asso ciation b etw een IP address and coun t is retained. On Search Logs , the query rep orts the temporal frequency of the query term “Obama” from Jan. 1, 2004 to present. (A da y is evenly divided into 16 units of time.) Fig 6 shows the results for both datasets and v arying  . The top ro w corresponds to NetTrace , the bottom to Search Logs . Within a ro w, each plot shows a different setting of  ∈ { 1 . 0 , 0 . 1 , 0 . 01 } . F or all plots, the x-axis is the size of the range query , and the y-axis is the error, a v eraged ov er sampled counts and interv als. Both axes are in log-scale. First, we compare ˜ L and ˜ H . F or unit-length ranges, ˜ L yields more accurate estimates. This is unsurprising since it is a low er sensitivity query and thus less noise is added for priv acy . Ho wev er, the error of ˜ L increases linearly with the size of the range. The av erage error of ˜ H increases slowly with the size of the range, as larger ranges typically require summing a greater num b er of subtrees. F or ranges larger than ab out 2000 units, the error of ˜ L is higher than ˜ H ; for the largest ranges, the error of ˜ L is 4-8 times larger than that of ˜ H (note the log-scale). Comparing H against ˜ H , the error of H is uniformly lo wer across all range sizes, settings of  , and datasets. The rela- tiv e p erformance of the estimators dep ends on  . At smaller  , the estimates of H are more accurate relative to ˜ H and ˜ L . Recall that as  decreases, noise increases. This suggests that the relativ e b enefit of statistical inference increases with the uncertaint y in the observed data. Finally , the results show that H can hav e low er error than ˜ L ov er small ranges, even for leaf counts. This may b e sur- prising since for unit-length counts, the scale of the noise of H is lar ger than that of ˜ L by a factor of log n . The reduction in error is b ecause these histograms are sparse. When the histogram con tains sparse regions, H can effectiv ely identify them because it has noisy observ ations at higher lev els of the tree. In contrast, ˜ L has only the leaf counts; thus, ev en if a range contains no records, ˜ L will assign a p ositiv e count to roughly half of the leav es in the range. 6. RELA TED WORK Dw ork has written comprehensive reviews of differential priv acy [7, 8]; below w e highligh t results closest to this w ork. 7 The idea of p ost-pro cessing the output of a differen tially priv ate mec hanism to ensure consistency w as introduced in Barak et al. [1], who prop osed a linear program for making a set of marginals consistent, non-negativ e, and integral. Ho wev er, unlike the presen t work, the p ost-processing is not sho wn to improv e accuracy . Blum et al. [4] propose an efficient algorithm to publish syn thetic data that is useful for range queries. In comparison with our hierarchical histogram, the technique of Blum et al. scales sligh tly better (logarithmic v ersus p oly-logarithmic) in terms of domain size (with all else fixed). How ever, our hierarc hical histogram achiev es low er error for a fixed do- main, and imp ortantly , the error do es not grow as the size of the database increases, whereas with Blum et al. it grows with O ( N 2 / 3 ) with N b eing the num b er of records (details in App endix E). The present w ork first app eared as an arXiv preprint [13], and since then a num b er of related works hav e emerged, including additional work by the authors. The technique for unattributed histograms has b een applied to accurately and efficiently estimate the degree sequences of large so cial net works [12]. Several techniques for histograms o ver hierar- c hical domains ha ve been dev elop ed. Xiao et al. [22] propose an approac h based on the Haar wa velet, which is conceptu- ally similar to the H query in that it is based on a tree of queries where each level in the tree is an increasingly fine- grained summary of the data. In fact, that tec hnique has er- ror equiv alent to a binary H query , as sho wn b y Li et al. [15], who represent b oth techniques as applications of the matrix mec hanism, a framework for computing workloads of linear coun ting queries under differential priv acy . W e are aw are of ongoing work by McSherry et al. [17] that com bines hierar- c hical querying with statistical inference, but differs from H in that it is adaptiv e. Chan et al. [5] consider the problem of con tinual release of aggregate statistics ov er streaming pri- v ate data, and prop ose a differentially priv ate counter that is similar to H , in which items are hierarchically aggregated b y arriv al time. The H and wa velet strategy are sp ecifically designed to supp ort range queries. Strategies for answering more general workloads of queries under differential priv acy are emerging, in b oth the offline [11, 15] and online [20] settings. 7. CONCLUSIONS Our results show that by transforming a differentially- priv ate output so that it is consisten t, we can b oost accu- racy . P art of the innov ation is devising a query set so that useful constraints hold. Then the challenge is to apply the constrain ts by searching for the closest consistent solution. Our query strategies for histograms hav e closed-form solu- tions for efficiently computing a consistent answer. Our results sho w that conv entional differential priv acy ap- proac hes can add more noise than is strictly required by the priv acy condition. W e b elieve that using constraints ma y b e an imp ortan t part of finding optimal strategies for query answ ering under differen tial priv acy . More discussion of the implications of our results, and p ossible extensions, is in- cluded in App endix B. 8. A CKNO WLEDGMENTS Ha y was supp orted b y the Air F orce Research Lab ora- tory (AFRL) and IARP A, under agreemen t n umber F A8750- 07-2-0158. Ha y and Miklau were supp orted b y NSF CNS 0627642, NSF DUE-0830876, and NSF IIS 0643681. Ras- togi and Suciu w ere supp orted b y NSF I IS-0627585. The U.S. Gov ernmen t is authorized to repro duce and distribute reprin ts for Gov ernmental purp oses notwithstanding any cop y- righ t notation thereon. The views and conclusion contained herein are those of the authors and should not b e inter- preted as necessarily represen ting the official p olicies or en- dorsemen ts, either expressed or implied, of the AFRL and IARP A, or the U.S. Gov ernment. 9. REFERENCES [1] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry , and K. T alwar. Priv acy , accuracy , and consistency to o: A holistic solution to contingency table release. In PODS , 2007. [2] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistic al Infer enc e Under Or der R estrictions . John Wiley and Sons Ltd, 1972. [3] R. E. Barlow and H. D. Brunk. The isotonic regression problem and its dual. JASA , 67(337):140–147, 1972. [4] A. Blum, K. Ligett, and A. Roth. A learning theory approac h to non-interactiv e database priv acy . In STOC , 2008. [5] T.-H. H. Chan, E. Shi, and D. Song. Priv ate and con tinual release of statistics. In ICALP , 2010. [6] F. R. K. Chung and L. Lu. Survey: Concen tration inequalities and martingale inequalities. Internet Mathematics , 2006. [7] C. Dwork. Differen tial priv acy: A survey of results. In T AMC , 2008. [8] C. Dwork. A firm foundation for priv ate data analysis. CACM, T o app e ar , 2010. [9] C. Dwork, F. McSherry , K. Nissim, and A. Smith. Calibrating noise to sensitivity in priv ate data analysis. In TCC , 2006. [10] A. Ghosh, T. Roughgarden, and M. Sundarara jan. Univ ersally utility-maximizing priv acy mechanisms. In STOC , 2009. [11] M. Hardt and K. T alwar. On the geometry of differen tial priv acy . In STOC , 2010. [12] M. Hay , C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree distribution of priv ate net works. In ICDM , 2009. [13] M. Hay , V. Rastogi, G. Miklau, and D. Suciu. Bo osting the accuracy of differentially-priv ate queries through consistency . CoRR , abs/0904.0942, April 2009. [14] J. T. G. Hwang and S. D. Peddada. Confidence in terv al estimation sub ject to order restrictions. Annals of Statistics , 1994. [15] C. Li, M. Hay , V. Rastogi, G. Miklau, and A. McGregor. Optimizing histogram queries under differen tial priv acy . In PODS , 2010. [16] F. McSherry . Priv acy in tegrated queries: An extensible platform for priv acy-preserving data analysis. In SIGMOD , 2009. [17] F. McSherry , K. T alw ar, and O. Williams. Maximum lik eliho od data synthesis. Man uscript, 2009. [18] M. E. J. Newman. The structure and function of complex net works. SIAM R eview , 45(2):167–256, 2003. 8 [19] G. Owen. Game The ory . Academic Press Ltd, 1982. [20] A. Roth and T. Roughgarden. In teractive priv acy via the median mechanism. In STOC , 2010. [21] S. D. Silvey . Statistic al Infer enc e . Chapman-Hall, 1975. [22] X. Xiao, G. W ang, and J. Gehrke. Differential priv acy via wa velet transforms. In ICDE , 2010. APPENDIX A. NO T A TION AL CONVENTIONS The table b elow summarizes notational conv entions used in the pap er. Q Sequence of counting queries L Unit- L ength query sequence H H ierarchical query sequence S S orted query sequence γ Q Constraint set for query Q ˜ Q , ˜ L , ˜ H , ˜ S Randomized query sequence H , S Randomized query sequence, returning minimum L 2 solution I Priv ate database instance L ( I ) , H ( I ) , S ( I ) Output sequence (truth) ˜ l = ˜ L ( I ) , ˜ h = ˜ H ( I ) , ˜ s = ˜ S ( I ) Output sequence (noisy) h = H ( I ) , s = S ( I ) Output sequence (inferred) B. DISCUSSION OF MAIN RESUL TS Here w e pro vide a supplementary discussion of the results, review the insights gained, and discuss future directions. Unattributed histograms. The choice of the sorted query S , instead of L , is an unqualified b enefit, because we gain from the inequalit y constraints on the output, while the sen- sitivit y of S is no greater than that of L . Among other ap- plications, this allows for extremely accurate estimation of degree sequences of a graph, improving error b y an order of magnitude on the baseline technique. The accuracy of the estimate dep ends on the input sequence. It works b est for sequences with duplicate counts, which matches w ell the degree sequences of so cial netw orks encountered in practice. F uture work sp ecifically orien ted tow ards degree sequence estimation could include a constrain t enforcing that the out- put sequence is gr aphic al , i.e. the degree sequence of some graph. Universal histograms. The choice of the hierarc hical count- ing query H , instead of L , offers a trade off b ecause the sen- sitivit y of H is greater than that of L . It is interesting that for some data sets and priv acy lev els, the effect of the H con- strain ts outw eighs the increased noise that m ust b e added. In other cases, the algorithms based on H provide greater ac- curacy for all but the smallest ranges. W e note that in man y practical settings, domains are large and sparse. The spar- sit y implies that no differen tially priv ate technique can yield meaningful answ ers for unit-length queries because the noise necessary for priv acy will drown out the signal. So while ˜ L sometimes has higher accuracy for small range queries, this ma y not hav e practical relev ance since the relative error of the answers renders them useless. In future work w e hope to extend the technique for uni- v ersal histograms to m ulti-dimensional range queries, and to in vestigate optimizations such as higher branching factors. Across both histogram tasks, our results clearly show that it is p ossible to achiev e greater accuracy without sacrificing priv acy . The existence of our improv ed estimators S and H sho w that there is another differentially priv ate noise dis- tribution that is more accurate than independent Laplace noise. This do es not contradict existing results b ecause the original differential priv acy work show ed only that calibrat- ing Laplace noise to the sensitivity of a query was sufficient for priv acy , not that it w as necessary . Only recently has the optimalit y of this construction been studied (and prov en only for single queries) [10]. Finding the optimal strategy for answ ering a set of queries under differential priv acy is an imp ortan t direction for future work, esp ecially in light of emerging priv ate query interfaces [16]. A natural goal is to describ e directly the improv ed noise distributions implied by S and H , and build a priv acy mec h- anism that samples from it. This could, in theory , av oid the inference step altogether. But it is seems quite difficult to discov er, describ e, and sample these improv ed noise dis- tributions, whic h will b e highly dep endent on a particular query of in terest. Our approac h suggests that constraints and constrained inference can be an effective path to dis- co vering new, more accurate noise distributions that satisfy differen tial priv acy . As a practical matter, our approach do es not necessarily burden the analyst with the constrained inference pro cess b ecause the server can implemen t the p ost- pro cessing step. In that case it would app ear to the analyst as if the server was sampling from the improv ed distribution. While our fo cus has b een on histogram queries, the tech- niques are probably not limited to histograms and could ha ve broader impact. Ho wev er, a general formulation may b e c hallenging to develop. There is a subtle relationship b e- t ween constraints and sensitivity: reform ulating a query so that it becomes highly constrained ma y similarly increase its sensitivit y . A challenge is finding queries such as S and H that hav e useful constraints but remain low sensitivity . An- other challenge is the computational efficiency of constrained inference, which is p osed here as a constrained optimization problem with a quadratic ob jective function. The complex- it y of solving this problem will dep end on the nature of the constrain ts and is NP-Hard in general. Our analysis shows that the constraint sets of S and H admit closed-form solu- tions that are efficient to compute. C. ADDITIONAL EXPERIMENTS This section pro vides detailed descriptions of the datasets, and additional results for unattributed histograms. NetTrace is deriv ed from an IP-lev el net work trace col- lected at a ma jor universit y . The trace monitors traffic at the gatewa y b et ween internal IP addresses and external IP addresses. F rom this data, we derived a bipartite connec- tion graph where the no des are hosts, lab eled b y their IP address, and an edge connotes the transmission of at least one data pack et. Here, differential priv acy ensures that in- dividual connections remain priv ate. Social Network is a graph derived from friendship rela- tions on an online so cial net work site. The graph is limited to a population of roughly 11 , 000 students from a single univ ersity . Differential priv acy implies that friendships will not be disclosed. The size of the graph (n umber of students) is assumed to b e public knowledge. 1 1 This is not a critical assumption and, in fact, the num b er 9 Index Count 0 10 20 30 40 50 66 100 1000 10000 66 70 80 40 45 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 1000 1100 1200 1300 1 2 3 Figure 7: On NetTrace , S ( I ) (solid gray), the av erage error of S (solid black) and ˜ S (dotted gray), for  = 1 . 0 . Search Logs is a dataset of searc h query logs o ver time from Jan. 1, 2004 to the present. F or priv acy reasons, it is difficult to obtain such data. Our dataset is derived from a searc h engine in terface that publishes summary statistics for sp ecified query terms. W e combined these summary statis- tics with a second dataset, which contains actual search query logs but for a muc h shorter time p erio d, to pro duce a syn thetic data set. In the exp eriments, ground truth refers to this syn thetic dataset. Differen tial priv acy guarantees that the output will preven t the asso ciation of an individual en tity (user, host) with a particular search term. Unattributed histograms. Figure 7 provides some intu- ition for ho w inference is able to reduce error. Shown is a p ortion of the unattributed histogram of NetTrace : the se- quence is sorted in desc ending order along the x-axis and the y-axis indicates the count. The solid gray line corresp onds to ground truth: a long horizontal stretch indicates a sub- sequence of uniform coun ts and a vertical drop indicates a decrease in count. The graphic shows only the middle p or- tion of the unattributed histogram—some very large and v ery small coun ts are omitted to improv e legibility . The solid blac k lines indicate the error of S av eraged o ver 200 random samples of ˜ S (with  = 1 . 0); the dotted gray lines indicate the exp ected error of ˜ S . The inset graph on the left reveals larger error at the b e- ginning of the sequence, when each count o ccurs once or only a few times. How ever, as the counts b ecome more con- cen trated (longer subsequences of uniform count), the error diminishes, as shown in the right inset. Some error remains around the p oin ts in the sequence where the coun t changes, but the error is reduced to zero for p ositions in the middle of uniform subsequences. Figure 7 illustrates that our approach reduces or elimi- nates noise in precisely the parts of the sequence where the noise is unnecessary for priv acy . Changing a tuple in the database cannot change a coun t in the middle of a uniform of studen ts can b e estimated priv ately within ± 1 / in ex- p ectation. Our techniques can b e applied directly to either the true count or a noisy estimate. subsequence, only at the end p oin ts. These exp erimental results also align with Theorem 2, which states that the er- ror of S is a function of the num ber of distinct coun ts in the sequence. In fact, the experimental results suggest that the theorem also holds lo cally for subsequences with a small n umber of distinct counts. This is an imp ortan t result since the t ypical degree sequences that arise in real data, such as the pow er-law distribution, contain very large uniform subsequences. D. PR OOFS Proof of Proposition 2. F or any output q , then let S ( q ) denote the set of noisy answers such that if ˜ q ∈ S ( q ) then the minimum L 2 solution given ˜ q and γ Q is q . F or an y I and I 0 ∈ nbr s ( I ), the following shows that Q is  - differen tially priv ate: P r [ Q ( I ) = q ] = P r [ ˜ Q ( I ) ∈ S ( q )] ≤ exp(  ) P r [ ˜ Q ( I 0 ) ∈ S ( q )] = exp(  ) P r [ Q ( I 0 ) = q ] where the inequality is b ecause ˜ Q is  -differentially pri- v ate. Proof of Proposition 3. Giv en a database I , supp ose w e add a record to it to obtain I 0 . The added record affects one coun t in L , i.e., there is exactly one i suc h that L ( I )[ i ] = x and L ( I 0 )[ i ] = x +1, and all other counts are the same. The added record affects S as follows. Let j be the largest index suc h that S ( I )[ j ] = x , then the added record increases the coun t at j by one: S ( I 0 )[ j ] = x + 1. Notice that this change do es not affect the sort order—i.e., in S ( I 0 ), the j th v alue remains in sorted order: S ( I 0 )[ j − 1] ≤ x , S ( I 0 )[ j ] = x + 1, and S ( I 0 )[ j + 1] ≥ x + 1. All other counts in S are the same, and thus the L 1 distance b et ween S ( I ) and S ( I 0 ) is 1. Proof of Proposition 4. If a tuple is added or remo ved from the relation, this affects the count for every range that includes it. There are exactly ` ranges that include a given tuple: the range of a single leaf and the ranges of 10 the no des along the path from that leaf to the ro ot. There- fore, adding/removing a tuple c hanges exactly ` counts eac h b y exactly 1. Thus, the sensitivit y is equal to ` , the height of the tree. D.1 Pr oof of Theorem 1 W e first restate the theorem b elow. Recall that ˜ s [ i, j ] denotes the subsequence of j − i + 1 elements: h ˜ s [ i ], ˜ s [ i + 1], . . . , ˜ s [ j ] i . Let ˜ M [ i, j ] record the mean of these elemen ts, i.e. ˜ M [ i, j ] = P j k = i ˜ s [ k ] / ( j − i + 1). Theorem 1. Denote L k = min j ∈ [ k,n ] max i ∈ [1 ,j ] ˜ M [ i, j ] and U k = max i ∈ [1 ,k ] min j ∈ [ i,n ] ˜ M [ i, j ] . The minimum L 2 solu- tion s , is unique and given by: s [ k ] = L k = U k . Proof. In the pro of, we abbreviate the notation and im- plicitly assume that the range of i is [1 , n ] or [1 , j ] when j is sp ecified. Similarly , the range of j is [1 , n ] or [ i, n ] when i is sp ecified. W e start with the easy part, showing that U k ≤ L k . De- fine an n × n matrix A k as follows: A k ij =    ˜ M [ i, j ] if i ≤ j ∞ if j < i ≤ k −∞ otherwise Then min j max i A k ij = L k and max i min j A k ij = U k . In any matrix A k , max i min j A k ij ≤ min j max i A k ij : this is a simple fact that can b e c heck ed directly , or see [19], hence U k ≤ L k . W e show next that if s is the minimum L 2 solution, then L k ≤ s [ k ] ≤ U k . If we show this, then the pro of of the theorem is completed, as then w e will then hav e s [ k ] = L k = U k . The pro of relies on the following lemma. Lemma 1. L et s b e the minimum L 2 solution. Then (i) s [1] ≤ U 1 , (ii) s [ n ] ≥ L n , (iii) for al l k , min( s [ k +1] , max i ˜ M [ i, k ]) ≤ s [ k ] ≤ max( s [ k − 1] , min j ˜ M [ k , j ]) . The pro of of the lemma app ears b elo w, but now w e use it to complete the pro of of Theorem 1. First, we show that s [ k ] ≤ U k using induction on k . The base case is k = 1 and it is stated in the lemma, part (i). F or the inductiv e step, assume s [ k − 1] ≤ U k − 1 . F rom (iii), w e hav e that s [ k ] ≤ max( s [ k − 1] , min j ˜ M [ k, j ]) ≤ max( U k − 1 , min j ˜ M [ k , j ]) = U k The last step follo ws from the definition of U k . A similar induction argument shows that s [ k ] ≥ L k , except the order is reversed: the base case is k = n and the inductiv e step assumes s [ k + 1] ≥ L k +1 . The only remaining step is to prov e the lemma. Proof of Lemma 1. F or (i), it is sufficien t to pro ve that s [1] ≤ ˜ M [1 , j ] for all j ∈ [1 , n ]. Assume the contrary . Th us there exists a j suc h that for s [1] > ˜ M [1 , j ]. Let δ = s [1] − ˜ M [1 , j ]. Thus δ > 0. F urther, for all i , denote δ i = s [ i ] − s [1]. Consider the sequence s 0 defined as follows: s 0 [ i ] =  s [ i ] − δ if i ≤ j s [ i ] otherwise It is obvious to see that since s is a sorted sequence, so is s 0 . W e no w claim that || s 0 − ˜ s || 2 < || s − ˜ s || 2 . F or this note that since the sequence s 0 [ j + 1 , n ] is identical to the sequence s [ j + 1 , n ], it is sufficient to prov e || s 0 [1 , j ] − ˜ s [1 , j ] || 2 < || s [1 , j ] − ˜ s [1 , j ] || 2 . T o prov e that, note that || s [1 , j ] − ˜ s [1 , j ] || 2 can be expanded as || s [1 , j ] − ˜ s [1 , j ] || 2 = j X i =1 ( s [ i ] − ˜ s [ i ]) 2 = j X i =1 ( s [1] + δ i − ˜ s [ i ]) 2 = j X i =1 ( ˜ M [1 , j ] + δ + δ i − ˜ s [ i ]) 2 Supp ose for a momen t that w e fix ˜ M [1 , j ] and δ i ’s, and treat || s [1 , j ] − ˜ s [1 , j ] || 2 as a function f ov er δ . The deriv ativ e of f ( δ ) is: f 0 ( δ ) = 2 j X i =1 ( ˜ M [1 , j ] + δ + δ i − ˜ s [ i ]) = 2  j ˜ M [1 , j ] − j X i =1 ˜ s [ i ]  + 2 j δ + 2 j X i =1 δ i = 2 j δ + 2 j X i =1 δ i Since δ i ≥ 0 for all i , then the deriv ative is strictly greater than zero for any δ > 0, whic h implies that f is a strictly increasing function of δ and has a minim um at δ = 0. There- fore, || s [1 , j ] − ˜ s [1 , j ] || 2 = f ( δ ) > f (0) = || s 0 [1 , j ] − ˜ s [1 , j ] || 2 . This is a contradiction since it was assumed that s w as the minim um solution. This completes the pro of for (i). F or (ii), the pro of of s [ n ] ≥ max i ˜ M [ i, n ] follows from a similar argument: if s [ n ] < ˜ M [ i, n ] for some i , define δ = ˜ M [ i, n ] − s [ n ] and the sequence s 0 with elements s 0 [ j ] = s [ j ] + δ for j ≥ i . Then s 0 can be shown to b e a strictly b etter solution than s , proving (ii). F or the pro of of (iii), we first show that s [ k ] ≤ max( s [ k − 1] , min j ˜ M [ k , j ]). Assume the contrary , i.e. there exists a k such that s [ k ] > s [ k − 1] and s [ k ] > min j ˜ M [ k , j ]. In other words, we assume there exists a k and j suc h that s [ k ] > s [ k − 1] and s [ k ] > ˜ M [ k , j ]. Denote δ = s [ k ] − max( s [ k − 1] , ˜ M [ k , j ]). By our assumption abov e, δ > 0. Define the sequence s 0 [ i ] =  s [ i ] − δ if k ≤ i ≤ j s [ i ] otherwise Note that by construction, s 0 [ k ] = s [ k ] − δ = s [ k ] − ( s [ k ] − max( s [ k − 1] , ˜ M [ k, j ])) = max( s [ k − 1] , ˜ M [ k, j ]). It is easy to see that s 0 is sorted (indeed the only inv ersion in the sort order could hav e o ccurred if s 0 [ k − 1] > s 0 [ k ], but do esn’t as s 0 [ k − 1] = s [ k − 1] ≤ max( s [ k − 1] , ˜ M [ k , j ]) = s 0 [ k ]). No w a similar argument as in the pro of of (i) for the se- quence ˜ s [ k, j ], yields that the error || s 0 [ k , j ] − ˜ s [ k, j ] || 2 < || s [ k , j ] − ˜ s [ k , j ] || 2 . Th us || s 0 − ˜ s || 2 < || s 0 − ˜ s || 2 and s 0 is a strictly b etter solution than s . This yields a contra- diction as s is the minimum L 2 solution. Hence s [ k ] ≤ max( s [ k − 1] , min j ˜ M [ k, j ]). A similar argumen t in the the reverse direction sho ws that s [ k ] ≥ min( s k +1 , max i ˜ M [ i, k ]) completing the pro of of (iii). 11 D.2 Pr oof of Theorem 2 W e first restate the theorem b elow. Denote n and d as the num b er of v alues and the num b er of distinct v alues in S ( I ) resp ectiv ely . Let n 1 , n 2 , . . . , n d b e the num b er of times eac h of the d distinct v alues o ccur in S ( I ) (thus P i n i = n ). Theorem 2. Ther e exist c onstants c 1 and c 2 indep endent of n and d such that er ror ( S ) ≤ d X i =1 c 1 log 3 n i + c 2  2 Thus er ror ( S ) = O ( d log 3 n/ 2 ) wher e as er ror ( ˜ S ) = Θ( n/ 2 ) . Before showing the pro of, we prov e the following lemma. Lemma 2. L et s = S ( I ) b e the input se quenc e. Cal l a tr anslation of s the op er ation of subtr acting fr om each ele- ment of s a fixe d amount δ . Then err or ( S [ i ]) is invariant under translation for al l i . Proof. Denote P r ( s | s ) ( P r ( ˜ s | s )) the probabilit y that s ( ˜ s ) is output on the input sequence s . Denote s 0 , s 0 , and ˜ s 0 the sequence obtained by translating s , s , and ˜ s by δ , resp ectiv ely . First observe that P r ( ˜ s | s ) = P r ( ˜ s 0 | s 0 ) as ˜ s and ˜ s 0 are obtained b y adding the same Laplacian noise to s and s 0 , resp ectiv ely . Using Theorem 1 (since all U k ’s and L k ’s shift b y δ on translating ˜ s by delta ), we get that if s is the mini- m um L 2 solution given ˜ s , then s 0 is the minimum L 2 solution giv en ˜ s 0 . Th us, P r ( s | s ) = P r ( s 0 | s 0 ) for all sequences s . F ur- ther, since s [ i ] and s 0 [ i ] yield the same L 2 error with s [ i ] and s 0 [ i ] respectively , w e get that the exp ected er ror ( S [ i ]) is same for b oth inputs s and s 0 . Lemma 3. L et X b e any p ositive random variable that is b ounde d ( lim x →∞ xP r ( X > x ) exists). Then E ( X ) ≤ Z ∞ 0 P r ( X > x ) dx Proof. The pro of follows from the following chain of equalities. E ( X ) = Z ∞ 0 x ∂ ∂ x ( P r ( X ≤ x )) = − Z ∞ 0 x ∂ ∂ x ( P r ( X > x )) = − [ xP r ( X > x )] ∞ 0 + Z ∞ 0 ( P r ( X ≤ x ) − 1) dx (by parts) = − lim x →∞ xP r ( X > x ) + Z ∞ 0 P r ( X > x ) dx ≤ Z ∞ 0 P r ( X > x ) dx Here the last equality follows as X is bounded and there- fore the limit exists and is positive. This completes the pro of. W e next state a theorem that w as shown in [6] Theorem 5 (Theorem 3.4 [6]). Supp ose that X 1 , X 2 , . . . , X n ar e indep endent r andom variables satisfying X i ≤ E ( X i ) + M , for 1 ≤ i ≤ n . We c onsider the sum X = P n i =1 X i with exp e ctation E ( X ) = P n i =1 E ( X i ) and V ar ( X ) = P n i =1 V ar ( X i ) . Then, we have P r ( X ≥ E ( X ) + λ ) ≤ e − λ 2 2( V ar ( X )+ M λ/ 3) F or a random v ariable X , denote I X the indicator function that X ≥ 0 (thus I X = 1 if X ≥ 0 and 0 otherwise). Using Theorem 5, we prov e the following lemma. Lemma 4. Supp ose i, j ar e indic es such that for al l k ∈ [ i, j ] , s [ k ] ≤ 0 . Then ther e exists a c onstant c such that for al l τ ≥ 1 the fol lowing holds. P r  ˜ M [ i, j ] 2 I ˜ M [ i,j ] ≥ c ( log 2 (( j − i + 1) τ ) ( j − i + 1)  2 )  ≤ 1 ( j − i + 1) 2 τ 2 Proof. W e apply Theorem 5 on ˜ s [ k ] for k ∈ [ i, j ]. First note that E ( ˜ s [ k ]) = s [ k ] ≤ 0. F urther V ar (˜ s [ k ]) = 2  2 as ˜ s [ k ] is obtained by adding Laplace noise to s [ k ] which has this v ariance. W e also know that ˜ s [ k ] ≥ M + s [ k ] happ ens with probabilit y at most e − M / 2. F or simplicity , call n to b e j − i + 1. Denoting X = P k ∈ [ i,j ] ˜ s [ k ], we see that E ( X ) ≤ 0 and V ar ( X ) = 2 n  2 . F ur- ther, set M = 3 log ( nτ ) / . Denote B the even t that for some k , ˜ s [ k ] ≥ M + s [ k ]. Thus P r ( B ) ≤ ne − M / 2 ≤ 1 2 n 2 τ 3 . If B do es not happ en, w e know that ˜ s [ k ] ≤ M + s [ k ] for all k ∈ [ i, j ]. Thus w e can then apply Theorem 5 to get: P r ( X ≥ E ( X ) + λ ) ≤ e − λ 2 2(2 n/ 2 + λ log ( nτ ) / ) + P r ( B ) = e − λ 2 2(2 n/ 2 + λ log ( nτ ) / ) + 1 2 n 2 τ 3 Setting λ = 8  √ n log ( nτ ) giv es us that P r  X ≥ E ( X ) + 8  √ n log ( nτ )  ≤ 1 n 2 τ 2 Since E ( X ) ≤ 0, we get P r  X ≥ 8  √ n log ( nτ )  ≤ 1 n 2 τ 2 Also we observe that ˜ M [ i, j ] = X/n , which yields P r  ˜ M [ i, j ] ≥ 8 log ( nτ ) √ n  ≤ 1 n 2 τ 2 Finally , observ e that ˜ M [ i, j ] ≤ c implies that ˜ M [ i, j ] 2 I ˜ M [ i,j ] ≤ c 2 . Thus we get P r  ˜ M [ i, j ] 2 I mm [ i,j ] ≥ 64 log 2 ( nτ ) n 2  ≤ 1 n 2 τ 2 Putting n = j − i + 1 and using c = 64 gives us the required result. No w we can give the pro of of Theorem 2. Proof of Theorem 2. The pro of of er r or ( ˜ S ) = Θ( n/ 2 ) is obvious since: er ror ( ˜ S ) = n X k =1 er ror ( ˜ s [ i ]) = n ( 2  2 ) In the rest of the proof, we shall sho w bound er ror ( S ). Let s = S ( I ) be the input sequence. W e know that s consists of d distinct elements. Denote s r as the r th distinct element of s . Also denote [ l r , u r ] as the set of indices corresp onding 12 to s r , i.e. ∀ i ∈ [ l r ,u r ] s [ i ] = s r and ∀ i / ∈ [ l r ,u r ] s [ i ] 6 = s r . Let M [ i, j ] record the mean of elements in s [ i, j ], i.e. M [ i, j ] = P j k = i s [ k ] / ( i − j + 1). T o b ound er ror ( S ), we shall bound er r or ( S [ i ]) separately for each i . T o b ound er r or ( S [ i ]), w e can assume W.L.O.G that s [ i ] is 0. This is b ecause if s [ i ] 6 = 0, then w e can trans- late the sequence s b y s [ i ]. As sho wn in Lemma 2 this preserv es er ror ( S [ i ]), while making s [ i ] = 0. Let k ∈ [ l r , u r ] be any index for the r th distinct elemen t of s . By definition, err or ( S [ k ]) = E ( s [ k ] − s [ k ]) 2 = E ( s [ k ] 2 ) (as w e can assume W.L.O.G s [ k ] = 0). F rom Theorem 1, we kno w that s [ k ] = U k . Th us er r or ( S [ k ]) = E ( U 2 k ). Here we treat U k = max i ≤ k min j ˜ M [ i, j ] as a random v ariable. Now b y definition of E , we hav e E ( U 2 k ) = E ( U 2 k I U k ) + E ( U 2 k (1 − I U k )) = A + B (say) W e shall b ound A and B separately . F or b ounding A , denote U k = max i ≤ k ˜ M [ i, u r ]. It is apparen t that U k ≥ U k and thus U 2 k I U k ≥ U 2 k I U k . T o bound A , w e observe that A = E ( U 2 k I U k ) ≤ E ( U 2 k I U k ) F urther, since U k = max i ≤ k ˜ M [ i, u r ], we know that U 2 k I U k = max i ≤ k ˜ M [ i, u r ] 2 I ˜ M [ i,u r ] . Thus we can write: A ≤ E ( U 2 k I U k ) = E  max i ≤ k ˜ M [ i, u r ] 2 I ˜ M [ i,u r ]  Let τ > 1 b e any num b er and c b e the constant used in Lemma 4. Let us denote e i the even t that: ˜ M [ i, u r ] 2 I ˜ M [ i,u r ] ≥ c ( log 2 (( u r − i + 1) τ ) ( u r − i + 1)  2 ) W e can apply lemma 4 to compute the probability of e i as s [ j ] ≤ 0 for all j ≤ u r (as we assumed W.L.O.G s [ k ] = 0). Th us we get P r ( e i ) ≤ 1 ( u r − i +1) 2 τ 2 . Define e = ∨ u r i =1 e i . Then P r ( e ) ≤ P u r i =1 P r ( e i ) = 2 /τ 2 (as P u r i =1 1 /i 2 ≤ 2). If the even t e do es not happ en, then it is easy to see that U 2 k I U k = max i ≤ k ˜ M [ i, u r ] 2 I ˜ M [ i,u r ] ≤ c ( log 2 (( u r − k + 1) τ ) ( u r − k + 1)  2 ) Th us with at least probability 1 − 2 /τ 2 (whic h is P r ( ¬ e )), w e get U 2 k I U k is bounded as ab o v e. This yields that there ex- ist constants c 1 and c 2 suc h that E ( U 2 k I U k ) ≤ c 1 log 2 ( u r − k +1)+ c 2 ( u r − k +1)  2 . The pro of is by the application of Lemma 3 (as U k is b ounded) and a simple integration ov er τ ranging from 1 to ∞ . Finally w e get that A ≤ E ( U 2 k I U k ) ≤ c 1 log 2 ( u r − k +1)+ c 2 ( u r − k +1)  2 . Recall that B = E ( U 2 k (1 − I U k )). W e can write B as E ( L 2 k (1 − I L k )) as L k = U k . Using the exact same ar- gumen ts as ab o ve for L k but on sequence − S yields that B ≤ c 1 log 2 ( k − l r +1)+ c 2 ( k − l r +1)  2 . Finally , we get that S [ k ] = A + B which is less than c 1 log 2 ( u r − k +1)+ c 2 ( u r − k +1)  2 + c 1 log 2 ( k − l r +1)+ c 2 ( k − l r +1)  2 . T o obtain a b ound on the total er ror ( S ). er ror ( S ) = d X r =1 X k ∈ [ l r ,u r ] er ror ( S [ k ]) ≤ d X r =1 X k ∈ [ l r ,u r ] c 1 log 2 ( u r − k + 1) + c 2 ( u r − k + 1)  2 + d X r =1 X k ∈ [ l r ,u r ] c 1 log 2 ( k − l r + 1) + c 2 ( k − l r + 1)  2 ≤ d X r =1 c 1 log 3 ( u r − l r + 1) + c 2  2 Finally noting that u r − l r + 1 is just n r , the num b er of o ccurrences of s r in s , we get er ror ( S ) = P r c 1 log 3 n r + c 2  2 = O ( d log 3 n/ 2 ). This completes the pro of of the theorem. D.3 Pr oof of Theorem 3 W e first restate the theorem b elow. Theorem 3. Given the noisy se quenc e ˜ h = ˜ H ( I ) , the unique minimum L 2 solution, h , is given by the fol lowing r e curr ence r elation. L et u be v ’s p ar ent: h [ v ] =  z [ v ] , if v is the ro ot z [ v ] + 1 k ( h [ u ] − P w ∈ succ ( u ) z [ w ]) , o.w. Proof. W e first show that h [ r ] = z [ r ] for the ro ot no de r . By definition of a minimum L 2 solution, the sequence h satisfies the following constrained optimization problem. Let succZ [ u ] = P w ∈ succ ( u ) z [ w ]. minimize X v ( h [ v ] − ˜ h [ v ]) 2 sub ject to ∀ v , X u ∈ succ ( v ) h [ u ] = h [ v ] Denote leav es ( v ) to be the set of leaf no des in the subtree ro oted at v . The ab ov e optimization problem can b e rewrit- ten as the following unconstrained minimization problem. minimize X v  ( X l ∈ leaves ( v ) h [ l ]) − ˜ h [ v ]  2 F or finding the minimum, we take deriv ative w.r.t h [ l ] for eac h l and equate it to 0. W e thus get the following set of equations for the minimum solution. ∀ l, X v : l ∈ leav es ( v ) 2  ( X l 0 ∈ leaves ( v ) h [ l 0 ]) − ˜ h [ v ]  = 0 Since P l 0 ∈ leaves ( v ) h [ l 0 ] = h [ v ], the ab o ve set of equations can b e rewritten as: ∀ l, P v : l ∈ leav es ( v ) h [ v ] = P v : l ∈ leav es ( v ) ˜ h [ v ] F or a leaf no de l , we can think of the ab o ve equation for l as corresp onding to a path from l to the ro ot r of the tree. The equation states that sum of the sequences h and ˜ h ov er the nodes along the path are the same. W e can sum all the equations to obtain the following equation. X v X l ∈ leaves ( v ) h [ v ] = X v X l ∈ leaves ( v ) ˜ h [ v ] 13 Denote l evel ( i ) as the set of no des at heigh t i of the tree. Th us ro ot b elongs to lev el ( ` − 1) and leav es in lev el (0). Abbreviating LH S ( RH S ) for the left (right) hand side of the ab o ve equation, w e observe the following. LH S = X v X l ∈ leaves ( v ) h [ v ] = ` − 1 X i =0 X v ∈ lev el ( i ) X l ∈ leaves ( v ) h [ v ] = ` − 1 X i =0 X v ∈ lev el ( i ) k i h [ v ] = ` − 1 X i =0 k i X v ∈ lev el ( i ) h [ v ] = ` − 1 X i =0 k i h [ r ] = k ` − 1 k − 1 h [ r ] Here w e use the fact that P v ∈ lev el ( i ) h [ v ] = h [ r ] for an y lev el i . This is because h satisfies the constraints of the tree. In a similar wa y , we also simplify the RHS. RH S = X v X l ∈ leaves ( v ) ˜ h [ v ] = ` − 1 X i =0 X v ∈ lev el ( i ) X l ∈ leaves ( v ) ˜ h [ v ] = ` − 1 X i =0 X v ∈ lev el ( i ) k i ˜ h [ v ] = ` − 1 X i =0 k i X v ∈ lev el ( i ) ˜ h [ v ] Note that w e cannot simplify the RHS further as ˜ h [ v ] ma y not satisfy the constraints of the tree. Finally equating LH S and RH S we get the following equation. h [ r ] = k − 1 k ` − 1 ` − 1 X i =0 k i X v ∈ lev el ( i ) ˜ h [ v ] F urther, it is easy to expand z [ r ] and chec k that z [ r ] = k − 1 k ` − 1 ` − 1 X i =0 k i X v ∈ lev el ( i ) ˜ h [ v ] Th us we get h [ r ] = z [ r ]. F or no des v other than the r , assume that w e ha ve computed h [ u ] for u = pr ed ( v ). Denote H = h [ u ]. Once H is fixed, we can argue that the v alue of h [ v ] will b e indep enden t of the v alues of ˜ h [ w ] for any w not in the subtree of u . F or no des w ∈ subtr ee ( u ) the L 2 minimization problem is equiv alent to the following one. minimize X w ∈ subtr ee ( u ) ( h [ w ] − ˜ h [ w ]) 2 sub ject to ∀ w ∈ subtr ee ( u ) , X w 0 ∈ succ ( w ) h [ w 0 ] = h [ w ] and X v ∈ succ ( u ) h [ v ] = H Again using no des l ∈ leav es ( u ), we con v ert this mini- mization into the following one. minimize X w ∈ subtr ee ( U )  ( X l ∈ leaves ( w ) h [ l ]) − ˜ h [ w ]  2 sub ject to X l ∈ leaves ( u ) h [ u ] = H W e can now use the metho d of Lagrange m ultipliers to find the solution of the ab ov e constrained minimization prob- lem. Using λ as the Lagrange parameter for the constraint P l ∈ leaves ( u ) h [ u ] = H , w e get the following sets of equations. ∀ l ∈ leav es ( u ) , X w : l ∈ leav es ( w ) 2  h [ w ] − ˜ h [ w ]  = − λ Adding the equations for all l ∈ l eaves ( u ) and solving for λ we get λ = − H − succZ [ u ] n ( u ) − 1 . Here n ( u ) is the num b er of no des in subtr ee ( u ). Finally adding the ab o ve equations for only leaf no des l ∈ l eaves ( v ), we get h [ v ] = z [ v ] − ( n ( v ) − 1) · λ = z [ v ] + n ( v ) − 1 n ( u ) − 1 ( H − succZ [ u ]]) = z [ v ] + 1 k ( h [ u ] − succZ [ u ]) This completes the pro of. D.4 Pr oof of Theorem 4 First, the theorem is restated. Theorem 4. ( i ) H is a line ar unbiased estimator, ( ii ) er ror ( H q ) ≤ er r or ( E q ) for al l q and for al l line ar unbiase d estimators E , ( iii ) er ror ( H q ) = O ( ` 3 / 2 ) for al l q , and ( iv ) ther e exists a query q s.t. er ror ( H q ) ≤ 3 2( ` − 1)( k − 1) − k er ror ( ˜ H q ) . Proof. F or (i), the linearity of H is obvious from the definition of z and h . T o show H is unbiased, we first show that z is un biased, i.e. E ( z [ v ]) = h [ v ]. W e use induction: the base case is if v is a leaf no de in which case E ( z [ v ]) = E ( ˜ h [ v ]) = h [ v ]. If v is not a leaf no de, assume that we hav e sho wn z is un biased for all no des u ∈ succ ( v ). Thus E ( succZ [ v ]) = X u ∈ succ ( v ) E ( z [ u ]) = X u ∈ succ ( v ) h [ u ] = h [ v ] Th us succZ [ v ] is an unbiased estimator for h [ v ]. Since z [ v ] is a linear combination of ˜ h [ v ] and succZ [ v ], which are b oth un biased estimators, z [ v ] is also un biased. This completes the induction step proving that z is unbiased for all no des. Finally , w e note that h [ v ] is a linear com bination of ˜ h [ v ], z [ v ], and succZ [ v ], all of whic h are unbiased estimators. Th us h [ v ] is also unbiased pro ving (i). F or (ii), we shall use the Gauss-Marko v theorem [21]. W e shall treat the sequence ˜ h as the set of observed v ariables, and l , the sequence of original leaf counts, as the set of unobserv ed v ariables. It is easy to see that for all no des v ˜ h [ v ] = X u ∈ leaves ( v ) l [ u ] + noise ( v ) Here noise ( v ) is the Laplacian random v ariable, which is in- dep enden t for different no des v , but has the same v ariance 14 for all no des. Hence ˜ h satisfies the hypothesis of Gauss- Mark ov theorem. (i) sho ws that h is a linear un biased estimator. F urther, h has b een obtained by minimizing the L 2 distance with ˜ h [ v ]. Hence, h is the Ordinary Least Squares (OLS) estimator, which by the Gauss-Marko v the- orem has the least er ror . Since it is the OLS estimator, it minimizes the er r or for estimating an y linear combination of the original counts, which includes in particular the given range query q . F or (iii), we note that any query q can b e answered b y summing at most k ` no des in the tree. Since for any no de v , er r or ( H [ v ]) ≤ er ror ( ˜ H [ v ]) = 2 ` 2 / 2 , we get er ror ( H [ q ]) ≤ k ` (2 ` 2 / 2 ) = O ( ` 3 / 2 ) F or (iv), denote l 1 and l 2 to be the leftmost and righ tmost leaf no des in the tree. Denote r to b e the ro ot. W e consider the query q that asks for the sum of all leaf no des except for l 1 and l 2 . Then from (i) er ror ( H ( q )) is less than the err or of the estimate ˜ h [ r ] − ˜ h [ l 1 ] − ˜ h [ l 2 ], which is 6 ` 2 / 2 . But, on the other hand, ˜ H will require summing 2( k − 1)( ` − 1) − k noisy coun ts in total—2( k − 1) at each lev el of the tree, except for the ro ot and the level just b elow the ro ot, only k − 2 counts are summed. Thus er r or ( ˜ H q ) = 2(2( k − 1)( ` − 1) − k ) ` 2 / 2 . Th us er ror ( H q ) ≤ 3 er ror ( ˜ H q ) 2( ` − 1)( k − 1) − k This completes the pro of. E. COMP ARISON WITH BLUM ET AL. W e compare a binary ˜ H q against the binary search equi- depth histogram of Blum et al. [4] in terms of ( , δ )-usefulness as defined by Blum et al. Since  is used in the usefulness definition, we use α as the parameter for α -differential pri- v acy . Let N b e the num b er of records in the database. An algo- rithm is ( , δ )-useful for a class of queries if with probability at least 1 − δ , for every query in the class, the absolute error is at most N . F or an y range query q , the absolute error of ˜ H q is | ˜ H q ( I ) − H q ( I ) | = | Y | where Y = P c i =1 γ i , each γ i ∼ Lap ( `/α ), and c is the num b er of subtrees in ˜ H q , which is at most 2 ` . W e use Corollary 1 from [5] to b ound the error of a sum of Laplace random v ariables. With ν = p c` 2 /α 2 q 2 ln 2 δ 0 , we obtain P r " | Y | ≤ 16 ` 3 2 ln 2 δ 0 α # ≥ 1 − δ 0 The ab o v e is for a single range query . T o b ound the error for all  n 2  range queries, w e use a union b ound. Set δ 0 = δ n 2 . Then ˜ H is ( , δ )-useful provided that  ≥  16 ` 3 2 ln 2 n 2 δ  /α . As in Blum et al., we can also fix  and b ound the size of the database. ˜ H is ( , δ )-useful when N ≥ 16 ` 3 2 ln 2 n 2 δ α = O log 3 2 n  log n + log 1 δ  α ! In comparison, the tec hnique of Blum et al. is ( , δ )-useful for range queries when N ≥ O log n  log log n + log 1 δ  α 3 ! Both techniques scale at most p oly-logarithmically with the size of the domain. How ever, the ˜ H scales b etter with  , ac hieving the same utility guaran tee with a database that is smaller by a factor of O (1 / 2 ). The ab o ve comparison reveals a distinction betw een the tec hniques: for ˜ H q the b ound on absolute error is indep en- den t of database size, i.e., it only dep ends on  , α , and the size of range. How ever, for the Blum et al. approac h, the absolute error increases with the size of the database at a rate of O ( N 2 / 3 ). 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment