Boosting the Accuracy of Differentially-Private Histograms Through Consistency

Boosting the Accuracy of Diff erentiall y Priv ate Histograms Thr ough Consistency Michael Ha y † , Vibhor Rastogi ‡ , Gerome Miklau † , Dan Suciu ‡ † University of Massachusetts Amherst { mha y ,miklau } @cs.umass .edu ‡ University of W ashington { vibhor ,suciu } @cs.washington.edu ABSTRA CT W e sho w that it is possible to signiﬁcan tly impro ve the accu- racy of a general class of histogram queries while satisfying diﬀeren tial priv acy . Our approac h carefully c ho oses a set of queries to ev aluate, and then exploits consistency con- strain ts that should hold o ver the noisy output. In a p ost- pro cessing phase, w e compute the consisten t input most lik ely to ha ve pro duced the noisy output. The ﬁnal out- put is diﬀeren tially-priv ate and consistent, but in addition, it is often muc h more accurate. W e show, b oth theoreti- cally and experimentally , that these techniques can b e used for estimating the degree sequence of a graph very precisely , and for computing a histogram that can supp ort arbitrary range queries accurately . 1. INTR ODUCTION Recen t work in diﬀeren tial priv acy [9] has shown that it is p ossible to analyze sensitive data while ensuring strong pri- v acy guarantees. Diﬀeren tial priv acy is typically ac hieved through random p erturbation: the analyst issues a query and receiv es a noisy answer. T o ensure priv acy , the noise is carefully calibrated to the sensitivity of the query . Infor- mally , query sensitivity measures how m uch a small change to the database—such as adding or removing a p erson’s pri- v ate record—can aﬀect the query answ er. Suc h query mech- anisms are simple, eﬃcient, and often quite accurate. In fact, one mechanism has recently b een shown to b e optimal for a single coun ting query [10]—i.e., there is no b etter noisy answ er to return under the desired priv acy ob jective. Ho wev er, analysts t ypically need to compute m ultiple sta- tistics on a database. Diﬀeren tially priv ate algorithms ex- tend nicely to a set of queries, but there can be diﬃcult trade-oﬀs among alternative strategies for answ ering a w ork- load of queries. Consider the analyst of a priv ate studen t database who requires answers to the following queries: the total num b er of students, x t , the num b er of students x A , x B , x C , x D , x F receiving grades A, B, C, D, and F resp ec- tiv ely , and the num b er of passing students, x p (grade D or DA T A OWNER ANAL YST Constrained Inference I Private Data • Q • • Q ( I ) • • ˜ q • • q • • γ Q • Q • • Q ( I ) • • ˜ q • • q • • γ Q • Q • • Q ( I ) • • ˜ q • • q • • γ Q • Q • • Q ( I ) • • ˜ q • • q • • γ Q Diff. Private Interface Smo oth Sensitivit y 1 temp Q ( I )= q Pro ving results from [1] and applying to degree sequence. Lemma 1. L et A b e an algorithm that on input x outputs A ( x )= f ( x )+ S ( x ) α Z . F or any inputs x, y , we have: Pr [ A ( x ) ∈ S ]= Pr [ Z ∈ Z x ( S )] wher e z x ( s )= s − f ( x ) S ( x ) / α and Z x ( S )= { z x ( s ) | s ∈ S} . A nd Pr [ A ( y ) ∈ S ]= Pr [ Z ∈ Z y ( S )] wher e z y ( s )= S ( x ) S ( y ) ! z x ( s )+ f ( x ) − f ( y ) S ( x ) / α " = s − f ( y ) S ( y ) / α and Z y ( S )= { z y ( s ) | s ∈ S} . In shorthand, Z x and Z y ar e r elate d as: Z y ( S )= σ ( Z x ( S )+ ∆ ) wher e σ = S ( x ) S ( y ) and ∆ = f ( x ) − f ( y ) S ( x ) / α . Prop osition 1. L et Z b e a L aplac e r andom variable. L et c, δ > 0 b e ﬁxe d. F or any ∆ such that | ∆ | ≤ c , the fol lowing slidin g pr op erty holds: Pr [ Z ∈ Z ] ≤ e c Pr [ Z ∈ Z + ∆ ] F or any σ such that σ ≤ 1+ c/ ln 1 δ , the fol lowing dilation pr op erty holds: Pr [ Z ∈ Z ] ≤ e c Pr [ Z ∈ σ Z ]+ δ F urther, they c an c ombine d: Pr [ Z ∈ Z ] ≤ e 2 c Pr [ Z ∈ σ ( Z + ∆ )] + δ Pr o of. F or an y c , w e ha v e: Pr[ Z ∈ Z ]= # z ∈ Z 1 2 e − | z | dz ≤ # z ∈ Z 1 2 e | ∆ | − | z + ∆ | e − | z | e − | z | dz b ecause | ∆ | − | z + ∆ | + | z | ≥ 0, observ e | ∆ | + | z | ≥ | z + ∆ | = e | ∆ | # z ∈ Z 1 2 e − | z + ∆ | dz = e | ∆ | Pr[ Z ∈ Z + ∆ ] ≤ e c Pr[ Z ∈ Z + ∆ ] XXXXX F or dilation, need to pro v e it but I kno w that there is s ome set Z suc h that for the dilation prop ert y to hold, it m ust b e that σ ≤ 1+ c/ ln 1 δ . But it ma y b e the case that it is necessary for σ < 1+ c/ ln 1 δ to b e true for all Z . 1 Step 1 Step 2 Step 3 Figure 1: Our approach to querying priv ate data. higher). Using a diﬀerentially priv ate interface, a ﬁrst alternative is to request noisy answers for just ( x A , x B , x C , x D , x F ) and use those answ ers to compute answ ers for x t and x p b y sum- mation. The sensitivit y of this set of queries is 1 because adding or remo ving one tuple c hanges exactly one of the ﬁv e outputs by a v alue of one. Therefore, the noise added to in- dividual answers is low and the noisy answers are accurate estimates of the truth. Unfortunately , the noise accumulates under summation, so the estimates for x t and x p are worse. A second alternative is to request noisy answers for all queries ( x t , x p , x A , x B , x C , x D , x F ). This query set has sen- sitivit y 3 (one change could aﬀect three return v alues, each b y a v alue of one), and the priv acy mechanism must add more noise to eac h component. This means the estimates for x A , x B , x C , x D , x F are worse than ab o ve, but the estimates for x t and x p ma y b e more accurate. There is another con- cern, ho wev er: inconsistency . The noisy answ ers are likely to violate the following constraints, which one would naturally exp ect to hold: x t = x p + x F and x p = x A + x B + x C + x D . This means the analyst m ust ﬁnd a w ay to reconcile the fact that there are tw o diﬀerent estimates for the total num b er of students and t wo diﬀerent estimates for the num ber of passing students. W e prop ose a technique for resolving in- consistency in a set of noisy answ ers, and show that doing so can actually increase accuracy . As a result, w e show that strategies inspired by the second alternative can b e superior in many cases. Ov erview of Approac h . Our approach, sho wn pictorially in Figure 1, inv olves three steps. First, given a task—such as computing a histogram ov er studen t grades—the analyst chooses a set of queries Q to send to the data o wner. The c hoice of queries will dep end on the particular task, but in this work they are chosen so that 1 000 001 010 01 1 Sources Destinations 1.00 1.01 3.00 3.01 3.10 . . . (a) T race data Query Deﬁnitions: L : h C 000 , C 001 , C 010 , C 011 i H : h C 0 ∗∗ , C 00 ∗ , C 01 ∗ , C 000 , C 001 , C 010 , C 011 i S : sort ( L ) T rue answer Private output Inferr e d answer L ( I ) = h 2 , 0 , 10 , 2 i ˜ L ( I ) = h 3 , 1 , 11 , 1 i H ( I ) = h 14 , 2 , 12 , 2 , 0 , 10 , 2 i ˜ H ( I ) = h 13 , 3 , 11 , 4 , 1 , 12 , 1 i H ( I ) = h 14 , 3 , 11 , 3 , 0 , 11 , 0 i S ( I ) = h 0 , 2 , 2 , 10 i ˜ S ( I ) = h 1 , 2 , 0 , 11 i S ( I ) = h 1 , 1 , 1 , 11 i (b) Query v ariations Figure 2: (a) Illustration of sample data representing a bipartite graph of netw ork connections; (b) Deﬁnitions and sample v alues for alternativ e query sequences: L counts the num b er of connections for each sour ce , H pro vides a hierarch y of range counts, and S returns an ordered degree sequence for the implied graph. constrain ts hold among the answers. F or example, rather than issue ( x A , x B , x C , x D , x F ), the analyst w ould form ulate the query as ( x t , x p , x A , x B , x C , x D , x F ), which has consis- tency constrain ts. The query set Q is sen t to the data o wner. In the second step, the data owner answers the set of queries, using a standard diﬀerentially-priv ate mec hanism [9], as follo ws. The queries are ev aluated on the priv ate database and the true answ er Q ( I ) is computed. Then random in- dep enden t noise is added to each answer in the set, where the data owner scales the noise based on the sensitivity of the query set. The set of noisy answers ˜ q is sent to the an- alyst. Imp ortan tly , b ecause this step is unchanged from [9], it oﬀers the same diﬀerential priv acy guarantee. The ab ov e step ensures priv acy , but the set of noisy an- sw ers returned may b e inconsistent. In the third and ﬁnal step, the analyst p ost-pro cesses the set of noisy answers to resolv e inconsistencies among them. W e prop ose a no v el ap- proac h for resolving inconsistencies, called constr aine d infer- enc e , that ﬁnds a new set of answers q that is the “closest” set to ˜ q that also satisﬁes the consistency constraints. F or tw o histogram tasks, our main tec hnical con tributions are eﬃcient techniques for the third step and a theoretical and empirical analysis of the accuracy of q . The surprising ﬁnding is that q can b e signiﬁcan tly more accurate than ˜ q . W e emphasize that the constrained inference step has no imp act on the diﬀeren tial priv acy guaran tee. The analyst p erforms this step without access to the priv ate data, using only the constraints and the noisy answers, ˜ q . The noisy answ ers ˜ q are the output of a diﬀeren tially priv ate mech- anism; an y post-pro cessing of the answers cannot dimin- ish this rigorous priv acy guaran tee. The constraints are prop erties of the query , not the database, and therefore kno wn by the analyst a priori . F or example, the constraint x p = x A + x B + x C + x D is simply the deﬁnition of x p . In tuitively , how ever, it would seem that if noise is added for priv acy and then constrained inference reduces the noise, some priv acy has b een lost. In fact, our results sho w that existing tec hniques add more noise than is strictly necessary to ensure diﬀerential priv acy . The extra noise pro vides no quan tiﬁable gain in priv acy but do es hav e a signiﬁcant cost in accuracy . W e show that constrained inference can b e an eﬀectiv e strategy for b oosting accuracy . The increase in accuracy we ac hiev e dep ends on the input database and the priv acy parameters. F or instance, for some databases and levels of noise the p erturbation ma y tend to pro duce answers that do not violate the constraints. In this case the inference step would not improv e accuracy . But we sho w that our inference pro cess never reduces accuracy and giv e conditions under whic h it will b oost accuracy . In prac- tice, we ﬁnd that many real datasets hav e data distributions for which our techniques signiﬁcantly improv e accuracy . Histogram tasks . W e demonstrate this technique on tw o sp eciﬁc tasks related to histograms. F or relational schema R ( A, B , . . . ), w e choose one attribute A on whic h histograms are built (called the range attribute). W e assume the domain of A , dom , is ordered. W e explain these tasks using sample data that will serve as a running example throughout the pap er, and is also the basis of later exp erimen ts. The relation R ( sr c, dst ), sho wn in Fig. 2, represents a trace of netw ork communications b e- t ween a source IP address ( sr c ) and a destination IP address ( dst ). It is bipartite because it represents ﬂo ws through a gatew ay router from internal to external addresses. In a con ven tional histogram, we form disjoin t interv als for the range attribute and compute counting queries for each sp eciﬁed range. In our example, we use sr c as the range at- tribute. There are four source addresses present in the table. If we ask for coun ts of all unit-length ranges, then the his- togram is simply the sequence h 2 , 0 , 10 , 2 i corresp onding to the (out) degrees of the source addresses h 000 , 001 , 010 , 011 i . Our ﬁrst histogram task is an unattributed histogram , in which the interv als themselves are irrelev ant to the anal- ysis and so we report only a multiset of frequencies. F or the example histogram, the multiset is { 0 , 2 , 2 , 10 } . An im- p ortan t instance of an unattributed histogram is the de- gree sequence of a graph, a crucial measure that is widely studied [18]. If the tuples of R represent queries submit- ted to a search engine, and A is the search term, then an unattributed histogram sho ws the frequency of o ccurrence of all terms (but not the terms themselves), and can be used to study the distribution. F or our second histogram task, we consider more con- v entional sequences of counting queries in which the inter- v als studied ma y be irregular and o verlapping. In this case, simply returning unattributed counts is insuﬃcient. And b ecause w e cannot predict ahead of time all the ranges of in terest, our goal is to compute priv ately a set of statistics suﬃcien t to supp ort arbitrary interv al counts and thus any histogram. W e call this a univ ersal histogram . 2 Con tinuing the example, a universal histogram allows the analyst to count the num ber of pack ets sent from any single address (e.g., the counts from source address 010 is 10), or from any range of addresses (e.g., the total num b er of pac k- ets is 14, and the num b er of pack ets from a source address matc hing preﬁx 01 ∗ is 12). While a univ ersal histogram can b e used compute an unat- tributed histogram, we distinguish b et ween the tw o b ecause w e show the latter can b e computed muc h more accurately . Con tributions . F or both unattributed and universal his- tograms, we propose a strategy for b o osting the accuracy of existing diﬀerentially priv ate algorithms. F or each task, (1) we show that there is an eﬃciently-computable, closed- form expression for the c onsistent query answer closest to a priv ate randomized output; (2) we prov e b ounds on the error of the inferred output, sho wing under what conditions inference b o osts accuracy; (3) we demonstrate signiﬁcan t impro vemen ts in accuracy through exp erimen ts on real data sets. Unattributed histograms are extremely accurate, with error at least an order of magnitude lo wer than existing tech- niques. Our approach to univ ersal histograms can reduce er- ror for larger ranges by 45-98%, and improv es on all ranges in some cases. 2. B A CKGR OUND In this section, w e in tro duce the concept of query se- quences and how they can b e used to supp ort histograms. Then we review diﬀerential priv acy and sho w how queries can b e answered under diﬀerential priv acy . Finally , we for- malize our constrained inference pro cess. All of the tasks considered in this pap er are form ulated as query se quenc es where eac h elemen t of the sequence is a sim- ple count query on a range. W e write interv als as [ x, y ] for x, y ∈ dom , and abbreviate interv al [ x, x ] as [ x ]. A coun ting query on range attribute A is: c ([ x, y ]) = Select count(*) From R Where x ≤ R.A ≤ y W e use Q to denote a generic query sequence (please see App endix A for an ov erview of notational conv entions). When Q is ev aluated on a database instance I , the output, Q ( I ), includes one answer to each counting query , so Q ( I ) is a vector of non-negative integers. The i th query in Q is Q [ i ]. W e consider the common case of a histogram ov er unit- length ranges. The conv en tional strategy is to simply com- pute counts for all unit-length ranges. This query sequence is denoted L : L = h c ([ x 1 ]) , . . . c ([ x n ]) i , x i ∈ dom Example 1. Using the example in Fig 2, we assume the domain of sr c c ontains just the 4 addr esses shown. Query L is h c ([000]) , c ([001]) , c ([010]) , c ([011]) i and L ( I ) = h 2 , 0 , 10 , 2 i . 2.1 Differential Privacy Informally , an algorithm is diﬀeren tially priv ate if it is insensitiv e to small changes in the input. F ormally , for any input database I , let nbrs ( I ) denote the set of neighboring databases, eac h diﬀering from I b y at most one record; i.e., if I 0 ∈ nbr s ( I ), then | ( I − I 0 ) ∪ ( I 0 − I ) | = 1. Definition 2.1 (  -differential priv acy). Algorithm A is  -diﬀer ential ly private if for al l instanc es I , any I 0 ∈ nbr s ( I ) , and any subset of outputs S ⊆ Rang e ( A ) , the fol- lowing holds: P r [ A ( I ) ∈ S ] ≤ exp(  ) × P r [ A ( I 0 ) ∈ S ] wher e the pr ob ability is taken over the r andomness of the A . Diﬀeren tial priv acy has b een deﬁned inconsistently in the lit- erature. The original concept, called  -indistinguishability [9], deﬁnes neighboring databases using hamming distance rather than symmetric diﬀerence (i.e., I 0 is obtained from I b y r e- placing a tuple rather than adding/removing a tuple). The c hoice of deﬁnition aﬀects the calculation of query sensi- tivit y . W e use the ab o ve deﬁnition (from Dwork [7]) but observ e that our results also hold under indistinguishability , due to the fact that  -diﬀerential priv acy (as deﬁned ab ov e) implies 2  -indistinguishability . T o answer queries under diﬀerential priv acy , we use the Laplace mec hanism [9], whic h ac hieves diﬀerential priv acy b y adding noise to query answers, where the noise is sam- pled from the Laplace distribution. The magnitude of the noise dep ends on the query’s sensitivity , deﬁned as follows (adapting the deﬁnition to the query sequences considered in this pap er). Definition 2.2 (Sensitivity). L et Q b e a se quence of c ounting queries. The sensitivity of Q , denoted S Q , is ∆ Q = max I ,I 0 ∈ nbrs ( I )   Q ( I ) − Q ( I 0 )   1 Throughout the pap er, we use || X − Y || p to denote the L p distance b et ween v ectors X and Y . Example 2. The sensitivity of query L is 1. Datab ase I 0 c an b e obtaine d fr om I by adding or r emoving a single r e cor d, which aﬀects exactly one of the c ounts in L by exactly 1. Giv en query Q , the Laplace mec hanism ﬁrst computes the query answ er Q ( I ) and then adds random noise indep en- den tly to each answ er. The noise is drawn from a zero-mean Laplace distribution with scale σ . As the following prop o- sition shows, diﬀerential priv acy is achiev ed if the Laplace noise is scaled appropriately to the sensitivity of Q and the priv acy parameter  . Proposition 1 (Lapla ce mechanism [9]). L et Q b e a query se quenc e of length d , and let h L ap ( σ ) i d denote a d -length ve ctor of i.i.d. samples fr om a Laplac e with sc ale σ . The r andomize d algorithm ˜ Q that takes as input datab ase I and outputs the fol lowing ve ctor is  -diﬀer ential ly private: ˜ Q ( I ) = Q ( I ) + h L ap (∆ Q / ) i d W e apply this technique to the query L . Since, L has sen- sitivit y 1, the following algorithm is  -diﬀerentially priv ate: ˜ L ( I ) = L ( I ) + h Lap(1 / ) i n W e rely on Prop osition 1 to ensure priv acy for the query sequences we prop ose in this pap er. W e emphasize that the prop osition holds for any query sequence Q , regardless of correlations or constraints among the queries in Q . Such dep endencies are accoun ted for in the calculation of sensi- tivit y . (F or example, consider the correlated sequence Q that consists of the same query rep eated k times, then the sensitivit y of Q is k times the sensitivity of the query .) 3 W e present the case where the analyst issues a single query sequence Q . T o supp ort multiple query sequences, the pro- to col that computes a  i -diﬀeren tially priv ate resp onse to the i th sequence is ( P  i )-diﬀeren tially priv ate. T o analyze the accuracy of the randomized query sequences prop osed in this paper w e quan tify their error. ˜ Q can b e considered an estimator for the true v alue Q ( I ). W e use the common Mean Squared Error as a measure of accuracy . Definition 2.3 (Err or). F or a randomize d query se- quenc e ˜ Q whose input is Q ( I ) , the er ror ( ˜ Q ) is P i E ( ˜ Q [ i ] − Q [ i ]) 2 Her e E is the exp e ctation taken over the p ossible r an- domness in gener ating ˜ Q . F or example, er r or ( ˜ L ) = P i E ( ˜ L [ i ] − L [ i ]) 2 whic h simpliﬁes to: n E [ Lap (1 / ) 2 ] = n V ar ( Lap (1 / )) = 2 n/ 2 . 2.2 Constrained Inference While ˜ L can b e used to supp ort unattributed and univer- sal histograms under diﬀerential priv acy , the main contribu- tion of this paper is the dev elopment of more accurate query strategies based on the idea of constrained inference. The sp eciﬁc strategies are describ ed in the next sections. Here, w e formulate the constrained inference problem. Giv en a query sequence Q , let γ Q denote a set of con- strain ts which must hold among the (true) answers. The constrained inference pro cess takes the randomized output of the query , denoted ˜ q = ˜ Q ( I ), and ﬁnds the sequence of query answers q that is “closest” to ˜ q and also satisﬁes the constrain ts of γ Q . Here closest is determined b y L 2 distance, and the result is the minimum L 2 solution : Definition 2.4 (Minimum L 2 solution). L et Q b e a query se quence with c onstr aints γ Q . Given a noisy query se quenc e ˜ q = ˜ Q ( I ) , a minimum L 2 solution, denote d q , is a ve ctor q that satisﬁes the c onstr aints γ Q and at the same time minimizes || ˜ q − q || 2 . W e use Q to denote the tw o step randomized process in whic h the data owner ﬁrst computes ˜ q = ˜ Q ( I ) and then computes the minimum L 2 solution from ˜ q and γ Q . (Al- ternativ ely , the data owner can release ˜ q and the latter step can be done by the analyst.) Importantly , the inference step has no impact on priv acy , as stated b elo w. (Pro ofs app ear in the App endix.) Proposition 2. If ˜ Q satisﬁes  -diﬀer ential privacy, then Q satisﬁes  -diﬀer ential privacy. 3. UNA TTRIBUTED HISTOGRAMS T o supp ort unattributed histograms, one could use the query sequence L . How ever, it con tains “extra” information— the attribution of eac h coun t to a particular range—whic h is irrelev ant for an unattribute d histogram. Since the asso ci- ation b et ween L [ i ] and i is not required, any permutation of the unit-length counts is a correct resp onse for the unattr- ibuted histogram. W e formulate an alternative query that asks for the counts of L in rank order. As we will show, ordering do es not increase sensitivity , but it do es introduce inequalit y constraints that can b e exploited by inference. F ormally , let a i refer to the answer to L [ i ] and let U = { a 1 , . . . , a n } b e the multiset of answers to queries in Q . W e write r ank i ( U ) to refer to the i th smallest answer in U . Then the query S is deﬁned as S = h r ank 1 ( U ) , . . . , r ank n ( U ) i Example 3. In the example in Fig 2, we have L ( I ) = h 2 , 0 , 10 , 2 i while S ( I ) = h 0 , 2 , 2 , 10 i . Thus, the answer S ( I ) c ontains the same c ounts as L ( I ) but in sorted or de r. T o answer S under diﬀerential priv acy , we must determine its sensitivity . Proposition 3. The sensitivity of S is 1. By Propositions 1 and 3, the follo wing algorithm is  - diﬀeren tially priv ate: ˜ S ( I ) = S ( I ) + h Lap(1 / ) i n Since the same magnitude of noise is added to S as to L , the accuracy of ˜ S and ˜ L is the same. How ever, S implies a p ow erful set of constraints. Notice that the ordering o c- curs b efore noise is added. Thus, the analyst kno ws that the returned counts are ordered according to the true rank order. If the returned answer con tains out-of-order counts, this must b e caused by the addition of random noise, and they are inconsistent. Let γ S denote the set of inequalities S [ i ] ≤ S [ i + 1] for 1 ≤ i < n . W e show next how to exploit these constraints to b oost accuracy . 3.1 Constrained Inference: Computing S As outlined in the in tro duction, the analyst sends query S to the data owner and receiv es a noisy answer ˜ s = ˜ S ( I ), the output of the diﬀerentially priv ate algorithm ˜ S ev aluated on the priv ate database I . W e now describ e a technique for p ost-processing ˜ s to ﬁnd an answer that is consistent with the ordering constraints. F ormally , given ˜ s , the ob jective is to ﬁnd an s that mini- mizes || ˜ s − s || 2 sub ject to the constraints s [ i ] ≤ s [ i + 1] for 1 ≤ i < n . The solution has a surprisingly elegant closed- form. Let ˜ s [ i, j ] be the subsequence of j − i + 1 elements: h ˜ s [ i ], ˜ s [ i + 1], . . . , ˜ s [ j ] i . Let ˜ M [ i, j ] be the mean of these elemen ts, i.e. ˜ M [ i, j ] = P j k = i ˜ s [ k ] / ( j − i + 1). Theorem 1. Denote L k = min j ∈ [ k,n ] max i ∈ [1 ,j ] ˜ M [ i, j ] and U k = max i ∈ [1 ,k ] min j ∈ [ i,n ] ˜ M [ i, j ] . The minimum L 2 solu- tion s , is unique and given by: s [ k ] = L k = U k . Since we ﬁrst stated this result in a technical rep ort [13], w e hav e learned that this problem is an instance of isotonic regression (i.e., least squares regression under ordering con- strain ts on the estimands). The statistics literature gives sev eral c haracterizations of the solution, including the ab o ve min-max formulas (cf. Barlow et al. [3]), as w ell as linear time algorithms for computing it (cf. Barlow et al. [2]). Example 4. We give thr e e examples of ˜ s and its closest or der e d se quenc e s . First, supp ose ˜ s = h 9 , 10 , 14 i . Sinc e ˜ s is alr e ady or der e d, s = ˜ s . Se c ond, ˜ s = h 9 , 14 , 10 i , wher e the last two elements ar e out of or der. The closest or der ed se quenc e is s = h 9 , 12 , 12 i . Final ly, let ˜ s = h 14 , 9 , 10 , 15 i . The se quenc e is in order exc ept for ˜ s [1] . While changing the ﬁrst element fr om 14 to 9 would make it or der e d, its distanc e fr om ˜ s would b e (14 − 9) 2 = 25 . In c ontr ast, s = h 11 , 11 , 11 , 15 i and || ˜ s − s || 2 = 14 . 3.2 Utility Analysis: the Accuracy of S Prior work in isotonic regression has shown inference can- not hurt, i.e., the accuracy of S is no lo wer than ˜ S [14]. 4 0 5 10 15 20 25 10 15 20 Inde x Count ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ε = 1.0 ● S ( I ) s ~ s Figure 3: Example of how s reduces the error of ˜ s . Ho wev er, we are not aw are of any results that give condi- tions for which S is more accurate than ˜ S . Before presenting a theoretical statement of such conditions, w e ﬁrst give an illustrativ e example. Example 5. Figur e 3 shows a se quenc e S ( I ) along with a sample d ˜ s and inferr e d s . While the values in ˜ s deviate c onsider ably fr om S ( I ) , s lies very close to the true answer. In p articular, for subse quenc e [1 , 20] , the true se quenc e S ( I ) is uniform and the c onstr ained infer enc e pr o c ess eﬀe ctively aver ages out the noise of ˜ s . At the twenty-ﬁrst p osition, which is a unique c ount in S ( I ) , and c onstr aine d infer enc e do es not r eﬁne the noisy answer, i.e., s [21] = ˜ s [21] . Fig 3 suggests that er ror ( S ) will b e low for sequences in whic h many coun ts are the same (Fig 7 in App endix C giv es another in tuitiv e view of the error reduction). The follo wing theorem quantiﬁes the accuracy of S precisely . Let n and d denote the num b er of v alues and the n um b er of distinct v alues in S ( I ) respectively . Let n 1 , n 2 , . . . , n d b e the num b er of times each of the d distinct v alues o ccur in S ( I ) (thus P i n i = n ). Theorem 2. Ther e exist c onstants c 1 and c 2 indep endent of n and d such that er ror ( S ) ≤ d X i =1 c 1 log 3 n i + c 2  2 Thus er ror ( S ) = O ( d log 3 n/ 2 ) wher e as er ror ( ˜ S ) = Θ( n/ 2 ) . The ab o ve theorem shows that constrained inference can b oost accuracy , and the improv ement depends on prop er- ties of the input S ( I ). In particular, if the num b er of dis- tinct elements d is 1, then er r or ( S ) = O (log 3 n/ 2 ), while er ror ( ˜ S ) = Θ( n/ 2 ). On the other hand, if d = n , then er ror ( S ) = O ( n/ 2 ) and thus b oth er r or ( S ) and er ror ( ˜ S ) scale linearly in n . F or man y practical applications, d  n , whic h makes er ror ( S ) signiﬁcantly low er than er r or ( ˜ S ). In Sec. 5, exp erimen ts on real data demonstrate that the error of S can b e orders of magnitude low er than that of ˜ S . 4. UNIVERSAL HISTOGRAMS While the query sequence L is the conv en tional strategy for computing a universal histogram, this strategy has lim- ited utilit y under diﬀerential priv acy . While accurate for small ranges, the noise in the unit-length counts accum u- lates under summation, so for larger ranges, the estimates can easily b ecome useless. W e prop ose an alternative query sequence that, in ad- dition to asking for unit-length interv als, asks for in terv al coun ts of larger granularit y . T o ensure priv acy , slightly more noise must b e added to the counts. Ho wev er, this strategy has the property that an y range query can b e answered via a linear com bination of only a small n umber of noisy counts, and this makes it muc h more accurate for larger ranges. Our alternative query sequence, denoted H , consists of a sequence of hierarc hical in terv als. Conceptually , these inter- v als are arranged in a tree T . Eac h no de v ∈ T corresponds to an interv al, and each no de has k children, corresp ond- ing to k equally sized subinterv als. The root of the tree is the interv al [ x 1 , x n ], which is recursively divided into subin- terv als until, at leav es of the tree, the interv als are unit- length, [ x 1 ] , [ x 2 ] , . . . , [ x n ]. F or notational conv enience, we deﬁne the heigh t of the tree ` as the n umber of no des , rather than edges, along the path from a leaf to the ro ot. Thus, ` = log k n + 1. T o transform the tree in to a sequence, w e arrange the interv al counts in the order given by a breadth- ﬁrst trav ersal of the tree. C 0 ** C 00 * C 01 * C 000 C 001 C 010 C 01 1 Figure 4: The tree T asso ciated with query H for the example in Fig. 2 for k = 2 . Example 6. Continuing fr om the example in Fig 2, we describ e H for the sr c domain. The intervals ar e arrange d into a binary ( k = 2) tr e e, as shown in Fig 4. The ro ot is asso ciate d with the interval [0 ∗ ∗ ] , which is evenly sub divide d among its childr en. The unit-length intervals at the le aves ar e [000] , [001] , [010] , [011] . The height of the tre e is ` = 3 . The intervals of the tr e e ar e arrange d into the query se- quenc e H = h C 0 ∗∗ , C 00 ∗ , C 01 ∗ , C 000 , C 001 , C 010 , C 011 i . Eval- uate d on instanc e I fr om Fig. 2, the answer is H ( I ) = h 14 , 2 , 12 , 2 , 0 , 10 , 2 i . T o answer H under diﬀerential priv acy , we must determine its sensitivity . As the following prop osition sho ws, H has a larger sensitivity than L . Proposition 4. The sensitivity of H is ` . By Propositions 1 and 4, the follo wing algorithm is  - diﬀeren tially priv ate: ˜ H ( I ) = H ( I ) + h Lap( `/ ) i m where m is the length of sequence H , equal to the num b er of counts in the tree. T o answer a range query using ˜ H , a natural strategy is to sum the few est n um b er of sub-interv als suc h that their union equals the desired range. How ever, one challenge with this approac h is inconsistency: in the corresponding tree of noisy answ ers, there may b e a parent coun t that do es not equal to the sum of its children. This can b e problematic: for example, an analyst migh t ask one interv al query and then ask for a sub-interv al and receive a lar ger count. W e next look at how to use the arithmetic constraints b et ween parent and c hild counts (denoted γ H ) to derive a consisten t, and more accurate, estimate H . 5 4.1 Constrained Inference: Computing H The analyst receives ˜ h = ˜ H ( I ), the noisy output from the diﬀerentially priv ate algorithm ˜ H . W e now consider the problem of ﬁnding the minimum L 2 solution: to ﬁnd the h that minimizes || ˜ h − h || 2 and also satisﬁes the consistency constrain ts γ H . This problem can be view ed as an instance of linear regres- sion. The unkno wns are the true counts of the unit-length in terv als. Each answer in ˜ h is a ﬁxed linear combination of the unknowns, plus random noise. Finding h is equiv alen t to ﬁnding an estimate for the unit-length interv als. In fact, h is the familiar least squares solution. Although the least squares solution can b e computed via linear algebra, the hierarchical structure of this problem in- stance allows us to derive an intuitiv e closed form solution that is also more eﬃcien t to compute, requiring only linear time. Let T be the tree corresp onding to ˜ h ; abusing nota- tion, for no de v ∈ T , we write ˜ h [ v ] to refer to the interv al asso ciated with v . First, w e deﬁne a p ossibly inconsistent estimate z [ v ] for eac h no de v ∈ T . The consistent estimate h [ v ] is then de- scrib ed in terms of the z [ v ] estimates. z [ v ] is deﬁned recur- siv ely from the leav es to the ro ot. Let l denote the height of no de v and succ ( v ) denote the set of v ’s children. z [ v ] = ( ˜ h [ v ] , if v is a leaf no de k l − k l − 1 k l − 1 ˜ h [ v ] + k l − 1 − 1 k l − 1 P u ∈ succ ( v ) z [ u ] , o.w. The intuition b ehind z [ v ] is that it is a weigh ted av erage of t wo estimates for the count at v ; in fact, the w eights are in versely prop ortional to the v ariance of the estimates. The consistent estimate h is deﬁned recursively from the ro ot to the leav es. At the ro ot r , h [ r ] is simply z [ r ]. As w e descend the tree, if at some no de u , w e hav e h [ u ] 6 = P w ∈ succ ( u ) z [ w ], then we adjust the v alues of each descen- dan t b y dividing the diﬀerence h [ u ] − P w ∈ succ ( u ) z [ w ] equally among the k descendants. The following theorem states that this is the minimum L 2 solution. Theorem 3. Given the noisy se quenc e ˜ h = ˜ H ( I ) , the unique minimum L 2 solution, h , is given by the fol lowing r e curr ence r elation. L et u be v ’s p ar ent: h [ v ] =  z [ v ] , if v is the ro ot z [ v ] + 1 k ( h [ u ] − P w ∈ succ ( u ) z [ w ]) , o.w. Theorem 3 shows that the ov erhead of computing H is minimal, requiring only tw o linear scans of the tree: a b ot- tom up scan to compute z and then a top down scan to compute the solution h given z . 4.2 Utility Analysis: the Accuracy of H W e measure utility as accuracy of range queries, and we compare three strategies: ˜ L , ˜ H , and H . W e start by com- paring ˜ L and ˜ H . Giv en range query q = c ([ x, y ]), we derive an estimate for the answer as follo ws. F or ˜ L , the estimate is simply the sum of the noisy unit-length interv als in the range: ˜ L q = P y i = x ˜ L [ i ]. The error of each count is 2 / 2 , and so the error for the range is er ror ( ˜ L q ) = O (( y − x ) / 2 ). F or ˜ H , we choose the natural strategy of summing the few est sub-interv als of ˜ H . Let r 1 , . . . , r t b e the ro ots of dis- join t subtrees of T such that the union of their ranges equals [ x, y ]. Then ˜ H q is deﬁned as ˜ H q = P t i =1 ˜ H [ r i ]. Each noisy coun t has error equal to 2 ` 2 / 2 (equal to the v ariance of the added noise) and the num b er of subtrees is at most 2 ` (at most tw o p er level of the tree), thus er ror ( ˜ H q ) = O ( ` 3 / 2 ). There is clearly a tradeoﬀ b etw een these tw o strategies. While ˜ L is accurate for small ranges, error gro ws linearly with the size of the range. In con trast, the error of ˜ H is p oly-logarithmic in the size of the domain (recall that ` = Θ(log n )). Thus, while ˜ H is less accurate for small ranges, it is m uch more accurate for large ranges. If the goal of a univ ersal histogram is to b ound worst-case or total error for all range queries, then ˜ H is the preferred strategy . W e no w compare ˜ H to H . Since H is consistent, range queries can b e easily computed by summing the unit-length coun ts. In addition to b eing consistent, it is also more ac- curate. In fact, it is in some sense optimal: among the class of strategies that (a) pro duce unbiased estimates for range queries and (b) deriv e the estimate from linear com- binations of the coun ts in ˜ h , there is no strategy with lo wer mean squared error than H . Theorem 4. ( i ) H is a line ar unbiased estimator, ( ii ) er ror ( H q ) ≤ er r or ( E q ) for al l q and for al l line ar unbiase d estimators E , ( iii ) er ror ( H q ) = O ( ` 3 / 2 ) for al l q , and ( iv ) ther e exists a query q s.t. er ror ( H q ) ≤ 3 2( ` − 1)( k − 1) − k er ror ( ˜ H q ) . P art (iv) of the theorem shows that H can more accurate than ˜ H on some range queries. F or example, in a height 16 binary tree—the kind of tree used in the experiments—there is a query q where H q is more accurate than ˜ H q b y a factor of 2( ` − 1)( k − 1) − k 3 = 9 . 33. F urthermore, the fact that H is consistent can lead to additional adv antages when the domain is sparse. W e pro- p ose a simple extension to H : after computing h , if there is a subtree ro oted at v such that h [ v ] ≤ 0, we simply set the coun t at v and all children of v to b e zero. This is a heuristic strategy; incorp orating non-negativity constraints in to inference is left for future work. Nevertheless, we show in experiments, that this small c hange can greatly reduce er- ror in sparse regions and can lead to H being more accurate than ˜ L even at small ranges. 5. EXPERIMENTS W e ev aluate our techniques on three real datasets (ex- plained in detail in App endix C): NetTrace is deriv ed from an IP-level netw ork trace collected at a ma jor universit y; Social Network is a graph deriv ed from friendship relations in an online so cial netw ork site; Search Logs is a dataset of searc h query logs o v er time from Jan. 1, 2004 to the present. Source co de for the algorithms is av ailable at the ﬁrst au- thor’s website. 5.1 Unattributed Histograms The ﬁrst set of exp erimen ts ev aluates the accuracy of con- strained inference on unattributed histograms. W e compare S to the baseline approach ˜ S . Since ˜ s = ˜ S ( I ) is likely to b e inconsistent—out-of-order, non-integral, and p ossibly negativ e—we consider a second baseline technique, denoted ˜ S r , which enforces consistency by sorting ˜ s and rounding eac h count to the nearest non-negative integer. W e ev aluate the p erformance of these estimators on three queries from the three datasets. On NetTrace : the query 6 ε = 1.0 0.1 0.01 1.0 0.1 0.01 1.0 0.1 0.01 Error 10 − 2 1 10 2 10 4 Social Network NetT race Query Logs Figure 5: Error across v arying datasets and  . Eac h triplet of bars represen ts the three estimators: ˜ S (ligh t gra y), ˜ S r (gra y), and S (black). returns the n um b er of internal hosts to which eac h exter- nal host is connected ( ≈ 65K external hosts); On Social Network , the query returns the degree sequence of the graph ( ≈ 11K no des). On Search Logs , the query returns the searc h frequency o ver a 3-month p erio d of the top 20K key- w ords; p osition i in the answ er vector is the n umber of times the i th rank ed keyw ord w as searched. T o ev aluate the utility of an estimator, w e measure its squared error. Results report the av erage squared error ov er 50 random samples from the diﬀeren tially-priv ate mec ha- nism (each sample pro duces a new ˜ s ). W e also show results for three settings of  = { 1 . 0 , 0 . 1 , 0 . 01 } ; smaller  means more priv acy , hence more random noise. Fig 5 sho ws the results of the exp erimen t. Eac h bar represen ts av erage p erformance for a single combination of dataset,  , and estimator. The bars represent, from left-to- righ t, ˜ S (light gray), ˜ S r (gra y), and S (black). The vertical axis is av erage squared error on a log-scale. The results in- dicate that the prop osed approach reduces the error by at least an order of magnitude across all datasets and settings of  . Also, the diﬀerence b et w een ˜ S r and S suggests that the improv emen t is due not simply to enforcing integralit y and non-negativit y but from the w ay consistency is enforced through constrained inference (though S and ˜ S r are compa- rable on Social Network at large  ). Finally , the relativ e accuracy of S improv es with decreasing  (more noise). Ap- p endix C provides intuition for how S reduces error. 5.2 Universal Histograms W e now ev aluate the eﬀectiv eness of constrained inference for the more general task of computing a univ ersal histogram and arbitrary range queries. W e ev aluate three techniques for supp orting universal histograms ˜ L , ˜ H , and H . F or all three approaches, we enforce integralit y and non-negativity b y rounding to the nearest non-negative integer. With H , rounding is done as part of the inference pro cess, using the approac h describ ed in Sec 4.2. W e ev aluate the accuracy o v er a set of range queries of v arying size and lo cation. The range sizes are 2 i for i = 1 , . . . , ` − 2 where ` is the height of the tree. F or each ﬁxed size, we select the lo cation uniformly at random. W e rep ort the av erage error ov er 50 random samples of ˜ l and ˜ h , and for each sample, 1000 randomly chosen ranges. W e ev aluate the follo wing histogram queries: On Net- Trace , the num b er of connections for each external host. Error 1 10 10 2 10 3 10 4 1 10 3 10 6 10 9 ● L ~ H ~ H ε = 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 10 2 10 3 10 4 ε = 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 10 2 10 3 10 4 ε = 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Error ● Range size Error 1 10 10 2 10 3 10 4 1 10 3 10 6 10 9 ε = 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Range size 1 10 10 2 10 3 10 4 ε = 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Range size 1 10 10 2 10 3 10 4 ε = 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Error Figure 6: A comparison estimators ˜ L (circles), ˜ H (di- amonds), and H (squares) on tw o real-world datasets (top NetTrace , b ottom Search Logs ). This is similar to the query in Sec 5.1 except that here the asso ciation b etw een IP address and coun t is retained. On Search Logs , the query rep orts the temporal frequency of the query term “Obama” from Jan. 1, 2004 to present. (A da y is evenly divided into 16 units of time.) Fig 6 shows the results for both datasets and v arying  . The top ro w corresponds to NetTrace , the bottom to Search Logs . Within a ro w, each plot shows a diﬀerent setting of  ∈ { 1 . 0 , 0 . 1 , 0 . 01 } . F or all plots, the x-axis is the size of the range query , and the y-axis is the error, a v eraged ov er sampled counts and interv als. Both axes are in log-scale. First, we compare ˜ L and ˜ H . F or unit-length ranges, ˜ L yields more accurate estimates. This is unsurprising since it is a low er sensitivity query and thus less noise is added for priv acy . Ho wev er, the error of ˜ L increases linearly with the size of the range. The av erage error of ˜ H increases slowly with the size of the range, as larger ranges typically require summing a greater num b er of subtrees. F or ranges larger than ab out 2000 units, the error of ˜ L is higher than ˜ H ; for the largest ranges, the error of ˜ L is 4-8 times larger than that of ˜ H (note the log-scale). Comparing H against ˜ H , the error of H is uniformly lo wer across all range sizes, settings of  , and datasets. The rela- tiv e p erformance of the estimators dep ends on  . At smaller  , the estimates of H are more accurate relative to ˜ H and ˜ L . Recall that as  decreases, noise increases. This suggests that the relativ e b eneﬁt of statistical inference increases with the uncertaint y in the observed data. Finally , the results show that H can hav e low er error than ˜ L ov er small ranges, even for leaf counts. This may b e sur- prising since for unit-length counts, the scale of the noise of H is lar ger than that of ˜ L by a factor of log n . The reduction in error is b ecause these histograms are sparse. When the histogram con tains sparse regions, H can eﬀectiv ely identify them because it has noisy observ ations at higher lev els of the tree. In contrast, ˜ L has only the leaf counts; thus, ev en if a range contains no records, ˜ L will assign a p ositiv e count to roughly half of the leav es in the range. 6. RELA TED WORK Dw ork has written comprehensive reviews of diﬀerential priv acy [7, 8]; below w e highligh t results closest to this w ork. 7 The idea of p ost-pro cessing the output of a diﬀeren tially priv ate mec hanism to ensure consistency w as introduced in Barak et al. [1], who prop osed a linear program for making a set of marginals consistent, non-negativ e, and integral. Ho wev er, unlike the presen t work, the p ost-processing is not sho wn to improv e accuracy . Blum et al. [4] propose an eﬃcient algorithm to publish syn thetic data that is useful for range queries. In comparison with our hierarchical histogram, the technique of Blum et al. scales sligh tly better (logarithmic v ersus p oly-logarithmic) in terms of domain size (with all else ﬁxed). How ever, our hierarc hical histogram achiev es low er error for a ﬁxed do- main, and imp ortantly , the error do es not grow as the size of the database increases, whereas with Blum et al. it grows with O ( N 2 / 3 ) with N b eing the num b er of records (details in App endix E). The present w ork ﬁrst app eared as an arXiv preprint [13], and since then a num b er of related works hav e emerged, including additional work by the authors. The technique for unattributed histograms has b een applied to accurately and eﬃciently estimate the degree sequences of large so cial net works [12]. Several techniques for histograms o ver hierar- c hical domains ha ve been dev elop ed. Xiao et al. [22] propose an approac h based on the Haar wa velet, which is conceptu- ally similar to the H query in that it is based on a tree of queries where each level in the tree is an increasingly ﬁne- grained summary of the data. In fact, that tec hnique has er- ror equiv alent to a binary H query , as sho wn b y Li et al. [15], who represent b oth techniques as applications of the matrix mec hanism, a framework for computing workloads of linear coun ting queries under diﬀerential priv acy . W e are aw are of ongoing work by McSherry et al. [17] that com bines hierar- c hical querying with statistical inference, but diﬀers from H in that it is adaptiv e. Chan et al. [5] consider the problem of con tinual release of aggregate statistics ov er streaming pri- v ate data, and prop ose a diﬀerentially priv ate counter that is similar to H , in which items are hierarchically aggregated b y arriv al time. The H and wa velet strategy are sp eciﬁcally designed to supp ort range queries. Strategies for answering more general workloads of queries under diﬀerential priv acy are emerging, in b oth the oﬄine [11, 15] and online [20] settings. 7. CONCLUSIONS Our results show that by transforming a diﬀerentially- priv ate output so that it is consisten t, we can b oost accu- racy . P art of the innov ation is devising a query set so that useful constraints hold. Then the challenge is to apply the constrain ts by searching for the closest consistent solution. Our query strategies for histograms hav e closed-form solu- tions for eﬃciently computing a consistent answer. Our results sho w that conv entional diﬀerential priv acy ap- proac hes can add more noise than is strictly required by the priv acy condition. W e b elieve that using constraints ma y b e an imp ortan t part of ﬁnding optimal strategies for query answ ering under diﬀeren tial priv acy . More discussion of the implications of our results, and p ossible extensions, is in- cluded in App endix B. 8. A CKNO WLEDGMENTS Ha y was supp orted b y the Air F orce Research Lab ora- tory (AFRL) and IARP A, under agreemen t n umber F A8750- 07-2-0158. Ha y and Miklau were supp orted b y NSF CNS 0627642, NSF DUE-0830876, and NSF IIS 0643681. Ras- togi and Suciu w ere supp orted b y NSF I IS-0627585. The U.S. Gov ernmen t is authorized to repro duce and distribute reprin ts for Gov ernmental purp oses notwithstanding any cop y- righ t notation thereon. The views and conclusion contained herein are those of the authors and should not b e inter- preted as necessarily represen ting the oﬃcial p olicies or en- dorsemen ts, either expressed or implied, of the AFRL and IARP A, or the U.S. Gov ernment. 9. REFERENCES [1] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry , and K. T alwar. Priv acy , accuracy , and consistency to o: A holistic solution to contingency table release. In PODS , 2007. [2] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistic al Infer enc e Under Or der R estrictions . John Wiley and Sons Ltd, 1972. [3] R. E. Barlow and H. D. Brunk. The isotonic regression problem and its dual. JASA , 67(337):140–147, 1972. [4] A. Blum, K. Ligett, and A. Roth. A learning theory approac h to non-interactiv e database priv acy . In STOC , 2008. [5] T.-H. H. Chan, E. Shi, and D. Song. Priv ate and con tinual release of statistics. In ICALP , 2010. [6] F. R. K. Chung and L. Lu. Survey: Concen tration inequalities and martingale inequalities. Internet Mathematics , 2006. [7] C. Dwork. Diﬀeren tial priv acy: A survey of results. In T AMC , 2008. [8] C. Dwork. A ﬁrm foundation for priv ate data analysis. CACM, T o app e ar , 2010. [9] C. Dwork, F. McSherry , K. Nissim, and A. Smith. Calibrating noise to sensitivity in priv ate data analysis. In TCC , 2006. [10] A. Ghosh, T. Roughgarden, and M. Sundarara jan. Univ ersally utility-maximizing priv acy mechanisms. In STOC , 2009. [11] M. Hardt and K. T alwar. On the geometry of diﬀeren tial priv acy . In STOC , 2010. [12] M. Hay , C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree distribution of priv ate net works. In ICDM , 2009. [13] M. Hay , V. Rastogi, G. Miklau, and D. Suciu. Bo osting the accuracy of diﬀerentially-priv ate queries through consistency . CoRR , abs/0904.0942, April 2009. [14] J. T. G. Hwang and S. D. Peddada. Conﬁdence in terv al estimation sub ject to order restrictions. Annals of Statistics , 1994. [15] C. Li, M. Hay , V. Rastogi, G. Miklau, and A. McGregor. Optimizing histogram queries under diﬀeren tial priv acy . In PODS , 2010. [16] F. McSherry . Priv acy in tegrated queries: An extensible platform for priv acy-preserving data analysis. In SIGMOD , 2009. [17] F. McSherry , K. T alw ar, and O. Williams. Maximum lik eliho od data synthesis. Man uscript, 2009. [18] M. E. J. Newman. The structure and function of complex net works. SIAM R eview , 45(2):167–256, 2003. 8 [19] G. Owen. Game The ory . Academic Press Ltd, 1982. [20] A. Roth and T. Roughgarden. In teractive priv acy via the median mechanism. In STOC , 2010. [21] S. D. Silvey . Statistic al Infer enc e . Chapman-Hall, 1975. [22] X. Xiao, G. W ang, and J. Gehrke. Diﬀerential priv acy via wa velet transforms. In ICDE , 2010. APPENDIX A. NO T A TION AL CONVENTIONS The table b elow summarizes notational conv entions used in the pap er. Q Sequence of counting queries L Unit- L ength query sequence H H ierarchical query sequence S S orted query sequence γ Q Constraint set for query Q ˜ Q , ˜ L , ˜ H , ˜ S Randomized query sequence H , S Randomized query sequence, returning minimum L 2 solution I Priv ate database instance L ( I ) , H ( I ) , S ( I ) Output sequence (truth) ˜ l = ˜ L ( I ) , ˜ h = ˜ H ( I ) , ˜ s = ˜ S ( I ) Output sequence (noisy) h = H ( I ) , s = S ( I ) Output sequence (inferred) B. DISCUSSION OF MAIN RESUL TS Here w e pro vide a supplementary discussion of the results, review the insights gained, and discuss future directions. Unattributed histograms. The choice of the sorted query S , instead of L , is an unqualiﬁed b eneﬁt, because we gain from the inequalit y constraints on the output, while the sen- sitivit y of S is no greater than that of L . Among other ap- plications, this allows for extremely accurate estimation of degree sequences of a graph, improving error b y an order of magnitude on the baseline technique. The accuracy of the estimate dep ends on the input sequence. It works b est for sequences with duplicate counts, which matches w ell the degree sequences of so cial netw orks encountered in practice. F uture work sp eciﬁcally orien ted tow ards degree sequence estimation could include a constrain t enforcing that the out- put sequence is gr aphic al , i.e. the degree sequence of some graph. Universal histograms. The choice of the hierarc hical count- ing query H , instead of L , oﬀers a trade oﬀ b ecause the sen- sitivit y of H is greater than that of L . It is interesting that for some data sets and priv acy lev els, the eﬀect of the H con- strain ts outw eighs the increased noise that m ust b e added. In other cases, the algorithms based on H provide greater ac- curacy for all but the smallest ranges. W e note that in man y practical settings, domains are large and sparse. The spar- sit y implies that no diﬀeren tially priv ate technique can yield meaningful answ ers for unit-length queries because the noise necessary for priv acy will drown out the signal. So while ˜ L sometimes has higher accuracy for small range queries, this ma y not hav e practical relev ance since the relative error of the answers renders them useless. In future work w e hope to extend the technique for uni- v ersal histograms to m ulti-dimensional range queries, and to in vestigate optimizations such as higher branching factors. Across both histogram tasks, our results clearly show that it is p ossible to achiev e greater accuracy without sacriﬁcing priv acy . The existence of our improv ed estimators S and H sho w that there is another diﬀerentially priv ate noise dis- tribution that is more accurate than independent Laplace noise. This do es not contradict existing results b ecause the original diﬀerential priv acy work show ed only that calibrat- ing Laplace noise to the sensitivity of a query was suﬃcient for priv acy , not that it w as necessary . Only recently has the optimalit y of this construction been studied (and prov en only for single queries) [10]. Finding the optimal strategy for answ ering a set of queries under diﬀerential priv acy is an imp ortan t direction for future work, esp ecially in light of emerging priv ate query interfaces [16]. A natural goal is to describ e directly the improv ed noise distributions implied by S and H , and build a priv acy mec h- anism that samples from it. This could, in theory , av oid the inference step altogether. But it is seems quite diﬃcult to discov er, describ e, and sample these improv ed noise dis- tributions, whic h will b e highly dep endent on a particular query of in terest. Our approac h suggests that constraints and constrained inference can be an eﬀective path to dis- co vering new, more accurate noise distributions that satisfy diﬀeren tial priv acy . As a practical matter, our approach do es not necessarily burden the analyst with the constrained inference pro cess b ecause the server can implemen t the p ost- pro cessing step. In that case it would app ear to the analyst as if the server was sampling from the improv ed distribution. While our fo cus has b een on histogram queries, the tech- niques are probably not limited to histograms and could ha ve broader impact. Ho wev er, a general formulation may b e c hallenging to develop. There is a subtle relationship b e- t ween constraints and sensitivity: reform ulating a query so that it becomes highly constrained ma y similarly increase its sensitivit y . A challenge is ﬁnding queries such as S and H that hav e useful constraints but remain low sensitivity . An- other challenge is the computational eﬃciency of constrained inference, which is p osed here as a constrained optimization problem with a quadratic ob jective function. The complex- it y of solving this problem will dep end on the nature of the constrain ts and is NP-Hard in general. Our analysis shows that the constraint sets of S and H admit closed-form solu- tions that are eﬃcient to compute. C. ADDITIONAL EXPERIMENTS This section pro vides detailed descriptions of the datasets, and additional results for unattributed histograms. NetTrace is deriv ed from an IP-lev el net work trace col- lected at a ma jor universit y . The trace monitors traﬃc at the gatewa y b et ween internal IP addresses and external IP addresses. F rom this data, we derived a bipartite connec- tion graph where the no des are hosts, lab eled b y their IP address, and an edge connotes the transmission of at least one data pack et. Here, diﬀerential priv acy ensures that in- dividual connections remain priv ate. Social Network is a graph derived from friendship rela- tions on an online so cial net work site. The graph is limited to a population of roughly 11 , 000 students from a single univ ersity . Diﬀerential priv acy implies that friendships will not be disclosed. The size of the graph (n umber of students) is assumed to b e public knowledge. 1 1 This is not a critical assumption and, in fact, the num b er 9 Index Count 0 10 20 30 40 50 66 100 1000 10000 66 70 80 40 45 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 1000 1100 1200 1300 1 2 3 Figure 7: On NetTrace , S ( I ) (solid gray), the av erage error of S (solid black) and ˜ S (dotted gray), for  = 1 . 0 . Search Logs is a dataset of searc h query logs o ver time from Jan. 1, 2004 to the present. F or priv acy reasons, it is diﬃcult to obtain such data. Our dataset is derived from a searc h engine in terface that publishes summary statistics for sp eciﬁed query terms. W e combined these summary statis- tics with a second dataset, which contains actual search query logs but for a muc h shorter time p erio d, to pro duce a syn thetic data set. In the exp eriments, ground truth refers to this syn thetic dataset. Diﬀeren tial priv acy guarantees that the output will preven t the asso ciation of an individual en tity (user, host) with a particular search term. Unattributed histograms. Figure 7 provides some intu- ition for ho w inference is able to reduce error. Shown is a p ortion of the unattributed histogram of NetTrace : the se- quence is sorted in desc ending order along the x-axis and the y-axis indicates the count. The solid gray line corresp onds to ground truth: a long horizontal stretch indicates a sub- sequence of uniform coun ts and a vertical drop indicates a decrease in count. The graphic shows only the middle p or- tion of the unattributed histogram—some very large and v ery small coun ts are omitted to improv e legibility . The solid blac k lines indicate the error of S av eraged o ver 200 random samples of ˜ S (with  = 1 . 0); the dotted gray lines indicate the exp ected error of ˜ S . The inset graph on the left reveals larger error at the b e- ginning of the sequence, when each count o ccurs once or only a few times. How ever, as the counts b ecome more con- cen trated (longer subsequences of uniform count), the error diminishes, as shown in the right inset. Some error remains around the p oin ts in the sequence where the coun t changes, but the error is reduced to zero for p ositions in the middle of uniform subsequences. Figure 7 illustrates that our approach reduces or elimi- nates noise in precisely the parts of the sequence where the noise is unnecessary for priv acy . Changing a tuple in the database cannot change a coun t in the middle of a uniform of studen ts can b e estimated priv ately within ± 1 / in ex- p ectation. Our techniques can b e applied directly to either the true count or a noisy estimate. subsequence, only at the end p oin ts. These exp erimental results also align with Theorem 2, which states that the er- ror of S is a function of the num ber of distinct coun ts in the sequence. In fact, the experimental results suggest that the theorem also holds lo cally for subsequences with a small n umber of distinct counts. This is an imp ortan t result since the t ypical degree sequences that arise in real data, such as the pow er-law distribution, contain very large uniform subsequences. D. PR OOFS Proof of Proposition 2. F or any output q , then let S ( q ) denote the set of noisy answers such that if ˜ q ∈ S ( q ) then the minimum L 2 solution given ˜ q and γ Q is q . F or an y I and I 0 ∈ nbr s ( I ), the following shows that Q is  - diﬀeren tially priv ate: P r [ Q ( I ) = q ] = P r [ ˜ Q ( I ) ∈ S ( q )] ≤ exp(  ) P r [ ˜ Q ( I 0 ) ∈ S ( q )] = exp(  ) P r [ Q ( I 0 ) = q ] where the inequality is b ecause ˜ Q is  -diﬀerentially pri- v ate. Proof of Proposition 3. Giv en a database I , supp ose w e add a record to it to obtain I 0 . The added record aﬀects one coun t in L , i.e., there is exactly one i suc h that L ( I )[ i ] = x and L ( I 0 )[ i ] = x +1, and all other counts are the same. The added record aﬀects S as follows. Let j be the largest index suc h that S ( I )[ j ] = x , then the added record increases the coun t at j by one: S ( I 0 )[ j ] = x + 1. Notice that this change do es not aﬀect the sort order—i.e., in S ( I 0 ), the j th v alue remains in sorted order: S ( I 0 )[ j − 1] ≤ x , S ( I 0 )[ j ] = x + 1, and S ( I 0 )[ j + 1] ≥ x + 1. All other counts in S are the same, and thus the L 1 distance b et ween S ( I ) and S ( I 0 ) is 1. Proof of Proposition 4. If a tuple is added or remo ved from the relation, this aﬀects the count for every range that includes it. There are exactly ` ranges that include a given tuple: the range of a single leaf and the ranges of 10 the no des along the path from that leaf to the ro ot. There- fore, adding/removing a tuple c hanges exactly ` counts eac h b y exactly 1. Thus, the sensitivit y is equal to ` , the height of the tree. D.1 Pr oof of Theorem 1 W e ﬁrst restate the theorem b elow. Recall that ˜ s [ i, j ] denotes the subsequence of j − i + 1 elements: h ˜ s [ i ], ˜ s [ i + 1], . . . , ˜ s [ j ] i . Let ˜ M [ i, j ] record the mean of these elemen ts, i.e. ˜ M [ i, j ] = P j k = i ˜ s [ k ] / ( j − i + 1). Theorem 1. Denote L k = min j ∈ [ k,n ] max i ∈ [1 ,j ] ˜ M [ i, j ] and U k = max i ∈ [1 ,k ] min j ∈ [ i,n ] ˜ M [ i, j ] . The minimum L 2 solu- tion s , is unique and given by: s [ k ] = L k = U k . Proof. In the pro of, we abbreviate the notation and im- plicitly assume that the range of i is [1 , n ] or [1 , j ] when j is sp eciﬁed. Similarly , the range of j is [1 , n ] or [ i, n ] when i is sp eciﬁed. W e start with the easy part, showing that U k ≤ L k . De- ﬁne an n × n matrix A k as follows: A k ij =    ˜ M [ i, j ] if i ≤ j ∞ if j < i ≤ k −∞ otherwise Then min j max i A k ij = L k and max i min j A k ij = U k . In any matrix A k , max i min j A k ij ≤ min j max i A k ij : this is a simple fact that can b e c heck ed directly , or see [19], hence U k ≤ L k . W e show next that if s is the minimum L 2 solution, then L k ≤ s [ k ] ≤ U k . If we show this, then the pro of of the theorem is completed, as then w e will then hav e s [ k ] = L k = U k . The pro of relies on the following lemma. Lemma 1. L et s b e the minimum L 2 solution. Then (i) s [1] ≤ U 1 , (ii) s [ n ] ≥ L n , (iii) for al l k , min( s [ k +1] , max i ˜ M [ i, k ]) ≤ s [ k ] ≤ max( s [ k − 1] , min j ˜ M [ k , j ]) . The pro of of the lemma app ears b elo w, but now w e use it to complete the pro of of Theorem 1. First, we show that s [ k ] ≤ U k using induction on k . The base case is k = 1 and it is stated in the lemma, part (i). F or the inductiv e step, assume s [ k − 1] ≤ U k − 1 . F rom (iii), w e hav e that s [ k ] ≤ max( s [ k − 1] , min j ˜ M [ k, j ]) ≤ max( U k − 1 , min j ˜ M [ k , j ]) = U k The last step follo ws from the deﬁnition of U k . A similar induction argument shows that s [ k ] ≥ L k , except the order is reversed: the base case is k = n and the inductiv e step assumes s [ k + 1] ≥ L k +1 . The only remaining step is to prov e the lemma. Proof of Lemma 1. F or (i), it is suﬃcien t to pro ve that s [1] ≤ ˜ M [1 , j ] for all j ∈ [1 , n ]. Assume the contrary . Th us there exists a j suc h that for s [1] > ˜ M [1 , j ]. Let δ = s [1] − ˜ M [1 , j ]. Thus δ > 0. F urther, for all i , denote δ i = s [ i ] − s [1]. Consider the sequence s 0 deﬁned as follows: s 0 [ i ] =  s [ i ] − δ if i ≤ j s [ i ] otherwise It is obvious to see that since s is a sorted sequence, so is s 0 . W e no w claim that || s 0 − ˜ s || 2 < || s − ˜ s || 2 . F or this note that since the sequence s 0 [ j + 1 , n ] is identical to the sequence s [ j + 1 , n ], it is suﬃcient to prov e || s 0 [1 , j ] − ˜ s [1 , j ] || 2 < || s [1 , j ] − ˜ s [1 , j ] || 2 . T o prov e that, note that || s [1 , j ] − ˜ s [1 , j ] || 2 can be expanded as || s [1 , j ] − ˜ s [1 , j ] || 2 = j X i =1 ( s [ i ] − ˜ s [ i ]) 2 = j X i =1 ( s [1] + δ i − ˜ s [ i ]) 2 = j X i =1 ( ˜ M [1 , j ] + δ + δ i − ˜ s [ i ]) 2 Supp ose for a momen t that w e ﬁx ˜ M [1 , j ] and δ i ’s, and treat || s [1 , j ] − ˜ s [1 , j ] || 2 as a function f ov er δ . The deriv ativ e of f ( δ ) is: f 0 ( δ ) = 2 j X i =1 ( ˜ M [1 , j ] + δ + δ i − ˜ s [ i ]) = 2  j ˜ M [1 , j ] − j X i =1 ˜ s [ i ]  + 2 j δ + 2 j X i =1 δ i = 2 j δ + 2 j X i =1 δ i Since δ i ≥ 0 for all i , then the deriv ative is strictly greater than zero for any δ > 0, whic h implies that f is a strictly increasing function of δ and has a minim um at δ = 0. There- fore, || s [1 , j ] − ˜ s [1 , j ] || 2 = f ( δ ) > f (0) = || s 0 [1 , j ] − ˜ s [1 , j ] || 2 . This is a contradiction since it was assumed that s w as the minim um solution. This completes the pro of for (i). F or (ii), the pro of of s [ n ] ≥ max i ˜ M [ i, n ] follows from a similar argument: if s [ n ] < ˜ M [ i, n ] for some i , deﬁne δ = ˜ M [ i, n ] − s [ n ] and the sequence s 0 with elements s 0 [ j ] = s [ j ] + δ for j ≥ i . Then s 0 can be shown to b e a strictly b etter solution than s , proving (ii). F or the pro of of (iii), we ﬁrst show that s [ k ] ≤ max( s [ k − 1] , min j ˜ M [ k , j ]). Assume the contrary , i.e. there exists a k such that s [ k ] > s [ k − 1] and s [ k ] > min j ˜ M [ k , j ]. In other words, we assume there exists a k and j suc h that s [ k ] > s [ k − 1] and s [ k ] > ˜ M [ k , j ]. Denote δ = s [ k ] − max( s [ k − 1] , ˜ M [ k , j ]). By our assumption abov e, δ > 0. Deﬁne the sequence s 0 [ i ] =  s [ i ] − δ if k ≤ i ≤ j s [ i ] otherwise Note that by construction, s 0 [ k ] = s [ k ] − δ = s [ k ] − ( s [ k ] − max( s [ k − 1] , ˜ M [ k, j ])) = max( s [ k − 1] , ˜ M [ k, j ]). It is easy to see that s 0 is sorted (indeed the only inv ersion in the sort order could hav e o ccurred if s 0 [ k − 1] > s 0 [ k ], but do esn’t as s 0 [ k − 1] = s [ k − 1] ≤ max( s [ k − 1] , ˜ M [ k , j ]) = s 0 [ k ]). No w a similar argument as in the pro of of (i) for the se- quence ˜ s [ k, j ], yields that the error || s 0 [ k , j ] − ˜ s [ k, j ] || 2 < || s [ k , j ] − ˜ s [ k , j ] || 2 . Th us || s 0 − ˜ s || 2 < || s 0 − ˜ s || 2 and s 0 is a strictly b etter solution than s . This yields a contra- diction as s is the minimum L 2 solution. Hence s [ k ] ≤ max( s [ k − 1] , min j ˜ M [ k, j ]). A similar argumen t in the the reverse direction sho ws that s [ k ] ≥ min( s k +1 , max i ˜ M [ i, k ]) completing the pro of of (iii). 11 D.2 Pr oof of Theorem 2 W e ﬁrst restate the theorem b elow. Denote n and d as the num b er of v alues and the num b er of distinct v alues in S ( I ) resp ectiv ely . Let n 1 , n 2 , . . . , n d b e the num b er of times eac h of the d distinct v alues o ccur in S ( I ) (thus P i n i = n ). Theorem 2. Ther e exist c onstants c 1 and c 2 indep endent of n and d such that er ror ( S ) ≤ d X i =1 c 1 log 3 n i + c 2  2 Thus er ror ( S ) = O ( d log 3 n/ 2 ) wher e as er ror ( ˜ S ) = Θ( n/ 2 ) . Before showing the pro of, we prov e the following lemma. Lemma 2. L et s = S ( I ) b e the input se quenc e. Cal l a tr anslation of s the op er ation of subtr acting fr om each ele- ment of s a ﬁxe d amount δ . Then err or ( S [ i ]) is invariant under translation for al l i . Proof. Denote P r ( s | s ) ( P r ( ˜ s | s )) the probabilit y that s ( ˜ s ) is output on the input sequence s . Denote s 0 , s 0 , and ˜ s 0 the sequence obtained by translating s , s , and ˜ s by δ , resp ectiv ely . First observe that P r ( ˜ s | s ) = P r ( ˜ s 0 | s 0 ) as ˜ s and ˜ s 0 are obtained b y adding the same Laplacian noise to s and s 0 , resp ectiv ely . Using Theorem 1 (since all U k ’s and L k ’s shift b y δ on translating ˜ s by delta ), we get that if s is the mini- m um L 2 solution given ˜ s , then s 0 is the minimum L 2 solution giv en ˜ s 0 . Th us, P r ( s | s ) = P r ( s 0 | s 0 ) for all sequences s . F ur- ther, since s [ i ] and s 0 [ i ] yield the same L 2 error with s [ i ] and s 0 [ i ] respectively , w e get that the exp ected er ror ( S [ i ]) is same for b oth inputs s and s 0 . Lemma 3. L et X b e any p ositive random variable that is b ounde d ( lim x →∞ xP r ( X > x ) exists). Then E ( X ) ≤ Z ∞ 0 P r ( X > x ) dx Proof. The pro of follows from the following chain of equalities. E ( X ) = Z ∞ 0 x ∂ ∂ x ( P r ( X ≤ x )) = − Z ∞ 0 x ∂ ∂ x ( P r ( X > x )) = − [ xP r ( X > x )] ∞ 0 + Z ∞ 0 ( P r ( X ≤ x ) − 1) dx (by parts) = − lim x →∞ xP r ( X > x ) + Z ∞ 0 P r ( X > x ) dx ≤ Z ∞ 0 P r ( X > x ) dx Here the last equality follows as X is bounded and there- fore the limit exists and is positive. This completes the pro of. W e next state a theorem that w as shown in [6] Theorem 5 (Theorem 3.4 [6]). Supp ose that X 1 , X 2 , . . . , X n ar e indep endent r andom variables satisfying X i ≤ E ( X i ) + M , for 1 ≤ i ≤ n . We c onsider the sum X = P n i =1 X i with exp e ctation E ( X ) = P n i =1 E ( X i ) and V ar ( X ) = P n i =1 V ar ( X i ) . Then, we have P r ( X ≥ E ( X ) + λ ) ≤ e − λ 2 2( V ar ( X )+ M λ/ 3) F or a random v ariable X , denote I X the indicator function that X ≥ 0 (thus I X = 1 if X ≥ 0 and 0 otherwise). Using Theorem 5, we prov e the following lemma. Lemma 4. Supp ose i, j ar e indic es such that for al l k ∈ [ i, j ] , s [ k ] ≤ 0 . Then ther e exists a c onstant c such that for al l τ ≥ 1 the fol lowing holds. P r  ˜ M [ i, j ] 2 I ˜ M [ i,j ] ≥ c ( log 2 (( j − i + 1) τ ) ( j − i + 1)  2 )  ≤ 1 ( j − i + 1) 2 τ 2 Proof. W e apply Theorem 5 on ˜ s [ k ] for k ∈ [ i, j ]. First note that E ( ˜ s [ k ]) = s [ k ] ≤ 0. F urther V ar (˜ s [ k ]) = 2  2 as ˜ s [ k ] is obtained by adding Laplace noise to s [ k ] which has this v ariance. W e also know that ˜ s [ k ] ≥ M + s [ k ] happ ens with probabilit y at most e − M / 2. F or simplicity , call n to b e j − i + 1. Denoting X = P k ∈ [ i,j ] ˜ s [ k ], we see that E ( X ) ≤ 0 and V ar ( X ) = 2 n  2 . F ur- ther, set M = 3 log ( nτ ) / . Denote B the even t that for some k , ˜ s [ k ] ≥ M + s [ k ]. Thus P r ( B ) ≤ ne − M / 2 ≤ 1 2 n 2 τ 3 . If B do es not happ en, w e know that ˜ s [ k ] ≤ M + s [ k ] for all k ∈ [ i, j ]. Thus w e can then apply Theorem 5 to get: P r ( X ≥ E ( X ) + λ ) ≤ e − λ 2 2(2 n/ 2 + λ log ( nτ ) / ) + P r ( B ) = e − λ 2 2(2 n/ 2 + λ log ( nτ ) / ) + 1 2 n 2 τ 3 Setting λ = 8  √ n log ( nτ ) giv es us that P r  X ≥ E ( X ) + 8  √ n log ( nτ )  ≤ 1 n 2 τ 2 Since E ( X ) ≤ 0, we get P r  X ≥ 8  √ n log ( nτ )  ≤ 1 n 2 τ 2 Also we observe that ˜ M [ i, j ] = X/n , which yields P r  ˜ M [ i, j ] ≥ 8 log ( nτ ) √ n  ≤ 1 n 2 τ 2 Finally , observ e that ˜ M [ i, j ] ≤ c implies that ˜ M [ i, j ] 2 I ˜ M [ i,j ] ≤ c 2 . Thus we get P r  ˜ M [ i, j ] 2 I mm [ i,j ] ≥ 64 log 2 ( nτ ) n 2  ≤ 1 n 2 τ 2 Putting n = j − i + 1 and using c = 64 gives us the required result. No w we can give the pro of of Theorem 2. Proof of Theorem 2. The pro of of er r or ( ˜ S ) = Θ( n/ 2 ) is obvious since: er ror ( ˜ S ) = n X k =1 er ror ( ˜ s [ i ]) = n ( 2  2 ) In the rest of the proof, we shall sho w bound er ror ( S ). Let s = S ( I ) be the input sequence. W e know that s consists of d distinct elements. Denote s r as the r th distinct element of s . Also denote [ l r , u r ] as the set of indices corresp onding 12 to s r , i.e. ∀ i ∈ [ l r ,u r ] s [ i ] = s r and ∀ i / ∈ [ l r ,u r ] s [ i ] 6 = s r . Let M [ i, j ] record the mean of elements in s [ i, j ], i.e. M [ i, j ] = P j k = i s [ k ] / ( i − j + 1). T o b ound er ror ( S ), we shall bound er r or ( S [ i ]) separately for each i . T o b ound er r or ( S [ i ]), w e can assume W.L.O.G that s [ i ] is 0. This is b ecause if s [ i ] 6 = 0, then w e can trans- late the sequence s b y s [ i ]. As sho wn in Lemma 2 this preserv es er ror ( S [ i ]), while making s [ i ] = 0. Let k ∈ [ l r , u r ] be any index for the r th distinct elemen t of s . By deﬁnition, err or ( S [ k ]) = E ( s [ k ] − s [ k ]) 2 = E ( s [ k ] 2 ) (as w e can assume W.L.O.G s [ k ] = 0). F rom Theorem 1, we kno w that s [ k ] = U k . Th us er r or ( S [ k ]) = E ( U 2 k ). Here we treat U k = max i ≤ k min j ˜ M [ i, j ] as a random v ariable. Now b y deﬁnition of E , we hav e E ( U 2 k ) = E ( U 2 k I U k ) + E ( U 2 k (1 − I U k )) = A + B (say) W e shall b ound A and B separately . F or b ounding A , denote U k = max i ≤ k ˜ M [ i, u r ]. It is apparen t that U k ≥ U k and thus U 2 k I U k ≥ U 2 k I U k . T o bound A , w e observe that A = E ( U 2 k I U k ) ≤ E ( U 2 k I U k ) F urther, since U k = max i ≤ k ˜ M [ i, u r ], we know that U 2 k I U k = max i ≤ k ˜ M [ i, u r ] 2 I ˜ M [ i,u r ] . Thus we can write: A ≤ E ( U 2 k I U k ) = E  max i ≤ k ˜ M [ i, u r ] 2 I ˜ M [ i,u r ]  Let τ > 1 b e any num b er and c b e the constant used in Lemma 4. Let us denote e i the even t that: ˜ M [ i, u r ] 2 I ˜ M [ i,u r ] ≥ c ( log 2 (( u r − i + 1) τ ) ( u r − i + 1)  2 ) W e can apply lemma 4 to compute the probability of e i as s [ j ] ≤ 0 for all j ≤ u r (as we assumed W.L.O.G s [ k ] = 0). Th us we get P r ( e i ) ≤ 1 ( u r − i +1) 2 τ 2 . Deﬁne e = ∨ u r i =1 e i . Then P r ( e ) ≤ P u r i =1 P r ( e i ) = 2 /τ 2 (as P u r i =1 1 /i 2 ≤ 2). If the even t e do es not happ en, then it is easy to see that U 2 k I U k = max i ≤ k ˜ M [ i, u r ] 2 I ˜ M [ i,u r ] ≤ c ( log 2 (( u r − k + 1) τ ) ( u r − k + 1)  2 ) Th us with at least probability 1 − 2 /τ 2 (whic h is P r ( ¬ e )), w e get U 2 k I U k is bounded as ab o v e. This yields that there ex- ist constants c 1 and c 2 suc h that E ( U 2 k I U k ) ≤ c 1 log 2 ( u r − k +1)+ c 2 ( u r − k +1)  2 . The pro of is by the application of Lemma 3 (as U k is b ounded) and a simple integration ov er τ ranging from 1 to ∞ . Finally w e get that A ≤ E ( U 2 k I U k ) ≤ c 1 log 2 ( u r − k +1)+ c 2 ( u r − k +1)  2 . Recall that B = E ( U 2 k (1 − I U k )). W e can write B as E ( L 2 k (1 − I L k )) as L k = U k . Using the exact same ar- gumen ts as ab o ve for L k but on sequence − S yields that B ≤ c 1 log 2 ( k − l r +1)+ c 2 ( k − l r +1)  2 . Finally , we get that S [ k ] = A + B which is less than c 1 log 2 ( u r − k +1)+ c 2 ( u r − k +1)  2 + c 1 log 2 ( k − l r +1)+ c 2 ( k − l r +1)  2 . T o obtain a b ound on the total er ror ( S ). er ror ( S ) = d X r =1 X k ∈ [ l r ,u r ] er ror ( S [ k ]) ≤ d X r =1 X k ∈ [ l r ,u r ] c 1 log 2 ( u r − k + 1) + c 2 ( u r − k + 1)  2 + d X r =1 X k ∈ [ l r ,u r ] c 1 log 2 ( k − l r + 1) + c 2 ( k − l r + 1)  2 ≤ d X r =1 c 1 log 3 ( u r − l r + 1) + c 2  2 Finally noting that u r − l r + 1 is just n r , the num b er of o ccurrences of s r in s , we get er ror ( S ) = P r c 1 log 3 n r + c 2  2 = O ( d log 3 n/ 2 ). This completes the pro of of the theorem. D.3 Pr oof of Theorem 3 W e ﬁrst restate the theorem b elow. Theorem 3. Given the noisy se quenc e ˜ h = ˜ H ( I ) , the unique minimum L 2 solution, h , is given by the fol lowing r e curr ence r elation. L et u be v ’s p ar ent: h [ v ] =  z [ v ] , if v is the ro ot z [ v ] + 1 k ( h [ u ] − P w ∈ succ ( u ) z [ w ]) , o.w. Proof. W e ﬁrst show that h [ r ] = z [ r ] for the ro ot no de r . By deﬁnition of a minimum L 2 solution, the sequence h satisﬁes the following constrained optimization problem. Let succZ [ u ] = P w ∈ succ ( u ) z [ w ]. minimize X v ( h [ v ] − ˜ h [ v ]) 2 sub ject to ∀ v , X u ∈ succ ( v ) h [ u ] = h [ v ] Denote leav es ( v ) to be the set of leaf no des in the subtree ro oted at v . The ab ov e optimization problem can b e rewrit- ten as the following unconstrained minimization problem. minimize X v  ( X l ∈ leaves ( v ) h [ l ]) − ˜ h [ v ]  2 F or ﬁnding the minimum, we take deriv ative w.r.t h [ l ] for eac h l and equate it to 0. W e thus get the following set of equations for the minimum solution. ∀ l, X v : l ∈ leav es ( v ) 2  ( X l 0 ∈ leaves ( v ) h [ l 0 ]) − ˜ h [ v ]  = 0 Since P l 0 ∈ leaves ( v ) h [ l 0 ] = h [ v ], the ab o ve set of equations can b e rewritten as: ∀ l, P v : l ∈ leav es ( v ) h [ v ] = P v : l ∈ leav es ( v ) ˜ h [ v ] F or a leaf no de l , we can think of the ab o ve equation for l as corresp onding to a path from l to the ro ot r of the tree. The equation states that sum of the sequences h and ˜ h ov er the nodes along the path are the same. W e can sum all the equations to obtain the following equation. X v X l ∈ leaves ( v ) h [ v ] = X v X l ∈ leaves ( v ) ˜ h [ v ] 13 Denote l evel ( i ) as the set of no des at heigh t i of the tree. Th us ro ot b elongs to lev el ( ` − 1) and leav es in lev el (0). Abbreviating LH S ( RH S ) for the left (right) hand side of the ab o ve equation, w e observe the following. LH S = X v X l ∈ leaves ( v ) h [ v ] = ` − 1 X i =0 X v ∈ lev el ( i ) X l ∈ leaves ( v ) h [ v ] = ` − 1 X i =0 X v ∈ lev el ( i ) k i h [ v ] = ` − 1 X i =0 k i X v ∈ lev el ( i ) h [ v ] = ` − 1 X i =0 k i h [ r ] = k ` − 1 k − 1 h [ r ] Here w e use the fact that P v ∈ lev el ( i ) h [ v ] = h [ r ] for an y lev el i . This is because h satisﬁes the constraints of the tree. In a similar wa y , we also simplify the RHS. RH S = X v X l ∈ leaves ( v ) ˜ h [ v ] = ` − 1 X i =0 X v ∈ lev el ( i ) X l ∈ leaves ( v ) ˜ h [ v ] = ` − 1 X i =0 X v ∈ lev el ( i ) k i ˜ h [ v ] = ` − 1 X i =0 k i X v ∈ lev el ( i ) ˜ h [ v ] Note that w e cannot simplify the RHS further as ˜ h [ v ] ma y not satisfy the constraints of the tree. Finally equating LH S and RH S we get the following equation. h [ r ] = k − 1 k ` − 1 ` − 1 X i =0 k i X v ∈ lev el ( i ) ˜ h [ v ] F urther, it is easy to expand z [ r ] and chec k that z [ r ] = k − 1 k ` − 1 ` − 1 X i =0 k i X v ∈ lev el ( i ) ˜ h [ v ] Th us we get h [ r ] = z [ r ]. F or no des v other than the r , assume that w e ha ve computed h [ u ] for u = pr ed ( v ). Denote H = h [ u ]. Once H is ﬁxed, we can argue that the v alue of h [ v ] will b e indep enden t of the v alues of ˜ h [ w ] for any w not in the subtree of u . F or no des w ∈ subtr ee ( u ) the L 2 minimization problem is equiv alent to the following one. minimize X w ∈ subtr ee ( u ) ( h [ w ] − ˜ h [ w ]) 2 sub ject to ∀ w ∈ subtr ee ( u ) , X w 0 ∈ succ ( w ) h [ w 0 ] = h [ w ] and X v ∈ succ ( u ) h [ v ] = H Again using no des l ∈ leav es ( u ), we con v ert this mini- mization into the following one. minimize X w ∈ subtr ee ( U )  ( X l ∈ leaves ( w ) h [ l ]) − ˜ h [ w ]  2 sub ject to X l ∈ leaves ( u ) h [ u ] = H W e can now use the metho d of Lagrange m ultipliers to ﬁnd the solution of the ab ov e constrained minimization prob- lem. Using λ as the Lagrange parameter for the constraint P l ∈ leaves ( u ) h [ u ] = H , w e get the following sets of equations. ∀ l ∈ leav es ( u ) , X w : l ∈ leav es ( w ) 2  h [ w ] − ˜ h [ w ]  = − λ Adding the equations for all l ∈ l eaves ( u ) and solving for λ we get λ = − H − succZ [ u ] n ( u ) − 1 . Here n ( u ) is the num b er of no des in subtr ee ( u ). Finally adding the ab o ve equations for only leaf no des l ∈ l eaves ( v ), we get h [ v ] = z [ v ] − ( n ( v ) − 1) · λ = z [ v ] + n ( v ) − 1 n ( u ) − 1 ( H − succZ [ u ]]) = z [ v ] + 1 k ( h [ u ] − succZ [ u ]) This completes the pro of. D.4 Pr oof of Theorem 4 First, the theorem is restated. Theorem 4. ( i ) H is a line ar unbiased estimator, ( ii ) er ror ( H q ) ≤ er r or ( E q ) for al l q and for al l line ar unbiase d estimators E , ( iii ) er ror ( H q ) = O ( ` 3 / 2 ) for al l q , and ( iv ) ther e exists a query q s.t. er ror ( H q ) ≤ 3 2( ` − 1)( k − 1) − k er ror ( ˜ H q ) . Proof. F or (i), the linearity of H is obvious from the deﬁnition of z and h . T o show H is unbiased, we ﬁrst show that z is un biased, i.e. E ( z [ v ]) = h [ v ]. W e use induction: the base case is if v is a leaf no de in which case E ( z [ v ]) = E ( ˜ h [ v ]) = h [ v ]. If v is not a leaf no de, assume that we hav e sho wn z is un biased for all no des u ∈ succ ( v ). Thus E ( succZ [ v ]) = X u ∈ succ ( v ) E ( z [ u ]) = X u ∈ succ ( v ) h [ u ] = h [ v ] Th us succZ [ v ] is an unbiased estimator for h [ v ]. Since z [ v ] is a linear combination of ˜ h [ v ] and succZ [ v ], which are b oth un biased estimators, z [ v ] is also un biased. This completes the induction step proving that z is unbiased for all no des. Finally , w e note that h [ v ] is a linear com bination of ˜ h [ v ], z [ v ], and succZ [ v ], all of whic h are unbiased estimators. Th us h [ v ] is also unbiased pro ving (i). F or (ii), we shall use the Gauss-Marko v theorem [21]. W e shall treat the sequence ˜ h as the set of observed v ariables, and l , the sequence of original leaf counts, as the set of unobserv ed v ariables. It is easy to see that for all no des v ˜ h [ v ] = X u ∈ leaves ( v ) l [ u ] + noise ( v ) Here noise ( v ) is the Laplacian random v ariable, which is in- dep enden t for diﬀerent no des v , but has the same v ariance 14 for all no des. Hence ˜ h satisﬁes the hypothesis of Gauss- Mark ov theorem. (i) sho ws that h is a linear un biased estimator. F urther, h has b een obtained by minimizing the L 2 distance with ˜ h [ v ]. Hence, h is the Ordinary Least Squares (OLS) estimator, which by the Gauss-Marko v the- orem has the least er ror . Since it is the OLS estimator, it minimizes the er r or for estimating an y linear combination of the original counts, which includes in particular the given range query q . F or (iii), we note that any query q can b e answered b y summing at most k ` no des in the tree. Since for any no de v , er r or ( H [ v ]) ≤ er ror ( ˜ H [ v ]) = 2 ` 2 / 2 , we get er ror ( H [ q ]) ≤ k ` (2 ` 2 / 2 ) = O ( ` 3 / 2 ) F or (iv), denote l 1 and l 2 to be the leftmost and righ tmost leaf no des in the tree. Denote r to b e the ro ot. W e consider the query q that asks for the sum of all leaf no des except for l 1 and l 2 . Then from (i) er ror ( H ( q )) is less than the err or of the estimate ˜ h [ r ] − ˜ h [ l 1 ] − ˜ h [ l 2 ], which is 6 ` 2 / 2 . But, on the other hand, ˜ H will require summing 2( k − 1)( ` − 1) − k noisy coun ts in total—2( k − 1) at each lev el of the tree, except for the ro ot and the level just b elow the ro ot, only k − 2 counts are summed. Thus er r or ( ˜ H q ) = 2(2( k − 1)( ` − 1) − k ) ` 2 / 2 . Th us er ror ( H q ) ≤ 3 er ror ( ˜ H q ) 2( ` − 1)( k − 1) − k This completes the pro of. E. COMP ARISON WITH BLUM ET AL. W e compare a binary ˜ H q against the binary search equi- depth histogram of Blum et al. [4] in terms of ( , δ )-usefulness as deﬁned by Blum et al. Since  is used in the usefulness deﬁnition, we use α as the parameter for α -diﬀerential pri- v acy . Let N b e the num b er of records in the database. An algo- rithm is ( , δ )-useful for a class of queries if with probability at least 1 − δ , for every query in the class, the absolute error is at most N . F or an y range query q , the absolute error of ˜ H q is | ˜ H q ( I ) − H q ( I ) | = | Y | where Y = P c i =1 γ i , each γ i ∼ Lap ( `/α ), and c is the num b er of subtrees in ˜ H q , which is at most 2 ` . W e use Corollary 1 from [5] to b ound the error of a sum of Laplace random v ariables. With ν = p c` 2 /α 2 q 2 ln 2 δ 0 , we obtain P r " | Y | ≤ 16 ` 3 2 ln 2 δ 0 α # ≥ 1 − δ 0 The ab o v e is for a single range query . T o b ound the error for all  n 2  range queries, w e use a union b ound. Set δ 0 = δ n 2 . Then ˜ H is ( , δ )-useful provided that  ≥  16 ` 3 2 ln 2 n 2 δ  /α . As in Blum et al., we can also ﬁx  and b ound the size of the database. ˜ H is ( , δ )-useful when N ≥ 16 ` 3 2 ln 2 n 2 δ α = O log 3 2 n  log n + log 1 δ  α ! In comparison, the tec hnique of Blum et al. is ( , δ )-useful for range queries when N ≥ O log n  log log n + log 1 δ  α 3 ! Both techniques scale at most p oly-logarithmically with the size of the domain. How ever, the ˜ H scales b etter with  , ac hieving the same utility guaran tee with a database that is smaller by a factor of O (1 / 2 ). The ab o ve comparison reveals a distinction betw een the tec hniques: for ˜ H q the b ound on absolute error is indep en- den t of database size, i.e., it only dep ends on  , α , and the size of range. How ever, for the Blum et al. approac h, the absolute error increases with the size of the database at a rate of O ( N 2 / 3 ). 15

Boosting the Accuracy of Differentially-Private Histograms Through Consistency

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment