Composition Attacks and Auxiliary Information in Data Privacy

Composition Attac ks and A uxiliary Inf ormation in Data Priv acy Srivatsa v a Ranjit Ganta P enn State University University P ark, P A 16802 ranjit@cse .psu.edu Shiv a Kasivis wanathan P enn State University University P ark, P A 16802 kasivis w@cse.psu.edu Adam Smith P enn State University University P ark, P A 16802 asmith@cse.psu.edu ABSTRA CT Priv acy is an increasingly important aspect of data publishing. Rea- soning about priv acy , howe ver , is fraught with pitfalls. One of the most signiﬁcant is the auxiliary information (also called e xter- nal kno wledge, background kno wledge, or side information) that an adversary gleans from other channels such as the web, public records, or domain knowledge. This paper explores how one can reason about priv acy in the face of rich, realistic sources of auxil- iary information. Speciﬁcally , we in vestigate the ef fectiv eness of current anon ymization schemes in preserving pri vac y when mul- tiple organizations independently release anonymized data about ov erlapping populations. 1. W e in vestigate composition attac ks , in which an adversary uses independent anonymized releases to breach priv acy . W e explain why recently proposed models of limited auxiliary information fail to capture composition attacks. Our e xperiments demon- strate that e ven a simple instance of a composition attack can breach priv acy in practice, for a large class of currently pro- posed techniques. The class includes k -anonymity and se veral recent variants. 2. On a more positive note, certain randomization-based notions of privac y (such as dif ferential pri vac y) provably resist compo- sition attacks and, in fact, the use of arbitrary side information. This resistance enables “stand-alone” design of anonymization schemes, without the need for explicitly keeping track of other releases. W e provide a precise formulation of this property , and prov e that an important class of relaxations of differential pri- vac y also satisfy the property . This signiﬁcantly enlarges the class of protocols known to enable modular design. 1. INTR ODUCTION Priv acy is an increasingly important aspect of data publishing. The potential social beneﬁts of analyzing large collections of per- sonal information (census data, medical records, social networks) are signiﬁcant. At the same time, the release of information from such repositories can be dev astating to the priv acy of indi viduals or organizations [5]. The challenge is therefore to discover and release Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. T o copy otherwise, to republish, to post on servers or to redistrib ute to lists, requires prior speciﬁc permission and/or a fee. Copyright 200X A CM X-XXXXX-XX-X/XX/XX ... $ 5.00. the global characteristics of these databases without compromising the priv acy of the individuals whose data the y contain. Reasoning about priv acy , howev er , is fraught with pitfalls. One of the most signiﬁcant dif ﬁculties is the auxiliary information (also called external knowledge, background kno wledge, or side infor- mation) that an adversary gleans from other channels such as the web or public records. For example, simply removing obviously identifying information such as names and address does not suf- ﬁce to protect priv acy since the remaining information (such as zip code, gender and date of birth [30]) may still identify a per- son uniquely when combined with auxiliary information (such as voter registration records). Schemes that resist such linkage hav e been the focus of extensiv e in vestigation, starting with work on publishing contingency tables [1], and more recently , in a line of techniques based on “ k -anonymity” [30]. This paper explores how one can reason about privac y in the face of rich, realistic sources of auxiliary information. This fol- lows lines of work in both the data mining [26, 27, 9] and cryp- tography [10, 12] communities that ha ve sought principled ways to incorporate unknown auxiliary information into anonymization schemes. Speciﬁcally , we in vestigate the ef fectiv eness of current anonymization schemes in preserving priv acy when multiple or- ganizations independently release anonymized data about overlap- ping populations. W e sho w new attacks on some schemes and also deepen the current understanding of schemes known to resist such attacks. Our results and their relation to previous work are dis- cussed below . Schemes that retain priv acy guarantees in the presence of inde- pendent releases are said to compose secur ely . The terminology , borrowed from cryptography (which borrowed, in turn, from soft- ware engineering), stems from the fact that schemes which com- pose securely can be designed in a stand-alone fashion without explicitly taking other releases into account. Thus, understanding independent releases is essential for enabling modular design. In fact, one would like schemes that compose securely not only with independent instances of themselves, but with arbitr ary external knowledge . W e discuss both types of compositions in this paper . The dual problem to designing schemes with good composition properties is the design of attacks that exploit such information. W e call these composition attac ks .A simple example of such an attack, in which two hospitals with overlapping patient populations pub- lish anonymized medical data, is presented below . Composition attacks highlight a realistic and important class of vulnerabilities. As priv acy preserving data publishing becomes more commonly deployed, it is increasingly difﬁcult to keep track of all the organi- zations that publish anonymized summaries inv olving a given in- dividual or entity and schemes that are vulnerable to composition attacks will become increasingly difﬁcult to use safely . 1.1 Contributions Our contributions are summarized brieﬂy in the abstract, above, and discussed in more detail in the following subsections. 1.1.1 Composition Attacks on P artition-based Schemes W e introduce composition attacks and study their effect on a popular class of partitioning-based anonymization schemes. V ery roughly , computer scientists have worked on two broad classes of anonymization techniques. Randomization-based schemes intro- duce uncertainty either by randomly perturbing the raw data (a technique called input perturbation , r andomized response , e.g., [34, 2, 16]), or post-randomization , e.g., [32]), or by injecting random- ness into the algorithm used to analyze the data (e.g., [6, 28]). P artition-based sc hemes cluster the indi viduals in the database into disjoint groups satisfying certain criteria (for example, in k -anony- mity [30], each group must hav e size at least k ). For each group, certain exact statistics are calculated and published. Partition-based schemes include k -anonymity [30] as well as se veral recent v ari- ants, e.g., [26, 23, 36, 27, 9]. Because the y release exact information, partition-based schemes seem especially vulnerable to composition attacks. In the ﬁrst part of this paper we study a simple instance of a composition attack called an intersection attack . W e observe that the speciﬁc proper- ties of current anonymization schemes make this attack possible, and we ev aluate its success empirically . Example. Suppose two hospitals H 1 and H 2 in the same city re- lease anonymized patient-discharge information. Because they are in the same city , some patients may visit both hospitals with sim- ilar ailments. T ables 1(a) and 1(b) represent (hypothetical) inde- pendent k -anonymizations of the discharge data from H 1 and H 2 using k = 4 and k = 6 , respectiv ely . The sensitive attribute here is the patient’ s medical condition. It is left untouched. The other attributes, deemed non-sensiti ve, are generalized (that is, replaced with aggregate values), so that within each group of rows, the vec- tors if non-sensitiv e attributes are identical. If Alice’ s employer knows that she is 28 years old, liv es in zip code 13012 and re- cently visited both hospitals, then he can attempt to locate her in both anon ymized tables. Alice matches four potential records in H 1 ’ s data, and six potential records in H 2 ’ s. Howe ver , the only disease that appears in both matching lists is AIDS, and so Alice’ s employer learns the reason for her visit. Intersection Attacks. The above example relies on two proper - ties of the partition-based anonymization schemes: (i) Exact sensi- tive value disclosure: the “sensitive” value corresponding to each member of the group is published exactly; and (ii) Locatability: giv en any individual’ s non-sensiti ve values (non-sensitiv e v alues are exactly those that are assumed to be obtainable from other, pub- lic information sources) one can locate the group in which indi vid- ual has been put in. Based on these properties, an adversary can narrow do wn the set of possible sensitive v alues for an individual by intersecting the sets of sensitiv e v alues present in his/her groups from multiple anonymized releases. Properties (i) and (ii) turn out to be widespread. The exact dis- closure of sensitive value lists is a design feature common to all the schemes based on k -anonymity: preserving the exact distribu- tion of sensitive values is important, and so no recoding is usually applied. Locatability is less uni versal, since it depends on the ex- act choice of clustering algorithm (used to form groups) and the recoding applied to the non-sensiti ve attributes. Howe ver , some schemes always satisfy locatability by virtue of their structure (e.g., schemes that recursively partition the data set along the lines of a hierarchy that is subsequently used for generalization [21, 22]). For Non-Sensitive Sensitive Zip code Age Nationality Condition 1 130** < 30 * AIDS 2 130** < 30 * Hear t Disease 3 130** < 30 * Viral Infection 4 130** < 30 * Viral Infection 5 130** ≥ 40 * Cancer 6 130** ≥ 40 * Hear t Disease 7 130** ≥ 40 * Viral Infection 8 130** ≥ 40 * Viral Infection 9 130** 3* * Cancer 10 130** 3* * Cancer 11 130** 3* * Cancer 12 130** 3* * Cancer (a) Non-Sensitive Sensitive Zip code Age Nationality Condition 1 130** < 35 * AIDS 2 130** < 35 * T uberculosis 3 130** < 35 * Flu 4 130** < 35 * T uberculosis 5 130** < 35 * Cancer 6 130** < 35 * Cancer 7 130** ≥ 35 * Cancer 8 130** ≥ 35 * Cancer 9 130** ≥ 35 * Cancer 10 130** ≥ 35 * T uberculosis 11 130** ≥ 35 * Vir al Infection 12 130** ≥ 35 * Vir al Infection (b) T able 1: A simple e xample of a composition attack. T ables (a) and (b) are 4- anonymous (respectiv ely , 6-anonymous) patient data from tw o hypothetical hospitals. If an Alice’ s employer knows that she is 28, lives in zip code 13012 and visits both hospitals, he learns that she has AIDS. other schemes, locatability is not perfect but our experiments sug- gest that using simple heuristics one can locate a indi vidual’ s group with high probability . Even with these properties, it is difﬁcult to come up with a theo- retical model for intersection attacks because the partitioning tech- niques generally create dependencies that are hard to model an- alytically . Ho wev er , if the sensitiv e values of the members of a group could be assumed to be statistically independent of their non-sensitiv e attribute values, then a simple birthday-paradox-style analysis would yield reasonable bounds. Experimental Results. Instead, we e valuated the success of in- tersection attacks empirically . W e ran the intersection attack on two popular census databases anonymized using partition-based schemes. W e evaluated the severity of such an attack by mea- suring the number of individuals who had their sensitive value re- vealed. Our experimental results conﬁrm that partitioning-based anonymization schemes including k -anonymity and its recent v ari- ants, ` -diversity and t -closeness, are indeed vulnerable to intersec- tion attacks. Section 3 elaborates our methodology and results. Related W ork on Modeling Backgr ound Knowledge. It is im- portant to point out that the partition-based schemes in the litera- ture were not designed to be used in contexts where independent releases are a vailable. Thus, we do not view our results as pointing out a ﬂaw in these schemes, b ut rather as directing the community’ s attention to an important direction for future work. It is equally important to highlight the progress that has already been made on modeling sophisticated background knowledge in partition-based schemes. One line has focused on taking into ac- count other , known releases, such as previous publications by the same organization (“sequential” releases, [33, 7, 36]) and multiple views of the same data set [37]. Another line has considered incor- porating knowledge of the clustering algorithm used to group indi- viduals [35]. Most rele vant to this paper are w orks that have sought to model unknown background knowledge. Martin et al. [27] and Chen et al. [9] provide complexity measures for an adversary’ s side information (roughly , they measure the size of the smallest formula within a CNF-lik e class that can encode the side information). Both works design schemes that prov ably resist attacks based on side in- formation whose complexity is belo w a giv en threshold. Independent releases (and hence composition attacks) fall out- side the models proposed by these works. The sequential release models do not ﬁt because they deal assume the other releases are known to the anonymization algorithm. The complexity-based mea- sures do not ﬁt because independent releases appear to hav e com- plexity that is linear in the size of the data set. 1.1.2 Composing Randomization-based Schemes Composition attacks appear to be dif ﬁcult to reason about, and it is not initially clear whether it is possible at all to design schemes that resist such attacks. Even deﬁning composition properties pre- cisely is tricky in the presence of malicious behavior (for example, see [24] for a recent surve y about composability of cryptographic protocols). Ne vertheless, a signiﬁcant family of anonymization deﬁnitions do provide guarantees against composition attacks, namely schemes that satisfy differ ential privacy [14]. Recent work has greatly expanded the applicability of dif ferential priv acy and its re- laxations, both in the theoretical [15, 6, 14, 4, 28] and applied [17, 3, 25] literature. Howe ver, certain recently developed techniques such as sampling [8], instance-based noise addition [29] and data synthesis [25] appear to require relaxations of the deﬁnition. It is simple to prove that both the strict and relaxed variants of differential priv acy compose well (see [13, 29, 28]). Less triv- ially , howe ver , one can prove that strictly differentially-pri vate al- gorithms also provide meaningful priv acy in the presence of ar- bitrary side information (Dwork and McSherry , [12]). In partic- ular , these schemes compose well even with completely different anonymization schemes. It is natural to ask if there are weaker deﬁnitions which provide similar guarantees. Certainly not all of them do: one natural re- laxation of dif ferential pri vac y , which replaces the multiplicative distance used in differential privac y with total v ariation distance, fails completely to protect pri vac y (see example 2 in [14]). In this paper , we prove that two important relaxations of differ - ential privac y do, indeed, resist arbitrary side information. First, we provide a Bayesian formulation of differential pri vac y which makes its resistance to arbitrary side information explicit. Second, we prov e that the relaxed deﬁnitions of [13, 25] still imply the Bayesian formulation. The proof is non-trivial, and relies on the “continuity” of Bayes’ rule with respect to certain distance mea- sures on probability distributions. Our result means that the re- cent techniques mentioned above [13, 8, 29, 25] can be used mod- ularly with the same sort of assurances as in the case of strictly differentially-pri vate algorithms. 2. P AR TITION-B ASED SCHEMES Let D be a multiset of tuples where each tuple corresponds to an individual in the database. Let R be an anonymized version of D . From this point on, we use the terms tuple and indi vidual interchangeably , unless the conte xt leads to ambiguity . Let A = A 1 , A 2 , . . . , A r be a collection of attributes and t be a tuple in R ; we use the notation t [ A ] to denote ( t [ A 1 ] , . . . , t [ A r ]) where each t [ A i ] denotes the v alue of attribute A i in table R for t . In partitioning-based anonymization approaches, there exists a division of data attributes into two classes, sensitive attributes and non-sensitive attributes . A sensiti ve attribute is one whose value and an individual’ s association with that value should not be dis- closed. All attributes other than the sensitive attributes are non- sensitiv e attributes. D E FI N I T I O N 1 ( Q UA S I - I D E N T I FI E R ) . A set of non-sensitive attributes { Q 1 , . . . , Q r } is called a quasi-identiﬁer if ther e is at least one individual in the original sensitive database D who can be uniquely identiﬁed by linking these attributes with auxiliary data. Previous work in this line typically assumed that all the attrib utes in the database other than the sensitiv e attribute form the quasi- identiﬁer . D E FI N I T I O N 2 ( E Q U I V A L E N C E C L A S S ) . An equivalence cla- ss for a table R with r espect to attrib utes in A is the set of all tuples t 1 , t 2 , . . . , t i ∈ R for which the projection of each tuple onto at- tributes in A is the same, i.e., t 1 [ A ] = t 2 [ A ] . . . = t i [ A ] . Partition-based schemes cluster individuals into groups, and then recode ( i.e., generalize or change) the non-sensitiv e values so that each group forms an equi valence class with respect to the quasi- identiﬁers. Sensitive values are not recoded. Different criteria are used to decide how , exactly , the groups should be structured. The most common rule is k -anonymity , which requires that each equi v- alence class contain at least k individuals. D E FI N I T I O N 3 ( k - A N O N Y M I T Y ) . A release R is k -anonym- ous if for every tuple t ∈ R , there exist at least k − 1 other tuples t 1 , t 2 , . . . , t k − 1 ∈ R such that t [ A ] = t 1 [ A ] = . . . = t k − 1 [ A ] for every collection A of attributes in quasi-identiﬁer . In our experiments we also consider two extensions to k -anonymity . D E FI N I T I O N 4 ( E N T RO P Y ` - D I V E R S I T Y ) . F or an equivalence class E , let S denote the domain of the sensitive attributes, and p ( E , s ) is the fraction of r ecords in E that have sensitive value s , then E is ` -diverse if: − X s ∈ S p ( E , s ) log( p ( E , s )) ≥ log l . A table is ` -diverse if all its equivalence classes ar e ` -diverse. D E FI N I T I O N 5 ( t - C L O S E N E S S ) . An equivalence class E is t - close if the distance between the distribution of a sensitive attrib ute in this class and distribution of the attribute in the whole table is no more than a thr eshold t . A table is t -close if all its equivalence classes ar e t -close. Locatability. As mentioned in the introduction, many anon ymiza- tion algorithms satisfy locatability , that is, they output tables in which one can locate an indi vidual’ s group based only on his or her non-sensitive values. D E FI N I T I O N 6 ( L O C A TAB I L I T Y ) . Let Q be the set of quasi- identiﬁer values of an individual in the original database D . Given the k -anonymized r elease R of D , the locatability pr operty allows an adversary to identify the set of tuples { t 1 , . . . , t K } in R (where K ≥ k ) that correspond to Q . Locatability does not necessarily hold for all partition-based sche- mes, since it depends on the e xact choice of clustering algorithm (used to form groups) and the recoding applied to the non-sensiti ve attributes. Ho wev er it is widespread. Some schemes always satisfy locatability by virtue of their structure ( e.g . , schemes that recur - siv ely partition the data set along the lines of a hierarchy always provide locatability if the attributes are then generalized using the same hierarchy , or if (min,max) summaries are used [21, 22]). For other schemes, locatability is not perfect but our experiments sug- gest that using simple heuristics can locate a person’ s group with good probability . For example, microaggregation [11] clusters indi- viduals based on Euclidean distance. The vectors of non-sensitive values in each group are replaced by the centroid (i.e., av erage) of the vectors. The simplest heuristic for locating an individual’ s group is to choose the group with the closest centroid vec tor . In experiments on census data, this correctly located approximately 70% of individuals. In our attacks, we always assume locatability . This assumption was also made in pre vious studies [30, 27]. 2.1 Intersection Attack Armed with these basic deﬁnitions, we now proceed to formalize the intersection attack (Algorithm 1). Algorithm 1 Intersection attack 1: R 1 , . . . , R n ← n independent anonymized releases 2: P ← set of overlapping population 3: for each indi vidual i in P do 4: for j = 1 to n do 5: e ij ← Get equi valence class ( R j , i ) 6: s ij ← Sensiti ve v alue set ( e ij ) 7: end for 8: S i ← s i 1 ∩ s i 2 ∩ . . . ∩ s in 9: end for RETURN S 1 , . . . , S | P | Let R 1 , . . . , R n be n independent anonymized releases with min- imum partition-sizes of k 1 , . . . , k n , respectiv ely . Let P be the ov erlapping population occurring in all the releases. The function Get equi valence class returns the equiv alence class into which an individual falls in a given anonymized release. The function Sen- sitiv e v alue set returns the set of (distinct) sensitive values for the members in a giv en equiv alence class. D E FI N I T I O N 7 ( A N O N Y M I T Y ) . F or each individual i in P , the anonymity factor promised by each release R j is equal to the corr esponding minimum partition-size k j . Howe ver , as pointed out in [26], the actual anonymity of fered is less than this ideal v alue and is equal to number of distinct v alues in each equiv alence class. W e call this as the effective anonymity D E FI N I T I O N 8 ( E FF E C T I V E A N O N Y M I T Y ) . F or an individual i in P , the effective anonymity offer ed by a release R j is equal to the number of distinct sensitive values of the partition into which the individual falls into. Let e ij be the equivalence class or parti- tion into which i falls into with r espect to the release R j , and let s ij denote the sensitive value set for e ij . The effective anonymity for i with r espect to the r elease R j is: EA ij = | s ij | . For each target individual i , EA ij is the effective prior anonymity with respect to R j (anonymity before the intersection attack). In the intersection attack, the list of possible sensiti ve values associated to the target is equal to intersection of all sensitive v alue sets s ij , j = 1 , . . . , n . So the effective posterior anonymity ( c EA i ) for i is: c EA i = |{∩ s ij }| , j = 1 , . . . , n . The difference between the effecti ve prior anonymity and effec- tiv e posterior anonymity quantiﬁes the drop in ef fective anonymity . Anon Dr op i = min j =1 ,...,n { EA ij } − c EA i . The vulnerable population ( VP ) is the number of individuals (among the overlapping population) for whom the intersection at- tack leads to a positiv e drop in the effecti ve anonymity . VP = { i ∈ P : Anon Dr op i > 0 } . After performing the sensitiv e value set intersection, the adver - sary knows only a possible set of values that each indi vidual’ s sen- sitiv e attribute can take. So, the adversary deduces that with equal probability (under the assumption that the adversary does not have any further auxiliary information) the individual’ s actual sensitiv e value is one of the values in the set {∩ s ij } , j = 1 , . . . , n . So, the adversaries conﬁdence le vel for an individual i can be deﬁned as: D E FI N I T I O N 9 ( C O N FI D E N C E L E V E L C i ) . F or each individ- ual i , the conﬁdence le vel C i of the adver sary in identifying the individual’ s true sensitive value thr ough the intersection attack is deﬁned as C i = 1 d EA i . Now , given some conﬁdence lev el C , we denote by VP C and PVP C the set and the percentage of ov erlapping indi viduals for whom the adversary can deduce the sensiti ve attrib ute value with a conﬁdence lev el of at least C . VP C = { i ∈ P : C i ≥ C } , PVP C = | VP C |· 100 | P | . 3. EXPERIMENT AL RESUL TS In this section we describe our experimental study 1 . The pri- mary goal is to quantify the severity of such an attack on exist- ing schemes. Although the earlier works address problems with k - anonymization and adversarial background knowledge, to the best of our kno wledge, none of these studies deal with attacks result- ing from auxiliary independent releases. Furthermore, none of the studies so far ha ve quantiﬁed the se verity of such an attack. 3.1 Setup W e use three different partitioning-based anonymization tech- niques to demonstrate the intersection attack: k -anonymity , ` -diver- sity , and t -closeness. For k -anonymity , we use the Mondrian mul- tidimensional approach proposed in [21] and the microaggregation technique proposed in [11]. For ` -div ersity and t -closeness, we use the deﬁnitions of entropy ` -diversity and t -closeness proposed in [26] and [23], respectiv ely . W e use two census-based databases from the UCI Machine Learn- ing repository [31]. The ﬁrst one is the Adult database that has been used extensi vely in the k -anonymity based studies. The database was prepared in a similar manner to previous studies [21, 26] (also explained in T able 2). The resulting database contained indi vidual records corresponding to 30162 people. The second database is the IPUMS database that contains individual information from the 1997 census studies. W e only use a subset of the attributes that are similar to the attributes present in the Adult database to maintain uniformity and to maintain quasi-identiﬁers. The IPUMS database contains individual records corresponding to a total of 70187 peo- ple. This data set was prepared as explained in T able 3. From both Adult and IPUMS databases, we generate two over - lapping subsets (Subset 1 and Subset 2) by randomly sampling in- dividuals without replacement from the total population. W e ﬁxed the overlap size to P = 5000 . For each of the databases, the two subsets are anonymized independently and the intersection attack 1 The code, parameter settings, and complete results are made av ail- able at: http://www.cse.psu.edu/ ˜ ranjit/kdd08 . Attribute Domain Size Class Age 74 Quasi ID W ork Class 7 Quasi ID Education 16 Quasi ID Marital Status 7 Quasi ID Race 5 Quasi ID Gender 2 Quasi ID Nativ e Country 41 Quasi ID Occupation 14 Sensitive T able 2: Description of the Adult census database. Attribute Domain Size Class Age 100 Quasi ID W ork Class 5 Quasi ID Education 10 Quasi ID Marital Status 6 Quasi ID Race 7 Quasi ID Sex 2 Quasi ID Birth Place 113 Quasi ID Occupation 247 Sensiti ve T able 3: Description of the IPUMS census database. is run on the anonymization results. All the experiments were run on a Pentium 4 system running W indows XP with 1GB RAM. 3.2 Sev erity of the Attack Our ﬁrst goal is to quantify the extent of damage possible through the intersection attack. For this, we consider two possible situa- tions: (i) Perfect breach and (ii) Partial breach. 3.2.1 P erfect Breac h A perfect breach occurs when the adversary can deduce the exact sensitiv e value of an indi vidual. In other w ords, a perfect breach is when the adversary has a conﬁdence level of 100% about the individual’ s sensitive data. T o estimate the probability of a perfect breach, we compute the percentage of overlapping population for whom the intersection attack leads to a ﬁnal sensitive value set of size 1 . Figure 1 plots this result. W e consider three scenarios for anonymizing the two overlap- ping subsets: (i) Mondrian on both the data subsets, (ii) Microag- gregation on both the data subsets, and (iii) Mondrian on the ﬁrst subset and microaggregation on the second subset. ( k 1 , k 2 ) repre- sents the pair of k v alues used to anonymize the ﬁrst and the second subset, respecti vely . In the experiments, we use the same k values for both the subsets ( k 1 = k 2 ) . Note that for simplicity , from now on we will be deﬁning conﬁdence lev el in terms of percentages. In the case of Adult database we found that around 12% of the population is vulnerable to a perfect breach for k 1 = k 2 = 5 . For the IPUMS database, this value is much more severe around 60%. As the degree of anonymization increases or in other words, as the value of k increases, the percentage of vulnerable population goes down. The reason for that is that as the value of k increases, the partition sizes in each subset increases. This leads to a larger inter - section set and thus lesser probability of obtaining an intersection set of size 1 . 3.2.2 P artial Breac h Our next experiment aims to compute a more practical quan- tiﬁcation of the severity of the intersection attack. In most cases, to inﬂict a priv acy breach, all that the adversary needs to do is to boil down the possible sensitiv e values to a few values which it- self could rev eal a lot of information. For example, for a hospi- tal discharge database, by boiling do wn the sensiti ve values of the disease/diagnosis to a few values, say , “Flu”, “Fe ver”, or “Cold”, it could be concluded that the individual is suffering from a viral infection. In this case, the adversary’ s conﬁdence le vel is 1 / 3 = 33% . Figure 2 plots the percentage of vulnerable population for whom the intersection attack leads to a partial breach for the Adult and IPUMS databases. Here, we only use the ﬁrst anonymization scenario described earlier in which both the o verlapping subsets of the database are anonymized using Mondrian multidimensional technique. Observe that the severity of the attack increases alarmingly for slight relax- ation on the required conﬁdence level. F or e xample, in the case of IPUMS database, around 95% of the population was vulnerable for a conﬁdence le vel of 25% for k 1 = k 2 = 5 . For the Adult database, although this value is not as alarming, more than 60% of the population was af fected. 3.3 Drop in Anonymity Our next goal is to measure the drop in anonymity occurring due to the intersection attack.T o achiev e this, we ﬁrst take a closer look at the way these schemes work. As described in the earlier sections, the basic paradigm in partitioning-based anonymization schemes is to partition the data such that each partition size is at least k . The methodology behind partitioning and then summariz- ing varies from scheme to scheme. The minimum partition-size ( k ) is thus used as a measure of the anon ymity of fered by these solutions. Howe ver , the effecti ve (or true) anonymity supported by these solutions is far less than the presumed anonymity k (refer to the discussion in Section 2.1). Figure 3 plots the av erage partition sizes and the av erage effec- tiv e anon ymities for the ov erlapping population. Here again, we only consider the scenario where both the overlapping subsets are anonymized using Mondrian multidimensional technique. Observe that the effecti ve anonymity is much less than the partition size for both the data subsets. Also, note that these techniques result in partition sizes that are much larger than the minimum required of k . For example, the average partition size observed in the IPUMS database for k = 5 is close to 40 . T o satisfy the k -anonymity def- inition, there is no need for any partition to be larger than 2 k + 1 . The reasoning for this is straightforward as splitting the partition of size greater than 2 k + 1 into two we get partitions of size at least k . Additionally , splitting any partition of size 2 k + 1 or more only results in preserving more information. The culprit behind the larger average partition sizes is generalization based on user- deﬁned hierarchies. Since generalization-based partitioning cannot be controlled at ﬁner levels, the resulting partition sizes tend to be much larger than the minimum required v alue. For each individual in the overlapping population, the effecti ve prior anonymity is equal to the effectiv e anonymity . W e deﬁne the av erage effecti ve prior anonymity with respect to a release as effecti ve prior anonymities averaged ov er the individuals in the ov erlapping population. Similarly , the average effecti ve posterior anonymity is the effectiv e posterior anonymities averaged ov er the individuals in the overlapping population. The difference between the av erage effecti ve prior anonymity and a verage ef fectiv e poste- rior anonymity giv es the average drop in effecti ve anonymity occur- ring due to the intersection attack. Figure 4 plots the a verage ef fec- tiv e prior anonymities and the av erage ef fectiv e posterior anonymi- ties for the overlapping population. Observ e that the average ef- fectiv e posterior anonymity is much less than the av erage effecti ve prior anonymity for both subsets. Also note that we measure drop in anonymities by using ef fective anon ymities instead of presumed anonymities. The situation only gets w orse (drops get larger) when presumed anonymities are used. 0 2 4 6 8 10 12 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 ,k 2 ) Percentage of Vulnerable Population (PVP 100% ) Mondrian, Mondrian Microaggreagation, Microaggregation Mondrian, Microaggreagation 35 40 45 50 55 60 65 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 ,k 2 ) Percentage of Vulnerable Population (PVP 100% ) Mondrian, Mondrian Microaggreagation, Microaggregation Mondrian, Microaggreagation ( a ) ( b ) F igur e 1: Se verity of the intersection attack - perfect breach (a) Adult database (b) IPUMS database. 0 10 20 30 40 50 60 70 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 , k 2 ) Percentage Of Vulnerable Population (PVP C ) Confidence Level C = 100% Confidence Level C = 50% Confidence Level C = 33% Confidence Level C = 25% 30 40 50 60 70 80 90 100 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (k 1 , k 2 ) Percentage Of Vulnerable Population (PVP C ) Confidence Level C = 100% Confidence Level C = 50% Confidence Level C = 33% Confidence Level C = 25% ( a ) ( b ) F igur e 2: Se verity of the intersection attack - partial breach (a) Adult database (b) IPUMS database. 0 5 10 15 20 25 30 35 40 45 50 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 , k 2 ) Average Partition Size Of Overlapping Population (PS) Average Partition Size - Subset 1 Average Partition Size - Subset 2 Average Effective Anonymity - Subset 1 Average Effective Anonymity - Subset 2 Presumed Anonymity 0 10 20 30 40 50 60 70 80 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 , k 2 ) Average Partition Size Of Overlapping Population (PS) Average Partition Size - Subset 1 Average Partition Size - Subset 2 Average Effective Anonymity - Subset 1 Average Effective Anonymity - Subset 2 Presumed Anonymity ( a ) ( b ) F igur e 3: Comparison of presumed anonymity , actual partition sizes, and ef fective anon ymity ( a) Adult database ( b) IPUMS database. 3.4 ` -diversity and t -closeness W e no w consider the ` -diversity and t -closeness extensions to the original k -anonymity deﬁnition. The goal again is to quantify the se verity of the intersection attack by measuring the extent to which a partial breach occurs with v arying le vels of adversary con- ﬁdence levels. Figure 5 plots the percentage of vulnerable popula- tion for whom the intersection attack leads to a partial breach for the Adult and IPUMS databases. Here, we anonymize both the sub- sets of the database with the same deﬁnition of priv acy . W e use the mondrian multidimensional k -anonymity with the additional con- straints as deﬁned by ` -div ersity and t -closeness. Figure 5(a) plots the result for the ` -diversity using the same ` value for both the subsets ( ` 1 = ` 2 ) and with k = 10 . Figure 5(b) plots the same for t -closeness. Even though these extended deﬁnitions seem to per- 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 , k 2 ) Average Effective Anon ymity Of Overlapping Population Average Effective Prior Anonymity - Subset 1 Average Effective Prior Anonymity - Subset 2 Average Effective Posterior Anonymity 0 2 4 6 8 10 12 14 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 , k 2 ) Average Effective Anonymity Of Overlapping Population Average Effective Prior Anonymity - Subset 1 Average Effective Prior Anonymity - Subset 2 Average Effective Posterior Anonymity ( a ) ( b ) F igur e 4: A verage drop in ef fectiv e anonymity due to the intersection attack ( a) Adult database ( b) IPUMS database. form better than the original k -anonymity deﬁnition, they still lead to considerable breach in case of an intersection attack. This result is fairly intuitiv e in the case of ` -diversity . Consider the deﬁnition of ` -diversity: the sensitiv e value set corresponding to each parti- tion should be “well” ( ` ) div erse. Ho wev er , there is no guarantee that the intersection of two well div erse sets leads to a well diverse set. t -closeness fares similarly . Also, both these deﬁnitions tend to force larger partition sizes, thus resulting in hea vy information loss. Figure 6 plots the average partition sizes of the individuals corre- sponding to the overlapping population. It compares the partition sizes observed for k -anonymity , ` -di versity , and t -closeness. For the IPUMS database, with a v alue of k = 10 , k -anonymity pro- duces partitions with an a verage partition size of 45 . While, for the same value of k = 10 , with a v alue of l = 5 , the av erage partition size obtained was close to 450 . The partition sizes for t -closeness get even worse, where a combination of k = 10 and t = 0 . 4 yield partitions of a verage size close to 1300 . W e can observe similar results for the Adult database. 3.5 Role of Sensitive Attrib ute Domain In all of the abo ve e xperiments we use the “Occupation” (oc- cupation code of the individual) as the sensitive attribute for both Adult and IPUMS databases as shown in T ables 2 and 3. The do- main size of the Occupation attribute in the Adult database was 14 whereas, the domain size in the IPUMS database was 247 . One of the plausible reasons for the attack to be more severe in case of the IPUMS database was the size of the sensiti ve attribute do- main. This is because most of partition sizes are way larger than the minimum v alue required i.e. k , in case of the Adult database, it is possible that the sensitiv e value set corresponding to every par- tition contains all the possible values in the domain. This implies that an intersection of tw o sensitiv e v alue sets results in a set of size close to the size of the domain. Thus, it is possible that intersection attack will be less effecti ve in cases where the sensiti ve attribute domain size is less than the a verage partition size. Intuiti vely , it seems like that in cases where the sensiti ve attribute domain size is lar ge (of the order of se veral hundreds) the intersection attack would be more severe. Also, most real-life databases have sensi- tiv e attrib utes with large domain sizes. For example, if we consider a typical hospital discharge database, an ICD9 code is used to de- scribe the diagnosis gi ven to the patient. The possible v alues for this code is a number from 1 to 999 [19] indicating the code for the speciﬁc patient diagnosis. In other cases, the sensitiv e attribute domain sizes tend be larger than this. The conjecture is that as the Sensitive Attribute Domain Size Diversity Occupation 247 4.30 Industry 145 4.35 Income 471 5.56 T able 4: IPUMS database versions (Non-Sensitiv e attributes remain same as the original) number of possible sensiti ve values increases, the intersection of two dif ferent sets results in a less div erse set. In order to conﬁrm this, we constructed two new versions of the IPUMS database by replacing the sensitiv e attribute “Occupation” of each individual with “Industry” corresponding to the individual’ s work and “Income” corresponding to the total income of the indi- vidual. The domain sizes corresponding to these attributes is sum- marized in T able 4. The domain size for “Industry” attribute is 145 , for the original “Occupation” attribute si 247 and that of “Income” is 471 . T able 4 summarizes this. W e ran the intersection attack on these new versions of the IPUMS database and compared it with the original. Figure 7 plots the a verage drop in effecti ve anonymity for the overlapping population. Based on our conjecture, the drop in ef- fectiv e anon ymity should increase with the increase in the sensitive attribute domain size. Surprisingly we did not observe the trend we were expecting. The drop in ef fectiv e anonymity in case of “Occu- pation” was less than when compared with “Industry”. It turns out that the reason for this is that the actual number of possible values for each sensitiv e attribute does not necessarily be the same as the domain size, or in other words the total number of possible values. So, a large sensitive attrib ute domain size does not guarantee that the number of possible values actually occuring is large. Instead, a simple entropy measure such as the shannon’ s entropy could be used to measure the actual number of possible values. The entropy value for each of these attributes is listed in T able 4. Although the actual domain size for ‘Occupation” attribute is larger , its en- tropy is less than that of than that of the “Industry” attrib ute. Now , the conjecture is that as the entropy (or information content) of the sensitiv e attribute increases, the severity of intersection attack in- creases. Our result in Figure 7 conﬁrms this. The average drop in effecti ve anonymity increases with the entropy of the correspond- ing sensitive attrib ute domain since the non-sensitive attributes are kept the same for all the datasets. 3.6 Number of Databases 0 5 10 15 20 25 30 35 40 45 50 (2, 2) (3, 3) (4, 4 ) (5, 5) (6, 6) (7, 7) l 1 , l 2 (for k = 10) Percentage Of Vulnerable Population (PVP C ) Confidence Level C = 100% Confidence Level C = 50% Confidence Level C = 33% Confidence Level C = 25% 0 5 10 15 20 25 30 35 40 45 50 (0.9, 0.9) (0.8, 0.8) (0.7, 0.7) (0.6, 0.6) (0.5,0.5) (0.4, 0.4) t 1 , t 2 (for k = 10) Percentage Of Vulnerable Population (PVP C ) Confidence Level C = 100% Confidence Level C = 50% Confidence Level C = 33% Confidence Level C = 25% ( a ) l − div ersit y ( b ) t − closeness 0 10 20 30 40 50 60 (2, 2) (3, 3) (4, 4 ) (5, 5) (6, 6) (l 1 , l 2 ) (with k = 10) Percentage Of Vulnerable Population (PVP C ) 100% Confidence Level 50% Confidence Level 33% Confidence Level 25% Confidence Level 0 10 20 30 40 50 60 70 80 (0.9, 0.9) (0.8, 0.8) (0.7, 0.7) (0.6, 0.6) (0.5,0.5) (t 1 , t 2 ) (with k = 10) Percentage Of Vulnerable Population (PVP C ) 100% Confidence Level 50% Confidence Level 33% Confidence Level 25% Confidence Level ( c ) l − div ersit y ( d ) t − closeness F igur e 5: Se verity of the intersection attack - l-di versity and t-closeness ( a)(b) Adult Database ( c)(d) IPUMS Database 0 50 100 150 200 250 300 350 400 450 Average Partition Size k = 10 k = 10, l = 5 k = 10, l = 6 k = 10, l = 7 k = 10, t = 0.6 k = 10, t = 0.5 k = 10, t = 0.4 0 200 400 600 800 1000 1200 1400 Average Partition Size k = 10 k = 10, l = 5 k = 10, l = 6 k = 10, l = 7 k = 10, t = 0.6 k = 10, t = 0.5 k = 10, t = 0.4 ( a ) ( b ) F igur e 6: A verage partition sizes for ` -di versity and t -closeness ( a) Adult Database ( b) IPUMS Database In the abov e experiments we have considered the scenario in which two anonymized releases contain information about over - lapping population. As data publishing becomes more prevelant among organizations that would like to share data for research and collaborativ e purposes, it is possible that the number of anonymized releases available containing information about the same subset of people is more than just two. The adversary could use as many anonymized releases as possible to gather information about a tar- get population and use the intersection attack to deduce the sen- sitiv e attrib ute v alues. In such a scenario, it is interesting to see how the intersection attack performs in the presence multiple (more than 2) overlapping anonymized releases. W e ﬁrst consider the per - centage of vulnerable population with a conﬁdence lev el of 100% ( P V P 100% ) . Figure 8(a) plots this for varying number ( n = 2 , 3 , 4) of anonymized releases available to adversary . Here again, we build n ov erlapping subsets of the IPUMS database by ﬁxing the over- lapping population at 5000 . It can be observ ed that the se verity of the intersection attack increases with the increase in the num- ber of anonymized releases av ailable to the adversary . There is a signiﬁcant increase in the percentage of vulnerable population with 0 1 2 3 4 5 6 7 8 9 10 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k1, k2) Average Drop in Effective "Anonymity " Occupation, Domain Size = 247, Entropy = 4.3 Industry, Domain Size = 145, Entropy = 4.35 Income, Domain Size = 471, Entropy = 5.56 F igur e 7: Ef fect of sensitiv e attribute domain size - IPUMS database. the increase in n , for small v alues of k . Howe ver , there seem to be no such signiﬁcant increase for larger values of k . The reason for this is that the partition sizes for larger v alues of k tend to be large enough such that the presence of additional anonymized re- leases does not help the intersection attack anymore. Alternativ e to the sev erity of the attack, we can study the effect of the number of anonymized releases on the drop in effecti ve anon ymity . Fig- ure 8(b) plots the average drop in effecti ve anonymity for varying number ( n = 2 , 3 , 4) of anonymized releases. Here again we can observe that drop in effecti ve anonymity increases with the increase in the number of anonymized releases. These results indicate that if the anonymized releases correspond to fairly larger v alues of k , there is only limited information gained by the adv ersary by col- lecting additional releases. 4. DIFFERENTIAL PRIV A CY In this section we gi ve a precise formulation of “resistance to ar - bitrary side information” and show that sev eral relaxations of dif- ferential priv acy imply it. The formulation follows the ideas origi- nally due to Dwork and McSherry , stated implicitly in [12]. This is, to our knowledge, the ﬁrst place such a formulation appears explic- itly . The proof that relaxed deﬁnitions (and hence the schemes of [13, 29, 25]) satisfy the Bayesian formulation is ne w . These results are explained in a greater detail in a separate technical report [20]. In this paper we just reproduce the relev ant parts from [20]. W e represent databases as vectors in D n for some domain D (for e xample, in the case of the relational databases abo ve, D is the product of the attribute domains). There is no distinction between “sensitiv e” and “insensiti ve” information. Giv en a randomized al- gorithm A , we let A ( D ) be the random v ariable (or , probability distribution on outputs) corresponding to input D . D E FI N I T I O N 1 0 ( D I FF E R E N T I A L P R I V AC Y ) . A randomized al- gorithm A is  -differentially private if for all databases D 1 , D 2 ∈ D n that dif fer in one individual, and for all subsets S of outputs, Pr[ A ( D 1 ) ∈ S ] ≤ e  Pr[ A ( D 2 ) ∈ S ] . This deﬁnition states that changing a single individual’ s data in the database leads to a small change in the distribution on outputs. Unlike more standard measures of distance such as total variation (also called statistical difference) or Kullback-Leibler diver gence, the metric here is multiplicativ e and so ev en very unlikely ev ents must have approximately the same probability under the distribu- tions A ( D 1 ) and A ( D 2 ) . This condition was relaxed somewhat in other papers [10, 15, 6, 13, 8, 29, 25]. The schemes in all those papers, howe ver , satisfy the following relaxation [13]: D E FI N I T I O N 11. A randomized algorithm A is ( , δ ) -differ- entially private if for all databases D 1 , D 2 ∈ D n that differ in one individual, and for all subsets S of outputs, Pr[ A ( D 1 ) ∈ S ] ≤ e  Pr[ A ( D 2 ) ∈ S ] + δ . The relaxations used in [15, 6, 25] were in fact stronger (i.e., less relaxed) than Deﬁnition 10. One consequence of the results belo w is that all the deﬁnitions are equiv alent up to polynomial changes in the parameters, and so giv en the space constraints we work only with the simplest notion. 2 4.1 Semantics of Differential Privacy There is a crisp, semantically-ﬂa vored interpretation of differen- tial pri vacy , due to Dwork and McSherry , and explained in [12]: Re gar dless of external knowledge, an adversary with access to the sanitized database draws the same conclusions whether or not my data is included in the original data. (the use of the term “seman- tic” for such deﬁnitions dates back to semantic security of encryp- tion [18]). W e require a mathematical formulation of “arbitrary external knowledge”, and of “drawing conclusions”. The ﬁrst is captured via a prior probability distribution b on D n ( b is a mnemonic for “beliefs”). Conclusions are modeled by the corresponding poste- rior distribution: giv en a transcript t , the adversary updates his be- lief about the database D using Bayes’ rule to obtain a posterior b b : b b [ D | t ] def = Pr[ A ( D ) = t ] b [ D ] P D 0 Pr[ A ( D 0 ) = t ] b [ D 0 ] . (1) In an interactive scheme, the deﬁnition of A depends on the ad- versary’ s choices; for simplicity we omit the dependence on the adversary in the notation. Also, for simplicity , we discuss only dis- crete probability distrib utions. Our results extend directly to the interactiv e, continuous case. For a database D , deﬁne D − i to be the vector obtained by replac- ing position i by some default value in D (any value in D will do). This corresponds to “removing” person i ’ s data. W e consider n + 1 related scenarios (“games”, in the language of cryptograph y), num- bered 0 through n . In Game 0, the adversary interacts with A ( D ) . This is the interaction that takes place in the real world. In Game i (for 1 ≤ i ≤ n ), the adversary interacts with A ( D − i ) . Game i describes the hypothetical scenario where person i ’ s data is not included. For a particular belief distrib ution b and transcript t , we consider the n + 1 corresponding posterior distributions b b 0 , . . . , b b n . The posterior b b 0 is the same as b b (deﬁned in Eq. (1)). For larger i , the i -th posterior distribution b b i represents the conclusions drawn in Game i , that is b b i [ D | t ] def = Pr[ A ( D − i ) = t ] b [ D ] P D 0 Pr[ A ( D 0 − i ) = t ] b [ D 0 ] . Giv en a particular transcript t , priv acy has been breached if there exists an index i such that the adversary would dra w different con- clusions depending on whether or not i ’ s data w as used. It turns out that the exact measure of “different” here does not matter much. W e chose the weakest notion that applies, namely statistical difference. If P and Q are probability measures on the set X , the statistical difference between P and Q is deﬁned as: SD ( P , Q ) = max S ⊂X | P [ S ] − Q [ S ] | . 2 That said, some of the other relaxations, such as probabilistic dif- ferential pri vac y of [25], could lead to better parameters in Theo- rem 15. 40 45 50 55 60 65 70 75 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 , k 2 ) Percentage Of Vulnerable Population (PVP 100% ) No. of Databases = 2 No. of Databases = 3 No. of Databases = 4 0 1 2 3 4 5 6 7 (5, 5) (10, 10) (15, 15) (20, 20) (25, 25) (30, 30) (k 1 , k 2 ) Average Drop in Effective Anonymity No. of Databases = 2 No. of Databases = 3 No. of Databases = 4 ( a ) ( b ) F igur e 8: Ef fect of Number of Anonymized Releases - IPUMS Database ( a) Percentage of V ulnerable Poplation ( b) Drop in Effecti ve Anon ymity D E FI N I T I O N 12. An algorithm A is  -semantically private if for all prior distributions b on D n , for all databases D ∈ D n , for all possible transcripts t , and for all i = 1 , . . . , n , SD “ b b 0 [ D | t ] , b b i [ D | t ] ” ≤ . This can be relaxed to allo w a probability δ of failure. D E FI N I T I O N 13. An algorithm is ( , δ ) -semantically private if, for all prior distributions b , with pr obability at least 1 − δ over pairs ( D, t ) , wher e the database D ← b ( D is drawn according to b ) and the transcript t ← A ( D ) ( t is drawn accor ding to A ( D )) , for all i = 1 , . . . , n : SD “ b b 0 [ D | t ] , b b i [ D | t ] ” ≤ . Dwork and McSherry proposed the notion of semantic priv acy , informally , and observed that it is equi valent to differential priv acy . P R O P O S I T I O N 1 4 ( D W O R K - M C S H E R RY ) .  -differ ential pri- vacy implies b  -semantic privacy , where b  = e  − 1 . W e sho w that this implication holds much more generally: T H E O R E M 1 5 ( M A I N R E S U L T ) . ( , δ )-differ ential privacy im- plies (  0 , δ 0 ) -semantic privacy wher e  0 = e 3  − 1 + 2 √ δ and δ 0 = O ( n √ δ ) . Theorem 15 states that the relaxations notions of differential pri- vac y used in some pre vious work still imply pri vac y in the face of arbitrary side information. This is not the case for all possible re- laxations, ev en very natural ones. F or example, if one replaced the multiplicativ e notion of distance used in differential privac y with total v ariation distance, then the follo wing “sanitizer” would be deemed priv ate: choose an index i ∈ { 1 , . . . , n } uniformly at ran- dom and publish the entire record of individual i together with his or her identity (e xample 2 in [14]). Such a “sanitizer” would not be meaningful at all, regardless of side information. Finally , the techniques used to prove Theorem 15 can also be used to analyze schemes which do not provide pri vac y for all pairs of neighboring databases D 1 and D 2 , b ut rather only for most such pairs (neighboring databases are the ones that differ in one indi- vidual). Speciﬁcally , it is sufﬁcient that those databases where the “indistinguishability” condition fails occur with small probability . D E FI N I T I O N 1 6 ( ( , δ ) - I N D I S T I N G U I S H A B I L I T Y ) . T wo rand- om variables X, Y taking values in a set X are ( , δ ) -indistingui- shable if for all sets S ⊆ X , Pr[ X ∈ S ] ≤ e  Pr[ Y ∈ S ] + δ and Pr[ Y ∈ S ] ≤ e  Pr[ X ∈ S ] + δ . T H E O R E M 17. Let A be a randomized algorithm. Let E = { D 1 ∈ D n : ∀ neighbors D 2 of D 1 , A ( D 1 ) and A ( D 2 ) are ( , δ ) -indistinguishable } . Then A satisﬁes (  0 , δ 0 ) -semantic pri- vacy for any prior distribution b such that b [ E ] = Pr D 3 ← b [ D 3 ∈ E ] ≥ 1 − δ with  0 = e 3  − 1 + 2 √ δ and δ 0 = O ( n √ δ ) . 4.2 Proof Sketch for Main Results The complete proofs are described in [20]. Here we sketch the main ideas behind both the proofs. Let Y | X = a denote the condi- tional distrib ution of Y given that X = a for jointly distrib uted random variables X and Y . The following lemma (proof omitted) plays an important role in our proofs. L E M M A 1 8 ( M A I N L E M M A ) . Suppose two pairs of random variables ( X, A ( X )) and ( Y , A 0 ( Y )) ar e ( , δ ) -indistinguishable (for some r andomized algorithms A and A 0 ). Then with pr obability at least 1 − δ 00 over t ← A ( X ) (equivalently t ← A 0 ( Y ) ), the ran- dom variables X | A ( X )= t and Y | A 0 ( Y )= t ar e (ˆ , ˆ δ ) -indistinguish- able with ˆ  = 3  , ˆ δ = 2 √ δ , and δ 00 = √ δ + 2 δ e  = O ( √ δ ) . Let A be a randomized algorithm (in the setting of Theorem 15, A is a ( , δ ) -differentially priv ate algorithm). Let b be a belief dis- tribution (in the setting of Proposition 17, b is a belief with b ( E ) ≥ 1 − δ ). The main idea behind both the proofs is to use Lemma 18 to show that with probability at least 1 − O ( √ δ ) over pairs ( D , t ) where D ← b and t ← A ( D ) , SD ` b | A ( D )= t , b | A ( D − i )= t ´ ≤  0 . T aking a union bound o ver all coordinates i , implies that with prob- ability at least 1 − O ( n √ δ ) over pairs ( D , t ) where D ← b and t ← A ( D ) , for all i = 1 , . . . , n , we hav e SD ` b | A ( D )= t , b | A ( D − i )= t ´ ≤  0 . For Proposition 17, it sho ws that A satisﬁes (  0 , δ 0 ) -semantic priv acy for b . In the Theorem 15 setting where A is ( , δ ) -dif ferent- ially priv ate and b is arbitrary , it shows that ( , δ ) -differential pri- vac y implies (  0 , δ 0 ) -semantic priv acy . 5. CONCLUDING REMARKS In this paper we explored how one can reason about privac y in the presence of independent anonymized releases of overlap- ping population. Our experimental study indicates that se veral cur - rently proposed partition-based anonymization schemes, including k -anonymity and its variants, are vulnerable to composition attacks. On the positiv e side, we gave a precise formulation of the property “resistance to arbitrary side information” and show that several re- laxations of differential pri vacy satisfy it. The most striking question that arises from this work is whether randomness in the anonymization algorithm is necessary to resist complex side information such as independent releases. Another interesting direction would be to study other settings where com- position attacks are realistic and effecti ve? A natural candidate for future in vestigation are the releases of overlapping contingency ta- bles that are often considered in the statistical literature. 6. REFERENCES [1] Special issue on disclosure limitation methods for protecting the conﬁdentiality of statistical data, 1998. [2] R. Agrawal and R. Srikant. Priv acy-preserving data mining. In SIGMOD , pages 439–450. A CM Press, 2000. [3] S. Agrawal and J. R. Haritsa. A framew ork for high-accuracy priv acy-preserving mining. In ICDE , pages 193–204. IEEE Computer Society , 2005. [4] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F . McSherry , and K. T alwar . Priv acy , accuracy , and consistency too: a holistic solution to contingency table release. In PODS , pages 273–282. A CM Press, 2007. [5] M. Barbaro and T . Zeller . A face is exposed for A OL searcher no. 4417749. The New Y ork Times , Aug. 2006. [6] A. Blum, C. Dwork, F . McSherry , and K. Nissim. Practical priv acy: The SuLQ framew ork. In PODS , pages 128–138. ACM Press, 2005. [7] J.-W . Byun, Y . Sohn, E. Bertino, and N. Li. Secure anon ymization for incremental datasets. In Secur e Data Management , pages 48–63. Springer , 2006. [8] K. Chaudhuri and N. Mishra. When random sampling preserves priv acy . In CRYPT O , pages 198–213, 2006. [9] B.-C. Chen, R. Ramakrishnan, and K. LeFevre. Priv acy sk yline: Priv acy with multidimensional adv ersarial knowledge. In VLDB , pages 770–781. VLDB Endowment, 2007. [10] I. Dinur and K. Nissim. Revealing information while preserving priv acy . In PODS , pages 202–210. A CM Press, 2003. [11] J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented microaggregation for statistical disclosure control. IEEE T ransactions on Knowledge and Data Engineering , 14(1):189–201, 2002. [12] C. Dwork. Differential pri vacy . In ICALP , pages 1–12. Springer , 2006. [13] C. Dwork, K. Kenthapadi, F . McSherry , I. Mironov , and M. Naor . Our data, ourselves: Priv acy via distributed noise generation. In EUR OCRYPT , pages 486–503. Springer , 2006. [14] C. Dwork, F . McSherry , K. Nissim, and A. Smith. Calibrating noise to sensitivity in pri vate data analysis. In TCC , pages 265–284. Springer , 2006. [15] C. Dwork and K. Nissim. Priv acy-preserving datamining on vertically partitioned databases. In CR YPTO , pages 528–544. Springer , 2004. [16] A. Evﬁmievski, R. Srikant, R. Agrawal, and J. Gehrk e. Priv acy preserving mining of association rules. In KDD , pages 217–228. A CM Press, 2002. [17] A. V . Evﬁmie vski, J. Gehrke, and R. Srikant. Limiting pri vacy breaches in priv acy preserving data mining. In PODS , pages 211–222. A CM Press, 2003. [18] S. Goldwasser and S. Micali. Probabilistic encryption. Journal of Computer and System Sciences , 28(2):270–299, 1984. [19] Internation classiﬁcation of diseases, http://www .cdc.gov/nchs/about/otheract/icd9/abticd9.htm. [20] S. P . Kasivisw anathan and A. Smith. A note on differential pri vac y: Deﬁning resistance to arbitrary side information. CoRR , arXiv:0803.39461 [cs.CR], 2008. [21] K. LeFevre, D. J. DeW itt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity . In ICDE , page 25. IEEE Computer Society , 2006. [22] K. LeFevre, D. J. DeW itt, and R. Ramakrishnan. W orkload-aware anonymization. In KDD , pages 277–286. A CM Press, 2006. [23] N. Li, T . Li, and S. V enkatasubramanian. t -closeness: Priv acy beyond k -anonymity and l -div ersity . In ICDE , pages 106–115. IEEE Computer Society , 2007. [24] Y . Lindell. Composition of Secur e Multi-P arty Protocols: A Compr ehensive Study . Springer-V erlag, 2003. [25] A. Machanavajjhala, D. Kifer , J. Abowd, J. Gehrke, and L. V ilhuber . Priv acy: From theory to practice on the map. In ICDE , 2008. [26] A. Machanavajjhala, D. Kifer , J. Gehrke, and M. V enkitasubramaniam. l -div ersity: Privac y beyond k -anonymity . ACM T ransactions on Knowledge Disco very fr om Data , 1(1), 2007. [27] D. J. Martin, D. Kifer, A. Machana vajjhala, J. Gehrke, and J. Y . Halpern. W orst-case background knowledge for pri vac y-preserving data publishing. In ICDE , pages 126–135. IEEE Computer Society , 2007. [28] F . McSherry and K. T alwar . Differential pri vac y in mechanism design. In FOCS , pages 94–103. IEEE Computer Society , 2007. [29] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensiti vity and sampling in priv ate data analysis. In STOC , pages 75–84. A CM Press, 2007. [30] L. Sweeney . k -anonymity: A model for protecting pri vacy . International Journal on Uncertainty , Fuzziness and Knowledge-based Systems , 10(5):557–570, 2002. [31] UCI machine learning repository , http://www .ics.uci.edu/ mlearn/databases/. [32] A. van den Hout and P . van der Heijden. Randomized response, statistical disclosure control and misclassiﬁcation: A review . International Statistical Review , 70:269–288, 2002. [33] K. W ang and B. C. M. Fung. Anonymizing sequential releases. In KDD , pages 414–423. A CM Press, 2006. [34] S. L. W arner. Randomized response: A survey technique for eliminating ev asiv e answer bias. Journal of the American Statistical Association , 60(309):63–69, 1965. [35] R. C.-W . W ong, A. W .-C. Fu, K. W ang, and J. Pei. Minimality attack in priv acy preserving data publishing. In VLDB , pages 543–554. VLDB Endowment, 2007. [36] X. Xiao and Y . T ao. M-in variance: towards pri vacy preserving re-publication of dynamic datasets. In SIGMOD , pages 689–700. A CM Press, 2007. [37] C. Y ao, X. S. W ang, and S. Jajodia. Checking for k -anonymity violation by views. In VLDB , pages 910–921. VLDB Endo wment, 2005.

Composition Attacks and Auxiliary Information in Data Privacy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment