Anonymizing Unstructured Data

In this paper we consider the problem of anonymizing datasets in which each individual is associated with a set of items that constitute private information about the individual. Illustrative datasets include market-basket datasets and search engine …

Authors: ** - Rajeev Motwani (Stanford University, Computer Science) - Shubha U. Nabar (Stanford University, Computer Science) **

Anon ymizing Unstructured Data Rajee v Motwani Depar tment of Computer Science Stanf ord University , Stanf ord, CA, USA rajee v@cs .s tanf ord.edu Shubha U . Nabar Depar tment of Computer Science Stanf ord University , Stanf ord, CA, USA sunabar@cs .stanfor d.edu ABSTRA CT In this paper we consi d er the problem of anon ymizing d atasets in whic h each individual is associated with a set of items that constitute priv ate info rmation abou t the i nd ividual. Il- lustrative datas ets includ e mark et-basket datasets and search engine query logs. W e formalize the notion of k -anonymity for set-v alue d data as a v arian t of t he k -anonymit y mo del for traditional relational datasets. W e define an optimizatio n problem that arises from this defi nition of anon y mit y and provide O ( k log k ) and O (1)-ap p ro ximation algorithms for the same. W e demonstrate applicabilit y of our algori thm s to the America Online query log dataset. 1. INTR ODUCTION Consider a dataset con taining detailed information about the priv ate actions of individu als, e.g., a marke t- bask et data- set or a dataset of searc h engine query logs. Market-bask et datasets contai n information ab out items b ought by individ- uals and search engine query logs contain detailed informa- tion ab out the queries posed by users and the results that w ere clic ked on. There is often a n eed to pub lish suc h data for researc h purp oses. Mark et- b ask et data, for instance, could b e used for association rule min in g and for the design and testing of recommendation systems. Query logs could b e used to stu d y patterns of query refinement, develop algo- rithms for query suggestion and improv e the ov erall quality of search. The pu blication of such data, how ever, p oses a challenge as far as th e priv acy of individual u sers is concerned. E ven after removing all p ersonal characteri stics of ind ividuals such as actual usernames and ip add resses, the pub licatio n of su ch data is still sub ject to priv acy attac ks from attac kers with partial knowl edge of the priva te actions of individuals. Our w ork i n this pap er is motiv ated by tw o such recent data releases and priva cy attacks on them. In A ugust of 2006, America O nline (AOL) released a large p ortion of its searc h engine query logs for researc h pur- p oses. The dataset conta ined 20 millio n q ueries p osed b y 650 , 00 0 A OL u sers o ver a 3 month perio d. Before releas- ing the data, AOL ran a simplistic anonymizatio n pro cedure wherein every username was replaced by a random identi- fier. Despite t his basic protective measure, the New Y ork Times [6] demonstrated how the queries themselves could essen tially revea l the identities of users. F or ex ample, user 4417749 revealed herself to b e a residen t of Gwinnett Count y in Lilburn, GA, by queryin g for businesses and services in the area. She further revealed her last name by querying for relatives. There w ere only 14 citizens with her last name in Gwinnett Count y , and the user was quickly revealed to b e Thelma Arn old, a 62 year old woman living in Georgia. F rom this p oin t on, researchers at the New Y ork Times could look at all of the queries p osed by Ms. Arnold ov er the 3 month p eriod. The publication of the q uery log data thus constituted a very serious priv acy breach. In Octob er of 2006, N etflix announced the $1-million Net- flix Prize for improving their movie recommend ation system. As a part of the contest N etflix p ublicly released a dataset conta ining 100 million mo v ie ratings created b y 500 , 0 00 Netflix subscrib ers ov er a p eriod of 6 years. Once again, a simplistic anonymization proced ure of replacing usernames with random identifiers w as used p rior to t he release. Nev - ertheless, it was shown that 84% of the sub scribers could b e uniquely identified by an attack er who k new 6 out o f 8 movies that t he subscrib er h ad rated out side of the top 500 [19]. The commonality b etw een the AOL and Netfl ix datasets is th at each individ u al’s data is essentially a set of items. F urther this set of items is b oth identifying of th e individ- ual as well as priv ate information ab out the indiv idual, and partial knowledge of this set of items is used in the priva cy attac k. In th e case of the Netfl ix data (representativ e of market-bask et data), for instance, it is the set of movies that a subscrib er rated, and in the case of the AOL data, it is th e set of queries that a user p osed, also called the user session . Motiv ated by th ese examples, as well as by the very real need for releasing such datasets for researc h purp oses, we prop ose a notion of anonymit y for set-v alued d ata in this pap er. Informally , a dataset is said to b e k -anonymous if every individual’s “set of items” is identical to those of at least k − 1 other individuals. So a user in t h e Netflix dataset w ould b e k - an onymous if at least k − 1 other users rated exactly the same set of movies; a user in the AOL query logs w ould b e k -anonymous i f at least k − 1 other users p osed exactly the same set of queries. One simple w ay to achiev e k -anonymity for a dataset w ould be to simply remo ve every i tem from every user’s set, or to add every item from the universe of items to ev- ery single set. Naturally this would radically distort t h e dataset rendering it u seless for analyses. So instead, to pro- vide greater utility t h an such a simplistic scheme, w e seek to make the minimal num b er of changes p ossible to the dataset in order to ac hieve th e anonymit y requirements. W e pro- vide O ( k log k ) and O (1)-approximation algorithms for this optimization problem. F urther we demonstrate ho w th ese algorithms can b e scaled for application to massiv e mo dern day datasets suc h as the AOL qu ery logs. T o summarize our contri b u tions. • W e define the notion of k -anonymit y for set-v alued data and introduce an opt imization p roblem for mini- mally achieving k -anonymit y in S ection 3. • W e pro vide algorithms with approximati on factors of O ( k log k ) and O (1) for th e optimization p roblem in Section 4. • In Section 5, we d emonstrate how our algorithms can b e scaled for application to massive datasets and ex- p erimen t on th e AOL logs . Before pro ceeding further, n ote th at illustrative datasets used as motiv ating examples ab ov e also contain further user information: time stamp information for when a rating was giv en and the actual rating itself in th e Netfl ix data; time stamp information for when a q uery was p osed and the q uery result th at was clic ked on in the AOL d ata. How ever for the purp oses of th is pap er, we ignore these other attributes of the d ataset and discuss how they could p otentially be dealt with in Section 5.5. Indeed th e priv acy attac ks mentioned above did n ot invo lve know ledge of th ese other attributes, and therefore the anonymization problem on even just the reduced set of att rib u tes is imp ortan t t o stud y . W e will next briefly review related w ork where we distin- guish our p roblem from the traditional k -anonymit y problem that has b een studied for relational datasets. 2. RELA TE D WORK There has b een considerable prior w ork on anonymizing traditional relational datasets such as medical records. The most widely studied anonymit y defin itions for such datasets are k -anonymity [3, 18 , 20, 23 , 15] an d its v ariants, l -diversity [17] and t -closeness [16]. In all these definitions, certain public attributes of the d ataset are initially determined to b e “quasi-identifiers”. F or instance, in a dataset of med- ical records, attributes such as Date- of-Birth, Gender and Zip code would qualify as quasi-identifiers since in combina- tion they can b e used to uniq uely identify 87% of the U.S. p opulation [23]. A dataset is then said to b e k -anonymous if every record in the dataset is identical to at least k − 1 other records on its quasi-identifying attribute v alues. The idea is that priva cy is ac hieved if every individual is h idden in a cro wd of size at least k . Anonymization algorithms achiev e the k -anonymit y requirement by suppr essing and gener ali z- ing the q u asi-iden tifying attribut e v alues of records. A triv- ial wa y t o achiev e k -anonymit y would b e to simply suppress every single attrib u te v alue in the dataset, but this w ould completely destroy the utility of the dataset. I nstead, in order to preserve utility , the algorithms attempt t o achiev e the anonymit y req uiremen t with a minimum num b er of sup- pressions and generalizations. The k inds of datasets that w e consider in this pap er differ from traditional relational datasets in tw o wa ys. First, eac h database record in our scenario essentially corresp onds to a set of items. The database records could thus b e of vari able length and high d imensionali ty . F urth er, there is no longer a clear d istinction b etw een priv ate attribut es and quasi iden- tifiers. A user’s queries are b oth priva te information ab out the user as wel l as id entifying of the user h imself. S imilarl y , in the case of market-basket data, the set of items b ought by an individual are priv ate information abou t the individual and at the same time can be used to id entify the individual. Our defi n ition of anonymity and anonymization algorithms are applicable for such set-v alued data. In [24] the authors s tu dy the p roblem of anon ymizing market-bask et d ata. They p ropose a notion of anonymity similar to k -anonymity where a limit is placed on the num- b er of p riv ate items of any indiv id ual that could b e kn o wn to an attack er b eforehand. The authors provide generalization algorithms to achiev e the anonymit y requirements. F or ex- ample, an item ‘milk’ in a u ser’s b ask et may b e generalized to ‘dairy p roduct’ in order to protect it. In contrast, the techniques w e prop ose consider additions and deletions to the dataset instead of generalizations. F urther, we demon- strate applicabilit y of our algorithms t o search engine query log data as well where there is no ob vious underlying hier- arc hy that can b e used to generalize queries. Our O (1)-approximation algorithm is derived by red ucing the anonymization problem to a clustering problem. Clus- tering techniques for achie v ing anonymity hav e also b een studied in [2], how ever here the authors seek to minimize the maximum radius of the clustering, whereas w e wish to minimize the sum of the H amming distances of p oin ts to their cluster centers. In [25] th e authors prop ose t he notion of ( h, k , p )-coherence for anonymizing transactional data. Here once again there is a d ivision of items into public and p riva te items. The goal of the anonymization is to ensure that for an y set of p public items, either no transaction contains this set, or at least k transactions contain it, and no more th an h p er- cent of these transactions contain a common priv ate item. The authors consider the minimal n u mb er of suppressions required to achiev e these anonymit y goals, how ever no the- oretical guarantees are giv en. Besides th e k -anonymization based techniques, there has also b een considerable work on anonymizing datasets by the addition of noise or p erturbation [4, 9, 5]. W e do not con- sider p erturbation-based approaches in this pap er. With regards to search engine query logs, there has b een w ork on identifying priv acy attacks b oth on users [14] as well as on companies whose websi tes ap p ear in qu ery results and get click ed on [21]. W e do not consider th e latter kin d of priv acy attack in this paper. [14] considers an anonymiza- tion pro cedure wherein keyw ords in queries are replaced by secure hashes. The authors show t hat such a pro cedure is susceptible to statistical attacks on the hashed keyw ords, leading to priva cy breaches. There has also b een work on defending against priv acy attac ks on users in [1]. This line of w ork considers heuristics such as th e remov al of infrequ en t queries and develops metho ds to apply suc h techniques on the fly as n ew queries are p osed. In contrast, w e consider a static scenario wherein a search engine would like to p ublicly release an existing set of query logs. 3. DEFINITIONS Let D = { S 1 , . . . , S n } b e a dataset containing n records. Eac h record S i is a set of items. F ormally S i is a non- empty subset of a universe of items, U = { e 1 , e 2 , . . . , e m } . W e can then define an anon ymous dataset as follo ws. Definition 1. ( k -Anonymity for S et - V alued Data) W e sa y ID Con tents S 1 { e 1 , e 2 , e 3 } S 2 { e 1 , e 2 } S 3 { e 1 , e 3 } S 4 { e 4 , e 5 , e 6 } S 5 { e 4 , e 5 } (a) Original Dataset ID Conten ts S 1 { e 1 , e 2 , e 3 } S 2 { e 1 , e 2 , e 3 } S 3 { e 1 , e 2 , e 3 } S 4 { e 4 , e 5 } S 5 { e 4 , e 5 } (b) 2-Anonymous T ransformation Figure 1: 2 -Anonymization that D is k -anonymous if every record S i ∈ D is identical to at least k − 1 other records. Give n this d efinition, w e can now define an optimization problem that asks for the minimum number of transforma- tions to be made to a dataset to obtain an anonymized dataset. Definition 2. (The k -Anonymization Problem for Set-V al- ued Data) Given a dataset D = { S 1 , . . . , S n } , find the min- im um num b er of items that need to b e add ed to or deleted from t he sets S 1 , . . . , S n to ensu re that the resulting dataset D ′ is k -anon y mous. W e illustrate the k - anon ym ization problem with an exam- ple. Example 1. Consider th e dataset in Figure 1(a). The dataset in Figure 1(b) represents a 2-anon y mou s transfor- mation that is obt ained by making 2 add itions and 1 dele- tion. The items e 3 and e 2 are a dd ed to records S 2 and S 3 respectively while the item e 6 is deleted from record S 4 . The resulting dataset consists of t wo 2-anon ym ou s groups: { S 1 , S 2 , S 3 } and { S 4 , S 5 } . As a more concrete example, in th e case of market-basket data, the dataset consists of records, where eac h record is a b asket of ite ms pur chase d by an individual. The k - anonymizati on problem then is to add or delete items to individuals’ bask ets so that every b asket is identical to at least k − 1 other baskets. In the case of searc h engine query logs, the records corre- sp on d to user sessions. In stead of treating each user session as a set of qu eries, w e consi d ered a relaxed problem and treat each user session as a set of query terms or keywor ds . See Section 5 for the details. The k -anonymization problem then b ecomes one of adding or d eleting keywords to or from user sessions to ensure that eac h u ser session b ecomes iden- tical to at least k − 1 oth er u ser sessions. Since no tw o user sessions are likely to b e similar on all the qu eries, w e con- sider a slightly mo dified p roblem in our ex p eriments. Each user session is first separated into “topic-based” thr e ads , and our goal becomes one of anonymizing these threads instead of the original sessions. The result is an increase in the util- it y of the released dataset. Again, Section 5 elab orates on the details. More generally , the dataset can b e thought of as a bipar- tite graph, with sets (user sessions/bask ets/individuals) rep- resen ted as no des on the left hand side an d items of the uni- verse (keyw ords searc hed for/items purchased/mo vies rated) as no des on the righ t hand side. The k - anon ymization p rob- lem then is to add or delete edges in the bipartite graph so ID e 1 e 2 e 3 e 4 e 5 e 6 S 1 1 1 1 0 0 0 S 2 1 1 0 0 0 0 S 3 1 0 1 0 0 0 S 4 0 0 0 1 1 1 S 5 0 0 0 1 1 0 Figure 2: Dataset from Figure 1(a) as a relational dataset that every no de on the left hand side is identic al to at least k − 1 other n odes. Dep ending on the application, it ma y mak e sense to re- strict the set of p ermissible operations to only add itions or only deletions, ho wev er in th is pap er we consider the most general version of the problem that p ermits b oth. 4. APPR O XIMA TION ALGORITHMS Give n these d efinitions, we are now ready to devise algo- rithms for optimally ac hieving k -anonymity . W e first d ra w connections b et ween the k -anonymization problem for set- v alued data and other optimization problems t hat have p re- viously b een studied in literature, namely , the suppression- based k -anonymization problem for relational data and the lo ad-b alanc e d facility lo c ation pr oblem . The redu ctions to these problems automatically give us the approximation al- gorithms w e desire. I n what follow s w e do not describ e the algorithms th emselv es, rather only t he reductions. The al- gorithms can b e found in [18, 3, 20, 10, 13, 22]. A natural question that arise s is whether traditional k - anonymit y algorithms that inv olve suppressions and gener- alizations can b e used for the k -anonymization problem for set-v alued data as defi ned in Section 1. T o th is end , we first translate the set-v alued dataset t o a traditional relational dataset. T ransfor ming D to R D A dataset D = { S 1 , . . . , S n } can be transformed t o a tradi- tional relational dataset R D by creating a bin ary attribute for every item e i in the universe and a tuple for eve ry set S i . Each tuple will t h en b e a vecto r in { 0 , 1 } m . The 1’s correspond to items in the universe th at a set contains and the 0’s correspond to those that it do es not 1 . F or exam- ple, the dataset from Figure 1(a) t ranslates to the dataset in Figure 2. The k -anonymization problem ov er D now translates to the follo wing problem ov er R D : Definition 3. ( k -Anonymization via Flips) Given a d ataset R D o ver a binary alphab et { 0 , 1 } , flip as few 0’s to 1’s and 1’s to 0’s in R D as p ossible so that every tup le is iden tical to at least k − 1 other tuples. It is t rivial to see t hat there is a one-to- on e correspondence b et ween feasible solutions for th e k -anonymization p roblem 1 Note that at no p oin t d o our approximation algorithms ever explicitly construct these bit vectors. Rather t hey op erate directly on the set rep resen tations of the tu ples, comput - ing intersections of p airs of sets. The algorithms th erefore scale with th e maximum set size rather than m . The bit vector representations hav e only b een used here for ease of exp osition. o ver D and the fl ip-based k -anonymization problem o ver R D . Pr oposition 1. Any fe asible solution, S f lip , to the flip- b ase d k -anonymization pr oblem over R D c an b e c onverte d to a f e asible solution, S ± , of the same c ost f or the k -anonymiza- tion pr oblem over D and vic e versa. Pr o of Sketch. F or every 0 that is flipp ed to a 1 in S f lip , simply add the corresp onding item to the corresponding set in S ± , and for every 1 that is flipp ed to a 0, delete th e item from the set. Now the fl ip -based k -anonymization problem can b e solved using supp ression-based k -anonymizatio n techniques for tra- ditional relational datasets stud ied in [18, 3 , 20 ]. The prob- lem studied h ere essentially b oils down to th e follo wing. Definition 4. ( k -Anonymization via Supp ressions) Given a d ataset R D o ver a binary alphab et { 0 , 1 } , what are th e minim u m num b er of 0 ′ s an d 1 ′ s in R D that need to be conv erted to *’s to ensure that every tuple is iden tical to at least k − 1 other tup les. Now it is easy to see that the follo wing holds. Pr oposition 2. Any f e asible solution S ∗ to the suppr ession- b ase d k-anonymization pr oblem c an b e c onverte d to a fe asible flip-b ase d solution S f lip using Algorithm 1. Algorithm 1 Con verting S ∗ to S f lip 1: //input: R D , S ∗ 2: for every k -anonymous group of tup les G in S ∗ do 3: for every column C do 4: // C G = C v alues for rows in G in R D 5: if num b er of 1’s in C G > num b er of 0’s then 6: flip the 0’s in C G to 1’s 7: else 8: flip the 1’s in C G to 0’s 9: end if 10: end for 11: end for The algorithm essential ly tak es every k - anon ymou s group of tuples in S ∗ . Then for an y column in the group that is suppressed (*ed out), it rep laces the column for that group entirel y with 1’s or entirely with 0’s dep ending on which action w ould inv olve a fewer num b er of flips in the original dataset R D . Example 2. Figure 3 shows an example of an original dataset, a 2-anon y m ou s dataset S ∗ obtained via suppressions, and a fl ip-based 2-anonymous d ataset S f lip obtained b y apply- ing Algorithm 1 to S ∗ . In b oth the solutions, the tw o 2- anonymous group s are { S 1 , S 4 , S 5 } and { S 2 , S 3 , S 6 } . Now we can show the follo wing ab out Algorithm 1. Theorem 1. F or a given dataset R D , let the c ost of a fe a- sible solution S ∗ to the suppr ession-b ase d k -anonymization pr oblem b e wi thin a f actor α of the c ost of the optimal solu- tion. Then the c ost of S f lip obtaine d by applying Algorithm 1 to S ∗ is within a factor of O ( k α ) of the c ost of the optimal solution for the fli p-b ase d k -anonymization pr oblem. ID e 1 e 2 e 3 S 1 1 1 0 S 2 0 0 1 S 3 1 0 1 S 4 1 0 0 S 5 1 0 0 S 6 1 0 1 (a) Original d ataset ID e 1 e 2 e 3 S 1 1 * 0 S 2 * 0 1 S 3 * 0 1 S 4 1 * 0 S 5 1 * 0 S 6 * 0 1 (b) S ∗ ID e 1 e 2 e 3 S 1 1 0 0 S 2 1 0 1 S 3 1 0 1 S 4 1 0 0 S 5 1 0 0 S 6 1 0 1 (c) S f lip Figure 3: S f lip is obtaine d from S ∗ via A lgorithm 1 Pr oof. Let O P T ∗ and O P T f lip b e the optimal solutions to th e suppression-based and flip-based k -anonymizatio n prob- lems ov er R D respectively . Then it is easy to see that Cost( OP T ∗ ) ≤ (2 k − 1)Cost( O P T f lip ). This is b ecause ev- ery k -anonymous group of tup les in O P T f lip consists of at most 2 k − 1 tuples. F urther, this group can be con verted to a k -anonymous group obtained by sup pressions by *ing out any column that con tains a flip (essentially the reverse of Algorithm 1). It is also easy to see that the cost of any solution S f lip ob- tained by applying Algorithm 1 to a solution S ∗ is less th an the cost of S ∗ . This giv es us t he follo wing set of inequ alities and our desired result. Cost( S f lip ) ≤ Co st ( S ∗ ) ≤ α Cos t ( OP T ∗ ) ≤ α (2 k − 1)Cost( O P T f lip ) The b est p ossible suppression-based k - anon ym ization al- gorithm thus giv es us a go od flip-based anonymization al- gorithm through th e application of Algorithm 1. Sin ce t h e suppression-based algorithm from [20] h as an approximatio n ratio of O (log k ), Theorem 1 together with Prop osition 1 giv es u s the follo wing result. Cor ollar y 1. Ther e exists an O ( k log k ) -appr oximation algorithm to the k -anonymization pr oblem for set-value d data. The suppression algorithm from [20] essentiall y consid- ers all p ossible partitions of the dataset into k -anonymous groups and chooses a go od one u sing a set-cov er typ e greedy algorithm. The translation of D to R D also enables the insight that the k -anonymization problem o ver set-v alued data is essen- tially a clustering problem. Each set can be viewed as vec- tor in { 0 , 1 } m . The opt imal solution to the follo wing clus- tering p roblem then gives us an optimal solution to the k - anonymizati on problem for set-val ued d ata. Definition 5. (The k -Group Clustering Problem) Giv en a set of p oints in { 0 , 1 } m , cluster the p oints into groups of size at least k and assign cluster centers in { 0 , 1 } m so th at the sum of the Hamming distances of the p oin ts to their cluster centers is minimized. The follo wing prop osition tells us that there is a one-to- one correspond ence b etw een feasible solutions t o the k -group clustering problem and the k -anonymization problem for set- v alued data. Pr oposition 3. Gi ven a solution, S gr oup , to the k -gr oup clustering pr oblem over a dataset R D , we c an obtain a so- lution S ± of the same c ost to the k -anonymization pr oblem over D and vic e versa. Pr o of Sketch. F or every cluster in S gr oup , create a k -anonym- ous group of the sets correspon d ing to th e cluster p oints in S ± . k -an onymity is achiev ed by add ing or deleting items as necessary so that every set in the group b ecomes identical to the set correspond ing to the cluster center. The sum of the Hamming distances of p oin ts to t heir cluster centers in S gr oup thus corresponds to th e total number of additions and deletions of items to obtain th e solution S ± . Give n Prop osition 3, we can n o w focu s on solving the k - group clustering problem from here on. In this regard, the follo wing result tells us that it suffices to consider p otential cluster centers from amongst the data p oints t hemselv es. Theorem 2. The c ost of the optimal solution to the k - gr oup clustering pr oblem when the cluster c enters ar e chosen fr om amongst the set of data p oints themselves is at most twic e the c ost of the optimal solution to the k -gr oup clus- tering pr oblem when the cluster c enters ar e al lowe d to b e arbitr ary p oints in { 0 , 1 } m . Pr oof. Let O P T b e th e optimal solution to th e k - group clustering problem when th e cluster centers are allo wed to b e arbitrary p oints in { 0 , 1 } m . No w consider a solution S r and that maintains the same cluster groups as O P T , b ut replaces eac h cluster center with a randomly c hosen data p oint from within th e cluster. The exp ected cost of this solution is giv en b elo w. E[Cost( S r and )] = X G ∈G X C ∈C 2 N C G 1 N C G 0 N C G 1 + N C G 0 Here G is th e set of all clusters in S r and (whic h is the same as the set of clusters in O P T ). C is the columns/dimensions of the dataset R D . N C G 1 and N C G 0 are th e num b er of 1’s and number of 0’s resp ectively that the p oin ts in a cluster G hav e in column C . The cost of th e optimal solution on the other hand is given by Cost( OP T ) = X G ∈G X C ∈C min( N C G 1 , N C G 0 ) . By simple algebraic manipulation, it is easy to see that E[Cost( S r and )] ≤ 2Cost( O P T ) . Since the exp ected cost of S r and is less than t wice the cost of OP T , there must exist some clustering solution where the cluster centers are c hosen from th e data p oin ts them- selv es whose cost is less than twice th e cost of O P T . This completes the pro of of the theorem. Theorem 2 considerably simplifies the clustering problem since there is n o w only a linear num b er of p otential cluster centers that need b e considered (as opp osed to 2 m ). W e can now frame t h is m o dified k -group clustering problem as an integ er program. min P i,j x ij d ij s.t x ij ≤ y j ∀ i, j P i x ij ≥ k y j ∀ j x ij , y j ∈ { 0 , 1 } ∀ i, j Here y j is an indicator v ariable th at indicates whether or not data p oint S j is chosen as a cluster center. x ij is an in- dicator vari able that indicates whether or not data p oint S i is assigned to cluster center S j and d ij is t he Hamming dis- tance b etw een d ata p oints S i and S j . This integer program- ming form ulation is exactly equiv alent to the load-balanced facilit y location problem studied in [10, 13, 22]. The cluster centers can b e thought of as facil ities, and the data points as demand p oints. The task then is to open facilities and assign demand p oints to op ened facilities so th at the sum of the distances to th e facilities is minimized and every facilit y has at least k demand p oints assigned to it. The algorithms for this problem w ork by solving a mo dified instance of a regular facili ty location problem (without the load balanc- ing constrain ts), and th en grouping together faciliti es that hav e few er than k demand p oin ts assigned to them. The result from [22 ] in conjunction with Theorem 2 and Prop o- sition 3, gives us the follo wing result. Theorem 3. Ther e exists an O(1)-appr oximation algo- rithm for the k -anonymization pr oblem for set-value d data. T o reemphasize the earlier footnote, the approximation al- gorithms for sup pression-based anonymiza tion or load-bala- nced facilit y location never need to exp licitly compute and operate on the bit vector representations of the records. They can op erate directly on the set representations, com- puting distances b et ween pairs of sets. Algorithm 1 need not op erate on the bit-vector representations either. It can simply tak e every k - group of sets and add ev ery ma jorit y item in the group to all the sets in the group, while deleting other items. 5. EXPERIMENT S In this section w e exp erimentally demonstrate applicabil- it y of our anonymization algorithms to th e AOL q uery log dataset. Recall (Definition 1) that in this dataset records correspond to user sessions and items corresp ond to the query terms/keyw ords. As mentioned earlier, the query log dataset also contains oth er attributes that w e ignore in th is pap er (see Section 5.5 for a discussion). Our goal then is to add or delete k eywords from u ser sessions so that every session b ecomes identical t o at least k − 1 others. The anonymizati on algorithms from Section 4 cann ot b e directly applied to the AOL d ataset for several reasons: (1) No tw o users in the dataset are lik ely to b e similar on all their queries since eac h user session is fairly large, repre- senti n g 3 mon th s of q ueries. The algorithms when directly applied to the user sessions would thus result in a large num- b er of additions and deletions. (2) The dataset consists of millions of users. The algorithms from Section 4 hav e a quadratic running time and therefore cannot b e practically applied to such real world d atasets directly . And (3) Differ- ent keyw ords from different users could often b e misspellings of each other or deriv ations from a common stem. The con- ditions for considering tw o user sessions to b e “iden tical” thus need to be relaxed. W e describ e b elow the steps w e took to o vercome th ese three problems. 5.1 Separating User Sessions into Thr eads T o deal with the issue of large user sessions, we consid- ered a relaxed problem definition: Each user session w as first d ivided into smaller t hreads and a different random identifier was assigned to each th read. W e then considered the anonymizatio n p roblem ov er these threads instead of the original sessions. Each user thread was treated as a set of keyw ords and our goal w as to add or delete keyw ords from user t hreads so t hat every user thread b ecame id entical to threads from at least k − 1 other u sers. One trivial wa y to divide sessions in to threads is to treat every single query from a user as a thread of its o wn and as- sign a random identifier to it. How ever this would render the data nearly useless for man y forms of analysis (e.g., study- ing pattern s of q uery refi n emen t). Instead “topic-based” threads w ere determined on the basis of the similarity of constituent queries. F or t his purp ose we employ ed tw o sim- ple measures to determine query similiarit y: • Edit distance: Tw o queries w ere deemed similar if the edit distance betw een them was less th an a threshold. • Overlapping result sets: Tw o qu eries were deemed sim- ilar if the result sets returned for each query by a search engine had a large over lap in the top 50 results. Using these similarity measures, eac h user session w as sep- arated into multiple th reads: Qu eries in a user session were considered in t h e order of their time stamps. A query th at w as similar to one seen b efore was assigned th e same iden- tifier as the previous query . A qu ery t h at w as very differ- ent from any of the prev iously seen queries was assigned a new identifier. This w as follow ed by another round where consecutive threads that contained similar q ueries we re col- lapsed and so on. This algori th m for determining threads w as run on a ran d om sample of ∼ 82 K users who p osed a to- tal of ∼ 412 K queries. The 82 K user sessions w ere split into ∼ 165 K threads. Each thread had on av erage 2 . 55 uniqu e keyw ords. There may of course exist more sophisticated techniques for separating sessions into topic-based threads, how ever this is not the focus in th is p aper. Note that the shift in goal from anonymizing sessions to anon ym izing threads, enhances the utilit y of the released dataset (anon ymizing entire sessions w ould require far to o many additions and deletions), with- out affecting priv acy to o much. In fact, as we shall see in S ection 5.4, the separation into threads itself helps in anonymizati on. 5.2 Pr e-clustering Us er Thr eads As mentioned earlier, the algorithms from Section 4 hav e a quadratic run ning time, and cannot b e practically applied to our d ataset of user threads. T o make them more scaleable, w e first p erformed a preliminary clustering step where w e clustered similar user threads together using a simple, fast clustering algorithm, and th en applied the k -anonymization algorithms from Section 4 t o th e threads within each cluster. If a cluster had fewer than k user threads, we simply deleted these threads altogether. R unning the k -anonymization al- gorithms within these small clusters w as much more efficient than running them directly on all the user threads at once. T o do the p reliminary clustering, w e used the Jaccard co- efficien t as a similarity measure for user threads. Recall that eac h thread S i is a su b set of the universe of keyw ords U = { e 1 , . . . , e m } . Under t he Jaccard measure, the similar- it y of tw o user threads, S i and S j is given by Sim( S i , S j ) = | S i ∩ S j | | S i ∪ S j | A straightfor ward clustering algorithm would inv olve a comparison betw een every pair of user threads and w ould thus b e very ineffcient. Instead, to quic kly cluster all the user threads, we used Locality S ensitiv e Hashing (LSH). The LSH technique wa s introduced in [12] to efficiently solve the nearest-neighbour searc h problem. The key idea is to hash eac h user thread using several different hash functions, en- suring that for each function, the probability of collision is muc h higher for th reads th at are similar to each other t han for those that are different. The Jaccard coefficient as a simi- larit y measure admits an LSH scheme called Min-H ashing [8, 7]. The basic idea in the Min-H ashing scheme is to rand omly p erm ute the u nivers e of keyw ords U , and for each user thread S i , comput e its hash v alue MH( S i ) as the index of the first item under the p ermutation that b elongs to S i . It can b e sho wn [8, 7] that for a random p ermutation the probability that tw o user threads hav e the same hash function is exactly equal to their Jaccard coefficient. Thus Min-Hashing is a probabilistic clustering algorithm, where eac h hash buc ket corresponds to a cluster th at puts together tw o u ser th reads with probability prop ortional to th eir Jaccard co efficien t. The LSH algorithm [12] concatenates p h ash-ke y s for users so th at the probabilit y that any tw o users S i and S j agree on their concatenated hash- k eys is equal to Sim( S i , S j ) p . The concatenation of hash-keys thus creates refined clusters with high precision. T ypical v alues for p that we tried were in the range 2 − 4. Clearly generating random p ermutations ov er the universe of keyw ords and storing them to compute Min-Hash v alues is not feasible. S o instead, w e generated a set of p indep endent, random seed va lues, one for eac h Min-H ash function and mapp ed each user thread to a h ash-v alue computed using the seed. This hash-v alue serve s as a proxy for the index in the random p ermutation. The approximate Min-Hash v alues thus comput ed hav e p roperties similar to the ideal Min-H ash v alue [11]. See [11] for more details on this technique. As a result of runn ing th e LS H-based clustering algorithm on our user threads, we otained a total of ∼ 84 K clusters. Eac h cluster contained an a verage of 2 user threads. The largest cluster contained ∼ 2800 threads and corresponded to the queries th at searched for ‘Go ogle’ ! Again, there may exist more sophisticated techniques for clustering similar user threads together, how ever th is is not the fo cus of this pap er, which is meant to b e more of a pro of of concept. 0 50 100 150 200 250 300 350 400 2 3 4 5 6 7 8 9 10 Number of changes (x 10 3 ) k Number of Additions Number of Deletions Total Cost Figure 4: Cost of achieving k -anon ymi t y 5.3 k -Anonymity wi thin Clusters Now within each cluster generated using th e LSH scheme above, we ran t h e k - an onymization algorithm from Section 4 (i.e., the suppression algorithm from [20] follo wed by the application of Algorithm 1). Before pro ceed in g further, we need to clarify the criterion that w as used for deeming tw o user threads to b e identical. As mentioned earlier, different u ser threads might contain keyw ords th at are actually just missp ellings of eac h other or deriv ations from a common stem. T o deal with this issue, w e once again resorted to LSH . W e treated each user th read as a set of Lo calit y S ensitiv e Hashes [8, 7] of its constituent keyw ords, i.e., a u ser thread S i = { e 1 , . . . , e ℓ } now b ecame S i = { LSH( e 1 ) , . . . , LSH( e ℓ ) } where LS H( e j ) is a concate- nation of Min-Hashes of the keyw ord e j 2 . Two user threads w ere considered id en tical if they had the same set of hashes. Now if a k -anonymous solution for a particular cluster deemed that a certain LSH v alue m ust b e deleted from a particular user thread, we simply d eleted all th e keyw ords from the user thread that generated that LS H v alue. If the solution aske d for a LSH v alue to b e added to a user t hread, w e added to the th read one of the keyw ords fr om its cluster that generated the LSH v alue. Threads in clusters of size less than k were en tirely d eleted. Figure 4 shows th e total num b er of additions and dele- tions of keyw ords that w ere mad e for differen t val ues of k . As w ould b e exp ected, as k increases, the total num b er of additions and deletions that need to b e made to ac hieve k - anonymit y increases. The num b er of additions is a small fraction of the total cost, and surp risingly go es down as k increases. 5.4 Case Study As anecdotal ev idence of t he effectiveness of our algo- rithms in anonymizing query log s, we looked at th e q uery logs of user 441774 9 who had b een previously b een identified as Ms. Thelma Arnold from Lilburn, Georgia. Figure 5(a) shows a sample of user 4417749’s query logs. Misspellings hav e b een main tained, how ever rep eated q ueries hav e been remov ed. As can be seen, the user searc hed for 2 Eac h keyw ord can b e treated as a multiset of characters 4417749 pine straw lilburn deliv ery 4417749 pine straw deliv ery in gwinnett count y 4417749 pine straw in lilburn ga. 4417749 atlant humane society 4417749 atlanta humane society 4417749 dek alb animal shelter 4417749 dek alb humane so ciety 4417749 gwinnett animal shelter 4417749 doravill e animal shelter 4417749 humane so ciet y 4417749 gwinnett humane society 4417749 seffects of nicotine 4417749 effects of nicotine 4417749 nicotine effects on the b ody 4417749 jarrett arnold 4417749 jarrett t. arnold 4417749 jarrett t. arnold eu gene oregon 4417749 eugene oregon jaylene arnold 4417749 ja ylene and jarrett arnold eugene or. . . . (a) User 4417749’s Session 1 4417749 pine straw lilburn delivery mulch 1 4417749 pine straw delivery in gwinnett county 1 4417749 pine straw in lilburn ga. ———————— — —————— 2 4417749 atlant humane so ciet y coun ty 2 4417749 atlanta humane so ciet y 2 4417749 deka lb animal shelter 2 4417749 deka lb humane society 2 4417749 gwinnett animal shelter 2 4417749 doraville animal shelter 2 4417749 humane society 2 4417749 gwinnett humane so ciet y ———————— — —————— 3 4417749 seffects of nicotine 3 4417749 effects of n icotine 3 4417749 nicotine eff ects on the b od y ———————— — —————— 4 4417749 jarrett arnold 4 4417749 jarrett t. arnold 4 4417749 jarrett t. arnold eugene oregon 4 4417749 eugene oregon ja ylene arnold 4 4417749 jaylene and jarrett arnold eugene or. . . . (b) User 4417749’s anon ymized th reads Figure 5: User 4417749’s Query Logs some fairly generic queries suc h as the “effects of nicotine on the b od y ”. H o wev er she also posed severa l identifying queries. F or instance, she q ueried for humane societies and animal shelters in Gwinnett count y , Georgia , revealing her- self to be an animal lov er in Gwinnett count y . F urther, she queried for pine straw d eliv ery in Lilburn, Gwinnett, thereby revea ling herself to b e a resident of Lilburn, Gwinnett. Fi- nally , her queries for relativ es in Oregon revealed that h er last name was “Arnold”. Figure 5(b) shows the result of ru nning our k - anon ymizat- ion algorithm for k = 3. Notice first that the division of Ms. A rnold’s session into threads itself go es some wa y in anonymizati on by de-correlating her v arious query topics. The session sample was d iv ided into a thread for pine stra w delivery , a thread for animal sh elters and humane so cieties, a thread for the effects of nicotine and a thread for the queries abou t relativ es in Oregon. Eac h t h read was assigned a sep- arate identifier. The threads were treated as sets of uniq ue keyw ords (n ot depicted in the figure) and were th en clustered with the threads of other users using LSH. The anonymization algo- rithms w ere run within the resulting clusters. If a partic- ular keyw ord was to b e deleted from a particular th read, w e deleted every o ccurence of that keyw ord from t he origi- nal queries of the thread. If a keyw ord was to b e added to a thread, w e added it to one of the original queries of the thread. The result was that some th reads such as the nico- tine t hread were left relativel y u n touched. In the thread for p ine straw delivery , the keyw ords ‘lilburn’, ‘delivery’, ‘gwinnett’, ‘count y’ and ‘ga.’ w ere d eleted, and the keyw ord ‘m ulch’ was add ed instead. This is b ecause other users in the thread’s cluster, querying for ‘pine stra w’, queried for it in conjun ct ion with th e keyw ord ‘mulc h’. S imilarly , in the thread for animal shelters and humane so cieties, the keyw ords ‘gwinnett’ and ‘dora ville’ were remov ed, while th e keyw ord ‘count y’ w as added since many users searc hed for animal shelters in ‘dek alb coun ty’. Finally , the thread for the relativ es in Oregon was deleted altogether b ecause not a sufficient number of threads from other users got clustered with it. Man y users queried for ‘arnold sch wa rzenegger’, how ever none of th eir threads fell in the same cluster! This example shows that our algorithm d oes the intu- itivel y right thing. Identifying k eywords are remo ved and keyw ords that commonly o ccur in conjunction with other keyw ords are add ed to a user’s threads. The guarantee is that every user thread will look like the threads of at least k − 1 other users, an d this guarantee is achiev ed while mak- ing a close to minimal num b er of additions and deletions. 5.5 Discussion While th e example of Ms. Thelma Arn old seems to indi- cate th at our anon ymization algorithms do th e right thing for query logs, our exp erimen tal w ork here is in reality a first step due to the complex nature of the dataset. Several p oin ts requ ire further discussion. Other Attributes: As mentioned earlier, q uery logs conta in oth er information ab out u ser activity , n amely time stamp information for when a q uery was p osed and the query result that w as clic ked on. Ou r algorithm fo cussed on anonymizing just t he queries themselv es, whereas it is conceiv ab le that t hese other attributes of the dataset ma y also b e used in launching priv acy attacks. One p ossible anonymizati on approach is to treat these other attrib u te v al- ues as items of th e universe as w ell and pro ceed as b efore. So for example, if a ma jorit y of users q ueried for t he I ndiana Jones movie on t he da y that it w as released, then this day w ould b e added as part of the time stamp to all user th reads on the Indiana Jones mo v ie. The drawbac k t o this approach could b e a loss of very fin e-grained time stamp information and a bet t er un derstanding of utilit y is req u ired b efore this approac h can b e recommended. Priv acy: I n adapting our algorithm t o the query logs, we considered a relaxation of the original problem statement: instead of anonymizing en tire sessions, w e anonymized threads. The p riv acy implications of this relaxation need to b e fur- ther examined. At fi rst glance, it seems that the division of the u ser sessions into threads only helps in ou r priv acy goals by d e- correlating a user’s q uery t opics. Ho wev er there is no “pro of ” that a user’s t h reads could not somehow b e stitc hed together to reconstruct his session, whic h would then n o longer b e k -anonymous. An exp erimental or th e- oretical study of the implications of our problem relaxation w ould be an intere sting av enue for future work. Utility: Our approach of treating a thread as a set of keyw ords affects the u t ilit y of the released dataset. F or ex - ample, in Figure 5(b), the keyw ord ‘count y’ was added to the query for ‘atlant humane society’ since it was to b e in- discriminately added t o any one of t h e qu eries in th e thread. In reality it should hav e b een added to the qu ery for ‘dek alb animal shelter’ and that too in th e semantically correct p o- sition as ‘dek alb county animal shelter’. Thus by treating threads as sets of keywo rds, we loose p otenti ally imp ortant information ab out the ordering of ke y words within q ueries. Another p oint regarding u tilit y , is our criterion for measur- ing th e utilit y of the released dataset. As in traditional k - anonymit y wo rk , the criterion w e u sed was to minimize the total num b er of changes made to th e dataset. A b etter met- ric for measuring the utility of the released dataset wo uld b e to measure the impact of the anonymization on algorithms that actu ally use t h e dataset. F or example, how well do es a searc h engine’s query suggestion algorithm wor k when run on the released dataset instead of th e original. This is a very in teresting question, that wo u ld need to b e ultimately answ ered for eva luating t h e utility of any anonymization sc heme. 6. SUMMAR Y AN D FUTURE WORK In this pap er we introd uced the k - anon ymization p roblem for set-v alued data. Algorithms with approximation factors of O ( k log k ) and O (1) for the problem were developed . W e applied our anonymization algorithms to th e AOL query log dataset. In order to scale the algorithms t o d eal with the size of the dataset, we prop osed a d ivision of the dataset into clusters, follo wed by the application of anonymization algorithms within the clusters. Besides the p roblems men- tioned in S ection 5.5, there are several other av enues for future w ork. F or instance, one interesting research direction w ould b e to develop scaleable anonymization algorithms for massiv e modern da y datasets with prov able approximation guaran tees. Another imp ortant research qu estion is how such algorithms can b e applied to an onymize datasets on the fly as new records get added to them. F or ex ample, as a searc h engine receives new queries, how should it anonymize them in an online fashion b efore storing them. 7. A CKNO WLEDGEMENTS The authors would like t o thank T omas F eder, Evimaria T erzi and An Zhu for many useful discussions. 8. REFERENCES [1] E. Adar. U ser 4XXXXX9: anonymizing q uery logs. In Pr o c e e dings of the Workshop on Query L o g A nalysis: So ci al and T e chnolo gic al Chal lenges , 2007. [2] G. Aggarwa l, T. F ed er, K. Kenthapadi, S. Khuller, R. Panigra hy , D. Thomas, and A. Zhu. Achieving anonymit y via clustering. In Pr o c e e di ngs of the 25th ACM SIGMOD-SIGACT-SIGAR T Symp osium on Principles of Datab ase Syste ms , pages 153–162, 2006. [3] G. Aggarwa l, T. F ed er, K. Kenthapadi, R. Motw ani, R. Panigra hy , D. Thomas, and A. Zhu. Anonymizing tables. In Pr o c e e dings of the 10th I nternat ional Confer enc e on Datab ase The ory , pages 246–25 8, 2005 . [4] R. A gra wa l and R. Srik ant. Priv acy-p reserving data mining. ACM SIGMO D R e c or d , 29(2):439–45 0, 2000. [5] R. A gra wa l, R. Srik ant, and D. Thomas. Priv acy preserving OLAP. I n Pr o c e e dings of the 2005 ACM SIGMOD International Confer enc e on Management of Data , pages 251–262, 2005. [6] M. Barbaro and T. Z. Jr. A face is exp osed for AOL searc her n o. 44177 49. New Y ork Tim es , A ug 2006. [7] A. Bro der. On the resem blance and containment of docu men ts. I n Pr o c e e di ngs of the Compr ession and Complexity of Se quenc es , page 21, 1997. [8] E. Cohen. Size-estimation framew ork with app lications to transitive closure and reachabili ty . Journal of Compute and System Scienc es , 55(3):441–453, 1997. [9] A. Evfi mievski, J. Gehrke, and R . S rik ant. L imiting priv acy breaches in priv acy preserving data mining. I n Pr o c e e dings of the 22nd ACM SIGMOD-SIGACT-SIGAR T Symp osi um on Princi ples of Datab ase Systems , p ages 211–222 , 2003. [10] S. Guha, A. Meyerso n, and K. Mun agala. H ierarc hical placemen t and netw ork design problems. In Pr o c e e dings of the 41st IEEE Annua l Symp osium on F oundations of Computer Scienc e , page 603, 2000. [11] P . I n dyk. A small approximately m in - wise indep endent family of hash functions. In Pr o c e e di ngs of the 10th Annual A CM-SIAM Symp osium on Discr ete Algorithms , pages 454–456, 1999. [12] P . I n dyk and R. Mot wani. App ro ximate nearest neighbors: tow ards removing the curse of dimensionalit y . I n Pr o c e e di ngs of the 30th A nnual ACM Symp osium on The ory of Computing , p ages 604–613 , 1998. [13] D. R. Karger and M. Minkoff. Building Steiner trees with incomplete global k no wledge. In Pr o c e e di ngs of the 41st I EEE Annual Symp osium on F oundations of Computer Scienc e , page 613, 2000. [14] R. Kumar, J. Nov ak, B. Pang, and A. T omkins. On anonymizing q uery logs via token-based hashing. In Pr o c e e dings of the 16th International Confer enc e on World Wide Web , pages 629–638, 2007. [15] K. R. Lefevre. A nonymity in data publishing and distribution . Ph.d. thesis, Un iv ersity of Wisconsin at Madison, 2007. [16] N. Li, T. Li, and S. V enkatas ub ramanian. t-closeness: priv acy b eyo nd k-anonymit y and l-diversit y . In Pr o c e e dings of the 23r d IEEE International Confer enc e on Data Engine ering , pages 106–115, 2007. [17] A. Machana v a jjhala, D. Kifer, J. Gehrke, and M. V enkitasubramaniam. ℓ -d iv ersity: priv acy b eyond k-anonymit y . ACM T r ansactions on Know le dge Disc overy fr om Data , 1(1):3, 2007. [18] A. Meyerson and R. Williams. O n the complexit y of optimal k-anonymit y . In Pr o c e e dings of the 23r d ACM SIGMOD-SIGACT-SIGAR T Symp osium on Principles of Datab ase Systems , p ages 223–228 , 2004. [19] A. Naray anan and V. Shmatiko v. Robu st de-anonymization of large sparse datasets. In Pr o c e e dings of the IEEE Symp osium on Se curity and Privacy , pages 111—125, 2008. [20] H. Park and K. Shim. A pproximate algorithms for k-anonymit y . In Pr o c e e dings of the 2007 ACM SIGMOD International Confer enc e on Management of Data , pages 67–78, 2007. [21] B. Poblete, M. Spiliop oulou, and R. Baeza-Y ates. W ebsite priv acy preserv ation for qu ery log p ublishing. In Pr o c e e dings of the 1st International Workshop on Privacy, Se curity and T rust i n KDD , 2007. [22] Z. S vitkina. Lo wer-bound ed facil ity lo cation. In Pr o c e e dings of the 19th Annual ACM-SIAM Symp osium on Di scr ete Algorithms , p ages 1154–11 63, 2008. [23] L. S w eeney . k-anonymity: a mo del for protecting priv acy . International Journal of Unc ertainty, F uzziness and Know le dge-Base d Systems , 10(5):557– 570, 2002. [24] M. T errovitis, N. Mamoulis, and P . Kalnis. Priv acy - preserving anon ymization of unstructured data. In Pr o c e e dings of the 34th International Confer enc e on V ery L ar ge Data Bases , 2008. [25] Y. Xu , K. W ang, A. W.-C. F u, and P . S. Y u. Anonymizing transaction databases for pub lication. In Pr o c e e ding of the 14th ACM SIGKDD International Confer enc e on Know l e dge Disc overy and Data Mining , pages 767–775 , 2008.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment