Towards Utility-driven Anonymization of Transactions

Publishing person-specific transactions in an anonymous form is increasingly required by organizations. Recent approaches ensure that potentially identifying information (e.g., a set of diagnosis codes) cannot be used to link published transactions t…

Authors: Grigorios Loukides, Aris Gkoulalas-Divanis, Bradley Malin

T o war ds Utility-driven Anon ymization of T ransactions Grigorios Loukides Aris G ko ulalas-Div anis Brad le y Malin {grigorios .loukides, aris.gkoula las, b .malin }@v anderb ilt.edu Depar tment of Biomedical Informatics V anderbilt University Nashville, TN, USA ABSTRA CT Publishing p erson-sp ecific transactions in an anon ymous form is increasingly required by organizations. Recent approac hes ensure that p otentially identifying information (e.g., a set of d iagnosis co des) cann ot b e used to link pub - lished transactions to p ersons’ identities , but all are lim- ited in application b ecause they incorp orate coarse priv acy requirements (e.g., protecting a certain set of m diagnosis codes requires protecting all m - sized sets), do not integrate utilit y requirements, and tend to explore a small portion of the solution space. In this pap er, w e propose a more general framew ork for anonymizing transactional data un - der specific priv acy and utilit y requ irements. W e model such requirements as constraints, investi gate h o w these con- strain ts can b e specified, and prop ose CO A T (COn strain t- based A nonymization of T ransactions), an al gorithm that anonymizes transactions using a flexible hierarc hy-free gen- eralization sc heme to meet the sp ecified constraints . Exp er- iments with b enchmark datasets v erify t hat CO A T signifi- cantl y outperforms the current state-of-the-art algorithm in terms of data utilit y , while being comparable in terms of efficiency . The effectiveness of our approach is also demon- strated in a real-w orld scenario, which requ ires d isseminat- ing a priva te, patien t-specific transactional dataset in a wa y that preserv es b oth priv acy and utilit y in intended studies. 1. INTR ODUCTION Organizations in v arious domains, ranging fro m health- care to gov ernment, increas ingly share p erson-sp ecific data, devoid o f explicit identi fiers (e.g. , names), to enable research and comply with regulations. F or instance, the U.S. Na- tional Institutes of Health (NIH) recently mandated that NIH-sp onsored in v estigators disclose data collected or stud- ied in a manner that is “free of iden tifiers that could lead to deductive disclosure of th e identit y of ind ividual sub jects” [ ? ]. Numerous stud ies demonstrate th at de-identific at ion (i.e., the remo v al of explicit identifiers) is insufficien t for priv acy protection of transactional data (i.e., data in whic h a set of items corresp ond to an ind ividual) [13] 25 [10]. This is b ecause published transactions can disclose the identit y of an individual asso ciated wi th a transaction, if an attacke r knows some of the items this individual is associated with. Imagine for example that Alic e was diagnosed with the dis- eases con tained in the first transaction of Fig. 1(a) and told her neigh b or Bob that she suffers fro m a, b and c , whic h are rela tively common. Publishing a de-identified version of the d ata of Fig. 1(a) allo ws Bob to fin d out that the first transaction corresp onds to A lic e , since there is only one transaction con taining a, b and c i n this dataset. This prob- lem, referred to as identity disclosur e , must b e addressed to comply with regulations and to protect individuals’ priv acy . Having identified Alic e ’s transaction, for examp le, Bob can infer t hat Alic e also suffers from the diseases d , e, f , g and h . 1.1 Motiv ation T o p reven t iden tity disclo sure, p ortions of transactions that are p oten tially link able to iden tifying info rmation need to b e ex plicitly sp ecified and p rotected prior t o data release. This invol ves formulating a set of privacy c onstr aints , which are satisfied b y transforming data so that each indiv idual is linked to a sufficien tly large number of transactions with respect to these constraints. This process achiev es priv acy b ecause an att acker must distinguish an individual’s real transaction among th e transformed ones t o identify him/her. How ever, data transformation may harm data utility when usabilit y requirements are unaccounted for, resulting in re- leased data that is subpar for in tended applications. Th us, it is ess entia l to balance priv acy constraints with utility c on- str aints to ensure meaningful analysis. T o show the impor- tance of both t yp es of constraints consider Ex ample 1. Pa tien t Diagnosis Co des Alic e a, b, c, d, e, f, g, h Mary a, c, e, f, g Ge or ge c, d, e, f, h Jack a, c, e, f Anne e, f, g, h T om d, e, f, g Jim a, b, d, e Steve a, c, f David a, c El len b, h (a) Original dataset p 1 = { a,b,c } p 2 = { d,e,f,g,h } (b) Privacy Constraints u 1 = { a,b } u 2 = { c } u 3 = { d } u 4 = { e,f,g,h } (c) Utility Constraints Figure 1: Example datase t and constrain ts. Example 1. Im agine that a hospital ne e ds to publish the dataset of Fig. 1(a), wher e e ac h tr ansaction c orr esp onds to P atient Diagnosis Co des Alic e ( a, b, c ), ( d, e, f, g, h ) Mary ( a, b, c ), ( d, e, f , g , h ) Ge or ge ( a, b, c ), ( d, e, f, g, h ) Jack ( a, b, c ), ( d, e, f , g , h ) Anne ( d, e, f , g , h ) T om ( d, e, f , g , h ) Jim ( a, b, c ), ( d, e, f, g, h ) Steve ( a, b, c ), ( d, e, f , g, h ) David ( a, b, c ) El len ( a, b, c ), ( d, e, f , g , h ) (a) Apriori (5 5 -anonymit y) [20] P atient Diagnosis Co des Alic e e, f Mary e, f Ge or ge e, f Jack e, f Anne e, f T om e, f Jim e Steve f David - El len - (b) Greedy Algorithm ((1 , 5 , 5)-cohe rence) [5] P atient Diagnosis Co des Alic e ( a, b ), c, e, f, ( g , h ) Mary ( a, b ), c, e, f, ( g , h ) Ge or ge c, e, f, ( g , h ) Jack ( a, b ), c, e, f Anne e, f, ( g , h ) T om e, f, ( g , h ) Jim ( a, b ), e Steve ( a, b ), c, f David ( a, b ), c El len ( a, b ), ( g , h ) (c) COA T ( k = 5 , s = 15% , con - straints of Figs. 1(b) and 1(c)) Figure 2: Three p ossible anon ymizations of the dataset of Fig. 1(a). a p atient (p atients’ names ar e not r ele ase d) and c on sists of a set of diagnosis c o des ( items ). Certain item c ombinations ( itemsets ), such as the c ombinations of diagnosis c o des abc and def g h of Fig. 1( b), ar e r e gar de d as p otential ly l ink- able and must b e asso ciate d wi th at le ast 5 tr ansactions to pr event identity disclosur e . At the same time, the publishe d data must b e able to sup p ort a st udy i n which the numb er of p atients diagnose d with cold , denote d with c , ne e ds to b e ac cur ately determine d. These r e quir ements c an b e mo dele d via the privac y c onst r aints in Fi g. 1(b) and a utility c on- str aint { c } . An anonymization that satisfies them is given in Fig. 2(c). Observe that e ach p atient c an b e linke d to no l ess than 5 tr ansactions using th e c ombinations abc or def g h , while the numb er of p atient s suffering fr om cold c an stil l b e ac cur ately c omput e d after anonymization. 1.2 Limitations of Existing Methodologies Preven ting identit y d isclosure in t ransactional data was inv estiga ted recently [20, 25]; h ow eve r, existing app roac hes inadequately d eal with the scenario of Ex ample 1 for sev- eral reasons. First, they sup p ort a limited class of priv acy requirements. While they are effective at protecting all item- sets comprised of a certain num ber of items (i.e., itemsets of certai n size ), potentially link able itemsets ma y invol ve certain items only and v ary in size. F or instance, in the presence of the priva cy constraints in Example 1, the ap- proac hes of [20] and [25] would p rotect all 56 combinations of 5 diagnosis co des (e.g., abcde, bcdef , etc.). Un necessarily protecting itemsets significantly distorts data b ecause the num ber of itemsets that require protection rapidly increases with their size. Second, prior approac hes neglect sp ecific data utilit y re- quirements. Thus, they d o not guaran tee generating prac- tically useful solutions for environments where usabilit y is based on w ell-defined policies, such as in epidemiology (where com binations of diagnosi s cod es form syndromes) [12]. F or instance, when applied t o the d ataset of Fig. 1(a), [20] and [25] prod u ce the anonymizations of Figs. 2(a) and 2(b) (respectively). These anon ymizations do not sup- p ort th e stud y of Examp le 1 b ecause they do not allo w the num ber of p atien ts suffering from c old to b e accurately com- puted, as a result of violating the ut ilit y constraint { c } . Third, the existing literature considers only a small num- b er of p ossible transformations to meet priv acy constrai nts. F or example, [20] p rotects a, b and c by generalizi ng them to their closest common ascendan t ( a , b, c ) according to the hierarc hy of Fig. 3, while [25] d o es so b y eliminating them from th e released d ataset. This incurs excessiv e information loss, b ut, as we show in this pap er, is unnecessary b ecause items can b e generalized together in a hierarch y-free manner to reduce the amount of information loss incurred. Figure 3: Hierarch y for the dataset of Fi g. 1(a). 1.3 Contrib ut ions This p ap er prop oses an in n o v ativ e approach to anonymize transactional data un d er priv acy and utility constraints. Give n a dataset and a set of constrain ts, our approach pre- ven ts identit y d isclosure by ensuring t hat each transaction is indistinguishable fro m at least k − 1 other transactions with respect to priv acy constrain ts, while satisfying utility con- strain ts. F or instance, when applied to anon ymize the data in Fig. 1(a) u sing the constraints in Figs . 1(b) and 1(c), our approac h generates the anonymization of Fig. 2(c), which satisfies th e imp osed constraints. Our work makes the following specific contributions: First, we prop ose a nov el constrain t sp ecification mo del th at enables data owners to express d etailed priv acy and utilit y requirements. W e introd uce priv ac y constrain ts that allo w protecting the req uired itemsets only , th ereby reducing in- formation loss, and utility constraints th at ensure the utility of anonymized data in practice. Ackno wledging that con- strain t sp ecification may be difficult in lack of sp ecific do- main knowledge, w e also p ropose an algorithm to extract priv acy constraints from t he d ata and a recip e for specify- ing utility constraints. Second, we propose an item generalizatio n mo del that eliminates t h e requirement for hierarchies. This is imp or- tant because man y domains d o not fit into rigid hierarchies [6, 15]. In doing so, our framew o rk can produce fine-grained anonymizati ons with su b stantia lly b etter utility than th at of [20] and [25]. F or example, to meet t he p riv acy constraint p 1 (Fig. 1(b)), our solution generalizes a and b to ( a, b ) and lea ve s c intact, retaining more informatio n than [20], whic h releases ( a, b, c ), and [25], which eliminates a , b and c . Third, we develop COnstrain t-based Anon ymization of T ransactions (COA T), an algorithm that iteratively selects priv acy constrain ts and transforms data to satisfy them. F or eac h priv acy constraint, COA T generalizes items in accor- dance with the specified u tilit y constraints and attempts to minimize information loss. Wh en a priv acy constraint can- not be satisfied through generalization, CO A T suppresses the least num ber of items required to meet this constraint. F ourth, we inv estiga te the effectiveness of our approach through ex p eriments on widely-used b en chmark datasets and a case study on patient-specific data extracted from the Electronic Medical Record system [18] of V anderbilt Univer- sit y Medical Cen ter, a large healthcare provider in the U.S. Our results v erify t h at the p rop osed metho dology is able to anon ymize transactio ns under v a rious priv a cy and utility constrain ts with less information loss than the state-of-the- art meth od [20] and to generate high-quality anonymizations in a real-w orld data disseminatio n scenario . 1.4 P aper Organization The rest of the pap er is organized as follo ws . Related work is review ed in Section 2. W e formally defin e our constraint sp ecification model and th e problem of anonymizing t rans- actions u n der constrain ts in Section 3. COA T is presen ted in Section 4. Section 5 p resen ts metho ds f or specifiying con- strain ts. Sections 6 and 7 rep ort experimental results and a case stu dy of applyin g COA T resp ectively . Finally , w e conclude t h e pap er in S ection 8. 2. RELA TED WORK The problem of iden tit y d isclosure in r el ational data pub- lishing h as b een studied extensively [4, 8 , 9, 16 , 19 ]. A well- established principle that can prev ent this type of threat is k -anonymit y [16, 19]. A relational table is k - an onymous when each record is indistinguishable from at least k − 1 oth- ers with resp ect to p otentially identif ying attributes (t ermed quasi-identifiers or QID s). K - anonymit y can b e achiev ed by gener alization , a process in which QID v alues are replaced by more general ones sp ecified by a generalization (reco d- ing) mod el, or via supp r e ssion , a tec hnique that remo ves v alues or records from anonymized data [8]. In this w ork, w e anonymize transactions to thw art identity disclosure by applying item generalization and suppression. Beyo nd iden- tity disclosure is anoth er th reat in relational data publishing known as attribute disclosur e ( i.e., the inference of an indi- vidual’s sensitive va lues), whic h can b e guarded against by sever al principles, such as l -diversi ty [11]. W e note that our approac h can b e extended to preven t attribu t e disclosure as w ell, bu t this is beyond the scope of this pap er. Priv a cy-preserving pub lication of t ransactions w as re- cently inv estigated [5, 20, 25]. First, in [20], k m -anonymit y w as prop osed to preven t attac k ers with the kno wledge of at most m items from link ing an id entified ind ividual to less than k pub lished transactions. The authors of [2 0] designed three algorithms to enforce k m -anonymit y , but the Apri- ori algorithm is the only one that is sufficiently scalable for use in practice. It operates in a b ott om- up fashion, begin- ning with itemsets comprised of one item and sub seq u ently considers incrementall y larger itemsets . In each iteration, k m -anonymit y is enforced using a hierarch y-based general- ization mod el. Second, [25] prop osed ( h, k, p )- coherence, a priv acy principle that addresses both identit y and attribute disclosure. This principle assumes a fix ed classification of items into potentially linka ble and sensitive, treats p oten- tially link able items simi larly to k m -anonymit y , and addi- tionally limi ts the probabilit y of inferring sensitiv e items. F ollo wing [2 0], w e do not adopt suc h a clas sification, and al- lo w an y item to b e t reated as potentially link able for the re- maining (sensitive) items. T o satisfy ( h, k, p ) - coherence, [25] prop osed an algorithm that disco v ers all unprotected item- sets of minimal size and p rotects th em by iterativ ely sup- pressing the item contained in t h e greatest n umber of those itemsets. The primary d ifferences b etw een our wo rk and the approac hes of [20] and [25] we re discussed in the In tro- duction. Finally , [5] developed a method that eliminates at- tribute disclosure based on bucketization [23] and l -diversit y . Our w ork is orthogonal to [5]; we do n ot aim to thw art at- tribute disclosure, but rather app ly item generalization and suppression to gu ard against iden tit y disclosure. Preserving priv acy has also b een considered in contexts related to know ledge sh aring, where the goal is t o preven t the inference of rules or patterns [2, 21]. Our meth od is fundamentally different from th is line of research, as w e aim to pu b lish data that preven ts the disclosure of indiv id uals’ identities instead. 3. BA CKGROUND AND PR OBLE M F OR- MULA TION Let I = { i 1 , ..., i M } be a finite set of literals, cal led items . Any sub set I ⊆ I is called an i temset o ver I , and is rep- resen ted as the concatenation of the items it con tains. An itemset that has m items or equiv alen tly a size of m , is called an m -itemset. A datase t D = { T 1 , ..., T N } is a set of N transactions. Each tr ansaction T n , n = 1 , ..., N , o v er I cor- respond s to a u nique individual and is a pair T n = ( tid , I ), where I is the itemset and tid is a u nique identifier. A trans- action T n = ( tid , J ) supp orts an itemset I over I , if I ⊆ J . Give n an itemset I ov er I in D , w e use sup ( I , D ) to rep- resen t the num ber of transactions T n ∈ D th at support I . This set of t ransactions, called supp orting tr ansactions of I in D , is denoted as D I . 3.1 Set -based Anonymization W e prop ose a set-based anonymization mo del fo r transac- tional data, forma lly defined as follo ws: Definition 3.1. (Set-Based Anonym iza tion). A set- b a se d anonymization of I i s a set ˜ I = { ˜ i 1 , ..., ˜ i ˜ M } with the fol lowing pr op erties: (1) e ac h item i n I is mapp e d to a unique item ˜ i m ∈ ˜ I , m ∈ [1 , ˜ M ] , that is a subset of I , using an anonymization function Φ : I → ˜ I , (2 ) S ˜ M m =1 ˜ i m = I − S , wher e S i s the set of items mapp e d to the empty subse t of I , and (3) ˜ i r ∩ ˜ i s = ∅ , for any ˜ i r , ˜ i s ∈ ˜ I , r 6 = s . ˜ i is a gener alize d item when it con tains at least one i r ∈ I that is mapp ed to a non- empty subset of I . W e use the notation ˜ i = ( i 1 , . . . , i m ) to refer to its elements (items) from I . Any item from I t hat is mapp ed to the empty subset of I , denoted as ( ), is called suppr esse d , and is conta ined in the set S . A n example of a set-b ased anonymization is sho wn in Fig. 4. Notice that items a and b are mapped to the same generalized item ( a, b ), whereas item d is suppressed. The set-based anonymization model is very flexible b e- cause i t does not force any items to be generalized together, as formally show n in Corollary 3.1. This is different from the f ul l-subtr e e generali zation mo del [7] ad op t ed in [20], Figure 4: An example of set-based a nonymization. whic h fo rces all siblings of an original (leaf-lev el) item to b e mapp ed to an intermediate nod e in the hierarch y when this item is generalized to the intermediate no de. Cor ollar y 3.1. I n the se t-b ase d anonymization mo del, mapping an item i r ∈ I to a gener alize d item ˜ i do es not for c e any other item i s ∈ I to b e mapp e d to ˜ i . Additionally , as ex plained in Corollary 3.2, the set-based anonymizati on mo del contains the generalization mo del used in [20] as a sp ecial case. Thus, our mo del enables ex- ploring a muc h larger set of p ossible anonymizations, whic h hav e the p otential to incur less information loss. Cor ollar y 3.2. T he full-sub tree r e c o ding mo del is a sp e cial c ase of the set-b as e d anonymization m o de l, wher e e a ch ˜ i m , m ∈ [1 , ˜ M ] , is mapp e d t o an i nterme diate no de of the c ons ider e d hier ar chy. Our anonymizatio n mo del transforms a dataset D into a new dataset ˜ D that helps prevent identit y disclosure, since the number of transacti ons of ˜ D that can be asso ciated with an ind ividual is increased, as prov en in Theorem 3.1. Theorem 3.1. (Generaliza tion Principle) . Given two i tems i r , i s that app e ar in tr ansactions of D and ar e mapp e d to the same gener alize d item ˜ i after anonymizing D to ˜ D , and an itemset i r i s , it holds that sup ( ˜ i, ˜ D ) = s up ( i r , D ) + sup ( i s , D ) − sup ( i r i s , D ) Pr oof. The pro of follo ws directly from the fact that t h e items i r and i s , and the itemset i r i s are m ap p ed to a com- mon literal ˜ i ∈ ˜ D in all transactio ns of D that supp ort i r , i s or i r i s . W e illustrate Theorem 3.1 using Figs. 1(a) and 2(c). Con- sider items a , b and the itemset ab in Fig. 1(a), whic h hav e supp ort of 6, 3 and 2 resp ectively , and are mapp ed to t he same generalized item ( a , b ) in Fig . 2(c). Ob serve that ( a, b ) has a supp ort of 7 that is equal to the sum of the sup p orts of a , b minus that of ab . 3.2 Pri vacy Constraints The integ ration of priv a cy constrain ts is cen tral to our framew ork b ecause they allo w for the explicit definition of whic h itemsets are potentially link a ble and req uire protec- tion. In what follo ws, w e formally define the n otion of pri- v acy constrain ts and t heir satisfiabilit y . Definition 3.2. (Priv acy Constraint Set) . A pri- vacy c onstr aint p is a non-empty set of items in I that ar e sp e cifie d as p otent ial ly linkable. The union of al l privacy c o nstr aints formulates a privacy c onstr aint set P . Definition 3.3. (Priv acy Constraint Sa tisfiabil- ity). Given a p ar ameter k , a privacy c onstr aint p = { i 1 , ..., i r } ∈ P is satisfied when the c or r es p on ding itemse t S r m =1 Φ( i m ) is: (1) supp or te d by at le a st k tr a nsactions in ˜ D , or (2) not supp orte d in ˜ D and e ach of its pr op er subsets is either supp orte d by at le ast k tr ansactions in ˜ D or not supp orte d in ˜ D . P is satisfie d when every p ∈ P is satisfie d. T o illustrate these definitions, consider th e priv acy con- strain t p 1 = { a, b, c } in Fig. 1(b). This priv acy constrain t is satisfied for k = 5 in the dataset of Fig. 2(c) b ecause 5 transactions supp ort the itemset Φ( a ) ∪ Φ( b ) ∪ Φ( c ) = ( a, b ) c . Satisfying a priv acy constrain t p preven ts identit y disclo- sure b ecause the number of transactions th at can be linke d to an individual using an y sub set of items in p is either at least k , or zero, as sho wn in Theorem 3.2. Theorem 3.2. (Monoto nicity). F or a given k , the satisfaction of a privacy c onstr aint p in ˜ D impl ies that e ach privacy c onstr aint p j ⊂ p is satisfie d in ˜ D . Pr oof. Assume that a priv acy constrain t p j ⊂ p is not satisfied in ˜ D for this v alue of k . Then, according to Defi- nition 3.3, the satisfaction of p implies that the itemset I = S ∀ i m ∈ p Φ( i m ) is supp orted by either at least k or 0 trans- actions in ˜ D . No w, consider an itemset J = S ∀ i m ∈ p j Φ( i m ), whic h is d erived by applying Φ on each item in p j . S ince J ⊂ I , w e h a ve s up ( J, ˜ D ) > s up ( I , ˜ D ) when sup ( I , ˜ D ) ≥ k due to the monotonicity principle [1]. When sup ( I , ˜ D ) = 0, p j is satisfied b y Definition 3.3. In either case p j is satisfied in ˜ D for the giv en k , whic h contradicts the assumption and prov es that the theorem holds true. Our priv acy constraint sp ecification mo del offers t w o b en- efits. First, it allows data owners to sp ecify a range of dif- feren t priv acy requirements. F or in stance, it can be u sed to protect sp ecific itemsets of v ario us sizes, or to provide th e same priv ac y guarantees as k m -anonymit y (by formula ting a priv acy constraint set that consists of all itemsets of size m ). Second, our mo del allo ws p rotecting any set of item- sets without enforcing any additional itemsets to b e un nec- essarily protected. This is imp ortant b ecause un ecessarily protecting itemsets ma y significantly increase the amount of information loss in cu rred to anon ymize data. 3.3 Utility Constraints Priv a cy p rotection is offered at the exp ense of data util- it y [8, 19], and so it is important to ensure that anon ymized data is not ove rly distorted. Existing approaches attempt to do so by minimizing the amount of information loss in- curred when anonymizing transactions [5, 20], but do n ot guaran tee furnishing a useful res ult for in tended applica- tions. By con trast, our metho dology offers such guaran- tees through th e in troduction of u tilit y constrain ts. Be- fore f ormally defining such constraints, w e make the fol- lo wing imp ortant observ ations related to data usefulness. Observ ation 1 Mapping a set of items in D to the same generalized item in the anonymized d ataset ˜ D in- trod uces distortion b ecause these items b ecome indistin- guishable in ˜ D . When there is no control of h o w sp e- cific items are generalized, ˜ D ma y not be p ractically useful. Observ ation 2 Su ppressing an item in D introduces distor- tion b ecause this item is n ot contained in ˜ D and the amount of distortion increases with the num b er of sup pressions. Based on these ob serva tions, we in trodu ce a u tilit y con- strain t set U to limit the amount of generaliza tion items are allo w ed to receive b ased on app lication requirements, and b ound the n umber of items t h at can b e suppressed u sing a threshold s . Definitions 3.4 and 3.5 illustrate the definition of a utilit y constraint set and its satisfiabilit y resp ectively . Definition 3.4. (Utility Constraint Set). A utility c o nstr aint se t U is a p artition of I that de clar es the set of al l owable mappings of the items f r om I to those of ˜ I thr ough Φ . Each elem ent of U is c al le d a util ity c onstr aint. Definition 3.5. (Utility Constraint S et Sa tisfia- bility). Gi ven sets I , ˜ I , a utility c onstr aint set U , and a p ar ameter s , U is satisfied if and only if (1) f or e ach non- empty ˜ i m ∈ ˜ I , ∃ u j ∈ U such that ˜ i m ⊆ u j , and (2) the fr a c- tion of items in I c ontaine d in the set of suppr esse d i tems S is at most s % . The fi rst condition limits the maximum amount of gen- eralization each item is allo w ed to receiv e in a set-based anonymizati on ˜ I , while the second cond ition ensu res th at the num ber of suppressed items is contro lled b y a threshold sp ecified b y data o wners. When b oth of these conditions hold, U is satis fied, and ˜ I corresp onds to a dataset that can b e meaningfully analyzed. Example 2 illustrates the above definitions. Example 2. Consider Fig. 5, in which ˜ i 1 ⊆ u 1 , ˜ i 2 ⊆ u 2 , and ˜ i 4 , ˜ i 5 , ˜ i 6 ar e subsets of u 4 . The p er c ent of suppr ess e d items i s 12 . 5% b e c ause I c onsists of 8 items (se e Fig. 4), and only d is suppr esse d. Thus, U is satisfie d. Figure 5: Utility Constraint Set Satisfiability exam- ple for the set-based anonymization of F ig. 4 W e also observe th at the number of the supp orting trans- actions of a gener alize d item in the anonymized dataset ˜ D is eq ual to t he number of transactions supp orting any item in the original dataset D that is mapped to this generalized item, as illustrated in Theorem 3.3. Theorem 3.3. Given a gener alize d item ˜ i m ∈ ˜ I such that ˜ i m = { i 1 , ..., i r } , it holds that | ˜ D ˜ i m | = |D i 1 ∪ . . . ∪ D i r | wher e | ˜ D ˜ i m | denotes the size of the set of supp orting tr an s- actions of ˜ i m in ˜ D , and |D i 1 ∪ . . . ∪ D i r | the size of the set of tr a nsactions supp orting at le ast one of the items in { i 1 , ..., i r } in D . Pr oof. The pro of is omitted, b ecause it is similar t o that of Theorem 3.1. Based on Theorem 3.3, w e pro vide the follo wing corollary that highlights the imp ortance of anonymizing data while satisfying utility constraints. Cor ollar y 3.3. Gi ven a utili ty c onstr aint set U that is satisfie d, a utility c onstr aint u j = { i 1 , ..., i r } ∈ U , and a set of gener alize d i tems { ˜ i 1 , ..., ˜ i s } c onstructe d by mapping e ach element of u j to one of these items, it holds that | ˜ D ˜ i 1 ∪ . . . ∪ ˜ D ˜ i s | = |D i 1 ∪ . . . ∪ D i r | Thus, the num b er of transac tions of D supporting any item con tained in a utilit y constrain t u j ∈ U can b e accu- rately computed from the anonymized dataset ˜ D , when U is satisfied and all items in u j hav e been general ized. This is crucial in many data analysis tasks (e.g., in generalized association rule mining [17]) where the supp ort of itemsets correspondin g t o aggr e gate c onc epts (i.e., itemse ts with a more general meaning than the items they are compris ed of ) needs to b e d etermined, as illustrated b elow . Example 3. Consider that the dataset of Fig. 1(a) has to b e anonymize d to supp ort a study in which the numb er of p at ients diagnose d with diab etes ne e ds to b e ac cur at ely c om- pute d. Assume also that diagnosis c o des a and b c orr esp on d to two differ e nt forms of diab etes . Observ e that the numb er of p at ients suffering fr om diabetes (i.e. , tr ansactions having a , b or ab ) in the dataset of Fig. 1(a) is the same as in the anonymization of this dataset shown in Fig. 2(c), b e c ause this anonymization satisfies th e util ity c onstr aints of Fig. 1(c) and b oth a and b in u 1 = { a, b } have b e en gener alize d. 3.4 Inf ormation Loss There ma y be many anonymizations that satisfy th e pri- v acy an d u tilit y constrain t sets, but they may not b e eq ually useful. Since disco v ering the one that least harms data util- it y is imp ortant, we prop ose a measure to capture data ut il- it y based on information loss. Definition 3.6. (Utility Loss for a Ge neralized Item). The Utility L os s (UL) for a gener ali ze d i tem ˜ i m is define d as UL ( ˜ i m ) = 2 | ˜ i m | − 1 2 M − 1 × w ( ˜ i m ) × sup ( ˜ i m , ˜ D ) N wher e | ˜ i m | de notes th e numb er of i tems f r om I mapp e d to ˜ i m using Φ , and w : ˜ I → [0 , 1] i s a function assigning a weight to ˜ i m . UL measures the amount of informa tion loss caused by generalizing a set of items as a pro duct of three terms. The first term penalizes a generalized item based on the num ber of items from I mapp ed to it. This is b ecause a generalized item can b e interpreted as any of the 2 | ˜ i m | − 1 n on - empty subsets of th e set of items mapp ed to it [5] , and there are 2 M − 1 p ossible non- empty subsets that can be formed u s- ing items from I . The second term is a weigh t specified by data owners to quantify the harm to data u tilit y caused by a generalized item, according to th e items mapp ed to it. W eights need to b e b etw een 0 and 1 for normalization pur- p oses, where larger weig hts are assigned to generalized items comprised of items that are more seman tically distant, since such generali zed items harm data utilit y more [20]. The se- mantic distance of items can b e comp u ted in many wa ys (e.g., based on th e heigh t of a hierarch y [16], the num ber of lea ve s of the closest common ascendant of these items in a hierarc hy [24], with the aid of on tologies [22], or by exp ert knowl edge). The third term is th e sup p ort of a generalized item in t he anonymized dataset, normalized by th e n umber of transactions. Items that app ear often in th e anonymized dataset are penalized more, since they introduce more data distortion. Example 4 illustrates how UL can b e computed. Example 4. Consider Fig. 1(a), and the anonymize d dataset of Fig. 2( c). Items a and b ar e gener alize d to ( a, b ) , which is assigne d a weight of 0 . 375 sp e cifie d by the data owner, and has a supp ort of 7 i n Fi g. 2(c). The UL for ( a, b ) is c ompute d as 2 2 − 1 2 8 − 1 × 0 . 375 × 7 8 ≈ 0 . 004 . Based on Defin ition 3.6, we quantif y the total amount of information loss for an anonymized dataset ˜ D as follo ws. Definition 3.7. (Utility Loss fo r an A nonymized Da t aset). The Uti l ity L oss for an anonymize d datas et ˜ D is given by UL ( ˜ D ) = X ∀ ˜ i m 6 = ∅ UL ( ˜ i m ) + X ∀ i m ∈S Y ( i m ) wher e Y : I → ℜ is a function assigning a p enalty to e ach suppr esse d item i m fr om D . The ab o ve defin ition captures data utility loss caused by b oth generalization and supp ression. Sp ecifically , for sup- pression, similar to [25], we allo w data owners to assign a p enalty to eac h suppressed item, according to th e perceived imp ortance of retaining this item in the anonymized result. F or instance, each suppressed item could receive a p enalty equal to its support, based on the fact that this defi n es the num ber of transactions from whic h it is eliminated. 3.5 Pr oblem Statement Given a tr ansactional dataset D , a privac y c onst r ai nt set P , a utility c onstr aint set U , and p ar ameters k, s , c onstruct an anonymize d version ˜ D of D using the set-b ase d anonymiza- tion mo del such that: (1) P and U ar e b oth satisfie d, and (2) the amount of utility loss UL ( ˜ D ) is m inimal. 4. ANONYMIZA TION ALGORITHM W e now present COA T (COnstraint-based Anonymiza- tion of T ransactions), a heuristic algorithm that solves th e aforemen tioned problem using item generaliza tion and sup- pression. Giv en D , P , U , k and s , CO A T selects a priv acy constrain t p ∈ P , and app lies item generalizations that are sp ecified by U and incur the smallest amount of informa tion loss to satisfy p . When p cannot b e satisfied b y generali za- tion, COA T sup presses the minimum number of items in p to satisfy it. The p ro cess is repeated for all priva cy constraints until P is satisfied. Pseudoco de for COA T is pro vided in Algorithm 1. Since the anonymized dataset ˜ D is pro duced by transforming items in transactions of t he original dataset D , in step 1, w e initialize ˜ D to D . S teps 2 to 14 p resent the main itera- tion of CO A T, which aims t o satisf y the priv acy constraint set P (step 2). In step 3, CO A T selects th e p riv acy con- strain t p th at is not satisfied and corresponds to an itemset Algorithm 1 CO A T( D , P , U , k , s ) 1. ˜ D ← D 2. while P is not satisfied do 3. find p corr esponding to I s. t. I ← arg max ∀ p j ∈P : p j is not satisfied  sup( ∪ ( i r ∈ p j ) i r , ˜ D )  4. while p is not satisfied ∧ | I | > 1 do 5. i m ← arg min ∀ i r ∈ I sup( i r , ˜ D ) 6. u l ← { u j ∈ U | i m ∈ u j } 7. if | u l | > 1 8. Gener alize ( i m , u l , P ) 9. else if sup( i m , ˜ D ) < k 10. Suppr ess ( i m , u l , P , s ) 11. if p is not satisfied ∧ | I | = 1 12. while p is not satisfied do 13. i m ← arg min ∀ i r ∈ I sup( i r , ˜ D ) 14. Suppr ess ( i m , u l , P , s ) 15. return ˜ D I having maxim um supp ort in ˜ D , since satisfying this con- strain t incurs minimal d istortion of D . This is due to the fact that the minimum n umber of transactions in ˜ D ha ve to b e distorted to augment the supp ort of I to at least k . Next, while p remains unsatisfied in ˜ D and p con tains more than one generalized item (step 4), CO A T p erforms steps 5 to 10. In step 5, the item i m from p with the minim um supp ort in ˜ D is selected to be generalized. Selecting i m this wa y attemp ts to minimize the num ber of generalizations required to satisfy p , as items with “lo w” supp ort need to b e generalized to meet the sp ecified k . Subsequently , we identif y the utility constraint u l from U for which i m ∈ u l , in order to retrieve the items that are allow ed to b e generalized with i m (step 6). If at least one item apart fro m i m is conta ined in u l (step 7), item i m is general ized in a wa y that minimizes information loss (step 8) as illustrated in Algorithm 2. Oth erwise we supp ress i m through a function Suppr ess , given in Algorithm 3, since applying generalization to increase t he sup p ort of i m to k would result in violating u l (steps 9 − 10). Steps 11 to 14 aim to satisfy p by suppressing the min- im um num ber of items in ˜ D req uired to satisfy this con- strain t. When I consists of one (generalized) item only , and p is not satisfied (step 11), w e iterative ly suppress items in I , starting with the one having the minim um supp ort, unti l p is met (steps 12 − 14). Last, ˜ D is released (step 15). Algorithm 2 Gener alize ( i m , u l , P ) 1. i s ← arg min ∀ i r ∈ u l \{ i m } UL ( ( i m , i r ) ) 2. ˜ i ← ( i m , i s ) 3. foreach p ∈ P : i m ∈ p ∨ i s ∈ p 4. p ← ( p ∪ { ˜ i } ) \{ i m , i s } 5. u l ← ( u l ∪ { ˜ i } ) \{ i m , i s } 6. Update transactions of ˜ D based on ˜ i Algorithm 3 Suppr ess ( i m , u l , P , s ) 1. u l ← u l \{ i m } 2. foreach p ∈ P : i m ∈ p 3. p ← p \{ i m } 4. Remo v e i m from all transac tions of ˜ D 5. if more than s % of items are suppressed 6. Error: U is violated Algorithms 2 and 3 indicate how CO A T p erforms gener- alization an d suppression resp ectively . Eac h of these op er- ations inv olv es up dating the priv a cy and utilit y constrain t sets P and U , as well as selected transactions of ˜ D . Sp ecifically , Gene r a lize ( Algorithm 2) operates as follo ws. In step 1, it identi fies the item i s that can b e generalized together with i m in a wa y that incurs the least possible in- formation loss according t o the UL measure. S tep 2 p erforms the mapp ing of the tw o items to a common generalized item ˜ i . F ollowing that, steps 3 − 5 up d ate the priv acy and the utilit y constraints to reflect this generalization. Finally , th e transactions of ˜ D that sup p orted any of i m , i s are up dated to supp ort the generalized item ˜ i instead. Suppr ess (Algorithm 3) in volv es remo ving an item i m from the priv acy and utility constraint sets ( steps 1 − 3), and the transactions sup p orting it in ˜ D (step 4). Finally , it c hec ks whether the imp osed suppression threshold s h as b een sur- passed (step 5). This happen s when utility constrain ts are o verly restrictiv e (e.g., they require all items to remain in- tact in the anon ymized dataset) and a “lo w” supp ression threshold is used. I n this case, data o wners are notified that th e ut ility constraint set U has b een violated and the anonymizati on p rocess terminates (step 6). Example 5. We apply COA T on the dataset D of Fig. 1(a), using the c onstr aints of Figs. 1(b) and 1(c), k = 5 , and s = 15% . Sinc e P c ontains p 1 , p 2 whose itemsets ar e e qual ly supp orte d in D , CO A T arbitr ar ily c onsiders p 1 = { a, b, c } . Then, it sele ct s b , which has the m inimum supp or t among a, b and c , and gener alizes it to gether with a , as r e quir e d by u 1 . This incr e ases the supp ort of ( a, b ) c to 7 , satisfying p 1 . Subse quently, COA T c onsiders p 2 = { d, e, f , g , h } . Item d i s minimal ly supp orte d among the items of p 2 , thus it is c on- sider e d for gener alization. However, d c annot b e gener ali ze d due to u 3 , and it is suppr esse d, sinc e its supp ort is b elow k . After suppr essing d , p 2 is stil l not met. Sinc e in p 2 , b oth g , h have minim um supp ort, g is arbitr arily sele cte d to b e gener alize d. Item g c an b e gener alize d with any of e, f o r h , but it is gener alize d with h , sinc e ( g , h ) i ncurs the minimum information loss. This satisfies p 2 and P is now satisfie d. U is also satisfie d as shown i n Example 2 . 5. SPECIFYING PRIV A CY AND UTILITY CONSTRAINTS The notions of priv acy and utilit y constraints, whic h re- flect itemsets deemed as potentially link able and imp ortant for intended data analysis tasks resp ectivel y , are central to our anonymizati on approach. Our constrain t sp ecification framew ork al lo ws data o wners to f orm ulate detail ed con- strain ts based on their specific priva cy and ut ilit y require- ments, w hich are giv en as input to COA T. Ho w ev er, ac- knowl edging that constraint sp ecification may b e challeng- ing for d ata o w ners who lac k domain knowle dge, w e presen t simple metho d s that aim to help su c h data owners formulate constrain ts. Section 5.1 discusses our Pri vacy c onstr aint s et gener ation ( Pgen ) algo rithm that constru cts a priv acy constraint set automatically , assuming that attac kers can u se any part of any transaction to link published data to ind ividuals. Pgen w orks by searching the original dataset for itemsets with “lo w” supp ort, eac h of which is treated as p otentially link- able an d is mo deled as a priv acy constrain t. Although the resultant priv a cy constraint set corresp onds t o a stringent priv acy policy , we believe that adopting this p olicy is a safe choi ce when d ata o wners are unable t o specify whic h items are p otentially link able. Section 5.2 provides a recipe to reduce the eff ort of sp ecifying u tilit y constraints. 5.1 Construc ting a Privacy Constraint Set Before presenting Pgen , we capture the largest part of a transaction t h at can b e u sed in linking attac ks using Defi- nition 5.1. Definition 5.1. (Maximal Infrequent Itemsets). Given a tr ansactional dataset D , and a p ar ameter k , we de- fine the set of maximal infrequent itemsets in D as those itemsets that have a supp ort in the interval (0 , k ) in D , and none of their pr op er sup ersets is supp orte d in D . Example 6 illustrates the ab ov e definition. Example 6. Consider a dataset c ompr ise d of the last thr e e tr ansactions of the dataset of Fig. 1(a) (asso ciate d with t he i temsets { a, c, f } , { a, c } and { b, h } r es p e ctively), and that k is set to 2 . The lattic e of i temsets in this dataset is il lustr ate d in Fig. 6, in which the supp ort of e ach sup- p or te d itemset is shown next to it. As c an b e se en, the set of maximal infrequent itemsets in this dat aset c ontains only acf and bh , as e ac h of these itemsets i s supp orte d in the dataset and al l of its pr op er sup ersets have a supp ort of zer o. Figure 6: An example of maximal infr e quent item- sets Give n a transactional dataset D , and a p arameter k , Pgen constructs a priv acy constraint set P that contains all the maximal infrequ ent itemsets in D . A s mentioned ab ov e, th e generated P can b e given as inp ut to COA T to ensure that anonymized data can preven t linking attac ks b ased on any part of any transaction in D . The p seud oco de of Pgen is provided in Algorithm 4. Algorithm 4 Pgen ( D , k ) 1. P ← sorted transactions of D with resp ect to their size in decrea sing order 2. foreach T r ∈ P , r = 1 , ... , N 3. foreach T s ∈ P , s = ( r + 1) , ..., N 4. if T s ⊆ T r Remo v e T s from P 5. Itemset I ← T r 6. if sup ( I , D ) ≥ k 7. Remo v e T r from P 8. return P Pgen starts b y creating a priv acy constrain t set P , whic h is initialized by the set of transactions of the original d ataset D , eac h of whic h is treated as a priv acy constrain t. Clearly , this set may contain redun dant itemsets, whic h w ould result in an u nnecessary computational ove rhead if used as input to CO A T. This is because COA T w orks b y satisfying each priv acy constraint iterativ ely . Therefore, Pgen implements a simple pruning strategy that remo ves redun d ant priv acy constrain ts from P to reduce its size without affecting the priv acy guaran tees provided when P is satisfied. The first step of this strategy is t o populate P with the set of transactions of D , sorted in terms of decreasing size. Subsequently , in steps 2-4, transactions ( T s ) that are subsets of other transactions ( T r ) are iden tified and remo ved from P . This is b ecause these transactions cannot correspond to maximal infre quent itemsets, according to Definition 5. 1. Next, steps 6 and 7 ensure t h at priva cy constraints that do not require protection ( i.e. itemsets induced by transactions having a supp ort of at least k in D ) are not included in P . Finally , P , which con tains the set of maximal infrequent itemsets in D , is returned in step 8. This priv acy constrai nt set can b e given as input to COA T. N otice that Pgen has a quadratic run-time complexity , as it in vo lves sorting, pair- wise comparison, and supp ort comp u tation for transactions. T o illustrate ho w Pgen works, we pro vide Example 7. Example 7. Consider applying Pgen on the dataset of Example 6, using k = 2 . This r esults in i nitializing P with thr e e privacy c onstr aints p 1 = { a, c, f } , p 2 = { a, c } and p 3 = { b, h } (one for e ach tr ansac tion), which ar e sorte d in terms of de cr e as ing size. Subse quent ly, p 2 is r emove d fr om P , b e c ause { a, c } is a subset of p 1 = { a, c, f } . Next, Pgen che cks the supp ort of p 1 , and so r etains it in P as the sup- p or t of p 1 in this dataset is 1 ∈ (0 , 2) . In the final iter ation, Pgen examines p 3 , and r etains i t in P for the same r e ason. Thus, Pgen r eturns P = { p 1 , p 3 } . 5.2 F ormulating a Utility Constraint Set While priv acy constraints can be extracted automatically as d iscussed ab ov e, this is difficult for u tilit y constrain ts, b e- cause they mo del application-sp ecific data analysis require- ments. Th us, we assume that data owners are able to sp ecify utilit y constrain ts to a vo id distorting itemsets th at need to b e used in intended applications. When in terested in generating anon ymized d ata that al- lo ws th e counts of aggr e gate c onc epts to b e accurately deter- mined, for example, data u sers can form ulate a utility con- strain t for eac h of these concepts (itemse ts), as explained in Section 3.3. These itemsets may b e selected with th e help of hierarc hies or on tologies, whic h are sp ecified by domain exp erts or constructed in an automated fashion [14]. A util- it y constraint containing the rema ining items (i.e., th ose not conta ined in the selected itemsets) shou ld also b e sp ecified to ensure that the utilit y constrain t set is a partition of I (see Defin ition 3.4). W e emphasize t hat the wa y al l items are generalized is go verned b y the utility loss function (see Defi nition 3.6), whic h forces seman tically related items to b e generalized together. Example 8 illustrates ho w utility constrain ts may b e sp ecified. Example 8. Consider that the dataset of Fig. 1(a) has to b e anonymize d to supp ort the study of Example 3 in which the numb er of p atients diagnose d with diabetes (i. e., tr a ns- actions having a , b , or a b ) ne e ds to b e ac cur ately c ompute d. T o supp ort this study, the hospital c an sp e cify a utility c on- str aint { a, b } , and include al l the r emaining diagnosis c o de s in a se c ond c onstr aint { c, d, e , f , g , h } . 6. EXPERIMENT AL EV ALU A TION In this section, we compare CO A T to Apriori [20] using four series of exp eriments. I n the first series, w e compare th e amount of informatio n loss the algorithms incur to ac hieve k m -anonymit y . T he second and third series of ex p eriments examine whether the algorithms can meet detailed priv acy and u tilit y requirements without harming data ut ility , and the last series ev aluate their efficiency . 6.1 Expe rimental setup and metrics W e use tw o real-w orld transactional d atasets, BMS- WebView-1 ( BMS1 ) and BMS-WebView-2 ( BMS2 ), which conta in clic k-stream data from tw o e-commerce sites. The datasets hav e been used in ev aluating prior w o rk [5, 20] and also as benchmarks in the 20 00 KDD-Cup competition. T a- ble 1 summarizes their characteristic s. Dataset N |I | Ma x. | T | Avg. | T | BMS-1 59602 497 267 2.5 BMS-2 77512 3340 161 5.0 T able 1: Description of used datasets T o ensure a fai r comparison b etw een COA T and A priori, w e configured the latter with the same hierarchies as in [20] and set the w eigh ts w ( ˜ i m ) used in CO A T based on a notion of seman tic distance comput ed according t o the aforemen- tioned h ierarc hies [24]. W e d id n ot compare our approach to the tw o other algorithms p rop osed by the auth ors of Apriori in [20]. This is b ecause these algorithms h a ve b een show n to b e comparable to Apriori in terms of effectiveness, while they are only applicable to datasets with a small domain of less than 50 items [20 ] (typically , transactional datasets hav e a domain size in the order of hundreds or thousands). W e also did not compare our approach to those of [25] and [5], since these approaches req u ire a fixed categorizatio n of items into p otentially link able and sensitiv e, a classification that is not app licable to the problem we tackle. Both CO A T and Ap riori were implemen ted in C++. All exp eriments were performed on an Intel 2.8GHz machine equipp ed with 4GB of R AM. T o quantify information loss, we considered aggregate query answ ering as an indicative application, and mea- sured the accuracy of an swering w o rkloads of qu eries on anonymized data produced by the tested algorithms. This is a widely-used app roac h t o characterize information loss [5, 9, 23] and is inv aria nt of the wa y tested algorithms work. Consider the COUNT() query Q shown in Fig. 7 . W e obt ain an accurate answ er a ( Q ) for Q when this query is applied to origi nal data D , but n ot in the case of generalized data ˜ D , as original items from I are mapp ed to generalized ones in ˜ I . Therefore, we can only estimate t h e answe r for Q . Q: SELECT COUNT( T n (or ˜ T n )) FROM D (or ˜ D ) WHERE i 1 ∈ T n ∧ i 2 ∈ T n ∧ . . . ∧ i q ∈ T n (or Φ( i 1 ) ∈ ˜ T n ∧ ... ∧ Φ( i q ) ∈ ˜ T n ) Figure 7: COUNT() query ex ample This estimation can b e p erformed by computin g the prob- abilit y a t ransaction of ˜ D satisfies Q , as Π q r =1 p ( i r ), where p ( i r ) is the probability of mapping an item i r , r = 1 , ..., q , in the q uery to a generali zed item ˜ i m , assuming that ˜ i m can include any p ossible subset of the items mapp ed t o it with equal p rob ab ility , and th at there are no correlations among generalized items [5, 9, 23]. An estimated an swer e ( Q ) of Q is then deriv ed by summing the correspondin g probabilities across all transactions ˜ T n of ˜ D . T o mea sure the accuracy of estimating Q , w e use the Re l- ative Err or ( RE ) measure computed as RE ( Q )= | a ( Q ) − e ( Q ) | /a ( Q ). Given a w orkload of queries, the Average Relative Error ( AvgRE ) for all q u eries, reflects how well anonymized data supports query answe ring [9, 23]. T o mea- sure AvgRE , w e constru cted workloads comprised of 10 00 COUNT() queries similar to Q . The items participating in these queries were selected randomly from th e generalized items. 6.2 Achie ving k m -anonymity In this section, we emp irically confirm that COA T not only satisfies k m -anonymit y , but does so with u p to 9 times less information loss than Apriori. S p ecifically , we ran CO A T by including all m -itemsets in th e priv acy constraint set P , considering a utility constrain t set U that contains all items (effectively allo wing all possible generalizations), and setting s = 0 . 5%. Both algorithms used the same k and m v alues. The results with resp ect to AvgRE and UL measures are summarized in Sections 6.2.1 and 6.2.2 resp ectively . In these exp erimen ts COA T d id not sup press any items. 6.2.1 Ca pturing data utility using A vgRE Figs. 8(a) and 8(b) rep ort Avg RE scores f or BMS1 , where the num b er of items q included in a query was 1 and 3 re- sp ectively , m w as set t o 2, and k was selected ov er the range [2 , 50]. As expected, increasing k induced more information loss due to the utilit y/priv acy trade-off. Increasing q had a similar effect b ecause accurately answering queries inv olving many items is more difficult. COA T outp erformed Apriori in b oth cases, achieving up t o 9 times b etter A vgRE scores. This is b ecause, as k increases, the reco ding mo del of A priori forces an increasingly large number of items to b e general- ized together, whil e the mo del in C OA T generalizes no more items th an required to protect an itemset. Similar results w ere ac hiev ed for BMS2 (omitted for b revity). 0 1 2 3 4 5 6 7 8 9 2 5 10 25 50 AvgRE k Apriori COAT (a) 0 2 4 6 8 10 12 14 16 18 20 2 5 10 25 50 AvgRE k Apriori COAT (b) Figure 8: AvgRE vs. k for (a) q = 1 and (b) q = 3 in BMS1 W e also ex ecuted COA T and Ap riori using k = 5, and v aried m b etw een 1 and 3. The AvgRE scores for BMS1 are sh own in Fig. 9(a) . Apriori incurred 7 times more infor- mation loss than CO A T to anonymize BMS1 when m = 3. This is b ecause the num ber of items that Apriori forces to b e generalized together to protect m -itemsets increa ses sub- stanti ally as m gro w s. The impact of this generali zation strategy on data utility w as eve n more evident in the case of BMS2 , as sho wn in Fig. 9(b). 0 5 10 15 20 25 30 35 40 1 2 3 AvgRE m Apriori COAT (a) 0 20 40 60 80 100 120 140 160 1 1.5 2 2.5 3 AvgRE m Apriori COAT (b) Figure 9: AvgRE vs. m for (a) BMS1 and (b) BMS2 6.2.2 Ca pturing data utili ty us ing UL W e compared the tw o algorithms with respect to the UL measure. Fig. 10(a) show s the result of running th ese al- gorithms on BMS2 using m = 2 and k v alues b etw een 2 and 50. Observe that Ap riori wa s fairly insensitiv e to k up to 25. In fact, Apriori ove r-generalized itemsets by increas- ing their supp ort to muc h larger v alues than k due to its recoding strategy . On th e other hand, CO A T ac hiev ed a muc h b etter result for all tested k v alues, due to the fin e- grained generalization model it emplo ys. W e also exam- ined how the algorithms fared with resp ect to UL when m v aries b etw een 1 and 3, and k = 5. Ob serve that Apri- ori incurred su b stantia lly more information loss than COA T for all tested m v alues. This again su ggests that the gen- eralization scheme of Ap riori distorts data muc h more than our set-b ased anonymiza tion strategy . Similar results w ere obtained for BMS1 (omitted for brevity). W e do not rep ort additional results with resp ect to UL b ecause COA T is designed to optimize this measure, and thus outperformed A priori in all tested cases. 1e+020 1e+025 1e+030 1e+035 1e+040 1e+045 1e+050 1e+055 1e+060 1e+065 1e+070 2 5 10 25 50 UL k Apriori COAT (a) 1 1e+010 1e+020 1e+030 1e+040 1e+050 1e+060 1e+070 1 2 3 UL m Apriori COAT (b) Figure 10: (a) UL vs. k and (b) UL vs. m for BMS2 6.3 Pri vacy constraints vs. data utility In th is section, we exp erimentally confirm th at our ap- proac h can generate anonymizations with a high lev el of data utility through the sp ecification of detailed priv acy con- strain ts by data o wners. The impact of constraints gener- ated by Pgen on d ata utility will b e examined in Section 7. W e constructed tw o types of p riv acy policies to simulate different priv acy requirements, on e in which itemsets that require p rotection are all of th e same size an d comprised of certain items from I , and another in which such itemsets differ in size. The utility constraint set U used in COA T w as set as in S ection 6.2. 6.3.1 Protecting items ets comprised of certain items W e considered 5 p riv acy p olicies of th e first typ e, PP1 ,. . . , PP5 , eac h of which assumes t h at all 2-itemsets contai ning a certain p ercent of randomly selected items require protection with k = 5. The mappings betw een priv acy p olicies and the p ercent of suc h items are as f ollo ws: PP1 → 2%, PP2 → 5%, PP3 → 10%, PP4 → 25%, PP5 → 50%. These p olicies are taken in to accoun t b y COA T, bu t not by A priori, whic h needs to protect all 2-itemsets to satisfy th em. W e first stud ied how priv acy p olicies affect data utility , as captured by AvgRE . Figs. 11(a) and 11(b) illustrate the results for q = 1 and q = 3 respectively . As exp ected, b e- cause it a voi ds unnecessarily protecting itemsets th at are not sp ecified by these p olicies, COA T distorted data signif- ican tly less than A priori. This is supp orted by the Av gRE scores for CO A T which w ere significantly b etter than Apri- ori. F urth ermore, as p olicies b ecome more strict (i.e., require protecting itemsets in duced by a larger p ercent of items from I ), the A vgRE scores for CO A T b ecame sligh tly w orse due to the utility/priv acy traded-off. Nevertheless , these scores remain substan tially b etter than that of Apriori in all case s. W e rep eated the same exp erimen ts for BMS2 , and obtained similar results sho wn in Figs. 11(c) and 11(d) respectively . 0 0.5 1 1.5 2 2.5 3 3.5 PP1 PP2 PP3 PP4 PP5 AvgRE Privacy Policy Apriori COAT (a) 0 1 2 3 4 5 6 7 8 PP1 PP2 PP3 PP4 PP5 AvgRE Privacy Policy Apriori COAT (b) 0.0001 0.001 0.01 0.1 1 PP1 PP2 PP3 PP4 PP5 AvgRE Privacy Policy Apriori COAT (c) 5 10 15 20 25 30 35 40 45 50 55 60 PP1 PP2 PP3 PP4 PP5 AvgRE Privacy Policy Apriori COAT (d) Figure 11: AvgRE v s. Priv acy Policy (a) for q = 1 and (b) for q = 3 (i n BMS 1 ), and (c ) for q = 1 and (d) for q = 3 (in BMS2 ) 6.3.2 Protecting items ets of varying size W e simulated 4 priva cy p olicies of the second t yp e: PP6 , . . . , PP9 . In each of these p olicies, P consisted of itemsets with size 1 to 4, as sho wn in T able 2, and k = 5. T o ac- count for th ese p olicies, Apriori had to protect all p ossible 4-itemsets, and thus it w as configured with m = 4. Priv acy % of % of % of % of Policy items 2-itemsets 3-itemsets 4- itemsets PP6 3 3% 33% 33% 1% PP7 3 0% 30% 30% 10% PP8 2 5% 25% 25% 25% PP9 1 6 . 7% 16 . 7% 16 . 7% 50% T able 2: Summ ary of priv acy poli ci es PP6 , . . . , PP9 The AvgRE scores for BMS1 and BMS2 , and a worklo ad comprised of queries with q = 2 are depicted in Figs. 12(a) and 12(b) respectively . N otice that COA T achiev ed bet- ter Av gRE scores in b oth d atasets, permitting answ ers to queries up to 40 times more accurately than Apriori. This is b ecause COA T ap p lies generalization to each priv acy con- strain t separa tely , thereby app lying the minim um level of generalization required to satisfy the specified constrain t. 1 10 100 1000 PP6 PP7 PP8 PP9 AvgRE Privacy Policy Apriori COAT (a) 0.1 1 10 100 PP6 PP7 PP8 PP9 AvgRE Privacy Policy Apriori COAT (b) Figure 12: AvgRE vs. Priv acy P olicy for q = 2 (a) in BMS 1 and (b) i n BMS2 6.4 Utility constraint s vs. data utility The experiments rep orted in this section ex amine the ef- fect of utility constraints on data utilit y . W e assumed 4 utilit y p olicies: UP1 , . . . , UP4 . Each p olicy contains groups of a certain num ber of semantically close items (i.e., sibling items in the hierarc hy). The mappings b etw een utilit y p oli- cies and the size of th ese groups are as follow s: UP1 → 25, UP2 → 50, UP3 → 250, and UP4 → 500. It ems in each group are allow ed to b e generalized together. Note that UP1 an d UP2 , whic h h a ve smaller group sizes, are very stringent and ma y require suppression to b e satisfied. F or this reason, w e configured COA T with a small suppression t hreshold s of 0 . 5%. Apriori do es not add ress these p olicies b ecause item generalization is n ot guided by u tilit y constrain ts. A lso, the priv acy constrai nt set P includ ed all 2-itemsets and Apriori w as ru n with m = 2. Av gRE scores for a w orkload of queries with q = 1 and q = 3, are shown in Figs. 13(a) and 13(b) respectively , for BMS1 . Observ e that CO A T significantly outp erformed Apriori for all utility p olicies. F urth ermore, the num b er of suppressed items was very small (0 . 0 1%) and o ccurred only in the case of UP1 . This illustrates th e effectiveness of COA T, whic h sup presses the minimum num ber of items required, and only when utility constraints cannot be other- wise met. W e also note t hat CO A T wa s able to satisfy the imp osed utility p olicies in all cases, u nlike Ap riori whic h was unable to meet any of them. I nterestingly , the Av gRE scores for CO A T were not substanti ally affected by utility p olicies. This is b ecause COA T applied a muc h lo w er level of general- ization t h an that sp ecified by the utility constrain ts. S imilar trends were observed for BMS2 (omitted for brev it y). 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 UP1 UP2 UP3 UP4 AvgRE Utility Policy Apriori COAT (a) 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 UP1 UP2 UP3 UP4 AvgRE Utility Policy Apriori COAT (b) Figure 13: AvgRE vs. Utili t y Policy (a) for q = 1 and (b) for q = 3 (in BMS1 ) 6.5 Effi ciency of Computation W e compared COA T and Apriori in terms of efficiency . W e first examined the scalabilit y of these algorithms with re- sp ect to d ataset cardinality , by applying t hem on a dataset constructed by randomly selecting transactions of BMS1 . CO A T was confi gured by setting P and U as in Section 6.2, m = 2 and k = 5. Apriori was run with the same k and m val ues. Fig. 14(a) rep orts run - time as cardinalit y v aries from 1K to 50K transactions. C OA T scales b etter than A priori with the size of the d ataset; u p to 2 . 5 times faster. This is b ecause COA T prunes t he space by discarding protected itemsets as cardinality increases, whereas Apriori considers all m -itemsets as well their p ossible generaliza- tions. 0 1 2 3 4 5 6 7 1K 10K 25K 50K Time (sec) |D| Apriori COAT (a) 0 20 40 60 80 100 120 2 10 25 50 Time (sec) k Apriori COAT Linear (b) Figure 14: Efficiency vs. (a) dataset size | D | and (b) k Last, w e ev aluated the impact of k on the ru n-time of CO A T and Apriori on BMS1 . W e used k v al ues b etw een 2 and 50, and set up all other parameters as in the p revious exp eriment. As can b e seen in Fig. 14(b), COA T is sligh tly less efficient th an Apriori. This is due to the fact that COA T generalizes one item at a time, exploiting the flexibility of the set- b ased anonymization mo del. By comparison, Ap ri- ori generalizes entire subtrees of items and th us reac hes the sp ecified k faster. Neve rtheless, the computation cost of CO A T w as less th an half a minute, remaining sub-linear for all testes v al ues of k . 7. CASE STUD Y : DI A GNO SIS CODES In this section, we examine whether CO A T can prod uce anonymized data that p ermits accurate analysis in a real- w orld scenario in v olving detailed, application-specific util- it y requ irements . I n th is con text, a transactional dataset (referred to as EMR ) derived from the Electronic Medical Record system of the V a nderbilt Universit y Medical Cen- ter [18] needs to be pu blished to enable certain biomedical studies. Each transaction of EMR corresp onds t o a distinct patient, and con tains his/her diagnosis co des in th e form of ICD-9 co des 1 . T able 3 summarizes th e characteristi cs of EMR . Dataset N |I | Ma x. | T | Avg. | T | EMR 1336 5830 25 3.1 T able 3: Description of the EMR dataset. The studies t h at anon ymized data needs to support focus on 20 different disorders, each of which is mo deled as a set of ICD-9 co des. F or instance, p ancr e atic c anc er is represented as a set of 7 ICD-9 co des, which correspond t o different 1 ICD-9 is t h e official system of assigning co des to diagnoses in th e U.S. forms of p ancr e atic c anc er and indicate t h at a patient suffers from this disorder. T o supp ort these stud ies, t h e num ber of patients suffering from each of t h ese disorders needs to b e accurately computed. At the same time, the linkag e of transactions to p atien ts’ id entities b ased on any com bination of ICD-9 codes must be prevented, b ecause the v ast ma jorit y of ICD-9 co des contai ned in EMR can b e found in other sources, as ve rified in our previous study [10]. T o ac hieve b oth priv acy and utility , we used our Pgen algorithm to construct a p riv acy constrain t set, and form u- lated a utility constrai nt set comprised of 20 utility con- strain ts, each for a different disorder (e.g., we sp ecified a utilit y constrain t that contai ns the 7 ICD-9 co des corre- sp on d ing to p ancr e atic c anc er ). F urt hermore, we configured CO A T by setting the w eigh ts w ( ˜ i m ) used in it based on a notion of semantic similarit y [2 4] computed according to the hierarc hy for ICD- 9 co des 2 , and limited the maximum al- lo w able fraction of suppressed items by setting s to 0 . 5%. Apriori w as also applied to anonymize EMR , although it provides no guara ntees that utilit y constraints are satisfied. W e ev a luated th e utilit y of anonymizations pro du ced by b oth COA T and A priori in t w o w a ys. First, w e examined whether the p ro du ced anonymizations satisfied the sp ecified utilit y constraint set. In fact, anon ymizations constru cted by COA T satisfy the latter set for all tested k val ues (namely 2 , 5 , 10 , 25 and 50). Thus, COA T managed t o generate prac- tically useful anonymizations that allo w the number of pa- tients having any of the 20 disorders used in the intended studies to b e accurately computed (see Corolla ry 3.3). On the other hand , the anonymizations construct ed by the Ap ri- ori algorithm did not satisfy th e sp ecified utility constraint set for any of the tested k val ues. Therefore, we did n ot ev al- uate th e data u tilit y of anon ymizations pro duced by Apriori using other criteria. In addition to satisfying the sp ecified u tilit y constraints, it is also imp ortant to generate anonymized data with “lo w” in- formation loss that can support general data an alysis tasks. Therefore, we inv estiga ted whether our meth od can generate anonymizati ons that are useful in aggregate query answ er- ing. T o captu re the amo unt of information loss, w e used the AvgRE measure, discussed in Section 6.1. AvgRE wa s computed using tw o differen t w orkloa ds referred t o as W1 and W2 resp ectively . W1 is comprised of COUNT() queries that retrieve com binations of ICD-9 co des sup p orted by at least 10% of t he transactions of EM R . These combinations correspond to frequen tly co-occurring disorders ( e.g., dia- b e tes and hyp ertens ion ) th at are imp ortant in the context of biomedical data analysis, and are d ifferent from the 20 dis- orders contained in the u tilit y constrain t set. W 2 is similar to the w orkload considered in Section 6.1. I t is comprised of 1000 COUNT() q ueries simil ar to t he query shown in Fig. 7, eac h of which is comprised of 2 I CD-9 codes randomly selected among generalized items. This w orkload models a scenario involving anonymized data queried by users with v ario us d ata analysis req uirements. Fig. 15(a) rep orts the AvgRE scores for EMR , where k w as selected over the range [2 , 50], and W1 w as used. As can be seen, the AvgRE scores indicate that anonymized data p ermits qu eries that are co mmon in biomedical data analysis tasks to b e answered fairly accurately , even when a strict priv acy policy is adopted . The correspond in g result 2 http://w ww.cdc.go v/nc hs/icd/icd9cm.h tm 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 2 5 10 25 50 AvgRE k COAT (a) 5 10 15 20 25 30 35 40 45 50 2 5 10 25 50 AvgRE k COAT (b) Figure 15: Av gRE vs. k for EMR-D computed using (a) W1 and (b) W2 for W 2 is rep orted in Fig. 15(b). Again, the AvgRE scores confirm t hat a lo w level of information loss w as incurred to anonymize EMR , particularly when k is 5 or low er as it is commonly th e case when p ublishing biomedical data [3]. In summary , our case study confirms th e effectiveness of our anonymization framew ork when th ere ar e sp ecific utility requirements. This is b ecause it allo ws the EMR dataset to b e published in a wa y that p reven ts linking attac ks with respect to an y p ortion of any transaction of th is dataset, helps biomedical studies focusing on specific disorders, and allo ws accurate data analysis. 8. CONCLUSIONS AND FUTURE WORK Existing app roaches for anon ymizing t ransactional data often p ro du ce ex cessiv ely d istorted data that is of limited utilit y , due to t he fa ct that they incorp orate coarse pri- v acy requirements, are agnostic with resp ect to d ata utility requirements, and search a fraction of the solution space. In resp onse, we developed a no vel approac h that ov ercomes these limitatio ns by allo wing fine-grained priv acy and util- it y requirements to be sp ecified as constrain ts, and CO A T (COnstraint-based Anon ymization of T ransactions), an al- gorithm that transforms data using item generaliza tion and suppression to satisfy the specified constrain ts, while mini- mally d istorting data. W e also demonstrated the effectiv e- ness of our approac h using extensiv e exp erimen ts on b en ch- mark datasets and a case study on patient-sp ecific data conta ining d iagnosis co des. Our results demonstrate that CO A T is ab le to satisfy a wide range of priv acy and utilit y requirements with less information loss than the state-of-the art metho d, and to an onymize data in a wa y that prev ents identit y d isclosure and retains data utilit y for in tended ap- plications. This w ork also op ens up several directions for future inv es- tigation. First, although exp erimentall y sho wn to b e b oth effective and efficien t in practice, the CO A T al gorithm is heuristic in nature, and, as such, it does not guaran tee gen- erating optimal anonymizati ons in terms of minim um infor- mation loss. T o add ress the growing size of d atasets and d o- mains, we intend to dev elop approximatio n algorithms that can offer such guaran tees. Second, we aim at extendin g our framew ork to deal with the p roblem of attribute disclosure based on the l -d ivers ity priv acy prin cip le [11]. 9. REFERENCES [1] R. Agraw al and R . Srik an t. F ast algorithms for mining association rules in large databases. In VLDB’94 , pages 487–499, 1994. [2] M. Atzori, F. Bonchi, F. Giannotti, and D. Pedresc hi. Blocking anonymit y threats raised b y frequent itemset mining. In IC DM ’05 , pages 561– 564, 2005 . [3] K. El Emam and F. K. Dankar. Pro tecting p riva cy using k-anonymity . Journal of the Americ an Me dic al Informatics Asso ciation , 15(5):627–6 37, 2008. [4] B. C. M. F ung, K . W ang, and P . S. Y u. T op-down sp ecialization for information and priv acy preserv atio n. In ICDE , pages 205–216, 2005. [5] G. Ghinita, Y. T ao, and P . K alnis. On th e anonymizati on of sparse high-dimensional data. In ICDE ’08 , pages 715–724, 2008. [6] Martin Hepp. P ossible ontologi es: Ho w realit y constrains the d evelo pment of relev an t on tologies. IEEE Internet Computing , 11(1):90–96, 2007. [7] V. S . Iyengar. T ransforming data t o satisfy priv acy constrain ts. In KDD ’02 , pages 279–288, 2002. [8] K. LeF ev re, D.J. DeWitt, and R . Ramakrishnan. Incognito: efficient full-domain k-anonymity . In SIGMOD ’05 , pages 49–60, 2005. [9] K. LeF ev re, D.J. DeWitt, and R . Ramakrishnan. Mondrian multidimensional k-anonymit y . In ICDE ’06 , page 25, 2006. [10] G. Lou k ides, J. C. Denny , and B. Malin. Do clinical profiles constitute priv acy risks for research participants? T o ap p ear in the Proc. of the American Medical Informatics Association (A MIA) An nual Symp osium, 2009. [11] A. Machana v ajjhala, J. Gehrke, D. Kifer, and M. V en k itasubramaniam. l-diversi ty: Priv acy b eyond k-anonymit y . In ICDE ’06 , page 24, 2006. [12] N. Marsden-H aug, V.B. F oster, P .L. Gould, E. Elb ert, H. W ang, and J.A. Pa vlin. Cod e-based syndromic surveil lance for influenzalik e illness by international classification of d iseases, ninth revision. Emer ging Infe ctious Dise ases , 13(2):207–2 16, 2007. [13] A. N aray anan and V . Shmatiko v. Robust de-anonymization of large sparse datasets. In IEEE Symp os ium on Se cu rity and Privacy ’08 , pages 111–125 , 2008. [14] K. Pun era, S. Ra jan, and J. Ghosh. A u tomatic construction of n- ary tree based taxonomies. In ICDMW ’06 , pages 75–7 9, 2006. [15] J. R od gers. Qualit y assurance and medical ontologies . Metho ds of Information i n Me di cine , 45(3):267–2 74, 2006. [16] P . Samarati. Protecting resp ondents identiti es in microdata release. IEEE TKDE , 13(9):1010–1027, 2001. [17] R. S rika nt and R. Agraw al. Mining generalized association rules. In VLDB ’95 , pages 407–419, 1995. [18] W.W. Stead, R.A. Bates, J. Byrd, D.A. Giuse, R.A. Miller, and E.K. Shultz. Case Study: The V anderbilt University Me di c al Center inf ormation m anagement ar chite ctur e . Clinical information systems: a compon ent-based approach. Springer, 2003. [19] L. Sw eeney . k-anonymit y: a model for protecting priv acy . Intern ational Journal on Unc ert ainty, F uzziness and Know le dge-b ase d Systems , 10:55 7–570, 2002. [20] M. T errovitis, N. Mamouli s, and P . Kalnis. Priv a cy-preserving anonymization of set-v alued data. PVLDB , 1(1):115–12 5, 2008. [21] V. S . V erykios and A. Gkoula las-Div anis. A Survey of Asso ciation Rule Hiding Metho ds for Privacy , chapter 11, pages 267–289. Priv acy Preserving Data Mining: Mo dels and Algorithms. Sp ringer, 2008. [22] L. W ang and X. Liu. A new model of ev aluating concept similarit y . Know.-Base d Syst. , 21(8):8 42–846, 2008. [23] X. X iao and Y. T ao. Anatomy: simple and effective priv acy preserv a tion. In VLDB ’06 , pages 139–150, 2006. [24] J. X u, W. W ang, J. Pei, X. W a ng, B. Shi, and A. W-C. F u. Utility-based anon ymization using lo cal recoding. In KDD ’06 , pages 785–790 , 2006. [25] Y. X u , K. W ang, A. W-C. F u, and P . S. Y u. Anonymizing transactio n d atabases for publication. In KDD ’08 , pages 767– 775, 2008 .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment