On Breaching Enterprise Data Privacy Through Adversarial Information Fusion

On Breaching Enterprise Data Pri v ac y Through Adv ersarial Informat ion Fu sion Sri v atsav a Ranjit Ga nta, Ra j Ach arya Dept of Computer Science, P enn Sta te Univer sity University P ark, P A, USA ranjit@cse.ps u.edu acharya@cse.p su.edu Abstract — Data priva cy is one of the key challenges face d by enterprises today . Anonymization techniques a ddress this problem by sani tizing sensitive data such that individual pri- vacy is preserv ed while allowing en terprises to maint ain and share sensitive data. Howev er , existing work on this problem make inherent assumptions about the data that ar e impractical in day-to-day enterprise data management scenarios. F urther , application of existin g anonymization schemes on enterprise data could lead to adversarial attacks in which an intrud er could use i nformation fu sion t echniques t o inﬂict a privac y breach. In t his paper , we shed light on th e shortcomings of current anonymization schemes i n the context of enterprise data. W e deﬁne and experimentally demonstrate W eb-based Information- Fusion Attack on anonymized enterprise data. W e formulate the problem of Fusion Resilient Enterprise Data Anonymization and propose a pr ototype solution to ad dress this pr oblem. I . I N T R O D U C T I O N Data p riv acy is one of the key challen ges faced b y en- terprises today . Sensiti ve indi vidual-speciﬁc information such as cu stomer data, em ployee data e tc are maintained and used for various purposes. Se veral instan ces of data pr i vac y breaches [1] in the recent past ha ve resulted i n ﬁnancial as well as reputation losses for enter prises. Anonymization techniques address th is p roblem by sanitizing sensitiv e d ata such that individual priv acy is preserved while allowing enterp rises to maintain an d share sensitive data. Recently , there has been a lot of work [2] [3] [4] [5] [6] on data anonymizatio n schemes. These techniques ca n be b roadly classiﬁed in to tw o types: • P artitioning based anonymization schemes : The ﬁr st class of tech niques guarantee priv acy by partitioning the data such that an adversary cannot uniquely identify the individuals falling in each par tition. T he basic ideology behind these techn iques is b lending in the cr owd which guaran tees that an ind i vidual or entity can not b e distin- guished from a minimum number of other peop le. K - anonymity [2], l -d i versity [4] and o ther work in this line [7] achieve partitioning through generalization and suppr ession techn iques. On th e other hand , techn iques such as [8], [9] achieve this by clustering the data. Partitioning based solution s are mainly app lied to non - interactive scenarios where the data needs to b e pub - lished/r eleased after anonymization. • P erturbation based a nonymization schemes : Th e oth er class o f techniqu es guaran tee data privac y b y add ing noise to the sensitive data and thus preventing id entiﬁca- tion. Solution s in this category can be furth er classiﬁed based on wheth er the setting c onsidered is interactive or n ot. Solu tions such as [5] [6] ad d no ise to perf orm speciﬁc data min ing tasks in a non -interactive setting . More recent solu tions such as [10] add ran domized n oise in an interactive setting wh ere-in the par ticular function to be e valuated on the data is known apriori . In th is pap er , we consid er a n on-interactive setting where the data ne eds to be r eleased / publishe d . W e focu s on par- titioning based schemes as they are read ily applicable to generic da tabases includ ing data with categorical attributes. T ab le I depicts a typical ind ividual-speciﬁc data consid ered in partitioning b ased an onymization literature. Ob serve that ther e exists a classiﬁcation of data attrib utes (as shown in T able I) into three d ifferent types: 1) Identiﬁer Attributes: Attributes carry ing explicit iden- tiﬁers such as Name , SSN etc . 2) Quasi Id entiﬁer Attrib utes: Attributes that could indi- rectly lead to id entiﬁcation of ind i viduals in the databa se such as Age , Zipcode and Gend er etc. These are also sometimes referred to as Non-S ensitive attributes. 3) Sensitive Attributes: Attr ibutes carryin g the sensitive informa tion about the individuals such as Disease , In- come etc. Based on this classiﬁcation , existing solutio ns assum e that the Iden tiﬁer Attributes in the d atabase are stripped prior to th e an onymization p rocess. This was under the implicit assumption th at the identiﬁer a ttributes were necessary neithe r for the release nor for the inten ded p urpose of the release. W e believe that this assumption is too restrictive an d is ev en im possible in some scenarios wher e the presen ce o f explicit identiﬁers is n ecessary for the intended pur pose of the anonymized r elease [11]. Consider the following scenar io: Enterprise Data - Example : T able II depicts a customer database in a typical ﬁnancial institution. Th e data co ntains names o f all the customer s along with certain non-sensitive and sensitive inf ormation. T he non- sensiti ve a ttributes are: In vestment V olume Index (In vst V ol) to indicate th e volume of in vestment (nu mber of shares tr aded etc.) made b y th e Identiﬁers Quasi Identiﬁe rs Sensiti ve Name SSN Zipcode Age Nationality Condition Alice 111-111-111 1 13053 28 Russian AIDS Bob 222-222-222 2 13068 29 American Flu Christine 333-333-3333 13068 21 Japanese Cancer Robert 444-444-444 4 13053 23 American Meningi tis T ABLE I S E N S I T I V E D ATAB A S E customer in the p ast, Investment Amoun t Ind e x (Invst Amt) to indicate the amou nt of inv estment (amo unt in volved in previous trades e tc.) mad e by the customer in the p ast, Cus- tomer V aluation (V aluatio n) to indica te the assigned value of the customer . The only sensitive attribute, Customer P ersonal Income (Inco me) , corr esponds to the custom er’ s pe rsonal in - come. Datab ases such a s this ar e an in tegral part of enterprises and are maintain ed and u sed for key operatio ns e veryday . In this paper , we shall refer to the m as Enterp rise Databases . The intern al r elea se of su ch data alon g with exp licit iden- tiﬁers ( Customer Names ) is a n ecessity for several enterprise operation s such as accoun ting, recor d keepin g etc. Howe ver, at the same time, such a release should not co mpromise the priv acy of sensiti ve informatio n ( Customer P ers onal Inc ome ). Note that trivial solutions such as rem ov al o f identiﬁers or use of pseudonyms are not viab le in such scenar ios. The key proper ties here are: • The inclusion of identify ing information is nece ssary for the r elease to serve the intended purpose. • Sensiti ve data disclosure should not be compromised e ven in the presence of explicit identiﬁers. Name In vst V ol In vst Am t V aluation Income Alice 8 7 4 91 ,250 Bob 5 4 4 74, 340 Christine 4 5 5 75,123 Robert 9 8 9 98,230 T ABLE II E N T E R P R I S E D AT A In the enterprise database s cenario described above, anonymizing data u sing existing techn iques falls sho rt in providing ade quate pro tection again st ad versarial attacks. This is b ecause existing te chniques [2] [ 3] [ 4] m ake an assumption that Identiﬁer Attributes ar e stripped prior to the anonymiza- tion process. Consider the p ossibility in which an ad versary (possibly an in sider ) is given (o r otherwise acq uires) access to the ano nymized release of an en terprise database. Now , the adversary can use the identiﬁers pre sent in th e release to co llect auxilia ry infor mation about th e ind i viduals present in the datab ase from a multitude of sourc es such as the we b (homep ages, blogs etc). The adversary could then fu se the auxiliary informatio n with the an onymized re lease to esti mate sensiti ve da ta. W eb-Based Information-Fusion Atta ck : Co nsider the enterprise d ata example describ ed ear lier as shown in T able II. One way to internally release th is ta ble is to r emove the customer salary in formation and p ublish the non -sensitiv e data as it is. The problem with this approach is that one can estimate the sensitive d ata based on the non -sensitiv e inf ormation present in the release. Th e solution is to anonymize the non- sensiti ve inf ormation an d remove th e sen siti ve infor mation. T ab le III shows th e a nonymized r elease of this data u sing par - titioning based anonymization sch eme such a s K - anonymity propo sed by Sween ey et al. [ 2]. W e use K - anonymization as a repr esentativ e o f p artitioning b ased so lutions for data anonymization a s o ther solution s in this category prod uce similar results. Name In vst V ol In vst Amt V aluation Income Alice [5-10] [5-10] [1-5] - Bob [5-10] [1-5] [1-5] - Christine [1-5] [1 -5] [1-5] - Robert [5-10] [5-10] [5-10] - T ABLE III A N O N Y M I Z E D E N T E R P R I S E D AT A T ab le III is now deemed safe an d is released internally within th e enterp rise. Now , consider the scenario in which a n adversary employee Bob is granted access to this anonymized release. Note that the release do es not giv e Bob the sensitiv e informa tion i.e customer per sonal income d ata. Howev er , he has access to n on-sensitive in formation such as the custome r valuation, inves tment volume etc. Bob ’ s goal is to use the anonymized release to estimate the customer person al income values. T o ach ie ve th is, h e u ses th e cu stomer nam es p resent in the release to search fo r addition al in formation abou t the cus- tomers av a ilable on the web which will he lp him estimate their personal in come. For example, he co llects in formation about the customer ’ s Employmen t , Pr operty Holdings etc. Example of s uch data c ollected from th e web is shown in T able IV. No w , by fusing this inform ation with the anonymize d release the adversary can estimate the sensitiv e custom er person al inco me informa tion. I n this example, let’ s say the inco me rang e for Name Empl oyment Property Holdings Alice CEO, Deutsche Bank 3560 Bob Manager , V erizon 1200 Christine Assistant, NYU 720 Robert CEO, Microsoft 5430 T ABLE IV A U X I L I A RY D AT A C O L L E C T E D B Y T H E A D V E R S A RY all the customer s is [$40000 - $1000 00] and could be di vided into thr ee classes Low [$4000 0 - $60 000], Med ium [$ 60000 - $8 0000] , and High [$800 00 - $10000 0]. Now , con sider the customer Ro bert . W ith an estimated valuation falling in the highest range [5-10] , Bob conclu des that Ro bert f alls into the highest in come category [$8 0000 - $1 00000 ]. By loo king at his employme nt an d pr operty ho ldings (an d po ssibly oth er auxiliary in formation ), Bob can further imp rove his estimate and conclude that Robert falls into u pper categor y [$900 00 - $100 000] of the High incom e class. Based on this, he estimates that Robert’s salary is the average of range [$9000 0 - $1 00000 ] i.e $95000 . This example demon strates, how , by using the auxiliary informa tion obtained fr om the web a n adversary c ould o btain a close estimate o f Ro bert’ s actual income. Although in the above example the attacker uses his understan ding of the data to fuse the anonymized release with web data, in reality , he could use various Info rmation F usion technique s for this purpose. Info rmation Fusion is a well- studied paradigm in wh ich mu ltiple data so urces ar e used to improve knowledge extra ction. In the attack demonstrated above, an adversary with access to anonym ized en terprise data gleans auxiliar y info rmation from th e web and u ses inf ormation fu sion techn iques to inﬂict a priv acy br each. In this paper, we ref er to such an attack as W eb-Based In formation-Fu sion Attac k o n en terprise data. This is illustra ted in Fig ure 1. No te that this attack is an example of an a ttack-mode l in wh ich a huma n-in-the -loop inﬂicts a priv acy breach. Anonymized Release Auxiliary Data From The Web Estimated Sensitive Data                                                                                                              Fig. 1. W eb-Based Informat ion-Fusion Attack A. Contributions and Organization In this paper, we demonstra te th e shortco mings o f exist- ing an onymization schemes wh en app lied to enterprise data throug h the W eb-Based Information -Fusion Atta ck . Our main contribution is the fo rmulation of Fusion Resilient Enterprise Data Ano nymization p roblem. W e pro pose an iterati ve sch eme to ﬁnd an optima l an onymization that o ffers maximu m pr o- tection again st such attacks for a giv en d ataset. The rest of the documen t is organized as follo ws: Section 2 provid es the related work to this problem. Section 3 elaborates o n the W eb- Based Infor mation-Fusion Attack and discusses the assump- tions made regard ing the attack. In Section 4 , we form ulate the problem of Fusion Resilient Enterp rise Data Anonymization . W e then pr esent our solution strategy to address the problem throug h incremen tal anonymization in Section 5. Section 6 presents experimen tal results by demo nstrating the attack on a real d ata set and presenting the p rototyp e so lution. Section 7 provides the conclusio n and future work. I I . R E L A T E D W O R K Data priv acy h as r eceiv ed a lot o f attentio n fro m b oth computer science and statistical research com munities. In statistical literature, studies on d ata conﬁden tiality [ 12] [13] propo se the u se of matrix masks for an onymizing data. In the co mputer science literature, several recen t studies [2] [3] have been done in the context o f K -anonymity . Ferrer [9] propo sed heuristic algor ithms for optimal K -anonymizatio n on quantitative data. Sev eral problems with k -anonym ity based partitioning techniq ues hav e been stud ied in [4] [7] an d other s. In [4], Mac hanav ajjhala et al. pointed to the po ssibility o f attacks on k -ano nymized data becau se of lack of div ersity in the sensiti ve values corre sponding to each partition. Later,in [7], Li et al. provided an argu ment that l -diversity is n either a sufﬁcient nor a necessary condition to guard against attacks on k -an onymized data. They propo sed a scheme in wh ich the distribution o f sensitiv e values with-in eac h p artition shou ld not be far fr om the distribution of sensitive values in the original data. One of the prima ry challenge s in data anonymizatio n is to take into consider ation the auxilia ry information (also called e xternal knowledge , backgr ound k nowledge or side information ) that an adversary can glean from other chan nels. Recent work o n partitioning ba sed tec hniques [4] [14] [15] has attempted to deﬁne adversary’ s bac kgr oun d knowledge and possible priv acy br each based on this. Martin et al [14] provide a ﬁrst formal treatmen t of adversarial background knowledge. They propo se a lang uage for expressing the adversary’ s knowl- edge based on conjun ctiv e pro positions. More recently , Chen et al. [15] have attempted to ﬁll this gap, by prop osing an ex- tension to the same languag e based framework. Howev er , these models do not co nsider a uxiliary inform ation obtained using identifying information p r esen t in the ano nymized r elease . On the other h and, ther e has been some w ork [16 ] [17 ] [ 18] on add ressing the prob lem of anonym izing sequential releases. The pr oblem h ere is to ensure that the curren t release of a p articular data set does not lead to a disclosure with respect to pre vious releases on the same data set. Orthogo nal to th ese work s,in [19] W on g e t al prove that adversary’ s knowledge of the ano nymization algorithm cou ld lead to a priv acy breac h. In [20], Aggarwal et al. pose the prob lem of ad versarial rule mining attac k o n ano nymized data. Ou r work is critically different from these studies as we co nsider inferential attribute d isclosure based on Informa tion Fusion using external infor mation sour ces. I I I . W E B - B A S E D I N F O R M A T I O N - F U S I O N A T TAC K A. Info rmation Fusion In this pape r , we u se fuzzy infer e ncing to build an Informa- tion Fusion system. This sectio n provides a b rief introd uction to fu zzy inf erencing and ho w it can be u sed by the adversary to fuse the anonymized release with web-based auxiliar y informa tion. Fuzzy In fer en ce is a well-studied p aradigm based on fuzzy logic , fuzzy if-then rules and fuzzy r easoning . Basically , it provides a mechanism to map a set of inp uts to a set of outputs using a set o f rules . W e r efer the re ader to [2 1] for an in troductio n to fuzzy inference systems. Th e ﬁrst step in volv ed in creating a fuzzy inference sy stem is to deter mine the inputs an d o utputs . In the web -based inform ation-fusion attack, th e in puts inc lude all the d ata attributes a v ailable to the adversary th rough : 1. Th e anonymized release an d 2. The aux iliary data collected throu gh the web . In our ru nning example fro m Section 1, the attributes Investment V olume Index , Investment Amoun t Ind ex , Customer V a luation fr om the anonymized release in T able III form the ﬁrst ha lf of inputs to the information fusio n system. The attributes Emplo yment , Pr ope rty Hold ings co llected from th e we b form the seco nd half of inputs. The output consists of single attrib ute, Customer P ersonal Inco me , which the adversary intends to estimate. In the secon d step, the adversary deﬁnes fuzzy-set deﬁnitions for each of th e input and o utput attributes. He then uses do main knowledge to fo rmulate a set of rules mapp ing the inp ut fu zzy sets to th e output f uzzy sets . Figure 2 illustrates the system. •         Low – [Below 30] Med – [30-60] High – [Above 70]    Level 1 – [1-3] Level 2 – [4-7] Level 3 – [8-10] Low – [$80000 - $95000] Med – [$95000 - $120000] High – [$120000 - $160000]  Inference Rules Inputs Output . . . Low – [500 - 1000] Med – [1000-2500] High – [2500-6000]  D E - F U Z Z I F I E R                                             Fig. 2. Fuzzy Inf erence System B. Attacker Cap ability W e assume that the intrud er is an insider wh o is g i ven or otherwise acqu ires access to the anonym ized data. Thus, the intruder h as access to in dividual identiﬁer s that can be used to index into the web and other data sources. Th e intruder is assumed to hav e the do main k nowledge about th e d ata to perfor m inform ation f usion. I V . P RO B L E M F O R M U L A T I O N In this section we fo rmulate the pro blem o f Fusion R e- silient E nterprise Da ta Ano nymization to addr ess web-based informa tion-fusion attack s. Since it is not possible to quantify the amount of auxiliary information the a dversary ca n collect, it is n ot practical to completely prevent such attacks. H owe ver, by estimating the auxiliary information that an adversary could collect, we can minimize the extent o f privacy br each in case of su ch an attack. This fo rms the primary goal of our p roblem formu lation: For a gi ven sensiti ve dataset, we need to ﬁnd an anonymization such that th e release cau ses minimum brea ch in c ase of a fusion attack. On the oth er han d, one o f the importan t factor s in volved in data ano nymization is the utility of th e release [22] [3]. The u tility of an anonymized release is a m easure of usefu lness of the r elease fo r the inten ded purpo se such as a speciﬁc task to be p erformed on the d ata Ex. Classiﬁcation etc. Se veral standard mea sures such as [ 22] have bee n pro posed in the liter ature to comp ute data utility . Hence, th e seco ndary g oal of ou r pro blem form ulation is to maximize data u tility . W ith these goals in hand, we pr oceed to formulate th e overall goal as fo llows: Let P = { p ij } m × n be a sensiti ve p riv ate d ataset deﬁne d over a ﬁnite set o f attrib utes { P 1 , P 2 , . . . , P n } . Let Q = { q ij } r × s be the a uxiliary data gathe red by the in- truder fr om the web ov er a set o f attributes { Q 1 , Q 2 , . . . , Q s } . Now , let P ′ be a ca ndidate anonymization of P . Let F be an informa tion fusio n system that takes in P ′ and Q as inputs an d produces ˆ P , an estima te o f P . Let U b e a measure of utility of P ′ . Goal : The goal of Fusion Resilient E nterprise Data Anonymizatio n is to compute a P ′ from P such that: 1) P ′ is resilient to W eb-based Info rmation Fusion A ttacks . 2) The utility U o ffered by P ′ meets the rele ase requir e- ments. T o formulate th e pro blem b ased o n the above goal, we need to quantify th e resilience to web -based inf ormation- fusion attacks. W e deﬁne this using the following de ﬁnitions: Deﬁnition 1 Dissimilarity ( D 1 ◦ D 2 ) F or two datasets D 1 and D 2 r ep r esen ting the same set of individua ls a nd the same set of a ttrib utes, D 1 ◦ D 2 is a mea sur e of dissimilarity between them . For two datasets { D 1 } m × n and { D 2 } m × n representin g the same set of in dividuals, we compu te th e dissimilarity using mean square d istance D 1 and D 2 : D 1 ◦ D 2 = 1 m ∗ T r (( D 1 − D 2 ) T ( D 1 − D 2 )) where m is the to tal number of reco rds in each datab ase an d T r ( A ) of a matrix A is the trace of A , i. e th e sum of the elements of th e main d iagonal. As deﬁned earlier, ˆ P is an estimate of P made by the adversary b ased on a cand idate r elease P ′ and web -based auxiliary data Q using the inform ation fusion system F . ˆ P = F ( P ′ , Q ) In order for priv acy of P to b e protected, the dissimilarity between P and the estimate made by the adversary , ˆ P , needs to be la r ge . Th e mo re the dissimilarity P ◦ ˆ P , the better pro tected P is. Also, th e dissimilarity between P an d ˆ P qu antiﬁes the pr otectio n of fered by the correspon ding P ′ against information fusion attacks. Based on this, we now deﬁne a Fusion Resilient Anonymizatio n as: Deﬁnition 2 Fusion Resilient Anonymizatio n An anony mization P ′ of a g iven sensitive data P is r esilient to fusion attacks if the dissimilarity ( P ◦ ˆ P ) between ˆ P an d P is above a certain thr eshold va lue T p . So, for a cand idate anonymization P ′ to be a safe r elease, the correspondin g ( P ◦ ˆ P ) needs to be ab ove a certain threshold value T p . It is o bvious to n ote that, among all the po ssible anonymization s ( P ′ s) that satisfy this pr operty , th e on e that has max imum value of ( P ◦ ˆ P ) offers m aximum protec tion. So, for th e anonymization P ′ to offer max imum resilience to web-based inform ation fusion attacks, the dissimilarity ( P ◦ ˆ P ) needs to be max imized. Recall that in ad dition to m aximizing the protection a gainst informa tion-fusion attacks, the u tility o f the relea se ( U ), should be m aximized. Let W 1 and W 2 be the weig hts assigned by the p ublisher for privac y protec tion against in formation fusion attacks and data utility respectively . Now , the ﬁnal objective can be stated as a weighted sum of p r otection and utility of the f orm: W 1 ∗ ( P ◦ ˆ P ) + W 2 ∗ U Now , the problem can b e stated as, Problem : Given a p riv ate dataset P , web-based d ata Q and an info rmation-f usion system F , ﬁnd the fusion resilient anonymization P ′ that m aximizes H = W 1 ∗ ( P ◦ ˆ P ) + W 2 ∗ U , where ˆ P r epresents the estimate of P based on P ′ and Q using F . In order to solve the above optimization pr oblem , we need to ﬁnd th e optimal anonymization P ′ in the solution sp ace containing all possible anonymizatio ns P ′ s that satisfy the fusion-r esilient-anonymizatio n proper ty d eﬁned earlier . One way to look at this solution space is to consider th e set of all anonym izations possible by an onymizing P to d ifferent levels . Note that the deﬁnition o f Anonymization Level depends on the speciﬁc ano nymization schem e to be employed. For example, in K - anonymization , the value o f k rep resents th e anonymization level. Th e m ore th e value of k is, the m ore the an onymization level. As mentio ned in Section 1, in our work, we u se K - anonymization as the basic anonymization scheme. F or a given dataset P , let i den ote the anonymizatio n lev el and P ′ i denote the release obtaine d by an onymizing P to level i . W e use the discernib ility metric deﬁned in [ 22] to measure the utility of a k -a nonymized data set. The metric can be mathematically stated as follows. C DM ( g , k ) = X ∀| E |≥ k | E | 2 + X ∀| E |

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment