On Breaching Enterprise Data Privacy Through Adversarial Information Fusion

Data privacy is one of the key challenges faced by enterprises today. Anonymization techniques address this problem by sanitizing sensitive data such that individual privacy is preserved while allowing enterprises to maintain and share sensitive data…

Authors: Srivatsava Ranjit Ganta, Raj Acharya

On Breaching Enterprise Data Privacy Through Adversarial Information   Fusion
On Breaching Enterprise Data Pri v ac y Through Adv ersarial Informat ion Fu sion Sri v atsav a Ranjit Ga nta, Ra j Ach arya Dept of Computer Science, P enn Sta te Univer sity University P ark, P A, USA ranjit@cse.ps u.edu acharya@cse.p su.edu Abstract — Data priva cy is one of the key challenges face d by enterprises today . Anonymization techniques a ddress this problem by sani tizing sensitive data such that individual pri- vacy is preserv ed while allowing en terprises to maint ain and share sensitive data. Howev er , existing work on this problem make inherent assumptions about the data that ar e impractical in day-to-day enterprise data management scenarios. F urther , application of existin g anonymization schemes on enterprise data could lead to adversarial attacks in which an intrud er could use i nformation fu sion t echniques t o inflict a privac y breach. In t his paper , we shed light on th e shortcomings of current anonymization schemes i n the context of enterprise data. W e define and experimentally demonstrate W eb-based Information- Fusion Attack on anonymized enterprise data. W e formulate the problem of Fusion Resilient Enterprise Data Anonymization and propose a pr ototype solution to ad dress this pr oblem. I . I N T R O D U C T I O N Data p riv acy is one of the key challen ges faced b y en- terprises today . Sensiti ve indi vidual-specific information such as cu stomer data, em ployee data e tc are maintained and used for various purposes. Se veral instan ces of data pr i vac y breaches [1] in the recent past ha ve resulted i n financial as well as reputation losses for enter prises. Anonymization techniques address th is p roblem by sanitizing sensitiv e d ata such that individual priv acy is preserved while allowing enterp rises to maintain an d share sensitive data. Recently , there has been a lot of work [2] [3] [4] [5] [6] on data anonymizatio n schemes. These techniques ca n be b roadly classified in to tw o types: • P artitioning based anonymization schemes : The fir st class of tech niques guarantee priv acy by partitioning the data such that an adversary cannot uniquely identify the individuals falling in each par tition. T he basic ideology behind these techn iques is b lending in the cr owd which guaran tees that an ind i vidual or entity can not b e distin- guished from a minimum number of other peop le. K - anonymity [2], l -d i versity [4] and o ther work in this line [7] achieve partitioning through generalization and suppr ession techn iques. On th e other hand , techn iques such as [8], [9] achieve this by clustering the data. Partitioning based solution s are mainly app lied to non - interactive scenarios where the data needs to b e pub - lished/r eleased after anonymization. • P erturbation based a nonymization schemes : Th e oth er class o f techniqu es guaran tee data privac y b y add ing noise to the sensitive data and thus preventing id entifica- tion. Solution s in this category can be furth er classified based on wheth er the setting c onsidered is interactive or n ot. Solu tions such as [5] [6] ad d no ise to perf orm specific data min ing tasks in a non -interactive setting . More recent solu tions such as [10] add ran domized n oise in an interactive setting wh ere-in the par ticular function to be e valuated on the data is known apriori . In th is pap er , we consid er a n on-interactive setting where the data ne eds to be r eleased / publishe d . W e focu s on par- titioning based schemes as they are read ily applicable to generic da tabases includ ing data with categorical attributes. T ab le I depicts a typical ind ividual-specific data consid ered in partitioning b ased an onymization literature. Ob serve that ther e exists a classification of data attrib utes (as shown in T able I) into three d ifferent types: 1) Identifier Attributes: Attributes carry ing explicit iden- tifiers such as Name , SSN etc . 2) Quasi Id entifier Attrib utes: Attributes that could indi- rectly lead to id entification of ind i viduals in the databa se such as Age , Zipcode and Gend er etc. These are also sometimes referred to as Non-S ensitive attributes. 3) Sensitive Attributes: Attr ibutes carryin g the sensitive informa tion about the individuals such as Disease , In- come etc. Based on this classification , existing solutio ns assum e that the Iden tifier Attributes in the d atabase are stripped prior to th e an onymization p rocess. This was under the implicit assumption th at the identifier a ttributes were necessary neithe r for the release nor for the inten ded p urpose of the release. W e believe that this assumption is too restrictive an d is ev en im possible in some scenarios wher e the presen ce o f explicit identifiers is n ecessary for the intended pur pose of the anonymized r elease [11]. Consider the following scenar io: Enterprise Data - Example : T able II depicts a customer database in a typical financial institution. Th e data co ntains names o f all the customer s along with certain non-sensitive and sensitive inf ormation. T he non- sensiti ve a ttributes are: In vestment V olume Index (In vst V ol) to indicate th e volume of in vestment (nu mber of shares tr aded etc.) made b y th e Identifiers Quasi Identifie rs Sensiti ve Name SSN Zipcode Age Nationality Condition Alice 111-111-111 1 13053 28 Russian AIDS Bob 222-222-222 2 13068 29 American Flu Christine 333-333-3333 13068 21 Japanese Cancer Robert 444-444-444 4 13053 23 American Meningi tis T ABLE I S E N S I T I V E D ATAB A S E customer in the p ast, Investment Amoun t Ind e x (Invst Amt) to indicate the amou nt of inv estment (amo unt in volved in previous trades e tc.) mad e by the customer in the p ast, Cus- tomer V aluation (V aluatio n) to indica te the assigned value of the customer . The only sensitive attribute, Customer P ersonal Income (Inco me) , corr esponds to the custom er’ s pe rsonal in - come. Datab ases such a s this ar e an in tegral part of enterprises and are maintain ed and u sed for key operatio ns e veryday . In this paper , we shall refer to the m as Enterp rise Databases . The intern al r elea se of su ch data alon g with exp licit iden- tifiers ( Customer Names ) is a n ecessity for several enterprise operation s such as accoun ting, recor d keepin g etc. Howe ver, at the same time, such a release should not co mpromise the priv acy of sensiti ve informatio n ( Customer P ers onal Inc ome ). Note that trivial solutions such as rem ov al o f identifiers or use of pseudonyms are not viab le in such scenar ios. The key proper ties here are: • The inclusion of identify ing information is nece ssary for the r elease to serve the intended purpose. • Sensiti ve data disclosure should not be compromised e ven in the presence of explicit identifiers. Name In vst V ol In vst Am t V aluation Income Alice 8 7 4 91 ,250 Bob 5 4 4 74, 340 Christine 4 5 5 75,123 Robert 9 8 9 98,230 T ABLE II E N T E R P R I S E D AT A In the enterprise database s cenario described above, anonymizing data u sing existing techn iques falls sho rt in providing ade quate pro tection again st ad versarial attacks. This is b ecause existing te chniques [2] [ 3] [ 4] m ake an assumption that Identifier Attributes ar e stripped prior to the anonymiza- tion process. Consider the p ossibility in which an ad versary (possibly an in sider ) is given (o r otherwise acq uires) access to the ano nymized release of an en terprise database. Now , the adversary can use the identifiers pre sent in th e release to co llect auxilia ry infor mation about th e ind i viduals present in the datab ase from a multitude of sourc es such as the we b (homep ages, blogs etc). The adversary could then fu se the auxiliary informatio n with the an onymized re lease to esti mate sensiti ve da ta. W eb-Based Information-Fusion Atta ck : Co nsider the enterprise d ata example describ ed ear lier as shown in T able II. One way to internally release th is ta ble is to r emove the customer salary in formation and p ublish the non -sensitiv e data as it is. The problem with this approach is that one can estimate the sensitive d ata based on the non -sensitiv e inf ormation present in the release. Th e solution is to anonymize the non- sensiti ve inf ormation an d remove th e sen siti ve infor mation. T ab le III shows th e a nonymized r elease of this data u sing par - titioning based anonymization sch eme such a s K - anonymity propo sed by Sween ey et al. [ 2]. W e use K - anonymization as a repr esentativ e o f p artitioning b ased so lutions for data anonymization a s o ther solution s in this category prod uce similar results. Name In vst V ol In vst Amt V aluation Income Alice [5-10] [5-10] [1-5] - Bob [5-10] [1-5] [1-5] - Christine [1-5] [1 -5] [1-5] - Robert [5-10] [5-10] [5-10] - T ABLE III A N O N Y M I Z E D E N T E R P R I S E D AT A T ab le III is now deemed safe an d is released internally within th e enterp rise. Now , consider the scenario in which a n adversary employee Bob is granted access to this anonymized release. Note that the release do es not giv e Bob the sensitiv e informa tion i.e customer per sonal income d ata. Howev er , he has access to n on-sensitive in formation such as the custome r valuation, inves tment volume etc. Bob ’ s goal is to use the anonymized release to estimate the customer person al income values. T o ach ie ve th is, h e u ses th e cu stomer nam es p resent in the release to search fo r addition al in formation abou t the cus- tomers av a ilable on the web which will he lp him estimate their personal in come. For example, he co llects in formation about the customer ’ s Employmen t , Pr operty Holdings etc. Example of s uch data c ollected from th e web is shown in T able IV. No w , by fusing this inform ation with the anonymize d release the adversary can estimate the sensitiv e custom er person al inco me informa tion. I n this example, let’ s say the inco me rang e for Name Empl oyment Property Holdings Alice CEO, Deutsche Bank 3560 Bob Manager , V erizon 1200 Christine Assistant, NYU 720 Robert CEO, Microsoft 5430 T ABLE IV A U X I L I A RY D AT A C O L L E C T E D B Y T H E A D V E R S A RY all the customer s is [$40000 - $1000 00] and could be di vided into thr ee classes Low [$4000 0 - $60 000], Med ium [$ 60000 - $8 0000] , and High [$800 00 - $10000 0]. Now , con sider the customer Ro bert . W ith an estimated valuation falling in the highest range [5-10] , Bob conclu des that Ro bert f alls into the highest in come category [$8 0000 - $1 00000 ]. By loo king at his employme nt an d pr operty ho ldings (an d po ssibly oth er auxiliary in formation ), Bob can further imp rove his estimate and conclude that Robert falls into u pper categor y [$900 00 - $100 000] of the High incom e class. Based on this, he estimates that Robert’s salary is the average of range [$9000 0 - $1 00000 ] i.e $95000 . This example demon strates, how , by using the auxiliary informa tion obtained fr om the web a n adversary c ould o btain a close estimate o f Ro bert’ s actual income. Although in the above example the attacker uses his understan ding of the data to fuse the anonymized release with web data, in reality , he could use various Info rmation F usion technique s for this purpose. Info rmation Fusion is a well- studied paradigm in wh ich mu ltiple data so urces ar e used to improve knowledge extra ction. In the attack demonstrated above, an adversary with access to anonym ized en terprise data gleans auxiliar y info rmation from th e web and u ses inf ormation fu sion techn iques to inflict a priv acy br each. In this paper, we ref er to such an attack as W eb-Based In formation-Fu sion Attac k o n en terprise data. This is illustra ted in Fig ure 1. No te that this attack is an example of an a ttack-mode l in wh ich a huma n-in-the -loop inflicts a priv acy breach. Anonymized Release Auxiliary Data From The Web Estimated Sensitive Data                                                                                                              Fig. 1. W eb-Based Informat ion-Fusion Attack A. Contributions and Organization In this paper, we demonstra te th e shortco mings o f exist- ing an onymization schemes wh en app lied to enterprise data throug h the W eb-Based Information -Fusion Atta ck . Our main contribution is the fo rmulation of Fusion Resilient Enterprise Data Ano nymization p roblem. W e pro pose an iterati ve sch eme to find an optima l an onymization that o ffers maximu m pr o- tection again st such attacks for a giv en d ataset. The rest of the documen t is organized as follo ws: Section 2 provid es the related work to this problem. Section 3 elaborates o n the W eb- Based Infor mation-Fusion Attack and discusses the assump- tions made regard ing the attack. In Section 4 , we form ulate the problem of Fusion Resilient Enterp rise Data Anonymization . W e then pr esent our solution strategy to address the problem throug h incremen tal anonymization in Section 5. Section 6 presents experimen tal results by demo nstrating the attack on a real d ata set and presenting the p rototyp e so lution. Section 7 provides the conclusio n and future work. I I . R E L A T E D W O R K Data priv acy h as r eceiv ed a lot o f attentio n fro m b oth computer science and statistical research com munities. In statistical literature, studies on d ata confiden tiality [ 12] [13] propo se the u se of matrix masks for an onymizing data. In the co mputer science literature, several recen t studies [2] [3] have been done in the context o f K -anonymity . Ferrer [9] propo sed heuristic algor ithms for optimal K -anonymizatio n on quantitative data. Sev eral problems with k -anonym ity based partitioning techniq ues hav e been stud ied in [4] [7] an d other s. In [4], Mac hanav ajjhala et al. pointed to the po ssibility o f attacks on k -ano nymized data becau se of lack of div ersity in the sensiti ve values corre sponding to each partition. Later,in [7], Li et al. provided an argu ment that l -diversity is n either a sufficient nor a necessary condition to guard against attacks on k -an onymized data. They propo sed a scheme in wh ich the distribution o f sensitiv e values with-in eac h p artition shou ld not be far fr om the distribution of sensitive values in the original data. One of the prima ry challenge s in data anonymizatio n is to take into consider ation the auxilia ry information (also called e xternal knowledge , backgr ound k nowledge or side information ) that an adversary can glean from other chan nels. Recent work o n partitioning ba sed tec hniques [4] [14] [15] has attempted to define adversary’ s bac kgr oun d knowledge and possible priv acy br each based on this. Martin et al [14] provide a first formal treatmen t of adversarial background knowledge. They propo se a lang uage for expressing the adversary’ s knowl- edge based on conjun ctiv e pro positions. More recently , Chen et al. [15] have attempted to fill this gap, by prop osing an ex- tension to the same languag e based framework. Howev er , these models do not co nsider a uxiliary inform ation obtained using identifying information p r esen t in the ano nymized r elease . On the other h and, ther e has been some w ork [16 ] [17 ] [ 18] on add ressing the prob lem of anonym izing sequential releases. The pr oblem h ere is to ensure that the curren t release of a p articular data set does not lead to a disclosure with respect to pre vious releases on the same data set. Orthogo nal to th ese work s,in [19] W on g e t al prove that adversary’ s knowledge of the ano nymization algorithm cou ld lead to a priv acy breac h. In [20], Aggarwal et al. pose the prob lem of ad versarial rule mining attac k o n ano nymized data. Ou r work is critically different from these studies as we co nsider inferential attribute d isclosure based on Informa tion Fusion using external infor mation sour ces. I I I . W E B - B A S E D I N F O R M A T I O N - F U S I O N A T TAC K A. Info rmation Fusion In this pape r , we u se fuzzy infer e ncing to build an Informa- tion Fusion system. This sectio n provides a b rief introd uction to fu zzy inf erencing and ho w it can be u sed by the adversary to fuse the anonymized release with web-based auxiliar y informa tion. Fuzzy In fer en ce is a well-studied p aradigm based on fuzzy logic , fuzzy if-then rules and fuzzy r easoning . Basically , it provides a mechanism to map a set of inp uts to a set of outputs using a set o f rules . W e r efer the re ader to [2 1] for an in troductio n to fuzzy inference systems. Th e first step in volv ed in creating a fuzzy inference sy stem is to deter mine the inputs an d o utputs . In the web -based inform ation-fusion attack, th e in puts inc lude all the d ata attributes a v ailable to the adversary th rough : 1. Th e anonymized release an d 2. The aux iliary data collected throu gh the web . In our ru nning example fro m Section 1, the attributes Investment V olume Index , Investment Amoun t Ind ex , Customer V a luation fr om the anonymized release in T able III form the first ha lf of inputs to the information fusio n system. The attributes Emplo yment , Pr ope rty Hold ings co llected from th e we b form the seco nd half of inputs. The output consists of single attrib ute, Customer P ersonal Inco me , which the adversary intends to estimate. In the secon d step, the adversary defines fuzzy-set definitions for each of th e input and o utput attributes. He then uses do main knowledge to fo rmulate a set of rules mapp ing the inp ut fu zzy sets to th e output f uzzy sets . Figure 2 illustrates the system. •         Low – [Below 30] Med – [30-60] High – [Above 70]    Level 1 – [1-3] Level 2 – [4-7] Level 3 – [8-10] Low – [$80000 - $95000] Med – [$95000 - $120000] High – [$120000 - $160000]  Inference Rules Inputs Output . . . Low – [500 - 1000] Med – [1000-2500] High – [2500-6000]  D E - F U Z Z I F I E R                                             Fig. 2. Fuzzy Inf erence System B. Attacker Cap ability W e assume that the intrud er is an insider wh o is g i ven or otherwise acqu ires access to the anonym ized data. Thus, the intruder h as access to in dividual identifier s that can be used to index into the web and other data sources. Th e intruder is assumed to hav e the do main k nowledge about th e d ata to perfor m inform ation f usion. I V . P RO B L E M F O R M U L A T I O N In this section we fo rmulate the pro blem o f Fusion R e- silient E nterprise Da ta Ano nymization to addr ess web-based informa tion-fusion attack s. Since it is not possible to quantify the amount of auxiliary information the a dversary ca n collect, it is n ot practical to completely prevent such attacks. H owe ver, by estimating the auxiliary information that an adversary could collect, we can minimize the extent o f privacy br each in case of su ch an attack. This fo rms the primary goal of our p roblem formu lation: For a gi ven sensiti ve dataset, we need to find an anonymization such that th e release cau ses minimum brea ch in c ase of a fusion attack. On the oth er han d, one o f the importan t factor s in volved in data ano nymization is the utility of th e release [22] [3]. The u tility of an anonymized release is a m easure of usefu lness of the r elease fo r the inten ded purpo se such as a specific task to be p erformed on the d ata Ex. Classification etc. Se veral standard mea sures such as [ 22] have bee n pro posed in the liter ature to comp ute data utility . Hence, th e seco ndary g oal of ou r pro blem form ulation is to maximize data u tility . W ith these goals in hand, we pr oceed to formulate th e overall goal as fo llows: Let P = { p ij } m × n be a sensiti ve p riv ate d ataset define d over a finite set o f attrib utes { P 1 , P 2 , . . . , P n } . Let Q = { q ij } r × s be the a uxiliary data gathe red by the in- truder fr om the web ov er a set o f attributes { Q 1 , Q 2 , . . . , Q s } . Now , let P ′ be a ca ndidate anonymization of P . Let F be an informa tion fusio n system that takes in P ′ and Q as inputs an d produces ˆ P , an estima te o f P . Let U b e a measure of utility of P ′ . Goal : The goal of Fusion Resilient E nterprise Data Anonymizatio n is to compute a P ′ from P such that: 1) P ′ is resilient to W eb-based Info rmation Fusion A ttacks . 2) The utility U o ffered by P ′ meets the rele ase requir e- ments. T o formulate th e pro blem b ased o n the above goal, we need to quantify th e resilience to web -based inf ormation- fusion attacks. W e define this using the following de finitions: Definition 1 Dissimilarity ( D 1 ◦ D 2 ) F or two datasets D 1 and D 2 r ep r esen ting the same set of individua ls a nd the same set of a ttrib utes, D 1 ◦ D 2 is a mea sur e of dissimilarity between them . For two datasets { D 1 } m × n and { D 2 } m × n representin g the same set of in dividuals, we compu te th e dissimilarity using mean square d istance D 1 and D 2 : D 1 ◦ D 2 = 1 m ∗ T r (( D 1 − D 2 ) T ( D 1 − D 2 )) where m is the to tal number of reco rds in each datab ase an d T r ( A ) of a matrix A is the trace of A , i. e th e sum of the elements of th e main d iagonal. As defined earlier, ˆ P is an estimate of P made by the adversary b ased on a cand idate r elease P ′ and web -based auxiliary data Q using the inform ation fusion system F . ˆ P = F ( P ′ , Q ) In order for priv acy of P to b e protected, the dissimilarity between P and the estimate made by the adversary , ˆ P , needs to be la r ge . Th e mo re the dissimilarity P ◦ ˆ P , the better pro tected P is. Also, th e dissimilarity between P an d ˆ P qu antifies the pr otectio n of fered by the correspon ding P ′ against information fusion attacks. Based on this, we now define a Fusion Resilient Anonymizatio n as: Definition 2 Fusion Resilient Anonymizatio n An anony mization P ′ of a g iven sensitive data P is r esilient to fusion attacks if the dissimilarity ( P ◦ ˆ P ) between ˆ P an d P is above a certain thr eshold va lue T p . So, for a cand idate anonymization P ′ to be a safe r elease, the correspondin g ( P ◦ ˆ P ) needs to be ab ove a certain threshold value T p . It is o bvious to n ote that, among all the po ssible anonymization s ( P ′ s) that satisfy this pr operty , th e on e that has max imum value of ( P ◦ ˆ P ) offers m aximum protec tion. So, for th e anonymization P ′ to offer max imum resilience to web-based inform ation fusion attacks, the dissimilarity ( P ◦ ˆ P ) needs to be max imized. Recall that in ad dition to m aximizing the protection a gainst informa tion-fusion attacks, the u tility o f the relea se ( U ), should be m aximized. Let W 1 and W 2 be the weig hts assigned by the p ublisher for privac y protec tion against in formation fusion attacks and data utility respectively . Now , the final objective can be stated as a weighted sum of p r otection and utility of the f orm: W 1 ∗ ( P ◦ ˆ P ) + W 2 ∗ U Now , the problem can b e stated as, Problem : Given a p riv ate dataset P , web-based d ata Q and an info rmation-f usion system F , find the fusion resilient anonymization P ′ that m aximizes H = W 1 ∗ ( P ◦ ˆ P ) + W 2 ∗ U , where ˆ P r epresents the estimate of P based on P ′ and Q using F . In order to solve the above optimization pr oblem , we need to find th e optimal anonymization P ′ in the solution sp ace containing all possible anonymizatio ns P ′ s that satisfy the fusion-r esilient-anonymizatio n proper ty d efined earlier . One way to look at this solution space is to consider th e set of all anonym izations possible by an onymizing P to d ifferent levels . Note that the definition o f Anonymization Level depends on the specific ano nymization schem e to be employed. For example, in K - anonymization , the value o f k rep resents th e anonymization level. Th e m ore th e value of k is, the m ore the an onymization level. As mentio ned in Section 1, in our work, we u se K - anonymization as the basic anonymization scheme. F or a given dataset P , let i den ote the anonymizatio n lev el and P ′ i denote the release obtaine d by an onymizing P to level i . W e use the discernib ility metric defined in [ 22] to measure the utility of a k -a nonymized data set. The metric can be mathematically stated as follows. C DM ( g , k ) = X ∀| E |≥ k | E | 2 + X ∀| E |

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment