Privacy Preserving ID3 over Horizontally, Vertically and Grid Partitioned Data

We consider privacy preserving decision tree induction via ID3 in the case where the training data is horizontally or vertically distributed. Furthermore, we consider the same problem in the case where the data is both horizontally and vertically dis…

Authors: Bart Kuijpers, Vanessa Lemmens, Bart Moelans

Priv acy Preserving ID3 o v er Horizon tally , V ertically and Grid P artitioned Data Bart Kuijp ers, V anessa Lemmens, Bart Mo elans The or etic al Computer Scienc e, Hasselt University & T r ansnational University Limbur g, Belgium Karl T uyls Dep artment.of Industrial De sign, Eindhoven University of T e c h nolo gy, The Netherlands Abstract W e consider p riv acy p reserving d ec ision tree indu c tion via ID3 in the case where the training data is horizon tally or v ertically distributed. F urthermore, we co nsider the s a me problem in the case w here the d a ta is b oth h o rizon tally and v ertically distributed, a situation w e refer to as grid p artitione d data . W e giv e an algorithm for priv acy preserving ID3 ov er horizont ally partitioned data inv olving more than t wo parties. F or grid partitioned d at a, we discuss t wo differen t ev aluation metho ds for preserving priv acy ID3, namely , first merging horizon tally and devel oping v ertically or first m e rging ve rtically and next d evelo ping h orizonta lly . Next to int ro ducing priv acy preserving data minin g o v er grid-partitioned data, the main con trib utio n of this pap er is that w e sh o w, by means of a complexity analysis that th e former ev aluation method is the more efficient. 1 In tro duction 1.1 Privacy pr eserving data mining In recent y ears priv acy preservin g data mining has emerged as a v ery active researc h area in data mining. The application p ossibilities of data mining, com- Email addr esses: bart.kui jpers@uhasselt .b e (Bart Kuijp ers), bart.moe lans@uhasselt.b e (Bart Mo elans), k.p.tuyls@tue .nl (Karl T uyls). bined with the Interne t, ha ve attracted and inspire d man y sc ien tists from dif- feren t researc h areas suc h as computer science, bioinfo rmatics and economics, to active ly participate in this relativ ely y o ung field. Ov er the la s t few ye ars this has na t ur a lly lead to a gro wing in terest in securit y or priv acy issues in data mining. More precisely , it became clear that disco v ering knowledge t hr o ugh a com bination of differen t databases, raises imp ortan t securit y issues. Despite the fact that a cen tralized warehouse approac h allows to disco ver know ledge, whic h would ha v e not emerged when the sites were mined individually , priv a cy of data cannot b e guaran teed in the con text of data w arehousing. Although data mining results usually do not violate priv acy of individuals, it cannot b e assured that an unauthorized p erson will not access the cen tralized w a reh ouse with some malev olen t inte n tions t o misuse gathered information for his own purp oses during the data mining pro cess . Neither can it b e guaran teed that, when data is partitio ne d o v er differen t sites and data is no t encrypted, it is imp ossible to deriv e new k no wledge ab out the other sites. Data mining tec hniques try to identify regularities in data, which are unkno wn and hard to disco ver by individuals. Regularities o r patterns ar e to be unde rsto o d as reve- lations o v er the en tire data, rather than on individuals. How ev er to find suc h rev elatio ns the mining pro cess has to access and use individual informat ion. More formally , this problem is recognized a s the infer enc e pr oblem [3,5]. Orig i- nally this problem dates bac k to r esearch in database t heory during the 70s and early 8 0 s , ac kno wledged bac k then as a c cess con trol. Mo dels w ere dev elop ed offering protection against unauthorized us e o r access of the database. Ho w- ev er, suc h mo dels seemed unable to sufficie n tly protect sensitiv e infor mat ion. More precisely , indirect acces ses (through a differen t database and metadata) still allo w ed one to a ttain information not autho r iz ed for. Here, metadata con- sists, e.g., of dep endencie s betw een differen t databases, integrit y constrain ts, domain kno wledge, etc. In other words, the inference problem o ccurs when one can o btain vital information through metadata violating individuals (or com- panies) priv acy . With the elab oration o f differen t net w ork technologies a nd gro wing inte rest in pattern recognition this pro ble m naturally carr ies ov er to data mining. In fact it gets ev en worse a s illustrated by Swe en y in [13]. In her work Sw een y sho ws that the ty pical de-iden tification techn ique applied on data sets fo r public use, do es not render the result anony mous. Mor e pre- cisely , she demonstrated that combinations of c haracteristics (or attributes) can construct a unique or near-unique iden tifier of tuples, whic h means that information c an b e gained on individuals ev en when their identifiers are dis- torted. Ov er the past few y ears state of the art researc h in priv acy preserving data mining ha s concen trated it s elf a long t w o ma jor lines: data whic h is horizontal ly distribute d and data whic h is vertic al l y distribute d . Horizontally partitioned data is data whic h is homogeneously distributed, meaning that all data tuples 2 yield ov er the same item or feature set. Essen tially this b oils down to different data sites collecting the same kind of information o ver differen t individuals. Consider for instance a sup ermark et c hain whic h gathers information on the buying b eha vior of its customers. T ypically , suc h a compan y ha s differen t branc hes, implying data to b e horizon tally distributed. V ertically distributed data is data whic h is heterogeneously distributed. Basically this m eans that data is collected b y differen t sites or parties on the same individuals but with differing item or f eatur e sets. Consider for instance financial institutions as banks and credit card companies, they b oth collect data on custome rs ha ving a credit card but with differing item sets. 1.2 Our c ontribution In this pap er, w e also consider data whic h is b oth ho r izontally and v ertically distributed, whic h w e will call grid p artitione d data . T o our kno wledge, there has b een no researc h up till now in priv a c y preserving data mining that consid- ers g r id distributed data. How ev er this kind of situation seems highly relev ant and significan t. Consider for instance the situation were differen t financial institutions gat her data on clien ts concerning sa vings accoun t, in v estmen ts, credit cards and others. This situation clearly considers data whic h is grid partitioned, since some institutions deal with credit cards and not with in- v estmen ts a nd vice v ersa and since financ ial institutions t ypically hav e data emerging from differen t branc hes of a ba nk. F or a more thorough elab oration of this example , w e refer to the end of this In tro duction (Section 1.3). In this pap er, we prop ose a new algorithm to prese rv e priv acy when constructing a decision tree for classification ov er g rid partitioned data using ID3, in v olving m ultiple parties. Most closely related to this work is that of L inde ll and Pink as [8] who intro duc ed a secure multi-part y computation t echniq ue for classifica- tion using the ID3 a lg orithm o ve r horizon tally par t it ioned data and that of Du and Zhan [4] who in tro duced a proto col fo r making ID3 secure o v er v erti- cally part itioned data. An impor t a n t con tribution of our w ork is to consider horizon tally and v ertically distributed data at the same time. F urthermore, w e also b eliev e this is highly significan t a s most real life vital da t a mining situations, in v olving m ultiple parties, consist of grid partitioned data. A mo- tiv ating example is discussed in Section 1.3 . F or grid partitioned data, we discuss tw o different ev aluation metho ds for pres erving priv acy ID3, namely , first merging horizon tally and dev eloping v ertically or first merging v ertically and ne xt dev eloping hor izontally . W e sh o w in Section 4 b y means of a com- plexit y analysis t ha t the for mer is the most efficien t . In the con text of secure m ultipart y computatio n, the “semi-honest” and the “ma licious” mo del [15,6] are considered. In the f o rme r, all parties follow the proto col s trictly , but are allo w ed to remem b er ev erything they encounter while executing t he proto col and to use this informatio n to compute information ab out the other par t ie s, 3 whereas in the malicious model the parties are allow ed to c heat. W e a s sume the semi-honest mo del in this pap er. The rest of this paper is structured as follo ws. W e end t his Introduction by a motiv ating ex ample for grid partitio ne d data, illustrating the importa nce of efficien tly dealing with g rid partitioned data in data mining applications. In Section 2, we sk etc h t he preliminaries as t he ID3 algorithm and definitions of hor iz on tally , vertically and grid distributed data. Section 3 intro duces o ur algorithm. W e discuss the tw o differen t ev aluation metho ds for preserving priv acy ID3 in Section 4. Section 5 concludes the pap er. 1.3 Motivating example for grid p artitione d data T ypically f or financial institutions as banks is that they offer their clien ts differen t servic es as a sav ings a ccount, choice of c redit card, Maestro and all kinds of inv estmen t p ossibilities as mor t gages, sto c k inv estmen ts, fund orders and so on. Of course a bank is inte rested in know ing whic h are go o d customers, whic h are bad ones and whic h are p ossible defrauders. Reasons are ob vious, making profit and av oiding losses b ec ause of clien ts whic h a re not credit w orth y and show unreliable b eha vior. Gathering a ll kinds of fina ncial data ab out their customers and t he ir transactions can help them in iden tifying risky clien ts and p ossible defrauders, preve n ting h uge financial lo s ses. More pr ecisely , by using go o d data mining tec hniques it b ecomes p ossible to generalize o v er these gathered data sets and iden tify p ossible risks for future cases or tr a ns actions. T ypical fo r different branc hes is to ga ther the same kind (i.e. item sets) of data on differen t clients , implying that data is ho ri z o ntal ly partitioned. A po s sible item set re-o ccurring at banks is illustrated in T able 1. T able 1: A p o ssible item set. Cust. n r. mo rtgage accoun t salary sto c k neg. saldo F raudulen t? A 11 25 . 000 104 . 20 0 2 . 200 0 no no B 12 0 1 . 001 . 020 4 . 4000 1 . 000 y es y es . . . . . . . . . . . . . . . . . . . . . Ho w ev er b y com bining their data sets, it w ould b ecome p ossible to deriv e kno wledge, leading to a high lev el of precis ion in triggering fraudulen t b eha v- ior, whic h t hey w ould not ha v e reac hed for individually . Consider f or instance a simplified rule X , Y → F , meaning that if features X and Y are satisfied this implies a high c hance the transaction is fraudulent. It is not imaginary that this asso ciation rule is know n t o bank A and unknown to bank B , sim- ply b ecause B has not enough (lo cal) tuples to supp ort this rule. There is a reasonable c hance that by com bining their da t abase s, sites A and B w ould ha v e disco vered asso ciation rules whic h g lobally hold and whic h they would 4 ha v e not disco v ered individually , implying a greater accuracy in identifying defrauders. Altho ugh suc h a co op erativ e b eha vior could sa v e them a great deal of money , none of them, as they are comp etitors, w ould b e willing to share all its transactions and itemsets with one another for ob vious reasons. Although it is p ossible for ba nk s to gather substantial data on their clien ts there is still ro om for more impro v emen t. More precisely , a bank do es not t ypically manage a ll the services it offers. F or instance credit card tr a ns actions are ma na ged b y separate companies collab orating with banks. D espite this co op erat ion, neither of them will b e to o happy to exc hange data and item or feature sets on their cus tomers. Still if they w ould b e willing to collab orate, this could lead to a higher precision in iden tifying fraudulen t cases and all parties w ould b enefit. In other words, the group of p eople ha ving a credit card is usually in v olve d in an in v estmen t of all p ossible kinds: mortgages, sto c k mark et, order funds etc. This implies that the group of individuals on whic h the credit card companies gather da t a is more or less the same as the group on whic h banks gather data concerning in v estmen ts. This b oils do wn to data whic h is vertic al ly pa r t itioned. Summarizing, this example sho ws that it is not imaginary at all that data app ears to b e as w ell horizontally as v ertically distributed in real life situa- tions, whic h w e call grid p artitione d data . Note that w e can easily extend the ab o ve example to con tain more pa rties . W e could for instance a dd tax services, in terested in trac king p eople c heating on their taxes. Most imp ortan t ly , the example illustrates that w e do not only need to consider priv acy preserving tec hniques fo r hor izontally o r v ertically distributed data, but that it is highly significan t for real life a pplicatio ns to consider t he combination of b oth. 2 Preliminaries W e start this section with a subsection summarizing the ID3 a lgorithm. Then w e con tin ue with a subsection describing definitions and examples of horizon- tally , v ertically and grid distributed data. W e contin ue with preliminaries on m ulti-part y computation. 2.1 The I D3 algorithm The ID 3 algorithm (Inducing Decision T rees) was originally introduced by Quinlan in [11] and is described b elo w in Algorithm 1. Here we briefly recall the steps in v o lved in the algo rithm. F or a t horough discussion of the algorithm w e refer the inte rested reader to [10]. 5 The input of ID3 is a finite dat a set of tuples con taining (discrete or nomi- nal) v a lue s for a finite n um b er of at tributes , one of whic h is called the class attribute (also called target class). ID3 induces a decision tree fro m an ex- ample set in a top-down manner. More precisely , the algorithm starts a t the ro ot no de, choosing eac h time the attribute whic h separates the data most efficien tly according to their tar get class. Then the algorithm creates a branc h for eac h v a lue o f this attribute a nd con tin ues from there b y rep eating the ab o ve pro cess un til all attributes are used. T o determine which attribute is b est in classifying the given data set, a measure from informat ion theory is used, namely information gain . Information gain is defined as the exp ected reduction in e ntr op y . Entrop y measures the homogeneity of a data set. More formally , the entrop y o f a data set o f tuples S is defined as: entr opy ( S ) = d X i =1 − p i lo g 2 p i (1) where d is the total n um b er of differen t v alues the target class can tak e on and p i is the prop ortion of tuples of the data set ha ving target v a lue i . The information gain of an attribute A is then defined as: g ain ( S, A ) = entr opy ( S ) − X v | S v | | S | entr opy ( S v ) (2) with S v the subset of S with tuples hav ing v alue v fo r attr ibute A . Algorithm 1 The ID3 Algorithm Require: R , a set of attributes. Require: C , the class attribute. Require: S , data set of tuples. 1: if R is empty then 2: Return the leaf hav ing the most frequen t v alue in data set S . 3: else if all tuples in S ha v e the same class v a lue t hen 4: Return a leaf with that sp ecific class v alue. 5: else 6: Determine attribute A with the highest info r ma t io n gain in S . 7: P a rtition S in m parts S ( a 1 ) , ..., S ( a m ) suc h that a 1 , ..., a m are the dif- feren t v alues of A . 8: Return a tree with ro ot A and m branche s lab elled a 1 ...a m , suc h that branc h i con tains ID3( R − { A } , C, S ( a i )). 9: end if 6 2.2 Horizontal ly, vertic a l ly and grid p artitione d data In this section w e provide a f ormal definition of horizontally , v ertically and grid partitioned data. W e will use t he pro jection op eration as defined in relational algebra in database theory . Supp ose w e hav e: (1) A relation (or da ta set) S o v er the sche ma I , A 1 , ..., A | R | , C consisting of a finite num b er of tuples. The attribute I is supp osed to b e a k ey (i.e., contain iden tifiers) and is not considered as a n attribute to build the decision tree. The only purp ose of the attribute I is to b e able to join vertically distributed data. The attribute C will b e referred to as the class attribute. (2) P arties P ij with i = 1 , .., v , j = 1 , .., h and v s maller than the n um b e r of attributes (i.e., | R | + 1) (3) Eac h party P ij is ho lding a part S ij con taining information ab out certain attributes (including I ) and certain tuples. The S ij are suc h that • S ij is a partition of S , more precisely S = ∪ h j =1 ⊲ ⊳ v I ,i =1 S ij ; • S ij and S ij ′ ha v e the same attributes but (parts of ) different tuples of S when j 6 = j ′ ; • S ij and S i ′ j ha v e disjoin t attributes but con tain information ab out the same tuples of S . Definition 1 We c al l S • horizon tally distributed if and only if v = 1 ; • v ertically distributed i f and only if h = 1 ; and • grid distributed if and only if v , h ≥ 2 . Examples of horizon tally , ve rtically and grid distributed databases can b e found in Figures 1(a), 1(b) and 2. 2.3 Pr eliminaries on multip arty c omputation In this section we recall some results from m ultipart y computation t ha t will b e neede d as building blo c ks in the algorithms in the next section. Basically , s ecure multi-part y computation (SMPC) mak es sure that differen t parties in v olv ed in a computation process, do not learn any thing more than the result(s) of the computation pro cess and anything else t ha t is deriv able in a p olynomial amoun t of time (without c heating). More precisely , SMPC is of great in terest to the inference problem. In the case of horizontally , v ertically 7 P a rt y P 11 I A 1 A 2 A 3 A 4 A 5 C ... ... P a rt y P 12 I A 1 A 2 A 3 A 4 A 5 C ... ... (a) P a rt y P 11 I A 1 A 2 A 3 ... ... P a rt y P 21 I A 4 A 5 C ... ... (b) Fig. 1: (left) Horizon tally distributed data.(righ t) V ertically distributed data. and grid partitioned data in data mining, the mining pro cess require s a lot of comm unicatio n b et w een the differen t parties. The SMPC techniq ues prev en t an y part y from deriving new kno wledge ab out the other parties in v olved. It is not our inten tion to giv e a complete ov erview here of SMPC, therefore w e refer to [9,2,15,8,12,1,3,14]. Here w e pro vide the securit y proto cols necessary for our purp oses, i.e., • the secure sum proto col, • the Y ao circu it, • the secure union prot o col, • the secure size of set inte rsection proto col; and • the xl n ( x ) proto col. W e remark that in t he contex t of secure m ultipart y computation, t w o mo dels, that w e already men tioned before, are considered, namely the “ s emi-honest” and the “malicious” mo del [6,15]. In the semi- h onest mo del , all part ies follow the pro t o col strictly . They are allo w ed to remem b er ev erything they encounter while executing t he proto col and to use this infor ma t io n to compute (in p olynomial time) information a bout the ot he r parties. In the malicious mo de l the pa rties are a llo w ed to che at. They ma y for example falsify their inputs in order to learn more about the input o f other parties. 8 P a rt y P 11 I A 1 A 2 ... ... P a rt y P 21 I A 3 ... ... P a rt y P 31 I A 4 A 5 C ... ... P a rt y P 12 I A 1 A 2 ... ... P a rt y P 22 I A 3 ... ... P a rt y P 32 I A 4 A 5 C ... ... P a rt y P 13 I A 1 A 2 ... ... P a rt y P 23 I A 3 ... ... P a rt y P 33 I A 4 A 5 C ... ... Fig. 2: Grid distr ib uted data. W e will assume the semi-honest mo del in the description of our algorithms. 2.3.1 Se cur e sum pr oto c ol The goal of this proto col is that k > 2 parties can compute the sum of the v alues each part y holds in suc h a w ay that no part y can learn an ything ab out the v alues of the other parties. The proto col of Kantarcioglu a nd Clifton [7] protects individual v alues by using a random num b er. P arty 0 adds a ra ndom n um b er to its o wn v alue and sends it to P art y 1. P a r ty 1 cannot learn an ything from this v alue due to the random num b er. Part y 1 a dds his v alue to this n um b er and sends it along to P arty 2. This pro cess con tinue s un til the last part y has been reac hed. This part y adds his n um b er to the n um b e r it receiv ed and sends it to Part y 0. P art y 0 can no w compute the sum b y distracting the random n umber from the sum it receiv ed of the last party . No w P a rt y 0 reve als the sum to the other parties. 9 Ho w safe is this proto col? It can b e sho wn, b y means of a p olynomial time sim ulato r, that, in the semi-honest mo del, this proto col is safe. Ac tually , to sho w safet y it is necess ary that all v alues remain within a finite domain [0 , m ] and all computations are done mo dulo m . It should b e remark ed tha t the proto col can b e brok en if w e assume the ma- licious mo del. F or instance, it is clear that when Part y i − 1 and P art y i + 1 collab orate, the v a lue of P arty i can b e discov ered. As a remedy for t his prob- lem, eac h party can split its v alue in n parts. Of all parts t he sum is calculated. T o av oid that P art y i − 1 and P art y i + 1 can collab orate t o disco v er the v alue of Part y i , during eac h of these n computations differen t paths are follo wed . In this w ay , more parties hav e to collab orate to disco v er individual v alues. 2.3.2 Y ao cir c uit Y ao intro duc ed in [1 5] the concept of se cur e two p arty c omputation . He show ed that a ny function f ( x, y ), where x is the input of Part y 1 and y the input o f P a rt y 2 , can b e ev aluated in a secure w a y . T o formalize the concept of securit y , w e concen trate on functions f (Y ao mak es use o f Boo lean circuits to represen t a function f ) o f the form f ( x, y ) = ( f 1 ( x, y ) , f 2 ( x, y )) . This function receiv es a par t of its input, namely x , from P a rt y 1 and the other pa r t o f its input, namely y , from Part y 2. P art y 1 w a n ts to learn f 1 ( x, y ) and P art y 2 w an ts to learn f 2 ( x, y ). Suppo s e that proto col Π is used to learn f . V iew Π i is what P arty i learns by executing proto col Π and O utput Π i is the output o f P art y i ( i = 1 , 2). Finally , let S i b e an a lgorithm that can b e exe cuted in p olynomial time. Y ao defines { S 1 ( x, f 1 ( x, y )) , f 2 ( x, y )) } = { V iew Π 1 ( x, y ) , O utput Π 2 ( x, y )) } and { ( f 1 ( x, y ) , S 2 ( y , f 2 ( x, y )) ) } = { O utpu t Π ( x, y ) , V iew Π 2 ( x, y )) } meaning that a n y party can lear n f rom f , b y executing pro tocol Π, only those facts that can b e learned in p olynomial time from his/her input and his /her output. Ex ecuting the proto col do es therefore not pro vide an y extra informa- tion. W e remark that G o ldreic h et al. [6] generalized the results of Y a o to more than t w o parties. Goldreich et al. also gav e the comp osition theorem that states that if a function g can b e reduced safely to a function f , and if there is a proto col to safely compute f , then also g can b e computed safely . In this pap er, we will refer to this t yp e of circ uits a s Y ao cir cuits , ev en if they concern more than t w o parties. 10 2.3.3 Se cur e union pr oto c ol When there are only tw o parties, computing the union of tw o sets b elonging to eac h of those parties, this can lead to securit y pro ble ms. Indeed, the kno wledge ab out ones own set and ab out the union, give s ( a t least partial) know ledge ab out the ot her parties set. In this section, we outline a metho d to compute the union o f k itemsets, b elonging to k parties, k > 2. The goal is that a ll parties should learn the union, without learning ab out the itemse ts of other parties. The algorithm is from Ka n tarcioglu and Clifton [7 ] and consists of four phases that w e sk etc h b elo w. These authors also sho w its securit y . Phase 1 : All parties generate a comm utativ e, dete rministic encry ption k ey E i and a decryption k ey D i . Eac h itemset is augmente d with fak e or dumm y items (this is done to pre v en t the determination of the cardinalit y of the it emset). A t the end of Phase 1, eac h part y has a n itemset of the same size (whic h is agreed up on at the start). Phase 2 : Each part y encrypts its items a nd commu nicates them to the next part y (the comm unication is cyclic as in the case o f secure sum computation, i.e., P art y i sends information to Part y ( i + 1) mo d k ). Eac h party encrypts what he receiv es and passes it to the next party . This con tin ues until eac h P a rt y i is in the possession of the completely encrypted items o f P arty ( i + 1) mo d k . W e r emark that to con tinue one more step would b e no longer secure. Phase 3 : The ev en-num b ered par t ies send all items in their p osses sion to Part y 0 and the o dd-n umbered parties do the same to P a r ty 1 ( the last party alw a ys has to send to P art y 1 to av oid that a part y gets its ow n fully encrypted itemset). P arties 0 a nd 1 tak e the union of what they receiv ed and remov e the doubles. P art y 1 sends ev erything he has t o Part y 0, who remo v es t he doubles. A t this p oin t the union is in the p osses sion of P art y 0, b e it fully encrypted. Phase 4 : The encrypted union is sen t to all parties to b e decrypted. Finally the fak e items are remo v ed and the result is announced to all pa r t ie s. 2.3.4 Se cur e size of set interse ction pr oto c ol When there are only tw o parties, computing the size of the inte rsection of t w o sets b elonging to eac h of the parties can lead to securit y pro blems. Indeed, the kno wledge ab out ones o wn set and ab out the size of the interse ction of t w o sets giv es (at least partial) kno wledge ab out the set o f the other part y . So, w e are in terested t o compute the size of set intersec tion of k itemse ts, b elonging to k parties, k > 2. The goal is that all parties should learn the size of set in tersection, without learning ab out the itemsets of other parties. 11 Jaideep V aidya [14] prop osed a pr o tocol for the secure computation of the size of set in tersection. It is similar to the secure union proto col, and w e will not rep eat the details here but refer to [14]. 2.3.5 x ln( x ) pr oto c ol The x ln( x ) proto col, due to Lindell and Pink as [9], is different from the pre- vious proto cols. It uses Y ao circuits, as men tioned earlier in this section. Because circuits are only suitable for t w o parties, also this proto col is only suitable for t wo parties. Assume w e hav e tw o parties, called Alice and Bob. Alice has a v alue x a and Bob has a v a lue x b . The goal of the x ln( x ) proto- col is to giv e Alice and Bob b oth a share s a and s b resp e ctiv ely , suc h that s a + s b = ( x a + x b ) ln( x a + x b ) The x ln( x ) proto col make s use of tw o subproto cols. The first receiv es tw o v alues x a and x b as input and returns t w o random shares of ln( x a + x b ) as output (using a T a ylor series). The second, called t he m ultiplication proto col, receiv es t w o v alues u a and u b as input and returns t w o random shares of u a .u b as output. Alice and Bob run the ln( x ) proto col and b ecome sh ares u a and u b . Ne xt, the m ultiplication proto col is executed t wice. First with u a and x b as input. This giv es Alice and Bob resp ec tiv ely shares v a and v b . the second time it is called with x a and u b , giving Alice and Bob resp ectiv ely shares w a and w b . Alice no w has x a , u a , v a and w a , with whic h she can compute s a = x a u a + v a + w a . Bob can construct s b = x b u b + v b + w b in a similar w a y . Since x a u a + x b u b + x a u b + x b u a = ( x a + x b )( u a + u b ) = ( x a + x b ) ln( x a + x b ), Alic e and Bob both ha ve their share of ( x a + x b ) ln( x a + x b ). 3 Priv acy preserving ID3: Grid p artitioned data In the presen t section, w e in tro duce our alg o rithms , preserving priv acy ov er grid partitioned data. Basically , w e will study the following dilemma: when data is g rid partit io ne d w e can first merge it horizon tally and then further dev elop the pro cess v ertically , or the other w ay around. Ob viously ot her w ay s of doing this are p ossible as w ell, but we consider only the tw o straightforw ard ones. Of c ourse while build ing the decision tree we need to preserv e priv acy and use some w ell kno wn proto cols for this. In this pap er w e consider priv acy as protecting individual data tuples a s w ell as protecting attributes and v alues o f attributes. So eac h part y will rev eal as little as p ossible ab out its data while still constructing an applicable distributed 12 decision t ree . The only thing t hat is kno wn abo ut the tree b y all parties is its structure and which party is resp onsible for eac h decision no de. More precisely , whic h party p o s sesses the attribute use d to mak e the dec ision, but not whic h attribute (and v alue). W e assume that only a limited n umber of parties know the class attribute and no part y kno ws the en tire set of attributes, w hic h is ob vious as w e use grid partitioned data. Once the tree is constructed instance classification pro ceeds as follows. The part y that wishes to classify a new unseen instance know s the ro ot no de of the tree: or the no de res ides at his site or he k no ws the ro ot no de-iden tification (no deID). A ro ot no de identific ation contains a co de identify ing the party p osse ssing that particular no de. Basically , when classifying a new instance, con trol passes from party to party , dep ending on the decision no des that are visited. Ev ery pa rt y know s the tuples attribute v alues for the no des at its site but kno ws nothing ab out the other attribute v alues. The classification then happ ens as in Algorithm 2. Algorithm 2 The classification algorithm called classi f y(t,no deID) . A site wishes to classify a new instance t . Control starts at the r o ot no de (whic h ev ery part y kno ws.) 1: if The no deID is a leaf no de then 2: its classification v alue (o r distribution) is returned. 3: else if The no deID is an in terior no de then 4: no de = lo cal no de with no deID 5: v alue = v alue of attribute no de.A (used as decision att ribute) for t he tuple t w e are classifying 6: c hildID = no de.v alue 7: return c hildID.classify(t,c hildID) 8: end if Before intro duc ing our new algorithms for grid partitioned data w e in tro duce a minor side result whic h has not b een dealt with so far in the lit e rature, i.e. horizon tally partitioned data with more tha n tw o par ties . 3.1 Privacy pr eservin g I D3 o v er Horizontal ly p artitione d data inv olving mor e than two p arties In this section we extend the result of Lindell and Pink as [8], i.e. preserving priv acy for decision tree learning with tw o parties, to more than t w o pa rties . 13 Recall the ID3 algor it hm from Section 2.1. W e will separately consider its three basic steps, i.e. emp t iness test of t he attribute se t S , all transactions ha ving the same class lab el, i.e. class lab el test a nd the default c ase . W e explain in detail ho w the algorithm preserv es priv acy in eac h of them. Emptiness t est Thanks to the horizon tal distribution o f data and the fact that all parties know the interme diate tree, they can easily determine whether S is empt y . In case S is empt y they ha v e to determine the most frequen t class v alue. This can b e calculated b y using the se cur e sum pr oto c o l for eac h class v alue. Eac h part y inputs the num b er of tuples hav ing the particular class v alue a t his data site to the proto col. In this wa y they can safely compute for eac h class v alue the total n um b er of tuples o ver all sites , ha ving this v alue for its class attribute. Now a leaf no de c an b e constructed con taining the most frequen t class v alue. Class lab el t est T o securely determine whether all tuples in S ha ve the same class v alue, a v ariant of the se cure union proto col can b e adopted. More precisely , all parties pro vide a value as input to the prot o col. If a part y ha s only one class v alue in all o f its tuples, it prov ides this v alue as input. Otherwise, a fixed sym b ol ⊥ is provided as input. The pro t o col then runs analogo us ly to t he standard secure union proto col un til the step in whic h data has t o b e decrypted. The first par ty , i.e. party 0, has all the v alues in its p ossess ion of whic h he deletes all doubles. No w there are tw o p ossibilities, only o ne v alue remains or not. In case o f the fo rme r, this m ust b e the v alue whic h was pro vided as input by part y 0 . This is a class v alue or the ⊥ s ym b ol. If it is a class v alue this means that all parties ha v e pro vided the proto col with this same v alue, otherwise it means that all parties still hav e more than one class v alue. In case of the la t ter it is sure that there are still more than one class v alue. The prot o col then has to b e stopped, else v alues are learned by different parties whic h they should not learn. Default case In this case should b e determined whic h attribute classifies data tuples in S most accurately . T o calculate information gain, xl n ( x ) has to b e calculated a n um b er o f times with x partitioned ov er the different par- ties (b eing data tuples). This can b e solv ed b y t he secure sum pro tocol. The differen t par t ie s provide as input their share of x to the proto col. The sum kno wn to the last part y and the random v alue o f t he first party m ultiplied b y − 1 are input of t he xln ( x ) proto col. This pro t o col will provide shares of the result as o utput, whic h then can b e used as input to a circuit whic h securely computes it sum and outputs whic h attribute classifies t he t uples b est. 14 The complete description can b e found in Algorithm 3. Algorithm 3 The priv a cy preserving ID3 algo rithm for more t han tw o parties o v er horizon tally distributed data Require: R , The set of attributes. Require: C , The class attribute. Require: S , the horizontally distributed data set. 1: if The parties test if R is empty then 2: Secure sum proto col is used to calculate whic h class v alue c i is most frequen t. 3: Return a leaf with class v alue c i . 4: else if All parties use secure union proto col to test if all tuples in S ha v e the same class v alue c i . then 5: Return a leaf with class v alue c i . 6: else 7: Determine attribute A , classifying most accurately tuples in S : use se- cure sum and xl n ( x ) proto cols. 8: P a rtition S in m parts S ( a 1 ) , ..., S ( a m ) suc h that a 1 ...a m are the differen t v alues of A . 9: Return a tree with root A and m branc hes a 1 ...a m suc h that branc h i con tains I D 3( R − { A } , C , S ( a i )). 10: end if 3.2 Grid Partitione d data W e will introduce the grid partit ioned priv acy preserving algorithm by running through the differen t steps of ID 3 info rmally . It is imp ortan t t o realize that no site kno ws the complete a ttribute se t S and only a limited num b er o f parties kno w the class attribute, more particularly as m uc h as t he re are horizon tal distributions. Not e that these a lg orithms only consider the cases for whic h the parties are denoted b y P ij with i = 1 , .., v , j = 1 , .., h . 3.2.1 Horizontal mer ge and vertic al develop m ent Recall the ID3 algor it hm from Section 2.1. W e will separately consider its three basic steps, i.e. emp t iness test of t he attribute se t S , all transactions ha ving the same class lab el, i.e. class lab el test and the default c ase . Here w e consider the case that w e first merge the data ho rizon tally and con tin ue v ertically . A horizon tal merge means that w e eliminate the horizon tal distribution, leav ing only a v ertical distribution. 15 Emptiness t est T o determine if there a r e an y attributes left, as many par- ties as there are v ertical distributions need to co op era t e with one anot her as w e need to kno w all attributes to compute this test. This can b e easily understo o d from Figure 2. More precisely , in the example of that figure, parties P ar ty 11 , P ar ty 21 and P ar ty 31 can determine together if there are an y attributes le ft. These parties c hec k how man y p ossible attr ibutes they still p ossess as a can- didate decision no de and pass this v alue as input t o the secure sum proto col. A t the end of the proto col the sum and the rando m v alue are passed to a Y ao circuit, whic h tests if the sum equals zero (meaning that S is empt y) or not. In case of the sum being zero, the most freq uen t class v alue has to b e deter- mined. This is done in the following manner: first all par t ies determine the tuples reac hing the curren t no de in the tree. Then these tuples are merged horizon tally b y constructing a union o v er the v ertical groups (o ve r index i ), i.e. for eac h v ertical group a secure union pro tocol is applied. In case of Fig - ure 2 w e ha v e the following groups computing a union: P ar ty 11 , P ar ty 12 and P ar ty 13 ; P ar ty 21 , P ar ty 22 and P ar ty 23 and P ar ty 31 , P ar ty 32 and P ar ty 33 . P a rties whic h a re lo cated on the same horizon tal la y er (meaning that they ha v e the same index for j ), whic h are v ertically distribute d, will use the same encryption k ey to compute the v ertical unions (i.e. the horizon tal merge). In our ex ample these are, P ar ty 11 , P ar ty 21 and P ar ty 31 ; P ar ty 12 , P ar ty 22 and P ar ty 32 ; P ar ty 13 , P ar ty 23 and P ar ty 33 . A t this stage there is only a v erti- cal distribution left o v er the en tire distributed database, as w e mer ge d data horizon tally . No w w e will con tin ue by developing vertic al ly . The in tersection of these differen t sets with the tuples in a particular class giv e the n umber of tuples that reac h tha t p oin t in the tree. This can b e done for each class v a lue; note that it is not nece ssary to use the secure size of set proto col b ecause the unions a re already encrypted. This gives us ev en tually the most fr equen t class v alue. Note that it is not necess ary to decrypt the v alues a g ain to compute the in tersections. The reason is tha t w e used the same encryption key s fo r par- ties at the same horizon tal lev el, implying that equal v alues in the encrypted unions are also equal in the real unions. No w a leaf can b e constructed with a certain leaf iden tifier. The v alue of the leaf is kno wn b y the parties that ha v e the class attribute. The others only kno w the iden tifier. Class lab el t est Chec king whether all transactions in the training set S ha v e the s ame class v alue happ e ns analogously to determining the most fre- quen t class v alue. More precis ely , one party kno wing t he class attribute, can compute the p ossible interse ctions, i.e., the interse ctions of the sets of tuples whic h migh t reac h the curren t lev el of the tree or no de of the tree with the tuples in a particular class . If all in tersections eq ual ze ro b esides one, a ll tu- ples in S hav e tha t particular class v alue. Since they are all the same, no w a leaf no de is constructed. The par t ies whic h w ere jo ine d in one ve rtical gr o up kno wing the class attribute, all kno w the id of the no de ( nodeI D in the al- 16 gorithms) and the sp ecific class v alue. All ot her parties j ust get to kno w the no deID of the no de, whic h they nee d in case they need to classify a new tuple leading to this no de in the tree. Default case In this case, the b est classifying attribute has to b e deter- mined. T o do this, transactions or tuples nee d to b e counted. T o le arn these n um b ers recall that en trop y a nd information gain where defined as follo ws: entr opy ( S ) = P d i =1 − p i lo g 2 p i where d is the total num b er of differen t v alues the target class can tak e on and p i is the prop ortion of tuples of the data set ha ving targ et v alue i , i.e. N i N , where N is the total num b er of tuples reach - ing the curren t no de and N i is the num b er of tuples with attribute v alue a i . The informa t io n gain of an attribute A is then defined as: g ain ( S, A ) = entr opy ( S ) − P v | S v | | S | entr opy ( S v ) F or eac h attribute information gain needs to b e computed. First all parties de- termine t he tuples reac hing the current no de in the tree, i.e. N . Then these tu- ples are merged horizon tally by constructing a union ov er t he v ertical groups, i.e. for eac h v ertical group a secure union pro tocol is applied. In case of Fig - ure 2 , we hav e the f ollo wing groups computing a union: P ar ty 11 , P ar ty 12 and P ar ty 13 ; P ar ty 21 , P ar ty 22 and P ar ty 23 ; P ar ty 31 , P ar ty 32 and P ar ty 33 . F or eve ry v ertical g r o up, one party will iterate o v er its attributes. F or eac h suc h att r ibute, n um b ers of tuples need to b e c oun ted for ev ery v alue of this attribute. Th us for ev ery v a lue o f the attribute (whic h is encrypted), an in- tersection is computed ov er all the s ets, resulting from the hor iz on tal merge. When w e kno w all these num b ers, the information gain o f this att r ibute can b e computed. This step is rep eated for all att r ibute s. The pro cess describ ed so far is called vertic al developmen t . On of t he parties of eac h v ertical group sa v es the the inf o rmation gain of its attributes. These parties can then co op erate to compute the b est classifying one. Finally , the party whic h po s sesses the b est classifying attribute constructs a decision no de with is giv en a no de iden tifier no deID . Th e v a lue of the no de is communicated to the other parties that also p osse ss this at t r ibute . The other parties only get to kno w the iden tifier. The complete description can b e found in Algorithm 4. 3.2.2 V ertic al mer ge and horizontal development Emptiness t est T o determine if there a r e an y attributes left, as many par- ties as there are vertic al distributions need to co op erate with one another as w e need to kno w all at t r ibute s to compute this test. So essen tially this is done in t he same manner as with the horizontal merge, i.e. the prev ious algorithm. 17 Algorithm 4 The priv acy preserving ID3 algorithm o v er grid partit io ne d data when data is merged horizon tally and further deve lop ed v ertically . Require: R , The set of attributes distributed among the parties P ij with i = 1 , .., v , j = 1 , .., h . Require: C , The class attribute with d class v alues, c 1 , ..., c d . Require: S , the grid distributed data set ov er parties P ij with with i = 1 , .., v , j = 1 , .., h and parties P v,j holding the class attribute . 1: if (Emptiness test)The part ies test if R is empt y t h en 2: Secure sum proto col and Y ao circuit a r e used to test whether R is empty . 3: In case the a t t ribute set is empty , Secure union proto col is used to merge data horizon tally . F or the v ertical dev elopmen t, the secure size of set in tersection proto col does NOT ha ve t o b e use d. A Y ao circuit is used to calculate whic h class v a lue c i is mo st frequen t. A leaf no de with class v alue c i is returned. 4: else if All pa rties test whether all tuples ha v e the same class v alue c i then 5: Secure union proto coland Y ao circuit are used to calculate this. 6: In case the test is TR UE, a leaf with class v alue c i is returned. 7: else 8: Determine attribute A , classifying most accurately tuples in S : use se- cure union and secure sum proto cols. 9: P a rtition S in m parts S ( a 1 ) , ..., S ( a m ) suc h that a 1 ...a m are the differen t v alues of A . 10: Return a tree with ro ot A and m bra nc hes a 1 ...a m suc h that branc h i con tains I D 3( R − { A } , C , S ( a i )). 11: end if T o determine the most frequen t class v alue we will merge data vertically . More precisely , ev ery part y first determines the n um b er of tuple s that reach the current lev el of the tree or no de of the tree. F or this the parties only use the attributes they p osse ss. Then w e merge v ertically b y letting co op erate the parties at the same horizontal leve l. In our example these a re P ar ty 11 , P ar ty 21 and P ar ty 31 ; P ar ty 12 , P ar ty 22 and P ar ty 32 ; P ar ty 13 , P ar ty 23 and P ar ty 33 . The pa rties tha t p ossess the class attribute now need to compute a set p er class v alue. In our ex ample these are parties P ar ty 31 , P ar ty 32 and P ar ty 33 . They compute as man y secure size of se t proto cols as there are class v alues. In this manner they compute per horizontal gr o up (or v ertical merge) the nu m b er of transactions p er class v alue. If the parties p ossessing the class attributes hav e computed these interse ctions, they hav e to co op erate to find out the tota l num b er of tuples p er class v alue. They compute this by using a secure s um proto col per class v alue. Then the se v alues are pass ed on to a Y ao circuit to b e a ble to learn the most fr equen t class v alue. Now a leaf can b e constructed with a certain leaf iden tifier. The v alue o f the leaf is kno wn b y the parties that ha ve the class attribute. The others only kno w the iden tifier. 18 Class label test Determining whether all tuples ha v e the same class v alue is ana lo gous to the previous step. The differenc e lies in the Y ao circuit, whic h will test if all su ms equal zero ex cept for one. Aga in a leaf can b e constructed with a certain leaf iden tifier. The v alue of the leaf is known by the parties that ha v e the class attribute. The others only know the iden tifier. Default case W e need to compute the b est classifying a ttribute. T o do this, transactions o r tuples need to b e counted. T o learn these n umbers recall that en trop y and informa t io n gain where defined as follows : entr opy ( S ) = P d i =1 − p i lo g 2 p i where d is the total n um b er o f differen t v alues the target clas s can take on a nd p i is the prop ortion of t uple s of the data set having target v alue i , i.e. N i N , where N is the total num b er of tuples reachin g the current no de and N i is the n um b er of tuples with at tribute v alue a i . The information gain of an attribute A is then defined as: g ain ( S, A ) = entr opy ( S ) − P v | S v | | S | entr opy ( S v ) First, w e merge data v ertically . More precisely , ev ery part y first determines the n umber of tuples that reac h the curren t lev el or no de of the tree. F or this the parties only use the attributes they p ossess. Then data is merged v ertically b y pa rties at the s ame horizon ta l la y er (having the same j index in P ij ) v ia a secure size of set in tersection proto col to obtain exactly t ho s e tuples that are in the curren t dataset ass o ciated to the node under cons ideration in the tree. In our example these are P ar ty 11 , P ar ty 21 and P ar ty 31 ; P ar ty 12 , P ar ty 22 and P ar ty 32 ; P ar ty 13 , P ar ty 23 and P ar ty 33 . T hrough a secure sum proto col these n um b e rs can now b e added to learn the num b er of t uples that reach the curren t no de of the tree, whic h is denoted by N in the entrop y form ula. No w w e need to compute for eac h remaining a ttribute its information gain. This is done b y computing for eac h v a lue of an a ttribute ov er eac h horizontal la y er, the n um b er of t uples ha ving this attribute v alue ( w e compute N i ). This is done by us ing the se cure size of set proto col. The n these n um b ers can b e added o ve r all horizontal lay ers b y using a se cure sum proto col. This is called horizontal development . The secure sum (added w ith the random v alue) and the random v alue itse lf multiplie d b y one are provided to the xl n ( x ) proto col. This circuit will then output shares of the resu lt whic h the n can be used as input to a circuit whic h securely computes it sum and outputs whic h attribute classifies the tuples best. A decis ion node can no w be constructed f or the b est classifying attribute. The description of the algorithm is summarized in Algorithm 5. 19 Algorithm 5 The priv acy preserving ID3 algorithm o v er grid partit io ne d data when data is merged v ertically and further dev elop ed horizontally . Require: R , The set of attributes distributed among the parties P ij with i = 1 , .., v , j = 1 , .., h . Require: C , The class attribute with d class v alues, c 1 , ..., c d . Require: S , the grid distributed data set ov er parties P ij with i = 1 , .., v , j = 1 , .., h and parties P v,j holding the class attribute . 1: if (Emptiness test)The part ies test if R is empt y t h en 2: Secure sum proto col and Y ao circuit a r e used to test whether R is empty . 3: In case the attribute set is empty , Secure size of set inters ection proto col, secure sum proto col and Y ao circuit are used to calculate whic h class v alue c i is most frequen t. a leaf no de with class v alue c i is returned. 4: else if All pa rties test whether all tuples ha v e the same class v alue c i then 5: Secure size of set in tersection pro tocol, secure sum prot o col and Y ao circuit are used to calculate this. 6: In case the test is TR UE, a leaf with class v alue c i is returned. 7: else 8: Determine attribute A , classifying most accurately tuples in S : use se- cure size of in tersection, secure sum and xl n ( x ) proto cols. 9: P a rtition S in m parts S ( a 1 ) , ..., S ( a m ) suc h that a 1 ...a m are the differen t v alues of A . 10: Return a tree with ro ot A and m bra nc hes a 1 ...a m suc h that branc h i con tains I D 3( R − { A } , C , S ( a i )). 11: end if 4 Complexit y analysis of priv acy preserving ID3 o v er grid-partitioned data In this section w e analyse the complexit y of the t w o computatio n strategies prop osed in t he previous section: first merging horizontally and deve loping v ertically or first merging vertically and next dev eloping horizontally . The differen t quan tit ies h , v , k , | T | , | R | , d , m , t and n that pla y a role in this analysis are explained in the next table. 20 Notation Meaning h the n um b er of horizontal groups v the n um b er of v ertical gro ups k the n um b er of parties (= h × v ) | T | the n umber of tuples in the data set | R | the n um b er of attributes d the n umber of v alues for the class attribute C m the maximal n um b er of v alues for an attribute t the maximal length of encryption k eys n the maximal length of T a ylor series The predominant task in the ID 3 algorithm is to determine the attribute with the highest Information Gain and w e will base our analysis mainly on this comp onen t. 4.1 The c omplexity of the c omp onents fr om SMPC In discussing the complexit y of the building blo c ks describ e d in Section 2.3 usually tw o comp onen ts ar e considered: the c omputational c omplexity and the c ommunic ation c om plexity . The former considers the cost of computations in the classical sense, the latter considers the cost of passing messages b et we en, e.g., b et w een different parties. 4.1.1 The c omplexity of the se cur e sum pr o t o c ol F or the secure su m proto col with k parties, the computation and comm uni- cation costs are b oth O ( k log ( | T | )). Eac h o f the parties neve r outputs v alues larger tha n | T | a nd the messages passed are nev er la rger than | T | . Assuming binary enco ding of n um b ers, this giv es the ab o v e result. 4.1.2 The c omplexity of the se cur e unio n pr oto c o l a nd se cur e size of inter- se ction pr oto c ol F or the secure union proto c ol a nd the secure size of interse ction proto c ol with k parties, the c omputation c ost is O ( k 2 | T | t 3 ) and the comm unication cost is O ( k 2 | T | t ). Inde ed, the parties send sets of at most size | T | , they make use of 21 encryption k eys of length t , hence the factor t 3 in the computation cost, and ev ery party has to encrypt k 2 sets. The v alue t in the comm unication cost p oin ts at the size of the sets tha t are transmitted. In tot a l k 2 messages are sen t of size | T | t . 4.1.3 The c omplexity of the se cur e x ln x pr oto c ol and Y ao c i r cuits The secure x ln x pro tocol for tw o parties has a computational cost of O (log ( | T | )) and a comm unication cost of O ( n lo g ( | T | ) t ). It ta kes input v alues of at most | T | . The proto col also depends on a v alue n that determines how far a T aylor series is dev elop ed. The pro tocol con tains a Y ao circuit, created by one of the parties who also giv es his input to t he circuit a nd passes it to the o ther party . This explains the comm unicatio n cost, in whic h n obv iously pla ys a role since it determines the size of the circuit. The sec ond part y receiv es the circuit and feeds his input to the circuit. Hereto one oblivious transfer is p erformed p er bit. This step explains the computational cost. 4.2 The c omplexity of fi rs t ho rizontal mer ging T o determine the attribute that b est classifies the data , for eac h attribute unions hav e to b e determined ov er h parties. The exact num b er of unions ma y dep end on the attribute under consideration. If it is a n attribute that b elongs to the part y that p osses ses the class attribute, it are v + d + m − 1 unions. If an other party b elongs t he class a ttribute, it are v + d + m − 2 unions. Also the n um b er of unio ns to b e transmitted dep ends o n the attribute. F or attributes in t he p osse ssion of the o wner of the class attribute, there a r e v − 1 unions to b e transmitted, for other attributes v + d − 2. So, w e conclude: computation cost = O ( | R | ( v + d + m )( h 2 | T | t 3 )) comm unication cost = O ( | R | ( v + d )( h 2 | T | t )). W e end this section with a remark on how this proto col could b e made more efficien t. R emark that the strength of this pro tocol resides in the fa ct that adjacen t parties may use the same encryption ke y . F or this reason it is not necessary to use the secure size of set in tersection proto col to calculate in- tersections. The first phase of this pro t o col can b e skipped b ecause the same k eys are used when computing unions. This is p oss ible here b ecause the data is b oth horizon tally and vertically distributed. When the data is only v erti- cally distributed, it would also b e p ossible to let the parties agree on some encryption k eys and to simplify the proto col in this w a y . 22 4.3 The c omplexity of fi rs t ve rtic al m er ging P er attribute 1 + d + m + dm v alues hav e to b e computed, namely: • The n um ber of transactions reac hing the curren t no de: 1; • The n um ber of transactions reac hing the curren t no de p er class v a lue: d . • The n um b er of transactions reac hing the current no de p er a ttribute v alue: m . • The num b er of transactions reac hing the curren t no de p er class v a lue and p er at t ribute v a lue : dm . All t hese v alues can be computed via a secure siz e o f set in tersection proto col that is each time executed by v parties. Since w e hav e to count this for eac h horizon tal g roup, this giv es in tota l h (1 + d + m + dm ) calls to the secure size of set inters ection proto col. With t he se v alues the computation contin ues. In case of tw o horizon tal groups this is with the x ln x proto col; in the case of more horizontal g roups this is with the sec ure sum proto col, follow ed b y the x ln x proto col. So, w e conclude: computation cost = O ( | R | ( h (1 + d + m + d m ))( v 2 | T | t 3 )+ | R | (1 + d + m + dm )(log ( | T | ) )+ | R | log ( | T | )) [+ O ( h log ( | T | ))] and comm unication cost = O ( | R | ( h (1 + d + m + d m )) . ( v 2 | T | t )+ | R | (1 + d + m + dm ) n (log ( | T | ) t )+ | R | log ( | T | ) t ) [+ O ( h log ( | T | ))]. 4.4 Conclusion on the c omplexity analysis W e start b y remarking that it lo oks more lo gical to merge the data first hor- izon tally and then to further de v elop it v ertically . De emptiness test can b e implemen ted more efficien tly in the former case. The secure x ln x proto col giv es an approximated result, but the difference from the real r esult is small. This proto col a ls o makes hea vy us e o f circuit computations. In practice it is preferable to av oid this. F or what concerns complexit y , the ab o v e obtained express ions also sho w that first horizon tally merging is adv antageous. And as remarked b efore it can b e 23 impro v ed b y an optimal use of encryption. Indeed, by giving differen t parties the same enc ryption k ey it is not necessary to p erform the secure size of set in tersection proto cols after the secure union proto cols ha ve b een executed. 5 Conclusions In this pap er w e first discussed the significance of ex tending the curren t state of the art in priv acy preserving datamining to grid partitioned data, i.e. data whic h is as w ell horizontally as ve rtically partitioned. Our mo t iv ating ex am- ple shows tha t this situation is of g reat interest to real w orld situations and applications. Then w e con tinue d b y formally defining horizontally , v ertically and grid partitioned data. T o our kno wledge w e are the first to formalize the concept of grid partitioned data. W e con tin ued b y in t roducing three new priv acy preserving data mining al- gorithms. W e started by extending the result of Lindell and Pink as [8], i.e. preserving priv acy fo r decision tree learning with tw o parties when data is horizon tally distributed, to more than tw o parties. Ho w ev er, the main con tri- bution of t his pap er ar e the t w o algorithms to sec urely induce a distributed decision t ree when data is grid par titioned. More precisely , w e considere d t wo p ossible solutions: one in which data is first merged horizon tally and then further de v eloped vertically a nd vice v ersa. The comple xit y analysis of b oth algorithms sho ws that it is more efficie n t to first merge data horizontally and further dev elop it ve rtically than t he o the r w ay around. References [1] R. Agra w al, R. Srik ant, Priv acy-preserving data minin g, in: Pr oceedings of the A CM SI GMOD In ternational Conference on Managemen t of Data, 2000. [2] C. Clifto n, M. Kan tarciog lu, J. V aidy a, X. Lin , M. Y . Zh u , T o ols for p riv acy preserving d at a min ing, SIGKDD Exp lo rations 4 (2) (2002) 28 –34. [3] C. Clifton, D. Marks, Securit y and priv acy implications of data minin g , in : Pro ceedings of the A CM SIGMOD W orkshop on Data Minin g and Kn o wledge Disco ve ry , 1996. [4] W. Du, Z. Zhan, Building decision tree classifier on p riv ate data, in: IEEE In ternational Conf er en c e on Data Mining W orkshop on Pr iv acy , Securit y , and Data Mining, 2002 . [5] C. F ark as, S . J a jod ia, Th e infer en c e problem: A s u rv ey , S IGKDD Explorations 4 (2) (2002 ) 6–11 . 24 [6] O. Goldreic h, S. Micali, A. Wigderson, Ho w to pla y an y men tal game or a completeness theorem for pr o to co ls with honest ma jority , in: Pro ceedings of the Nineteent h Annual ACM Symp osium on Theory of Computing (TO C), 1987. [7] M. Kan tarcioglu, C. Clifton, Priv acy-preserving distributed mining of asso ci ation rules on horizon tally partitioned d a ta, in: Pro ceedings of the A C M SIGMOD W orkshop on Researc h Iss u es on Data Mining and Knowledge Disco ve ry (DMKD), 200 2. [8] Y. Lindell, B. Pink as, Priv acy preserving data m ining, in: Pro ceedings of 20th Ann ual In ternational ryptology Conf er en c e (CR YPT O), 2000. [9] Y. Lindell, B. Pink as, Priv acy preserving data m ining, in: Pro ceedings of 20th Ann ual Int ernational Cr yptolo gy Conference, vo l. 1880 of Lecture Notes in Computer Science , 20 00. [10] T. Mitc hell, Mac hine learning, Mc Gra w-Hill Series in C o mputer Science. [11] J. R. Quinlan, In d uctio n of decision trees, Mac hine Learning 1 (1) (1986) 81– 106. [12] C. Su, K. Sakur ai, Secure computation ov er d istr ibuted databases, IPSJ journ a l. [13] L. S w eeney , A p rimer on data p riv acy protection. ph d thesis, in: Massac husetts Institute of T ec hnologie, 2001 . [14] J. V aidy a, Priv acy Preserving Data Mining o v er V ertically Partit ioned Data, PhD thesis, Purdue Un iversit y , 2004. [15] A. C.-C. Y ao, Ho w to generate and exchange secrets (extended abs t ract), in: Pro ceedings of the 27th IEEE Symp osium on F ound at ions of Computer Science (F O CS), 1986. 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment