Privacy Preserving ID3 over Horizontally, Vertically and Grid Partitioned Data

Priv acy Preserving ID3 o v er Horizon tally , V ertically and Grid P artitioned Data Bart Kuijp ers, V anessa Lemmens, Bart Mo elans The or etic al Computer Scienc e, Hasselt University & T r ansnational University Limbur g, Belgium Karl T uyls Dep artment.of Industrial De sign, Eindhoven University of T e c h nolo gy, The Netherlands Abstract W e consider p riv acy p reserving d ec ision tree indu c tion via ID3 in the case where the training data is horizon tally or v ertically distributed. F urthermore, we co nsider the s a me problem in the case w here the d a ta is b oth h o rizon tally and v ertically distributed, a situation w e refer to as grid p artitione d data . W e giv e an algorithm for priv acy preserving ID3 ov er horizont ally partitioned data inv olving more than t wo parties. F or grid partitioned d at a, we discuss t wo diﬀeren t ev aluation metho ds for preserving priv acy ID3, namely , ﬁrst merging horizon tally and devel oping v ertically or ﬁrst m e rging ve rtically and next d evelo ping h orizonta lly . Next to int ro ducing priv acy preserving data minin g o v er grid-partitioned data, the main con trib utio n of this pap er is that w e sh o w, by means of a complexity analysis that th e former ev aluation method is the more eﬃcient. 1 In tro duction 1.1 Privacy pr eserving data mining In recent y ears priv acy preservin g data mining has emerged as a v ery active researc h area in data mining. The application p ossibilities of data mining, com- Email addr esses: bart.kui jpers@uhasselt .b e (Bart Kuijp ers), bart.moe lans@uhasselt.b e (Bart Mo elans), k.p.tuyls@tue .nl (Karl T uyls). bined with the Interne t, ha ve attracted and inspire d man y sc ien tists from dif- feren t researc h areas suc h as computer science, bioinfo rmatics and economics, to active ly participate in this relativ ely y o ung ﬁeld. Ov er the la s t few ye ars this has na t ur a lly lead to a gro wing in terest in securit y or priv acy issues in data mining. More precisely , it became clear that disco v ering knowledge t hr o ugh a com bination of diﬀeren t databases, raises imp ortan t securit y issues. Despite the fact that a cen tralized warehouse approac h allows to disco ver know ledge, whic h would ha v e not emerged when the sites were mined individually , priv a cy of data cannot b e guaran teed in the con text of data w arehousing. Although data mining results usually do not violate priv acy of individuals, it cannot b e assured that an unauthorized p erson will not access the cen tralized w a reh ouse with some malev olen t inte n tions t o misuse gathered information for his own purp oses during the data mining pro cess . Neither can it b e guaran teed that, when data is partitio ne d o v er diﬀeren t sites and data is no t encrypted, it is imp ossible to deriv e new k no wledge ab out the other sites. Data mining tec hniques try to identify regularities in data, which are unkno wn and hard to disco ver by individuals. Regularities o r patterns ar e to be unde rsto o d as reve- lations o v er the en tire data, rather than on individuals. How ev er to ﬁnd suc h rev elatio ns the mining pro cess has to access and use individual informat ion. More formally , this problem is recognized a s the infer enc e pr oblem [3,5]. Orig i- nally this problem dates bac k to r esearch in database t heory during the 70s and early 8 0 s , ac kno wledged bac k then as a c cess con trol. Mo dels w ere dev elop ed oﬀering protection against unauthorized us e o r access of the database. Ho w- ev er, suc h mo dels seemed unable to suﬃcie n tly protect sensitiv e infor mat ion. More precisely , indirect acces ses (through a diﬀeren t database and metadata) still allo w ed one to a ttain information not autho r iz ed for. Here, metadata con- sists, e.g., of dep endencie s betw een diﬀeren t databases, integrit y constrain ts, domain kno wledge, etc. In other words, the inference problem o ccurs when one can o btain vital information through metadata violating individuals (or com- panies) priv acy . With the elab oration o f diﬀeren t net w ork technologies a nd gro wing inte rest in pattern recognition this pro ble m naturally carr ies ov er to data mining. In fact it gets ev en worse a s illustrated by Swe en y in [13]. In her work Sw een y sho ws that the ty pical de-iden tiﬁcation techn ique applied on data sets fo r public use, do es not render the result anony mous. Mor e pre- cisely , she demonstrated that combinations of c haracteristics (or attributes) can construct a unique or near-unique iden tiﬁer of tuples, whic h means that information c an b e gained on individuals ev en when their identiﬁers are dis- torted. Ov er the past few y ears state of the art researc h in priv acy preserving data mining ha s concen trated it s elf a long t w o ma jor lines: data whic h is horizontal ly distribute d and data whic h is vertic al l y distribute d . Horizontally partitioned data is data whic h is homogeneously distributed, meaning that all data tuples 2 yield ov er the same item or feature set. Essen tially this b oils down to diﬀerent data sites collecting the same kind of information o ver diﬀeren t individuals. Consider for instance a sup ermark et c hain whic h gathers information on the buying b eha vior of its customers. T ypically , suc h a compan y ha s diﬀeren t branc hes, implying data to b e horizon tally distributed. V ertically distributed data is data whic h is heterogeneously distributed. Basically this m eans that data is collected b y diﬀeren t sites or parties on the same individuals but with diﬀering item or f eatur e sets. Consider for instance ﬁnancial institutions as banks and credit card companies, they b oth collect data on custome rs ha ving a credit card but with diﬀering item sets. 1.2 Our c ontribution In this pap er, w e also consider data whic h is b oth ho r izontally and v ertically distributed, whic h w e will call grid p artitione d data . T o our kno wledge, there has b een no researc h up till now in priv a c y preserving data mining that consid- ers g r id distributed data. How ev er this kind of situation seems highly relev ant and signiﬁcan t. Consider for instance the situation were diﬀeren t ﬁnancial institutions gat her data on clien ts concerning sa vings accoun t, in v estmen ts, credit cards and others. This situation clearly considers data whic h is grid partitioned, since some institutions deal with credit cards and not with in- v estmen ts a nd vice v ersa and since ﬁnanc ial institutions t ypically hav e data emerging from diﬀeren t branc hes of a ba nk. F or a more thorough elab oration of this example , w e refer to the end of this In tro duction (Section 1.3). In this pap er, we prop ose a new algorithm to prese rv e priv acy when constructing a decision tree for classiﬁcation ov er g rid partitioned data using ID3, in v olving m ultiple parties. Most closely related to this work is that of L inde ll and Pink as [8] who intro duc ed a secure multi-part y computation t echniq ue for classiﬁca- tion using the ID3 a lg orithm o ve r horizon tally par t it ioned data and that of Du and Zhan [4] who in tro duced a proto col fo r making ID3 secure o v er v erti- cally part itioned data. An impor t a n t con tribution of our w ork is to consider horizon tally and v ertically distributed data at the same time. F urthermore, w e also b eliev e this is highly signiﬁcan t a s most real life vital da t a mining situations, in v olving m ultiple parties, consist of grid partitioned data. A mo- tiv ating example is discussed in Section 1.3 . F or grid partitioned data, we discuss tw o diﬀerent ev aluation metho ds for pres erving priv acy ID3, namely , ﬁrst merging horizon tally and dev eloping v ertically or ﬁrst merging v ertically and ne xt dev eloping hor izontally . W e sh o w in Section 4 b y means of a com- plexit y analysis t ha t the for mer is the most eﬃcien t . In the con text of secure m ultipart y computatio n, the “semi-honest” and the “ma licious” mo del [15,6] are considered. In the f o rme r, all parties follow the proto col s trictly , but are allo w ed to remem b er ev erything they encounter while executing t he proto col and to use this informatio n to compute information ab out the other par t ie s, 3 whereas in the malicious model the parties are allow ed to c heat. W e a s sume the semi-honest mo del in this pap er. The rest of this paper is structured as follo ws. W e end t his Introduction by a motiv ating ex ample for grid partitio ne d data, illustrating the importa nce of eﬃcien tly dealing with g rid partitioned data in data mining applications. In Section 2, we sk etc h t he preliminaries as t he ID3 algorithm and deﬁnitions of hor iz on tally , vertically and grid distributed data. Section 3 intro duces o ur algorithm. W e discuss the tw o diﬀeren t ev aluation metho ds for preserving priv acy ID3 in Section 4. Section 5 concludes the pap er. 1.3 Motivating example for grid p artitione d data T ypically f or ﬁnancial institutions as banks is that they oﬀer their clien ts diﬀeren t servic es as a sav ings a ccount, choice of c redit card, Maestro and all kinds of inv estmen t p ossibilities as mor t gages, sto c k inv estmen ts, fund orders and so on. Of course a bank is inte rested in know ing whic h are go o d customers, whic h are bad ones and whic h are p ossible defrauders. Reasons are ob vious, making proﬁt and av oiding losses b ec ause of clien ts whic h a re not credit w orth y and show unreliable b eha vior. Gathering a ll kinds of ﬁna ncial data ab out their customers and t he ir transactions can help them in iden tifying risky clien ts and p ossible defrauders, preve n ting h uge ﬁnancial lo s ses. More pr ecisely , by using go o d data mining tec hniques it b ecomes p ossible to generalize o v er these gathered data sets and iden tify p ossible risks for future cases or tr a ns actions. T ypical fo r diﬀerent branc hes is to ga ther the same kind (i.e. item sets) of data on diﬀeren t clients , implying that data is ho ri z o ntal ly partitioned. A po s sible item set re-o ccurring at banks is illustrated in T able 1. T able 1: A p o ssible item set. Cust. n r. mo rtgage accoun t salary sto c k neg. saldo F raudulen t? A 11 25 . 000 104 . 20 0 2 . 200 0 no no B 12 0 1 . 001 . 020 4 . 4000 1 . 000 y es y es . . . . . . . . . . . . . . . . . . . . . Ho w ev er b y com bining their data sets, it w ould b ecome p ossible to deriv e kno wledge, leading to a high lev el of precis ion in triggering fraudulen t b eha v- ior, whic h t hey w ould not ha v e reac hed for individually . Consider f or instance a simpliﬁed rule X , Y → F , meaning that if features X and Y are satisﬁed this implies a high c hance the transaction is fraudulent. It is not imaginary that this asso ciation rule is know n t o bank A and unknown to bank B , sim- ply b ecause B has not enough (lo cal) tuples to supp ort this rule. There is a reasonable c hance that by com bining their da t abase s, sites A and B w ould ha v e disco vered asso ciation rules whic h g lobally hold and whic h they would 4 ha v e not disco v ered individually , implying a greater accuracy in identifying defrauders. Altho ugh suc h a co op erativ e b eha vior could sa v e them a great deal of money , none of them, as they are comp etitors, w ould b e willing to share all its transactions and itemsets with one another for ob vious reasons. Although it is p ossible for ba nk s to gather substantial data on their clien ts there is still ro om for more impro v emen t. More precisely , a bank do es not t ypically manage a ll the services it oﬀers. F or instance credit card tr a ns actions are ma na ged b y separate companies collab orating with banks. D espite this co op erat ion, neither of them will b e to o happy to exc hange data and item or feature sets on their cus tomers. Still if they w ould b e willing to collab orate, this could lead to a higher precision in iden tifying fraudulen t cases and all parties w ould b eneﬁt. In other words, the group of p eople ha ving a credit card is usually in v olve d in an in v estmen t of all p ossible kinds: mortgages, sto c k mark et, order funds etc. This implies that the group of individuals on whic h the credit card companies gather da t a is more or less the same as the group on whic h banks gather data concerning in v estmen ts. This b oils do wn to data whic h is vertic al ly pa r t itioned. Summarizing, this example sho ws that it is not imaginary at all that data app ears to b e as w ell horizontally as v ertically distributed in real life situa- tions, whic h w e call grid p artitione d data . Note that w e can easily extend the ab o ve example to con tain more pa rties . W e could for instance a dd tax services, in terested in trac king p eople c heating on their taxes. Most imp ortan t ly , the example illustrates that w e do not only need to consider priv acy preserving tec hniques fo r hor izontally o r v ertically distributed data, but that it is highly signiﬁcan t for real life a pplicatio ns to consider t he combination of b oth. 2 Preliminaries W e start this section with a subsection summarizing the ID3 a lgorithm. Then w e con tin ue with a subsection describing deﬁnitions and examples of horizon- tally , v ertically and grid distributed data. W e contin ue with preliminaries on m ulti-part y computation. 2.1 The I D3 algorithm The ID 3 algorithm (Inducing Decision T rees) was originally introduced by Quinlan in [11] and is described b elo w in Algorithm 1. Here we brieﬂy recall the steps in v o lved in the algo rithm. F or a t horough discussion of the algorithm w e refer the inte rested reader to [10]. 5 The input of ID3 is a ﬁnite dat a set of tuples con taining (discrete or nomi- nal) v a lue s for a ﬁnite n um b er of at tributes , one of whic h is called the class attribute (also called target class). ID3 induces a decision tree fro m an ex- ample set in a top-down manner. More precisely , the algorithm starts a t the ro ot no de, choosing eac h time the attribute whic h separates the data most eﬃcien tly according to their tar get class. Then the algorithm creates a branc h for eac h v a lue o f this attribute a nd con tin ues from there b y rep eating the ab o ve pro cess un til all attributes are used. T o determine which attribute is b est in classifying the given data set, a measure from informat ion theory is used, namely information gain . Information gain is deﬁned as the exp ected reduction in e ntr op y . Entrop y measures the homogeneity of a data set. More formally , the entrop y o f a data set o f tuples S is deﬁned as: entr opy ( S ) = d X i =1 − p i lo g 2 p i (1) where d is the total n um b er of diﬀeren t v alues the target class can tak e on and p i is the prop ortion of tuples of the data set ha ving target v a lue i . The information gain of an attribute A is then deﬁned as: g ain ( S, A ) = entr opy ( S ) − X v | S v | | S | entr opy ( S v ) (2) with S v the subset of S with tuples hav ing v alue v fo r attr ibute A . Algorithm 1 The ID3 Algorithm Require: R , a set of attributes. Require: C , the class attribute. Require: S , data set of tuples. 1: if R is empty then 2: Return the leaf hav ing the most frequen t v alue in data set S . 3: else if all tuples in S ha v e the same class v a lue t hen 4: Return a leaf with that sp eciﬁc class v alue. 5: else 6: Determine attribute A with the highest info r ma t io n gain in S . 7: P a rtition S in m parts S ( a 1 ) , ..., S ( a m ) suc h that a 1 , ..., a m are the dif- feren t v alues of A . 8: Return a tree with ro ot A and m branche s lab elled a 1 ...a m , suc h that branc h i con tains ID3( R − { A } , C, S ( a i )). 9: end if 6 2.2 Horizontal ly, vertic a l ly and grid p artitione d data In this section w e provide a f ormal deﬁnition of horizontally , v ertically and grid partitioned data. W e will use t he pro jection op eration as deﬁned in relational algebra in database theory . Supp ose w e hav e: (1) A relation (or da ta set) S o v er the sche ma I , A 1 , ..., A | R | , C consisting of a ﬁnite num b er of tuples. The attribute I is supp osed to b e a k ey (i.e., contain iden tiﬁers) and is not considered as a n attribute to build the decision tree. The only purp ose of the attribute I is to b e able to join vertically distributed data. The attribute C will b e referred to as the class attribute. (2) P arties P ij with i = 1 , .., v , j = 1 , .., h and v s maller than the n um b e r of attributes (i.e., | R | + 1) (3) Eac h party P ij is ho lding a part S ij con taining information ab out certain attributes (including I ) and certain tuples. The S ij are suc h that • S ij is a partition of S , more precisely S = ∪ h j =1 ⊲ ⊳ v I ,i =1 S ij ; • S ij and S ij ′ ha v e the same attributes but (parts of ) diﬀerent tuples of S when j 6 = j ′ ; • S ij and S i ′ j ha v e disjoin t attributes but con tain information ab out the same tuples of S . Deﬁnition 1 We c al l S • horizon tally distributed if and only if v = 1 ; • v ertically distributed i f and only if h = 1 ; and • grid distributed if and only if v , h ≥ 2 . Examples of horizon tally , ve rtically and grid distributed databases can b e found in Figures 1(a), 1(b) and 2. 2.3 Pr eliminaries on multip arty c omputation In this section we recall some results from m ultipart y computation t ha t will b e neede d as building blo c ks in the algorithms in the next section. Basically , s ecure multi-part y computation (SMPC) mak es sure that diﬀeren t parties in v olv ed in a computation process, do not learn any thing more than the result(s) of the computation pro cess and anything else t ha t is deriv able in a p olynomial amoun t of time (without c heating). More precisely , SMPC is of great in terest to the inference problem. In the case of horizontally , v ertically 7 P a rt y P 11 I A 1 A 2 A 3 A 4 A 5 C ... ... P a rt y P 12 I A 1 A 2 A 3 A 4 A 5 C ... ... (a) P a rt y P 11 I A 1 A 2 A 3 ... ... P a rt y P 21 I A 4 A 5 C ... ... (b) Fig. 1: (left) Horizon tally distributed data.(righ t) V ertically distributed data. and grid partitioned data in data mining, the mining pro cess require s a lot of comm unicatio n b et w een the diﬀeren t parties. The SMPC techniq ues prev en t an y part y from deriving new kno wledge ab out the other parties in v olved. It is not our inten tion to giv e a complete ov erview here of SMPC, therefore w e refer to [9,2,15,8,12,1,3,14]. Here w e pro vide the securit y proto cols necessary for our purp oses, i.e., • the secure sum proto col, • the Y ao circu it, • the secure union prot o col, • the secure size of set inte rsection proto col; and • the xl n ( x ) proto col. W e remark that in t he contex t of secure m ultipart y computation, t w o mo dels, that w e already men tioned before, are considered, namely the “ s emi-honest” and the “malicious” mo del [6,15]. In the semi- h onest mo del , all part ies follow the pro t o col strictly . They are allo w ed to remem b er ev erything they encounter while executing t he proto col and to use this infor ma t io n to compute (in p olynomial time) information a bout the ot he r parties. In the malicious mo de l the pa rties are a llo w ed to che at. They ma y for example falsify their inputs in order to learn more about the input o f other parties. 8 P a rt y P 11 I A 1 A 2 ... ... P a rt y P 21 I A 3 ... ... P a rt y P 31 I A 4 A 5 C ... ... P a rt y P 12 I A 1 A 2 ... ... P a rt y P 22 I A 3 ... ... P a rt y P 32 I A 4 A 5 C ... ... P a rt y P 13 I A 1 A 2 ... ... P a rt y P 23 I A 3 ... ... P a rt y P 33 I A 4 A 5 C ... ... Fig. 2: Grid distr ib uted data. W e will assume the semi-honest mo del in the description of our algorithms. 2.3.1 Se cur e sum pr oto c ol The goal of this proto col is that k > 2 parties can compute the sum of the v alues each part y holds in suc h a w ay that no part y can learn an ything ab out the v alues of the other parties. The proto col of Kantarcioglu a nd Clifton [7] protects individual v alues by using a random num b er. P arty 0 adds a ra ndom n um b er to its o wn v alue and sends it to P art y 1. P a r ty 1 cannot learn an ything from this v alue due to the random num b er. Part y 1 a dds his v alue to this n um b er and sends it along to P arty 2. This pro cess con tinue s un til the last part y has been reac hed. This part y adds his n um b er to the n um b e r it receiv ed and sends it to Part y 0. P art y 0 can no w compute the sum b y distracting the random n umber from the sum it receiv ed of the last party . No w P a rt y 0 reve als the sum to the other parties. 9 Ho w safe is this proto col? It can b e sho wn, b y means of a p olynomial time sim ulato r, that, in the semi-honest mo del, this proto col is safe. Ac tually , to sho w safet y it is necess ary that all v alues remain within a ﬁnite domain [0 , m ] and all computations are done mo dulo m . It should b e remark ed tha t the proto col can b e brok en if w e assume the ma- licious mo del. F or instance, it is clear that when Part y i − 1 and P art y i + 1 collab orate, the v a lue of P arty i can b e discov ered. As a remedy for t his prob- lem, eac h party can split its v alue in n parts. Of all parts t he sum is calculated. T o av oid that P art y i − 1 and P art y i + 1 can collab orate t o disco v er the v alue of Part y i , during eac h of these n computations diﬀeren t paths are follo wed . In this w ay , more parties hav e to collab orate to disco v er individual v alues. 2.3.2 Y ao cir c uit Y ao intro duc ed in [1 5] the concept of se cur e two p arty c omputation . He show ed that a ny function f ( x, y ), where x is the input of Part y 1 and y the input o f P a rt y 2 , can b e ev aluated in a secure w a y . T o formalize the concept of securit y , w e concen trate on functions f (Y ao mak es use o f Boo lean circuits to represen t a function f ) o f the form f ( x, y ) = ( f 1 ( x, y ) , f 2 ( x, y )) . This function receiv es a par t of its input, namely x , from P a rt y 1 and the other pa r t o f its input, namely y , from Part y 2. P art y 1 w a n ts to learn f 1 ( x, y ) and P art y 2 w an ts to learn f 2 ( x, y ). Suppo s e that proto col Π is used to learn f . V iew Π i is what P arty i learns by executing proto col Π and O utput Π i is the output o f P art y i ( i = 1 , 2). Finally , let S i b e an a lgorithm that can b e exe cuted in p olynomial time. Y ao deﬁnes { S 1 ( x, f 1 ( x, y )) , f 2 ( x, y )) } = { V iew Π 1 ( x, y ) , O utput Π 2 ( x, y )) } and { ( f 1 ( x, y ) , S 2 ( y , f 2 ( x, y )) ) } = { O utpu t Π ( x, y ) , V iew Π 2 ( x, y )) } meaning that a n y party can lear n f rom f , b y executing pro tocol Π, only those facts that can b e learned in p olynomial time from his/her input and his /her output. Ex ecuting the proto col do es therefore not pro vide an y extra informa- tion. W e remark that G o ldreic h et al. [6] generalized the results of Y a o to more than t w o parties. Goldreich et al. also gav e the comp osition theorem that states that if a function g can b e reduced safely to a function f , and if there is a proto col to safely compute f , then also g can b e computed safely . In this pap er, we will refer to this t yp e of circ uits a s Y ao cir cuits , ev en if they concern more than t w o parties. 10 2.3.3 Se cur e union pr oto c ol When there are only tw o parties, computing the union of tw o sets b elonging to eac h of those parties, this can lead to securit y pro ble ms. Indeed, the kno wledge ab out ones own set and ab out the union, give s ( a t least partial) know ledge ab out the ot her parties set. In this section, we outline a metho d to compute the union o f k itemsets, b elonging to k parties, k > 2. The goal is that a ll parties should learn the union, without learning ab out the itemse ts of other parties. The algorithm is from Ka n tarcioglu and Clifton [7 ] and consists of four phases that w e sk etc h b elo w. These authors also sho w its securit y . Phase 1 : All parties generate a comm utativ e, dete rministic encry ption k ey E i and a decryption k ey D i . Eac h itemset is augmente d with fak e or dumm y items (this is done to pre v en t the determination of the cardinalit y of the it emset). A t the end of Phase 1, eac h part y has a n itemset of the same size (whic h is agreed up on at the start). Phase 2 : Each part y encrypts its items a nd commu nicates them to the next part y (the comm unication is cyclic as in the case o f secure sum computation, i.e., P art y i sends information to Part y ( i + 1) mo d k ). Eac h party encrypts what he receiv es and passes it to the next party . This con tin ues until eac h P a rt y i is in the possession of the completely encrypted items o f P arty ( i + 1) mo d k . W e r emark that to con tinue one more step would b e no longer secure. Phase 3 : The ev en-num b ered par t ies send all items in their p osses sion to Part y 0 and the o dd-n umbered parties do the same to P a r ty 1 ( the last party alw a ys has to send to P art y 1 to av oid that a part y gets its ow n fully encrypted itemset). P arties 0 a nd 1 tak e the union of what they receiv ed and remov e the doubles. P art y 1 sends ev erything he has t o Part y 0, who remo v es t he doubles. A t this p oin t the union is in the p osses sion of P art y 0, b e it fully encrypted. Phase 4 : The encrypted union is sen t to all parties to b e decrypted. Finally the fak e items are remo v ed and the result is announced to all pa r t ie s. 2.3.4 Se cur e size of set interse ction pr oto c ol When there are only tw o parties, computing the size of the inte rsection of t w o sets b elonging to eac h of the parties can lead to securit y pro blems. Indeed, the kno wledge ab out ones o wn set and ab out the size of the interse ction of t w o sets giv es (at least partial) kno wledge ab out the set o f the other part y . So, w e are in terested t o compute the size of set intersec tion of k itemse ts, b elonging to k parties, k > 2. The goal is that all parties should learn the size of set in tersection, without learning ab out the itemsets of other parties. 11 Jaideep V aidya [14] prop osed a pr o tocol for the secure computation of the size of set in tersection. It is similar to the secure union proto col, and w e will not rep eat the details here but refer to [14]. 2.3.5 x ln( x ) pr oto c ol The x ln( x ) proto col, due to Lindell and Pink as [9], is diﬀerent from the pre- vious proto cols. It uses Y ao circuits, as men tioned earlier in this section. Because circuits are only suitable for t w o parties, also this proto col is only suitable for t wo parties. Assume w e hav e tw o parties, called Alice and Bob. Alice has a v alue x a and Bob has a v a lue x b . The goal of the x ln( x ) proto- col is to giv e Alice and Bob b oth a share s a and s b resp e ctiv ely , suc h that s a + s b = ( x a + x b ) ln( x a + x b ) The x ln( x ) proto col make s use of tw o subproto cols. The ﬁrst receiv es tw o v alues x a and x b as input and returns t w o random shares of ln( x a + x b ) as output (using a T a ylor series). The second, called t he m ultiplication proto col, receiv es t w o v alues u a and u b as input and returns t w o random shares of u a .u b as output. Alice and Bob run the ln( x ) proto col and b ecome sh ares u a and u b . Ne xt, the m ultiplication proto col is executed t wice. First with u a and x b as input. This giv es Alice and Bob resp ec tiv ely shares v a and v b . the second time it is called with x a and u b , giving Alice and Bob resp ectiv ely shares w a and w b . Alice no w has x a , u a , v a and w a , with whic h she can compute s a = x a u a + v a + w a . Bob can construct s b = x b u b + v b + w b in a similar w a y . Since x a u a + x b u b + x a u b + x b u a = ( x a + x b )( u a + u b ) = ( x a + x b ) ln( x a + x b ), Alic e and Bob both ha ve their share of ( x a + x b ) ln( x a + x b ). 3 Priv acy preserving ID3: Grid p artitioned data In the presen t section, w e in tro duce our alg o rithms , preserving priv acy ov er grid partitioned data. Basically , w e will study the following dilemma: when data is g rid partit io ne d w e can ﬁrst merge it horizon tally and then further dev elop the pro cess v ertically , or the other w ay around. Ob viously ot her w ay s of doing this are p ossible as w ell, but we consider only the tw o straightforw ard ones. Of c ourse while build ing the decision tree we need to preserv e priv acy and use some w ell kno wn proto cols for this. In this pap er w e consider priv acy as protecting individual data tuples a s w ell as protecting attributes and v alues o f attributes. So eac h part y will rev eal as little as p ossible ab out its data while still constructing an applicable distributed 12 decision t ree . The only thing t hat is kno wn abo ut the tree b y all parties is its structure and which party is resp onsible for eac h decision no de. More precisely , whic h party p o s sesses the attribute use d to mak e the dec ision, but not whic h attribute (and v alue). W e assume that only a limited n umber of parties know the class attribute and no part y kno ws the en tire set of attributes, w hic h is ob vious as w e use grid partitioned data. Once the tree is constructed instance classiﬁcation pro ceeds as follows. The part y that wishes to classify a new unseen instance know s the ro ot no de of the tree: or the no de res ides at his site or he k no ws the ro ot no de-iden tiﬁcation (no deID). A ro ot no de identiﬁc ation contains a co de identify ing the party p osse ssing that particular no de. Basically , when classifying a new instance, con trol passes from party to party , dep ending on the decision no des that are visited. Ev ery pa rt y know s the tuples attribute v alues for the no des at its site but kno ws nothing ab out the other attribute v alues. The classiﬁcation then happ ens as in Algorithm 2. Algorithm 2 The classiﬁcation algorithm called classi f y(t,no deID) . A site wishes to classify a new instance t . Control starts at the r o ot no de (whic h ev ery part y kno ws.) 1: if The no deID is a leaf no de then 2: its classiﬁcation v alue (o r distribution) is returned. 3: else if The no deID is an in terior no de then 4: no de = lo cal no de with no deID 5: v alue = v alue of attribute no de.A (used as decision att ribute) for t he tuple t w e are classifying 6: c hildID = no de.v alue 7: return c hildID.classify(t,c hildID) 8: end if Before intro duc ing our new algorithms for grid partitioned data w e in tro duce a minor side result whic h has not b een dealt with so far in the lit e rature, i.e. horizon tally partitioned data with more tha n tw o par ties . 3.1 Privacy pr eservin g I D3 o v er Horizontal ly p artitione d data inv olving mor e than two p arties In this section we extend the result of Lindell and Pink as [8], i.e. preserving priv acy for decision tree learning with tw o parties, to more than t w o pa rties . 13 Recall the ID3 algor it hm from Section 2.1. W e will separately consider its three basic steps, i.e. emp t iness test of t he attribute se t S , all transactions ha ving the same class lab el, i.e. class lab el test a nd the default c ase . W e explain in detail ho w the algorithm preserv es priv acy in eac h of them. Emptiness t est Thanks to the horizon tal distribution o f data and the fact that all parties know the interme diate tree, they can easily determine whether S is empt y . In case S is empt y they ha v e to determine the most frequen t class v alue. This can b e calculated b y using the se cur e sum pr oto c o l for eac h class v alue. Eac h part y inputs the num b er of tuples hav ing the particular class v alue a t his data site to the proto col. In this wa y they can safely compute for eac h class v alue the total n um b er of tuples o ver all sites , ha ving this v alue for its class attribute. Now a leaf no de c an b e constructed con taining the most frequen t class v alue. Class lab el t est T o securely determine whether all tuples in S ha ve the same class v alue, a v ariant of the se cure union proto col can b e adopted. More precisely , all parties pro vide a value as input to the prot o col. If a part y ha s only one class v alue in all o f its tuples, it prov ides this v alue as input. Otherwise, a ﬁxed sym b ol ⊥ is provided as input. The pro t o col then runs analogo us ly to t he standard secure union proto col un til the step in whic h data has t o b e decrypted. The ﬁrst par ty , i.e. party 0, has all the v alues in its p ossess ion of whic h he deletes all doubles. No w there are tw o p ossibilities, only o ne v alue remains or not. In case o f the fo rme r, this m ust b e the v alue whic h was pro vided as input by part y 0 . This is a class v alue or the ⊥ s ym b ol. If it is a class v alue this means that all parties ha v e pro vided the proto col with this same v alue, otherwise it means that all parties still hav e more than one class v alue. In case of the la t ter it is sure that there are still more than one class v alue. The prot o col then has to b e stopped, else v alues are learned by diﬀerent parties whic h they should not learn. Default case In this case should b e determined whic h attribute classiﬁes data tuples in S most accurately . T o calculate information gain, xl n ( x ) has to b e calculated a n um b er o f times with x partitioned ov er the diﬀerent par- ties (b eing data tuples). This can b e solv ed b y t he secure sum pro tocol. The diﬀeren t par t ie s provide as input their share of x to the proto col. The sum kno wn to the last part y and the random v alue o f t he ﬁrst party m ultiplied b y − 1 are input of t he xln ( x ) proto col. This pro t o col will provide shares of the result as o utput, whic h then can b e used as input to a circuit whic h securely computes it sum and outputs whic h attribute classiﬁes t he t uples b est. 14 The complete description can b e found in Algorithm 3. Algorithm 3 The priv a cy preserving ID3 algo rithm for more t han tw o parties o v er horizon tally distributed data Require: R , The set of attributes. Require: C , The class attribute. Require: S , the horizontally distributed data set. 1: if The parties test if R is empty then 2: Secure sum proto col is used to calculate whic h class v alue c i is most frequen t. 3: Return a leaf with class v alue c i . 4: else if All parties use secure union proto col to test if all tuples in S ha v e the same class v alue c i . then 5: Return a leaf with class v alue c i . 6: else 7: Determine attribute A , classifying most accurately tuples in S : use se- cure sum and xl n ( x ) proto cols. 8: P a rtition S in m parts S ( a 1 ) , ..., S ( a m ) suc h that a 1 ...a m are the diﬀeren t v alues of A . 9: Return a tree with root A and m branc hes a 1 ...a m suc h that branc h i con tains I D 3( R − { A } , C , S ( a i )). 10: end if 3.2 Grid Partitione d data W e will introduce the grid partit ioned priv acy preserving algorithm by running through the diﬀeren t steps of ID 3 info rmally . It is imp ortan t t o realize that no site kno ws the complete a ttribute se t S and only a limited num b er o f parties kno w the class attribute, more particularly as m uc h as t he re are horizon tal distributions. Not e that these a lg orithms only consider the cases for whic h the parties are denoted b y P ij with i = 1 , .., v , j = 1 , .., h . 3.2.1 Horizontal mer ge and vertic al develop m ent Recall the ID3 algor it hm from Section 2.1. W e will separately consider its three basic steps, i.e. emp t iness test of t he attribute se t S , all transactions ha ving the same class lab el, i.e. class lab el test and the default c ase . Here w e consider the case that w e ﬁrst merge the data ho rizon tally and con tin ue v ertically . A horizon tal merge means that w e eliminate the horizon tal distribution, leav ing only a v ertical distribution. 15 Emptiness t est T o determine if there a r e an y attributes left, as many par- ties as there are v ertical distributions need to co op era t e with one anot her as w e need to kno w all attributes to compute this test. This can b e easily understo o d from Figure 2. More precisely , in the example of that ﬁgure, parties P ar ty 11 , P ar ty 21 and P ar ty 31 can determine together if there are an y attributes le ft. These parties c hec k how man y p ossible attr ibutes they still p ossess as a can- didate decision no de and pass this v alue as input t o the secure sum proto col. A t the end of the proto col the sum and the rando m v alue are passed to a Y ao circuit, whic h tests if the sum equals zero (meaning that S is empt y) or not. In case of the sum being zero, the most freq uen t class v alue has to b e deter- mined. This is done in the following manner: ﬁrst all par t ies determine the tuples reac hing the curren t no de in the tree. Then these tuples are merged horizon tally b y constructing a union o v er the v ertical groups (o ve r index i ), i.e. for eac h v ertical group a secure union pro tocol is applied. In case of Fig - ure 2 w e ha v e the following groups computing a union: P ar ty 11 , P ar ty 12 and P ar ty 13 ; P ar ty 21 , P ar ty 22 and P ar ty 23 and P ar ty 31 , P ar ty 32 and P ar ty 33 . P a rties whic h a re lo cated on the same horizon tal la y er (meaning that they ha v e the same index for j ), whic h are v ertically distribute d, will use the same encryption k ey to compute the v ertical unions (i.e. the horizon tal merge). In our ex ample these are, P ar ty 11 , P ar ty 21 and P ar ty 31 ; P ar ty 12 , P ar ty 22 and P ar ty 32 ; P ar ty 13 , P ar ty 23 and P ar ty 33 . A t this stage there is only a v erti- cal distribution left o v er the en tire distributed database, as w e mer ge d data horizon tally . No w w e will con tin ue by developing vertic al ly . The in tersection of these diﬀeren t sets with the tuples in a particular class giv e the n umber of tuples that reac h tha t p oin t in the tree. This can b e done for each class v a lue; note that it is not nece ssary to use the secure size of set proto col b ecause the unions a re already encrypted. This gives us ev en tually the most fr equen t class v alue. Note that it is not necess ary to decrypt the v alues a g ain to compute the in tersections. The reason is tha t w e used the same encryption key s fo r par- ties at the same horizon tal lev el, implying that equal v alues in the encrypted unions are also equal in the real unions. No w a leaf can b e constructed with a certain leaf iden tiﬁer. The v alue of the leaf is kno wn b y the parties that ha v e the class attribute. The others only kno w the iden tiﬁer. Class lab el t est Chec king whether all transactions in the training set S ha v e the s ame class v alue happ e ns analogously to determining the most fre- quen t class v alue. More precis ely , one party kno wing t he class attribute, can compute the p ossible interse ctions, i.e., the interse ctions of the sets of tuples whic h migh t reac h the curren t lev el of the tree or no de of the tree with the tuples in a particular class . If all in tersections eq ual ze ro b esides one, a ll tu- ples in S hav e tha t particular class v alue. Since they are all the same, no w a leaf no de is constructed. The par t ies whic h w ere jo ine d in one ve rtical gr o up kno wing the class attribute, all kno w the id of the no de ( nodeI D in the al- 16 gorithms) and the sp eciﬁc class v alue. All ot her parties j ust get to kno w the no deID of the no de, whic h they nee d in case they need to classify a new tuple leading to this no de in the tree. Default case In this case, the b est classifying attribute has to b e deter- mined. T o do this, transactions or tuples nee d to b e counted. T o le arn these n um b ers recall that en trop y a nd information gain where deﬁned as follo ws: entr opy ( S ) = P d i =1 − p i lo g 2 p i where d is the total num b er of diﬀeren t v alues the target class can tak e on and p i is the prop ortion of tuples of the data set ha ving targ et v alue i , i.e. N i N , where N is the total num b er of tuples reach - ing the curren t no de and N i is the num b er of tuples with attribute v alue a i . The informa t io n gain of an attribute A is then deﬁned as: g ain ( S, A ) = entr opy ( S ) − P v | S v | | S | entr opy ( S v ) F or eac h attribute information gain needs to b e computed. First all parties de- termine t he tuples reac hing the current no de in the tree, i.e. N . Then these tu- ples are merged horizon tally by constructing a union ov er t he v ertical groups, i.e. for eac h v ertical group a secure union pro tocol is applied. In case of Fig - ure 2 , we hav e the f ollo wing groups computing a union: P ar ty 11 , P ar ty 12 and P ar ty 13 ; P ar ty 21 , P ar ty 22 and P ar ty 23 ; P ar ty 31 , P ar ty 32 and P ar ty 33 . F or eve ry v ertical g r o up, one party will iterate o v er its attributes. F or eac h suc h att r ibute, n um b ers of tuples need to b e c oun ted for ev ery v alue of this attribute. Th us for ev ery v a lue o f the attribute (whic h is encrypted), an in- tersection is computed ov er all the s ets, resulting from the hor iz on tal merge. When w e kno w all these num b ers, the information gain o f this att r ibute can b e computed. This step is rep eated for all att r ibute s. The pro cess describ ed so far is called vertic al developmen t . On of t he parties of eac h v ertical group sa v es the the inf o rmation gain of its attributes. These parties can then co op erate to compute the b est classifying one. Finally , the party whic h po s sesses the b est classifying attribute constructs a decision no de with is giv en a no de iden tiﬁer no deID . Th e v a lue of the no de is communicated to the other parties that also p osse ss this at t r ibute . The other parties only get to kno w the iden tiﬁer. The complete description can b e found in Algorithm 4. 3.2.2 V ertic al mer ge and horizontal development Emptiness t est T o determine if there a r e an y attributes left, as many par- ties as there are vertic al distributions need to co op erate with one another as w e need to kno w all at t r ibute s to compute this test. So essen tially this is done in t he same manner as with the horizontal merge, i.e. the prev ious algorithm. 17 Algorithm 4 The priv acy preserving ID3 algorithm o v er grid partit io ne d data when data is merged horizon tally and further deve lop ed v ertically . Require: R , The set of attributes distributed among the parties P ij with i = 1 , .., v , j = 1 , .., h . Require: C , The class attribute with d class v alues, c 1 , ..., c d . Require: S , the grid distributed data set ov er parties P ij with with i = 1 , .., v , j = 1 , .., h and parties P v,j holding the class attribute . 1: if (Emptiness test)The part ies test if R is empt y t h en 2: Secure sum proto col and Y ao circuit a r e used to test whether R is empty . 3: In case the a t t ribute set is empty , Secure union proto col is used to merge data horizon tally . F or the v ertical dev elopmen t, the secure size of set in tersection proto col does NOT ha ve t o b e use d. A Y ao circuit is used to calculate whic h class v a lue c i is mo st frequen t. A leaf no de with class v alue c i is returned. 4: else if All pa rties test whether all tuples ha v e the same class v alue c i then 5: Secure union proto coland Y ao circuit are used to calculate this. 6: In case the test is TR UE, a leaf with class v alue c i is returned. 7: else 8: Determine attribute A , classifying most accurately tuples in S : use se- cure union and secure sum proto cols. 9: P a rtition S in m parts S ( a 1 ) , ..., S ( a m ) suc h that a 1 ...a m are the diﬀeren t v alues of A . 10: Return a tree with ro ot A and m bra nc hes a 1 ...a m suc h that branc h i con tains I D 3( R − { A } , C , S ( a i )). 11: end if T o determine the most frequen t class v alue we will merge data vertically . More precisely , ev ery part y ﬁrst determines the n um b er of tuple s that reach the current lev el of the tree or no de of the tree. F or this the parties only use the attributes they p osse ss. Then w e merge v ertically b y letting co op erate the parties at the same horizontal leve l. In our example these a re P ar ty 11 , P ar ty 21 and P ar ty 31 ; P ar ty 12 , P ar ty 22 and P ar ty 32 ; P ar ty 13 , P ar ty 23 and P ar ty 33 . The pa rties tha t p ossess the class attribute now need to compute a set p er class v alue. In our ex ample these are parties P ar ty 31 , P ar ty 32 and P ar ty 33 . They compute as man y secure size of se t proto cols as there are class v alues. In this manner they compute per horizontal gr o up (or v ertical merge) the nu m b er of transactions p er class v alue. If the parties p ossessing the class attributes hav e computed these interse ctions, they hav e to co op erate to ﬁnd out the tota l num b er of tuples p er class v alue. They compute this by using a secure s um proto col per class v alue. Then the se v alues are pass ed on to a Y ao circuit to b e a ble to learn the most fr equen t class v alue. Now a leaf can b e constructed with a certain leaf iden tiﬁer. The v alue o f the leaf is kno wn b y the parties that ha ve the class attribute. The others only kno w the iden tiﬁer. 18 Class label test Determining whether all tuples ha v e the same class v alue is ana lo gous to the previous step. The diﬀerenc e lies in the Y ao circuit, whic h will test if all su ms equal zero ex cept for one. Aga in a leaf can b e constructed with a certain leaf iden tiﬁer. The v alue of the leaf is known by the parties that ha v e the class attribute. The others only know the iden tiﬁer. Default case W e need to compute the b est classifying a ttribute. T o do this, transactions o r tuples need to b e counted. T o learn these n umbers recall that en trop y and informa t io n gain where deﬁned as follows : entr opy ( S ) = P d i =1 − p i lo g 2 p i where d is the total n um b er o f diﬀeren t v alues the target clas s can take on a nd p i is the prop ortion of t uple s of the data set having target v alue i , i.e. N i N , where N is the total num b er of tuples reachin g the current no de and N i is the n um b er of tuples with at tribute v alue a i . The information gain of an attribute A is then deﬁned as: g ain ( S, A ) = entr opy ( S ) − P v | S v | | S | entr opy ( S v ) First, w e merge data v ertically . More precisely , ev ery part y ﬁrst determines the n umber of tuples that reac h the curren t lev el or no de of the tree. F or this the parties only use the attributes they p ossess. Then data is merged v ertically b y pa rties at the s ame horizon ta l la y er (having the same j index in P ij ) v ia a secure size of set in tersection proto col to obtain exactly t ho s e tuples that are in the curren t dataset ass o ciated to the node under cons ideration in the tree. In our example these are P ar ty 11 , P ar ty 21 and P ar ty 31 ; P ar ty 12 , P ar ty 22 and P ar ty 32 ; P ar ty 13 , P ar ty 23 and P ar ty 33 . T hrough a secure sum proto col these n um b e rs can now b e added to learn the num b er of t uples that reach the curren t no de of the tree, whic h is denoted by N in the entrop y form ula. No w w e need to compute for eac h remaining a ttribute its information gain. This is done b y computing for eac h v a lue of an a ttribute ov er eac h horizontal la y er, the n um b er of t uples ha ving this attribute v alue ( w e compute N i ). This is done by us ing the se cure size of set proto col. The n these n um b ers can b e added o ve r all horizontal lay ers b y using a se cure sum proto col. This is called horizontal development . The secure sum (added w ith the random v alue) and the random v alue itse lf multiplie d b y one are provided to the xl n ( x ) proto col. This circuit will then output shares of the resu lt whic h the n can be used as input to a circuit whic h securely computes it sum and outputs whic h attribute classiﬁes the tuples best. A decis ion node can no w be constructed f or the b est classifying attribute. The description of the algorithm is summarized in Algorithm 5. 19 Algorithm 5 The priv acy preserving ID3 algorithm o v er grid partit io ne d data when data is merged v ertically and further dev elop ed horizontally . Require: R , The set of attributes distributed among the parties P ij with i = 1 , .., v , j = 1 , .., h . Require: C , The class attribute with d class v alues, c 1 , ..., c d . Require: S , the grid distributed data set ov er parties P ij with i = 1 , .., v , j = 1 , .., h and parties P v,j holding the class attribute . 1: if (Emptiness test)The part ies test if R is empt y t h en 2: Secure sum proto col and Y ao circuit a r e used to test whether R is empty . 3: In case the attribute set is empty , Secure size of set inters ection proto col, secure sum proto col and Y ao circuit are used to calculate whic h class v alue c i is most frequen t. a leaf no de with class v alue c i is returned. 4: else if All pa rties test whether all tuples ha v e the same class v alue c i then 5: Secure size of set in tersection pro tocol, secure sum prot o col and Y ao circuit are used to calculate this. 6: In case the test is TR UE, a leaf with class v alue c i is returned. 7: else 8: Determine attribute A , classifying most accurately tuples in S : use se- cure size of in tersection, secure sum and xl n ( x ) proto cols. 9: P a rtition S in m parts S ( a 1 ) , ..., S ( a m ) suc h that a 1 ...a m are the diﬀeren t v alues of A . 10: Return a tree with ro ot A and m bra nc hes a 1 ...a m suc h that branc h i con tains I D 3( R − { A } , C , S ( a i )). 11: end if 4 Complexit y analysis of priv acy preserving ID3 o v er grid-partitioned data In this section w e analyse the complexit y of the t w o computatio n strategies prop osed in t he previous section: ﬁrst merging horizontally and deve loping v ertically or ﬁrst merging vertically and next dev eloping horizontally . The diﬀeren t quan tit ies h , v , k , | T | , | R | , d , m , t and n that pla y a role in this analysis are explained in the next table. 20 Notation Meaning h the n um b er of horizontal groups v the n um b er of v ertical gro ups k the n um b er of parties (= h × v ) | T | the n umber of tuples in the data set | R | the n um b er of attributes d the n umber of v alues for the class attribute C m the maximal n um b er of v alues for an attribute t the maximal length of encryption k eys n the maximal length of T a ylor series The predominant task in the ID 3 algorithm is to determine the attribute with the highest Information Gain and w e will base our analysis mainly on this comp onen t. 4.1 The c omplexity of the c omp onents fr om SMPC In discussing the complexit y of the building blo c ks describ e d in Section 2.3 usually tw o comp onen ts ar e considered: the c omputational c omplexity and the c ommunic ation c om plexity . The former considers the cost of computations in the classical sense, the latter considers the cost of passing messages b et we en, e.g., b et w een diﬀerent parties. 4.1.1 The c omplexity of the se cur e sum pr o t o c ol F or the secure su m proto col with k parties, the computation and comm uni- cation costs are b oth O ( k log ( | T | )). Eac h o f the parties neve r outputs v alues larger tha n | T | a nd the messages passed are nev er la rger than | T | . Assuming binary enco ding of n um b ers, this giv es the ab o v e result. 4.1.2 The c omplexity of the se cur e unio n pr oto c o l a nd se cur e size of inter- se ction pr oto c ol F or the secure union proto c ol a nd the secure size of interse ction proto c ol with k parties, the c omputation c ost is O ( k 2 | T | t 3 ) and the comm unication cost is O ( k 2 | T | t ). Inde ed, the parties send sets of at most size | T | , they make use of 21 encryption k eys of length t , hence the factor t 3 in the computation cost, and ev ery party has to encrypt k 2 sets. The v alue t in the comm unication cost p oin ts at the size of the sets tha t are transmitted. In tot a l k 2 messages are sen t of size | T | t . 4.1.3 The c omplexity of the se cur e x ln x pr oto c ol and Y ao c i r cuits The secure x ln x pro tocol for tw o parties has a computational cost of O (log ( | T | )) and a comm unication cost of O ( n lo g ( | T | ) t ). It ta kes input v alues of at most | T | . The proto col also depends on a v alue n that determines how far a T aylor series is dev elop ed. The pro tocol con tains a Y ao circuit, created by one of the parties who also giv es his input to t he circuit a nd passes it to the o ther party . This explains the comm unicatio n cost, in whic h n obv iously pla ys a role since it determines the size of the circuit. The sec ond part y receiv es the circuit and feeds his input to the circuit. Hereto one oblivious transfer is p erformed p er bit. This step explains the computational cost. 4.2 The c omplexity of ﬁ rs t ho rizontal mer ging T o determine the attribute that b est classiﬁes the data , for eac h attribute unions hav e to b e determined ov er h parties. The exact num b er of unions ma y dep end on the attribute under consideration. If it is a n attribute that b elongs to the part y that p osses ses the class attribute, it are v + d + m − 1 unions. If an other party b elongs t he class a ttribute, it are v + d + m − 2 unions. Also the n um b er of unio ns to b e transmitted dep ends o n the attribute. F or attributes in t he p osse ssion of the o wner of the class attribute, there a r e v − 1 unions to b e transmitted, for other attributes v + d − 2. So, w e conclude: computation cost = O ( | R | ( v + d + m )( h 2 | T | t 3 )) comm unication cost = O ( | R | ( v + d )( h 2 | T | t )). W e end this section with a remark on how this proto col could b e made more eﬃcien t. R emark that the strength of this pro tocol resides in the fa ct that adjacen t parties may use the same encryption ke y . F or this reason it is not necessary to use the secure size of set in tersection proto col to calculate in- tersections. The ﬁrst phase of this pro t o col can b e skipped b ecause the same k eys are used when computing unions. This is p oss ible here b ecause the data is b oth horizon tally and vertically distributed. When the data is only v erti- cally distributed, it would also b e p ossible to let the parties agree on some encryption k eys and to simplify the proto col in this w a y . 22 4.3 The c omplexity of ﬁ rs t ve rtic al m er ging P er attribute 1 + d + m + dm v alues hav e to b e computed, namely: • The n um ber of transactions reac hing the curren t no de: 1; • The n um ber of transactions reac hing the curren t no de p er class v a lue: d . • The n um b er of transactions reac hing the current no de p er a ttribute v alue: m . • The num b er of transactions reac hing the curren t no de p er class v a lue and p er at t ribute v a lue : dm . All t hese v alues can be computed via a secure siz e o f set in tersection proto col that is each time executed by v parties. Since w e hav e to count this for eac h horizon tal g roup, this giv es in tota l h (1 + d + m + dm ) calls to the secure size of set inters ection proto col. With t he se v alues the computation contin ues. In case of tw o horizon tal groups this is with the x ln x proto col; in the case of more horizontal g roups this is with the sec ure sum proto col, follow ed b y the x ln x proto col. So, w e conclude: computation cost = O ( | R | ( h (1 + d + m + d m ))( v 2 | T | t 3 )+ | R | (1 + d + m + dm )(log ( | T | ) )+ | R | log ( | T | )) [+ O ( h log ( | T | ))] and comm unication cost = O ( | R | ( h (1 + d + m + d m )) . ( v 2 | T | t )+ | R | (1 + d + m + dm ) n (log ( | T | ) t )+ | R | log ( | T | ) t ) [+ O ( h log ( | T | ))]. 4.4 Conclusion on the c omplexity analysis W e start b y remarking that it lo oks more lo gical to merge the data ﬁrst hor- izon tally and then to further de v elop it v ertically . De emptiness test can b e implemen ted more eﬃcien tly in the former case. The secure x ln x proto col giv es an approximated result, but the diﬀerence from the real r esult is small. This proto col a ls o makes hea vy us e o f circuit computations. In practice it is preferable to av oid this. F or what concerns complexit y , the ab o v e obtained express ions also sho w that ﬁrst horizon tally merging is adv antageous. And as remarked b efore it can b e 23 impro v ed b y an optimal use of encryption. Indeed, by giving diﬀeren t parties the same enc ryption k ey it is not necessary to p erform the secure size of set in tersection proto cols after the secure union proto cols ha ve b een executed. 5 Conclusions In this pap er w e ﬁrst discussed the signiﬁcance of ex tending the curren t state of the art in priv acy preserving datamining to grid partitioned data, i.e. data whic h is as w ell horizontally as ve rtically partitioned. Our mo t iv ating ex am- ple shows tha t this situation is of g reat interest to real w orld situations and applications. Then w e con tinue d b y formally deﬁning horizontally , v ertically and grid partitioned data. T o our kno wledge w e are the ﬁrst to formalize the concept of grid partitioned data. W e con tin ued b y in t roducing three new priv acy preserving data mining al- gorithms. W e started by extending the result of Lindell and Pink as [8], i.e. preserving priv acy fo r decision tree learning with tw o parties when data is horizon tally distributed, to more than tw o parties. Ho w ev er, the main con tri- bution of t his pap er ar e the t w o algorithms to sec urely induce a distributed decision t ree when data is grid par titioned. More precisely , w e considere d t wo p ossible solutions: one in which data is ﬁrst merged horizon tally and then further de v eloped vertically a nd vice v ersa. The comple xit y analysis of b oth algorithms sho ws that it is more eﬃcie n t to ﬁrst merge data horizontally and further dev elop it ve rtically than t he o the r w ay around. References [1] R. Agra w al, R. Srik ant, Priv acy-preserving data minin g, in: Pr oceedings of the A CM SI GMOD In ternational Conference on Managemen t of Data, 2000. [2] C. Clifto n, M. Kan tarciog lu, J. V aidy a, X. Lin , M. Y . Zh u , T o ols for p riv acy preserving d at a min ing, SIGKDD Exp lo rations 4 (2) (2002) 28 –34. [3] C. Clifton, D. Marks, Securit y and priv acy implications of data minin g , in : Pro ceedings of the A CM SIGMOD W orkshop on Data Minin g and Kn o wledge Disco ve ry , 1996. [4] W. Du, Z. Zhan, Building decision tree classiﬁer on p riv ate data, in: IEEE In ternational Conf er en c e on Data Mining W orkshop on Pr iv acy , Securit y , and Data Mining, 2002 . [5] C. F ark as, S . J a jod ia, Th e infer en c e problem: A s u rv ey , S IGKDD Explorations 4 (2) (2002 ) 6–11 . 24 [6] O. Goldreic h, S. Micali, A. Wigderson, Ho w to pla y an y men tal game or a completeness theorem for pr o to co ls with honest ma jority , in: Pro ceedings of the Nineteent h Annual ACM Symp osium on Theory of Computing (TO C), 1987. [7] M. Kan tarcioglu, C. Clifton, Priv acy-preserving distributed mining of asso ci ation rules on horizon tally partitioned d a ta, in: Pro ceedings of the A C M SIGMOD W orkshop on Researc h Iss u es on Data Mining and Knowledge Disco ve ry (DMKD), 200 2. [8] Y. Lindell, B. Pink as, Priv acy preserving data m ining, in: Pro ceedings of 20th Ann ual In ternational ryptology Conf er en c e (CR YPT O), 2000. [9] Y. Lindell, B. Pink as, Priv acy preserving data m ining, in: Pro ceedings of 20th Ann ual Int ernational Cr yptolo gy Conference, vo l. 1880 of Lecture Notes in Computer Science , 20 00. [10] T. Mitc hell, Mac hine learning, Mc Gra w-Hill Series in C o mputer Science. [11] J. R. Quinlan, In d uctio n of decision trees, Mac hine Learning 1 (1) (1986) 81– 106. [12] C. Su, K. Sakur ai, Secure computation ov er d istr ibuted databases, IPSJ journ a l. [13] L. S w eeney , A p rimer on data p riv acy protection. ph d thesis, in: Massac husetts Institute of T ec hnologie, 2001 . [14] J. V aidy a, Priv acy Preserving Data Mining o v er V ertically Partit ioned Data, PhD thesis, Purdue Un iversit y , 2004. [15] A. C.-C. Y ao, Ho w to generate and exchange secrets (extended abs t ract), in: Pro ceedings of the 27th IEEE Symp osium on F ound at ions of Computer Science (F O CS), 1986. 25

Privacy Preserving ID3 over Horizontally, Vertically and Grid Partitioned Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment