Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers

Hac king Smart Mac hines with Smarte r Ones: Ho w to Extract Meaningful Data from Mac hine Learning Classiﬁers Giusepp e A tenies e 1 , Giov anni F elici 2 , Luigi V. Mancini 1 , Angelo Spo g nardi 1 , An tonio Villani 3 , and Domenico Vitali 1 1 Universit` a di Roma La Sapienza, Dipartimento di Informatica { ateniese ,mancini,spognardi , vitali } @di.uniroma1.it 2 Consiglio Nazionale delle Ricerche, Istituto di Analisi dei Sistemi ed Informatica Roma giovanni.fe lici@iasi.cnr.it 3 Universit` a di Roma T re, Dipartimen to d i Matematic a villani@mat.uniro ma3.it Abstract. Mac hine Learning (ML) algorithms are used to train com- puters to p erform a va riety of complex tasks and improv e with exp eri- ence. Computers learn how to recognize patterns, make unintended deci- sions, or rea ct to a dy namic environment. Certain trained mac hines ma y b e more eﬀective than others b ecause th ey are based on more suitable ML algorithms or b ecause they were trained through sup erior training sets. Although ML algorithms are known and publicly rel eased, training sets may n ot b e reasonably ascertainable and , indeed, may b e guarded as trade secrets. While muc h researc h has b een p erformed about th e priv acy of the elements of training sets, in this p aper we fo cus our at- tention on ML classiﬁers and on the statistical information that can b e unconsciously or maliciously revealed from them. W e show that it is p os- sible to infer unexp ected but useful information from ML classiﬁers. In particular, we build a nov el meta-classiﬁer and train it to hack other clas- siﬁers, obtaining meaningful information abou t their training sets. This kind of information leak age can b e ex ploited, for examp le, by a vendor to build m ore eﬀective classiﬁers or to simply acquire trade secrets from a comp etitor’s ap p aratus, p otentially violating its intellectual prop erty righ ts. 1 In tro duction Machine learning cla ssiﬁers ar e des igned to make eﬀective and eﬃcient predic- tion of “ patterns” fro m lar ge data sets. Many applications have b een prop ose d in the literatur e (e.g., [27, 54, 49, 23, 25]) a nd machine lea rning alg o rithms p erv ade several contexts of informa tio n technology . ML approaches (such as Supp ort V ector machines, Clustering, Ba yesian net work, Hidden Marko v models, etc.) rely o n quite distinct ma thematical c o ncepts but generally they are employed to so lve similar proble ms . A machine learning alg orithm consists of tw o pha ses: tr aining and classiﬁc ation . During the training, th e ML algorithm is fed with a tr aining set of samples. In this phase, the relationships a nd the corr elations implied in the tr aining samples are gather ed ins ide the mo del . Afterwards, the mo del is used during the class iﬁc a tion phas e to classify and ev alua te new data. ML c lassiﬁers a r e usually able to mana ge a larg e amount o f data and to ada pt to dynamic environment s . Their versatility makes them suitable for several im- po rtant tasks. F or example, classiﬁcation a nd regression mo dels are employ e d to a nalyze curren t and historical trends to make predictions in ﬁnancial mar- kets [24, 33, 8], to study bio lo gical problems [54], to s uppor t medica l dia gnosis [30, 42, 57], to cla ssify netw or k traﬃc or detect anomalies [22, 28, 39, 12, 49]. One may think tha t it is safe to relea se a cla ssiﬁer, whether in hardware or softw are, s ince in tellec tual prop erty laws would preven t any o ne from pro ducing a similar appar atus, fo r example, b y cop ying its code or its design principles. How ever, rele a sing a trained class iﬁe r ma y be sub ject to unexp ected informa- tion lea k ag es that ma ke it pos sible to produce a comp etitive pro duct witho ut violating any in tellectual pr op erty rights. Let us consider , for ins tance, a classiﬁer C a that is less eﬀective than a cla ssiﬁer C b pro duced b y a competitor. The ML alg orithms us ed in C b may b e publicly a v ail- able or be infer r ed through r everse enginee r ing. F or example, c o mmercial soft- ware pro ducts for sp e ech recognition, such a s Nuance Dragon NaturallySp eak- ing [1], utilize widely studied Hidden Mar ko v Mo dels . These a lgorithms, alo ng with their optimizations , ar e well-understo o d a nd quite standard. Thus, the co m- mon assumption is that a n yone c a n easily replicate them. In par ticular, we could assume that the tra ining set used for C b is sup erior , in the sense tha t makes C b more eﬀective than C a even t ho ugh bo th implemen t essentially the same ML algorithms. What mak es C b better than C a is the sp eciﬁc knowledge for med dur- ing the training phase, inferr e d by the training s e t. F or instance, a classiﬁer that makes sto ck market predic tio ns base d o n neural net work holds its p ow er in the weigh ts at its hidden la yer (see A). But those w eights depend exclus iv e ly on the training set, hence v alua ble informatio n that must b e treasur ed. Thu s , it is fair to as k: Is it safe to r elease a pro ﬁtable ML classiﬁer? W ould selling a softw are/ha rdware clas siﬁer reveal concre te hin ts about its training set, uncov ering the secrets of its eﬀectiveness a nd jeopar dizing the vendor? W e s how that a classiﬁer c a n be hack ed and that it is pos sible to extract from it meaningful information about its training set. This can b e a ccomplished bec ause a typical ML classiﬁe r learns by changing its internal structur e to a b- sorb the information con ta ine d in the training da ta. In particular, w e devise and train a meta- classiﬁer tha t can succe ssfully detect and class ify these changes and deduce v aluable information. How ever, we could no t rep ort on pro ducts r eleased by co mmercial vendors b ecause w e did not get legal p ermission to ha ck a propri- etary pro duct. Nevertheless, we analyzed the same ML algo r ithms emplo yed by commercial pro ducts. F or example, we c onsidered the HMM-based sp eech recog- nition engine of the op en-source pack ag e V o xF org e which is similar to the o ne s employ ed b y commercial pro ducts , such as Nuance Dr a gon NaturallySpeking . W e no te, in addition, that using op en-sour ce s oft ware makes our exp eriments easily repro ducible by others. It is impor ta n t to obser ve that we a re not in terested in priv acy leaks, but rather in discov er ing anything that makes classiﬁer s b e tter than others . In partic- ular, we do not c are ab out pr o tecting the elements of the training set. Cons ide r the following exa mple: a sp eech recognition s o ft ware recognizes spoken words better than comp eting pro ducts, even though they a ll implemen t the sa me ML algorithms. The training s et is comp osed of commonly sp oken w ords, thus it do es not ma ke sense to talk ab out priv acy pro tection. How ever, we show how to build a meta-class iﬁe r trained to re veal that, for instance, the ma jority o f training samples came fr om female voices or from voices of p eo ple with marked accents (e.g., Indian, Br itish, American, etc.). Then, we can extra pola te certain hidden att ributes whic h are somehow absorb ed by the learning algor ithm, thus po ssibly uncov ering the secret sauce that makes the sp e ech reco gnition softw are stay ahead of the comp etition. Therefore the type of leak age w e are in ter e sted in is quite diﬀerent than that considered in pr iv acy preserving data mining and s ta tistical databases [14] or diﬀerential pr iv ac y [9, 19]. Indeed, in Section 4, we show that a system providing Diﬀerent ia l Priv acy is utterly insecur e in our mo del. Remark: W e intro duce a nov el type of informatio n leak ag e and sho w t ha t it is inherent to lea rning. This is far from obvious a nd, indeed, quite unexp ected: Clearly , a ll lea rning algo rithms must reco gnize patterns in their datas e t. Thus, classiﬁers will inheren tly reveal some information. The open question is whether this information has a ny meaning. Indeed, classiﬁe r s are v ery o pa que ob jects and make it diﬃcult to infer a n ything useful a t all. What we show here is that it is s till p ossible to extra c t something meaningful relating to pr op erties o f the training s et. This is surpris ing and achiev able through a meta-clas siﬁer that is sp ecially trained to exp ose this information. How ever, we do not attempt to formally deﬁne this new type o f information leak ag e nor provide mec hanisms to preven t it. 1.1 Con tributi o ns Our res ults evince rea lis tic issues facing machine learning algo r ithms. In partic- ular, the main contributions of our work a re: 1. W e put forward a new type o f infor mation leak age that, to the b est of our knowledge, has not been cons idered b efore. W e show that it is unsafe to release trained cla ssiﬁers since v a luable information ab out the t r aining set can b e extr a cted from them. 2. W e propose a wa y to leverage the ab ove infor mation leak age, devising a gen- eral attack strateg y that ca n b e used to ha ck ML classiﬁers . In particular , we deﬁne a mo del for a meta - classiﬁer that can b e trained to extract meaningful data fr om targeted cla ssiﬁers. 3. W e describ e s everal attacks against exis ting ML cla s siﬁers: we success fully attack ed a n Internet traﬃc classiﬁer implemented via Supp ort V ector Ma- chines (SVMs) and a sp eech reco gnition softw ar e based o n Hidden Mar ko v Mo dels (HMMs). W e be lieve existing clas siﬁers, whether commercia l pro ducts or pr ototypes re- leased to the rese a rch communit y , are susceptible to our general attack str ategy . W e put forward the imp ortance of pr otecting the tr aining set and of the need fo r nov el ma chine le a rning techniques tha t would pre ven t determined co mpetitor s from pr obing a ML cla s siﬁcator and lear ning trade secrets from it. 1.2 Organization of this pap er The r e s t of the pap er is org anized as follows: Section 2 descr ib es the pro blem and int r o duces an attack methodo logy that makes use of a ML mo del. Se c tion 3 shows how we success fully applied our prop os e d metho dology to hack tra ined SVM and HMM c lassiﬁers. In Section 4 we analyze the b ehavior of our attack metho do logy when the training set is provided through diﬀerential pr iv acy . Section 5 contains some related w o rks. Section 6 concludes our work with some remark s . 2 Hac king Mac hine Learning classiﬁers In this pap er we a r e interested in Machine Learning a lgorithms used for class i- ﬁcation purp oses, such as Int e r net traﬃc classiﬁer s, speech recognitio n s y stems, or for ﬁnancial mark et predictions. Our goal is to hack a tr ained classiﬁer to obtain informatio n that w as implicitly a bsorb ed from the element s the classiﬁer received as input. Consider for instance the Artiﬁcia l Neur al Networks (ANNs) based o n Multi- layer p er c eptr on (please refer to A for details ab out this a lgorithm). Consider a simple neur al netw or k tha t has to lea rn the iden tity function over a vector of eight bits, only one of them set to 1 ( this example is ta ken fr om the p o pular bo o k of Mitchell [47]). The net work has a ﬁxed structure with eight input neu- rons, three hidden units and eig ht output neurons. Using the ba ckpropagation algorithm ov er the eigh t p ossible input sequences, the net work ev entually learns the target function. By examining the weigh ts of the three hidden units, it is po ssible to observe how they actually enco de (in bina ry) eight distinct v alues, namely all po ssible sequences ov er three bits (000 , 001 , 0 10 , . . . , 111). The exact v a lues of the hidden units for one typical r un of the backpropagation algorithm are s hown in T able 1. Basically , the hidden units of the net work were able to capture the essential infor ma tion from the eight inputs, automatica lly discov er - ing a way to repr esent the inputs. Thus, it is p ossible to ex tract the (p oss ibly sensitive) cardinality of the training set by just lo o k ing at the trained netw ork. In the following sec tio n, we des crib e a metho d to ex tract this type of sensitive information. Namely , we show in Section 3.2 tha t it is po ssible to determine if a cer tain t y p e of netw ork tr a ﬃc was included in the tra ining set of an Internet classiﬁer trained on Cis co net work data ﬂows [53]. Similarly , w e hack ed a sp eech recognition system a nd were able to determine the accent of s pea kers emplo yed during its training. This ca se study is rep orted in Section 3.1. Input Hidden V alues Output 10000000 → . 89 . 04 . 08 → 10000000 01000000 → . 15 . 99 . 99 → 01000000 00100000 → . 01 . 97 . 27 → 00100000 00010000 → . 99 . 97 . 71 → 00010000 00001000 → . 03 . 05 . 02 → 00001000 00000100 → . 01 . 11 . 88 → 00000100 00000010 → . 80 . 01 . 98 → 00000010 00000001 → . 60 . 94 . 01 → 00000001 T able 1. The wei ghts of the hidd en states , taken fro m Figure 4.7 of [47] 2.1 An attac k strategy In this section we devise a general attac k strateg y a gainst a tra ined clas siﬁer that can make an attac ker able to disco ver some statistica l infor mation ab out the training set. W e deﬁne the training data s et D as a m ultiset wher e all the elements ar e couples of the form { ( a , l ) | a = h a 1 , a 2 , . . . , a n i} ; to simplify , we can assume without loss of genera lit y that a i ∈ { 0 , 1 } m , and l ∈ { 0 , 1 } ν . Each training element a is represented as a vector o f n fe atur es (the v alues a i of the v ecto r) a nd has an asso ciated classiﬁcation lab el l . C is a generic machine learning classiﬁer trained on D : it could be a n Artiﬁcia l Neu r al Net work (AN N), a Hidden Markov Mo del (HMM) or a s imple De cision T r e e (DT). W e as s ume that C is disclosed after the end of the tr a ining phase. This means that in our mo del the adversary cannot taint C during the learning pr o cess. Instead, we a ssume that the adv er sary is a ble to arbitrarily modify the b ehavior o f C during the c la ssiﬁcation pr o c e ss. In fact, when C is disclose d , it includes the set of instructions for the classiﬁcatio n task as well a s the mo del deﬁnition; hence, bo th the data structures and the instruction sequences ar e completely in the hand of the adversary . The assumption that the adversary has co mplete access to the cla ssiﬁer is reasonable s inc e it is p ossible to extract the plain classiﬁer a lso from a binary executable thro ugh, for instance, dyna mic ana ly sis tec hniq ue s [13]. Each classiﬁer C ca n be e nc o ded in a se t of feature vectors tha t can b e used as input to train a meta-clas siﬁer MC . The set of feature vectors that repre sent s C are denoted by F C . F or example, in the case of an SVM, the set F C would contain the list of all the s upp or t vectors of the class iﬁe r C . In Figure 1, C x is the trained class iﬁe r that the adversary want s to ex a mine in o rder to infer so me statistical informa tion ab out the training set D x . Le t P be the prop er ty that the adversary wan ts to learn about the undisclosed D x . W e write P ≈ D to say that the pr o per t y P is pres e rved b y the datase t D . F o r instance, in the con text o f medical diag nosis applications, P could b e: the entries of the tr aining set ar e e qu al ly b alanc e d b etwe en males and female s . T o discern whether P ≈ D x , the adversary ca n build a meta-classiﬁer MC , that is a classiﬁer trained o ver a particula r data s et D C comp osed of the elements a ∈ F C i lab eled with l ∈ { P , P } . The label is assigned according to the nature of the Fig. 1. A ttack metho dology: the target training set D x prod uced C x . Using seve ral training sets D 1 , . . . , D n with or without a sp eciﬁc prop erty , we build C 1 , . . . , C n , namely the training set for the meta-classiﬁer MC that will classify C x . dataset used to tr a in the class iﬁer C i . T o train MC the a dversar y has to build the tr aining set ﬁrst. F or this purp ose, the adversary ge nerates a vector of sp eciﬁc datasets D = ( D 1 , . . . , D n ) in such wa y that D co ntains a (pos s ibly) balanced amount of instances reﬂecting P and P . After this step, he tr ains the meta-classiﬁer MC as describ ed in Alg orithm 1. The algor ithm takes as input the crea ted training se ts D a nd their cor r esp onding lab els. It starts with an empt y data set (line 3). Then, it trains a classiﬁer C i on each cr eated data set (line 5 ) and ge ts the repr esentation of the classiﬁer as a set of feature vectors (line 6 ). Then, it adds each feature vector to the datas et D C (line 8). Finally , it trains the meta-clas siﬁer using the res ulting data s et D C (line 11). Next, the adversary uses the meta-classiﬁer MC o n F C x to predict which class l x the classiﬁer C x belo ngs to. This is already a new form of information leak age since the adversary learns whether the orig inal tra ining data D x preserves P o r not. In practice, thanks to our attack, we ar e able to infer an y key statistical prop erty P preser ved by the training set performing a sort of brute-force a ttack on the set of pr ope rties. It is important to remark that with this metho dolog y the adv ers ary extra cts external information , NOT in th e form of a ttributes o f the dataset D x . Thes e are essentially statistical prop erties inferred from the rela tio nship amo ng data set ent r ies. F or exa mple, in Section 3 .1 w e show ho w to attack a speech recog nitio n classiﬁer b y extracting information abo ut the a ccent o f the spea kers. This in- formation is not supp osed to be captured explicitly b y the model nor it is an attribute of the tra ining se t. T o further impr ov e the quality o f the clas siﬁcation pr o cess, some ﬁlt ers can be applied to the set D C of mo dels resulting from the tra ining phase. The ﬁlters depe nd on the problem domain a nd ar e used to ﬁnd optimal mo dels for the Input : D : the arra y of training sets l : the arra y of labels, where each l i ∈ { P , P } Output : The meta-classiﬁer MC 1 T rainMC ( D , l ) 2 b egin 3 D C = {∅} 4 foreac h D i ∈ D do 5 C i ← train( D i ) 6 F C i ← getF eatureV ectors( C i ) 7 foreac h a ∈ F C i do 8 D C = D C ∪ { a , l i } 9 end 10 end 11 MC ← t rain( D C ) 12 return MC 13 end Algorithm 1: T r aining of the meta-cla ssiﬁer prop erty P and get rid o f les s signiﬁcant entries. In so me c ases (as the example in Section 3.2), this step can be simply assimilated int o the tra ining pha s e of the meta-classiﬁer . In other cases, as the ex ample in Section 3.1, we will discuss a ﬁlter realized with the Kullback-Leibler divergence [43]. 3 Case studies In this se c tion we provide tw o examples of attac ks performed according with the metho dology introduce d in Section 2.1. W e prob e tw o co mplex systems, one of which is la rgely used by so ft ware vendors and research communities. As our ﬁrst example, we attack a Speech Recognition sys tem realized by Hidden Markov Mo dels; later, we co nsider a netw ork traﬃc cla ssiﬁer implemented by Supp ort V ecto r Machines. O ur exp eriments ar e p erfor med using W ek a ([5 6]). In ea c h experiment, we us e Decision T ree as meta-class iﬁe r M C (more details on Decisio n T ree are rep or ted in B); we a lwa ys use the C 4 . 5’s implementation, namely J 48 mo dule, included within the W ek a framework. Clear ly , the a ttack could b e replic a ted using meta- classiﬁers base d on other ML algor ithms. The ev aluation o f our exp eriments is per formed using standard metrics : (1 ) r e- c al l , that is the true p ositive rate, and (2) pr e cision , that is the ra tio of true po sitive and the total num b er of p ositive predictions of the mo del. F ur thermore, (3) ac cur acy , namely the r ate of corre ct predictions made by the classiﬁer ov er the num ber of instances o f the entire data set, can b e eas ily derived from the co nfusion matric es in Sections 3.1 and 3.2. In order to ev aluate the eﬀectiveness of our attack stra tegy , we crafted several classiﬁers trained on str ongly biased training sets. Thes e clas siﬁers would prob- ably o bta in very low p erformance dur ing the classiﬁcation pha se; as s uc h, they would b e unlikely employ ed in a commercial pro duct. Moreover, in our e x per i- men ts , we decided to focus on simple binary pro per ties. Our aims are to pr ovide an attack strateg y that could b e easily gener alized a nd to demostrate that it is po ssible to infer information o n the tra ining s et lo o king at the weigh ts lea rned by a classiﬁer. A ttacking commerc ial pro ducts is o nly a ma tter of tuning the genera tion of the sets D 1 , . . . , D n according to mor e co mplex pro p er ties . T o ev aluate our a ttack strategy we ma ke tw o assumptions: 1) the adversary knows which machine learning algorithm is emplo yed by the tar g et 2) the adver- sary ha s complete a ccess to the clas siﬁer. W e claim that these tw o assumptions are rea sonable. In fact, the information a bo ut wha t algor ithms are employ ed is not considered a sensitive information, and sometimes it is adv er tised b y the vendor itself; for instance, the newest version of the NaturallySp eaking engine (whic h is the version 12 at the time of wr iting) leverages HMM and ﬁve-grams to per form speech recognition and this info r mation can b e ga thered from Nuance’s website and patents. F o r wha t concerns the second assumption, note that in ma n y cases vendors need to hand o ut their classiﬁers to end-user s e mbedding them within the soft ware executable or appara tuse s; as such, an adversary would b e able to extrac t the classiﬁer using, for instance, tec hniq ue s based on dyna mic binary analy sis. Per- forming this t yp e of analysis is orthogona l to our a ttack metho dology and is out of the scop e o f this work. It is worth remar king that the structure o f the tra ining set (e.g., the list of at- tributes) is no t necessary to p erform our a ttack; indeed, w e are interested o n the external information ab out the training data and w e do not consider the attribute v a lues. 3.1 Hidden Mark ov Mo del s Bac kground A Marko v Mo del is a sto chastic pr o cess tha t can b e r epresented as a ﬁnite state mac hine in whic h the transitio n probability depends only on the curren t state and is indep endent from any pr ior (and future) state of the pro cess. An Hidden Markov Mo del , introduced in [16], is a particular type of Marko v Mo del for mo deling seque nc e s that can be characterized b y an under ly - ing pro cess genera ting an obse r v a ble sequence. Indeed, o nly the o utputs of the states are o bs erved (the actual sequence o f the states of the pro cess canno t b e directly observed). One of the most elegant examples to describ e HMMs was conceived b y Jaso n Eisner [26]: Supp ose tha t, in the year 2799 , a climate scien- tist is studying the weather in Ba ltimore Ma ryland for the summer of 20 07 by examining a diar y , which ha d rec o rded how many ice cr eams were eaten by Ja son every day of that summer. O nly using this rec o rd (the observ able sequence ), is it p ossible to estimate with a goo d a ppr oximation the daily temp erature (t he hidden sequence). HMMs s o lve the se quential le arning pr oblem that is a sp e c ial learning problem whe r e the da ta domain is sequential b y its natur e (e.g. sp eech recognition pro blem). In Figure 2, a simple mo del M is repr e sent e d that can b e 2 q 3 q 1 q Fig. 2. An example of Hidd en Ma rkov Model with th ree states. describ ed by: – a set of hidden states Q = q 1 , q 2 , ..., q m – a transition probability matrix A =    a 11 a 12 . . . a 1 m a 21 a 22 . . . a 2 m . . . . . . . . . . . .    where the element a i,j represents the probability of moving fr om s tate i to state j – an emissio n proba bilit y matrix B ( m × n ), where the element b j,k is the probability to pro duce the o bs erv a ble o k from the state j , that is b j,k = B j ( k ) = P ( o k | q j ) The HMM mo del is bas e d on t wo main assumptions. The ﬁrst is the Marko v assumption, namely that given a sequence x 1 , . . . , x i − 1 of transitions b etw een states, the pr obability of the next sta te dep ends o nly on the present state: P ( x i = q j | x 1 , x 2 , . . . , x i − 1 ) = P ( x i = q j | x i − 1 ) The se c ond is the output indep endence assumption, namely that given a se- quence x 1 , . . . , x T of tra nsitions b etw een states, where x i = q j , and the o bserved sequence y 1 , . . . , y T , the emission pro bability of any observ able o k depe nds only on the pre sent state a nd not on any other state or observ able: P ( y i = o k | x 1 , . . . , x i , . . . , x T , y 1 , . . . , y T ) = P ( o k | q j ) In Figure 2, thr e e s ta tes ( q 1 , q 2 and q 3 ) are shown: the tr ansition probabilities a ij , and, for the three states, the emission pro babilities ( B 1 , B 2 , B 3 resp ectively) of the three o bserv able ( o 1 , o 2 , o 3 ). The HMM mo dels are well-suited to s olve three types of problems: likeliho o d, deco ding and learning [38]. Likeliho o d problems are rela ted to ev aluating the probability of observing a given obs e rv a ble sequence y 1 , . . . , y T , given a complete HMM mo del, where bo th matrices A and B a re known. Decoding pr oblems call for the ev aluation of the bes t sequence of hidden states x 1 , . . . , x T that can hav e pro duced a given observ able sequence y 1 , . . . , y T . Le a rning problems consist of reconstructing the t wo matrices A and B o f an HMM, given the set of states Q and one (or more) observ ation seq uence Y . F or this task , the Viterbi and the Baum-Welch algorithms are used r esp ectively to train and tune the HMM. HMM for sp eec h recognition In this section we descr ib e the attack to the HMM in the sp eciﬁc case of Sp e e ch R e c o gnition Engines (SRE). Sp e e ch R e c o gni- tion (SR) is the pr o cess of conv er ting a so und r e corded through an acquis itio n hardware to a seq uence of written words. The applica tions of SR are manyfold: dictation, voice search, hands-fr e e co mma nd execution, audio archiv e searching, etc. The predo minant technology us ed to p erfor m this t a sk is the HMM [37], many to o ls a r e nowada ys av ailable ([5, 41]). W e ex plo ited o ur metho dolog y to verify whether the HMM was trained with a biase d training set: acco rding to the metho dology describ ed in 2.1, we are able to detect with high co nﬁdence whether the HMM was trained only with p eople from the sa me nationality . T o r ecognize a sp eech, SREs require tw o types of input: – an A c oustic Mo del , whic h is created b y taking speech audio ﬁles, i.e., the sp e e ch c orpus , and their transcr iptions, and co m bing them into a s tatistical representation of the sounds that make up ea ch word; – and either a L anguage Mo del or a Gr ammar File . Both describ e the set of words that the statistica l mo del will b e able to classify . How ever, the ﬁrst mo del contains the probabilities o f sequences of w ords, while the second con- tains a set of predeﬁned com bina tions of w or ds. In the following exp eriment, this pap er use s only the La nguage Mo del. Let us brieﬂy in tr o duce the typical SRE workﬂow. An unknown sp eech wav eform is captured by the a cquisition hardware, the Pulse Co de Mo dula tion provides the digital r epresentation of the ana logical audio signal. This bitstream is now con- verted in mel-fr e quency c epstr al c o eﬃcients (MF CCs), namely a r e pr esentation of the short-term p ow er spe c trum of sounds. The MFCCs a re the observables o f a Hidden Marko v Mo del that changes state o ver time and that generates one (or more) observables once it enters in to a new sta te. In this scenar io, the states of the HMM are all the p ossible subphonemes o f the la nguage while the tr ansition matr ix contains the proba bility for each sub- phoneme to cycle ov er itself or to move to the next subphoneme. The emission probabilities are the pro babilit y to obser ve a cer tain MFCC fr o m each sub- phoneme. The o nly p ossible tra nsitions be t ween the states of each phonemes are to themse lves o r to success ive s ta tes, in a left-to-r ig ht fashion; the self- lo ops makes it p ossible to deal with the v aria ble length of each pho neme with ease. Both tr ansition and emission pr obabilities are built using the Viterbi alg o - rithm [32] ov er a lar ge sp eech corpus. Since the MFCC ﬁles are v ector s of real-v a lued num b ers, they are approximated by the m ultiv ar iate Gaussians distribution (note that the probability to have exactly the same v ector w ould b e nearly 0 ). F or an y diﬀeren t state (i.e., sub- phoneme), each dimension of the vector ha s a cer tain me an and varianc e that represent the likeliho o d of a n individual aco ustic obs e rv a tion from that state. F o r the sake of our ex p er iments, we build the Hidden Markov Models using the Hidden Markov Mo del T o olkit (HTK) [60] to olkit. HTK consists o f a set of librar y mo dules and to ols a v ailable in C. The HTK to o lk it provides a high level of modularity and is organized through a set of libra ries with functions (e.g., HMem for memor y manag ement , H SigP for sig nal pro ces s ing ,. . . ) a nd a small core. The MFCC ﬁles w er e gathere d from the V oxF orge pro ject [2], the most imp ortant spe ech co rpus and acous tic model rep osito r y for op en-source sp eech reco gnition engines. Mo r eov er, ea c h sp eech ﬁle relea sed by V oxF o rge is asso ciated with several categor ies such as gender, age range, and pronunciation dialect. The aim of our e xper iment is to extract this infor mation, which is im- plicitly cor r elated with the conten ts, even if it do es no t a ppea r as a n a ttribute in our data s et. A ttac k description The main ob jective of this attack is to build a meta- classiﬁer for the following prop erty P : the classiﬁer was tr aine d only with p e ople who sp e ak an Indi an english diale ct . W e emph a size that this is external infor- mation as intro duced in Sectio n 2.1: the sp eech dialect is NOT explicitly used during the tra ining pro c ess, but in practice it inﬂuences the o utput of the clas - siﬁer. The ﬁrst part of the exp eriment des crib es the enco ding of the HMMs; next, we describ e the decisio n tree o f the meta-class iﬁer; ﬁnally , we present an improv ed version of the classiﬁer that uses a ﬁlter to improve the clas siﬁcation. T o ca rry out the a ttac k , we retrieved 11 , 137 re c ordings from the V oxF orge cor- pus. In par ticular, for our exp eriment, we to ok only the MFCC ﬁles in the English language. Each tr a ck co mes with a form con ta ining some meta-info r mation (e.g . gender, a ge, pronunciation dialec t). W e ha ve par titioned the corpus according to this meta-information; for this exp eriment, we hav e c onsidered the pa rtition containing the recor dings ma de with the same pronunciation diale c t and similar recording equimen ts. W e prepro cess ed the corpus with the HTK too lkit in order to minimize the environment a l noise. Sta rting from this pa rtition, we have cre- ated D accor ding to the rule deﬁned in Section 2 .1. Then, we ha ve tra ined each classiﬁer C i as describ ed in Algo rithm 1. After that, we started with the enco ding phase which is describ ed b elow. Ea c h classiﬁer C i , is r epresented in the HTK to olkit by a n ASCII ﬁle containing an HMM for ea ch phoneme b elong ing to the English language . E ach HMM is comp osed of: a transition probability matrix A ( n × n ) whic h descr ibes the tra n- sition b etw een hidden states and the tw o vectors M = ( µ 1 , µ 2 , . . . , µ m ) and V = ( σ 1 , σ 2 , . . . , σ m ) that are resp ectively mean and v ariance of the o utput probability distribution from a given hidden state (see Sections 3.1 a nd 3.1). In our exp eriments w e took the default HTK v alues during the training s tep (i.e. m = 25 and n = 5). T o enco de a single HMM w e chose to f o cus only on the output distributions, that is, the co uple o f vectors ( M , V ). The idea is that all these v alue s are initialized in the early steps of the training, accor ding to a mean computed ov er the entire MF CC dataset: since all the v alues ar e iter- atively r eﬁned thro ugh the HTK to olk it, then we exp ect that these v alues a re correla ted in some w ay with the voices of the learning set a nd, b y extension, with the pronunciation dialects. F or this r eason we s et the feature vector a ∈ F C as follows: a = ( ph, µ 1 , µ 2 , . . . , µ m , σ 1 , σ 2 , . . . , σ m , l i ) where ph is a string v a lue represe n ting a phoneme, µ 1 , µ 2 , . . . , µ m and σ 1 , σ 2 , . . . , σ m are the output probability vectors and l i ∈ { Indian , not Indian } is the lab el of the curr en t row. It is imp ortant to no tice that this e nc o ding gives a row in D C for each phoneme of the aco ustic mo del. O ur training set was comp osed of 5 , 4 20 tuples eq ually balanced ov er the tw o class iﬁcations considered for this exp eriment (i.e. the 50% of training data were gener ated by Indian p eople a nd the remaining 50% by p eople sp eak ing with diﬀerent a ccent ). The test set was comp osed of 1 , 01 6 insta nces: 774 of these are classiﬁed as not Indian and the re- maining 242 a re cla ssiﬁed as Indian . The tr a ining ended up with a very complex meta-classiﬁer : the decision tree was co mp ose d of more than 811 no des with 610 leav es. Indian not I n dian classiﬁed as 220 22 Indian 72 702 not Indian T able 2. The confusion matrix of the meta-classiﬁer Precision R ecall NotIndian 0.97 0.91 Indian 0.75 0.91 T able 3. The precision and recall summary of th e meta-classiﬁer T a ble 2 rep orts the co nfusion ma tr ix obtained from this e x per imen t (w e recall that the co nfusio n matrix shows how correctly a classiﬁer a ssigned the lab els to the element s of the input s et). The n ot Indian classiﬁers ar e cor rectly class iﬁed with precision o f 0.97 whereas the Indian classiﬁers are r ecognized with pr ecision 0.75. (Sp eciﬁcally: r ecall Indian: 0.909 and r ecall not Indian: 0.9 07.) One of the most interesting features provided by the C4.5 algorithm consists of the order in which the attributes decis io n tree appear. In fact C4.5 puts the most 0 50 100 150 200 250 300 350 400 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Number of occurences indian not indian Fig. 3. The frequency of th e v alues of σ 2 for all p honemes in the training data of the meta-classiﬁer. representative attributes at the hig her level of the tree. In our exp eriment, one of the most representativ e no des is σ 2 . The frequencies of ea c h v alue of σ 2 in the training data of the meta-cla ssiﬁer are repres en ted in ﬁgure 3. It is easy to notice that the mean v a lues of ea c h distribution are co nsiderably shifted and can b e easily re c o gnized with r esp ect to the c la ss. Our meta -classiﬁer is v ery eﬀectiv e in catching those diﬀ er ences; hence, as our exp eriments show, it correctly classiﬁes the most par t of the test set. T o further impr ove the quality o f MC , we hav e applied a ﬁlter to the tra ining se t D C . Our goal w as to extract the pho nemes that b etter diﬀer entiate the langua ge dialect. T o perfor m this task, we employ ed the Kul lb ack-L eibler (KL) div ergenc e betw e en the o utput probability distributions of the mo dels. The KL divergence is deﬁned as follows: D K L ( P || Q ) = X i P ( i ) l og P ( i ) Q ( i ) (1) A low D K L v a lue means a high similarity o f the tw o pro ba bilit y dis tr ibutions, while on the other hand, high divergence v a lue s cor resp ond to an inferior simi- larity . T his means that the phonemes with the hig he s t divergence are the ones which b etter discriminate the Indian a ccent from o thers. Since the output proba bilities fo llow a Normal distributio n, we used the following equation to compute the K L divergence: D K L ( X i || X j ) = ( µ i − µ j ) 2 2 σ 2 i + 1 2 σ 2 i σ 2 j − 1 − ln σ 2 i σ 2 j ! (2) where X i ∼ N ( µ i , σ i ) and X j ∼ N ( µ j , σ j ). W e built 100 diﬀer e nt training sets without India n reco rds, obtaining the relative acoustic mo dels C = ( C 1 , C 2 , . . . , C 100 ). Then, w e built the r eference le a rning set containing only Indian recor ds , obtaining the relative aco ustic mo del C r . Then, we compared the distance b etw een the output pro bability dis tributions of C r with every C i ∈ C , o btaining the summed v alue of the divergence. Since the same phoneme state has 25 p ossible output distributio ns , we have just computed the mean distance v alue across all the distributions. Finally , we to ok the ﬁv e phonemes with the highest divergence and w e re built MC using only the en tr ies relative to these phonemes. Indian not I n dian classiﬁed as 169 6 Indian 2 137 not Indian T able 4. The confusion matrix of the ﬁlter e d meta-class iﬁ er Precision R ecall NotIndian 0.98 0.96 Indian 0.95 0.98 T able 5. The precision and recall summary of th e ﬁlter e d meta-classi ﬁ er T a ble 4 shows the confusion matrix of the ﬁ lter e d classiﬁer. The new r esults are noticeably improved: the pr ecision for the not Indian class is 0 . 9 8 a s b efore whereas the precis io n for the In dian clas s is increased to 0 . 95. (Sp eciﬁcally: reca ll Indian: 0.986 and r ecall no t Indian: 0.9 66.) Also, the size o f the decision tree has dropp ed down sig niﬁcantly (the res ulting decision tree is comp osed o nly of 2 1 no des with 11 leaves). 3.2 Supp ort V ector M ac hines Bac kground Supp ort V e c tor Machines (SVM) are sup ervised lea rning metho ds related to st atistic al le arning the ory and ﬁr st in tr o duce d by Boser et al. in [1 7]. SVMs ar e la rgely used fo r class iﬁc a tion a nd regr ession a na lysis. In their basic form, SVMs are ﬁrst trained with sets of input da ta classiﬁed in t wo cla sses and are then used t o guess the clas s for eac h new given input. This a spec t makes SVM a non-pr ob abilistic binary line ar classiﬁer . Suppo r t V ector classiﬁers a re based on the c o ncept of sep ar ating hyp erplanes , that a re the hyperpla ne s in the attribute space that deﬁnes the decis ion b oundar ie s b etw een sets of ob jects belo nging to diﬀerent cla sses. During the training phase, the SVM receives a set of lab eled exa mples, each of them descr ib ed by n numerical a ttributes ( fe atur es ) and thus r epresented as a s et of p oints in a n -dimensio nal s pace. F or the sa ke of simplicity , we brieﬂy introduce how an SVM works with data repr esented by t wo attributes a nd mapp ed into t wo cla sses. The en try i of the training datas et is represented by a 2-dimensio na l vector x i = h x i 1 , x i 2 i and b elongs to one and only one class y i : ( y 1 , x 1 ) , ( y 2 , x 2 ) , . . . ( y n , x m ) y j ∈ − 1 , 1 (3) Let us supp o se that the training data is linearly separable, na mely there ex is ts a vector w a nd a scalar v a lue b such that: w · x i + b ≥ 1 if y i = 1 , w · x i + b ≤ 1 if y i = − 1 (4) In order to deal with sets that are not linear ly separ able, the tra ining vectors x i can be mapp ed into a higher dimensional space by the function φ , the s o called kernel function : many k ernel functions have b een pr o po sed, but the most used are linear K ( x i , x j ) = x T i x j , p oly no mial K ( x i , x j ) = ( γ x T i x j + r ) d , γ ≥ 0 , radial basis function, RBF, K ( x i , x j ) = exp ( − γ k x i − x j k 2 ) , γ ≥ 0 a nd sig moid K ( x i , x j ) = tanh ( γ x T i x j + r ). The Suppo r t V ector classiﬁer ﬁnds the optimal hyperplanes that separa te the training data with a maximal mar gin in this higher dimensional space; formally it resolves the sy stem of equa tio ns: y i ( w 0 · x + b 0 ) = 0 (5) It must b e p ointed out that, thanks to the natur e of the tra ining a lgorithm adopted by SVM, the solution of (5) can be o btained a t a re a sonable compu- tational cost reg ardless of the kernel function adopted. Intuitiv ely , a g o o d s e p- aration is achiev ed by the hyperplane tha t has the la rgest distance - o r mar gin - b etw een the nearest tra ining data p oints of diﬀerent classes: these p oints ar e called the supp ort ve ctors . R o ughly sp eaking, the lar ger the margin, the lower the generaliza tion error o f the cla ssiﬁer. It is easy to notice how the functional mar gin p oints determine the hyperplane of separa tion. This informatio n is tr iv ially fea tured by the attribute v alues in the training sets. F urthermore, we highlight that SVM can disclose more informa- tion when sev er al classiﬁers tr ained with diﬀeren t kernel functions a re provided. Since a trained SVM is repres en ted b y a set of w eights and a subset of the tra in- ing sample, it is not ea sy to obta in use ful informa tion on the characteristics of the complete training se t dir ectly fro m the SVM repr esentation. SVMs g enerated a signiﬁcant res earch activit y which extends across the limits of data mining area . Although SVMs were initially int r o duced to solve pattern recognition problems in an eﬃcie nt w ay ([2 0]), no wadays they ar e suitable in several co ntexts. In fact, SV Ms a re used for in trusion detection and anomaly detection ([39, 34, 22]) or as pa rt of co mplex sys tems for s imilar ta sks ([4 8, 6]). Other author s prop os e SVM-based s y stems for priv acy-cr itical tasks, such a s cancer diagnostic [30, 7], text ca tegorizatio n [36 ], or face recog nition [4]. SVM for ne tw ork traﬃc class iﬁcation As s hown by the extensive literature on this topic [28, 27, 49, 15], netw or k tra ﬃc classiﬁc a tion is commonly r ealized by means of Machine Learning algorithms, like K-Means, HMM, decision tre e s , and SVM. In order to ev aluate the informatio n leak ag e of SVM class iﬁers, we s et up a simple Netw ork T raﬃc Clas siﬁer able to distinguish b etw een DNS and WEB traﬃc. In particular, we c onsidered an SVM c lassiﬁer bas ed on the SMO mo dule (Sequent ia l Minimal Optimiza tion [50]) o f the W ek a framework. Our exper iment uses a r e al netﬂow dataset, gathered b y a national tier 2 Autonomous System. NetFlow is a Cisco TM proto col used by netw or k adminis- trators for gather ing traﬃc sta tis tics [5 3]. NetFlow is used to monitor data a t Lay ers 2-4 of the netw ork ing proto col sta ck and to pr ovide an aggr egated v iew of the netw o rk status. In particula r, NetFlo w eﬃcient ly supp orts many net work tasks such as tr a ﬃc acc o un ting , net work billing a nd planning, a s well as Denial of Service monito r ing. A netﬂo w-enabled router pro duces o ne new reco rd for each newly established connection, collecting selec ted ﬁelds from its IP hea der. More precisely , a single netﬂow r ecord is deﬁned as a unidir e ct ional sequenc e of packet s all sharing the following v alues : so urce and destination IP addres s es, source a nd destinatio n po rts (for UDP or TCP , 0 for other proto c o ls), IP pr oto col, Ingress in terface index and IP Type o f Service. Other v aluable information a sso ciated with the ﬂow, suc h a s timestamp, du- ration, num b er of packets and t r ansmitted bytes, a re also recorded. Then, w e consider a single netﬂow as a record tha t repres en ts the da ta exchanged b etw een t wo hosts only in one dir ection. W e co nsider a netw o rk traﬃc cla ssiﬁer aimed at corr ectly distinguishing the WE B and DNS traﬃc. The classiﬁer was trained using a balanced s et of netﬂo ws of WEB a nd DNS tr a ﬃc. It is worth noting that the WEB data set includes s everal tr a ﬃc patterns. Namely , it co n ta ins the ﬂo ws directed to national newspap ers, advertising websites, and the Go ogle search engine website. During the training phase o f the exp eriment, we used all the ﬁelds of the netﬂow ent r ies, except t he source a nd destination IP addresses o f the tr a ck ed connec- tions. In the literature there a re exa mples of SVM Class iﬁe r s for traﬃc detec- tion [28] able to distinguis h a grea ter v a riety of netw ork proto cols; the metho d- ology used in o ur exp eriment is s imilar, and can b e co nsidered appropr iate to highlight the statistical information leak ag e issues that ar e the tar get of our resear ch. No tice that the ac cur acy and the p re cision of the obtained classiﬁer is optimal, thanks to the simplicity of the training samples: indeed, WEB and DNS co nnections hav e well-separated traﬃc patterns, pro ducing a large marg in for c lassiﬁcation. A ttac k description In our exp eriment we in vestigate whether it is p ossible to extrap olate the t y p e of tra ﬃc that was used during the construction of the SVM mo del. F or example: Can we infer wh et her Go o gle web tr aﬃc was use d in the tr aining samples? (As be fo re, Go o g le traﬃc do es NOT app ear in the a ttributes of the training set.) W e pro ceed with our attac k by c r eating sev e ral a d-ho c data sets with well-deﬁned statistica l pro per ties and use them to build o ur meta- classiﬁer MC . Namely , we created 70 ad-ho c data sets, selecting 20.0 00 ﬂows of net work traﬃc, distinct from the original tr aining set. While all 70 c lassiﬁers were trained with a non-sp eciﬁc DNS traﬃc, the ﬁrst half of the classiﬁer s w er e trained using WEB traﬃc directed only to Go og le sear c h eng ine (prop erty P ). F o r the r emaining 3 5 classiﬁers , w e us ed WEB traﬃc without any netﬂow dire c ted to Go ogle sea rch engine (prop erty P ). Each classiﬁer was trained us ing a po lynomial k er nel function of degree 3 and was enco ded by the list of the s uppor t vectors it contains, namely a set of po in ts ( y , x ) in the n − dimensional spa ce ( x = { x 1 , x 2 , . . . , x n } ). The training samples o f the c la ssiﬁer MC are comp ose d of all the supp ort vectors of the 70 classiﬁers , lab eled according to the pr o pe r t y P or P used for training: D C = [ C i { ( y , h x i , l ab el ) } W e ev a luate the pe rformance of MC using the cr oss v alida tion s tr ategy , a metho d that divides the data into k mutually ex clusive subsets (namely , the “folds”) of approximately equal size. With cross v alidation, the accuracy estimate is the av erag e accuracy for the k folds. Google not Google classiﬁed as 2312 101 Google 92 2786 not Google T able 6. The confusion matrix of the meta-classiﬁer Precision R ecall Google 0. 95 0.93 not Google 0.94 0.96 T able 7. The precision and recall summary of th e meta-classiﬁer T a ble 6 summarizes the exp eriment r e s ults: with respect to the Go o gle class, we a ch ie v e a pr e cision of 0.954 and a r e c al l of 0.932. On the other hand, w e correctly classify not Go o gle instances with a pr e cision of 0.9 43 and a r e c al l of 0.962. As in the example with the HMMs, the exp erimental r esults show tha t we were a ble to build a n eﬀective meta-cla ssiﬁer that infers whether the t r aining set given as input includes als o a sp eciﬁc type of traﬃc. 4 Diﬀeren tial priv acy In this section we sho w that diﬀere ntial priv acy is ineﬀective against our attack strategy . More sp eciﬁcally , the information leak ag e we are after sits o utside the adversary mo del considered by diﬀerential priv a cy . Diﬀerent ia l priv acy [1 9, 1 4, 9] pr otects ag a inst uninten tional disclo sure of po ten tia lly sensitive infor mation rela ted to a single recor d of a databa s e D . In other words, diﬀerential pr iv ac y maximizes the accura cy of queries fro m statis- tical databases a nd, at the same time, minimizes the ability to identifying single records. T o protect the priv acy of data base reco r ds, diﬀeren tia l priv acy opts for basically three approaches: 1. The ﬁr st is to o bfusca te the or iginal database D and transform it into D ′ . This strateg y is completely ineﬀective in our mo del since D ′ is the database actually used during training and it is exactly what the a dversary in our mo del is after . That is, our adversary is not interested in D , or any of its records, but it is ra ther eager for any information on D ′ , i.e., anything that is the result of the trans formations applied by diﬀerential priv acy . 2. Another appro ach is to train a classiﬁe r and then add noise to the output. This is also ineﬀective since, in o ur mo del, the adversary has complete acces s to the classiﬁer and co uld just disa ble the instruction that adds nois e . 3. The thir d a pproach is more subtle. I t consists of adding no ise during train- ing, thus eﬀectively obfuscating the lear ning pr o cess. This appro a ch is still ineﬀective aga inst our adv er sary since, in tuitively , the ﬁnal classiﬁer m ust anyw ay conv erge to classify corr ectly the training set. Thus, the noise must be s omehow restrained and its eﬀect can easily b e mitigated (see b elow). It may be unclear why the third appro ach ab ove fails to provide an y protec- tion a gainst our adversary . Hence, we p erformed nex t an exp eriment sho wing how to extract sensitive information from a cla ssiﬁer tra ined within the fra mework SuLQ, in tr o duce d in [9]. The SulQ a utho r s impro ved sev era l standar d classiﬁers to provide diﬀ er ent ia l priv acy . The main idea co nsists of adding a small amoun t of noise, according to a Normal Distribution N (0 , σ ), to any access to the train- ing s e t. The v ariance of N regula tes the priv acy prop erty provided by diﬀerential priv a cy . Before intro ducing the exp eriment, we br ieﬂy recall some concepts of K - Means, whic h is the most p opular cluster ing algor ithm. 4.1 K-Means: the clusteri zation algorithm Clustering is the task o f par titio ning unstructured data in such a way that ob jects with an hig h level of similarity fall into the sa me par tition. Clustering is a t y pic a l example of unsup ervised lear ning mo dels where examples a re unlab eled, i.e., they are no t pre- c lassiﬁed. The K-Me ans a lgorithm [3] is one of the most common metho ds in this family and it has b een used in many applications (e.g., [12, 59, 1 0, 11]). F o r example, in [12] the author s developed a re a l-time tra ﬃc classiﬁcatio n metho d, base d o n K-Me ans , to identify SSH ﬂows from statistical behavior of IP tr aﬃc par ameters, such as length, ar riv a l times and dir ection of pa c kets. In K -Me ans b oth tra ining and classiﬁcatio n phases a r e very intu itive. Dur- ing the lear ning pro c e ss, the algor ithm par titions a set of n obs erv a tions into k clusters. Then, the a lgorithm s elects the c entro id (i.e., the bar ycenter, o r geo - metric midp oint) of every cluster as a repr esentativ e for that set of ob jects. More formally , given a set of obs erv a tions ( x 1 , x 2 , . . . , x n ), wher e ea ch observ ation is a d - dimensional re a l vector, K-Me ans partitions the n observ atio ns into k sets ( k ≤ n ) S = { S 1 , S 2 , . . . , S k } in o rder to minimize the within-cluster function: argmin S k X i =1 X x j ∈ S i k x j − µ i k (6) where µ i is the mean o f p oints in S i . T o classify a given data set of d - dimensional elemen ts with respect to k clus- ters, K-Me ans r uns a learning pro cess that can b e summarized by the following steps: 1. Randomly pick k initial cluster centroids; 2. Assign each instance x to the cluster tha t ha s a centroid near est to x ; 3. Recompute each cluster’s cent r oid bas ed o n whic h element s are con tained in it; 4. Rep eat Steps 2 and 3 unt il co nv ergence is achiev ed; 4.2 Hac king mo de ls secured b y Diﬀerential Priv acy W e implemented tw o v ariants of a netw o r k traﬃc classiﬁer that makes use of K-Me ans . W e trained both classiﬁer s with the same data set of the SVM e x per i- men t of Section 3.2. The ﬁr st implementation directly uses the euclidian distance as metric to revise the c entr oids in the iterative r eﬁnement phase (equa tion 6). The seco nd version implements a priv acy pres erving version of K-Me ans , provid- ing diﬀerential priv acy . W e implemented the latter within the SulQ framework, int r o duced by Blum et al. [9]. W e ran the t wo cla s siﬁers on 70 training sets , obtaining 7 0 distinct centroids. Recall that our ob jective is to recog nize whether there was Go ogle traﬃc within the traces. With resp ect to t he classiﬁer with no diﬀerential pr iv ac y , we represe n t the centroids when there is traﬃc to Go ogle.com in ﬁgure 4(a), and no t r aﬃc to Go ogle.com in ﬁg ur e 4 (b). It is ea sy to see that the pos itio ns o f the centroids are quite diﬀerent, allowing us to eas ily distinguish b etw een these tw o cases. Similar r esults app ear when w e picture the cent r oids of the classiﬁer pr oviding diﬀerential priv acy in ﬁgur es 5(a) and 5(b), resp ectively . Even in this case, an adversary can easily distinguish whether there is Go ogle.com tr aﬃc or not. ✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✥ ✥ ✂✥ ✥ ✥ ✥ ✥ ✂✥ ✥ ✥ ✥ ✄✥ ✥ ✥ ✥ ✥ ✄✥ ✥ ✥ ✥ ☎✥ ✥ ✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✁✥ ✂✥ ✥ ✂✥ ✄✥ ✥ ✄✥ ☎✥ ✥ ❜ ✆ ✝ ✞ ✟ ♣✠✡☛☞✌ ✍ ❑ ✎ ☞✠ ✏✍ ✑☞ ✏ ✌ ✒ ✓✔ ✕ ✍ ✖ ✔ ✌ ✗ ✏✓ ✘ ✔ ✙ ✙ ☞✒ ☞ ✏ ✌ ✔ ✠ ✚ ✛ ✒ ✔ ✜✠✡✢ ✣✍✔ ✏ ✤ ✖ ☞✦ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ✖ ✔ ✌ ✗ ✧✓✓ ✤ ✚ ☞ ✙ ✚ ✓✖ ✍ ✘❉★ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ✡☞ ✏ ✌ ✒ ✓✔ ✕ ✍ ❲✩✪ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ♣ ✓✔ ✏ ✌ ✍ (a) T raining set conta ins W eb traﬃc directed to Google.com ✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✥ ✥ ✂✥ ✥ ✥ ✥ ✥ ✂✥ ✥ ✥ ✥ ✄✥ ✥ ✥ ✥ ✥ ✄✥ ✥ ✥ ✥ ☎✥ ✥ ✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✁✥ ✂✥ ✥ ✂✥ ✄✥ ✥ ✄✥ ☎✥ ✥ ❜ ✆ ✝ ✞ ✟ ♣✠✡☛☞✌ ✍ ❑✎ ☞✠ ✏✍ ✑☞ ✏ ✌ ✒ ✓✔ ✕✍ ✖✔ ✌ ✗ ✏ ✓ ✘✔ ✙ ✙ ☞✒ ☞ ✏ ✌ ✔ ✠ ✚ ✛✒ ✔ ✜✠✡✢ ✣✍ ✔ ✏ ✤ ✖ ☞ ✦ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ✖✔ ✌ ✗ ✏ ✓ ✧ ✓✓✤ ✚ ☞ ✙ ✚ ✓✖ ✍ ✘❉★ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ✡☞ ✏ ✌ ✒ ✓✔ ✕✍ ❲✩✪ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ♣✓✔ ✏ ✌ ✍ (b) T raining set does not contain W eb traﬃc directed t o Google.com Fig. 4. Centroids of the K-Means traﬃc classiﬁer without diﬀerential priv acy . ✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✥ ✥ ✂✥ ✥ ✥ ✥ ✥ ✂✥ ✥ ✥ ✥ ✄✥ ✥ ✥ ✥ ✥ ✄✥ ✥ ✥ ✥ ☎✥ ✥ ✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✁✥ ✂✥ ✥ ✂✥ ✄✥ ✥ ✄✥ ☎✥ ✥ ❜ ✆ ✝ ✞ ✟ ♣✠✡☛☞✌ ✍ ❑ ✎ ☞✠✏ ✍ ✑☞✏✌ ✒ ✓✔ ✕ ✍ ✖ ✔ ✌ ✗ ✘ ✔ ✙ ✙ ☞ ✒ ☞✏✌ ✔ ✠ ✚ ✛✒ ✔ ✜✠✡✢ ✣ ✜✠✒ ✂ ✤ ✥ ✦ ✧✍✔ ✏★ ✖☞✩ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ✖✔ ✌ ✗ ✪✓✓★ ✚ ☞ ✙ ✚ ✓✖✍ ✘❉✫ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ✡☞✏✌ ✒ ✓✔ ✕ ✍ ❲✬✭ ✌ ✒ ✠ ✙ ✙ ✔ ✡ ♣ ✓✔ ✏✌ ✍ (a) T raining set conta ins W eb traﬃc directed to Google.com ✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✥ ✥ ✂✥ ✥ ✥ ✥ ✥ ✂✥ ✥ ✥ ✥ ✄✥ ✥ ✥ ✥ ✥ ✄✥ ✥ ✥ ✥ ☎✥ ✥ ✥ ✥ ✥ ✥ ✥ ✁✥ ✥ ✁✥ ✂✥ ✥ ✂✥ ✄✥ ✥ ✄✥ ☎✥ ✥ ❜ ✆ ✝ ✞ ✟ ♣✠✡☛☞✌ ✍ ❑✎ ☞✠ ✏✍ ✑☞ ✏ ✌ ✒ ✓✔ ✕✍ ✖✔ ✌ ✗ ✘✔ ✙ ✙ ☞✒ ☞ ✏✌ ✔ ✠✚ ✛✒ ✔ ✜ ✠✡ ✢ ✣ ✜ ✠✒ ✂✤ ✥ ✦ ✧ ✍ ✔ ✏★ ✖☞ ✩ ✌ ✒ ✠✙ ✙ ✔ ✡ ✖ ✔ ✌ ✗ ✏ ✓ ✪ ✓✓ ★ ✚ ☞ ✙ ✚ ✓✖ ✍ ✘ ❉✫ ✌ ✒ ✠✙ ✙ ✔ ✡ ✡☞ ✏ ✌ ✒ ✓✔ ✕✍ ❲✬✭ ✌ ✒ ✠✙ ✙ ✔ ✡ ♣✓✔ ✏ ✌ ✍ (b) T raining set does not contain W eb traﬃc directed t o Google.com Fig. 5. Centroids of a K-Means T raﬃc Classiﬁer with diﬀerential priv acy . 5 Related works The re s earch ar ea closest to the iss ues addresse d in our pap er app ears to be In- formation Disclosure co nsidered in priv acy pre s erving da ta mining and statistica l databases. It is worth descr ibing some o f these r elated results, even though we stress that the type of lea k a g e we co nsider in this paper has not been considered befo re. As formalized by Dwork in [19], diﬀer ential privacy deals with the g eneral problem of pr iv acy pre s erving analysis o f data. More for mally , a ra n domize d me chanism M pr ovides ǫ − diﬀer ential privacy if, for a datab ase D 1 and D 2 , which diﬀer by at most one element, and for any t: P r [ M ( D 1 ) = t ] P r [ M ( D 2 ) = t ] ≤ e ǫ In the diﬀeren tia l priv acy mo del, a trusted server holds a database with sen- sitive information. Answers to q ue r ies are p erturb ed b y the addition of random noise generated a ccording to a r andom distribution (usually a L aplac e distribu- tion). Two settings are deﬁned: n on inter active , where the trusted server computes and publishes statistics on the original data, and inter active , where the server sits in the middle and dir e ctly alters the answers to user queries to gua rantee sp eciﬁc priv a cy pr op e r ties. Chaudhuri et al. [40] design a priv acy preserving logistic regression a lg orithm which w orks in the ǫ − diﬀerential priv acy mo del ([31]). The idea is q uite simple: the res ult of the trained clas s iﬁer is p erturb ed with a dynamic amount of noise. This approach do es not conside r the secur ity is sues due to the exp osure of the mo del generated dur ing the lear ning phas e of the linear regre ssion algo rithm. Other machine lear ning a lg orithms, such as Decision T rees, Ar tiﬁcia l Neura l Net works, Clustering, hav e b een re - engineered to provide diﬀerential priv acy and se veral a re deﬁned within the SulQ fra mew o rk [9]. Privacy Pr eserving Data Mining (PPDM) [1 4] is a novel resear c h ar e a aimed at developing techniques that p erfor m data mining pr imitiv e s while pr otecting the priv acy of individual data r ecords. In [55], V er ykios et al. class iﬁed PPDM techn iq ues in ﬁv e clas ses. Amo ng them, we men tion the Privacy pr eservation class which refers to techniques used to preserve pr iv acy for sele ct ive mo diﬁ- cations of da ta recor ds. This c an b e achieved through heuristic v a lues (e.g., selecting the v alues that minimize the utility loss of the data), cry pto graphic pro- to cols (e.g., via Sec ur e Multipar t y Computation [44]), o r reconstruction-based techn iq ues (e.g., s tr ategy aimed at reconstructing the original data dis tribution using r andomized data). Some previous work exists r elated to extra ction of informa tion from clas siﬁers. F o r instance, in [2 9], the authors show how a bay esia n learning a lgorithm ca n be used to learn which words are employ ed by a classiﬁer to cla ssify messages as spam and ham. Similarly , in [45, 58], the a utho r s de s crib e some statistica l attacks against spam ﬁlters aimed a t unders ta nding mes sage fea tures that a re not correc tly classiﬁed by the ﬁlter s. Although using learning alg orithms a gainst other learning algor ithms is not unprecedented([29, 46]), o ur approa ch is diﬀerent since w e uncov e r ed a new class of information leak age that is inherent to the learning pro cess and that ha s never bee n dis c ussed b efore. 6 Conclusions In this pap er we in tr o duced a nov el a pproach to extract meaning ful data fr o m machine learning clas siﬁers using a meta-cla ssiﬁer. While previous works in ves- tigated priv acy concer ns of a single da tabase record, our appr oach fo cuses on the statistical informa tion stric tly correlated to the tra ining samples used during the learning phase. W e show ed that se veral ML classiﬁer s suﬀer from a new class of information leak age that is not captured b y priv acy- preserving mo dels, such as PPDM o r diﬀere n tial priv acy . W e devised a meta-cla s siﬁer to successfully distinguish the accent o f users inv olved in deﬁning the corpus of a sp eech recognition engine. F urthermor e, we attack ed an Internet traﬃc classiﬁer to infer whether a speciﬁc t r aﬃc pattern was used dur ing tra ining. Our results evince r ealistic issues facing machine learning alg orithms as we put forward the imp ortance o f protecting the tra ining s et—the alluring recip e that makes a class iﬁer b etter than the comp etition and that should b e gua rded as a tra de secr et. Bibliograph y [1] http ://www .nuance.com/dragon/index.htm . [2] http ://www .voxforge.org . [3] Some metho ds of classiﬁc ation and analysis of multivariate observations , 1967. [4] F ac e R e c o gnition by Su pp ort V e ctor Machines , FG ’00 , W ashington, DC, USA, 20 00. IEEE Computer Society . ISBN 0 -7695 -0580- 5. URL http:/ /dl.ac m.org/citation.cfm?id=795661.796198 . [5] Julius — an op en sour c e r e al-time lar ge vo c abulary r e c o gnition engine. , Aal- bo rg, Denmark , 2 001. [6] Intru s ion dete ction using neur al networks and su pp ort ve ctor machines , vol- ume 2, 2 002. [7] Morpholo gic al Classiﬁc ation of Me dic al Images using Nonline ar Supp ort V e ctor Machines , 200 4. IEEE . [8] Applic ation of mo diﬁe d n eur al network weights’ matric es explaining determinants of for eign investment p attern s in the emer ging mar- kets , MICAI’05 , Berlin, Heidelb erg , 20 05. Springer-V erlag . ISBN 3- 540-2 9896- 7 , 978 -3-540 -2989 6-0. doi: 1 0.1007 / 11579 427 73. URL http:/ /dx.do i.org/10.1007/11579427_73 . [9] Pr actic al priva cy: the SuLQ fr amework , PODS ’05 , New Y ork, NY, USA, 2005. ACM. ISBN 1-5 9593- 0 62-0. doi: 10.11 45/10 65167.1065184. URL http:/ /doi.a cm.org/10.1145/1065167.1065184 . [10] T r aﬃc classiﬁc ation using clustering algorithms , MineNet ’06, New Y o rk, NY, USA, 2006 . ACM. ISBN 1-595 93-569 - X. doi: ht tp:// doi.acm.org /10.11 45/1162678.1162679. URL http:/ /doi.a cm.org/10.1145/1162678.1162679 . [11] T r aﬃc classiﬁc ation on the ﬂ y , volume 36, New Y ork, NY, USA, April 2006. ACM. doi: h ttp://do i.acm.org/ 10.114 5/1129582.1129589. URL http:/ /doi.a cm.org/10.1145/1129582.1129589 . [12] R e al Time Identiﬁc ation of SS H Encrypte d A pplic ation Flows by Using Cluster Analysis T e chniques , NETWORKING ’09, Berlin, Heidelber g, 2009. Springer-V erlag . ISBN 978-3 -642-0 1398- 0. doi: http://dx.doi.org/ 10.100 7 /978- 3 - 642- 01399- 7 15. URL http:/ /dx.do i.org/10.1007/978- 3 - 642- 01399- 7_15 . [13] Automatic Reverse Engine ering of Data Struct ur es fr om Binary Exe cut ion , 2010. The Internet So cie t y . [14] Rakesh Agraw al and Ra makrishnan Srik ant. Priv acy- preserving data mining. SIGMOD R e c. , 29(2):439 –450, May 2000. ISSN 01 63-580 8. doi: 10 .1 145/3 35191 .335438. UR L http:/ /doi.a cm.org/10.1145/335191.335438 . [15] T om Auld, Andrew W. Mo ore, and Stephen F. Gull. Bay esia n neural ne t- works for internet traﬃc classiﬁca tion. IEEE T r ansactions on N eu r al Net- works , 1 8(1):223– 239, 2007. [16] L. E . Baum and T. Petrie. Statistical inf er ence for pr o babilistic functions of ﬁnite state Marko v chains. Annals of Mathematic al Statistics , 3 7:1554 –1563 , 1966. [17] Ber nha rd E. Boser, Isa belle M. Guyon, and Vladimir N. V apnik. A training algorithm for optimal margin classiﬁers . In Pr o c e e dings o f the 5th Annual ACM Workshop on Computational L e arning The ory , pages 144 –152. ACM Press, 1992. [18] Leo Breiman, J. H. F riedman, R. A . Ols he n, and C. J. Stone. Classiﬁc ation and Re gr ession T r e es . Statistics/Pr obability Series. W adsworth Publishing Company , Belmont, C a lifornia, U.S.A., 1 984. [19] Michele Bugliesi, Bar t Preneel, Vladimiro Sas sone, a nd Ingo W eg ener, e di- tors. Diﬀer ential Privacy , volume 4052 of L e ctu r e N otes in Computer Sci- enc e , 20 06. Springer . ISBN 3 -540-3 5907- 9. [20] Christo pher J. C. Burges. A tutorial on supp ort vector machines for pattern recognition. Data Min. Know l. Disc ov. , 2 (2):121–1 67, June 1998. ISSN 13 84-58 1 0. doi: 10.10 2 3/A:100 9 715923555. URL http:/ /dx.do i.org/10.1023/A:1009715923555 . [21] Yves Chauvin and David E . Rumelha r t, editors. Backpr op agation: the ory, ar chite ctu re s , a n d applic ations . L. Er lbaum As s o ciates Inc., Hillsdale, NJ, USA, 1 995. ISBN 0 -8058 -1259- 8. [22] Rung Ching Chen, Kai-F an Cheng, and Chia-F en Hsieh. Using roug h set and supp or t vector machine for netw ork intrusion detection. CoRR , abs/10 04.056 7 , 2010 . [23] Edith Cohen and Carsten Lund. Pack et classiﬁca tion in la rge isps: de- sign and ev alua tion of decis ion tree classiﬁer s. SIGMETRICS Perform. Eval. R ev. , 33(1):73– 84, June 2005. ISSN 016 3 -5999 . doi: 10.1145 /1071 690. 10642 22. URL ht tp://d oi.acm. org/10.1145/1071690.1064222 . [24] V asant Dhar. Prediction in ﬁnancial markets: The case for small disjuncts. ACM T r ans. Int el l. Syst. T e chnol. , 2(3):19:1– 19:22, May 2011. ISSN 2157- 6904. doi: 10.114 5/196 1189.1961191. URL http:/ /doi.a cm.org/10.1145/1961189.1961191 . [25] Ruisheng Dia o , Kai Sun, Vijay Vittal, Rober t J. O’Keefe, Michael R. Richardson, Navin Bhatt, Dw ayne Stradford, and Sanjoy K. Sarawgi. Decision T ree- B ased Online V oltage Security Assessment Using PMU Measurements. IEEE T r ansactions on Power Systems , 24(2):832 – 839, May 20 09. ISSN 0885 -8950. doi: 10 .1109/ TPWRS.2009.2 016528. URL http:/ /dx.do i.org/10.1109/TPWRS.2009.2016528 . [26] Ja s on Eisner . An interactive sprea dsheet for teaching the for ward-backw ard algorithm. In Pr o c e e dings of t he ACL-02 Workshop on Eﬀe ctive t o ols and metho dolo gies for te aching natur al language pr o c essing and c omputational linguistics - V olume 1 , ETMTNLP ’02, pages 1 0 –18, Stro uds bur g, P A, USA, 2002. Asso cia tio n for Computationa l L inguistics. do i: 10.3 115/1 118108. 11181 10. URL ht tp://d x.doi.o rg/10.3115/1118108.1118110 . [27] Jeﬀrey Er man, Anirban Mahanti, Martin Arlitt, Ir a Cohen, and Carey Williamson. Oﬄine/realtime traﬃc cla s siﬁcation using semi-sup ervised learning . P erform. Eval. , 64 :1194– 1213, Octo- ber 2007 . ISSN 0166- 5316. doi: 10.10 16/j.p ev a.2007 .06.014. URL http:/ /dl.ac m.org/citation.cfm?id=1284907.1285040 . [28] Alice Este, F ra ncesco Gring oli, and Luca Sa lgarelli. Suppo r t vector machines for tcp tr aﬃc classiﬁcatio n. Computer Networks , 53(14 ):2476 – 2490, 2009. ISSN 1389-1 286. doi: 10.1016/ j.comnet.2009.0 5.003. URL http:/ /www.s ciencedirect.com/science/article/pii/S1389128609001649 . [29] J. Graham-Cumming. How to b eat a n a da ptiv e spam ﬁlter. In The MIT Sp am Confer enc e , 200 4. [30] Isab elle Guyon, Ja son W eston, Stephen Ba rnhill, and Vladimir V apnik. Gene selection for cance r classiﬁca tion using supp ort vector machines. Mach. L e arn. , 46(1-3):389– 422, Ma rch 2 002. ISSN 0885-6 125. doi: 10.1023 / A:10124 87302 7 97. URL http ://dx. doi.or g/10.1023/A:1012487302797 . [31] Shai Halevi a nd T al Rabin, editors. Calibr ating Noise to Sensitivity in Private Data Analy sis , volume 3876 of L e ct ur e Notes in Computer Scienc e , 2006. Springe r. ISBN 3 -540- 32731 - 2. [32] J.F. Hay es. The viter bi algo rithm applied to dig ital data transmissio n. Communic ations Magazine, IEEE , 40(5):2 6 – 32, may 20 02. ISSN 01 63- 6804. doi: 10.110 9/MCOM.20 02.1006969. [33] Ypke Hiemstra . Linear re g ression versus backpropagation netw orks to predict quarter ly s to ck market exces s r eturns. Comput. Ec on. , 9(1):67– 76, F ebruary 199 6. ISSN 0 927-70 99. doi: 10.100 7/BF001 15692. URL http:/ /dx.do i.org/10.1007/BF00115692 . [34] W enjie Hu, Yihua Liao, and V. Rao V em ur i. Robust ano ma ly detection using support vector mac hines. In In Pr o c e e dings of the International Con- fer enc e on Machine L e arning . Morga n Kaufmann P ublishers Inc. [35] Anil K. Jain, Jianchang Mao , and K. Mohiuddin. Artiﬁcial neural netw or ks: A tutor ial. IEEE Computer , 29:31– 44, 1996. [36] Thor sten Joa chims. T ext categor ization with suppo rt vector machines: Learning with many r elev a n t features. In Cla ire N´ edellec and C´ eline Rou- veirol, editor s, Ma chine L e arning: ECML-98 , volume 1 398 of L e ctu r e Notes in Computer Scienc e , pages 13 7–142 . Springer Berlin / Heidelberg , 19 98. ISBN 978-3 -540- 6 4417-0. URL http://d x.doi. org/10.1007/BFb0026683 . 10.100 7/BFb002 6683. [37] B. H. Juang a nd L. R. Rabiner. Hidden markov mo dels for sp eech reco g- nition. T e chnometrics , 33 (3):251–2 72, August 199 1. ISSN 0 040-1 706. doi: 10.230 7/126 8779. URL ht tp://d x.doi. org/10.2307/1268779 . [38] Daniel Jur a fsky and James H. Martin. Sp e e ch and L anguage Pr o- c essing (2nd Edition) ( Pr ent ic e Hal l Series in Artiﬁcial Intel li- genc e) . Prentice Hall, 2 e dition, 2008 . ISBN 01 31873 210. URL http:/ /www.a mazon.com/Language- Processing- Prentice- Artificial- In telligence/dp/013187321 [39] Latifur Kha n, Mamoun Awad, and Bhav ani Thuraisingham. A new int r usion detection system using supp ort vector machines and hi- erarchical cluster ing. The VLD B Journ al , 16(4):50 7 –521, Octob er 2007. ISSN 10 66-888 8. doi: 10.1 007/s 00778 - 006- 0002- 5. URL http:/ /dx.do i.org/10.1007/s00778- 006- 0002- 5 . [40] Daphne Koller , Dale Sch uurmans, Y oshua Bengio, a nd L´ eon Bottou, editors. Privacy-pr eserving lo gistic r e gr ession , 20 08. Curr a n Asso ciates, Inc. [41] K.- F. Lee a nd H.-W. Hon. Large -vocabular y sp eaker-indep endent cont inu- ous sp eech re c o gnition using hmm. In Ac oustics, S p e e ch, and Signal Pr o- c essing, 1988. ICASS P- 88., 1988 In ternational Confer enc e on , pages 123 –126 vol.1, a pr 1 9 88. doi: 10 .1109/ ICASSP .1 988.19 6527. [42] Ming Li and Zhi-Hua Z hou. Impr ov e Co mputer-Aided Diagnosis With Ma- chine Lear ning T echniques Using Undiagnosed Sa mples. IEEE T r ansactions on Systems, Man, and Cyb ernetics - Part A: Systems and Hum ans , 37(6 ): 1088– 1098, Novem b er 2007 . ISSN 1083 -4427 . doi: 10 .1109/ TSMCA.2007. 90474 5. U RL h ttp:// dx.doi .org/10.1109/TSMCA.2007.904745 . [43] Jia nhua Lin. Divergence mea sures ba s ed on the shannon entropy . IEEE T r ansactions on Information the ory , 37:1 45–15 1, 19 91. [44] Y ehuda Lindell and Be nn y P ink as . Secure mult ipa rty computation for priv a cy-preser ving data mining. IACR Cryptolo gy ePrint Ar chive , 2008 : 197, 2 008. [45] D. Lowd and C. Meek. Go o d w o r d attacks on statistical spam ﬁlters. In In Pr o c e e dings of the 2nd Confer enc e on Email and Anti-Sp am , 200 5. [46] Daniel Lowd a nd Christopher Meek . Adv e r sarial learning. In Pr o c e e d- ings of the eleventh ACM SIGKDD international c onfer enc e on Kn ow le dge disc overy in data mining , KDD ’0 5, pag e s 641–6 47, New Y ork, NY, USA, 2005. ACM. ISBN 1 -5959 3 -135- X. doi: 1 0.1145 /1081 870.1081950. URL http:/ /doi.a cm.org/10.1145/1081870.1081950 . [47] T. Mitchell. Ma chine L e arning . McGraw-Hill Education (ISE Editions), 1st edition, Oc to ber 1 997. ISB N 00711 54671 . URL http:/ /www.a mazon.com/exec/obidos/redirect?tag=citeulike07- 2 0&path=ASIN/0071154671 . [48] Snehal A. Mulay , P .R. Dev ale , and G.V. Ga rje. Article:intrusion detec- tion system using supp o rt vector machine and decisio n tree. In t ernational Journal of Computer A pplic ations , 3 (3):40–43 , June 2010 . P ublis hed By F o undation o f Co mputer Science. [49] T.T.T. Nguy en and G. Armitage. A survey of tech niq ues for in ternet traﬃc classiﬁcation using machine lear ning. Communic ations Surveys T ut orials, IEEE , 10(4 ):5 6 –76, quar ter 2 008. ISSN 1553-877 X. doi: 10 .1109/ SUR V. 2008.0 80406 . [50] J. Platt. F ast training of support v e c tor mac hines using sequential min- imal optimization. In B. Schoelkopf, C. Burg es, a nd A. Smola, editors , A dvanc es in Kernel Metho ds - Supp ort V e ctor L e arning . MIT Pr ess, 19 98. URL h ttp:// resear ch.microsoft.com/ ~ jplatt /smo.h tml . [51] J. R. Quinlan. Induction of dec ision trees. Mach. L e arn. , 1(1):8 1–106 , March 19 8 6. ISSN 088 5-612 5 . doi: 1 0 .1023/ A:10226 4 3204877. URL http:/ /dx.do i.org/10.1023/A:1022643204877 . [52] J. Ro ss Quinlan. C4.5: pr o gr ams for machine le arning . Morgan Kaufmann Publishers Inc., Sa n F r ancisco, CA, USA, 199 3 . ISBN 1-5 5860-2 38-0. [53] Cisco Systems. Cisco Systems NetFlow Services Exp or t V ersion 9. http:/ /tools .ietf.org/html/rfc3954 , 2004 . [54] Adi L T arca , Vincen t J Ca rey , Xue-wen Chen, Rob erto Romer o, and Sorin Dr˘ aghici. Machine learning and its applications to bio lo gy . PL oS Com- put Biol , 3(6):e116, 06 20 07. doi: 10.13 71/jour na l.pcbi.0 03011 6. URL http:/ /dx.do i.org/10.1371%2Fjournal.pcbi.0030116 . [55] V assilios S. V erykio s , Elisa Bertino, Igor Nai F ovino, Loredana Parasil- iti Prov enza, Y ucel Saygin, and Y annis Theo doridis. State-of-the- art in pr iv acy preser v ing da ta mining. SIGMOD R e c. , 33(1):50–5 7 , March 2004 . ISSN 016 3-580 8 . doi: 10 .1145/ 9 74121 .974131. URL http:/ /doi.a cm.org/10.1145/974121.974131 . [56] W ek a Machine L e arning Pro ject. W ek a. URL ht tp:// www.cs.waik ato.ac .nz / ˜ ml/wek a. [57] M. W ernick, Y ongyi Y ang, J. Br anko v, G. Y our ganov, a nd S. Strother. Ma- chine lear ning in medical ima ging. Signal Pr o c essing M agazine, IEEE , 2 7 (4):25 –38, july 2 010. ISSN 1 0 53-58 88. doi: 10.1 109/MSP .20 10.936730. [58] Greg ory L. Wittel a nd S. F elix W u. O n attacking statistical spa m ﬁlters. In IN PR OC. OF THE CONFERENCE ON EMAIL AND ANTI-S P AM (CEAS), MOU NT AIN VIEW , 200 4. [59] Guowu Xie, M. Iliofotou, R. K eralapura , M. F aloutsos , and A. Nucci. Sub- ﬂow: T ow a rds practical ﬂow-lev e l tr aﬃc classiﬁca tion. In INFOCOM, 2012 Pr o c e e dings IEEE , pa ges 254 1 –254 5, mar ch 2012 . doi: 10.11 09/INFCOM. 2012.6 19564 9. [60] Steve Y oung. A review of la rge-voca bulary contin uo us-sp eech. Signal Pr o c essing Magazine, IEEE , 13(5):4 5, sept. 1 9 96. ISSN 105 3-588 8. do i: 10.110 9/79.5 36824. A Artiﬁcial Neural Netw or ks The Artiﬁcial Neur al Networks (ANNs) are a categor y of machine learning a l- gorithms able to solve a v a riety of problems in decision making , o ptimization, prediction, and co ntrol, lear ning functions from real, discr ete and vector v a lued examples. The ANNs obtain go o d p erfor mances in pro blems where the training data is retrieved by complex senso r, such as camer as or micropho nes. These a l- gorithms are also resilient to the presence of noise in the da taset. Several t y pes of ANN hav e b een pro po sed [35]. W e fo cus on a par ticular family of ANNs, the ones based on Multilayer Per c eptr ons , a nd the r elated Backpr op agation algorithm ([21]) us e d fo r their tr aining. The basic unit o f an ANN is the Per c eptr on (or neur o n), a unit that ta kes a v ector of real-v a lued inputs, calcula tes a linear combination of these inputs and then o utputs 1 if the re s ult is greater than some threshold and -1 otherwise. More formally a pe r ceptron can b e re pr esented as a function o ( x 1 , . . . , x n ) =  1 if P n i =0 w i x i > 0 − 1 otherwise where we consider x 0 to b e always set to 1 to simplify the notation, a nd we call net = P n i =1 w i x i . Obser ve that − w 0 is the threshold that makes the neur on to output 1. A sing le p erceptron r e presents a n hyperplane decision surface in the n − dimensional space of instances. This kind of p erceptr on can only discr iminate b etw een lin- early separable instances. T o overcome this limitation, the sigmoid function σ is used to decide the output v alue: σ ( net ) = 1 1 − exp net An ANN is a multi-lay er netw o rk of neurons: a ﬁr st input lay er r eceives the input bits and provides mo diﬁed inputs to a fo llowing lay er , that, in tur n, ela bo rates them and feeds a new lay er, and so on. The las t layer outputs the result o f the ANN. The neuro ns that form the internal lay ers are ca lled the hidden units. The core function of the net work resides in the weigh t of the hidden units in the int er nal lay e r s which are set through the bac kpr o pagation algor ithm. Starting from random weigh ts, the algorithm tunes them using a training s et of input- output pairs: the inputs go forward to the netw or k until they b ecome output, while the error s (namely , the diﬀerence b etw een a ctual and ex pected outputs) are back-propag ated to correct the weights. The err or is r educed iteratively un til a minimal and toler able err or is obtained. The backpropaga tion of the er ror is inspired by the principle of gra dien t descent: in a nutshell, if the weigh t sig niﬁ- cantly contributes to the error then its a djustmen t will b e grea ter. B Classiﬁcation and Regression T r ees A classiﬁca tion or r egressio n tree (introduce by Breiman et al. in [18] in 1 984) is a prediction mo del which maps o bserv ations in a de cision tr e e . The obser v atio ns L = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) constitute the training set and are used to le arn a decis ion tree. Both class iﬁc a tion a nd reg ression tr ees dea l with the prediction of a resp onse v a riable y (let Y be the domain of y ), given the v a lues of a vector of predictor v ar iables x (let X b e the domain of x ). If y is a contin uo us or disc r ete v ariable tak ing rea l v alues (e.g., the size o f an o b ject, the num ber o f occur r ences of certain ev ents), the pr oblem is called r e gr ession ; if Y is a ﬁnite set of unorde r ed v alue s (e.g., the t y p e of Iris plants), the pr oblem is called classiﬁc ation . The training pha se pro duces a tree str uctur e in which the leav es represe nt the class lab els and the branches repr esent c onjun ct ions of fe atur es that lea d to the cla ss lab els of their leaves. Decisio n tre e s can b e considered as disjunction of conjunctions of constraints on the attr ibute-v alues of instances. Ea c h path from the tre e ro ot to a leaf cor resp onds to a c onjunction of attribute tests , a nd the tree itself to a disju nc tio n of these conjunctions [47]. Decision tr ees w or k better whe n the t a rget function has discr ete output (for example “yes or no ”) and the data instances are represented by attribute-v alue pairs. F urthermo re, decision trees p erform well ev en when the training da taset con tains er rors or missing v alues. These characteristics make decision tree a s uitable s olution for many classiﬁca tio n problems and in a great v a r iety of contexts. The most p opular implemen ta tion of decisio n trees is the C 4 . 5 [52] algo r ithm, which is an extended version of the I D 3 algorithm [51]. top-down, greedy search through the space of all p oss ible decisio n trees. In detail, ID3 algor ithm starts the search of decision tree answering the question: which attribute should b e use d at the r o ot of t he tr e e? Once the ro ot is found, a desce nden t no de of the ro ot is created for each po ssible v alue, then the same question is ask ed recursively at each new no de, un til: (i) each attribute has b een considered in the path through the tree, or (ii) the training exa mples related to a speciﬁc leaf has the same a ttribute v alues. The s election of the b est attribute in ea ch level of the tr e e is p erfor med using the concept of information gain . In fact, the information ga in measures how well a given attribute sepa rates the training examples. Giv en a collectio n S o f items, for each attribute A , I D 3 alg orithm ev aluates the ga in o f A with resp ect to S via the eq uation: Gain ( S, A ) = H ( S ) − X v ∈ V alues ( A ) | S v | | S | H ( S v ) where H ( S ) repr esents the Entropy of the entire dataset and S v is the subset of S fo r which a ttribute A ha s v alue v .

Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment