Differentially Private Empirical Risk Minimization

Diﬀeren tially Priv ate Empirical Risk Mi nimization ∗ Kamalik a Chaudh uri † , Claire Mon teleoni ‡ , Anand D. Sarwate § Octob er 25, 2018 Abstract Priv acy- preserving machine learning alg orithms are crucial fo r the increas ing ly common set- ting in which p er s onal data , suc h as medical or ﬁnancial recor ds, are analyzed. W e pr ovide general techniques to pro duce priv acy- preserving approximations of cla s siﬁers lear ned via (reg u- larized) empirical risk minimization (ERM). These algo rithms are priv ate under the ǫ -diﬀer ential privacy d eﬁnition due to Dwork et a l. (2006). First we apply the output p er tur bation ideas of Dw ork et al. (20 06), to E RM cla ssiﬁcation. Then we pr o p o se a new metho d, obje ctive p erturb a- tion , for pr iv acy-prese rving machine le arning algor ithm desig n. This metho d entails p e rturbing the ob jective function b efor e optimizing ov er classiﬁers. If the lo ss and regularize r satisfy cer ta in conv exity and diﬀerentiabilit y criteria, we prove theor etical res ults showing that o ur alg o rithms preserve priv acy , a nd pr ovide generaliza tion bo unds for linear and nonlinear kernels. W e further present a priv acy -preserving technique fo r tuning the par ameters in genera l machine lea rning algorithms, thereby providing end- to-end priv acy g ua rantees for the training pr o cess. W e apply these results to pr o duce priv acy-pr eserving ana logues of regularize d logistic r egressio n a nd sup- po rt vector machines. W e obtain encoura ging r esults from ev a luating their p e rformance on real demographic and b enchmark data sets. Our results show that b oth theoretica lly and empiri- cally , ob jectiv e p er turbation is supe r ior to the previous state- of-the-art, output pe r turbation, in managing the inherent tradeoﬀ betw een priv acy a nd learning p er formance. 1 In tro duction Priv acy has b ecome a gro wing concern, d ue to the massiv e increase in p erson al information stored in electronic databases, suc h as medical records, ﬁn ancial records, we b searc h histories, and so c ial net work data. Mac hine learning can b e emplo y ed to disco ver no vel p opu lation-wide patterns, ho wev er the r esults of su c h algorithms m ay r eveal certain in dividuals’ sens itive information, thereby violating their priv acy . Th us, an emerging challe nge for mac hine learning is how to learn from datasets that con tain sensitiv e p ersonal information. A t the ﬁ rst glance, it ma y app ear th at simple anon ymization of priv ate in formation is enough to preserve priv acy . How ev er , this is often not the case; ev en if ob vious identiﬁers, su c h as names and ∗ This wo rk will app ear in t he Journal of Machine Learning R esearc h . † K. Chaud huri is with the Department of Comput er Science and En gineering, U niversi ty of California, San Diego, La Jolla, CA 92093, U SA, kchaudh uri@ucsd.ed u ‡ C. Monteleoni is with th e Center for Computational Learning S ystems, Columbia Universit y , New Y ork, NY 10115, USA, cmontel@ccl s.columbia .edu § A.D. Sarw ate is with the Information Theory and Applications Center, U niversi ty of Cal ifornia, San Diego, La Jolla, CA 92093-0447, USA, asarwate@ucsd.e du 1 addresses, are remo ved from the data, the r emaining ﬁelds can still form u nique “signatures” th at can help re-identify individu als. Suc h attac ks ha ve b een demonstrated by v arious works, an d are p ossible in many r ealistic setti ngs, suc h as wh en an adv ersary has sid e information (Swee ney, 1997; Nara ya nan and S hmatik ov, 2008; Gan ta et al., 2008), and when the data has structural prop er- ties (Bac kstrom et al., 2007), among others. Mo reo ver, ev en r eleasing statist ics on a sensitiv e dataset ma y n ot b e su ﬃcien t to pr eserve pr iv acy , as illustrated on genetic data (Homer et al., 2008; W ang et al., 2009). Thus, there is a great need for designing mac hine learning al gorithms that also preserve the priv acy of ind ivid uals in the d atasets on wh ic h they train and op erate. In th is pap er w e focus on the problem of classiﬁcatio n, one of the fund amen tal problems of mac hin e learning, when the training data consists of sensitive information of individuals. Our w ork addresses th e Emp irical risk minimization (ERM) framework for classiﬁcation, in whic h a classiﬁer is c hosen by m in imizing the av erage o v er the training data of the prediction loss (with resp ect to the lab e l) of th e classiﬁer in pr ed icting eac h training d ata p oin t. In th is w ork, w e fo cus on regularized ERM in w hic h there is an add itional term in the optimization, called the regularizer, whic h p en alizes the complexit y of th e classiﬁer with resp ect to some metric. Regularized ERM metho ds are widely used in practice, for example in logistic regression and supp o rt vec tor mac hines (SVMs), and man y also ha ve theoretic al justiﬁcation in the form of generalization error b ounds with resp ect to indep endently , identic ally distribu ted (i.i.d.) data (see V apnik (1998) for further details). F or our priv acy measure, w e use a deﬁn ition due to Dw ork et al. (2006 b ), who h a ve pr op osed a measure of q u an tifying the pr iv acy-risk asso ciated with computing fu nctions of sensitive data. Their ǫ -diﬀer ential privacy mod el is a strong, cryptographically-motiv ated deﬁnition of priv acy that has recen tly r eceiv ed a s igniﬁcan t amoun t of researc h atten tion for its r obustness to known attac ks, su c h as those in v olving side information (Gan ta et al., 2008). Algorithms satisfying ǫ -diﬀeren tial priv acy are r an d omized; the output is a random v ariable whose distribution is conditioned on the d ata s et. A statistical pro cedu re satisﬁes ǫ -diﬀeren tial p riv acy if c hanging a single data p oint do es n ot shift the output distr ibution by too m u c h. T h erefore, from lo oking at the outp u t of the algorithm, it is diﬃcult to inf er the v alue of an y p articular data p oi n t. In th is pap er, w e dev elop m etho ds for approxima ting ERM wh ile guarantee ing ǫ -diﬀeren tial priv acy . Our results hold for loss functions and regularizers satisfying certain diﬀeren tiabilit y and con vexit y conditions. An imp ortant asp ect of our work is that we dev elop metho d s f or end- to-end privacy ; eac h step in the learning pro cess can cause additional risk of priv acy violatio n, and we pr o vide algorithms with quantiﬁable priv acy guarantees f or trainin g as w ell as p arameter tuning. F or training, we provide tw o p riv acy-preserving approximat ions to ERM. The ﬁrst is output p erturb ation , based on the sensitivity metho d prop osed by Dw ork et al. (2006b). In this metho d noise is added to the output of the stand ard ERM algo rithm. Th e second metho d is no v el, and in v olve s adding n oise to the regularized ERM ob jectiv e function p rior to minimizing. W e call this second method obje ctive p erturb ation . W e sho w theoretical b ounds for b oth pro cedu res; the theoretical p erformance of ob jectiv e p erturb ation is sup erior to that of output p ertu rbation for most problems. Ho w ev er, for our resu lts to hold w e require that the regularizer b e strongly conv ex (ruling L 1 regularizers) and additional constrain ts on the loss function an d its deriv ativ es. In practice, these additional constraints do not aﬀect the p erforman ce of the resulting classiﬁer; we v alidate our th eoretical resu lts on data sets from the UCI r ep ository . In practice, parameters in learning algorithms are c hosen via a holdout data s et. In th e con text of pr iv acy , w e must guaran tee the priv acy of the h oldout data as well. W e exploit results from 2 the theory of diﬀerent ial p riv acy to deve lop a priv acy-preserving p arameter tunin g algorithm, and demonstrate its use in practic e. T ogether w ith our training algorithms, this parameter tuning algorithm guarant ees p riv acy to all data used in the learning pro cess. Guaran teeing priv acy incurs a cost in p erformance; b ecause the algorithms m ust cause some uncertain t y in the output, they increase the loss of the output predictor. Beca use the ǫ -diﬀeren tial priv acy mo del requires robus tn ess against all data sets, we mak e no assumptions on the un derlying data for the purp ose s of making priv acy guarantee s. Ho wev er, to prov e th e impact of pr iv acy constrain ts on the generalizat ion er r or, we assume the d ata is i.i.d. according to a ﬁ xed bu t unknown distribution, as is s tand ard in the mac hin e learning literature. Although many of our results h old for ERM in general, we pro vide s p eciﬁc results for classiﬁcation using logistic regression and supp ort v ector mac hines. Some of the former resu lts w ere rep orted in Chaudhuri and Montele oni (200 8); here w e generalize them to ERM and extend the resu lts to kernel m etho ds, and pr o vide exp erimen ts on real datasets. More sp ec iﬁcally , the contributions of this p ap er are as follo ws: • W e deriv e a computationally eﬃcien t algorithm for ERM classiﬁcation, based on the sensi- tivit y metho d due to Dwork et al. (2006 b ). W e analyze the accuracy of this algorithm, and pro vide an upp er b ound on th e num b er of training samples r equired by this algo rithm to ac hieve a ﬁxed generalizat ion error. • W e pr ovide a general tec h nique, obje ctive p erturb ation , for p r o vidin g computationally eﬃcien t, diﬀeren tially priv ate approxima tions to regularized ERM algorithms. This extends the wo rk of Chaudhuri and Mon teleoni (2008), which follo w s as a sp ecial case, and corrects an error in the argumen ts made there. W e apply th e general results on the sen s itivit y metho d and ob jectiv e p erturb ation to log istic regression and sup p ort v ector mac hine classiﬁers. In addition to priv acy guarantee s, w e also provide generaliz ation b oun d s for this algorithm. • F or k ernel methods with n onlinear k ern el fun ctions, the optimal classiﬁer is a linear com b ina- tion of k ernel functions cente red at the training p oin ts. This form is inh eren tly non-p riv ate b ecause it rev eals th e training data. W e adapt a random pro jection metho d du e to Rahimi and Rec h t (Rahimi and Rech t, 2007, 2008b), to deve lop priv acy-preserving kernel-ERM al- gorithms. W e provi de theoretical results on generalization p erforman ce. • Because the holdout d ata is u sed in th e pro cess of training and releasing a classiﬁer, we pro vide a p riv acy-preserving parameter tun ing alg orithm based on a randomized selectio n pro cedure (McSherry and T alw ar, 2007) applicable to general mac hine learning algorithms. This guarantee s end-to-end priv acy during the learning pro cedure. • W e v alidate our results using exp erimen ts on tw o datasets from th e UCI Mac hine L earning rep ositories (Asuncion and Newman, 2007)) and KDD Cup (Hettic h an d Bay, 199 9). Our results sh o w that ob jecti v e p erturbation is generally sup erior to output p ertur bation. W e also d emons trate the impact of end-to-end pr iv acy on generalization er r or. 1.1 Related W ork There has b een a signiﬁcan t amoun t of literature on the ineﬀectiv eness of simple anon ym ization pro cedures. F or example, Naray anan and Sh matik o v (2008) sh ow that a small amoun t of au x iliary information (kn o wledge of a few m ovie- ratings, and approxima te d ates) is su ﬃ cien t for an adversary 3 to re-identify an individual in the Netﬂix dataset, whic h co nsists of anonymized data ab out Netﬂix users and their movie ratings. The same phenomenon has b een observ ed in other kinds of data, such as so cial net w ork graph s (Bac kstrom et al., 2007), searc h query logs (Jones et al., 2007) and others. Releasing statistics computed on sen s itiv e d ata can also b e problematic; for example, W ang et al. (2009) sho w that releasing R 2 -v alues computed on high-dimensional genetic data can lead to priv acy breac h es by an adv ersary who is arm ed with a s mall amount of auxiliary in formation. There has also b een a signiﬁcan t amount of wo rk on p riv acy-preserving d ata m in ing (Agraw al and Sr ik an t, 2000; Evﬁmievski et al., 2003; Sw eeney, 2002; Mac h ana v a jjhala et al., 2006), spanning seve ral com- m u nities, that u ses priv acy mo dels other than diﬀeren tial priv acy . Man y of the mo dels used hav e b een shown to b e su sceptible to c omp osition attacks , attac ks in wh ic h the adv ersary has some reasonable amount of prior knowledge (Gan ta et al., 2008). Oth er w ork (Mangasa rian et al., 2008) considers the prob lem of pr iv acy-preserving SVM cla ssiﬁcation when separate agen ts ha v e to share priv ate data, and p ro vid es a s olution that uses rand om k ernels, but d o es provide any formal p riv acy guaran tee. An alternativ e line of p riv acy wo rk is in the Secur e Mu ltipart y Computation setting du e to Y ao (1982), where the sensitive d ata is split across multiple hostile databases, and the goal is to compute a fu nction on the union of these data bases. Zhan and Mat win (20 07) and Laur et al. (2006) consider computing p riv acy-preserving SVMs in this setting, and their goal is to design a distributed p roto col to learn a classiﬁer. This is in con trast with our work, w hic h deals with a setting where the algorithm has access to the en tire dataset. Diﬀeren tial pr iv acy , the f ormal p r iv acy deﬁnition u sed in our pap er, w as p rop osed by the seminal w ork of Dwork et al. (2006b), and has b een used sin ce in n umer ou s w orks on p r iv acy (Chaudhuri and Mishra, 20 06; McSherry and T alw ar, 2007; Nissim et al., 200 7; Barak et al., 2007; Chaudhuri and Mon teleoni , 2008; Mac h ana v a jjhala et al., 2008). Unlik e many other priv acy def- initions, suc h as those men tioned ab ov e, diﬀerent ial p r iv acy has b een sh o wn t o b e resistan t to comp osition attac ks (attac ks inv olving side-inform ation) (Ganta et al., 2008). Some follo w -up w ork on diﬀerent ial priv acy includes w ork on diﬀerentia lly-priv ate com binatorial optimization, due to Gup ta et al. (20 10), and diﬀeren tially-priv ate con tingency tables, due to Barak et al. (2007 ) and Kasivishw anathan et al. (2010). W asserm an and Zhou (2010) pr o vide a more statistical view of diﬀerenti al priv acy , and Zhou et al. (2009) provide a tec hniqu e of generating syn thetic d ata u sing compression via r andom linear or aﬃne transf ormations. Previous literature has also considered learning with diﬀerenti al p r iv acy . On e of the ﬁr st su c h w orks is Kasivisw anathan et al. (2008), wh ic h presen ts a general, although compu tationally in ef- ﬁcien t, metho d for P A C-learning ﬁnite concept classes. Blum et al. (2008) p resen ts a metho d for releasing a database in a diﬀerenti ally-priv ate mann er, so that certain ﬁxed classes of queries can b e answered accurately , pro vided the class of queries has a b ounded V C-dimen sion. Their metho ds can also b e u sed to learn classiﬁers with a ﬁxed V C-dimen s ion – see K asivisw anathan et al. (20 08); ho wev er the resulting algorithm is also computationally ineﬃcient . Some sample complexity lo wer b ound s in this setting ha v e b een pro vid ed by Beimel et al. (2010). In addition, Dwo rk and Lei (2009) explore a connection b et ween diﬀerentia l priv acy and robust statistics, and provide an algo- rithm for pr iv acy-preservin g regression using ideas from r obust statistics. Ho w ev er, th eir algorithm also requires a ru nning time whic h is exp onen tial in the data dimension, and is hence computation- ally ineﬃcien t. This work builds on our preliminary wo rk in Chaudhuri and Mon teleoni (2008). W e ﬁrst show ho w to extend the sensitivity metho d, a form of output p erturb ation , due to Dw ork et al. (2006 b), 4 to classiﬁcation algorithms. In general, output p ertur bation method s alter the outp ut of the fu nc- tion computed on the database, before releasing it; in p articular the sensitivit y method mak es an algorithm diﬀerenti ally priv ate by adding noise to its outp ut. In the classiﬁcation setting, the noise protects the priv acy of the training data, b ut increases the pr ediction error of the classiﬁer. Recen tly , indep endent wo rk by Rub instein et al. (2009) has rep o rted an extension of the sensitivit y metho d to linear and k ernel SVMs. Their utilit y analysis d iﬀers fr om our s , and th us the analo- gous generalizati on b oun ds are not comparable. Because Rubinstein et al. use tec hniques fr om algorithmic stabilit y , their utilit y b o unds compare the priv ate and non-p r iv ate classiﬁers using the same v alue for th e r egularizatio n parameter. In cont rast, our appr oac h tak es in to accoun t ho w the v alue of the regularization parameter might c hange du e to pr iv acy co nstraint s. In con tr ast, we prop ose the obje ctive p erturb ation metho d, in whic h n oise is added to th e obje ctive f u nction b efore optimizing o ver the sp ace classiﬁers. Both the sensitivit y method and ob jectiv e p ertur bation resu lt in compu tationally eﬃcient algorithms for our sp ec iﬁc case. In general, our theoretical b ound s on sample r equiremen t are incomparable with the b ound s of Kasiviswanathan et al. (2008) b ecause of the diﬀerence b et w een their setting and ours. Our approac h to priv acy-preserving tuning uses the exp o nent ial mec hanism of McSherry and T alw ar (2007) b y training classiﬁers with d iﬀeren t p arameters on disjoin t subs ets of the data and then ran- domizing the selection of whic h classiﬁer to release. This b ears a sup erﬁcial resem blance to the sample-and-aggrega te (Nissim et al., 2007) and V-fold cross-v alidation, but only in the sense that only a part of the d ata is u sed to train the classiﬁer. On e drawbac k is that our app roac h requires s ig- niﬁcan tly more data in practice. Other approac h es to selecting the regularization p arameter could b eneﬁt f r om a more careful analysis of the regularizatio n p arameter, as in Hastie et al. (2004). 2 Mo d el W e w ill use k x k , k x k ∞ , and k x k H to denote the ℓ 2 -norm, ℓ ∞ -norm, and norm in a Hilb ert sp ace H , resp ectiv ely . F or an integer n w e will u se [ n ] to denote the set { 1 , 2 , . . . , n } . V ectors will t ypically b e written in b oldface and sets in calligraphic t yp e. F or a matrix A , we will use the notatio n k A k 2 to denote th e L 2 norm of A . 2.1 Empirical Risk Minimization In this pap er w e dev elop priv acy-preserving algorithms for r e g ularize d empiric al risk minimization , a sp ecial case of wh ic h is learnin g a classiﬁer from lab eled examples. W e will phrase our p roblem in terms of classiﬁcatio n and indicate when more general resu lts hold. Our algorithms tak e as inpu t tr aining data D = { ( x i , y i ) ∈ X × Y : i = 1 , 2 , . . . , n } of n d ata-label pairs. In the case of b inary classiﬁcation th e data space X = R d and the lab el set Y = {− 1 , + 1 } . W e will assume th roughout that X is the un it ball so that k x i k 2 ≤ 1. W e w ould lik e to pro du ce a pr e dictor f : X → Y . W e m easure the qu ality of our predictor on the training d ata via a nonnegativ e loss function ℓ : Y × Y → R . In regularized emp ir ical risk min imization (ERM), we c ho ose a p redictor f that min imizes th e regularized empirical loss: J ( f , D ) = 1 n n X i =1 ℓ ( f ( x i ) , y i ) + Λ N ( f ) . (1) 5 This minimization is p e rformed o v er f in an hypothesis class H . The regularizer N ( · ) pr ev ents o ver-ﬁtting. F or the ﬁrst part of this pap er we will restrict our atten tion to linear p r edictors and with some ab u se of notation w e will wr ite f ( x ) = f T x . 2.2 Assumptions on loss and regularizer The conditions un der wh ic h we can pro ve results on priv acy and generalization error dep end on analytic prop erties of the loss and regularizer. In particular, w e will require certain forms of con vexit y (see Ro ck afellar and W ets (1998)). Deﬁnition 1. A function H ( f ) over f ∈ R d is said to b e strictly co n v ex if for al l α ∈ (0 , 1) , f , and g , H ( α f + (1 − α ) g ) < αH ( f ) + (1 − α ) H ( g ) . (2) It is said to b e λ -strongly conv ex if for al l α ∈ (0 , 1) , f , and g , H ( α f + (1 − α ) g ) ≤ αH ( f ) + (1 − α ) H ( g ) − 1 2 λα (1 − α ) k f − g k 2 2 . (3) A strictly con vex function has a unique minim um – s ee Bo yd and V and en b erghe (2004). S trong con vexit y p la ys a role in guaran teeing our pr iv acy and generalizatio n requiremen ts. F or our priv acy results to h old we will also r equire that the regularizer N ( · ) and loss function ℓ ( · , · ) b e diﬀerent iable functions of f . This excludes certai n classes of regularizers, su c h as the ℓ 1 -norm regularizer N ( f ) = k f k 1 , and classes of loss functions su c h as the h inge loss ℓ SVM ( f T x , y ) = (1 − y f T x ) + . In some cases w e can pro ve priv acy guaran tees for appro ximations to these non-diﬀerenti able functions. 2.3 Priv acy mo del W e are int erested in pro d ucing a classiﬁer in a m anner that preserv es the priv acy of ind ividual en tr ies of the dataset D that is us ed in training the classiﬁer. Th e notion of pr iv acy w e u se is the ǫ -diﬀer ential privacy mo del , develo p ed by Dw ork et al. (200 6b); Dw ork (200 6 ), whic h deﬁnes a notion of p riv acy for a randomized algorithm A ( D ). Supp ose A ( D ) pro du ces a classiﬁer, and let D ′ b e another dataset that diﬀers from D in one entry (whic h w e assume is the priv ate v alue of one p erson). T hat is, D ′ and D ha ve n − 1 p o in ts ( x i , y i ) in common. The algorithm A pro vides diﬀeren tial priv acy if f or an y set S , the like liho o d that A ( D ) ∈ S is close to the lik eliho o d A ( D ′ ) ∈ S , (where th e likelihoo d is o ve r the randomness in the algo rithm). That is, any sin gle en try of the dataset do es not aﬀect the output d istribution of the algorithm b y m u c h; du ally , this means that an adversary , w ho knows all but one entry of the dataset, cannot gain muc h additional information ab out the last en try by observing the output of the algorithm. The follo win g deﬁnition of diﬀeren tial priv acy is due to Dw ork et al. (2006 b ), paraphr ased from W asserman and Zhou (2010). Deﬁnition 2. An algorith m A ( B ) taking v alues in a set T pr ovides ǫ p -diﬀer ential privacy if sup S sup D , D ′ µ ( S | B = D ) µ ( S | B = D ′ ) ≤ e ǫ p , (4) wher e the ﬁrst supr emum is over al l me asur able S ⊆ T , the se c ond is over al l datasets D and D ′ diﬀering in a single entry, and µ ( ·|B ) is the c onditional distribution (me asur e) on T induc e d by 6 D D ′ ( x i , y i ) ( x ′ i , y ′ i ) A ( · ) S Figure 1: An algorithm which is diﬀerent ially priv ate. When datasets wh ic h are identica l except for a single ent ry are inpu t to the algorithm A , th e tw o distributions on the algorithm’s output are close. F or a ﬁxed m easurable S the ratio of the measures (or densities) sh ou ld b e b oun ded. the output A ( B ) given a dataset B . The r atio i s interpr ete d to b e 1 whenever the numer ator and denominato r ar e b oth 0. Note that if S is a set of measure 0 under the conditional measures in d uced b y D and D ′ , the ratio is automatically 1. A more measure-theoretic deﬁnition is giv en in Zhou et al. (2009). An illustration of the deﬁn ition is giv en in Figure 1. The follo w in g form of the deﬁnition is du e to Dw ork et al. (2006a). Deﬁnition 3. An algorithm A pr ovides ǫ p -diﬀer ential privacy if for any two datasets D and D ′ that diﬀer in a single entry and for any set S , exp( − ǫ p ) P ( A ( D ′ ) ∈ S ) ≤ P ( A ( D ) ∈ S ) ≤ exp( ǫ p ) P ( A ( D ′ ) ∈ S ) , (5) wher e A ( D ) (r esp. A ( D ′ ) ) is the output of A on input D (r esp. D ′ ). W e observ e that an algo rithm A that satisﬁes Equation 4 also satisﬁes Equation 5, and as a result, Deﬁnition 2 is str on ger than Deﬁnition 3 . F rom this deﬁnition, it is clear that th e A ( D ) that outpu ts the minimizer of the ERM ob jectiv e (1) do es n ot p ro vid e ǫ p -diﬀeren tial pr iv acy f or an y ǫ p . This is b ecause an ERM solution is a linear com bin ation of some selected training samples “near” the d ecision b o undary . If D and D ′ diﬀer in one of these samples, then the classiﬁer will c hange completely , making the lik eliho o d ratio in (5) inﬁnite. Regularizat ion helps by p enalizing the L 2 norm of the c hange, but do es not account h o w the dir ection of the m inimizer is s en sitiv e to c h anges in the data. Dw ork et al. (2006b) also p ro vid e a s tandard recip e f or computing pr iv acy-preserving appro xi- mations to functions b y adding noise with a particular distribution to the outpu t of the fu nction. W e call this recip e the sensitivity metho d. Let g : ( R m ) n → R b e a scalar function of z 1 , . . . , z n , where z i ∈ R m corresp onds to the priv ate v alue of individual i ; th en the sensitivit y of g is deﬁned as follo w s. Deﬁnition 4. The sensitivity of a function g : ( R m ) n → R is maximum diﬀer enc e b etwe en the values of the function when one input changes. Mor e formal ly, the sensitivi ty S ( g ) of g is deﬁne d 7 as: S ( g ) = max i max z 1 ,...,z n ,z ′ i   g ( z 1 , . . . , z i − 1 , z i , z i +1 , . . . , z n ) − g ( z 1 , . . . , z i − 1 , z ′ i , z i +1 , . . . , z n )   . (6) T o compute a f u nction g on a dataset D = { z 1 , . . . , z n } , the sens itivity metho d outputs g ( z 1 , . . . , z n ) + η , where η is a rand om v ariable drawn according to the Laplace distrib ution, with mean 0 and standard d eviation S ( g ) ǫ p . It is sho wn in Dw ork et al. (2006b) that suc h a pro cedu re is ǫ p -diﬀeren tially pr iv ate. 3 Priv acy-preserving ERM Here we describ e t w o appr oac hes for creating p riv acy-preserving algorithms from (1). 3.1 Output p erturbation : the sensitivit y metho d Algorithm 1 is derived from the sensitiv i ty metho d of Dw ork et al. (2006b), a general metho d for generating a priv acy-preserving approximati on to any function A ( · ). In th is section the norm k · k is the L 2 -norm un less otherwise sp eciﬁed. F or the function A ( D ) = argmin J ( f , D ), Algorithm 1 outputs a v ector A ( D ) + b , where b is random noise w ith density ν ( b ) = 1 α e − β k b k , (7) where α is a normalizing constan t. The parameter β is a f unction of ǫ p , and th e L 2 - sensitivity of A ( · ), wh ic h is deﬁned as follo ws. Deﬁnition 5. The L 2 -sensitivity of a ve ctor-value d fu nction is deﬁne d as the maximum change in the L 2 norm of the value of the function when one input changes. Mor e formal ly, S ( A ) = max i max z 1 ,...,z n ,z ′ i   A ( z 1 , . . . , z i , . . . ) − A ( z 1 , . . . , z ′ i , . . . )   . (8) The inte rested reader is referred to Dw ork et al. (200 6b ) for fur th er details. Ad ding noise to the output of A ( · ) has the eﬀect of masking th e eﬀect of an y p articular data p o in t. Ho w ever, in some applications the sensitivit y of the minimizer argmin J ( f , D ) ma y b e quite high, w hic h would require the sensitivit y metho d to add noise w ith high v ariance. Algorithm 1 ERM w ith output p erturbation (sensitivit y) Inputs: Data D = { z i } , parameters ǫ p , Λ. Output: Appro xim ate minimizer f priv . Dra w a vect or b according to (7 ) with β = n Λ ǫ p 2 . Compute f priv = argmin J ( f , D ) + b . 8 3.2 Ob jectiv e p er turbation A diﬀerent approac h , ﬁrst p rop osed by C haudhuri and Mon teleoni (2008), is to add noise to the ob jectiv e f unction itself and then p ro duce the minimizer of the p erturb ed ob jectiv e. Th at is, w e can minimize J priv ( f , D ) = J ( f , D ) + 1 n b T f , (9) where b has densit y give n by (7), with β = ǫ p . Note that the priv acy parameter here do es not dep end on the sensitivit y of the of the classiﬁcation algorithm. Algorithm 2 ERM w ith ob jectiv e p er tu rbation Inputs: Data D = { z i } , parameters ǫ p , Λ, c . Output: Appro xim ate minimizer f priv . Let ǫ ′ p = ǫ p − log(1 + 2 c n Λ + c 2 n 2 Λ 2 ). If ǫ ′ p > 0, then ∆ = 0, else ∆ = c n ( e ǫ p / 4 − 1) − Λ, and ǫ ′ p = ǫ p / 2. Dra w a vect or b according to (7 ) with β = ǫ ′ p / 2. Compute f priv = argmin J priv ( f , D ) + 1 2 ∆ || f || 2 . The algorithm r equ ires a certain slac k, log (1 + 2 c n Λ + c 2 n 2 Λ 2 ), in the p riv acy parameter. T his is du e to ad d itional factors in b ounding the ratio of the densities. The “If ” s tatement in the algorithm is from having to consider tw o cases in the pro of of Theorem 2, w h ic h sho w s that the algorithm is diﬀeren tially pr iv ate. 3.3 Priv acy guaran tees In this section, we establish the conditions under wh ic h Algorithms 1 and 2 provide ǫ p -diﬀeren tial priv acy . First, w e establish guarantees for Algorithm 1. 3.3.1 Priv acy Guaran te e s for Output Perturbation Theorem 1 . If N ( · ) is diﬀer e ntiable, and 1 -str ongly c onvex, and ℓ is c onvex and diﬀer entiable, with | ℓ ′ ( z ) | ≤ 1 for al l z , then, Algor ithm 1 pr ovides ǫ p -diﬀer ential privacy. The p r o of of Theorem 1 fol lo ws from Corolla ry 1, and Dwork et al. (20 06b). The pro of is pro vided here f or completeness. Pr o of. F rom Corollary 1, if the cond itions on N ( · ) and ℓ h old, then the L 2 -sensivit y of ERM with regularization parameter Λ is at most 2 n Λ . W e observ e that when w e pic k || b || from the distrib u tion in Algorithm 1, f or a sp eciﬁc vecto r b 0 ∈ R d , the density at b 0 is pr op ortional to e − n Λ ǫ p 2 || b 0 || . Let D and D ′ b e any t w o datasets that diﬀer in the v alue of one ind ividual. Then, for any f , g ( f |D ) g ( f |D ′ ) = ν ( b 1 ) ν ( b 2 ) = e − n Λ ǫ p 2 ( || b 1 ||−|| b 2 || ) , (10) where b 1 and b 2 are the corresp onding noise v ectors chosen in Step 1 of Algorithm 1, and g ( f |D ) ( g ( f |D ′ ) resp ectiv ely) is the densit y of the output of Algorithm 1 at f , wh en th e input is D ( D ′ 9 resp ectiv ely). If f 1 and f 2 are the solutions resp ective ly to non-priv ate regularized ERM when the input is D and D ′ , then, b 2 − b 1 = f 2 − f 1 . F rom C orollary 1, and u sing a tr iangle inequalit y , || b 1 || − || b 2 || ≤ || b 1 − b 2 || = || f 2 − f 1 || ≤ 2 n Λ . (11) Moreo v er, b y symmetry , the densit y of the directions of b 1 and b 2 are u niform. T herefore, by construction, ν ( b 1 ) ν ( b 2 ) ≤ e ǫ p . The theorem follo ws. The main ingredien t of the p ro of of Theorem 1 is a result ab out the sensitivit y of regularized ERM, wh ic h is pro vided b el o w. Lemma 1. L et G ( f ) and g ( f ) b e two ve ctor-value d functions, which ar e c ontinuous, and diﬀer en- tiable at al l p oints. Mor e over, let G ( f ) and G ( f ) + g ( f ) b e λ - str ongly c onvex. If f 1 = argmin f G ( f ) and f 2 = argmin f G ( f ) + g ( f ) , then k f 1 − f 2 k ≤ 1 λ max f k∇ g ( f ) k . (12) Pr o of. Using the deﬁnition of f 1 and f 2 , and the fact that G and g are conti n uous and d iﬀeren tiable ev erywh ere, ∇ G ( f 1 ) = ∇ G ( f 2 ) + ∇ g ( f 2 ) = 0 . (13) As G ( f ) is λ -strongly con ve x, it follo w s from Lemma 14 of Shalev-Shw artz (2007) that: ( ∇ G ( f 1 ) − ∇ G ( f 2 )) T ( f 1 − f 2 ) ≥ λ k f 1 − f 2 k 2 . (14) Com b ining this with (13) and the Cauch y-Sc hw artz inequalit y , we get that k f 1 − f 2 k · k∇ g ( f 2 ) k ≥ ( f 1 − f 2 ) T ∇ g ( f 2 ) = ( ∇ G ( f 1 ) − ∇ G ( f 2 )) T ( f 1 − f 2 ) ≥ λ k f 1 − f 2 k 2 . (15) The conclusion follo ws from dividing b oth sides by λ k f 1 − f 2 k . Corollary 1. If N ( · ) is diﬀer entiable and 1 -str ongly c onvex, and ℓ is c onvex and diﬀe r entiable with | ℓ ′ ( z ) | ≤ 1 for al l z , then, the L 2 -sensitivity of J ( f , D ) is at most 2 n Λ . Pr o of. Let D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } and D ′ = { ( x 1 , y 1 ) , . . . , ( x ′ n , y ′ n ) } b e t w o datasets that diﬀer in th e v alue of the n -th ind ivid ual. Moreo ve r, w e let G ( f ) = J ( f , D ), g ( f ) = J ( f , D ′ ) − J ( f , D ), f 1 = argmin f J ( f , D ), and f 2 = argmin f J ( f , D ′ ). Finally , w e set g ( f ) = 1 n ( ℓ ( y ′ n f T x ′ n ) − ℓ ( y n f T x n )). W e observ e that d u e to the conv exit y of ℓ , and 1-strong conv exit y of N ( · ), G ( f ) = J ( f , D ) is Λ-strongly con v ex. Moreo ver, G ( f ) + g ( f ) = J ( f , D ′ ) is also Λ-strongly con ve x. Finally , d ue to the diﬀeren tiabilit y of N ( · ) and ℓ , G ( f ) and g ( f ) are also diﬀeren tiable at all p oin ts. W e ha ve: ∇ g ( f ) = 1 n ( y n ℓ ′ ( y n f T x n ) x n − y ′ n ℓ ′ ( y ′ n f T x ′ n ) x ′ n ) . (16) As y i ∈ [ − 1 , 1], | ℓ ′ ( z ) | ≤ 1, for all z , and || x i || ≤ 1, for an y f , ||∇ g ( f ) || ≤ 1 n ( || x n − x ′ n || ) ≤ 1 n ( || x n || + || x ′ n || ) ≤ 2 n . The pro of now follo ws by an applicatio n of Lemm a 1. 10 3.3.2 Priv acy Guaran te e s for Ob jectiv e P erturbation In this section, we sh o w that Algorithm 2 is ǫ p -diﬀeren tially priv ate. This p ro of requires stronger assumptions on the loss fun ction than we re r equ ired in Theorem 1. In certain cases, some of these assumptions can b e w eak ened; for such an example, see Section 3.4.2. Theorem 2. If N ( · ) is 1 -str ongly c onvex and doubly diﬀer entiable, and ℓ ( · ) is c onvex and doubly diﬀer entiable, with | ℓ ′ ( z ) | ≤ 1 and | ℓ ′′ ( z ) | ≤ c for al l z , then Algorithm 2 is ǫ p -diﬀer ential ly priva te. Pr o of. Consider an f priv output by Algorithm 2. W e observ e th at giv en any ﬁxed f priv and a ﬁxed dataset D , there alw a ys exists a b su c h that Algorithm 2 outputs f priv on inp ut D . Because ℓ is d iﬀerentiable and con vex, and N ( · ) is diﬀeren tiable, w e can tak e the gradien t of the ob ject iv e function and set it to 0 at f priv . Therefore, b = − n Λ ∇ N ( f priv ) − n X i =1 y i ℓ ′ ( y i f T priv x i ) x i − n ∆ f priv . (17) Note that (17) holds b ecause for an y f , ∇ ℓ ( f T x ) = ℓ ′ ( f T x ) x . W e claim that as ℓ is diﬀeren tiable and J ( f , D ) + ∆ 2 || f || 2 is strongly con ve x, giv en a dataset D = ( x 1 , y 1 ) , . . . , ( x n , y n ), there is a bijection b et w een b and f priv . The r elation (17 ) sho ws that t wo diﬀerent b v alues cannot r esu lt in the same f priv . F urtherm ore, since the ob jectiv e is strictly con vex, for a ﬁxed b and D , there is a unique f priv ; therefore the map from b to f priv is injectiv e. The r elation (17) also sho w s that f or any f priv there exists a b for wh ic h f priv is the minimizer, so the map from b to f priv is surjectiv e. T o show ǫ p -diﬀeren tial priv acy , we n eed to compu te the ratio g ( f priv |D ) /g ( f priv |D ′ ) of th e den - sities of f priv under the tw o datasets D and D ′ . This ratio can b e written as (Billingsley, 1995) g ( f priv |D ) g ( f priv |D ′ ) = µ ( b |D ) µ ( b ′ |D ′ ) · | det( J ( f priv → b |D )) | − 1 | det( J ( f priv → b ′ |D ′ )) | − 1 , where J ( f priv → b |D ), J ( f priv → b |D ′ ) are the Jacobian matrices of the mapp ings from f priv to b , and µ ( b |D ) and µ ( b |D ′ ) are th e densities of b giv en the output f priv , when the d atasets are D and D ′ resp ectiv ely . First, we b ound the ratio of the Jacobian d etermin ants. Let b ( j ) denote the j -th co ordin ate of b . F rom (17) we ha v e b ( j ) = − n Λ ∇ N ( f priv ) ( j ) − n X i =1 ℓ ′ ( y i f T priv x i ) x ( j ) i − n ∆ f ( j ) priv . Giv en a dataset D , the ( j, k )-th en tr y of the Jacobian matrix J ( f → b |D ) is ∂ b ( j ) ∂ f ( k ) priv = − n Λ ∇ 2 N ( f priv ) ( j,k ) − X i y 2 i ℓ ′′ ( y i f T priv x i ) x ( j ) i x ( k ) i − n ∆1( j = k ) , where 1( · ) is the indicator fun ction. W e n ote that th e Jacobian is deﬁn ed for all f priv b ecause N ( · ) and ℓ are globally doub ly diﬀeren tiable. 11 Let D and D ′ b e tw o datasets whic h diﬀer in the v alue of the n -th item such that D = { ( x 1 , y 1 ) , . . . , ( x n − 1 , y n − 1 ) , ( x n , y n ) } and D ′ = { ( x 1 , y 1 ) , . . . , ( x n − 1 , y n − 1 ) , ( x ′ n , y ′ n ) } . Moreo v er, w e deﬁn e matrices A and E as follo w s: A = n Λ ∇ 2 N ( f priv ) + n X i =1 y 2 i ℓ ′′ ( y i f T priv x i ) x i x T i + n ∆ I d E = − y 2 n ℓ ′′ ( y n f T priv x n ) x n x T n + ( y ′ n ) 2 ℓ ′′ ( y ′ n f T priv x ′ n ) x ′ n x ′ T n . Then, J ( f priv → b |D ) = − A , and J ( f priv → b |D ′ ) = − ( A + E ). Let λ 1 ( M ) and λ 2 ( M ) denote the largest and second largest eigen v alues of a matrix M . As E has rank at most 2, f rom Lemma 2, | det( J ( f priv → b |D ′ )) | | det( J ( f priv → b |D )) | = | det( A + E ) | | det A | = | 1 + λ 1 ( A − 1 E ) + λ 2 ( A − 1 E ) + λ 1 ( A − 1 E ) λ 2 ( A − 1 E ) | . F or a 1-strongly con vex f unction N , the Hessian ∇ 2 N ( f priv ) h as eigen v alues greater than 1 (Bo yd and V an d en b erghe , 2004). Since w e hav e assumed ℓ is d oubly diﬀerentiable and con ve x, an y eigen v alue of A is therefore at least n Λ + n ∆; th er efore, for j = 1 , 2, | λ j ( A − 1 E ) | ≤ | λ j ( E ) | n (Λ+∆) . App lying the triangle inequalit y to the trace norm: | λ 1 ( E ) | + | λ 2 ( E ) | ≤ | y 2 n ℓ ′′ ( y n f T priv x n ) | · k x n k + | − ( y ′ n ) 2 ℓ ′′ ( y ′ n f T priv x ′ n ) | ·   x ′ n   . Then upp er b ou n ds on | y i | , || x i || , and | ℓ ′′ ( z ) | yield | λ 1 ( E ) | + | λ 2 ( E ) | ≤ 2 c. Therefore, | λ 1 ( E ) | · | λ 2 ( E ) | ≤ c 2 , and | det( A + E ) | | det( A ) | ≤ 1 + 2 c n (Λ + ∆) + c 2 n 2 (Λ + ∆) 2 =  1 + c n (Λ + ∆)  2 . W e no w consider t wo cases. In the ﬁrst case, ∆ = 0, and by deﬁnition, in that case, 1 + 2 c n Λ + c 2 n 2 Λ 2 ≤ e ǫ p − ǫ ′ p . In the second case, ∆ > 0, and in this case, by deﬁnition of ∆, (1 + c n (Λ+∆) ) 2 = e ǫ p / 2 = e ǫ p − ǫ ′ p . Next, we b oun d the ratio of th e densities of b . W e observe that as | ℓ ′ ( z ) | ≤ 1, for any z and | y i | , || x i || ≤ 1, for datasets D an d D ′ whic h diﬀer by one v alue, b ′ − b = y n ℓ ′ ( y n f T priv x n ) x n − y ′ n ℓ ′ ( y n f T priv x ′ n ) x ′ n . This implies th at: k b k −   b ′   ≤   b − b ′   ≤ 2 . W e can write: µ ( b |D ) µ ( b ′ |D ′ ) = || b || d − 1 e − ǫ ′ p || b || / 2 · 1 surf( || b || ) || b ′ || d − 1 e − ǫ ′ p || b ′ || / 2 · 1 surf( || b ′ || ) ≤ e ǫ ′ p ( || b ||−|| b ′ || ) / 2 ≤ e ǫ ′ p , 12 where surf( x ) denotes th e surface area of the sp here in d dimensions with r adius x . Here th e last step follo w s from the fact that surf( x ) = s (1) x d − 1 , where s (1) is th e su rface area of the unit sphere in R d . Finally , w e are ready to b ound th e ratio of densities: g ( f priv |D ) g ( f priv |D ′ ) = µ ( b |D ) µ ( b ′ |D ′ ) · | det( J ( f priv → b |D ′ )) | | det( J ( f priv → b ′ |D )) | = µ ( b |D ) µ ( b ′ |D ′ ) · | det( A + E ) | | det A | ≤ e ǫ ′ p · e ǫ p − ǫ ′ p ≤ e ǫ p . Th us, Algorithm 2 satisﬁes Deﬁnition 2. Lemma 2. If A is ful l r ank, and if E has r ank at most 2 , then, det( A + E ) − det( A ) det( A ) = λ 1 ( A − 1 E ) + λ 2 ( A − 1 E ) + λ 1 ( A − 1 E ) λ 2 ( A − 1 E ) , (18) wher e λ j ( Z ) is the j -th ei g envalue of matrix Z . Pr o of. Note that E has rank at most 2, so A − 1 E also has r ank at most 2. Using the fact that λ i ( I + A − 1 E ) = 1 + λ i ( A − 1 E ), det( A + E ) − det( A ) det A = det( I + A − 1 E ) − 1 = (1 + λ 1 ( A − 1 E ))( 1 + λ 2 ( A − 1 E )) − 1 = λ 1 ( A − 1 E ) + λ 2 ( A − 1 E ) + λ 1 ( A − 1 E ) λ 2 ( A − 1 E ) . 3.4 Application to classiﬁcation In this section, w e sho w h o w to use our results to p ro vid e pr iv acy-preserving v ersions of logistic regression and su pp ort ve ctor m ac hines. 3.4.1 Logistic Regression One p opular ERM classiﬁcatio n algo rithm is regularized logistic r egression. I n this case, N ( f ) = 1 2 || f || 2 , and the loss function is ℓ LR ( z ) = log(1 + e − z ). T aking deriv ativ es and d ouble deriv ativ es, ℓ ′ LR ( z ) = − 1 (1 + e z ) ℓ ′′ LR ( z ) = 1 (1 + e − z )(1 + e z ) . Note that ℓ LR is con tinuous, diﬀeren tiable and d oubly diﬀerentia ble, w ith c ≤ 1 4 . Therefore, w e can plug in logistic loss directly to Th eorems 1 and 2 to get the follo wing result. 13 Corollary 2. The output of A lgorithm 1 with N ( f ) = 1 2 || f || 2 , ℓ = ℓ LR is an ǫ p -diﬀer ential ly private appr oximation to lo gistic r e gr ession. The output of Algorithm 2 with N ( f ) = 1 2 || f || 2 , c = 1 4 , and ℓ = ℓ LR , is an ǫ p -diﬀer ential ly private appr oximation to lo gistic r e gr ession. W e quantify ho w wel l the outpu ts of Algorithms 1 and 2 appro ximate (non-priv ate) log istic regression in Section 4. 3.4.2 Supp ort V ector Mac hines Another v ery commonly used classiﬁer is L 2 -regularized supp ort vecto r machines. In this case, again, N ( f ) = 1 2 || f || 2 , and ℓ SVM ( z ) = max(0 , 1 − z ) . (19) Notice that th is loss function is con tin u ous, but not diﬀerentiable, and th us it do es not satisfy conditions in Theorems 1 and 2 . There are t wo alternativ e solutions to this. First, we can ap p ro x im ate ℓ SVM b y a diﬀeren t loss function, which is doubly diﬀerentiable, as follo ws (see also Ch ap elle (2007)): ℓ s ( z ) =    0 if z > 1 + h − (1 − z ) 4 16 h 3 + 3(1 − z ) 2 8 h + 1 − z 2 + 3 h 16 if | 1 − z | ≤ h 1 − z if z < 1 − h (20) As h → 0, this loss approac hes the h inge loss. T aking deriv ativ es, w e observe that: ℓ ′ s ( z ) =    0 if z > 1 + h (1 − z ) 3 4 h 3 − 3(1 − z ) 4 h − 1 2 if | 1 − z | ≤ h − 1 if z < 1 − h (21) Moreo v er, ℓ ′′ s ( z ) =    0 if z > 1 + h − 3(1 − z ) 2 4 h 3 + 3 4 h if | 1 − z | ≤ h 0 if z < 1 − h (22) Observe that this implies that | ℓ ′′ s ( z ) | ≤ 3 4 h for all h and z . Moreo ver, ℓ s is con vex, as ℓ ′′ s ( z ) ≥ 0 for a ll z . Therefore, ℓ s can b e used in Theorems 1 and 2, whic h giv es us p r iv acy-preserving appro ximations to r egularized supp ort vecto r mac h ines. Corollary 3. The output of Algor ithm 1 with N ( f ) = 1 2 || f || 2 , and ℓ = ℓ s is an ǫ p -diﬀer ential ly private appr oximation to supp ort ve ctor machines. The output of Algo rithm 2 with N ( f ) = 1 2 || f || 2 , c = 3 4 h , and ℓ = ℓ s is an ǫ p -diﬀer ential ly private appr oximation to su pp ort ve ctor machines. The second solution is to use Hu b er Loss, as suggested by Chap elle (2007), whic h is deﬁned as follo ws: ℓ Huber ( z ) =    0 if z > 1 + h 1 4 h (1 + h − z ) 2 if | 1 − z | ≤ h 1 − z if z < 1 − h (23) 14 Observe that Hub er loss is con v ex and diﬀerentiable, and piecewise d oubly-diﬀerent iable, with c = 1 2 h . Ho wev er, it is not globally d oubly diﬀeren tiable, and h ence the Jacobian in the p ro of of Theorem 2 is und eﬁ ned for certain v alues of f . Ho wev er, w e can sho w that in this case, Algorithm 2, when ru n with c = 1 2 h satisﬁes Deﬁnition 3. Let G denote the map from f priv to b in (17) u nder B = D , and H d enote the map under B = D ′ . By deﬁnition, the probabilit y P ( f priv ∈ S | B = D ) = P b ( b ∈ G ( S )). Corollary 4. L et f priv b e the output of Algorithm 2 with ℓ = ℓ Huber , c = 1 2 h , and N ( f ) = 1 2 || f || 2 2 . F or any set S of p ossible values of f priv , and any p air of datasets D , D ′ which diﬀer in the private value of one p erson ( x n , y n ) , e − ǫ p P ( S | B = D ′ ) ≤ P ( S | B = D ) ≤ e ǫ p P ( S | B = D ′ ) . (24) Pr o of. Consider the ev en t f priv ∈ S . Let T = G ( S ) and T ′ = H ( S ). Because G is a bijection, w e ha ve P ( f priv ∈ S | B = D ) = P b ( b ∈ T | B = D ) , (25) and a similar expression when B = D ′ . No w note th at ℓ ′ Huber ( z ) is only non-d iﬀeren tiable for a ﬁn ite num b er of v alues of z . Let Z b e the set of these v alues of z . C = { f : y f T x = z ∈ Z , ( x , y ) ∈ D ∪ D ′ } . (26) Pic k a tu ple ( z , ( x , y )) ∈ Z × ( D ∪ D ′ ). The set of f such that y f T x = z is a h yp erplane in R d . Since ∇ N ( f ) = f / 2 and ℓ ′ is piecewise linear, from (17) we see that the set of corresp onding b ’s is also p iecewise linear, and hence h as Leb esgue measur e 0. Since the measure corresp ond ing to b is absolutely contin uous with resp ec t to the Leb esgue measure, th is hyp erplane has probabilit y 0 under b as w ell. Since C is a ﬁn ite union of su c h h yp erplanes, we ha ve P ( b ∈ G ( C )) = 0. Th us we h a ve P b ( T | B = D ) = P b ( G ( S \ C ) | B = D ), and s imilarly for D ′ . F rom the deﬁnition of G and H , for f ∈ S \ C , H ( f ) = G ( f ) + y n ℓ ′ ( y n f T x n ) x n − y ′ n ℓ ′ ( y ′ n f T x ′ n ) x ′ n . (27) since f / ∈ C , th is mapping shows that if P b ( G ( S \ C ) | B = D ) = 0 then w e m ust hav e P b ( H ( S \ C ) | B = D ) = 0. Thus the result h olds for sets of measur e 0. If S \ C has p ositiv e measure we can calc ulate the ratio of the pr obabilities for f priv for w hic h the loss is t wice-diﬀeren tiable. F or suc h f priv the Jacobian is also deﬁned, and we can use a metho d similar to Theorem 2 to pro ve the result. Remark: Because the p riv acy pr o of for Algorithm 1 do e s not require the analytic prop erties of 2, we can also use Hub er loss in Algorithm 1 to get an ǫ g -diﬀeren tially priv ate appro ximation to the SVM. W e quan tify h o w w ell th e outputs of Algorithms 1 and 2 app r o ximate priv ate sup p ort v ector mac hines in Sectio n 4. These approximati ons to the hinge loss are necessary b ecause of the analytic requir emen ts of Theorems 1 and 2 on the loss fu n ction. Because the requirements of Theorem 2 are str icter, it ma y b e p ossible to u se an app r o ximate loss in Algorithm 1 that w ould not b e admissib le in Algorithm 2. 15 4 Generalization p erformance In this section, we p r o vide guarante es on the p erf ormance of p riv acy-preserving E RM algorithms in Section 3. W e p ro vid e these b ound s f or L 2 -regularizatio n. T o quan tify this p erformance, we will assume that the n entries in the dataset D are dr awn i.i.d. acco rding to a ﬁ x ed d istribution P ( x , y ). W e measure the p erformance of these algorithms by the n um b er of samples n required to ac heive error L ∗ + ǫ g , where L ∗ is the loss of a reference ERM predictor f 0 . T his resulting b oun d on ǫ g will dep end on the norm k f 0 k of this predictor. By c ho osing an upp er b oun d ν on the norm, w e can interpret the result as sa yin g that the pr iv acy-preserving classiﬁer w ill ha ve error ǫ g more than that of an y predictor with k f 0 k ≤ ν . Giv en a distr ib ution P the exp ected loss L ( f ) for a classiﬁer f is L ( f ) = E ( x ,y ) ∼ P  ℓ ( f T x , y )  . (28) The sample complexit y for generaliza tion error ǫ g against a classiﬁer f 0 is n umber of samples n required to ac hieve error L ( f 0 ) + ǫ g under an y data distr ibution P . W e w ould lik e the sample complexit y to b e lo w. F or a ﬁxed P we deﬁne the follo wing function, which will b e u seful in our analysis: ¯ J ( f ) = L ( f ) + Λ 2 k f k 2 . (29) The f u nction ¯ J ( f ) is the exp ectatio n (o ver P ) of the non-pr iv ate L 2 -regularized ERM ob ject iv e ev aluated at f . F or non-priv ate ERM, Shalev-Sh wartz and Srebr o (2008) sho w that for a gi v en f 0 with loss L ( f 0 ) = L ∗ , if th e num b er of data p o in ts n > C || f 0 || 2 log( 1 δ ) ǫ 2 g (30) for some constan t C , th en the excess loss of the L 2 -regularized SVM solution f svm satisﬁes L ( f svm ) ≤ L ( f 0 ) + ǫ g . This order gro wth will hold for our results as w ell. It also serv es as a reference against whic h we can compare the additional burden on the sample complexit y imp osed by the priv acy constrain ts. F or most learning problems, w e require the ge neralization err or ǫ g < 1. Moreo ver, it is also t yp - ically the case that f or more diﬃcult learning pr ob lems, || f 0 || is higher. F or example, for regularized SVM, 1 || f 0 || is th e margin of classiﬁcation, and as a resu lt, || f 0 || is higher for learning p roblems with smaller margin. F rom the b oun ds p ro vid ed in this section, w e note th at the dominating term in the sample requ iremen t for ob jectiv e p erturbation h as a b etter dep endence on || f 0 || as wel l as 1 ǫ g ; as a result, for more d iﬃ cu lt learning problems, w e exp ect ob jecti v e p ertu rbation to p erform b etter than output p erturbation. 4.1 Output p erturbation First, w e provide p erformance guaran tees for Algorithm 1, b y pro viding a b ound on the n umber of samples required for Algorithm 1 to pro d uce a classiﬁer with lo w error. Deﬁnition 6. A function g ( z ) : R → R i s c -Lipschitz if for al l p airs ( z 1 , z 2 ) we have | g ( z 1 ) − g ( z 2 ) | ≤ c | z 1 − z 2 | . 16 Recall that if a function g ( z ) is diﬀerentiable, with | g ′ ( z ) | ≤ r for all z , then g ( z ) is also r -Lipsc hitz. Theorem 3. L et N ( f ) = 1 2 || f || 2 , and let f 0 b e a c lassiﬁer su ch that L ( f 0 ) = L ∗ , and let δ > 0 . If ℓ is diﬀer entiable and c ontinuous with | ℓ ′ ( z ) | ≤ 1 , the derivative ℓ ′ is c -Lipschitz, the data D is dr awn i.i.d. ac c or ding to P , then ther e exists a c onstant C such that if the numb er of tr aining samples satisﬁes n > C max || f 0 || 2 log( 1 δ ) ǫ 2 g , d log ( d δ ) || f 0 || ǫ g ǫ p , d log ( d δ ) c 1 / 2 || f 0 || 2 ǫ 3 / 2 g ǫ p ! , (31) wher e d is the dimension of the data sp ac e, then the output f priv of Algorith m 1 satisﬁes P ( L ( f priv ) ≤ L ∗ + ǫ g ) ≥ 1 − 2 δ . (32) Pr o of. Let f rtr = argmin f ¯ J ( f ) f ∗ = argmin f J ( f , D ) , and f priv denote the output of Algorithm 1. Using th e analysis metho d of Shalev-Shw artz and Srebro (2008) shows L ( f priv ) = L ( f 0 ) + ( ¯ J ( f priv ) − ¯ J ( f rtr )) + ( ¯ J ( f rtr ) − ¯ J ( f 0 )) + Λ 2 || f 0 || 2 − Λ 2 || f priv || 2 . (33) W e will b ou n d the terms on the right -hand sid e of (33 ). F or a regularizer N ( f ) = 1 2 || f || 2 the Hessian satisﬁes ||∇ 2 N ( f ) || 2 ≤ 1 . Therefore, from Lemma 3, with p robabilit y 1 − δ o v er the p riv acy mec hanism , J ( f priv , D ) − J ( f ∗ , D ) ≤ 8 d 2 log 2 ( d/δ )( c + Λ) Λ 2 n 2 ǫ 2 p . F ur th ermore, th e results of Sridh aran et al. (2008) sho w that with p robabilit y 1 − δ o ver th e c h oice of the d ata distribution, ¯ J ( f priv ) − ¯ J ( f rtr ) ≤ 2( J ( f priv , D ) − J ( f ∗ , D )) + O  log(1 /δ ) Λ n  . The constan t in the last term dep ends on the deriv ativ e of the loss and the b ound on the data p oin ts, whic h by assumption are b ounded. Combining the preceeding tw o state men ts, w ith probabilit y 1 − 2 δ o v er the n oise in the priv acy mec h anism and the d ata d istribution, the second term in th e righ t-hand-side of (33) is at most: ¯ J ( f priv ) − ¯ J ( f rtr ) ≤ 16 d 2 log 2 ( d/δ )( c + Λ) Λ 2 n 2 ǫ 2 p + O  log(1 /δ ) Λ n  . (34) By deﬁnition of f rtr , the diﬀerence ( ¯ J ( f rtr ) − ¯ J ( f 0 )) ≤ 0. Setting Λ = ǫ g || f 0 || 2 in (33) and using (34), w e obtain L ( f priv ) ≤ L ( f 0 ) + 16 || f 0 || 4 d 2 log 2 ( d/δ )( c + ǫ g / || f 0 || 2 ) n 2 ǫ 2 g ǫ 2 p + O  || f 0 || 2 log(1 /δ ) nǫ g  + ǫ g 2 . (35) Solving for n to make the total excess error equal to ǫ g yields (31). 17 Lemma 3. Supp ose N ( · ) is doubly diﬀer entiable with ||∇ 2 N ( f ) || 2 ≤ η for al l f , and supp ose that ℓ is diﬀer entiable and has c ontinuous and c -Lipschitz derivatives. Given tr aining data D , let f ∗ b e a classiﬁer that minimizes J ( f , D ) and let f priv b e the classiﬁer output by Algor ithm 1. Th en P b  J ( f priv , D ) ≤ J ( f ∗ , D ) + 2 d 2 ( c + Λ η ) log 2 ( d/δ ) Λ 2 n 2 ǫ 2 p  ≥ 1 − δ, (36) wher e the pr ob ability is taken over the r andomness in the noise b of Algorithm 1. Note that when ℓ is doub ly d iﬀeren tiable, c is an up p er b ound on the d ouble deriv ativ e of ℓ , and is the same as the constan t c in Theorem 2. Pr o of. Let D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , and recall th at || x i || ≤ 1, and | y i | ≤ 1. As N ( · ) and ℓ are diﬀeren tiable, we use the Mean V alue Theorem to sh o w that for some t b etw een 0 and 1, J ( f priv , D ) − J ( f ∗ , D ) = ( f priv − f ∗ ) T ∇ J ( t f ∗ + (1 − t ) f priv ) ≤ || f priv − f ∗ || · ||∇ J ( t f ∗ + (1 − t ) f priv ) || , (37) where the second step follo w s by an application of the Cauch y-Sc h w artz in equalit y . R ecall that ∇ J ( f , D ) = Λ ∇ N ( f ) + 1 n X i y i ℓ ′ ( y i f T x i ) x i . Moreo v er, recall that ∇ J ( f ∗ , D ) = 0, from the optimalit y of f ∗ . Therefore, ∇ J ( t f ∗ + (1 − t ) f priv , D ) = ∇ J ( f ∗ , D ) − Λ( ∇ N ( f ∗ ) − ∇ N ( t f ∗ + (1 − t ) f priv )) − 1 n X i y i  ℓ ′ ( y i ( f ∗ ) T x i ) − ℓ ′ ( y i ( t f ∗ + (1 − t ) f priv ) T x i )  x i . (38) No w, from the L ipsc h itz cond ition on ℓ , for eac h i we can u pp er b ound eac h term in the su mmation ab o v e:   y i  ℓ ′ ( y i ( f ∗ ) T x i ) − ℓ ′ ( y i ( t f ∗ + (1 − t ) f priv ) T x i )  x i   ≤ | y i | · || x i || · | ℓ ′ ( y i ( f ∗ ) T x i ) − ℓ ′ ( y i ( t f ∗ + (1 − t ) f priv ) T x i ) | ≤ | y i | · || x i || · c · | y i (1 − t )( f ∗ − f priv ) T x i | ≤ c (1 − t ) | y i | 2 · || x i || 2 · || f ∗ − f priv || ≤ c (1 − t ) || f ∗ − f priv || . (39) The third step follo ws b ecause ℓ ′ is c -Lipsc hitz and the last step follo ws from the b o unds on | y i | and || x i || . Because N is doubly diﬀeren tiable, we can apply the Mean V alue Theorem again to conclude that ||∇ N ( t f ∗ + (1 − t ) f priv ) − ∇ N ( f ∗ ) || ≤ (1 − t ) || f priv − f ∗ || · ||∇ 2 N ( f ′′ ) || 2 (40) for some f ′′ ∈ R d . 18 As 0 ≤ t ≤ 1, we can com bine (38), (39), and (40) to obtain k∇ J ( t f ∗ + (1 − t ) f priv , D ) k ≤ k Λ( ∇ N ( f ∗ ) − ∇ N ( t f ∗ + (1 − t ) f priv )) k +      1 n X i y i ( ℓ ′ ( y i ( f ∗ ) T x i ) − ℓ ′ ( y i ( t f ∗ + (1 − t ) f priv ) T x i )) x i      ≤ (1 − t ) k f priv − f ∗ k ·  Λ η + 1 n · n · c  ≤ k f priv − f ∗ k (Λ η + c ) . (41) F rom the deﬁnition of Algorithm 1, f priv − f ∗ = b , where b is the n oise vecto r. No w we can apply Lemma 4 to || f priv − f ∗ || , with parameters k = d , and θ = 2 Λ nǫ p . F r om Lemma 4, with p robabilit y 1 − δ , || f priv − f ∗ || ≤ 2 d log( d δ ) Λ nǫ p . The Lemma follo w s b y com b ining th is with Equations 41 and 37. Lemma 4. L et X b e a r andom variable dr awn fr om the distribution Γ ( k , θ ) , wher e k is an inte ger. Then, P  X < k θ log  k δ  ≥ 1 − δ. (42) Pr o of. Since k is an in teger, w e can decompose X d istributed according to Γ( k , θ ) as a summation X = X 1 + . . . + X k , (43) where X 1 , X 2 , . . . , X k are indep en d en t exp onentia l random v ariables with mean θ . F or eac h i w e ha ve P ( X i ≥ θ log( k /δ )) = δ /k . Now, P ( X < k θ log( k/δ )) ≥ P ( X i < θ log( k/δ ) i = 1 , 2 , . . . , k ) (44) = (1 − δ /k ) k (45) ≥ 1 − δ. (46) 4.2 Ob jectiv e p er turbation W e no w establish p erformance b oun ds on Algorithm 2. Th e b ound can b e su mmarized as f ollo ws. Theorem 4. L et N ( f ) = 1 2 || f || 2 , and let f 0 b e a classiﬁer with exp e cte d loss L ( f 0 ) = L ∗ . L et ℓ b e c onvex, doubly diﬀer entiable, and let its derivatives satisfy | ℓ ′ ( z ) | ≤ 1 and | ℓ ′′ ( z ) | ≤ c for al l z . Then ther e exists a c onstant C such that for δ > 0 , if the n tr aining samples i n D ar e dr awn i.i.d. ac c or ding to P , and if n > C max || f 0 || 2 log(1 /δ ) ǫ 2 g , c || f 0 || 2 ǫ g ǫ p , d log ( d δ ) || f 0 || ǫ g ǫ p ! , (47) then the output f priv of Algorith m 2 satisﬁes P ( L ( f priv ) ≤ L ∗ + ǫ g ) ≥ 1 − 2 δ . (48) 19 Pr o of. Let f rtr = argmin f ¯ J ( f ) f ∗ = argmin f J ( f , D ) , and f priv denote the output of Algorithm 1. As in Th eorem 3, the analysis of S halev-Sh w artz and Sr ebro (2008) shows L ( f priv ) = L ( f 0 ) + ( ¯ J ( f priv ) − ¯ J ( f rtr )) + ( ¯ J ( f rtr ) − ¯ J ( f 0 )) + Λ 2 || f 0 || 2 − Λ 2 || f priv || 2 . (49) W e will b ou n d eac h of the terms on the r igh t-hand -side. If n > c || f 0 || 2 ǫ g ǫ p and Λ > ǫ g 4 || f 0 || 2 , then n Λ > c 4 ǫ p , so fr om the d eﬁ nition of ǫ ′ p in Algorithm 2, ǫ ′ p = ǫ p − 2 log  1 + c n Λ  = ǫ p − 2 log  1 + ǫ p 4  ≥ ǫ p − ǫ p 2 , (50) where the la st step follo ws b ecause log(1 + x ) ≤ x for x ∈ [0 , 1]. Note that for these v alues of Λ w e ha ve ǫ ′ p > 0. Therefore, w e can apply Lemma 5 to conclude that with probabilit y at least 1 − δ o ve r the priv acy mec h anism, J ( f priv , D ) − J ( f ∗ , D ) ≤ 4 d 2 log 2 ( d/δ ) Λ n 2 ǫ 2 p . (51) F rom Sridh aran et al. (2008), ¯ J ( f priv ) − ¯ J ( f rtr ) ≤ 2( J ( f priv , D ) − J ( f ∗ , D )) + O  log(1 /δ ) Λ n  (52) ≤ 8 d 2 log 2 ( d/δ ) Λ n 2 ǫ 2 p + O  log(1 /δ ) Λ n  . (53) By d eﬁnition of f ∗ , we hav e ¯ J ( f rtr ) − ¯ J ( f 0 ) ≤ 0. If Λ is set to b e ǫ g || f 0 || 2 , th en , the fourth quantit y in Eq u ation 49 is at most ǫ g 2 . The theorem follo ws by solving for n to mak e the total excess err or at most ǫ g . The follo wing lemma is analogous to Lemma 3, and it establishes a b ound on the distance b et w een the output of Algorithm 2, and non-priv ate regularized ERM. W e note that this b ound holds when Algorithm 2 has ǫ ′ p > 0, that is, when ∆ = 0 . Ensu ring that ∆ = 0 requires an additional condition on n , w h ic h is stated in T heorem 4. Lemma 5. L et ǫ ′ p > 0 . L et f ∗ = argmin J ( f , D ) , and let f priv b e the classiﬁer output by Algorithm 2. If N ( · ) is 1 -str ongly c onvex and glob al ly diﬀer entiable, and if ℓ i s c onvex and diﬀer entiable at al l p oints, with | ℓ ′ ( z ) | ≤ 1 for al l z , then P b  J ( f priv , D ) ≤ J ( f ∗ , D ) + 4 d 2 log 2 ( d/δ ) Λ n 2 ǫ 2 p  ≥ 1 − δ, (54) wher e the pr ob ability is taken over the r andomness in the noise b of Algorithm 2. 20 Pr o of. By the assumption ǫ ′ p > 0, the classiﬁer f priv minimizes the ob jectiv e function J ( f , D ) + 1 n b T f , and therefore J ( f priv , D ) ≤ J ( f ∗ , D ) + 1 n b T ( f ∗ − f priv ) . (55) First, we tr y to b ound || f ∗ − f priv || . Recall that Λ N ( · ) is Λ-strongly con v ex and globally d iﬀeren- tiable, and ℓ is con v ex and diﬀeren tiable. W e can therefore app ly Lemma 1 with G ( f ) = J ( f , D ) and g ( f ) = 1 n b T f to obtain the b ound || f ∗ − f priv || ≤ 1 Λ     ∇ ( 1 n b T f )     ≤ || b || n Λ . (56) Therefore by the Cauc h y-Sc hw artz inequalit y , J ( f priv , D ) − J ( f ∗ , D ) ≤ || b || 2 n 2 Λ . (57) Since || b || is dra wn fr om a Γ( d, 2 ǫ p ) distribution, from Lemma 4, with pr obabilit y 1 − δ , || b | | ≤ 2 d log( d/δ ) ǫ p . The Lemma follo w s by plugging this in to th e previous equation. 4.3 Applications In this section, we examine the sample requirement of pr iv acy-preserving regularized log istic r e- gression and su pp ort ve ctor m ac hines. Recall that in b oth these cases, N ( f ) = 1 2 || f || 2 . Corollary 5 (Logistic Regression) . L et tr aining data D b e gener ate d i.i .d. ac c or ding to a distribu- tion P and let f 0 b e a classiﬁer with exp e cte d loss L ( f 0 ) = L ∗ . L et the loss function ℓ = ℓ LR deﬁne d in Se ction 3.4.1. Then the f ol lowing two statements hold: 1. Ther e exists a C 1 such that if n > C 1 max || f 0 || 2 log( 1 δ ) ǫ 2 g , d log ( d δ ) || f 0 || ǫ g ǫ p , d log ( d δ ) || f 0 || 2 ǫ 3 / 2 g ǫ p ! , (58) then the output f priv of Algorith m 1 satisﬁes P ( L ( f priv ) ≤ L ∗ + ǫ g ) ≥ 1 − δ . (59) 2. Ther e exists a C 2 such that if n > C max || f 0 || 2 log(1 /δ ) ǫ 2 g , || f 0 || 2 ǫ g ǫ p , d log ( d δ ) || f 0 || ǫ g ǫ p ! , (60) then the output f priv of Algorith m 2 with c = 1 4 satisﬁes P ( L ( f priv ) ≤ L ∗ + ǫ g ) ≥ 1 − δ . (61) 21 Pr o of. Since ℓ LR is con vex and doubly diﬀerent iable f or an y z 1 , z 2 , ℓ ′ LR ( z 1 ) − ℓ ′ LR ( z 2 ) ≤ ℓ ′′ LR ( z ∗ )( z 1 − z 2 ) (62) for some z ∗ ∈ [ z 1 , z 2 ]. Moreo ver, | ℓ ′′ LR ( z ∗ ) | ≤ c = 1 4 , so ℓ ′ is 1 4 -Lipsc h itz. The corollary now follo ws from Theorems 3 and 4. F or SVMs we state results with ℓ = ℓ Huber , bu t a similar b ound ca n b e shown for ℓ s as w ell. Corollary 6 (Hub er Supp ort V ector Mac hines) . L et tr aining data D b e g e ner ate d i.i.d. ac c or ding to a distribution P and let f 0 b e a classiﬁer with exp e cte d loss L ( f 0 ) = L ∗ . L et the loss fu nc tion ℓ = ℓ Huber deﬁne d in (23) . Then the fol lowing two statements hold: 1. Ther e exists a C 1 such that if n > C 1 max || f 0 || 2 log( 1 δ ) ǫ 2 g , d log ( d δ ) || f 0 || ǫ g ǫ p , d log ( d δ ) || f 0 || 2 h 1 / 2 ǫ 3 / 2 g ǫ p ! , (63) then the output f priv of Algorith m 1 satisﬁes P ( L ( f priv ) ≤ L ∗ + ǫ g ) ≥ 1 − δ . (64) 2. Ther e exists a C 2 such that if n > C max || f 0 || 2 log(1 /δ ) ǫ 2 g , || f 0 || 2 hǫ g ǫ p , d log ( d δ ) || f 0 || ǫ g ǫ p ! , (65) then the output f priv of Algorith m 2 with c = 1 4 satisﬁes P ( L ( f priv ) ≤ L ∗ + ǫ g ) ≥ 1 − δ . (66) Pr o of. The Hub er loss is conv ex and diﬀerentiable with con tinuous deriv ativ es. Moreo v er, since the deriv ativ e of the Hub er loss is piecewise linear with slop e 0 or at most 1 2 h , for any z 1 , z 2 , | ℓ ′ Huber ( z 1 ) − ℓ ′ Huber ( z 2 ) | ≤ 1 2 h | z 1 − z 2 | , (67) so ℓ ′ Huber is 1 2 h -Lipsc h itz. The ﬁr s t part of the corollary follo w s fr om Theorem 3. F or the second part of the corollary , we observ e that from Corollary 4, we do not need ℓ to b e globally double diﬀerenti able, and th e b ound on | ℓ ′′ ( z ) | in Th eorem 4 is only needed to ensure that ǫ ′ p > 0; since ℓ Huber is double d iﬀeren tiable except in a set of Leb esgue measure 0, with | ℓ ′′ Huber ( z ) | ≤ 1 2 h , the corollary follo ws by an application of Theorem 4. 22 5 Kernel metho ds A p o w er f ul m etho dolo gy in learning pr oblems is the “k ernel tric k,” whic h allo ws the eﬃcien t con- struction of a pred ictor f that lies in a repro d ucing kernel Hilb ert space (RKHS) H asso ciated to a p osit iv e deﬁnite k ernel fun ction k ( · , · ). The represent er th eorem (Kimeldorf and W ahba, 1970) sho w s that the regularized emp irical r isk in (1) is minimized b y a function f ( x ) th at is giv en b y a linear com bination of k ernel fu nctions cen tered at th e data p oin ts: f ∗ ( x ) = n X i =1 a i k ( x ( i ) , x ) . (68) This elega n t result is imp ortant for b ot h theoretica l and computational reasons. Computationally , one releases the v alues a i corresp ondin g to th e f that minimizes the empir ical risk, along with the data p o in ts x ( i ); the user classiﬁes a new x b y ev aluating the f unction in (68 ). A crucial diﬃcult y in terms of priv acy is that this d irectly releases the priv ate v alues x ( i ) of some individu als in th e trainin g set. Th us, even if the classiﬁer is compu ted in a priv acy-preserving w a y , an y classiﬁer released by this pro cess requ ir es r eveal ing the data. W e pr o vide an algorithm that av oids this pr oblem, using an appro x im ation method (Rahimi and Rec ht, 2007, 20 08b) to appro ximate the kernel fun ction u sing random p ro jectio ns. 5.1 Mathematical preliminaries Our approac h w orks for ke rnel fu nctions wh ic h are trans lation in v arian t, so k ( x , x ′ ) = k ( x − x ′ ). The k ey idea in the random p ro jectio n metho d is from Bo c h ner’s Theorem, wh ic h states that a con tinuous translation inv ariant k ernel is p ositiv e deﬁn ite if and only if it is the F ourier transform of a nonn egativ e m easure. Th is means th at the F ourier transform K ( θ ) of translation-in v arian t k ern el fun ction k ( t ) can b e n ormalized so that ¯ K ( θ ) = K ( θ ) / k K ( θ ) k 1 is a probabilit y m easure on the transf orm sp ace Θ. W e will assum e ¯ K ( θ ) is u niformly b ounded o v er θ . In this r epresen tation k ( x , x ′ ) = Z Θ φ ( x ; θ ) φ ( x ′ ; θ ) ¯ K ( θ ) dθ , (69) where we will assume the feature fun ctions φ ( x ; θ ) are b ounded: | φ ( x ; θ ) | ≤ ζ ∀ x ∈ X , ∀ θ ∈ Θ . (70) A function f ∈ H can b e written as f ( x ) = Z Θ a ( θ ) φ ( x ; θ ) ¯ K ( θ ) d θ . (71) T o prov e our generaliza tion b ounds w e m u st sho w that b oun ded classiﬁers f induce bou n ded func- tions a ( θ ) . W riting the ev aluation fu nctional as an inner pr o duct with k ( x , x ′ ) and (69) shows f ( x ) = Z Θ  Z X f ( x ′ ) φ ( x ′ ; θ ) d x ′  φ ( x ; θ ) ¯ K ( θ ) dθ . (72) 23 Th us we ha ve a ( θ ) = Z X f ( x ′ ) φ ( x ′ ; θ ) d x ′ (73) | a ( θ ) | ≤ V ol( X ) · ζ · k f k ∞ . (74) This sh o ws that a ( θ ) is b ounded u niformly ov er Θ when f ( x ) is b oun d ed uniformly o ver X . T he v olume of the un it ball is V ol( X ) = π d/ 2 Γ( d 2 +1) (see Ball (199 7) for more d etails). F or large d this is ( q 2 π e d ) d b y Stirling’s form u la. F urtherm ore, we h a ve k f k 2 H = Z Θ a ( θ ) 2 ¯ K ( θ ) dθ . (75) 5.2 A reduction to the linear case W e no w describ e how to apply Algorithms 1 and 2 for classiﬁcation with k ernels, by transformin g to linear classiﬁcation. Giv en { θ j } , let R : X → R D b e the map that sen d s x ( i ) to a v ector v ( i ) ∈ R D where v j ( i ) = φ ( x ( i ); θ j ) for j ∈ [ D ]. W e then use Algorithm 1 or Algorithm 2 to compute a priv acy-preserving linear classiﬁer f in R D . The algorithm r eleases R and ˜ f . Th e o v erall classiﬁer is f priv ( x ) = ˜ f ( R ( x )). Algorithm 3 Pr iv ate ERM for n onlinear k ernels Inputs: Data { ( x i , y i ) : i ∈ [ n ] } , p osit iv e deﬁnite k ernel fun ction k ( · , · ), sampling function ¯ K ( θ ), parameters ǫ p , Λ, D Outputs: Predictor f priv and pr e-ﬁlter { θ j : j ∈ [ D ] } . Dra w { θ j : j = 1 , 2 , . . . , D } iid according to ¯ K ( θ ). Set v ( i ) = p 2 /D [ φ ( x ( i ); θ 1 ) · · · φ ( x ( i ); θ D )] T for eac h i . Run Algorithm 1 or Algorithm 2 with data { ( v ( i ) , y ( i )) } and parameters ǫ p , Λ. As an examp le, consider the Gaussian k ernel k ( x , x ′ ) = exp  − γ   x − x ′   2 2  . (76) The F ourier tr an s form of a Gaussian is a Gaussian, so we can sample θ j = ( ω , ψ ) according to the distribution Un iform[ − π , π ] × N (0 , 2 γ I d ) and compute v j = cos ( ω T x + ψ ). The random phase is used to p ro duce a real-v alued mapp ing. The p ap er of Rahimi and Rec h t (2008a) has m ore examples of transform s for other kernel functions. 5.3 Priv acy guaran tees Because the workhorse of Algorithm 3 is a diﬀeren tially-priv ate v ersion of ERM for linear classiﬁers (either Algorithm 1 or Algorithm 2), and the p oint s { θ j : j ∈ [ D ] } are indep e ndent of the data, the priv acy guaran tees for Algorithm 3 f ollo w trivially f rom Theorems 1 and 2. Theorem 5. Given dat a { ( x ( i ) , y ( i )) : i = 1 , 2 , . . . , n } with ( x ( i ) , y ( i ) ) and k x ( i ) k ≤ 1 , the outputs ( f priv , { θ j : j ∈ [ D ] } ) of Algorithm 3 guar ante e ǫ p -diﬀer ential privacy. The pro o f trivially follo ws from a com bin ation of Th eorems 1, 2, and th e fact that the θ j ’s are dra wn ind ep endently of the input dataset. 24 5.4 Generalization p erformance W e no w tu rn to generalizatio n b ounds for Algorithm 3. W e will p ro ve results u sing ob jectiv e p er- turbation (Algorithm 2) in Algorithm 3, but analog ous results for output p erturbation (Alg orithm 1) are simple to prov e. Ou r comparison s will b e against arbitrary predictors f 0 whose norm is b ound ed in s ome sense. That is, give n an f 0 with s ome prop erties, w e will c ho ose regularizati on parameter Λ, d imension D , and n umber of samp les n so that th e predictor f priv has exp ected loss close to th at of f 0 . In this secti on w e will assume N ( f ) = 1 2 k f k 2 so that N ( · ) is 1-strongly con vex, and that the loss fun ction ℓ is con v ex, diﬀeren tiable and | ℓ ′ ( z ) | ≤ 1 for all z . Our ﬁrst generalization result is the simplest, since it assu mes a strong condition that giv es easy guarant ees on th e pr o jec tions. W e w ou ld lik e the p redictor pro du ced by Algorithm 3 to b e comp etitiv e against an f 0 suc h that f 0 ( x ) = Z Θ a 0 ( θ ) φ ( x ; θ ) ¯ K ( θ ) dθ , (77) and | a 0 ( θ ) | ≤ C (see R ah im i and Rec ht (2008b )). Our ﬁrst resu lt pr o vides the tec hn ical buildin g blo c k for our other generalization results. The pro of mak es use of ideas f rom Rahim i and Rec ht (2008 b ) and techniques from S ridharan et al. (2008); Shalev-Shw artz and Srebro (2008). Lemma 6. L et f 0 b e a pr e dictor such that | a 0 ( θ ) | ≤ C , for al l θ , wher e a 0 ( θ ) is given b y (77), and supp ose L ( f 0 ) = L ∗ . M or e over, supp ose that ℓ ′ ( · ) is c -Lipschitz. If the data D is dr awn i. i.d. ac c or ding to P , then ther e exists a c onstant C 0 such that if n > C 0 · max C 2 p log(1 /δ ) ǫ p ǫ 2 g · log C log(1 /δ ) ǫ g δ , cǫ g ǫ p log(1 /δ ) ! , (78) then Λ and D c an b e chosen such that the output f priv of Algorith m 3 using Algor ithm 2 satisﬁes P ( L ( f priv ) − L ∗ ≤ ǫ g ) ≥ 1 − 4 δ . (79) Pr o of. Since | a 0 ( θ ) | ≤ C and the ¯ K ( θ ) is b ounded, w e hav e (Rahimi and Rec ht, 2008b, Theorem 1) that w ith probabilit y 1 − 2 δ there exists an f p ∈ R D suc h that L ( f p ) ≤ L ( f 0 ) + O  1 √ n + 1 √ D  C r log 1 δ ! , (80) W e will choose D to mak e this loss small. F urtherm ore, f p is guaran teed to hav e k f p k ∞ ≤ C /D , so k f p k 2 2 ≤ C 2 D . (81) No w giv en suc h an f p w e m ust sho w that f priv will ha ve true risk close to that of f p as long as there are enough data p o in ts. This can b e sh o wn using the tec hniques in Sh alev-Sh wartz and Sr ebro (2008). Let ¯ J ( f ) = L ( f ) + Λ 2 k f k 2 2 , 25 and let f rtr = argmin f ∈ R D ¯ J ( f ) minimize the r egularized true r isk. Then ¯ J ( f priv ) = ¯ J ( f p ) + ( ¯ J ( f priv ) − ¯ J ( f rtr )) + ( ¯ J ( f rtr ) − ¯ J ( f p )) . No w, since ¯ J ( · ) is minim ized b y f rtr , the last term is n egativ e and w e can disregard it. Then w e ha ve L ( f priv ) − L ( f p ) ≤ ( ¯ J ( f priv ) − ¯ J ( f rtr )) + Λ 2 k f p k 2 2 − Λ 2 k f priv k 2 2 . (82) F rom Lemma 5, with probabilit y at least 1 − δ o v er th e noise b , J ( f priv ) − J  argmin f J ( f )  ≤ 4 D 2 log 2 ( D /δ ) Λ n 2 ǫ 2 p . (83) No w using (Sridharan et al., 2008, Corollary 2), we can b ound the term ( ¯ J ( f priv ) − ¯ J ( f rtr )) b y twice the gap in the regularized empirical risk diﬀerence (83) plus an additional term. That is, with probabilit y 1 − δ : ¯ J ( f priv ) − ¯ J ( f rtr ) ≤ 2( J ( f priv ) − J ( f rtr )) + O  log(1 /δ ) Λ n  . (84) If we set n > c 4 ǫ p Λ , then ǫ ′ p > 0, and w e can plug Lemma 5 in to (84) to obtain: ¯ J ( f priv ) − ¯ J ( f rtr ) ≤ 8 D 2 log 2 ( D /δ ) Λ n 2 ǫ 2 p + O  log(1 /δ ) Λ n  . (85) Plugging (85) in to (82), discarding the negativ e term in v olving k f priv k 2 2 and setting Λ = ǫ g / k f p k 2 giv es L ( f priv ) − L ( f p ) ≤ 8 k f p k 2 2 D 2 log 2 ( D /δ ) n 2 ǫ 2 p ǫ g + O k f p k 2 2 log 1 δ nǫ g ! + ǫ g 2 . (86) No w w e hav e, using (80) and (86 ), that with probabilit y 1 − 4 δ : L ( f priv ) − L ( f 0 ) ≤ ( L ( f priv ) − L ( f p )) + ( L ( f p ) − L ( f 0 )) ≤ 8 k f p k 2 2 D 2 log 2 ( D /δ ) n 2 ǫ 2 p ǫ g + O k f p k 2 2 log(1 /δ ) nǫ g ! + ǫ g 2 + O  1 √ n + 1 √ D  C r log 1 δ ! , Substituting (81), w e ha v e L ( f priv ) − L ( f 0 ) ≤ 8 C 2 D log 2 ( D /δ ) n 2 ǫ 2 p ǫ g + O  C 2 log(1 /δ ) D nǫ g  + ǫ g 2 + O  1 √ n + 1 √ D  C r log 1 δ ! . 26 T o set the remainin g parameters, w e will c h o ose D < n so th at L ( f priv ) − L ( f 0 ) ≤ 8 C 2 D log 2 ( D /δ ) n 2 ǫ 2 p ǫ g + O  C 2 log(1 /δ ) D nǫ g  + ǫ g 2 + O C p log(1 /δ ) √ D ! . W e set D = O ( C 2 log(1 /δ ) /ǫ 2 g ) to m ak e the last term ǫ g / 6, and: L ( f priv ) − L ( f 0 ) ≤ O   C 4 log 1 δ log 2 C 2 log(1 /δ ) ǫ 2 g δ n 2 ǫ 2 p ǫ 3 g   + O  ǫ g n  + 2 ǫ g 3 . Setting n as in (78) pr o ves th e resu lt. Moreo v er, setting n > c k f p k 2 4 ǫ p ǫ g = C 0 · cǫ g ǫ p log(1 /δ ) ensures th at n > c 4Λ ǫ p . W e can adapt the pr o of p ro cedure to sh o w that Algorithm 3 is comp etit iv e against any clas- siﬁer f 0 with a giv en b ound on k f 0 k ∞ . It can b e sh o wn that for some constan t ζ that | a 0 ( θ ) | ≤ V ol( X ) ζ k f 0 k ∞ . Then w e can set this as C in (78) to obtain the follo wing result. Theorem 6. L et f 0 b e a classiﬁer with norm k f 0 k ∞ , and let ℓ ′ ( · ) b e c -Lipschitz. Then for any distribution P , ther e exists a c onstant C 0 such that if n > C 0 · max k f 0 k 2 ∞ ζ 2 (V ol ( X )) 2 p log(1 /δ ) ǫ p ǫ 2 g · log k f 0 k ∞ V ol( X ) ζ log (1 /δ ) ǫ g δ Γ( d 2 + 1) , cǫ g ǫ p log(1 /δ ) ! , (87) then Λ and D c an b e chosen such that the output f priv of Algor ithm 3 with Algorithm 2 satisﬁes P ( L ( f priv ) − L ( f 0 ) ≤ ǫ g ) ≥ 1 − 4 δ . Pr o of. Subs tituting C = V ol ( X ) ζ k f 0 k ∞ in Lemma 6 w e get the result. W e can also deriv e a generalizatio n resu lt with r esp ect to classiﬁers with b ounded k f 0 k H . Theorem 7. L et f 0 b e a classiﬁer with norm k f 0 k H , and let ℓ ′ b e c -Lipschitz. Then for any distribution P , ther e exists a c onstant C 0 such that if, n = C 0 · max k f 0 k 4 H ζ 2 (V ol ( X )) 2 p log(1 /δ ) ǫ p ǫ 4 g · log k f 0 k H V ol( X ) ζ log(1 /δ ) ǫ g δ Γ( d 2 + 1) , cǫ g ǫ p log(1 /δ ) ! , (88) then Λ and D c an b e chosen such that the output of A lgorithm 3 run with A lgorithm 2 satisﬁes P ( L ( f priv ) − L ( f 0 ) ≤ ǫ g ) ≥ 1 − 4 δ . Pr o of. Let f 0 b e a classiﬁer with n orm k f 0 k 2 H and exp ected loss L ( f 0 ). No w consider f rtr = argmin f L ( f ) + Λ rtr 2 k f k 2 H , for some Λ rtr to b e sp eciﬁed later. W e will ﬁ rst need a b ound on k f rtr k ∞ in order to use our previous sample complexit y resu lts. Sin ce f rtr is a minimizer, we can take the deriv ativ e of the regularized exp ected loss and set it to 0 to get: f rtr ( x ′ ) = − 1 Λ rtr  ∂ ∂ f Z X ℓ ( f ( x ′ ) , y ) dP ( x , y )  = − 1 Λ rtr  Z X  ∂ ∂ f ( x ′ ) ℓ ( f ( x ) , y )  ·  ∂ ∂ f ( x ′ ) f ( x )  dP ( x , y )  , 27 where P ( x , y ) is a distribution on p airs ( x , y ). No w, using th e represen ter theorem, ∂ ∂ f ( x ′ ) f ( x ) = k ( x ′ , x ). Sin ce the kernel fu nction is b ounded and the deriv ativ e of the loss is alwa ys u pp er b ounded b y 1, s o the inte grand can b e upp er b oun ded b y a constan t. Since P ( x , y ) is a p robabilit y distr i- bution, we ha ve for all x ′ that | f rtr ( x ′ ) | = O (1 / Λ rtr ). No w w e set Λ rtr = ǫ g / k f 0 k 2 H to get k f rtr k ∞ = O k f 0 k 2 H ǫ g ! . W e no w hav e tw o cases to consider, dep endin g on wh ether L ( f 0 ) < L ( f rtr ) or L ( f 0 ) > L ( f rtr ). Case 1: Sup p ose that L ( f 0 ) < L ( f rtr ). Then b y the deﬁnition of f rtr , L ( f rtr ) + ǫ g 2 · k f rtr k 2 H k f 0 k 2 H ≤ L ( f 0 ) + ǫ g 2 . Since ǫ g 2 · k f rtr k 2 H k f 0 k 2 H ≥ 0, w e hav e L ( f rtr ) − L ( f 0 ) ≤ ǫ g 2 . Case 2: Supp ose that L ( f 0 ) > L ( f rtr ). Then the regularized classiﬁer has b etter generalization p erformance than the original, so w e hav e trivially that L ( f rtr ) − L ( f 0 ) ≤ ǫ g 2 . Therefore in b oth cases we hav e a b ound on k f rtr k ∞ and a generalization gap of ǫ g / 2. W e can no w apply Theorem 6 to show that for n satisfying (87) we ha ve P ( L ( f priv ) − L ( f 0 ) ≤ ǫ g ) ≥ 1 − 4 δ . 6 P arameter tuning The priv acy-preserving learning algorithms presente d so far in this pap er assume that the regular- ization constan t Λ is provided as an in p ut, and is indep enden t of the data. In actual applications of ER M, Λ is selecte d based on the data itself. In this section, w e address this issue: how to design an ERM algorithm with end-to-end priv acy , whic h selects Λ based on the data itself. Our solution is to presen t a pr iv acy-preservin g parameter tuning tec hnique that is applicable in general m achine learning algorithms, b eyo nd ERM. In practice, one t ypically tunes parameters (suc h as the regularization parameter Λ) as follo w s: u sing data held out for v alidation, train predictors f ( · ; Λ) for m ultiple v alues of Λ, and s elect the one whic h pro vides the b est emp irical p erformance. Ho w ev er, even though the outpu t of an algorithm preserv es ǫ p -diﬀeren tial priv acy for a ﬁxed Λ (as is the case with Algorithms 1 and 2), by c ho osing a Λ based on empirical p erforman ce on a v alidation set ma y violate ǫ p -diﬀeren tial priv acy guarantee s. That is, if the pr o cedure that pic ks Λ is not priv ate, then an adv ersary ma y use the release d classiﬁer to infer the v alue of Λ and therefore something ab out the v alues in the database. W e suggest t w o wa ys of resolving this issue. First, if w e ha v e ac cess to a smaller pu blicly a v ailable data from the same distrib ution, then we can use this as a holdout set to tun e Λ. Th is Λ can b e su bsequently used to train a classiﬁer on the priv ate data. Since the v alue of Λ do es not dep end on the v alues in the p riv ate d ata set, this pro cedure will s till preserv e the priv acy of individuals in the priv ate data. If no suc h public data is a v ailable, then we n eed a d iﬀeren tially priv ate tuning pr o cedure. W e pro vide suc h a pr o cedure b elo w. T he main idea is to train for diﬀeren t v alues of Λ on separate 28 subsets of th e training d ataset, so that th e total training pro cedure still mainta ins ǫ p -diﬀeren tial priv acy . W e score eac h of these predictors on a v alidation set, and choose a Λ (and hence f ( · ; Λ)) using a randomized pr iv acy-preserving comparison p ro cedure (McSherry and T alwa r, 200 7). The last step is n eeded to guaran tee ǫ p -diﬀeren tial priv acy for individuals in the v alidation set. T his ﬁnal algorithm p ro vid es an end-to-end gu arantee of diﬀerentia l p r iv acy , and renders our p riv acy- preserving ER M p r o cedure complete. W e observe that b oth these pro cedu res can b e used for tun ing m u ltiple parameters as w ell. 6.1 T uning algorithm Algorithm 4 Pr iv acy-preserving parameter tun ing Inputs: Database D , parameters { Λ 1 , . . . , Λ m } , ǫ p . Outputs: P arameter f priv . Divide D into m + 1 equal p ortions D 1 , . . . , D m +1 , eac h of size |D | m +1 . F or eac h i = 1 , 2 , . . . , m , apply a priv acy-preserving learning algorithm (e.g. Algorithms 1, 2 , or 3) on D i with parameter Λ i and ǫ p to get output f i . Ev aluate z i , the num b er of mistak es made by f i on D m +1 . Set f priv = f i with probab ility q i = e − ǫ p z i / 2 P m i =1 e − ǫ p z i / 2 . (89) W e note that the list of p ot en tial Λ v alues inpu t to this pr o cedure should not b e a fun ction of the priv ate dataset. It can b e sho wn that the empirical error on D m +1 of the classiﬁer output by this p ro cedure is close to the empirical err or of the b est classiﬁer in the set { f 1 , . . . , f m } on D m +1 , pro vided |D | is high enough. 6.2 Priv acy and utility Theorem 8. The output of the tuning pr o c e dur e of Algo rithm 4 is ǫ p -diﬀer ential ly private. Pr o of. T o sho w that Algorithm 4 p reserv es ǫ p -diﬀeren tial pr iv acy , w e ﬁr st consider an alternativ e pro cedure M . Let M b e the pro cedur e that releases the v alues ( f 1 , . . . , f m , i ) wh ere, f 1 , . . . , f m are the in termediate v alues computed in the second step of Algorithm 4, and i is the index selecte d by the exp onen tial mec hanism step. W e ﬁrst sh o w that M preserves ǫ p -diﬀeren tial pr iv acy . Let D and D ′ b e t w o datasets that diﬀer in the v alue of one in dividual suc h that D = ¯ D ∪{ ( x , y ) } , and D ′ = ¯ D ∪ { ( x ′ , y ′ ) } . Recall that the dataset s D 1 , . . . , D m +1 are disjoin t; m oreo ve r, the randomness in the p r iv acy mec hanism s are ind ep endent. Therefore, P ( f 1 ∈ S 1 , . . . , f m ∈ S m , i = i ∗ |D ) = Z S 1 × ... S m P ( i = i ∗ | f 1 , . . . , f m , D m +1 ) µ ( f 1 , . . . , f m |D ) d f 1 . . . d f m = Z S 1 × ... S m P ( i = i ∗ | f 1 , . . . , f m , D m +1 ) m Y j =1 µ j ( f j |D j ) d f 1 . . . d f m , (90) 29 where µ j ( f ) is the densit y at f in duced by the classiﬁer ru n with parameter Λ j , and µ ( f 1 , . . . , f m ) is the join t density at f 1 , . . . , f m , induced by M . No w su pp ose that ( x , y ) ∈ D j , for j = m + 1. Then, D k = D ′ k , and µ j ( f j |D j ) = µ j ( f j |D ′ j ), for k ∈ [ m ]. Moreo v er, giv en an y ﬁx ed set f 1 , . . . , f m , P  i = i ∗ |D ′ m +1 , f 1 , . . . , f m  ≤ e ǫ p P ( i = i ∗ |D m +1 , f 1 , . . . , f m ) . (91) Instead, if ( x , y ) ∈ D j , for j ∈ [ m ], then, D k = D ′ k , for k ∈ [ m + 1] , k 6 = j . Thus, for a ﬁxed f 1 , . . . , f m , P  i = i ∗ |D ′ m +1 , f 1 , . . . , f m  = P ( i = i ∗ |D m +1 , f 1 , . . . , f m ) (92) µ k ( f k |D k ) ≤ e ǫ p µ k ( f k |D ′ k ) . (93) The lemma f ollo ws b y com bining (90 )-(93). No w, an adv ersary who has access to the ou tp ut of M can compute th e outpu t of Algorithm 4 itself, without any furth er access to the dataset. Therefore, b y a sim ulatibilit y a rgument, as in Dw ork et al. (2006b), Algorithm 4 also preserv es ǫ p -diﬀeren tial pr iv acy . In the theorem ab o ve, w e assume that the individual algorithms for pr iv acy-preserving classiﬁ- cation satisfy Deﬁnition 2; a sim ilar theorem can also b e sho wn when they satisfy a guaran tee as in Corollary 4 . The follo wing th eorem shows that the emp irical error on D K +1 of the classiﬁer output b y the tuning pr o cedure is close to th e empirical error of the b est classiﬁer in the s et { f 1 , . . . , f K } . The pro of of th is Theorem follo ws from Lemma 7 of McSherry and T alwa r (200 7 ). Theorem 9. L et z min = min i z i , and let z b e the numb er of mistakes made on D m +1 by the classiﬁer output by our tuning pr o c e dur e. Then, with pr ob ability 1 − δ , z ≤ z min + 2 log ( m/δ ) ǫ p . (94) Pr o of. In the notation of McSherry and T alw ar (20 07 ), the z min = O P T , the base measure µ is uniform on [ m ], and S t = { i : z i < z min + t } . Their Lemma 7 sh o ws that P  ¯ S 2 t  ≤ exp( − ǫ p t ) µ ( S t ) , (95) where µ is the uniform measur e on [ m ]. Using min µ ( S t ) = 1 m to up p er b ound th e righ t sid e and setting it equal to δ we obtai n t = 1 ǫ p log m δ . (96) F rom this we ha ve P  z ≥ z min + 2 ǫ p log m δ  ≤ δ , (97) and the resu lt follo w s. 30 7 Exp erimen ts In this section w e giv e exp erimen tal r esults for training linear classiﬁers w ith Algorithms 1 and 2 on tw o real datasets. Imp osing priv acy requirements n ecessarily degrades classiﬁer p erformance. Our exp erimen ts sho w that pro vided there is suﬃcien t data, ob jectiv e p ertu r bation (Algorithm 2) typicall y outp erf orm s the sensitivit y metho d (1) signiﬁ can tly , and ac hiev es er r or r ate close to that of the analogous non-p riv ate ERM algorithm. W e ﬁ rst demonstrate ho w the accuracy of the classiﬁcation algorithms v ary with ǫ p , the priv acy requ iremen t. W e then show ho w th e p erformance of pr iv acy-preserving classiﬁcation v aries w ith increasing tr ainin g data size. The ﬁrs t dataset w e consider is the Adu lt dataset from the UCI Mac hine Learning Rep osito ry (Asuncion and Newman , 2007). This mo derately-sized dataset con tains demographic information ab out approxi mately 47 , 000 individu als, and the classiﬁcation task is to predict w hether the annual income of an individual is b el o w or ab o ve $50 ,000, based on v ariables suc h as age, sex, o ccupation, and education. F or our exp eriments, the av erage fr action of p ositiv e lab els is ab out 0 . 25; therefore, a trivial classiﬁer that alwa ys predicts − 1 w ill ac hieve this error-rate, and only error-rates b elo w 0 . 25 are interesting. The second d ataset we consider is th e KDDCup 99 dataset (He ttic h and Ba y, 1999); the task here is to predict whether a n et work connectio n is a denial-of-service attac k or not, based on sev eral attributes. The d ataset includes ab ou t 5,000,000 instances. F or this data the a v erage fraction of p ositiv e lab els is 0 . 20. In ord er to implemen t the conv ex minimizatio n pro cedure, we use the con v ex optimization library pr o vided b y Ok azaki (2009). 7.1 Prepro cessing In order to pro cess the Adult dataset into a form amenable for classiﬁcation, w e remo v ed all entries with missing v alues, and conv erted eac h catego rial attribute to a binary v ector. F or example, an attribute such as (Mal e,Female ) w as conv erted into 2 binary features. Eac h column w as normalized to ensure that the m axim um v alue is 1, and th en eac h ro w is n ormalized to ensur e that th e norm of an y example is at most 1. After pr epro cessing, eac h example was represen ted by a 10 5-dimensional v ector, of n orm at most 1. F or the KDDCup99 dataset, th e instances were prepro cessed by conv erting eac h categorial at- tribute to a binary v ector. Eac h column was normalized to ensure that the maxim u m v alue is 1, and ﬁnally , eac h ro w w as normalized, to ensu re th at the norm of an y example is at most 1. After prepro cessing, eac h examp le was represen ted by a 119-dimensional v ector, of norm at most 1. 7.2 Priv acy-Accuracy T radeoﬀ F or our ﬁrst set of exp eriments, w e study the tradeoﬀ b etw een the priv acy requ iremen t on the classiﬁer, and its classiﬁcation accuracy , when the classiﬁer is trained on data of a ﬁxed size. The priv acy requirement is quant iﬁed by th e v alue of ǫ p ; increasing ǫ p implies a h igher change in the b elief of the adv ersary when one en try in D c hanges, and th us lo w er priv acy . T o measure ac curacy , w e use classiﬁcation (test) error; n amely , the fraction of times the classiﬁer predicts a lab el with the wr ong sign. T o study the priv acy-accuracy tradeoﬀ, w e compare ob jectiv e p erturbation with the sens itivit y metho d for logistic regression and Hub er S VM. F or Hub er S VM, w e p ic ke d the Hub er constan t 31 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Privacy parameter ε p Misclassification error rate Sensitivity LR Objective LR Non−Private LR (a) Regularized logistic regression, Adult 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Privacy parameter ε p Misclassification error rate Sensitivity SVM Objective SVM Non−Private SVM (b) Regularized SVM, Adult Figure 2: Priv acy-Accuracy trade-oﬀ for the A dult dataset 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 Privacy parameter ε p Misclassification error rate Sensitivity LR Objective LR Non−Private LR (a) Regularized logistic regression, KDDCup99 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 Privacy parameter ε p Misclassification error rate Sensitivity SVM Objective SVM Non−Private SVM (b) Regularized SVM, KDDCup99 Figure 3: Priv acy-Accuracy trade-oﬀ for the K DDCup99 dataset h = 0 . 5, a typical v alue (Chap ell e, 2007) 1 . F or eac h data set w e trained classiﬁers for a few ﬁxed v alues of Λ and tested the err or of these classiﬁers. F or eac h algorithm w e c hose the v alue of Λ that minimizes the error-rate for ǫ p = 0 . 1. 2 W e then plotted the error-rate against ǫ p for the chosen v alue of Λ. The results are shown in Fig ures 2 and 3 for b oth logistic regression and supp ort vect or mac hin es 3 . The optimal v alues of Λ are sho wn in T ables 1 and 2. F or non-priv ate logistic regression and SVM, eac h presented err or-rate is an a verage o ver 10-fold cross-v alidation; for the s ensitivit y metho d as w ell as ob jectiv e p erturbation, the presented err or-rate is an a v erage o v er 10-fold cross- v alidation and 50 runs of the randomized trainin g pro cedure. F or A dult , the p riv acy-acc uracy tradeoﬀ is computed o ver the enti re dataset, whic h consists of 45 , 220 examples; for KD DCup99 w e 1 Chapelle (2007) recommends u sing h b etw een 0 . 01 and 0 . 5; w e use h = 0 . 5 as we found that a higher val ue typicall y leads to more numerica l stabilit y , as well as b etter p erformance for b oth priv acy- preserving meth od s. 2 F or KDDCup99 the error of t he non-priv ate algorithms did not increase with decreasing Λ. 3 The sligh t kink in th e SV M curve on Adult is du e to a switch to the second phase of the algorithm. 32 Λ 10 − 10 . 0 10 − 7 . 0 10 − 4 . 0 10 − 3 . 5 10 − 3 . 0 10 − 2 . 5 10 − 2 . 0 10 − 1 . 5 Logistic Non-Priv ate 0.1540 0.1533 0.1654 0.1694 0.1758 0.1895 0.2322 0 .2478 Output 0.5318 0.53 18 0.517 5 0.4 928 0.4310 0.3163 0.2395 0.245 6 Ob jectiv e 0.8248 0.82 48 0.824 8 0.2 694 0.2369 0.2161 0.2305 0.247 5 Hub er Non-Priv ate 0.1527 0.1521 0.1632 0.1669 0.1719 0.1793 0.2454 0 .2478 Output 0.5318 0.53 18 0.521 1 0.5 011 0.4464 0.3352 0.2376 0.247 6 Ob jectiv e 0.2585 0.25 85 0.258 5 0.2 582 0.2559 0.2046 0.2319 0.247 8 T able 1: Er ror for d iﬀeren t regularization parameters on Adult for ǫ p = 0 . 1. The b est err or p er algorithm is in b old. Λ 10 − 9 . 0 10 − 7 . 0 10 − 5 . 0 10 − 3 . 5 10 − 3 . 0 10 − 2 . 5 10 − 2 . 0 10 − 1 . 5 Logistic Non-Priv ate 0.0016 0.00 16 0.0021 0.0038 0.0037 0.0037 0.0325 0.0594 Output 0.5245 0.5245 0.5093 0.3518 0.1114 0.0359 0.0304 0.0678 Ob jectiv e 0.2084 0.2084 0.2084 0.0196 0.0118 0.0113 0.0285 0.0591 Hub er Non-Priv ate 0.0013 0.0013 0.0013 0.0029 0.0051 0.0056 0.0061 0.0163 Output 0.5245 0.5245 0.5229 0.4611 0.3353 0.0590 0.0092 0.0179 Ob jectiv e 0.0191 0.0191 0.0191 0.1827 0.0123 0.0066 0.0064 0.0157 T able 2: Error for diﬀeren t regularization p arameters on KDDCup99 f or ǫ p = 0 . 1. Th e b est err or p er algorithm is in b old. use a r andomly c hosen su bset of 70 , 000 examples. F or the Adult d ataset, the co nstan t classiﬁer th at classiﬁes all examples to b e negativ e ac heives a classiﬁcat ion error of ab out 0 . 25. The sen sitivit y metho d th us do es sligh tly b etter than this con- stan t classiﬁer for most v alues of ǫ p for b oth logistic r egression and su pp ort v ector mac h ines. Ob jec- tiv e p ertu rbation outp erforms sensitivit y , and ob jectiv e p erturbation for supp ort v ector mac hines ac hieve s lo we r classiﬁcat ion err or than ob j ectiv e p erturbation for logistic r egression. Non-priv ate logistic regression an d supp ort vect or mac hin es b ot h ha v e classiﬁcation error ab out 0 . 15. F or the KDDCu p99 dataset, the constan t classiﬁer th at classiﬁes all examples as negativ e, has error 0 . 19. Again, ob jecti v e p erturbation outp erforms sen s itivit y for b o th logistic r egression and supp ort ve ctor mac hines; ho w ever, for SVM and high v alues of ǫ p (lo w priv acy), the s en sitivit y metho d p erforms almost as w ell as ob jectiv e p erturbation. In the lo w priv acy regime, logistic regression un der ob jectiv e p ertu r bation is b ette r than supp ort v ector m ac hines. In contrast, in the high priv acy regime (lo w ǫ p ), supp ort v ector mac h ines with ob jectiv e p erturb ation outp erform logistic regression. F or this d ataset, n on-priv ate logistic regression and su pp ort v ector mac hines b oth h av e a classiﬁcatio n error of ab out 0 . 001. F or S VMs on b ot h Adu lt and KDD Cup99 , for large ǫ p (0 . 25 onw ards), th e er r or of either of the priv ate metho ds can increase sligh tly with increasing ǫ p . This seems coun terintuitiv e, but app ears to b e du e the imbalance in f raction of the tw o lab e ls. As the lab el s are imbala nced, the optimal classiﬁer is trained to p erform b e tter on the negativ e lab els than the p ositiv es. As ǫ p increases, for 33 a ﬁ x ed training d ata size, so does the p erturbation from the optimal cla ssiﬁer, induced b y either of the priv ate metho ds. Th us , as the p erturb ation increases, th e n u m b er of false p ositiv es increases, whereas the n umber of false n egativ es decrea ses (as we v eriﬁed b y measuring the a v er age false p ositiv e an d false negativ e rates of the p riv ate classiﬁers). Therefore, the total err or may increase sligh tly with d ecreasing priv acy . 7.3 Accuracy vs. T raining Data Size T radeoﬀs Next we examine ho w classiﬁcation accuracy v aries as we increase the size of the tr ainin g set. W e measure classiﬁcation accuracy as the accuracy of the classiﬁer pro duced by the tuning pr o ce dure in Section 6. As th e A dult dataset is not suﬃcien tly large to allo w us to do priv acy-preserving tuning, for these exp eriments, w e restrict our atten tion to the KDDCup99 dataset. Figures 4 and 5 p resen t the learning curves for ob jectiv e p erturbation, non-priv ate ER M and the sensitivit y metho d for logistic loss and Hub er loss, resp ectiv ely . Exp eriments are sho wn for ǫ p = 0 . 01 and ǫ p = 0 . 05 for b ot h loss f unctions. T h e trainin g sets (for eac h of 5 v alues of Λ) are c hosen to b e of size n = 60 , 000 to n = 120 , 000, and the v alidation and test sets eac h are of size 25 , 000. Eac h presente d v alue is an a verage o ver 5 random p ermutati ons of the data, and 50 ru n s of the r andomized classiﬁcation p ro cedure. F or ob jectiv e p ertur bation we p erformed exp erimen t in the regime when ǫ ′ p > 0, so ∆ = 0 in Algorithm 2. 4 F or non -p riv ate ERM, we present result for training sets fr om n = 300 , 000 to n = 600 , 000. The non-pr iv ate algorithms are tuned by comparing 5 v alues of Λ on the same tr aining set, an d the test set is of size 25 , 000. Eac h rep orted v alue is an av erage o ver 5 random p erm utations of the data. W e see f rom th e ﬁgures that for non-priv ate logistic r egression and supp ort v ector mac hines, the error remains constant with increasing data size. F or the priv ate method s, the error u sually decreases as the data size increases. In all cases, ob jectiv e p ertur b ation outp erforms the sensitivity metho d, and supp o rt v ector mac hines generally outp erform logist ic regression. 6 7 8 9 10 11 12 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 Training set size ( ε p = 0.05) Misclassification error rate Sensitivity LR Objective LR Non−Private LR (a) ǫ p = 0 . 05 6 7 8 9 10 11 12 x 10 4 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Training set size ( ε p = 0.01) Misclassification error rate Sensitivity LR Objective LR Non−Private LR (b) ǫ p = 0 . 01 Figure 4: Learning curves for logistic r egression on the KDDCup99 dataset 4 This was chosen for a fair comparison with non- priv ate as w ell as th e output p ertu rbation metho d, b oth of whic h had access to on ly 5 v alues of Λ. 34 6 7 8 9 10 11 12 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 Training set size ( ε p = 0.05) Misclassification error rate Sensitivity SVM Objective SVM Non−Private SVM (a) ǫ p = 0 . 05 6 7 8 9 10 11 12 x 10 4 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Training set size ( ε p = 0.01) Misclassification error rate Sensitivity SVM Objective SVM Non−Private SVM (b) ǫ p = 0 . 01 Figure 5: Learning curves for SVM on the KDDCu p99 dataset 8 Discussions and Conclusions In this pap er we study the p roblem of learning classiﬁers with regularized empirical r isk mini- mization in a priv acy-preserving m anner. W e consider p riv acy in the ǫ p -diﬀeren tial priv acy mo del of Dwork et al. (2006b) and p ro vid e tw o algorithms for priv acy-preserving ERM. Th e ﬁr st one is based on the sensitivit y method due to Dwork et al. (2006b), in whic h the output of the non- priv ate algorithm is p erturb ed b y adding noise. W e introd uce a second algorithm based on the new paradigm of ob jectiv e p erturbation. W e provide b ound s on th e sample r equiremen t of these algorithms for ac h ieving generalization error ǫ g . W e show ho w to app ly these algorithms with k er- nels, and ﬁ nally , we pr o vide exp erimen ts with b oth algo rithms on t wo r eal datasets. Our work is, to our kno w ledge, the ﬁrst to prop ose computationally eﬃcien t classiﬁcatio n algorithms satisfying diﬀeren tial pr iv acy , together with v alidation on standard data s ets. In general, f or classiﬁcation, the error rate in creases as the priv acy requirements are made more stringen t. O u r generalizatio n guaran tees formalize this “pr ice of priv acy .” Our exp erimen ts, as w ell as theoretical results, indicate that ob jectiv e p erturb ation usually outp erforms the sensitivit y meth- o ds at man aging the tradeoﬀ b et w een pr iv acy an d learning p erformance. Both algorithms p erform b etter w ith more tr ainin g d ata, and when abund ant training d ata is av ailable, the p erformance of b oth algorithms can b e close to non-priv ate classiﬁcation. The conditions on the loss function and regularizer required by outp ut p erturbation and ob- jectiv e p erturbation are somewhat diﬀerent . As Th eorem 1 sho w s , output p ertur b ation requires strong con vexit y in the regularizer and conv exit y as w ell as a b ounded deriv ativ e condition in the loss function. The last condition can b e replaced b y a L ipsc h itz cond ition instead. Ho w ev er, th e other tw o conditions app ear to b e required, un less w e imp ose some further restrictions on the loss and regularizer. Ob ject iv e p erturbation on the other h and, requires strong conv exit y of the regular- izer, con v exit y , d iﬀerentiabilit y , and b ound ed doub le deriv ativ es in the loss function. Sometimes, it is p ossible to construct a diﬀeren tiable appro ximation to the loss function, ev en if the loss f u nction is not itself diﬀeren tiable, as sho wn in S ection 3.4.2. Our exp erimen tal as well as theoretical results indicate that in general, ob jectiv e p erturbation pro vides more accurate solutions than outpu t p ertu rbation. Th u s , if the loss fu nction satisﬁes the 35 conditions of Theorem 2, w e r ecommend using ob j ective p erturb ation. In some s itu ations, suc h as for SVMs, it is p ossible that ob jectiv e p ertur b ation do e s not directly apply , but applies to an appro ximation of the target loss fun ction. In our exp eriments, the loss of statistical eﬃciency d ue to suc h app ro ximation has b een small compared to the loss of eﬃciency due to priv acy , and we susp e ct that this is the case for many practica l situations as well. Finally , our work do es not address the question of ﬁnding priv ate solutions to regularize d ERM when the regularizer is not strongly con vex. F or example, neither the output p erturbation, nor the ob jectiv e p erturbation method w ork for L 1 -regularized ERM. Ho wev er, in L 1 -regularized ERM, one can ﬁnd a dataset in wh ich a change in on e trainin g p oin t can signiﬁcan tly c hange the solution. As a resu lt, it is p ossible that suc h problems are inh eren tly diﬃcult to solv e priv ately . An op en q u estion in this w ork is to extend ob jectiv e p er tu rbation metho ds to more general con vex optimization problems. Curr ently , the ob jectiv e p erturbation metho d applies to strongly con vex r egularizatio n functions and diﬀerentia ble losses. Con v ex optimizatio n p roblems app e ar in man y con texts within and without mac h ine learning: densit y estimation, resource allo cation for comm un ication sys tems and n et working, so cial wel fare optimization in economics, and elsewhere. In some cases these algorithms will also op erate on sensitiv e or priv ate data. Extending the ideas and analysis here to those settings w ould pr o vide a rigorous fou n dation for pr iv acy analysis. A second op en question is to ﬁnd a b etter solution f or priv acy-preserving cla ssiﬁcation with k ern els. Our curr en t metho d is based on a reduction to the linear case, using the algorithm of Rahimi and Rech t (2008b ); ho w ever, this metho d can b e statisticall y in eﬃcien t, and require a lot of training d ata, particularly when coupled with our priv acy mec hanism. Th e reason is that the algorithm of Rahimi and Rec ht (2008b) r equires the d im en sion D of the pr o jec ted space to b e v ery high for go o d p erf orm ance. How ev er, most diﬀerentiall y-priv ate algorithms p erform w orse as the dimensionalit y of the data grows. Is there a b etter linearization metho d , which is p ossibly data- dep end ent, that will pr ovide a more statistical ly eﬃcien t solution to priv acy-preserving learnin g with ke rnels? A ﬁnal qu estion is to pro vide b etter u pp er and lo w er b ounds on the sample requiremen t of priv acy-preserving linear classiﬁcati on. Th e main op en question here is to pr o vide a computationally eﬃcien t algorithm for linear classiﬁcation w hic h has b etter stati stical eﬃciency . Priv acy-preserving mac h ine learning is the endea vo r of designing priv ate analog ues of widely used mac hine learning algorithms. W e b el iev e the pr esent study is a starting p oint for furth er study of the diﬀeren tial priv acy mo del in this relativ ely n ew su bﬁeld of mac hine learning. T h e w ork of Dw ork et al. (2006b) set up a framewo rk for assessing th e priv acy risks asso ciated with publishin g the results of data analyses. Demand in g h igh p riv acy requires sacriﬁcing utilit y , wh ic h in the con text of classiﬁcation and prediction is excess loss or regret. I n this pap er we d emonstrate the priv acy-utilit y tradeoﬀ for ERM, which is bu t one corner of the mac hine learning w orld . Applying these priv acy concepts to other mac hine learning p roblems will lead to new and interesti ng tr adeoﬀs and to w ard s a set of to ols for practical p riv acy-preserving learnin g and inference. W e hop e that our wo rk provides a b enc hmark of the curren t p rice of priv acy , and inspir es improv emen ts in future w ork. Ac kno wledgmen ts The authors w ould lik e to th an k Sanjo y Dasgupta and Daniel Hsu for sev eral p oin ters, and to ac knowledge Adam Smith, Dan Kifer, and Abhrad eep Guha Thakurta, who help ed p oin t out an error in the previous v ersion of th e pap er. The work of K. Chaud h uri and A.D. Sarwa te was 36 supp orted in part by the California Institute for T elecomm un ications and Information T echnolog ies (CALIT2) at UC S an Diego. K. Chaudhuri wa s also s u pp orted by National Science F ound ation I IS- 07135 40. Pa rt of this work wa s done while C. Mon teleoni w as at UC S an Diego, with sup p ort from National Science F oun dation I IS-0713540. Th e exp eriment al results were made p ossibly by su p p ort from the UCS D FW Grid Pro ject, NSF R esearch Infrastru ctur e Gran t Num b er EIA-030362 2. References R. Agra wal and R. Sr ik an t. Priv acy-preserving d ata mining. SIGMOD R e c or d , 29(2):439–4 50, 2000. A. Asun cion and D.J. Newman. UCI M ac hin e Lea rning Rep osito ry . Universit y of California, Irvine, S c h o o l of Information and Computer Sciences, 2007. URL http://w ww.ics.uc i.edu/$\sim$mlearn/MLRepository.html . L. Bac kstrom, C. Dw ork, and J. Klein b erg. Wherefore art thou R3579X? Anonymized social net works, hidd en patterns, and structural steganograph y . In Pr o c e e dings of the 16th International World Wide Web Confer enc e , 2007. K. Ball. An elementa ry in tro duction to mo dern co n v ex ge ometry . In S. Levy , editor, Flavors of Ge ometry , v olume 31 of Mathematic al Scienc es R ese ar ch Institute Public ations , pages 1–58. Cam b ridge Unive rsit y Pr ess, 1997. B. Barak, K . Chaudhuri, C. Dwork, S . Kale, F. McSherry , and K. T alwar. Priv acy , accuracy , and consistency too: a holistic solution to con tingency table release. I n Pr o c e e dings of the 26th ACM SIGMOD-SIGACT-SIGAR T Symp osium on Principles of Datab ase Systems (PODS) , p ages 273– 282, 2007. Amos Beimel, Shiv a Prasad Kasiviswa nathan, and Kobbi Nissim. Bounds on the sample com- plexit y for pr iv ate learning and p riv ate data release. In P r o c e e dings of the 7th IACR The ory of Crypto gr aphy Confer enc e (TCC) , pages 437–454, 2010 . P Billingsley . Pr ob ability and me asur e . A Wiley-Inte rscience pu blication. Wiley , New Y ork [u.a.], 3. ed edition, 1995. ISBN 047100 7102. A. Blum, K. L igett, and A. Roth. A learnin g theory approac h to non-in teractiv e database priv acy . In R. E . Ladn er and C. Dwo rk, editors, P r o c e e dings of the 40th ACM Symp osium on The ory of Computing (STOC) , pages 609– 618. A CM, 2008. ISBN 978-1-6055 8-047-0. S. Bo yd and L. V and enb erghe. Convex Optimization . Cam bridge Univ ersit y Press, Cam b ridge, UK, 2004. O. Chap elle . T raining a supp ort v ector machine in the primal. Neur al Computa tion , 19(5):115 5– 1178, Ma y 2007. K. C haudhuri and N. Mishr a. When random s amp ling p r eserv es priv acy . In Cynthia Dw ork, editor, CR YPTO , volume 4117 of L e ctur e Notes in Computer Scienc e , pages 198–21 3. Springer, 2006 . ISBN 3-540-3 7432-9 . K. Chaudhuri and C. Mon teleoni. Priv acy-preserving logistic regression. In Pr o c e e dings of the 22nd Annual Confer enc e on N eur al Information Pr o c essing Systems (NIPS) , 2008. 37 C. Dwork. Diﬀeren tial pr iv acy . In M. Bugliesi, B. Preneel, V. Sassone, and I. W egener, editors, ICALP (2) , vo lume 4052 of L e ctur e Notes in Computer Scienc e , pages 1–12. Sp r inger, 2006. ISBN 3-540-3 5907-9 . C. Dw ork and J . Lei. Diﬀeren tial priv acy and robu st statistics. In Pr o c e e dings of the 41st ACM Symp osium on The ory of Computing (STOC) , 2009. C. Dwork, K. Ken thapadi, F. McSherry , I. Mirono v, and M. Naor. Ou r data, our s elv es: Priv acy via distributed n oise generation. In S erge V audena y , editor, E UROCR YPT , volume 4004 of L e ctu r e Notes in Computer Scienc e , p ages 486–503 . S p ringer, 2006a. C. Dwork, F. McSherr y , K. Nissim, and A. Smith. Calibrating n oise to sen sitivit y in priv ate data analysis. In 3r d IACR The ory of Crypto gr aphy Confer enc e, , pages 265–284 , 2006b. A. Evﬁm ievski, J. Gehrk e, and R. Srik an t. Limiting priv acy breac hes in priv acy preserving d ata mining. In Pr o c e e dings of the 22nd ACM SIGMOD-SIGACT-SIGAR T Symp osium on Principles of Datab ase Systems (POD S) , pages 211–222, 2003 . S.R. Gan ta, S.P . Kasiviswanat han, and A. Smith. Comp osition attac ks and auxiliary information in data priv acy . In Pr o c e e dings of the 14th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data M ining (KDD) , pages 265–273, 2008 . A. Gup ta, K . Ligett, F. McSherry , A. Roth, an d K. T alwar. Diﬀerentia lly p r iv ate approxima tion algorithms. In P r o c e e dings of the 2010 ACM-SIAM Symp osium on Discr e te Algorithms (SOD A) , 2010. T. Hastie, S. Rosset, R. Tib shirani, an d J . Zhu. The en tire regularizatio n p ath for the sup p ort v ector mac h ine. Journal of Machine L e arning R e se ar ch , 5:1391–14 15, 2004. S. Hettic h and S.D. Ba y . The UCI KDD Ar c hive. Univ ersity of California, Irvine, Department of Information and C omputer Science, 1999 . URL http://kd d.ics.uc i.edu . Nils Homer, Szab olcs Szelinger, Margot Redman, Da vid Duggan, W aibha v T embe, Jill Muehling, John V. P earson, Dietric h A. Stephan, Stanley F. Nelson, and Da vid W. Craig. Resolving individuals con trib uting trace amoun ts of dna to highly complex mixtures using high-densit y snp genot yping microarra ys. PL oS Genetics , 4(8): e10001 67, 08 2008. Rosie J ones, Ra vi Kumar, Bo Pang, an d Andr ew T omkins. ”i know what y ou d id last summ er”: query logs and u ser priv acy . In CIKM ’07: P r o c e e dings of the sixte enth ACM c onfer enc e on Confer enc e on information and know le dge management , pages 909–9 14, New Y ork, NY, USA, 2007. A C M. ISBN 978-1-595 93-803 -9. S.P . Kasivish w anathan, M. Rudelson, A. Smith , and J. Ullman. Th e price of priv ately releasing con tingency tables and th e sp ectra of rand om matrices with correlated rows. In Pr o c e e dings of the 42nd ACM Symp osium on The ory of Computing (STOC) , 2010. S. A. Kasiviswanat han, H. K. Lee, K. Nissim, S . Raskho dniko v a, and A. S m ith. What can we learn priv ately? In Pr o c. of F OCS , 2008. G.S. Kimeldorf and G. W ah ba. A corresp ond ence b et w een Ba y esian estimation on sto c h astic pro- cesses and s mo othing by splines. A nnals of Mathematic al Statistics , 41:49 5–502, 1970. 38 S. Laur, H. Lipmaa, and T. Mieli k¨ ainen. Cryptographically priv ate supp ort ve ctor mac hines. In Pr o c e e dings of the 12th ACM SIGKDD International Confer enc e on Know le dge D isc overy and Data Mining (KDD) , p ages 618–62 4, 2006. A. Mac hanav a jjhala, J. Gehrke , D. Kifer, and M. V enkitasub ramaniam. l-diversit y: Priv acy b ey ond k-anon ym ity . In Pr o c e e dings of the 22nd International Confer enc e on Data Engine ering (ICDE) , 2006. A. Mac hanav a jjhala, D. Kifer, J . M. Ab owd, J. Gehrk e, and L. Vilh u b er. Priv acy: T heory meets practice on the map. In Pr o c e e dings of the 24th International Confer enc e on Data Engine ering (ICDE) , pages 277–286, 2008. O. L . Mangasarian, E. W. Wild, and G. F ung. Priv acy-preserving classiﬁcation of v ertically par- titioned d ata via random ke rnels. ACM T r ansactions on Know le dge Disc overy f r om Data , 2(3), 2008. F. McSherry and K. T alw ar. Mec h anism design via diﬀerentia l priv acy . In Pr o c e e dings of the 48th Annual IEEE Symp osium on F oundations of Computer Scienc e (FOCS) , pages 94–103, 2007. A. Nara y anan and V. Sh matik o v. Robu st d e-anon ymization of large sparse datasets (ho w to break anon ym it y of the netﬂix prize dataset). In Pr o c e e dings of 29th IEEE Symp osium on Se curity and Privacy , pages 111– 125, 2008 . K. Nissim, S. Raskho dn ik ov a, and A. Sm ith. Smo oth sensitivit y and sampling in priv ate data analysis. In Da vid S. Johnson and Uriel F eige, editors, Pr o c e e dings of the 39th ACM Symp osium on the The ory of Computing (STOC) , pages 75–84. A CM, 2007. ISBN 978-1-59 593-63 1-8. N. Ok azaki. liblbfgs: a library of limited-memory Bro yden-Fletc her-Goldfarb-S hanno (L-BF GS). 2009. URL http:/ /www.cho kkan.org/software/liblbfgs/index.html . A. Rahimi and B. Rec h t. Random features for large-scale kernel mac h ines. In Pr o c e e dings of the 21st Annual Confer enc e on Neur al Information P r o c essing Systems (NIPS) , 2007. A. Rahimi an d B. Rech t. Uniform approxima tion of functions with random bases. In Pr o c e e dings of the 46th Al ler ton Confer enc e on Communic ation, Contr ol, and Computing , 2008a . A. Rahimi and B. Rec ht. W eigh ted sums of r andom kitc hen sin ks : Replacing minimization w ith randomization in learning. In Pr o c e e dings of the 22nd Annual Confer enc e on Neur al Information Pr o c essing Systems (NIP S) , 2008b. R.T. Ro c k afellar and R J-B. W ets. V ariational Analysis . Springer, Berlin, 1998 . B. I. P . Ru binstein, P . L. Bartlett, L. Huang, and N. T aft. Learning in a large function sp ace: Priv acy-preserving mechanisms for SVM learning. In http:/ /arxiv.or g/abs/0911.5708 , 2009. S. Shalev-Sh wartz. Online L e arning: The ory, Algorithms, and Appl ic ations . PhD thesis, The Hebrew Unive rsit y of Jerusalem, J uly 2007. S. Shalev-Shw artz and N. Sr ebro. SVM optimization : In v ers e dep en d ence on training set size. In The 25th International Confer enc e on Machine L e arning (ICM L) , 2008 . 39 K. Sr id haran, N. Sreb r o, and S. Sh alev-Sh wartz. F ast rates for regularized ob jectiv es. In Pr o c e e dings of the 22nd Annual Confer enc e on Neur al Information Pr o c essing Systems (N IP S) , 2008. L. S w eeney . k-anon ymit y: a mo d el for protecting priv acy . Internatio nal Journal on Unc ertainty, F uzziness and Know le dge- Base d Systems , 2002. L. Sw eeney . W ea vin g tec h nology and p olic y together to maint ain conﬁdential it y . Journal of L aw, Me dicine and Ethics , 25:98–110 , 1997. V. V apnik. Statistic al L e arning The ory . John Wiley & Son s , New Y ork, 1998. Rui W ang, Y ong F uga Li, XiaoF eng W ang, Haixu T ang, and Xiao y ong Zhou. Learning y our iden tit y and disease from research pap ers: information leaks in genome wide asso ciation study . In ACM Confer enc e on Computer and Communic ations Se curity , pages 534– 544, 2009. L. W asserm an an d S. Zhou. A statistica l framew ork for diﬀeren tial p riv acy . Journal of the Americ an Statistic al Asso ciation , 105(48 9):375 –389, 2010. Andrew Chi-Chih Y ao. Proto cols for secure compu tations (extended abstract). In 23r d A nnual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 160–164 , 1982. J. Z. Zhan and S . Mat win. Pr iv acy-preserving supp ort v ector mac hine classiﬁcation. Internation al Journal of Intel ligent Information and Datab ase Systems , 1(3/4) :356–3 85, 2007. S. Zhou, K. Ligett, and L. W asserman. Diﬀeren tial priv acy with compression. In Pr o c e e dings of the 2009 International Symp osium on Information The ory , Seoul, South Korea, 2009 . 40

Differentially Private Empirical Risk Minimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment