Efficient differentially private learning improves drug sensitivity prediction
Users of a personalised recommendation system face a dilemma: recommendations can be improved by learning from data, but only if the other users are willing to share their private information. Good personalised predictions are vitally important in pr…
Authors: Antti Honkela, Mrinal Das, Arttu Nieminen
Efficien t differen tially priv ate learning impro v es drug sensitivit y prediction An tti Honk ela 1 , 2 , 3 , ∗ , † , Mrinal Das 4 , ∗ , Arttu Nieminen 1 , On ur Dikmen 1 and Sam uel Kaski 4 , † 1 Helsinki Institute for Information T echnology HIIT, Department of Computer Science, Univ ersity of Helsinki, Finland 2 Departmen t of Mathematics and Statistics, Universit y of Helsinki, Finland 3 Departmen t of Public Health, Universit y of Helsinki, Finland 4 Helsinki Institute for Information T echnology HIIT, Department of Computer Science, Aalto Univ ersity , Finland Abstract Users of a p ersonalised recommendation system face a dilemma: recom- mendations can b e improv ed b y learning from data, but only if the other users are willing to share their priv ate information. Go od p ersonalised predictions are vitally imp ortan t in precision medicine, but genomic infor- mation on which the predictions are based is also particularly sensitive, as it directly identifies the patients and hence cannot easily b e anon ymised. Differen tial priv acy [ 7 , 8 ] has emerged as a p oten tially promising solu- tion: priv acy is considered sufficient if presence of individual patients cannot be distinguished. Ho wev er, differen tially priv ate learning with curren t metho ds does not improv e predictions with feasible data sizes and dimensionalities [ 10 ]. Here we sho w that useful predictors can be learned under pow erful differential priv acy guaran tees, and ev en from mo derately-sized data sets, by demonstrating significant improv ements with a new robust priv ate regression metho d in the accuracy of priv ate drug sensitivity prediction [ 4 ]. The metho d combines tw o key prop erties not present even in recent prop osals [ 26 , 9 ], which can b e generalised to other predictors: we prov e it is asymptotically consisten tly and efficien tly priv ate, and demonstrate that it p erforms well on finite data. Go o d finite data p erformance is achiev ed by limiting the sharing of priv ate information b y decreasing the dimensionality and by pro jecting outliers to fit tighter b ounds, therefore needing to add less noise for equal priv acy . As already the simple-to-implemen t method shows promise on the c hallenging genomic data, w e anticipate rapid progress tow ards practical applications in many fields, such as mobile sensing and so cial media, in addition to the badly needed precision medicine solutions. ∗ These authors contributed equally to this w ork. † These authors jointly supervised the work. 1 1 In tro duction The widespread collection of priv ate data, b oth by individuals and hospitals in the health domain, creates a ma jor opp ortunit y to develop new services by learn- ing predictiv e mo dels from the data. Priv acy-preserving algorithms are required and hav e b een prop osed, but for instance anonymisation approaches [1, 18, 17] cannot guarantee priv acy against adv ersaries with additional side information, and are p oorly suited for genomic data where the entire data is identifying [ 13 ]. Guaran tees of differential priv acy [ 7 , 8 ] remain v alid ev en under these condi- tions [ 8 ], and differential priv acy has arisen as the most p opularly studied strong priv acy mechanism for learning from data. 2 Efficien t differen tially priv ate learning Differen tial priv acy [ 7 , 8 ] is a form ulation of reasonable priv acy guarantees for priv acy-preserving computation. It gives guarantees ab out the output of a computation and can be com bined with complemen tary cryptographic approac hes suc h as homomorphic encryption [ 12 ] if the computation process needs protection to o. An algorithm M op erating on a data set D is said to b e differ ential ly private if for any tw o data sets D and D 0 , differing only by one sample, the ratio of probabilities of obtaining any sp ecific result c is b ounded as p ( M ( D ) = c ) p ( M ( D 0 ) = c ) ≤ exp( ) . (1) Because of symmetry b et ween D and D 0 the probabilities need to b e similar to satisfy the condition. Differential priv acy is preserved in p ost-processing, which mak es it flexible to use in complex algorithms. The is a priv acy parameter in terpretable as a priv acy budget, with higher v alues corresponding to less priv acy preserv ation. Differentially priv ate learning algorithms are usually based on p erturbing either the input [2, 7], output [7, 26] or the ob jective [3, 28]. Here we apply differential priv acy to regression. The aim is to learn a mo del to predict the scalar target y i from d -dimensional inputs x i (Fig. 1a) as y i = f ( x i ) + η i , where f is an unkno wn mapping and η i represen ts noise and mo delling error. W e wish to design a suitable structure for f and a differentially priv ate mechanism for efficiently learning an accurate priv ate f from a data set D = { ( x i , y i ) } n i =1 . W e argue that a practical differentially priv ate algorithm needs to combine t wo things: (i) it needs to pro vide asymptotic al ly efficiently private estimators so that the excess loss incurred from preserving priv acy will diminish as the n umber of samples n in the data set increases; (ii) it needs to p erform wel l on mo der ately-size d data . While the first requiremen t of asymptotic efficiency or consistency seems ob vious, it is non-trivial to implement in practice and rules out some mechanisms published even quite recen tly [ 29 ]. The requirement was addressed in the Ba yesian setting very recen tly [ 9 ], but the metho d failed to co ver the second 2 equally imp ortan t criterion. Asymptotically consistently priv ate metho ds alwa ys allo w reaching stronger priv acy with more samples. It is difficult to prov e optimalit y of a metho d on finite data so go o d p erfor- mance needs to be demonstrated empirically . A design strategy for goo d methods con trols the amount of shared priv ate information. This has tw o comp onen ts: (a) dimensionalit y needs to b e reduced, to av oid the inherent incompatibility of priv acy and high dimensionality which has b een discussed previously [ 6 ], and (b) in tro ducing robustness b y b ounding and transforming each v ariable (feature) to a tigh ter interv al. Con trolling the amount of shared information also in tro duces a trade-off: compared to the non-priv ate setting, decreasing the dimensionality a lot may degrade the p erformance of the non-priv ate approach, while a corre- sp onding low-dimensional priv ate algorithm may attain higher p erformance than a higher-dimensional one (see the results and Fig. 3a). The essence of differential priv acy is to inject a sufficient amount of noise to mask the differences b et ween the computation results obtained from neigh bouring data sets (differing by only one entry). The definition dep ends on the w orst-case b eha viour, which implies that suitably limiting the space of allow ed results will reduce the amount of noise needed and p oten tially improv e the results. In the output p erturbation framework this can b e ac hieved by b ounding the p ossible outputs [26]. Here w e prop ose a more pow erful approach of bounding the data b y pro jecting outliers to tighter b ounds. The current standard practice in priv ate learning is to linearly transform the data to desired b ounds [ 28 ]. This is clearly sub-optimal as a few outliers can force a very small scale for other p oin ts. Significantly higher signal-to-priv acy-noise ratio can b e achiev ed by setting the b ounds to cov er the essential v ariation in the data and pro jecting the outliers separately inside these b ounds. This approac h also robustifies the analysis against outliers as the pro jection can b e made indep enden t of the outlier scale. In linear regression w e call the resulting mo del r obust private line ar r e gr ession . It is illustrated in Fig. 1b, c. 3 Results Genomics is an imp ortan t domain for priv acy-a ware mo delling, in particular for precision medicine. Man y p eople wish to keep their and also their relatives’ genomes priv ate [ 19 ], and simple anon ymisation is not sufficien t to protect priv acy since a genome is inheren tly identifiable [13]. F urthermore, individual genomes can b e reco vered from summary statistics [ 15 ] as well as phenotype data such as gene expression data [ 14 ]. On the other hand, previous research has shown that p oorly implemented priv ate mo dels may put a patient to severe risk [10]. W e apply the robust priv ate linear regression mo del to predict drug sensitivit y giv en gene expression data, in a setup where a small internal data set can b e complemen ted by a larger set only av ailable under priv acy protection (Fig. 1a). W e use data from the Genomics of Drug Sensitivit y in Cancer (GDSC) pro ject [ 27 ], and the setting and ev aluation are similar as in the recent DREAM-NCI drug 3 a Learning Data Non-private data (optional) Predictive model Privacy wall Dimensionality reduction and projection of outliers c b 3 2 1 0 1 2 3 3 2 1 0 1 2 3 B = 3.2 3 2 1 0 1 2 3 B = 1 3 2 1 0 1 2 3 B = 0.3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 A subset of points 3 2 1 0 1 2 3 Projection 3 2 1 0 1 2 3 Projected Figure 1: Differen tially priv ate learning of a predictive mo del . a , The mo delling setup; most data (top) are a v ailable for learning only if their priv acy can b e protected. b , Bounding the data increasingly tightly (B; green square) brings 1D robust priv ate linear regression mo dels (blue lines illustrating the distribution of results of the randomised algorithm) closer to the non-priv ate mo del (blac k line) as less noise needs to b e injected. Blue p oin ts: data. c , The data are b ounded in robust priv ate linear regression by pro jecting outliers within the b ounds (sho wn only for a subset of the p oin ts). 4 sensitivit y prediction challenge [ 4 ]. The sensitivity of each drug is predicted with Ba yesian linear regression based on expression of kno wn cancer genes iden tified b y the GDSC pro ject [ 27 ] to limit the dimensionality . W e achiev e differential priv acy by injecting noise to the sufficient statistics computed from the data, using the Laplace mechanism [7]. F ull details are presented in Metho ds. Unlik e with previous approaches, now prediction accuracy (ranking of new cell lines [ 4 ] to sensitive vs insensitive measured by Sp earman’s rank correlation; Fig. 2) improv es when more priv acy protected data is received. The prop osed non-linear pro jection of the data to tigh ter b ounds is the key to this success, as without it the metho d p erforms as p oorly as the earlier ones. T o improv e prediction p erformance in differentially priv ate learning, trade- offs need to b e made b et ween dimensionality and amount of data (Fig. 3a), and b et w een strength of priv acy guarantees and amount of data (Fig. 3c), but the amoun t of optional non-priv ate data matters significan tly only when there is v ery little priv ate data (Fig. 3b). In Secs. A–B in the Supplemen tary Information we define asymptotic consis- tency and efficiency of priv ate estimators relative to non-priv ate ones and prov e that the optimal conv ergence rate of differentially priv ate Bay esian estimators to the corresp onding non-priv ate ones is O (1 /n ) for n samples, which is matched b y our metho d. Unlike existing approaches [ 22 , 24 , 23 ], w e compare the priv ate estimators to the corresp onding non-priv ate ones, making the theory more easily accessible and more broadly applicable. Robust priv ate linear regression treats non-priv ate and scram bled priv ate data similarly in the mo del learning. An interesting next step for further improving the accuracy on very small priv ate data would b e to giv e a different weigh t to the clean and priv acy-scram bled data b y incorp orating knowledge of the injected noise in the Bay esian inference, as has b een prop osed for generative mo dels [ 25 ], but which is non-trivial in regression. 4 Metho ds 4.1 Linear regression mo del The Bay esian linear regression mo del for scalar target y i , with d -dimensional input x i and fixed noise precision λ , is defined by y i | x i ∼ N ( x T i β , λ ) β ∼ N (0 , λ 0 I ) , (2) where β is the unknown parameter to b e learn t. The λ and λ 0 are the precision parameters of the corresp onding Gaussian distributions, and act as regularisers. Giv en an observ ed data set D = { ( x i , y i ) } n i =1 with sufficient statistics nxx = P n i =1 x i x T i and nxy = P n i =1 x i y i , the p osterior distribution of β is Gaussian, p ( β |D ) = N ( β ; µ ∗ , Λ ∗ ), with precision Λ ∗ = λ 0 I + λnxx (3) 5 1 0 + 0 1 0 + 1 0 0 1 0 + 2 0 0 1 0 + 3 0 0 1 0 + 4 0 0 1 0 + 5 0 0 1 0 + 6 0 0 1 0 + 7 0 0 1 0 + 8 0 0 S i ze o f d a t a se t ( i n t e r n a l + e x t e r n a l ) 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 S p e a r m a n ' s r a n k c o r r e l a t i o n c o e ffi c i e n t L R n o n - p r i v a t e d a t a ( d = 1 0 ) L R n o n - p r i v a t e d a t a ( d = 6 4 ) L i n e a r R e g r e ssi o n ( L R ) Pr i v a t e L R R o b u st p r i v a t e L R O u t p u t p e r t u r b e d L R F u n c t i o n a l m e c h a n i sm L R Figure 2: Accuracy of drug sensitivity prediction in terms of Sp ear- man’s rank correlation co efficien t ov er ranking cell lines by sensitiv- it y to a drug (higher is b etter) increases with size of priv ate data for the prop osed robust priv ate linear regression. The state-of-the-art metho ds fail to utilise priv ate data under strict priv acy conditions. The baselines (horizon tal dashed lines) are learned on 10 non-priv ate data p oin ts; the priv ate algorithms additionally hav e priv acy-protected data (x-axis). The non-priv ate algorithm (LR) has the same amount of additional non-priv acy-protected data. All metho ds use 10-dimensional data except purple baseline showing the b est p erformance with 10 non-priv ate data p oin ts. Priv ate metho ds use = 2, corre- sp onding results for = 1 are in Fig. 6. The results are av eraged ov er all drugs and 50-fold Monte Carlo cross-v alidation; error bars denote standard deviation o ver 50 Monte Carlo rep eats. (See Metho ds for details.) 6 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 a ) R e d u c e d d i m e n si o n a l i t y 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 S i z e o f p r i v a t e d a t a se t 0 . 0 1 . 0 1 . 7 R e l a t i v e i m p r o v e m e n t o n r a n k c o r r e l a t i o n 0 5 1 0 1 5 2 0 2 5 3 0 b ) S i ze o f n o n - p r i v a t e d a t a 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 S i ze o f p r i v a t e d a t a se t 1 1 . 5 2 2 . 5 3 c ) Pr i v a c y p a r a m e t e r 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 S i z e o f p r i v a t e d a t a se t Figure 3: Key trade-offs in differen tially priv ate learning. Relativ e im- pro vemen ts ov er baseline (10 non-priv ate data p oin ts). a , As the dimensionality increases, the models without priv ate data impro ve whereas more data are needed to improv e p erformance of the priv ate metho ds. b , With enough priv ate data, adding more non-priv ate data do es not significantly increase the p erformance. c , More data are needed if priv acy guarantees are tighter ( is smaller). Size of non-priv ate data is 10 and = 2 (except when otherwise noted). 7 and mean µ ∗ = Λ − 1 ∗ ( λnxy ) (4) After learning with the training data set, the prediction of y i using x i is computed as follows: ˆ y i = x T i µ ∗ . (5) A more robust alternative is to define prior distributions for the precision parameters. In our case, a Gamma prior is assigned for b oth: λ ∼ Gamma( a, b ) λ 0 ∼ Gamma( a 0 , b 0 ) . (6) The p osterior can b e sampled using computational metho ds such as auto- matic differen tation v ariational inference (ADVI) [ 16 ] where we fit a v ariational distribution to the p osterior. The precision parameters and correlation co effi- cien ts β are then sampled from the fitted distribution. F or this purp ose, the data lik eliho od in Eq. (2) needs to b e expressed in terms of the sufficient statistics nxx , nxy , and ny y = P n i =1 y 2 i , which results in p ( y | X, β , λ ) = λ 2 π n/ 2 exp − λ 2 ( β T nxxβ − 2 β T nxy + ny y ) . (7) The prediction of y i is computed using x i and av eraging ov er a sufficiently large n umber m of sampled regression co efficien ts β ( k ) as ˆ y i = Z p ( y | β , X test ,i ) p ( β |D train ) dβ ≈ 1 m m X k =1 x T test ,i β ( k ) . (8) F or ev aluation w e keep a part of the data set D aside (not used for training) and after predicting ˆ y i , we ev aluate the error b et ween the actual y i and ˆ y i . In this pap er, we do this using Sp earman’s rank correlation co efficien t to ev aluate ho w well the predictions separate sensitive and insensitive cell lines. 4.2 Differen tial priv acy and efficiency W e apply differential priv acy as defined in Eq. (1) . W e use b ounde d differ ential privacy , where tw o data sets are considered neighbouring if they contain the same num b er of elements n with n − 1 equal elements. Compared to the other common alternative of unb ounde d differ ential privacy , in which tw o data sets are considered neighbouring if one is obtained from the other b y adding or remo ving an elemen t, b ounded differential priv acy makes it clear that the num ber of samples is not priv ate which simplifies parameter tuning. The priv acy parameter v alues are not directly comparable b et ween the tw o formalisms, although an = k un b ounded differentially priv ate mechanism is alw ays a = 2 k b ounded differen tially priv ate mechanism. 8 W e define a priv ate parameter estimation mechanism to b e asymptotic al ly c onsistently private , if the priv ate estimate con verges in probability to the corresp onding non-priv ate estimate as the num ber of samples increases. W e sho w that the optimal rate of conv ergence of the priv ate estimate to the corresp onding non-priv ate Bay esian estimate is O (1 /n ). Mechanisms reaching this con vergence rate are called asymptotic al ly efficiently private . A mechanism for estimating a mo del is called asymptotically consistently priv ate with resp ect to a utilit y function if the utility of the priv ate mo del conv erges in probabilit y to the utilit y of the corresp onding non-priv ate mo del. F or full detail of these definitions see Supplemen tary Information sections 1.1-1.2. 4.3 Robust priv ate linear regression The robust priv ate linear regression is based on p erturbing the sufficien t statistics nxx = P n i =1 x i x T i , nxy = P n i =1 x i y i , and ny y = P n i =1 y 2 i . W e use indep enden t p i -differen tially priv ate Laplace mechanisms [ 7 ] for p erturbing each statistic with i = p i for each i = 1 , 2 , 3 and p 1 + p 2 + p 3 = 1. T ogether, they provide an -differen tially priv ate mechanism. W e pro ject the outliers in the priv ate data sets to fit the data in the interv al [ − B ∗ , B ∗ ] as x ij = max( − B x , min( x ij , B x )) y i = max( − B y , min( y i , B y )) . (9) After the pro jection, k x i k ∞ ≤ B x and | y i | ≤ B y , and w e add noise to nxx distributed as Laplace (0 , b xx ), to nxy = P n i =1 x i y i distributed as Laplace (0 , b xy ), and to ny y = P n i =1 y 2 i distributed as Laplace (0 , b y y ), where the scale parameters are b xx = d ( d +1) B 2 x p 1 , b xy = 2 dB x B y p 2 , and b y y = B 2 y p 3 . This generalises earlier work on b ounded v ariables [ 9 ] to the unbounded case by introducing the pro jection. Pro of that this yields a v alid asymptotically consistent and efficient differen tially priv ate mechanism is given in Supplementary Information section 2. W e also sho w that a similar algorithm, applied to the estimation of a Gaussian mean, leads to an asymptotically consisten t and efficient priv ate estimate of the p osterior mean, while the simpler input p erturbation that p erturbs the en tire data set is not asymptotically consistently priv ate. The priv acy budget prop ortions p 1 , p 2 , p 3 and pro jection thresholds B x , B y are imp ortan t parameters for go o d mo del p erformance. As illustrated in Fig. 4, the pro jection thresholds dep end strongly on the size of the data set. W e propose finding the optimal parameter v alues on an auxiliary synthetic data set of the same size, which w as found to b e effective in our case. W e generate the auxiliary data set of n samples using a generative mo del similar to the one sp ecified in Eq. (2): x i ∼ N (0 , I d ) y i | x i ∼ N ( x T i β , λ ) β ∼ N (0 , λ 0 I ) , (10) 9 where d is the dimension. First we find the optimal budget split p 1 , p 2 , p 3 . F or all p ossible com binations of ( p 1 , p 2 , p 3 ) ∈ { 0 . 05 , 0 . 1 , . . . , 0 . 90 } 3 , where p 1 + p 2 + p 3 = 1, we pro ject the data using clipping thresholds for the current split, and we p erturb the sufficien t statistics according to the current budget split. W e compute the prediction as in Eq. (8) using samples drawn from the v ariational distribution fitted with ADVI and compute the error with resp ect to the original v alues. The error measure w e use is Spearman’s rank correlation b et w een the original and predicted v alues. The split ( p 1 , p 2 , p 3 ) which gives the minimum error is used in all test settings. As illustrated in Fig. 5, in our exp erimen ts the optimal split gives the largest prop ortion of the priv acy budget to the term nxy (60%), the second largest prop ortion to the term nxx (35%), and the smallest p ossible prop ortion to the term ny y (5%). W e parameterise the pro jection thresholds as a function of the data standard deviation as B x = ω x σ x , B y = ω y σ y (11) ω x , ω y ∈ { 0 . 1 ω } 20 ω =1 , (12) where the σ x and σ y are the standard deviations of x (considering all dimensions) and y , resp ectiv ely . With all 400 pairs of ( B x , B y ) as sp ecified ab o v e, we apply the outlier pro jection metho d of Eq. (9) . W e p erturb the sufficien t statistics according to the chosen optimal priv acy budget split and fit the mo del as in Eq. (3) ,4 using the pro jected v alues and then compute the error with respect to the original v alues. The pair of ( ω x , ω y ) which gives the minimum error is used to define the ( B x , B y ) for the real data as in Eq. (11) . As the error we used Sp earman’s rank correlation b et ween original y 1: n and predicted ¯ y 1: n based on the mo del learn t with pro jected v alues. 4.4 Data and pre-pro cessing W e used the gene expression and drug sensitivit y data from the Genomics of Drug Sensitivity in Canc er (GDSC) pro ject [ 27 , 11 ] (release 6.1, March 2017, http://www.cancerrxgene.org ) consisting of 265 drugs and a panel of 985 human cancer cell lines. The dimensionalit y of the RMA-normalised gene expression data was reduced from d = 17490 down to 64 based on prior kno wledge ab out genes that are frequently mutated in cancer, provided b y the GDSC pro ject at http://www.cancerrxgene.org/translation/Gene . W e further ordered the genes based on their mutation coun ts as rep orted at http: //cancer.sanger.ac.uk/cosmic/curation . Drug resp onses were quantified b y log-transformed IC50 v alues (the drug concentration yielding 50% resp onse) from the dose resp onse data measured at 9 different concentrations. The mean w as first remov ed from each gene, x ij := x ij − mean ( x i : n,j ), and each data p oin t w as normalised to hav e L2-norm k x i k 2 = 1, which focuses the analysis on relative expression of the selected genes, and equalises the contribution of eac h data p oin t. The mean w as remov ed from drug sensitivities, y i := y i − mean ( y 1: n ). 10 Data with missing drug resp onses were ignored, making the num ber of cell lines differen t across different drugs. 4.5 Exp erimen tal setup W e carried out a 50-fold Mon te Carlo cross-v alidation pro cess for differen t splits of the data set into train and test using different random seeds. F or each rep eat, w e randomly split the 985 cell lines to 100 for testing and the rest for training. W e further randomly partitioned the training set to 30 non-priv ate cell lines and used the rest as the priv ate data set. In the exp erimen ts, we tested non- priv ate data sizes from 0 to 30, and priv ate data sizes from 100 to 800. The h yp erparameters for the Gamma priors of precision parameters λ, λ 0 in Eq. (6) w ere set to a = b = a 0 = b 0 = 2. The Gamma(2,2) distribution has mean 1 and v ariance 1/2 and defines a realistic distribution ov er sensible v alues of precision parameters which should b e larger than zero. W e implemented the mo del and carried out the inference with the PyMC3 Python mo dule [ 20 ]. Using AD VI, w e fitted a normal distribution with uncorrelated v ariables to the p osterior distribution. W e computed the drug resp onse predictions using m = 5000 samples from the fitted v ariational distribution. W e used AD VI b ecause it gives similar results as Hamiltonian Mon te Carlo sampling b ut significantly faster. The optimal priv acy budget split was based on prediction p erformance av eraged ov er fiv e auxiliary data sets of 500 synthetic samples (appro ximately half of the GDSC data set size) and five generated noise samples, and for each split, the optimal pro jection thresholds were chosen similarly based on av erage p erformance ov er fiv e auxiliary data sets and five noise samples. The prediction for eac h split w as computed using m = 5000 samples drawn from the v ariational distribution fitted with ADVI. The final optimal pro jection thresholds for each test case w ere chosen using the optimal budget split and based on a verage prediction p erformance ov er 20 auxiliary data sets and 20 noise samples. All auxiliary data sets were generated by fixing the precision parameter v alues to the prior means, λ = λ 0 = 1. The prediction for eac h pair of pro jection thresholds was also computed using fixed precision parameters as in Eq. (3) and Eq. (4) , as generating samples from the fitted v ariational distribution for all test cases would ha ve b een infeasible in practice. 4.6 Alternativ e metho ds used in comparisons W e compared five mo dels: (i) linear regression (LR) as defined in Eq. (2) , (ii) robust priv ate LR is the prop osed metho d, and (iii) priv ate LR is the proposed metho d without pro jection of the outliers, (iv) output p erturbed LR [ 26 ], and (v) functional mechanism LR [ 28 ]. Output p erturbed LR learns parameters β using the same LR mo del in Eq. (2) , but instead of statistics the parameters are p erturb ed, in a data-indep enden t manner. Our implementation of output p erturbed LR makes use of minConf optimisation pack age [ 21 ]. F or functional mec hanism LR we used the code publicly av ailable at https://sourceforge. net/projects/functionalmecha/ . 11 4.7 Alternativ e interpretation: transformed linear regres- sion The outlier pro jection mechanism can also b e interpreted to pro duce a trans- formed linear regression problem, φ y ( y i ) | x i ∼ N ( φ x ( x i ) T β , λ ) , (13) where the functions φ y () and φ x () implementing the outlier pro jection can b e defined as φ y ( y i ) = max( − B y , min( B y , y i )) (14) φ x ( x i ) = max( − B x , min( B x , x i )) . (15) The normalisation of data can also b e included as a transformation. This in terpretation mak es explicit the flexibilit y in designing the transformations: the differen tial priv acy guaran tees will remain v alid as long as the transformations ob ey the b ounds φ y ( y i ) ∈ [ − B y , B y ] , φ x ( x i ) ∈ [ − B x , B x ] . (16) Ac knowledgemen ts W e w ould lik e to thank Muhammad Ammad-ud-din for assistance in data pro cessing and Otte Hein¨ av aara for assistance in the theoretical analysis. W e ac knowledge the computational resources provided b y the Aalto Science-IT pro ject. This work w as funded b y the Academy of Finland (Cen tre of Excellence COIN; and grants 283193 (S.K. and M.D), 294238 and 292334 (S.K.), 278300 (A.H. and O.D.), 259440 and 283107 (A.H.)). References [1] R. Ba yardo and R. Agra w al. Data priv acy through optimal k-anon ymization. In Pr o c. 21st Int. Conf. Data Eng. (ICDE 2005) , 2005. [2] A. Blum, C. Dwork, F. McSherry , and K. Nissim. Practical priv acy: the SuLQ framework. In Pr o c. PODS 2005 , 2005. [3] K. Chaudhuri and C. Monteleoni. Priv acy-preserving logistic regression. In A dv. Neur al Inf. Pr o c ess. Syst. 21 , 2008. [4] J. C. Costello, L. M. Heiser, E. Georgii, M. G¨ onen, M. P . Menden, N. J. W ang, M. Bansal, M. Ammad-ud din, P . Hintsanen, S. A. Khan, J.- P . Mpindi, O. Kallioniemi, A. Honkela, T. Aittok allio, K. W ennerberg, NCI DREAM Communit y , J. J. Collins, D. Gallahan, D. Singer, J. Saez- Ro driguez, S. Kaski, J. W. Gray , and G. Stolovitzky . A communit y effort to assess and improv e drug sensitivity prediction algorithms. Nat. Biote chnol. , 32(12):1202–1212, Dec 2014. 12 [5] P . Diaconis and D. Ylvisaker. Conjugate priors for exponential families. A nn. Stat. , 7(2):269–281, Mar 1979. [6] J. C. Duchi, M. I. Jordan, and M. J. W ainwrigh t. Priv acy aw are learning. J. ACM , 61(6):1–57, Dec 2014. [7] C. Dwork, F. McSherry , K. Nissim, and A. Smith. Calibrating noise to sensitivit y in priv ate data analysis. In Pr o c. TCC 2006 . 2006. [8] C. Dwork and A. Roth. The algorithmic foundations of differential priv acy . F ound. T r ends The or. Comput. Sci. , 9(3-4):211–407, Aug. 2014. [9] J. F oulds, J. Geumlek, M. W elling, and K. Chaudhuri. On the theory and practice of priv acy-preserving Bay esian data analysis. In Pr o c. UAI 2016 , Mar. 2016. [10] M. F redrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart. Priv acy in pharmacogenetics: An end-to-end case study of p ersonalized warfarin dosing. In Pr o c. 23r d USENIX Se curity Symp. (USENIX Se curity 2014) , pages 17–32, 2014. [11] M. J. Garnett et al. Systematic identification of genomic mark ers of drug sensitivit y in cancer cells. Natur e , 483(7391):570–575, Mar 2012. [12] C. Gen try . A ful ly homomorphic encryption scheme . PhD thesis, Stanford Univ ersity , 2009. [13] M. Gymrek, A. L. McGuire, D. Golan, E. Halp erin, and Y. Erlich. Iden tifying p ersonal genomes by surname inference. Scienc e , 339(6117):321–324, Jan 2013. [14] A. Harmanci and M. Gerstein. Quan tification of priv ate information leak age from phenotype-genotype data: linking attacks. Nat. Metho ds , 13(3):251– 256, Mar 2016. [15] N. Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PL oS Genet. , 4(8):e1000167, Aug 2008. [16] A. Kucuk elbir, D. T ran, R. Ranganath, A. Gelman, and D. M. Blei. Auto- matic differen tiation v ariational inference. J Mach L e arn R es , 18(14):1–45, 2017. [17] N. Li, T. Li, and S. V enk atasubramanian. t-closeness: Priv acy b ey ond k-anon ymity and l-diversit y . In Pr o c. ICDE 2007 , 2007. [18] A. Mac hanav a jjhala, D. Kifer, J. Gehrke, and M. V enkitasubramaniam. L-div ersity: Priv acy b ey ond k-anonymit y . TKDD , 1(1):3, Mar 2007. [19] M. Nav eed et al. Priv acy in the genomic era. ACM Comput. Surv. , 48(1):1– 44, Aug 2015. 13 [20] J. Salv atier, T. V. Wiecki, and C. F onnesb ec k. Probabilistic programming in Python using PyMC3. Pe erJ Computer Scienc e , 2:e55, apr 2016. [21] M. Schmidt, E. v an den Berg, M. F riedlander, and K. Murph y . Optimizing costly functions with simple constrain ts: A limited-memory pro jected quasi- newton algorithm. In Pr o c. AIST A TS 2009 , 2009. [22] A. Smith. Efficien t, differentially priv ate p oin t estimators. Sept. 2008. arXiv:0809.4794 [cs.CR]. [23] Y.-X. W ang, J. Lei, and S. E. Fienberg. Learning with differential priv acy: Stabilit y , learnability and the sufficiency and necessit y of ERM principle. F eb. 2015. arXiv: 1502.06309 [stat.ML]. [24] L. W asserman and S. Zhou. A statistical framework for differential priv acy . J. Am. Stat. Asso c. , 105(489):375–389, Mar 2010. [25] O. Williams and F. McSherry . Probabilistic inference and differen tial priv acy . In A dv. Neur al Inf. Pr o c ess. Syst. 23 , 2010. [26] X. W u, M. F redrikson, W. W u, S. Jha, and J. F. Naughton. Revisiting differen tially priv ate regression: Lessons from learning theory and their consequences. Dec. 2015. arXiv:1512.06388 [cs.CR]. [27] W. Y ang et al. Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker disco very in cancer cells. Nucleic A cids R es. , 41(Database issue):D955–D961, Jan 2013. [28] J. Zhang, Z. Zhang, X. Xiao, Y. Y ang, and M. Winslett. F unctional mec ha- nism: Regression analysis under differential priv acy . PVLDB , 5(11):1364– 1375, 2012. [29] Z. Zhang, B. Rubinstein, and C. Dimitrak akis. On the differential priv acy of Bay esian inference. In Pr o c. AAAI 2016 , 2016. 14 Supplemen tary Information A Theoretical bac kground W e argue that effective differentially priv ate predictive mo delling metho ds can b e dev elop ed by a combination of: i. An asymptotically efficien tly priv ate mechanism for which the effect of the noise added to guaran tee priv acy v anishes as the num b er of samples increases; and ii. A wa y to limit the amount of priv ate information to b e shared. This yields better p erformance on finite data as less noise needs to be added for equiv alen t priv acy . This can b e achiev ed through a combination of t wo things: a. An approach to decrease the dimensionality of the data prior to the application of the priv ate algorithm; and b. A metho d to fo cus the priv acy guarantees to relev an t v ariation in data. Criterion i can be formally stated through additional loss in accuracy or utilit y of the estimates b ecause of priv acy . Our main asymptotic result is that the optimal conv ergence rate of a differen tially priv ate mechanism to a Bay esian estimate is O (1 /n ), whic h can b e reached by our prop osed mechanism. Criterion ii is non-asymptotic and thus more difficult to address theoretically . It manifests itself in the constants in the conv ergence rates as well as empirical findings on the effect of dimensionality reduction and pro jecting outliers to tigh ter b ounds as discussed in the main text and in Fig. 4. A.1 Definition of asymptotic efficiency W e b egin by formalisation of the theory b ehind Criterion i. Definition 1. A differ ential ly private me chanism M is asymptotically consistent with resp ect to an estimated parameter θ if the private estimates ˆ θ M given a data set D c onver ge in pr ob ability to the c orr esp onding non-private estimates ˆ θ N P as the numb er of samples, n = |D | , gr ows without b ound, i.e., if for any 1 α > 0 , lim n →∞ Pr {k ˆ θ M − ˆ θ N P k > α } = 0 . Definition 2. A differ ential ly private me chanism M is asymptotically effi- cien tly priv ate with resp ect to an estimated parameter θ , if the me chanism is asymptotic al ly c onsistent and the private estimates ˆ θ M c onver ge to the c orr e- sp onding non-private estimates ˆ θ N P at the r ate O (1 /n ) , i.e., if for any α > 0 ther e exist c onstants C, N such that Pr {k ˆ θ M − ˆ θ N P k > C /n } < α 1 W e use α in limit expressions instead of usual to av oid confusion with -differential priv acy . 15 for al l n ≥ N . The term asymptotically efficiently priv ate in the ab o ve definition is justified b y the following theorem, whic h shows that the rate O (1 /n ) is optimal for estimating exp ectation parameters of exponential family distributions. As it seems unlikely that b etter rates could b e obtained for more difficult problems, w e conjecture that this rate cannot b e b eaten for Bay esian estimates in general. Theorem 1. The private estimates ˆ θ M of an exp onential family p osterior exp e ctation p ar ameter θ , gener ate d by a differ ential ly private me chanism M that achieves -differ ential privacy for any > 0 , c annot c onver ge to the c orr esp onding non-private estimates ˆ θ N P at a r ate faster than 1 /n . This is, assuming M is -differ ential ly private, ther e exists no function f ( n ) such that lim sup nf ( n ) = 0 and for al l α > 0 , ther e exists a c onstant N such that Pr {k ˆ θ M − ˆ θ N P k > f ( n ) } < α for al l n ≥ N . Pr o of. The non-priv ate estimate of an e xpectation parameter of an exp onen tial family is [5] ˆ θ N P | x 1 , . . . , x n = n 0 x 0 + P n i =1 x i n 0 + n . (17) The difference of the estimates from tw o neigh b ouring data sets differing by one elemen t is ( ˆ θ N P |D ) − ( ˆ θ N P |D 0 ) = x − y n 0 + n , (18) where x and y are the corresp onding mismatched elements. Let ∆ = max ( k x − y k ), and let D and D 0 b e neighbouring data sets including these maximally differen t elemen ts. Let us assume that there exists a function f ( n ) suc h that lim sup nf ( n ) = 0 and for all α > 0 there exists a constant N such that Pr {k ˆ θ M − ˆ θ N P k > f ( n ) } < α for all n ≥ N . Fix α > 0 and c ho ose M ≥ max ( N , n 0 ) suc h that f ( n ) ≤ ∆ / 4 n for all n ≥ M . This implies that k ( ˆ θ N P |D ) − ( ˆ θ N P |D 0 ) k = ∆ n 0 + n ≥ ∆ 2 n ≥ 2 f ( n ) . (19) Let us define the region C D = { t | k ( ˆ θ N P |D ) − t k < f ( n ) } . Based on our assumptions we hav e Pr( ˆ θ M |D ∈ C D ) > 1 − α (20) Pr( ˆ θ M |D 0 ∈ C D ) < α (21) 16 whic h implies that Pr( ˆ θ M |D ∈ C D ) Pr( ˆ θ M |D 0 ∈ C D ) > 1 − α α (22) whic h means that M cannot b e differentially priv ate with < log ((1 − α ) /α ) → ∞ as α → 0. A.2 Differen t utility functions Definition 3. L et U ( ˆ θ N P ( D )) me asur e the utility of the non-private mo del ˆ θ N P estimate d fr om data set D and let U ( ˆ θ M ( D )) me asur e the c orr esp onding utility of the private mo del ˆ θ M obtaine d using differ ential ly private me chanism M . The me chanism M is asymptotically consistent with resp ect to a b ounded utility U if the r andom variables U ( ˆ θ M ( D )) c onver ge in pr ob ability to U ( ˆ θ N P ( D )) as the numb er of samples, n = |D | , gr ows without b ound, i.e., if for any α > 0 , lim n →∞ Pr {|U ( ˆ θ M ( D )) − U ( ˆ θ N P ( D )) | > α } = 0 . Theorem 2. A differ ential ly private me chanism M that is asymptotic al ly c on- sistent with r esp e ct to a set of p ar ameters is asymptotic al ly c onsistent with r esp e ct to any c ontinuous utility that only dep ends on those p ar ameters. Pr o of. If ˆ θ M con verges in probability to ˆ θ N P then by the contin uous mapping theorem the v alue of U ( ˆ θ M ) will conv erge in probability to U ( ˆ θ N P ). A.3 Example: Gaussian mean Theorem 3. Differ ential ly private infer enc e of the me an of a Gaussian variable, with L aplac e me chanism to p erturb the sufficient statistics, is asymptotic al ly c onsistent with r esp e ct to the p osterior me an. Pr o of. Let us consider the mo del x i ∼ N ( µ, Λ) µ ∼ N ( µ 0 , Λ 0 ) with µ as the unkno wn parameter and Λ and Λ 0 denoting the fixed prior precision matrices of the noise and the mean, resp ectiv ely . W e assume || x i || 1 ≤ B and enforce this by pro jecting the larger elements to satisfy this b ound. Let the observed data set b e D = { x i } n i =1 with sufficient statistic n ¯ x = P n i =1 x i . The non-priv ate p osterior mean is µ N P = (Λ 0 + n Λ) − 1 (Λ n ¯ x + Λ 0 µ 0 ) . The corresp onding priv ate p osterior mean is obtained by replacing n ¯ x with the p erturbed v ersion n ¯ x 0 = n ¯ x + δ , where δ = ( δ 1 , . . . , δ d ) T ∈ R d with δ j ∼ Laplace(0 , 2 B d ) and d = dim( x i ), yielding µ DP = (Λ 0 + n Λ) − 1 (Λ( n ¯ x + δ ) + Λ 0 µ 0 ) . 17 The difference of the priv ate and non-priv ate means is k µ DP − µ N P k 1 = k (Λ 0 + n Λ) − 1 (Λ δ ) k 1 = k (Λ − 1 Λ 0 + n · I ) − 1 δ k 1 ≤ c n k δ k 1 , whic h is v alid for all c > 1 for large enough n . This implies that Pr {k µ DP − µ N P k 1 ≥ α } ≤ Pr n c n k δ k 1 ≥ α o → 0 as n → ∞ for all α > 0. Theorem 4. Differ ential ly private infer enc e of the me an of a Gaussian variable with L aplac e me chanism to p erturb the input data set (naive input p erturb ation) is not asymptotic al ly c onsistent with r esp e ct to the p osterior me an. Pr o of. The mechanism is almost the same as in Theorem 3, but we now hav e n ¯ x 0 = n ¯ x + P n i =1 δ i ) where δ i = ( δ i 1 , . . . , δ id ) T ∈ R d with δ ij ∼ Laplace (0 , 2 B d ). Similar computation as ab o ve yields k µ DP − µ N P k 1 = (Λ 0 + n Λ) − 1 (Λ n X i =1 δ i ) 1 = ( 1 n Λ − 1 Λ 0 + I ) − 1 1 n n X i =1 δ i 1 ≥ 1 2 1 n n X i =1 δ i 1 for sufficien tly large n . By the cen tral limit theorem the distribution of 1 n P n i =1 δ i con verges to a Gaussian with non-zero v ariance. Hence µ DP do es not conv erge to µ N P for large n and the metho d is not asymptotically consistent. A.3.1 Asymptotic efficiency Theorem 5. -differ ential ly private estimate of the me an of a d -dimensional Gaussian variable x b ounde d by k x i k 1 ≤ B in which the L aplac e me chanism is use d to p erturb the sufficient statistics, is asymptotic al ly efficiently private. Pr o of. In the pro of of Theorem 3 we show ed that k µ DP − µ N P k 1 ≤ c n k δ k 1 , where δ = ( δ 1 , . . . , δ d ) T ∈ R D with δ j ∼ Laplace 0 , 2 B d . Because δ j is Laplace, | δ j | is exp onen tial with | δ j | ∼ Exp onen tial 2 B d and k δ k 1 = d X j =1 | δ j | ∼ Gamma d, 2 B d . 18 Giv en α > 0 we can c ho ose C > cF − 1 (1 − α ; d, / 2 B d ), where F − 1 ( x ; a, b ) is the inv erse cumulativ e distribution function of the Gamma distribution with shap e a and rate b , to ensure that Pr k µ DP − µ N P k 1 > C n ≤ Pr 1 n k δ k 1 > C n = Pr {k δ k 1 > C } < α. (23) A.3.2 Con v ergence rate W e can further study the probabilit y of making an error of at least a given magnitude as Pr {k µ DP − µ N P k 1 ≥ φ } ≤ Pr n c n k δ k 1 ≥ φ o = Pr n Gamma d, n 2 B cd ≥ φ o = 1 − F φ ; d, n 2 B cd = 1 − γ ( d, nφ 2 B cd ) Γ( d ) , (24) where F ( x ; a, b ) is the cumulativ e distribution function of the Gamma distribution with shap e a and rate b . The form ula in Eq. (24) unfortunately has no simple closed form expression. The result shows, how ev er, that the n required to reach a certain level of p erformance is linear in B and 1 . The dep endence on d is complicated, but it is in general sup er-linear as suggested by the mean of the gamma distribution in Eq. (24), 2 B d 2 n . A.4 Example: Zhang et al., AAAI 2016, In their pap er Zhang et al. derive utility b ounds for a num b er of mechanisms. The b ounds are clearly insufficient to demonstrate the asymptotic efficiency of the corresp onding metho ds. F or Laplace mechanism applied to Ba yesian netw ork inference, their b ound on excess KL-divergence as a function of the data set size n is O ( mn ln n ) 1 − exp − n 2 |I | + p −O ( mn ln n ) ln δ . B Differen tially priv ate linear regression Let us next consider the linear regression mo del with fixed noise Λ, y i | x i ∼ N ( x T i β , Λ) β ∼ N ( β 0 , Λ 0 ) , with β as the unknown parameter and Λ and Λ 0 denoting the precision matrices of the corresp onding distributions. 19 Let the observed data set b e D = { ( x i , y i ) } n i =1 with sufficient statistics nxx = P n i =1 x i x T i and nxy = P n i =1 x i y i . The non-priv ate p osterior precision of β is Λ N P = Λ 0 + Λ nxx and the corresp onding p osterior mean is µ N P = Λ − 1 N P (Λ nxy + Λ 0 β 0 ) . (25) The corresp onding priv ate p osterior precision is obtained by replacing nxx with the p erturbed version nxx 0 = nxx + ∆, where ∆ follows the Laplace distribution according to the Laplace mechanism, yielding Λ DP = Λ 0 + Λ( nxx + ∆) . Similarly using nxy 0 = nxy + δ with δ follo wing the Laplace mechanism w e obtain µ DP = Λ − 1 DP (Λ( nxy + δ ) + Λ 0 β 0 ) . (26) As presen ted in Metho ds, a more robust alternative is to assign prior distributions to the precision parameters and then sample the p osterior. This requires using the three sufficient statistics nxx , nxy , and ny y that are p erturbed with suitable noise. The mec hanism is presented in detail in Algorithm 1 and prov en to guaran tee differential priv acy in Theorem 6. F or theoretical analysis, we study the mo del with fixed precision parameters and an even priv acy budget split b et w een the tw o needed sufficient statistics. In Algorithm 1 and Theorem 6, this corresp onds to setting p 1 = p 2 = 0 . 5 and leaving out the unnecessary term S y y . B.1 The detailed mec hanism The function project in Algorithm 1 pro jects the data p oin ts in to a useful space and computes the sufficient statistics. Theorem 6. Algorithm DiffPriSS in Algorithm 1 is -differ ential ly private. Pr o of. (i) S xx = C C 0 + P is p 1 -differen tially priv ate. S xx is a symmetric d × d matrix with d ( d +1) 2 degrees of freedom. After project | C | ∞ ≤ B x and the sensitivity of eac h elemen t ∆( S xx ) ij = sup | c i c j − c 0 i c 0 j | ≤ 2 B 2 x . Adding Laplace distributed noise to ( S xx ) ij with b = d ( d +1) B 2 x p 1 yields an 0 -DP mec hanism with 0 = 2 p 1 d ( d +1) . Using basic comp osition [ 8 ] ov er the d ( d +1) 2 indep enden t dimensions shows that S xx = C C 0 + P is p 1 -differen tially priv ate. (ii) C D is a d × 1 v ector where d is the cardinalit y of I and eac h element of C D is computed as follows: ∀ i ∈ I , C D i = n X j =1 C ij D j , (27) 20 Algorithm 1 Differen tially priv ate statistics release p 1 + p 2 + p 3 = 1 function DiffPriSS ( X , Y , , B x , B y ) n = | Y | , d = dim( X ) ( C, D ) = pr oject ( X , Y , B x , B y ) for i ∈ { 1 , . . . , n } do for j ∈ { i, . . . , n } do P ij = P j i ∼ Laplace 0 , d ( d +1) B 2 x p 1 end for end for for i ∈ I do Q i ∼ Laplace 0 , 2 dB x B y p 2 end for R ∼ Laplace 0 , B 2 y p 3 S xx = C C 0 + P S xy = C D + Q S y y = D D 0 + R end function function Project ( X , Y , B x , B y ) for j = 1 to n do for i = 1 to d do C ij = max( − B x , min( B x , X ij )) end for D j = max( − B y , min( B y , Y j )) end for end function where | C ij | ≤ B x and | D j | ≤ B y , and thus the sensitivity of C D is 2 dB x B y . Th us, S xy = C D + Q is p 2 -differen tially priv ate. (iii) D D 0 is a scalar computed as D D 0 = n X j =1 D 2 j , where | D j | ≤ B y , and th us the sensitivity of D D 0 is B 2 y . Thus, S y y = D D 0 + R is p 3 -differen tially priv ate. Therefore, releasing S xx , S xy , and S y y together b y DiffPriSS is -differen tially priv ate. B.2 Asymptotic consistency and efficiency Theorem 7. Differ ential ly private infer enc e of the p osterior me an of the weights of line ar r e gr ession with L aplac e me chanism to p erturb the sufficient statistics is asymptotic al ly c onsistent with r esp e ct to the p osterior me an. 21 Pr o of. Using Eqs. (25)–(26) we can ev aluate k µ DP − µ N P k 1 = Λ − 1 DP (Λ( nxy + δ ) + Λ 0 β 0 ) − Λ − 1 N P (Λ nxy + Λ 0 β 0 ) 1 ≤ Λ − 1 DP (Λ( nxy + δ ) + Λ 0 β 0 ) − Λ − 1 DP (Λ nxy + Λ 0 β 0 ) 1 + Λ − 1 DP (Λ nxy + Λ 0 β 0 ) − Λ − 1 N P (Λ nxy + Λ 0 β 0 ) 1 = Λ − 1 DP Λ δ 1 + (Λ − 1 DP − Λ − 1 N P )(Λ nxy + Λ 0 β 0 ) 1 = (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ 1 + (Λ 0 + Λ( nxx + ∆)) − 1 − (Λ 0 + Λ( nxx )) − 1 (Λ nxy + Λ 0 β 0 ) 1 = (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ 1 + 1 n Λ 0 + Λ xx + 1 n ∆ − 1 − 1 n Λ 0 + Λ xx − 1 Λ xy + 1 n Λ 0 β 0 1 . Assuming xx > 0, the first term clearly approac hes 0 as n → ∞ . F or the second term, as n → ∞ , ( 1 n Λ 0 + Λ( xx + 1 n ∆)) − 1 → ( 1 n Λ 0 + Λ xx ) − 1 and as (Λ xy + 1 n Λ 0 β 0 ) is b ounded, the second term also approaches 0 as n → ∞ . This shows that µ DP con verges in probability to µ N P . Theorem 8. -differ ential ly private infer enc e of the p osterior me an of the weights of line ar r e gr ession with the L aplac e me chanism of Algorithm 1 to p erturb the sufficient statistics is asymptotic al ly efficiently private. Pr o of. F rom the pro of of Theorem 7 we hav e k µ DP − µ N P k 1 ≤ (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ 1 + " 1 n Λ 0 + Λ xx + 1 n ∆ − 1 − 1 n Λ 0 + Λ xx − 1 # Λ xy + 1 n Λ 0 β 0 1 . (28) The first term can b e b ounded easily as (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ 1 = (Λ − 1 Λ 0 + ∆ + nxx ) − 1 δ 1 ≤ (Λ − 1 Λ 0 + ∆ + nxx ) − 1 1 k δ k 1 ≤ c 1 n ( xx ) − 1 1 k δ k 1 (29) where c 1 > 1. The b ound is v alid for any c 1 > 1 as n gets large enough. Similarly as in the pro of of Theorem 5, k δ k 1 ∼ Gamma d, 4 dB x B y . (30) 22 Giv en α > 0 we can choose similarly as in the pro of of Theorem 5 C 1 > c 1 F − 1 (1 − α/ 2; d, / (4 dB x B y )) ( xx ) − 1 1 , where F − 1 ( x ; α, β ) is the in v erse distribution function of the Gamma distribution with shap e α and rate β , to ensure that Pr k (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ k 1 > C 1 n < α 2 . (31) The sec on d term can b e b ounded as " 1 n Λ 0 + Λ xx + 1 n ∆ − 1 − 1 n Λ 0 + Λ xx − 1 # Λ xy + 1 n Λ 0 β 0 1 = " 1 n Λ − 1 Λ 0 + xx + 1 n ∆ − 1 − 1 n Λ − 1 Λ 0 + xx − 1 # xy + 1 n Λ − 1 Λ 0 β 0 1 = 1 n 1 n Λ − 1 Λ 0 + xx + 1 n ∆ − 1 ∆ 1 n Λ − 1 Λ 0 + xx − 1 xy + 1 n Λ − 1 Λ 0 β 0 1 ≤ 1 n 1 n Λ − 1 Λ 0 + xx + 1 n ∆ − 1 ∆ 1 n Λ − 1 Λ 0 + xx − 1 1 xy + 1 n Λ − 1 Λ 0 β 0 1 ≤ 1 n 1 n Λ − 1 Λ 0 + xx + 1 n ∆ − 1 1 k ∆ k 1 1 n Λ − 1 Λ 0 + xx − 1 1 xy + 1 n Λ − 1 Λ 0 β 0 1 ≤ c 2 n ( xx ) − 1 1 k ∆ k 1 ( xx ) − 1 1 k xy k 1 =: c 2 n B 2 , where similarly as in Eq. (29) , the b ound is v alid for any c 2 > 1 as n gets large enough. Here k ∆ k 1 is the l 1 -norm of the matrix ∆ that whose elements follow the Laplace distribution ∆ ij ∼ Laplace(0 , 2 d ( d +1) B 2 x ). W e can b ound it as k ∆ k 1 = max i k ∆ : i k 1 , where ∆ : i are the row vectors of ∆ and the latter is the vector ` 1 -norm. Similarly as in Eq. (30) we hav e k δ k 1 ∼ Gamma d, 2 d ( d + 1) B 2 x (32) and as ab o ve given α > 0 we can c ho ose C 2 > c 2 F − 1 (1 − α/ 2; d, / (2 d ( d + 1) B 2 x )) ( xx ) − 1 2 1 k xy k 1 , 23 where F − 1 ( x ; α, β ) is the in v erse distribution function of the Gamma distribution to ensure that Pr B 2 > C 2 n < α 2 . (33) Com bining Eqs. (31) and (33) shows that Pr k µ DP − µ N P k 1 > C 1 + C 2 n < α. (34) B.3 Con v ergence rate Using Chebysev’s inequality together with Eq. (30) w e can sho w that with high probabilit y k δ k 1 = O d 2 B x B y and thus (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ 1 = O d 2 B x B y ( xx ) − 1 1 n ! . (35) Similarly for the second term we obtain B 2 = O d 3 B 2 x ( xx ) − 1 2 1 k xy k 1 . (36) Com bining Eqs. (28)–(36) yields k µ DP − µ N P k 1 = O d 2 B x B y k xx − 1 k 1 + d 3 B 2 x ( xx ) − 1 2 1 k xy k 1 n with high probability . 24 n=300 B x / σ x 0.5 1 1.5 2 B y / σ y 0.5 1 1.5 2 0.4 0.5 0.6 n=1000 B x / σ x 0.5 1 1.5 2 B y / σ y 0.5 1 1.5 2 Rank correlation 0.8 0.85 0.9 n=3000 B x / σ x 0.5 1 1.5 2 B y / σ y 0.5 1 1.5 2 0.925 0.93 0.935 0.94 n=10000 B x / σ x 0.5 1 1.5 2 B y / σ y 0.5 1 1.5 2 Rank correlation 0.925 0.9255 0.926 0.9265 Figure 4: Illustration of the effect of pro jecting the outliers in linear regression, for different sample sizes n with 10-dimensional synthetic data, ev aluated by Sp earman’s rank correlation b et w een the predicted and true v alues. The x and y axes denote the pro jection thresholds as a function of standard deviations of data. T op right corner illustrates pro jection threshold at 2 standard deviations, no outlier pro jection would b e further to top right. Higher v alues (y ellow) are b etter. The result illustrates a clear b enefit from the pro jection for mo derate sample siz es, but the b enefit decreases for really large sample sizes. 25 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 p r i v a c y b u d g e t sh a r e fo r n x y 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 S p e a r m a n 's r a n k c o r r e l a t i o n c o e ffi c i e n t p r i v a c y b u d g e t sh a r e fo r n y y 0 . 0 5 0 . 1 5 0 . 2 5 0 . 3 5 0 . 4 5 0 . 5 5 0 . 6 5 0 . 7 5 0 . 8 5 b e st Figure 5: Optimal priv acy budget split b et ween sufficient statistics. Accuracy on a synthetic data set improv es as a bigger prop ortion of the fixed priv acy budget is assigned for nxy . The b est p erformance is ac hiev ed b y assigning term ny y the smallest prop ortion 5%, term nxy a large 60% prop ortion, and term nxx the remaining 35% prop ortion of the priv acy budget. 26 1 0 + 0 1 0 + 1 0 0 1 0 + 2 0 0 1 0 + 3 0 0 1 0 + 4 0 0 1 0 + 5 0 0 1 0 + 6 0 0 1 0 + 7 0 0 1 0 + 8 0 0 S i z e o f d a t a se t ( i n ( e r n a l + e + ( e r n a l ) − 0 . 0 5 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 S p e a r m a n ' s r a n c o r r e l a ( i o n c o e ffi c i e n ( L R n o n - p r i v a t e d a t a ( d = 1 0 ) L R n o n - p r i v a t e d a t a ( d = 6 4 ) L i n e a r R e g r e ssi o n ( L R ) Pr i v a t e L R R o b u st p r i v a t e L R O u t p u t p e r t u r b e d L R F u n c t i o n a l m e c h a n i sm L R Figure 6: This is a complemen t to Figure 2 with more stringent priv acy . Here w e sho w Sp earman’s rank correlation co efficien ts ( ρ ) b et ween the measured ranking of the cell lines and the ranking predicted by the mo dels using = 1. The baselines (horizontal dashed lines) are learned on 10 non-priv ate data p oin ts; the priv ate algorithms additionally hav e priv acy-protected data (x-axis). The non- priv ate algorithm (LR) has the same amount of additional non-priv acy-protected data. All metho ds use 10-dimensional data except purple baseline showing the b est p erformance with 10 non-priv ate data p oin ts. The results are av eraged ov er all drugs and 50-fold Monte Carlo cross-v alidation; error bars denote standard deviation ov er 50 Monte Carlo repeats. The result shows that more data are needed for go o d prediction p erformance under more stringent priv acy . 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment