Efficient differentially private learning improves drug sensitivity prediction

Eﬃcien t diﬀeren tially priv ate learning impro v es drug sensitivit y prediction An tti Honk ela 1 , 2 , 3 , ∗ , † , Mrinal Das 4 , ∗ , Arttu Nieminen 1 , On ur Dikmen 1 and Sam uel Kaski 4 , † 1 Helsinki Institute for Information T echnology HIIT, Department of Computer Science, Univ ersity of Helsinki, Finland 2 Departmen t of Mathematics and Statistics, Universit y of Helsinki, Finland 3 Departmen t of Public Health, Universit y of Helsinki, Finland 4 Helsinki Institute for Information T echnology HIIT, Department of Computer Science, Aalto Univ ersity , Finland Abstract Users of a p ersonalised recommendation system face a dilemma: recom- mendations can b e improv ed b y learning from data, but only if the other users are willing to share their priv ate information. Go od p ersonalised predictions are vitally imp ortan t in precision medicine, but genomic infor- mation on which the predictions are based is also particularly sensitive, as it directly identiﬁes the patients and hence cannot easily b e anon ymised. Diﬀeren tial priv acy [ 7 , 8 ] has emerged as a p oten tially promising solu- tion: priv acy is considered suﬃcient if presence of individual patients cannot be distinguished. Ho wev er, diﬀeren tially priv ate learning with curren t metho ds does not improv e predictions with feasible data sizes and dimensionalities [ 10 ]. Here we sho w that useful predictors can be learned under pow erful diﬀerential priv acy guaran tees, and ev en from mo derately-sized data sets, by demonstrating signiﬁcant improv ements with a new robust priv ate regression metho d in the accuracy of priv ate drug sensitivity prediction [ 4 ]. The metho d combines tw o key prop erties not present even in recent prop osals [ 26 , 9 ], which can b e generalised to other predictors: we prov e it is asymptotically consisten tly and eﬃcien tly priv ate, and demonstrate that it p erforms well on ﬁnite data. Go o d ﬁnite data p erformance is achiev ed by limiting the sharing of priv ate information b y decreasing the dimensionality and by pro jecting outliers to ﬁt tighter b ounds, therefore needing to add less noise for equal priv acy . As already the simple-to-implemen t method shows promise on the c hallenging genomic data, w e anticipate rapid progress tow ards practical applications in many ﬁelds, such as mobile sensing and so cial media, in addition to the badly needed precision medicine solutions. ∗ These authors contributed equally to this w ork. † These authors jointly supervised the work. 1 1 In tro duction The widespread collection of priv ate data, b oth by individuals and hospitals in the health domain, creates a ma jor opp ortunit y to develop new services by learn- ing predictiv e mo dels from the data. Priv acy-preserving algorithms are required and hav e b een prop osed, but for instance anonymisation approaches [1, 18, 17] cannot guarantee priv acy against adv ersaries with additional side information, and are p oorly suited for genomic data where the entire data is identifying [ 13 ]. Guaran tees of diﬀerential priv acy [ 7 , 8 ] remain v alid ev en under these condi- tions [ 8 ], and diﬀerential priv acy has arisen as the most p opularly studied strong priv acy mechanism for learning from data. 2 Eﬃcien t diﬀeren tially priv ate learning Diﬀeren tial priv acy [ 7 , 8 ] is a form ulation of reasonable priv acy guarantees for priv acy-preserving computation. It gives guarantees ab out the output of a computation and can be com bined with complemen tary cryptographic approac hes suc h as homomorphic encryption [ 12 ] if the computation process needs protection to o. An algorithm M op erating on a data set D is said to b e diﬀer ential ly private if for any tw o data sets D and D 0 , diﬀering only by one sample, the ratio of probabilities of obtaining any sp eciﬁc result c is b ounded as p ( M ( D ) = c ) p ( M ( D 0 ) = c ) ≤ exp(  ) . (1) Because of symmetry b et ween D and D 0 the probabilities need to b e similar to satisfy the condition. Diﬀerential priv acy is preserved in p ost-processing, which mak es it ﬂexible to use in complex algorithms. The  is a priv acy parameter in terpretable as a priv acy budget, with higher v alues corresponding to less priv acy preserv ation. Diﬀerentially priv ate learning algorithms are usually based on p erturbing either the input [2, 7], output [7, 26] or the ob jective [3, 28]. Here we apply diﬀerential priv acy to regression. The aim is to learn a mo del to predict the scalar target y i from d -dimensional inputs x i (Fig. 1a) as y i = f ( x i ) + η i , where f is an unkno wn mapping and η i represen ts noise and mo delling error. W e wish to design a suitable structure for f and a diﬀerentially priv ate mechanism for eﬃciently learning an accurate priv ate f from a data set D = { ( x i , y i ) } n i =1 . W e argue that a practical diﬀerentially priv ate algorithm needs to combine t wo things: (i) it needs to pro vide asymptotic al ly eﬃciently private estimators so that the excess loss incurred from preserving priv acy will diminish as the n umber of samples n in the data set increases; (ii) it needs to p erform wel l on mo der ately-size d data . While the ﬁrst requiremen t of asymptotic eﬃciency or consistency seems ob vious, it is non-trivial to implement in practice and rules out some mechanisms published even quite recen tly [ 29 ]. The requirement was addressed in the Ba yesian setting very recen tly [ 9 ], but the metho d failed to co ver the second 2 equally imp ortan t criterion. Asymptotically consistently priv ate metho ds alwa ys allo w reaching stronger priv acy with more samples. It is diﬃcult to prov e optimalit y of a metho d on ﬁnite data so go o d p erfor- mance needs to be demonstrated empirically . A design strategy for goo d methods con trols the amount of shared priv ate information. This has tw o comp onen ts: (a) dimensionalit y needs to b e reduced, to av oid the inherent incompatibility of priv acy and high dimensionality which has b een discussed previously [ 6 ], and (b) in tro ducing robustness b y b ounding and transforming each v ariable (feature) to a tigh ter interv al. Con trolling the amount of shared information also in tro duces a trade-oﬀ: compared to the non-priv ate setting, decreasing the dimensionality a lot may degrade the p erformance of the non-priv ate approach, while a corre- sp onding low-dimensional priv ate algorithm may attain higher p erformance than a higher-dimensional one (see the results and Fig. 3a). The essence of diﬀerential priv acy is to inject a suﬃcient amount of noise to mask the diﬀerences b et ween the computation results obtained from neigh bouring data sets (diﬀering by only one entry). The deﬁnition dep ends on the w orst-case b eha viour, which implies that suitably limiting the space of allow ed results will reduce the amount of noise needed and p oten tially improv e the results. In the output p erturbation framework this can b e ac hieved by b ounding the p ossible outputs [26]. Here w e prop ose a more pow erful approach of bounding the data b y pro jecting outliers to tighter b ounds. The current standard practice in priv ate learning is to linearly transform the data to desired b ounds [ 28 ]. This is clearly sub-optimal as a few outliers can force a very small scale for other p oin ts. Signiﬁcantly higher signal-to-priv acy-noise ratio can b e achiev ed by setting the b ounds to cov er the essential v ariation in the data and pro jecting the outliers separately inside these b ounds. This approac h also robustiﬁes the analysis against outliers as the pro jection can b e made indep enden t of the outlier scale. In linear regression w e call the resulting mo del r obust private line ar r e gr ession . It is illustrated in Fig. 1b, c. 3 Results Genomics is an imp ortan t domain for priv acy-a ware mo delling, in particular for precision medicine. Man y p eople wish to keep their and also their relatives’ genomes priv ate [ 19 ], and simple anon ymisation is not suﬃcien t to protect priv acy since a genome is inheren tly identiﬁable [13]. F urthermore, individual genomes can b e reco vered from summary statistics [ 15 ] as well as phenotype data such as gene expression data [ 14 ]. On the other hand, previous research has shown that p oorly implemented priv ate mo dels may put a patient to severe risk [10]. W e apply the robust priv ate linear regression mo del to predict drug sensitivit y giv en gene expression data, in a setup where a small internal data set can b e complemen ted by a larger set only av ailable under priv acy protection (Fig. 1a). W e use data from the Genomics of Drug Sensitivit y in Cancer (GDSC) pro ject [ 27 ], and the setting and ev aluation are similar as in the recent DREAM-NCI drug 3 a Learning Data Non-private data (optional) Predictive model Privacy wall Dimensionality reduction and projection of outliers c b 3 2 1 0 1 2 3 3 2 1 0 1 2 3 B = 3.2 3 2 1 0 1 2 3 B = 1 3 2 1 0 1 2 3 B = 0.3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 A subset of points 3 2 1 0 1 2 3 Projection 3 2 1 0 1 2 3 Projected Figure 1: Diﬀeren tially priv ate learning of a predictive mo del . a , The mo delling setup; most data (top) are a v ailable for learning only if their priv acy can b e protected. b , Bounding the data increasingly tightly (B; green square) brings 1D robust priv ate linear regression mo dels (blue lines illustrating the distribution of results of the randomised algorithm) closer to the non-priv ate mo del (blac k line) as less noise needs to b e injected. Blue p oin ts: data. c , The data are b ounded in robust priv ate linear regression by pro jecting outliers within the b ounds (sho wn only for a subset of the p oin ts). 4 sensitivit y prediction challenge [ 4 ]. The sensitivity of each drug is predicted with Ba yesian linear regression based on expression of kno wn cancer genes iden tiﬁed b y the GDSC pro ject [ 27 ] to limit the dimensionality . W e achiev e diﬀerential priv acy by injecting noise to the suﬃcient statistics computed from the data, using the Laplace mechanism [7]. F ull details are presented in Metho ds. Unlik e with previous approaches, now prediction accuracy (ranking of new cell lines [ 4 ] to sensitive vs insensitive measured by Sp earman’s rank correlation; Fig. 2) improv es when more priv acy protected data is received. The prop osed non-linear pro jection of the data to tigh ter b ounds is the key to this success, as without it the metho d p erforms as p oorly as the earlier ones. T o improv e prediction p erformance in diﬀerentially priv ate learning, trade- oﬀs need to b e made b et ween dimensionality and amount of data (Fig. 3a), and b et w een strength of priv acy guarantees and amount of data (Fig. 3c), but the amoun t of optional non-priv ate data matters signiﬁcan tly only when there is v ery little priv ate data (Fig. 3b). In Secs. A–B in the Supplemen tary Information we deﬁne asymptotic consis- tency and eﬃciency of priv ate estimators relative to non-priv ate ones and prov e that the optimal conv ergence rate of diﬀerentially priv ate Bay esian estimators to the corresp onding non-priv ate ones is O (1 /n ) for n samples, which is matched b y our metho d. Unlike existing approaches [ 22 , 24 , 23 ], w e compare the priv ate estimators to the corresp onding non-priv ate ones, making the theory more easily accessible and more broadly applicable. Robust priv ate linear regression treats non-priv ate and scram bled priv ate data similarly in the mo del learning. An interesting next step for further improving the accuracy on very small priv ate data would b e to giv e a diﬀerent weigh t to the clean and priv acy-scram bled data b y incorp orating knowledge of the injected noise in the Bay esian inference, as has b een prop osed for generative mo dels [ 25 ], but which is non-trivial in regression. 4 Metho ds 4.1 Linear regression mo del The Bay esian linear regression mo del for scalar target y i , with d -dimensional input x i and ﬁxed noise precision λ , is deﬁned by y i | x i ∼ N ( x T i β , λ ) β ∼ N (0 , λ 0 I ) , (2) where β is the unknown parameter to b e learn t. The λ and λ 0 are the precision parameters of the corresp onding Gaussian distributions, and act as regularisers. Giv en an observ ed data set D = { ( x i , y i ) } n i =1 with suﬃcient statistics nxx = P n i =1 x i x T i and nxy = P n i =1 x i y i , the p osterior distribution of β is Gaussian, p ( β |D ) = N ( β ; µ ∗ , Λ ∗ ), with precision Λ ∗ = λ 0 I + λnxx (3) 5 1 0 + 0 1 0 + 1 0 0 1 0 + 2 0 0 1 0 + 3 0 0 1 0 + 4 0 0 1 0 + 5 0 0 1 0 + 6 0 0 1 0 + 7 0 0 1 0 + 8 0 0 S i ze o f d a t a se t ( i n t e r n a l + e x t e r n a l ) 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 S p e a r m a n ' s r a n k c o r r e l a t i o n c o e ffi c i e n t L R n o n - p r i v a t e d a t a ( d = 1 0 ) L R n o n - p r i v a t e d a t a ( d = 6 4 ) L i n e a r R e g r e ssi o n ( L R ) Pr i v a t e L R R o b u st p r i v a t e L R O u t p u t p e r t u r b e d L R F u n c t i o n a l m e c h a n i sm L R Figure 2: Accuracy of drug sensitivity prediction in terms of Sp ear- man’s rank correlation co eﬃcien t ov er ranking cell lines by sensitiv- it y to a drug (higher is b etter) increases with size of priv ate data for the prop osed robust priv ate linear regression. The state-of-the-art metho ds fail to utilise priv ate data under strict priv acy conditions. The baselines (horizon tal dashed lines) are learned on 10 non-priv ate data p oin ts; the priv ate algorithms additionally hav e priv acy-protected data (x-axis). The non-priv ate algorithm (LR) has the same amount of additional non-priv acy-protected data. All metho ds use 10-dimensional data except purple baseline showing the b est p erformance with 10 non-priv ate data p oin ts. Priv ate metho ds use  = 2, corre- sp onding results for  = 1 are in Fig. 6. The results are av eraged ov er all drugs and 50-fold Monte Carlo cross-v alidation; error bars denote standard deviation o ver 50 Monte Carlo rep eats. (See Metho ds for details.) 6 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 a ) R e d u c e d d i m e n si o n a l i t y 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 S i z e o f p r i v a t e d a t a se t 0 . 0 1 . 0 1 . 7 R e l a t i v e i m p r o v e m e n t o n r a n k c o r r e l a t i o n 0 5 1 0 1 5 2 0 2 5 3 0 b ) S i ze o f n o n - p r i v a t e d a t a 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 S i ze o f p r i v a t e d a t a se t 1 1 . 5 2 2 . 5 3 c ) Pr i v a c y p a r a m e t e r 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 S i z e o f p r i v a t e d a t a se t Figure 3: Key trade-oﬀs in diﬀeren tially priv ate learning. Relativ e im- pro vemen ts ov er baseline (10 non-priv ate data p oin ts). a , As the dimensionality increases, the models without priv ate data impro ve whereas more data are needed to improv e p erformance of the priv ate metho ds. b , With enough priv ate data, adding more non-priv ate data do es not signiﬁcantly increase the p erformance. c , More data are needed if priv acy guarantees are tighter (  is smaller). Size of non-priv ate data is 10 and  = 2 (except when otherwise noted). 7 and mean µ ∗ = Λ − 1 ∗ ( λnxy ) (4) After learning with the training data set, the prediction of y i using x i is computed as follows: ˆ y i = x T i µ ∗ . (5) A more robust alternative is to deﬁne prior distributions for the precision parameters. In our case, a Gamma prior is assigned for b oth: λ ∼ Gamma( a, b ) λ 0 ∼ Gamma( a 0 , b 0 ) . (6) The p osterior can b e sampled using computational metho ds such as auto- matic diﬀeren tation v ariational inference (ADVI) [ 16 ] where we ﬁt a v ariational distribution to the p osterior. The precision parameters and correlation co eﬃ- cien ts β are then sampled from the ﬁtted distribution. F or this purp ose, the data lik eliho od in Eq. (2) needs to b e expressed in terms of the suﬃcient statistics nxx , nxy , and ny y = P n i =1 y 2 i , which results in p ( y | X, β , λ ) =  λ 2 π  n/ 2 exp  − λ 2 ( β T nxxβ − 2 β T nxy + ny y )  . (7) The prediction of y i is computed using x i and av eraging ov er a suﬃciently large n umber m of sampled regression co eﬃcien ts β ( k ) as ˆ y i = Z p ( y | β , X test ,i ) p ( β |D train ) dβ ≈ 1 m m X k =1 x T test ,i β ( k ) . (8) F or ev aluation w e keep a part of the data set D aside (not used for training) and after predicting ˆ y i , we ev aluate the error b et ween the actual y i and ˆ y i . In this pap er, we do this using Sp earman’s rank correlation co eﬃcien t to ev aluate ho w well the predictions separate sensitive and insensitive cell lines. 4.2 Diﬀeren tial priv acy and eﬃciency W e apply diﬀerential priv acy as deﬁned in Eq. (1) . W e use b ounde d diﬀer ential privacy , where tw o data sets are considered neighbouring if they contain the same num b er of elements n with n − 1 equal elements. Compared to the other common alternative of unb ounde d diﬀer ential privacy , in which tw o data sets are considered neighbouring if one is obtained from the other b y adding or remo ving an elemen t, b ounded diﬀerential priv acy makes it clear that the num ber of samples is not priv ate which simpliﬁes parameter tuning. The priv acy parameter  v alues are not directly comparable b et ween the tw o formalisms, although an  = k un b ounded diﬀerentially priv ate mechanism is alw ays a  = 2 k b ounded diﬀeren tially priv ate mechanism. 8 W e deﬁne a priv ate parameter estimation mechanism to b e asymptotic al ly c onsistently private , if the priv ate estimate con verges in probability to the corresp onding non-priv ate estimate as the num ber of samples increases. W e sho w that the optimal rate of conv ergence of the priv ate estimate to the corresp onding non-priv ate Bay esian estimate is O (1 /n ). Mechanisms reaching this con vergence rate are called asymptotic al ly eﬃciently private . A mechanism for estimating a mo del is called asymptotically consistently priv ate with resp ect to a utilit y function if the utility of the priv ate mo del conv erges in probabilit y to the utilit y of the corresp onding non-priv ate mo del. F or full detail of these deﬁnitions see Supplemen tary Information sections 1.1-1.2. 4.3 Robust priv ate linear regression The robust priv ate linear regression is based on p erturbing the suﬃcien t statistics nxx = P n i =1 x i x T i , nxy = P n i =1 x i y i , and ny y = P n i =1 y 2 i . W e use indep enden t p i  -diﬀeren tially priv ate Laplace mechanisms [ 7 ] for p erturbing each statistic with  i = p i  for each i = 1 , 2 , 3 and p 1 + p 2 + p 3 = 1. T ogether, they provide an  -diﬀeren tially priv ate mechanism. W e pro ject the outliers in the priv ate data sets to ﬁt the data in the interv al [ − B ∗ , B ∗ ] as x ij = max( − B x , min( x ij , B x )) y i = max( − B y , min( y i , B y )) . (9) After the pro jection, k x i k ∞ ≤ B x and | y i | ≤ B y , and w e add noise to nxx distributed as Laplace (0 , b xx ), to nxy = P n i =1 x i y i distributed as Laplace (0 , b xy ), and to ny y = P n i =1 y 2 i distributed as Laplace (0 , b y y ), where the scale parameters are b xx = d ( d +1) B 2 x p 1  , b xy = 2 dB x B y p 2  , and b y y = B 2 y p 3  . This generalises earlier work on b ounded v ariables [ 9 ] to the unbounded case by introducing the pro jection. Pro of that this yields a v alid asymptotically consistent and eﬃcient diﬀeren tially priv ate mechanism is given in Supplementary Information section 2. W e also sho w that a similar algorithm, applied to the estimation of a Gaussian mean, leads to an asymptotically consisten t and eﬃcient priv ate estimate of the p osterior mean, while the simpler input p erturbation that p erturbs the en tire data set is not asymptotically consistently priv ate. The priv acy budget prop ortions p 1 , p 2 , p 3 and pro jection thresholds B x , B y are imp ortan t parameters for go o d mo del p erformance. As illustrated in Fig. 4, the pro jection thresholds dep end strongly on the size of the data set. W e propose ﬁnding the optimal parameter v alues on an auxiliary synthetic data set of the same size, which w as found to b e eﬀective in our case. W e generate the auxiliary data set of n samples using a generative mo del similar to the one sp eciﬁed in Eq. (2): x i ∼ N (0 , I d ) y i | x i ∼ N ( x T i β , λ ) β ∼ N (0 , λ 0 I ) , (10) 9 where d is the dimension. First we ﬁnd the optimal budget split p 1 , p 2 , p 3 . F or all p ossible com binations of ( p 1 , p 2 , p 3 ) ∈ { 0 . 05 , 0 . 1 , . . . , 0 . 90 } 3 , where p 1 + p 2 + p 3 = 1, we pro ject the data using clipping thresholds for the current split, and we p erturb the suﬃcien t statistics according to the current budget split. W e compute the prediction as in Eq. (8) using samples drawn from the v ariational distribution ﬁtted with ADVI and compute the error with resp ect to the original v alues. The error measure w e use is Spearman’s rank correlation b et w een the original and predicted v alues. The split ( p 1 , p 2 , p 3 ) which gives the minimum error is used in all test settings. As illustrated in Fig. 5, in our exp erimen ts the optimal split gives the largest prop ortion of the priv acy budget to the term nxy (60%), the second largest prop ortion to the term nxx (35%), and the smallest p ossible prop ortion to the term ny y (5%). W e parameterise the pro jection thresholds as a function of the data standard deviation as B x = ω x σ x , B y = ω y σ y (11) ω x , ω y ∈ { 0 . 1 ω } 20 ω =1 , (12) where the σ x and σ y are the standard deviations of x (considering all dimensions) and y , resp ectiv ely . With all 400 pairs of ( B x , B y ) as sp eciﬁed ab o v e, we apply the outlier pro jection metho d of Eq. (9) . W e p erturb the suﬃcien t statistics according to the chosen optimal priv acy budget split and ﬁt the mo del as in Eq. (3) ,4 using the pro jected v alues and then compute the error with respect to the original v alues. The pair of ( ω x , ω y ) which gives the minimum error is used to deﬁne the ( B x , B y ) for the real data as in Eq. (11) . As the error we used Sp earman’s rank correlation b et ween original y 1: n and predicted ¯ y 1: n based on the mo del learn t with pro jected v alues. 4.4 Data and pre-pro cessing W e used the gene expression and drug sensitivit y data from the Genomics of Drug Sensitivity in Canc er (GDSC) pro ject [ 27 , 11 ] (release 6.1, March 2017, http://www.cancerrxgene.org ) consisting of 265 drugs and a panel of 985 human cancer cell lines. The dimensionalit y of the RMA-normalised gene expression data was reduced from d = 17490 down to 64 based on prior kno wledge ab out genes that are frequently mutated in cancer, provided b y the GDSC pro ject at http://www.cancerrxgene.org/translation/Gene . W e further ordered the genes based on their mutation coun ts as rep orted at http: //cancer.sanger.ac.uk/cosmic/curation . Drug resp onses were quantiﬁed b y log-transformed IC50 v alues (the drug concentration yielding 50% resp onse) from the dose resp onse data measured at 9 diﬀerent concentrations. The mean w as ﬁrst remov ed from each gene, x ij := x ij − mean ( x i : n,j ), and each data p oin t w as normalised to hav e L2-norm k x i k 2 = 1, which focuses the analysis on relative expression of the selected genes, and equalises the contribution of eac h data p oin t. The mean w as remov ed from drug sensitivities, y i := y i − mean ( y 1: n ). 10 Data with missing drug resp onses were ignored, making the num ber of cell lines diﬀeren t across diﬀerent drugs. 4.5 Exp erimen tal setup W e carried out a 50-fold Mon te Carlo cross-v alidation pro cess for diﬀeren t splits of the data set into train and test using diﬀerent random seeds. F or each rep eat, w e randomly split the 985 cell lines to 100 for testing and the rest for training. W e further randomly partitioned the training set to 30 non-priv ate cell lines and used the rest as the priv ate data set. In the exp erimen ts, we tested non- priv ate data sizes from 0 to 30, and priv ate data sizes from 100 to 800. The h yp erparameters for the Gamma priors of precision parameters λ, λ 0 in Eq. (6) w ere set to a = b = a 0 = b 0 = 2. The Gamma(2,2) distribution has mean 1 and v ariance 1/2 and deﬁnes a realistic distribution ov er sensible v alues of precision parameters which should b e larger than zero. W e implemented the mo del and carried out the inference with the PyMC3 Python mo dule [ 20 ]. Using AD VI, w e ﬁtted a normal distribution with uncorrelated v ariables to the p osterior distribution. W e computed the drug resp onse predictions using m = 5000 samples from the ﬁtted v ariational distribution. W e used AD VI b ecause it gives similar results as Hamiltonian Mon te Carlo sampling b ut signiﬁcantly faster. The optimal priv acy budget split was based on prediction p erformance av eraged ov er ﬁv e auxiliary data sets of 500 synthetic samples (appro ximately half of the GDSC data set size) and ﬁve generated noise samples, and for each split, the optimal pro jection thresholds were chosen similarly based on av erage p erformance ov er ﬁv e auxiliary data sets and ﬁve noise samples. The prediction for eac h split w as computed using m = 5000 samples drawn from the v ariational distribution ﬁtted with ADVI. The ﬁnal optimal pro jection thresholds for each test case w ere chosen using the optimal budget split and based on a verage prediction p erformance ov er 20 auxiliary data sets and 20 noise samples. All auxiliary data sets were generated by ﬁxing the precision parameter v alues to the prior means, λ = λ 0 = 1. The prediction for eac h pair of pro jection thresholds was also computed using ﬁxed precision parameters as in Eq. (3) and Eq. (4) , as generating samples from the ﬁtted v ariational distribution for all test cases would ha ve b een infeasible in practice. 4.6 Alternativ e metho ds used in comparisons W e compared ﬁve mo dels: (i) linear regression (LR) as deﬁned in Eq. (2) , (ii) robust priv ate LR is the prop osed metho d, and (iii) priv ate LR is the proposed metho d without pro jection of the outliers, (iv) output p erturbed LR [ 26 ], and (v) functional mechanism LR [ 28 ]. Output p erturbed LR learns parameters β using the same LR mo del in Eq. (2) , but instead of statistics the parameters are p erturb ed, in a data-indep enden t manner. Our implementation of output p erturbed LR makes use of minConf optimisation pack age [ 21 ]. F or functional mec hanism LR we used the code publicly av ailable at https://sourceforge. net/projects/functionalmecha/ . 11 4.7 Alternativ e interpretation: transformed linear regres- sion The outlier pro jection mechanism can also b e interpreted to pro duce a trans- formed linear regression problem, φ y ( y i ) | x i ∼ N ( φ x ( x i ) T β , λ ) , (13) where the functions φ y () and φ x () implementing the outlier pro jection can b e deﬁned as φ y ( y i ) = max( − B y , min( B y , y i )) (14) φ x ( x i ) = max( − B x , min( B x , x i )) . (15) The normalisation of data can also b e included as a transformation. This in terpretation mak es explicit the ﬂexibilit y in designing the transformations: the diﬀeren tial priv acy guaran tees will remain v alid as long as the transformations ob ey the b ounds φ y ( y i ) ∈ [ − B y , B y ] , φ x ( x i ) ∈ [ − B x , B x ] . (16) Ac knowledgemen ts W e w ould lik e to thank Muhammad Ammad-ud-din for assistance in data pro cessing and Otte Hein¨ av aara for assistance in the theoretical analysis. W e ac knowledge the computational resources provided b y the Aalto Science-IT pro ject. This work w as funded b y the Academy of Finland (Cen tre of Excellence COIN; and grants 283193 (S.K. and M.D), 294238 and 292334 (S.K.), 278300 (A.H. and O.D.), 259440 and 283107 (A.H.)). References [1] R. Ba yardo and R. Agra w al. Data priv acy through optimal k-anon ymization. In Pr o c. 21st Int. Conf. Data Eng. (ICDE 2005) , 2005. [2] A. Blum, C. Dwork, F. McSherry , and K. Nissim. Practical priv acy: the SuLQ framework. In Pr o c. PODS 2005 , 2005. [3] K. Chaudhuri and C. Monteleoni. Priv acy-preserving logistic regression. In A dv. Neur al Inf. Pr o c ess. Syst. 21 , 2008. [4] J. C. Costello, L. M. Heiser, E. Georgii, M. G¨ onen, M. P . Menden, N. J. W ang, M. Bansal, M. Ammad-ud din, P . Hintsanen, S. A. Khan, J.- P . Mpindi, O. Kallioniemi, A. Honkela, T. Aittok allio, K. W ennerberg, NCI DREAM Communit y , J. J. Collins, D. Gallahan, D. Singer, J. Saez- Ro driguez, S. Kaski, J. W. Gray , and G. Stolovitzky . A communit y eﬀort to assess and improv e drug sensitivity prediction algorithms. Nat. Biote chnol. , 32(12):1202–1212, Dec 2014. 12 [5] P . Diaconis and D. Ylvisaker. Conjugate priors for exponential families. A nn. Stat. , 7(2):269–281, Mar 1979. [6] J. C. Duchi, M. I. Jordan, and M. J. W ainwrigh t. Priv acy aw are learning. J. ACM , 61(6):1–57, Dec 2014. [7] C. Dwork, F. McSherry , K. Nissim, and A. Smith. Calibrating noise to sensitivit y in priv ate data analysis. In Pr o c. TCC 2006 . 2006. [8] C. Dwork and A. Roth. The algorithmic foundations of diﬀerential priv acy . F ound. T r ends The or. Comput. Sci. , 9(3-4):211–407, Aug. 2014. [9] J. F oulds, J. Geumlek, M. W elling, and K. Chaudhuri. On the theory and practice of priv acy-preserving Bay esian data analysis. In Pr o c. UAI 2016 , Mar. 2016. [10] M. F redrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart. Priv acy in pharmacogenetics: An end-to-end case study of p ersonalized warfarin dosing. In Pr o c. 23r d USENIX Se curity Symp. (USENIX Se curity 2014) , pages 17–32, 2014. [11] M. J. Garnett et al. Systematic identiﬁcation of genomic mark ers of drug sensitivit y in cancer cells. Natur e , 483(7391):570–575, Mar 2012. [12] C. Gen try . A ful ly homomorphic encryption scheme . PhD thesis, Stanford Univ ersity , 2009. [13] M. Gymrek, A. L. McGuire, D. Golan, E. Halp erin, and Y. Erlich. Iden tifying p ersonal genomes by surname inference. Scienc e , 339(6117):321–324, Jan 2013. [14] A. Harmanci and M. Gerstein. Quan tiﬁcation of priv ate information leak age from phenotype-genotype data: linking attacks. Nat. Metho ds , 13(3):251– 256, Mar 2016. [15] N. Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PL oS Genet. , 4(8):e1000167, Aug 2008. [16] A. Kucuk elbir, D. T ran, R. Ranganath, A. Gelman, and D. M. Blei. Auto- matic diﬀeren tiation v ariational inference. J Mach L e arn R es , 18(14):1–45, 2017. [17] N. Li, T. Li, and S. V enk atasubramanian. t-closeness: Priv acy b ey ond k-anon ymity and l-diversit y . In Pr o c. ICDE 2007 , 2007. [18] A. Mac hanav a jjhala, D. Kifer, J. Gehrke, and M. V enkitasubramaniam. L-div ersity: Priv acy b ey ond k-anonymit y . TKDD , 1(1):3, Mar 2007. [19] M. Nav eed et al. Priv acy in the genomic era. ACM Comput. Surv. , 48(1):1– 44, Aug 2015. 13 [20] J. Salv atier, T. V. Wiecki, and C. F onnesb ec k. Probabilistic programming in Python using PyMC3. Pe erJ Computer Scienc e , 2:e55, apr 2016. [21] M. Schmidt, E. v an den Berg, M. F riedlander, and K. Murph y . Optimizing costly functions with simple constrain ts: A limited-memory pro jected quasi- newton algorithm. In Pr o c. AIST A TS 2009 , 2009. [22] A. Smith. Eﬃcien t, diﬀerentially priv ate p oin t estimators. Sept. 2008. arXiv:0809.4794 [cs.CR]. [23] Y.-X. W ang, J. Lei, and S. E. Fienberg. Learning with diﬀerential priv acy: Stabilit y , learnability and the suﬃciency and necessit y of ERM principle. F eb. 2015. arXiv: 1502.06309 [stat.ML]. [24] L. W asserman and S. Zhou. A statistical framework for diﬀerential priv acy . J. Am. Stat. Asso c. , 105(489):375–389, Mar 2010. [25] O. Williams and F. McSherry . Probabilistic inference and diﬀeren tial priv acy . In A dv. Neur al Inf. Pr o c ess. Syst. 23 , 2010. [26] X. W u, M. F redrikson, W. W u, S. Jha, and J. F. Naughton. Revisiting diﬀeren tially priv ate regression: Lessons from learning theory and their consequences. Dec. 2015. arXiv:1512.06388 [cs.CR]. [27] W. Y ang et al. Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker disco very in cancer cells. Nucleic A cids R es. , 41(Database issue):D955–D961, Jan 2013. [28] J. Zhang, Z. Zhang, X. Xiao, Y. Y ang, and M. Winslett. F unctional mec ha- nism: Regression analysis under diﬀerential priv acy . PVLDB , 5(11):1364– 1375, 2012. [29] Z. Zhang, B. Rubinstein, and C. Dimitrak akis. On the diﬀerential priv acy of Bay esian inference. In Pr o c. AAAI 2016 , 2016. 14 Supplemen tary Information A Theoretical bac kground W e argue that eﬀective diﬀerentially priv ate predictive mo delling metho ds can b e dev elop ed by a combination of: i. An asymptotically eﬃcien tly priv ate mechanism for which the eﬀect of the noise added to guaran tee priv acy v anishes as the num b er of samples increases; and ii. A wa y to limit the amount of priv ate information to b e shared. This yields better p erformance on ﬁnite data as less noise needs to be added for equiv alen t priv acy . This can b e achiev ed through a combination of t wo things: a. An approach to decrease the dimensionality of the data prior to the application of the priv ate algorithm; and b. A metho d to fo cus the priv acy guarantees to relev an t v ariation in data. Criterion i can be formally stated through additional loss in accuracy or utilit y of the estimates b ecause of priv acy . Our main asymptotic result is that the optimal conv ergence rate of a diﬀeren tially priv ate mechanism to a Bay esian estimate is O (1 /n ), whic h can b e reached by our prop osed mechanism. Criterion ii is non-asymptotic and thus more diﬃcult to address theoretically . It manifests itself in the constants in the conv ergence rates as well as empirical ﬁndings on the eﬀect of dimensionality reduction and pro jecting outliers to tigh ter b ounds as discussed in the main text and in Fig. 4. A.1 Deﬁnition of asymptotic eﬃciency W e b egin by formalisation of the theory b ehind Criterion i. Deﬁnition 1. A diﬀer ential ly private me chanism M is asymptotically consistent with resp ect to an estimated parameter θ if the private estimates ˆ θ M given a data set D c onver ge in pr ob ability to the c orr esp onding non-private estimates ˆ θ N P as the numb er of samples, n = |D | , gr ows without b ound, i.e., if for any 1 α > 0 , lim n →∞ Pr {k ˆ θ M − ˆ θ N P k > α } = 0 . Deﬁnition 2. A diﬀer ential ly private me chanism M is asymptotically eﬃ- cien tly priv ate with resp ect to an estimated parameter θ , if the me chanism is asymptotic al ly c onsistent and the private estimates ˆ θ M c onver ge to the c orr e- sp onding non-private estimates ˆ θ N P at the r ate O (1 /n ) , i.e., if for any α > 0 ther e exist c onstants C, N such that Pr {k ˆ θ M − ˆ θ N P k > C /n } < α 1 W e use α in limit expressions instead of usual  to av oid confusion with  -diﬀerential priv acy . 15 for al l n ≥ N . The term asymptotically eﬃciently priv ate in the ab o ve deﬁnition is justiﬁed b y the following theorem, whic h shows that the rate O (1 /n ) is optimal for estimating exp ectation parameters of exponential family distributions. As it seems unlikely that b etter rates could b e obtained for more diﬃcult problems, w e conjecture that this rate cannot b e b eaten for Bay esian estimates in general. Theorem 1. The private estimates ˆ θ M of an exp onential family p osterior exp e ctation p ar ameter θ , gener ate d by a diﬀer ential ly private me chanism M that achieves  -diﬀer ential privacy for any  > 0 , c annot c onver ge to the c orr esp onding non-private estimates ˆ θ N P at a r ate faster than 1 /n . This is, assuming M is  -diﬀer ential ly private, ther e exists no function f ( n ) such that lim sup nf ( n ) = 0 and for al l α > 0 , ther e exists a c onstant N such that Pr {k ˆ θ M − ˆ θ N P k > f ( n ) } < α for al l n ≥ N . Pr o of. The non-priv ate estimate of an e xpectation parameter of an exp onen tial family is [5] ˆ θ N P | x 1 , . . . , x n = n 0 x 0 + P n i =1 x i n 0 + n . (17) The diﬀerence of the estimates from tw o neigh b ouring data sets diﬀering by one elemen t is ( ˆ θ N P |D ) − ( ˆ θ N P |D 0 ) = x − y n 0 + n , (18) where x and y are the corresp onding mismatched elements. Let ∆ = max ( k x − y k ), and let D and D 0 b e neighbouring data sets including these maximally diﬀeren t elemen ts. Let us assume that there exists a function f ( n ) suc h that lim sup nf ( n ) = 0 and for all α > 0 there exists a constant N such that Pr {k ˆ θ M − ˆ θ N P k > f ( n ) } < α for all n ≥ N . Fix α > 0 and c ho ose M ≥ max ( N , n 0 ) suc h that f ( n ) ≤ ∆ / 4 n for all n ≥ M . This implies that k ( ˆ θ N P |D ) − ( ˆ θ N P |D 0 ) k = ∆ n 0 + n ≥ ∆ 2 n ≥ 2 f ( n ) . (19) Let us deﬁne the region C D = { t | k ( ˆ θ N P |D ) − t k < f ( n ) } . Based on our assumptions we hav e Pr( ˆ θ M |D ∈ C D ) > 1 − α (20) Pr( ˆ θ M |D 0 ∈ C D ) < α (21) 16 whic h implies that Pr( ˆ θ M |D ∈ C D ) Pr( ˆ θ M |D 0 ∈ C D ) > 1 − α α (22) whic h means that M cannot b e diﬀerentially priv ate with  < log ((1 − α ) /α ) → ∞ as α → 0. A.2 Diﬀeren t utility functions Deﬁnition 3. L et U ( ˆ θ N P ( D )) me asur e the utility of the non-private mo del ˆ θ N P estimate d fr om data set D and let U ( ˆ θ M ( D )) me asur e the c orr esp onding utility of the private mo del ˆ θ M obtaine d using diﬀer ential ly private me chanism M . The me chanism M is asymptotically consistent with resp ect to a b ounded utility U if the r andom variables U ( ˆ θ M ( D )) c onver ge in pr ob ability to U ( ˆ θ N P ( D )) as the numb er of samples, n = |D | , gr ows without b ound, i.e., if for any α > 0 , lim n →∞ Pr {|U ( ˆ θ M ( D )) − U ( ˆ θ N P ( D )) | > α } = 0 . Theorem 2. A diﬀer ential ly private me chanism M that is asymptotic al ly c on- sistent with r esp e ct to a set of p ar ameters is asymptotic al ly c onsistent with r esp e ct to any c ontinuous utility that only dep ends on those p ar ameters. Pr o of. If ˆ θ M con verges in probability to ˆ θ N P then by the contin uous mapping theorem the v alue of U ( ˆ θ M ) will conv erge in probability to U ( ˆ θ N P ). A.3 Example: Gaussian mean Theorem 3. Diﬀer ential ly private infer enc e of the me an of a Gaussian variable, with L aplac e me chanism to p erturb the suﬃcient statistics, is asymptotic al ly c onsistent with r esp e ct to the p osterior me an. Pr o of. Let us consider the mo del x i ∼ N ( µ, Λ) µ ∼ N ( µ 0 , Λ 0 ) with µ as the unkno wn parameter and Λ and Λ 0 denoting the ﬁxed prior precision matrices of the noise and the mean, resp ectiv ely . W e assume || x i || 1 ≤ B and enforce this by pro jecting the larger elements to satisfy this b ound. Let the observed data set b e D = { x i } n i =1 with suﬃcient statistic n ¯ x = P n i =1 x i . The non-priv ate p osterior mean is µ N P = (Λ 0 + n Λ) − 1 (Λ n ¯ x + Λ 0 µ 0 ) . The corresp onding priv ate p osterior mean is obtained by replacing n ¯ x with the p erturbed v ersion n ¯ x 0 = n ¯ x + δ , where δ = ( δ 1 , . . . , δ d ) T ∈ R d with δ j ∼ Laplace(0 , 2 B d  ) and d = dim( x i ), yielding µ DP = (Λ 0 + n Λ) − 1 (Λ( n ¯ x + δ ) + Λ 0 µ 0 ) . 17 The diﬀerence of the priv ate and non-priv ate means is k µ DP − µ N P k 1 = k (Λ 0 + n Λ) − 1 (Λ δ ) k 1 = k (Λ − 1 Λ 0 + n · I ) − 1 δ k 1 ≤ c n k δ k 1 , whic h is v alid for all c > 1 for large enough n . This implies that Pr {k µ DP − µ N P k 1 ≥ α } ≤ Pr n c n k δ k 1 ≥ α o → 0 as n → ∞ for all α > 0. Theorem 4. Diﬀer ential ly private infer enc e of the me an of a Gaussian variable with L aplac e me chanism to p erturb the input data set (naive input p erturb ation) is not asymptotic al ly c onsistent with r esp e ct to the p osterior me an. Pr o of. The mechanism is almost the same as in Theorem 3, but we now hav e n ¯ x 0 = n ¯ x + P n i =1 δ i ) where δ i = ( δ i 1 , . . . , δ id ) T ∈ R d with δ ij ∼ Laplace (0 , 2 B d  ). Similar computation as ab o ve yields k µ DP − µ N P k 1 =      (Λ 0 + n Λ) − 1 (Λ n X i =1 δ i )      1 =      ( 1 n Λ − 1 Λ 0 + I ) − 1 1 n n X i =1 δ i      1 ≥ 1 2      1 n n X i =1 δ i      1 for suﬃcien tly large n . By the cen tral limit theorem the distribution of 1 n P n i =1 δ i con verges to a Gaussian with non-zero v ariance. Hence µ DP do es not conv erge to µ N P for large n and the metho d is not asymptotically consistent. A.3.1 Asymptotic eﬃciency Theorem 5.  -diﬀer ential ly private estimate of the me an of a d -dimensional Gaussian variable x b ounde d by k x i k 1 ≤ B in which the L aplac e me chanism is use d to p erturb the suﬃcient statistics, is asymptotic al ly eﬃciently private. Pr o of. In the pro of of Theorem 3 we show ed that k µ DP − µ N P k 1 ≤ c n k δ k 1 , where δ = ( δ 1 , . . . , δ d ) T ∈ R D with δ j ∼ Laplace  0 , 2 B d   . Because δ j is Laplace, | δ j | is exp onen tial with | δ j | ∼ Exp onen tial   2 B d  and k δ k 1 = d X j =1 | δ j | ∼ Gamma  d,  2 B d  . 18 Giv en α > 0 we can c ho ose C > cF − 1 (1 − α ; d, / 2 B d ), where F − 1 ( x ; a, b ) is the inv erse cumulativ e distribution function of the Gamma distribution with shap e a and rate b , to ensure that Pr  k µ DP − µ N P k 1 > C n  ≤ Pr  1 n k δ k 1 > C n  = Pr {k δ k 1 > C } < α. (23) A.3.2 Con v ergence rate W e can further study the probabilit y of making an error of at least a given magnitude as Pr {k µ DP − µ N P k 1 ≥ φ } ≤ Pr n c n k δ k 1 ≥ φ o = Pr n Gamma  d, n 2 B cd  ≥ φ o = 1 − F  φ ; d, n 2 B cd  = 1 − γ ( d, nφ 2 B cd ) Γ( d ) , (24) where F ( x ; a, b ) is the cumulativ e distribution function of the Gamma distribution with shap e a and rate b . The form ula in Eq. (24) unfortunately has no simple closed form expression. The result shows, how ev er, that the n required to reach a certain level of p erformance is linear in B and 1  . The dep endence on d is complicated, but it is in general sup er-linear as suggested by the mean of the gamma distribution in Eq. (24), 2 B d 2 n . A.4 Example: Zhang et al., AAAI 2016, In their pap er Zhang et al. derive utility b ounds for a num b er of mechanisms. The b ounds are clearly insuﬃcient to demonstrate the asymptotic eﬃciency of the corresp onding metho ds. F or Laplace mechanism applied to Ba yesian netw ork inference, their b ound on excess KL-divergence as a function of the data set size n is O ( mn ln n )  1 − exp  − n 2 |I |  + p −O ( mn ln n ) ln δ . B Diﬀeren tially priv ate linear regression Let us next consider the linear regression mo del with ﬁxed noise Λ, y i | x i ∼ N ( x T i β , Λ) β ∼ N ( β 0 , Λ 0 ) , with β as the unknown parameter and Λ and Λ 0 denoting the precision matrices of the corresp onding distributions. 19 Let the observed data set b e D = { ( x i , y i ) } n i =1 with suﬃcient statistics nxx = P n i =1 x i x T i and nxy = P n i =1 x i y i . The non-priv ate p osterior precision of β is Λ N P = Λ 0 + Λ nxx and the corresp onding p osterior mean is µ N P = Λ − 1 N P (Λ nxy + Λ 0 β 0 ) . (25) The corresp onding priv ate p osterior precision is obtained by replacing nxx with the p erturbed version nxx 0 = nxx + ∆, where ∆ follows the Laplace distribution according to the Laplace mechanism, yielding Λ DP = Λ 0 + Λ( nxx + ∆) . Similarly using nxy 0 = nxy + δ with δ follo wing the Laplace mechanism w e obtain µ DP = Λ − 1 DP (Λ( nxy + δ ) + Λ 0 β 0 ) . (26) As presen ted in Metho ds, a more robust alternative is to assign prior distributions to the precision parameters and then sample the p osterior. This requires using the three suﬃcient statistics nxx , nxy , and ny y that are p erturbed with suitable noise. The mec hanism is presented in detail in Algorithm 1 and prov en to guaran tee diﬀerential priv acy in Theorem 6. F or theoretical analysis, we study the mo del with ﬁxed precision parameters and an even priv acy budget split b et w een the tw o needed suﬃcient statistics. In Algorithm 1 and Theorem 6, this corresp onds to setting p 1 = p 2 = 0 . 5 and leaving out the unnecessary term S y y . B.1 The detailed mec hanism The function project in Algorithm 1 pro jects the data p oin ts in to a useful space and computes the suﬃcient statistics. Theorem 6. Algorithm DiffPriSS in Algorithm 1 is  -diﬀer ential ly private. Pr o of. (i) S xx = C C 0 + P is p 1  -diﬀeren tially priv ate. S xx is a symmetric d × d matrix with d ( d +1) 2 degrees of freedom. After project | C | ∞ ≤ B x and the sensitivity of eac h elemen t ∆( S xx ) ij = sup | c i c j − c 0 i c 0 j | ≤ 2 B 2 x . Adding Laplace distributed noise to ( S xx ) ij with b = d ( d +1) B 2 x p 1  yields an  0 -DP mec hanism with  0 = 2 p 1  d ( d +1) . Using basic comp osition [ 8 ] ov er the d ( d +1) 2 indep enden t dimensions shows that S xx = C C 0 + P is p 1  -diﬀeren tially priv ate. (ii) C D is a d × 1 v ector where d is the cardinalit y of I and eac h element of C D is computed as follows: ∀ i ∈ I , C D i = n X j =1 C ij D j , (27) 20 Algorithm 1 Diﬀeren tially priv ate statistics release p 1 + p 2 + p 3 = 1 function DiffPriSS ( X , Y ,  , B x , B y ) n = | Y | , d = dim( X ) ( C, D ) = pr oject ( X , Y , B x , B y ) for i ∈ { 1 , . . . , n } do for j ∈ { i, . . . , n } do P ij = P j i ∼ Laplace  0 , d ( d +1) B 2 x p 1   end for end for for i ∈ I do Q i ∼ Laplace  0 , 2 dB x B y p 2   end for R ∼ Laplace  0 , B 2 y p 3   S xx = C C 0 + P S xy = C D + Q S y y = D D 0 + R end function function Project ( X , Y , B x , B y ) for j = 1 to n do for i = 1 to d do C ij = max( − B x , min( B x , X ij )) end for D j = max( − B y , min( B y , Y j )) end for end function where | C ij | ≤ B x and | D j | ≤ B y , and thus the sensitivity of C D is 2 dB x B y . Th us, S xy = C D + Q is p 2  -diﬀeren tially priv ate. (iii) D D 0 is a scalar computed as D D 0 = n X j =1 D 2 j , where | D j | ≤ B y , and th us the sensitivity of D D 0 is B 2 y . Thus, S y y = D D 0 + R is p 3  -diﬀeren tially priv ate. Therefore, releasing S xx , S xy , and S y y together b y DiffPriSS is  -diﬀeren tially priv ate. B.2 Asymptotic consistency and eﬃciency Theorem 7. Diﬀer ential ly private infer enc e of the p osterior me an of the weights of line ar r e gr ession with L aplac e me chanism to p erturb the suﬃcient statistics is asymptotic al ly c onsistent with r esp e ct to the p osterior me an. 21 Pr o of. Using Eqs. (25)–(26) we can ev aluate k µ DP − µ N P k 1 =   Λ − 1 DP (Λ( nxy + δ ) + Λ 0 β 0 ) − Λ − 1 N P (Λ nxy + Λ 0 β 0 )   1 ≤   Λ − 1 DP (Λ( nxy + δ ) + Λ 0 β 0 ) − Λ − 1 DP (Λ nxy + Λ 0 β 0 )   1 +   Λ − 1 DP (Λ nxy + Λ 0 β 0 ) − Λ − 1 N P (Λ nxy + Λ 0 β 0 )   1 =   Λ − 1 DP Λ δ   1 +   (Λ − 1 DP − Λ − 1 N P )(Λ nxy + Λ 0 β 0 )   1 =   (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ   1 +    (Λ 0 + Λ( nxx + ∆)) − 1 − (Λ 0 + Λ( nxx )) − 1  (Λ nxy + Λ 0 β 0 )   1 =   (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ   1 +       1 n Λ 0 + Λ  xx + 1 n ∆  − 1 −  1 n Λ 0 + Λ xx  − 1   Λ xy + 1 n Λ 0 β 0      1 . Assuming xx > 0, the ﬁrst term clearly approac hes 0 as n → ∞ . F or the second term, as n → ∞ , ( 1 n Λ 0 + Λ( xx + 1 n ∆)) − 1 → ( 1 n Λ 0 + Λ xx ) − 1 and as (Λ xy + 1 n Λ 0 β 0 ) is b ounded, the second term also approaches 0 as n → ∞ . This shows that µ DP con verges in probability to µ N P . Theorem 8.  -diﬀer ential ly private infer enc e of the p osterior me an of the weights of line ar r e gr ession with the L aplac e me chanism of Algorithm 1 to p erturb the suﬃcient statistics is asymptotic al ly eﬃciently private. Pr o of. F rom the pro of of Theorem 7 we hav e k µ DP − µ N P k 1 ≤   (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ   1 +      "  1 n Λ 0 + Λ  xx + 1 n ∆  − 1 −  1 n Λ 0 + Λ xx  − 1 #  Λ xy + 1 n Λ 0 β 0       1 . (28) The ﬁrst term can b e b ounded easily as   (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ   1 =   (Λ − 1 Λ 0 + ∆ + nxx ) − 1 δ   1 ≤   (Λ − 1 Λ 0 + ∆ + nxx ) − 1   1 k δ k 1 ≤ c 1 n   ( xx ) − 1   1 k δ k 1 (29) where c 1 > 1. The b ound is v alid for any c 1 > 1 as n gets large enough. Similarly as in the pro of of Theorem 5, k δ k 1 ∼ Gamma  d,  4 dB x B y  . (30) 22 Giv en α > 0 we can choose similarly as in the pro of of Theorem 5 C 1 > c 1 F − 1 (1 − α/ 2; d, / (4 dB x B y ))   ( xx ) − 1   1 , where F − 1 ( x ; α, β ) is the in v erse distribution function of the Gamma distribution with shap e α and rate β , to ensure that Pr  k (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ k 1 > C 1 n  < α 2 . (31) The sec on d term can b e b ounded as      "  1 n Λ 0 + Λ  xx + 1 n ∆  − 1 −  1 n Λ 0 + Λ xx  − 1 #  Λ xy + 1 n Λ 0 β 0       1 =      "  1 n Λ − 1 Λ 0 + xx + 1 n ∆  − 1 −  1 n Λ − 1 Λ 0 + xx  − 1 #  xy + 1 n Λ − 1 Λ 0 β 0       1 = 1 n       1 n Λ − 1 Λ 0 + xx + 1 n ∆  − 1 ∆  1 n Λ − 1 Λ 0 + xx  − 1  xy + 1 n Λ − 1 Λ 0 β 0       1 ≤ 1 n       1 n Λ − 1 Λ 0 + xx + 1 n ∆  − 1 ∆  1 n Λ − 1 Λ 0 + xx  − 1      1     xy + 1 n Λ − 1 Λ 0 β 0     1 ≤ 1 n       1 n Λ − 1 Λ 0 + xx + 1 n ∆  − 1      1 k ∆ k 1       1 n Λ − 1 Λ 0 + xx  − 1      1     xy + 1 n Λ − 1 Λ 0 β 0     1 ≤ c 2 n    ( xx ) − 1    1 k ∆ k 1    ( xx ) − 1    1 k xy k 1 =: c 2 n B 2 , where similarly as in Eq. (29) , the b ound is v alid for any c 2 > 1 as n gets large enough. Here k ∆ k 1 is the l 1 -norm of the matrix ∆ that whose elements follow the Laplace distribution ∆ ij ∼ Laplace(0 , 2 d ( d +1) B 2 x  ). W e can b ound it as k ∆ k 1 = max i k ∆ : i k 1 , where ∆ : i are the row vectors of ∆ and the latter is the vector ` 1 -norm. Similarly as in Eq. (30) we hav e k δ k 1 ∼ Gamma  d,  2 d ( d + 1) B 2 x  (32) and as ab o ve given α > 0 we can c ho ose C 2 > c 2 F − 1 (1 − α/ 2; d, / (2 d ( d + 1) B 2 x ))    ( xx ) − 1    2 1 k xy k 1 , 23 where F − 1 ( x ; α, β ) is the in v erse distribution function of the Gamma distribution to ensure that Pr  B 2 > C 2 n  < α 2 . (33) Com bining Eqs. (31) and (33) shows that Pr  k µ DP − µ N P k 1 > C 1 + C 2 n  < α. (34) B.3 Con v ergence rate Using Chebysev’s inequality together with Eq. (30) w e can sho w that with high probabilit y k δ k 1 = O  d 2 B x B y   and thus   (Λ 0 + Λ( nxx + ∆)) − 1 Λ δ   1 = O d 2 B x B y   ( xx ) − 1   1 n ! . (35) Similarly for the second term we obtain B 2 = O    d 3 B 2 x    ( xx ) − 1    2 1 k xy k 1     . (36) Com bining Eqs. (28)–(36) yields k µ DP − µ N P k 1 = O    d 2 B x B y k xx − 1 k 1 + d 3 B 2 x    ( xx ) − 1    2 1 k xy k 1 n    with high probability . 24 n=300 B x / σ x 0.5 1 1.5 2 B y / σ y 0.5 1 1.5 2 0.4 0.5 0.6 n=1000 B x / σ x 0.5 1 1.5 2 B y / σ y 0.5 1 1.5 2 Rank correlation 0.8 0.85 0.9 n=3000 B x / σ x 0.5 1 1.5 2 B y / σ y 0.5 1 1.5 2 0.925 0.93 0.935 0.94 n=10000 B x / σ x 0.5 1 1.5 2 B y / σ y 0.5 1 1.5 2 Rank correlation 0.925 0.9255 0.926 0.9265 Figure 4: Illustration of the eﬀect of pro jecting the outliers in linear regression, for diﬀerent sample sizes n with 10-dimensional synthetic data, ev aluated by Sp earman’s rank correlation b et w een the predicted and true v alues. The x and y axes denote the pro jection thresholds as a function of standard deviations of data. T op right corner illustrates pro jection threshold at 2 standard deviations, no outlier pro jection would b e further to top right. Higher v alues (y ellow) are b etter. The result illustrates a clear b eneﬁt from the pro jection for mo derate sample siz es, but the b eneﬁt decreases for really large sample sizes. 25 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 p r i v a c y b u d g e t sh a r e fo r n x y 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 S p e a r m a n 's r a n k c o r r e l a t i o n c o e ffi c i e n t p r i v a c y b u d g e t sh a r e fo r n y y 0 . 0 5 0 . 1 5 0 . 2 5 0 . 3 5 0 . 4 5 0 . 5 5 0 . 6 5 0 . 7 5 0 . 8 5 b e st Figure 5: Optimal priv acy budget split b et ween suﬃcient statistics. Accuracy on a synthetic data set improv es as a bigger prop ortion of the ﬁxed priv acy budget is assigned for nxy . The b est p erformance is ac hiev ed b y assigning term ny y the smallest prop ortion 5%, term nxy a large 60% prop ortion, and term nxx the remaining 35% prop ortion of the priv acy budget. 26 1 0 + 0 1 0 + 1 0 0 1 0 + 2 0 0 1 0 + 3 0 0 1 0 + 4 0 0 1 0 + 5 0 0 1 0 + 6 0 0 1 0 + 7 0 0 1 0 + 8 0 0 S i z e o f d a t a se t ( i n ( e r n a l + e + ( e r n a l ) − 0 . 0 5 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 S p e a r m a n ' s r a n c o r r e l a ( i o n c o e ffi c i e n ( L R n o n - p r i v a t e d a t a ( d = 1 0 ) L R n o n - p r i v a t e d a t a ( d = 6 4 ) L i n e a r R e g r e ssi o n ( L R ) Pr i v a t e L R R o b u st p r i v a t e L R O u t p u t p e r t u r b e d L R F u n c t i o n a l m e c h a n i sm L R Figure 6: This is a complemen t to Figure 2 with more stringent priv acy . Here w e sho w Sp earman’s rank correlation co eﬃcien ts ( ρ ) b et ween the measured ranking of the cell lines and the ranking predicted by the mo dels using  = 1. The baselines (horizontal dashed lines) are learned on 10 non-priv ate data p oin ts; the priv ate algorithms additionally hav e priv acy-protected data (x-axis). The non- priv ate algorithm (LR) has the same amount of additional non-priv acy-protected data. All metho ds use 10-dimensional data except purple baseline showing the b est p erformance with 10 non-priv ate data p oin ts. The results are av eraged ov er all drugs and 50-fold Monte Carlo cross-v alidation; error bars denote standard deviation ov er 50 Monte Carlo repeats. The result shows that more data are needed for go o d prediction p erformance under more stringent priv acy . 27

Efficient differentially private learning improves drug sensitivity prediction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment