Wishart Mechanism for Differentially Private Principal Components Analysis

W ishart Mechanism f or Differ entially Priv ate Principal Components Analysis W uxuan Jiang, Cong Xie and Zhihua Zhang Department of Computer Science and Engineering Shanghai Jiao T ong Uni versity , Shanghai, China jiangwuxuan@gmail.com, { xcgoner , zhihua } @sjtu.edu.cn Abstract W e propose a new input perturbation mechanism for publishing a cov ariance matrix to achieve ( , 0) - differential priv acy . Our mechanism uses a W ishart dis- tribution to generate matrix noise. In particular, we apply this mechanism to principal component analy- sis (PCA). Our mechanism is able to keep the posi- tiv e semi-deﬁniteness of the published cov ariance ma- trix. Thus, our approach giv es rise to a general pub- lishing framework for input perturbation of a symmet- ric positive semideﬁnite matrix. Moreover , compared with the classic Laplace mechanism, our method has better utility guarantee. T o the best of our kno wledge, the W ishart mechanism is the best input perturbation approach for ( , 0) -differentially private PCA. W e also compare our work with previous exponential mecha- nism algorithms in the literature and provide near opti- mal bound while having more ﬂexibility and less com- putational intractability . 1 Introduction Plenty of machine learning tasks deal with sensitiv e in- formation such as ﬁnancial and medical data. A common concern regarding data security arises on account of the rapid dev elopment of data mining techniques. Se veral data priv acy deﬁnitions are proposed in the literature. Among them differential priv acy (DP) has been widely used (Dwork et al. 2006). Dif ferential priv acy controls the fundamental quantity of information that can be revealed with chang- ing one indi vidual. Beyond a concept in database security , differential pri vacy has been used by many researchers to dev elop priv acy-preserving learning algorithms (Chaudhuri and Monteleoni 2009; Chaudhuri, Monteleoni, and Sarwate 2011; Bojarski et al. 2014). Indeed, this class of algorithms is applied to a large number of machine learning mod- els including logistic regression (Chaudhuri and Monteleoni 2009), support vector machine (Chaudhuri, Monteleoni, and Sarwate 2011), random decision tree (Bojarski et al. 2014), etc. Accordingly , these methods can protect the raw data ev en though the output and algorithm itself are published. Differential priv acy (DP) aims to hide the indi vidual infor - mation while keeping basic statistics of the whole dataset. Copyright c  2016, Association for the Advancement of Artiﬁcial Intelligence (www .aaai.org). All rights reserv ed. A simple idea to achiev e this purpose is to add some spe- cial noise to the original model. After that, the attacker, who has two outputs generated by slightly different inputs, can- not distinguish whether the output change comes from the artiﬁcial noise or input dif ference. Ho we ver , the noise might inﬂuence the quality of regular performance of the model. Thus, we should carefully trade off between pri vac y and util- ity . No matter what the procedure is, a query , a learning al- gorithm, a game strate gy or something else, we are able to deﬁne differential priv acy if this procedure takes a dataset as input and returns the corresponding output. In this paper, we study the problem of designing differential pri v ate principal component analysis (PCA). PCA reduces the data dimen- sion while keeping the optimal v ariance. More speciﬁcally , it ﬁnds a projection matrix by computing a lo w rank approx- imation to the sample covariance matrix of the gi ven data points. Priv acy-preserving PCA is a well-studied problem in the literature (Dwork et al. 2014; Hardt and Roth 2012; 2013; Hardt and Price 2014; Blum et al. 2005; Chaudhuri, Sar- wate, and Sinha 2012; Kapralov and T al war 2013). It out- puts a noisy projection matrix for dimension reduction while preserving the priv acy of any single data point. The e xtant priv acy-preserving PCA algorithms ha ve been de vised based on two major features: the notion of differential priv acy and the stage of randomization. Accordingly , the priv acy- preserving PCA algorithms can be divided into distinct cat- egories. The notion of dif ferential pri vacy has two types: ( , 0) -DP (also called pure DP) and ( , δ ) -DP (also called approximate DP). ( , δ ) -DP is a weaker version of ( , 0) -DP as the former allows the pri vacy guarantee to be broken with tin y proba- bility (more precisely , δ ). In the seminal work on priv acy- preserving PCA (Dw ork et al. 2014; Hardt and Roth 2012; 2013; Hardt and Price 2014; Blum et al. 2005), the au- thors used the notion of ( , δ ) -DP . In contrast, there is only a few work (Chaudhuri, Sarwate, and Sinha 2012; Kapralov and T alwar 2013), which is based on ( , 0) -DP . In terms of the stage of randomization, there are two mainstream classes of approaches. The ﬁrst is randomly computing the eigenspace (Hardt and Roth 2013; Hardt and Price 2014; Chaudhuri, Sarwate, and Sinha 2012; Kapralov and T alwar 2013). The noise is added in the computing pro- cedure. An alternati ve way is directly adding noise to the co- variance matrix. Then one runs the non-priv ate eigenspace computing algorithm to produce the output. This class of approaches is called input perturbation (Blum et al. 2005; Dwork et al. 2014). The input perturbation algorithms pub- lish a noisy sample cov ariance matrix before computing the eigenspace. Thus, an y further operation on the noisy cov ari- ance matrix does not violate priv acy guarantee. So far as the ﬂe xibility is concerned, the input perturbation has bet- ter performance because it is not limited only to computing eigenspace. Besides, the input perturbation approach is ef- ﬁcient because it merely takes extra efforts on generating the noise. In vie w of these adv antages, our mechanism for priv acy-preserving PCA is also based on input perturbation. Related W ork Blum et al. (2005) proposed an early input perturbation framew ork (named SULQ), and the parameters of noise are reﬁned by Dwork et al. (2006). Dwork et al. (2014) prov ed the state-of-the-art utility bounds for ( , δ ) -DP . Hardt and Roth (2012) provided a better bound under the coherence as- sumption. In (Hardt and Roth 2013; Hardt and Price 2014), the authors used a noisy po wer method to produce the princi- pal eigen vector iterativ ely with removing the pre vious gen- erated ones. Hardt and Price (2014) provided a special case for ( , 0) -DP as well. Chaudhuri, Sarwate, and Sinha (2012) proposed the ﬁrst useful pri vacy-preserving PCA algorithm for ( , 0) -DP based on an exponential mechanism (McSherry and T alwar 2007). Kapralov and T alwar (2013) argued that the algo- rithm in (Chaudhuri, Sarwate, and Sinha 2012) lacks con- ver gence time guarantee and used heuristic tests to check con vergence of the chain, which may affect the priv acy guar- antee. They also devised a mixed algorithm for lo w rank ma- trix approximation. Ho we ver , their algorithm is quite com- plicated to implement and takes O ( d 6 / ) running time. Here d is the dimension of the data point. Our work is mainly inspired by Dwork et al. (2014). Since they pro vided the algorithms for ( , δ ) -DP , we seek the sim- ilar approach for ( , 0) -DP with a different noise matrix de- sign. As input perturbation methods, Blum et al. (2005) and Dwork et al. (2014) both used the Gaussian symmetric noise matrix for pri vately publishing a noisy covariance matrix. A reasonable w orry is that the published matrix might be no longer positive semideﬁnite, a normal attribute for a cov ari- ance matrix. Contribution and Or ganization In this paper we propose a new mechanism for pri v acy- preserving PCA that we call W ishart mec hanism . The ke y idea is to add a W ishart noise matrix to the original sam- ple cov ariance matrix. A Wishart matrix is alw ays positi ve semideﬁnite, which in turn makes the perturbed co v ariance matrix positive semideﬁnite. Additionally , Wishart matrix can be reg arded as the scatter matrix of some random Gaus- sian vectors (Gupta and Nagar 2000). Consequently , our W ishart mechanism equiv alently adds Gaussian noise to the original data points. Setting appropriate parameters of W ishart distrib ution, we deriv e the ( , 0) -priv acy guarantee (Theorem 4). Compared to the present Laplace mechanism, our W ishart mechanism adds less noise (Section 4), which implies our mechanism always has better utility bound. W e also provide a general framew ork for choosing Laplace or Wishart input perturba- tion for ( , 0) -DP in Section 4. Not only using the Laplace mechanism as a baseline, we also conduct theoretical analysis to compare our work with other priv acy-preserving PCA algorithms based on the ( , 0) -DP . With respect to the different criteria, we pro- vide sample complexity bound (Theorem 7) for compari- son with Chaudhuri, Sarwate, and Sinha (2012) and deriv e the low rank approximation closeness when comparing to Kapralov and T alwar (2013). Other than the principal eigen- vector guarantee in (Chaudhuri, Sarw ate, and Sinha 2012), we ha ve the guarantee for rank- k subspace closeness (Theo- rem 6). W ith using a stronger deﬁnition of adjacent matrices, we achie ve a k -free utility bound (Theorem 9). Con verting the lo wer bound construction in (Chaudhuri, Sarwate, and Sinha 2012; Kapralov and T alwar 2013) into our case, we can see the W ishart mechanism is near-optimal. The remainder of the paper is or ganized as follo ws. Sec- tion 2 gives the notation and deﬁnitions used in our paper . Section 3 lists the baseline and our designed algorithms. Section 4 pro vides the thorough analysis on pri vacy and util- ity guarantee of our mechanism together with comparison to sev eral highly-related work. Finally , we conclude the work in Section 5. Note that we put some proofs and more expla- nation into the supplementary material. 2 Preliminaries W e ﬁrst give some notation that will be used in this paper . Let I m denote the m × m identity matrix. Gi ven an m × n real matrix Z = [ Z ij ] , let its full singular v alue decomposition (SVD) as Z = U Σ V T , where U ∈ R m × m and V ∈ R n × n are orthogonal (i.e., U T U = I m and V T V = I n ), and Σ ∈ R m × n is a diagonal matrix with the i th diagonal entry σ i being the i th largest singular v alue of Z . Assume that the rank of Z is ρ ( ≤ min( m, n ) ). This implies that Z has ρ nonzero singular v alues. Let U k and V k be the ﬁrst k ( < ρ ) columns of U and V , respectiv ely , and Σ k be the k × k top sub-block of Σ . Then the m × n matrix Z k = U k Σ k V T k is the best rank- k approximation to Z . The Frobenius norm of Z is deﬁned as k Z k F = r P i,j Z 2 ij = s ρ P i =1 σ 2 i , the spectral norm is deﬁned as k Z k 2 = max x 6 =0 k Z x k 2 k x k 2 = σ 1 , the nuclear norm is deﬁne as k Z k ∗ = ρ P i =1 σ i , and the ` 1 , 1 norm is deﬁned as k Z k 1 , 1 = P i,j | Z ij | . Giv en a set of n raw data points X = [ x 1 , · · · , x n ] where x i ∈ R d , we consider the problem of publishing a noisy em- pirical sample covariance matrix for doing PCA. F ollo wing previous work on priv acy-preserving PCA, we also assume k x i k 2 ≤ 1 . The standard PCA computes the sample covari- ance matrix of the ra w data A = 1 n X X T = 1 n P n i =1 x i x T i . Since A is a d × d symmetric positi ve semideﬁnite matrix, its SVD is equi valent to the spectral decomposition. That is, A = V Σ V T . PCA uses V k as projection matrix to compute the lo w-dimensional representation of raw data: Y , V T k X . In this work we use Laplace and W ishart distrib utions, which are deﬁned as follows. Deﬁnition 1. A random variable z is said to have a Laplace distribution z ∼ Lap ( µ, b ) , if its pr obability density function is p ( z ) = 1 2 b exp( − | z − µ | b ) . Deﬁnition 2 ((Gupta and Nagar 2000)) . A d × d random symmetric positive deﬁnite matrix W is said to have a W ishart distribution W ∼ W d ( m, C ) , if its probability den- sity function is p ( W ) = | W | m − d − 1 2 2 md 2 | C | m 2 Γ d ( m 2 ) exp( − 1 2 tr( C − 1 W )) , wher e m > d − 1 and C is a d × d positive deﬁnite matrix. Now we introduce the formal deﬁnition of differential pri- vac y . Deﬁnition 3. A randomized mechanism M takes a dataset D as input and outputs a structure s ∈ R , wher e R is the range of M . F or any two adjacent datasets D and ˆ D (with only one distinct entry), M is said to be ( , 0) -differ ential private if for all S ⊆ R we have Pr { M ( D ) ∈ S } ≤ e  Pr { M ( ˆ D ) ∈ S } , wher e  > 0 is a small parameter controlling the strength of privacy r equirement. This deﬁnition actually sets limitation on the similarity of output probability distributions for the giv en similar in- puts. Here the adjacent datasets can have se veral dif ferent interpretations. In the scenario of pri vacy-preserving PCA, our deﬁnition is as follo ws. T wo datasets X and ˆ X are ad- jacent provided X = [ x 1 , . . . , x i − 1 , x i , x i +1 , . . . , x n ] and ˆ X = [ x 1 , . . . , x i − 1 , ˆ x i , x i +1 , . . . , x n ] for x i 6 = ˆ x i . It should be pointed out that our deﬁnition of adjacent datasets is slightly different from (Kapralov and T alwar 2013), which leads to signiﬁcant difference on utility bounds. W e will giv e more speciﬁcally discussions in Section 4. W e also gi ve the deﬁnition of ( , δ ) -differential priv acy . This notion requires less priv acy protection so that it often brings better utility guarantee. Deﬁnition 4. A randomized mechanism M takes a dataset as input and outputs a structure s ∈ R , wher e R is the range of M . F or any two adjacent datasets D and ˆ D (with only one distinct entry), M is said to be ( , δ ) -differ ential private if for all S ⊆ R we have Pr[ M ( D ) ∈ S ] ≤ e  Pr[ M ( ˆ D ) ∈ S ] + δ. Sensitivity analysis is a general approach to achieving dif- ferential pri v acy . The following deﬁnitions show the two typical kinds of sensitivity . Deﬁnition 5. The ` 1 sensitivity is deﬁned as s 1 ( M ) = max d ( D, ˆ D )=1    M ( D ) − M ( ˆ D )    1 . The ` 2 sensitivity is deﬁned as s 2 ( M ) = max d ( D, ˆ D )=1    M ( D ) − M ( ˆ D )    2 . The sensitivity describes the possible lar gest change as a result of individual data entry replacement. The ` 1 sensitiv- ity is used in Laplace Mechanism for ( , 0) -differential pri- vac y , while the ` 2 sensitivity is used in Gaussian Mechanism for ( , δ ) -dif ferential pri v acy . W e list the two mechanisms for comparison. Theorem 1 (Laplace Mechanism) . Let λ > s 1 ( M ) / . Add Laplace noise Lap (0 , λ ) to each dimension of M ( D ) . This mechanism pr ovides ( , 0) -differ ential privacy . Theorem 2 (Gaussian Mechanism) . F or c 2 > 2 ln(1 . 25 /δ ) , let σ > cs 2 ( M ) / . Add Gaussian noise N (0 , σ 2 ) to each dimension of M ( D ) . This mechanism pr ovides ( , δ ) - differ ential privacy . The abov e mechanisms are all perturbation methods. An- other widely used method is exponential mechanism (Mc- Sherry and T alw ar 2007) which is based on sampling tech- niques. 3 Algorithms First we look at a general framework of pri v acy-preserving PCA. According to the deﬁnition of dif ferential pri vacy , a priv acy-preserving PCA takes the raw data matrix X as in- put and then calculates the sample covariance matrix A = 1 n X X T . Finally , it computes the top- k subspace of A as the output. The traditional approach adds noise to the computing pro- cedure. For example, Chaudhuri, Sarwate, and Sinha (2012) and Kapralov and T alwar (2013) used a sampling based mechanism during computing eigenv ectors to obtain ap- proximate results. Our mechanism adds noise in the ﬁrst stage, publishing A in a dif ferential pri vate manner . Thus, our mechanism takes X as input and outputs A . Afterwards we follo w the standard PCA to compute the top- k subspace. This can be seen as a differential priv ate preprocessing pro- cedure. Our baseline is the Laplace mechanism (Algorithm 1 and Theorem 1). T o the best of our kno wledge, Laplace mech- anism is the only input perturbation method for ( , 0) -DP PCA. Since this priv ate procedure ends before computing the subspace, this shows M ( D ) = 1 n D D T in sensitivi ty deﬁnition. Note that to make ˆ A be symmetric, we use a symmet- ric matrix-variate Laplace distrib ution in Algorithm 1. Ho w- ev er, this mechanism cannot guarantee the positiv e semi- deﬁniteness of ˆ A , a desirable attribute for a cov ariance ma- trix. This moti vates us to use a Wishart noise alternati vely , giving rise to the W ishart mechanism in Algorithm 2. Algorithm 1 Laplace input perturbation Input: Raw data matrix X ∈ R d × n ; Priv acy parameter  ; Number of data n ; 1: Draw d 2 + d 2 i.i.d. samples from Lap (0 , 2 d n ) , then form a symmetric matrix L . These samples are put in the upper triangle part. Each entry in lower triangle part is copied from the opposite position. 2: Compute A = 1 n X X T ; 3: Add noise ˆ A = A + L ; Output: ˆ A ; Algorithm 2 W ishart input perturbation Input: Raw data matrix X ∈ R d × n ; Priv acy parameter  ; Number of data n ; 1: Draw a sample W from W d ( d + 1 , C ) , where C has d same eigen values equal to 3 2 n ; 2: Compute A = 1 n X X T ; 3: Add noise ˆ A = A + W ; Output: ˆ A ; 4 Analysis In this section, we are going to conduct theoretical analy- sis of Algorithms 1 and 2 under the framework of differ - ential priv ate matrix publishing. The theoretical support has two parts: priv acy and utility guarantee. The former is the essential requirement for priv acy-preserving algorithms and the latter tells how well the algorithm works against a non- priv ate version. Chieﬂy , we list the v aluable theorems and analysis. All the technical proofs omitted can be found in the supplementary material. Privacy guarantee W e ﬁrst show that both algorithms satisfy priv acy guar- antee. Suppose there are two adjacent datasets X = [ x 1 , . . . , v , . . . , x n ] ∈ R d × n and ˆ X = [ x 1 , . . . , ˆ v , . . . , x n ] ∈ R d × n where v 6 = ˆ v (i.e., only v and ˆ v are distinct). W ithout loss of generality , we further assume that each data vector has the ` 2 norm at most 1. Theorem 3. Algorithm 1 pr ovides ( , 0) -differ ential pri- vacy . This theorem can be quickly proved by some simple deriv ations so we put the proof in the supplementary ma- terial. Theorem 4. Algorithm 2 pr ovides ( , 0) -differ ential pri- vacy . Pr oof. Assume the outputs for the adjacent inputs X and ˆ X are identical (denoted A + W 0 ). Here A = 1 n X X T and ˆ A = 1 n ˆ X ˆ X T . W e deﬁne the difference matrix ∆ , A − ˆ A = 1 n ( v v T − ˆ v ˆ v T ) . Actually the pri v acy guarantee is to bound the following term: p ( A + W = A + W 0 ) p ( ˆ A + W = A + W 0 ) = p ( W = W 0 ) p ( W = A + W 0 − ˆ A ) = p ( W = W 0 ) p ( W = W 0 + ∆) As W ∼ W d ( d + 1 , C ) , we hav e that p ( W = W 0 ) p ( W = W 0 + ∆) = exp[ − 1 2 tr( C − 1 W 0 )] exp[ − 1 2 tr( C − 1 ( W 0 + ∆))] = exp[ 1 2 tr( C − 1 ( W 0 + ∆)) − tr( C − 1 W 0 )] = exp[ 1 2 tr( C − 1 ∆)] . Then apply V on Neumann’ s trace inequality: For matrices A, B ∈ R d × d , denote their i th-largest singular value as σ i ( · ) . Then | tr( AB ) | ≤ P d i =1 σ i ( A ) σ i ( B ) . So that exp[ 1 2 tr( C − 1 ∆)] ≤ exp[ 1 2 d X i =1 σ i ( C − 1 ) σ i (∆)] ≤ exp[ 1 2   C − 1   2 k ∆ k ∗ ] . (1) Since ∆ = A − ˆ A = 1 n ( v v T − ˆ v ˆ v T ) has rank at most 2, and by singular v alue inequality σ i + j − 1 ( A + B ) ≤ σ i ( A ) + σ j ( B ) , we can bound k ∆ k ∗ : n k ∆ k ∗ ≤ σ 1 ( v v T ) + σ 1 ( − ˆ v ˆ v T ) + max { σ 1 ( v v T ) + σ 2 ( − ˆ v ˆ v T ) , σ 2 ( v v T ) + σ 1 ( − ˆ v ˆ v T ) } = σ 1 ( v v T ) + σ 1 ( ˆ v ˆ v T ) + max { σ 1 ( v v T ) , σ 1 ( ˆ v ˆ v T ) } ≤ 3 max σ 1 ( v v T ) = 3 max   v v T   2 = 3 max k v k 2 2 ≤ 3 . In Algorithm 2, the scale matrix C in Wishart distribu- tion has d same eigen values equal to 3 2 n , which implies   C − 1   2 = 2 n 3 . Substituting these terms in Eq. (1) yields p ( A + W = A + W 0 ) p ( ˆ A + W = A + W 0 ) ≤ exp[ 1 2   C − 1   2 k ∆ k ∗ ] ≤ exp[ 1 2 · 2 n 3 · 3 n ] = e  . Utility guarantee Then we giv e bounds about how far the noisy results are from optimal. Since the Laplace and W ishart mechanisms are both input perturbation methods, their analyses are sim- ilar . In order to ensure priv acy guarantee, we add a noise ma- trix to the input data. Such noise may ha ve ef fects on the property of the original matrix. For input perturbation meth- ods, the magnitude of the noise matrix directly determines how large the effects are. For e xample, if the magnitude of the noise matrix is ev en larger than data, the matrix after perturbation is surely covered by noise. Better utility bound means less noise added. W e choose the spectral norm of the noise matrix to measure its magnitude . Since we are inv esti- gating the priv acy-preserving PCA problem, the usefulness of the subspace of the top- k singular vectors is mainly cared. The noise matrix in the Laplace mechanism is constructed with d 2 + d 2 i.i.d random variables of Lap (2 d/n ) . Using the tail bound for an ensemble matrix in (T ao 2012), we hav e that the spectral norm of the noise matrix in Algorithm 1 satisﬁes k L k 2 = O (2 d √ d/n ) with high probability . Then we turn to analyze the W ishart mechanism. W e use the tail bound of the W ishart distribution in (Zhu 2012): Lemma 1 (T ail Bound of W ishart Distribution) . Let W ∼ W d ( m, C ) . Then for θ ≥ 0 , with pr obability at most d exp( − θ ) , λ 1 ( W ) ≥ ( m + p 2 mθ ( r + 2) + 2 θ r ) λ 1 ( C ) wher e r = tr( C ) / k C k 2 . In our settings that r = d and m = d + 1 , we thus hav e that with probability at most d exp( − θ ) , λ 1 ( W ) ≥ ( d + 1 + p 2( d + 1)( d + 2) θ + 2 θ d ) λ 1 ( C ) . Let θ = c log d ( c > 1) . Then d exp( − θ ) = d 1 − c . So we can say with high probability λ 1 ( W ) = O ([ d + 1 + p 2( d + 1)( d + 2) θ + 2 θ d ] λ 1 ( C )) . For con venience, we write λ 1 ( W ) = O ( d log dλ 1 ( C )) = O (3 d log d/ 2 n ) . W e can see that the spectral norm of noise matrix gener- ated by the W ishart mechanism is O ( d log d/n ) while the Laplace mechanism requires O ( d √ d/n ) . This implies that the Wishart mechanism adds less noise to obtain pri vacy guarantee. W e list the present four input perturbation ap- proaches for comparison. Compared to the state-of-the-art results about ( , δ ) case (Dwork et al. 2014), our noise mag- nitude of O ( d log d n ) is obviously worse than their O ( √ d n ) . It can be seen as the utility gap between ( , δ ) -DP and ( , 0) - DP . T able 1: Spectral norm of noise matrix in input perturbation. Approach Noise magnitude Priv acy Laplace O ( d √ d/n ) ( , 0) (Blum et al. 2005) O ( d √ d log d/n ) ( , δ ) W ishart O ( d log d/n ) ( , 0) (Dwork et al. 2014) O ( √ d/n ) ( , δ ) General framew ork W e are talking about the intrin- sic difference between the Laplace and W ishart mecha- nisms. The key element is the dif ference matrix ∆ of two adjacent matrices. Laplace mechanism adds a noise matrix according to the ` 1 sensitivity , which equals to max k ∆ k 1 , 1 . Thus, the spectral norm of noise matrix is O (max k ∆ k 1 , 1 √ d/n ) . When it comes to the W ishart mechanism, the magnitude of noise is determined by k C k 2 . For purpose of satisfying pri vac y guarantee, we take k C k 2 = ω (max k ∆ k ∗ /n ) . Then the spectral norm of noise matrix is O (max k ∆ k ∗ d log d/n ) . Consequently , we ob- tain the following theorem. Theorem 5. M is a d × d symmetric matrix generated by some input. F or two arbitrary adjacent inputs, the gener- ated matrices are M and ˆ M . Let ∆ = M − ˆ M . Using the W ishart mechanism to publish M in differ ential private man- ner works better if max k ∆ k 1 , 1 max k ∆ k ∗ = ω ( √ d log d ); otherwise the Laplace mechanism works better . T op- k subspace closeness W e now conduct comparison between our mechanism and the algorithm in (Chaudhuri, Sarwate, and Sinha 2012). Chaudhuri, Sarwate, and Sinha (2012) proposed an exponential-mechanism-based method, which outputs the top- k subspace by drawing a sample from the matrix Bingham-von Mises-Fisher distribution. W ang, W u, and W u (2013) applied this algorithm to priv ate spec- tral analysis on graph and showed that it outperforms the Laplace mechanism for output perturbation. Because of the scoring function deﬁned, it is hard to directly sample from the original Bingham-v on Mises-Fisher distribution. Instead, Chaudhuri, Sarwate, and Sinha (2012) used Gibbs sampling techniques to reach an approximate solution. How- ev er, there is no guarantee for conv ergence. They check the con vergence heuristically , which may affect the basic pri- vac y guarantee. First we provide our result on the top- k subspace close- ness: Theorem 6. Let ˆ V k be the top- k subspace of A + W in Algo- rithm 2. Denote the non-noisy subspace as V k corr esponding to A . Assume σ 1 ≥ σ 2 ≥ · · · ≥ σ d ar e singular values of A . If σ k − σ k +1 ≥ 2 k W k 2 , then with high pr obability    V k V T k − ˆ V k ˆ V T k    F ≤ 2 √ k k W k 2 σ k − σ k +1 . W e apply the well-known Davis-Kahan sin θ theo- rem (Da vis 1963) to obtain this result. This theorem char- acterizes the usefulness of our noisy top- k subspace. Nev- ertheless, Chaudhuri, Sarwate, and Sinha (2012) only pro- vided the utility guarantee on the principal eigen vector . So we can only compare the top- 1 subspace closeness, corre- spondingly . Before the comparison, we introduce the measure in (Chaudhuri, Sarwate, and Sinha 2012). Deﬁnition 6. A randomized algorithm A ( · ) is an ( ρ, η ) - close appr oximation to the top eigen vector if for all data sets D of n points, output a vector ˆ v 1 such that P ( h ˆ v 1 , v 1 i ≥ ρ ) ≥ 1 − η . Under this measure, we deri ve the sample complexity of the W ishart mechanism. Theorem 7. If n ≥ 3( d +1+ q 2( d +1)( d +2) log d η +2 d log d η ) 2  (1 − ρ 2 )( λ 1 − λ 2 ) and ρ ≥ √ 2 2 , then the W ishart mechanism is a ( ρ, η ) -close ap- pr oximation to PCA. Because a useful algorithm should output an eigenv ector making ρ close to 1, our condition of ρ ≥ √ 2 2 is quite weak. Comparing to the sample complexity bound of the algorithm in (Chaudhuri, Sarwate, and Sinha 2012): Theorem 8. If n ≥ d  (1 − ρ )( λ 1 − λ 2 ) ( log 1 η d + log 4 λ 1 (1 − ρ 2 )( λ 1 − λ 2 ) ) , then the algorithm in (Chaudhuri, Sarwate, and Sinha 2012) is a ( ρ, η ) -close appr oximation to PCA. Our result has a factor up to log d with dropping the term log λ 1 λ 1 − λ 2 . Actually , the relationship between d and λ 1 λ 1 − λ 2 heavily depends on the data. Thus, as a special case of top- k subspace closeness, our bound for the top- 1 subspace is comparable to Chaudhuri, Sarwate, and Sinha’ s (2012). Low rank approximation Here we discuss the compari- son between the W ishart mechanism and priv acy-preserving rank- k approximation algorithm proposed in (Kapralov and T alwar 2013; Hardt and Price 2014). PCA can be seen as a special case of lo w rank approximation problems. Kapralov and T alwar (2013) combined the exponential and Laplace mechanisms to design a low rank approximation al- gorithm for a symmetric matrix, providing strict guarantee on con ver gence. Ho wever , the implementation of the algo- rithm contains too many approximation techniques and it takes O ( d 6 / ) time complexity while our algorithm takes O ( kd 2 ) running time. Hardt and Price (2014) proposed an efﬁcient meta algorithm, which can be applied to ( , δ ) - differentially priv ate PCA. Additionally , they provided a ( , 0) -differentially pri vate v ersion. W e need to point out that the deﬁnition of adjacent matrix in priv acy-preserving lo w rank approximation is dif ferent from ours (our deﬁnition is the same as (Dwork et al. 2014; Chaudhuri, Sarwate, and Sinha 2012)). In the deﬁnition (Kapralov and T alwar 2013; Hardt and Price 2014), two ma- trices A and B are called adjacent if k A − B k 2 ≤ 1 , while we restrict the difference to a certain form v v T − ˆ v ˆ v T . In fact, we mak e a stronger assumption so that we are dealing with a case of less sensitivity . This difference impacts the lower bound pro vided in (Kapralov and T alwar 2013). For the consistence of comparison, we remove the term 1 n in Algorithm 2, which means we use the X X T for PCA instead of 1 n X X T . This is also used by Dwork et al. (2014). Applying Lemma 1 in (Achlioptas and McSherry 2001), we can immediately hav e the following theorem: Theorem 9. Suppose the original matrix is A = X X T and ˆ A k is the rank- k appr oximation of output by the W ishart mechanism. Denote the k -th lar gest eigen value of A as λ k . Then    A − ˆ A k    2 ≤ λ k +1 + O ( d log d  ) . Kapralov and T alwar (2013) pro vided a bound of O ( k 3 d  ) and Hardt and Price (2014) provided O ( k 3 2 d log 2 d  ) for the same scenario. If k 3 is larger than log d , our algorithm will work better . Moreov er, our mechanism has better bounds than that of Hardt and Price (2014) while both algorithms are computationally ef ﬁcient. Kapralov and T alwar (2013) established a lower bound of O ( kd  ) according to their deﬁ- nition of adjacent matrix. If replaced with our deﬁnition, the lower bound will become O ( d  ) . The details will be gi ven in the in supplementary material. So our mechanism is near- optimal. 5 Concluding Remarks W e hav e studied the problem of priv ately publishing a symmetric matrix and provided an approach for choosing Laplace or Wishart noise properly . In the scenario of PCA, our Wishart mechanism adds less noise than the Laplace, which leads to better utility guarantee. Compared with the priv acy-preserving PCA algorithm in (Chaudhuri, Sarwate, and Sinha 2012), our mechanism has reliable rank- k util- ity guarantee while the former (Chaudhuri, Sarwate, and Sinha 2012) only has rank-1. F or rank-1 approximation we have the comparable performance on sample complex- ity . Compared with the lo w rank approximation algorithm in (Kapralov and T alwar 2013), the bound of our mecha- nism does not depend on k . Moreo ver , our method is more tractable computationally . Compared with the tractable algo- rithm in (Hardt and Price 2014), our utility bound is better . Since input perturbation only publishes the matrix for PCA, any other procedure can tak e the noisy matrix as in- put. Thus, our approach has more ﬂexibility . While other entry-wise input perturbation techniques make the cov ari- ance not be positi ve semideﬁnite, in our case the noisy co- variance matrix still preserv es this property . Acknowledgments W e thank Luo Luo for the meaningful technical discus- sion. W e also thank Y ujun Li, Tianfan Fu for support on the early stage of the work. This work is supported by the Na- tional Natural Science Foundation of China (No. 61572017) and the Natural Science Foundation of Shanghai City (No. 15ZR1424200). References Achlioptas, D., and McSherry , F . 2001. Fast computation of low rank matrix approximations. In Pr oceedings of the thirty-thir d annual ACM symposium on Theory of comput- ing , 611–618. A CM. Blum, A.; Dwork, C.; McSherry , F .; and Nissim, K. 2005. Practical pri vacy: the sulq frame work. In Pr oceedings of the twenty-fourth A CM SIGMOD-SIGA CT -SIGART symposium on Principles of database systems , 128–138. A CM. Bojarski, M.; Choromanska, A.; Choromanski, K.; and Le- Cun, Y . 2014. Differentially-and non-dif ferentially-priv ate random decision trees. arXiv pr eprint arXiv:1410.6973 . Chaudhuri, K., and Monteleoni, C. 2009. Pri vacy-preserving logistic regression. In Advances in Neural Information Pr o- cessing Systems , 289–296. Chaudhuri, K.; Monteleoni, C.; and Sarw ate, A. D. 2011. Differentially priv ate empirical risk minimization. The J our- nal of Machine Learning Resear ch 12:1069–1109. Chaudhuri, K.; Sarwate, A.; and Sinha, K. 2012. Near- optimal dif ferentially priv ate principal components. In Ad- vances in Neural Information Pr ocessing Systems , 989–997. Davis, C. 1963. The rotation of eigen vectors by a pertur - bation. Journal of Mathematical Analysis and Applications 6(2):159–173. Dwork, C.; McSherry , F .; Nissim, K.; and Smith, A. 2006. Calibrating noise to sensitivity in priv ate data analysis. In Theory of cryptography . Springer . 265–284. Dwork, C.; T alw ar , K.; Thakurta, A.; and Zhang, L. 2014. Analyze gauss: optimal bounds for priv acy-preserving prin- cipal component analysis. In Pr oceedings of the 46th Annual A CM Symposium on Theory of Computing , 11–20. A CM. Gupta, A. K., and Nag ar , D. K. 2000. Matrix V ariate Distri- butions . Chapman & Hall/CRC. Hardt, M., and Price, E. 2014. The noisy po wer method: A meta algorithm with applications. In Advances in Neural Information Pr ocessing Systems , 2861–2869. Hardt, M., and Roth, A. 2012. Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual A CM symposium on Theory of computing , 1255– 1268. A CM. Hardt, M., and Roth, A. 2013. Beyond worst-case analysis in priv ate singular v ector computation. In Pr oceedings of the forty-ﬁfth annual ACM symposium on Theory of computing , 331–340. A CM. Kapralov , M., and T alwar , K. 2013. On differentially pri- vate low rank approximation. In Pr oceedings of the T wenty- F ourth Annual A CM-SIAM Symposium on Discr ete Algo- rithms , 1395–1414. SIAM. McSherry , F ., and T alwar , K. 2007. Mechanism design via differential priv acy . In F oundations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on , 94–103. IEEE. T ao, T . 2012. T opics in random matrix theory , volume 132. American Mathematical Soc. W ang, Y .; W u, X.; and W u, L. 2013. Differential pri v acy preserving spectral graph analysis. In Advances in Knowl- edge Discovery and Data Mining . Springer . 329–340. Zhu, S. 2012. A short note on the tail bound of wishart distribution. arXiv preprint . Supplementary material A Proof of pri vacy guarantee The basic settings are the same as section 4. Proof of Theor em 3 In order to prov e Theorem 3, we ﬁrst give the follo wing lemma. Lemma 2. F or mechanism M ( D ) = 1 n D D T , the ` 1 sensi- tivity s 1 satisﬁes d n < s 1 ( M ) < 2 d n . Pr oof. Suppose v = ( p 1 , · · · , p d ) T and ˆ v = ( q 1 , · · · , q d ) T . Then the ` 1 sensitivity of M ( D ) can be conv erted to the following optimization problem: s 1 ( M ) = max 1 n X 1 ≤ i,j ≤ d | p i p j − q i q j | , subject to d X i =1 p 2 i ≤ 1 , d X i =1 q 2 i ≤ 1 . Setting p i = 1 √ d and q i = 0 for i = 1 , . . . , d , we can ha ve a lo wer bound s 1 ( M ) ≥ d n . Then applying the triangle in- equality , we hav e the upper bound: X 1 ≤ i,j ≤ d | p i p j − q i q j | < X 1 ≤ i,j ≤ d | p i p j | + | q i q j | = 2 X 1 ≤ i,j ≤ d | p i p j | ≤ 2 d n . Now applying Lemma 2 to Theorem 1 immediately ob- tains the priv acy guarantee for the Laplace mechanism. B Proof of utility guarantee Proof of Theor em 6 Pr oof. W e use the follo wing two lemmas. Lemma 3 (Da vis-Kahan sin θ theorem (Davis 1963)) . Let the k -th eig en vector of A and ˆ A be v k and ˆ v k . Denote P k = k P i =1 v k v T k and ˆ P k = k P i =1 ˆ v k ˆ v k T . If λ k ( A ) > λ k +1 ( ˆ A ) , then    P k − ˆ P k    2 ≤    A − ˆ A    2 λ k ( A ) − λ k +1 ( ˆ A ) . Lemma 4 (W eyl’ s inequality) . If M , H and P ar e d × d Hermitian matrices such that M = H + P . Let the k -th eigen values of M , H and P be µ k , ν k and ρ k , respectively . F or i ∈ [ n ] , we have ν i + ρ d ≤ µ i ≤ ν i + ρ 1 . In our case, A and ˆ A are both symmetric positiv e semidef- inite (because of the property of Wishart distribution). So the eigen values equal to singular v alues. Then we use Lemma 4 with A = H and W = P . W e obtain σ i ( A + W ) ≤ σ i ( A ) + σ 1 ( W ) = σ i ( A ) + k W k 2 . Applying Lemma3 with A = A and ˆ A = A + W leads to    P k − ˆ P k    2 =    V k V T k − ˆ V k ˆ V T k    2 ≤ k W k 2 λ k ( A ) − λ k +1 ( A + W ) ≤ k W k 2 λ k ( A ) − λ k +1 ( A ) − k W k 2 = k W k 2 σ k − σ k +1 − k W k 2 . Under the assumption σ k − σ k +1 ≥ 2 k W k 2 , we ﬁnally ha ve    V k V T k − ˆ V k ˆ V T k    2 ≤ 2 k W k 2 σ k − σ k +1 . Using the property    V k V T k − ˆ V k ˆ V T k    F ≤ √ k    V k V T k − ˆ V k ˆ V T k    2 we ﬁnish the proof. Proof of Theor em 7 W e are going to ﬁnd the condition on sample complexity to satisfy ( ρ, η ) -close approximation. Pr oof. Set k = 1 in Theorem 6. Then    V 1 V T 1 − ˆ V 1 ˆ V T 1    F =   v 1 v T 1 − ˆ v 1 ˆ v T 1   F = tr[( v 1 v T 1 − ˆ v 1 ˆ v T 1 )( v 1 v T 1 − ˆ v 1 ˆ v T 1 ) T ] = 2 − 2( v T 1 ˆ v 1 ) 2 ≤ 2 k W k 2 λ 1 − λ 2 . The condition λ 1 − λ 2 ≥ 2 k W k 2 requires the last term to ha ve a upper bound of 1, which implies | v T 1 ˆ v 1 | ≥ √ 2 2 . Let η = d exp( − θ ) , which is θ = log d η , we ha ve that with probability 1 − η , ( v T 1 ˆ v 1 ) 2 ≥ 1 − k W k 2 λ 1 − λ 2 = 1 − ( d + 1 + p 2( d + 1)( d + 2) θ + 2 θ d ) λ 1 ( C ) λ 1 − λ 2 = 1 − 3( d + 1 + q 2( d + 1)( d + 2) log d η + 2 d log d η ) 2 n ( λ 1 − λ 2 ) . Under the condition n ≥ 3( d +1+ q 2( d +1)( d +2) log d η +2 d log d η ) 2  (1 − ρ 2 )( λ 1 − λ 2 ) ( v T 1 ˆ v 1 ) 2 ≥ 1 − (1 − ρ 2 ) = ρ 2 Which yields Pr( v T 1 ˆ v 1 ≥ ρ ) ≥ 1 − η . C Lower bound f or low rank approximation W e mainly follow the construction of Kapralov and T alw ar (2013) and mak e a slight modiﬁcation to ﬁt into our deﬁni- tion of adjacent matrices. Lemma 5. Deﬁne C k δ ( Y ) = { S ∈ G k,d :   Y Y T − S S T   2 ≤ δ } . F or eac h δ > 0 ther e e xists fam- ily F = { Y 1 , · · · , Y N } with N = 2 Ω( k ( d − k ) log 1 /δ ) , where Y i ∈ G k,d such that C k δ ( Y i ) ∩ C k δ ( Y j ) = ∅ for i 6 = j . Theorem 10. Suppose the original matrix is A = X X T and ˆ A k is the rank- k appr oximation of output by the any  -differ ential private mechanism. Denote the k -th larg est eigen value of A as λ k . Then    A − ˆ A k    2 ≤ λ k +1 + Ω( d/ ) . Pr oof. T ake a set F = { Y 1 , · · · , Y N } in Lemma 5. Con- struct a series of matrices A i = γ Y i Y T i where i ∈ [ N ] . Then E A i h    A i − ˆ A i k    2 i ≤ δ γ . Let ˆ A i k = ˆ Y i ˆ Σ ˆ Y i T . Then letting e A i k = ˆ Y i ˆ Y i T , we hav e E A i h    A i − e A i k    2 i ≤ 2 δ γ . Using Markov’ s inequality leads to Pr A i h    A i − e A i k    2 ≤ 4 δ γ i > 1 2 . Here is the main difference between our deﬁnition and Kapralov and T alwar (2013). They consider the distance from A i to A j is at most 2 γ since   A i   2 ≤ γ . In our frame- work, A i is a dataset consisting of γ data groups, each one is Y i . Changing A i to A j means replacing γ k data points with brand ne w ones. So we consider the distance is at most 2 γ k . The algorithm should put at least half of the probability mass into C k 4 δ ( Y i ) . Meanwhile, to satisfy the priv acy guar- antee Pr { M ( A i ) ∈ C k 4 δ ( Y i ) } Pr { M ( A j ) ∈ C k 4 δ ( Y i ) } ≤ e 2 γ k So Pr { M ( A j ) ∈ C k 4 δ ( Y i ) } ≥ 1 2 e − 2 γ k , we hav e 1 2 e − 2 γ k · 2 Ω( k ( d − k ) log 1 /δ ) ≤ 1 Which implies γ = Ω( d log(1 /δ ) / ) and completes our proof.

Wishart Mechanism for Differentially Private Principal Components Analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment