Differentially Private Data Releasing for Smooth Queries with Synthetic Database Output

Diﬀeren tially Priv ate Data Releasing f or Smo oth Que ries with Syn thetic Database Output Chi Jin, Ziteng W ang, Junliang Huang, Yiqiao Zhong, and Liw ei W ang ∗ Septem b er 18, 2018 Abstract W e consider acc ur ately answering smo oth queries while pres erving diﬀerential priv acy . A query is said to be K -smo oth if it is speciﬁed by a funct ion deﬁned on [ − 1 , 1] d whose partial deriv atives up to or der K are a ll bounded. W e develop an ǫ -diﬀerentially priv ate mechanism for the class of K -smo oth queries . The ma jor adv antage of the a lgorithm is tha t it o utputs a synthetic database. In real applications, a synthetic database output is appealing . Our mec h- anism achiev es an accuracy of O ( n − K 2 d + K /ǫ ), and r uns in poly nomial time. W e also generalize the mechanism to pr e serve ( ǫ, δ )-diﬀerential priv acy with s lightly improv ed a ccuracy . E xtensive exp eriments on b enchmark datasets demonstrate that the mechanisms hav e go o d accura cy a nd are eﬃcien t. Keyw ords: Diﬀeren tial priv acy , smo oth qu eries, syn thetic database. 1 In tro duction Mac hine learning is often condu cted on datasets con taining sensitiv e inform ation, suc h as medical records, commercia l d ata, etc. The b eneﬁt of learnin g from suc h data is tremendous. But when releasing sensitiv e data, one m u st tak e pr iv acy in to consideration, and has to tradeoﬀ b et ween the accuracy and the amount of priv acy loss of the individuals in the database. In this pap er we stud y diﬀer ential privacy [1 1], wh ich has b ecome a standard concept of p ri- v acy . Diﬀeren tial pr iv acy guaran tees that almost nothin g new can b e learned from the database that con tains one sp eciﬁc individu al’s information compared with that fr om the database withou t that individual’s information. More concretely , a mec han ism whic h releases inform ation about the database is said to preserve d iﬀeren tial priv acy , if the c hange of a single database elemen t do es not aﬀect th e probab ility distribu tion of the output signiﬁcan tly . Therefore diﬀeren tial p riv acy pro vides s trong guarantee s against atta c ks; the risk of an y individual to subm it her information to the data base is very small. Recen tly there ha ve b een extensive studies of mac hine le arning ∗ Chi Jin is with Dept. of EECS, Un ive rsit y of California, Berkeley . Email: chijin@cs.berkeley .edu. Ziteng W ang, Junliang Hu ang, Yiqiao Zhong an d Liw ei W ang are with Key Lab oratory of Mac hin e P erception, MOE, School of EECS, P eking Universit y . Email: wangzt201 2@gmail.com, huangjunliang@pku.edu .cn, yiqiaozhong@pku.edu.cn, w anglw@cis.pku.edu.cn 1 [6, 21, 35, 5, 7, 9], statistical estimation [34, 25 , 10], and data mining [23 , 24, 22] under the diﬀer- en tial priv acy framew ork. One of the m ost wel l studied p roblems in diﬀerential priv acy is qu er y an s w erin g. That is, h o w to answer a set of queries d iﬀeren tially priv ately , accurately and eﬃcien tly . A simp le and eﬃcient metho d is the Laplac e mec hanism [11]. Laplace m echanism a dds La place noi se to the t rue answ ers of th e queries, with the a moun t of noise prop ortional to the sensitivit y of th e quer y function. Thus Laplace mec hanism has go o d p erformances on queries of low sensitivit y . A t yp ical class of queries that has lo w sensitivit y is linear queries, whose sensitivit y is O (1 /n ), wh ere n is the size of the database. Although simple and eﬃcien t, Laplace mec h anism has a li mitation. It can answ er at most O ( n 2 ) queries w ith nont rivial p riv acy and accuracy guaran tees. In real applications, there can b e many users and eac h u ser ma y su bmit a set of queries. Th us, limiting the num b er of total queries to b e no more than n 2 is to o restricted. A remark able result d ue to Blum, Ligett and Roth [4] sh o ws that in formation theoretically it is p ossib le for a mechanism to answer far more th an n 2 linear queries w hile preserving diﬀeren tial priv acy and non trivial accuracy sim u ltaneously . Sp eciﬁcally , their mechanism (will b e r eferred to as BLR) can answe r exp onentia lly many linear queries and ac hiev e goo d a ccuracy . Th ere is a ser ies of w orks [12, 14, 27, 18, 17] impro v in g the result of [4]. All th ese mec hanism s are v ery p o w erfu l in the sense that they can ans w er general and adv ers ely c hosen queries. Among the mec hanisms menti oned ab ov e, BLR is diﬀeren t to all the others in the output of the algorithm. The output of BLR is a synthetic datab ase , w hile the output of th e other mec h anisms are answers to the queries. F rom a pr actica l p oin t of view, the synthetic database output is ve ry app ealing. In fact, b efore the notion of diﬀerentia l p riv acy was prop osed, almost all practical tec h- niques dev elop ed to preserve p riv acy against certain t yp es of attac ks outpu t a synthetic d atabase b y mo d ifying the ra w dataset (please see the surve y [1 ] and the references therein). Ho wev er, outputting synthetic d atabase w h ile p reserving diﬀeren tial priv acy is m uc h more diﬃ- cult than outputting answers to the queries in terms of computational complexit y . Comparing the runn in g time of BLR with the r u nning time of the Pr iv ate Multiplicativ e W eigh t u p dating (PMW) mec hanism [18] wh ic h is one of the b est mec hanisms outp utting answers, BLR runs in time sup er- p olynomial in b oth the size of the data u niv ers e and the n um b er of qu eries, wh ile the ru nning time of PMW is linear in these t w o factors. Generally , if the data universe is { 0 , 1 } d , there are strong hardness results for diﬀerentiall y p riv ately outputting synthetic d atabase. In particular, it can b e shown that there is n o diﬀeren tially priv ate algorithm whic h can output a synthetic database, accurately answe r general quer ies, and r un in p olynomial time 1 [30]. Giv en the hardness result ag ainst general qu eries, recently there are growing in terests in stud ying eﬃcien t and d iﬀeren tially pr iv ate mec hanisms for a restricted class of queries. F rom a practica l p oin t of view, if there exists a class of qu eries whic h is rich enough to conta in most qu eries u sed in applications and allo ws one to dev elop fast mec h anisms, then the h ard ness result is not a serious barrier for diﬀerentia l p r iv acy . Blum et al.[4] considers rectangle queries in the setting that the data u niv ers e is [ − 1 , 1] d , where d is a constant. A rectangle query is sp eciﬁed b y an axis-aligned rectangle. Th e answ er to the qu ery is the fraction of th e data p oints that lie in the rectangle. Th ey sho w that if [ − 1 , 1] d is discretized to 1 This hardness result assumes the existence of one-wa y functions. 2 p oly( n ) bits of pr ecision, then th ere is an eﬃcient mec hanism whic h outputs a s y nthetic d atabase and is accurate to the class of all rectangle queries. Another class of queries that attracts a lot of atte n tions is the k -wa y conjunctions (or k -w a y marginal). T h e d ata universe for th is problem is { 0 , 1 } d . Thus eac h individ ual r ecord h as d binary attributes. A k -wa y conjunction qu ery is sp eciﬁed b y k features. The query asks what fraction of the individ ual r ecords in the d atabase h as all these k features b eing 1. A series of works attac k this problem using several diﬀerent tec hniques [3 , 15, 8, 19, 29, 13] . They pr op ose elegan t mec h an ism s whic h run in time p oly ( n ) when k is a constant even if the size of the data un iverse is exp onen tially large. Thus these algorithms are more eﬃcient than the b est general-query-answering mechanisms in th e large data universe setting. Ho wev er, the outpu t of these mec hanisms a re n ot synthetic databases 2 . In this pap er w e study smo oth queries deﬁned also on data u niv erse [ − 1 , 1] d for a constan t d . W e sa y a query is K -smo oth if it is sp eciﬁed b y a smo oth function, which h as b oun ded partial deriv ativ es up to the K th order. The answer to the query is th e a v erage of the fun ction v alues on data p oin ts in the database. Sm o oth fu nctions are widely used in mac hin e learning and data analysis. There are extensiv e stu dies on the r elatio n b etw een smo othness, regularization, r epro ducing kernels and generalizat ion abilit y [32, 28]. Our main result is an ǫ -diﬀerenti ally priv ate mec hanism for the class of all K -sm o oth queries. The outpu t of the mec h anism is a syn thetic d atabase. The m echanism has ( α, β )-acc uracy , where α = O ( n − K 2 d + K /ǫ ) for β exp onen tially small. T he ru nning time of the mec hanism is O ( n 3 dK +5 d 4 d +2 K ), p olynomial in th e size of the database. Note that if the order of s m o othness K is large compared to the d imension d , the error of the mec h anism can b e close to n − 1 . In cont rast, if we emplo y BLR to solv e this pr oblem and outp ut a synthetic database, the accuracy guarantee is O ( n − K d +3 K ), wh ic h is at b est n − 1 / 3 for large K . T o ac hiev e this accuracy , th e ru nning time of BLR is sup er-exp on ential in the s ize of the database (please see Section 3.3 for detailed analysis). W e also generaliz e our mec hanism to preserve ( ǫ, δ )-diﬀerentia l pr iv acy with s ligh tly imp ro ved accuracy . Our w ork is related to [33], w hic h prop oses an eﬃcien t algorithm ab le to answ er smo oth queries diﬀeren tially priv ately . Ho w ev er, that m echanism outputs a (priv ate) synopsis of the database. In order to obtain the answer of a query , the us er h as to r u n an ev aluation algorithm, w hic h inv olv es complicated numerical inte gration pro cedures. In con trast, th e mec hanism giv en in this paper simply outputs a sy nthetic d atabase, wh ich is f riendly to the users in app licatio ns. W e conduct extensiv e exp eriments to ev aluate th e p erformance of th e prop osed mec h anism on b enc hmark datasets (wh ich con tain sensitiv e in f ormation such as medical r ecords of individ uals). W e also d ev elop simp le tec hn iques to improv e the eﬃciency of th e algorithm. Exp erimenta l r esults demonstrate that the algorithms ac hieve go o d accuracy and are p racticall y eﬃcient on datasets of v arious s izes and num b ers of attributes. The rest of the pap er is organized as follo ws. Sect ion 2 b r ieﬂy describ es the bac kgroun d of data priv acy and give s th e basic deﬁn itions. In Section 3 we p rop ose th e pr iv ate mec han ism s th at output synthetic d atabase and accurately an s w er sm o oth queries. Section 3 also con tains the main theoretical r esults, analyzing the p er f ormances of the algorithms. All the exp eriment al results are giv en in Section 4. Finally , we conclud e in S ection 5. All pro ofs are giv en in the app endix. 2 The h ardness result in [30] has prov ed that for k -wa y marginal, eﬃciently out p utting synthetic database is n ot p ossible. 3 2 Preliminaries Let D b e a database con taining n data p oints in the data universe X . In this pap er, we consider the case that X ⊂ R d where d is a constan t. Typical ly , we assu me that the data u niv ers e X = [ − 1 , 1] d . Tw o databases D and D ′ are called n eigh b ors if | D | = | D ′ | = n and they d iﬀer in exactly one data p oin t. The follo wing is the formal deﬁn ition of d iﬀeren tial p riv acy . Deﬁnition 2 .1 (( ǫ, δ )-diﬀerential priv acy) . A sanitizer S whic h is a randomized algorithm that maps an input database into some range R is said to preserve ( ǫ, δ )-diﬀerenti al priv acy , if for all pairs of neighbor d atabases D , D ′ and for an y sub set A ⊂ R , it holds that P ( S ( D ) ∈ A ) ≤ P ( S ( D ′ ) ∈ A ) · e ǫ + δ, where the p r obabilit y is tak en ov er the random coins of S . If S p reserv es ( ǫ, 0)-diﬀeren tial p riv acy , w e sa y S is ǫ -diﬀerenti ally priv ate. W e consider line ar queries . Eac h linear query q f is sp eciﬁed b y a fun ction f whic h maps the data unive rse [ − 1 , 1] d to R . q f is deﬁn ed as q f ( D ) := 1 | D | P x ∈ D f ( x ). Let Q b e a set of quer ies. The accuracy of a mechanism with r esp ect to Q is deﬁned as follo ws. Deﬁnition 2.2 (( α, β )-accuracy) . Let Q b e a set of queries. A sanitizer S is said to h a ve ( α, β )- accuracy for size n databases with resp ect to Q , if f or ev ery database D with | D | = n the follo wing holds P ( ∃ q ∈ Q , |S ( D , q ) − q ( D ) | ≥ α ) ≤ β , where S ( D , q ) is the answ er to q give n b y S , and the probability is o ver the in ternal randomn ess of the mec h anism S . ( α, β )-acc uracy is a strong notion of accuracy . It requires that with high probabilit y all the queries are accurately answe red by the mec hanism (i.e., it is a w orst-case accuracy with r esp ect to queries). Some auth ors also consider a slightl y weak er deﬁn ition ( α, β , γ )-ac curacy [12]. Deﬁnition 2.3 (( α, β , γ )-accuracy) . Let Q b e a set of queries. A san itizer S is sa id to h a ve ( α, β , γ )-acc uracy for size n databases w ith resp ect to Q , if for ev ery database D with | D | = n the follo wing holds P ( S is ( α, γ ) − accurate f or D ) ≥ 1 − β , where the pr obabilit y is o ve r the in ternal randomness of the mec h anism S ; and ( α, γ )-acc urate means that |S ( D , q ) − q ( D ) | ≤ α holds for at least 1 − γ fraction q ∈ Q . W e will mak e use of the Laplace mec hanism [11] in our alg orithm. La place mec hanism adds Laplace noise to the output. W e d enote by Lap( σ ) the random v ariable distributed according to the Laplace distribu tion with p arameter σ : P (Lap( σ ) = x ) = 1 2 σ exp( −| x | /σ ). W e will design a diﬀerentiall y priv ate mechanism w hic h outputs a syn thetic d atabase ˜ D . Each elemen t of ˜ D is a data p oin t in the d ata u n iv erse. | ˜ D | and | D | can b e diﬀerent, i.e., the synthet ic database and the original database may con tain d iﬀeren t num b ers of data p oin ts. F or an y query q f ∈ Q , the user simply calc ulates q f ( ˜ D ) := 1 | ˜ D | P x ∈ ˜ D f ( x ) as an app ro ximation of q f ( D ). Our diﬀeren tially priv ate mec hanism guarantee s accuracy w ith r esp ect to the set of smo oth quer ies. 4 Next w e formally d eﬁne smo oth qu eries. S ince eac h query q f is s p eciﬁed by a fu nction f , a s et of queries Q F can b e sp eciﬁed by a set of fun ctions F . Remem b er that eac h f ∈ F map s [ − 1 , 1] d to R . F or an y p oint x = ( x 1 , . . . , x d ) ∈ [ − 1 , 1] d , if k = ( k 1 , . . . , k d ) is a d -tuple of nonnegativ e int egers, then w e deﬁne D k := D k 1 1 · · · D k d d := ∂ k 1 ∂ x k 1 1 · · · ∂ k d ∂ x k d d . Let | k | := k 1 + . . . + k d . Deﬁne th e K -norm as k f k K := s u p | k |≤ K sup x ∈ [ − 1 , 1] d | D k f ( x ) | . W e will stud y th e set C K B whic h con tains all smo oth fu nctions whose deriv ativ es up to order K ha ve ∞ -norm up p er b ounded by a constan t B > 0. F ormally , C K B := { f : k f k K ≤ B } . The set of qu eries sp eciﬁed by C K B , denoted as Q C K B , is our fo cus. Smo oth functions hav e b een studied in depth in mac hine learning [31, 32, 28]. Man y fun ctions widely used in mac hin e learning are smo oth functions. An example is the Gaussian k ern el function f ( x ) = exp  − k x − x 0 k 2 2 σ 2  , where x 0 ∈ R d is a constan t vect or. L in ear com bin ation of Gaussian kernels is one of th e most p opular f unctions used in mac hin e learning f ( x ) = J X j =1 α j exp  − k x − x j k 2 2 σ 2  , where x j , j = 1 , 2 , . . . , J , are constant ve ctors. The smo othness of this typ e of functions is c h aracterized in the follo wing p rop osition. Prop osition 2.1. L et f ( x ) = J X j =1 α j exp  − k x − x j k 2 2 σ 2  , wher e x ∈ R d . L et α = ( α 1 , . . . , α J ) . Supp ose k α k 1 ≤ 1 . Then for every K ≤ σ 2 , k f k K ≤ 1 . The pro of is giv en in the app endix Section A.3. 3 Theoretical Results This section con tains the main theoretica l resu lts of the pap er. In Section 3.1 w e giv e an ǫ - diﬀeren tially priv ate mec hanism which outputs a synthetic database and guarantee go o d accuracy for s m o oth queries. S ection 3.2 generalizes the mec hanism to preserve ( ǫ, δ )-diﬀerent ial p riv acy with sligh tly imp ro ved accuracy . In Section 3.3 we compare the p erformance of our algorithms to w ell kn o wn d iﬀeren tially priv ate mec h anisms on this pr oblem. 5 3.1 The ǫ -diﬀe ren tially Priv ate Mecha nism The follo wing theorem is our main resu lt. It sa ys that if th e query class is sp eciﬁed b y sm o oth functions, then there is a p olynomial time mechanism whic h p reserv es ǫ -diﬀerential p riv acy and go o d accuracy . The outpu t of the mechanism is a synthetic d ataset. A formal description of the mec hanism is giv en in Algorithm 1. Theorem 3.1. L et the query set b e Q C K B := { q f ( D ) = 1 n X x ∈ D f ( x ) : f ∈ C K B } , wher e K ∈ N and B > 0 ar e c onstants. L et the data universe b e [ − 1 , 1] d , wher e d is a c onstant. Then the me chanism describ e d in Algorithm 1 satisﬁes that for any ǫ > 0 , the fol lowing hold: 1) The me chanism pr eserves ǫ -diﬀe r ential privacy. 2) Ther e is an absolute c onstant c such that for every β ≥ c · e − n 1 2 d + K the me chanism is ( α, β ) - ac cur ate, wher e α = O ( n − K 2 d + K /ǫ ) , and the hidden c onstant dep ends only on d , K and B . 3) The running time of the me chanism i s O ( n 3 dK +5 d 4 d +2 K ) . (This is dominate d by solving the line ar pr o gr amming pr oblem in step 20 of the algorithm.) 4) The size of the output synthetic datab ase is O ( n 1+ K +1 2 d + K ) . The pro of of Theorem 3.1 is given in the app end ix Section A.1. Before exp lainin g the ideas of the algorithm, let us ﬁrst tak e a closer lo ok at the r esults in Theorem 3.1. T o ha ve a b etter view of how the p erformances dep end on the order of smo othness, let u s consider three cases. The ﬁ rst case is K = 1, i.e., th e query f unctions only h a ve the ﬁ rst order deriv ativ es. Another extreme case is K/d = M ≫ 1, i.e., v ery s m o oth queries. W e also consider a case in the m id dle b y assuming K = 2 d . T able 1 giv es simpliﬁed upp er b oun d s for th e error, the runn in g time of the algorithm, and th e size of the output synthetic database in these cases. F r om T able 1 we can see that the accuracy α impro v es dr amatica lly from roughly O ( n − 1 2 d ) to nearly O ( n − 1 ) as K increases. F or K > 2 d , the error is smaller than the samplin g err or O ( 1 √ n ). On th e other h an d , the ru nning time of the mechanism in creases if one wa n ts b etter accuracy for highly smo oth queries. (Please s ee S ection 4 f or ho w to improv e the eﬃciency of the algorithm in practice.) Finally , the size of the output synthetic database also incr eases in ord er to hav e b etter accuracy: roughly , O ( n − 1 ) accuracy requires an O ( n 2 )-size syntheti c d atabase. No w we explain the mec h an ism in detail. T he ﬁrs t idea is that all sm o oth fun ctions in C K B can b e appr o ximated by linear com binations of a small set of b asis fun ctions. In fact, appr o ximation of smo oth fun ctions by p olynomials, radial basis fu nction, w a velets e tc. has b een well studied for decades. Ho w ever, for the diﬀerent ial pr iv acy problem, our requir emen t of the approximat ion is q u ite diﬀeren t to the t ypical results in appro ximation theory . Sp eciﬁcally , we requ ire that all smo oth functions in C K B can b e appro ximated by linear com binations of a set of basis fun ctions with smal l c o eﬃcients . The co eﬃcient s corresp ond to all smo oth fu nctions must b e u niformly b ounded b y a constan t. (The reason will so on b e clear.) It is n ot clear from standard appr o ximation theory whether any of the ab o v e mentio ned basis function sets satisﬁes suc h a requ iremen t. Instead, we mak e a change of v ariables θ i = arccos( x i ) and consider appr oximati on of the transformed fu nction 6 Algorithm 1 Priv ate Synthetic DB for Sm o oth Q u eries Notations: T d t := { 0 , 1 , . . . , t − 1 } d , a k := 2 k + 1 − N N , A := { a k | k = 0 , 1 , . . . , N − 1 } , L := { i L | i = − L, − L + 1 , . . . , L − 1 , L } , x := ( x 1 , · · · , x d ), θ i ( x ) := arccos( x i ). P aramete rs: Priv acy p arameters ǫ, δ > 0, F ailure p r obabilit y β > 0, Smo othness order K ∈ N . Input: Database D ∈  [ − 1 , 1] d  n Output: Synthetic d atabase ˜ D ∈  [ − 1 , 1] d  m 1: Set t = ⌈ n 1 2 d + K ⌉ , N = ⌈ n K 2 d + K ⌉ , m = ⌈ n 1+ K +1 2 d + K ⌉ , L = ⌈ n d + K 2 d + K ⌉ . 2: In itializ e: D ′ ← ∅ , ˜ D ← ∅ , u ← 0 N d 3: for all z = ( z 1 , . . . , z d ) ∈ D do 4: x i ← arg min a ∈A | z i − a | , i = 1 , . . . , d 5: Add x = ( x 1 , . . . , x d ) to D ′ 6: end for 7: for all r = ( r 1 , . . . , r d ) ∈ T d t do 8: b r ← 1 n P x ∈ D ′ cos ( r 1 θ 1 ( x )) . . . cos ( r d θ d ( x )) 9: ˆ b r ← b r + Lap  t d nǫ  10: ˆ b ′ r ← arg min l ∈L | ˆ b r − l | 11: end for 12: for all k = ( k 1 , . . . , k d ) ∈ T d N do 13: for all r = ( r 1 , . . . , r d ) ∈ T d t do 14: W rk ← cos ( r 1 arccos( a k 1 )) . . . cos ( r d arccos( a k d )) 15: W ′ rk ← arg min l ∈L | W rk − l | 16: end for 17: end for 18: ˆ b ′ ← ( ˆ b ′ r ) k r k ∞ ≤ t − 1 ( ˆ b ′ is a t d dimensional v ector) 19: W ′ ← ( W ′ rk ) k r k ∞ ≤ t − 1 , k k k ∞ ≤ N − 1 (a t d × N d matrix) 20: Solve the follo win g LP pr oblem: min u k W ′ u − ˆ b ′ k 1 , su b ject to u  0, k u k 1 = 1. Obtain the optimal solution u ∗ . 21: rep eat 22: Samp le y according to distribution u ∗ 23: Add y to ˜ D 24: until | ˜ D | = m 25: return: ˜ D g f ( θ 1 , . . . , θ d ) = f (cos θ 1 , . . . , cos θ d ) by linear com binations of trigonometric p olynomials. It can b e sho wn that the trigonometric p olynomial basis s atisﬁes the small co eﬃcien t requirement . It is w orth p oin tin g out that here w e consider L ∞ appro ximation, d iﬀerent to the L 2 appro ximation whic h is simply the F ourier analysis when using trigonometric b asis. Next w e view the trigonometric p olynomial functions as a set of basis queries. W e compute the answ ers of the basis queries (step 8 in Algorithm 1), and add Laplace n oise to the answe rs (step 9). These noisy answers guaran tee diﬀeren tial p r iv acy . Note that if, for a smo oth query , we k n o w the co eﬃcien ts of the lin ear com bination of b asis fu nctions that approxi mate the smo oth fun ction, then we can easily obtain a diﬀerentiall y priv ate answer to the smo oth query b y simply combining the n oisy answers of the basis qu eries w ith these co eﬃcien ts. Moreo v er, b ecause all the co eﬃcien ts 7 T able 1: P erformance vs. Order of smo othness Order o f Accura cy R unning Size of smoothness α time synthetic DB K = 1 O ( n − 1 2 d +1 ) O ( n 2 ) O ( n 1+ 2 2 d +1 ) K = 2 d O ( n − 1 2 ) O ( n 3 4 d + 5 8 ) O ( n 3 2 + 1 4 d ) K d = M ≫ 1 O ( n − (1 − 2 M ) ) O ( n d ( 3 2 − 1 2 M ) ) O ( n 2 − 1 2 M ) are sm all, th e error of the answe r to the smo oth query is small. Ho wev er, an im p ortan t adv an tage of our m echanism is that we d o not eve n need to know the linear co eﬃcien ts. W e merely n eed to kno w that there exist such co eﬃcient s w hic h leads to a goo d appro ximation of a smo oth fu nction. Finally , our goal is to generate a syn thetic dataset (without using an y inform ation of the original database) so that if we ev aluate all th e b asis qu eries on this synthetic d atabase, all the answers will b e close to the noisy answe rs obtained from th e original dataset. The k ey observ ation is th at if w e ha ve such a s yn th etic dataset, then the ev aluatio n of any s m o oth query on th is synt hetic d ataset is an answ er b oth diﬀerential ly pr iv ate and accurate. T o generate su c h a dataset, w e ﬁrst learn a probabilit y distrib ution o v er [ − 1 , 1] d so that the answers of the basis queries with resp ect to th is distribution is close to the noisy answ ers. Observ e that suc h a distribution m us t exist, b ecause the uniform distribu tion ov er the original dataset satisﬁes this requirement. How ev er, learning a con tinuous distr ib ution is computationally int ractable. So we discretize the domain (as well as the original d ata (step 4)) and consider distribu tions o ver th e discretized data univ erse. Because th e queries are smo oth, the error inv olv ed by discretization can b e controll ed. Learning the distribu tion can b e formula ted as a linear programming p roblem (step 20). Note that in the LP p roblem we minimize l 1 error instead of l ∞ error b ecause it results in sligh tly b etter accuracy . Finally , w e randomly d ra w s u ﬃcien tly large num b er of d ata from this probability distribution, and these data form the output synthetic d atabase. The ru nning time of the mec hanism is d omin ated b y th e linear programming step. It is known that the w orst-case time complexity of the inte rior p oin t metho d is upp er b ound ed in terms of the n um b er of v ariables, n umb er of constrain ts, and the n um b er of bits to enco de the problem. It is easy to see that there are only p oly( n ) v ariables and constrain ts. T o con trol the n um b er of bits, w e round eac h num b er in the linear programming p r oblem in a certain precision lev el (step 10 and 15). Bec ause all the num b ers after roun ding are b ounded uniformly by a constan t, th e num b er of bits is not to o large. 3.2 Generalization to ( ǫ, δ ) -diﬀeren tial Priv acy It’s easy to generalize the previous ǫ -diﬀerentia lly p riv ate mec hanism to an ( ǫ, δ )-diﬀerentia lly priv ate mec h anism whic h could ac h iev e slightl y b etter accuracy . The ( ǫ, δ )-diﬀerent ial priv ate mec hanism is diﬀerent to Algorithm 1 only in step 1 and step 9. These t wo steps are replaced by the follo wing: 8 1) Step 1. Set t = ⌈ n 2 3 d +2 K (log 1 δ ) − 1 3 d +2 K ⌉ , N = ⌈ n 2 K 3 d +2 K (log 1 δ ) − K 3 d +2 K ⌉ , m = ⌈ n 4 d +4 K +2 3 d +2 K (log 1 δ ) − 2 d +2 K +1 3 d +2 K ⌉ , and L = ⌈ n 2 d +2 K 3 d +2 K (log 1 δ ) − d + K 3 d +2 K ⌉ . 2) Step 9. ˆ b r = b r + Lap  ( t d log 1 δ ) 1 2 nǫ  . W e hav e the follo wing theorem for this mec h anism. Theorem 3.2. L et the query set Q C K B b e deﬁne d as in The or em 3.1. L et the data u niverse b e [ − 1 , 1] d , wher e d ∈ N is a c onstant. Then the me chanism describ e d ab ove satisﬁes that for any ǫ > 0 , δ > 0 , the fol lowing hold: 1) The me chanism is ( ǫ, δ ) - diﬀer ential ly private. 2) Ther e is an absolute c onstant c such that for any β ≥ c · e − n 2 3 d +2 k (log 1 δ ) − 1 3 d +2 K the me chanism is ( α, β ) -ac cur ate, wher e α = O  n − 2 K 3 d +2 K (log 1 δ ) K 3 d +2 K /ǫ  , and the hidden c onstant dep ends only on d , K and B . 3) The running time of the me chanism is O  n 3 dK +5 d 3 d +2 K (log 1 δ ) − 3 dK +5 d 6 d +4 K  . 4) The size of synthetic datab ase is O  n 4 d +4 K +2 3 d +2 K (log 1 δ ) − 2 d +2 K +1 3 d +2 K  . The pro of of Th eorem 3.2 is by the standard use of the comp osition theorem [14]. W e omit the details. Note that the ru nning time and the size of the output syn thetic database of this ( ǫ, δ )-diﬀeren tially priv ate mec h anism are similar to that of the ǫ -diﬀeren tially priv ate one. 3.3 Comparison to E xisting Algorithms Here we study the p erformance of existing diﬀerent ially priv ate m ec hanism whic h can output a syn thetic database for accurately answ ering smo oth quer ies. In particular, w e analyze a simple v arian t of the BLR mec hanism. Note that the original BLR mec hanism app lies to th e setting wh ere the d ata u niv erse is { 0 , 1 } d and the query set con tains a ﬁnite num b er of linear qu eries. Give n th e query set, BLR outputs a syn thetic data base and preserve s ǫ -diﬀeren tial priv acy . Let | Q | b e the n u mb er of queries in the query set Q , and let |X | b e the size of the data un iv erse, the accuracy of BLR is ˜ O   log | Q | log |X | n  1 / 3  [4]. (In this subsection we ignore the dep endence on all other factors for clarit y .) F or the smo oth qu er y problem, the data unive rse is the contin uous domain [ − 1 , 1] d , and the query set cont ains inﬁnitely many elemen ts as th e n um b er of smo oth functions is in ﬁnite. In order to apply BLR to this pr oblem, one must d iscr etize b oth the data un iv erse and th e range of the smo oth fu n ctions. It is easy to see that to ac hiev e an accuracy of α for all smo oth qu eries, it is necessary and su ﬃcien t to d iscretize the d ata u niv erse [ − 1 , 1] d to Ω( 1 α ) grids along eac h d imension, and discretize the range to Ω( 1 α ) precision. After th ese discretization, the data u niv ers e X is of size Ω (( 1 α ) d ), and the q u ery set Q conta ins only a ﬁnite num b er of queries. The follo wing prop osition give s th e p erformance of BLR f or the discretized smo oth queries. 9 Prop osition 3.3. The ac cur acy gu ar ante e of the BLR me chanism (implemente d as describ e d ab ove) on the se t of K -smo oth que ries is O  n − K d +3 K  . The running time for achieving such an ac cur acy is sup er-exp onential in n . The pro of of Prop osition 3.3 is give n in the app endix Section A.2. Note that eve n for highly smo oth q u eries, the accuracy guarantee of BLR is at b est O  n − 1 / 3  . In con trast, our mec hanism has an accuracy close to n − 1 if K is large c ompared to d . More imp ortan tly , our mec h anism ru ns in p olynomial time, m uc h more eﬃcient than BLR on the smo oth problem. 3.4 Practical Acceleration via P riv ate PCA Theoretically , the w orst-case time complexit y of our ǫ -diﬀerentia lly priv ate mec hanism can b e nearly n 3 d 2 to ac hiev e n − 1 accuracy for highly smooth queries. In real application su c h a r unning time is u nacceptable. W e th u s consider a simple v arian t of Algorithm 1 whic h turn s out to b e v ery eﬃcien t in our exp eriments and suﬀers only fr om m inor loss in accuracy . Note that the run ning time of Algorithm 1 is dominated by the lin ear p rogramming step. Th is LP p roblem h as O ( N d ) v ariables and O ( t d ) constraint s, where N d is the num b er of discretized grids in [ − 1 , 1] d and t d is the n um b er of trigonometric p olynomial basis functions. T o mak e our algorithm pr actica l, we consider a subset M of the N d grids with size C := |M| ≪ N d and restrict th e pr obabilit y distribu tion u on this subset of grids. Similarly , we use a subset of size R of the t d trigonometric p olynomial basis fun ctions p referred to lo w er degrees. By d oing this, the LP p roblem has C v ariables and R constrain ts. The simp lest appr oac h to obtain M is sampling from N d grids in [ − 1 , 1] d uniformly . Ho wev er, this appr oac h s u ﬀers from s u bstan tial loss in accuracy (see App endix f or exp erimen tal results of this metho d), b ecause |M| is extremely s mall compared to N d , the probability th at M cont ains data p oin ts in D (or close to D ) is very small. In order to reduce the size of th e LP problem and preserve the accuracy , we need a b ette r approac h to obtain M . F ormally , the problem of choosing a subset M for our p urp ose can b e formulate d as f ollo ws: W e wan t a s u bset M s o that 1) M is diﬀerentia lly pr iv ate; 2) |M| is small; 3) F or almost every d ata p oin t x in D , ther e is a p oint in M close to x . Note that without the pr iv acy co ncern, one can simply let M = D . But under the requiremen t of priv acy , th is problem is highly n on-trivial. Here we adopt pr iv ate PC A to obtain a lo w dimen sional ellipsoid. The ellipsoid is spann ed by the (priv ate) top eigen v ectors of the data co v ariance m atrix with the squ are ro ot of the (priv ate) eigen v alues as the r adius. In particular, w e use a sligh tly mo diﬁed version of the Priv ate Su bspace Iteration (PSI) mec h anism due to Hardt [16 ] to compute the p riv ate eigenpairs. Th e mec hanism is describ ed in Algorithm 2. Finally , we un iformly sample C p oin ts from the ellipsoid to form M . In the follo wing three r esults, we sho w that th e PSI mechanism is diﬀerential ly priv ate an d accurate for th e top eigen vecto rs and eigen v alues r esp ectiv ely . Note that Hardt[16] sh ows that the tangen t of the angle b et w een the space s p anned by the top- k leading eigen v ectors of the true data co v ariance matrix and the the s p ace spanned by the output columns v ectors is small w ith h igh probabilit y . How ev er, it do es not suﬃ ce to conclude that the output p riv ate ellipsoid con verge 10 Algorithm 2 Priv ate Sub space Iteration Input: Database D ∈ ([ − 1 , 1] d ) n Output: T op- k p riv ate eigenv ectors and eigen v alues of X ( L ) . P aramete rs: Num b er of iterations L ∈ N , d im en sion k, p r iv acy p arameters ǫ, δ > 0, Denote GS as the Gram-Sc h midt orthonormalization algorithm. 1: Set σ = 5 d √ 4 kL log(1 /δ ) nǫ , A = 1 n D D T − ¯ D T ¯ D , w here ¯ D is the mean of D . 2: In itializ e: G (0) ∼ N (0 , 1) d × k , X (0) ← GS( G 0 ) 3: for all l = 1 , 2 , . . . , L do 4: S amp le G ( l ) ∼ N (0 , σ 2 ) n × k . 5: W ( l ) = AX ( l − 1) + || X ( l − 1) || ∞ G ( l ) 6: X ( l ) ← GS( W ( l ) ) 7: end for to the true PCA ellipsoid. Our results sligh tly strengthen the result in [16 ]. W e s h o w that the column-wise con v ergence b et ween eigen vecto rs and output columns , wh ic h can b e concluded from the s imultaneous con v ergence b et w een the increasing sequence of eigenspaces and th e increasing sequence of output-spaces. Theorem 3.4 ( Acc uracy of the eigen v ectors ) . Give n a datab ase D with | D | = n , let A = 1 n D D T − ¯ D T ¯ D with ei g envalues λ 1 ≥ · · · ≥ λ d and γ k = λ k /λ k +1 − 1 for some k ≤ d/ 2 . L et U = ( u 1 , . . . , u k ) ∈ R d × k b e a b asis f or the sp ac e sp anne d by the top k eigenve ctors. The matrix X ( L ) = ( x ( L ) 1 , . . . , x ( L ) k ) ∈ R d × k r eturne d by Algorith m 2 on i nput of D , with p ar ameters k , L ≥ C (min s ≤ k γ s ) − 1 log d for suﬃciently lar ge c onstant C , L ∈ N , and privacy p ar ameter σ satisﬁes with pr ob ability 1 − o (1) , sin θ ( u s , x ( L ) s ) ≤ O  σ ω s q d max || X ( l ) || 2 ∞ log L  , wher e ω s = ( max { 1 γ s λ s , 1 γ s − 1 λ s − 1 } 2 ≤ s ≤ k , 1 γ 1 λ 1 s = 1 . Corollary 3.5 ( Accuracy of the eigen v alues ) . Given the assumption in The or e m 3.4, let ˆ λ s = p x T s A 2 x s , with pr ob ability 1 − o (1) , we have | ˆ λ s − λ s | ≤ O ( σ 2 d max || X l || 2 ∞ log L γ 2 s λ 2 s ) . Theorem 3.6 ( Priv acy ) . If A lgorithm 2 is exe cute d with e ach G ( l ) indep endently sam ple d as G ( l ) ∼ N (0 , σ 2 ) d × k with σ = 5 d √ 4 kL log(1 /δ ) nǫ , then Algorithm 2 satisﬁes ( ǫ, δ ) -diﬀer ential privacy. If A lgorithm 2 is exe cute d with e ach G ( l ) indep endently sample d as G ( l ) ∼ Lap ( σ ) d × k with σ = 50 d 3 / 2 k L nǫ , then Algorithm 2 satisﬁes ǫ -diﬀer ential privacy. The pro of of Theorem 3.4, 3.6, and Pr op osition 3.5 are giv en in th e app endix Section A.4. 11 4 Exp erimen ts W e ev aluate our mec h anisms on ﬁv e datasets all from the UC I r ep ository: 1) CR M: Comm unities and Crime dataset that combines so cio-economic data, la w enforcemen t data, and crime data. 2) CTG: A Cardioto cograph y d ataset consisting of measur emen ts of fetal heart rate and u terine con- traction features on cardioto cograms. 3) P AM: A Physic al Activit y Monitoring dataset consisting of in ertial measurements and heart rate data. 4) PKS: co nsisting of a series of b iomedical v oice measuremen ts of a group of p eople, some of whic h are with P arkinson disease. 5) WDBC: Br east Cancer Wisconsin Diagnostic d ataset consists of c h aracteristic s of the cell nuclei. T able 2: Summary of the dataset Dataset Size ( n ) # Att ributes ( d ) CRM 1993 100 CTG 2126 20 P AM 20000 40 PKS 5875 20 WDBC 569 30 A sum mary of the size and the n um b er of attributes 3 of these datasets is giv en in T able 2. S in ce the data un iv erse considered in th is p ap er is [ − 1 , 1] d , we normalize eac h attribute to [ − 1 , 1]. W e conduct t wo groups of exp eriments. In one group w e us e the mec hanism wh ic h guaran tees ǫ -diﬀeren tial p r iv acy , and in the other we use the algorithm whic h guaran tees ( ǫ, δ )-diﬀerential priv acy . In b oth groups of exp eriments, w e set ǫ = 1 . W e set δ = 1 0 − 10 in exp erimen ts with ( ǫ, δ )-diﬀerent ial priv acy . The qu eries emplo yed in the exp eriments are linear combinatio ns of Gaussian kernel functions. W e u s e this t yp e of functions b ecause 1) Th ese fun ctions p ossess go o d smo othness prop erty as stated in Section 2, and 2) lin ear com bin ations of Gaussian are univ ersal app ro ximators. Detaile d parameter setting of the query fun ctions is as f ollo ws. W e consider f ( x ) = J X j =1 α j exp  − k x − x j k 2 2 σ 2  . In all exp eriments, w e set J = 10; α j is randomly c hosen from [0 , 1], and x j is rand omly c hosen from [ − 1 , 1] d . W e test v arious v alues of σ to see h o w the smo othness of the qu ery fun ction aﬀects the p erf orm ance of the algorithm (see b elo w for d etailed resu lts). W e u se diﬀeren t p erf orm ance measures to ev aluate the algorithm. The goal is to h av e a compre- hensiv e understand ing of th e p erformance of the mec hanism. W e consider the worst-c ase err or of the mec hanism o ver the set of queries. Because our query set, i.e., linear com bination of Gaussian Kernels, con tains inﬁ nitely many fu nctions, we randomly choose 10000 qu er ies in eac h exp erimen t. The w orst-case error is o v er these 10000 queries. W e giv e b oth absolute err or and relativ e error for all exp eriments. Absolute error of a query q f is deﬁn ed as | q f ( D ) − q f ( ˜ D ) | ; and relativ e err or is deﬁ ned as | q f ( D ) − q f ( ˜ D ) q f ( D ) | . W e pr esen t relativ e 3 Because w e study smooth queries deﬁned on Euclidean space, we only use the continuous attributes. 12 T able 3: W orst-case error of ǫ -diﬀerential p riv acy ( C = 10 4 ) Dataset Erro r σ Time(s) 2 4 6 8 10 CRM Abs 0.001 0.03 5 0.033 0.022 0.020 1.1 Rel 1.084 0.25 6 0.083 0.037 0.027 CTG Abs 0.046 0.04 1 0.027 0.014 0.005 1.1 Rel 0.209 0.06 3 0.033 0.015 0.006 P AM Abs 0.007 0.00 6 0.004 0.001 0.004 1.2 Rel 0.058 0.01 1 0.006 0.001 0.004 PKS Abs 0.006 0.00 7 0.001 0.007 0.004 0.9 Rel 0.059 0.01 3 0.002 0.008 0.004 WDBC Abs 0.037 0.05 9 0.039 0.011 0.012 1.0 Rel 0.329 0.11 0 0.053 0.013 0.014 error b ecause in certain cases (e.g. when σ is small) f ( x ) is very s mall f or most x ∈ D . T herefore in this case a small absolute error do es not necessarily imply go o d p erformance, and relativ e error is m ore informativ e 4 . W e presen t the run ning time of the mec h anism for outp utting the synthetic database in eac h exp eriment . T h e computer u sed in all the exp erimen ts is a workstatio n with 2 In tel Xeon X5650 pro cessors of 2.6 7GHz and 32GB RAM. W e u se the CPLEX p ac k age for solving the linear pro- gramming problem in our algorithms. W e present the p erformance of the ǫ -diﬀerential ly priv ate algorithm in T able 3. F or eac h dataset, b oth abs olute error and relativ e error, as a v erage of 20 roun ds, are rep orted sequen tially . W e m ak e use of linear com b ination of Gaussian w ith diﬀerent v alues of σ as the query functions. The last column of the table lists the running time with resp ect to the worst σ of the algorithm for ou tp utting the synthetic database. No w we analyze the exp erimenta l r esults in T able 3 in greater detail. I n this set of exp eriments w e set C = 10 4 . First, the algorithm is q u ite eﬃcien t. On all datasets the mechanism outputs the syn thetic database in less than ten seconds. Next considering the accuracy . As explained earlier, the relativ e error is more meaningfu l in our exp eriment s. It can b e seen that except for th e case σ = 2 (reca ll that in Prop osition 2.1 , we sho w f ∈ C K 1 for K ≤ σ 2 ), the accuracy is reasonably go o d . Th e relativ e errors decrease mon otonically as the th e order of sm o othn ess of the quer ies increases. In T ab le 4, we present the results for th e ( ǫ, δ )-diﬀeren tially priv ate mec hanism. Comparing to T able 3, the p erformances of the tw o algorithms are similar for δ = 10 − 10 . 5 Conclusion Outputting a synthetic database while p r eserving diﬀerentia l priv acy is very app ealing from a prac- tical viewp oin t. In th is pap er, we prop ose diﬀeren tially priv ate mec h anisms whic h outp ut synt hetic 4 W e p oin t out that one also needs to b e careful when using relative error. In our exp eriments, we delib erately set α j ∈ [0 , 1]. So f ( x ) ≥ 0 for all x . If instead we set α j ∈ [ − 1 , 1], then f ( x ) can b e either p ositive or negative, and it is p ossible that q f ( D ) is close to zero while f ( x ) is not small for most x ∈ D . In such a case, a large relative error does n ot n ecessaril y imply a bad p erformance. 13 T able 4: W orst-case error of ( ǫ, δ )-diﬀerent ial priv acy ( C = 10 4 ) Dataset Erro r σ Time(s) 2 4 6 8 10 CRM Abs 0.001 0.01 8 0.034 0.020 0.020 7.5 Rel 0.631 0.12 6 0.083 0.034 0.028 CTG Abs 0.042 0.03 0 0.023 0.014 0.008 1.1 Rel 0.192 0.04 5 0.028 0.016 0.008 P AM Abs 0.012 0.02 0 0.006 0.003 0.001 6.9 Rel 0.089 0.03 3 0.007 0.003 0.002 PKS Abs 0.015 0.00 2 0.003 0.001 0.006 1.4 Rel 0.109 0.00 3 0.003 0.001 0.007 WDBC Abs 0.045 0.03 2 0.019 0.018 0.011 1.2 Rel 0.388 0.06 1 0.026 0.021 0.013 database. T h e u ser can obtain accurate answers to all sm o oth queries from th e synthet ic d atabase. The mec h anisms ru n in p olynomial time, while existing algorithms r u n in sup er-exponential time. F or queries of high order smo othness, the m echanisms ac hieve an accuracy nearly O ( n − 1 ), muc h b etter than the sampling error O ( n − 1 / 2 ) which is inheren t to diﬀerentia lly priv ate m echanisms answ ering general queries. There are a few fu ture dir ections we thin k worth exploring. Smo oth and non-smo oth queries : As ment ioned in Introd uction, there exists an eﬃcien t and diﬀeren tially priv ate algorithm whic h ou tp uts a synthetic database and is accurate to the class of rectangle queries d eﬁ ned on [ − 1 , 1] d [4]. Note that rectangle queries are not s m o oth. These queries are sp eciﬁed by ind icato r functions which are not ev en con tinuous. The mec h anism p rop osed in [4] is completely diﬀeren t to the mec hanism for smo oth queries giv en in this p ap er. Thus an immediate question is: can we d ev elop eﬃcient mechanisms whic h outp ut s y nthetic d atabase and preserve diﬀeren tial priv acy for a n atural class of queries conta ining b oth smo oth and im p ortan t non-smo oth functions. A Pro ofs of the theorems and auxiliary exp erimen t results In this ap p endix, we will give th e pro of of the main theorem in Section A.1 ; the analysis of BLR on the Smo oth Q u ery in S ection A.2; the analysis of smo othness of lin ear com b ination of Guassian k ern el functions in Section A.3; pro ofs of priv ately estimation on eigen v ectors and eigen v alues in Section A.4; and the auxiliary exp eriment results by a simple app roac h to get sub set in Section A.5. A.1 Pro of of the Main Theorem In this section we prov e Theorem 3.1. Pr o of of The or em 3.1 . W e ﬁ rst deﬁne some n otatio ns rep eatedly used in the p ro of. Let the input database b e D = ( z (1) , z (2) , · · · , z ( n ) ) . 14 Let the discretized d ataset b e (please see step 5 in Algorithm 1) D ′ = ( x (1) , x (2) , · · · , x ( n ) ) . Also let the outpu t synthetic d ataset b e ˜ D = ( y (1) , y (2) , · · · , y ( m ) ) . Let b = ( b r ) k r k ∞ ≤ t − 1 b e a t d dimensional v ector, where b r is d eﬁned in step 8 of the algorithm. Similarly , Let ˆ b = ( ˆ b r ) k r k ∞ ≤ t − 1 and let W = ( W rk ) k r k ∞ ≤ t − 1 , k k k ∞ ≤ N − 1 , where ˆ b r and W rk are deﬁned as in step 9 and 14 of the algorithm resp ectiv ely . Let ∆ = ˆ b − b b e the t d dimensional Laplace noise, wh ere ˆ b is d eﬁned in step 16 of the algorithm. Finally let ˜ b = ( ˜ b r ) k r k ∞ ≤ t − 1 , where ˜ b r = 1 m X y ∈ ˜ D cos ( r 1 θ 1 ( y )) . . . cos ( r d θ d ( y )) . (Recall that θ i ( y ) = arccos( y i ). Please see also the Notations in Algorithm 1.) No w w e pro v e the four results in th e theorem one by one. A.1.1 Diﬀeren tial Priv acy That the mec h anism p reserv es ǫ -diﬀeren tial p riv acy is s traigh tforwa rd. Note that the output syn - thetic database ˜ D con tains no p riv ate information other than that obtained from ˆ b . S o we only need to sho w that ˆ b is diﬀeren tially priv ate. But this is immediate from the priv acy of Laplace mec hanism . A.1.2 Accuracy Let θ = ( θ 1 , . . . , θ d ). F or an y f ( x ) ∈ C K B , w here x ∈ [ − 1 , 1] d , let g f ( θ ) := f (cos θ 1 , . . . , cos θ d ) . Denote c = ( c r 1 ,...,r d ) k r k| ∞ ≤ t − 1 as a t d -dimensional ve ctor, and: h t f ( c , θ ) := X 0 ≤ r 1 ,...,r d ≤ t − 1 c r 1 ,...,r d cos( r 1 θ 1 ) . . . cos( r d θ d ) , F or a constan t M (w e will sp ecify how to c ho ose the v alue of M later), let: c ∗ := arg inf k c k ∞ ≤ M sup θ ∈ [ − π , π ] d   h t f ( c , θ ) − g f ( θ )   h M ,t f ( θ ) := X 0 ≤ r 1 ,...,r d ≤ t − 1 c ∗ r 1 ,...,r d cos( r 1 θ 1 ) . . . cos( r d θ d ) . Th us, h M ,t f is the b est t th order s mall co eﬃcien t appro ximation of g f . 15 Moreo v er, for an y x = ( x 1 , . . . , x d ) ∈ [ − 1 , 1] d , also let θ ( x ) := (arccos x 1 , . . . , arccos x d ) . No w w e decomp ose the error of th e mec hanism into several terms :    q f ( ˜ D ) − q f ( D )    =       1 m X y ∈ ˜ D f ( y ) − 1 n X z ∈ D f ( z )       ≤       1 m X y ∈ ˜ D f ( y ) − 1 m X y ∈ ˜ D h M ,t f ( θ ( y ))       +       1 m X y ∈ ˜ D h M ,t f ( θ ( y )) − 1 n X x ∈ D ′ h M ,t f ( θ ( x ))       +      1 n X x ∈ D ′ h M ,t f ( θ ( x )) − 1 n X x ∈ D ′ f ( x )      +      1 n X x ∈ D ′ f ( x ) − 1 n X z ∈ D f ( z )      . (1) W e fur ther decomp ose the second term in the last r o w of th e ab ov e in equ alit y . W e ha ve       1 m X y ∈ ˜ D h M ,t f ( θ ( y )) − 1 n X x ∈ D ′ h M ,t f ( θ ( x ))       =    c ∗ · ( ˜ b − b )    ≤  k ˜ b − ˆ b k 1 + k ∆ k 1  k c ∗ k ∞ ≤  k ˜ b − W u ∗ k 1 + k W u ∗ − W ′ u ∗ k 1 + k W ′ u ∗ − ˆ b ′ k 1 + k ˆ b ′ − ˆ b k 1 + k ∆ k 1  k c ∗ k ∞ ≤  k ˜ b − W u ∗ k 1 + k ( W − W ′ ) u ∗ k 1 + k W u ′ − ˆ b k 1 + k ( W − W ′ ) u ′ k 1 + 2 k ˆ b ′ − ˆ b k 1 + k ∆ k 1  k c ∗ k ∞ ≤  k ˜ b − W u ∗ k 1 + 4 t d L + 4 k ∆ k 1  k c ∗ k ∞ , (2) where u ′ is th e uniform distribu tion on D ′ . Note that th e second last inequalit y holds b ecause k W ′ u ∗ − ˆ b ′ k 1 ≤ k W ′ u ′ − ˆ b ′ k 1 . Also, the last inequalit y in (2 ) follo ws from k ˆ b ′ − ˆ b k 1 ≤ t d L + k ∆ k 1 , and k W u ′ − ˆ b k 1 ≤ k W u ′ − b k 1 + k ∆ k 1 = k ∆ k 1 , where the last equalit y h olds s in ce W u ′ = b . 16 Deﬁne η d =      1 n X x ∈ D ′ f ( x ) − 1 n X z ∈ D f ( z )      , η n = 4 k ∆ k 1 k c ∗ k ∞ , η a =       1 m X y ∈ ˜ D f ( y ) − 1 m X y ∈ ˜ D h M ,t f ( θ ( y ))       +      1 n X x ∈ D ′ h M ,t f ( θ ( x )) − 1 n X x ∈ D ′ f ( x )      , η s = k ˜ b − W u ∗ k 1 k c ∗ k ∞ , η r = 4 t d L k c ∗ k ∞ , where η d , η n , η a , η s , η r corresp ond to the d iscretizatio n error, noise error, approxima tion error, sampling err or and roun ding error, resp ectiv ely . Com bining (1), (2) and the equations ab ov e, we ha ve the error of the mec hanism b ounded by the su m of th ese ﬁv e types of err ors:    q f ( ˜ D ) − q f ( D )    ≤ η d + η n + η a + η s + η r . W e now b oun d the ﬁve errors separately . Discretization error η d : Since f ∈ C K B ( K ≥ 1), th e ﬁr st order d er iv ativ es of f are all b oun ded b y B . Also th e d iscr etization precision of [ − 1 , 1] d is 1 N , so the distance b etw een the data in D and the corresp ond ing data in D ′ is O ( 1 N ). Th us we hav e η d =      1 n X x ∈ D ′ f ( x ) − 1 n X z ∈ D f ( z )      ≤ dB N = O  n − K 2 d + K  . Noise error η n : Let M b e a constan t dep ending on d , K , B and suﬃcien tly large 5 . S ince M is a constant, k c ∗ k ∞ = O (1). Thus to b oun d η n = k ∆ k 1 · k c ∗ k ∞ , we on ly n eed to b ound th e l 1 norm of th e t d -dimensional vecto r ∆ w hic h con tains i.i.d. random v ariables Lap  t d nǫ  ; or equiv alen tly b ound the su m of t d i.i.d. random v ariables with exp onential distr ib ution. It is well known that suc h a sum satisﬁes gamma distribu tion. Simp le calculations yields P  k ∆ k 1 ≤ 2 t 2 d nǫ  ≥ 1 − 10 e − t d 5 . Th us, with probability 1 − 10 e − t d 5 , we ha v e η n = k ∆ k 1 k c ∗ k ∞ ≤ O  t 2 d nǫ  . Appro ximation error η a : Recall th at f or any x , g f ( θ ( x )) = f ( x ) . 5 M = 2 K B ( π ( K + 1)) d suﬃces for this and all later req uirements on M . 17 W e hav e η a =       1 m X y ∈ ˜ D f ( y ) − 1 m X y ∈ ˜ D h M ,t f ( θ ( y ))       +      1 n X x ∈ D ′ h M ,t f ( θ ( x )) − 1 n X x ∈ D ′ f ( x )      ≤ 2   g f − h M ,t f   [ − π ,π ] d . T o b oun d η a , we n eed the follo wing result. Theorem A.1 ([33]) . F or any K, d, B , ther e is M such that for every f ∈ C K B k g f − h M ,t f k [ − π ,π ] d ≤ O  1 t K +1  . According to this th eorem, w e h av e η p ≤ O  1 t K +1  . Sampling error η s : It is easy to b ound samplin g error. Let W r b e the ro w v ector of matrix W indexed by r . Recall that − 1 ≤ W rk ≤ 1. Thus for eac h r , by Ch ernoﬀ b ound we ha v e that for an y τ > 0: P  | ˜ b r − W r u ∗ | ≥ τ  ≤ 2 e − mτ 2 2 , since ˜ b r is j ust the a v erage of m i.i.d. samples and W u ∗ is its exp ectation. Next by union b oun d P  k ˜ b − W u ∗ k ∞ ≥ τ  ≤ 2 t d e − mτ 2 2 , and therefore P  k ˜ b − W u ∗ k 1 ≥ t d τ  ≤ 2 t d e − mτ 2 2 . Setting τ such that 2 t d e − mτ 2 2 = e − t , we h a ve th at with probabilit y 1 − e − t , k ˜ b − W u ∗ k 1 ≤ O t d +1 / 2 √ m ! . Rounding error η r : Since k c ∗ k ∞ is u pp er b ounded by a constant, we hav e η r ≤ O  t d L  . Putting it together: Com b ining the ﬁ v e types of errors, w e h a ve that with probabilit y 1 − e − t − 10 e − t d 5 , the error of the mechanism s atisﬁes     1 m X y ∈ ˜ D f ( y ) − 1 n X z ∈ D f ( z )     ≤ O 1 N + 1 t K +1 + t 2 d nǫ + t d + 1 2 √ m + t d L ! . (3) Recall that the mechanism sets t = ⌈ n 1 2 d + K ⌉ , N = ⌈ n K 2 d + K ⌉ , m = ⌈ n 1+ K +1 2 d + K ⌉ , L = ⌈ n d + K 2 d + K ⌉ . The theorem follo w s after some s im p le calculation. 18 A.1.3 Running time It is not diﬃcult to see that in this case th e r unning time of the mec hanism is dominated by solving the Linear Programming pr oblem in step 20. (Because the time complexity of linear programming is with resp ect to arithmetic op erations, all ru nning time discus sed h ere should b e u nderstand in this wa y .) T o analyze the runnin g time of the LP problem, observe that it could b e rewritten in follo wing standard form: max x ¯ c T ¯ x (4) s.t. ¯ A ¯ x = ¯ b ¯ x  0 where ¯ A =  L · W ′ L · I t d − L · I t d 1 T N d 0 0  , ¯ b =  L · ˆ b ′ 1  , ¯ c =   0 1 t d 1 t d   , ¯ x =   u v w   . ¯ A is a ¯ m × ¯ n matrix where ¯ m = t d + 1 and ¯ n = N d + 2 t d . Note that 1) eac h elemen t of W ′ is in [ − 1 , 1]; 2) eac h elemen t of ˆ b ′ is in [ − 1 , 1]; and 3) eac h elemen t of W ′ and ˆ b ′ is rounded to pr ecision 1 /L . S o actually we h a ve redu ce to a LP problem (4 ), with elements of ¯ A , ¯ b , ¯ c are all in tegers and b ound ed b y L . The most well -kno wn w orst-case complexit y of the inte rior p oint algorithm for linear program- ming with inte ger parameters is O ( ¯ n 3 ˜ L ), where ¯ n is the num b er of v ariables and ˜ L is the n um b er of bits to enco de the linear p rogramming problem. Here w e u se a more reﬁned b ound giv en in [2]. By using th is b ou n d, we are able to prov e a muc h b etter time complexit y for our algorithm; b ecause in the lin ear p rogramming pr oblem (4), th e num b er of constrain ts is often muc h smaller than the num b er of v ariables. The b oun d we make use of for the complexit y of linear p rogramming is O ( ¯ n 1 . 5 ¯ m 1 . 5 ln ¯ m ¯ L ) [2 ]. Here, ¯ L is the size of LP p roblem in standard form deﬁn ed as follo ws [26]: ¯ L = ⌈ log (1 + | d et( ¯ A max ) | ) ⌉ + ⌈ log (1 + k ¯ c k ∞ ) ⌉ + ⌈ log(1 + k ¯ b k ∞ ) ⌉ + ⌈ log ( ¯ m + ¯ n ) ⌉ , where ¯ A max = arg max X is a square submatrix of ¯ A | det( X ) | . Note that ¯ m < ¯ n , s o the s ize of ¯ A max is at most ¯ m × ¯ m . Therefore, w e ha v e | det( ¯ A max ) | ≤ ¯ m ! L ¯ m , and ¯ L = O ( ¯ m (lo g ¯ m + log L ) + log ¯ n ) . Giv en ¯ m = O ( t d ) and ¯ n = O ( N d ), simple calculation sh o ws that the total time complexit y is O  ¯ n 1 . 5 ¯ m 1 . 5 ln ¯ m ¯ L  = O  N 1 . 5 d t 2 . 5 d  = O  n 3 dK +5 d 4 d +2 K  . 19 A.1.4 Size of the output syn thetic database The size of synthetic dataset m is set in step 1 of the algorithm. A.2 Analysis of the Performan ce of BLR on the Smo ot h Query P roblem In this section we prov e Prop osition 3.3. Pr o of of Pr o p osition 3.3 . As is sta ted in Section 3.3, the accuracy of BLR is ˜ O   log | Q | log |X | n  1 / 3  . So here we only need to analyze the size of the query set ob tained after discretization. F or ev er y K ∈ N , let Q α C K B b e the set of queries obtained b y d iscr etizing b oth the domain [ − 1 , 1] d and the range [ − B , B ] of the smo oth functions in C B K with pr ecision α as d escrib ed in S ection 3.3. W e use the follo win g result. Lemma A.2. Ther e is an absolute c onstant c such that log    Q α C K B    ≥ c  1 α  d/K . Since the discretiza tion precision is α , and the ﬁrst order deriv ativ es of the f u nctions are b ounded b y the constant B , the total error ind uced by discretization of the domain and range is at least α . Th us the error of the discretized BLR is max   B α, ˜ O    1 α  d/K n ! 1 / 3     . The prop osition follo w s by choosing the optimal α . Pr o of of L emma A.2 . Without loss of generalit y , w e consider the case B = 1. Deﬁne h ( x ) ( x ∈ R d ) as follo ws: h ( x ) = ( exp  1 − 1 1 −k x k 2 2  , k x k 2 ≤ 1 0 . otherwise It is well kno w n that h ( x ) ∈ C ∞ ( R d ), h ( x ) ∈ [0 , 1], and for every d -tuple of nonnegativ e integ ers k = ( k 1 , . . . , k d ), D k h ( x ) = 0 w hen k x k ≥ 1. Since the partial deriv ativ es of h are contin uous and h has b oun ded supp ort, w e deﬁne M K := max | k |≤ K max x | D k h ( x ) | . Since K is a constan t, M K is also a constant . Let N 0 = 1 /α . F or simplicit y we assume N 0 is an intege r. First we partition [ − 1 , 1] d in to h yp ercub es w ith equal side length. Let n 0 b e an integ er whose v alue will b e determined later. L et l = n 0 / N 0 b e the side length of the hyp ercub es. Let m 0 = 1 /l b e the num b er of hyp ercub es along eac h dimension. Denote the cen ters of the m d 0 h yp ercub es as x 1 , . . . , x m d 0 . 20 Consider the set F = { z = ( z 1 , . . . , z m d 0 ) : z 1 ∈  − N 0 − 1 N 0 , − N 0 − 2 N 0 , . . . , N 0 − 1 N 0  , z i ∈ {− 1 , 0 , 1 } , i = 2 , 3 , . . . , m d 0 } . Clearly |F | = (2 N 0 − 1)3 m d 0 − 1 . F or every z ∈ F , we will construct a K -smo oth fu n ction f z so that for ev ery pair z , z ′ ∈ F , f z and f z ′ are still d iﬀeren t after discretization ov er the domain and th e range. In particular, w e requ ir e that f z and f z ′ are d iﬀeren t as long as the discretization p recision is α ; it do es not matter w here th e discretization thresholds are set. If this can b e done, then log    Q α C B K    ≥ Ω ( m d 0 ) . Belo w, w e w ill sho w that m 0 can b e as large as Ω  N 1 /K 0  . Once this is prov ed, th e p rop osition follo ws. T o do this, d eﬁne f z ( x ) = z 1 + 1 N 0 m d 0 X j =2 h (2( x − x j ) /l ) · z j . No w let u s lo ok at some simple prop erties of the function f z . f z p erturb s the constant fun ction z 1 with linear com b in ations of the inﬁ nitely s mo oth fun ction h shifted to eac h x j (the cen ters of the h yp ercub es). Moreo ver, z j ∈ {− 1 , 0 , 1 } ( j = 2 , 3 , . . . , m d 0 ) con trols th e p ertu rbation at x j . It can b e a p ositiv e or n egativ e p ertur bation or no p erturbation. The magnitud e of the p erturbation is 1 / N 0 . Note that h (2( x − x j ) /l ) is supp orted b y th e set { x ∈ R d : k x − x j k 2 ≤ l / 2 } . F or any x ∈ R d , th ere exists at most one j s u c h that h (2( x − x j ) /l ) 6 = 0. Therefore, for any ﬁxed x , at most on e term in the s u mmation in (A.2) do es n ot v anish. Also note that 1 N 0 h (2( x − x j ) /l ) · z j can contribute 1 / N 0 to the magnitude of f z . Thus for d iﬀeren t z , z ′ , f z and f z ′ are alwa ys diﬀerent no matter ho w the discretization thresholds are put. F u rthermore, if | k | ≤ K , then for every z , | D k f z ( x ) | =       1 N 0 m d 0 X j =2 D k h (2( x − x j ) /l )       ≤  2 l  K M K / N 0 , since th e supp ort of all the p ertur bation h do es not o verlap. I n order that all th e fu nctions f z has K -norm b ounded by 1, w e need  2 l  K M K / N 0 ≤ 1 . The ab o v e in equalit y can b e satisﬁed by setting n 0 =  2 M 1 /K K N K − 1 K 0  , Th us we h a ve m 0 = N 0 n 0 = Ω  N 1 /K 0  . The lemma follo ws . 21 A.3 Smo othness of Linea r Com bination of Gaussian Kernel F unctions In this section we p r o ve Pr op osition 2.1. First, we introdu ce a well-kno wn inequalit y for Hermite p olynomial. Prop osition 2.1 follo ws immediately fr om this lemma. Lemma A.3. [20] F or Hermite p olyno mial of de gr e e k deﬁne d as H k ( x ) = ( − 1) k e x 2 d k d x k e − x 2 , wher e k ∈ N and x ∈ ( −∞ , ∞ ) , it satisfy fol lowing ine quality: | H k ( x ) | ≤ (2 k k !) 1 2 e 1 2 x 2 . Pr o of of Pr o p osition 2.1 . W e only need to show that th e K -norm of the Gaussian k ernel f unc- tion is b ound ed b y 1 since k α k 1 ≤ 1. Let g ( x ) = e − x 2 . F r om Lemma A.3 w e directly hav e: | d k d x k g ( x ) | = | H k ( x ) e − x 2 | ≤ (2 k k !) 1 2 . Let k = ( k 1 , . . . , k d ), and | k | = K . Therefore, for f ( x ) deﬁned in Prop ositio n 2.1, we hav e: | D k f ( x ) | = d Y j =1 d k j d x k j j g  x j − y j √ 2 σ  ≤  1 √ 2 σ  K   d Y j =1 (2 k j k j !)   1 2 ≤ ( K !) 1 2 σ K . Ob viously , when K ≤ σ 2 , | D k f ( x ) | ≤ K K 2 σ K ≤ 1 . The prop osition follo w s. A.4 Priv ately E st imation on Eigenv ectors and E igenv alues In this section w e prov e Th eorem 3.4 and the priv acy guarant ee (Theorem 3.6). F or simp licit y we denote || X || the sp ectral norm of a matrix X . Before we state it formally , let us take a closer lo ok at the T heorem 3.3 in [16]: With high probabilit y , the tangen t of the angle b et w een the space sp anned by the top- k leading eigen vecto rs, namely eigenspace, and the sp ace spann ed by the output columns, namely output-space, is smal l , giv en regularity conditions. Our goal is the column-wise con vergence b et w een eigen vecto rs and out- put columns, whic h can b e concluded fr om the sim ultaneous conv ergence b et w een the increasing sequence of eigenspaces and the in creasing sequen ce of outp u t-spaces, give n that they shared th e same dimension. Th is constraint leads us to utilize a weak er v ersion of Theorem 3.3 by sp ecifying r = k , b ut the fa vored column-wise con vergence at least comp ensated for th e loss of tuning param- eter r . Note that simply applying T heorem 3.3 consecutiv ely for the sequence will not assu re the high conv ergence p robabilit y 1 − o (1) and our an alysis can b e extended to the case k = O ( d ), where the dimension d can grow as the size of database giv en the aptitude of added noise is adequate. 22 Lemma A.4. Assuming the data universe X = [ − 1 , 1] d , for al l p airs of ne i ghb or datab ases D , D ′ with | D | = | D ′ | = n , let A ( D ) = 1 n D D T − ¯ D T ¯ D , wher e ¯ D is the me an of D . It holds that || A ( D ) − A ( D ′ ) || ≤ 5 d n . Lemma A.5. L et A = ( a ij ) ∈ R n × d , denote A k l = ( a ij ) i ≤ k ,j ≤ l the ( k , l ) -sub matrix of A for any k ≤ n and l ≤ d , then || A k l || ≤ || A || . Lemma A.6. L et U ∈ R d × k b e a matrix with or thonor mal c olumns. L et G (1) , ..., G ( L ) ∼ N (0 , σ 2 ) d × k with k ≤ d and assume that L ≤ d . L et G ( l ) s and U s b e the ( d, s ) - sub matrix of G ( l ) and U r esp e c- tively for s ∈ [ k ] . Then, with pr ob ability 1 − o (1) , max l ∈ [ L ] || U T s G ( l ) s || ≤ O ( σ p k log L ) , ∀ s ∈ [ k ] . Lemma A.7. L et U ∈ R d × k b e a matrix with orthonorma l c olumns. L et G (1) , ..., G ( L ) ∼ Lap ( σ ) d × k with k ≤ d and as sume that L ≤ d . L et G ( l ) s and U s b e th e ( d, s ) -sub mat rix of G ( l ) and U r esp e c tively for s ∈ [ k ] . Then, with pr ob ability 1 − o (1) , max l ∈ [ L ] || U T s G ( l ) s || ≤ O ( σ k p log( Lk 2 )) , ∀ s ∈ [ k ] . Pr o of of The or em 3.4 . Let m = max || X ( L ) || ∞ , assu me the sp ectral decomp osition A = ZΛZ − 1 , and d enote Λ =  Λ 1 Λ 2  , and Z =  Z 1 Z 2  , w here Λ 1 ∈ R s × s and Z 1 ∈ R d × s . Next we denote U s = Z 1 Λ 1 Z T 1 and V s = Z 2 Λ 2 Z T 2 . Ob viously we h a ve A = U s Λ 1 U T s + V s Λ 2 V T s . Let ∆( U s ) ≥ max L l =1 || U T s G ( l ) s || and ∆( V s ) ≥ max L l =1 || V T s G ( l ) s || , where G ( l ) s is th e ( d, s )-sub matrix of G ( l ) . By Lemma A.6, we concludes that with pr obabilit y 1 − o (1), the follo wing eve n ts o ccurs sim ultaneously: 1. ∀ s ∈ [ k ] , ∆( U s ) ≤ O ( σ m √ k log L ), 2. ∀ s ∈ [ k ] , ∆( V s ) ≤ O ( σ m √ d log L ). Notice that for all s ≤ k , w e ha v e ∆ ( U s ) ≤ ∆( V s ) as we set s ≤ k ≤ d/ 2. S ince arccos θ ( U s , X (0) s ) is b ounded, w here X (0) s is the ( d, s )-sub matrix of X (0) , we h a ve f or all s ≤ k ∆( U s ) arccos θ ( U s , X (0) s ) ≤ O ( σ m p k log L ) . Applying Theorem 2.9 in [16], w e ha v e w ith p robabilit y of 1 − o (1), for all s ≤ k tan θ ( U s , X ( L ) s ) ≤ O  σ γ s λ s q d max || X ( l ) || 2 ∞ log L  . (5) F or the case s = 1, th e theorem is prov ed. Now for any ﬁxed 1 < s ≤ k , notice that u s is in the sp ace spanned by ( u 1 , . . . , u s ) as well as the orthogonal complemen t of th e space s p anned b y 23 ( u 1 , . . . , u s − 1 ), w e ha v e sin 2 θ ( u s , x ( L ) s ) = || U s − 1 U T s − 1 x ( L ) s + ( I − U s U T s ) x ( L ) s || 2 = || U s − 1 U T s − 1 x ( L ) s || 2 + || ( I − U s U T s ) x ( L ) s || 2 ≤ sin 2 θ ( U s − 1 , X ( L ) s − 1 ) + sin 2 θ ( U s , X ( L ) s ) ≤ 2(max { sin 2 θ ( U s − 1 , X ( L ) s − 1 ) , sin 2 θ ( U s , X ( L ) s ) } ) ≤ 2(max { tan 2 θ ( U s − 1 , X ( L ) s − 1 ) , tan 2 θ ( U s , X ( L ) s ) } ) . (6) The theorem, for the case s ≥ 2, is pr o ve d by sub stituting (5) into (6). Pr o of of Cor ol lary 3.5 . Denote x s = x ( L ) s and θ ( L ) = θ ( U s , X ( L ) s ) for short. Let x s = u + u ⊥ , where u is the eigen v ector corresp onding to λ s . T hen, since u = x s cos φ and u ⊥ = x s sin φ for a φ ≤ θ ( L ) , we h a ve ˆ λ 2 s = x T s A 2 x s = u T A 2 u + u ⊥ T A 2 u ⊥ = λ 2 s u T u + u ⊥ T A 2 u ⊥ ≤ λ 2 s || u || 2 + λ 2 1 || u ⊥ || 2 ≤ λ 2 s cos 2 θ ( L ) + λ 2 1 sin 2 θ ( L ) = λ 2 s (1 − sin 2 θ ( L ) ) + λ 2 1 sin 2 θ ( L ) = λ 2 s + ( λ 2 1 − λ 2 s ) sin 2 θ ( L ) . Th us, | ˆ λ s − λ s | ≤ ( λ 2 1 − λ 2 s ) ˆ λ s + λ s sin 2 θ ( L ) = O ( σ 2 d max || X l || 2 ∞ log L γ 2 s λ 2 s ) . The corollary follo w s. Pr o of of The or em 3.6 . F ollo ws the Lemma 3.6 in [16]. A.5 Exp erimen t s Results: Simple Approach to Get Subs et In th is section w e give the setting of the num b er of b asis fun ction R in our exp erimen t and p r o vide the exp eriment results with su bset S s ampled f rom N d grids uniformly . Let R =    ˜ C n d 2 d + σ 2 ǫ -diﬀeren tial priv acy , ˜ C n 2 d 3 d +2 σ 2 ( ǫ, δ )-diﬀerent ial priv acy , where ˜ C is a constant and we chose ˜ C = 0 . 5. All the results in T able 5 and T able 6 are the a verage results on indep end en t exp eriments o ver 20 rou n ds. Comp are to T able 3 and T able 4, the wo rst-case error obtained th r ough PS I red u ced signiﬁcan tly . 24 T able 5: W orst-case error of ǫ -diﬀerential p riv acy (hyp ercub e) Dataset Erro r σ Time(s) 2 4 6 8 10 CRM Abs 0.001 0.02 8 0.035 0.031 0.031 7.2 Rel 1.721 0.22 6 0.101 0.051 0.046 CTG Abs 0.089 0.07 5 0.050 0.028 0.017 1.8 Rel 0.796 0.13 9 0.066 0.033 0.019 P AM Abs 0.111 0.16 0 0.097 0.062 0.043 9.7 Rel 0.646 0.25 5 0.121 0.070 0.047 PKS Abs 0.071 0.07 9 0.050 0.027 0.017 3.4 Rel 0.655 0.15 4 0.068 0.032 0.019 WDBC Abs 0.040 0.06 2 0.029 0.019 0.015 2.7 Rel 0.309 0.13 7 0.037 0.022 0.017 T able 6: W orst-case error of ( ǫ, δ )-diﬀerent ial priv acy (h yp ercub e) Dataset Erro r σ Time(s) 2 4 6 8 10 CRM Abs 0.001 0.02 7 0.041 0.034 0.027 14.5 Rel 1.773 0.25 8 0.093 0.054 0.039 CTG Abs 0.103 0.07 5 0.042 0.024 0.019 2.6 Rel 0.884 0.14 0 0.055 0.028 0.021 P AM Abs 0.101 0.15 8 0.104 0.067 0.042 15.3 Rel 0.595 0.25 3 0.128 0.076 0.046 PKS Abs 0.099 0.08 6 0.048 0.027 0.022 3.2 Rel 0.924 0.16 5 0.065 0.032 0.025 WDBC Abs 0.040 0.04 6 0.040 0.021 0.019 3.3 Rel 0.340 0.09 9 0.057 0.026 0.021 25 References [1] C. Aggarw al and P . Y u. A ge neral surv ey of pr iv acy preservin g data mining models and algorithms. I n Privacy-Pr eserving Data Mi ni ng , c hapter 2, pages 11–52. S pringer, 2008. [2] K. M. Anstreic her. Linear pr ogramming in O ( n 3 ln n L ) op erations. SIAM J. on Optimization , 9(4):8 03–81 2, Ap r . 1999. [3] B. Barak, K. Chaudhuri, C. Dwork, S. K ale, F. McSh erry , and K. T alwar. Priv acy , accuracy , and consistency to o: a holistic solution to continge ncy table release. In POD S , p ages 273–282 . A CM, 2007. [4] A. Blum, K. Ligett, and A. Roth. A learning theory approac h to non-inte ractiv e database priv acy . In STOC , pages 609–618 . A CM, 2008. [5] K. Ch audhuri and D. Hsu . Sample complexit y b ounds for diﬀerent ially priv ate learning. In COL T , 2011. [6] K. Ch audh uri, C. Mon teleoni, and A. Sarwa te. Diﬀerentia lly pr iv ate empirical risk minimiza- tion. JMLR , 12:1069, 2011. [7] K. Chaudhuri, A. Sarwa te, and K. Sinh a. Near-optimal d iﬀeren tially p riv ate prin cipal comp o- nen ts. In NIP S , p ages 998–1006, 2012. [8] M. Cheraghchi, A. Kliv ans, P . Kothari, and H. Lee. S ubmo du lar fun ctions are noise stable. In SODA , p ages 1586–1592 . SIAM, 2012. [9] K. Choromanski, G. J agannathan, A. Choromansk a, and C . Mon teleoni. Diﬀerent ially-priv ate learning of lo w dimensional man if olds . In AL T , 2013. [10] J. Duc hi, M. Jord an, and M. W ain wr igh t. P riv acy aw are learning. In NIP S , 2012. [11] C. Dwork, F. McSherry , K. Nissim, and A. Smith. Calibrating noise to s ensitivit y in priv ate data analysis. TCC , pages 265–284 , 2006. [12] C. Dw ork, M. Naor, O. Reingold, G. Rothblum, and S . V adhan. On the complexit y of diﬀeren- tially p riv ate data release: eﬃcien t algorithms and hardn ess r esults. In STOC , p ages 381–390 . A CM, 2009. [13] C. Dwo rk, A. Nik olo v, and K. T alw ar. Eﬃcient algorithms for priv ately releasing marginals via con vex relaxations. arXiv pr eprint arXiv:1308.1385 , 2013. [14] C. Dwork, G. R othblum, and S. V adh an. Bo osting and diﬀerentia l priv acy . In FOCS , pages 51–60 . IEEE, 2010. [15] A. Gupta, M. Hard t, A. Roth, and J. Ullman. Priv ately releasing conjun ctions and the s tatis- tical query barr ier. In STOC , p ages 803–812. ACM, 2011. [16] M. Hardt. Robust su bspace iteration and pr iv acy-preservin g sp ectral analysis. arXiv pr eprint arXiv:1311.24 95 , 2013. 26 [17] M. Hard t, K. Ligett, and F. McSh erry . A s im p le and practical algorithm for diﬀerential ly priv ate data release. In NIPS , pages 2348–23 56, 2012. [18] M. Hardt and G. Rothblum. A multiplicat iv e weigh ts mec hanism for p r iv acy-preserving data analysis. In FOCS , pages 61–70. IEEE Comp uter So ciet y , 2010. [19] M. Hardt, G. N. Roth b lum, and R. A. S erv edio. Priv ate data r elease via learning thresholds. In SODA , pages 168–187 . SIAM, 2012. [20] J. I ndritz. An inequalit y for Hermite p olynomials. Pr o c e e dings of the Americ an Mathematic al So ciety , 12(6):pp. 981–983 , 1961. [21] P . Jain, P . Kothari, and A. Thaku r ta. Diﬀerenti ally pr iv ate online learnin g. In COL T , 2012. [22] D. K if er and B. Lin. T o wards an axiomatization of statistical priv acy and u tilit y . In POD S , pages 147–15 8. A CM, 2010. [23] D. Kifer and A. Mac h ana v a jjhala. No free lunc h in data pr iv acy . In KDD , p ages 193– 204. A CM, 2011. [24] J. Lee and C . Clifton. Diﬀerent ial id en tiﬁability . In K DD , pages 1041–1 049. A CM, 2012. [25] J. Lei. Diﬀeren tially pr iv ate M-estimators. In NIPS , 2011. [26] R. D. Mon teiro and I. Adler. Inte rior p ath follo win g primal-dual algorithms. Part I: L inear programming. M ath. P r o gr am. , 44(1):27– 41, Ju ne 1989. [27] A. Roth and T. R ou gh gard en . In teractiv e p riv acy via the median mec hanism. I n STOC , pages 765–7 74. A C M, 2010. [28] A. Smola, B. Sch¨ olk opf, and K. M ¨ uller. The connection b et w een regularization op erators and supp ort v ector ke rnels. Neu r al Ne tworks , 11(4):637– 649, 1998. [29] J. Th aler, J . Ullman, and S. V adhan. F aster algorithms for priv ately r eleasing marginals. I n ICALP , pages 810–821 . Sp ringer, 2012. [30] J. Ullman and S. V adhan. PCPs and the h ard ness of generating priv ate synthetic data. In TCC , pages 400–416. Spr inger, 2011. [31] A. W. V an Der V aart and J. W ellner. We ak Conver genc e and Empiric al P r o c esses . S pringer, 1996. [32] G. W ahba. Supp ort vec tor m ac hines, repro ducing ke rnel Hilb ert spaces and the randomized gacv. A dvanc es i n Kernel M etho ds-Su pp ort V e ctor L e arning , 6:69– 87, 1999. [33] Z. W ang, K. F an, J. Z hang, and L. W ang. Eﬃcient algorithm for priv ately r eleasing smo oth queries. In NIPS , 2013. [34] L. W asserman and S. Zh ou . A statistica l framew ork for diﬀerentia l priv acy . J ournal of the Americ an Statistic al Asso ciation , 105(4 89):37 5–389, 2010. [35] O. Williams and F. McSherry . Probabilistic infer en ce and diﬀerent ial priv acy . In NIPS , 2010. 27

Differentially Private Data Releasing for Smooth Queries with Synthetic Database Output

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment