Differentially Private Data Releasing for Smooth Queries with Synthetic Database Output
We consider accurately answering smooth queries while preserving differential privacy. A query is said to be $K$-smooth if it is specified by a function defined on $[-1,1]^d$ whose partial derivatives up to order $K$ are all bounded. We develop an $\…
Authors: Chi Jin, Ziteng Wang, Junliang Huang
Differen tially Priv ate Data Releasing f or Smo oth Que ries with Syn thetic Database Output Chi Jin, Ziteng W ang, Junliang Huang, Yiqiao Zhong, and Liw ei W ang ∗ Septem b er 18, 2018 Abstract W e consider acc ur ately answering smo oth queries while pres erving differential priv acy . A query is said to be K -smo oth if it is specified by a funct ion defined on [ − 1 , 1] d whose partial deriv atives up to or der K are a ll bounded. W e develop an ǫ -differentially priv ate mechanism for the class of K -smo oth queries . The ma jor adv antage of the a lgorithm is tha t it o utputs a synthetic database. In real applications, a synthetic database output is appealing . Our mec h- anism achiev es an accuracy of O ( n − K 2 d + K /ǫ ), and r uns in poly nomial time. W e also generalize the mechanism to pr e serve ( ǫ, δ )-differential priv acy with s lightly improv ed a ccuracy . E xtensive exp eriments on b enchmark datasets demonstrate that the mechanisms hav e go o d accura cy a nd are efficien t. Keyw ords: Differen tial priv acy , smo oth qu eries, syn thetic database. 1 In tro duction Mac hine learning is often condu cted on datasets con taining sensitiv e inform ation, suc h as medical records, commercia l d ata, etc. The b enefit of learnin g from suc h data is tremendous. But when releasing sensitiv e data, one m u st tak e pr iv acy in to consideration, and has to tradeoff b et ween the accuracy and the amount of priv acy loss of the individuals in the database. In this pap er we stud y differ ential privacy [1 1], wh ich has b ecome a standard concept of p ri- v acy . Differen tial pr iv acy guaran tees that almost nothin g new can b e learned from the database that con tains one sp ecific individu al’s information compared with that fr om the database withou t that individual’s information. More concretely , a mec han ism whic h releases inform ation about the database is said to preserve d ifferen tial priv acy , if the c hange of a single database elemen t do es not affect th e probab ility distribu tion of the output significan tly . Therefore differen tial p riv acy pro vides s trong guarantee s against atta c ks; the risk of an y individual to subm it her information to the data base is very small. Recen tly there ha ve b een extensive studies of mac hine le arning ∗ Chi Jin is with Dept. of EECS, Un ive rsit y of California, Berkeley . Email: chijin@cs.berkeley .edu. Ziteng W ang, Junliang Hu ang, Yiqiao Zhong an d Liw ei W ang are with Key Lab oratory of Mac hin e P erception, MOE, School of EECS, P eking Universit y . Email: wangzt201 2@gmail.com, huangjunliang@pku.edu .cn, yiqiaozhong@pku.edu.cn, w anglw@cis.pku.edu.cn 1 [6, 21, 35, 5, 7, 9], statistical estimation [34, 25 , 10], and data mining [23 , 24, 22] under the differ- en tial priv acy framew ork. One of the m ost wel l studied p roblems in differential priv acy is qu er y an s w erin g. That is, h o w to answer a set of queries d ifferen tially priv ately , accurately and efficien tly . A simp le and efficient metho d is the Laplac e mec hanism [11]. Laplace m echanism a dds La place noi se to the t rue answ ers of th e queries, with the a moun t of noise prop ortional to the sensitivit y of th e quer y function. Thus Laplace mec hanism has go o d p erformances on queries of low sensitivit y . A t yp ical class of queries that has lo w sensitivit y is linear queries, whose sensitivit y is O (1 /n ), wh ere n is the size of the database. Although simple and efficien t, Laplace mec h anism has a li mitation. It can answ er at most O ( n 2 ) queries w ith nont rivial p riv acy and accuracy guaran tees. In real applications, there can b e many users and eac h u ser ma y su bmit a set of queries. Th us, limiting the num b er of total queries to b e no more than n 2 is to o restricted. A remark able result d ue to Blum, Ligett and Roth [4] sh o ws that in formation theoretically it is p ossib le for a mechanism to answer far more th an n 2 linear queries w hile preserving differen tial priv acy and non trivial accuracy sim u ltaneously . Sp ecifically , their mechanism (will b e r eferred to as BLR) can answe r exp onentia lly many linear queries and ac hiev e goo d a ccuracy . Th ere is a ser ies of w orks [12, 14, 27, 18, 17] impro v in g the result of [4]. All th ese mec hanism s are v ery p o w erfu l in the sense that they can ans w er general and adv ers ely c hosen queries. Among the mec hanisms menti oned ab ov e, BLR is differen t to all the others in the output of the algorithm. The output of BLR is a synthetic datab ase , w hile the output of th e other mec h anisms are answers to the queries. F rom a pr actica l p oin t of view, the synthetic database output is ve ry app ealing. In fact, b efore the notion of differentia l p riv acy was prop osed, almost all practical tec h- niques dev elop ed to preserve p riv acy against certain t yp es of attac ks outpu t a synthetic d atabase b y mo d ifying the ra w dataset (please see the surve y [1 ] and the references therein). Ho wev er, outputting synthetic d atabase w h ile p reserving differen tial priv acy is m uc h more diffi- cult than outputting answers to the queries in terms of computational complexit y . Comparing the runn in g time of BLR with the r u nning time of the Pr iv ate Multiplicativ e W eigh t u p dating (PMW) mec hanism [18] wh ic h is one of the b est mec hanisms outp utting answers, BLR runs in time sup er- p olynomial in b oth the size of the data u niv ers e and the n um b er of qu eries, wh ile the ru nning time of PMW is linear in these t w o factors. Generally , if the data universe is { 0 , 1 } d , there are strong hardness results for differentiall y p riv ately outputting synthetic d atabase. In particular, it can b e shown that there is n o differen tially priv ate algorithm whic h can output a synthetic database, accurately answe r general quer ies, and r un in p olynomial time 1 [30]. Giv en the hardness result ag ainst general qu eries, recently there are growing in terests in stud ying efficien t and d ifferen tially pr iv ate mec hanisms for a restricted class of queries. F rom a practica l p oin t of view, if there exists a class of qu eries whic h is rich enough to conta in most qu eries u sed in applications and allo ws one to dev elop fast mec h anisms, then the h ard ness result is not a serious barrier for differentia l p r iv acy . Blum et al.[4] considers rectangle queries in the setting that the data u niv ers e is [ − 1 , 1] d , where d is a constant. A rectangle query is sp ecified b y an axis-aligned rectangle. Th e answ er to the qu ery is the fraction of th e data p oints that lie in the rectangle. Th ey sho w that if [ − 1 , 1] d is discretized to 1 This hardness result assumes the existence of one-wa y functions. 2 p oly( n ) bits of pr ecision, then th ere is an efficient mec hanism whic h outputs a s y nthetic d atabase and is accurate to the class of all rectangle queries. Another class of queries that attracts a lot of atte n tions is the k -wa y conjunctions (or k -w a y marginal). T h e d ata universe for th is problem is { 0 , 1 } d . Thus eac h individ ual r ecord h as d binary attributes. A k -wa y conjunction qu ery is sp ecified b y k features. The query asks what fraction of the individ ual r ecords in the d atabase h as all these k features b eing 1. A series of works attac k this problem using several different tec hniques [3 , 15, 8, 19, 29, 13] . They pr op ose elegan t mec h an ism s whic h run in time p oly ( n ) when k is a constant even if the size of the data un iverse is exp onen tially large. Thus these algorithms are more efficient than the b est general-query-answering mechanisms in th e large data universe setting. Ho wev er, the outpu t of these mec hanisms a re n ot synthetic databases 2 . In this pap er w e study smo oth queries defined also on data u niv erse [ − 1 , 1] d for a constan t d . W e sa y a query is K -smo oth if it is sp ecified b y a smo oth function, which h as b oun ded partial deriv ativ es up to the K th order. The answer to the query is th e a v erage of the fun ction v alues on data p oin ts in the database. Sm o oth fu nctions are widely used in mac hin e learning and data analysis. There are extensiv e stu dies on the r elatio n b etw een smo othness, regularization, r epro ducing kernels and generalizat ion abilit y [32, 28]. Our main result is an ǫ -differenti ally priv ate mec hanism for the class of all K -sm o oth queries. The outpu t of the mec h anism is a syn thetic d atabase. The m echanism has ( α, β )-acc uracy , where α = O ( n − K 2 d + K /ǫ ) for β exp onen tially small. T he ru nning time of the mec hanism is O ( n 3 dK +5 d 4 d +2 K ), p olynomial in th e size of the database. Note that if the order of s m o othness K is large compared to the d imension d , the error of the mec h anism can b e close to n − 1 . In cont rast, if we emplo y BLR to solv e this pr oblem and outp ut a synthetic database, the accuracy guarantee is O ( n − K d +3 K ), wh ic h is at b est n − 1 / 3 for large K . T o ac hiev e this accuracy , th e ru nning time of BLR is sup er-exp on ential in the s ize of the database (please see Section 3.3 for detailed analysis). W e also generaliz e our mec hanism to preserve ( ǫ, δ )-differentia l pr iv acy with s ligh tly imp ro ved accuracy . Our w ork is related to [33], w hic h prop oses an efficien t algorithm ab le to answ er smo oth queries differen tially priv ately . Ho w ev er, that m echanism outputs a (priv ate) synopsis of the database. In order to obtain the answer of a query , the us er h as to r u n an ev aluation algorithm, w hic h inv olv es complicated numerical inte gration pro cedures. In con trast, th e mec hanism giv en in this paper simply outputs a sy nthetic d atabase, wh ich is f riendly to the users in app licatio ns. W e conduct extensiv e exp eriments to ev aluate th e p erformance of th e prop osed mec h anism on b enc hmark datasets (wh ich con tain sensitiv e in f ormation such as medical r ecords of individ uals). W e also d ev elop simp le tec hn iques to improv e the efficiency of th e algorithm. Exp erimenta l r esults demonstrate that the algorithms ac hieve go o d accuracy and are p racticall y efficient on datasets of v arious s izes and num b ers of attributes. The rest of the pap er is organized as follo ws. Sect ion 2 b r iefly describ es the bac kgroun d of data priv acy and give s th e basic defin itions. In Section 3 we p rop ose th e pr iv ate mec han ism s th at output synthetic d atabase and accurately an s w er sm o oth queries. Section 3 also con tains the main theoretical r esults, analyzing the p er f ormances of the algorithms. All the exp eriment al results are giv en in Section 4. Finally , we conclud e in S ection 5. All pro ofs are giv en in the app endix. 2 The h ardness result in [30] has prov ed that for k -wa y marginal, efficiently out p utting synthetic database is n ot p ossible. 3 2 Preliminaries Let D b e a database con taining n data p oints in the data universe X . In this pap er, we consider the case that X ⊂ R d where d is a constan t. Typical ly , we assu me that the data u niv ers e X = [ − 1 , 1] d . Tw o databases D and D ′ are called n eigh b ors if | D | = | D ′ | = n and they d iffer in exactly one data p oin t. The follo wing is the formal defin ition of d ifferen tial p riv acy . Definition 2 .1 (( ǫ, δ )-differential priv acy) . A sanitizer S whic h is a randomized algorithm that maps an input database into some range R is said to preserve ( ǫ, δ )-differenti al priv acy , if for all pairs of neighbor d atabases D , D ′ and for an y sub set A ⊂ R , it holds that P ( S ( D ) ∈ A ) ≤ P ( S ( D ′ ) ∈ A ) · e ǫ + δ, where the p r obabilit y is tak en ov er the random coins of S . If S p reserv es ( ǫ, 0)-differen tial p riv acy , w e sa y S is ǫ -differenti ally priv ate. W e consider line ar queries . Eac h linear query q f is sp ecified b y a fun ction f whic h maps the data unive rse [ − 1 , 1] d to R . q f is defin ed as q f ( D ) := 1 | D | P x ∈ D f ( x ). Let Q b e a set of quer ies. The accuracy of a mechanism with r esp ect to Q is defined as follo ws. Definition 2.2 (( α, β )-accuracy) . Let Q b e a set of queries. A sanitizer S is said to h a ve ( α, β )- accuracy for size n databases with resp ect to Q , if f or ev ery database D with | D | = n the follo wing holds P ( ∃ q ∈ Q , |S ( D , q ) − q ( D ) | ≥ α ) ≤ β , where S ( D , q ) is the answ er to q give n b y S , and the probability is o ver the in ternal randomn ess of the mec h anism S . ( α, β )-acc uracy is a strong notion of accuracy . It requires that with high probabilit y all the queries are accurately answe red by the mec hanism (i.e., it is a w orst-case accuracy with r esp ect to queries). Some auth ors also consider a slightl y weak er defin ition ( α, β , γ )-ac curacy [12]. Definition 2.3 (( α, β , γ )-accuracy) . Let Q b e a set of queries. A san itizer S is sa id to h a ve ( α, β , γ )-acc uracy for size n databases w ith resp ect to Q , if for ev ery database D with | D | = n the follo wing holds P ( S is ( α, γ ) − accurate f or D ) ≥ 1 − β , where the pr obabilit y is o ve r the in ternal randomness of the mec h anism S ; and ( α, γ )-acc urate means that |S ( D , q ) − q ( D ) | ≤ α holds for at least 1 − γ fraction q ∈ Q . W e will mak e use of the Laplace mec hanism [11] in our alg orithm. La place mec hanism adds Laplace noise to the output. W e d enote by Lap( σ ) the random v ariable distributed according to the Laplace distribu tion with p arameter σ : P (Lap( σ ) = x ) = 1 2 σ exp( −| x | /σ ). W e will design a differentiall y priv ate mechanism w hic h outputs a syn thetic d atabase ˜ D . Each elemen t of ˜ D is a data p oin t in the d ata u n iv erse. | ˜ D | and | D | can b e different, i.e., the synthet ic database and the original database may con tain d ifferen t num b ers of data p oin ts. F or an y query q f ∈ Q , the user simply calc ulates q f ( ˜ D ) := 1 | ˜ D | P x ∈ ˜ D f ( x ) as an app ro ximation of q f ( D ). Our differen tially priv ate mec hanism guarantee s accuracy w ith r esp ect to the set of smo oth quer ies. 4 Next w e formally d efine smo oth qu eries. S ince eac h query q f is s p ecified by a fu nction f , a s et of queries Q F can b e sp ecified by a set of fun ctions F . Remem b er that eac h f ∈ F map s [ − 1 , 1] d to R . F or an y p oint x = ( x 1 , . . . , x d ) ∈ [ − 1 , 1] d , if k = ( k 1 , . . . , k d ) is a d -tuple of nonnegativ e int egers, then w e define D k := D k 1 1 · · · D k d d := ∂ k 1 ∂ x k 1 1 · · · ∂ k d ∂ x k d d . Let | k | := k 1 + . . . + k d . Define th e K -norm as k f k K := s u p | k |≤ K sup x ∈ [ − 1 , 1] d | D k f ( x ) | . W e will stud y th e set C K B whic h con tains all smo oth fu nctions whose deriv ativ es up to order K ha ve ∞ -norm up p er b ounded by a constan t B > 0. F ormally , C K B := { f : k f k K ≤ B } . The set of qu eries sp ecified by C K B , denoted as Q C K B , is our fo cus. Smo oth functions hav e b een studied in depth in mac hine learning [31, 32, 28]. Man y fun ctions widely used in mac hin e learning are smo oth functions. An example is the Gaussian k ern el function f ( x ) = exp − k x − x 0 k 2 2 σ 2 , where x 0 ∈ R d is a constan t vect or. L in ear com bin ation of Gaussian kernels is one of th e most p opular f unctions used in mac hin e learning f ( x ) = J X j =1 α j exp − k x − x j k 2 2 σ 2 , where x j , j = 1 , 2 , . . . , J , are constant ve ctors. The smo othness of this typ e of functions is c h aracterized in the follo wing p rop osition. Prop osition 2.1. L et f ( x ) = J X j =1 α j exp − k x − x j k 2 2 σ 2 , wher e x ∈ R d . L et α = ( α 1 , . . . , α J ) . Supp ose k α k 1 ≤ 1 . Then for every K ≤ σ 2 , k f k K ≤ 1 . The pro of is giv en in the app endix Section A.3. 3 Theoretical Results This section con tains the main theoretica l resu lts of the pap er. In Section 3.1 w e giv e an ǫ - differen tially priv ate mec hanism which outputs a synthetic database and guarantee go o d accuracy for s m o oth queries. S ection 3.2 generalizes the mec hanism to preserve ( ǫ, δ )-different ial p riv acy with sligh tly imp ro ved accuracy . In Section 3.3 we compare the p erformance of our algorithms to w ell kn o wn d ifferen tially priv ate mec h anisms on this pr oblem. 5 3.1 The ǫ -diffe ren tially Priv ate Mecha nism The follo wing theorem is our main resu lt. It sa ys that if th e query class is sp ecified b y sm o oth functions, then there is a p olynomial time mechanism whic h p reserv es ǫ -differential p riv acy and go o d accuracy . The outpu t of the mechanism is a synthetic d ataset. A formal description of the mec hanism is giv en in Algorithm 1. Theorem 3.1. L et the query set b e Q C K B := { q f ( D ) = 1 n X x ∈ D f ( x ) : f ∈ C K B } , wher e K ∈ N and B > 0 ar e c onstants. L et the data universe b e [ − 1 , 1] d , wher e d is a c onstant. Then the me chanism describ e d in Algorithm 1 satisfies that for any ǫ > 0 , the fol lowing hold: 1) The me chanism pr eserves ǫ -diffe r ential privacy. 2) Ther e is an absolute c onstant c such that for every β ≥ c · e − n 1 2 d + K the me chanism is ( α, β ) - ac cur ate, wher e α = O ( n − K 2 d + K /ǫ ) , and the hidden c onstant dep ends only on d , K and B . 3) The running time of the me chanism i s O ( n 3 dK +5 d 4 d +2 K ) . (This is dominate d by solving the line ar pr o gr amming pr oblem in step 20 of the algorithm.) 4) The size of the output synthetic datab ase is O ( n 1+ K +1 2 d + K ) . The pro of of Theorem 3.1 is given in the app end ix Section A.1. Before exp lainin g the ideas of the algorithm, let us first tak e a closer lo ok at the r esults in Theorem 3.1. T o ha ve a b etter view of how the p erformances dep end on the order of smo othness, let u s consider three cases. The fi rst case is K = 1, i.e., th e query f unctions only h a ve the fi rst order deriv ativ es. Another extreme case is K/d = M ≫ 1, i.e., v ery s m o oth queries. W e also consider a case in the m id dle b y assuming K = 2 d . T able 1 giv es simplified upp er b oun d s for th e error, the runn in g time of the algorithm, and th e size of the output synthetic database in these cases. F r om T able 1 we can see that the accuracy α impro v es dr amatica lly from roughly O ( n − 1 2 d ) to nearly O ( n − 1 ) as K increases. F or K > 2 d , the error is smaller than the samplin g err or O ( 1 √ n ). On th e other h an d , the ru nning time of the mechanism in creases if one wa n ts b etter accuracy for highly smo oth queries. (Please s ee S ection 4 f or ho w to improv e the efficiency of the algorithm in practice.) Finally , the size of the output synthetic database also incr eases in ord er to hav e b etter accuracy: roughly , O ( n − 1 ) accuracy requires an O ( n 2 )-size syntheti c d atabase. No w we explain the mec h an ism in detail. T he firs t idea is that all sm o oth fun ctions in C K B can b e appr o ximated by linear com binations of a small set of b asis fun ctions. In fact, appr o ximation of smo oth fun ctions by p olynomials, radial basis fu nction, w a velets e tc. has b een well studied for decades. Ho w ever, for the different ial pr iv acy problem, our requir emen t of the approximat ion is q u ite differen t to the t ypical results in appro ximation theory . Sp ecifically , we requ ire that all smo oth functions in C K B can b e appro ximated by linear com binations of a set of basis fun ctions with smal l c o efficients . The co efficient s corresp ond to all smo oth fu nctions must b e u niformly b ounded b y a constan t. (The reason will so on b e clear.) It is n ot clear from standard appr o ximation theory whether any of the ab o v e mentio ned basis function sets satisfies suc h a requ iremen t. Instead, we mak e a change of v ariables θ i = arccos( x i ) and consider appr oximati on of the transformed fu nction 6 Algorithm 1 Priv ate Synthetic DB for Sm o oth Q u eries Notations: T d t := { 0 , 1 , . . . , t − 1 } d , a k := 2 k + 1 − N N , A := { a k | k = 0 , 1 , . . . , N − 1 } , L := { i L | i = − L, − L + 1 , . . . , L − 1 , L } , x := ( x 1 , · · · , x d ), θ i ( x ) := arccos( x i ). P aramete rs: Priv acy p arameters ǫ, δ > 0, F ailure p r obabilit y β > 0, Smo othness order K ∈ N . Input: Database D ∈ [ − 1 , 1] d n Output: Synthetic d atabase ˜ D ∈ [ − 1 , 1] d m 1: Set t = ⌈ n 1 2 d + K ⌉ , N = ⌈ n K 2 d + K ⌉ , m = ⌈ n 1+ K +1 2 d + K ⌉ , L = ⌈ n d + K 2 d + K ⌉ . 2: In itializ e: D ′ ← ∅ , ˜ D ← ∅ , u ← 0 N d 3: for all z = ( z 1 , . . . , z d ) ∈ D do 4: x i ← arg min a ∈A | z i − a | , i = 1 , . . . , d 5: Add x = ( x 1 , . . . , x d ) to D ′ 6: end for 7: for all r = ( r 1 , . . . , r d ) ∈ T d t do 8: b r ← 1 n P x ∈ D ′ cos ( r 1 θ 1 ( x )) . . . cos ( r d θ d ( x )) 9: ˆ b r ← b r + Lap t d nǫ 10: ˆ b ′ r ← arg min l ∈L | ˆ b r − l | 11: end for 12: for all k = ( k 1 , . . . , k d ) ∈ T d N do 13: for all r = ( r 1 , . . . , r d ) ∈ T d t do 14: W rk ← cos ( r 1 arccos( a k 1 )) . . . cos ( r d arccos( a k d )) 15: W ′ rk ← arg min l ∈L | W rk − l | 16: end for 17: end for 18: ˆ b ′ ← ( ˆ b ′ r ) k r k ∞ ≤ t − 1 ( ˆ b ′ is a t d dimensional v ector) 19: W ′ ← ( W ′ rk ) k r k ∞ ≤ t − 1 , k k k ∞ ≤ N − 1 (a t d × N d matrix) 20: Solve the follo win g LP pr oblem: min u k W ′ u − ˆ b ′ k 1 , su b ject to u 0, k u k 1 = 1. Obtain the optimal solution u ∗ . 21: rep eat 22: Samp le y according to distribution u ∗ 23: Add y to ˜ D 24: until | ˜ D | = m 25: return: ˜ D g f ( θ 1 , . . . , θ d ) = f (cos θ 1 , . . . , cos θ d ) by linear com binations of trigonometric p olynomials. It can b e sho wn that the trigonometric p olynomial basis s atisfies the small co efficien t requirement . It is w orth p oin tin g out that here w e consider L ∞ appro ximation, d ifferent to the L 2 appro ximation whic h is simply the F ourier analysis when using trigonometric b asis. Next w e view the trigonometric p olynomial functions as a set of basis queries. W e compute the answ ers of the basis queries (step 8 in Algorithm 1), and add Laplace n oise to the answe rs (step 9). These noisy answers guaran tee differen tial p r iv acy . Note that if, for a smo oth query , we k n o w the co efficien ts of the lin ear com bination of b asis fu nctions that approxi mate the smo oth fun ction, then we can easily obtain a differentiall y priv ate answer to the smo oth query b y simply combining the n oisy answers of the basis qu eries w ith these co efficien ts. Moreo v er, b ecause all the co efficien ts 7 T able 1: P erformance vs. Order of smo othness Order o f Accura cy R unning Size of smoothness α time synthetic DB K = 1 O ( n − 1 2 d +1 ) O ( n 2 ) O ( n 1+ 2 2 d +1 ) K = 2 d O ( n − 1 2 ) O ( n 3 4 d + 5 8 ) O ( n 3 2 + 1 4 d ) K d = M ≫ 1 O ( n − (1 − 2 M ) ) O ( n d ( 3 2 − 1 2 M ) ) O ( n 2 − 1 2 M ) are sm all, th e error of the answe r to the smo oth query is small. Ho wev er, an im p ortan t adv an tage of our m echanism is that we d o not eve n need to know the linear co efficien ts. W e merely n eed to kno w that there exist such co efficient s w hic h leads to a goo d appro ximation of a smo oth fu nction. Finally , our goal is to generate a syn thetic dataset (without using an y inform ation of the original database) so that if we ev aluate all th e b asis qu eries on this synthetic d atabase, all the answers will b e close to the noisy answe rs obtained from th e original dataset. The k ey observ ation is th at if w e ha ve such a s yn th etic dataset, then the ev aluatio n of any s m o oth query on th is synt hetic d ataset is an answ er b oth differential ly pr iv ate and accurate. T o generate su c h a dataset, w e first learn a probabilit y distrib ution o v er [ − 1 , 1] d so that the answers of the basis queries with resp ect to th is distribution is close to the noisy answ ers. Observ e that suc h a distribution m us t exist, b ecause the uniform distribu tion ov er the original dataset satisfies this requirement. How ev er, learning a con tinuous distr ib ution is computationally int ractable. So we discretize the domain (as well as the original d ata (step 4)) and consider distribu tions o ver th e discretized data univ erse. Because th e queries are smo oth, the error inv olv ed by discretization can b e controll ed. Learning the distribu tion can b e formula ted as a linear programming p roblem (step 20). Note that in the LP p roblem we minimize l 1 error instead of l ∞ error b ecause it results in sligh tly b etter accuracy . Finally , w e randomly d ra w s u fficien tly large num b er of d ata from this probability distribution, and these data form the output synthetic d atabase. The ru nning time of the mec hanism is d omin ated b y th e linear programming step. It is known that the w orst-case time complexity of the inte rior p oin t metho d is upp er b ound ed in terms of the n um b er of v ariables, n umb er of constrain ts, and the n um b er of bits to enco de the problem. It is easy to see that there are only p oly( n ) v ariables and constrain ts. T o con trol the n um b er of bits, w e round eac h num b er in the linear programming p r oblem in a certain precision lev el (step 10 and 15). Bec ause all the num b ers after roun ding are b ounded uniformly by a constan t, th e num b er of bits is not to o large. 3.2 Generalization to ( ǫ, δ ) -differen tial Priv acy It’s easy to generalize the previous ǫ -differentia lly p riv ate mec hanism to an ( ǫ, δ )-differentia lly priv ate mec h anism whic h could ac h iev e slightl y b etter accuracy . The ( ǫ, δ )-different ial priv ate mec hanism is different to Algorithm 1 only in step 1 and step 9. These t wo steps are replaced by the follo wing: 8 1) Step 1. Set t = ⌈ n 2 3 d +2 K (log 1 δ ) − 1 3 d +2 K ⌉ , N = ⌈ n 2 K 3 d +2 K (log 1 δ ) − K 3 d +2 K ⌉ , m = ⌈ n 4 d +4 K +2 3 d +2 K (log 1 δ ) − 2 d +2 K +1 3 d +2 K ⌉ , and L = ⌈ n 2 d +2 K 3 d +2 K (log 1 δ ) − d + K 3 d +2 K ⌉ . 2) Step 9. ˆ b r = b r + Lap ( t d log 1 δ ) 1 2 nǫ . W e hav e the follo wing theorem for this mec h anism. Theorem 3.2. L et the query set Q C K B b e define d as in The or em 3.1. L et the data u niverse b e [ − 1 , 1] d , wher e d ∈ N is a c onstant. Then the me chanism describ e d ab ove satisfies that for any ǫ > 0 , δ > 0 , the fol lowing hold: 1) The me chanism is ( ǫ, δ ) - differ ential ly private. 2) Ther e is an absolute c onstant c such that for any β ≥ c · e − n 2 3 d +2 k (log 1 δ ) − 1 3 d +2 K the me chanism is ( α, β ) -ac cur ate, wher e α = O n − 2 K 3 d +2 K (log 1 δ ) K 3 d +2 K /ǫ , and the hidden c onstant dep ends only on d , K and B . 3) The running time of the me chanism is O n 3 dK +5 d 3 d +2 K (log 1 δ ) − 3 dK +5 d 6 d +4 K . 4) The size of synthetic datab ase is O n 4 d +4 K +2 3 d +2 K (log 1 δ ) − 2 d +2 K +1 3 d +2 K . The pro of of Th eorem 3.2 is by the standard use of the comp osition theorem [14]. W e omit the details. Note that the ru nning time and the size of the output syn thetic database of this ( ǫ, δ )-differen tially priv ate mec h anism are similar to that of the ǫ -differen tially priv ate one. 3.3 Comparison to E xisting Algorithms Here we study the p erformance of existing different ially priv ate m ec hanism whic h can output a syn thetic database for accurately answ ering smo oth quer ies. In particular, w e analyze a simple v arian t of the BLR mec hanism. Note that the original BLR mec hanism app lies to th e setting wh ere the d ata u niv erse is { 0 , 1 } d and the query set con tains a finite num b er of linear qu eries. Give n th e query set, BLR outputs a syn thetic data base and preserve s ǫ -differen tial priv acy . Let | Q | b e the n u mb er of queries in the query set Q , and let |X | b e the size of the data un iv erse, the accuracy of BLR is ˜ O log | Q | log |X | n 1 / 3 [4]. (In this subsection we ignore the dep endence on all other factors for clarit y .) F or the smo oth qu er y problem, the data unive rse is the contin uous domain [ − 1 , 1] d , and the query set cont ains infinitely many elemen ts as th e n um b er of smo oth functions is in finite. In order to apply BLR to this pr oblem, one must d iscr etize b oth the data un iv erse and th e range of the smo oth fu n ctions. It is easy to see that to ac hiev e an accuracy of α for all smo oth qu eries, it is necessary and su fficien t to d iscretize the d ata u niv erse [ − 1 , 1] d to Ω( 1 α ) grids along eac h d imension, and discretize the range to Ω( 1 α ) precision. After th ese discretization, the data u niv ers e X is of size Ω (( 1 α ) d ), and the q u ery set Q conta ins only a finite num b er of queries. The follo wing prop osition give s th e p erformance of BLR f or the discretized smo oth queries. 9 Prop osition 3.3. The ac cur acy gu ar ante e of the BLR me chanism (implemente d as describ e d ab ove) on the se t of K -smo oth que ries is O n − K d +3 K . The running time for achieving such an ac cur acy is sup er-exp onential in n . The pro of of Prop osition 3.3 is give n in the app endix Section A.2. Note that eve n for highly smo oth q u eries, the accuracy guarantee of BLR is at b est O n − 1 / 3 . In con trast, our mec hanism has an accuracy close to n − 1 if K is large c ompared to d . More imp ortan tly , our mec h anism ru ns in p olynomial time, m uc h more efficient than BLR on the smo oth problem. 3.4 Practical Acceleration via P riv ate PCA Theoretically , the w orst-case time complexit y of our ǫ -differentia lly priv ate mec hanism can b e nearly n 3 d 2 to ac hiev e n − 1 accuracy for highly smooth queries. In real application su c h a r unning time is u nacceptable. W e th u s consider a simple v arian t of Algorithm 1 whic h turn s out to b e v ery efficien t in our exp eriments and suffers only fr om m inor loss in accuracy . Note that the run ning time of Algorithm 1 is dominated by the lin ear p rogramming step. Th is LP p roblem h as O ( N d ) v ariables and O ( t d ) constraint s, where N d is the num b er of discretized grids in [ − 1 , 1] d and t d is the n um b er of trigonometric p olynomial basis functions. T o mak e our algorithm pr actica l, we consider a subset M of the N d grids with size C := |M| ≪ N d and restrict th e pr obabilit y distribu tion u on this subset of grids. Similarly , we use a subset of size R of the t d trigonometric p olynomial basis fun ctions p referred to lo w er degrees. By d oing this, the LP p roblem has C v ariables and R constrain ts. The simp lest appr oac h to obtain M is sampling from N d grids in [ − 1 , 1] d uniformly . Ho wev er, this appr oac h s u ffers from s u bstan tial loss in accuracy (see App endix f or exp erimen tal results of this metho d), b ecause |M| is extremely s mall compared to N d , the probability th at M cont ains data p oin ts in D (or close to D ) is very small. In order to reduce the size of th e LP problem and preserve the accuracy , we need a b ette r approac h to obtain M . F ormally , the problem of choosing a subset M for our p urp ose can b e formulate d as f ollo ws: W e wan t a s u bset M s o that 1) M is differentia lly pr iv ate; 2) |M| is small; 3) F or almost every d ata p oin t x in D , ther e is a p oint in M close to x . Note that without the pr iv acy co ncern, one can simply let M = D . But under the requiremen t of priv acy , th is problem is highly n on-trivial. Here we adopt pr iv ate PC A to obtain a lo w dimen sional ellipsoid. The ellipsoid is spann ed by the (priv ate) top eigen v ectors of the data co v ariance m atrix with the squ are ro ot of the (priv ate) eigen v alues as the r adius. In particular, w e use a sligh tly mo dified version of the Priv ate Su bspace Iteration (PSI) mec h anism due to Hardt [16 ] to compute the p riv ate eigenpairs. Th e mec hanism is describ ed in Algorithm 2. Finally , we un iformly sample C p oin ts from the ellipsoid to form M . In the follo wing three r esults, we sho w that th e PSI mechanism is differential ly priv ate an d accurate for th e top eigen vecto rs and eigen v alues r esp ectiv ely . Note that Hardt[16] sh ows that the tangen t of the angle b et w een the space s p anned by the top- k leading eigen v ectors of the true data co v ariance matrix and the the s p ace spanned by the output columns v ectors is small w ith h igh probabilit y . How ev er, it do es not suffi ce to conclude that the output p riv ate ellipsoid con verge 10 Algorithm 2 Priv ate Sub space Iteration Input: Database D ∈ ([ − 1 , 1] d ) n Output: T op- k p riv ate eigenv ectors and eigen v alues of X ( L ) . P aramete rs: Num b er of iterations L ∈ N , d im en sion k, p r iv acy p arameters ǫ, δ > 0, Denote GS as the Gram-Sc h midt orthonormalization algorithm. 1: Set σ = 5 d √ 4 kL log(1 /δ ) nǫ , A = 1 n D D T − ¯ D T ¯ D , w here ¯ D is the mean of D . 2: In itializ e: G (0) ∼ N (0 , 1) d × k , X (0) ← GS( G 0 ) 3: for all l = 1 , 2 , . . . , L do 4: S amp le G ( l ) ∼ N (0 , σ 2 ) n × k . 5: W ( l ) = AX ( l − 1) + || X ( l − 1) || ∞ G ( l ) 6: X ( l ) ← GS( W ( l ) ) 7: end for to the true PCA ellipsoid. Our results sligh tly strengthen the result in [16 ]. W e s h o w that the column-wise con v ergence b et ween eigen vecto rs and output columns , wh ic h can b e concluded from the s imultaneous con v ergence b et w een the increasing sequence of eigenspaces and th e increasing sequence of output-spaces. Theorem 3.4 ( Acc uracy of the eigen v ectors ) . Give n a datab ase D with | D | = n , let A = 1 n D D T − ¯ D T ¯ D with ei g envalues λ 1 ≥ · · · ≥ λ d and γ k = λ k /λ k +1 − 1 for some k ≤ d/ 2 . L et U = ( u 1 , . . . , u k ) ∈ R d × k b e a b asis f or the sp ac e sp anne d by the top k eigenve ctors. The matrix X ( L ) = ( x ( L ) 1 , . . . , x ( L ) k ) ∈ R d × k r eturne d by Algorith m 2 on i nput of D , with p ar ameters k , L ≥ C (min s ≤ k γ s ) − 1 log d for sufficiently lar ge c onstant C , L ∈ N , and privacy p ar ameter σ satisfies with pr ob ability 1 − o (1) , sin θ ( u s , x ( L ) s ) ≤ O σ ω s q d max || X ( l ) || 2 ∞ log L , wher e ω s = ( max { 1 γ s λ s , 1 γ s − 1 λ s − 1 } 2 ≤ s ≤ k , 1 γ 1 λ 1 s = 1 . Corollary 3.5 ( Accuracy of the eigen v alues ) . Given the assumption in The or e m 3.4, let ˆ λ s = p x T s A 2 x s , with pr ob ability 1 − o (1) , we have | ˆ λ s − λ s | ≤ O ( σ 2 d max || X l || 2 ∞ log L γ 2 s λ 2 s ) . Theorem 3.6 ( Priv acy ) . If A lgorithm 2 is exe cute d with e ach G ( l ) indep endently sam ple d as G ( l ) ∼ N (0 , σ 2 ) d × k with σ = 5 d √ 4 kL log(1 /δ ) nǫ , then Algorithm 2 satisfies ( ǫ, δ ) -differ ential privacy. If A lgorithm 2 is exe cute d with e ach G ( l ) indep endently sample d as G ( l ) ∼ Lap ( σ ) d × k with σ = 50 d 3 / 2 k L nǫ , then Algorithm 2 satisfies ǫ -differ ential privacy. The pro of of Theorem 3.4, 3.6, and Pr op osition 3.5 are giv en in th e app endix Section A.4. 11 4 Exp erimen ts W e ev aluate our mec h anisms on fiv e datasets all from the UC I r ep ository: 1) CR M: Comm unities and Crime dataset that combines so cio-economic data, la w enforcemen t data, and crime data. 2) CTG: A Cardioto cograph y d ataset consisting of measur emen ts of fetal heart rate and u terine con- traction features on cardioto cograms. 3) P AM: A Physic al Activit y Monitoring dataset consisting of in ertial measurements and heart rate data. 4) PKS: co nsisting of a series of b iomedical v oice measuremen ts of a group of p eople, some of whic h are with P arkinson disease. 5) WDBC: Br east Cancer Wisconsin Diagnostic d ataset consists of c h aracteristic s of the cell nuclei. T able 2: Summary of the dataset Dataset Size ( n ) # Att ributes ( d ) CRM 1993 100 CTG 2126 20 P AM 20000 40 PKS 5875 20 WDBC 569 30 A sum mary of the size and the n um b er of attributes 3 of these datasets is giv en in T able 2. S in ce the data un iv erse considered in th is p ap er is [ − 1 , 1] d , we normalize eac h attribute to [ − 1 , 1]. W e conduct t wo groups of exp eriments. In one group w e us e the mec hanism wh ic h guaran tees ǫ -differen tial p r iv acy , and in the other we use the algorithm whic h guaran tees ( ǫ, δ )-differential priv acy . In b oth groups of exp eriments, w e set ǫ = 1 . W e set δ = 1 0 − 10 in exp erimen ts with ( ǫ, δ )-different ial priv acy . The qu eries emplo yed in the exp eriments are linear combinatio ns of Gaussian kernel functions. W e u s e this t yp e of functions b ecause 1) Th ese fun ctions p ossess go o d smo othness prop erty as stated in Section 2, and 2) lin ear com bin ations of Gaussian are univ ersal app ro ximators. Detaile d parameter setting of the query fun ctions is as f ollo ws. W e consider f ( x ) = J X j =1 α j exp − k x − x j k 2 2 σ 2 . In all exp eriments, w e set J = 10; α j is randomly c hosen from [0 , 1], and x j is rand omly c hosen from [ − 1 , 1] d . W e test v arious v alues of σ to see h o w the smo othness of the qu ery fun ction affects the p erf orm ance of the algorithm (see b elo w for d etailed resu lts). W e u se differen t p erf orm ance measures to ev aluate the algorithm. The goal is to h av e a compre- hensiv e understand ing of th e p erformance of the mec hanism. W e consider the worst-c ase err or of the mec hanism o ver the set of queries. Because our query set, i.e., linear com bination of Gaussian Kernels, con tains infi nitely many fu nctions, we randomly choose 10000 qu er ies in eac h exp erimen t. The w orst-case error is o v er these 10000 queries. W e giv e b oth absolute err or and relativ e error for all exp eriments. Absolute error of a query q f is defin ed as | q f ( D ) − q f ( ˜ D ) | ; and relativ e err or is defi ned as | q f ( D ) − q f ( ˜ D ) q f ( D ) | . W e pr esen t relativ e 3 Because w e study smooth queries defined on Euclidean space, we only use the continuous attributes. 12 T able 3: W orst-case error of ǫ -differential p riv acy ( C = 10 4 ) Dataset Erro r σ Time(s) 2 4 6 8 10 CRM Abs 0.001 0.03 5 0.033 0.022 0.020 1.1 Rel 1.084 0.25 6 0.083 0.037 0.027 CTG Abs 0.046 0.04 1 0.027 0.014 0.005 1.1 Rel 0.209 0.06 3 0.033 0.015 0.006 P AM Abs 0.007 0.00 6 0.004 0.001 0.004 1.2 Rel 0.058 0.01 1 0.006 0.001 0.004 PKS Abs 0.006 0.00 7 0.001 0.007 0.004 0.9 Rel 0.059 0.01 3 0.002 0.008 0.004 WDBC Abs 0.037 0.05 9 0.039 0.011 0.012 1.0 Rel 0.329 0.11 0 0.053 0.013 0.014 error b ecause in certain cases (e.g. when σ is small) f ( x ) is very s mall f or most x ∈ D . T herefore in this case a small absolute error do es not necessarily imply go o d p erformance, and relativ e error is m ore informativ e 4 . W e presen t the run ning time of the mec h anism for outp utting the synthetic database in eac h exp eriment . T h e computer u sed in all the exp erimen ts is a workstatio n with 2 In tel Xeon X5650 pro cessors of 2.6 7GHz and 32GB RAM. W e u se the CPLEX p ac k age for solving the linear pro- gramming problem in our algorithms. W e present the p erformance of the ǫ -differential ly priv ate algorithm in T able 3. F or eac h dataset, b oth abs olute error and relativ e error, as a v erage of 20 roun ds, are rep orted sequen tially . W e m ak e use of linear com b ination of Gaussian w ith different v alues of σ as the query functions. The last column of the table lists the running time with resp ect to the worst σ of the algorithm for ou tp utting the synthetic database. No w we analyze the exp erimenta l r esults in T able 3 in greater detail. I n this set of exp eriments w e set C = 10 4 . First, the algorithm is q u ite efficien t. On all datasets the mechanism outputs the syn thetic database in less than ten seconds. Next considering the accuracy . As explained earlier, the relativ e error is more meaningfu l in our exp eriment s. It can b e seen that except for th e case σ = 2 (reca ll that in Prop osition 2.1 , we sho w f ∈ C K 1 for K ≤ σ 2 ), the accuracy is reasonably go o d . Th e relativ e errors decrease mon otonically as the th e order of sm o othn ess of the quer ies increases. In T ab le 4, we present the results for th e ( ǫ, δ )-differen tially priv ate mec hanism. Comparing to T able 3, the p erformances of the tw o algorithms are similar for δ = 10 − 10 . 5 Conclusion Outputting a synthetic database while p r eserving differentia l priv acy is very app ealing from a prac- tical viewp oin t. In th is pap er, we prop ose differen tially priv ate mec h anisms whic h outp ut synt hetic 4 W e p oin t out that one also needs to b e careful when using relative error. In our exp eriments, we delib erately set α j ∈ [0 , 1]. So f ( x ) ≥ 0 for all x . If instead we set α j ∈ [ − 1 , 1], then f ( x ) can b e either p ositive or negative, and it is p ossible that q f ( D ) is close to zero while f ( x ) is not small for most x ∈ D . In such a case, a large relative error does n ot n ecessaril y imply a bad p erformance. 13 T able 4: W orst-case error of ( ǫ, δ )-different ial priv acy ( C = 10 4 ) Dataset Erro r σ Time(s) 2 4 6 8 10 CRM Abs 0.001 0.01 8 0.034 0.020 0.020 7.5 Rel 0.631 0.12 6 0.083 0.034 0.028 CTG Abs 0.042 0.03 0 0.023 0.014 0.008 1.1 Rel 0.192 0.04 5 0.028 0.016 0.008 P AM Abs 0.012 0.02 0 0.006 0.003 0.001 6.9 Rel 0.089 0.03 3 0.007 0.003 0.002 PKS Abs 0.015 0.00 2 0.003 0.001 0.006 1.4 Rel 0.109 0.00 3 0.003 0.001 0.007 WDBC Abs 0.045 0.03 2 0.019 0.018 0.011 1.2 Rel 0.388 0.06 1 0.026 0.021 0.013 database. T h e u ser can obtain accurate answers to all sm o oth queries from th e synthet ic d atabase. The mec h anisms ru n in p olynomial time, while existing algorithms r u n in sup er-exponential time. F or queries of high order smo othness, the m echanisms ac hieve an accuracy nearly O ( n − 1 ), muc h b etter than the sampling error O ( n − 1 / 2 ) which is inheren t to differentia lly priv ate m echanisms answ ering general queries. There are a few fu ture dir ections we thin k worth exploring. Smo oth and non-smo oth queries : As ment ioned in Introd uction, there exists an efficien t and differen tially priv ate algorithm whic h ou tp uts a synthetic database and is accurate to the class of rectangle queries d efi ned on [ − 1 , 1] d [4]. Note that rectangle queries are not s m o oth. These queries are sp ecified by ind icato r functions which are not ev en con tinuous. The mec h anism p rop osed in [4] is completely differen t to the mec hanism for smo oth queries giv en in this p ap er. Thus an immediate question is: can we d ev elop efficient mechanisms whic h outp ut s y nthetic d atabase and preserve differen tial priv acy for a n atural class of queries conta ining b oth smo oth and im p ortan t non-smo oth functions. A Pro ofs of the theorems and auxiliary exp erimen t results In this ap p endix, we will give th e pro of of the main theorem in Section A.1 ; the analysis of BLR on the Smo oth Q u ery in S ection A.2; the analysis of smo othness of lin ear com b ination of Guassian k ern el functions in Section A.3; pro ofs of priv ately estimation on eigen v ectors and eigen v alues in Section A.4; and the auxiliary exp eriment results by a simple app roac h to get sub set in Section A.5. A.1 Pro of of the Main Theorem In this section we prov e Theorem 3.1. Pr o of of The or em 3.1 . W e fi rst define some n otatio ns rep eatedly used in the p ro of. Let the input database b e D = ( z (1) , z (2) , · · · , z ( n ) ) . 14 Let the discretized d ataset b e (please see step 5 in Algorithm 1) D ′ = ( x (1) , x (2) , · · · , x ( n ) ) . Also let the outpu t synthetic d ataset b e ˜ D = ( y (1) , y (2) , · · · , y ( m ) ) . Let b = ( b r ) k r k ∞ ≤ t − 1 b e a t d dimensional v ector, where b r is d efined in step 8 of the algorithm. Similarly , Let ˆ b = ( ˆ b r ) k r k ∞ ≤ t − 1 and let W = ( W rk ) k r k ∞ ≤ t − 1 , k k k ∞ ≤ N − 1 , where ˆ b r and W rk are defined as in step 9 and 14 of the algorithm resp ectiv ely . Let ∆ = ˆ b − b b e the t d dimensional Laplace noise, wh ere ˆ b is d efined in step 16 of the algorithm. Finally let ˜ b = ( ˜ b r ) k r k ∞ ≤ t − 1 , where ˜ b r = 1 m X y ∈ ˜ D cos ( r 1 θ 1 ( y )) . . . cos ( r d θ d ( y )) . (Recall that θ i ( y ) = arccos( y i ). Please see also the Notations in Algorithm 1.) No w w e pro v e the four results in th e theorem one by one. A.1.1 Differen tial Priv acy That the mec h anism p reserv es ǫ -differen tial p riv acy is s traigh tforwa rd. Note that the output syn - thetic database ˜ D con tains no p riv ate information other than that obtained from ˆ b . S o we only need to sho w that ˆ b is differen tially priv ate. But this is immediate from the priv acy of Laplace mec hanism . A.1.2 Accuracy Let θ = ( θ 1 , . . . , θ d ). F or an y f ( x ) ∈ C K B , w here x ∈ [ − 1 , 1] d , let g f ( θ ) := f (cos θ 1 , . . . , cos θ d ) . Denote c = ( c r 1 ,...,r d ) k r k| ∞ ≤ t − 1 as a t d -dimensional ve ctor, and: h t f ( c , θ ) := X 0 ≤ r 1 ,...,r d ≤ t − 1 c r 1 ,...,r d cos( r 1 θ 1 ) . . . cos( r d θ d ) , F or a constan t M (w e will sp ecify how to c ho ose the v alue of M later), let: c ∗ := arg inf k c k ∞ ≤ M sup θ ∈ [ − π , π ] d h t f ( c , θ ) − g f ( θ ) h M ,t f ( θ ) := X 0 ≤ r 1 ,...,r d ≤ t − 1 c ∗ r 1 ,...,r d cos( r 1 θ 1 ) . . . cos( r d θ d ) . Th us, h M ,t f is the b est t th order s mall co efficien t appro ximation of g f . 15 Moreo v er, for an y x = ( x 1 , . . . , x d ) ∈ [ − 1 , 1] d , also let θ ( x ) := (arccos x 1 , . . . , arccos x d ) . No w w e decomp ose the error of th e mec hanism into several terms : q f ( ˜ D ) − q f ( D ) = 1 m X y ∈ ˜ D f ( y ) − 1 n X z ∈ D f ( z ) ≤ 1 m X y ∈ ˜ D f ( y ) − 1 m X y ∈ ˜ D h M ,t f ( θ ( y )) + 1 m X y ∈ ˜ D h M ,t f ( θ ( y )) − 1 n X x ∈ D ′ h M ,t f ( θ ( x )) + 1 n X x ∈ D ′ h M ,t f ( θ ( x )) − 1 n X x ∈ D ′ f ( x ) + 1 n X x ∈ D ′ f ( x ) − 1 n X z ∈ D f ( z ) . (1) W e fur ther decomp ose the second term in the last r o w of th e ab ov e in equ alit y . W e ha ve 1 m X y ∈ ˜ D h M ,t f ( θ ( y )) − 1 n X x ∈ D ′ h M ,t f ( θ ( x )) = c ∗ · ( ˜ b − b ) ≤ k ˜ b − ˆ b k 1 + k ∆ k 1 k c ∗ k ∞ ≤ k ˜ b − W u ∗ k 1 + k W u ∗ − W ′ u ∗ k 1 + k W ′ u ∗ − ˆ b ′ k 1 + k ˆ b ′ − ˆ b k 1 + k ∆ k 1 k c ∗ k ∞ ≤ k ˜ b − W u ∗ k 1 + k ( W − W ′ ) u ∗ k 1 + k W u ′ − ˆ b k 1 + k ( W − W ′ ) u ′ k 1 + 2 k ˆ b ′ − ˆ b k 1 + k ∆ k 1 k c ∗ k ∞ ≤ k ˜ b − W u ∗ k 1 + 4 t d L + 4 k ∆ k 1 k c ∗ k ∞ , (2) where u ′ is th e uniform distribu tion on D ′ . Note that th e second last inequalit y holds b ecause k W ′ u ∗ − ˆ b ′ k 1 ≤ k W ′ u ′ − ˆ b ′ k 1 . Also, the last inequalit y in (2 ) follo ws from k ˆ b ′ − ˆ b k 1 ≤ t d L + k ∆ k 1 , and k W u ′ − ˆ b k 1 ≤ k W u ′ − b k 1 + k ∆ k 1 = k ∆ k 1 , where the last equalit y h olds s in ce W u ′ = b . 16 Define η d = 1 n X x ∈ D ′ f ( x ) − 1 n X z ∈ D f ( z ) , η n = 4 k ∆ k 1 k c ∗ k ∞ , η a = 1 m X y ∈ ˜ D f ( y ) − 1 m X y ∈ ˜ D h M ,t f ( θ ( y )) + 1 n X x ∈ D ′ h M ,t f ( θ ( x )) − 1 n X x ∈ D ′ f ( x ) , η s = k ˜ b − W u ∗ k 1 k c ∗ k ∞ , η r = 4 t d L k c ∗ k ∞ , where η d , η n , η a , η s , η r corresp ond to the d iscretizatio n error, noise error, approxima tion error, sampling err or and roun ding error, resp ectiv ely . Com bining (1), (2) and the equations ab ov e, we ha ve the error of the mec hanism b ounded by the su m of th ese fiv e types of err ors: q f ( ˜ D ) − q f ( D ) ≤ η d + η n + η a + η s + η r . W e now b oun d the five errors separately . Discretization error η d : Since f ∈ C K B ( K ≥ 1), th e fir st order d er iv ativ es of f are all b oun ded b y B . Also th e d iscr etization precision of [ − 1 , 1] d is 1 N , so the distance b etw een the data in D and the corresp ond ing data in D ′ is O ( 1 N ). Th us we hav e η d = 1 n X x ∈ D ′ f ( x ) − 1 n X z ∈ D f ( z ) ≤ dB N = O n − K 2 d + K . Noise error η n : Let M b e a constan t dep ending on d , K , B and sufficien tly large 5 . S ince M is a constant, k c ∗ k ∞ = O (1). Thus to b oun d η n = k ∆ k 1 · k c ∗ k ∞ , we on ly n eed to b ound th e l 1 norm of th e t d -dimensional vecto r ∆ w hic h con tains i.i.d. random v ariables Lap t d nǫ ; or equiv alen tly b ound the su m of t d i.i.d. random v ariables with exp onential distr ib ution. It is well known that suc h a sum satisfies gamma distribu tion. Simp le calculations yields P k ∆ k 1 ≤ 2 t 2 d nǫ ≥ 1 − 10 e − t d 5 . Th us, with probability 1 − 10 e − t d 5 , we ha v e η n = k ∆ k 1 k c ∗ k ∞ ≤ O t 2 d nǫ . Appro ximation error η a : Recall th at f or any x , g f ( θ ( x )) = f ( x ) . 5 M = 2 K B ( π ( K + 1)) d suffices for this and all later req uirements on M . 17 W e hav e η a = 1 m X y ∈ ˜ D f ( y ) − 1 m X y ∈ ˜ D h M ,t f ( θ ( y )) + 1 n X x ∈ D ′ h M ,t f ( θ ( x )) − 1 n X x ∈ D ′ f ( x ) ≤ 2 g f − h M ,t f [ − π ,π ] d . T o b oun d η a , we n eed the follo wing result. Theorem A.1 ([33]) . F or any K, d, B , ther e is M such that for every f ∈ C K B k g f − h M ,t f k [ − π ,π ] d ≤ O 1 t K +1 . According to this th eorem, w e h av e η p ≤ O 1 t K +1 . Sampling error η s : It is easy to b ound samplin g error. Let W r b e the ro w v ector of matrix W indexed by r . Recall that − 1 ≤ W rk ≤ 1. Thus for eac h r , by Ch ernoff b ound we ha v e that for an y τ > 0: P | ˜ b r − W r u ∗ | ≥ τ ≤ 2 e − mτ 2 2 , since ˜ b r is j ust the a v erage of m i.i.d. samples and W u ∗ is its exp ectation. Next by union b oun d P k ˜ b − W u ∗ k ∞ ≥ τ ≤ 2 t d e − mτ 2 2 , and therefore P k ˜ b − W u ∗ k 1 ≥ t d τ ≤ 2 t d e − mτ 2 2 . Setting τ such that 2 t d e − mτ 2 2 = e − t , we h a ve th at with probabilit y 1 − e − t , k ˜ b − W u ∗ k 1 ≤ O t d +1 / 2 √ m ! . Rounding error η r : Since k c ∗ k ∞ is u pp er b ounded by a constant, we hav e η r ≤ O t d L . Putting it together: Com b ining the fi v e types of errors, w e h a ve that with probabilit y 1 − e − t − 10 e − t d 5 , the error of the mechanism s atisfies 1 m X y ∈ ˜ D f ( y ) − 1 n X z ∈ D f ( z ) ≤ O 1 N + 1 t K +1 + t 2 d nǫ + t d + 1 2 √ m + t d L ! . (3) Recall that the mechanism sets t = ⌈ n 1 2 d + K ⌉ , N = ⌈ n K 2 d + K ⌉ , m = ⌈ n 1+ K +1 2 d + K ⌉ , L = ⌈ n d + K 2 d + K ⌉ . The theorem follo w s after some s im p le calculation. 18 A.1.3 Running time It is not difficult to see that in this case th e r unning time of the mec hanism is dominated by solving the Linear Programming pr oblem in step 20. (Because the time complexity of linear programming is with resp ect to arithmetic op erations, all ru nning time discus sed h ere should b e u nderstand in this wa y .) T o analyze the runnin g time of the LP problem, observe that it could b e rewritten in follo wing standard form: max x ¯ c T ¯ x (4) s.t. ¯ A ¯ x = ¯ b ¯ x 0 where ¯ A = L · W ′ L · I t d − L · I t d 1 T N d 0 0 , ¯ b = L · ˆ b ′ 1 , ¯ c = 0 1 t d 1 t d , ¯ x = u v w . ¯ A is a ¯ m × ¯ n matrix where ¯ m = t d + 1 and ¯ n = N d + 2 t d . Note that 1) eac h elemen t of W ′ is in [ − 1 , 1]; 2) eac h elemen t of ˆ b ′ is in [ − 1 , 1]; and 3) eac h elemen t of W ′ and ˆ b ′ is rounded to pr ecision 1 /L . S o actually we h a ve redu ce to a LP problem (4 ), with elements of ¯ A , ¯ b , ¯ c are all in tegers and b ound ed b y L . The most well -kno wn w orst-case complexit y of the inte rior p oint algorithm for linear program- ming with inte ger parameters is O ( ¯ n 3 ˜ L ), where ¯ n is the num b er of v ariables and ˜ L is the n um b er of bits to enco de the linear p rogramming problem. Here w e u se a more refined b ound giv en in [2]. By using th is b ou n d, we are able to prov e a muc h b etter time complexit y for our algorithm; b ecause in the lin ear p rogramming pr oblem (4), th e num b er of constrain ts is often muc h smaller than the num b er of v ariables. The b oun d we make use of for the complexit y of linear p rogramming is O ( ¯ n 1 . 5 ¯ m 1 . 5 ln ¯ m ¯ L ) [2 ]. Here, ¯ L is the size of LP p roblem in standard form defin ed as follo ws [26]: ¯ L = ⌈ log (1 + | d et( ¯ A max ) | ) ⌉ + ⌈ log (1 + k ¯ c k ∞ ) ⌉ + ⌈ log(1 + k ¯ b k ∞ ) ⌉ + ⌈ log ( ¯ m + ¯ n ) ⌉ , where ¯ A max = arg max X is a square submatrix of ¯ A | det( X ) | . Note that ¯ m < ¯ n , s o the s ize of ¯ A max is at most ¯ m × ¯ m . Therefore, w e ha v e | det( ¯ A max ) | ≤ ¯ m ! L ¯ m , and ¯ L = O ( ¯ m (lo g ¯ m + log L ) + log ¯ n ) . Giv en ¯ m = O ( t d ) and ¯ n = O ( N d ), simple calculation sh o ws that the total time complexit y is O ¯ n 1 . 5 ¯ m 1 . 5 ln ¯ m ¯ L = O N 1 . 5 d t 2 . 5 d = O n 3 dK +5 d 4 d +2 K . 19 A.1.4 Size of the output syn thetic database The size of synthetic dataset m is set in step 1 of the algorithm. A.2 Analysis of the Performan ce of BLR on the Smo ot h Query P roblem In this section we prov e Prop osition 3.3. Pr o of of Pr o p osition 3.3 . As is sta ted in Section 3.3, the accuracy of BLR is ˜ O log | Q | log |X | n 1 / 3 . So here we only need to analyze the size of the query set ob tained after discretization. F or ev er y K ∈ N , let Q α C K B b e the set of queries obtained b y d iscr etizing b oth the domain [ − 1 , 1] d and the range [ − B , B ] of the smo oth functions in C B K with pr ecision α as d escrib ed in S ection 3.3. W e use the follo win g result. Lemma A.2. Ther e is an absolute c onstant c such that log Q α C K B ≥ c 1 α d/K . Since the discretiza tion precision is α , and the first order deriv ativ es of the f u nctions are b ounded b y the constant B , the total error ind uced by discretization of the domain and range is at least α . Th us the error of the discretized BLR is max B α, ˜ O 1 α d/K n ! 1 / 3 . The prop osition follo w s by choosing the optimal α . Pr o of of L emma A.2 . Without loss of generalit y , w e consider the case B = 1. Define h ( x ) ( x ∈ R d ) as follo ws: h ( x ) = ( exp 1 − 1 1 −k x k 2 2 , k x k 2 ≤ 1 0 . otherwise It is well kno w n that h ( x ) ∈ C ∞ ( R d ), h ( x ) ∈ [0 , 1], and for every d -tuple of nonnegativ e integ ers k = ( k 1 , . . . , k d ), D k h ( x ) = 0 w hen k x k ≥ 1. Since the partial deriv ativ es of h are contin uous and h has b oun ded supp ort, w e define M K := max | k |≤ K max x | D k h ( x ) | . Since K is a constan t, M K is also a constant . Let N 0 = 1 /α . F or simplicit y we assume N 0 is an intege r. First we partition [ − 1 , 1] d in to h yp ercub es w ith equal side length. Let n 0 b e an integ er whose v alue will b e determined later. L et l = n 0 / N 0 b e the side length of the hyp ercub es. Let m 0 = 1 /l b e the num b er of hyp ercub es along eac h dimension. Denote the cen ters of the m d 0 h yp ercub es as x 1 , . . . , x m d 0 . 20 Consider the set F = { z = ( z 1 , . . . , z m d 0 ) : z 1 ∈ − N 0 − 1 N 0 , − N 0 − 2 N 0 , . . . , N 0 − 1 N 0 , z i ∈ {− 1 , 0 , 1 } , i = 2 , 3 , . . . , m d 0 } . Clearly |F | = (2 N 0 − 1)3 m d 0 − 1 . F or every z ∈ F , we will construct a K -smo oth fu n ction f z so that for ev ery pair z , z ′ ∈ F , f z and f z ′ are still d ifferen t after discretization ov er the domain and th e range. In particular, w e requ ir e that f z and f z ′ are d ifferen t as long as the discretization p recision is α ; it do es not matter w here th e discretization thresholds are set. If this can b e done, then log Q α C B K ≥ Ω ( m d 0 ) . Belo w, w e w ill sho w that m 0 can b e as large as Ω N 1 /K 0 . Once this is prov ed, th e p rop osition follo ws. T o do this, d efine f z ( x ) = z 1 + 1 N 0 m d 0 X j =2 h (2( x − x j ) /l ) · z j . No w let u s lo ok at some simple prop erties of the function f z . f z p erturb s the constant fun ction z 1 with linear com b in ations of the infi nitely s mo oth fun ction h shifted to eac h x j (the cen ters of the h yp ercub es). Moreo ver, z j ∈ {− 1 , 0 , 1 } ( j = 2 , 3 , . . . , m d 0 ) con trols th e p ertu rbation at x j . It can b e a p ositiv e or n egativ e p ertur bation or no p erturbation. The magnitud e of the p erturbation is 1 / N 0 . Note that h (2( x − x j ) /l ) is supp orted b y th e set { x ∈ R d : k x − x j k 2 ≤ l / 2 } . F or any x ∈ R d , th ere exists at most one j s u c h that h (2( x − x j ) /l ) 6 = 0. Therefore, for any fixed x , at most on e term in the s u mmation in (A.2) do es n ot v anish. Also note that 1 N 0 h (2( x − x j ) /l ) · z j can contribute 1 / N 0 to the magnitude of f z . Thus for d ifferen t z , z ′ , f z and f z ′ are alwa ys different no matter ho w the discretization thresholds are put. F u rthermore, if | k | ≤ K , then for every z , | D k f z ( x ) | = 1 N 0 m d 0 X j =2 D k h (2( x − x j ) /l ) ≤ 2 l K M K / N 0 , since th e supp ort of all the p ertur bation h do es not o verlap. I n order that all th e fu nctions f z has K -norm b ounded by 1, w e need 2 l K M K / N 0 ≤ 1 . The ab o v e in equalit y can b e satisfied by setting n 0 = 2 M 1 /K K N K − 1 K 0 , Th us we h a ve m 0 = N 0 n 0 = Ω N 1 /K 0 . The lemma follo ws . 21 A.3 Smo othness of Linea r Com bination of Gaussian Kernel F unctions In this section we p r o ve Pr op osition 2.1. First, we introdu ce a well-kno wn inequalit y for Hermite p olynomial. Prop osition 2.1 follo ws immediately fr om this lemma. Lemma A.3. [20] F or Hermite p olyno mial of de gr e e k define d as H k ( x ) = ( − 1) k e x 2 d k d x k e − x 2 , wher e k ∈ N and x ∈ ( −∞ , ∞ ) , it satisfy fol lowing ine quality: | H k ( x ) | ≤ (2 k k !) 1 2 e 1 2 x 2 . Pr o of of Pr o p osition 2.1 . W e only need to show that th e K -norm of the Gaussian k ernel f unc- tion is b ound ed b y 1 since k α k 1 ≤ 1. Let g ( x ) = e − x 2 . F r om Lemma A.3 w e directly hav e: | d k d x k g ( x ) | = | H k ( x ) e − x 2 | ≤ (2 k k !) 1 2 . Let k = ( k 1 , . . . , k d ), and | k | = K . Therefore, for f ( x ) defined in Prop ositio n 2.1, we hav e: | D k f ( x ) | = d Y j =1 d k j d x k j j g x j − y j √ 2 σ ≤ 1 √ 2 σ K d Y j =1 (2 k j k j !) 1 2 ≤ ( K !) 1 2 σ K . Ob viously , when K ≤ σ 2 , | D k f ( x ) | ≤ K K 2 σ K ≤ 1 . The prop osition follo w s. A.4 Priv ately E st imation on Eigenv ectors and E igenv alues In this section w e prov e Th eorem 3.4 and the priv acy guarant ee (Theorem 3.6). F or simp licit y we denote || X || the sp ectral norm of a matrix X . Before we state it formally , let us take a closer lo ok at the T heorem 3.3 in [16]: With high probabilit y , the tangen t of the angle b et w een the space sp anned by the top- k leading eigen vecto rs, namely eigenspace, and the sp ace spann ed by the output columns, namely output-space, is smal l , giv en regularity conditions. Our goal is the column-wise con vergence b et w een eigen vecto rs and out- put columns, whic h can b e concluded fr om the sim ultaneous conv ergence b et w een the increasing sequence of eigenspaces and the in creasing sequen ce of outp u t-spaces, give n that they shared th e same dimension. Th is constraint leads us to utilize a weak er v ersion of Theorem 3.3 by sp ecifying r = k , b ut the fa vored column-wise con vergence at least comp ensated for th e loss of tuning param- eter r . Note that simply applying T heorem 3.3 consecutiv ely for the sequence will not assu re the high conv ergence p robabilit y 1 − o (1) and our an alysis can b e extended to the case k = O ( d ), where the dimension d can grow as the size of database giv en the aptitude of added noise is adequate. 22 Lemma A.4. Assuming the data universe X = [ − 1 , 1] d , for al l p airs of ne i ghb or datab ases D , D ′ with | D | = | D ′ | = n , let A ( D ) = 1 n D D T − ¯ D T ¯ D , wher e ¯ D is the me an of D . It holds that || A ( D ) − A ( D ′ ) || ≤ 5 d n . Lemma A.5. L et A = ( a ij ) ∈ R n × d , denote A k l = ( a ij ) i ≤ k ,j ≤ l the ( k , l ) -sub matrix of A for any k ≤ n and l ≤ d , then || A k l || ≤ || A || . Lemma A.6. L et U ∈ R d × k b e a matrix with or thonor mal c olumns. L et G (1) , ..., G ( L ) ∼ N (0 , σ 2 ) d × k with k ≤ d and assume that L ≤ d . L et G ( l ) s and U s b e the ( d, s ) - sub matrix of G ( l ) and U r esp e c- tively for s ∈ [ k ] . Then, with pr ob ability 1 − o (1) , max l ∈ [ L ] || U T s G ( l ) s || ≤ O ( σ p k log L ) , ∀ s ∈ [ k ] . Lemma A.7. L et U ∈ R d × k b e a matrix with orthonorma l c olumns. L et G (1) , ..., G ( L ) ∼ Lap ( σ ) d × k with k ≤ d and as sume that L ≤ d . L et G ( l ) s and U s b e th e ( d, s ) -sub mat rix of G ( l ) and U r esp e c tively for s ∈ [ k ] . Then, with pr ob ability 1 − o (1) , max l ∈ [ L ] || U T s G ( l ) s || ≤ O ( σ k p log( Lk 2 )) , ∀ s ∈ [ k ] . Pr o of of The or em 3.4 . Let m = max || X ( L ) || ∞ , assu me the sp ectral decomp osition A = ZΛZ − 1 , and d enote Λ = Λ 1 Λ 2 , and Z = Z 1 Z 2 , w here Λ 1 ∈ R s × s and Z 1 ∈ R d × s . Next we denote U s = Z 1 Λ 1 Z T 1 and V s = Z 2 Λ 2 Z T 2 . Ob viously we h a ve A = U s Λ 1 U T s + V s Λ 2 V T s . Let ∆( U s ) ≥ max L l =1 || U T s G ( l ) s || and ∆( V s ) ≥ max L l =1 || V T s G ( l ) s || , where G ( l ) s is th e ( d, s )-sub matrix of G ( l ) . By Lemma A.6, we concludes that with pr obabilit y 1 − o (1), the follo wing eve n ts o ccurs sim ultaneously: 1. ∀ s ∈ [ k ] , ∆( U s ) ≤ O ( σ m √ k log L ), 2. ∀ s ∈ [ k ] , ∆( V s ) ≤ O ( σ m √ d log L ). Notice that for all s ≤ k , w e ha v e ∆ ( U s ) ≤ ∆( V s ) as we set s ≤ k ≤ d/ 2. S ince arccos θ ( U s , X (0) s ) is b ounded, w here X (0) s is the ( d, s )-sub matrix of X (0) , we h a ve f or all s ≤ k ∆( U s ) arccos θ ( U s , X (0) s ) ≤ O ( σ m p k log L ) . Applying Theorem 2.9 in [16], w e ha v e w ith p robabilit y of 1 − o (1), for all s ≤ k tan θ ( U s , X ( L ) s ) ≤ O σ γ s λ s q d max || X ( l ) || 2 ∞ log L . (5) F or the case s = 1, th e theorem is prov ed. Now for any fixed 1 < s ≤ k , notice that u s is in the sp ace spanned by ( u 1 , . . . , u s ) as well as the orthogonal complemen t of th e space s p anned b y 23 ( u 1 , . . . , u s − 1 ), w e ha v e sin 2 θ ( u s , x ( L ) s ) = || U s − 1 U T s − 1 x ( L ) s + ( I − U s U T s ) x ( L ) s || 2 = || U s − 1 U T s − 1 x ( L ) s || 2 + || ( I − U s U T s ) x ( L ) s || 2 ≤ sin 2 θ ( U s − 1 , X ( L ) s − 1 ) + sin 2 θ ( U s , X ( L ) s ) ≤ 2(max { sin 2 θ ( U s − 1 , X ( L ) s − 1 ) , sin 2 θ ( U s , X ( L ) s ) } ) ≤ 2(max { tan 2 θ ( U s − 1 , X ( L ) s − 1 ) , tan 2 θ ( U s , X ( L ) s ) } ) . (6) The theorem, for the case s ≥ 2, is pr o ve d by sub stituting (5) into (6). Pr o of of Cor ol lary 3.5 . Denote x s = x ( L ) s and θ ( L ) = θ ( U s , X ( L ) s ) for short. Let x s = u + u ⊥ , where u is the eigen v ector corresp onding to λ s . T hen, since u = x s cos φ and u ⊥ = x s sin φ for a φ ≤ θ ( L ) , we h a ve ˆ λ 2 s = x T s A 2 x s = u T A 2 u + u ⊥ T A 2 u ⊥ = λ 2 s u T u + u ⊥ T A 2 u ⊥ ≤ λ 2 s || u || 2 + λ 2 1 || u ⊥ || 2 ≤ λ 2 s cos 2 θ ( L ) + λ 2 1 sin 2 θ ( L ) = λ 2 s (1 − sin 2 θ ( L ) ) + λ 2 1 sin 2 θ ( L ) = λ 2 s + ( λ 2 1 − λ 2 s ) sin 2 θ ( L ) . Th us, | ˆ λ s − λ s | ≤ ( λ 2 1 − λ 2 s ) ˆ λ s + λ s sin 2 θ ( L ) = O ( σ 2 d max || X l || 2 ∞ log L γ 2 s λ 2 s ) . The corollary follo w s. Pr o of of The or em 3.6 . F ollo ws the Lemma 3.6 in [16]. A.5 Exp erimen t s Results: Simple Approach to Get Subs et In th is section w e give the setting of the num b er of b asis fun ction R in our exp erimen t and p r o vide the exp eriment results with su bset S s ampled f rom N d grids uniformly . Let R = ˜ C n d 2 d + σ 2 ǫ -differen tial priv acy , ˜ C n 2 d 3 d +2 σ 2 ( ǫ, δ )-different ial priv acy , where ˜ C is a constant and we chose ˜ C = 0 . 5. All the results in T able 5 and T able 6 are the a verage results on indep end en t exp eriments o ver 20 rou n ds. Comp are to T able 3 and T able 4, the wo rst-case error obtained th r ough PS I red u ced significan tly . 24 T able 5: W orst-case error of ǫ -differential p riv acy (hyp ercub e) Dataset Erro r σ Time(s) 2 4 6 8 10 CRM Abs 0.001 0.02 8 0.035 0.031 0.031 7.2 Rel 1.721 0.22 6 0.101 0.051 0.046 CTG Abs 0.089 0.07 5 0.050 0.028 0.017 1.8 Rel 0.796 0.13 9 0.066 0.033 0.019 P AM Abs 0.111 0.16 0 0.097 0.062 0.043 9.7 Rel 0.646 0.25 5 0.121 0.070 0.047 PKS Abs 0.071 0.07 9 0.050 0.027 0.017 3.4 Rel 0.655 0.15 4 0.068 0.032 0.019 WDBC Abs 0.040 0.06 2 0.029 0.019 0.015 2.7 Rel 0.309 0.13 7 0.037 0.022 0.017 T able 6: W orst-case error of ( ǫ, δ )-different ial priv acy (h yp ercub e) Dataset Erro r σ Time(s) 2 4 6 8 10 CRM Abs 0.001 0.02 7 0.041 0.034 0.027 14.5 Rel 1.773 0.25 8 0.093 0.054 0.039 CTG Abs 0.103 0.07 5 0.042 0.024 0.019 2.6 Rel 0.884 0.14 0 0.055 0.028 0.021 P AM Abs 0.101 0.15 8 0.104 0.067 0.042 15.3 Rel 0.595 0.25 3 0.128 0.076 0.046 PKS Abs 0.099 0.08 6 0.048 0.027 0.022 3.2 Rel 0.924 0.16 5 0.065 0.032 0.025 WDBC Abs 0.040 0.04 6 0.040 0.021 0.019 3.3 Rel 0.340 0.09 9 0.057 0.026 0.021 25 References [1] C. Aggarw al and P . Y u. A ge neral surv ey of pr iv acy preservin g data mining models and algorithms. I n Privacy-Pr eserving Data Mi ni ng , c hapter 2, pages 11–52. S pringer, 2008. [2] K. M. Anstreic her. Linear pr ogramming in O ( n 3 ln n L ) op erations. SIAM J. on Optimization , 9(4):8 03–81 2, Ap r . 1999. [3] B. Barak, K. Chaudhuri, C. Dwork, S. K ale, F. McSh erry , and K. T alwar. Priv acy , accuracy , and consistency to o: a holistic solution to continge ncy table release. In POD S , p ages 273–282 . A CM, 2007. [4] A. Blum, K. Ligett, and A. Roth. A learning theory approac h to non-inte ractiv e database priv acy . In STOC , pages 609–618 . A CM, 2008. [5] K. Ch audhuri and D. Hsu . Sample complexit y b ounds for different ially priv ate learning. In COL T , 2011. [6] K. Ch audh uri, C. Mon teleoni, and A. Sarwa te. Differentia lly pr iv ate empirical risk minimiza- tion. JMLR , 12:1069, 2011. [7] K. Chaudhuri, A. Sarwa te, and K. Sinh a. Near-optimal d ifferen tially p riv ate prin cipal comp o- nen ts. In NIP S , p ages 998–1006, 2012. [8] M. Cheraghchi, A. Kliv ans, P . Kothari, and H. Lee. S ubmo du lar fun ctions are noise stable. In SODA , p ages 1586–1592 . SIAM, 2012. [9] K. Choromanski, G. J agannathan, A. Choromansk a, and C . Mon teleoni. Different ially-priv ate learning of lo w dimensional man if olds . In AL T , 2013. [10] J. Duc hi, M. Jord an, and M. W ain wr igh t. P riv acy aw are learning. In NIP S , 2012. [11] C. Dwork, F. McSherry , K. Nissim, and A. Smith. Calibrating noise to s ensitivit y in priv ate data analysis. TCC , pages 265–284 , 2006. [12] C. Dw ork, M. Naor, O. Reingold, G. Rothblum, and S . V adhan. On the complexit y of differen- tially p riv ate data release: efficien t algorithms and hardn ess r esults. In STOC , p ages 381–390 . A CM, 2009. [13] C. Dwo rk, A. Nik olo v, and K. T alw ar. Efficient algorithms for priv ately releasing marginals via con vex relaxations. arXiv pr eprint arXiv:1308.1385 , 2013. [14] C. Dwork, G. R othblum, and S. V adh an. Bo osting and differentia l priv acy . In FOCS , pages 51–60 . IEEE, 2010. [15] A. Gupta, M. Hard t, A. Roth, and J. Ullman. Priv ately releasing conjun ctions and the s tatis- tical query barr ier. In STOC , p ages 803–812. ACM, 2011. [16] M. Hardt. Robust su bspace iteration and pr iv acy-preservin g sp ectral analysis. arXiv pr eprint arXiv:1311.24 95 , 2013. 26 [17] M. Hard t, K. Ligett, and F. McSh erry . A s im p le and practical algorithm for differential ly priv ate data release. In NIPS , pages 2348–23 56, 2012. [18] M. Hardt and G. Rothblum. A multiplicat iv e weigh ts mec hanism for p r iv acy-preserving data analysis. In FOCS , pages 61–70. IEEE Comp uter So ciet y , 2010. [19] M. Hardt, G. N. Roth b lum, and R. A. S erv edio. Priv ate data r elease via learning thresholds. In SODA , pages 168–187 . SIAM, 2012. [20] J. I ndritz. An inequalit y for Hermite p olynomials. Pr o c e e dings of the Americ an Mathematic al So ciety , 12(6):pp. 981–983 , 1961. [21] P . Jain, P . Kothari, and A. Thaku r ta. Differenti ally pr iv ate online learnin g. In COL T , 2012. [22] D. K if er and B. Lin. T o wards an axiomatization of statistical priv acy and u tilit y . In POD S , pages 147–15 8. A CM, 2010. [23] D. Kifer and A. Mac h ana v a jjhala. No free lunc h in data pr iv acy . In KDD , p ages 193– 204. A CM, 2011. [24] J. Lee and C . Clifton. Different ial id en tifiability . In K DD , pages 1041–1 049. A CM, 2012. [25] J. Lei. Differen tially pr iv ate M-estimators. In NIPS , 2011. [26] R. D. Mon teiro and I. Adler. Inte rior p ath follo win g primal-dual algorithms. Part I: L inear programming. M ath. P r o gr am. , 44(1):27– 41, Ju ne 1989. [27] A. Roth and T. R ou gh gard en . In teractiv e p riv acy via the median mec hanism. I n STOC , pages 765–7 74. A C M, 2010. [28] A. Smola, B. Sch¨ olk opf, and K. M ¨ uller. The connection b et w een regularization op erators and supp ort v ector ke rnels. Neu r al Ne tworks , 11(4):637– 649, 1998. [29] J. Th aler, J . Ullman, and S. V adhan. F aster algorithms for priv ately r eleasing marginals. I n ICALP , pages 810–821 . Sp ringer, 2012. [30] J. Ullman and S. V adhan. PCPs and the h ard ness of generating priv ate synthetic data. In TCC , pages 400–416. Spr inger, 2011. [31] A. W. V an Der V aart and J. W ellner. We ak Conver genc e and Empiric al P r o c esses . S pringer, 1996. [32] G. W ahba. Supp ort vec tor m ac hines, repro ducing ke rnel Hilb ert spaces and the randomized gacv. A dvanc es i n Kernel M etho ds-Su pp ort V e ctor L e arning , 6:69– 87, 1999. [33] Z. W ang, K. F an, J. Z hang, and L. W ang. Efficient algorithm for priv ately r eleasing smo oth queries. In NIPS , 2013. [34] L. W asserman and S. Zh ou . A statistica l framew ork for differentia l priv acy . J ournal of the Americ an Statistic al Asso ciation , 105(4 89):37 5–389, 2010. [35] O. Williams and F. McSherry . Probabilistic infer en ce and different ial priv acy . In NIPS , 2010. 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment