Private Empirical Risk Minimization Beyond the Worst Case: The Effect of the Constraint Set Geometry
Empirical Risk Minimization (ERM) is a standard technique in machine learning, where a model is selected by minimizing a loss function over constraint set. When the training dataset consists of private information, it is natural to use a differential…
Authors: Kunal Talwar, Abhradeep Thakurta, Li Zhang
Pri v ate Empirical Risk Minimization Be yond the W orst Case: The Ef fect of the Constraint Set Geometry Kunal T alwar ∗ Abhradeep Thakurta † Li Zhang ‡ Nov ember 22, 2016 Abstract Empirical Risk Minimization (ERM) is a standard technique in machine learning, where a model is selected by minimizing a loss function ov er constraint set. When the training dataset consists of priv ate information, it is natural to use a differentially priv ate ERM algorithm, and this problem has been the subject of a long line of work [CM08, KST12, JKT12, ST13a, DJW13, JT14, BST14, Ull14]. A priv ate ERM algorithm outputs an approximate minimizer of the loss function and its error can be measured as the difference from the optimal value of the loss function. When the constraint set is arbitrary , the required error bounds are fairly well understood [BST14]. In this work, we show that the geometric properties of the constraint set can be used to deriv e significantly better results. Specifically , we show that a differentially priv ate version of Mirror Descent leads to error bounds of the form ˜ O ( G C /n ) for a Lipschitz loss function, impro ving on the ˜ O ( √ p/n ) bounds in [BST14]. Here p is the dimensionality of the problem, n is the number of data points in the training set, and G C denotes the Gaussian width of the constraint set that we optimize o ver . W e sho w similar improv ements for strongly con vex functions, and for smooth functions. In addition, we sho w that when the loss function is Lipschitz with respect to the ` 1 norm and C is ` 1 -bounded, a differentially pri vate version of the Frank-W olfe algorithm giv es error bounds of the form ˜ O ( n − 2 / 3 ) . This captures the important and common case of sparse linear re gression (LASSO), when the data x i satisfies | x i | ∞ ≤ 1 and we optimize over the ` 1 ball. W e also show our algorithm is nearly optimal by proving a matching lo wer bound for this setting. ∗ Google. kunal@kunaltalwar.org . (Part of this research was performed at the now defunct Microsoft Research Silicon V alley .) † Y ahoo Labs, Sunnyv ale. abhradeep@yahoo-inc.com ‡ Google. liqzhang@google.com . (P art of this research was performed at the no w defunct Microsoft Research Silicon V alley .) 1 Intr oduction A common task in supervised learning is to select the model that best fits the data. This is frequently achiev ed by selecting a loss function that associates a real-valued loss with each datapoint d and model θ and then selecting from a class of admissible models, the model θ that minimizes the average loss ov er all data points in the training set. This procedure is commonly referred to as Empirical Risk Minimization (ERM). The av ailability of large datasets containing sensitiv e information from individuals has moti vated the study of learning algorithms that guarantee the priv acy of individuals contrib uting to the database. A rigor- ous and by-no w standard priv acy guarantee is via the notion of differential priv acy . In this work, we study the design of differentially priv ate algorithms for Empirical Risk Minimization, continuing a long line of work initiated by [CM08] and continued in [CMS11, KST12, JKT12, ST13a, DJW13, JT14, BST14, Ull14]. As an example, suppose that the training dataset D consists of n pairs of data d i = ( x i , y i ) where x i ∈ < p , usually called the feature vector , and y i ∈ < , the prediction. The goal is to find a “reasonable model” θ ∈ < p such that y i can be predicted from the model θ and the feature vector x i . The quality of approximation is usually measured by a loss function L ( θ ; d i ) , and the empirical loss is defined as L ( θ ; D ) = 1 n P i L ( θ ; d i ) . For example, in the linear model with squared loss, L ( θ ; d i ) = ( h θ , x i i − y i ) 2 . Commonly , one restricts θ to come from a constraint set C . This can account for additional knowledge about θ , or can be helpful in a voiding ov erfitting and making the learning algorithm more stable. This leads to the constrained optimization problem of computing θ ∗ = argmin θ ∈C L ( θ ; D ) . For example, in the classical sparse linear regression problem, we set C to be the ` 1 ball. No w our goal is to compute a model θ that is pri vate with respect to changes in a single d i while having high quality , where the quality is measured by the excess empirical risk compared to the optimal model. Problem definition: Gi ven a data set D = { d 1 , · · · , d n } of n samples from a domain D , a con ve x set C ⊆ < p , and a con ve x loss function L : C × D → < , for any model θ , define its excess empirical risk as R ( θ ; D ) def = 1 n n X i =1 L ( θ ; d i ) − min θ ∈C 1 n n X i =1 L ( θ ; d i ) . (1) W e define the risk of a mechanism A on a data set D as R ( A ; D ) = E [ R ( A ( D ); D )] , where the e xpectation is over the internal randomness of A , and the risk R ( A ) = max D ∈D n R ( A ; D ) is the maximum risk ov er all the possible data sets. Our objectiv e is then to design a mechanism A which preserves ( , δ ) -dif ferential pri vac y (Definition 2.1) and achieves as low risk as possible. W e call the minimum achie vable risk as privacy risk , defined as min A R ( A ) , where the min is over all ( , δ ) -differentially pri v ate mechanisms A . Pre vious work on pri vate ERM has studied this problem under fairly general conditions. For con vex loss functions L ( θ ; d i ) that for ev ery d i are 1-Lipschitz as functions from ( < p , ` 2 ) to < (i.e. are Lipschitz in the first parameter with respect to the ` 2 norm), and for C contained in the unit ` 2 ball, [BST14] showed 1 that the priv acy risk is at most e O ( √ p/n ) . They also sho wed that this bound cannot be improved in general, e ven for the squared loss function. Similarly they ga ve tight bounds under stronger assumptions on the loss functions (more details belo w). In this work, we go beyond these worst-case bounds by exploiting properties of the constraint set C . In the setting of the previous paragraph, we show that the √ p term in the priv acy risk can be replaced by the Gaussian W idth of C , defined as G C = E g ∈N (0 , 1) p [sup θ ∈C h θ , g i ] . Gaussian width is a well-studied quantity in Con ve x Geometry that captures the global geometry of C [Bal97]. For a C contained in the the ` 2 ball it is ne ver larger than O ( √ p ) and can be significantly smaller . For example, for the ` 1 ball, the Gaussian width is only Θ( √ log p ) . Similarly , we giv e improv ed bounds for other assumptions on the loss functions. These bounds are prov ed by analyzing a noisy version of the mirror descent algorithm [NY83, BT03]. 1 Throughout the paper , we use e O , e Ω to hide the polynomial factors in 1 / , log(1 /δ ) , log n , and log p . 1 In the simplest setting, when the loss function L ( · , d ) is con v ex, and L 2 -Lipschitz with respect to the ` 2 norm on the parameter space, we get the following result. The precise bounds require a potential function that is tailored to the conv ex set C . In the following, let kC k 2 denote the ` 2 radius of C , and G C denote the Gaussian width of C . Theorem 1.1 (Informal v ersion) . Ther e e xists an ( , δ ) -differ entially private algorithm A such that R ( A ) = O L 2 G C log( n/δ ) n . In particular , R ( A ) = O L 2 kC k 2 √ p log ( n/δ ) n , and if C is a polytope with k vertices, R ( A ) = O L 2 kC k 2 log k log( n/δ ) n . Similar improvements can be sho wn (Section 3.2) for other constraint sets, such as those bounding the grouped ` 1 norm, interpolation norms, or the nuclear norm when the vector is viewed as a matrix. When one additionally assumes that the loss functions satisfy a strong con ve xity definition (Appendix A) , we can get further improved bounds. Moreover , for smooth loss functions (Section 4), we can show that a simpler objecti ve perturbation algorithm [CMS11, KST12] giv es Gaussian-width dependent bounds similar to the ones above. Our work also implies Gaussian-width-dependent con vergence bounds for the noisy (stochastic) mirror descent algorithm, which may be of independent interest. The bounds based on mirror descent hav e a dependence on the ` 2 Lipschitz constant. This constant might be too large for some problems. For example, for the popular sparse linear regression problem, one often assumes x i to have bounded ` ∞ norm, i.e. each entry of x i , instead of k x i k 2 , is bounded. The ` 2 Lipschitz constant is then polynomial in p and leads to a loose bound. In these cases, it would be more beneficial to hav e a dependence on the ` 1 Lipschitz constant. Our next contribution is to address this issue. W e show that when C is the ` 1 ball, one can get significantly better bounds using a differentially priv ate version of the Frank-W olfe algorithm. Let kC k 1 denote the maximum ` 1 radius of C , and Γ L the curvature constant for L (precise definition in Section 5). Theorem 1.2. If C is a polytope with k vertices, then there exists an ( , δ ) -differ entially private algorithm A suc h that R ( A ) = O Γ L 1 / 3 ( L 1 kC k 1 ) 2 / 3 log( nk ) p log(1 /δ ) ( n ) 2 / 3 ! . In particular , for the sparse linear r e gr ession pr oblem wher e eac h k x i k ∞ ≤ 1 , we have that R ( A ) = O (log ( np/δ ) / ( n ) 2 / 3 ) . Finally , we use the fingerprinting code lo wer bound technique developed in [BUV14] to show that the upper bound for the sparse linear regression problem, and hence the abo ve result, is nearly tight. Theorem 1.3. F or the sparse linear r e gr ession pr oblem wher e k x i k ∞ ≤ 1 , for = 0 . 1 and δ = 1 /n , any ( , δ ) -differ entially private algorithm A must have R ( A ) = Ω(1 / ( n log n ) 2 / 3 ) . In T able 1 we summarize our upper and lo wer bounds. Combining our results with that of [BST14], in particular we show that all the bounds in this paper are essentially tight. The lower bound for the ` 1 -norm case does not follo w from [BST14], and we provide a ne w lower bound ar gument. Our results enlarge the set of problems for which priv acy comes “for free”. Gi ven n samples from a distribution, suppose that θ ∗ is the empirical risk minimizer and θ priv is the differentially priv ate approximate minimizer . Then the non-priv ate ERM algorithm outputs θ ∗ and incurs expected (on the distribution) loss 2 Pre vious work This work Assumption Upper bound Lo wer bound Upper bound Lo wer bound 1 -Lipschitz w .r .t L 2 -norm and kC k 2 = 1 √ p n [BST14] Ω √ p n [BST14] Mirror descent: 1 n min { √ p, log k } ... and λ -smooth √ p + λ n [CMS11] Ω √ p n [BST14] (for λ = O ( p ) ) Frank-W olfe: λ 1 / 3 ( n ) 2 / 3 min { p 1 / 3 , log 1 / 3 k } Obj. pert: min { √ p, √ log k } + λ n 1 -Lipschitz w .r .t L 1 -norm, kC k 1 = 1 , and curv ature Γ Frank-W olfe: Γ 1 / 3 log( nk ) ( n ) 2 / 3 ˜ Ω 1 n 2 / 3 T able 1: Upper and lo wer bounds for ( , δ ) -dif ferentially priv ate ERM. k denotes the number of corners in the con v ex set C .(In general the dependence is on the Gaussian width of C , generalizing √ p or √ log k .) The curv ature parameter is a weaker condition than smoothness, and is in particular bounded by the smoothness. Bounds ignore multiplicati ve dependence of log(1 /δ ) and in the lo wer bounds, is considered as a constant. The lo wer bounds of [BST14] have the form Ω(min { n, · · · } ) . equal to the loss ( θ ∗ , training-set ) + generalization-error, where the gener alization err or term depends on the loss function, C and on the number of samples n . The differentially priv ate algorithm incurs an additional loss of the priv acy risk. If the priv acy risk is asymptotically no larger than the generalization error , we can think of priv acy as coming for free, since under the assumption of n being large enough to make the generalization error small, we are also making n large enough to make the priv acy risk small. For many of the problems, by our work we get priv acy risk bounds that are close to the best kno wn generalization bounds for those settings. More concretely , in the case when the kC k 2 ≤ 1 and the loss function is 1 -Lipschitz in the ` 2 -norm, the kno wn generalization error bounds strictly dominate the pri vac y risk when n = ω ( G 4 C ) [SSSSS09, Theorem 7]. In the case when C is the ` 1 -ball, and the loss function is the squared loss with k x k ∞ ≤ 1 and | y | ≤ 1 , the generalization error dominates the pri vac y risk when n = ω (log 3 p ) [BM03, Theorem 18]. 1.1 Related work In the follo wing we distinguish between the tw o settings: i) the con v ex set is bounded in the ` 2 -norm and the the loss function is 1 -Lipschitz in the ` 2 -norm (call it the ( ` 2 /` 2 ) -setting for bre vity), and ii) the con vex set is bounded in the ` 1 -norm and the the loss function is 1 -Lipschitz in the ` 1 -norm (call it the ( ` 1 /` ∞ ) -setting). The ( ` 2 /` 2 ) -setting: In all the works on priv ate con vex optimization that we are aware of, either the excess risk guarantees depend polynomially on the dimensionality of the problem ( p ), or assumes special structure to the loss (e.g., generalized linear model [JT14] or linear losses [DNPR10, ST13b]). Similar dependence is also present in the online version of the problem [JKT12, ST13c]. [BST14] recently show that in the pri v ate ERM setting, in general this polynomial dependence on p is unav oidable. In our w ork we sho w that one can replace this dependence on p with the Gaussian width of the constraint set C , which can be much smaller . W e use the mirror descent algorithm of [BT03] as our building block. The ( ` 1 /` ∞ ) -setting: The only results in this setting that we are aware of are [KST12, ST13a, JT14, ST13b]. 3 The first two works make certain assumtions about the instance ( restricted str ong con vexity (RSC) and mu- tual incoher ence ). Under these assumptions, the y obtain priv acy risk guarantees that depend logarithmically in the dimensions p , and thus allowing the guarantees to be meaningful e ven when p n . In fact their bound of O ( polylog p/n ) can be better than our tight bound of O ( polylog p/n 2 / 3 ) . Howe ver , these assump- tions on the data are strong and may not hold in practice [W as12]. Our guarantees do not require any such data dependent assumptions. The result of [JT14] captures the scenario when the constraint set C is the probability simplex and the loss function is in the generalized linear model, but provides a worse bound of O ( polylog p/n 1 / 3 ) . Eff ect of Gaussian width in risk minimization: For linear losses, the notions of Rademacher complexities and Gaussian complexities are closely related to the concept of Gaussian width, i.e., when the loss function are of the form h θ , d i . One of the initial works that formalized this connection was by [BM03]. They in particular bound the excess generalization error by the Gaussian complexity of the constraint set C , which is very similar to Gaussian width in the context of linear functions. Recently [CRPW12] sho w that the Gaussian width of a constraint set C is very closely related to the number of generic linear measurements one needs to perform to recov er an underlying model θ ∗ ∈ C . [SZ13] analyzed the problem of noisy stochastic gradient descent (SGD) for general con ve x loss func- tions. Their empirical risk guarantees depend polynomially on the ` 2 -norm of the noise vector that gets added during the gradient computation in the SGD algorithm. As a corollary of our results we show that if the noise vector is sub-Gaussian (not necessarily spherical ), the polynomial dependence on the ` 2 -norm of the noise can be replaced by a term depending on the Gaussian width of the set C . Analysis of noisy descent methods: The analysis of noisy versions of gradient descent and mirror descent algorithms has attracted interest for unrelated reasons [RR WN11, DJM13] when asynchronous updates are the source of noise. T o our kno wledge, this line of work does not take the geometry of the constraint set into account, and thus our results may be applicable to those settings as well. W e should note here that the notion of Gaussian width has been used by [NTZ13], and [DNT13] in the context of differentially priv ate query release mechanisms but in the very different context of answering multiple linear queries ov er a database. 2 Backgr ound 2.1 Differential Pri vacy The notion of differential priv acy (Definition 2.1) is by now a defacto standard for statistical data priv acy [DMNS06, Dwo06, Dwo08, Dwo09]. One of the reasons for which differential priv acy has become so popular is because it pro vides meaningful guarantees e ven in the presence of arbitrary auxiliary information. At a semantic lev el, the priv acy guarantee ensures that an adversary learns almost the same thing about an individual independent of his pr esence or absence in the data set. The parameters ( , δ ) quantify the amount of information leakage. For reasons beyond the scope of this work, ≈ 0 . 1 and δ = 1 /n ω (1) are a good choice of parameters. Here n refers to the number of samples in the data set. Definition 2.1. A randomized algorithm A is ( , δ ) -differ entially private ([DMNS06, DKM + 06]) if, for all neighboring data sets D and D 0 (i.e., the y differ in one recor d, or equivalently , d H ( D , D 0 ) = 1 ) and for all events S in the output space of A , we have Pr( A ( D ) ∈ S ) ≤ e Pr( A ( D 0 ) ∈ S ) + δ . Her e d H ( D , D 0 ) r efer s to the Hamming distance. 4 2.2 Bregman Di vergence, Con v exity , Norms, and Gaussian W idth In this section we revie w some of the concepts commonly used in con ve x optimization useful to the e xposi- tion of our algorithms. In all the definitions below we assume that the set C ⊆ < p is closed and con ve x. ` q -norm, q ≥ 1 : For q ≥ 1 , the ` q -norm for any vector v ∈ < p is defined as p P i =1 v ( i ) q 1 /q , where v ( i ) is the i -th coordinate of the vector v . Minko wski norm w .r .t a set C ⊆ < p : For any vector v ∈ < p , the Minkowski norm (denoted by k v k C ) w .r .t. a centrally symmetric con ve x set C is defined as follows. k v k C = min { r ∈ < : v ∈ r C } . L -Lipschitz continuity w.r .t. norm k · k : A function Ψ : C → < is L -Lispchitz within a set C w .r .t. a norm k · k if the follo wing holds. ∀ θ 1 , θ 2 ∈ C , | Ψ( θ 1 ) − Ψ( θ 2 ) | ≤ L · k θ 1 − θ 2 k . Con vexity and ∆ -strong con v exity w .r .t norm k · k : A function Ψ : C → < is con vex if ∀ θ 1 , θ 2 ∈ C , α ∈ [0 , 1] , Ψ( αθ 1 + (1 − α ) θ 2 ) ≤ α Ψ( θ 1 ) + (1 − α )Ψ( θ 2 ) . A function is ∆ -strongly con ve x within a set C w .r .t. a norm k · k if ∀ θ 1 , θ 2 ∈ C , α ∈ [0 , 1] , Ψ( αθ 1 + (1 − α ) θ 2 ) ≤ α Ψ( θ 1 ) + (1 − α )Ψ( θ 2 ) − ∆ · α (1 − α ) 2 k θ 1 − θ 2 k 2 . Bregman divergence: For any conv ex function Ψ : < p → < , the Bregman div ergence defined by B Ψ : < p × < p → < is defined as B Ψ ( θ 1 , θ 2 ) = Ψ( θ 1 ) − Ψ( θ 2 ) − h5 Ψ( θ 2 ) , θ 1 − θ 2 i . Notice that Bregman di ver gence is always positi ve, and con v ex in the first ar gument. ∆ -strong con vexity w.r .t a function Ψ : A function f : C → < is ∆ -strongly con vex within a set C w .r .t. a dif ferentiable con vex function Ψ if the following holds. ∀ θ 1 , θ 2 ∈ C , α ∈ [0 , 1] , f ( αθ 1 + (1 − α ) θ 2 ) ≤ αf ( θ 1 ) + (1 − α ) f ( θ 2 ) − ∆ · α (1 − α ) 2 B Ψ ( θ 1 , θ 2 ) . Duality: The following duality property (Fact 2.2) of norms will be useful through the rest of this paper . Recall that for any pair of dual norms k · k a and k · k b , and x, y ∈ < p , Holder’ s inequality says that |h x, y i| ≤ k x k a · k y k b . F act 2.2. The dual of ` q norm is ` q 0 -norm such that 1 /q + 1 /q 0 = 1 . The dual of k · k C is k · k C ∗ , wher e for any vector v ∈ < p , k v k C ∗ = max w ∈C |h w , v i| . Gaussian width of a set C : Let b ∼ N (0 , I p ) be a Gaussian random vector in < p . The Gaussian width of a set C is defined as G C def = E b sup w ∈C h b, w i . F act 2.3 (Concentration of Gaussian width [BLM13]) . Let W = sup w ∈C h b, w i , wher e b ∼ N (0 , 1) p and α 2 = max θ ∈C k θ k 2 2 . Then, Pr [ | W − G C | ≥ u ] ≤ 2 e − u 2 2 α 2 . 5 3 Priv ate Mirror Descent and the Geometry of C In this section we introduce the well-established mirr or descent algorithm [NY83] in the context of priv ate con ve x optimization. W e notice that since mirror descent is designed to closely follow the geometry of the con ve x set C , we get much tighter bounds than that were known earlier in the literature for a large class of interesting instantiations of the con vex set C . More precisely , using pri vate mirror descent one can show that the priv acy depends on the Gaussian width (see Section 2.2) instead of any explicit dependence on the dimensionality p . The main technical contribution in the analysis of priv ate (noisy) mirror descent is to express the e xpected potential drop in terms of the Gaussian width.(See (4) in the proof of Theorem 3.2.) 3.1 Private Mirr or Descent Method In Algorithm 1 we define our priv ate mirror descent procedure. The algorithm takes as input a potential function Ψ that is chosen based on the constraint set C . B Ψ refers to the Bregman diver gence with respect to Ψ . (See Section 2.2.) If L ( θ ; d ) is not dif ferentiable at θ , we use any sub-gradient at θ instead of 5L ( θ ; d ) . Algorithm 1 A Noise − MD : Differentially Pri vate Mirror Descent Input: Data set: D = { d 1 , · · · , d n } , loss function: L ( θ ; D ) = 1 n n P i =1 L ( θ ; d i ) (with ` 2 -Lipschitz constant L for L ), pri vac y parameters: ( , δ ) , conv ex set: C , potential function: Ψ : C → < , and learning rate: η : [ T + 1] → < . 1: Set noise variance σ 2 ← 32 L 2 T log 2 ( T /δ ) ( n ) 2 . 2: Let θ 1 : be an arbitrary point in C . 3: for t = 1 to T do 4: θ t +1 = arg min θ ∈C h η t +1 5 ( L ( θ t ; D ) + b t − Ψ( θ t )) , θ − θ t i + Ψ( θ ) , where b t ∼ N (0 , I p σ 2 ) . 5: Output θ priv ← 1 T T P t =1 θ t . Theorem 3.1 (Pri v acy guarantee) . Algorithm 1 is ( , δ ) -dif fer entially private. The proof of this theorem is fairly straightforward and follows from by no w standard priv acy guarantee of Gaussian mechanism [DKM + 06], and the strong composition theorem [DR V10]. For a detailed proof, we refer the reader to [BST14, Theorem 2.1]. T o establish the utility guarantee in a general form, it will be useful to introduce a symmetric con vex body Q (and the norm k · k Q ) w .r .t. which the potential function Ψ is strongly conv ex. W e will instantiate this theorem with various choices of Q and Ψ depending on C in Section 3.2. While relativ ely standard in Mirror Descent algorithms, the reader may find it somewhat counter-intuiti ve that Q enters the algorithm only through the potential function Ψ , but plays an important role the analysis and the resulting guarantee. In most of the cases, we will set Q = C and the reader may find it con v enient to think of that case. Our proof of the theorem belo w closely follo ws the analysis of mirror descent from [ST10]. One can obtain stronger guarantees (typically , ˜ O (1 / ( n ) 2 ) ) under strong con vexity assumptions on the loss function. W e defer the details of this result to Appendix A. Theorem 3.2 (Utility guarantee) . Suppose that for any d ∈ D , the loss function L ( · ; d ) is conve x and L - lipschitz with respect to the ` 2 norm. Let Q ⊆ < p be a symmetric con vex set with Gaussian width G Q and ` 2 -diameter kQk 2 , and let Ψ : C → < be 1 -str ongly conve x w .r .t. k · k Q -norm chosen in Algorithm 6 A Noise − MD (Algorithm 1). If T = kQk 2 2 2 n 2 L 2 log 2 ( n/δ ) ( G 2 Q + kQk 2 2 ) and for all t ∈ [ T + 1] , η t = η = 1 L kQk 2 √ T , then E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O L q G 2 Q + kQk 2 2 max θ ∈C Ψ( θ ) log( n/δ ) n . Remark 1 . Notice that the bound abov e is scale in v ariant. For example, gi ven an initial choice of the con v ex set Q , scaling Q may reduce G Q but at the same time it will scale up the strong con vexity parameter . Pr oof of Theor em 3.2. For the ease of notation we ignore the parameterization of L ( θ ; D ) on the data set D and simply refer to as L ( θ ) . T o begin with, from a direct application of Jensen’ s inequality , we have the follo wing. L ( θ priv ) − min θ ∈C L ( θ ) ≤ 1 T T X t =1 L ( θ t ) − min θ ∈C L ( θ ) (2) So it suffices to bound the R.H.S. of (2) in order to bound the excess empirical risk. In Claim 3.3, we upper bound the R.H.S. of (2) by a sequence of linear approximations of L ( θ ) , thus “linearizing” our analysis. Claim 3.3. Let θ ∗ = arg min θ ∈C L ( θ ) . F or every t ∈ [ T ] , let γ t be the sub-gr adient of L ( θ t ) used in iteration t of Algorithm A Noise − MD (Algorithm 1). Then the con vexity of the loss function implies that 1 T T X t =1 L ( θ t ) − min θ ∈C L ( θ ) ≤ 1 T T X t =1 h γ t , θ t − θ ∗ i . Thus it suffices to bound 1 T T P t =1 h γ t , θ t − θ ∗ i in order to bound the priv acy risk. By simple algebraic manipulation we hav e the following. (Recall that b t is the noise vector used in Algorithm A Noise − MD .) η h γ t + b t , θ t − θ ∗ i = η h γ t + b t , θ t − θ t +1 + θ t +1 − θ ∗ i = η h γ t + b t , θ t − θ t +1 i | {z } A + h η ( γ t + b t ) + 5 Ψ( θ t +1 ) − 5 Ψ( θ t ) , θ t +1 − θ ∗ i | {z } B + h5 Ψ( θ t ) − 5 Ψ( θ t +1 ) , θ t +1 − θ ∗ i | {z } C . (3) W e next upper bound each of the terms A , B and C in (3). By Holder’ s inequality , we write A = η h γ t , θ t − θ t +1 i + η h b t , θ t − θ t +1 i ≤ 1 √ 2 k θ t − θ t +1 k Q · η √ 2 k γ t k Q ∗ + 1 √ 2 k θ t − θ t +1 k Q · η √ 2 k b t k Q ∗ ≤ 1 4 k θ t − θ t +1 k 2 Q + η 2 k γ t k 2 Q ∗ + 1 4 k θ t − θ t +1 k 2 Q + η 2 k b t k 2 Q ∗ = 1 2 k θ t − θ t +1 k 2 Q + η 2 k γ t k 2 Q ∗ + k b t k 2 Q ∗ (4) 7 where we hav e used the A.M-G.M. inequality in the third step. T aking expectations over the choice of b t , we hav e E b t [ A ] ≤ 1 2 E b t k θ t − θ t +1 k 2 Q + η 2 L 2 kQk 2 2 + E b t k b t k 2 Q ∗ . (5) W e now bound E b t k b t k 2 Q ∗ . First notice that k b t k 2 Q ∗ = σ 2 max θ ∈Q h θ , v i 2 , where v ∼ N (0 , 1) p . Let us denote W = max θ ∈Q h θ , v i 2 . By Fact 2.3, we hav e the following for an y µ ≥ 0 . Pr W ≥ ( µ + 1) 2 G 2 Q ≤ 2 e − µ 2 G 2 Q 2 kQk 2 2 . (6) From (6) we hav e the following. E [ W ] = ∞ Z 0 Pr[ W ≥ x ] dx = G 2 Q Z 0 Pr [ W ≥ x ] dx + ∞ Z G 2 Q Pr [ W ≥ x ] dx ≤ G 2 Q + 2 ∞ Z G 2 Q exp − x − G 2 Q 2 kQk 2 2 ! dx = G 2 Q + 2 ∞ Z 0 exp − x 2 kQk 2 2 dx = O G 2 Q + kQk 2 2 . (7) Using (5) and (7) we hav e the following: E b t [ A ] ≤ 1 2 E b t k θ t − θ t +1 k 2 Q + η 2 O L 2 kQk 2 2 + σ 2 G 2 Q + kQk 2 2 . (8) W e next proceed to bound the term B in (3). By the definition of θ t +1 , it follo ws that h η ( γ t + b t ) − 5 Ψ( θ t ) , θ t +1 i + Ψ( θ t +1 ) ≤ h η ( γ t + b t ) − 5 Ψ( θ t ) , θ ∗ i + Ψ( θ ∗ ) . This implies that B ≤ − Ψ( θ t +1 ) + Ψ( θ ∗ ) + h5 Ψ( θ t +1 ) , θ t +1 − θ ∗ i = − B Ψ ( θ t +1 , θ ∗ ) ≤ 0 . (9) One can write the term C in (3) as follo ws. B Ψ ( θ ∗ , θ t ) − B Ψ ( θ ∗ , θ t +1 ) − B Ψ ( θ t +1 , θ t ) = Ψ( θ ∗ ) − Ψ( θ t ) − h5 Ψ( θ t ) , θ ∗ − θ t i − Ψ( θ ∗ ) + Ψ( θ t +1 ) + h5 Ψ( θ t +1 ) , θ ∗ − θ t +1 i − Ψ( θ t +1 ) + Ψ( θ t ) + h5 Ψ( θ t ) , θ t +1 − θ t i = C (10) Notice that since b t is independent of θ t , E [ h b t , θ t − θ ∗ i ] = 0 . Plugging the bounds (8),(9) and (10) in (3), we hav e the following. η E [ h γ t , θ t − θ ∗ i ] = η E [ h γ t + b t , θ t − θ ∗ i ] ≤ B Ψ ( θ ∗ , θ t ) − B Ψ ( θ ∗ , θ t +1 ) + η 2 O L 2 kQk 2 2 + σ 2 G 2 Q + kQk 2 2 + 1 2 k θ t − θ t +1 k 2 Q − B Ψ ( θ t +1 , θ t ) | {z } D (11) 8 In order to bound the term D in (11), we use the assumption that Ψ( θ ) is 1 -strongly con vex with respect to k · k Q . This immediately implies that in (11) D ≤ 0 . Using this bound, summing ov er all T -rounds, we have 1 T T X t =1 E [ h γ t , θ t − θ ∗ i ] ≤ max θ ∈C Ψ( θ ) η T + η O L 2 kQk 2 2 + σ 2 G 2 Q + kQk 2 2 (12) In the abov e we used the follo wing property of Bregman div ergence: B Ψ ( θ ∗ , θ 1 ) ≤ max θ ∈C Ψ( θ ) . W e can prov e this fact as follo ws. Let θ † = a rgmin θ ∈C Ψ( θ ) . By the generalized Pythagorean theorem [Rak09, Chapter 2], it follows that B Ψ ( θ ∗ , θ 1 ) ≤ B Ψ ( θ ∗ , θ † ) − B Ψ ( θ 1 , θ † ) ≤ B Ψ ( θ ∗ , θ † ) . The last inequality follows from the fact that Bregman diverence is always non-negati ve. Now since θ † minimizes Ψ and Ψ is con vex, it follo ws that h5 Ψ( θ † ) , θ ∗ − θ † i ≥ 0 . This immediately implies B Ψ ( θ ∗ , θ † ) ≤ Ψ( θ ∗ ) ≤ max θ ∈C Ψ( θ ) . Setting T = k Q k 2 2 2 n 2 log 2 ( n/δ ) ( kQk 2 2 + G 2 Q ) and η = q max θ ∈C Ψ( θ ) L kQk 2 √ T , and using (2) and Claim 3.3 we get the required bound. 3.2 Instantiation of Private Mirr or Descent to V arious Settings of C In this section we discuss some of the instantiations of Theorem 3.2. F or arbitrary con vex set C ⊆ < p with L 2 -diameter kC k 2 : Let Ψ( θ ) = 1 2 k θ − θ 0 k 2 2 (with some fixed θ 0 ∈ C ) and we choose the con v ex set Q to be the unit ` 2 -ball in Theorem 3.2. Immediately , we obtain the following as a corollary . E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O L √ p kC k 2 log( n/δ ) n . (13) This is a slight improv ement over [BST14]. F or the conv ex set C ⊆ < p being a polytope: Let C = conv { v 1 , · · · , v k } be the con ve x hull of vectors v i ∈ < p such that for all i ∈ [ p ] , k v i k 2 ≤ kC k 2 . Fact 3.4 will be very useful for choosing the correct potential function Ψ in Algorithm A Noise − MD (Algorithm 1). F act 3.4 (From [SST11]) . F or the con vex set C defined above, let Q be the con vex hull of C and −C . The Minkowski norm for any θ ∈ < p is given by k θ k Q = inf α 1 , ··· ,α k , k P i =1 α i v i = θ k P i =1 | α i | . Additionally , let k θ k Q ,q = inf α 1 , ··· ,α k , k P i =1 α i v i = θ k P i =1 | α i | q 1 /q be a norm for any q ∈ (1 , 2] . Then the function Ψ( θ ) = 1 4( q − 1) k θ k 2 Q ,q is 1 -str ongly con ve x w .r .t. k · k Q ,q -norm. In the follo wing we state the following claim which will be useful later . Claim 3.5. If q = log k log k − 1 , then the following is true for any θ ∈ < p : k θ k Q ≤ e · k θ k Q ,q . Pr oof. First notice that for any vector v = h v 1 , · · · , v k i , k v k 1 ≤ k 1 − 1 /q k v k q . This follows from Holder’ s inequality . Now setting q = log k / (log k − 1) , we get k v k 1 ≤ e · k v k q . For any θ ∈ < p , let a = h α 1 , · · · , α k i be the vector of parameters corresponding to k θ k Q ,q . From the above, we know that k a k 1 ≤ e · k a k q . And by definition, we kno w that k θ k Q ≤ k a k 1 . This completes the proof. 9 Claim 3.5 implies that if Ψ( θ ) = 1 4( q − 1) k θ k 2 Q ,q and q = log k log k − 1 , then max θ ∈C Ψ( θ ) = O (log k ) . Addi- tionally due to Fact 3.4, Ψ( θ ) is O (1) -strongly conv ex w .r .t. k · k Q . W ith the above observations, and observing that G Q = O ( kC k 2 √ log k ) , setting Q and Ψ as abov e, we immediately get the follo wing corol- lary of Theorem 3.2. Notice that the bound does not hav e any explicit dependence on the dimensionality of the problem. E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O L kC k 2 log k log ( n/δ ) n . (14) Notice that this result extends to the standard p -dimensional probability simplex: C = { θ ∈ < p : p P i =1 θ i = 1 , ∀ i ∈ [ p ] , θ i ≥ 0 } . In this case, the only difference is that the term log k gets replaced by log p in (14). W e remark that applying standard approaches of [JT14, ST13b] provides a similar bound only in the case of linear loss functions. F or grouped ` 1 -norm: For a vector x ∈ < p and a parameter k , the grouped ` 1 -norm defined as k θ k ( k,` 1 , 2 ) = d p/k e P i =1 v u u t min { i · k,p } P j =( i − 1) k +1 | θ j | 2 . If C denotes the conv ex set centered at zero with radius one with respect to k · k ( k,` 1 , 2 ) -norm, then it follows from union bound on each of the blocks of coordinates in [ p ] that G C = p k log( p/k ) . In the following we propose the follo wing choices of Ψ depending on the parameter k . (These choices are based on [BTN13, Section 5.3.3].) For a giv en M > 1 , divide the coordinates of θ into M blocks, and denote each block as θ ( j ) . Ψ( θ ) = 1 M ξ M X j =1 θ ( j ) M 2 , M = ( 2 , if d p/k e ≤ 2 1 + 1 / (log( p/k )) , otherwise , ξ = 1 , if d p/k e = 1 1 / 2 , if d p/k e = 2 1 / ( e log ( p/k )) otherwise W ith this setting of Ψ( θ ) one can show that max θ ∈C Ψ( θ ) = O ( p log( p/k )) . Plugging these bounds in Theo- rem 3.2, we get (15) as a corollary . E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O L q k log 2 ( p/k ) log( n/δ ) n . (15) Similar bounds can be achie ved for other forms of interpolation norms, e.g ., L 1 , L 2 -interpolation norms: k θ k α, inter ( ` 1 ,` 2 ) = (1 − α ) k θ k 1 + α k θ k 2 with α ∈ [0 , 1] . Notice that since the set C = { θ : k θ k α, inter ( ` 1 ,` 2 ) ≤ 1 } is a subset of C 1 + C 2 , where C 1 = { (1 − α ) θ : k θ k 1 ≤ 1 } and C 2 = { αθ : k θ k 2 ≤ 1 } , it follows that the Gaussian width G C ≤ G C 1 + G C 2 = O ((1 − α ) √ log p + α √ p ) . Additionally from [SST11] it follows that there exists a strongly con vex function Ψ( θ ) w .r .t. k · k C such that it is O (1) for θ ∈ C . While using Theorem 3.2 in both of the abov e settings, we set the con vex set Q = C . F or lo w-rank matrices: It is known that the non-pri vate mirror descent extends immediately to matrices [BTN13]. In the following we sho w that this is also true for the pri vate mirror descent algorithm in Algorithm 1 ( A Noise − MD ). For the matrix setting, we assume θ ∈ < p × p and the loss function L ( θ ; d ) is L -Lipschitz in the Frobenius norm k · k F . From [DTTZ14] it follows that if the noise vector b in Algorithm A Noise − MD is replaced by a matrix b ∈ < p × p with each entry of b drawn i.i.d. from N (0 , σ 2 ) (with the standard de viation σ being the same as in Algorithm A Noise − MD ), then the ( , δ ) -differential priv acy guarantee holds. In the follo wing we instantiate Theorem 3.2 for the class of m × m real matrices with nuclear norm at most one. Call it the set C . (For a matrix θ , k θ k nuc refers to the sum of the singular values of θ .) This class is the conv ex 10 hull of rank one matrices with unit euclidean norm. [CRPW12, Proposition 3.11] shows that the Gaussian width of C is O ( √ m ) . [BTN13, Section 5.2.3] showed that the function Ψ( θ ) = 4 √ e log (2 m ) 2 q (1+ q ) m P i =1 σ 1+ q i ( θ ) with q = 1 2 log (2 m ) is 1 -strongly con vex w .r .t. k · k nuc -norm. Moreover , max θ ∈C Ψ( θ ) = O (log m ) . Plugging these bounds in Theorem 3.2 , we immediately get the follo wing excess empirical risk guarantee. E h L ( θ pr iv ; D ) i − min θ ∈C L ( θ ; D ) = O L √ m log m log( n/δ ) n . (16) 3.3 Con ver gence Rate of Noisy Mirror Descent In this section we analyze the excess empirical risk guarantees of Algorithm 1 (Algorithm A Noise − MD ) as a purely noisy mirror descent algorithm, and ignoring priv ac y considerations. Let us assume that the oracle that returns the gradient computation is noisy . In particular each of the b t (in Line 4 of Algorithm A Noise − MD ) is drawn independently from distributions which are mean zero and sub-Gaussian with v ariance Σ p × p , where Σ is the covariance matrix. For example, this may be achiev ed by sampling a small number of d i ’ s and averaging 5L ( θ t ; d i ) ov er the sampled values. Using the same proof technique of Theorem 3.2, and the observation that E b ∼N (0 , I p ) max θ ∈C h √ Σ · b, θ i = O ( p λ max (Σ) G C ) , we obtain the following corollary of Theorem 3.2. Here λ max corresponds to the maximum eigen value and we set the con ve x set Q = C in Theorem 3.2 for the ease of exposition. Corollary 3.6 (Noisy mirror descent guarantee) . Let C ⊆ < p be a symmetric con vex set with its ` 2 diameter and Gaussian width r epr esented by kC k 2 and G C r espectively , and let Ψ : C → < be an 1 -str ongly con ve x function w .r .t. k · k C -norm chosen in Algorithm A Noise − MD (Algorithm 1). F or any d ∈ D , suppose that the loss function L ( θ ; d ) is con vex and L -Lipschitz with respect to the ` 2 norm. If for all t ∈ [ T + 1] , η t = η = q max θ ∈C Ψ( θ ) q T ( L 2 kC k 2 2 + λ max (Σ) ( G 2 C + kC k 2 2 )) , then the following is true. E h L ( θ alg ; D ) i − min θ ∈C L ( θ ; D ) = O q max θ ∈C Ψ( θ ) q T L 2 kC k 2 2 + λ max (Σ) G 2 C + kC k 2 2 √ T . Her e the expectation is o ver the randomness of the algorithm. Corollary 3.6 abov e improves on the bound obtained in the noisy gradient descent literature [SZ13, Theorem 2] as long as the noise follo ws the mean zero sub-Gaussian distrib ution mentioned abov e and the potential function Ψ exists. In particular it improv es on the dependence on dimension by removing any explicit dependence on p . For dif ferent settings of Ψ depending on the con vex set C , see Section 3.2. 4 Objectiv e Perturbation f or Smooth Functions In this section we sho w that if the loss function L is twice continuously dif ferentiable, then one can recov er similar bounds as in Section 3 using the objective perturbation algorithm of [CMS11, KST12]. The main contribution in this section is a tighter analysis of objective perturbation using Gaussian width. In the follo wing (Algorithm 2) we first revisit the objectiv e perturbation algorithm. The ( , δ ) -dif ferential priv acy guarantee follo ws from [KST12]. Theorem 4.1 shows pri v acy risk bounds that are similar to that in Section 3. 11 Remark 2 . The smoothness property of the loss function L is used in the priv acy analysis. It can be shown that this is in some sense necessary . (See [KST12] for a more detailed discussion.) Standard approaches to wards smoothing (like con volving with a smooth function) adversely affects the utility guarantee and results in sub-optimal dependence on the number of data samples ( n ). (See [BST14, Appendix E].) Remark 3 . W e define the set Q as in Theorem 4.1 because we want to both symmetrize and extend the con ve x set C to a full-dimensional space. For example, think of the probability simplex in p -dimensions as the set C , and Q to be the ` 1 -ball. Also when there exists a differentiable con vex function Ψ : C → < such that Ψ is 1 -strongly con vex w .r .t. k · k Q and the guarantee in Theorem A.1 holds w .r .t. Ψ , then Theorem 4.1 is a special case of Theorem A.1. This in particular captures the following cases: i) Ψ( θ ) = 1 2 k θ k 2 2 (and correspondingly Q being the ` 2 -ball), and ii) Ψ( θ ) = p P i =1 θ i log θ i (and correspondingly Q being the ` 1 -ball). Algorithm 2 Objecti ve Perturbation [KST12] Input: Data set: D = { d 1 , · · · , d n } , loss function: L ( θ ; D ) = 1 n n P i =1 L ( θ ; d i ) (with ` 2 -Lipschitz constant L for L ), priv acy parameters: ( , δ ) , con ve x set: C (denote the diameter in ` 2 -norm by kC k 2 ), upper and lo wer bounds λ max , λ min on the eigen v alues of 5 2 L ( θ ; d ) (for all d and for all θ ∈ C ). 1: Set ζ = max 2 λ max n − min θ ∈C ,d ∈D λ min ( 5 2 L ( θ ; d )) , 0 . 2: Output θ priv ← arg min θ ∈C L ( θ ; D ) + ζ 2 k θ − θ 0 k 2 2 + h b, θ i , where b ∼ N 0 , L 2 (2 log (1 /δ )) ( n ) 2 I p × p and θ 0 ∈ C is fixed. Theorem 4.1 (Utility guarantee) . Suppose that C ⊆ < p has diameter kC k 2 and Gaussian width G C . Further suppose that for all d ∈ D , the loss function L ( · ; d ) is twice continuously differ entiable , and for all θ ∈ C , k 5 2 l ( θ ; d ) k has spectral norm at most λ max . Then Algorithm 2 satisfies the following guarantees. 1. Lipschitz case: Suppose that for any d ∈ D , the loss function L ( · ; d ) is con vex and L -lipschitz w .r .t. the ` 2 norm. Then E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O LG C p log(1 /δ ) + λ max kC k 2 2 n ! . 2. Lipschitz and strongly con vex case: Suppose that for any d ∈ D , the loss function L ( · ; d ) is L - lipschitz in the ` 2 norm, and ∆ -str ongly con vex with respect to k · k Q , where Q is the symmetric con vex hull of C . If ∆ ≥ 2 kC k 2 2 λ max n , then the following is true. E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O ( LG C ) 2 log(1 /δ ) ∆( n ) 2 . Pr oof. For the ease of notation, we will drop the dependence on the data set D , and represent the loss functinon L ( θ ; D ) as L ( θ ) . Let J ( θ ) = L ( θ ) + ∆ 2 k θ k 2 2 and let J priv ( θ ) = J ( θ ) + h b, θ i . Also let ˆ θ = arg min θ ∈C J ( θ ) . W e denote the variance of the noise in Algorithm 2, by σ 2 = L 2 (2 log (1 /δ )) ( n ) 2 . 12 Case 1 (Loss function L is Lipsc hitz): By the optimality of θ priv , the follo wing is true. J priv ( ˆ θ ) ≥ J priv ( θ priv ) ⇔ J ( ˆ θ ) + h b, ˆ θ i ≥ J ( θ priv ) + h b, θ priv i ⇔ J ( θ priv ) − J ( ˆ θ ) ≤ h b, ˆ θ − θ priv i ⇒ E h J ( θ priv ) − J ( ˆ θ ) i = O LG C p log(1 /δ ) n ! . (17) The last equality follo ws from the definition of Gaussian width and the variance of the noise vector b . Let θ ∗ = arg min θ ∈C L ( θ ) . From (17),the definition of J ( θ ) , and that ˆ θ minimizes J ( θ ) , the following is true. E L ( θ priv ) − L ( θ ∗ ) = E J ( θ priv ) − J ( θ ∗ ) + ζ 2 k θ ∗ − θ 0 k 2 2 − ζ 2 k θ priv − θ 0 k 2 2 ≤ E h J ( θ priv ) − J ( ˆ θ ) i + ζ 2 k θ ∗ − θ 0 k 2 2 = O LG C p log(1 /δ ) + λ max kC k 2 2 n ! . (18) Case 2 (Loss function L is Lipsc hitz and strongly con v ex): First notice that by the definition of Minko wski norm, for any vector v ∈ C , k v k Q ≥ k v k 2 / kC k 2 . This implies that if L is ∆ -strongly conv ex w .r .t. k · k Q - norm, then it is ∆ / kC k 2 2 strongly con v ex w .r .t. k · k 2 -norm. Hence with the lower bound on ∆ -satisfied, ζ in Algorithm 2 is always zero. By the definition of strong con ve xity of L , the following is true. L ( θ ∗ ) ≥ L ( θ priv ) + ∆ 2 k θ priv − θ ∗ k 2 Q ⇔ L ( θ ∗ ) + h b, θ ∗ i − h b, θ ∗ i ≥ L ( θ priv ) + h b, θ priv i − h b, θ priv i + ∆ 2 k θ priv − θ ∗ k 2 Q ⇒ h b, θ priv − θ ∗ i ≥ ∆ 2 k θ priv − θ ∗ k 2 Q ⇒ h b, θ priv − θ ∗ k θ priv − θ ∗ k Q i ≥ ∆ 2 k θ priv − θ ∗ k Q ⇒ max v ∈Q h b, v i ≥ ∆ 2 k θ priv − θ ∗ k Q ⇒ k θ priv − θ ∗ k Q ≤ 2 max v ∈Q h b, v i ∆ = 2 k b k Q ∗ ∆ (19) In the abov e we ha ve used the fact L ( θ ∗ ) + h b, θ ∗ i ≤ L ( θ priv ) + h b, θ priv i (due to the optimality condition). Using (19) we get the follo wing. L ( θ ∗ ) + h b, θ ∗ i ≥ L ( θ priv ) + h b, θ priv i ⇒ L ( θ priv ) − L ( θ ∗ ) ≤ k b k Q ∗ · k θ priv − θ ∗ k Q ⇒ L ( θ priv ) − L ( θ ∗ ) ≤ 2 k b k 2 Q ∗ ∆ ⇒ E L ( θ priv ) − L ( θ ∗ ) = O σ 2 G 2 Q ∆ ! = O ( LG C ) 2 log(1 /δ ) ∆( n ) 2 . 13 This completes the proof. In the last step we used the fact that G Q = Θ( G C ) . 5 Priv ate Con vex Optimization by Frank-W olfe algorithm The algorithms in the pre vious section work best when the objecti ve function is Lipschitz with respect to ` 2 norm. But in many machine learning tasks, especially those with sparsity constraint, the objectiv e function is often Lipschitz with respect to ` 1 norm. For example, in the high-dimensional linear regression setting e.g. the classical LASSO algorithm[T ib96], we would like to compute a rgmin θ, k θ k 1 ≤ s 1 n k X θ − y k 2 2 . In the usual case of | x ij | , | y j | = O (1) , L ( θ ) = 1 n k X θ − y k 2 2 is O (1) -Lipschitz with respect to ` 1 -norm b ut is O ( p ) -Lipschitz with respect to ` 2 -norm. So applying the priv ate mirror-descent would result in a fairly loose bound. In this section, we will show that in these cases it is more effecti ve to use the priv ate version of the classical Frank-W olfe algorithm. In particular , we show that for LASSO, such algorithm achie ves the nearly optimal pri vac y risk of e O (1 /n 2 / 3 ) . 5.1 Frank-W olfe algorithm The Frank-W olfe algorithm [FW56] can be regarded as a “greedy” algorithm which moves towards the optimum solution in the first order approximation (see Algorithm 3 for the description). How fast Frank- W olfe algorithm con ver ges depends L ’ s “curvature”, defined as follows according to [Cla10, Jag13]. W e remark that a β -smooth function on C has curvature constant bounded by β k C k 2 . Definition 5.1 (Curv ature constant) . F or L : C → < , define Γ L as below . Γ L := sup θ 1 ,θ 2 , ∈C ,γ ∈ (0 , 1] ,θ 3 = θ 1 + γ ( θ 2 − θ 1 ) 2 γ 2 ( L ( θ 3 ) − L ( θ 1 ) − h θ 3 − θ 1 , 5L ( θ 1 ) i ) . Remark 4 . One can show ([Cla10, Jag13]) that for any q , r ≥ 1 such that q − 1 + r − 1 = 1 , Γ L is upper bounded by λ kC k 2 q , where λ = max θ ∈C , k v k q =1 k 5 2 L ( θ ) · v k r . Remark 5 . One useful bound is for the quadratic programming L ( θ ) = θX T X θ + h b, θ i . In this case, by [Cla10], Γ L ≤ max a,b ∈ X ·C k a − b k 2 2 . When C is centrally symmetric, we hav e the bound Γ L ≤ 4 max θ ∈C k X θ k 2 2 . Algorithm 3 Frank-W olfe algorithm Input: C ⊆ < p , L : C → < , µ 1: Choose an arbitrary θ 1 from C ; 2: for t = 1 to T − 1 do 3: Compute e θ t = a rgmin θ ∈C h5L ( θ t ) , ( θ − θ t ) i ; 4: Set θ t +1 = θ t + µ ( e θ t − θ t ) ; 5: return θ T . Define θ ∗ = a rgmin θ ∈C L ( θ ) . The following theorem sho ws the con ver gence of Frank-W olfe algorithm. Theorem 5.2 ([Cla10, Jag13]) . If we set µ = 1 /T , then L ( θ T ) − L ( θ ∗ ) = O (Γ L /T ) . 14 While the Frank-W olfe algorithm does not necessarily provide faster con vergence compared to the gradient-descent based method, it has two major advantages. First, on Line 3, it reduces the problem to solving a minimization of linear function. When C is defined by small number of vertices, e.g. when C is an ` 1 ball, the minimization can be done by checking h5L ( θ t ) , x i for each verte x x of C . This can be done ef ficiently . Secondly , each step in Frank-W olfe takes a con ve x combination of θ t and e θ t , which is on the boundary of C . Hence each intermediate solution is always inside C (sometimes called pr ojection fr ee ), and the final outcome θ T is the con ve x combination of up to T points on the boundary of C (or vertices of C when C is a polytope). Such outcome might be desired, for example when C is a polytope, as it corresponds to a sparse solution. Due to these reasons Frank-W olfe algorithm has found many applications in machine learning [SSSZ10, HK12, Cla10]. As we shall see below , these properties are also useful for obtaining low risk bounds for their pri vate v ersion. 5.2 Private Frank-W olfe Algorithm W e now present a priv ate version of the Frank-W olfe algorithm. W e can achieve pri vac y by replacing Line 3 in Algorithm 3 with its priv ate version in one of two ways. In the first variant, we apply exponential mechanism [MT07] to guarantee priv acy; and in the second variant, we apply objectiv e perturbation. The first variant works especially well when C is a polytope defined by polynomially many vertices. In this case, we show that the error depends on the ` 1 -Lipschitz constant, which can be much smaller than the ` 2 -Lipschitz constant. In particular , the priv ate Frank-W olfe algorithm is nearly optimal for the important high-dimensional sparse linear regression (or compressiv e sensing) problem. The second variant applies to general con vex set C . In this case, we are able to sho w that the risk depends on the Gaussian width of C . The details are in Appendix B. Algorithm 4 describes the priv ate version of Frank-W olfe algorithm for the polytope case, i.e. when C is a con vex hull of a finite set S of vertices (or corners). In this case, we know that any linear function is minimized at one point of S per the follo wing basic fact. F act 5.3. Let C ⊆ < p be the con vex hull of a compact set S ⊆ < p . F or any vector v ∈ < p , arg min θ ∈C h θ , v i ∩ S 6 = ∅ . Since θ t +1 can be selected as one of | S | vertices, by applying the exponential mechanism [MT07], we obtain differentially priv ate algorithm with risk logarithmically dependent on | S | . When | S | is polynomial in p , it leads to an error bound with log p dependence. While the exponential mechanism can be applied to the general C as well, its error would depend on the size of a cov er of the boundary of C , which can be exponential in p , leading to an error bound with polynomial dependence on p . Hence for general con ve x set C , in A Noise − FW ( Gen − convex ) (Algorithm 5 in Appendix B) , we use objective perturbation instead and obtain an error dependent on the Gaussian width of C . Theorem 5.4 (Pri v acy guarantee) . Algorithm 4 is ( , δ ) -dif fer entially private. The proof of priv acy follows from a straight forward use of exponential mechanism [MT07, BLST10] (the noisy maximum v ersion from [BLST10, Theorem 5]) and the strong composition theorem [DR V10]. In Theorem 5.5 we prov e the utility guarantee for the pri vate Frank-W olfe algorithm for the con ve x polytope case. Define Γ L = max D ∈D C L ov er all the possible data sets in D . Theorem 5.5 (Utility guarantee) . Let L 1 , S and kC k 1 be defined as in Algorithms 4 (Algorithm A Noise − FW ( p olytop e ) ). Let Γ L be an upper bound on the curvature constant (defined in Definition 5.1) for the loss function L ( · ; d ) that holds for all d ∈ D . In Algorithm A Noise − FW ( p olytop e ) , if we set T = Γ L 2 / 3 ( n ) 2 / 3 ( L 1 kC k 1 ) 2 / 3 , then E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O Γ L 1 / 3 ( L 1 kC k 1 ) 2 / 3 log( n | S | ) p log(1 /δ ) ( n ) 2 / 3 ! . 15 Algorithm 4 A Noise − FW ( p olytop e ) : Differentially Pri vate Frank-W olfe Algorithm (Polytope Case) Input: Data set: D = { d 1 , · · · , d n } , loss function: L ( θ ; D ) = 1 n n P i =1 L ( θ ; d i ) (with ` 1 -Lipschitz constant L 1 for ` ), pri vac y parameters: ( , δ ) , conv ex set: C = conv ( S ) with kC k 1 denoting max s ∈ S k s k 1 and S being the set of corners. 1: Choose an arbitrary θ 1 from C ; 2: for t = 1 to T − 1 do 3: ∀ s ∈ S, α s ← h s, 5L ( θ t ; D ) i + Lap L 1 kC k 1 √ 8 T log (1 /δ ) n , where Lap ( λ ) ∼ 1 2 λ e −| x | /λ . 4: e θ t ← arg min s ∈ S α s . 5: θ t +1 ← (1 − µ ) θ t + µ e θ t , where µ = 1 T +2 . 6: Output θ priv = θ T . Her e the expectation is o ver the randomness of the algorithm. Pr oof. For ease of notation we hide the dependence of L on the data set D and represent it simply as L ( θ ) . In order to prov e the utility guarantee we first in voke the utility guarantee of the non-priv ate noisy Frank-W olfe algorithm from [Jag13, Theorem 1]. Theorem 5.6 (Non-pri vate utility guarantee [Jag13]) . Assume the conditions in Theor em 5.5 and let β > 0 be fixed. Recall that µ = 1 / ( T + 2) and let φ 1 ∈ C . Suppose that h s 1 , · · · , s T i is a sequence of vectors fr om C , with φ t +1 = (1 − µ ) φ t + µs t such that for all t ∈ [ T ] , h s t , 5L ( φ t ) i ≤ min s ∈C h s, 5L ( φ t ) i + 1 2 β µ Γ L . Then, L ( φ T ) − min θ ∈C L ( θ ) ≤ 2Γ L T + 2 (1 + β ) . Since the con vex set C is a polytope with corners in S , if s t in Theorem 5.6 corresponds to e θ t in Algo- rithm A Noise − FW ( p olytop e ) , and φ t corresponds to θ t in A Noise − FW ( p olytop e ) , then using the tail properties of Laplace distribution and Fact 5.3 one can show that with probability at least 1 − ζ , the term β in Theorem 5.6 is at most O L 1 kC k 1 √ 8 T log (1 /δ ) log ( | S | T /ζ ) µn Γ L . Plugging in this bound in Theorem 5.6, we immediately get that with probability at least 1 − ζ , L ( θ T ) − min θ ∈C L ( θ ) = O Γ L T + L 1 kC k 1 p 8 T log (1 /δ ) log( | S | T /ζ ) n ! . (20) From, (20) we can conclude the follo wing in expectation. E L ( θ T ) − min θ ∈C L ( θ ) = O Γ L T + L 1 kC k 1 p 8 T log (1 /δ ) log( T L 1 kC k 1 · | S | ) n ! . (21) Setting T = Γ L 2 / 3 ( n ) 2 / 3 ( L 1 kC k 1 ) 2 / 3 results in the claimed utility guarantee. 16 5.3 Nearly optimal private LASSO W e no w apply the priv ate Frank-W olfe algorithm A Noise − FW ( p olytop e ) to the important case of the sparse linear regression (or LASSO) problem. W e show that the priv ate Frank-W olfe algorithm leads to a nearly tight ˜ O ( 1 n 2 / 3 ) bound. Problem definition: Giv en a data set D = { ( x 1 , y 1 ) , · · · , ( x n , y n ) } of n -samples from the domain D = { ( x, y ) : x ∈ < p , y ∈ [ − 1 , 1] , k x k ∞ ≤ 1 } , and the con vex set C = ` p 1 . Define the mean squared loss, L ( θ ; D ) = 1 2 n X i ∈ [ n ] ( h x i , θ i − y i ) 2 . (22) The objective is to compute θ priv ∈ C to minimize L ( θ ; D ) while preserving pri vac y with respect to any change of individual ( x i , y i ) pair . The non-priv ate setting of the above problem is a variant of the least squares problem with ` 1 regularization, which was started by the work of LASSO [T ib96, T + 97] and inten- si vely studied in the past years [HTF01, DJ04, CT05, Don06, CT07, BR T09, BM12, R WY09, Zha13]. Since the ` 1 ball is the con ve x hull of 2 p vertices, we can apply the priv ate Frank-W olfe algorithm A Noise − FW ( p olytop e ) . For the above setting, it is easy to check that the ` 1 -Lipschitz constant is bounded by O (1) . Further , by applying the bound on quadratic programming Remark 5, we have that C L ≤ 4 max θ ∈C k X θ k 2 2 = O (1) since C is the unit ` 1 ball, and | x ij | ≤ 1 . Hence Γ = O (1) . Now applying Theorem 5.5, we hav e Corollary 5.7. Let D = { ( x 1 , y 1 ) , · · · , ( x n , y n ) } of n samples fr om the domain D = { ( x, y ) : k x k ∞ ≤ 1 , | y | ≤ 1 } , and the con vex set C equal to the ` 1 -ball. The output θ priv of Algorithm A Noise − FW ( p olytop e ) ensur es the following. E [ L ( θ priv ; D ) − min θ ∈C L ( θ ; D )] = O log( np/δ ) ( n ) 2 / 3 . Remark 6 . Compared to the pre vious work [KST12, ST13a], the above upper bound makes no assumption of r estricted str ong conve xity or mutual incoherence , which might be too strong for realistic settings [W as12]. Also our results significantly improv e bounds of [JT14], from ˜ O (1 /n 1 / 3 ) to ˜ O (1 /n 2 / 3 ) , which considered the case of the set C being the probability simplex and the loss being a generalized linear model. In the follo wing, we shall sho w that to ensure pri vac y , the error bound in Corollary 5.7 is nearly optimal in terms of the dominant factor of 1 /n 2 / 3 . Theorem 5.8 (Optimality of priv ate Frank-W olfe) . Let C be the ` 1 -ball and L be the mean squar ed loss in equation (22). F or every sufficiently lar ge n , for e very ( , δ ) -differ entially private algorithm A , with ≤ 0 . 1 and δ = o (1 /n 2 ) , ther e exists a data set D = { ( x 1 , y 1 ) , · · · , ( x n , y n ) } of n samples fr om the domain D = { ( x, y ) : k x k ∞ ≤ 1 , | y | ≤ 1 } suc h that E [ L ( A ( D ); D ) − min θ ∈C L ( θ ; D )] = e Ω 1 n 2 / 3 . W e pro ve the lo wer bound by follo wing the fingerprinting codes argument of [BUV14] for lowerbound- ing the error of ( , δ ) -differentially priv ate algorithms. Similar to [BUV14] and [DTTZ14], we start with the follo wing lemma which is implicit in [BUV14].The matrix X in Theorem 5.9 is the padded T ardos code used in [DTTZ14, Section 5]. For any matrix X , denote by X ( i ) the matrix obtained by removing the i -th ro w of X . Call a column of a matrix a concensus column if the entries in the column are either all 1 or all − 1 . The sign of a concensus column is simply the concensus value of the column. Write w = m/ log m and p = 1000 m 2 . 17 Theorem 5.9. [Cor ollary 16 fr om [DTTZ14], restated] Let m be a suf ficiently lar ge positive inte ger . Ther e exists a matrix X ∈ {− 1 , 1 } ( w +1) × p with the following guarantee. F or each i ∈ [1 , w + 1] , ther e ar e at least 0 . 999 p concensus columns W i in each X ( i ) . In addition, for algorithm A on input matrix X ( i ) wher e i ∈ [1 , w + 1] , if with pr obability at least 2 / 3 , A ( X ( i ) ) pr oduces a p -dimensional sign vector which agr ees with at least 3 4 p columns in W i , then A is not ( ε, δ ) dif fer entially private with r espect to single r ow change (to some other r ow in X ). Write τ = 0 . 001 . Let k = τ w p . W e first form an k × p matrix Y where the column vectors of Y are mutually orthogonal { 1 , − 1 } v ectors. This is possible as k p . Now we construct w + 1 databases D i for 1 ≤ i ≤ w + 1 as follo ws. For all the databases, they contain the common set of pair ( z j , 0) for 1 ≤ j ≤ k where z j = ( Y j 1 , . . . , Y j p ) is the j -th ro w vector of Y . In addition, each D i contains w pairs ( x j , 1) for x j = ( X j 1 , . . . , X j k ) for j 6 = i . Then L ( θ ; D i ) is defined as follows (for the ease of notation in this proof, we work with the un-normalized loss. This does not affect the generality of the ar guments in any w ay .) L ( θ ; D i ) = X j 6 = i ( x j · θ − 1) 2 + k X j =1 ( y j · θ ) 2 = X j 6 = i ( x j · θ − 1) 2 + k k θ k 2 2 . The last equality is due to that the columns of Y are mutually orthogonal {− 1 , 1 } vectors. For each D i , consider θ ∗ ∈ n − 1 p , 1 p o p such that the sign of the coordinates of θ ∗ matches the sign for the consensus columns of X ( i ) . Plugging θ ∗ in L ( θ ∗ ; ˆ D ) we ha ve the follo wing, L ( θ ∗ ; ˆ D ) ≤ w X i =1 (2 τ ) 2 + k /p since the number of consensus columns is at least (1 − τ ) p ] = ( τ + 4 τ 2 ) w . (23) W e no w prov e the crucial lemma, which states that if θ is such that k θ k 1 ≤ 1 and L ( θ ; D i ) is small, then θ has to agree with the sign of concensus columns of X ( i ) . Lemma 5.10. Suppose that k θ k 1 ≤ 1 , and L ( θ ; D i ) < 1 . 1 τ w . F or j ∈ W i , denote by s j the sign of the consensus column j . Then we have |{ j ∈ W i : sign( θ j ) = s j }| ≥ 3 4 p . Pr oof. For any S ⊆ { 1 , . . . , p } , denote by θ | S the projection of θ to the coordinate subset S . Consider three subsets S 1 , S 2 , S 3 , where S 1 = { j ∈ W i : sign( θ j ) = s j } , S 2 = { j ∈ W i : sign( θ j ) 6 = s j } , S 3 = { 1 , . . . , p } \ W j . The proof is by contradiction. Assume that | S 1 | < 3 4 p . Further denote θ i = θ | S i for i = 1 , 2 , 3 . Now we will bound k θ 1 k 1 and k θ 3 k 1 using the inequality k x k 2 ≥ k x k 1 / √ d for an y d -dimensional vector . k θ 3 k 2 2 ≥ k θ 3 k 2 1 / | S 3 | ≥ k θ 3 k 2 1 / ( τ p ) . Hence k k θ 3 k 2 2 ≥ w k θ 3 k 2 1 . But k k θ 3 k 2 2 ≤ k k θ k 2 2 ≤ 1 . 1 τ w , so that k θ 3 k 1 ≤ √ 1 . 1 τ ≤ 0 . 04 . 18 Similarly by the assumption of | S 1 | < 3 4 p , k θ 1 k 2 2 ≥ k θ 1 k 2 1 / | S 1 | ≥ 4 k θ 1 k 2 1 / (3 p ) . Again using k k θ k 2 2 < 1 . 1 τ w , we have that k θ 1 k 1 ≤ p 1 . 1 ∗ 3 / 4 ≤ 0 . 91 . No w we have h x i , θ i − 1 = k θ 1 k 1 − k θ 2 k 1 + β i − 1 where | β i | ≤ k θ 3 k 1 ≤ 0 . 04 . By k θ 1 k 1 + k θ 2 k 1 + k θ 3 k 1 ≤ 1 , we hav e |h x i , θ i − 1 | ≥ 1 − k θ 1 k − | β i | ≥ 1 − 0 . 91 − 0 . 04 = 0 . 05 . Hence we hav e that L ( θ ; D i ) ≥ (0 . 05) 2 w ≥ 1 . 1 τ w . This leads to a contradiction. Hence we must have | S 1 | ≥ 3 4 p . W ith Theorem 5.9 and Lemma 5.10, we can no w prove Theorem 5.8. Pr oof. Suppose that A is pri vate. And for the datasets we constructed abov e, E [ L ( A ( D i ); D i ) − min θ L ( θ ; D i )] ≤ cw , for sufficiently small constant c . By Markov inequality , we hav e with probability at least 2 / 3 , L ( A ( D i ); D i ) − min θ L ( θ ; D i ) ≤ 3 cw . By (23), we hav e min θ L ( θ ; D i ) ≤ ( τ + 4 τ 2 ) w . Hence if we choose c a constant small enough, we hav e with probability 2 / 3 , L ( A ( D i ); D i ) < ( τ + 4 τ 2 + 3 c ) w ≤ 1 . 1 τ w . (24) By Lemma 5.10, (24) implies that A ( D i ) agrees with at least 3 4 p concensus columns in X ( i ) . Howe ver by Theorem 5.9, this violates the pri vac y of A . Hence we hav e that there exists i , such that E [ L ( A ( D i ); D i ) − min θ L ( θ ; D i )] > cw . Recall that w = m/ log m and n = w + w p = O ( m 3 / log m ) . Hence we have that E [ L ( A ( D i ); D i ) − min θ L ( θ ; D i )] = Ω( n 1 / 3 / log 2 / 3 n ) . The proof is completed by con verting the above bound to the normalized version of Ω(1 / ( n log n ) 2 / 3 ) . Refer ences [Bal97] Keith Ball. An elementary introduction to modern con ve x geometry . Flavors of geometry , 31:1–58, 1997. [BLM13] St ´ ephane Boucheron, G ´ abor Lugosi, and Pascal Massart. Concentr ation inequalities: A nonasymptotic theory of independence . Oxford univ ersity press, 2013. [BLST10] Raghav Bhaskar , Sri vatsan Laxman, Adam Smith, and Abhradeep Thakurta. Discovering fre- quent patterns in sensiti ve data. In KDD , New Y ork, NY , USA, 2010. [BM03] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Mac hine Learning Resear ch , 3:463–482, 2003. 19 [BM12] Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices. IEEE T ransactions on Information Theory , 2012. [BR T09] Peter J Bickel, Y a’acov Ritov , and Alexandre B Tsybakov . Simultaneous analysis of lasso and dantzig selector . The Annals of Statistics , pages 1705–1732, 2009. [BST14] Raef Bassily , Adam Smith, and Abhradeep Thakurta. Pri vate empirical risk minimization, re visited. In FOCS , 2014. [BT03] Amir Beck and Marc T eboulle. Mirror descent and nonlinear projected subgradient methods for con ve x optimization. Operations Resear ch Letters , 31(3):167–175, 2003. [BTN13] Aharon Ben-T al and Arkadi Nemirovski. Lectures on modern con vex optimization. Lectur e notes , 2013. [BUV14] Mark Bun, Jonathan Ullman, and Salil V adhan. Fingerprinting codes and the price of approxi- mate dif ferential priv acy . In STOC , 2014. [Cla10] Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the Frank-W olfe algorithm. A CM T ransations on Algorithms , 2010. [CM08] Kamalika Chaudhuri and Claire Monteleoni. Pri vac y-preserving logistic regression. In NIPS , 2008. [CMS11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially pri vate empir- ical risk minimization. JMLR , 12:1069–1109, 2011. [CRPW12] V enkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S W illsky . The con vex geometry of linear inv erse problems. F oundations of Computational Mathematics , 12(6):805– 849, 2012. [CT05] Emmanuel Candes and T errance T ao. Decoding by linear programming. IEEE T r ansactions on Information Theory , 51, 2005. [CT07] Emmanuel Candes and T erence T ao. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics , pages 2313–2351, 2007. [DJ04] David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics , pages 962–994, 2004. [DJM13] John C. Duchi, Michael I. Jordan, and Brendan McMahan. Estimation, optimization, and parallelism when data is sparse. In Advances in Neural Information Pr ocessing Systems , pages 2832–2840, 2013. [DJW13] John C. Duchi, Michael I. Jordan, and Martin J. W ainwright. Local pri vac y and statistical minimax rates. In FOCS , 2013. [DKM + 06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry , Ilya Mironov , and Moni Naor . Our data, ourselves: Pri vac y via distributed noise generation. In EUR OCRYPT , pages 486–503, 2006. [DMNS06] Cynthia Dwork, Frank McSherry , K obbi Nissim, and Adam Smith. Calibrating noise to sensi- ti vity in priv ate data analysis. In Theory of Cryptography Conference , pages 265–284. Springer , 2006. 20 [DNPR10] Cynthia Dwork, Moni Naor , T oniann Pitassi, and Guy N. Rothblum. Dif ferential pri vac y under continual observ ation. In STOC , 2010. [DNT13] Cynthia Dwork, Aleksandar Nikolov , and Kunal T alwar . Efficient algorithms for priv ately releasing marginals via con v ex relaxations. arXiv pr eprint arXiv:1308.1385 , 2013. [Don06] David L Donoho. Compressed sensing. IEEE T ransactions on Information Theory , 2006. [DR V10] Cynthia Dwork, Guy N. Rothblum, and Salil P . V adhan. Boosting and dif ferential pri vac y . In FOCS , 2010. [DSSST10] John C. Duchi, Shai Shalev-Shwartz, Y oram Singer , and Ambuj T e wari. Composite objective mirror descent. In COLT , 2010. [DTTZ14] Cynthia Dwork, Kunal T alwar , Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for pri vac y-preserving principal component analysis. In STOC , 2014. [Dwo06] Cynthia Dwork. Differential pri v acy . In ICALP , LNCS, pages 1–12, 2006. [Dwo08] Cynthia Dwork. Differential priv acy: A survey of results. In T AMC , pages 1–19. Springer , 2008. [Dwo09] Cynthia Dwork. The differential pri v acy frontier . In TCC , pages 496–502. Springer, 2009. [FW56] Marguerite Frank and Philip W olfe. An algorithm for quadratic programming. Naval resear c h logistics quarterly , 3(1-2):95–110, 1956. [HK12] Elad Hazan and Satyen Kale. Projection-free online learning. In ICML , 2012. [HTF01] Tre v or Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning . Springer Series in Statistics. Springer Ne w Y ork Inc., 2001. [Jag13] Martin Jaggi. Revisiting { Frank-W olfe } : Projection-free sparse con ve x optimization. In ICML , 2013. [JKT12] Prateek Jain, Prav esh Kothari, and Abhradeep Thakurta. Differentially pri vate online learning. In COLT , pages 24.1–24.34, 2012. [JT14] Prateek Jain and Abhradeep Thakurta. (near) dimension independent risk bounds for differen- tially pri vate learning. In International Confer ence on Machine Learning (ICML) , 2014. [KST12] Daniel Kifer , Adam Smith, and Abhradeep Thakurta. Priv ate con vex empirical risk minimiza- tion and high-dimensional regression. In COL T , pages 25.1–25.40, 2012. [MT07] Frank McSherry and Kunal T alwar . Mechanism design via dif ferential pri v acy . In FOCS , pages 94–103. IEEE, 2007. [NTZ13] Aleksandar Nikolov , Kunal T alwar , and Li Zhang. The geometry of dif ferential priv acy: The sparse and approximate cases. In STOC , 2013. [NY83] A. S. Nemirovski and D. B. Y udin. Pr oblem Complexity and Method Efficiency in Optimization . John W iley & Sons, 1983. [Rak09] Alexander Rakhlin. Lecture notes on online learning. Lecture notes , 2009. 21 [RR WN11] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free ap- proach to parallelizing stochastic gradient descent. In Advances in Neural Information Pr o- cessing Systems , pages 693–701, 2011. [R WY09] G. Raskutti, M. J. W ainwright, and B. Y u. Minimax rates of estimation for high-dimensional linear regression o ver $ \ ell q$-balls. ArXiv e-prints , October 2009. [SSS07] Shai Shalev-Shw artz and Y oram Singer . Logarithmic regret algorithms for strongly con vex repeated games. 2007. [SSSSS09] Shai Shalev-Shwartz, Ohad Shamir , Nathan Srebro, and Karthik Sridharan. Stochastic Con ve x Optimization. In COLT , 2009. [SSSZ10] Shai Shalev-Shwartz, Nathan Srebro, and T ong Zhang. Trading accuracy for sparsity in opti- mization problems with sparsity constraints. SIAM Journal on Optimization , 2010. [SST11] Nati Srebro, Karthik Sridharan, and Ambuj T ewari. On the univ ersality of online mirror de- scent. In NIPS , 2011. [ST10] Karthik Sridharan and Ambuj T ew ari. Con ve x games in banach spaces. In COLT , 2010. [ST13a] Adam Smith and Abhradeep Thakurta. Differentially priv ate feature selection via stability arguments, and the rob ustness of the lasso. In COLT , 2013. [ST13b] Adam Smith and Abhradeep Thakurta. Follo w the perturbed leader is differentially priv ate with optimal regret guarantees, 2013. Personal Communication. [ST13c] Adam Smith and Abhradeep Thakurta. Nearly optimal algorithms for priv ate online learning in full-information and bandit settings. In NIPS , 2013. [SZ13] Ohad Shamir and T ong Zhang. Stochastic gradient descent for non-smooth optimization: Con- ver gence results and optimal av eraging schemes. In ICML , pages 71–79, 2013. [T + 97] Robert T ibshirani et al. The lasso method for variable selection in the cox model. Statistics in medicine , 16(4):385–395, 1997. [T ib96] R. T ibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society . Series B (Methodolo gical) , 1996. [Ull14] Jonathan Ullman. Priv ate multiplicati ve weights beyond linear queries. CoRR , abs/1407.1571, 2014. [W as12] Larry W asserman. Restricted isometry property , rest in peace. Blog-post , 2012. [Zha13] Li Zhang. Nearly optimal minimax estimator for high dimensional sparse linear regression. Annals of Statistics , 2013. 22 A T ighter Guarantees of Mirror Descent f or Str ongly Con vex Functions In this section we study Algorithm 1 (Algorithm A Noise − MD ) in the context of strongly con ve x functions with the following form: Every loss function L ( θ ; d ) is L -Lipschitz in the L 2 -norm and ∆ -strongly con vex with respect to some differentiable conv ex function Ψ : C → < , for any θ ∈ C and d ∈ D . (See Section 2.2 for a definition.) This setting has previously been studied in [DSSST10, SSS07]. T wo common e xample are: i) Ψ( θ ) = 1 2 k θ k 2 2 and L ( θ ; d ) is ∆ -strongly con vex w .r .t. k · k 2 , and ii) for composite loss functions L ( θ ; d ) = g ( θ ; d ) + ∆Ψ( θ ) if Ψ( θ ) = p P i =1 θ i log( θ i ) , then L ( θ ; d ) is ∆ -strongly con ve x w .r .t. Ψ( θ ) within the probability simplex which is in turn 1 -strongly conv ex w .r .t. k · k 1 [DSSST10, Section 5]. In the following we sho w that one can get a much sharper dependence on n (compared to Theorem 3.2) under strong con ve xity . Remark 7 . [BST14] analyzed the setting of strong con ve xity w .r .t. ` 2 -norm, and in particular pro vided tight error guarantees. For this case, Theorem A.1 leads to similarly tight bounds, and thus the lower bounds in [BST14] imply that in general, our guarantee cannot be improv ed. Theorem A.1 (Utility guarantee for strongly con ve x functions) . Let Q be the symmetric con ve x hull of C . Assume that e very loss function L ( θ ; d ) is L -Lipschitz in the ` 2 -norm and ∆ -str ongly con vex with r espect to some dif fer entiable 1 -str ongly conve x (w .r .t. k · k Q ) function Ψ : C → < , for any θ ∈ C and d ∈ D . Let kC k 2 be the ` 2 -diameter of the set C , and G C be the Gaussian width. In Algorithm A Noise − MD (Algorithm 1), if we set T = ( kC k 2 · n ) 2 ( G 2 C + kC k 2 2 ) , the potential function to be Ψ and η t = 2 ∆ t , then following is true. E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O L 2 G 2 C + kC k 2 2 log( n/δ ) log ( kC k 2 ( n )) ∆( n ) 2 ! . Her e the expectation is o ver the randomness of the algorithm. Pr oof. For ease of notation, we hide the dependence of L ( θ ; D ) on the data set D , and simply represent it as L ( θ ) . The first part of the proof is fairly standard and exactly same as that of Theorem 3.2 till (2). Follo wing the same notation, it suffices to bound 1 T T P t =1 L ( θ t ) − min θ ∈C L ( θ ) . Rest of the proof differs from Theorem 3.2 to the extent that we no w work with a quadratic approximation (Claim A.2) to the loss function instead of a linear application (Claim 3.3). Claim A.2. Let θ ∗ = arg min θ ∈C L ( θ ) . F or every t ∈ [ T ] , let γ t be the sub-gradient of L ( θ t ; D ) used in iteration t of Algorithm A Noise − MD (Algorithm 1). Then, the following is true. 1 T T X t =1 L ( θ t ; D ) − min θ ∈C L ( θ ; D ) ≤ 1 T T X t =1 [ h γ t , θ t − θ ∗ i − ∆ · B Ψ ( θ ∗ , θ t )] . The proof of this claim is a direct consequence of the definition of strong con ve xity . Now using (11) from the proof of Theorem 3.2 and summing ov er the T iterations, we hav e the following. 1 T T X t =1 E [ h γ t , θ t − θ ∗ i − ∆ · B Ψ ( θ ∗ , θ t )] ≤ 1 T T X t =1 E B Ψ ( θ ∗ , θ t ) − B Ψ ( θ ∗ , θ t +1 ) η t +1 − ∆ · B Ψ ( θ ∗ , θ t ) + O L 2 kC k 2 2 + σ 2 G 2 C + kC k 2 2 T ! T X t =1 η t +1 = 1 T T X t =1 E B Ψ ( θ ∗ , θ t ) 1 η t +1 − 1 η t − ∆ + O L 2 kC k 2 2 + σ 2 G 2 C + kC k 2 2 T ! T X t =1 η t (25) 23 No w setting η t = 1 ∆ · t and using Claim A.2 we obtain the follo wing. 1 T T X t =1 E [ L ( θ t )] − L ( θ ∗ ) = O L 2 kC k 2 2 + σ 2 G 2 C + kC k 2 2 ∆ T ! log T = O log T ∆ L 2 kC k 2 2 T + L 2 G 2 C + kC k 2 2 log( n/δ ) ( n ) 2 !! (26) Setting T = kC k 2 2 ( n ) 2 / ( G 2 C + kC k 2 2 ) in (26), we obtain the required excess risk bound as follo ws: E L ( θ priv ) − min θ ∈C L ( θ ) = O L 2 G 2 C + kC k 2 2 log( n/δ ) log ( kC k 2 ( n )) ∆( n ) 2 ! . B Missing Details f or Private Frank-W olfe f or the ` 2 -bounded Case In this section we provide the details of the priv ate Frank-W olfe algorithm for the ` 2 -bounded case, along with the pri vac y and utility guarantees. Here for a data set D = { d 1 , · · · , d n } , define objective function as the empirical loss function L ( θ ; D ) = 1 n n P i =1 ` ( θ ; d i ) . W e define L 2 the ` 2 -Lipschitz constant, respecti vely , of L ov er all the possible data sets. Algorithm 5 A Noise − FW ( Gen − convex ) : Differentially Pri vate Frank-W olfe Algorithm (General Con vex Case) Input: Data set: D = { d 1 , · · · , d n } , loss function: L ( θ ; D ) = 1 n n P i =1 ` ( θ ; d i ) (with ` 2 -Lipschitz constant L 2 for ` ), pri vac y parameters: ( , δ ) , con vex set: C bounded in the ` 2 -norm, denoted by kC k 2 . 1: choose an arbitrary θ 1 from C ; 2: for t = 1 to T − 1 do 3: e θ t = arg min θ ∈C h5L ( θ t ; D ) + b t , θ i , where b t ∼ N (0 , I p σ 2 ) and σ 2 ← 32 L 2 T log 2 ( n/δ ) ( n ) 2 . 4: θ t +1 ← (1 − µ ) θ t + µ e θ t , where µ = 1 T +2 . 5: Output θ priv = θ T . Theorem B.1 (Pri vac y guarantee) . Algorithm A Noise − FW ( Gen − convex ) (Algorithm 5) is ( , δ ) -differ entially private. The proof of pri vac y is exactly same as the proof of priv acy in Theorem 3.2. In the following we pro vide the utility guarantee for Algorithm A Noise − FW ( Gen − convex ) . Theorem B.2 (Utility guarantee) . Let L 2 , and kC k 2 be defined as in Algorithm 5 (Algorithm A Noise − FW ( Gen − convex ) ). Let G C the Gaussian width of the con vex set C ⊆ < p , and let Γ L be the curvatur e constant (defined in Def- inition 5.1) for the loss function ` ( θ ; d ) for all θ ∈ C and d ∈ D . In Algorithm A Noise − FW if we set T = Γ L 2 / 3 ( n ) 2 / 3 ( L 2 G C ) 2 / 3 , then the excess empirical risk is as follows. E L ( θ priv ; D ) − min θ ∈C L ( θ ; D ) = O Γ L 1 / 3 ( L 2 G C ) 2 / 3 log 2 ( n/δ ) ( n ) 2 / 3 ! . Her e the expectation is o ver the randomness of the algorithm and Γ L is the curvatur e constant. 24 Pr oof. Recall σ 2 = 32 L 2 T log 2 ( T /δ ) ( n ) 2 . Using the property of Gaussian width (Section 2.2), and a similar analysis as that of the con ve x polytope case, we can conclude the following. E L ( θ T ) − min θ ∈C L ( θ ) = O Γ L T + L 2 G C √ T log 2 ( T /δ ) n ! . (27) Setting T = Γ L 2 / 3 ( n ) 2 / 3 ( L 2 G C ) 2 / 3 , results in the utility guarantee. 25
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment