Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

Dif ferentially Pri v ate Empirical Risk Minimization: Ef ﬁcient Algorithms and T ight Error Bounds Raef Bassily ∗ Adam Smith ∗ † Abhradeep Thakurta ‡ October 21, 2014 Abstract In this paper , we initiate a systematic in vestigation of dif ferentially pri vate algorithms for con ve x empirical risk minimization. V arious instantiations of this problem hav e been studied before. W e pro- vide ne w algorithms and matching lower bounds for pri vate ERM assuming only that each data point’ s contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. W e pro vide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly con vex. Our algorithms run in polynomial time, and in some cases ev en match the optimal nonpriv ate running time (as measured by oracle complexity). W e gi ve separate algorithms (and lower bounds) for ( , 0) - and ( , δ ) -differential pri vac y; perhaps surprisingly , the techniques used for designing optimal algorithms in the two cases are completely dif ferent. Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contrib utions of each data point to the loss function is smooth . W e show that simple approaches to smoothing arbitrary loss functions (in order to apply previous tech- niques) do not yield optimal error rates. In particular , optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median. ∗ Computer Science and Engineering Department, The Pennsylv ania State University . { bassily,asmith } @psu.edu . R.B. and A.S. were supported in part by NSF awards #0747294 and #0941553. † A.S. is on sabbatical at, and partly supported by , Boston Univ ersity’ s Hariri Institute for Computing and Center for RISCS as well as Harvard Uni versity’ s Center for Computation and Society , via a Simons Inv estigator grant to Salil V adhan. ‡ Stanford Univ ersity and Microsoft Research. b-abhrag@microsoft.com . Supported in part by the Sloan Foundation. Contents 1 Introduction 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Other Related W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Additional Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Org anization of this Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Gradient Descent and Optimal ( , δ ) -differentially private Optimization 6 3 Exponential Sampling and Optimal ( , 0) -private Optimization 9 3.1 Exponential Mechanism for Lipschitz Con ve x Loss . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Ef ﬁcient Implementation of Algorithm A exp − samp (Algorithm 2) . . . . . . . . . . . . . . . 11 3.3 Our construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Localization and Optimal Private Algorithms f or Strongly Con vex Loss 13 5 Lower Bounds on Excess Risk 15 5.1 Lo wer bounds for Lipschitz Con vex Functions . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2 Lo wer bounds for Strongly Con vex Functions . . . . . . . . . . . . . . . . . . . . . . . . . 18 6 Efﬁcient Sampling fr om Logconca ve Distrib utions over Con vex Sets and The Proof of Theo- rem 3.4 19 6.1 Ef ﬁcient  -Differentially Pri vate Algorithm for Lipschitz Con ve x Loss . . . . . . . . . . . . 24 A Straightf orward Smoothing Does Not Yield Optimal Algorithms 28 B Localization and ( , δ ) -Differentially Private Algorithms f or Lipschitz, Strongly Con vex Loss 30 C Proof of Lemma 5.1 31 C.1 Proof of P art 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 C.2 Proof of P art 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 D Con verting Excess Risk Bounds in Expectation to High-pr obability Bounds 33 E Excess Risk Bounds f or Smooth Functions 33 F Fr om Excess Empirical Risk to Generalization Error 34 1 Intr oduction Con ve x optimization is one of the most basic and powerful computational tools in statistics and machine learning. It is most commonly used for empirical risk minimization (ERM): the data set D = { d 1 , ..., d n } deﬁnes a con ve x loss function L ( · ) which is minimized o ver a conv ex set C . When run on sensitiv e data, ho wev er , the results of con ve x ERM can leak sensitiv e information. For example, medians and support vector machine parameters can, in man y cases, leak entire records in the clear (see “Motiv ation”, belo w). In this paper , we provide new algorithms and matching lower bounds for differ entially private con ve x ERM assuming only that each data point’ s contrib ution to the loss function is Lipschitz and that the domain of optimization is bounded. This b uilds on a line of work started by Chaudhuri et al. [11]. Problem formulation. Gi ven a data set D = { d 1 , ..., d n } drawn from a uni verse X , and a closed, con ve x set C , our goal is to minimize L ( θ ; D ) = n X i =1 ` ( θ ; d i ) ov er θ ∈ C The map ` deﬁnes, for each data point d , a loss function ` ( · ; d ) on C . W e will generally assume that ` ( · ; d ) is con vex and L -Lipschitz for all d ∈ X . One obtains v ariants on this basic problem by assuming additional restrictions, such as (i) that ` ( · ; d ) is ∆ -strongly con ve x for all d ∈ X , and/or (ii) that ` ( · ; d ) is β -smooth for all d ∈ X . Deﬁnitions of Lipschitz, strong con ve xity and smoothness are provided at the end of the introduction. For example, gi ven a collection of data points in R p , the Euclidean 1-median is a point in R p that minimizes the sum of the Euclidean distances to the data points. That is, ` ( θ ; d i ) = k θ − d i k 2 , which is 1-Lipschitz in θ for an y choice of d i . Another common example is the support v ector machine (SVM): gi ven a data point d i = ( x i , y i ) ∈ R p × {− 1 , 1 } , one deﬁnes a loss function ` ( θ ; d i ) = hing e ( y i · h θ , x i i ) , where hing e ( z ) = (1 − z ) + (here (1 − z ) + equals 1 − z for z ≤ 1 and 0, otherwise). The loss is L -Lipshitz in θ when k x i k 2 ≤ L . Our formulation also captures r e gularized ERM, in which an additional (con ve x) function r ( θ ) is added to the loss function to penalize certain types of solutions; the loss function is then r ( θ ) + P n i =1 ` ( θ ; d i ) . One can fold the regularizer r ( · ) into the data-dependent functions by replacing ` ( θ ; d i ) with ˜ ` ( θ ; d i ) = ` ( θ ; d i ) + 1 n r ( θ ) , so that L ( θ ; D ) = P i ˜ ` ( θ ; d i ) . This folding comes at some loss of generality (since it may increase the Lipschitz constant), but it does not affect asymptotic results. Note that if r is ∆ n -strongly con ve x, then ev ery ˜ ` is ∆ -strongly con ve x. W e measure the success of our algorithms by the worst-case (over inputs) expected excess empirical risk , namely E ( L ( ˆ θ ; D ) − L ( θ ∗ ; D )) , (1) where ˆ θ is the output of the algorithm, θ ∗ = arg min θ ∈C L ( θ ; D ) is the true minimizer , and the expectation is only o ver the coins of the algorithm. Expected risk guarantees can be conv erted to high-probability guarantees using standard ampliﬁcation techniques (see Appendix D for details). Another important measure of performance is an algorithm’ s (e xcess) generalization error , where loss is measured with respect to the av erage ov er an unknown distribution from which the data are assumed to be drawn i.i.d.. Our upper bounds on empirical risk imply upper bounds on generalization error (via uniform con vergence and similar ideas); the resulting bounds are only kno wn to be tight in certain ranges of parameters, ho wev er . Detailed statements may be found in Appendix F. Motivation. Con vex ERM is used for ﬁtting models from simple least-squares regression to support vector machines, and their use may hav e signiﬁcant implications to pri vac y . As a simple example, note that the 1 Euclidean 1-median of a data set will typically be an actual data point, since the gradient of the loss function has discontinuities at each of the d i . (Thinking about the one-dimensional median, where there is always a data point that minimizes the loss, is helpful.) Thus, releasing the median may well re veal one of the data points in the clear . A more subtle e xample is the support v ector machine (SVM). The solution to an SVM program is often presented in its dual form, whose coefﬁcients typically consist of a set of p + 1 exact data points. Kasi viswanathan et al. [32] sho w how the results of many con vex ERM problems can be combined to carry out reconstruction attacks in the spirit of Dinur and Nissim [16]. Differential privacy is a rigorous notion of priv acy that emer ged from a line of w ork in theoretical computer science and cryptography [19, 6, 21]. W e say tw o data sets D and D 0 of size n are neighbors if they differ in one entry (that is, |D 4D 0 | = 2 ). A randomized algorithm A is ( , δ ) -differentially priv ate (Dwork et al. [21, 20]) if, for all neighboring data sets D and D 0 and for all e vents S in the output space of A , we ha ve Pr( A ( D ) ∈ S ) ≤ e  Pr( A ( D 0 ) ∈ S ) + δ . Algorithms that satisfy differential pri vac y for  < 1 and δ  1 /n provide meaningful priv acy guarantees, e ven in the presence of side information. In particular , they av oid the problems mentioned in “Motiv ation” abov e. See Dwork [18], Kasi viswanathan and Smith [30], Kifer and Machana vajjhala [33] for discussion of the “semantics” of dif ferential priv acy . Setting Parameters. W e will aim to quantify the role of sev eral basic parameters on the excess risk of dif ferentially priv ate algorithms: the size of the data set n , the dimension p of the parameter space C , the Lipschitz constant L of the loss functions, the diameter kC k 2 of the constraint set and, when applicable, the strong con ve xity ∆ . W e may take L and kC k 2 to be 1 without loss of generality: W e can set kC k 2 = 1 by rescaling θ (replacing by θ with θ · kC k 2 ); we can then set L = 1 by rescaling the loss function L (replacing L by L /L ). These tw o transformations change the e xcess risk by L k C k 2 . The parameter ∆ cannot similarly be rescaled while keeping L and kC k 2 the same. Ho wev er , we always ha ve ∆ ≤ 2 L/ kC k 2 . In the sequel, we thus focus on the setting where L = kC k 2 = 1 and ∆ ∈ [0 , 2] . T o con vert excess risk bounds for L = kC k 2 = 1 to the general setting, one can multiply the risk bounds by L kC k 2 , and replace ∆ by ∆ kC k 2 L . 1.1 Contributions W e gi ve algorithms that signiﬁcantly impro ve on the state of the art for optimizing non-smooth loss functions — for both the general case and strongly con vex functions, we improv e the e xcess risk bounds by a factor of √ n , asymptotically . The algorithms we giv e for ( , 0) - and ( , δ ) -differential priv acy work on very dif ferent principles. W e group the algorithms below by technique: gradient descent, exponential sampling, and localization. For the purposes of this section, ˜ O ( · ) notation hides factors polynomial in log n and log (1 /δ ) . Detailed bounds are stated in T able 1. Gradient descent-based algorithms. F or ( , δ ) -differential priv acy , we show that a noisy version of gra- dient descent achiev es excess risk ˜ O ( √ p/ ) . This matches our lo wer bound, Ω(min( n, √ p/ )) , up to log- arithmic factors. (Note that ev ery θ ∈ C has excess risk at most n , so a lower bound of n can always be matched.) F or ∆ -strongly con vex functions, a v ariant of our algorithm has risk ˜ O ( p ∆ n 2 ) , which matches the lo wer bound Ω( p n 2 ) when ∆ is bounded below by a constant (recall that ∆ ≤ 2 since L = kC k 2 = 1 ). Pre viously , the best kno wn risk bounds were Ω( √ pn/ ) for general con ve x functions and Ω( p √ n ∆  2 ) for ∆ -strongly con vex functions (achiev able via se veral different techniques (Chaudhuri et al. [11], Kifer et al. 2 ( , 0) -DP ( , δ ) -DP Previous [11] This work Previous [34] This work Assumptions Upper Bd Upper Bd Lower Bd Upper Bd Upper Bd Lower Bd 1 -Lipschitz and k C k 2 = 1 p √ n  p  p  √ p · n log (1 /δ )  √ p log 2 ( n/δ )  √ p  ... and O ( p ) -smooth p  p  √ p log (1 /δ )  √ p  1 -Lipschitz and ∆ - strongly conv ex and k C k 2 = 1 (implies ∆ ≤ 2 ) p 2 √ n ∆  2 log( n ) ∆ · p 2 n 2 p 2 n 2 p log (1 /δ ) √ n ∆  2 log 3 ( n/δ ) ∆ · p n 2 p n 2 ... and O ( p ) -smooth p 2 n ∆  2 p 2 n 2 p log (1 /δ ) n ∆  2 p n 2 T able 1: Upper and lo wer bounds for excess risk of differentially-pri vate con vex ERM. Bounds ignore leading multiplicati ve constants, and the values in the table gi ve the bound when it is belo w n . That is, upper bounds should be read as O (min( n, ... )) and lower bounds, as Ω(min( n, ... )) ). Here kC k 2 is the diameter of C . The bounds are stated for the setting where L = kC k 2 = 1 , which can be enforced by rescaling; to get general statements, multiply the risk bounds by L kC k 2 , and replace ∆ by ∆ kC k 2 L . W e assume δ < 1 /n to simplify the bounds. [34], Jain et al. [28], Duchi et al. [17])). Under the restriction that each data point’ s contribution to the loss function is sufﬁciently smooth, objectiv e perturbation [11, 34] also has risk ˜ O ( √ p/ ) (which is tight, since the lower bounds apply to smooth functions). Howe ver , smooth functions do not include important special cases such as medians and support vector machines. Chaudhuri et al. [11] suggest applying their technique to support vector machines by smoothing (“huberizing”) the loss function. W e sho w in Appendix A that this approach still yields expected e xcess risk Ω( √ pn/ ) . Although straightforward noisy gradient descent would work well in our setting, we present a faster v ariant based on stochastic gradient descent: At each step t , the algorithm samples a random point d i from the data set, computes a noisy v ersion of d i ’ s contrib ution to the gradient of L at the current estimate ˜ θ t , and then uses that noisy measurement to update the parameter estimate. The algorithm is similar to algorithms that hav e appeared pre viously (W illiams and McSherry [49] ﬁrst in vestigated gradient descent with noisy updates; stochastic variants were studied by Jain et al. [28], Duchi et al. [17], Song et al. [46]). The novelty of our analysis lies in taking adv antage of the randomness in the choice of d i (follo wing Kasi viswanathan et al. [31]) to run the algorithm for many steps without a signiﬁcant cost to pri vac y . Running the algorithm for T = n 2 steps, gi ves the desired expected excess risk bound. Even nonpriv ate ﬁrst-order algorithms—i.e., those based on gradient measurements—must learn information about the gradient at Ω( n 2 ) points to get risk bounds that are independent of n (this follows from “oracle complexity” bounds showing that 1 / √ T con ver gence rate is optimal [39, 1]). Thus, the query complexity of our algorithm cannot be improv ed without using more information about the loss function, such as second deri vati ves. The gradient descent approach does not, to our knowledge, allo w one to get optimal excess risk bounds for ( , 0) -differential priv acy . The main obstacle is that “strong composition” of ( , δ ) -priv acy Dwork et al. [22] appears necessary to allo w a ﬁrst-order method to run for sufﬁciently man y steps. 3 Exponential Sampling-based Algorithms. F or ( , 0) -dif ferential priv acy , we observe that a straightforward use of the exponential mechanism — sampling from an appropriately-sized net of points in C , where each point θ has probability proportional to exp( −  L ( θ ; D )) — has excess risk ˜ O ( p/ ) on general Lipschitz functions, nearly matching the lower bound of Ω( p/ ) . (The bound would not be optimal for ( , δ ) -priv acy because it scales as p , not √ p ). This mechanism is inefﬁcient in general since it requires construction of a net and an appropriate sampling mechanism. W e gi ve a polynomial time algorithm that achie ves the optimal excess risk, namely O ( p/ ) . Note that the achie ved excess risk does not ha ve any logarithmic factors which is sho wn to be the case using a “peeling-” type argument that is speciﬁc to con vex functions. The idea of our algorithm is to sample ef ﬁciently from the continuous distrib ution on all points in C with density P ( θ ) ∝ e −  L ( θ ) . Although the distribution we hope to sample from is log-conca ve, standard techniques do not work for our purposes: existing methods con ver ge only in statistical difference, whereas we require a multiplicative con ver gence guarantee to provide ( , 0) -differential priv acy . Previous solutions to this issue (Hardt and T alwar [25]) w orked for the uniform distribution, b ut not for general log-concav e distributions. The problem comes from the combination of an arbitrary conv ex set and an arbitrary (Lipschitz) loss function deﬁning P . W e circumv ent this issue by gi ving an algorithm that samples from an appropriately deﬁned distribution ˜ P on a cube containing C , such that (i) the algorithm outputs a point in C with constant probability , and (ii) ˜ P , conditioned on sampling from C , is within multiplicative distance O (  ) from the correct distribution. W e use, as a subroutine, the random walk on grid points of the cube of [2]. Localization: Optimal Algorithms for Strongly Con vex Functions. The exponential-sampling-based technique discussed above does not tak e advantage of strong con ve xity of the loss function. W e show , ho wev er , that a nov el combination of two standard techniques—the e xponential mechanism and Laplace- noise-based output perturbation—does yield an optimal algorithm. Chaudhuri et al. [11] and [41] sho wed that strongly con vex functions ha ve low-sensiti vity minimizers, and hence that one can release the minimum of a strongly con vex function with Laplace noise (with total Euclidean length about ρ = p ∆ n if each loss function is ∆ -strongly con ve x). Simply using this ﬁrst estimate as a candidate output does not yield optimal utility in general; instead it gi ves a risk bound of roughly p ∆  . The main insight is that this ﬁrst estimate deﬁnes us a small neighborhood C 0 ⊆ C , of radius about ρ , that contains the true minimizer . Running the exponential mechanism in this small set improves the excess risk bound by a f actor of about ρ ov er running the same mechanism on all of C . The ﬁnal risk bound is then ˜ O ( ρ p n ) = ˜ O ( p 2 ∆  2 n ) , which matches the lo wer bound of Ω( p 2  2 n ) when ∆ = Ω(1) . This simple “localization” idea is not needed for ( , δ ) -priv acy , since the gradient descent method can already tak e adv antage of strong con ve xity to con ver ge more quickly . Lower Bounds. W e use techniques dev eloped to bound the accuracy of releasing 1-way marginals (due to Hardt and T alw ar [25] for ( , 0) − and Bun et al. [8] for ( , δ ) -pri v acy) to show that our algorithms hav e essentially optimal risk bounds. The instances that arise in our lower bounds are simple: the functions can be linear (or quadratic, for the case of strong con vexity) and the constraint set C can be either the unit ball or the hypercube. In particular , our lower bounds apply to special case of smooth functions, demonstrating the optimality of objecti ve perturbation [11, 34] in that setting. The reduction to lo wer-bounds for 1-way marginals is not quite black-box; we exploit speciﬁc properties of the instances used by Hardt and T alwar [25], Bun et al. [8]. Finally , we provide a much stronger lower bound on the utility of a speciﬁc algorithm, the Huberization- based algorithm proposed by Chaudhuri et al. [11] for support v ector machines. In order to apply their algorithm to nonsmooth loss functions, they proposed smoothing the loss function by Huberization, and then running their algorithm (which requires smoothness for the pri vacy analysis) on the resulting, modiﬁed 4 loss functions. W e show that for any setting of the Huerization parameters, there are simple, one-dimensional nonsmooth loss functions for which the algorithm has error Ω( n ) . This bound justiﬁes the ef fort we put into designing ne w algorithms for nonsmooth loss functions. Generalization Error . In Appendix F, we discuss the implications of our results for generalization err or . Speciﬁcally , suppose that the data are drawn i.i.d. from a distrib ution τ , and let ExcessRisk ( θ ) denote the expected loss of θ on unseen data from τ , that is ExcessRisk ( θ ) = E d ∼ τ [ ` ( θ ; d )] . For an important class of loss functions, called generalized linear models , the straightforward applica- tion of our algorithms gi ves generalization error ˜ O  min  1 √ n , p c n  where c = 1 for the case of ( , 0) - dif ferential priv acy , and c = 2 for ( , δ ) -dif ferential priv acy (assuming log(1 /δ ) = O (log n ) ). This bound is tight: the 1 √ n is necessary ev en in the nonpriv ate setting, and the necessity of the p c n term follo ws from our lo wer bounds on excess empirical risk (they are also lo wer bounds on generalization error). For the case of general Lipschitz con ve x functions, a modiﬁcation of our algorithms gi ves e xcess risk ˜ O  p c 0 √ n  , where c 0 = 1 2 for ( , 0) -differential priv acy and c 0 = 1 4 for ( , δ ) dif ferential priv acy (that is, the generalization error bound is roughly the square root of the corresponding empirical error bound). The best kno wn lower bound, howe ver , is the same as for the special case of generalized linear models. The bounds match when p ≈ n (in which case no nontrivial generalization error is possible). Ho wev er , for smaller v alues of p there remains a gap that is polynomial in p . Closing the gap is an interesting open problem. 1.2 Other Related W ork In addition to the previous work mentioned abov e, we mention sev eral closely related works. A rich line of work seeks to characterize the optimal error of dif ferentially priv ate algorithms for learning and optimization Kasi viswanathan et al. [31], Beimel et al. [3], Chaudhuri and Hsu [9], Beimel et al. [4, 5]. In particular , our results on ( , 0) -dif ferential priv acy imply nearly-tight bounds on the “representation dimension” Beimel et al. [5] of con ve x Lipschitz functions. Jain and Thakurta [27] gave dimension-independent expected excess risk bounds for the special case of “generalized linear models” with a strongly con vex re gularizer , assuming that C = R p (that is, uncon- strained optimization). Kifer et al. [34], Smith and Thakurta [45] considered parameter con vergence for high-dimensional sparse regression (where p  n ). The settings of those papers are orthogonal to ours, though a common generalization would be interesting. Ef ﬁcient implementations of the exponential mechanism o ver inﬁnite domains were discussed by Hardt and T alw ar [25], Chaudhuri et al. [12] and Kapralov and T alwar [29]. The latter two works were speciﬁc to sampling (approximately) singular vectors of a matrix, and their techniques do not ob viously apply here. Dif ferentially priv ate conv ex learning in different models has also been studied: for example, Jain et al. [28], Duchi et al. [17], Smith and Thakurta [44] study online optimization, Jain and Thakurta [26] study an interacti ve model tailored to high-dimensional kernel learning. Con vex optimization techniques ha ve also played an important role in the dev elopment of algorithms for “simultaneous query release” (e.g., the line of work emer ging from Hardt and Rothblum [24]). W e do not know of a direct connection between those works and our setting. 1.3 Additional Deﬁnitions For completeness, we state a fe w additional deﬁnitions related to con ve x sets and functions. • ` : C → R is L -Lipschitz (in the Euclidean norm) if, for all pairs x, y ∈ C , we have | ` ( x ) − ` ( y ) | ≤ 5 L k x − y k 2 . A subgradient of a con vex ` function at x , denoted ∂ ` ( x ) , is the set of v ectors z such that for all y ∈ C , ` ( y ) ≥ ` ( x ) + h z , y − x i . • ` is ∆ -str ongly con vex on C if, for all x ∈ C , for all subgradients z at x , and for all y ∈ C , we ha ve ` ( y ) ≥ ` ( x ) + h z , y − x i + ∆ 2 k y − x k 2 2 (i.e., ` is bounded below by a quadratic function tangent at x ). • ` is β -smooth on C if, for all x ∈ C , for all subgradients z at x and for all y ∈ C , we ha ve ` ( y ) ≤ ` ( x ) + h z , y − x i + β 2 k y − x k 2 2 (i.e., ` is bounded abo ve by a quadratic function tangent at x ). Smoothness implies dif ferentiability , so the subgradient at x is unique. • Giv en a con vex set C , we denote its diameter by kC k 2 . W e denote the projection of an y v ector θ ∈ R p to the con ve x set C by Π C ( θ ) = arg min x ∈C k θ − x k 2 . 1.4 Organization of this Paper Our upper bounds (efﬁcient algorithms) are given in Sections 2, 3, and 4, whereas our lo wer bounds are gi ven in Section 5. Namely , in Section 2, we give efﬁcient construction for ( , δ ) -dif ferentially priv ate algorithms for general con ve x loss as well as Lipschitz strongly con vex loss. In Section 3, we discuss a pure  -dif ferentially priv ate algorithm for general Lipschitz con ve x loss and outline an efﬁcient construction for such algorithm. In Section 4, we discuss our localization technique and sho w ho w to construct ef ﬁcient pure  -dif ferentially pri vate algorithms for Lipschitz strongly con vex loss. W e deri ve our lo wer bound for general Lipschitz con vex loss in Section 5.1 and our lower bound for Lipschitz strongly con ve x loss in Section 5.2. In Section 6, we discuss a generic construction of an efﬁcient algorithm for sampling (with a multiplicativ e distance guarantee) from a logconcave distribution over an arbitrary con vex bounded set. As a by-product of our generic construction, we giv e the details of the construction of our ef ﬁcient  -differentially pri vate algorithm from Section 3.2. The appendices contain proof details and supplementary material: Appendix A shows that smoothing a nonsmooth loss function in order to apply the objectiv e perturbation technique of Chaudhuri et al. [11] can introduce signiﬁcant additional error . Appendix B gi ves details on the application of localization in the setting of ( , δ ) -differential priv acy . Appendix C provides additional details on the proofs of lo wer bounds. In Appendix D, we explain standard modiﬁcations that allow our algorithms to giv e high probability guarantees instead of expected risk guarantees. Finally , in Appendix F we discuss the how our algorithms can be adapted to provide guarantees on generalization error , rather than empirical error . 2 Gradient Descent and Optimal ( , δ ) -differentially priv ate Optimization In this section we provide an algorithm A Noise − GD (Algorithm 1) for computing θ priv using a noisy stochas- tic variant of the classic gradient descent algorithm from the optimization literature [7]. Our algorithm (and the utility analysis) was inspired by the approach of W illiams and McSherry [49] for logistic regression. All the excess risk bounds (16) in this section and the rest of this paper, are presented in expectation ov er the randomness of the algorithm. In Section D we provide a generic tool to translate the e xpectation bounds into high probability bound albeit at a loss of extra logarithmic f actor in the in verse of the f ailure probability . Note(1): The results in this section do not require the loss function ` to be differentiable. Although we present Algorithm A Noise − GD (and its analysis) using the gradient of the loss function ` ( θ ; d ) at θ , the same guarantees hold if instead of the gradient, the algorithm is run with any sub-gradient of ` at θ . 6 Note(2): Instead of using the stochastic v ariant in Algorithm 1, one can use the complete gradient (i.e., 5L ( θ ; D ) ) in Step 5 and still hav e the same utility guarantee as Theorem 2.4. Howe ver , the running time goes up by a factor of n . Algorithm 1 A Noise − GD : Dif ferentially Priv ate Gradient Descent Input: Data set: D = { d 1 , · · · , d n } , loss function ` (with Lipschitz constant L ), priv acy parameters ( , δ ) , con ve x set C , and the learning rate function η : [ n 2 ] → R . 1: Set noise variance σ 2 ← 32 L 2 n 2 log( n/δ ) log(1 /δ )  2 . 2: e θ 1 : Choose any point from C . 3: for t = 1 to n 2 − 1 do 4: Pick d ∼ u D with replacement. 5: e θ t +1 = Π C  e θ t − η ( t ) h n 5 ` ( e θ t ; d ) + b t i , where b t ∼ N  0 , I p σ 2  . 6: Output θ priv = e θ n 2 . Theorem 2.1 (Pri v acy guarantee) . Algorithm A Noise − GD (Algorithm 1) is ( , δ ) -differ entially private. Pr oof. At any time step t ∈ [ n 2 ] in Algorithm A Noise − GD , ﬁx the randomness due to sampling in Line 4. Let X t ( D ) = n 5 ` ( e θ t ; d ) + b t be a random variable deﬁned ov er the randomness of b t and conditioned on e θ t (see Line 5 for a deﬁnition), where d ∈ D is the data point picked in Line 4. Denote µ X t ( D ) ( y ) to be the measure of the random variable X t ( D ) induced on y ∈ R . For any two neighboring data sets D and D 0 , deﬁne the privacy loss random variable [22] to be W t =    log µ X t ( D ) ( X t ( D )) µ X t ( D 0 ) ( X t ( D ))    . Standard differential pri vac y arguments for Gaussian noise addition (see [34, 40]) will ensure that with probability 1 − δ 2 (ov er the randomness of the random variables b t ’ s and conditioned on the randomness due to sampling), W t ≤  2 √ log(1 /δ ) for all t ∈ [ n 2 ] . No w using the following lemma (Lemma 2.2 with  0 =  2 √ log(1 /δ ) and γ = 1 /n ) we ensure that o ver the randomness of b t ’ s and the randomness due to sampling in Line 4 , w .p. at least 1 − δ 2 , W t ≤  n √ log(1 /δ ) for all t ∈ [ n 2 ] . While using Lemma 2.2, we ensure that the condition  2 √ log(1 /δ ) ≤ 1 is satisﬁed. Lemma 2.2 (Priv acy ampliﬁcation via sampling. Lemma 4 in [3]) . Over a domain of data sets T n , if an algorithm A is  0 ≤ 1 dif fer entially private, then for any data set D ∈ T n , e xecuting A on uniformly random γ n entries of D ensures 2 γ  0 -differ ential privacy . T o conclude the proof, we apply “strong composition” (Lemma 2.3) from [22]. W ith probability at least 1 − δ , the priv acy loss W = n 2 P t =1 W t is at most  . This concludes the proof. Lemma 2.3 (Strong composition [22]) . Let , δ 0 ≥ 0 . The class of  -dif fer entially private algorithms satisﬁes (  0 , δ 0 ) -differ ential privacy under T -fold adaptive composition for  0 = p 2 T ln(1 /δ 0 )  + T  ( e  − 1) . In the following we pro vide the utility guarantees for Algorithm A Noise − GD under tw o dif ferent settings, namely , when the function ` is Lipschitz, and when the function ` is Lipschitz and strongly con ve x. In Section 5 we argue that these e xcess risk bounds are essentially tight. 7 Theorem 2.4 (Utility guarantee) . Let σ 2 = O  L 2 n 2 log( n/δ ) log(1 /δ )  2  . F or θ priv output by Algorithm A Noise − GD we have the following. (The expectation is over the r andomness of the algorithm.) 1. Lipschitz functions: If we set the learning rate function η ( t ) = kC k 2 √ t ( n 2 L 2 + pσ 2 ) , then we have the following excess risk bound. Here L is the Lipscthiz constant of the loss function ` . E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O L kC k 2 log 3 / 2 ( n/δ ) p p log(1 /δ )  ! . 2. Lipschitz and strongly conv ex functions: If we set the learning rate function η ( t ) = 1 ∆ nt , then we have the following excess risk bound. Here L is the Lipscthiz constant of the loss function ` and ∆ is the str ong conve xity parameter . E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  L 2 log 2 ( n/δ ) p log (1 /δ ) n ∆  2  . Pr oof. Let G t = n 5 ` ( e θ t ; d ) + b t in Line 5 of Algorithm 1. First notice that o ver the randomness of the sampling of the data entry d from D , and the randomness of b t , E [ G t ] = 5L ( e θ t ; D ) . Additionally , we ha ve the follo wing bound on E [ k G t k 2 2 ] . E [ k G t k 2 2 ] = n 2 E [ k 5 ` ( e θ t ; d ) k 2 2 ] + 2 n E [ h5 ` ( e θ t ; d ) , b t i ] + E [ k b t k 2 2 ] ≤ n 2 L 2 + pσ 2 [Here σ 2 is the v ariance of b t in Line 5] (2) In the above expression we hav e used the fact that since e θ t is independent of b t , so E [ h5 ` ( e θ t ; d ) , b t i ] = 0 . Also, we have E [ k b t k 2 2 ] = pσ 2 . W e can now directly use Lemma 2.5 to obtain the required error guarantee for Lipschitz con ve x functions, and Lemma 2.6 for Lipschitz and strongly con vex functions. Lemma 2.5 (Theorem 2 from [43]) . Let F ( θ ) (for θ ∈ C ) be a conve x function and let θ ∗ = arg min θ ∈C F ( θ ) . Let θ 1 be any arbitrary point fr om C . Consider the stochastic gradient descent algorithm θ t +1 = Π C [ θ t − η ( t ) G t ( θ t )] , wher e E [ G t ( θ t )] = 5 F ( θ t ) , E [ k G t k 2 2 ] ≤ G 2 and the learning rate function η ( t ) = kC k 2 G √ t . Then for any T > 1 , the following is true. E [ F ( θ T ) − F ( θ ∗ )] = O  kC k 2 G log T √ T  . Using the bound from (2) in Lemma 2.5 (i.e., set G = p n 2 L 2 + pσ 2 ), and setting T = n 2 and the learning rate function η ( t ) as in Lemma 2.5, giv es us the required e xcess risk bound for Lipschitz con vex functions. F or Lipschitz and strongly con vex functions we use the follo wing result by [43]. Lemma 2.6 (Theorem 1 from [43]) . Let F ( θ ) (for θ ∈ C ) be a λ -str ongly con vex function and let θ ∗ = arg min θ ∈C F ( θ ) . Let θ 1 be any arbitrary point fr om C . Consider the stochastic gradient descent algorithm θ t +1 = Π C [ θ t − η ( t ) G t ( θ t )] , wher e E [ G t ( θ t )] = 5 F ( θ t ) , E [ k G t k 2 2 ] ≤ G 2 and the learning rate function η ( t ) = 1 λt . Then for any T > 1 , the following is true. E [ F ( θ T ) − F ( θ ∗ )] = O  G 2 log T λT  . 8 Using the bound from (2) in Lemma 2.6 (i.e., set G = p n 2 L 2 + pσ 2 ), λ = n ∆ , and setting T = n 2 and the learning rate function η ( t ) as in Lemma 2.6, gi ves us the required excess risk bound for Lipschitz and strongly con ve x con vex functions. Note: Algorithm A Noise − GD has a running time of O ( pn 2 ) , assuming that the gradient computation for ` takes time O ( p ) . V ariants of Algorithm A Noise − GD hav e appeared in [49, 28, 17, 47]. The most rele vant work in our current context is that of [47]. The main idea in [47] is to run stochastic gradient descent with gradients computed over small batches of disjoint samples from the data set (as opposed to one single sample used in Algorithm A Noise − GD ). The issue with the algorithm is that it cannot provide excess risk guarantee which is o ( √ n ) , where n is the number of data samples. One observ ation that we mak e is that if one remo ves the constraint of disjointness and use the ampliﬁcation lemma (Lemma 2.2), then one can ensure a much tighter pri vac y guarantees for the same setting of parameters used in the paper . 3 Exponential Sampling and Optimal ( , 0) -priv ate Optimization In this section, we focus on the case of pure  -dif ferential pri vac y and pro vide an optimal ef ﬁcient algorithm for empirical risk minimization for the general class of conv ex and Lipschitz loss functions. The main building block of this section is the well-kno wn exponential mechanism [37]. First, we sho w that a variant of the exponential mechanism is optimal. A major technical contrib u- tion of this section is to mak e the exponential mechanism computationally efﬁcient which is discussed in Section 3.2. 3.1 Exponential Mechanism f or Lipschitz Con vex Loss In this section we only deal with loss functions which are Lipschitz. W e provide an  -differentially priv ate algorithm (Algorithm 2) which achie ves the optimal excess risk for arbitrary con vex bounded sets. Algorithm 2 A exp − samp : Exponential sampling based con vex optimization Input: Data set of size n : D , loss function ` , priv acy parameter  and con ve x set C . 1: L ( θ ; D ) = n P i =1 ` ( θ ; d i ) . 2: Sample a point θ priv from the con ve x set C w .p. proportional to exp  −  2 L kC k 2 L ( θ ; D )  and output. Theorem 3.1 (Pri v acy guarantee) . Algorithm 2 is  -differ entially private. Pr oof. First, notice that the distribution induced by the exponential weight function in step 2 of Algorithm 2 is the same if we use exp  −  L kC k 2 ( L ( θ ; D ) − L ( θ 0 ; D ))  for some arbitrary point θ 0 ∈ C . Since ` is L - Lipschitz, the sensiti vity of L ( θ ; D ) − L ( θ 0 ; D ) is at most L kC k 2 . The proof then follo ws directly from the analysis of exponential mec hanism by [37]. In the follo wing we prove the utility guarantee for Algorithm A exp − samp . Theorem 3.2 (Utility guarantee) . Let θ priv be the output of A exp − samp (Algorithm 2 above). Then, we have the following guarantee on the expected excess risk. (The expectation is over the randomness of the 9 θ ∗ Con vex set: C A 1 A 2 A 3 A 4 r 1 r 2 r 3 r 4 Diﬀerential cone: Ω Figure 1: Dif ferential cone Ω inside the con ve x set C algorithm.) E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  pL kC k 2   . Pr oof. Consider a dif ferential cone Ω centered at θ ∗ (see Figure 1). W e will bound the expected e xcess risk of θ priv by O  pL kC k 2   conditioned on θ priv ∈ Ω ∩ C for e very dif ferential cone. This immediately implies the abov e theorem by the properties of conditional expectation. Let Γ be a ﬁxed threshold (to be set later) and let R ( θ ) = L ( θ priv ; D ) − L ( θ ∗ ; D ) for the purposes of bre vity . Let the marked sets A i ’ s in Figure 1 be deﬁned as A i = { θ ∈ Ω ∩ C : ( i − 1)Γ ≤ R ( θ ) ≤ i · Γ } . Instead of directly computing the probability of θ priv being outside A 1 , we will analyze the probabilities for being in each of the A i ’ s indi vidually . This form of “peeling” arguments have been used for risk analysis of con ve x loss in the machine learning literature (e.g., see [48]) and will allow us to get rid of the extra logarithmic factor that would have otherwise shown up in the excess risk if we use the standard analysis of the exponential mechanism in [37]. Since Ω is a differential cone and since R ( θ ) is continuous on C , it follows that within Ω ∩ C , R ( θ ) only depends on k θ − θ ∗ k 2 . Therefore, let r 1 , r 2 , · · · be the distance of the set boundaries of A 1 , A 2 , · · · from θ ∗ . (See Figure 1.) One can equi v alently write each A i as follo ws: A i = { θ ∈ Ω ∩ C : r i − 1 < k θ − θ ∗ k 2 ≤ r i } . The follo wing claim is the key part of the proof. Claim 3.3. Con vexity of R ( θ ) for all θ ∈ C implies that r i − r i − 1 ≤ r i − 1 − r i − 2 for all i ≥ 3 . Pr oof. Since by deﬁnition θ ∗ is the minimizer of R ( θ ) within C and R ( θ ) is con vex, we hav e R ( θ 2 ) ≥ R ( θ 1 ) for any θ 1 , θ 2 ∈ C ∩ Ω such that k θ 2 − θ ∗ k 2 ≥ k θ 1 − θ ∗ k 2 . This directly implies the required bound. No w , the volume of the set A i is gi ven by V ol ( A i ) = κ r i R r i − 1 r p − 1 dr for some ﬁxed constant κ . Hence, V ol ( A i ) V ol ( A 2 ) = r p i − 1 r p 1 · ( r i /r i − 1 ) p − 1 ( r 2 /r 1 ) p − 1 ≤ r p i − 1 r p 1 ≤ ( i − 1) p . 10 where the last two inequalities follo ws from Claim 3.3. Hence, we get the following bound on the probability that the excess risk R ( θ priv ) ≥ 4Γ conditioned on θ priv ∈ C ∩ Ω (For brevity , we remove the conditioning sign from the probabilities belo w). Pr[ R ( θ priv ) ≥ 4Γ] ≤ Pr[ θ priv ∈ ∞ S i =4 A i ] Pr[ θ priv ∈ A 2 ] ≤ ∞ X i =4 V ol ( A i ) V ol ( A 2 ) · e −  ( i − 3) Γ 2 L kC k 2 ≤ ∞ X i =4 ( i − 1) p · e −  ( i − 3) Γ 2 L kC k 2 ≤ 3 p e −  Γ 2 L kC k 2 1 − 2 p e −  Γ 2 L kC k 2 where the last inequality follo ws from the fact that ( i − 1) p ≤ 3 p ·  2 i − 4  p for i ≥ 4 . Hence, for e very t > 0 , if we choose Γ = 2 L kC k 2  (( p + 1) ln 3 + t ) , then, conditioned on θ priv ∈ C ∩ Ω , we get Pr[ R ( θ priv ) ≥ 8 L kC k 2  (( p + 1) ln 3 + t )] ≤ e − t . Since this is true for e very t > 0 , we hav e our required bound as a corollary . 3.2 Efﬁcient Implementation of Algorithm A exp − samp (Algorithm 2) In this section, we giv e a high-lev el description of a computationally ef ﬁcient construction of Algorithm 2. Our algorithm runs in polynomial time in n, p and outputs a sample θ ∈ C from a distribution that is arbitrarily close (in the multiplicati ve sense) to the distribution of the output of Algorithm 2. Since we are interested in an efﬁcient pure  -differentially priv ate algorithm, we need an ef ﬁcient sampler with a multiplicati ve distance guarantee. In fact, if we were interested in ( , δ ) algorithms, ef ﬁcient sampling with a total v ariation guarantee would hav e suf ﬁced which would hav e made our task a lot easier as we could hav e used one of the exisiting algorithms, e.g., [36]. In [25], it was shown ho w to sample efﬁciently with a multiplicati ve guarantee from the unifr om distribution over a con vex bounded set. Howe ver , what we want to achie ve here is more general, that is, to sample ef ﬁciently from any gi ven logconca ve distrib ution deﬁned ov er a con vex bounded set. T o the best of our knowledge, this task has not been explicitly work ed out before, ne vertheless, all the ingredients needed to accomplish it are present in the literature, mainly [2]. W e highlight here the main ideas of our constrution. Since such construction is not speciﬁc to our pri vac y problem and could be of independent interest, in this section, we only provide the high-level description of this construction, ho wev er all the details of such construction and the proof of our main result (Theorem 3.4 belo w) are deferred to Section 6. Theorem 3.4. There is an ef ﬁcient version of Algorithm 2 that has the following guar antees. 1. Privacy: The algorithm is  -differ entially private. 2. Utility: The output θ priv ∈ C of the algorithm satisﬁes E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  pL kC k 2   . 11 3. Running time: Assuming C is in isotr opic position, the algorithm runs in time 1 O  kC k 2 2 p 3 n 3 max { p log ( kC k 2 pn ) ,  kC k 2 n }  . In fact, the running time of our algorithm depends on kC k ∞ rather than kC k 2 . Namely , all the kC k 2 terms in the running time can be replaced with kC k ∞ , ho wev er , we chose to write it in this less conserv ati ve way since all the bounds in this paper are e xpressed in terms of kC k 2 . Before describing our construction, we ﬁrst introduce some useful notation and discuss some prelimi- naries. For any two probability measures µ, ν deﬁned with respect to the same sample space Q ⊆ R p , the relati ve (multiplicati ve) distance between µ and ν , denoted as Dist ∞ ( µ, ν ) is deﬁned as Dist ∞ ( µ, ν ) = sup q ∈Q     log dµ ( q ) dν ( q )     . where dµ ( q ) dν ( q ) (resp., dν ( q ) dµ ( q ) ) denotes the ratio of the tw o measures (more precisely , the Radon-Nik odym deri va- ti ve). Assumptions: W e assume that we can efﬁciently test whether a giv en point θ ∈ R p lies in C using a mem- bership oracle. W e also assume that we can ef ﬁcienly optimize an ef ﬁciently computable con ve x function ov er a con vex set. T o do this, it sufﬁces to hav e a projection oracle. W e do not take into account the extra polynomial factor in the running time which is required to perform such operations since this factor is highly dependent on the speciﬁc structure of the set C . When C is not isotropic: In Theorem 3.4 and in the rest of this section, we assume that C is in isotropic position. In particular, we assume that B ⊆ C ⊆ p B . Ho wev er , if the con vex set is not in isotropic position, fortunately , we know of ef ﬁcient algorithms for placing it in isotropic position (for example, the algorithm of [35]). In such case, the ﬁrst step of our algorithm would be to transform C to an isotropic position (and apply the corresponding transformation to the loss function). This step takes time polynomial in p with additional polylog factor in 1 r where r > 0 is the diameter of the largest ball we can ﬁt inside C (See [35] and [23]). Speciﬁcally , if r = Ω(1) , then our set C would be already in isotropic position. Then, we sample ef ﬁciently from the transformed set. Finally , we apply the in verse transformation to the output sample to obtain a sample from the desired distribution over C in its original position. In the worst case, the isotropic transformation can amplify the diameter of C by a factor of p . Putting all this together , the running time in Theorem 3.4 abov e will pick up an extra factor of O  max  p 3 , polylog  1 r  in the worst case if C is not in isotropic position. 3.3 Our construction Let τ denote the L ∞ diameter of C . The Minko wski’ s norm of θ ∈ R p with respect to C , denoted as ψ ( θ ) , is deﬁned as ψ ( θ ) = inf { r > 0 : θ ∈ r C } . W e deﬁne ¯ ψ α ( θ ) , α · max { 0 , ψ ( θ ) − 1 } for α > 0 . Note that ¯ ψ α ( θ ) > 0 if and only if θ / ∈ C . Moreo ver , it is not hard to verify that ¯ ψ α is α -Lipschitz. W e use the grid-walk algorithm of [2] for sampling from a logconcave distribution deﬁned ov er a cube as a building block. Our construction is described as follows: 1. Enclose the set C with a cube A with edges of length τ . 1 The case where C is not in isotropic position is discussed below . 12 2. Obtain a con vex Lipschitz extension ¯ L ( . ; D ) of the loss function L ( . ; D ) over A . This can be done ef ﬁciently using a projection oracle. 3. Deﬁne F ( θ ) , e −  6 L kC k 2 ¯ L ( θ ; D ) − ¯ ψ α ( θ ) , θ ∈ A , for a speciﬁc choice of α = O ( n k C k 2 ) (See Section 6 for details). 4. Run the grid-walk algorithm of [2] with F as the input weight function and A as the input cube, and output a sample θ whose distribution is close, with respect to Dist ∞ , to the distribution induced by F on A which is gi ven by F ( θ ) R v ∈ A F ( v ) dv , θ ∈ A . No w , note that what we hav e so far is an efﬁcient procedure (let’ s denote it by A cube − samp ) that outputs a sample from a distrib ution over A which close, with respect to Dist ∞ , to the continuous distrib ution F ( u ) R v ∈ A F ( v ) dv , u ∈ A . W e then argue that due to the choices made for the values of the parameters abov e, A cube − samp outputs a sample in C with probability at least 1 2 . That is, the algorithm succeeds to output a sample from a distribution close to the right distribution on C with probability at least 1 / 2 . Hence, we can amplify the probability of success by repeating A cube − samp suf ﬁciently many times where fresh random coins are used by A cube − samp in every time (speciﬁcally , O ( n ) iterations would sufﬁce). If A cube − samp returns a sample θ ∈ C in one of those iterations, then our algorithm terminates outputting θ . Otherwise, it outputs a uniformly random sample θ ⊥ from the unit ball B (Note that B ⊆ C since C is in isotropic position). W e ﬁnally show that this termination condition can only change the distribution of the output sample by a constant factor sufﬁciently close to 1 . Hence, we obtain our ef ﬁcient algorithm referred to in Theorem 3.4. 4 Localization and Optimal Priv ate Algorithms for Str ongly Con vex Loss It is unclear ho w to get a direct v ariant of Algorithm 2 in Section 3 for Lipschitz and strongly con ve x losses that can achieve optimal excess risk guarantees. The issue in extending Algorithm 2 directly is that the con ve x set C ov er which the exponential mechanism is deﬁned is “too lar ge” to provide tight guarantees. W e sho w a generic  -differentially pri vate algorithm for minimizing Lipschitz strongly con vex loss func- tions based on a combination of a simple pre-processing step (called the localization step ) and any generic  -dif ferentially pri vate algorithm for Lipschitz con vex loss functions. W e carry out the localization step us- ing a simple output perturbation algorithm which ensures that the con vex set o ver which the  -dif ferentially pri vate algorithm (in the second step) is run has diameter ˜ O ( p/n ) . Next, we instantiate the generic  -dif ferentially pri vate algorithm in the second step with our ef ﬁcient exponential mechanism of Section3.1 (Algorithm 2) to obtain an algorithm with optimal excess risk bound (Theorem 4.3). Note: The localization technique is not speciﬁc to pure  -dif ferential pri vac y , and extends naturally to ( , δ ) case. Although it is not relev ant in our current context, since we already ha ve gradient descent based algorithm which achie ves optimal excess risk bound. W e defer the details for the ( , δ ) case to Appendix B. Details of the generic algorithm: W e ﬁrst give a simple algorithm that carries out the desired localization step. The crux of the algorithm is the same as to that of the output perturbation algorithm of [10, 11]. The high-le vel idea is to ﬁrst compute θ ∗ = arg min θ ∈C L ( θ ; D ) and add noise according to the sensiti vity of θ ∗ . The details of the algorithm are gi ven in Algorithm 3. 13 Algorithm 3 A  out − pert : Output Perturbation for Strongly Con vex Loss Input: data set of size n : D , loss function ` , strong con vexity parameter ∆ , pri vac y parameter  , con vex set C , and radius parameter ζ < 1 . 1: L ( θ ; D ) = n P i =1 ` ( θ ; d i ) . 2: Find θ ∗ = arg min θ ∈C L ( θ ; D ) . 3: θ 0 = Π C ( θ ∗ + b ) , where b is random noise vector with density 1 α e − n ∆  2 L k b k 2 (where α is a normalizing constant) and Π C is the projection on to the con ve x set C . 4: Output C 0 = { θ ∈ C : k θ − θ 0 k 2 ≤ ζ 2 Lp ∆ n } . Having Algorithm 3 in hand, we now gi ve a generic  -differentially pri vate algorithm for minimizing L ov er C . Let A  gen − Lip denote any generic  -dif ferentially priv ate algorithm for optimizing L over some arbi- trary con ve x set ˜ C ⊆ C . Algorithm 2 from Section 3.1 or its efﬁcient version Algorithm 7(See Theorem 3.4 and Section 6 for details) is an e xample of A  gen − Lip . The algorithm we present here (Algorithm 4 belo w) makes a black-box call in its ﬁrst step to A  2 out − pert (Algorithm 3 shown abov e), then, in the second step, it feeds the output of A  2 out − pert into A  2 gen − Lip and ouptut. Algorithm 4 Output-perturbation-based Generic Algorithm Input: data set of size n : D , loss function ` , strong con vexity parameter ∆ , pri vac y parameter  , and con ve x set C . 1: Run A  2 out − pert (Algorithm 3) with input priv acy parameter / 2 , radius parameter ζ = 3 log ( n ) and output C 0 . 2: Run A  2 gen − Lip on inputs n, D , `, priv acy parameter / 2 , and con vex set C 0 , and output θ priv . Theorem 4.1 (Pri v acy guarantee) . Algorithm 4 is  -differ entially private. Pr oof. The pri vac y guarantee follows directly from the composition theorem together with the fact that A  2 out − pert is  2 -dif ferentially priv ate (see [11]) and that A  2 gen − Lip is  2 -dif ferentially priv ate by assumption. In the follo wing theorem, we pro vide a generic expression for the e xcess risk of Algorithm 4 in terms of the expected e xcess risk of any gi ven algorithm A gen − Lip . Lemma 4.2 (Generic utility guarantee) . Let e θ denote the output of Algorithm A  gen − Lip on inputs n, D , `, , ˜ C (for an arbitrary con vex set ˜ C ⊆ C ). Let ˆ θ denote the minimizer of L ( . ; D ) o ver ˜ C . If E h L ( e θ ; D ) − L ( ˆ θ ; D ) i ≤ F  p, n, , L, k ˜ C k 2  for some function F , then the output θ priv of Algorithm 4 satisﬁes E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  F  p, n, , L, O  Lp log( n ) ∆ n  , wher e θ ∗ = arg min θ ∈C L ( θ ; D ) . 14 Pr oof. The proof follo ws from the fact that, in Algorithm A  2 out − pert , the norm of the noise vector k b k 2 is distributed according to Gamma distrib ution Γ( p, 4 L ∆ n ) and hence satisﬁes Pr  k b k 2 ≤ ζ 4 Lp ∆ n  ≥ 1 − e − ζ (see, for example, [11]). No w , set ζ = 3 log( n ) . Hence, with probability 1 − 1 n 3 , C 0 (the output of A  2 out − pert ) contains θ ∗ . Hence, by setting ˜ C in the statement of the lemma to C 0 (and noting that kC 0 k 2 = O  Lp log( n ) ∆ n  ), then conditioned on the e vent that C 0 contains θ ∗ , we hav e ˆ θ = θ ∗ and hence E  L ( θ priv ; D ) − L ( θ ∗ ; D ) | θ ∗ ∈ C 0  ≤ F  p, n, , L, O  Lp log( n ) ∆ n  Thus, E  L ( θ priv ; D ) − L ( θ ∗ ; D )  ≤ F  p, n, , L, O  Lp log( n ) ∆ n  (1 − 1 n 3 ) + nL kC k 2 1 n 3 Note that the second term on the right-hand side abo ve becomes O ( 1 n 2 ) . From our lo wer bound (Section 5.2 belo w), F ( ., n, ., ., . ) must be at least Ω( 1 n ) . Hence, we ha ve E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  F  p, n, , L, O  Lp log( n ) ∆ n  which completes the proof of the theorem. Instantiation of Algorithm A  2 gen − Lip with the exponential sampling algorithm: Next, we giv e our optimal  -dif ferentially priv ate algorithm for Lipschitz strongly con ve x loss functions. T o do this, we instantiate the generic Algorithm A gen − Lip in Algorithm 4 with our exponential sampling algorithm from Section 3.1 (Algorithm 2), or its efﬁcient version Algorithm A eﬀ − exp − samp (See Section 3.2) to obtain the optimal e xcess risk bound. W e formally state the bound in Theorem 4.3) below . The proof of Theorem 4.3 follo ws from Theorem 3.2 and Lemma 4.2 abov e. Theorem 4.3 (Utility guarantee with Algorithm 7 as an instantiation of A gen − Lip ) . Suppose we r eplace A  2 gen − Lip in Algorithm 4 with Algorithm 2 (Section 3.1), or its efﬁcient version Algorithm 7 (See Theor em 3.4 and Section 6 for details). Then, the output θ priv satisﬁes E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  p 2 L 2 n ∆  2 log( n )  wher e θ ∗ = arg min θ ∈C L ( θ ; D ) . 5 Lower Bounds on Excess Risk In this section, we complete the picture by deri ving lo wer bounds on the e xcess risk caused by dif ferentially pri vate algorithm for risk minimization. As before, for a dataset D = { d 1 , . . . , d n } , our decomposable loss 15 function is expressed as L ( θ ; D ) = n P i =1 ` ( θ ; d i ) , θ ∈ C for some con vex set C ⊂ R p . In Section 5.1, we consider the case of con vex Lipschitz loss functions, whereas in Section 5.2, we consider the case of strongly con ve x and Lipschitz loss functions. Before we state and pro ve our lo wer bounds, we ﬁrst gi ve the follo wing useful lemma which gives lower bounds on the L 2 -error incurred by  and ( , δ ) -differentially pri vate algorithms for estimating the 1 -way marginals of datasets ov er {− 1 √ p , 1 √ p } p . This lemma is based on the results of [25] and [8], ho wever , for the sake of completeness, we gi ve a detailed proof of this lemma in Appendix C. Lemma 5.1 (Lo wer bounds for 1-way marginals) . 1.  -differential private algorithms: Let n, p ∈ N and  > 0 . Ther e is a number M = Ω (min ( n, p/ )) such that for e very  -dif fer entially private algorithm A , ther e is a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p with k P n i =1 d i k 2 ∈ [ M − 1 , M + 1] suc h that, with pr obability at least 1 / 2 (taken o ver the algorithm random coins), we have kA ( D ) − q ( D ) k 2 = Ω  min  1 , p n  wher e q ( D ) = 1 n P n i =1 d i . 2. (  , δ ) -dif ferential private algorithms: Let n, p ∈ N ,  > 0 , and δ = o ( 1 n ) . There is a number M = Ω  min  n, √ p/  such that for every ( , δ ) -differ entially private algorithm A , ther e is a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p with k P n i =1 d i k 2 ∈ [ M − 1 , M + 1] such that, with pr obability at least 1 / 3 (taken over the algorithm r andom coins), we have l kA ( D ) − q ( D ) k 2 = Ω  min  1 , √ p n  wher e q ( D ) = 1 n P n i =1 d i . 5.1 Lower bounds f or Lipschitz Con vex Functions In this section, we gi ve lower bounds for both  and ( , δ ) dif ferentially priv ate algorithms for minimizing any con vex Lipschitz loss function L ( θ ; D ) . W e consider the follo wing loss function. Deﬁne ` ( θ ; d ) = −h θ , d i , θ ∈ B , d ∈ {− 1 √ p , 1 √ p } p (3) For an y dataset D = { d 1 , . . . , d n } with data points drawn from {− 1 √ p , 1 √ p } p , and any θ ∈ B , deﬁne L ( θ ; D ) = −h θ , n X i =1 d i i (4) Clearly , L is linear and, hence, Lipschitz and con vex. Note that, whenev er k P n i =1 d i k 2 > 0 , θ ∗ = P n i =1 d i k P n i =1 d i k 2 is the minimizer of L ( . ; D ) ov er B . Next, we show lo wer bounds on the excess risk incurred by any  and ( , δ ) differentially pri v ate algorithm with output θ priv ∈ B . 16 Theorem 5.2 (Lower bound for  -differentially pri vate algorithms) . Let n, p ∈ N and  > 0 . F or every  -differ entially private algorithm (whose output is denoted by θ priv ), ther e is a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p such that, with pr obability at least 1 / 2 (o ver the algorithm random coins), we must have L ( θ ; D ) − L ( θ ∗ ; D ) = Ω (min ( n, p/ )) wher e θ ∗ = P n i =1 d i k P n i =1 d i k 2 is the minimizer of L ( . ; D ) o ver B and L is given by (4). Pr oof. Let A be an  -dif ferentially priv ate algorithm for minimizing L and let θ priv denote its output. First, observe that for any θ ∈ B and dataset D , L ( θ ; D ) − L ( θ ∗ ; D ) = k P n i =1 d i k 2 (1 − h θ, θ ∗ i ) . Hence, we ha ve L ( θ ; D ) −L ( θ ∗ ; D ) ≥ 1 2 k P n i =1 d i k 2 k θ − θ ∗ k 2 2 . This is due to the fact that k θ − θ ∗ k 2 2 = k θ ∗ k 2 2 + k θ k 2 2 − 2 h θ , θ ∗ i and the fact that θ ∗ , θ ∈ B . Let M = Ω (min ( n, p/ )) be as in Part 1 of Lemma 5.1. Suppose, for the sake of a contradiction, that for every dataset D ⊆ {− 1 √ p , 1 √ p } p with k P n i =1 d i k 2 ∈ [ M − 1 , M + 1] , with probability more than 1 / 2 , we hav e k θ priv − θ ∗ k 2 6 = Ω (1) . Let ˜ A be an  -dif ferentially pri v ate algorithm that ﬁrst runs A on the data and then outputs M n θ priv . Note that this implies that for ev ery dataset D ⊆ {− 1 √ p , 1 √ p } p with k P n i =1 d i k 2 ∈ [ M − 1 , M + 1] , with probability more than 1 / 2 , k ˜ A ( D ) − q ( D ) k 2 6 = Ω  min  1 , p n  which contradicts P art 1 of Lemma 5.1. Thus, there must e xist a dataset D ⊆ {− 1 √ p , 1 √ p } p with k P n i =1 d i k 2 = Ω (min ( n, p/ )) such that with probability at least 1 / 2 , we ha ve k θ priv − θ ∗ k 2 = Ω (1) . Therefore, from the observ ation we made in the previous paragraph, we ha ve, with probability at least 1 / 2 , L ( θ priv ; D ) − L ( θ ∗ ; D ) = Ω (min ( n, p/ )) . Theorem 5.3 (Lo wer bound for ( , δ ) -differentially pri vate algorithms) . Let n, p ∈ N ,  > 0 , and δ = o ( 1 n ) . F or e very ( , δ ) -dif fer entially private algorithm (whose output is denoted by θ priv ), there is a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p such that, with pr obability at least 1 / 3 (over the algorithm r andom coins), we must have L ( θ ; D ) − L ( θ ∗ ; D ) = Ω (min ( n, √ p/ )) wher e θ ∗ = P n i =1 d i k P n i =1 d i k 2 is the minimizer of L ( . ; D ) o ver B and L is given by (4). Pr oof. W e use Part 2 of Lemma 5.1 and follow the same lines of the proof of Theorem 5.2. Dependence on L and kC k 2 : Although our lower bounds abo ve are deriv ed for L = 1 and kC k 2 = 2 , one can easily get their counterparts in the general case, i.e., for arbitrary v alues of L and kC k 2 . The only dif ference is that the lower bounds for the general case pick up an extra factor of L k C k 2 . T o see this, let L = α and kC k 2 = 2 β for arbitrary α, β > 0 . First, we change (inﬂate or shrink) the parameter set from B to β B and we change our loss function in (3) to ˜ ` ( θ ; d ) = − α h θ , d i , θ ∈ β B , d ∈ {− 1 √ p , 1 √ p } p . Let’ s denote the corresponding big loss function by ˜ L . Let θ † be the minimizer of ˜ L ( . ; D ) . No w , note that θ † = β P n i =1 d i k P n i =1 d i k 2 = β θ ∗ where θ ∗ is the minimizer of our original big loss function L ( . ; D ) giv en by (4). Finally , observe that for any θ ∈ β B and dataset D , we have ˜ L ( θ ; D ) − ˜ L ( θ † ; D ) = αβ  L  θ β ; D  − L ( θ ∗ ; D )  . This shows that our bounds above get scaled by L kC k 2 in the general case. Hence, giv en our upper bounds in Sections 2 and 3, our lo wer bounds in this section are tight for all values of L and kC k 2 . 17 5.2 Lower bounds f or Strongly Con vex Functions W e gi ve here lo wer bounds on the excess risk of  and ( , δ ) differentially pri vate optimization algorithms for the class of strongly con ve x decomposable loss function L ( θ ; D ) . Let ` ( θ ; d ) be half the squared L 2 -distance between θ ∈ B and d ∈ {− 1 √ p , 1 √ p } p , that is ` ( θ ; d ) = 1 2 k θ − d k 2 2 (5) Note that ` , as deﬁned, is 1 -Lipschitz and 1 -strongly conv ex. For a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p , the decomposable loss function is deﬁned as L ( θ ; D ) = 1 2 n X i =1 k θ − d i k 2 2 (6) Notice that the minimizer of L ( . ; D ) ov er B is θ ∗ = 1 n P d i which is equal to q ( D ) in the terminology of Lemma 5.1. Note also that we can write the e xcess risk as L ( θ priv ; D ) − L ( θ ∗ ) = n 2 k θ priv − q ( D ) k 2 2 (7) Theorem 5.4 (Lower bound for  -differentially pri vate algorithms) . Let n, p ∈ N and  > 0 . F or every  -differ entially private algorithm (whose output is denoted by θ priv ), ther e is a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p such that, with pr obability at least 1 / 2 (o ver the algorithm random coins), we must have L ( θ priv ; D ) − L ( θ ∗ ; D ) = Ω  min  n, p 2  2 n  wher e θ ∗ = 1 n P d i is the minimizer of L ( . ; D ) o ver B and L is given by (6). Pr oof. The proof follo ws directly from (7) and Part 1 of Lemma 5.1. Theorem 5.5 (Lo wer bound for ( , δ ) -differentially pri vate algorithms) . Let n, p ∈ N ,  > 0 , and δ = o ( 1 n ) . F or e very ( , δ ) -dif fer entially private algorithm (whose output is denoted by θ priv ), there is a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p such that, with pr obability at least 1 / 3 (over the algorithm r andom coins), we must have L ( θ priv ; D ) − L ( θ ∗ ; D ) = Ω  min  n, p  2 n  wher e θ ∗ = 1 n P d i is the minimizer of L ( . ; D ) o ver B and L is given by (6). Pr oof. The proof follo ws directly from (7) and Part 2 of Lemma 5.1. Dependence on L, kC k 2 , and ∆ : Our lower bounds in this section are derived for the case where L = ∆ = 1 and kC k 2 = 2 . For any values L, kC k 2 , and ∆ such that ∆ kC k 2 L = Ω(1) , these lower bounds pick up an extra factor of L 2 ∆ . T o see this, let L = α and kC k 2 = 2 β for arbitrary α, β > 0 . First, instead of the data univ erse {− 1 √ p , 1 √ p } p , we choose the dataset entries from {− β √ p , β √ p } p and let any such dataset be denoted by ˜ D . W e change the parameter set from B to β B and change the loss function in (5) to ˜ ` ( θ , ˜ d ) = α 2 β k θ − ˜ d k 2 2 , θ ∈ β B , ˜ d ∈ {− β √ p , β √ p } p . Clearly , ˜ ` ( ., ˜ d ) is α -Lipschitz and α β -strongly 18 con ve x. Let ˜ L denote the corresponding big loss function and θ † denote the minimizer of ˜ L ( . ; ˜ D ) . Note that θ † = q ( ˜ D ) = β θ ∗ where θ ∗ is the minimizer of our original big loss function L (; D ) giv en by (6) for the dataset D = 1 β ˜ D . Finally , observe that for any θ ∈ β B and dataset ˜ D ⊆ {− β √ p , β √ p } p , we hav e ˜ L ( θ ; ˜ D ) − ˜ L ( θ † ; ˜ D ) = α β  L  θ β ; D  − L ( θ ∗ ; D )  where D = 1 β ˜ D ⊆ {− 1 √ p , 1 √ p } p . This sho ws that the lo wer bounds in this case get scaled by αβ = α 2 α/β = L 2 / ∆ . In fact, this is true as long as ∆ kC k 2 L = Ω(1) . Hence, our upper bounds in Sections 2 and 4 imply that our lower bounds are tight for all values of L, kC k 2 , and ∆ for which ∆ kC k 2 L = Ω(1) . In other words, in the general case (where ∆ kC k 2 L is not necessarily Ω(1) ), these lo wer bounds are tight up to a factor of ∆ kC k 2 L . 6 Efﬁcient Sampling fr om Logconcav e Distributions over Con vex Sets and The Pr oof of Theorem 3.4 In this section, we discuss a generic construction of an ef ﬁcient algorithm for sampling from a logconcave distribution o ver an arbitrary con ve x bounded set. Such algorithm giv es a multiplicati ve distance guarantee on the distrib ution of its output, that is, it outputs a sample from a distribution that is within a constant factor (close to 1) from the desired logconcave distribution. As a by-product of our generic construction, we gi ve the construction of our efﬁcient  -differentially priv ate algorithm A eﬀ − exp − samp whose construction is outlined in Section 3.2 and prove Theorem 3.4. As ar gued in Section 3.2, we will assume that the con ve x set is already in isotropic position. The reader may refer to Section 3.2 for the details of dealing with the general case (where the set is not necessarily isotropic) and the ef fect of that on the running time. W e start by the following lemma which describes Algorithm A cube − samp for sampling from a distribution proprtional to a gi ven logconcav e function F deﬁned over a hypercube A . Lemma 6.1. Let A ⊂ R p be a p -dimensional hyper cube with edge length τ . Let F be a logconcave function that is strictly positive over A where log F is η -Lipschitz. Let µ A be the pr obability measure induced by the density F R u ∈ A F ( u ) du . Let ˜  > 0 . Ther e is an algorithm A cube − samp that takes A, F , η and ˜  as inputs, and outputs a sample e θ ∈ A that is drawn fr om a continuous distribution ˆ µ A over A with the pr operty that Dist ∞ ( ˆ µ A , µ A ) ≤ ˜  . Mor eover , the running time of A cube − samp is O  η 2 τ 2 ˜  2 p 3 max  p log  η τ p ˜   , η τ   . Pr oof. Let γ = ˜  2 η √ p . W e construct a grid G γ , { u ∈ R p : u j + γ 2 is integer multiple of γ , 1 ≤ j ≤ p } . Next, we run the grid-walk algorithm of [2] with the logconcave weight function F on A ∩ G γ . It follows from the results of [2] that (i) the grid-walk is a lazy , time-re versible Marko v chain, (ii) the stationary distrib ution of such grid-walk is π = F P u ∈ A ∩G γ F ( u ) , and (iii) the grid-walk has conductance φ ≥ ˜  8 η τ p 3 2 e ˜  2 . W e run the grid-walk for t ∞ steps (namely , the L ∞ mixing time 2 of the walk) and output a sample ˆ u ∈ A ∩ G γ . Then, we uniformly sample a point θ from the grid cell whose center is ˆ u . Let ˆ π denote the distribution of the output ˆ u of the grid-walk after t ∞ steps. Let ˆ µ A , as in the statement of the lemma, denote the distrib ution of θ that is uniformly sampled from the grid cell whose center is ˆ u . Now , suppose that after t ∞ steps it is 2 That is, the mixing time w .r .t. the relative distance Dist ∞ deﬁned in Section 3.2 19 guaranteed to ha ve Dist ∞ ( ˆ π , π ) ≤ ˜  2 . Then, since log F is η -Lipschitz and γ = ˜  2 η √ p (where, as deﬁned abov e, γ is the edge length of e very cell of G γ ), it is easy to sho w that Dist ∞ ( ˆ µ A , µ A ) ≤ ˜  . Hence, it remains to sho w a bound on t ∞ , the L ∞ mixing time of the Marko v chain giv en by the grid-walk. Speciﬁcally , t ∞ is the number of the steps on the grid-walk required to have Dist ∞ ( ˆ π , π ) ≤ ˜  2 . T owards this end, we use the result of [38] on the rapid mixing of lazy Marko v chains with ﬁnite state space. W e formally restate this result in the follo wing lemma. Lemma 6.2 (Theorem 1 in [38]) . Let P be a lazy , time re versible Marko v chain over a ﬁnite state space Γ . Then, the time t ∞ r equir ed for r elative L ∞ con ver gence of  0 is at most 1 + R 4 / 0 4 π ∗ 4 dx x Φ 2 ( x ) . Her e, Φ( x ) = inf { φ S : π ( S ) ≤ x } wher e φ S denotes the conductance of the set S ⊆ Γ and π ∗ is the minimum pr obability assigned by the stationary distribution. No w , setting  0 = ˜  2 in the abov e lemma and using the fact that Φ( x ) ≥ φ ≥ ˜  8 η τ p 3 2 e ˜  2 for all x , we get t ∞ = O  η 2 τ 2 p 3 ˜  2 log  1 ˜ π ∗  . Observe that π ( u ) = F ( u ) P v ∈ A ∩G γ F ( v ) ≥ e − η τ  τ γ  p where the last inequality follo ws from the fact that log F is η -Lipschitz. Plugging the v alue we set for γ , we get t ∞ = O  η 2 τ 2 ˜  2 p 3 max  p log  η τ p ˜   , η τ   . This completes the proof. No w , suppose we are given an arbitrary bounded con vex set C and a logconcave function F ( θ ) = e − f ( θ ) , θ ∈ C where f is a con vex η − Lipschitz function on C . Ha ving Algorithm A cube − samp of Lemma 6.1 in hand, we now construct an efﬁcient algorithm A init − samp that, with pr obability at least 1 / 2 , outputs a sample in C from a distrib ution close (w .r .t. Dist ∞ ) to the distrib ution on C proportional to F . Our algorithm A init − samp does this by ﬁrst enclosing the set C by a hypercube A , then constructing a con ve x Lipschitz extension of f over A (note that we need this step since f may not be deﬁned outside C ). Using a standard trick in literature, our algorithm modulates F by a guage function to reduce the weight attributed to points outside C . Namely , giv en the conv ex Lipschitz extension of f o ver A , denoted as ¯ f , Algorithm A init − samp deﬁnes a modiﬁed function ˜ F ( θ ) = e − ¯ f ( θ ) − ¯ ψ α ( θ ) where ¯ ψ α is our guage function with a tuning parameter α . The function ¯ ψ α is chosen such that it is zero inside C and is montonically increasing outside C as we mov e away from C . The exact form of ¯ ψ α will be giv en shortly . Algorithm A init − samp then calls Algorithm A cube − samp on inputs A and ˆ F . By appropriately choosing the parameter α of our gauge function, it is guaranteed that, with probability at least 1 / 2 , A cube − samp will return a sample in C . Thus, by Lemma 6.1, we reach the desired goal. The function ¯ ψ α ( θ ) is deﬁned through the Minkowski’ s norm of θ .The Minko wski’ s norm of θ ∈ R p with respect to C , denoted as ψ ( θ ) , is deﬁned as ψ ( θ ) = inf { r > 0 : θ ∈ r C } . W e deﬁne ¯ ψ α ( θ ) , α · max { 0 , ψ ( θ ) − 1 } for some tuning parameter α > 0 (to be speciﬁed later). Note that ¯ ψ α ( θ ) > 0 if and only if θ / ∈ C since 0 p ∈ C (as C is assumed to be in isotropic position) and C is con ve x. Moreov er , it is not hard to verify that ¯ ψ α is α -Lipschitz when C is in isotropic position. Before we giv e the precise construction of Algorithm A init − samp , we discuss ﬁrst an important step in this algorithm, namely , the construction of con vex Lipschitz extension of f . The follo wing lemma (which is a v ariant of Theorem 1 in [13]) asserts that any conv ex Lipschitz efﬁciently computable function deﬁned 20 ov er some con vex set C has a Lipschitz extension (with the same Lipschitz constant) over R p that is also con ve x and ef ﬁciently computable where the computation efﬁcienc y of the extension is granted under the assumption of the existence of some projection oracle. Lemma 6.3 (Con vex Lipschitz e xtension, Theorem 1 in [13]) . Let f be an efﬁciently computable, η - Lipschitz, con vex function deﬁned on a con vex bounded set C ⊂ R p . Then there exists an efﬁciently com- putable, η -Lipschitz conve x function ¯ f deﬁned over R p such that ¯ f , r estricted to C , is equal to f . The efﬁcient computation of ¯ f is based on the assumption of the existence of a pr ojection oracle. For clarity and completeness, we gi ve a proof of this lemma here. Pr oof. For the sake of simplicity , let’ s assume that C is closed. Actually , this is no loss of generality since we can always redeﬁne f such that it is deﬁned on the closure of C which is possible because f is continuous on C . W e use a standard extension in literature. Namely , deﬁne g y ( x ) , f ( y ) + η k x − y k 2 , y ∈ C , x ∈ R p ¯ f ( x ) = min y ∈C g y ( x ) , x ∈ R p . As a standard result (for example, see [14]), we know that the function ¯ f on R p is η -Lipschitz extension of f . Moreov er , since f is conv ex and C is a con vex set, then for e very x ∈ R p , the computation of ¯ f ( x ) is a con ve x program which can be implemented efﬁciently using a linear optimization oracle. In particular, a projection oracle would sufﬁce and hence ¯ f is efﬁciently computable. It remains to show that ¯ f is con vex. Let x 1 , x 2 ∈ R p . Let y 1 and y 2 denote the minimizers of g y ( x 1 ) and g y ( x 2 ) over y ∈ C , respecti vely . Let 0 ≤ λ ≤ 1 . Deﬁne x λ = λx 1 + (1 − λ ) x 2 and let y λ denote the minimzer of g y ( x λ ) o ver y ∈ C . No w , observe that ¯ f ( x λ ) = g y λ ( x λ ) ≤ g λy 1 +(1 − λ ) y 2 ( x λ ) = f ( λy 1 + (1 − λ ) y 2 ) + η k λ ( y 1 − x 1 ) + (1 − λ ) . ( y 2 − x 2 ) k 2 ≤ λ ( f ( y 1 ) + η k y 1 − x 1 k 2 ) + (1 − λ ) ( f ( y 2 ) + η k y 2 − x 2 k 2 ) = λ ¯ f ( x 1 ) + (1 − λ ) F ( x 2 ) where the inequality in the ﬁrst line follo ws from the f act that y λ is the minimzer (w .r .t. y ) of g y ( x λ ) and the inequality in the third line follo ws from the con vexity of f and the L 2 -norm. This completes the proof of the lemma. No w , we gi ve the construction of Algorithm A init − samp follo wed by a Lemma asserting the probabilistic guarantee discussed abov e. Lemma 6.4. W ith pr obability at least 1 / 2 , A init − samp (Algorithm 5 abo ve) outputs e θ ∈ C . Mor eover , the conditional distribution of e θ conditioned on the event e θ ∈ C is at multiplicative distance Dist ∞ which is at most ˜  fr om the distribution induced by F over C , i.e., within multiplicative distance ˜  fr om the desired distribution e − f ( θ ) R θ ∈C e − f ( θ ) dθ , θ ∈ C . The running time of A init − samp is O  ˜ η 2 τ 2 ˜  2 p 3 max  p log  ˜ η τ p ˜   , ˜ η τ  wher e ˜ η = max( η kC k 2 , p ) . 21 Algorithm 5 A init − samp : Ef ﬁcient Log-Concav e Sampling with a Probabilistic Guarantee Input: A bounded con ve x set C , a conv ex function f deﬁned over C , Lipschitz constant η of f , desired multiplicati ve distance guarantee ˜  . 1: Find a cube A ⊇ C with edge length τ = kC k ∞ . 2: Find a con vex Lipschitz e xtension ¯ f ( θ ) = min u ∈C ( f ( u ) + η k θ − u k 2 ) . 3: ¯ ψ α ( θ ) = α · max { 0 , ψ ( θ ) − 1 } with α = 3 e 2˜  ( η kC k 2 + p ) , where ψ ( θ ) is the Minko wski’ s norm of θ w .r .t. C as deﬁned abov e. 4: F ( θ ) = e − ¯ f ( θ ) − ¯ ψ α ( θ ) . 5: Output e θ = A cube − samp  A, F , η + α, ˜  2  Pr oof. By Lemma 6.1, we kno w that e θ (the output of A cube − samp in Step 5) has a distrib ution ˆ µ A with the property that Dist ∞ ( ˆ µ A , µ A ) ≤ ˜  where µ A ( u ) = F ( u ) R v ∈ A F ( v ) dv , u ∈ A . W e will show that R θ ∈ A \C ˆ µ A ( θ ) dθ ≤ R θ ∈C ˆ µ A ( θ ) dθ . In particular , it suf ﬁces to show that R θ ∈ A \C F ( θ ) dθ ≤ e − 2˜  R θ ∈C F ( θ ) dθ . T ow ards this end, consider a differential ( p -dimensional) cone with a dif ferential angle dω at its verte x which is located at the origin (i.e., inside C since C is in isotropic position). Let θ 0 be the point where the axis of the cone intersects with the boundary of C . The set C divides the cone into two re gions; one inside C and the other is outside C . W e now sho w that, for any such cone, the inte gral of F o ver its region outside C , denoted by I out , is less than the integral of e − 2˜  F over the re gion inside C which is denoted by I in . First, observ e that I in = dω p k e θ k p 2 Z 1 0 e − ¯ f ( r θ 0 ) r p − 1 dr ≥ dω p k e θ k p 2 e − ¯ f ( θ 0 ) Z 1 0 e − η kC k 2 (1 − r ) r p − 1 dr = dω p k e θ k p 2 e − ¯ f ( θ 0 ) Z 1 0 e − η kC k 2 r (1 − r ) p − 1 dr ≥ dω p k e θ k p 2 e − ¯ f ( θ 0 ) Z 1 η kC k 2 + p 0 (1 − η kC k 2 r )(1 − pr ) dr ≥ dω p k e θ k p 2 e − ¯ f ( θ 0 ) Z 1 η kC k 2 + p 0 (1 − ( η kC k 2 + p ) r ) dr = dω p k e θ k p 2 e − ¯ f ( θ 0 ) 1 2( η kC k 2 + p ) where the second inequality in the ﬁrst line follo ws from the Lipschitz property of ¯ f and the second inequal- ity in the second line follows from the fact that e − x ≥ 1 − x and (1 − x ) p − 1 ≥ 1 − px . On the other hand, we can upper bound I out as follo ws. I out ≤ dω p k e θ k p 2 Z ∞ 1 e − ¯ f ( r θ 0 ) e − α ( r − 1) r p − 1 dr ≤ dω p k e θ k p 2 e − ¯ f ( θ 0 ) Z ∞ 1 e η kC k 2 ( r − 1) e − α ( r − 1) r p − 1 dr ≤ 2( η kC k 2 + p ) I in Z ∞ 0 e − ( α − ( η kC k 2 + p )) r ≤ e − 2˜  I in (8) where the last inequality follo ws from the setting of α we made in Algorithm 5. Since this is true for an y dif ferential cone as described above, this pro ves that A init − samp outputs e θ ∈ C with probability at least 1 / 2 . Next, let Go od denote the e vent that e θ ∈ C and let ˆ µ Goo d denote the conditional distribution of e θ conditioned on Go o d . Let µ C denote the distribution induced by F on C , that is, F R θ ∈C F ( θ ) dθ . Observe that, for an y measurable set U ⊆ C , ˆ µ Goo d ( U ) = ˆ µ A ( U ) ˆ µ A ( C ) . No w , since µ C ( U ) = µ A ( U ) µ A ( C ) , by Lemma 6.1, we hav e Dist ∞ ( ˆ µ Goo d , µ C ) ≤ ˜  . Finally , regarding the running time of A init − samp , note that, as pointed out at the end of Section 3.2, we assume that f is ef ﬁciently computable and that there exist a membership oracle (to ef ﬁciently test 22 the membership of a point w .r .t. C ) and a projection oracle (to efﬁciently construct the con ve x Lipschitz extension ¯ f ). This enables us to ef ﬁciently implement the ﬁrst four steps of A init − samp . Howe ver , we do not take into account the extra polynomial factor in running time that is required to perform those steps since it would be highly dependent on the speciﬁc structure of C . Thus, under this assumption, the running time of A init − samp is the same as the running time of A cube − samp with inputs A, F , η + α, and ˜  2 . The expression in the lemma follo ws directly from the running time of A cube − samp in Lemma 6.1 and the fact that η + α = O (max( η kC k 2 , p )) . No w , using a standard boosting approach, we construct an algorithm A eﬀ − samp that outputs a sample θ ∈ C with probability 1 whose distribution is at multiplicative distance at most ˜  from the desired distibution on C . Algorithm 6 A eﬀ − samp : Ef ﬁcient Log-Concav e Sampling over a Con ve x Set Input: A bounded con ve x set C , a conv ex function f deﬁned over C , Lipschitz constant η of f , desired multiplicati ve distance guarantee ˜  . 1: Find τ = kC k ∞ . 2: Set m = 4 η kC k 2 + p log( kC k 2 ) + log  1 1 − e − ˜  4  . 3: for 1 ≤ i ≤ m do 4: e θ = A init − samp ( C , f , η , ˜  4 ) . 5: if e θ ∈ C then 6: Output e θ and abort . 7: Output a unifromly random sample e θ from the unit ball B . (Note that B ⊆ C since C is in isotropic position.) Lemma 6.5. Let ˆ µ C denote the distribution of e θ (the output of Algorithm A eﬀ − samp ) and µ C denote the desir ed distribution e − f R θ ∈C e − f ( θ ) dθ on C . W e have Dist ∞ ( ˆ µ C , µ C ) ≤ ˜  . Mor eover , the running time of A eﬀ − samp is O  ˜ η 2 τ 2 ˜  2 p 3 · max  p log  ˜ η τ p ˜   , ˜ η τ  · max  p log( kC k 2 ) , η kC k 2 , log  1 ˜   wher e ˜ η = max( η kC k 2 , p ) . Pr oof. Let ˆ µ Goo d denote the conditional distribution of e θ (the output of Algorithm A eﬀ − samp ) conditioned on the e vent that A init − samp outputs a sample in C in one of the m iterations of the for loop. From Lemma 6.4, it is easy to see that the probability measure of the output of A eﬀ − samp can be expressed as d ˆ µ C ( e θ ) = (1 − 2 − m ) d ˆ µ Goo d ( e θ ) + 2 − m · 1 V ol ( B ) · 1 ( e θ ∈ B ) where 1 ( . ) is the standard indicator function, i.e., it tak es v alue 1 whene ver e θ ∈ B and zero otherwise. Also, from Lemma 6.4, we kno w that Dist ∞ ( ˆ µ Goo d , µ C ) ≤ ˜  4 . Let µ ∗ denote the minimum v alue of the density function e − f ( θ ) R u ∈C e − f ( u ) du for θ ∈ C . By the Lipschitz property of f , we hav e µ ∗ ≥ e − 2 η τ V ol ( C ) = 1 V ol ( B ) V ol ( B ) V ol ( C ) e − 2 η kC k 2 = 1 V ol ( B ) e − 2 η kC k 2 kC k p 2 . 23 Hence, our choice of m guarantees that 2 − m µ ∗ V ol ( B ) ≤ e ˜  2  e ˜  2 − 1  . It also guarantees that (1 − 2 − m ) ≥ e − ˜  4 . Putting this together , we get e Dist ∞ ( ˆ µ C , µ C ) ≤ e ˜  4 e Dist ∞ ( ˆ µ Goo d , µ C ) + 2 − m µ ∗ V ol ( B ) ≤ e ˜  . The running time of A eﬀ − samp is at most O ( m · T A init − samp ) where T A init − samp is the running time of A init − samp (which is of the same order as that of A cube − samp gi ven in Lemma 6.1). Note that Step 7 can be carried out in linear time using standard methods in literature. Finally , by plugging in our choice for the v alue of m giv es the expression in the lemma statement. This completes the proof. 6.1 Efﬁcient  -Differentially Pri vate Algorithm f or Lipschitz Con vex Loss In this section, we sho w a straightforward construction for our efﬁcient Algorithm A eﬀ − exp − samp (referred to in Section 3.2) based on the construction established abo ve for efﬁcient logconcav e sampling. Based on the results established abov e in this section, we giv e a proof of Theorem 3.4 which will also be fairly straightforward. First, ﬁx a dataset D . Our goal is to construct an efﬁcient version of Algorithm A exp − samp (Algorithm 2 from Section 3.1). T o do this, we simply run Algorithm A eﬀ − samp (Algorithm 6 abov e) with the function f instantiated with the scaled decomposable loss function  6 L kC k 2 L ( . ; D ) deﬁned ov er the con ve x bounded set C (which is assumed to be in isotropic position as discussed in Section 3.2). Hence, η in our case is n 6 kC k 2 . Namely , as shown below , our  -dif ferentially pri vate algorithm A eﬀ − exp − samp is an instantiation of A eﬀ − samp on inputs C ,  6 L kC k 2 L ( . ; D ) , n 6 kC k 2 , and  3 . Algorithm 7 A eﬀ − exp − samp : Ef ﬁcient Log-Concav e Sampling over a Con ve x Set Input: A dataset D of size n , a bounded con vex set C , con vex L − Lipschitz loss function ` , and a pri v acy parameter  . 1: L ( θ ; D ) = n P i =1 ` ( θ ; d i ) . 2: Output θ priv = A eﬀ − samp  C ,  6 L kC k 2 L ( . ; D ) , n 6 kC k 2 ,  3  . The choice of the scaling factor  6 L kC k 2 of the loss function and the multiplicativ e distance guarantee of  3 is tuned to yield an  -differentially pri v ate algorithm. T o see this, we rely on the following simple lemma gi ven in [25]. Lemma 6.6 (follo ws from Lemma A.1 in [25]) . Let , ˜  > 0 . Let Q ⊆ R p . F or every dataset D , let µ D denote the distribution (o ver Q ) of the output of an  -differ entially private algorithm A 1 when run on the input dataset D , and ˆ µ D be the distribution (over Q ) of the output of some algorithm A 2 when run on D . Suppose that Dist ∞ ( ˆ µ D , µ D ) ≤ ˜  for all D . Then, A 2 is (2˜  +  ) -differ entially private. Proof of Theor em 3.4: Having Lemmas 6.5 and 6.6 in hand, the proof of Theorem 3.4 becomes straight- forward. First, we show differential priv acy of Algorithm 7. For an y dataset D , let µ D be the distribution of θ proportional to e −  6 L kC k 2 L ( θ ; D ) . Note that µ D is the distribution of the output of Algorithm A exp − samp 24 (Algorithm 2 from Section 3.1) when  is replaced with  3 . Let ˆ µ D be the distribution of the output θ priv of A eﬀ − exp − samp (Algorithm 7 abo ve). From Lemma 6.5, it follows that Dist ∞  ˆ µ D , µ D  ≤  3 . Hence, from Theorem 3.1 and Lemma 6.6, we reach the fact that A eﬀ − exp − samp is  -dif ferentially priv ate. T o show the utility guarantee of A eﬀ − exp − samp , we ﬁrst observe that the distribution of the output θ priv is close with respect to Dist ∞ to (i.e., within a constant factor of) the distrib ution of the output of A exp − samp (Algorithm 2 from Section 3.1), and hence, the utility analysis follo ws the same lines of Theorem 3.2. Finally , observe that the running time of A eﬀ − exp − samp is the same as the running time of A eﬀ − samp with η replaced with n 6 kC k 2 and ˜  replaced with  3 . Also observe that τ = kC k ∞ ≤ kC k 2 and, as a standard assumption, n = ω ( p ) . Putting this together , we get the running time gi ven in the statement of the theorem. This completes the proof of Theorem 3.4. Acknowledgments W e are grateful to Santosh V empala and Ravi Kannan for discussions about efﬁcient sampling algorithms for log-concave distributions over con ve x bodies. In particular , Ra vi suggested the idea of using a penalty term to reduce from sampling ov er C to sampling ov er the cube. Refer ences [1] Alekh Agarwal, Peter L. Bartlett, Pradeep D. Ravikumar , and Martin J. W ainwright. Information- theoretic lo wer bounds on the oracle comple xity of stochastic con ve x optimization. IEEE T r ansactions on Information Theory , 58(5):3235–3249, 2012. [2] David Apple gate and Ravi Kannan. Sampling and integration of near log-concave functions. In STOC , 1991. [3] Amos Beimel, Hai Brenner , Shiv a Prasad Kasi viswanathan, and K obbi Nissim. Bounds on the sample complexity for pri v ate learning and priv ate data release. Machine learning , 94(3), 2014. [4] Amos Beimel, Kobbi Nissim, and Uri Stemmer . Characterizing the sample complexity of pri vate learners. CoRR , abs/1402.2224, 2014. [5] Amos Beimel, K obbi Nissim, and Uri Stemmer . Priv ate learning and sanitization: Pure vs. approximate dif ferential priv acy . CoRR , abs/1407.2674, 2014. [6] A vrim Blum, Cynthia Dw ork, Frank McSherry , and K obbi Nissim. Practical priv acy: The SuLQ frame work. In PODS , pages 128–138. ACM, 2005. [7] Stephen Boyd and Lie ven V andenberghe. Con vex Optimization . Cambridge Univ ersity Press, New Y ork, NY , USA, 2004. ISBN 0521833787. [8] Mark Bun, Jonathan Ullman, and Salil V adhan. Fingerprinting codes and the price of approximate dif ferential priv acy . In STOC , 2014. [9] Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially priv ate learning. In Sham M. Kakade and Ulrike von Luxb urg, editors, COLT , volume 19 of JMLR Pr oceedings , pages 155–186. JMLR.org, 2011. 25 [10] Kamalika Chaudhuri and Claire Monteleoni. Pri vac y-preserving logistic regression. In NIPS . MIT press, 2008. [11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially pri vate empirical risk minimization. JMLR , 12:1069–1109, 2011. [12] Kamalika Chaudhuri, Anand D. Sarwate, and Kaushik Sinha. A near-optimal algorithm for dif ferentially-priv ate principal components. Journal of Machine Learning Resear ch , 14(1):2905–2943, 2013. [13] S. Cobzas and C. Mustata. Norm-preserving extension of con vex lipschitz functions. Journal of Appr oximation Theory , 24:236–244, 1978. [14] J. Czipser and L. Geher . Extension of function satisfying a lipschitz condition. Acta Math. Acad. Ski. Hungar . , 6:236–244, 1955. [15] Anindya De. Lo wer bounds in differential pri v acy . CoRR , abs/1107.2183, 2011. [16] Irit Dinur and K obbi Nissim. Revealing information while preserving priv acy . In PODS , pages 202– 210. A CM, 2003. [17] John C. Duchi, Michael I. Jordan, and Martin J. W ainwright. Local priv acy and statistical minimax rates. In IEEE Symp. on F oundations of Computer Science (FOCS) , 2013. [18] Cynthia Dwork. Differential pri vac y . In ICALP , LNCS, pages 1–12, 2006. [19] Cynthia Dwork and K obbi Nissim. Priv acy-preserving datamining on vertically partitioned databases. In CRYPT O , LNCS, pages 528–544. Springer , 2004. [20] Cynthia Dwork, Krishnaram K enthapadi, Frank McSherry , Ilya Mirono v , and Moni Naor . Our data, ourselves: Priv acy via distrib uted noise generation. In EUR OCRYPT , pages 486–503, 2006. [21] Cynthia Dwork, Frank McSherry , K obbi Nissim, and Adam Smith. Calibrating noise to sensiti vity in pri vate data analysis. In Theory of Cryptography Confer ence , pages 265–284. Springer , 2006. [22] Cynthia Dwork, Guy N. Rothblum, and Salil P . V adhan. Boosting and dif ferential priv acy . In FOCS , 2010. [23] Abraham D Flaxman, Adam T auman Kalai, and H Brendan McMahan. Online con vex optimization in the bandit setting: gradient descent without a gradient. In ACM-SIAM SOD A , pages 385–394, 2005. [24] Moritz Hardt and Guy N. Rothblum. A multiplicati ve weights mechanism for priv acy-preserving data analysis. In FOCS , 2010. [25] Moritz Hardt and Kunal T alwar . On the geometry of dif ferential priv acy . In ST OC , 2010. [26] Prateek Jain and Abhradeep Thakurta. Dif ferentially priv ate learning with kernels. In ICML (3) , volume 28 of JMLR Pr oceedings , pages 118–126. JMLR.org, 2013. [27] Prateek Jain and Abhradeep Thakurta. (near) dimension independent risk bounds for differentially pri vate learning. In International Confer ence on Machine Learning (ICML) , 2014. 26 [28] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Dif ferentially pri vate online learning. In Confer ence on Learning Theory , pages 24.1–24.34, 2012. [29] Michael Kapralov and Kunal T alwar . On differentially pri vate lo w rank approximation. In Sanjeev Khanna, editor , SOD A , pages 1395–1414. SIAM, 2013. ISBN 978-1-61197-251-1, 978-1-61197-310- 5. [30] Shiv a Prasad Kasi viswanathan and Adam Smith. A note on dif ferential pri vac y: Deﬁning resistance to arbitrary side information. CoRR , arXiv:0803.39461 [cs.CR], 2008. [31] Shiv a Prasad Kasi viswanathan, Homin K. Lee, K obbi Nissim, Sofya Raskhodniko va, and Adam Smith. What can we learn pri vately? In FOCS , 2008. [32] Shiv a Prasad Kasiviswanathan, Mark Rudelson, and Adam Smith. The power of linear reconstruction attacks. In ACM-SIAM Symposium on Discr ete Algorithms (SODA) , 2013. [33] Daniel Kifer and Ashwin Machanav ajjhala. A rigorous and customizable frame work for priv acy . In PODS , pages 77–88, 2012. [34] Daniel Kifer , Adam Smith, and Abhradeep Thakurta. Pri v ate con ve x empirical risk minimization and high-dimensional regression. In Confer ence on Learning Theory , pages 25.1–25.40, 2012. [35] L. Lov asz and S. V empala. Simulated annealing in con ve x bodies and an o ∗ ( n 4 ) volume algorithm. In FOCS , 2003. [36] L ´ aszl ´ o Lov ´ asz and Santosh V empala. The geometry of logconca ve functions and sampling algorithms. Random Struct. Algorithms , 30(3):307–358, 2007. [37] Frank McSherry and Kunal T alwar . Mechanism design via dif ferential pri v acy . In FOCS , pages 94– 103, 2007. [38] Ben Morris and Y uv al Peres. Evolving sets, mixing and heat k ernel bounds. Pr obability Theory and Related F ields , 2005. [39] A. S. Nemirovski and D. B. Y udin. Pr oblem Complexity and Method Efﬁciency in Optimization . John W iley & Sons, 1983. [40] Aleksandar Nikolo v , Kunal T alwar , and Li Zhang. The geometry of dif ferential pri vac y: The sparse and approximate cases. In STOC , 2013. [41] Benjamin I. P . Rubinstein, Peter L. Bartlett, Ling Huang, and Nina T aft. Learning in a large function space: Pri vac y-preserving mechanisms for svm learning. CoRR , abs/0911.5708, 2009. [42] Shai Shale v-Shwartz, Ohad Shamir , Nathan Srebro, and Karthik Sridharan. Stochastic Con vex Op- timization. In COLT , 2009. URL http://eprints.pascal- network.org/archive/ 00005408/ . [43] Ohad Shamir and T ong Zhang. Stochastic gradient descent for non-smooth optimization: Conv ergence results and optimal av eraging schemes. In ICML , pages 71–79, 2013. [44] Adam Smith and Abhradeep Thakurta. (nearly) optimal algorithms for priv ate online learning in full- information and bandit settings. In Neural Information Pr ocessing Systems (NIPS) , 2013. 27 [45] Adam Smith and Abhradeep Thakurta. Differentially pri v ate feature selection via stability arguments, and the robustness of the lasso. In Conference on Learning Theory (COL T) , 2013. [46] S. Song, K. Chaudhuri, and A.D. Sarwate. Stochastic gradient descent with differentially pri vate updates. In Pr oc. of the Global Confer ence on Signal and Information Pr ocessing , pages 245–248, December 2013. doi: 10.1109/GlobalSIP .2013.6736861. [47] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differen- tially pri vate updates. In IEEE Global Confer ence on Signal and Information Pr ocessing , 2013. [48] Karthik Sridharan, Shai Shalev-shwartz, and Nathan Srebro. Fast rates for regularized objectiv es. In NIPS , 2008. [49] Oliv er W illiams and Frank McSherry . Probabilistic inference and dif ferential priv acy . In NIPS , 2010. A Straightf orward Smoothing Does Not Y ield Optimal Algorithms In Section E we saw that the objectiv e perturbation algorithm (14) of [11, 34] already matches the optimal excess risk bounds for Lipschitz, and Lipschitz and strongly conv ex functions when the loss function ` is twice-continuously dif ferentiable with a bounded double deriv ativ e β . A natural question that arises, is it possible to smoothen out a non-smooth loss function by con volving with a smooth kernel (like the Gaussian kernel) or by Huberization, and still achieve the optimal excess risk bound? In this section we look at a simple loss functions (the hinge loss) and a very popular Huberization method (quadratic smoothing) argue that there is an inherent cost due to smoothing which will not allo w one to get the optimal excess risk bounds. Consider the loss function ` ( θ ; d ) = ( y − xθ ) + , where the data point d = ( x, y ) , and x, y ∈ [ − 1 , 1] and θ ∈ R . Here the function f ( z ) = ( z ) + is equal to z when z > 0 and zero otherwise. Clearly f ( z ) has a point of non-dif ferentiability at zero. W e can modify the function f , in the follo wing way , to ensure that the resulting function ˆ f is smooth (or twice-continuously dif ferentiable). Deﬁne ˆ f ( z ) = f ( z ) , when z < − h or when z > h . In the range [ − h, h ] , we set ˆ f ( z ) = z 2 4 h + z 2 + h 4 . It is not hard to verify that the function ˆ f ( z ) is twice-continuously dif ferentiable e verywhere. This form of smoothing is commonly called Huberization. Let the smoothed version of ` ( θ ; d ) be deﬁned as ˆ ` ( θ ; d ) = ˆ f (( y − xθ )) for d = ( x, y ) . W ith the choice of loss function ˆ ` , the objecti ve perturbation algorithm is as below . (The regularization coef ﬁcient is chosen to ensure that it is at least β 2  , where β is the smoothness parameter of ˆ ` .): θ priv = arg min θ ∈ [ − 2 , 2] n X i =1 ˆ ` ( θ ; d i ) + θ 2 8 h + bθ (9) In (9) the noise b ∼ N (0 , 8 log(1 /δ )  2 ) . In the results to follo w , we show that for any choice of the Huberization parameter h , there exists data sets of size n from the domain above where the excess risk for objecti ve perturbation will be pro vably worse than our results in this paper . W e present the results for the ( , δ ) - dif ferential priv acy case, b ut the same conclusions hold for the pure  -differential pri v acy case. Theorem A.1. F or every h > 0 , ther e exists D such the e xcess risk for the objective perturbation algorithm in (9) satisﬁes: E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = Ω  min  n, max { nh, 1 h }  = Ω( √ n ) . 28 Her e the loss function L ( θ ; D ) = n P i =1 ` ( θ ; d i ) (wher e D = { d 1 , · · · , d n } ) and θ ∗ = arg min θ ∈ [ − 2 , 2] n P i =1 ` ( θ ; d i ) . Pr oof. Consider the data set D 1 with n 3 entries being ( x = − 1 , y = 1) and 2 n 3 entries being ( x = 1 , y = − 1) . In the follo wing lemma we lower bound the e xcess risk on D 1 for a gi ven huberization parameter h . Lemma A.2. Let , δ be the privacy parameter s with  being a constant ( < 1 ) and δ = Ω  1 n 4  . F or the data set D 1 mentioned above, the e xcess risk for objective perturbation (9) is as follows. F or all h > 0 , we have E  L ( θ priv ; D 1 ) − L ( θ ; D 1 )  = Ω ( n · min { 1 , h } ) . Her e the loss function L ( θ ; D 1 ) = n P i =1 ` ( θ ; d i ) (wher e D 1 = { d 1 , · · · , d n } ) and θ ∗ = arg min θ ∈ [ − 2 , 2] n P i =1 ` ( θ ; d i ) . Pr oof. For the ease of notation, let ˆ L ( θ ; D 1 ) = n P i =1 ˆ ` ( θ ; d i ) . First notice two properties of ˆ L : i) the minimizer (call it ˆ θ within the set [ − 2 , 2] is at max {− 2 , 1 − h } , and ii) ˆ L is quadratic within the range [1 − h, 1 + h ] with strong conv exity parameter at least n 6 h . Additionally notice that θ ∗ = 1 and the re gularizer θ 2 8 h in (9) is centered at zero. Also by Markov’ s inequality , w .p. ≥ 2 / 3 , we hav e | b | ≤ 8 √ log(1 /δ )  . Now to satisfy optimality , n | θ priv − ˆ θ | 3 h ≤ | b | . This suggests that | θ priv − ˆ θ | ≤ 3 h | b | n .Therefore, the difference  θ ∗ − θ priv  is at least min n 1 , h  1 − 3 | b | n o . Therefore the excess risk with probability at least 2 / 3 is Ω ( n · min { 1 , h } ) , which concludes the proof. Consider a data set D 2 which has e xactly max { n 2 − 1 32 h , 0 } entries with ( x = − 1 , y = 1) and min { n 2 + 1 32 h , n } entries with ( x = 1 , y = 1) . In the following lemma we lower bound the excess risk on D 2 for a gi ven huberization parameter h . Lemma A.3. Let , δ be the privacy parameters with  being a constant ( < 1 ) and δ = Ω  1 n 4  . Let h < 1 log n be a ﬁxed Huberization parameter . Then for the data set D 2 mentioned above, the excess risk for objective perturbation (9) is as follows. E  L ( θ priv ; D 2 ) − L ( θ ∗ ; D 2 )  = Ω  min  1 h , n  . Her e the loss function L ( θ ; D 2 ) = n P i =1 ` ( θ ; d i ) and θ ∗ = arg min θ ∈ [ − 2 , 2] n P i =1 ` ( θ ; d i ) . Pr oof. For the ease of notation, let ˆ L ( θ ; D 2 ) = n P i =1 ˆ ` ( θ ; d i ) . Notice that within the range [ − 1 + h, 1 − h ] , the slope of ˆ L ( θ ; D 2 ) is max { − 1 16 h , − n } . By the optimality condition of θ priv , we hav e the following. θ priv 4 h + b − min { 1 16 h , n } = 0 (10) Solving for θ priv , we hav e θ priv = min   4 , 4 nh  + 4 bh . By assumption h < 1 / log n and w .p. ≥ 2 / 3 we hav e | b | ≤ 8 √ log(1 /δ )  . Therefore, w .p. ≥ 2 / 3 , we have θ priv ≤  . 29 No w notice that with the original loss function ` , arg min θ ∈ [ − 2 , 2] n P i =1 ` ( θ ; d i ) = 1 . Since the loss function L ( θ ; D 2 ) has a slope of max { − 1 16 h , − n } in the range [ − 1 , 1] , the excess risk is Ω((1 −  ) min  1 h , n  ) which concludes the proof. Finally combining Lemmas A.3 and A.2 completes the proof of Theorem A.1. B Localization and ( , δ ) -Differ entially Private Algorithms f or Lipschitz, Str ongly Con vex Loss W e use slightly different version of A  out − pert (Algorithm 3) which we denote by A ( ,δ ) out − pert where the algo- rithm takes as input an extra priv acy parameter δ , it samples the noise vector b from the Gaussian distribution N  0 , I p σ 2 0  where σ 2 0 = 4 L 2 log ( 1 δ ) ∆ 2  2 n 2 , and outputs C 0 = { θ ∈ C : k θ − θ 0 k 2 ≤ ζ σ 0 √ p } . Let A ( ,δ ) gen − Lip denote any generic ( , δ ) -differentially priv ate algorithm for optimizing a decomposable loss of con ve x Lipschitz functions ov er some arbitrary con ve x set ˜ C ⊆ C . Algorithm 1 from Section 2 is an example of A ( ,δ ) gen − Lip . No w , we construct an algorithm A ( ,δ ) gen − str − convex which is the ( , δ ) analog of A  gen − str − convex (Algorithm 4). Namely , A ( ,δ ) gen − str − convex runs in similar fashion to A  gen − str − convex where the only difference is that it takes an extra priv acy parameter δ as input and calls algorithms A (  2 , δ 2 ) out − pert and A (  2 , δ 2 ) gen − Lip instead of A  2 out − pert and A  2 gen − Lip , respecti vely . Theorem B.1 (Pri v acy guarantee) . Algorithm A ( ,δ ) gen − str − convex is ( , δ ) -differ entially private. Pr oof. The pri vac y guarantee follows directly from the composition theorem together with the fact that A (  2 , δ 2 ) out − pert is (  2 , δ 2 ) -dif ferentially pri vate and that A (  2 , δ 2 ) gen − Lip is (  2 , δ 2 ) -dif ferentially pri vate by assumption. Theorem B.2 (Generic utility guarantee) . Let e θ denote the output of Algorithm A ( ,δ ) gen − Lip on inputs n, D , `, , δ, ˜ C (for an arbitrary con vex set ˜ C ⊆ C ). Let ˆ θ denote the minimizer of L ( . ; D ) o ver ˜ C . If E h L ( e θ ; D ) − L ( ˆ θ ; D ) i ≤ F  p, n, , δ, L, k ˜ C k 2  for some function F , then the output θ priv of A ( ,δ ) gen − str − convex satisﬁes E  L ( θ priv ; D ) − L ( θ ∗ ; D )  ≤ O   F   p, n, , δ, L, O   L q p log  1 δ  log( n ) ∆ n       . Pr oof. The proof follows the same lines of the proof of Theorem 4.2 except for the fact that, in Algorithm A (  2 , δ 2 ) out − pert , the noise vector b is Gaussian and hence using the standard bounds on the norm of an i.i.d. Gaussian vector , we hav e Pr  k b k 2 ≤ ζ σ 0 √ p  = Pr   k b k 2 ≤ ζ 4 L q log  2 δ  ∆ n   ≥ 1 − e − Ω( ζ 2 ) W e set ζ = p 3 log( n ) and the rest of the proof follo ws in the same way as the proof of Theorem 4.2. 30 C Pr oof of Lemma 5.1 C.1 Proof of P art 1 W e restate Part 1 of the lemma here for con venience. Lemma 5.1- P art 1: Let n, p ∈ N and  > 0 . Ther e is a number M = Ω (min ( n, p/ )) such that for every  -differ entially private algorithm A , there is a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p with k P n i =1 d i k 2 ∈ [ M − 1 , M + 1] suc h that, with pr obability at least 1 / 2 (taken over the algorithm random coins), we have kA ( D ) − q ( D ) k 2 = Ω  min  1 , p n  wher e q ( D ) = 1 n P n i =1 d i . Pr oof. W e use a standard packing argument. V ariants of such argument hav e appeared in sev eral places in literature, e.g., [25] and [15]. W e ﬁrst construct K = 2 p/ 2 points d (1) , ..., d ( K ) in {− 1 √ p , 1 √ p } p such that for e very distinct pair d ( i ) , d ( j ) of these points, we hav e k d ( i ) − d ( j ) k 2 ≥ 1 8 . It is easy to show the existence of such set of points using the probabilistic method (for example, the Gilbert- V arshamov construction of a linear random (2 p/ 2 , p ) -binary code ov er {− 1 √ p , 1 √ p } achie ves this property). Fix  > 0 . Deﬁne n ∗ = 1 20 p  . Let’ s ﬁrst consider the case where n ≤ n ∗ . W e construct K datasets D (1) , ..., D ( K ) where for each i ∈ [ K ] , D ( i ) contains n copies of d ( i ) . Note that for all i 6 = j , k q  D ( i )  − q  D ( j )  k 2 ≥ 1 8 (11) Let A be any  -differentially priv ate algorithm for answering q . Suppose that for ev ery D ( i ) , i ∈ [ K ] , with probability at least 1 / 2 , kA  D ( i )  − q  D ( i )  k 2 < 1 16 , i.e., for e very i ∈ [ K ] , Pr  A  D ( i )  ∈ B  D ( i )  ≥ 1 2 where for any dataset D , B ( D ) is deﬁned as B ( D ) = { θ ∈ R p : k θ − q ( D ) k 2 < 1 16 } (12) Note that for all i 6 = j , D ( i ) and D ( j ) dif fer in all their n entries. Since A is  -differentially priv ate, for all i ∈ [ K ] , we have Pr  A  D (1)  ∈ B  D ( i )  ≥ 1 2 e − n . Since, by (11) abd (12), all B  D ( i )  , i ∈ [ K ] , are mutually disjoint, then K · 1 2 e − n ≤ K X i =1 Pr h A  D (1)  ∈ B  D ( i ) i ≤ 1 which implies that n > n ∗ for sufﬁciently large p which is a contradiction to the fact that n ≤ n ∗ . Hence, there must exist a dataset D ( i ) for some i ∈ [ K ] on which A makes an L 2 -error which is at least 1 16 with probability at least 1 2 . Note also that the L 2 norm of the sum of the entries of such D ( i ) is n . Next, we consider the case where n > n ∗ . Fix an arbitrary point c ∈ {− 1 √ p , 1 √ p } p . As before, we construct K = 2 p/ 2 datasets ˜ D (1) , ..., ˜ D ( K ) of size n where for e very i ∈ [ K ] , the ﬁrst n ∗ entries of each dataset ˜ D ( i ) are the same as dataset D ( i ) from before whereas the remaining n − n ∗ entries are constructed 31 as follows. The ﬁrst d n − n ∗ 2 e of those entries are all copies of c whereas the last b n − n ∗ 2 c are copies of − c . Note that any two distinct datasets ˜ D ( i ) , ˜ D ( j ) in this collection differ in e xactly n ∗ entries. Let A be any  -dif ferentially priv ate algorithm for answering q . Suppose that for every i ∈ [ K ] , with probability at least 1 / 2 , we hav e kA  ˜ D ( i )  − q  ˜ D ( i )  k 2 < 1 16 n ∗ n (13) Note that for all i ∈ [ K ] , q  ˜ D ( i )  = n ∗ n q  D ( i )  + α where α = c n if n − n ∗ is odd and 0 if n − n ∗ is ev en. No w , we deﬁne an algorithm ˆ A for answering q on datasets D of size n ∗ as follows. First, ˆ A appends d n − n ∗ 2 e copies of c followed by b n − n ∗ 2 c copies of − c to D to get a dataset ˜ D of size n . Then, it runs A on ˜ D and outputs n n ∗  A ( ˜ D ) − α  . Hence, by the post-processing propertry of differential priv acy , ˆ A is  -differentially pri vate since A is  -dif ferentially priv ate. Thus, assumption (13) implies that for every i ∈ [ K ] , with probability at least 1 / 2 , we ha ve k ˆ A  D ( i )  − q  D ( i )  k 2 < 1 16 . Ho we ver , this contradicts our result in the ﬁrst part of the proof. Therefore, there must exist a dataset ˜ D ( i ) in the above collection such that, with probability at least 1 / 2 , kA  ˜ D ( i )  − q  ˜ D ( i )  k 2 ≥ 1 16 n ∗ n = 1 320 p n . Note also that the L 2 norm of the sum of entries of such ˜ D ( i ) is always between n ∗ − 1 and n ∗ + 1 . Summing up, we ha ve sho wn that for e very n and ev ery  > 0 , there is a number M = Ω (min( n, p/ )) such that for e very  -dif ferentially priv ate algorithm A , there exists a dataset D of size n with the property that k P n ` =1 d ` k 2 ∈ [ M − 1 , M + 1] such that, with probability at least 1 / 2 , kA ( D ) − q ( D ) k 2 ≥ 1 16 min  1 , p 20 n  = Ω  min  1 , p n  . C.2 Proof of P art 2 W e restate below Part 2 of the lemma. Lemma 5.1- Part 2: Let n, p ∈ N ,  > 0 , and δ = o ( 1 n ) . Ther e is a number M = Ω  min  n, √ p/  such that for every ( , δ ) -differ entially private algorithm A , ther e is a dataset D = { d 1 , . . . , d n } ⊆ {− 1 √ p , 1 √ p } p with k P n i =1 d i k 2 ∈ [ M − 1 , M + 1] suc h that, with pr obability at least 1 / 3 (tak en over the algorithm ran- dom coins), we have kA ( D ) − q ( D ) k 2 = Ω  min  1 , √ p n  wher e q ( D ) = 1 n P n i =1 d i . Pr oof. Let n ∈ N . Fix  > 0 and δ = o ( 1 n ) . Let A be any ( , δ ) -dif ferentially pri vate algorithm for answering q . Corollary 3.6 of [8] (together with Lemma 2.5 of the same reference) shows that there exists n ∗ = Ω( √ p  ) such that for every n ≤ n ∗ , there exists a dataset D ⊂ {− 1 √ p , 1 √ p } p of size n such that, with probability at least 1 / 3 , kA ( D ) − q ( D ) k 2 > 2 27 . Note that to reach this statement from Corollary 3.6 of [8], ﬁrst, we have to translate their deﬁnition of ( α, β ) -accuracy (gi ven by Deﬁnition 2.2 in the same reference) 32 to what this implies about the L 2 -error which is fairly straightforward. Also, note that in their construction, the dataset entries are drawn from { 0 , 1 } p whereas here the entries come from {− 1 √ p , 1 √ p } p , hence, we need to take this normalization into account when we translate their statement to our setting. Moreover , by a careful inspection of the construction in [8] 3 , one can show that such dataset whose existence was ar gued abov e has the property that the L 2 -norm of the sum of its entries is Ω( n ) . (See Section 6 of [8] for details). This gi ves us the desired Ω(1) lo wer bound on the L 2 -error when n ≤ n ∗ = Ω( √ p  ) . T o complete the proof, we need to consider the case where n > n ∗ . T o do this, we follow the same argument of the second half of the proof of Part 1 abov e (the pure  case). Namely , we proceed as folllows. First, let A be any ( , δ ) -dif ferentially pri vate algorithm (where δ = o ( 1 n ) ). W e construct datasets ˜ D of size n whose n ∗ ﬁrst entries are constructed in the same way datasets D in the ﬁrst half of this proof were constructed, and the remaining n − n ∗ entries are constructed such that half of them contain c and the other half conatin − c for some ﬁxed c ∈ {− 1 √ p , 1 √ p } p . Then, we show that if for all such ˜ D , with probability at least 2 / 3 , kA ( ˜ D ) − q ( ˜ D ) k 2 < 2 27 n ∗ n , then there is an ( , δ ) -differentially priv ate algorithm ˆ A such that for all D of size n ∗ constructed earlier , we get k ˆ A ( D ) − q ( D ) k 2 < 2 27 which contradicts the result of the ﬁrst half of this proof. Hence, there is a dataset ˜ D of size n (constructed as abov e) such that, with probability at least 1 / 3 , kA ( ˜ D ) − q ( ˜ D ) k 2 ≥ 2 27 n ∗ n = Ω( √ p n ) . Note also that such ˜ D has the property that the L 2 norm of the sum of its entries lies in [ M − 1 , M + 1] where M is the L 2 norm of the sum of the ﬁrst n ∗ entries of ˜ D which is Ω( n ∗ ) = Ω( √ p  ) (follo wing the argument of the ﬁrst half of this proof). D Con verting Excess Risk Bounds in Expectation to High-probability Bounds In this paper all of our utility guarantees are in terms of the expectation over the randomness of the al- gorithm. Although all the utility analysis except for the gradient descent based algorithm (Algorithm 1) provide high-probability guarantees directly , in this section we pro vide a generic approach for obtaining high-probability guarantee based on the expected risk bounds. The idea is to run the underlying differen- tially pri vate algorithm k -times, with the pri v acy parameters /k and δ /k for each run. Let θ priv 1 , · · · θ priv k be the vectors output by the k -runs. First notice that the vector θ priv 1 , · · · , θ priv k is ( , δ ) -dif ferentially priv ate. Moreov er if the algorithm has expected excess risk of F ( , δ ) (where F is the speciﬁc excess risk function of  and δ ), then by Marko v’ s inequality there exist an execution of the algorithm i ∈ [ k ] for which the excess risk is 2 F ( /k , δ /k ) with probability at least 1 − 1 / 2 k . One can now use the e xponential mechanism from Algorithm 2, to pick the best θ priv i from the list. By the same analysis of Theorem 3.2, one can show that with probability at least 1 − ρ/ 2 , the exponential mechanism will output a v ector θ priv that has excess risk of max i Excess risk ( θ priv i ) − O  L kC k 2  log( k /ρ )  . Setting k = log (2 /ρ ) , we hav e that with probability at lest 1 − ρ , the excess risk for θ priv is at most O ( F (  log(1 /ρ ) , δ log(1 /ρ ) )) . Placing this bound in context of the paper , the high probability bounds are only a poly log(1 /ρ ) factor of f from the expectation bounds. E Excess Risk Bounds f or Smooth Functions In this section we present the scenario where each of the loss function ` ( θ ; d ) (for all d in the domain) is β - smooth in addition to being L -Lipschitz (for θ ∈ C ). It turns out that both for  and ( , δ ) -differential pri vac y , 3 Their construction is based on robust ( n, p ) -ﬁngerprinting codes (see Deﬁnition 3.3 in the same reference) 33 objecti ve perturbation algorithm (see (14)) [11, 34] achiev es the best possible error guarantees, where the random variable b is either sampled i) from the Gamma distribution with the kernel ∝ e −  k b k 2 2 L . or ii) from the Normal distribution N  0 , I p 8 L 2 log(1 /δ )  2  . In terms of priv acy , when the noise vector b is from Gamma distribution, the algorithm is  -dif ferentially priv ate. And when the noise is from Normal distribution, it is ( , δ ) -differentially pri v ate. For completeness purposes, we also state the error bounds from [34] (translated to the context of this paper). θ priv = arg min θ ∈C L ( θ ; D ) + ∆ 2 k θ k 2 2 + h b, θ i (14) Theorem E.1 (Lipschitz and smooth function) . The excess risk bounds ar e as follows: 1. [11] W ith Gamma density ν 1 , setting ∆ = Θ  Lp  kC k 2  and assuming ∆ ≥ β 2  , we have E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  L kC k 2 p   . 2. [34] W ith Gaussian density , setting ∆ = Θ  √ L 2 p log(1 /δ )  kC k 2  and assuming ∆ ≥ β 2  , we have E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  L kC k 2 √ p ln(1 /δ )   . Additionally when the loss function ` ( θ ; d ) is ∆ -strongly conv ex (for θ ∈ C ) for all d in the domain with the condition that ∆ ≥ β 2  , one can essentially recover the tight error guarantees for the  and ( , δ ) case of dif ferential priv acy respecti vely . The main observ ation is that for the pri vac y guarantee to be achie ved one need not add the additional regularizer . Although not in its explicit form, a variant of this observ ation appears in the work of [34]. W e state the error guarantee from [34, Theorem 31] translated to our setting. Notice that unlike Theorem E.1, the error guarantee in Theorem E.2 does not depend on the diameter of the con ve x set C . Theorem E.2 (Lipschitz, smooth and strongly con vex function) . The excess risk bounds ar e as follows: 1. W ith Gamma density ν 1 , if ∆ ≥ β 2  , we have E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  L 2 p 2 n ∆  2  . 2. W ith Gaussian density , if ∆ ≥ β 2  , we have E  L ( θ priv ; D ) − L ( θ ∗ ; D )  = O  L 2 p ln(1 /δ ) n ∆  2  . F Fr om Excess Empirical Risk to Generalization Error In this section, pro vide a generic tool to interpret our ERM results in the conte xt of generalization error ( true risk) bounds. F or a giv en distribution τ , let us deﬁne true risk for a model θ ∈ C as follo ws. T rueRisk ( θ ) = E d ∼ τ [ ` ( θ ; d )] . (15) Analogously , we deﬁne the excess risk for a gi ven a gi ven model θ by ExcessRisk ( θ ) . ExcessRisk ( θ ) = E d ∼ τ [ ` ( θ ; d )] − min θ ∈C E d ∼ τ [ ` ( θ ; d )] . (16) Let D be a data set of n data samples drawn i.i.d. from the distribution τ . The follo wing theorem from learning theory relates true excess risk to e xcess empirical risk. 34 Theorem F .1 (Section 5.4 from [42]) . Let ` be L − Lipschitz, ∆ -str ong con vex loss function. W ith pr obability at least 1 − γ o ver the randomness of sampling the data set D , the following is true. ExcessRisk ( θ ) ≤ r 2 L 2 n ∆ q L ( θ ; D ) − min θ ∈C L ( θ ; D ) + 4 L 2 γ ∆ n . Plugging in the utility guarantee for θ priv from Theorems 2.4 and 4.3, and using the expectation to high-probability bound trick from Appendix D, we obtain the follo wing. Theorem F .2 (Lipschitz and strongly con ve x functions) . 1. Ther e exists an  -dif ferentially private algorithm, that outputs θ priv such that with pr obability at least 1 − γ over the r andomness of sampling the data set D and the risk minimization algorithm, the following is true: ExcessRisk ( θ ) = O L 2 p p log n · p oly log(1 /γ ) ∆ nγ ! . 2. Ther e exists an ( , δ ) -dif fer entially private algorithm, that outputs θ priv such that with pr obability at least 1 − γ over the randomness of sampling the data set D and the risk minimization algorithm, the following is true: ExcessRisk ( θ ) = O L 2 √ p log 2 ( n/δ ) · p oly log(1 /γ ) ∆ nγ ! . One can use the following regularization trick to get excess risk guarantees for general con vex functions. Let ` ( θ ; d ) be an L -Lipschitz function and let ˆ ` ( θ ; d ) = ` ( θ ; d ) + ∆ 2 k θ k 2 2 . Notice that ov er the con ve x set C , ˆ ` is ( L + kC k 2 ) -Lipschitz and ∆ -strongly con vex. Let ExcessRisk ` ( θ ) denote the e xcess risk for the loss function ` . W e can observe the follo wing. ExcessRisk ` ( θ ) ≤ ExcessRisk ˆ ` ( θ ) + ∆ 2 kC k 2 2 . (17) Combining with Theorem F .2, plugging in (17), and optimizing for ∆ , we have the follo wing: Theorem F .3 (Lipschitz functions) . 1. Ther e exists an  -dif ferentially private algorithm, that outputs θ priv such that with pr obability at least 1 − γ over the r andomness of sampling the data set D and the risk minimization algorithm, the following is true: ExcessRisk ( θ ) = O √ p ( L + kC k 2 ) kC k 2 (log n · p oly log(1 /γ )) 1 / 4 √ nγ ! . 2. Ther e exists an ( , δ ) -dif fer entially private algorithm, that outputs θ priv such that with pr obability at least 1 − γ over the randomness of sampling the data set D and the risk minimization algorithm, the following is true: ExcessRisk ( θ ) = O p 1 / 4 ( L + kC k 2 ) kC k 2 log( n/δ ) · p oly log(1 /γ ) √ nγ ! . 35 Note. While the dependence on n for our pri vate algorithms in Theorem F .3 matches the bounds for the corresponding non-priv ate algorithms (see[42]), unlike the non-priv ate counter parts, the priv ate algorithms hav e an e xplicit dependence on p . W e leave it as an open problem to ﬁgure out the right dependence on p w .r .t. excess risk for pri vate algorithms. Generalized Linear Models (GLM). When the loss function ` is a generalized linear function, we obtain better bounds on the true e xcess risk. The follo wing bounds are actually tight (see the note after the theorem statement.) Theorem F .4. Suppose the loss function ` ( θ ; d ) can be written as g ( h θ , d i ; d ) wher e g is L g -Lipschitz in its ﬁrst input and d ∈ X wher e kX k 2 ≤ R . Let L = L g R . 1. Ther e exists an  -dif ferentially private algorithm, that outputs θ priv such that with pr obability at least 1 − γ over the r andomness of sampling the data set D and the risk minimization algorithm, the following is true assuming n = ω  ( p/ ) 2  : ExcessRisk ( θ ) = O L k C k 2 p log(1 /γ ) √ n ! . 2. Ther e exists an ( , δ ) -dif fer entially private algorithm, that outputs θ priv such that with pr obability at least 1 − γ over the randomness of sampling the data set D and the risk minimization algorithm, the following is true assuming n = ω  p/ 2  : ExcessRisk ( θ ) = O L k C k 2 p log(1 /γ ) log 2 ( n/δ ) √ n ! . The above theorem follows from Theorem 2 in [42] and the regularization trick abov e. Theorem F .4 sho ws that in case of GLM, we can essentially attain the non-private upper bound of O ( L k C k 2 √ n ) which is kno wn to be tight: for e xample, if we consider a linear loss function then using a standard Central Limit Theorem ar gument (or using standard lower bounds on the minimax error in parametric estimation), one can sho w that the there exists a distribution on X for which the true excess risk is Ω  L k C k 2 √ n  . For general Lipschitz loss functions, we provide a tool that can be used to obtain expectation (over algorithm’ s random coins) guarantees on the e xcess risk (as opposed to the high probability guarantees gi ven in Theorem F .3 above.) For any L -Lipschitz loss function ` and any distrib ution τ from which the data points d 1 , ..., d n com- prising the dataset D are drawn in i.i.d. fashion, we deﬁne EmpExcRisk ( θ ) , 1 n E d 1 ,...,d n ∼ τ  L ( θ ; D ) − min θ 0 ∈C L ( θ 0 ; D )  . No w , we giv e the following useful lemma. Lemma F .5. 1. Let θ priv denote the output of an ( , 0) -differ entially algorithm. W e have E  ExcessRisk ( θ priv )  = O  L kC k 2  + E  EmpExcRisk ( θ priv )  . 36 2. Let θ priv denote the output of an ( , δ ) -differ entially algorithm. W e have E  ExcessRisk ( θ priv )  = O  L kC k 2  + k C k 2 2 δ + E  EmpExcRisk ( θ priv )  . wher e the expectation in both cases is o ver the random coins of the algorithm. W e can use this lemma together with our ERM upper bounds for general Lipschitz functions to gi ve the follo wing expectation guarantees on the excess risk: Theorem F .6 (Lipschitz functions: Expectation guarantees) . 1. Ther e is an  Θ  q p n  , 0  -differ entially private algorithm, that outputs θ priv such that the following is true: E  ExcessRisk ( θ priv )  = O  L k C k 2 r p n  . 2. Ther e exists an  Θ  p 1 / 4 log( n/δ ) √ n  , δ  -differ entially private algorithm, that outputs θ priv such that the following is true: E  ExcessRisk ( θ priv )  = O L k C k 2 p 1 / 4 log( n/δ ) √ n ! . wher e the expectation in both cases is o ver the random coins of the algorithm. 37

Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment