Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization

Accelerated Proximal Stochastic Dual Coordinate Ascent for Re gularized Loss Minimization Shai Shale v-Shwartz ∗ T ong Zhang † ‡ Abstract W e introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an inner-outer iteration procedure. W e analyze the runtime of the frame- work and obtain rates that impro ve state-of-the-art results for v arious k ey machine learning optimization problems including SVM, logistic regression, ridge regression, Lasso, and multiclass SVM. Experiments validate our theoretical ﬁndings. 1 Intr oduction W e consider the following generic optimization problem associated with regularized loss minimization of linear predictors: Let X 1 , . . . , X n be matrices in R d × k (referred to as instances), let φ 1 , . . . , φ n be a se- quence of v ector con ve x functions deﬁned on R k (referred to as loss functions), let g ( · ) be a con v ex function deﬁned on R d (referred to as a regularizer), and let λ ≥ 0 (referred to as a regularization parameter). Our goal is to solve: min w ∈ R d P ( w ) where P ( w ) = " 1 n n X i =1 φ i ( X > i w ) + λg ( w ) # . (1) For example, in ridge regression the regularizer is g ( w ) = 1 2 k w k 2 2 , the instances are column vectors, and for e very i the i ’th loss function is φ i ( a ) = 1 2 ( a − y i ) 2 , for some scalar y i . Let w ∗ = argmin w P ( w ) (we will later mak e assumptions that imply that w ∗ is unique). W e say that w is  -accurate if P ( w ) − P ( w ∗ ) ≤  . Our main result is a new algorithm for solving (1). If g is 1 -strongly con vex and each φ i is (1 /γ ) -smooth (meaning that its gradient is (1 /γ ) -Lipschitz), then our algorithm ﬁnds, with probability of at least 1 − δ , an  -accurate solution to (1) in time O  d  n + min  1 λ γ , r n λ γ  log(1 / ) log(1 /δ ) max { 1 , log 2 (1 / ( λ γ n )) }  = ˜ O  d  n + min  1 λ γ , r n λ γ  . This applies, for example, to ridge re gression and to logistic regression with L 2 regularization. The O notation hides constants terms and the ˜ O notation hides constants and logarithmic terms. W e make these explicit in the formal statement of our theorems. ∗ School of Computer Science and Engineering, The Hebrew Uni versity , Jerusalem, Israel † Department of Statistics, Rutgers Univ ersity , NJ, USA ‡ Baidu Inc., Beijing, China 1 Intuiti vely , we can think of 1 λγ as the condition number of the problem. If the condition number is O ( n ) then our runtime becomes ˜ O ( dn ) . This means that the runtime is nearly linear in the data size. This matches the recent result of Shalev-Shwartz and Zhang [25], Le Roux et al. [15], but our setting is signiﬁcantly more general. When the condition number is much lar ger than n , our runtime becomes ˜ O ( d q n λ γ ) . This signiﬁcantly improv es over the result of [25, 15]. It also signiﬁcantly improves over the runtime of accelerated gradient descent due to Nesterov [18], which is ˜ O ( d n q 1 λ γ ) . By applying a smoothing technique to φ i , we also deriv e a method that ﬁnds an  -accurate solution to (1) assuming that each φ i is O (1) -Lipschitz, and obtain the runtime ˜ O  d  n + min  1 λ  , r n λ   . This applies, for example, to SVM with the hinge-loss. It signiﬁcantly improv es over the rate d λ of SGD (e.g. [22]), when 1 λ  n . W e can also apply our results to non-strongly conv ex regularizers (such as the L 1 norm regularizer), or to non-regularized problems, by adding a slight L 2 regularization. For example, for L 1 regularized problems, and assuming that each φ i is (1 /γ ) -smooth, we obtain the runtime of ˜ O  d  n + min  1  γ , r n  γ  . This applies, for example, to the Lasso problem, in which the goal is to minimize the squared loss plus an L 1 regularization term. T o put our results in context, in the table belo w we specify the runtime of various algorithms (while ignoring constants and logarithmic terms) for three key machine learning applications; SVM in which φ i ( a ) = max { 0 , 1 − a } and g ( w ) = 1 2 k w k 2 2 , Lasso in which φ i ( a ) = 1 2 ( a − y i ) 2 and g ( w ) = σ k w k 1 , and Ridge Regression in which φ i ( a ) = 1 2 ( a − y i ) 2 and g ( w ) = 1 2 k w k 2 2 . Additional applications, and a more detailed runtime comparison to previous work, are gi v en in Section 5. In the table below , SGD stands for Stochastic Gradient Descent, and A GD stands for Accelerated Gradient Descent. Problem Algorithm Runtime SVM SGD [22] d λ A GD [17] dn q 1 λ  This paper d  n + min { 1 λ  , p n λ }  Lasso SGD and v ariants (e.g. [28, 27, 21]) d  2 Stochastic Coordinate Descent [23, 16] dn  FIST A [18, 2] dn q 1  This paper d  n + min { 1  , p n  }  Ridge Regression Exact d 2 n + d 3 SGD [15], SDCA [25] d  n + 1 λ  A GD [18] dn q 1 λ This paper d  n + min { 1 λ , p n λ }  2 T echnical contribution: Our algorithm combines two ideas. The ﬁrst is a proximal version of stochastic dual coordinate ascent (SDCA). 1 In particular , we generalize the recent analysis of [25] in two directions. First, we allo w the regularizer , g , to be a general strongly con ve x function (and not necessarily the squared Euclidean norm). This allo ws us to consider non-smooth regularization function, such as the L 1 regulariza- tion. Second, we allow the loss functions, φ i , to be v ector v alued functions which are smooth (or Lipschitz) with respect to a general norm. This generalization is useful in multiclass applications. As in [25], the runtime of this procedure is ˜ O  d  n + 1 λγ  . This would be a nearly linear time (in the size of the data) if 1 λγ = O ( n ) . Our second idea deals with the case 1 λγ  n by iteratively approximating the objectiv e function P with objective functions that hav e a stronger regularization. In particular , each iteration of our acceleration procedure inv olves approximate minimization of P ( w ) + κ 2 k w − y k 2 2 , with respect to w , where y is a vector obtained from previous iterates and κ is order of 1 / ( γ n ) . The idea is that the addition of the relati vely strong regularization makes the runtime of our proximal stochastic dual coordinate ascent proce- dure be ˜ O ( dn ) . And, with a proper choice of y at each iteration, we show that the sequence of solutions of the problems with the added regularization con ver ge to the minimum of P after q 1 λγ n iterations. This yields the ov erall runtime of d q n λγ . Additional related work: As mentioned before, our ﬁrst contribution is a proximal version of the stochas- tic dual coordinate ascent method and extension of the analysis given in Shalev-Shw artz and Zhang [25]. Stochastic dual coordinate ascent has also been studied in Collins et al. [3] but in more restricted settings than the general problem considered in this paper . One can also apply the analysis of stochastic coordinate descent methods giv en in Richtárik and T aká ˇ c [19] on the dual problem. Howe v er , here we are interested in understanding the primal sub-optimality , hence an analysis which only applies to the dual problem is not suf ﬁcient. The generality of our approach allo ws us to apply it for multiclass prediction problems. W e discuss this in detail later on in Section 5. Recently , [13] deri ved a stochastic coordinate ascent for structural SVM based on the Frank-W olfe algorithm. Although with different motiv ations, for the special case of multiclass problems with the hinge-loss, their algorithm ends up to be the same as our proximal dual ascent algorithm (with the same rate). Our approach allows to accelerate the method and obtain an e ven f aster rate. The proof of our acceleration method adapts Nesterov’ s estimation sequence technique, studied in De- volder et al. [7], Schmidt et al. [20], to allow approximate and stochastic proximal mapping. See also [1, 6]. In particular , it relies on similar ideas as in Proposition 4 of [20]. Ho we v er , our speciﬁc requirement is dif ferent, and the proof presented here is dif ferent and signiﬁcantly simpler than that of [20]. There hav e been se veral attempts to accelerate stochastic optimization algorithms. See for e xample [12, 11, 4] and the references therein. Howe ver , the runtime of these methods hav e a polynomial dependence on 1 / ev en if φ i are smooth and g is λ -strongly conv e x, as opposed to the logarithmic dependence on 1 / obtained here. As in [15, 25], we av oid the polynomial dependence on 1 / by allowing more than a single pass ov er the data. 1 T echnically speaking, it may be more accurate to use the term randomized dual coordinate ascent, instead of stochastic dual coordinate ascent. This is because our algorithm makes more than one pass ov er the data, and therefore cannot work directly on distributions with inﬁnite support. Howe ver , following the con vention in the prior machine learning literature, we do not mak e this distinction. 3 2 Pr eliminaries All the functions we consider in this paper are proper con ve x functions ov er a Euclidean space. W e use R to denote the set of real numbers and to simplify our notation, when we use R to denote the range of a function f we in fact allow f to output the value + ∞ . Gi ven a function f : R d → R we denote its conjugate function by f ∗ ( y ) = sup x [ y > x − f ( x )] . Gi ven a norm k · k P we denote the dual norm by k · k D where k y k D = sup x : k x k P =1 y > x. W e use k · k or k · k 2 to denote the L 2 norm, k x k = x > x . W e also use k x k 1 = P i | x i | and k x k ∞ = max i | x i | . The operator norm of a matrix X with respect to norms k · k P , k · k P 0 is deﬁned as k X k P → P 0 = sup u : k u k P =1 k X u k P 0 . A function f : R k → R d is L - Lipschitz with respect to a norm k · k P , whose dual norm is k · k D , if for all a, b ∈ R d , we hav e k f ( a ) − f ( b ) k D ≤ L k a − b k P . A function f : R d → R is (1 /γ ) - smooth with respect to a norm k · k P if it is differentiable and its gradient is (1 /γ ) -Lipschitz with respect to k · k P . An equiv alent condition is that for all a, b ∈ R d , we hav e f ( a ) ≤ f ( b ) + ∇ f ( b ) > ( a − b ) + 1 2 γ k a − b k 2 P . A function f : R d → R is γ - strongly con vex with respect to k · k P if f ( w + v ) ≥ f ( w ) + ∇ f ( w ) > v + γ 2 k v k 2 P . It is well known that f is γ -strongly con ve x with respect to k · k P if and only if f ∗ is (1 /γ ) -smooth with respect to the dual norm, k · k D . The dual problem of (1) is max α ∈ R k × n D ( α ) where D ( α ) = " 1 n n X i =1 − φ ∗ i ( − α i ) − λg ∗ 1 λn n X i =1 X i α i !# , (2) where α i is the i ’th column of the matrix α , which forms a v ector in R k . W e will assume that g is strongly con ve x which implies that g ∗ ( · ) is continuous differentiable. If we deﬁne v ( α ) = 1 λn n X i =1 X i α i and w ( α ) = ∇ g ∗ ( v ( α )) , (3) then it is known that w ( α ∗ ) = w ∗ , where α ∗ is an optimal solution of (2). It is also known that P ( w ∗ ) = D ( α ∗ ) which immediately implies that for all w and α , we have P ( w ) ≥ D ( α ) , and hence the duality gap deﬁned as P ( w ( α )) − D ( α ) can be regarded as an upper bound on both the primal sub-optimality , P ( w ( α )) − P ( w ∗ ) , and on the dual sub-optimality , D ( α ∗ ) − D ( α ) . 4 3 Main Results In this section we describe our algorithms and their analysis. W e start in Section 3.1 with a description of our proximal stochastic dual coordinate ascent procedure (Prox-SDCA). Then, in Section 3.2 we sho w how to accelerate the method by calling Prox-SDCA on a sequence of problems with a strong regularization. Throughout the ﬁrst two sections we assume that the loss functions are smooth. Finally , we discuss the case of Lipschitz loss functions in Section 3.3. The proofs of the main acceleration theorem (Theorem 3) is gi ven in Section 4. The rest of the proofs are provided in the appendix. 3.1 Proximal Stochastic Dual Coordinate Ascent W e no w describe our proximal stochastic dual coordinate ascent procedure for solving (1). Our results in this subsection holds for g being a 1 -strongly con vex function with respect to some norm k · k P 0 and every φ i being a (1 /γ ) -smooth function with respect to some other norm k · k P . The corresponding dual norms are denoted by k · k D 0 and k · k D respecti vely . The dual objecti ve in (2) has a different dual v ector associated with each example in the training set. At each iteration of dual coordinate ascent we only allo w to change the i ’th column of α , while the rest of the dual vectors are kept intact. W e focus on a randomized version of dual coordinate ascent, in which at each round we choose which dual vector to update uniformly at random. At step t , let v ( t − 1) = ( λn ) − 1 P i X i α ( t − 1) i and let w ( t − 1) = ∇ g ∗ ( v ( t − 1) ) . W e will update the i -th dual v ariable α ( t ) i = α ( t − 1) i + ∆ α i , in a way that will lead to a sufﬁcient increase of the dual objectiv e. For the primal problem, this would lead to the update v ( t ) = v ( t − 1) +( λn ) − 1 X i ∆ α i , and therefore w ( t ) = ∇ g ∗ ( v ( t ) ) can also be written as w ( t ) = argmax w h w > v ( t ) − g ( w ) i = argmin w " − w > n − 1 n X i =1 X i α ( t ) i ! + λg ( w ) # . Note that this particular update is rather similar to the update step of proximal-gradient dual-a veraging method (see for example Xiao [27]). The difference is on ho w α ( t ) is updated. The goal of dual ascent methods is to increase the dual objectiv e as much as possible, and thus the optimal way to choose ∆ α i would be to maximize the dual objecti ve, namely , we shall let ∆ α i = argmax ∆ α i ∈ R k  − 1 n φ ∗ i ( − ( α i + ∆ α i )) − λg ∗ ( v ( t − 1) + ( λn ) − 1 X i ∆ α i )  . Ho we ver , for a comple x g ∗ ( · ) , this optimization problem may not be easy to solve. T o simplify the optimiza- tion problem we can rely on the smoothness of g ∗ (with respect to a norm k · k D 0 ) and instead of directly maximizing the dual objective function, we try to maximize the follo wing proximal objective which is a lo wer bound of the dual objecti ve: argmax ∆ α i ∈ R k  − 1 n φ ∗ i ( − ( α i + ∆ α i )) − λ  ∇ g ∗ ( v ( t − 1) ) > ( λn ) − 1 X i ∆ α i + 1 2 k ( λn ) − 1 X i ∆ α i k 2 D 0  = argmax ∆ α i ∈ R k  − φ ∗ i ( − ( α i + ∆ α i )) − w ( t − 1) > X i ∆ α i − 1 2 λn k X i ∆ α i k 2 D 0  . 5 In general, this optimization problem is still not necessarily simple to solve because φ ∗ may also be complex. W e will thus also propose alternati ve update rules for ∆ α i of the form ∆ α i = s ( −∇ φ i ( X > i w ( t − 1) ) − α ( t − 1) i ) for an appropriately chosen step size parameter s > 0 . Our analysis shows that an appropriate choice of s still leads to a suf ﬁcient increase in the dual objecti ve. It should be pointed out that we can always pick ∆ α i so that the dual objectiv e is non-decreasing. In fact, if for a speciﬁc choice of ∆ α i , the dual objectiv e decreases, we may simply set ∆ α i = 0 . Therefore throughout the proof we will assume that the dual objecti ve is non-decreasing whene ver needed. The theorems below provide upper bounds on the number of iterations required by our prox-SDCA procedure. Theorem 1. Consider Pr ocedure Pr ox-SDCA as given in F igur e 1. Let α ∗ be an optimal dual solution and let  > 0 . F or e very T suc h that T ≥  n + R 2 λγ  log  n + R 2 λγ  · D ( α ∗ ) − D ( α (0) )  ! , we ar e guaranteed that E [ P ( w ( T ) ) − D ( α ( T ) )] ≤  . Moreo ver , for every T such that T ≥  n +  R 2 λγ  · 1 + log D ( α ∗ ) − D ( α (0) )  !! , let T 0 = T − n − d R 2 λγ e , then we ar e guaranteed that E [ P ( ¯ w ) − D ( ¯ α )] ≤  . W e next gi v e bounds that hold with high probability . Theorem 2. Consider Pr ocedure Pr ox-SDCA as given in Figur e 1. Let α ∗ be an optimal dual solution, let  D ,  P > 0 , and let δ ∈ (0 , 1) . 1. F or every T such that T ≥ &  n + R 2 λγ  log 2( D ( α ∗ ) − D ( α (0) ))  D !' ·  log 2  1 δ  , we ar e guaranteed that with pr obability of at least 1 − δ it holds that D ( α ∗ ) − D ( α ( T ) ) ≤  D . 2. F or every T such that T ≥ &  n + R 2 λγ  log  n + R 2 λγ  + log 2( D ( α ∗ ) − D ( α (0) ))  P !!' ·  log 2  1 δ  , we ar e guaranteed that with pr obability of at least 1 − δ it holds that P ( w ( T ) ) − D ( α ( T ) ) ≤  P . 3. Let T be such that T ≥  n +  R 2 λγ  · 1 + & log 2( D ( α ∗ ) − D ( α (0) ))  P !'! ·  log 2  2 δ  , and let T 0 = T − n − d R 2 λγ e . Suppose we c hoose d log 2 (2 /δ ) e values of t uniformly at random fr om T 0 + 1 , . . . , T , and then choose the single value of t fr om these d log 2 (2 /δ ) e values for which P ( w ( t ) ) − D ( α ( t ) ) is minimal. Then, with pr obability of at least 1 − δ we have that P ( w ( t ) ) − D ( α ( t ) ) ≤  P . 6 Procedure Proximal Stochastic Dual Coordinate Ascent: Prox-SDCA( P , , α (0) ) Goal: Minimize P ( w ) = 1 n P n i =1 φ i ( X > i w ) + λg ( w ) Input: Objective P , desired accuracy  , initial dual solution α (0) (default: α (0) = 0 ) Assumptions: ∀ i , φ i is (1 /γ ) -smooth w .r .t. k · k P and let k · k D be the dual norm of k · k P g is 1 -strongly con vex w .r .t. k · k P 0 and let k · k D 0 be the dual norm of k · k P 0 ∀ i , k X i k D → D 0 ≤ R Initialize v (0) = 1 λn P n i =1 X i α (0) i , w (0) = ∇ g ∗ (0) Iterate: for t = 1 , 2 , . . . Randomly pick i Find ∆ α i using any of the follo wing options (or any other update that achie ves a larger dual objecti ve): Option I: ∆ α i = argmax ∆ α i  − φ ∗ i ( − ( α ( t − 1) i + ∆ α i )) − w ( t − 1) > X i ∆ α i − 1 2 λn k X i ∆ α i k 2 D 0  Option II: Let u = −∇ φ i ( X > i w ( t − 1) ) and q = u − α ( t − 1) i Let s = argmax s ∈ [0 , 1]  − φ ∗ i ( − ( α ( t − 1) i + sq )) − s w ( t − 1) > X i q − s 2 2 λn k X i q k 2 D 0  Set ∆ α i = sq Option III: Same as Option II but replace the deﬁnition of s as follo ws: Let s = min  1 , φ i ( X > i w ( t − 1) )+ φ ∗ i ( − α ( t − 1) i )+ w ( t − 1) > X i α ( t − 1) i + γ 2 k q k 2 D k q k 2 D ( γ + 1 λn k X i k 2 D → D 0 )  Option IV : Same as Option III but replace k X i k 2 D → D 0 in the deﬁnition of s with R 2 Option V : Same as Option II but replace the deﬁnition of s to be s = λnγ R 2 + λnγ α ( t ) i ← α ( t − 1) i + ∆ α i and for j 6 = i , α ( t ) j ← α ( t − 1) j v ( t ) ← v ( t − 1) + ( λn ) − 1 X i ∆ α i w ( t ) ← ∇ g ∗ ( v ( t ) ) Stopping condition : Let T 0 < t (default: T 0 = t − n − d R 2 λγ e ) A veraging option: Let ¯ α = 1 t − T 0 P t i = T 0 +1 α ( i − 1) and ¯ w = 1 t − T 0 P t i = T 0 +1 w ( i − 1) Random option: Let ¯ α = α ( i ) and ¯ w = w ( i ) for some random i ∈ T 0 + 1 , . . . , t Stop if P ( ¯ w ) − D ( ¯ α ) ≤  and output ¯ w, ¯ α, and P ( ¯ w ) − D ( ¯ α ) Figure 1: The Generic Proximal Stochastic Dual Coordinate Ascent Algorithm 7 The abo ve theorem tells us that the runtime required to ﬁnd an  accurate solution, with probability of at least 1 − δ , is O d  n + R 2 λγ  · log D ( α ∗ ) − D ( α (0) )  ! · log  1 δ  ! . (4) This yields the follo wing corollary . Corollary 1. The expected runtime r equir ed to minimize P up to accur acy  is O d  n + R 2 λγ  · log D ( α ∗ ) − D ( α (0) )  !! . Pr oof. W e have shown that with a runtime of O  d  n + R 2 λγ  · log  2( D ( α ∗ ) − D ( α (0) ))   we can ﬁnd an  accurate solution with probability of at least 1 / 2 . Therefore, we can run the procedure for this amount of time and check if the duality gap is smaller than  . If yes, we are done. Otherwise, we would restart the process. Since the probability of success is 1 / 2 we hav e that the av erage number of restarts we need is 2 , which concludes the proof. 3.2 Acceleration The Prox-SDCA procedure described in the pre vious subsection has the iteration bound of ˜ O  n + R 2 λγ  . This is a nearly linear runtime whene ver the condition number , R 2 / ( λγ ) , is O ( n ) . In this section we sho w how to improv e the dependence on the condition number by an acceleration procedure. In particular , throughout this section we assume that 10 n < R 2 λγ . W e further assume throughout this subsection that the regularizer , g , is 1 -strongly con vex with respect to the Euclidean norm, i.e. k u k P 0 = k · k 2 . This also implies that k u k D 0 is the Euclidean norm. A generalization of the acceleration technique for strongly conv ex regularizers with respect to general norms is left to future w ork. The main idea of the acceleration procedure is to iteratively run the Prox-SDCA procedure, where at iteration t we call Prox-SDCA with the modiﬁed objecti ve, ˜ P t ( w ) = P ( w ) + κ 2 k w − y ( t − 1) k 2 , where κ is a relati vely lar ge re gularization parameter and the regularization is centered around the v ector y ( t − 1) = w ( t − 1) + β ( w ( t − 1) − w ( t − 2) ) for some β ∈ (0 , 1) . That is, our regularization is centered around the previous solution plus a “momentum term” β ( w ( t − 1) − w ( t − 2) ) . A pseudo-code of the algorithm is giv en in Figure 2. Note that all the parameters of the algorithm are determined by our theory . Remark 1. In the pseudo-code below , we specify the parameters based on our theor etical derivation. In our experiments, we found out that this choice of parameters also work very well in practice . However , we also found out that the algorithm is not very sensitive to the choice of parameters. F or example, we found out that running 5 n iterations of Pr ox-SDCA (that is, 5 epochs over the data), without checking the stopping condition, also works very well. The main theorem is the follo wing. Theorem 3. Consider the accelerated Pr ox-SDCA algorithm given in F igure 2. 8 Procedure Accelerated Prox-SDCA Goal: Minimize P ( w ) = 1 n P n i =1 φ i ( X > i w ) + λg ( w ) Input: T arget accurac y  (only used in the stopping condition) Assumptions: ∀ i , φ i is (1 /γ ) -smooth w .r .t. k · k P and let k · k D be the dual norm of k · k P g is 1 -strongly con vex w .r .t. k · k 2 ∀ i , k X i k D → 2 ≤ R R 2 γ λ > 10 n (otherwise, solve the problem using vanilla Prox-SDCA) Deﬁne κ = R 2 γ n − λ , µ = λ/ 2 , ρ = µ + κ , η = p µ/ρ , β = 1 − η 1+ η , Initialize y (1) = w (1) = 0 , α (1) = 0 , ξ 1 = (1 + η − 2 )( P (0) − D (0)) Iterate: for t = 2 , 3 , . . . Let ˜ P t ( w ) = 1 n P n i =1 φ i ( X > i w ) + ˜ λ ˜ g t ( w ) where ˜ λ ˜ g t ( w ) = λg ( w ) + κ 2 k w k 2 2 − κ w > y ( t − 1) Call ( w ( t ) , α ( t ) ,  t ) = Prox-SDCA  ˜ P t , η 2(1+ η − 2 ) ξ t − 1 , α ( t − 1)  Let y ( t ) = w ( t ) + β ( w ( t ) − w ( t − 1) ) Let ξ t = (1 − η / 2) t − 1 ξ 1 Stopping conditions: break and return w ( t ) if one of the follo wing conditions hold: 1. t ≥ 1 + 2 η log( ξ 1 / ) 2. (1 + ρ/µ )  t + ρκ 2 µ k w ( t ) − y ( t − 1) k 2 ≤  Figure 2: The Accelerated Prox-SDCA Algorithm 9 • Corr ectness: When the algorithm terminates we have that P ( w ( t ) ) − P ( w ∗ ) ≤  . • Runtime: – The number of outer iterations is at most 1 + 2 η log( ξ 1 / ) ≤ 1 + s 8 R 2 λγ n  log  2 R 2 λγ n  + log  P (0) − D (0)   . – Each outer iteration in volves a single call to Pr ox-SDCA, and the averaged runtime requir ed by each suc h call is O  d n log  R 2 λγ n  . By a straightforward ampliﬁcation argument we obtain that for ev ery δ ∈ (0 , 1) the total runtime required by accelerated Prox-SDCA to guarantee an  -accurate solution with probability of at least 1 − δ is O d s nR 2 λ γ log  R 2 λ γ n   log  R 2 λγ n  + log  P (0) − D (0)   log  1 δ  ! . 3.3 Non-smooth, Lipschitz, loss functions So far we ha ve ass umed that for e very i , φ i is a (1 /γ ) -smooth function. W e no w consider the case in which φ i might be non-smooth, and e ven non-dif ferentiable, but it is L -Lipschitz. Follo wing Nesterov [17], we apply a “smoothing” technique. W e ﬁrst observe that if φ is L -Lipschitz function then the domain of φ ∗ is in the ball of radius L . Lemma 1. Let φ : R k → R be an L -Lipschitz function w .r .t. a norm k · k P and let k · k D be the dual norm. Then, for any α ∈ R k s.t. k α k D > L we have that φ ∗ ( α ) = ∞ . Pr oof. Fix some α with k α k D > L . Let x 0 be a vector such that k x 0 k P = 1 and α > x 0 = k α k D (this is a vector that achie ves the maximal objective in the deﬁnition of the dual norm). By deﬁnition of the conjugate we hav e φ ∗ ( α ) = sup x [ α > x − φ ( x )] = − φ (0) + sup x [ α > x − ( φ ( x ) − φ (0))] ≥ − φ (0) + sup x [ α > x − L k x − 0 k P ] ≥ − φ (0) + sup c> 0 [ α > ( cx 0 ) − L k cx 0 k P ] = − φ (0) + sup c> 0 ( k α k D − L ) c = ∞ . This observ ation allo ws us to smooth L -Lipschitz functions by adding regularization to their conjugate. In particular , the following lemma generalizes Lemma 2.5 in [26]. 10 Lemma 2. Let φ be a pr oper , conve x, L -Lipschitz function w .r .t. a norm k · k P , let k · k D be the dual norm, and let φ ∗ be the conjugate of φ . Assume that k · k 2 ≤ k · k D . Deﬁne ˜ φ ∗ ( α ) = φ ∗ ( α ) + γ 2 k α k 2 2 and let ˜ φ be the conjugate of ˜ φ ∗ . Then, ˜ φ is (1 /γ ) -smooth w .r .t. the Euclidean norm and ∀ a, 0 ≤ φ ( a ) − ˜ φ ( a ) ≤ γ L 2 / 2 . Pr oof. The fact that ˜ φ is (1 /γ ) -smooth follows directly from the fact that ˜ φ ∗ is γ -strongly con vex. For the second claim note that ˜ φ ( a ) = sup b h ba − φ ∗ ( b ) − γ 2 k b k 2 2 i ≤ sup b [ ba − φ ∗ ( b )] = φ ( a ) and ˜ φ ( a ) = sup b h ba − φ ∗ ( b ) − γ 2 k b k 2 2 i = sup b : k b k D ≤ L h ba − φ ∗ ( b ) − γ 2 k b k 2 2 i ≥ sup b : k b k D ≤ L h ba − φ ∗ ( b ) − γ 2 k b k 2 D i ≥ sup b : k b k D ≤ L [ ba − φ ∗ ( b )] − γ 2 L 2 = φ ( a ) − γ 2 L 2 . Remark 2. It is also possible to smooth using differ ent r e gularization functions which ar e str ongly con vex with r espect to other norms. See Nester ov [17] for discussion. 4 Pr oof of Theor em 3 The ﬁrst claim of the theorem is that when the procedure stops we hav e P ( w ( t ) ) − P ( w ∗ ) ≤  . W e therefore need to sho w that each stopping condition guarantees that P ( w ( t ) ) − P ( w ∗ ) ≤  . For the second stopping condition, recall that w ( t ) is an  t -accurate minimizer of P ( w ) + κ 2 k w − y ( t − 1) k 2 , and hence by Lemma 3 belo w (with z = w ∗ , w + = w ( t ) , and y = y ( t − 1) ): P ( w ∗ ) ≥ P ( w ( t ) ) + Q  ( w ∗ ; w ( t ) , y ( t − 1) ) ≥ P ( w ( t ) ) − ρκ 2 µ k y ( t − 1) − w ( t ) k 2 − (1 + ρ/µ )  t . It is left to show that the ﬁrst stopping condition is correct, namely , to show that after 1 + 2 η log( ξ 1 / ) iterations the algorithm must con v erge to an  -accurate solution. Observe that the deﬁnition of ξ t yields that ξ t = (1 − η / 2) t − 1 ξ 1 ≤ e − η ( t − 1) / 2 ξ 1 . Therefore, to prove that the ﬁrst stopping condition is v alid, it sufﬁces to sho w that for e very t , P ( w ( t ) ) − P ( w ∗ ) ≤ ξ t . Recall that at each outer iteration of the accelerated procedure, we approximately minimize an objecti ve of the form P ( w ; y ) = P ( w ) + κ 2 k w − y k 2 . Of course, minimizing P ( w ; y ) is not the same as minimizing P ( w ) . Our ﬁrst lemma shows that for every y , if w + is an  -accurate minimizer of P ( w ; y ) then we can deri v e a lower bound on P ( w ) based on P ( w + ) and a con vex quadratic function of w . 11 Lemma 3. Let µ = λ/ 2 and ρ = µ + κ . Let w + be a vector such that P ( w + ; y ) ≤ min w P ( w , y ) +  . Then, for every z , P ( z ) ≥ P ( w + ) + Q  ( z ; w + , y ) , wher e Q  ( z ; w + , y ) = µ 2    z −  y − ρ µ ( y − w + )     2 − ρκ 2 µ k y − w + k 2 − (1 + ρ/µ )  . Pr oof. Denote Ψ( w ) = P ( w ) − µ 2 k w k 2 . W e can write 1 2 k w k 2 = 1 2 k y k 2 + y > ( w − y ) + 1 2 k w − y k 2 . It follo ws that P ( w ) = Ψ( w ) + µ 2 k w k 2 = Ψ( w ) + µ 2 k y k 2 + µ y > ( w − y ) + µ 2 k w − y k 2 . Therefore, we can re write P ( w ; y ) as: P ( w ; y ) = Ψ( w ) + µ 2 k y k 2 + µ y > ( w − y ) + ρ 2 k w − y k 2 . Let ˜ w = argmin w P ( w ; y ) . Therefore, the gradient 2 of P ( w ; y ) w .r .t. w vanishes at ˜ w , which yields ∇ Ψ( ˜ w ) + µy + ρ ( ˜ w − y ) = 0 ⇒ ∇ Ψ( ˜ w ) + µy = ρ ( y − ˜ w ) . By the µ -strong con vexity of Ψ we hav e that for e v ery z , Ψ( z ) ≥ Ψ( ˜ w ) + ∇ Ψ( ˜ w ) > ( z − ˜ w ) + µ 2 k z − ˜ w k 2 . Therefore, P ( z ) = Ψ( z ) + µ 2 k y k 2 + µ y > ( z − y ) + µ 2 k z − y k 2 ≥ Ψ( ˜ w ) + ∇ Ψ( ˜ w ) > ( z − ˜ w ) + µ 2 k z − ˜ w k 2 + µ 2 k y k 2 + µ y > ( z − y ) + µ 2 k z − y k 2 = P ( ˜ w ; y ) − ρ 2 k ˜ w − y k 2 + ∇ Ψ( ˜ w ) > ( z − ˜ w ) + µ y > ( z − ˜ w ) + µ 2  k z − ˜ w k 2 + k z − y k 2  = P ( ˜ w ; y ) − ρ 2 k ˜ w − y k 2 + ρ ( y − ˜ w ) > ( z − ˜ w ) + µ 2  k z − ˜ w k 2 + k z − y k 2  = P ( ˜ w ; y ) + ρ 2 k ˜ w − y k 2 + ρ ( y − ˜ w ) > ( z − y ) + µ 2  k z − ˜ w k 2 + k z − y k 2  . 2 If the regularizer g ( w ) in the deﬁnition of P ( w ) is non-dif ferentiable, we can replace ∇ Ψ( ˜ w ) with an appropriate sub-gradient of Ψ at ˜ w . It is easy to verify that the proof is still v alid. 12 In addition, by standard algebraic manipulations, ρ 2 k ˜ w − y k 2 + ρ ( y − ˜ w ) > ( z − y ) + µ 2 k z − ˜ w k 2 −  ρ 2 k w + − y k 2 + ρ ( y − w + ) > ( z − y ) + µ 2 k z − w + k 2  =  ρ ( w + − y ) − ρ ( z − y ) + µ ( w + − z )  > ( ˜ w − w + ) + ρ + µ 2 k ˜ w − w + k 2 = ( ρ + µ )( w + − z ) > ( ˜ w − w + ) + ρ + µ 2 k ˜ w − w + k 2 = 1 2     √ µ ( w + − z ) + ρ + µ √ µ ( ˜ w − w + )     2 − µ 2 k z − w + k 2 − ( ρ + µ ) 2 2 µ k ˜ w − w + k 2 + ρ + µ 2 k ˜ w − w + k 2 ≥ − µ 2 k z − w + k 2 − ρ ( ρ + µ ) 2 µ k ˜ w − w + k 2 . Since P ( · ; y ) is ( ρ + µ ) -strongly con vex and ˜ w minimizes P ( · ; y ) , we hav e that for e very w + it holds that ρ + µ 2 k ˜ w − w + k 2 ≤ P ( w + ; y ) − P ( ˜ w ; y ) . Combining all the abov e and using the fact that for ev ery w , y , P ( w ; y ) ≥ P ( w ) , we obtain that for ev ery w + , P ( z ) ≥ P ( w + ) + ρ 2 k w + − y k 2 + ρ ( y − w + ) > ( z − y ) + µ 2 k z − y k 2 −  1 + ρ µ   P ( w + ; y ) − P ( ˜ w ; y )  . Finally , using the assumption P ( w + ; y ) ≤ min w P ( w ; y ) +  we conclude our proof. W e saw that the quadratic function P ( w + ) + Q  ( z ; w + , y ) lower bounds the function P everywhere. Therefore, any con vex combination of such functions would form a quadratic function which lower bounds P . In particular , the algorithm (implicitly) maintains a sequence of quadratic functions, h 1 , h 2 , . . . , deﬁned as follo ws. Choose η ∈ (0 , 1) and a sequence y (1) , y (2) , . . . that will be speciﬁed later . Deﬁne, h 1 ( z ) = P (0) + Q P (0) − D (0) ( z ; 0 , 0) = P (0) + µ 2 k z k 2 − (1 + ρ/µ )( P (0) − D (0)) , and for t ≥ 1 , h t +1 ( z ) = (1 − η ) h t ( z ) + η ( P ( w ( t +1) ) + Q  t +1 ( z ; w ( t +1) , y ( t ) )) . The follo wing simple lemma sho ws that for e very t ≥ 1 and z , h t ( z ) lo wer bounds P ( z ) . Lemma 4. Let η ∈ (0 , 1) and let y (1) , y (2) , . . . be any sequence of vectors. Assume that w (1) = 0 and for every t ≥ 1 , w ( t +1) satisﬁes P ( w ( t +1) ; y ( t ) ) ≤ min w P ( w ; y ( t ) ) +  t +1 . Then, for every t ≥ 1 and every vector z we have h t ( z ) ≤ P ( z ) . Pr oof. The proof is by induction. For t = 1 , observe that P (0; 0) = P (0) and that for e very w we hav e P ( w ; 0) ≥ P ( w ) ≥ D (0) . This yields P (0; 0) − min w P ( w ; 0) ≤ P (0) − D (0) . The claim now follows directly from Lemma 3. Next, for the inducti ve step, assume the claim holds for some t − 1 ≥ 1 and let us prov e it for t . By the recursiv e deﬁnition of h t and by using Lemma 3 we hav e h t ( z ) = (1 − η ) h t − 1 ( z ) + η ( P ( w ( t ) ) + Q  t ( z ; w ( t ) , y ( t − 1) )) ≤ (1 − η ) h t − 1 ( z ) + η P ( z ) . Using the inductiv e assumption we obtain that the right-hand side of the above is upper bounded by (1 − η ) P ( z ) + η P ( z ) = P ( z ) , which concludes our proof. 13 The more dif ﬁcult part of the proof is to sho w that for e very t ≥ 1 , P ( w ( t ) ) ≤ min w h t ( w ) + ξ t . If this holds true, then we would immediately get that for e very w ∗ , P ( w ( t ) ) − P ( w ∗ ) ≤ P ( w ( t ) ) − h t ( w ∗ ) ≤ P ( w ( t ) ) − min w h t ( w ) ≤ ξ t . This will conclude the proof of the ﬁrst part of Theorem 3, since ξ t = ξ 1 (1 − η / 2) t − 1 ≤ ξ 1 e − ( t − 1) η / 2 , and therefore, 1 + 2 η log( ξ 1 / ) iterations sufﬁce to guarantee that P ( w ( t ) ) − P ( w ∗ ) ≤  . Deﬁne v ( t ) = argmin w h t ( w ) . Let us construct an explicit formula for v ( t ) . Clearly , v (1) = 0 . Assume that we ha ve calculated v ( t ) and let us calculate v ( t +1) . Note that h t is a quadratic function which is minimized at v ( t ) . Furthermore, it is easy to see that for e very t , h t is µ -strongly con vex quadratic function. Therefore, h t ( z ) = h t ( v ( t ) ) + µ 2 k z − v ( t ) k 2 . By the deﬁnition of h t +1 we obtain that h t +1 ( z ) = (1 − η )( h t ( v ( t ) ) + µ 2 k z − v ( t ) k 2 ) + η ( P ( w ( t +1) ) + Q  t +1 ( z ; w ( t +1) , y ( t ) )) . Since the gradient of h t +1 ( z ) at v ( t +1) should be zero, we obtain that v ( t +1) should satisfy (1 − η ) µ ( v ( t +1) − v ( t ) ) + η µ  v ( t +1) − ( y ( t ) − ρ µ ( y ( t ) − w ( t +1) ))  = 0 Rearranging, we obtain v ( t +1) = (1 − η ) v ( t ) + η ( y ( t ) − ρ µ ( y ( t ) − w ( t +1) )) . (5) Getting back to our second phase of the proof, we need to show that for ev ery t we hav e P ( w ( t ) ) ≤ h t ( v ( t ) ) + ξ t . W e do so by induction. For the case t = 1 we have P ( w (1) ) − h 1 ( v (1) ) = P (0) − h 1 (0) = (1 + ρ/µ )( P (0) − D (0)) = ξ 1 . For the induction step, assume the claim holds for t ≥ 1 and let us prov e it for t + 1 . W e use the shorthands, Q t ( z ) = Q  t ( z ; w ( t ) , y ( t − 1) ) and ψ t ( z ) = Q t ( z ) + P ( w ( t ) ) . Let us re write h t +1 ( v ( t +1) ) as h t +1 ( v ( t +1) ) = (1 − η ) h t ( v ( t +1) ) + η ψ t +1 ( v ( t +1) ) = (1 − η )( h t ( v ( t ) ) + µ 2 k v ( t ) − v ( t +1) k 2 ) + η ψ t +1 ( v ( t +1) ) . 14 By the inducti ve assumption we hav e h t ( v ( t ) ) ≥ P ( w ( t ) ) − ξ t and by Lemma 3 we ha v e P ( w ( t ) ) ≥ ψ t +1 ( w ( t ) ) . Therefore, h t +1 ( v ( t +1) ) ≥ (1 − η )( ψ t +1 ( w ( t ) ) − ξ t + µ 2 k v ( t ) − v ( t +1) k 2 ) + η ψ t +1 ( v ( t +1) ) (6) = (1 − η ) µ 2 k v ( t ) − v ( t +1) k 2 + η ψ t +1 ( v ( t +1) ) + (1 − η ) ψ t +1 ( w ( t ) ) − (1 − η ) ξ t . Next, note that we can re write Q t +1 ( z ) = µ 2 k z − y ( t ) k 2 + ρ ( z − y ( t ) ) > ( y ( t ) − w ( t +1) ) + ρ 2 k y ( t ) − w ( t +1) k 2 − (1 + ρ/µ )  t +1 . Therefore, η ψ t +1 ( v ( t +1) ) + (1 − η ) ψ t +1 ( w ( t ) ) − P ( w ( t +1) ) + (1 + ρ/µ )  t +1 (7) = η µ 2 k v ( t +1) − y ( t ) k 2 + (1 − η ) µ 2 k w ( t ) − y ( t ) k 2 + ρ ( η v ( t +1) + (1 − η ) w ( t ) − y ( t ) ) > ( y ( t ) − w ( t +1) ) + ρ 2 k y ( t ) − w ( t +1) k 2 So far we did not specify η and y ( t ) (except y (0) = 0 ). W e next set η = p µ/ρ and ∀ t ≥ 1 , y ( t ) = (1 + η ) − 1 ( η v ( t ) + w ( t ) ) . This choices guarantees that (see (5)) η v ( t +1) + (1 − η ) w ( t ) = η (1 − η ) v ( t ) + η 2 (1 − ρ µ ) y ( t ) + η 2 ρ µ w ( t +1) + (1 − η ) w ( t ) = w ( t +1) + (1 − η ) " η v ( t ) + η 2 (1 − ρ µ ) 1 − η y ( t ) + w ( t ) # = w ( t +1) + (1 − η )  η v ( t ) − 1 − η 2 1 − η y ( t ) + w ( t )  = w ( t +1) + (1 − η ) h η v ( t ) − (1 + η ) y ( t ) + w ( t ) i = w ( t +1) . W e also observ e that  t +1 ≤ η ξ t 2(1+ η − 2 ) which implies that (1 + ρ/µ )  t +1 + (1 − η ) ξ t ≤ (1 − η / 2) ξ t = ξ t +1 . Combining the abov e with (6) and (7), and rearranging terms, we obtain that h t +1 ( v ( t +1) ) − P ( w ( t +1) ) + ξ t +1 − (1 − η ) µ 2 k w ( t ) − y ( t ) k 2 ≥ (1 − η ) µ 2 k v ( t ) − v ( t +1) k 2 + η µ 2 k v ( t +1) − y ( t ) k 2 − ρ 2 k y ( t ) − w ( t +1) k 2 . Next, observ e that ρη 2 = µ and that by (5) we have y ( t ) − w ( t +1) = η h η y ( t ) + (1 − η ) v ( t ) − v ( t +1) i . 15 W e therefore obtain that h t +1 ( v ( t +1) ) − P ( w ( t +1) ) + ξ t +1 − (1 − η ) µ 2 k w ( t ) − y ( t ) k 2 ≥ (1 − η ) µ 2 k v ( t ) − v ( t +1) k 2 + η µ 2 k y ( t ) − v ( t +1) k 2 − µ 2 k η y ( t ) + (1 − η ) v ( t ) − v ( t +1) k 2 . The right-hand side of the above is non-negati ve because of the conv exity of the function f ( z ) = µ 2 k z − v ( t +1) k 2 , which yields P ( w ( t +1) ) ≤ h t +1 ( v ( t +1) ) + ξ t +1 − (1 − η ) µ 2 k w ( t ) − y ( t ) k 2 ≤ h t +1 ( v ( t +1) ) + ξ t +1 . This concludes our inducti ve ar gument. Pro ving the “runtime” part of Theorem 3: W e next show that each call to Prox-SDCA will terminate quickly . By the deﬁnition of κ we hav e that R 2 ( κ + λ ) γ = n . Therefore, based on Corollary 1 we kno w that the av eraged runtime at iteration t is O d n log ˜ D t ( α ∗ ) − ˜ D t ( α ( t − 1) ) η 2(1+ η − 2 ) ξ t − 1 !! . The followi ng lemma bounds the initial dual sub-optimality at iteration t ≥ 4 . Similar arguments will yield a similar result for t < 4 . Lemma 5. ˜ D t ( α ∗ ) − ˜ D t ( α ( t − 1) ) ≤  t − 1 + 36 κ λ ξ t − 3 . Pr oof. Deﬁne ˜ λ = λ + κ , f ( w ) = λ ˜ λ g ( w ) + κ 2 ˜ λ k w k 2 , and ˜ g t ( w ) = f ( w ) − κ ˜ λ w > y ( t − 1) . Note that ˜ λ does not depend on t and therefore v ( α ) = 1 n ˜ λ P i X i α i is the same for e very t . Let, ˜ P t ( w ) = 1 n n X i =1 φ i ( X > i w ) + ˜ λ ˜ g t ( w ) . W e hav e ˜ P t ( w ( t − 1) ) = ˜ P t − 1 ( w ( t − 1) ) + κw ( t − 1) > ( y ( t − 2) − y ( t − 1) ) . (8) Since ˜ g ∗ t ( θ ) = max w w > ( θ + κ ˜ λ y ( t − 1) ) − f ( w ) = f ∗ ( θ + κ ˜ λ y ( t − 1) ) , we obtain that the dual problem is ˜ D t ( α ) = − 1 n X i φ ∗ i ( − α i ) − ˜ λf ∗ ( v ( α ) + κ ˜ λ y ( t − 1) ) 16 Let z = κ ˜ λ ( y ( t − 1) − y ( t − 2) ) , then, by the smoothness of f ∗ we hav e f ∗ ( v ( α ) + κ ˜ λ y ( t − 1) ) = f ∗ ( v ( α ) + κ ˜ λ y ( t − 2) + z ) ≤ f ∗ ( v ( α ) + κ ˜ λ y ( t − 2) ) + ∇ f ∗ ( v ( α ) + κ ˜ λ y ( t − 2) ) > z + 1 2 k z k 2 . Applying this for α ( t − 1) and using w ( t − 1) = ∇ ˜ g ∗ t − 1 ( v ( α ( t − 1) )) = ∇ f ∗ ( v ( α ( t − 1) ) + κ ˜ λ y ( t − 2) ) , we obtain f ∗ ( v ( α ( t − 1) ) + κ ˜ λ y ( t − 1) ) ≤ f ∗ ( v ( α ( t − 1) ) + κ ˜ λ y ( t − 2) ) + w ( t − 1) > z + 1 2 k z k 2 . It follo ws that − ˜ D t ( α ( t − 1) ) + ˜ D t − 1 ( α ( t − 1) ) ≤ κw ( t − 1) > ( y ( t − 1) − y ( t − 2) ) + κ 2 2 ˜ λ k y ( t − 1) − y ( t − 2) k 2 . Combining the abov e with (8), we obtain that ˜ P t ( w ( t − 1) ) − ˜ D t ( α ( t − 1) ) ≤ ˜ P t − 1 ( w ( t − 1) ) − ˜ D t − 1 ( α ( t − 1) ) + κ 2 2 ˜ λ k y ( t − 1) − y ( t − 2) k 2 . Since ˜ P t ( w ( t − 1) ) ≥ ˜ D t ( α ∗ ) and since ˜ λ ≥ κ we get that ˜ D t ( α ∗ ) − ˜ D t ( α ( t − 1) ) ≤  t − 1 + κ 2 k y ( t − 1) − y ( t − 2) k 2 . Next, we bound k y ( t − 1) − y ( t − 2) k 2 . W e hav e k y ( t − 1) − y ( t − 2) k = k w ( t − 1) − w ( t − 2) + β ( w ( t − 1) − w ( t − 2) − w ( t − 2) + w ( t − 3) ) k ≤ 3 max i ∈{ 1 , 2 } k w ( t − i ) − w ( t − i − 1) k , where we used the triangle inequality and β < 1 . By strong con vexity of P we have, for e very i , k w ( i ) − w ∗ k ≤ s P ( w ( i ) ) − P ( w ∗ ) λ/ 2 ≤ s ξ i λ/ 2 , which implies k w ( t − i ) − w ( t − i − 1) k ≤ k w ( t − i ) − w ∗ k + k w ∗ − w ( t − i − 1) k ≤ 2 s ξ t − i − 1 λ/ 2 . This yields the bound k y ( t − 1) − y ( t − 2) k 2 ≤ 72 ξ t − 3 λ . All in all, we hav e obtained that ˜ D t ( α ∗ ) − ˜ D t ( α ( t − 1) ) ≤  t − 1 + 36 κ λ ξ t − 3 . 17 Getting back to the proof of the second claim of Theorem 3, we hav e obtained that ˜ D t ( α ∗ ) − ˜ D t ( α ( t − 1) ) η 2(1+ η − 2 ) ξ t − 1 ≤  t − 1 η 2(1+ η − 2 ) ξ t − 1 + 36 κξ t − 3 λ η 2(1+ η − 2 ) ξ t − 1 ≤ (1 − η / 2) − 1 + 36 κ 2(1 + η − 2 ) λη (1 − η / 2) − 2 ≤ (1 − η / 2) − 4  1 + 72 κ (1 + η − 2 ) λη  ≤ (1 − η / 2) − 2  1 + 36 η − 5  , where in the last inequality we used η − 2 − 1 = 2 κ λ , which implies that 2 κ λ (1 + η − 2 ) ≤ η − 4 . Using 1 < η − 5 , 1 − η / 2 ≥ 0 . 5 , and taking log to both sides, we get that log ˜ D t ( α ∗ ) − ˜ D t ( α ( t − 1) ) η 2(1+ η − 2 ) ξ t − 1 ! ≤ 2 log(2) + log(37) − 5 log( η ) ≤ 7 + 2 . 5 log  R 2 λγ n  . All in all, we hav e shown that the av erage runtime required by Prox-SDCA ( ˜ P t , η 2(1+ η − 2 ) ξ t − 1 , α ( t − 1) ) is upper bounded by O  d n log  R 2 λγ n  , which concludes the proof of the second claim of Theorem 3. 5 A pplications In this section we specify our algorithmic framew ork to sev eral popular machine learning applications. In Section 5.1 we start by describing several loss functions and deriving their conjugate. In Section 5.2 we describe se v eral re gularization functions. Finally , in the rest of the subsections we specify our algorithm for Ridge regression, SVM, Lasso, logistic re gression, and multiclass prediction. 5.1 Loss functions Squared loss: φ ( a ) = 1 2 ( a − y ) 2 for some y ∈ R . The conjugate function is φ ∗ ( b ) = max a ab − 1 2 ( a − y ) 2 = 1 2 b 2 + y b Logistic loss: φ ( a ) = log(1 + e a ) . The deriv ativ e is φ 0 ( a ) = 1 / (1 + e − a ) and the second deriv ative is φ 00 ( a ) = 1 (1+ e − a )(1+ e a ) ∈ [0 , 1 / 4] , from which it follows that φ is (1 / 4) -smooth. The conjugate function is φ ∗ ( b ) = max a ab − log(1 + e a ) = ( b log ( b ) + (1 − b ) log(1 − b ) if b ∈ [0 , 1] ∞ otherwise Hinge loss: φ ( a ) = [1 − a ] + := max { 0 , 1 − a } . The conjugate function is φ ∗ ( b ) = max a ab − max { 0 , 1 − a } = ( b if b ∈ [ − 1 , 0] ∞ otherwise 18 Smooth hinge loss: This loss is obtained by smoothing the hinge-loss using the technique described in Lemma 2. This loss is parameterized by a scalar γ > 0 and is deﬁned as: ˜ φ γ ( a ) =      0 a ≥ 1 1 − a − γ / 2 a ≤ 1 − γ 1 2 γ (1 − a ) 2 o.w . (9) The conjugate function is ˜ φ ∗ γ ( b ) = ( b + γ 2 b 2 if b ∈ [ − 1 , 0] ∞ otherwise It follows that ˜ φ ∗ γ is γ strongly con vex and ˜ φ is (1 /γ ) -smooth. In addition, if φ is the vanilla hinge-loss, we hav e for every a that φ ( a ) − γ / 2 ≤ ˜ φ ( a ) ≤ φ ( a ) . Max-of-hinge: The max-of-hinge loss function is a function from R k to R , which is deﬁned as: φ ( a ) = max j [ c j + a j ] + , for some c ∈ R k . This loss function is useful for multiclass prediction problems. T o calculate the conjugate of φ , let S = { β ∈ R k + : k β k 1 ≤ 1 } (10) and note that we can write φ as φ ( a ) = max β ∈ S X j β j ( c j + a j ) . Hence, the conjugate of φ is φ ∗ ( b ) = max a h a > b − φ ( a ) i = max a min β ∈ S   a > b − X j β j ( c j + a j )   = min β ∈ S max a   a > b − X j β j ( c j + a j )   = min β ∈ S   − X j β j c j + X j max a j a j ( b j − β j )   . Each inner maximization ov er a j would be ∞ unless β j = b j . Therefore, φ ∗ ( b ) = ( − c > b if b ∈ S ∞ otherwise (11) Smooth max-of-hinge This loss obtained by smoothing the max-of-hinge loss using the technique de- scribed in Lemma 2. This loss is parameterized by a scalar γ > 0 . W e start by adding regularization to the conjugate of the max-of-hinge gi ven in (11) and obtain ˜ φ ∗ γ ( b ) = ( γ 2 k b k 2 − c > b if b ∈ S ∞ otherwise (12) 19 T aking the conjugate of the conjugate we obtain ˜ φ γ ( a ) = max b b > a − ˜ φ ∗ γ ( b ) = max b ∈ S b > ( a + c ) − γ 2 k b k 2 = γ 2 k ( a + c ) /γ k 2 − γ 2 min b ∈ S k b − ( a + c ) /γ k 2 (13) While we do not hav e a closed form solution for the minimization problem ov er b in the deﬁnition of ˜ φ γ abov e, this is a problem of projecting onto the intersection of the L 1 ball and the positi ve orthant, and can be solved ef ﬁciently using the following procedure, adapted from [9]. Project ( µ ) Goal: solve argmin b k b − µ k 2 s.t. b ∈ R k + , k b k 1 ≤ 1 Let: ∀ i, ˜ µ i = max { 0 , µ i } If: k ˜ µ k 1 ≤ 1 stop and return b = ˜ µ Sort: let i 1 , . . . , i k be s.t. µ i 1 ≥ µ i 2 ≥ . . . ≥ µ i k Find: j ∗ = max n j : j ˜ µ i j + 1 − P j r =1 ˜ µ i r > 0 o Deﬁne: θ = − 1 + P j ∗ r =1 ˜ µ i r Return: b s.t. ∀ i, b i = max { µ i − θ /j ∗ , 0 } It also holds that ∇ ˜ φ γ ( a ) = argmin b ∈ S k b − ( a + c ) /γ k 2 , and therefore the gradient can also be calculated using the abov e projection procedure. Note that if φ being the max-of-hinge loss, then φ ∗ ( b ) + γ / 2 ≥ ˜ φ ∗ γ ( b ) ≥ φ ∗ ( b ) and hence φ ( a ) − γ / 2 ≤ ˜ φ γ ( a ) ≤ φ ( a ) . Observe that all negati ve elements of a + c does not contribute to ˜ φ γ . This immediately implies that if φ ( a ) = 0 then we also have ˜ φ γ ( a ) = 0 . Soft-max-of-hinge loss function: Another approach to smooth the max-of-hinge loss function is by using soft-max instead of max. The resulting soft-max-of-hinge loss function is deﬁned as φ γ ( a ) = γ log 1 + k X i =1 e ( c i + a i ) /γ ! , (14) where γ > 0 is a parameter . W e have max i [ c i + a i ] + ≤ φ γ ( a ) ≤ max i [ c i + a i ] + + γ log( k + 1) . The j ’th element of the gradient of φ is ∇ j φ γ ( a ) = e ( c j + a j ) /γ 1 + P k i =1 e ( c i + a i ) /γ . By the deﬁnition of the conjugate we have φ ∗ γ ( b ) = max a a > b − φ γ ( a ) . The vector a that maximizes the abov e must satisfy ∀ j, b j = e ( c j + a j ) /γ 1 + P k i =1 e ( c i + a i ) /γ . 20 This can be satisﬁed only if b j ≥ 0 for all j and P j b j ≤ 1 . That is, b ∈ S . Denote Z = P k i =1 e ( c i + a i ) /γ and note that (1 + Z ) k b k 1 = Z ⇒ Z = k b k 1 1 − k b k 1 ⇒ 1 + Z = 1 1 − k b k 1 . It follo ws that a j = γ (log( b j ) + log(1 + Z )) − c j = γ (log( b j ) − log(1 − k b k 1 )) − c j which yields φ ∗ γ ( b ) = X j ( γ (log( b j ) − log(1 − k b k 1 )) − c j ) b j + γ log(1 − k b k 1 ) = − c > b + γ   (1 − k b k 1 ) log (1 − k b k 1 ) + X j b j log( b j )   . Finally , if b / ∈ S then the gradient of a > b − φ γ ( a ) does not v anish anywhere, which means that φ ∗ γ ( b ) = ∞ . All in all, we obtain φ ∗ γ ( b ) = ( − c > b + γ  (1 − k b k 1 ) log (1 − k b k 1 ) + P j b j log( b j )  if b ∈ S ∞ otherwise (15) Since the entropic function, P j b j log( b j ) is 1 -strongly con ve x ov er S with respect to the L 1 norm, we obtain that φ ∗ γ is γ -strongly con ve x with respect to the L 1 norm, from which it follows that φ γ is (1 /γ ) -smooth with respect to the L ∞ norm. 5.2 Regularizers L 2 regularization: The simplest regularization is the squared L 2 regularization g ( w ) = 1 2 k w k 2 2 . This is a 1 -strongly con vex re gularization function whose conjugate is g ∗ ( θ ) = 1 2 k θ k 2 2 . W e also hav e ∇ g ∗ ( θ ) = θ . For our acceleration procedure, we also use the L 2 regularization plus a linear term, namely , g ( w ) = 1 2 k w k 2 − w > z , for some vector z . The conjugate of this function is g ∗ ( θ ) = max w  w > ( θ + z ) − 1 2 k w k 2  = 1 2 k θ + z k 2 . W e also hav e ∇ g ∗ ( θ ) = θ + z . 21 L 1 regularization: Another popular regularization we consider is the L 1 regularization, f ( w ) = σ k w k 1 . This is not a strongly con vex regularizer and therefore we will add a slight L 2 regularization to it and deﬁne the L 1 - L 2 regularization as g ( w ) = 1 2 k w k 2 2 + σ 0 k w k 1 , (16) where σ 0 = σ λ for some small λ . Note that λg ( w ) = λ 2 k w k 2 2 + σ k w k 1 , so if λ is small enough (as will be formalized later) we obtain that λg ( w ) ≈ σ k w k 1 . The conjugate of g is g ∗ ( v ) = max w  w > v − 1 2 k w k 2 2 − σ 0 k w k 1  . The maximizer is also ∇ g ∗ ( v ) and we no w show ho w to calculate it. W e ha ve ∇ g ∗ ( v ) = argmax w  w > v − 1 2 k w k 2 2 − σ 0 k w k 1  = argmin w  1 2 k w − v k 2 2 + σ 0 k w k 1  A sub-gradient of the objecti ve of the optimization problem above is of the form w − v + σ 0 z = 0 , where z is a vector with z i = sign( w i ) , where if w i = 0 then z i ∈ [ − 1 , 1] . Therefore, if w is an optimal solution then for all i , either w i = 0 or w i = v i − σ 0 sign( w i ) . Furthermore, it is easy to verify that if w is an optimal solution then for all i , if w i 6 = 0 then the sign of w i must be the sign of v i . Therefore, whenever w i 6 = 0 we hav e that w i = v i − σ 0 sign( v i ) . It follows that in that case we must ha ve | v i | > σ 0 . And, the other direction is also true, namely , if | v i | > σ 0 then setting w i = v i − σ 0 sign( v i ) leads to an objectiv e value whose i ’th component is 1 2  σ 0  2 + σ 0 ( | v i | − σ 0 ) ≤ 1 2 | v i | 2 , where the right-hand side is the i ’th component of the objecti ve value we will obtain by setting w i = 0 . This leads to the conclusion that ∇ i g ∗ ( v ) = sign( v i )  | v i | − σ 0  + = ( v i − σ 0 sign( v i ) if | v i | > σ 0 0 o.w . It follo ws that g ∗ ( v ) = X i sign( v i )  | v i | − σ 0  + v i − 1 2 X i (  | v i | − σ 0  + ) 2 − σ 0 X i  | v i | − σ 0  + = X i  | v i | − σ 0  +  | v i | − σ 0 − 1 2  | v i | − σ 0  +  = 1 2 X i   | v i | − σ 0  +  2 . 22 Another regularization function we’ ll use in the accelerated procedure is g ( w ) = 1 2 k w k 2 2 + σ 0 k w k 1 − z > w . (17) The conjugate function is g ∗ ( v ) = 1 2 X i   | v i + z i | − σ 0  +  2 , and its gradient is ∇ i g ∗ ( v ) = sign( v i + z i )  | v i + z i | − σ 0  + 5.3 Ridge Regression In ridge regression, we minimize the squared loss with L 2 regularization. That is, g ( w ) = 1 2 k w k 2 and for e very i we hav e that x i ∈ R d and φ i ( a ) = 1 2 ( a − y i ) 2 for some y i ∈ R . The primal problem is therefore P ( w ) = 1 2 n n X i =1 ( x > i w − y i ) 2 + λ 2 k w k 2 . Belo w we specify Prox-SDCA for ridge regression. W e use Option I since it is possible to deriv e a closed form solution to the maximization of the dual with respect to ∆ α i . Indeed, since − φ ∗ i ( − b ) = − 1 2 b 2 + y i b we hav e that the maximization problem is ∆ α i = argmax b − 1 2 ( α ( t +1) i + b ) 2 + y i ( α ( t +1) i + b ) − w ( t − 1) > x i b − b 2 k x i k 2 2 λn = argmax b − 1 a  1 + k x i k 2 2 λn  b 2 −  α ( t +1) i + w ( t − 1) > x i − y i  b = − α ( t +1) i + w ( t − 1) > x i − y i 1 + k x i k 2 2 λn . Applying the abov e update and using some additional tricks to improve the running time we obtain the follo wing procedure. Prox-SDCA( ( x i , y i ) n i =1 , , α (0) , z ) for solving ridge regression Goal: Minimize P ( w ) = 1 2 n P n i =1 ( x > i w − y i ) 2 + λ  1 2 k w k 2 − w > z  Initialize v (0) = 1 λn P n i =1 α (0) i x i , ∀ i, ˜ y i = y i − x > i z Iterate: for t = 1 , 2 , . . . Randomly pick i ∆ α i = − α ( t − 1) i + v ( t − 1) > x i − ˜ y i 1+ k x i k 2 2 λn α ( t ) i ← α ( t − 1) i + ∆ α i and for j 6 = i , α ( t ) j ← α ( t − 1) j v ( t ) ← v ( t − 1) + ∆ α i λn x i Stopping condition : Let w ( t ) = v ( t ) + z Stop if 1 2 n P n i =1  ( x > i w ( t ) − y i ) 2 + ( α ( t ) i + y i ) 2 − y 2 i  + λw ( t ) > v ( t ) ≤  23 The runtime of Prox-SDCA for ridge regression becomes ˜ O  d  n + R 2 λ  , where R = max i k x i k . This matches the recent results of [15, 25]. If R 2 /λ  n we can apply the accelerated procedure and obtain the improv ed runtime ˜ O d r nR 2 λ ! . 5.4 Logistic Regression In logistic regression, we minimize the logistic loss with L 2 regularization. That is, g ( w ) = 1 2 k w k 2 and for e very i we hav e that x i ∈ R d and φ i ( a ) = log (1 + e a ) . The primal problem is therefore 3 P ( w ) = 1 n n X i =1 log(1 + e x > i w ) + λ 2 k w k 2 . The dual problem is D ( α ) = 1 n n X i =1 ( α i log( − α i ) − (1 + α i ) log (1 + α i )) − λ 2 k v ( α ) k 2 , and the dual constraints are α ∈ [ − 1 , 0] n . Belo w we specify Prox-SDCA for logistic regression using Option III. Prox-SDCA( ( x i ) n i =1 , , α (0) , z ) for logistic regression Goal: Minimize P ( w ) = 1 n P n i =1 log(1 + e x > i w ) + λ  1 2 k w k 2 − w > z  Initialize v (0) = 1 λn P n i =1 α (0) i x i , and ∀ i, p i = x > i z Deﬁne: φ ∗ ( b ) = b log ( b ) + (1 − b ) log(1 − b ) Iterate: for t = 1 , 2 , . . . Randomly pick i p = x > i w ( t − 1) q = − 1 / (1 + e − p ) − α ( t − 1) i s = min  1 , log(1+ e p )+ φ ∗ ( − α ( t − 1) i )+ pα ( t − 1) i +2 q 2 q 2 (4+ 1 λn k x i k 2 )  ∆ α i = sq α ( t ) i = α ( t − 1) i + ∆ α i and for j 6 = i , α ( t ) j = α ( t − 1) j v ( t ) = v ( t − 1) + ∆ α i λn x i Stopping condition : let w ( t ) = v ( t ) + z Stop if 1 n P n i =1  log(1 + e x > i w ( t ) ) + φ ∗ ( − α ( t − 1) i )  + λw ( t ) > v ( t ) ≤  The runtime analysis is similar to the analysis for ridge regression. 3 Usually , the training data comes with labels, y i ∈ {± 1 } , and the loss function becomes log (1 + e − y i x > i w ) . Howe ver , we can easily get rid of the labels by re-deﬁning x i ← − y i x i . 24 5.5 Lasso In the Lasso problem, the loss function is the squared loss but the regularization function is L 1 . That is, we need to solve the problem: min w " 1 2 n n X i =1 ( x > i w − y i ) 2 + σ k w k 1 # , (18) with a positi ve re gularization parameter σ ∈ R + . Let ¯ y = 1 2 n P n i =1 y 2 i , and let ¯ w be an optimal solution of (18). Then, the objectiv e at ¯ w is at most the objecti ve at w = 0 , which yields σ k ¯ w k 1 ≤ ¯ y ⇒ k ¯ w k 2 ≤ k ¯ w k 1 ≤ ¯ y σ . Consider the optimization problem min w P ( w ) where P ( w ) = 1 2 n n X i =1 ( x > i w − y i ) 2 + λ  1 2 k w k 2 2 + σ λ k w k 1  , (19) for some λ > 0 . This problem ﬁts into our frame work, since now the regularizer is strongly con vex. Furthermore, if w ∗ is an ( / 2) -accurate solution to the problem in (19), then P ( w ∗ ) ≤ P ( ¯ w ) + / 2 which yields " 1 2 n n X i =1 ( x > i w ∗ − y i ) 2 + σ k w ∗ k 1 # ≤ " 1 2 n n X i =1 ( x > i ¯ w − y i ) 2 + σ k ¯ w k 1 # + λ 2 k ¯ w k 2 2 + / 2 . Since k ¯ w k 2 2 ≤ ( ¯ y /σ ) 2 , we obtain that setting λ =  ( σ / ¯ y ) 2 guarantees that w ∗ is an  accurate solution to the original problem gi ven in (18). In light of the above, from now on we focus on the problem given in (19). As in the case of ridge regression, we can apply Prox-SDCA with Option I. The resulting pseudo-code is given belo w . Applying the above update and using some additional tricks to improve the running time we obtain the following procedure. Prox-SDCA( ( x i , y i ) n i =1 , , α (0) , z ) for solving L 1 − L 2 regression Goal: Minimize P ( w ) = 1 2 n P n i =1 ( x > i w − y i ) 2 + λ  1 2 k w k 2 + σ 0 k w k 1 − w > z  Initialize v (0) = 1 λn P n i =1 α (0) i x i , and ∀ j, w (0) j = sign( v (0) j + z j )[ | v (0) j + z j | − σ 0 ] + Iterate: for t = 1 , 2 , . . . Randomly pick i ∆ α i = − α ( t − 1) i + w ( t − 1) > x i − y i 1+ k x i k 2 2 λn α ( t ) i = α ( t − 1) i + ∆ α i and for j 6 = i , α ( t ) j = α ( t − 1) j v ( t ) = v ( t − 1) + ∆ α i λn x i ∀ j, w ( t ) j = sign( v ( t ) j + z j )[ | v ( t ) j + z j | − σ 0 ] + Stopping condition : Stop if 1 2 n P n i =1  ( x > i w ( t ) − y i ) 2 − 2 y i α ( t ) i + ( α ( t ) i ) 2  + λw ( t ) > v ( t ) ≤  25 Let us now discuss the runtime of the resulting method. Denote R = max i k x i k and for simplicity , assume that ¯ y = O (1) . Choosing λ =  ( σ / ¯ y ) 2 , the runtime of our method becomes ˜ O d n + min ( R 2  σ 2 , r nR 2  σ 2 )!! . It is also con venient to write the bound in terms of B = k ¯ w k 2 , where, as before, ¯ w is the optimal solution of the L 1 regularized problem. W ith this parameterization, we can set λ = /B 2 and the runtime becomes ˜ O d n + min ( R 2 B 2  , r n R 2 B 2  )!! . The runtime of standard SGD is O ( dR 2 B 2 / 2 ) ev en in the case of smooth loss functions such as the squared loss. Se veral variants of SGD, that leads to sparser intermediate solutions, ha ve been proposed (e.g. [14, 21, 27, 8, 10]). Ho we ver , all of these variants share the runtime of O ( dR 2 B 2 / 2 ) , which is much slower than our runtime when  is small. Another relev ant approach is the FIST A algorithm of [2]. The shrinkage operator of FIST A is the same as the gradient of g ∗ used in our approach. It is a batch algorithm using Nesterov’ s accelerated gradient technique. For the squared loss function, the runtime of FIST A is O d n r R 2 B 2  ! . This bound is worst than our bound by a factor of at least √ n . Another approach to solving (18) is stochastic coordinate descent o ver the primal problem. [21] sho wed that the runtime of this approach is O  dnB 2   , under the assumption that k x i k ∞ ≤ 1 for all i . Similar results can also be found in [16]. For our method, the runtime depends on R 2 = max i k x i k 2 2 . If R 2 = O (1) then the runtime of our method is much better than that of [21]. In the general case, if max i k x i k ∞ ≤ 1 then R 2 ≤ d , which yields the runtime of ˜ O d n + min ( dB 2  , r n dB 2  )!! . This is the same or better than [21] whene ver d = O ( n ) . 5.6 Linear SVM Support V ector Machines (SVM) is an algorithm for learning a linear classiﬁer . Linear SVM (i.e., SVM with linear kernels) amounts to minimizing the objecti ve P ( w ) = 1 n n X i =1 [1 − x > i w ] + + λ 2 k w k 2 , where [ a ] + = max { 0 , a } , and for ev ery i , x i ∈ R d . This can be cast as the objectiv e giv en in (1) by letting the regularization be g ( w ) = 1 2 k w k 2 2 , and for e very i , φ i ( a ) = [1 − a ] + , is the hinge-loss. 26 Let R = max i k x i k 2 . SGD enjoys the rate of O  1 λ  . Many software packages apply SDCA and obtain the rate ˜ O  n + 1 λ  . W e now show how our accelerated proximal SDCA enjoys the rate ˜ O  n + p n λ  . This is signiﬁcantly better than the rate of SGD when λ < 1 /n . W e note that a default setting for λ , which often works well in practice, is λ = 1 /n . In this case, λ = /n  1 /n . Our ﬁrst step is to smooth the hinge-loss. Let γ =  and consider the smooth hinge-loss as deﬁned in (9). Recall that the smooth hinge-loss satisﬁes ∀ a, φ ( a ) − γ / 2 ≤ ˜ φ ( a ) ≤ φ ( a ) . Let ˜ P be the SVM objecti ve while replacing the hinge-loss with the smooth hinge-loss. Therefore, for ev ery w 0 and w , P ( w 0 ) − P ( w ) ≤ ˜ P ( w 0 ) − ˜ P ( w ) + γ / 2 . It follo ws that if w 0 is an ( / 2) -optimal solution for ˜ P , then it is  -optimal solution for P . For the smoothed hinge loss, the optimization problem giv en in Option I of Prox-SDCA has a closed form solution and we obtain the follo wing procedure: Prox-SDCA( ( x 1 , . . . , x n ) , , α (0) , z ) for solving SVM (with smooth hinge-loss as in (9) ) Deﬁne: ˜ φ γ as in (9) Goal: Minimize P ( w ) = 1 n P n i =1 ˜ φ γ ( x > i w ) + λ  1 2 k w k 2 − w > z  Initialize w (0) = z + 1 λn P n i =1 α (0) i x i Iterate: for t = 1 , 2 , . . . Randomly pick i ∆ α i = max  − α ( t − 1) i , min  1 − α ( t − 1) i , 1 − x > i w ( t − 1) − γ α ( t − 1) i k x i k 2 / ( λn )+ γ  α ( t ) i ← α ( t − 1) i + ∆ α i and for j 6 = i , α ( t ) j ← α ( t − 1) j w ( t ) ← w ( t − 1) + ∆ α i λn x i Stopping condition : Stop if 1 n P n i =1  ˜ φ γ ( x > i w ( t ) ) − α ( t ) i + γ 2 ( α ( t ) i ) 2  + λw ( t ) > ( w ( t ) − z ) ≤  Denote R = max i k x i k . Then, the runtime of the resulting method is ˜ O d n + min ( R 2 γ λ , s nR 2 γ λ )!! . In particular , choosing γ =  we obtain a solution to the original SVM problem in runtime of ˜ O d n + min ( R 2  λ , r nR 2  λ )!! . As mentioned before, this is better than SGD when 1 λ  n . 27 5.7 Multiclass SVM Next we consider Multiclass SVM using the construction described in Crammer and Singer [5]. Each example consists of an instance vector x i ∈ R d and a label y i ∈ { 1 , . . . , k } . The goal is to learn a matrix W ∈ R d,k such that W > x i is a k ’th dimensional vector of scores for the dif ferent classes. The prediction is the coordinate of W > x i of maximal v alue. The loss function is max j 6 = y i (1 + ( W > x i ) j − ( W > x i ) y i ) . This can be written as φ (( W > x i ) − ( W > x i ) y i ) where φ i ( a ) = max j [ c i,j + a j ] + , with c i being the all ones vector e xcept 0 in the y i coordinate. W e can model this in our framework as follo ws. Giv en a matrix M let vec ( M ) be the column vector obtained by concatenating the columns of M . Let e j be the all zeros vector except 1 in the j ’th coordinate. For e very i , let c i = 1 − e y i and let X i ∈ R dk,k be the matrix whose j ’th column is vec ( x i ( e j − e y i ) > ) . Then, X > i vec ( W ) = W > x i − ( W > x i ) y i . Therefore, the optimization problem of multiclass SVM becomes: min w ∈ R dk P ( w ) where P ( w ) = 1 n n X i =1 φ i ( X > i w ) + λ 2 k w k 2 . As in the case of SVM, we will use the smooth version of the max-of-hinge loss function as described in (13). If we set the smoothness parameter γ to be  then an ( / 2) -accurate solution to the problem with the smooth loss is also an  -accurate solution to the original problem with the non-smooth loss. Therefore, from no w on we focus on the problem with the smooth max-of-hinge loss. W e specify Prox-SDCA for multiclass SVM using Option I. W e will show that the optimization problem in Option I can be calculated ef ﬁciently by sorting a k dimensional vector . Such ideas were explored in [5] for the non-smooth max-of-hinge loss. Let ˆ w = w − 1 λn X i α ( t − 1) i . Then, the optimization problem over α i can be written as argmax α i : − α i ∈ S ( − c > i − ˆ w > X i ) α i − γ 2 k α i k 2 − 1 2 λn k X i α i k 2 . (20) As shown before, if we or ganize ˆ w as a d × k matrix, denoted ˆ W , we hav e that X > i ˆ w = ˆ W > x i − ( ˆ W > x i ) y i . W e also hav e that X i α i = X j vec ( x i ( e j − e y i ) > ) α i,j = vec ( x i X j α i,j ( e j − e y i ) > ) = vec ( x i ( α i − k α i k 1 e y i ) > ) . It follows that an optimal solution to (20) must set α i,y i = 0 and we only need to optimize over the rest of the dual v ariables. This also yields, k X i α i k 2 = k x i k 2 k α i k 2 2 + k x i k 2 k α i k 2 1 . 28 So, (20) becomes: argmax α i : − α i ∈ S,α i,y i =0 ( − c > i − ˆ w > X i ) α i − γ 2 k α i k 2 2 − k x i k 2 2 λn k α i k 2 2 − k x i k 2 2 λn k α i k 2 1 . (21) This is equi v alent to a problem of the form: argmin a ∈ R k − 1 + ,β k a − µ k 2 2 + C β 2 s.t. k a k 1 = β ≤ 1 , (22) where µ = c > i + ˆ w > X i γ + k x i k 2 λn and C = k x i k 2 λn γ + k x i k 2 λn = 1 γ λn k x i k 2 + 1 . The equi v alence is in the sense that if ( a, β ) is a solution of (22) then we can set α i = − a . Assume for simplicity that µ is sorted in a non-increasing order and that all of its elements are non- negati ve (otherwise, it is easy to verify that we can zero the ne gati ve elements of µ and sort the non-ne gative, without affecting the solution). Let ¯ µ be the cumulativ e sum of µ , that is, for ev ery j , let ¯ µ j = P j r =1 µ r . For e very j , let z j = ¯ µ j − j µ j . Since µ is sorted we have that z j +1 = j +1 X r =1 µ r − ( j + 1) µ j +1 = j X r =1 µ r − j µ j +1 ≥ j X r =1 µ r − j µ j = z j . Note also that z 1 = 0 and that z k = ¯ µ k = k µ k 1 (since the coordinate of µ that corresponds to y i is zero). By the properties of projection onto the simplex (see [9]), for ev ery z ∈ ( z j , z j +1 ) we have that the projection of µ onto the set { b ∈ R k + : k b k 1 = z } is of the form a r = max { 0 , µ r − θ /j } where θ = ( − z + ¯ µ j ) /j . Therefore, the objecti ve becomes (ignoring constants that do not depend on z ), j θ 2 + C z 2 = ( − z + ¯ µ j ) 2 /j + C z 2 . The ﬁrst order condition for minimality w .r .t. z is − ( − z + ¯ µ j ) /j + C z = 0 ⇒ z = ¯ µ j 1 + j C . If this v alue of z is in ( z j , z j +1 ) , then it is the optimal z and we’ re done. Otherwise, the optimum should be either z = 0 (which yields α = 0 as well) or z = 1 . 29 a = OptimizeDual ( µ, C ) Solve the optimization problem gi ven in (22) Initialize: ∀ i, ˆ µ i = max { 0 , µ i } , and sort ˆ µ s.t. ˆ µ 1 ≥ ˆ µ 2 ≥ . . . ≥ ˆ µ k Let: ¯ µ be s.t. ¯ µ j = P j i =1 ˆ µ i Let: z be s.t. z j = min { ¯ µ j − j ˆ µ j , 1 } and z k +1 = 1 If: ∃ j s.t. ¯ µ j 1+ j C ∈ [ z j , z j +1 ] return a s.t. ∀ i, a i = max n 0 , µ i −  − ¯ µ j 1+ j C + ¯ µ j  /j o Else: Let j be the minimal index s.t. z j = 1 set a s.t. ∀ i, a i = max { 0 , µ i − ( − z j + ¯ µ j ) /j } If: k a − µ k 2 + C ≤ k µ k 2 return a Else: return (0 , . . . , 0) The resulting pseudo-codes for Prox-SDCA is gi ven belo w . W e specify the procedure while referring to W as a matrix, because it is the more natural representation. For con venience of the code, we also maintain in α i,y i the v alue of − P j 6 = y i α i,j (instead of the optimal v alue of 0 ). Prox-SDCA( ( x 1 , y 1 ) n i =1 , , α, Z ) for solving Multiclass SVM (with smooth hinge-loss as in (13) ) Deﬁne: ˜ φ γ as in (13) Goal: Minimize P ( W ) = 1 n P n i =1 ˜ φ γ (( W > x i ) − ( W > x i ) y i ) + λ  1 2 vec ( W ) > vec ( W ) − vec ( W ) > vec ( Z )  Initialize W = Z + 1 λn P n i =1 x i α > i Iterate: for t = 1 , 2 , . . . Randomly pick i ˆ W = W − 1 λn x i α > i p = x > i ˆ W , p = p − p y i , c = 1 − e y i , µ = c + p γ + k x i k 2 / ( λn ) , C = 1 1+ γ λn/ k x i k 2 a = OptimizeDual ( µ, C ) α i = − a , α y i = k a k 1 W = ˆ W + 1 λn x i α > i Stopping condition : let G = 0 for i = 1 , . . . , n a = W > x i , a = a − a y i , c = 1 − e y i , b = Project (( a + c ) /γ ) G = G + γ 2 ( k ( a + c ) /γ k 2 − k b − ( a + c ) /γ k 2 ) + c > α ( t ) i + γ 2 ( k α ( t ) i k 2 − ( α ( t ) i,y i ) 2 ) Stop if G/n + λ vec ( W ) > vec ( W − Z ) ≤  30 λ astro-ph cov1 CCA T 10 − 6 0 20 40 60 80 100 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 AccProxSDCA ProxSDCA FIST A 0 20 40 60 80 100 0 . 3 0 . 35 0 . 4 0 . 45 0 . 5 AccProxSDCA ProxSDCA FIST A 0 20 40 60 80 100 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 AccProxSDCA ProxSDCA FIST A 10 − 7 0 20 40 60 80 100 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 AccProxSDCA ProxSDCA FIST A 0 20 40 60 80 100 0 . 3 0 . 35 0 . 4 0 . 45 0 . 5 AccProxSDCA ProxSDCA FIST A 0 20 40 60 80 100 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 AccProxSDCA ProxSDCA FIST A 10 − 8 0 20 40 60 80 100 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 AccProxSDCA ProxSDCA FIST A 0 20 40 60 80 100 0 . 3 0 . 35 0 . 4 0 . 45 0 . 5 AccProxSDCA ProxSDCA FIST A 0 20 40 60 80 100 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 AccProxSDCA ProxSDCA FIST A 10 − 9 0 20 40 60 80 100 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 AccProxSDCA ProxSDCA FIST A 0 20 40 60 80 100 0 . 3 0 . 35 0 . 4 0 . 45 0 . 5 AccProxSDCA ProxSDCA FIST A 0 20 40 60 80 100 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 AccProxSDCA ProxSDCA FIST A Figure 3: Comparing Accelerated-Prox-SDCA, Prox-SDCA, and FIST A for minimizing the smoothed hinge-loss ( γ = 1 ) with L 1 − L 2 regularization ( σ = 10 − 5 and λ varies in { 10 − 6 , . . . , 10 − 9 } ). In each of these plots, the y-axis is the primal objective and the x-axis is the number of passes through the entire training set. The three columns corresponds to the three data sets. The methods are terminated either if stopping condition is met (with  = 10 − 3 ) or after 100 passes ov er the data. 31 6 Experiments In this section we compare Prox-SDCA, its accelerated version Accelerated-Prox-SDCA, and the FIST A algorithm of [2], on L 1 − L 2 regularized loss minimization problems. The experiments were performed on three large datasets with very different feature counts and sparsity , which were kindly provided by Thorsten Joachims (the datasets were also used in [24]). The astro-ph dataset classiﬁes abstracts of papers from the physics ArXiv according to whether they belong in the astro- physics section; CCA T is a classiﬁcation task taken from the Reuters RCV1 collection; and cov1 is class 1 of the covertype dataset of Blackard, Jock & Dean. The following table provides details of the dataset characteristics. Dataset T raining Size T esting Size Features Sparsity astro-ph 29882 32487 99757 0 . 08% CCA T 781265 23149 47236 0 . 16% cov1 522911 58101 54 22 . 22% These are binary classiﬁcation problems, with each x i being a vector which has been normalized to be k x i k 2 = 1 , and y i being a binary class label of ± 1 . W e multiplied each x i by y i and following [24], we employed the smooth hinge loss, ˜ φ γ , as in (9), with γ = 1 . The optimization problem we need to solve is therefore min w P ( w ) where P ( w ) = 1 n n X i =1 ˜ φ γ ( x > i w ) + λ 2 k w k 2 2 + σ k w k 1 . In the experiments, we set σ = 10 − 5 and v ary λ in the range { 10 − 6 , 10 − 7 , 10 − 8 , 10 − 9 } . The con v ergence behaviors are plotted in Figure 3. In all the plots we depict the primal objectiv e as a function of the number of passes over the data (often referred to as “epochs”). For FIST A, each iteration in volves a single pass ov er the data. For Prox-SDCA, each n iterations are equi v alent to a single pass over the data. And, for Accelerated-Prox-SDCA, each n inner iterations are equiv alent to a single pass o ver the data. For Prox-SDCA and Accelerated-Prox-SDCA we implemented their corresponding stopping conditions and terminate the methods once an accuracy of 10 − 3 was guaranteed. It is clear from the graphs that Accelerated-Prox-SDCA yields the best results, and often signiﬁcantly outperform the other methods. Prox-SDCA behav es similarly when λ is relativ ely large, but it con verges much slo wer when λ is small. This is consistent with our theory . Finally , the relativ e performance of FIST A and Prox-SDCA depends on the ratio between λ and n , but in all cases, Accelerated-Prox-SDCA is much faster than FIST A. This is again consistent with our theory . 7 Discussion and Open Pr oblems W e have described and analyzed a proximal stochastic dual coordinate ascent method and have sho wn how to accelerate the procedure. The overall runtime of the resulting method improves state-of-the-art results in many cases of interest. There are two main open problems that we lea ve to future research. Open Problem 1. When 1 λγ is lar ger than n , the runtime of our pr ocedur e becomes ˜ O  d q n λγ  . Is it possible to derive a method whose runtime is ˜ O  d  n + q 1 λγ  ? 32 Open Problem 2. Our Pr ox-SDCA pr ocedure and its analysis works for re gularizers which ar e str ongly con vex with r espect to an arbitrary norm. However , our acceleration procedur e is designed for re gularizers which ar e str ongly con vex with respect to the Euclidean norm. Is is possible to extend the acceleration pr ocedur e to mor e general r e gularizers? Acknowledgements The authors would like to thank Fen Xia for careful proof-reading of the paper which helped us to correct numerous typos. Shai Shalev-Shw artz is supported by the following grants: Intel Collaborati ve Research In- stitute for Computational Intelligence (ICRI-CI) and ISF 598-10. T ong Zhang is supported by the following grants: NSF IIS-1016061, NSF DMS-1007527, and NSF IIS-1250985. A Pr oofs of Iteration Bounds f or Pr ox-SDCA The proof technique follows that of Shalev-Shwartz and Zhang [25], but with the required generality for handling general strongly con vex re gularizers and smoothness/Lipschitzness with respect to general norms. W e prove the theorems for running Prox-SDCA while choosing ∆ α i as in Option I. A careful exam- ination of the proof easily rev eals that the results hold for the other options as well. More speciﬁcally , Lemma 6 only requires choosing ∆ α i = s ( u ( t − 1) i − α ( t − 1) i ) as in (23), and Option III chooses s to optimize the bound on the right hand side of (25), and hence ensures that the choice can do no worse than the result of Lemma 6 with any s . The simpliﬁcation in Option IV and V employs the speciﬁc simpliﬁcation of the bound in Lemma 6 in the proof of the theorems. The ke y lemma is the follo wing: Lemma 6. Assume that φ ∗ i is γ -str ongly-con vex. F or any iteration t , let E t denote the expectation with r espect to the randomness in choosing i at r ound t , conditional on the value of α ( t − 1) . Then, for any iteration t and any s ∈ [0 , 1] we have E t [ D ( α ( t ) ) − D ( α ( t − 1) )] ≥ s n [ P ( w ( t − 1) ) − D ( α ( t − 1) )] −  s n  2 G ( t ) 2 λ , wher e G ( t ) = 1 n n X i =1  k X i k 2 D → D 0 − γ (1 − s ) λn s  E t h k u ( t − 1) i − α ( t − 1) i k 2 D i , and − u ( t − 1) i = ∇ φ i ( X > i w ( t − 1) ) . Pr oof. Since only the i ’th element of α is updated, the improvement in the dual objecti ve can be written as n [ D ( α ( t ) ) − D ( α ( t − 1) )] =  − φ ∗ ( − α ( t ) i ) − λng ∗  v ( t − 1) + ( λn ) − 1 X i ∆ α i  −  − φ ∗ ( − α ( t − 1) i ) − λng ∗  v ( t − 1)  The smoothness of g ∗ implies that g ∗ ( v + ∆ v ) ≤ h ( v ; ∆ v ) , where h ( v ; ∆ v ) := g ∗ ( v ) + ∇ g ∗ ( v ) > ∆ v + 1 2 k ∆ v k 2 D 0 . Therefore, n [ D ( α ( t ) ) − D ( α ( t − 1) )] ≥  − φ ∗ ( − α ( t ) i ) − λnh  v ( t − 1) ; ( λn ) − 1 X i ∆ α i  | {z } A −  − φ ∗ ( − α ( t − 1) i ) − λng ∗  v ( t − 1)  | {z } B . 33 By the deﬁnition of the update we hav e for all s ∈ [0 , 1] that A = max ∆ α i − φ ∗ ( − ( α ( t − 1) i + ∆ α i )) − λnh  v ( t − 1) ; ( λn ) − 1 X i ∆ α i  ≥ − φ ∗ ( − ( α ( t − 1) i + s ( u ( t − 1) i − α ( t − 1) i ))) − λnh ( v ( t − 1) ; ( λn ) − 1 sX i ( u ( t − 1) i − α ( t − 1) i )) . (23) From no w on, we omit the superscripts and subscripts. Since φ ∗ is γ -strongly con vex, we ha ve that φ ∗ ( − ( α + s ( u − α ))) = φ ∗ ( s ( − u ) + (1 − s )( − α )) ≤ sφ ∗ ( − u ) + (1 − s ) φ ∗ ( − α ) − γ 2 s (1 − s ) k u − α k 2 D (24) Combining this with (23) and rearranging terms we obtain that A ≥ − sφ ∗ ( − u ) − (1 − s ) φ ∗ ( − α ) + γ 2 s (1 − s ) k u − α k 2 D − λnh ( v ; ( λn ) − 1 sX ( u − α )) = − sφ ∗ ( − u ) − (1 − s ) φ ∗ ( − α ) + γ 2 s (1 − s ) k u − α k 2 D − λng ∗ ( v ) − sw > X ( u − α ) − s 2 k X ( u − α ) k 2 D 0 2 λn ≥ − s ( φ ∗ ( − u ) + w > X u ) + ( − φ ∗ ( − α ) − λng ∗ ( v )) + s 2  γ (1 − s ) − s k X k 2 D → D 0 λn  k u − α k 2 D + s ( φ ∗ ( − α ) + w > X α ) . Since − u = ∇ φ ( X > w ) we have φ ∗ ( − u ) + w > X u = − φ ( X > w ) , which yields A − B ≥ s  φ ( X > w ) + φ ∗ ( − α ) + w > X α +  γ (1 − s ) 2 − s k X k 2 D → D 0 2 λn  k u − α k 2 D  . (25) Next note that with w = ∇ g ∗ ( v ) , we hav e g ( w ) + g ∗ ( v ) = w > v . Therefore: P ( w ) − D ( α ) = 1 n n X i =1 φ i ( X > i w ) + λg ( w ) − − 1 n n X i =1 φ ∗ i ( − α i ) − λg ∗ ( v ) ! = 1 n n X i =1 φ i ( X > i w ) + 1 n n X i =1 φ ∗ i ( − α i ) + λw > v = 1 n n X i =1  φ i ( X > i w ) + φ ∗ i ( − α i ) + w > X i α i  . Therefore, if we take e xpectation of (25) w .r .t. the choice of i we obtain that 1 s E t [ A − B ] ≥ [ P ( w ) − D ( α )] − s 2 λn · 1 n n X i =1  k X i k 2 D → D 0 − γ (1 − s ) λn s  E t [ k u i − α i k 2 D ] | {z } = G ( t ) . W e hav e obtained that n s E t [ D ( α ( t ) ) − D ( α ( t − 1) )] ≥ [ P ( w ( t − 1) ) − D ( α ( t − 1) )] − s G ( t ) 2 λn . (26) Multiplying both sides by s/n concludes the proof of the lemma. 34 Equipped with the abov e lemmas we are ready to prov e Theorem 1 and Theorem 2. Pr oof of Theor em 1. The assumption that φ i is (1 /γ ) -smooth implies that φ ∗ i is γ -strongly-conv e x. W e will apply Lemma 6 with s = n n + R 2 / ( λγ ) = λnγ R 2 + λnγ ∈ [0 , 1] . Recall that k X i k D → D 0 ≤ R . Therefore, the choice of s implies that k X i k 2 D → D 0 − γ (1 − s ) λn s ≤ R 2 − 1 − s s/ ( λnγ ) = R 2 − R 2 = 0 , and hence G ( t ) ≤ 0 for all t . This yields, E t [ D ( α ( t ) ) − D ( α ( t − 1) )] ≥ s n ( P ( w ( t − 1) ) − D ( α ( t − 1) )) . (27) T aking expectation of both sides with respect to the randomness at previous rounds, and using the law of total expectation, we obtain that E [ D ( α ( t ) ) − D ( α ( t − 1) )] ≥ s n E [ P ( w ( t − 1) ) − D ( α ( t − 1) )] . (28) But since  ( t − 1) D := D ( α ∗ ) − D ( α ( t − 1) ) ≤ P ( w ( t − 1) ) − D ( α ( t − 1) ) and D ( α ( t ) ) − D ( α ( t − 1) ) =  ( t − 1) D −  ( t ) D , we obtain that E [  ( t ) D ] ≤  1 − s n  E [  ( t − 1) D ] ≤  1 − s n  t  (0) D ≤  (0) D e − st n . Therefore, whene ver t ≥ n s log(  (0) D / D ) =  n + R 2 λγ  log(  (0) D / D ) , we are guaranteed that E [  ( t ) D ] would be smaller than  D . Using again (28), we can also obtain that E [ P ( w ( t ) ) − D ( α ( t ) )] ≤ n s E [ D ( α ( t +1) ) − D ( α ( t ) )] = n s E [  ( t ) D −  ( t +1) D ] ≤ n s E [  ( t ) D ] . (29) So, requiring E [  ( t ) D ] ≤ s n  P we obtain an expected duality gap of at most  P . This means that we should require t ≥  n + R 2 λγ  log(( n + R 2 λγ ) ·  (0) D  P ) , which prov es the ﬁrst part of Theorem 1. Next, we sum the ﬁrst inequality of (29) o ver t = T 0 + 1 , . . . , T to obtain E   1 T − T 0 T X t = T 0 +1 ( P ( w ( t ) ) − D ( α ( t ) ))   ≤ n s ( T − T 0 ) E [ D ( α ( T +1) ) − D ( α ( T 0 +1) )] . No w , if we choose ¯ w, ¯ α to be either the av erage vectors or a randomly chosen v ector ov er t ∈ { T 0 + 1 , . . . , T } , then the abov e implies E [ P ( ¯ w ) − D ( ¯ α )] ≤ n s ( T − T 0 ) E [ D ( α ( T +1) ) − D ( α ( T 0 +1) )] ≤ n s ( T − T 0 ) E [  ( T 0 +1) D )] ≤ n s ( T − T 0 )  (0) D e − sT 0 n . 35 It follo ws that in order to obtain a result of E [ P ( ¯ w ) − D ( ¯ α )] ≤  P , we need to hav e T 0 ≥ n s log n (0) D s ( T − T 0 )  P ! . In particular , the choice of T − T 0 = n s and T 0 = n s log(  (0) D / P ) satisﬁes the above requirement. Pr oof of Theor em 2. Deﬁne t 0 = d n s log(2  (0) D / D ) e . The proof of Theorem 1 implies that for ev ery t , E [  ( t ) D ] ≤  (0) D e − st n . By Markov’ s inequality , with probability of at least 1 / 2 we have  ( t ) D ≤ 2  (0) D e − st n . Applying it for t = t 0 we get that  ( t 0 ) D ≤  D with probability of at least 1 / 2 . Now , lets apply the same argument again, this time with the initial dual sub-optimality being  ( t 0 ) D . Since the dual is monotonically non-increasing, we have that  ( t 0 ) D ≤  (0) D . Therefore, the same argument tells us that with probability of at least 1 / 2 we would ha v e that  (2 t 0 ) D ≤  D . Repeating this d log 2 (1 /δ ) e times, we obtain that with probability of at least 1 − δ , for some k we hav e that  ( kt 0 ) D ≤  D . Since the dual is monotonically non-decreasing, the claim about the dual sub-optimality follo ws. Next, for the duality gap, using (27) we ha ve that for ev ery t such that  ( t − 1) D ≤  D we hav e P ( w ( t − 1) ) − D ( α ( t − 1) ) ≤ n s E [ D ( α ( t ) ) − D ( α ( t − 1) )] ≤ n s  D . This prov es the second claim of Theorem 2. For the last claim, suppose that at round T 0 we ha ve  ( T 0 ) D ≤  D . Let T = T 0 + n/s . It follows that if we choose t uniformly at random from { T 0 , . . . , T − 1 } , then E [ P ( w ( t ) ) − D ( α ( t ) )] ≤  D . By Markov’ s inequality , with probability of at least 1 / 2 we hav e P ( w ( t ) ) − D ( α ( t ) ) ≤ 2  D . Therefore, if we choose log 2 (2 /δ ) such random t , with probability ≥ 1 − δ / 2 , at least one of them will have P ( w ( t ) ) − D ( α ( t ) ) ≤ 2  D . Combining with the ﬁrst claim of the theorem, choosing  D =  P / 2 , and applying the union bound, we conclude the proof of the last claim of Theorem 2. Refer ences [1] Michel Baes. Estimate sequence methods: extensions and approximations. Institute for Operations Resear ch, ETH, Zürich, Switzerland , 2009. [2] A. Beck and M. T eboulle. A fast iterative shrinkage-thresholding algorithm for linear in verse problems. SIAM J ournal on Imaging Sciences , 2(1):183–202, 2009. [3] M. Collins, A. Globerson, T . K oo, X. Carreras, and P . Bartlett. Exponentiated gradient algorithms for conditional random ﬁelds and max-mar gin marko v networks. Journal of Mac hine Learning Resear ch , 9:1775–1822, 2008. [4] Andre w Cotter, Ohad Shamir , Nathan Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. arXiv preprint , 2011. [5] K. Crammer and Y . Singer . On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Resear ch , 2:265–292, 2001. 36 [6] Alexandre d’Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Opti- mization , 19(3):1171–1183, 2008. [7] Oli vier Dev older, Francois Glineur , and Y u. Nesterov . First-order methods of smooth con- ve x optimization with ine xact oracle. T echnical Report 2011/2, CORE, 2011. a v ailable: http://www .uclouvain.be/cps/ucl/doc/core/documents/coredp2011_2web .pdf. [8] J. Duchi and Y . Singer . Efﬁcient online and batch learning using forward backward splitting. The J ournal of Machine Learning Resear ch , 10:2899–2934, 2009. [9] John Duchi, Shai Shale v-Shwartz, Y oram Singer , and T ushar Chandra. Efﬁcient projections onto the l 1-ball for learning in high dimensions. In Pr oceedings of the 25th international confer ence on Machine learning , pages 272–279. A CM, 2008. [10] John Duchi, Shai Shale v-Shwartz, Y oram Singer , and Ambuj T e wari. Composite objecti ve mirror descent. In Pr oceedings of the 23r d Annual Conference on Learning Theory , pages 14–26, 2010. [11] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly con ve x stochastic composite optimization i: A generic algorithmic framew ork. SIAM Journal on Optimization , 22(4):1469–1492, 2012. [12] Chonghai Hu, W eike Pan, and James T Kwok. Accelerated gradient methods for stochastic optimiza- tion and online learning. In Advances in Neur al Information Pr ocessing Systems , pages 781–789, 2009. [13] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P . Pletscher . Stochastic block-coordinate frank-wolfe optimization for structural svms. arXiv preprint , 2012. [14] J. Langford, L. Li, and T . Zhang. Sparse online learning via truncated gradient. In NIPS , pages 905–912, 2009. [15] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A Stochastic Gradient Method with an Expo- nential Con vergence Rate for Strongly-Con ve x Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258 , 2012. [16] Y . Nesterov . Ef ﬁciency of coordinate descent methods on huge-scale optimization problems. SIAM J ournal on Optimization , 22(2):341–362, 2012. [17] Y urii Nesterov . Smooth minimization of non-smooth functions. Mathematical Pr ogr amming , 103(1): 127–152, 2005. [18] Y urii Nesterov . Gradient methods for minimizing composite objectiv e function, 2007. [19] Peter Richtárik and Martin T aká ˇ c. Iteration comple xity of randomized block-coordinate descent meth- ods for minimizing a composite function. Mathematical Pr ogr amming , pages 1–38, 2012. [20] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Con vergence rates of inexact proximal-gradient methods for con vex optimization. T echnical Report arXi v:1109.2415, arXi v , 2011. [21] S. Shalev-Shw artz and A. T e wari. Stochastic methods for l 1-regularized loss minimization. The J ournal of Machine Learning Resear ch , 12:1865–1892, 2011. 37 [22] S. Shalev-Shw artz, Y . Singer , and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In ICML , pages 807–814, 2007. [23] Shai Shalev-Shwartz and Ambuj T e wari. Stochastic methods for l 1 regularized loss minimization. In ICML , page 117, 2009. [24] Shai Shalev-Shw artz and T ong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. arXiv preprint , 2012. [25] Shai Shalev-Shw artz and T ong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Resear ch , 14:567–599, Feb 2013. [26] Shai Shalev-Shwartz, Nathan Srebro, and T ong Zhang. T rading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization , 20(6):2807–2832, 2010. [27] Lin Xiao. Dual av eraging method for regularized stochastic learning and online optimization. J ournal of Machine Learning Resear ch , 11:2543–2596, 2010. [28] T ong Zhang. On the dual formulation of regularized linear systems. Machine Learning , 46:91–129, 2002. 38

Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment