Distributed Convex Optimization with Many Convex Constraints

Distrib uted Con v ex Optimization with Man y Non-Linear Constraints Joachim Giesen S ¨ oren Laue ∗ Abstract W e address the problem of solving conv ex optimization problems with many non-linear constraints in a distributed setting. Our approach is based on an extension of the alternating direction method of multipliers (ADMM). Although it has been in vented decades ago, ADMM so far can be applied only to unconstrained problems and problems with linear equality or inequality constraints. Our extension can directly handle arbitrary inequality constraints. It combines the ability of ADMM to solve con ve x optimization problems in a distributed setting with the ability of the Augmented Lagrangian method to solve constrained optimization problems, and as we sho w , it inherits the con vergence guarantees of both ADMM and the Augmented Lagrangian method. 1 Intr oduction The increasing availability of distributed hardware suggests addressing large scale optimization problems by distributed algorithms. Large scale optimization problems in v olve a lar ge number of optimization variables , or a large number of input parameters , or a large number of constraints . Here we address the latter case of a large number of constraints. In recent years, the alternating direction method of multipliers (ADMM) that was proposed by Glowinski and Marroco [9] and by Gabay and Mercier [7] already decades ago obtained considerable attention, also beyond the machine learning community , because it allows to solve con ve x optimization problems that in v olve a lar ge number of parameters in a distributed setting [2]. For instance, the parameters in ordinary least squares regression are just the data points. The optimization problem behind a typical machine learning method usually aims for minimizing a loss-function that is the sum of the losses for each data point. Hence, the objectiv e function f of such problems is separable, i.e., it holds that f ( x ) = P i f i ( x i ) , where f i is determined by the i -th data point. In this case ADMM lends itself to a distributed implementation where the data points are distributed on dif ferent compute nodes. Standard ADMM works for unconstrained optimization problems and for optimization problems with linear equality and/or inequality constraints. Surprisingly , so f ar no general con vex inequality constraints hav e been considered directly in the context of ADMM. Optimization problems with a lar ge number of constraints typically also arise as big data problems, b ut instead of contributing a term to the objecti v e func- tion, each data point now contrib utes a constraint to the problem. An illustrativ e example is the core vector machine [25, 26], where the smallest enclosing ball for a giv en set of data points has to be computed. The objecti ve function here is the radius of the ball that needs to be minimized, and every data point contrib utes a non-linear constraint, namely the distance of the point from the center must be at most the radius. Another example problem are rob ust SVMs that we discuss in more detail later in the paper . ∗ Friedrich-Schiller-Uni versity Jena, German y , { joachim.giesen, soeren.laue } @uni-jena.de 1 In principle, standard ADMM can also be used for solving constrained optimization problems. A dis- tributed implementation of the straightforward e xtension of ADMM leads to non-trivial constrained opti- mization subproblems that have to be solved in e very iteration. Solving constrained problems is typically transformed into a sequence of unconstrained problems. Hence, this approach features three nested loops, the outer loop for reaching consensus, one loop for the constraints, and an inner loop for solving uncon- strained problems. Alternati vely , one could use the standard Augmented Lagrangian method, originally kno wn as the method of multipliers[12], that has been speciﬁcally designed for solving constrained opti- mization problems. Combining the Augmented Lagrangian method with ADMM allows to solv e general constrained problems in a distributed fashion by running the Augmented Lagrangian method in an outer loop and ADMM in an inner loop. Again, we end up with three nested loops, the outer loop for the augmented Lagrangian method and the standard two nested inner loops for ADMM. Thus, one could assume that any distributed solver for constrained optimization problems needs at least three nested loops: one for reaching consensus, one for the constraints, and one for the unconstrained problems. The ke y contribution of our paper is sho wing that this is not the case. One of the nested loops can be a voided by merging the loops for reaching consensus and dealing with the constraints. Our approach, that only needs tw o nested loops, combines ADMM with the Augmented Lagrangian method differently than the direct approach of running the Augmented Lagrangian method in the outer and ADMM in the inner loop. But the latter combination still provides us with a good baseline to compare against. Related work. T o the best of our knowledge, our e xtension of ADMM is the ﬁrst distributed algorithm for solving general con v ex optimization problems with no restrictions on the type of constraints or assump- tions on the structure of the problem. Surprisingly , e ven our baseline method of running the augmented Lagrangian method in an outer loop and ADMM in an inner loop has not been studied before. The only special case, that we are aware of, are quadratically constrained quadratic problems that hav e been addressed by Huang and Sidiropoulos [14] using consensus ADMM. Howe ver , their approach does not scale to many constraints, because e very constraint gi ves rise to a ne w subproblem. Mosk-Aoyama et al. [18] have designed and analyzed a distributed algorithm for solving con vex opti- mization problems with separable objecti ve function and linear equality constraints. Their algorithm blends a gossip-based information spreading, iterativ e gradient ascent method with the barrier method from interior- point algorithms. It is similar to ADMM and can also handle only linear constraints. Zhu and Mart ´ ınez [29] hav e introduced a distributed multi-agent algorithm for minimizing a con ve x function that is the sum of local functions subject to a global equality or inequality constraint. Their al- gorithm in volves projections onto local constraint sets that are usually as hard to compute as solving the original problem with general constraints. For instance, it is well kno wn via standard duality theory that the feasibility problem for linear programs is as hard as solving linear programs. This holds true for general con v ex optimization problems with v anishing duality gap. In principle, the standard ADMM can also handle con ve x constraints by transforming them into indicator functions that are added to the objectiv e function. Howe ver , this leads to subproblems that need to be solved in each iteration that entail computing a projection onto the feasible region. This entails the same issues as the method by Zhu and Mart ´ ınes [29] since computing these projections can be as hard as solving the original problem. W e will elaborate on this in more detail in Section 3. The recent literature on ADMM is v ast. Most papers on ADMM stay in the standard framework of optimizing a function or a sum of functions subject to linear constraints and make contributions to one or more of the follo wing aspects: (1) Theoretical (and practical) con ver gence guarantees [6, 10, 11, 13, 19], (2) con ver gence guarantees for asynchronous ADMM [27], (3) splitting the problem into more than two subproblems [4, 15], (4) optimal penalty parameter selection [8], (5) solving the indi vidual subproblems ef ﬁciently or inexactly while still guaranteeing con vergence [3, 5, 16, 21], and (6) applications of ADMM. 2 2 Alternating dir ection method of multipliers Here, we brieﬂy revie w the alternating direction method of multipliers (ADMM) and discuss how it can be adapted to deal with distrib uted data. ADMM is an iterativ e algorithm that in its most general form can solve con vex optimization problems of the form min x,z f 1 ( x ) + f 2 ( z ) s.t. Ax + B z − c = 0 , (1) where f 1 : R n 1 → R ∪ {∞} and f 2 : R n 2 → R ∪ {∞} are con vex functions, A ∈ R m × n 1 and B ∈ R m × n 2 are matrices, and c ∈ R m . ADMM can obviously deal with linear equality constraints, but it can also handle linear inequality constraints. The latter are reduced to linear equality constraints by replacing constraints of the form Ax ≤ b by Ax + s = b , adding the slack variable s to the set of optimization variable s, and setting f 2 ( s ) = 1 R m + ( s ) , where 1 R m + ( s ) = ( 0 , if s ≥ 0 ∞ , otherwise , is the indicator function of the set R m + = { x ∈ R m | x ≥ 0 } . Note that f 1 and f 2 are allowed to take the v alue ∞ . Recently , ADMM regained a lot of attention, because it allows to solve problems with separable objec- ti ve function in a distrib uted setting. Such problems are typically giv en as min x P i f i ( x ) , where f i corresponds to the i -th data point (or more generally i -th data block) and x is a weight vector that describes the data model. This problem can be transformed into an equiv alent optimization problem, with indi vidual weight vectors x i for each data point (data block) that are coupled through an equality constraint, min x i ,z P i f i ( x i ) s.t. x i − z = 0 ∀ i, which is a special case of Problem 1 that can be solved by ADMM in a distributed setting by distributing the data. Adding conv ex inequality constraints to Problem 1 does not destroy con vexity of the problem, but so far ADMM cannot deal with such constraints. Note that the problem only remains con ve x, if all equality constraints are induced by af ﬁne functions. That is, we cannot add conv ex equality constraints in general without destroying con vexity . Our goal for the follo wing sections is extending ADMM such that it can also deal with nonlinear , con vex inequality constraints. For problems with many constraints we will show that these constraints can be distributed similarly as the data points in problems with separable objectiv e function are distributed for the standard ADMM. 3 Pr oblems with non-linear constraints W e consider con ve x optimization problems of the form min x,z f 1 ( x ) + f 2 ( z ) s.t. g 0 ( x ) ≤ 0 h 1 ( x ) + h 2 ( z ) = 0 , (2) 3 where f 1 and f 2 are as in Problem 1, g 0 : R n 1 → R p is con ve x in ev ery component, and h 1 : R n 1 → R m and h 2 : R n 2 → R m are af ﬁne functions. In the follo wing we assume that the problem is feasible, i.e., that a feasible solution exists, and that strong duality holds. A sufﬁcient condition for strong duality is that the interior of the feasible re gion is non-empty . This condition is kno wn as Slater’ s condition for con vex optimization problems [24]. As stated before, our goal is extending ADMM such that it can also solve Problem 2. The simple trick of adding non-negati ve slack variables only works, if the constraints g 0 ( x ) ≤ 0 are afﬁne. Still, this trick gi ves some insight into the general problem. W e ha ve dealt with the non-negati vity constraints on the slack v ariables by adding an indicator function to the objecti ve function. The indicator function forces ADMM to project the solution in e very iteration onto the set { s ∈ R m | s ≥ 0 } which is just the non-negati ve orthant. Projecting onto the non-negati ve orthant is an easy problem and thus ADMM can efﬁciently deal with linear inequality constraints. As we ha ve already mentioned in the introduction the idea of transforming the constraints into indicator functions and adding them to the objectiv e function can be generalized to non- linear constraints. Howe ver , ADMM then needs to compute in ev ery iteration a projection onto the more complicated feasible set { x | g 0 ( x ) ≤ 0 } . Such a projection is the solution to the follo wing constrained optimization problem min x k x k 2 s.t. g 0 ( x ) ≤ 0 , whose solving, depending on the constraints, requires a QP , SOCP or ev en SDP solver . Thus we hav e only deferred the difﬁculties that have been induced by the non-linear constraints to the subproblem of computing the projections. Here, we will devise a method for dealing with arbitrary constraints directly without any hard-to-compute projections. 4 ADMM extension For our extension of ADMM and its conv ergence analysis we need to work with an equiv alent reformulation of Problem 2, where we replace g 0 ( x ) by g ( x ) = max { 0 , g 0 ( x ) } 2 , with componentwise maximum, and turn the con ve x inequality constraints into con vex equality constraints. Thus, in the follo wing we consider optimization problems of the form min x,z f 1 ( x ) + f 2 ( z ) s.t. g ( x ) = 0 h 1 ( x ) + h 2 ( z ) = 0 , (3) where g ( x ) = max { 0 , g 0 ( x ) } 2 , which by construction is again con ve x in e very component and differen- tiable if g ( x ) is dif ferentiable. Note, though, that the constraint g ( x ) = 0 is no longer af ﬁne. Ho wev er , we sho w in the following that Problem 3 can still be solv ed ef ﬁciently . Analogously to ADMM our extension builds on the Augmented Lagrangian for Problem 3 which is the follo wing function L ρ ( x, z , µ, λ ) = f 1 ( x ) + f 2 ( z ) + ρ 2 k g ( x ) k 2 + µ > g ( x ) + ρ 2 k h 1 ( x ) + h 2 ( z ) k 2 + λ > ( h 1 ( x ) + h 2 ( z )) , where µ ∈ R p and λ ∈ R m are Lagrange multipliers, ρ > 0 is some constant, and k·k denotes the Euclidean norm. The Lagrange multipliers are also referred to as dual variables. Algorithm 1 is our extension of ADMM for solving instances of Problem 3. It runs in iterations. In the ( k + 1) -th iteration the primal variables x k and z k as well as the dual v ariables µ k and λ k are updated. 4 Algorithm 1 ADMM for problems with non-linear constraints 1: input: instance of Problem 3 2: output: approximate solution x ∈ R n 1 , z ∈ R n 2 , µ ∈ R p , λ ∈ R m 3: initialize x 0 = 0 , z 0 = 0 , µ 0 = 0 , λ 0 = 0 , and ρ to some constant > 0 4: repeat 5: x k +1 := argmin x L ρ ( x, z k , µ k , λ k ) 6: z k +1 := argmin z L ρ ( x k +1 , z , µ k , λ k ) 7: µ k +1 := µ k + ρg ( x k +1 ) 8: λ k +1 := λ k + ρ  h 1 ( x k +1 ) + h 2 ( z k +1 )  9: until conv ergence 10: return x k , z k , µ k , λ k 5 Con vergence analysis From duality theory we kno w that for all x ∈ R n 1 and z ∈ R n 2 L 0 ( x ∗ , z ∗ , µ ∗ , λ ∗ ) ≤ L 0 ( x, z , µ ∗ , λ ∗ ) , (4) where L 0 is the Lagrangian of Problem 3 and x ∗ , z ∗ , µ ∗ , and λ ∗ are optimal primal and dual v ariables. Note, that x ∗ , z ∗ , µ ∗ , and λ ∗ are not necessarily unique. Here, they refer just to one optimal solution. Also note that the Lagrangian is identical to the Augmented Lagrangian with ρ = 0 . Gi ven that strong duality holds, the optimal solution to the original Problem 3 is identical to the optimal solution of the Lagrangian dual. W e need a few more deﬁnitions. Let f k = f 1 ( x k ) + f 2 ( z k ) be the objective function value at the k -th iterate ( x k , z k ) and let f ∗ be the optimal function v alue. Let r k g = g ( x k ) be the residual of the nonlinear equality constraints, i.e., the constraints originating from the conv ex inequality constraints, and let r k h = h 1 ( x k ) + h 2 ( z k ) be the residual of the linear equality constraints in iteration k . Our goal in this section is to prov e the follo wing theorem. Theorem 1. When Algorithm 1 is applied to an instance of Pr oblem 3, then lim k →∞ r k g = 0 , lim k →∞ r k h = 0 , and lim k →∞ f k = f ∗ . The theorem states primal feasibility and con vergence of the primal objecti ve function value. Note, ho wev er , that conv er gence to primal optimal points x ∗ and z ∗ cannot be guaranteed. This is the case for the original ADMM as well. Additional assumptions on the problem, like, for instance, a unique optimum, are necessary to guarantee con ver gence to the primal optimal points. Howe ver , the points x k , z k will be primal optimal and feasible up to an arbitrarily small error for suf ﬁciently large k . The proof of Theorem 1 follows along the lines of the con ver gence proof for the original ADMM in [2] and is subdi vided into four lemmas. Lemma 1. The dual variables µ k ar e non-ne gative for all iterations, i.e., it holds that µ k ≥ 0 for all k ∈ N . Pr oof. The proof is by induction. In Line 3 of Algorithm 1 the dual variable is initialized as µ 0 = 0 . If µ k ≥ 0 , then it follows from the update rule in Line 7 of Algorithm 1 that µ k +1 = µ k + ρg ( x ) ≥ 0 , since g ( x ) = max { 0 , g 0 ( x ) } 2 ≥ 0 and by assumption also ρ > 0 . 5 Lemma 2. The differ ence between the optimal objective function value and its value at the ( k + 1) -th iterate can be bounded as f ∗ − f k +1 ≤ ( µ ∗ ) > r k +1 g + ( λ ∗ ) > r k +1 h Pr oof. It follows from the deﬁnitions, v anishing constraints in an optimum, and Inequality 4 that f ∗ = f 1 ( x ∗ ) + f 2 ( z ∗ ) = L 0 ( x ∗ , z ∗ , µ ∗ , λ ∗ ) ≤ L 0 ( x k +1 , z k +1 , µ ∗ , λ ∗ ) = f 1 ( x k +1 ) + f 2 ( z k +1 ) + ( µ ∗ ) > r k +1 g + ( λ ∗ ) > r k +1 h = f k +1 + ( µ ∗ ) > r k +1 g + ( λ ∗ ) > r k +1 h . Lemma 3. The dif fer ence between the value of the objective function at the ( k + 1) -th iterate and its optimal value can be bounded as follows f k +1 − f ∗ ≤ − ( µ k +1 ) > r k +1 g − ( λ k +1 ) > r k +1 h − ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  − r k +1 h + h 2 ( z k +1 ) − h 2 ( z ∗ )  . Pr oof. From Line 5 of Algorithm 1 we know that x k +1 minimizes the function L ρ ( x, z k , µ k , λ k ) with re- spect to x . Hence, we kno w that 0 must be contained in the subdifferential of L ρ ( x, z k , µ k , λ k ) with respect to x at x k +1 , i.e., 0 ∈ ∂ f 1 ( x k +1 ) + ρ · ∂ g ( x k +1 ) · g ( x k +1 ) + ∂ g ( x k +1 ) · µ k + ρ · ∂ h 1 ( x k +1 ) ·  h 1 ( x k +1 ) + h 2 ( z k )  + ∂ h 1 ( x k +1 ) · λ k , where ∂ f 1 ( x k +1 ) ∈ R n 1 is the subdifferential of f 1 at x k +1 , ∂ g ( x k +1 ) ∈ R n 1 × p is the subdifferential of g at x k +1 , and ∂ h 1 ( x k +1 ) ∈ R n 1 × m is the subdif ferential of h 1 at x k +1 . The update rule for the dual v ariables µ in Line 7 of Algorithm 1 gi ves µ k = µ k +1 − ρg ( x k +1 ) and similarly , the update rule for the dual v ariables λ in Line 8 gi v es λ k = λ k +1 − ρ  h 1 ( x k +1 ) + h 2 ( z k +1 )  . Plugging these update rules into the subdif ferential optimality condition from abov e giv es 0 ∈ ∂ f 1 ( x k +1 ) + ρ · ∂ g ( x k +1 ) · g ( x k +1 ) + ∂ g ( x k +1 ) ·  µ k +1 − ρg ( x k +1 )  + ρ · ∂ h 1 ( x k +1 ) ·  h 1 ( x k +1 ) + h 2 ( z k )  + ∂ h 1 ( x k +1 ) ·  λ k +1 − ρh 1 ( x k +1 ) − ρh 2 ( z k +1 )  , and thus 0 ∈ ∂ f 1 ( x k +1 ) + ∂ g ( x k +1 ) · µ k +1 + ∂ h 1 ( x k +1 ) ·  λ k +1 − ρ  h 2 ( z k +1 ) − h 2 ( z k )  . 6 If 0 is contained in the subdifferential of a conv ex function at point x , then x is a minimizer of this function. That is, x k +1 minimizes the con v ex function x 7→ f 1 ( x ) + ( g ( x )) > µ k +1 + ( h 1 ( x )) >  λ k +1 − ρ  h 2 ( z k +1 ) − h 2 ( z k )  . (5) This function is con vex, because f 1 and g are con ve x functions, h 1 is an afﬁne function, and any non-negati ve combination of con v ex functions is again a con vex function. Note that we have µ k +1 ≥ 0 by Lemma 1. Similarly , Line 6 of Algorithm 1 implies that 0 is contained in the subdifferential of L ρ ( x k +1 , z , µ k , λ k ) with respect to z at z k +1 , i.e., 0 ∈ ∂ f 2 ( z k +1 ) + ρ · ∂ h 2 ( z k +1 ) ·  h 1 ( x k +1 ) + h 2 ( z k +1 )  + ∂ h 2 ( z k +1 ) · λ k . Again, substituting λ k = λ k +1 − ρ  h 1 ( x k +1 ) + h 2 ( z k +1 )  we get 0 ∈ ∂ f 2 ( z k +1 ) + ∂ h 2 ( z k +1 ) · λ k +1 . Hence, z k +1 minimizes the con v ex function z 7→ f 2 ( z ) + ( λ k +1 ) > h 2 ( z ) . (6) The function is con v ex since f 2 is con v ex and h 2 is af ﬁne. Since x k +1 is a minimizer of Function 5 we hav e f 1 ( x k +1 ) + ( g ( x k +1 )) > µ k +1 + ( h 1 ( x k +1 )) >  λ k +1 − ρ  h 2 ( z k +1 ) − h 2 ( z k )  ≤ f 1 ( x ∗ ) + ( g ( x ∗ )) > µ k +1 + ( h 1 ( x ∗ )) >  λ k +1 − ρ  h 2 ( z k +1 ) − h 2 ( z k )  . Analogously , since z k +1 minimizes Function 6 we hav e f 2 ( z k +1 ) + ( λ k +1 ) > h 2 ( z k +1 ) ≤ f 2 ( z ∗ ) + ( λ k +1 ) > h 2 ( z ∗ ) . Finally , from summing up both inequalities and rearranging we get f k +1 − f ∗ = f 1 ( x k +1 ) + f 2 ( z k +1 ) − f 1 ( x ∗ ) − f 2 ( z ∗ ) ≤ ( g ( x ∗ )) > µ k +1 + ( h 1 ( x ∗ )) >  λ k +1 − ρ  h 2 ( z k +1 ) − h 2 ( z k )  − ( g ( x k +1 )) > µ k +1 − ( h 1 ( x k +1 )) >  λ k +1 − ρ  h 2 ( z k +1 ) − h 2 ( z k )  + ( λ k +1 ) > h 2 ( z ∗ ) − ( λ k +1 ) > h 2 ( z k +1 ) = − ( g ( x k +1 )) > µ k +1 + ( λ k +1 ) > ( h 1 ( x ∗ ) + h 2 ( z ∗ )) − ( λ k +1 ) >  h 1 ( x k +1 ) + h 2 ( z k +1 )  − ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  h 1 ( x ∗ ) − h 1 ( x k +1 )  = − ( µ k +1 ) > r k +1 g − ( λ k +1 ) > r k +1 h − ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  h 1 ( x ∗ ) − h 1 ( x k +1 )  = − ( µ k +1 ) > r k +1 g − ( λ k +1 ) > r k +1 h − ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  h 1 ( x ∗ ) + h 2 ( z ∗ ) − h 2 ( z ∗ ) − h 1 ( x k +1 ) − h 2 ( z k +1 ) + h 2 ( z k +1 )  = − ( µ k +1 ) > r k +1 g − ( λ k +1 ) > r k +1 h − ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  − h 2 ( z ∗ ) − r k +1 h + h 2 ( z k +1 )  , 7 where we have used that g ( x ∗ ) = 0 , h 1 ( x ∗ ) + h 2 ( z ∗ ) = 0 , r k +1 g = g ( x k +1 ) is the residual of the con ve x constraints, and r k +1 h = h 1 ( x k +1 ) + h 2 ( z k +1 ) is the residual of the afﬁne constraints in iteration k + 1 . T o continue, we need one more deﬁnition. Deﬁnition 1. Let V k = 1 ρ k µ k − µ ∗ k 2 + 1 ρ k λ k − λ ∗ k 2 + ρ k h 2 ( z k ) − h 2 ( z ∗ ) k 2 . For this ne wly deﬁned quantity we show in the follo wing lemma that it is non-increasing over the itera- tions. This property will be crucial for the proof of Theorem 1. Lemma 4. F or every iter ation k ∈ N it holds that ρ k r k +1 g k 2 + ρ k r k +1 h k 2 + ρ k h 2 ( z k +1 ) − h 2 ( z k ) k 2 ≤ V k − V k +1 . Pr oof. Summing up the inequality in Lemma 2 and the inequality in Lemma 3 giv es 0 ≤ ( µ ∗ ) > r k +1 g − ( µ k +1 ) > r k +1 g + ( λ ∗ ) > r k +1 h − ( λ k +1 ) > r k +1 h − ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  − r k +1 h + h 2 ( z k +1 ) − h 2 ( z ∗ )  , or equi valently , by rearranging and multiplying by 2 , 0 ≥ 2( µ k +1 − µ ∗ ) > r k +1 g + 2( λ k +1 − λ ∗ ) > r k +1 h + 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  − r k +1 h + h 2 ( z k +1 ) − h 2 ( z ∗ )  . (7) Next we are re writing the three terms in this inequality indi vidually . Using the update rule µ k +1 = µ k + ρg ( x k +1 ) = µ k + ρr k +1 g for the Lagrange multipliers µ in Line 7 of Algorithm 1 se veral times we can re write the ﬁrst term as follo ws 2( µ k +1 − µ ∗ ) > r k +1 g = 2( µ k + ρr k +1 g − µ ∗ ) > r k +1 g = 2( µ k − µ ∗ ) > r k +1 g + ρ k r k +1 g k 2 + ρ k r k +1 g k 2 = 2 ( µ k − µ ∗ ) > ( µ k +1 − µ k ) ρ + k µ k +1 − µ k k 2 ρ + ρ k r k +1 g k 2 = 2 ( µ k − µ ∗ ) >  µ k +1 − µ ∗ − ( µ k − µ ∗ )  ρ + k µ k +1 − µ ∗ − ( µ k − µ ∗ ) k 2 ρ + ρ k r k +1 g k 2 = 2( µ k − µ ∗ ) >  µ k +1 − µ ∗  − 2 k µ k − µ ∗ k 2 ρ + k µ k +1 − µ ∗ k 2 + k µ k − µ ∗ k 2 − 2( µ k +1 − µ ∗ ) > ( µ k − µ ∗ ) ρ + ρ k r k +1 g k 2 = k µ k +1 − µ ∗ k 2 − k µ k − µ ∗ k 2 ρ + ρ k r k +1 g k 2 . 8 The analogous argument holds for the second term, when using the update rule λ k +1 = λ k + ρr k +1 h in Line 8 of Algorithm 1, i.e., we hav e 2( λ k +1 − λ ∗ ) > r k +1 h = k λ k +1 − λ ∗ k 2 − k λ k − λ ∗ k 2 ρ + ρ k r k +1 h k 2 . Adding ρ    r k +1 h    2 to the third term of Inequality 7 gi ves ρ    r k +1 h    2 + 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  − r k +1 h + h 2 ( z k +1 ) − h 2 ( z ∗ )  = ρ k r k +1 h k 2 − 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  > r k +1 h + 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  h 2 ( z k +1 ) − h 2 ( z ∗ )  = ρ k r k +1 h k 2 − 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  > r k +1 h + 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  h 2 ( z k +1 ) − h 2 ( z k )  +  h 2 ( z k ) − h 2 ( z ∗ )  = ρ k r k +1 h k 2 − 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  > r k +1 h + 2 ρ k h 2 ( z k +1 ) − h 2 ( z k ) k 2 + 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  h 2 ( z k ) − h 2 ( z ∗ )  = ρ k r k +1 h −  h 2 ( z k +1 ) − h 2 ( z k )  k 2 + ρ k h 2 ( z k +1 ) − h 2 ( z k ) k 2 + 2 ρ  h 2 ( z k +1 ) − h 2 ( z k )  >  h 2 ( z k ) − h 2 ( z ∗ )  = ρ k r k +1 h −  h 2 ( z k +1 ) − h 2 ( z k )  k 2 + ρ k  h 2 ( z k +1 ) − h 2 ( z ∗ )  −  h 2 ( z k ) − h 2 ( z ∗ )  k 2 + 2 ρ  h 2 ( z k +1 ) − h 2 ( z ∗ )  −  h 2 ( z k ) − h 2 ( z ∗ )  >  h 2 ( z k ) − h 2 ( z ∗ )  = ρ k r k +1 h −  h 2 ( z k +1 ) − h 2 ( z k )  k 2 + ρ k h 2 ( z k +1 ) − h 2 ( z ∗ ) k 2 + ρ k h 2 ( z k ) − h 2 ( z ∗ ) k 2 − 2 ρ  h 2 ( z k +1 ) − h 2 ( z ∗ )  >  h 2 ( z k ) − h 2 ( z ∗ )  + 2 ρ  h 2 ( z k +1 ) − h 2 ( z ∗ )  >  h 2 ( z k ) − h 2 ( z ∗ )  − 2 ρ  h 2 ( z k ) − h 2 ( z ∗ )  >  h 2 ( z k ) − h 2 ( z ∗ )  = ρ k r k +1 h −  h 2 ( z k +1 ) − h 2 ( z k )  k 2 + ρ k h 2 ( z k +1 ) − h 2 ( z ∗ ) k 2 − ρ k h 2 ( z k ) − h 2 ( z ∗ ) k 2 . Hence, Inequality 7 is equi valent to 0 ≥ 1 ρ  k ( µ k +1 − µ ∗ k 2 − k µ k − µ ∗ k 2  + ρ k r k +1 g k 2 + 1 ρ  k ( λ k +1 − λ ∗ k 2 − k λ k − λ ∗ k 2  + ρ k r k +1 h −  h 2 ( z k +1 ) − h 2 ( z k )  k 2 + ρ k h 2 ( z k +1 ) − h 2 ( z ∗ ) k 2 − ρ k h 2 ( z k ) − h 2 ( z ∗ ) k 2 . By rearranging the terms in this inequality and using the follo wing term expansion k r k +1 h − ( h 2 ( z k +1 ) − h 2 ( z k )) k 2 = k r k +1 h k 2 + k h 2 ( z k +1 ) − h 2 ( z k ) k 2 − 2( r k +1 h ) >  h 2 ( z k +1 ) − h 2 ( z k )  , 9 we get ρ k r k +1 g k 2 + ρ k r k +1 h k 2 + ρ k h 2 ( z k +1 ) − h 2 ( z k ) k 2 − 2 ρ ( r k +1 h ) >  h 2 ( z k +1 ) − h 2 ( z k )  = ρ k r k +1 g k 2 + ρ k r k +1 h −  h 2 ( z k +1 ) − h 2 ( z k )  k 2 ≤ 1 ρ k µ k − µ ∗ k 2 + 1 ρ k λ k − λ ∗ k 2 + ρ k h 2 ( z k ) − h 2 ( z ∗ ) k 2 −  1 ρ k µ k +1 − µ ∗ k 2 + 1 ρ k λ k +1 − λ ∗ k 2 + ρ k h 2 ( z k +1 ) − h 2 ( z ∗ ) k 2  = V k − V k +1 , where we hav e used Deﬁnition 1 of V k in the last equality . Hence, to ﬁnish the proof of Lemma 4 it only remains to sho w that 2 ρ ( r k +1 h ) >  h 2 ( z k +1 ) − h 2 ( z k )  ≤ 0 . From the proof of Lemma 3 we know that z k +1 minimizes the function f 2 ( z ) + ( λ k +1 ) > h 2 ( z ) and similarly that z k minimizes the function f 2 ( z ) + ( λ k ) > h 2 ( z ) . Hence, we hav e the following tw o inequalities f 2 ( z k +1 ) + ( λ k +1 ) > h 2 ( z k +1 ) ≤ f 2 ( z k ) + ( λ k +1 ) > h 2 ( z k ) and f 2 ( z k ) + ( λ k ) > h 2 ( z k ) ≤ f 2 ( z k +1 ) + ( λ k ) > h 2 ( z k +1 ) . Summing up these two inequalities yields ( λ k +1 ) > h 2 ( z k +1 ) + ( λ k ) > h 2 ( z k ) ≤ ( λ k +1 ) > h 2 ( z k ) + ( λ k ) > h 2 ( z k +1 ) , or equi valently 0 ≥ ( λ k +1 ) >  h 2 ( z k +1 ) − h 2 ( z k )  + ( λ k ) >  h 2 ( z k ) − h 2 ( z k +1 )  =  λ k +1 − λ k  >  h 2 ( z k +1 ) − h 2 ( z k )  = ρ ( r k +1 h ) >  h 2 ( z k +1 ) − h 2 ( z k )  , where we hav e used the update rule for λ , see again Line 8 of Algorithm 1. This completes the proof of Lemma 4. No w we are prepared to prov e our main theorem. Pr oof of Theorem 1. Using Lemma 4 and that V k ≥ 0 for ev ery iteration k , see Deﬁnition 1, we can con- clude that ρ ∞ X k =0  k r k +1 g k 2 + k r k +1 h k 2 + k h 2 ( z k +1 ) − h 2 ( z ) k 2  ! ≤ ∞ X k =0  V k − V k +1  ≤ V 0 . The series on the right hand side is absolutely con ver gent, because V 0 < ∞ , which follo ws from the fact that h 2 is an af ﬁne function. The absolute con ver gence implies lim k →∞ r k g = 0 , lim k →∞ r k h = 0 , and lim k →∞ k h 2 ( z k +1 ) − h 2 ( z k ) k = 0 , i.e., the points x k and z k will be primal feasible up to an arbitrarily small error for suf ﬁciently lar ge k . Finally , it follows from Lemmas 2 and 3 that lim k →∞ f k = f ∗ , i.e., the points x k and z k are also primal optimal up to an arbitrarily small error for suf ﬁciently large k . 10 6 Con vex optimization pr oblems with many constraints Finally , we are ready to discuss the main problem that we set out to address in this paper , namely solving general conv ex optimization problems with many constraints in a distrib uted setting by distributing the constraints. That is, we want to address optimization problems of the form min x f ( x ) s.t. g i ( x ) ≤ 0 i = 1 . . . p h i ( x ) = 0 i = 1 . . . m, (8) where f : R n → R and g i : R n → R p i are conv ex functions, and h i : R n → R m i are afﬁne functions. In total, we hav e p 1 + p 2 + . . . + p p inequality constraints that are grouped together into p batches and m 1 + m 2 + . . . + m m equality constraints that are subdi vided into m groups. For distributing the constraints we can assume without loss of generality that m = p . That is, we have m batches that each contain p i inequality and m i equality constraints. Again it is easier to work with an equiv alent reformulation of Problem 8, where each batch of equality and inequality constraints shares the same v ariables x i , namely problems of the form min x i ,z P m i =1 f ( x i ) s.t. max { 0 , g i ( x i ) } 2 = 0 i = 1 . . . m h i ( x i ) = 0 i = 1 . . . m x i = z , (9) where all the variables x i are coupled through the afﬁne constraints x i = z . T o keep our exposition simple, the objecti ve function has been scaled by m in the reformulation. For specializing our extension of ADMM to instances of Problem 9 we need the Augmented Lagrangian of this problem, which reads as L ρ ( x i , z , µ i,g , µ i,h , λ ) = m X i =1 f ( x i ) + ρ 2 m X i =1 k max { 0 , g i ( x i ) } 2 k 2 + m X i ( µ i,g ) > max { 0 , g i ( x i ) } 2 + ρ 2 m X i =1 k h i ( x i ) k 2 + m X i ( µ i,h ) > h i ( x i ) + ρ 2 m X i =1 k x i − z k 2 + m X i ( λ i ) > ( x i − z ) , where µ i,g , µ i,h , and λ i are the Lagrange multipliers (dual v ariables). Note that the Lagrange function is separable. Hence, the update of the x v ariables in Line 5 of Algo- rithm 1 decomposes into the follo wing m independent updates x k +1 i = argmin x i f ( x i ) + ρ 2 k max { 0 , g i ( x i ) } 2 k 2 + ( µ k i,g ) > max { 0 , g i ( x i ) } 2 + ρ 2 k h i ( x i ) k 2 + ( µ k i,h ) > h i ( x i ) + ρ 2 k x i − z k k 2 + ( λ k i ) > ( x i − z k ) , 11 that can be solv ed in parallel once the constraints g i ( x i ) and h i ( x i ) hav e been distrib uted on m dif ferent, distributed compute nodes. Note that each update is an unconstrained, con ve x optimization problem, be- cause the functions that need to be minimized are sums of con vex functions. The only two summands where this might not be obvious, are ρ 2 k max { 0 , g i ( x i ) } 2 k 2 and ( µ k i,g ) > max { 0 , g i ( x i ) } 2 . For the ﬁrst term note that the squared norm of a non-neg ativ e, con ve x function is alw ays con vex again. The second term is con v ex, because according to Lemma 1 the µ k i,g are always non-ne gati ve. The update of the z v ariable in Line 6 of Algorithm 1 amounts to solving the follo wing unconstrained optimization problems z k +1 = argmin z m X i =1 ρ 2 k x k +1 i − z k 2 + m X i =1 ( λ k i ) > ( x k +1 i − z ) = ρ P m i =1 x k +1 i + P m i =1 λ k i ρ · m , and the updates of the dual v ariables µ i and λ i are as follo ws µ k +1 i,g = µ k i,g + ρ max { 0 , g i ( x k +1 i ) } 2 , µ k +1 i,h = µ k i + ρ h i ( x k +1 i ) , λ k +1 i = λ k i + ρ  x k +1 i − z k +1  . That is, in each iteration there are m independent, unconstrained minimization problems that can be solved in parallel on dif ferent compute nodes. The solutions of the independent subproblems are then combined on a central node through the update of the z v ariables and the Lagrange multipliers. Actually , since the Lagrange multipliers µ i,g and µ i,h are also local, i.e., in volv e only the variables x k +1 i for an y gi ven index i , the y can also be updated in parallel on the same compute nodes where the x k i updates take place. Only the v ariables z and the Lagrange multipliers λ i need to be updated centrally . Looking at the update rules it becomes apparent that Algorithm 1 when applied to instances of Problem 9 is basically a combination of the standard Augmented Lagrangian method [12, 20] for solving con vex, constrained optimization problems and ADMM. It combines the ability to solve constrained optimization problems (Augmented Lagrangian) with the ability to solve con vex optimization problems distrib utedly (ADMM). Let us brieﬂy come back to the comparison with the alternativ e approach of dealing with con vex con- straints g 0 ( x ) ≤ 0 by adding appropriate indicator functions to the objectiv e function and using standard ADMM. As we hav e discussed before, every compute node has to ensure feasibility and thus needs to project onto the feasible set { x | g 0 ( x ) ≤ 0 } in ev ery iteration. These projections are quadratic, non-linearly constrained optimization problems. In contrast to that, our extension of ADMM only needs to solve uncon- strained optimization problems in e very iteration. 12 7 Experiments W e hav e implemented our extension of ADMM in Python using the NumPy and SciPy libraries, and tested this implementation on the robust SVM problem [23] that has a second order cone constraint for ev ery data point. In our experiments we distributed these constraints onto different compute nodes, where we had to solve an unconstrained optimization problem in e v ery iteration. Since there is no other approach av ailable that could deal with a large number of arbitrary constraints in a distributed manner we compare our approach to the baseline approach of running an Augmented Lagrangian method in an outer loop and standard ADMM in an inner loop. Note that this approach has three nested loops. The outer loop turns the constrained problem into a sequence of unconstrained problems (Augmented Lagrangians), the next loop distrib utes the problem using distrib uted ADMM, and the ﬁnal inner loop solves the unconstrained subproblems using the L-BFGS-B algorithm [17, 28] in our implementation. 7.1 Robust SVMs The rob ust SVM problem has been designed to deal with binary classiﬁcation problems whose input are not just labeled data points ( x (1) , y (1) ) , . . . , ( x ( n ) , y ( n ) ) , where the x ( i ) are feature vectors and the y ( i ) are binary labels, but a distribution o ver the feature vectors. That is, the labels are assumed to be known precisely and the uncertainty is only in the features. The idea behind the robust SVM is replacing the constraints (for fea- ture vectors without uncertainty) of the standard linear soft-mar gin SVM by their probabilistic counterparts Pr  y ( i ) w > x ( i ) ≥ 1 − ξ i  ≥ 1 − δ i that require the now random variable x ( i ) with probability at least 1 − δ i ≥ 0 to be on the correct side of the hyperplane whose normal vector is w . Shiv aswamy et al. show that the probabilistic constraints can be written as second order cone constraints y ( i ) w > ¯ x ( i ) ≥ 1 − ξ i + p δ i / (1 − δ i )   Σ 1 / 2 i w   , under the assumption that the mean of the random v ariable x ( i ) is the empirical mean ¯ x ( i ) and the co v ariance matrix of x ( i ) is Σ i . The robust SVM problem is then the follo wing SOCP (second order cone program) min w,ξ 1 2 k w k 2 + c n X i =1 ξ i s.t. y ( i ) w > ¯ x ( i ) ≥ 1 − ξ i + p δ i / (1 − δ i )   Σ 1 / 2 i w   ξ i ≥ 0 , i = 1 . . . n. This problem can be reformulated into the form of Problem 9 and is thus amenable to a distributed imple- mentation of our extension of ADMM. 7.2 Experimental setup W e generated random data sets similarly to [1], where an interior point solv er has been described for solving the robust SVM problem. The set of feature vectors was sampled from a uniform distrib ution on [ − 1 , 1] n . The covariance matrices Σ i were randomly chosen from the cone of positi ve semideﬁnite matrices with entries in the interval [ − 1 , 1] and δ i has been set to 1 2 . Each data point contributes exactly one constraint to the problem and is assigned to only one of the compute nodes. In the following, the primal optimization variables are w and ξ , the consensus variables for the primal optimization variables w are still denoted as z , and also the dual variables are still denoted as λ for the consensus constraints and µ for the con ve x constraints, respectiv ely . 13 0 5 10 15 20 25 30 35 40 iteration 5 . 0 5 . 5 6 . 0 6 . 5 7 . 0 7 . 5 8 . 0 8 . 5 9 . 0 f k 0 5 10 15 20 25 30 35 40 iteration 10 − 4 10 − 3 10 − 2 10 − 1 k r k g k 0 5 10 15 20 25 30 35 40 iteration 10 − 3 10 − 2 10 − 1 10 0 k r k h k 0 5 10 15 20 25 30 35 40 iteration 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 10 1 k z k − z ∗ k 0 5 10 15 20 25 30 35 40 iteration 10 − 3 10 − 2 10 − 1 10 0 10 1 10 2 k λ k − λ ∗ k 0 5 10 15 20 25 30 35 40 iteration 10 − 2 10 − 1 10 0 10 1 10 2 V k Figure 1: V arious statistics for the performance of the distributed ADMM extension on an instance of the robust SVM problem. The con vergence proof only states that the value V k must be monotonically decreasing. This can be observed also experimentally in the ﬁgure on the bottom right. Neither the primal function v alue nor the residuals need to be monotonically decreasing, and as can be seen in the ﬁgures on the top, they actually do not decrease monotonically . 7.3 Con vergence results Figure 1 sho ws the primal objective function v alue f k , the norm of the residuals r k g and r k h , the distances k z k − z ∗ k , k λ k − λ ∗ k , and the value V k of one run of our algorithm for tw o compute nodes. Note, that only V k must be strictly monotonically decreasing according to our con ver gence analysis. The proof does not make any statement about the monotonicity of the other values, and as can be seen in Figure 1, such state- ments would actually not be true. All v alues decrease in the long run, b ut are not necessarily monotonically decreasing. As can be seen in Figure 1 (top-left), the function value f k is actually increasing for the ﬁrst fe w iter- ations, while the residuals r k g for the inequality constraints become very small, see Figure 1 (top-middle). That is, within the ﬁrst iterations each compute node ﬁnds a solution to its share of the data that is almost feasible but has a higher function v alue than the true optimal solution. This basically means that the errors ξ i for the data points are over -estimated. After a few more iterations the primal function v alue drops and the inequality residuals increase meaning that the error terms ξ i as well as the indi vidual estimators w i con v erge to their optimal v alues. In the long run, the local estimators at the different compute nodes con ver ge to the same solution. This is witnessed in Figure 1 (top-right), where one can see that the residuals r k h for the consensus constraints con v erge to zero, i.e., consensus among the compute nodes is reached in the long run. Finally , it can be seen that the consensus estimator z k con v erges to its unique optimal point z ∗ . Note that in general we cannot guarantee such a con ver gence since the optimal point does not need to be unique. But of course, in the special case that the optimal point is unique we always ha v e con ver gence to this point. 14 2 4 6 8 10 12 14 16 compute nodes 0 5 10 15 20 25 30 35 40 iterations baseline this paper 10 3 10 4 10 5 data points 0 10 20 30 40 50 60 70 80 iterations baseline this paper 10 − 3 10 − 2 10 − 1 k z k − z ∗ k ∞ 0 50 100 150 200 iterations baseline this paper Figure 2: Running times of the algorithm on the robust SVM problem. The ﬁgure on the left sho ws that the number of iterations increases mildly with the number of compute nodes. The middle picture shows that the number of iterations is decreasing with increasing number of data points. The ﬁgure on the right shows the dependency of the distance of the consensus estimator z k in iteration k to the optimal estimator z ∗ . It can be seen that our extension of ADMM outperforms the baseline approach with three nested loops. 7.4 Scalability results Figure 2 shows the scalability of our extension of ADMM in terms of the number of compute nodes, data points, and approximation quality , respectiv ely . All running times were measured in terms of iterations and av eraged over runs for ten randomly generated data sets. The ﬁgures show the number of iterations for the approach presented in this paper and for the baseline, i.e., a three nested loops approach. (1) For measuring the scalability in terms of employed compute nodes, we generated 10,000 data points with 10,000 features. As stopping criterion we used k z k − z ∗ k ∞ ≤ 5 · 10 − 3 , i.e., the predictor z k had to be close to the optimum. Here we use the inﬁnity norm to be independent from the number of dimensions. The data set was split into four , eight, twelve, and 16 equal sized batches that were distributed among the compute nodes. Note that ev ery batch had much fe wer data points than features, and thus the optimal solutions to the respecti ve problems at the compute nodes were quite dif ferent from each other . Nevertheless, our algorithm conv er ged v ery well to the globally optimal solution. Only the con ver gence speed was affected by the di versity of the local solutions at the dif ferent compute nodes. Since we kept the total number of data points in our experiments ﬁx ed, the diversity was increasing with the number of compute nodes that were assigned fewer data points each. Hence it was expected that the con v ergence speed decreases, i.e., the number of iterations increases, with a gro wing number of compute nodes. The expected behavior can be seen in Figure 2 (left). Howe ver , the increase is rather mild. The number of iterations less than doubles when the number of compute nodes increases from four to 16. (2) For measuring the scalability in terms of the number of data points we increased the number of data points but k ept the number of features ﬁxed at 200. The stopping criterion for our algorithm was again k z k − z ∗ k ∞ ≤ 5 · 10 − 3 . W e used eight compute nodes to compute the solutions. Again, the points were distributed equally among the compute nodes. This time one w ould expect a decreasing running time with an increasing number of data points, because the number of data points per machine is increasing and thus also the div ersity of the local solutions at the different compute nodes is decreasing. That is, with an increasing number of data points it should take fewer iterations to reach an approximate consensus about the global solution among the compute nodes. The results of the experiment that are sho wn in Figure 2 (middle) conﬁrm this expectation. 15 The number of iterations indeed decreases with a growing number of data points. It has been noted before by Shale v-Shwartz and Srebro [22] that an increasing number of data points can require less work for pro viding a good predictor . W e observe a similar phenomenon here. (3) For measuring the scalability in terms of the approximation quality , we generated 8000 data points in 200 dimensions. Again, eight compute nodes were used for the experiments whose results are sho wn in Figure 2 (right). As expected the number of iterations (running time) increases with increasing approxi- mation quality that was again measured in terms of the inﬁnity norm. In this paper we are not providing a theoretical con ver gence rate analysis, which we leav e for future work, b ut the experimental results shown here already provide some intuition on the dependency of the number of iterations in terms of the approx- imation quality: It seems that our extension of ADMM can solv e problems to a medium accuracy within a reasonable number of iterations, but higher accuracy requires a signiﬁcant increase in the number of iter- ations. Such a behavior is well known for standard ADMM without constraints [2]. In the context of our example application, rob ust SVMs, medium accuracy usually is sufﬁcient as often higher accurac y solutions do not provide better predictors, a phenomenon that is also kno wn as re gularization by early stopping. 8 Conclusions W e ha ve introduced and analyzed an algorithm for solving general con vex optimization problems with many non-linear constraints in a distributed setting. The algorithm is based on an extension of the alternating direc- tion method of multipliers (ADMM). Experiments on the robust SVM problem corroborate our theoretical con v ergence analysis and demonstrate the scalability of the approach in terms of the number of compute nodes as well as the number of data points. Despite the v ast literature on ADMM, to the best of our knowledge, an ADMM-like scheme for dis- tributing general con ve x constraints has not been studied before. Standard ADMM is typically used for solving unconstrained optimization problems with separable objectiv e function in a distributed fashion, but in principle standard ADMM can also be used for solving constrained optimization problems. This leads, in the distributed implementation of ADMM, to local, constrained optimization problems that ha ve to be solved in every iteration. These local constrained optimization problems are easy to solve in special cases, for instance for linear constraints, but can become hard to solve in the general case of con vex, non-linear constraints. In general three nested loops are necessary in this approach, an outer loop for reaching con- sensus, one loop for the constraints, and an inner loop for solving unconstrained problems. Alternativ ely , one can use the Augmented Lagrangian method for constrained optimization in the outer loop and standard ADMM in the inner loop. This approach also entails three nested loops, an outer loop for the constraints, one loop for reaching consensus, and an inner loop for solving unconstrained problems. That is, the tasks of the two outer loops, reaching consensus and dealing with the constraints, are interchanged in the two approaches. Here, we use the second approach, i.e., the Augmented Lagrangian with ADMM in the inner loop, as our baseline since it av oids the need for solving constrained problems in the inner loop. But our main contribution is showing that the two loops for reaching consensus and for handling constraints can be merged. This results in an extension of ADMM for dealing with problems with many , non-linear constraints in a distributed fashion that only needs two nested loops. T o the best of our knowledge, we provide the ﬁrst con ver gence proof for such a lazy algorithmic scheme. Experimental results provide evidence that our two-loop algorithm is indeed more ef ﬁcient than the baseline approach with three nested loops. Acknowledgments This work was supported by Deutsche Forschungsgemeinschaft (DFG) under grant GI-711/5-1 and grant LA2971/1-1. 16 Refer ences [1] Martin Andersen, Joachim Dahl, Zhang Liu, and Lie ven V andenberghe. Interior-P oint Methods for Lar ge-Scale Cone Pr ogramming , pages 55–84. MIT Press, 2012. [2] Stephen P . Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimiza- tion and statistical learning via the alternating direction method of multipliers. F oundations and T rends in Machine Learning , 3(1):1–122, 2011. [3] Tsung-Hui Chang, Mingyi Hong, and Xiangfeng W ang. Multi-agent distributed optimization via in- exact consensus admm. IEEE T ransactions on Signal Pr ocessing , 63(2):482–497, 2015. [4] Caihua Chen, Bingsheng He, Y inyu Y e, and Xiaoming Y uan. The direct extension of admm for multi-block con ve x minimization problems is not necessarily con ver gent. Mathematical Pr ogram- ming , 155(1-2):57–79, 2016. [5] Liang Chen, Defeng Sun, and Kim-Chuan T oh. An efﬁcient inexact symmetric gauss–seidel based majorized admm for high-dimensional con ve x composite conic programming. Mathematical Pr o- gramming , 161(1-2):237–270, 2017. [6] W ei Deng and W otao Y in. On the global and linear conv er gence of the generalized alternating direction method of multipliers. Journal of Scientiﬁc Computing , 66(3):889–916, 2016. [7] Daniel Gabay and Bertrand Mercier . A dual algorithm for the solution of nonlinear v ariational prob- lems via ﬁnite element approximation. Computers & Mathematics with Applications , 2(1):17 – 40, 1976. [8] Euhanna Ghadimi, Andr ´ e T eixeira, Iman Shames, and Mikael Johansson. Optimal parameter selection for the alternating direction method of multipliers (admm): Quadratic problems. IEEE T rans. Automat. Contr . , 60(3):644–658, 2015. [9] R. Glo winski and A. Marroco. Sur l’approximation, par lments ﬁnis d’ordre un, et la rsolution, par pnalisation-dualit d’une classe de problmes de dirichlet non linaires. ESAIM: Mathematical Modelling and Numerical Analysis - Modlisation Mathmatique et Analyse Numrique , 9(R2):41–76, 1975. [10] Bingsheng He and Xiaoming Y uan. On the O(1/n) con vergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis , 50(2):700–709, 2012. [11] Bingsheng He and Xiaoming Y uan. On non-ergodic con vergence rate of douglas–rachford alternating direction method of multipliers. Numerische Mathematik , 130(3):567–577, 2015. [12] Magnus R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applica- tions , 4(5):303–320, 1969. [13] Mingyi Hong and Zhi-Quan Luo. On the linear con vergence of the alternating direction method of multipliers. Mathematical Pr ogramming , pages 1–35, 2012. [14] Kejun Huang and Nicholas D Sidiropoulos. Consensus-admm for general quadratically constrained quadratic programming. IEEE T ransactions on Signal Pr ocessing , 64(20):5297–5310, 2016. [15] Tian yi Lin, Shiqian Ma, and Shuzhong Zhang. On the global linear con ver gence of the admm with multiblock v ariables. SIAM J ournal on Optimization , 25(3):1478–1497, 2015. 17 [16] Zhouchen Lin, Risheng Liu, and Zhixun Su. Linearized alternating direction method with adaptiv e penalty for low-rank representation. In Advances in Neural Information Pr ocessing Systems (NIPS) , pages 612–620, 2011. [17] Jos ´ e Luis Morales and Jorge Nocedal. Remark on ”algorithm 778: L-BFGS-B: fortran subroutines for large-scale bound constrained optimization”. A CM T rans. Math. Softw . , 38(1):7:1–7:4, 2011. [18] Damon Mosk-Ao yama, Tim Roughgarden, and De va vrat Shah. Fully distributed algorithms for con ve x optimization problems. SIAM Journal on Optimization , 20(6):3260–3279, 2010. [19] Robert Nishihara, Laurent Lessard, Benjamin Recht, Andrew Packard, and Michael I Jordan. A general analysis of the con ver gence of admm. In International Confer ence on Machine Learning (ICML) , pages 343–352, 2015. [20] M. J. D. Po well. Algorithms for nonlinear constraints that use lagrangian functions. Mathematical Pr ogramming , 14(1):224–248, 1969. [21] Katya Scheinberg, Shiqian Ma, and Donald Goldfarb . Sparse in verse covariance selection via alter- nating linearization methods. In Advances in Neural Information Pr ocessing Systems (NIPS) , pages 2101–2109, 2010. [22] Shai Shalev-Shw artz and Nathan Srebro. SVM optimization: in verse dependence on training set size. In International Confer ence on Machine Learning (ICML) , pages 928–935, 2008. [23] Pannagadatta K. Shiv aswamy , Chiranjib Bhattacharyya, and Alexander J. Smola. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Re- sear ch , 7:1283–1314, 2006. [24] Morton Slater . Lagrange multipliers re visited. Co wles Foundation Discussion Papers 80, Cowles Foundation for Research in Economics, Y ale Uni versity , 1950. [25] Ivor W . Tsang, Andr ´ as K ocsor , and James T . Kwok. Simpler core vector machines with enclosing balls. In International Conference on Mac hine Learning (ICML) , pages 911–918, 2007. [26] E. Alper Y ildirim. T wo algorithms for the minimum enclosing ball problem. SIAM J ournal on Opti- mization , 19(3):1368–1391, 2008. [27] Ruiliang Zhang and James T Kwok. Asynchronous distributed admm for consensus optimization. In International Confer ence on Machine Learning (ICML) , pages 1701–1709, 2014. [28] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: fortran subroutines for large-scale bound-constrained optimization. A CM T rans. Math. Softw . , 23(4):550–560, 1997. [29] Minghui Zhu and Sonia Mart ´ ınez. On distributed con vex optimization under inequality and equality constraints. IEEE T rans. A utomat. Contr . , 57(1):151–164, 2012. 18

Distributed Convex Optimization with Many Convex Constraints

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment