Last-iterate convergence rates for min-max optimization

While classic work in convex-concave min-max optimization relies on average-iterate convergence results, the emergence of nonconvex applications such as training Generative Adversarial Networks has led to renewed interest in last-iterate convergence …

Authors: Jacob Abernethy, Kevin A. Lai, Andre Wibisono

Last-iterate convergence rates for min-max optimization
Last-iterate con ver gence rates f or min-max optimization Jacob Aber nethy ∗ Georgia Institute of T echnology prof@gatech.edu Ke vin A. Lai ∗ Georgia Institute of T echnology kevinlai@gatech.edu Andre W ibisono ∗ Georgia Institute of T echnology wibisono@gatech.edu Abstract While classic work in conv ex-conca ve min-max optimization relies on average- iterate con ver gence results, the emergence of noncon vex applications such as training Generativ e Adversarial Networks has led to renewed interest in last-iterate con ver gence guarantees. Proving last-iterate con vergence is challenging because many natural algorithms, such as Simultaneous Gradient Descent/Ascent, prov ably div erge or cycle e ven in simple con ve x-concav e min-max settings, and previous work on global last-iterate con ver gence rates has been limited to the bilinear and con ve x-strongly concav e settings. In this work, we sho w that the H A M I LTO N I A N G R A D I E N T D E S C E N T (HGD) algorithm achiev es linear con ver gence in a v ari- ety of more general settings, including conv ex-conca ve problems that satisfy a “sufficiently bilinear” condition. W e also prov e similar con ver gence rates for the Consensus Optimization (CO) algorithm of [ MNG17 ] for some parameter settings of CO. 1 Introduction In this paper we consider methods to solv e smooth unconstrained min-max optimization problems. In the most classical setting, a min-max objectiv e has the form min x 1 max x 2 g ( x 1 , x 2 ) where g : R d × R d → R is a smooth objective function with two inputs. The usual goal in such problems is to find a saddle point, also known as a min-max solution , which is a pair ( x ∗ 1 , x ∗ 2 ) ∈ R d × R d that satisfies g ( x ∗ 1 , x 2 ) ≤ g ( x ∗ 1 , x ∗ 2 ) ≤ g ( x 1 , x ∗ 2 ) (1) for ev ery x 1 ∈ R d and x 2 ∈ R d . Min-max problems have a long history , going back at least as far as [ Neu28 ], which formed the basis of much of modern game theory , and including a great deal of work in the 1950s when algorithms such as fictitious play were explored [Bro51, Rob51]. The con vex-concave setting, where we assume g is con vex in x 1 and concave in x 2 , is a classic min-max problem that has a number of different applications, such as solving constrained con vex optimization problems. While a variety of tools hav e been de veloped for this setting, a very popular approach within the machine learning community has been the use of so-called no-re gr et algorithms ∗ Author order is alphabetical and all authors contributed equally . Preprint. Under revie w . [ CBL06 , Haz16 ]. This trick, which was originally dev eloped by [ Han57 ] and later emerged in the dev elopment of boosting [ FS99 ], provides a simple computational method via repeated play: each of the inputs x 1 and x 2 are updated iterati vely according to no-re gret learning protocols, and one can prov e that the av erage-iterates ( ¯ x 1 , ¯ x 2 ) con ver ge to a min-max solution. Recently , interest in min-max optimization has surged due to the enormous popularity of Generative Adversarial Networks (GANs), whose training in volves solving a noncon vex min-max problem where x 1 and x 2 correspond to the parameters of two different neural nets [ GP AM + 14 ]. The fundamentally noncon ve x nature of this problem changes two things. First, it is infeasible to find a “global” solution of the min-max objectiv e. Instead, a typical goal in GAN training is to find a local min-max, namely a pair ( x ∗ 1 , x ∗ 2 ) that satisfies (1) for all ( x 1 , x 2 ) in some neighborhood of ( x ∗ 1 , x ∗ 2 ) . Second, iterate av eraging lacks the theoretical guarantees present in the conv ex-conca ve setting. This has motiv ated research on last-iterate con vergence guarantees, which are appealing because they more easily carry ov er from con ve x to noncon ve x settings. Last-iterate con ver gence guarantees for min-max problems hav e been challenging to prove since standard analysis of no-regret algorithms says essentially nothing about last-iterate conv ergence. W idely used no-regret algorithms, such as Simultaneous Gradient Descent/Ascent (SGD A), fail to con ver ge even in the simple bilinear setting where g ( x 1 , x 2 ) = x > 1 C x 2 for some arbitrary matrix C . SGD A prov ably cycles in continuous time and di verges in discrete time (see for example [ DISZ18 , MGN18 ]). In fact, the full range of F ollow-The-Re gularized-Leader (FTRL) algorithms prov ably do not con ver ge in zero-sum games with interior equilibria [ MPP18 ]. This occurs because the iterates of the FTRL algorithms e xhibit cyclic behavior , a phenomenon commonly observed when training GANs in practice as well. Much of the recent research on last-iterate con vergence in min-max problems has focused on asymptotic or local conv ergence [ MLZ + 19 , MNG17 , DP18 , BRM + 18 , LFB + 19 , MJS19 ]. While these results are certainly useful, one would ideally lik e to prove global non-asymptotic last-iterate con ver gence rates. Prov able global con ver gence rates allow for quantitati ve comparison of different algorithms and can aid in choosing learning rates and architectures to ensure fast conv ergence in practice. Y et despite the e xtensiv e amount of literature on con vergence rates for con vex optimization, very fe w global last-iterate con ver gence rates hav e been pro ved for min-max problems. Existing work on global last-iterate con ver gence rates has been limited to the bilinear or con vex-strongly conca ve settings [Tse95, LS19, DH19, MOP19]. In particular , the following basic question is still open: “What global last-iterate con ver gence rates are achie vable for con vex-conca ve min-max problems?” Our Contribution Understanding global last-iterate rates in the con ve x-concav e setting is an important stepping stone towards prov able last-iterate rates in the noncon ve x-nonconcav e setting. Motiv ated by this, we prove ne w linear last-iterate con vergence rates in the con vex-conca ve setting for an algorithm called H A M I LT O N I A N G R A D I E N T D E S C E N T (HGD) under weaker assumptions compared to previous results. HGD is gradient descent on the squared norm of the gradient, and it has been mentioned in [ MNG17 , BRM + 18 ]. Our results are the first to show non-asymptotic con ver gence of an efficient algorithm in settings that not linear or strongly conv ex in either input. In particular , we introduce a no vel “suf ficiently bilinear” condition on the second-order deri vati ves of the objecti ve g and sho w that this condition is suf ficient for HGD to achie ve linear con ver gence in con ve x-concav e settings. The “sufficiently bilinear” condition appears to be a new sufficient condition for linear con ver gence rates that is distinct from pre viously known conditions such as the Polyak-Łojasie wicz (PL) condition or pure bilinearity . Our analysis relies on sho wing that the squared norm of the gradient satisfies the PL condition in v arious settings. As a corollary of this result, we can lev erage [ KNS16 ] to show that a stochastic v ersion of HGD will hav e a last-iterate con ver gence rate of O (1 / √ k ) in the “sufficiently bilinear” setting. On the practical side, while vanilla HGD has issues training GANs in practice, [ MNG17 ] sho w that a related algorithm kno wn as Consensus Optimization (CO) can ef fectiv ely train GANs in a v ariety of settings, including on CIF AR-10 and celebA. W e show that CO can be viewed as a perturbation of HGD, which implies that for some parameter settings, CO con ver ges at the same rate as HGD. W e be gin in Section 2 with background material and notation, including some of our k ey assumptions. In Section 3, we discuss Hamiltonian Gradient Descent (HGD), and we present our linear con ver gence rates for HGD in various settings. In Section 4, we present some of the key technical components used to prov e our results from Section 3. Finally , in Section 5, we present our results for Stochastic HGD and Consensus Optimization. The details of our proofs are in Appendix H. 2 Figure 1: HGD conv erges quickly , while SGDA spirals. This nonconv ex-nonconca ve objecti ve is defined in Appendix K. 2 Background 2.1 Preliminaries In this section, we discuss some key definitions and notation. W e will use ||·|| to denote the Euclidean norm for vectors or the operator norm for matrices or tensors. For a symmetric matrix A , we will use λ min ( A ) and λ max ( A ) to denote the smallest and largest eigen v alues of A . For a general real matrix A , σ min ( A ) and σ max ( A ) denote the smallest and largest singular v alues of A . Definition 2.1. A critical point of f : R d → R is a point x ∈ R d such that ∇ f ( x ) = 0 . Definition 2.2 (Con ve xity / Strong con ve xity) . Let α ≥ 0 . A function f : R d → R is α -str ongly con vex if for any u, v ∈ R d , f ( u ) ≥ f ( v ) + h∇ f ( v ) , u − v i + α 2 || u − v || . When f is twice- differ entiable, f is α -str ongly-con vex iff for all x ∈ R d , ∇ 2 f ( x )  αI . If α = 0 in either of the above definitions, f is called conve x. Definition 2.3 (Monotone / Strongly monotone) . Let α ≥ 0 . A vector field v : R d → R d is α -str ongly monotone if for any x, y ∈ R d , h x − y , v ( x ) − v ( y ) i ≥ α || x − y || 2 . If α = 0 , v is called monotone. Definition 2.4 (Smoothness) . A function f : R d → R is L -smooth if f is differ entiable everywher e and for all u, v ∈ R d satisfies ||∇ f ( u ) − ∇ f ( v ) || ≤ L || u − v || . Notation Since g is a function of x 1 ∈ R d and x 2 ∈ R d , we will often consider x 1 and x 2 to be components of one vector x = ( x 1 , x 2 ) . W e will use superscripts to denote iterate indices. Follo wing [ BRM + 18 ], we use ξ = ( ∂ g ∂ x 1 , − ∂ g ∂ x 2 ) to denote the signed vector of partial deriv ativ es. Under this notation, the Simultaneous Gradient Descent/Ascent (SGDA) algorithm can be written as follows: x ( k +1) = x ( k ) − η ξ ( x ( k ) ) W e will write the Jacobian of ξ as: J ≡ ∇ ξ = ∂ 2 g ∂ x 2 1 ∂ 2 g ∂ x 1 ∂ x 2 − ∂ 2 g ∂ x 2 ∂ x 1 − ∂ 2 g ∂ x 2 2 ! . Note that unlike the Hessian in standard optimization, J is not symmetric, due to the negati ve sign in ξ . When clear from the context, we often omit dependence on x when writing ξ , J, g , H , and other 3 functions. Note that ξ , J , and H are defined for a giv en objective g – we omit this dependence as well for notational clarity . W e will always assume g is suf ficiently differentiable whenev er we take deriv atives. In particular , we assume second-order differentiability in Section 3. W e will also use the follo wing non-standard definition for notational con venience: Definition 2.5 (Higher-order Lipschitz) . A function g : R d → R is ( L 1 , L 2 , L 3 ) -Lipschitz if for all x ∈ R d , || ξ ( x ) || ≤ L 1 and ||∇ ξ ( x ) || ≤ L 2 , and for all x, y ∈ R d , ||∇ ξ ( x ) − ∇ ξ ( y ) || ≤ L 3 || x − y || . W e will consider a v ariety of settings for min-max optimization based on properties of the objecti ve function g : R d × R d → R . In the conv ex-conca ve setting, g is con vex as a function of x 1 for any fixed x 2 ∈ R d and g is concav e as a function of x 2 for any fixed x 1 ∈ R d . W e can form analogous definitions by replacing the words “conv ex” and “concav e” with words such as “strongly con ve x/concav e”, “linear”, or “noncon ve x”. The bilinear setting refers to the case when g ( x 1 , x 2 ) = x > 1 C x 2 for some matrix C . The str ongly monotone setting refers to the case when ξ is a strongly monotone vector field, as in the case when g is strongly conv ex-strongly conca ve. 2.2 Notions of con ver gence in min-max problems The con ver gence rates in this paper will apply to min-max problems where g satisfies the follo wing assumption: Assumption 2.6. All critical points of the objective g ar e global min-maxes (i.e. they satisfy (1)). In other words, we prov e conv ergence rates to min-maxes in settings where con ver gence to criti- cal points is necessary and sufficient for conv ergence to min-maxes. This assumption is true for con ve x-concav e settings, b ut also holds for some noncon vex-nonconca ve settings, as we discuss in Ap- pendix E. This assumption allows us to measure the con ver gence of our algorithms to  -appr oximate critical points , defined as follo ws: Definition 2.7. Let  ≥ 0 . A point x ∈ R d × R d is an  -appr oximate critical point if || ξ ( x ) || ≤  . Con ver gence to approximate critical points is a necessary condition for con ver gence to local or global minima, and it is a natural measure of con ver gence since the v alue of g at a giv en point gi ves no information about how close we are to a min-max. Our main con ver gence rate results focus on this first-order notion of con ver gence, which is sufficient giv en Assumption 2.6. W e discuss notions of second-order con ver gence and ways to adapt our results to the general noncon vex setting in Appendix A. 2.3 Related work Asymptotic and local con vergence Sev eral recent papers have given asymptotic or local con- ver gence results for min-max problems. [ MLZ + 19 ] show that the extragr adient (EG) algorithm con ver ges asymptotically in a broad class of problems known as coherent saddle point problems, which include quasicon ve x-quasiconcav e problems. Howe ver , they do not pro ve con vergence rates. For more general smooth nonconv ex min-max problems, a number of different papers ha ve giv en local stability or local asymptotic con vergence results for various algorithms, which we discuss in Appendix A. Non-asymptotic con vergence rates Compared to the work on asymptotic con vergence, the work on global non-asymptotic last-iterate con ver gence rates has been limited to much more restrictive settings. A classic result by [Roc76] shows a linear con vergence rate for the proximal point method in the bilinear and strongly conv ex-strongly conca ve cases. Another classic result, by [ Tse95 ], shows a linear con ver gence rate for the extragradient algorithm in the bilinear case. [ LS19 ] show that a number of algorithms achieve a linear conv ergence rate in the bilinear case, including Optimistic Mirror Descent (OMD) and Consensus Optimization (CO). They also show that SGDA obtains a linear con ver gence rate in the strongly conv ex-strongly conca ve case. [ MOP19 ] show that OMD and EG obtain a linear rate for the strongly con vex-strongly conca ve case, in addition to pro ving similar results for generalized versions of both algorithms. Finally , [ DH19 ] show that SGD A achie ves a linear 4 con ver gence rate for a conv ex-strongly concave setting with a full column rank linear interaction term. 2 Non-uniform average-iterate conv ergence A number of recent works have studied the con ver - gence of non-uniform averages of iterates, which can be viewed as an interpolation between the standard uniform av erage-iterate and last-iterate. W e discuss these works further in Appendix B. 3 Hamiltonian Gradient Descent Our main algorithm for finding saddle points of g ( x 1 , x 2 ) is called H A M I LT O N I A N G R A D I E N T D E S C E N T (HGD). HGD consists of performing gradient descent on a particular objecti ve function H that we refer to as the Hamiltonian , following the terminology of [ BRM + 18 ]. 3 If we let ξ :=  ∂ g ∂ x 1 , − ∂ g ∂ x 2  be the vector of (appropriately-signed) partial deri v ativ es, then the Hamiltonian is: H ( x ) := 1 2 k ξ ( x ) k 2 = 1 2  k ∂ g ∂ x 1 ( x ) k 2 + k ∂ g ∂ x 2 ( x ) k 2  . Since a critical point occurs when ξ ( x ) = 0 , we can find a (approximate) critical point by finding a (approximate) minimizer of H . Moreover , under Assumption 2.6, finding a critical point is equi valent to finding a saddle point. This motiv ates the HGD update procedure on x ( k ) = ( x ( k ) 1 , x ( k ) 2 ) with step-size η > 0 : x ( k +1) = x ( k ) − η ∇H ( x ( k ) ) , (2) HGD has been mentioned in [ MNG17 , BRM + 18 ], and it strongly resembles the Consensus Opti- mization (CO) approach of [MNG17]. The HGD update requires a Hessian-vector product because ∇H = ξ > J , making HGD a second- order iterati ve scheme. Howe ver , Hessian-vector products are cheap to compute when the objecti ve is defined by a neural net, taking only two gradient oracle calls [ Pea94 ]. This makes the Hessian-vector product oracle a theoretically appealing primitive, and it has been used widely in the noncon ve x optimization literature. Since Hessian-vector product oracles are feasible to compute for GANs, many recent algorithms for local min-max noncon ve x optimization hav e also utilized Hessian-vector products [MNG17, BRM + 18, ADLH19, LFB + 19, MJS19]. T o the best of our kno wledge, pre vious work on last-iterate con ver gence rates has only focused on how algorithms perform in three particular cases: (a) when the objectiv e g is bilinear, (b) when g is strongly con ve x-strongly concav e, and (c) when g is con ve x-strongly concav e [ Tse95 , LS19 , DH19 , MOP19 ]. The e xistence of methods with pro vable finite-time guarantees for settings beyond the aforementioned has remained an open problem. This work is the first to show that an efficient algorithm, namely HGD, can achie ve non-asymptotic con ver gence in settings that are not strongly con vex or linear in either player . 3.1 Con ver gence Rates for HGD W e now state our main theorems for this paper , which show con ver gence to critical points. When Assumption 2.6 holds, we get con vergence to min-maxes. All of our main results will use the following multi-part assumption: Assumption 3.1. Let g : R d × R d → R . 1. Assume a critical point for g exists. 2. Assume g is ( L 1 , L 2 , L 3 ) -Lipschitz and let L H = L 1 L 3 + L 2 2 . Our first theorem shows that HGD con v erges for the strongly con vex-strongly concave case. Although simple, this result will help us demonstrate our analysis techniques. 2 Specifically , they assume g ( x 1 , x 2 ) = f ( x 1 ) + x T 2 Ax 1 − h ( x 2 ) , where f is smooth and con vex, h is smooth and strongly con vex, and A has full column rank. W e make a brief comparison of our work to that of [DH19] for the con ve x-strongly concav e setting in Appendix D. 3 W e note that the function H is not the Hamiltonian as in the sense of classical physics, as we do not use the symplectic structure in our analysis, but rather we only perform gradient descent on H . 5 Theorem 3.2. Let Assumption 3.1 hold and let g ( x 1 , x 2 ) be c -str ongly conve x in x 1 and c -str ongly concave in x 2 . Then the HGD update pr ocedure described in (2) with step-size η = 1 /L H starting fr om some x (0) ∈ R d × R d will satisfy     ξ ( x ( k ) )     ≤  1 − c 2 L H  k/ 2     ξ ( x (0) )     . Next, we sho w that HGD con ver ges when g is linear in one of its arguments and the cross-deriv ative is full rank. This setting allows a slightly tighter analysis compared to Theorem 3.4. Theorem 3.3. Let Assumption 3.1 hold and let g ( x 1 , x 2 ) be L -smooth in x 1 and linear in x 2 , and assume the cr oss derivative ∇ 2 x 1 ,x 2 g is full rank with all singular values at least γ > 0 for all x ∈ R d × R d . Then the HGD update pr ocedur e described in (2) with step-size η = 1 /L H starting fr om some x (0) ∈ R d × R d will satisfy     ξ ( x ( k ) )     ≤  1 − γ 4 (2 γ 2 + L 2 ) L H  k/ 2     ξ ( x (0) )     . Finally , we show our main result, which requires smoothness in both players and a lar ge, well- conditioned cross-deriv ative. Theorem 3.4. Let Assumption 3.1 hold and let g be L -smooth in x 1 and L -smooth in x 2 . Let µ 2 = min x 1 ,x 2 λ min (( ∇ 2 x 2 x 2 g ( x 1 , x 2 )) 2 ) and ρ 2 = min x 1 ,x 2 λ min (( ∇ 2 x 1 x 1 g ( x 1 , x 2 )) 2 ) , and assume the cr oss derivative ∇ 2 x 1 x 2 g is full rank with all singular values lower bounded by γ > 0 and upper bounded by Γ for all x ∈ R d × R d . Mor eover , let the following “sufficiently bilinear” condition hold: ( γ 2 + ρ 2 )( µ 2 + γ 2 ) − 4 L 2 Γ 2 > 0 . (3) Then the HGD update procedur e described in (2) with step-size η = 1 /L H starting fr om some x (0) ∈ R d × R d will satisfy     ξ ( x ( k ) )     ≤  1 − ( γ 2 + ρ 2 )( γ 2 + µ 2 ) − 4 L 2 Γ 2 (2 γ 2 + ρ 2 + µ 2 ) L H  k/ 2     ξ ( x (0) )     . (4) As discussed abo ve, Theorem 3.4 provides the first last-iterate conv ergence rate for min-max problems that does not require strong con vexity or linearity in either input. For example, the objectiv e g ( x 1 , x 2 ) = f ( x 1 ) + 3 Lx > 1 x 2 − h ( x 2 ) , where f and h are L -smooth con ve x functions, satisfies the assumptions of Theorem 3.4 and is not strongly con ve x or linear in either input. W e discuss a simple example that is not con vex-conca ve in Appendix E. W e also show ho w our results can be applied to specific settings, such as the Dirac-GAN, in Appendix G. The “suf ficiently bilinear” condition (3) is in some sense necessary for our linear con vergence rate since linear con ver gence is impossible in general for conv ex-conca ve settings, due to lo wer bounds on con ve x optimization [ AH18 , ASS17 ]. W e giv e some explanations for this condition in the following section. In simple experiments for HGD on con vex-conca ve and noncon vex-nonconca ve objecti ves, the con ver gence rate speeds up when there is a larger bilinear component, as expected from our theoretical results. W e show these experiments in Appendix K. 3.2 Explanation of “sufficiently bilinear” condition In this section, we explain the “sufficiently bilinear” condition (3). Suppose our objectiv e is g ( x 1 , x 2 ) = ˆ g ( x 1 , x 2 ) + cx > 1 x 2 for a smooth function ˆ g . Then for sufficiently large values of c (i.e. g has a large enough bilinear term), we see that g satisfies (3). T o see this, note that if we have γ 4 > 4 L 2 Γ 2 , then condition (3) holds. Let γ 0 and Γ 0 be lower and upper bounds on the singular v alues of ∇ 2 x 1 x 2 ˆ g . Then it suffices to hav e ( γ 0 + c ) 4 > 4 L 2 (Γ 0 + c ) 2 , which is true for c = 3 max { L, Γ 0 } (i.e. c = O ( L ) suf fices). This condition is analogous to the case when we use SGDA on the objecti ve g ( x 1 , x 2 ) = ˆ g ( x 1 , x 2 ) + c || x 1 || 2 − c || x 2 || 2 for L -smooth con ve x-concav e ˆ g . According to [ LS19 ], SGDA will conv erge at a rate of roughly ˜ L 2 c 2 log(1 / ) for ˜ L -smooth and c -strongly con ve x-strongly concav e objecti ves. 4 4 The actual rate is β c log(1 / ) , for some parameter β that is at least ( L + c ) 2 . 6 For c = 0 , SGD A will diver ge in the worst case. For c = o ( L ) , we get linear con ver gence, but it will be slo w because L + c c is large (this can be thought of as a large condition number). Finally , for c = Ω( L ) , we get f ast linear con ver gence, since L + c c = O (1) . Thus, to get fas t linear con ver gence it suffices to make the problem “sufficiently strongly con ve x-strongly concave” (or “sufficiently strongly monotone”). Theorem 3.4 and condition (3) show that there e xists another class of settings where we can achie ve linear rates in the min-max setting. In our case, if we have an objective g ( x 1 , x 2 ) = ˆ g ( x 1 , x 2 )+ cx > 1 x 2 for a smooth function ˆ g , we will get linear con vergence if k∇ 2 x 1 x 2 ˆ g k ≤ δ L and c ≥ 3(1 + δ ) L , which ensures that the problem is “sufficiently bilinear . ” Intuiti vely , it makes sense that the “sufficiently bilinear” setting allows a linear rate because the pure bilinear setting allo ws a linear rate. Another w ay to understand condition (3) is that it is a suf ficient condition for the existence of a unique critical point in a general class of settings, as we show in the follo wing lemma, which we pro ve in Appendix F. Lemma 3.5. Let g ( x 1 , x 2 ) = f ( x 1 ) + cx > 1 x 2 − h ( x 2 ) wher e f and h ar e L -smooth. Mor eover , assume that ∇ 2 f ( x 1 ) and ∇ 2 h ( x 2 ) each have a 0 eigen value for some x 1 and x 2 . If (3) holds, then g has a unique critical point. 4 Proof sk etches f or HGD con ver gence rate results In this section, we go over the ke y components of the proofs for our con ver gence rates from Section 3.1. Recall that the intuition behind HGD was that critical points (where ξ ( x ) = 0 ) are global minima of H = 1 2 || ξ || 2 . On the other hand, there is no guarantee that H is a conv ex potential function, and a priori, one would not assume gradient descent on this potential would find a critical point. Nonetheless, we are able to show that in a variety of settings, H satisfies the PL condition , which allows HGD to hav e linear con ver gence. Proving this requires proving properties about the singular values of J ≡ ∇ ξ . 4.1 The Polyak-Łojasiewicz condition f or the Hamiltonian W e begin by recalling the definition of the PL condition. Definition 4.1 (Polyak-Łojasie wicz (PL) condition [ Pol63 , Loj63 ]) . A function f : R d → R satisfies the PL condition with par ameter α > 0 if for all x ∈ R d , 1 2 ||∇ f ( x ) || 2 ≥ α ( f ( x ) − min x ∗ ∈ R d f ( x ∗ )) . The PL condition is well-known to be the weak est condition necessary to obtain linear con ver gence rate for gradient methods; see for example [ KNS16 ]. W e will show that H satisfies the PL condition, which allows us to use the follo wing classic theorem. Theorem 4.2 (Linear rate under PL [ Pol63 , Loj63 ]) . Let f : R d → R be L -smooth and let x ∗ ∈ arg min x ∈ R d f ( x ) . Suppose f satisfies the PL condition with parameter α . Then if we run gradient descent fr om x (0) ∈ R d with step-size 1 L , we have: f ( x ( k ) ) − f ( x ∗ ) ≤ (1 − α L ) k ( f ( x (0) ) − f ( x ∗ )) . For completeness, we pro vide the proof of Theorem 4.2 in Appendix C. All of our results use Assumption 3.1, so we are guaranteed that g has a critical point. This implies that the global minimum of H is 0, which allows us to pro ve the follo wing ke y lemma: Lemma 4.3. Assume we have a twice differ entiable g ( x 1 , x 2 ) with associated ξ , H , J . Let α > 0 . If J J >  α I for every x , then H satisfies the PL condition with parameter α . Pr oof. Consider the squared norm of the gradient of the Hamiltonian: 1 2 k∇Hk 2 = 1 2 k J > ξ k 2 = 1 2 h ξ , ( J J > ) ξ i ≥ α 2 || ξ || 2 = α H . The proof is finished by noting that H ( x ) = 0 when x is a critical point. T o use Theorem 4.2, we will also need to show that H is smooth, which holds when g is ( L 1 , L 2 , L 3 ) - Lipschitz. The proof of Lemma 4.4 is in Appendix H. 7 Lemma 4.4. Consider any g ( x 1 , x 2 ) which is ( L 1 , L 2 , L 3 ) -Lipschitz for constants L 1 , L 2 , L 3 > 0 . Then the Hamiltonian H ( x ) is ( L 1 L 3 + L 2 2 ) -smooth. T o use Lemma 4.3, we will need control over the eigen v alues of J J > , which we achieve with the following linear algebra lemmas. W e provide their proofs in Appendix H. Lemma 4.5. Let H =  M 1 B − B > − M 2  and let  ≥ 0 . If M 1  I and M 2 ≺ − I , then for all eigen values λ of H H > , we have λ >  2 . Lemma 4.6. Let H =  A C − C > 0  , wher e C is squar e and full rank. Then if λ is an eigen value of H H > , then we must have λ ≥ σ 4 min ( C ) 2 σ 2 min ( C )+ || A || 2 . 4.2 Proof sk etches for Theor ems 3.2, 3.3, and 3.4 W e now proceed to sketch the proofs of our main theorems using the techniques we ha ve described. The following lemma sho ws it suffices to pro ve the PL condition for H for the various settings of our theorems: Lemma 4.7. Given g : R d × R d → R , suppose H satisfies the PL condition with parameter α and is L H -smooth. Then if we update some x (0) ∈ R d × R d using (2) with step-size η = 1 /L H , then we have the following:     ξ ( x ( k ) )     ≤  1 − α L H  k/ 2     ξ ( x (0) )     . Pr oof. Since H satisfies the PL condition with parameter α and H is L H -smooth, we know by Theorem 4.2 that gradient descent on H with step-size 1 /L H con ver ges at a rate of H ( x ( k ) ) ≤ (1 − α L H ) k H ( x (0) ) . Substituting in for H giv es the lemma. It remains to show that H satisfies the PL condition in the settings of Theorems 3.2 to 3.4. First, we show the result for the strongly con ve x-strongly concav e setting of Theorem 3.2. Lemma 4.8 (PL for the strongly con ve x-strongly concav e setting) . Let g be c -str ongly conve x in x 1 and c -str ongly concave in x 2 . Then H satisfies the PL condition with parameter α = c 2 . Pr oof. W e apply Lemma 4.5 with H = J . Since g is c -strongly-con ve x in x 1 and c -strongly concav e in x 2 we hav e M 1 = ∇ 2 x 1 x 1 g  cI and M 2 = −∇ 2 x 2 x 2 g  cI . Then the magnitude of the eigen v alues of J is at least c . Thus, J J >  c 2 I , so by Lemma 4.3, H satisfies the PL condition with parameter c 2 . Next, we sho w that H satisfies the PL condition for the noncon ve x-linear setting of Theorem 3.3. W e prov e this lemma in Appendix H.4 by using Lemma 4.6. Lemma 4.9 (PL for the smooth nonconv ex-linear setting) . Let g be L -smooth in x 1 and lin- ear in x 2 . Moreo ver , for all x ∈ R d × R d , let ∇ 2 x 1 x 2 g ( x 1 , x 2 ) be full rank and square with σ min ( ∇ 2 x 1 x 2 g ( x 1 , x 2 )) ≥ γ . Then H satisfies the PL condition with parameter α = γ 4 2 γ 2 + L 2 . Finally , we prov e that H satisfies the PL condition in the noncon ve x-noncon ve x setting of Theorem 3.4. The proof for Lemma 4.10 is in Appendix H.5, and it uses Lemma H.2, which is similar to Lemma 4.6. Lemma 4.10 (PL for the smooth noncon vex-noncon vex setting) . Let g be L -smooth in x 1 and L -smooth in x 2 . Also, let ∇ 2 x 1 x 2 g be full rank and let all of its singular values be lower bounded by γ and upper bounded by Γ for all x ∈ R d × R d . Let ρ 2 = min x 1 ,x 2 λ min (( ∇ 2 x 1 x 1 g ( x 1 , x 2 )) 2 ) and µ 2 = min x 1 ,x 2 λ min (( ∇ 2 x 2 x 2 g ( x 1 , x 2 )) 2 ) . Assume the following condition holds: ( γ 2 + ρ 2 )( γ 2 + µ 2 ) − 4 L 2 Γ 2 > 0 . Then H satisfies the PL condition with parameter α = ( γ 2 + ρ 2 )( γ 2 + µ 2 ) − 4 L 2 Γ 2 2 γ 2 + ρ 2 + µ 2 . Combining Lemmas 4.8 to 4.10 with Lemma 4.7 yields Theorems 3.2 to 3.4. 8 5 Extensions of HGD results Stochastic HGD Our results abov e also imply rates for stochastic HGD, where the gradient ∇H in (2), is replaced by a st ochastic estimator v of ∇H such that E [ v ] = ∇H . Since we sho w that H satisfies the PL condition with parameter α in dif ferent settings, we can use Theorem 4 in [ KNS16 ] to show that stochastic HGD conv erges at a O (1 / √ k ) rate in the settings of Theorems 3.2 to 3.4, including the “sufficiently bilinear” setting. W e prove Theorem 5.1 in Appendix I. Theorem 5.1. Let Assumption 3.1 hold and suppose H satisfies the PL condition with parameter α . Suppose we use the update x ( k +1) = x ( k ) − η k v ( x ( k ) ) , wher e v is a stochastic estimate of ∇H such that E [ v ] = ∇H and E [ k v ( x ( k ) ) k 2 ] ≤ C 2 for all x ( k ) . Then if we use η k = 2 k +1 2 α ( k +1) 2 , we have the following con ver gence r ate: E [ k ξ ( x ( k ) ) k ] ≤ q L H C 2 kα 2 . Consensus Optimization The Consensus Optimization (CO) algorithm of [ MNG17 ] is as follows: x ( k +1) = x ( k ) − η ( ξ ( x ( k ) ) + γ ∇H ( x ( k ) )) (5) where γ > 0 . This is essentially a weighted combination of SGD A and HGD. [ MNG17 ] remark that while HGD has poor performance on noncon ve x problems in practice, CO can effecti vely train GANs in a v ariety of settings, including on CIF AR-10 and celebA. While they frame CO as SGD A with a small modification, they actually set γ = 10 for sev eral of their experiments, which suggests that one can also view CO as a modified form of HGD. Using this perspectiv e, we prov e Theorem 5.2, which implies that we get linear con ver gence of CO in the same settings as Theorems 3.2 to 3.4 provided that γ is sufficiently lar ge (i.e. the HGD update is large compared to the SGDA update). The key technical component is showing that HGD still performs well even with a certain kind of small arbitrary perturbation. Pre viously , [ LS19 ] proved that CO achiev es linear con ver gence in the bilinear setting, so our result greatly expands the settings where CO has prov able non-asymptotic con vergence. W e prove Theorem 5.2 in Appendix J. Theorem 5.2. Let Assumption 3.1 hold. Let g be L g smooth and suppose H satisfies the PL condition with parameter α . Then if we update some x (0) ∈ R d × R d using the CO update (5) with step-size η = α 4 L H L g and γ = 4 L g α , we get the following con verg ence:     ξ ( x ( k ) )     ≤  1 − α 4 L H  k     ξ ( x (0) )     . (6) W e also show that CO con verges in practice on some simple e xamples in Appendix K. References [ADLH19] Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Local saddle point optimization: A curvature exploitation approach. In Artificial Intelligence and Statistics (AIST ATS) , 2019. [AH18] Naman Agarwal and Elad Hazan. Lo wer bounds for higher-order con vex optimization. In Confer ence on Learning Theory (COLT) , 2018. [ALL W18] Jacob Abernethy , Ke vin A Lai, Kfir Y Levy , and Jun-Kun W ang. Faster rates for con ve x-concav e games. Conference on Learning Theory (COL T) , 2018. [ASS17] Y ossi Arje v ani, Ohad Shamir, and Ron Shif f. Oracle complexity of second-order methods for smooth con vex optimization. Mathematical Pr ogr amming , pages 1–34, 2017. [AZH16] Zeyuan Allen-Zhu and Elad Hazan. V ariance reduction for faster non-conv ex opti- mization. In International Confer ence on Machine Learning (ICML) , pages 699–707, 2016. [BRM + 18] David Balduzzi, Sebastien Racaniere, James Martens, Jakob F oerster , Karl T uyls, and Thore Graepel. The mechanics of n-player dif ferentiable games. In International Confer ence on Machine Learning (ICML) , 2018. 9 [Bro51] George W Bro wn. Iterative solution of games by fictitious play . Activity analysis of pr oduction and allocation , 13(1):374–376, 1951. [CBL06] Nicolo Cesa-Bianchi and Gábor Lugosi. Pr ediction, learning, and games . Cambridge univ ersity press, 2006. [CHDS17] Y air Carmon, Oli ver Hinder , John C Duchi, and Aaron Sidford. “Conv ex until pro ven guilty": Dimension-free acceleration of gradient descent on non-con vex functions. In International Confer ence on Machine Learning (ICML) , 2017. [DH19] Simon S Du and W ei Hu. Linear conv ergence of the primal-dual gradient method for con vex-conca ve saddle point problems without strong con vexity . In Artificial Intelligence and Statistics (AIST A TS) , 2019. [DISZ18] Constantinos Daskalakis, Andre w Ilyas, V asilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In International Confer ence on Learning Representations (ICLR) , 2018. [DP18] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , pages 9255–9265, 2018. [FS99] Y oav Freund and Robert E. Schapire. Adaptiv e Game Playing Using Multiplicativ e W eights. Games and Economic Behavior , 29(1-2):79–103, October 1999. [GBVLJ19] Gauthier Gidel, Hugo Berard, Pascal V incent, and Simon Lacoste-Julien. A variational inequality perspectiv e on generativ e adversarial nets. International Confer ence on Learning Repr esentations (ICLR) , 2019. [GL16] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for noncon vex nonlinear and stochastic programming. Mathematical Pr ogramming , 156(1-2):59–99, 2016. [GP AM + 14] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. Generative adv ersarial nets. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , pages 2672–2680, 2014. [Han57] James Hannan. Approximation to bayes risk in repeated play . Contributions to the Theory of Games , 3:97–139, 1957. [Haz16] Elad Hazan. Introduction to online con ve x optimization. F oundations and T rends R  in Optimization , 2(3-4):157–325, 2016. [KALL18] T ero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressiv e growing of gans for improved quality , stability , and variation. International Conference on Learning Repr esentations (ICLR) , 2018. [KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear conv ergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In J oint Eur opean Confer ence on Machine Learning and Knowledge Discovery in Databases , pages 795–811. Springer , 2016. [Kro19] Christian Kroer . First-order methods with increasing iterate averaging for solving saddle-point problems. arXiv pr eprint arXiv:1903.10646 , 2019. [LFB + 19] Alistair Letcher , Jakob Foerster , David Balduzzi, Tim Rocktäschel, and Shimon White- son. Stable opponent shaping in dif ferentiable games. In International Confer ence on Learning Repr esentations , 2019. [Loj63] Lojasie wicz. A topological property of real analytic subsets (in french). Coll. du CNRS, Les équations aux dérivées partielles , page 87–89, 1963. [LS19] T engyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local con ver gence of generati ve adv ersarial networks. Artificial Intelligence and Statistics (AIST ATS) , 2019. [MGN18] Lars Mescheder , Andreas Geiger , and Sebastian Nowozin. Which training methods for gans do actually con ver ge? In International Conference on Machine Learning (ICML) , pages 3478–3487, 2018. 10 [MJS19] Eric V Mazumdar , Michael I Jordan, and S Shankar Sastry . On finding local nash equilibria (and only local nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838 , 2019. [MLZ + 19] Panayotis Mertik opoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, V ijay Chandrasekhar , and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In International Confer ence on Learning Repr esentations (ICLR) , 2019. [MNG17] Lars Mescheder , Sebastian Now ozin, and Andreas Geiger . The numerics of GANs. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , pages 1825–1835, 2017. [MOP19] Aryan Mokhtari, Asuman Ozdaglar , and Sarath Pattathil. A unified analysis of extra- gradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv pr eprint arXiv:1901.08511 , 2019. [MPP18] Panayotis Mertik opoulos, Christos Papadimitriou, and Geor gios Piliouras. Cycles in adversarial regularized learning. In Pr oceedings of the T wenty-Ninth Annual ACM- SIAM Symposium on Discr ete Algorithms (SOD A) , pages 2703–2717. SIAM, 2018. [Neu28] J v Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen , 100(1):295– 320, 1928. [Pea94] Barak A Pearlmutter . Fast exact multiplication by the hessian. Neural computation , 6(1):147–160, 1994. [Pol63] B. T . Polyak. Gradient methods for minimizing functionals (in russian). Zh. Vychisl. Mat. Mat. F iz. , page 643–653, 1963. [Rob51] Julia Robinson. An iterativ e method of solving a game. Annals of mathematics , pages 296–301, 1951. [Roc76] R T yrrell Rockafellar . Monotone operators and the proximal point algorithm. SIAM journal on contr ol and optimization , 14(5):877–898, 1976. [Tse95] Paul Tseng. On linear con ver gence of iterativ e methods for the v ariational inequality problem. J ournal of Computational and Applied Mathematics , 60(1-2):237–252, 1995. [YFW + 19] Y asin Y azıcı, Chuan-Sheng Foo, Stefan W inkler , Kim-Hui Y ap, Georgios Piliouras, and V ijay Chandrasekhar . The unusual effecti veness of averaging in gan training. International Confer ence on Learning Repr esentations (ICLR) , 2019. 11 A General noncon vex min-max optimization In standard noncon ve x optimization, a common goal is to find second-order local minima, which are approximate critical points where ∇ 2 f is approximately positiv e definite. Like wise, a common goal in noncon vex min-max optimization is to find approximate critical points where an analogous second- order condition holds, namely that ∇ 2 x 1 x 1 g ( x ) is approximately positi ve definite and ∇ 2 x 2 x 2 g ( x ) is approximately negati ve definite. Critical points where this second-order condition holds are called local min-maxes . When Assumption 2.6 holds, all critical points are global min-maxes, but in more general settings, we may encounter critical points that do not satisfy these conditions. Critical points may be local min-mins or max-mins or indefinite points. A number of recent papers have proposed dynamics for noncon vex min-max optimization, sho wing local stability or local asymptotic con ver gence results [ MNG17 , DP18 , BRM + 18 , LFB + 19 , MJS19 ]. The key guarantee that these papers generally giv e is that their algorithms will be stable at local min-maxes and unstable at some set of undesirable critical points (such as local max-mins). This essentially amounts to a guarantee that in the con vex-conca ve setting, their algorithms will con ver ge asymptotically and in the strictly concav e-strictly con ve x setting (i.e. where there is only an undesirable max-min ), their algorithms will di ver ge asymptotically . This type of local stability is essentially the best one can ask for in the general noncon ve x setting, and we show ho w to gi ve similar guarantees for our algorithm in Section A.1. A.1 Noncon vex extensions f or HGD While the nai ve version of HGD will try to con ver ge to all critical points, we can modify HGD slightly to achiev e second-order stability guarantees as in various related work such as [ BRM + 18 , LFB + 19 ]. In particular , we consider modifying HGD so that there is some scalar α in front of the ∇H term as follows: x ( k +1) = x ( k ) − η α ∇H ( x ( k ) ) (7) W e now present two ways to choose α . Our first method is inspired by the Simplectic Gradient Adjustment algorithm of [BRM + 18], which is as follows: x ( k +1) = x ( k ) − η ( ξ ( x ( k ) ) − λA > ξ ( x k )) (8) where A is the antisymmetric part of J and λ = sgn  h ξ , J i  A > ξ , J  . [ BRM + 18 ] sho w that λ is positiv e when in a strictly con vex-strictly conca ve re gion and neg ativ e in a strictly concav e-strictly con ve x region. Thus, if we choose α = λ = sgn  h ξ , J i  A > ξ , J  , we can ensure that the modified HGD will exhibit local stability around strict min-maxes and local instability around strict max-mins. This follo ws simply because we will do gradient descent on H in the first case and gradient ascent on H in the second case. Another way to choose α in volv es using an approximate eigen value computation on ∇ 2 x 1 x 1 g and ∇ 2 x 2 x 2 g to detect whether ∇ 2 x 1 x 1 g is positi ve semidefinite and ∇ 2 x 2 x 2 g is ne gati ve semidefinite (which would mean we are in a con ve x-concav e region). W e set α = 1 if we are in a conv ex-conca ve re gion and − 1 otherwise, which will guarantee local stability around min-maxes and local instability around other critical points. This approximate eigenv ector computation can be done using a logarithmic number of Hessian-vector products. B Background on non-unif orm average iterates A number of recent works have focused on the performance of a non-uniform average of an algorithm’ s iterates. Iterate av eraging can lend stability to an algorithm or improve performance if the algorithm cycles around the solution. On the other hand, uniform averages can suf fer from worse performance in nonconv ex settings if early iterates are far from optimal. Non-uniform av eraging is a way to achie ve the stability benefits of iterate averaging while potentially speeding up con ver gence compared to uniform av eraging. In this way , one can view non-uniform a veraging as an interpolation between av erage-iterate and last-iterate algorithms. One popular non-uniform averaging scheme is the e xponential mo ving average (EMA). For an algorithm with iterates z (0) , ..., z ( T ) , the EMA at iterate t is defined recursi vely as z ( t ) E M A = β z ( t − 1) E M A + (1 − β ) z ( t − 1) E M A 12 where z (0) E M A = z (0) and β < 1 . A typical value for β is 0 . 999 . [ YFW + 19 ] and [ GBVLJ19 ] show that uniform and EMA schemes can improv e GAN performance on a v ariety of datasets. [ MGN18 ] and [ KALL18 ] use EMA to ev aluate the GAN models they train, sho wing the effecti veness of EMA in practice. In terms of theoretical results, [ Kro19 ] studies saddle point problems of the form min x 1 max x 2 f ( x 1 ) + g ( x 1 ) + h K x 1 , x 2 i − h ∗ ( x 2 ) , where f is a smooth con ve x function, g and h are con ve x functions with easily computable prox-mappings, and K is some linear operator . They show that for certain algorithms, linear a veraging and quadratic av eraging schemes are prov ably at least as good as the uniform a verage scheme in terms of iterate comple xity . [ ALL W18 ] sho w how linear and exponential av eraging schemes can be used to achie ve faster con vergence rates in some specific con ve x-concav e games. Overall, while non-uniform averaging is appealing for a variety of reasons, there is currently no theoretical explanation for wh y it outperforms uniform a verages or why it w ould con ver ge at all in many settings. In fact, one natural way to sho w con ver gence for an EMA scheme would be to sho w last-iterate con ver gence. C Proof of linear con vergence rate under PL condition Here we present a classic proof of Theorem 4.2. Pr oof of Theorem 4.2. f ( x ( k +1) ) − f ( x ∗ ) ≤ f ( x ( k ) ) − f ( x ∗ ) − 1 2 L       ∇ f ( x ( k ) )       2 (9) ≤ f ( x ( k ) ) − f ( x ∗ ) − α L ( f ( x ( k ) ) − f ( x ∗ )) (10) =  1 − α L  ( f ( x ( k ) ) − f ( x ∗ )) (11) where the first line comes from smoothness and the update rule for gradient descent, the second inequality comes from the PL condition. Applying the last line recursiv ely giv es the result. D Comparison of Theorem 3.4 to [DH19] In this section, we compare our results in Theorem 3.4 to those of [ DH19 ]. [ DH19 ] prov e a rate for SGD A when g is L -smooth and con vex in x 1 and L -smooth and µ -strongly concave in x 2 and ∇ 2 x 1 x 2 g is some fixed matrix A . The specific setting they consider is to find the unconstrained min-max for a function g : R d 1 × R d 2 → R defined as g ( x 1 , x 2 ) = f ( x 1 ) + x > 2 Ax 1 − h ( x 2 ) where f is con vex and smooth, h is strongly-con ve x and smooth, and A ∈ R d 2 × d 1 has rank d 1 (i.e. A has full column rank). Their rate uses the potential function P t = λa t + b t , where we hav e: λ = 2 L Γ( L + Γ 2 µ ) µγ 2 (12) a k =       x ( k ) 1 − x ∗ 1       (13) b k =       x ( k ) 2 − x ∗ 2       (14) where ( x ∗ 1 , x ∗ 2 ) is the min-max for the objecti ve. Their rate (Theorem 3.1 in [DH19]) is P k +1 ≤ 1 − c µ 2 γ 4 L 3 Γ 2 ( L + Γ 2 µ ) ! k P k (15) for some constant c > 0 . T o translate this rate into bounds on || ξ || , we can use the smooth- ness of g in both of its arguments to note that       ∂ g ∂ x 1 ( x 1 , x 2 )       =       ∂ g ∂ x 1 ( x 1 , x 2 ) − ∂ g ∂ x 1 ( x ∗ 1 , x ∗ 2 )       ≤ L       x ( k ) 1 − x ∗ 1       and likewise for x 2 . So the rate on P k translates into a rate on || ξ || with some additional factor in front. 13 Their rate and our rate are incomparable – neither is strictly better . For instance when γ = Γ is much larger than all other quantities, their rates simplify to  1 − O  µ 3 L 3  k , while ours go to  1 − O  γ 2 L H  k/ 2 . While our con ver gence rate requires the sufficiently bilinear condition (3) to hold, we do not require conv exity in x 1 or concavity in x 2 . Moreover , we allow ∇ 2 x 1 x 2 g to change as long as the bounds on the singular values hold whereas [ DH19 ] require ∇ 2 x 1 x 2 g to be a fixed matrix. E Noncon vex-nonconcav e setting where Assumption 2.6 and the conditions f or Theorem 3.4 hold In this section we giv e a concrete example of a noncon ve x-nonconcav e setting where Assumption 2.6 and the conditions for Theorem 3.4 hold. W e choose this example for simplicity , but one can easily come up with other more complicated examples. For our e xample, we define the following function: F ( x ) =    − 3( x + π 2 ) for x ≤ − π 2 − 3 cos x for − π 2 < x ≤ π 2 − cos x + 2 x − π for x > π 2 (16) The first and second deriv atives of F are as follo ws: F 0 ( x ) =    − 3 for x ≤ − π 2 3 sin x for − π 2 < x ≤ π 2 sin x + 2 for x > π 2 (17) F 00 ( x ) =    0 for x ≤ − π 2 3 cos x for − π 2 < x ≤ π 2 cos x for x > π 2 (18) From Figure 2, we can see that this function is neither con ve x nor concav e. Figure 2: Plot of nonconv ex function F ( x ) defined in (16), as well as its first and second deriv ati ves Our objectiv e will be g ( x 1 , g 2 ) = F ( x 1 ) + 4 x > 1 x 2 − F ( x 2 ) . Note that L = 3 because F 00 ( x ) ≤ 3 for all x . Also, γ = Γ = 4 since ∇ 2 x 1 x 2 g = 4 I . 14 First, we show that g satisfies Assumption 3.1. W e see that g has a critical point at (0 , 0) . Moreover , g is ( L 1 , L 2 , L 3 ) -Lipschitz for any finite-sized region of R 2 . Thus, if we assume our algorithm stays within a ball of some radius R , the ( L 1 , L 2 , L 3 ) -Lipschitz assumption will be satisfied. Since our algorithm does not di ver ge and indeed con ver ges at a linear rate to the min-max, this assumption is fairly mild. Next, we sho w that g satisfies condition (3). Condition (3) requires γ 4 > 4 L 2 Γ 2 for g . W e see that this holds because γ 4 = 4 4 = 256 and 4 L Γ 2 = 4 ∗ 3 ∗ 4 2 = 192 . Therefore, the assumptions of Theorem 3.4 are satisfied. W e can also show that this objectiv e satisfies Assumption 2.6, so we get conv ergence to the min-max of g . W e will sho w that g has only one critical point (at (0 , 0) ) and that this critical point is a min-max. W e first giv e a “proof by picture” belo w , showing a plot of g in Figure 3, along with plots of g ( · , 0) and g (0 , · ) sho wing that (0 , 0) is indeed a min-max. Figure 3: Plot of noncon ve x-nonconcav e g ( x 1 , x 2 ) = F ( x 1 ) + 4 x > 1 x 2 − F ( x 2 ) W e can also formally show that (0 , 0) is the unique critical point of g and that it is a min-max. W e prov e this for completeness, although the calculations more or less amount to a simple case analysis. Let us look at the deriv atives of g with respect to x 1 and x 2 : ∂ g ∂ x 1 ( x 1 , x 2 ) =    − 3 + 4 x 2 for x 1 ≤ − π 2 3 sin x 1 + 4 x 2 for − π 2 < x 1 ≤ π 2 sin x 1 + 2 + 4 x 2 for x 1 > π 2 (19) ∂ g ∂ x 2 ( x 1 , x 2 ) =    3 + 4 x 1 for x 2 ≤ − π 2 − 3 sin x 2 + 4 x 1 for − π 2 < x 2 ≤ π 2 − sin x 2 + 2 + 4 x 1 for x 2 > π 2 (20) 15 Figure 4: Plot of g ( · , 0) . W e can see that there is only one min and it occurs at x 1 = 0 . Figure 5: Plot of g (0 , x 2 ) . W e can see that there is only one max and it occurs at x 2 = 0 . 16 Observe that if x 1 ∈ [ − π 2 , π 2 ] then critical points of g must satisfy 3 sin x 1 + 4 x 2 = 0 , which implies that x 2 ∈ [ − 3 4 , 3 4 ] . Like wise, if x 2 ∈ [ − π 2 , π 2 ] , then critical points of g must hav e x 1 ∈ [ − 3 4 , 3 4 ] . W e show that this implies that g only has critical points where x 1 and x 2 are both in the range [ − π 2 , π 2 ] . Suppose g had a critical point such that x 1 ≤ − π 2 . Then this critical point must satisfy x 2 = 3 4 . But from our observation above, if a critical point has x 2 = 3 4 , then x 1 must lie in [ − 3 4 , 3 4 ] , which contradicts x 1 ≤ − π 2 . Next, suppose g had a critical point such that x 1 > π 2 . Then this critical point must satisfy x 2 = − 1 4 (sin x 1 + 2) , which implies that x 2 ∈ [ − 3 4 , 3 4 ] . But then by the observation abov e, x 1 must lie in [ − 3 4 , 3 4 ] , which contradicts x 1 > π 2 . From this we see that any critical point of g must have x 1 ∈ [ − π 2 , π 2 ] . W e can make analogous arguments to sho w that any critical point of g must hav e x 2 ∈ [ − π 2 , π 2 ] . From this, we can conclude that all critical points of g must satisfy the following: 3 sin x 1 + 4 x 2 = 0 (21) − 3 sin x 2 + 4 x 1 = 0 (22) These equations imply the following: x 1 = 3 4 sin x 2 (23) x 2 = − 3 4 sin x 1 (24) ⇒ x 1 = 3 4 sin  − 3 4 sin x 2  (25) ⇒ x 2 = − 3 4 sin  3 4 sin x 2  (26) That is, for all critical points of g , x 1 must be a fixed point of h 1 ( x ) = 3 4 sin  − 3 4 sin x  and x 2 must be a fixed point of h 2 ( x ) = − 3 4 sin  3 4 sin x  . Since | h 0 1 ( x ) | < 1 and | h 0 2 ( x ) | < 1 always, h 1 and h 2 are contracti ve maps, so the y hav e only one fixed point each. Thus, g will only ha ve one critical point, namely the point ( x 1 , x 2 ) such that x 1 is the unique fixed point of h 1 and x 2 is the unique fixed point of h 2 . Finally , we can observe that (0 , 0) is a critical point of g , so it must be the unique critical point of g . One can also see that this is a min-max by looking at the second deriv atives of F in (18). F Proof of Lemma 3.5 T o prove Lemma 3.5, we will use the follo wing lemma: Lemma F .1. Let g ( x 1 , x 2 ) = f ( x 1 ) + cx > 1 x 2 − h ( x 2 ) wher e f and h ar e L -smooth. Then if c > L , g has a unique critical point. Pr oof of Lemma 3.5. Condition (3) is as follows: ( γ 2 + ρ 2 )( µ 2 + γ 2 ) − 4 L 2 Γ 2 > 0 . (27) Note that in our setting, γ = Γ = c . Next, observe that if ∇ 2 f ( x 1 ) and ∇ 2 h ( x 2 ) each have a 0 eigen v alue for some x 1 and x 2 , condition (3) reduces to: c > 2 L. (28) Then by Lemma F .1, we see that g must have a unique critical point. Next, we pro ve Lemma F .1. 17 Pr oof of Lemma F .1. Suppose our objectiv e is g ( x 1 , x 2 ) = f ( x 1 ) + cx > 1 x 2 − h ( x 2 ) where f and h are both L -smooth con ve x functions. Critical points of g must satisfy the following: ∇ f ( x 1 ) + cx 2 = 0 (29) −∇ h ( x 2 ) + cx 1 = 0 (30) ⇒ x 1 = 1 c ∇ h ( x 2 ) (31) ⇒ x 2 = − 1 c ∇ f  1 c ∇ h ( x 2 )  (32) In other words, x 2 must be a fixed point of F ( z ) = − 1 c ∇ f ( 1 c ∇ h ( z )) . The function F will hav e a unique fixed point if it is a contracti ve map. W e now sho w that for c > L , this is the case. || F ( u ) − F ( v ) || =         1 c ∇ f  1 c ∇ h ( u )  − 1 c ∇ f  1 c ∇ h ( v )          (33) ≤ L c ·         1 c ∇ h ( u ) − 1 c ∇ h ( v )         (34) ≤ L 2 c 2 || u − v || < || u − v || (35) where the inequalities follow from smoothness of f and h . An analogous property can be shown by solving for x 1 instead. Thus, if c > L , then g will have a unique fix ed point. Condition (3) is thus a sufficient condition for the e xistence of a unique critical point for the class of objectiv es abov e. G A pplications In this section, we discuss how our results can be applied to various settings. One simple setting is the Dirac-GAN from [ MGN18 ], where g ( x 1 , x 2 ) = min x 1 max x 2 f ( x > 1 x 2 ) − f (0) for some function f whose deri vati ve is always non-zero. When f ( t ) = t , the Dirac-GAN is just a bilinear game, so HGD will con verge globally to the Nash Equilibrium (NE) of this Dirac-GAN, as shown in [ BRM + 18 ]. Our results prov e global con vergence rates for HGD on the Dirac-GAN e ven when a small smooth con ve x regularizer is added for the discriminator or subtracted for the generator . Moreover , Lemma 2.2 of [ MGN18 ] sho ws that the diagonal blocks of the Jacobian are 0 at the NE for arbitrary f with non-zero deri v ativ e. As such, HGD will achiev e the con v ergence rates in this paper in a region around the NE for the Dirac-GAN for arbitrary f with non-zero deriv ati ve e ven when a small smooth con ve x regularizer is added for either player . [ DH19 ] list se veral applications where the min-max formulation is rele vant, such as in ERM problems with a linear classifier . Given a data matrix A , the ERM problem in volves solving min x ` ( Ax ) + f ( x ) for some smooth, con vex loss ` and smooth, con ve x regularizer f . This problem has the saddle point formulation min x max y y > Ax − ` ∗ ( y ) + f ( x ) . According to [ DH19 ], this formulation can be advantageous when it allo ws a finite-sum structure, reduces communication complexity in a distributed setting, or allows some sparsity structure to be exploited. Our results show that linear rates are possible for this problem if A is square, well-conditioned, and suf ficiently lar ge compared to ` and f . H Proofs f or Section 4 In this section, we prove our main results about the conv ergence of HGD, starting with some key technical lemmas. 18 H.1 Proof of Lemma 4.4 Pr oof. W e have ∇H = ξ > J . Let u, v ∈ R d × R d . Then we hav e: ||∇H ( u ) − ∇H ( v ) || =     ξ > ( u ) J ( u ) − ξ > ( v ) J ( v )     =     ξ > ( u ) J ( u ) − ξ > ( u ) J ( v ) + ξ > ( u ) J ( v ) − ξ > ( v ) J ( v )     ≤     ξ > ( u ) J ( u ) − ξ > ( u ) J ( v )     +     ξ > ( u ) J ( v ) − ξ > ( v ) J ( v )     ≤ || ξ ( u ) || · || J ( u ) − J ( v ) || + || ξ ( u ) − ξ ( v ) || · || J ( v ) || ≤ ( L 1 L 3 + L 2 2 ) || u − v || H.2 Proof of Lemma 4.5 Pr oof. Note that H H > =  M 2 1 + B B T − M 1 B − B M 2 − ( M 1 B + B M 2 ) T M 2 2 + B T B  =  M 1 − B − B T M 2  2 . Now let Z =  M 1 − B − B T M 2  . It suffices to show that for any eigenv alue δ of Z , | δ | ≤  . For the sake of contradiction, let v be an eigen value of Z with eigen v alue δ such that | δ | ≤  . Let v =  v 1 v 2  . Since Z v = δ v for | δ | ≤  and M 1  I and M 2 ≺ − I , we must hav e v 1 6 = 0 and v 2 6 = 0 . Then we hav e:  M 1 v 1 − B v 2 M 2 v 2 − B > v 1  = δ  v 1 v 2  (36) This implies ( M 1 − δ I ) v 1 = B v 2 (37) ( M 2 − δ I ) v 2 = B > v 1 (38) Let ˆ M 1 = M 1 − δ I and let ˆ M 2 = M 2 − δ I . Note that ˆ M 1  0 and ˆ M 2 ≺ 0 . Then we can write v 1 = ˆ M − 1 1 B v 2 . Further , we can substitute into (38) to get ˆ M 2 v 2 = B > ˆ M − 1 1 B v 2 (39) ⇐ ⇒ − ˆ M − 1 2 B > ˆ M − 1 1 B v 2 = − v 2 (40) In other words, v 2 is an eigen vector of − ˆ M − 1 2 B > ˆ M − 1 1 B with eigen v alue − 1 . Let A = − ˆ M − 1 2 and T = B > ˆ M − 1 1 B . Note that A is positiv e definite and T is PSD. Then we have: AT = A 1 / 2 ( A 1 / 2 T A 1 / 2 ) A − 1 / 2 (41) Since A 1 / 2 T A 1 / 2 is PSD, and AT is similar to A 1 / 2 T A 1 / 2 , we must have that all of the eigen v alues of AT are nonnegati ve. This contradicts that v 2 is an eigen vector of AT with eigen v alue − 1 . Thus, all eigen v alues of Z must have magnitude greater than  . H.3 Proof of Lemma 4.6 Pr oof. Suppose λ is an eigen value of H H > with eigenv ector v =  v 1 v 2  . WLOG, suppose λ < σ 2 min ( C ) . Since v is an eigenv ector , we hav e:  A 2 + C C > − AC − C > A C > C   v 1 v 2  = λ  v 1 v 2  (42) Thus, we hav e: ( A 2 + C C > − λI ) v 1 − AC v 2 = 0 (43) − C > Av 1 + ( C > C − λI ) v 2 = 0 (44) 19 Since λ < σ 2 min ( C ) , we have that C > C − λI is inv ertible, so we can write v 2 = ( C > C − λI ) − 1 C > Av 1 from the (44). Plugging this into (43) giv es: ( A 2 + C C > − λI − AC ( C > C − λI ) − 1 C > A ) v 1 = 0 (45) ( A ( I − C ( C > C − λI ) − 1 C > ) A + C C > − λI ) v 1 = 0 (46) Write the SVD of C as C = U Σ V > . Then we hav e: C ( C > C − λI ) − 1 C > = U Σ V > ( V Σ U > U Σ V > − λI ) − 1 V Σ U > (47) = U Σ V > ( V (Σ 2 − λI ) V > ) − 1 V Σ U > (48) = U Σ V > V − T (Σ 2 − λI ) − 1 V − 1 V Σ U > (49) = U Σ 2 (Σ 2 − λI ) − 1 U > (50) = U D U > (51) where the second line follows because V V > = I when C is full rank and where D is a diagonal matrix such that D ii = σ 2 i ( C ) σ 2 i ( C ) − λ . Let M = I − D , so M is diagonal with M ii = − λ σ 2 i ( C ) − λ . Then (46) becomes: ( AM A + C C > − λI ) v 1 = 0 (52) This means T = AM A + C C > − λI has a 0 eigen value. A simple lower bound for the eigen v alues of T is λ min ( T ) ≥ − || A || 2 λ σ 2 min − λ + σ 2 min ( C ) − λ (53) W e will show that if λ < δ , where δ = σ 2 min ( C ) + || A || 2 2 − q ( σ 2 min + || A || 2 2 ) 2 − σ 4 min , then λ min ( T ) > 0 , which is a contradiction. It suffices to sho w the following inequality: − || A || 2 λ σ 2 min − λ + σ 2 min ( C ) − λ > 0 (54) ⇐ ⇒ σ 2 min ( C ) − λ > || A || 2 λ σ 2 min − λ (55) ⇐ ⇒ ( σ 2 min ( C ) − λ ) 2 > || A || 2 λ (56) ⇐ ⇒ λ 2 − (2 σ 2 min ( C ) + || A || 2 ) λ + σ 4 min ( C ) > 0 (57) (57) has zeros at the following v alues: σ 2 min ( C ) + || A || 2 2 ± v u u t σ 2 min + || A || 2 2 ! 2 − σ 4 min ( C ) (58) Since (57) is a con vex parabola, if λ is less than both zeros, we will have prov ed (57). This is clearly true if λ < δ . As a last step, we can giv e a slightly nicer form of δ , using Lemma H.1. Letting x = σ 2 min ( C ) + || A || 2 2 and c = σ 4 min ( C ) , we have δ > σ 4 min ( C ) 2 σ 2 min ( C )+ || A || 2 . So to reiterate, if λ < σ 4 min ( C ) 2 σ 2 min ( C )+ || A || 2 < δ , then (57) holds, so T  0 , which contradicts (52). Lemma H.1. F or x ∈ (0 , 1) and c ∈ (0 , x 2 ) , we have: x − p x 2 − c > c 2 x Pr oof. x − p x 2 − c = x − x r 1 − c x 2 > x − x  1 − c 2 x 2  = c 2 x 20 H.4 Proof of Lemma 4.9 Pr oof. Let C ( x 1 , x 2 ) = ∇ 2 x 1 x 2 g ( x 1 , x 2 ) . For all x ∈ R d × R d , C ( x 1 , x 2 ) is square and full rank by assumption, so we can apply Lemma 4.6 with H = J at each point x ∈ R d × R d , which giv es λ ( J J > ) ≥ σ 4 min ( C ( x 1 ,x 2 )) 2 σ 2 min ( C ( x 1 ,x 2 ))+ || ∇ 2 x 1 x 1 g ( x 1 ,x 2 ) || 2 . W e hav e     ∇ 2 x 1 x 1 g ( x 1 , x 2 )     ≤ L since g is smooth in x 1 . Also, σ 2 min ( C ( x 1 , x 2 )) ≥ γ . Then we hav e that J J >  γ 4 2 γ 2 + L 2 I , so by Lemma 4.3, H satisfies the PL condition with parameter γ 4 2 γ 2 + L 2 . H.5 Proof of Lemma 4.10 T o prove Lemma 4.10, we use the follo wing lemma: Lemma H.2. Let H =  A C − C > − B  , wher e C is squar e and full rank. Mor eover , let c = ( σ 2 min ( C ) + λ min ( A 2 ))( λ min ( B 2 ) + σ 2 min ( C )) − σ 2 max ( C )( || A || + || B || ) 2 and assume c > 0 . Then if λ is an eigen value of H H > =  A 2 + C C > − AC − C B − C > A − B C > B 2 + C > C  , we must have λ ≥ ( σ 2 min ( C ) + λ min ( A 2 ))( λ min ( B 2 ) + σ 2 min ( C )) − σ 2 max ( C )( || A || + || B || ) 2 (2 σ 2 min ( C ) + λ min ( A 2 ) + λ min ( B 2 )) 2 . Pr oof of Lemma H.2. This proof resembles that of Lemma 4.6. Let v =  v 1 v 2  be an eigen vector of H H > with eigen v alue λ . Expanding H H > v = λv , we have: ( A 2 + C C > − λI ) v 1 − ( AC + C B ) v 2 = 0 (59) − ( C > A + B C > ) v 1 + ( B 2 + C > C − λI ) | {z } M v 2 = 0 (60) ⇒ v 2 = M − 1 ( C > A + B C > ) v 1 (61) ⇒ ( − ( AC + C B ) M − 1 ( C > A + B C > ) + A 2 + C C > − λI ) v 1 = 0 (62) where M is in vertible because C > C is positiv e definite and WLOG, we may assume that λ < λ min ( C > C ) = σ 2 min ( C ) . W e will show that if the assumptions in the statement of the lemma hold, then we get a contradiction if λ is below some positive threshold. In particular, we show that the following inequality holds for small enough λ (this inequality contradicts (62)): σ 2 min ( C ) − λ + λ min ( A 2 ) > σ 2 max ( C )( || A || + || B || ) 2     M − 1     ⇐ σ 2 min ( C ) − λ + λ min ( A 2 ) > σ 2 max ( C ) λ min ( B 2 ) + σ 2 min ( C ) − λ ( || A || + || B || ) 2 ⇐ ⇒ λ 2 − (2 σ 2 min ( C ) + λ min ( A 2 ) + λ min ( B 2 )) λ + ( σ 2 min ( C ) + λ min ( A 2 ))( λ min ( B 2 ) + σ 2 min ( C )) − σ 2 max ( C )( || A || + || B || ) 2 > 0 Letting b = 2 σ 2 min ( C ) + λ min ( A 2 ) + λ min ( B 2 ) , we can solve for the zeros of the abo ve equation: λ = b ± √ b 2 − 4 c 2 (63) Note that we have c > 0 by assumption, so this equation has only positiv e roots. Note also that b 2 > 4 c , so the roots will not be imaginary . Then we see that if λ < δ = b − √ b 2 − 4 c 2 , we get a contradiction. Using Lemma H.1, we see that δ > c b . So we’ ve proven that λ < c b giv es a contradiction, so we must hav e λ ≥ c b , i.e. λ ≥ ( σ 2 min ( C ) + λ min ( A 2 ))( λ min ( B 2 ) + σ 2 min ( C )) − σ 2 max ( C )( || A || + || B || ) 2 2 σ 2 min ( C ) + λ min ( A 2 ) + λ min ( B 2 ) . 21 Pr oof of Lemma 4.10. The proof is very similar to that of Lemma 4.9. Let C ( x 1 , x 2 ) = ∇ 2 x 1 x 2 g ( x 1 , x 2 ) . For all x ∈ R d × R d , C ( x 1 , x 2 ) is square and full rank with bounds on its singular values by assumption. Moreover , (3) holds, so we can apply Lemma H.2 with H = J at each point x ∈ R d × R d . Using the fact that g is smooth in x 1 and x 2 , this giv es λ ( J J > ) ≥ ( σ 2 min ( C ( x 1 , x 2 )) + λ min ( A 2 ))( σ 2 min ( C ( x 1 , x 2 )) + µ 2 ) − 4 L 2 σ 2 max ( C ( x 1 , x 2 )) 2 σ 2 min ( C ( x 1 , x 2 )) + λ min ( A 2 ) + µ 2 . Using the bounds on the singular values of C ( x 1 , x 2 ) , we hav e that J J >  ( γ 2 + λ min ( A 2 ))( γ 2 + µ 2 ) − 4 L 2 Γ 2 2 γ 2 + λ min ( A 2 )+ µ 2 I , so by Lemma 4.3, H satisfies the PL condition with parameter ( γ 2 + λ min ( A 2 ))( γ 2 + µ 2 ) − 4 L 2 Γ 2 2 γ 2 + λ min ( A 2 )+ µ 2 . I Proof of Theor em 5.1 In this section, we prov e Theorem 5.1. The proof lev erages the following theorem from [KNS16]. 5 Theorem I.1 ([ KNS16 ]) . Assume that f is L -smooth, has a non-empty solution set X ∗ , and satisfies the PL condition with parameter α . Let v be a stochastic estimate of ∇ f such that E [ v ] = ∇ f . Assume E [ k v ( x ( k ) ) k 2 ] ≤ C 2 for all x ( k ) and some C . If we use the SGD update x ( k +1) = x ( k ) − η k v ( x ( k ) ) with η k = 2 k +1 2 α ( k +1) 2 , then, we get a con verg ence rate of E [ f ( x k ) − f ∗ ] ≤ LC 2 2 k α 2 (64) If instead we use a constant η k = η < 1 2 α , then we obtain a linear con ver gence rate up to a solution level that is pr oportional to η , E [ f ( x ( k ) ) − f ∗ ] ≤ (1 − 2 α η ) k [ f ( x (0) ) − f ∗ ] + LC 2 η 4 α (65) Now we can pro ve Theorem 5.1. Pr oof of Theorem 5.1. If H satisfies the PL condition with parameter α , then we can apply Theo- rem I.1 to the stochastic variant of HGD. since H ∗ = 0 , we get E  1 2 k ξ ( x ( k ) ) k 2  ≤ L H C 2 2 k α 2 (66) The theorem follo ws from Jensen’ s inequality , which implies that E  k ξ ( x ( k ) ) k  ≤ q E  k ξ ( x ( k ) ) k 2  . J Proof of Theor em 5.2 In this section, we pro ve our main result about Consensus Optimization, namely Theorem 5.2. The key technical component is showing that HGD still performs well even with small arbitrary perturbations, as we show in the follo wing theorem: Theorem J .1. Let x ( k +1) = x ( k ) − η ∇H ( x ( k ) ) + η v v ( k ) wher e v ( k ) is some arbitrary vector such that     v ( k )     =     ξ ( x ( k ) )     . Let g be L g -smooth and suppose H satisfies the PL condition with parameter α . Let η = 1 L H and let η v = α 4 L H L g . Then we get the following con ver gence:     ξ ( x ( k ) )     ≤  1 − α 4 L H  k     ξ ( x (0) )     . (67) From Theorem J.1, it is simple to prov e Theorem 5.2 5 The actual theorem in [ KNS16 ] is stated in a slightly different w ay , but it is equiv alent to our presentation. 22 Pr oof of Theorem 5.2. Note that the CO update (5) with γ = 4 L g α is exactly the update in Theorem J.1 with v ( k ) = − ξ ( x ( k ) ) , so we get the desired con ver gence rate. Our result treats SGD A as an adversarial perturbation e ven though this is not the case, which suggests that this analysis may be improved. It would be nice if one could directly apply the PL-based analysis that we used for HGD, b ut this does not seem to work for CO since CO is not gradient descent on some objectiv e. Now we pro ve Theorem J.1. Pr oof of Theorem J .1. Let x ( k +1 / 2) = x ( k ) − η ∇H ( x ( k ) ) , so x ( k +1) = x ( k +1 / 2) + η v v ( k ) . From (11) in the proof of Theorem 4.2 with η = 1 L H , we get     ξ ( x ( k +1 / 2) )     ≤  1 − α L H  1 / 2     ξ ( x ( k ) )     ≤ (1 − α 2 L H )     ξ ( x ( k ) )     . (68) Next, note that the triangle inequality and smoothness of g imply:       ξ ( x ( k +1) )       ≤       ξ ( x ( k +1 / 2) )       +       ξ ( x ( k +1) ) − ξ ( x ( k +1 / 2) )       (69) ≤       ξ ( x ( k +1 / 2) )       + L g       x ( k +1) − x ( k +1 / 2)       (70) =       ξ ( x ( k +1 / 2) )       + L g || η v v || (71) Using the abov e result and     v ( k )     =     ξ ( x ( k ) )     , we get:       ξ ( x ( k +1) )       ≤  1 − α 2 L H + L g η v        ξ ( x ( k ) )       (72) Setting η v = α 4 L H L g giv es the result. Note that for this result, we assume g is L g smooth in x 1 and x 2 jointly , whereas in other parts of the paper we assume g is smooth in x 1 or x 2 separately . If g is L -smooth in x 1 and L -smooth in x 2 and     ∇ 2 x 1 x 2 g ( x 1 , x 2 )     ≤ L c for all x 1 , x 2 , then g will be L + L c smooth. K Experiments In this section, we present some e xperimental results showing ho w SGDA, HGD, and CO perform on a con vex-conca ve objecti ve and a noncon ve x-nonconcav e objecti ve. For our CO plots, γ refers to the γ parameter in the CO algorithm. All of our experiments are initialized at (5 , 5) . The step- size η for HGD and SGDA is al ways 0 . 01 , while the step-size η for CO with γ = { 0 . 1 , 1 , 10 } is { 0 . 1 , 0 . 01 , 0 . 001 } respectiv ely to account for the fact that increasing γ increases the effecti ve step-size, so the η parameter needs to be decreased accordingly . The experiments were all run on a standard 2017 Macbook Pro. The main takeaw ays from the experiments are that CO with low γ will not con ver ge if there is a large bilinear term, while CO with high γ and HGD all con ver ge for small and large bilinear terms. When the bilinear term is lar ge, CO with high γ and HGD both will con verge in fe wer iterations (for the same step-size). W e did not optimize for step-size, so it is possible this effect may change if the optimal step-size is chosen for each setting. K.1 Con vex-conca ve objecti ve The conv ex-conca ve objectiv e we use is g ( x 1 , x 2 ) = f ( x 1 ) + cx 1 x 2 − f ( x 2 ) where f ( x ) = log(1 + e x ) . W e show a plot of f in Figure 6. When c = 3 , SGD A con verges, and when c = 10 , SGD A di ver ges. W e note that HGD and CO (for large enough γ ) tend to con ver ge faster when c is larger . 23 Figure 6: Plot of f ( x ) = log (1 + e x ) with its first and second deri vati ves. This is a con ve x, smooth function K.1.1 SGD A con ver ges ( c = 3 ) These plots show g when c = 3 , so SGD A con ver ges, as does CO with γ = 0 . 1 . 24 (a) (b) Figure 7: SGDA vs. HGD for 300 iterations for g ( x 1 , x 2 ) = f ( x 1 ) + cx 1 x 2 − f ( x 2 ) where f ( x ) = log (1 + e x ) and c = 3 . SGD A slo wly circles tow ards the min-max, and HGD goes directly to the min-max. 25 (a) (b) Figure 8: CO for 100 iterations with different v alues of γ for g ( x 1 , x 2 ) = f ( x 1 ) + cx 1 x 2 − f ( x 2 ) where f ( x ) = log (1 + e x ) and c = 3 . The γ = 0 . 1 curve slo wly circles tow ards the min-max, while the other curves go directly to the min-max. 26 (a) (b) Figure 9: HGD vs. CO for 100 iterations for g ( x 1 , x 2 ) = f ( x 1 ) + cx 1 x 2 − f ( x 2 ) where f ( x ) = log(1 + e x ) and c = 3 with dif ferent values of γ . 27 K.1.2 SGD A diver ges ( c = 10 ) These plots sho w g when c = 10 , so SGD A div erges, as does CO with γ = 0 . 1 . Note that in this case, CO with γ ≥ 1 and HGD both require v ery fe w iterations (typically about 2) to reach the min-max. (a) (b) Figure 10: SGDA vs. HGD for 150 iterations for g ( x 1 , x 2 ) = f ( x 1 ) + cx 1 x 2 − f ( x 2 ) where f ( x ) = log(1 + e x ) and c = 10 . SGD A slowly circles away from the min-max, while HGD goes directly to the min-max. 28 (a) (b) Figure 11: CO for 15 iterations with different v alues of γ for g ( x 1 , x 2 ) = f ( x 1 ) + cx 1 x 2 − f ( x 2 ) where f ( x ) = log (1 + e x ) and c = 10 . The γ = 0 . 1 curve makes a cyclic pattern around the min-max, while the other curves go directly to the min-max. 29 (a) (b) Figure 12: HGD vs. CO for 15 iterations with different v alues of γ for g ( x 1 , x 2 ) = f ( x 1 ) + cx 1 x 2 − f ( x 2 ) where f ( x ) = log (1 + e x ) and c = 10 . 30 K.2 Noncon vex-nonconca ve objecti ve The noncon vex-nonconca ve objecti ve we use is g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F is defined as in (16) in Appendix E. F ( x ) =    − 3( x + π 2 ) for x ≤ − π 2 − 3 cos x for − π 2 < x ≤ π 2 − cos x + 2 x − π for x > π 2 (73) W e show a plot of F in Figure 13. Figure 13: Plot of noncon vex function F ( x ) defined in (16), as well as its firs t and second deri vati ves As in the con ve x-concav e case, when c = 3 , SGD A con ver ges, and when c = 10 , SGD A di verges. Again, HGD and CO (for large enough γ ) tend to con ver ge faster when c is larger . K.2.1 SGD A con ver ges ( c = 3 ) These plots show g when c = 3 , so SGD A con ver ges, as does CO with γ = 0 . 1 . 31 (a) (b) Figure 14: SGD A vs. HGD for 300 iterations for g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F ( x ) is defined in (73) and c = 3 . SGD A slo wly circles towards the min-max, and HGD goes more directly to the min-max. 32 (a) (b) Figure 15: CO for 100 iterations with different v alues of γ for g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F ( x ) is defined in (73) and c = 3 . The γ = 0 . 1 curve slowly circles towards the min-max, while the other curves go more directly to the min-max. 33 (a) (b) Figure 16: HGD vs. CO for 100 iterations for g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F ( x ) is defined in (73) and c = 3 with dif ferent values of γ . 34 K.2.2 SGD A diver ges ( c = 10 ) These plots sho w g when c = 10 , so SGD A div erges, as does CO with γ = 0 . 1 . Note that in this case, CO with γ ≥ 1 and HGD both require v ery fe w iterations (typically about 2) to reach the min-max. (a) (b) Figure 17: SGD A vs. HGD for 150 iterations for g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F ( x ) is defined in (73) and c = 10 . SGD A slo wly circles away from the min-max, while HGD goes directly to the min-max. 35 (a) (b) Figure 18: CO for 15 iterations with dif ferent values of γ for g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F ( x ) is defined in (73) and c = 10 . The γ = 0 . 1 curve makes an erratic cycle around the min-max, slowly di verging, while the other curv es go directly to the min-max. 36 (a) (b) Figure 19: HGD vs. CO for 15 iterations with dif ferent v alues of γ for g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F ( x ) is defined in (73) and c = 10 . 37 K.3 Con ver gence of HGD for noncon vex-noncon vex objectiv e with different-sized bilinear terms In this section, we look at the con vergence of HGD for the same objecti ve as discussed in the pre vious section, namely g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F is defined as in (16) in Appendix E. F ( x ) =    − 3( x + π 2 ) for x ≤ − π 2 − 3 cos x for − π 2 < x ≤ π 2 − cos x + 2 x − π for x > π 2 (74) In this case, we will vary c to show that HGD con verges f aster for higher c and will not con ver ge for sufficiently lo w c . Figure 20: Distance to minmax for HGD iterates for dif ferent values of c in the objecti ve g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F ( x ) is defined in (73). 38 Figure 21: Gradient norm for HGD iterates for different values of c in the objective g ( x 1 , x 2 ) = F ( x 1 ) + cx 1 x 2 − F ( x 2 ) where F ( x ) is defined in (73). Since all runs are initialized at (5 , 5) , when c is increased, the initial gradient norm also increases. Nonetheless, HGD still con verges faster for the cases with higher c . 39

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment