On the adequacy of untuned warmup for adaptive optimization

On the Adequacy of Untuned W armup f or Adaptiv e Optimization Jerry Ma, 1 2 Denis Y arats 3 4 1 Booth School of Business, Univ ersity of Chicago 2 U.S. Patent and T rademark Of ﬁce, Department of Commerce 3 Courant Institute of Mathematical Sciences, New Y ork Univ ersity 4 Facebook AI Research research@jerryma.net, denisyarats@cs.nyu.edu Abstract Adaptiv e optimization algorithms such as Adam are widely used in deep learning. The stability of such algorithms is of- ten improv ed with a warmup schedule for the learning rate. Motiv ated by the dif ﬁculty of choosing and tuning warmup schedules, recent work proposes automatic variance rectiﬁ- cation of Adam’ s adaptive learning rate, claiming that this rectiﬁed approach (“RAdam”) surpasses the v anilla Adam al- gorithm and reduces the need for expensi ve tuning of Adam with warmup. In this work, we refute this analysis and pro- vide an alternative explanation for the necessity of warmup based on the magnitude of the update term , which is of greater relev ance to training stability . W e then provide some “rule-of- thumb” warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or -less identically to RAdam in typical practical settings. W e conclude by sug- gesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup ov er 2 / (1 − β 2 ) training iterations. 1 Introduction Stochastic gradient-based optimization serves as the workhorse training approach for many classes of paramet- ric models, including neural networks. Stochastic gradient descent and its various ﬁrst-order cousins (Polyak 1964; Nesterov 1983) have enabled numerous advances in deep learning across domains (Krizhevsk y , Sutske ver , and Hinton 2012; He et al. 2016; Gehring et al. 2017). More recently , adaptiv e optimization algorithms ha ve become pre valent in training the largest deep learning models. These adaptiv e methods, which include Adagrad (Duchi, Hazan, and Singer 2010), RMSProp (Hinton, Srivasta va, and Swersky 2012), and Adam (Kingma and Ba 2014), scale the step size for each individual parameter based on v arious gradient moments. Many practitioners ha ve adopted the Adam algorithm for general-purpose use; notably , the preponderance of recent state-of-the-art results in natural language processing (De vlin et al. 2018; Radford et al. 2019; Liu et al. 2019b; Brown et al. 2020) hav e employed Adam, demonstrating the algorithm’ s Copyright © 2021, Association for the Adv ancement of Artiﬁcial Intelligence (www .aaai.org). All rights reserved. Appendix av ailable at https://arxiv .org/abs/1910.04209. ability to effecti vely train neural networks with parameter counts from 100 million to sev eral billion. In these large-scale settings, Adam’ s global learning rate is usually annealed with a “warmup schedule” which promotes early-stage training stability by regulating the size of the pa- rameter updates. The prev alent warmup schedule is a simple linear warmup, in which the global learning rate starts at zero and increases by a constant at each iteration until reaching its intended value. 1 The parameters of these warmup schedules are typically tuned for each problem setting and model. Liu et al. (2020) performed an analysis of Adam with warmup, concluding that Adam requires a warmup sched- ule to mitigate the large or div ergent variance of the per- parameter scale term. They then propose the rectiﬁed Adam (“RAdam”) algorithm, which automatically corrects for this high variance. Liu et al. highlight the robustness of RAdam, noting in particular that RAdam reduces or elim- inates the need for tuning warmup schedules when using Adam. RAdam has been applied to domains including gen- erativ e modeling (Y amamoto, Song, and Kim 2020), natural language processing (Nguyen and Salazar 2019), and video retriev al (Liu et al. 2019a). Contributions Our contributions in this w ork are as follows: Reexamining RAdam and the variance-based motiva- tion for warmup W e div e into the inner operation of RAdam and ﬁnd that it is precisely Adam with a ﬁxed warmup schedule, with the only de viation being to perform four iterations of heavy-ball momentum (Polyak 1964) at the outset. W e then argue that the variance-based moti vation for warmup is impaired as it ov erlooks the correlation between the ﬁrst and second moment estimators, which is crucial for understanding the actual parameter updates applied by Adam. Analyzing Adam’ s early-stage update magnitudes Shifting focus from gradients to parameter updates, we then perform a simulation-based analysis on the magnitudes of Adam’ s parameter updates. W e ﬁnd that even at a simulated local minimum of the objectiv e, Adam exhibits considerable non-regularity in its early-stage parameter updates, shedding 1 Linear warmup has also been deployed for ﬁrst-order optimiza- tion – see, for example, Goyal et al. (2017). light on why Adam may require learning rate warmup to a greater extent than ﬁrst-order optimization methods. Demonstrating the sufﬁciency of untuned warmup W e provide some simple and intuitiv e “rule-of-thumb” warmup schedules for Adam, all of which require no tun- ing. As our main empirical result, we demonstrate that these schedules result in substanti vely identical performance and training dynamics to those of RAdam across a wide range of models, problem settings, and hyperparameters, indicat- ing that any claimed beneﬁts can be achieved with lower complexity using of f-the-shelf optimization tools. As a sen- sible untuned default, we recommend linear warmup over 2 · (1 − β 2 ) − 1 iterations. 2 Preliminaries W e begin with notation and a brief re view of stochastic gradi- ent descent and Adam. Primitives θ ∈ R p denotes a vector of model parame- ters. L ( θ ) : R p → R denotes a loss function to be minimized ov er the model parameters. ˆ L ( θ ) : R p → R denotes an unbi- ased approximator of the loss function ( e . g . o ver a minibatch). ∇L ( θ ) and ∇ ˆ L ( θ ) denote the gradient of L ( θ ) and ˆ L ( θ ) , re- spectiv ely . The terms θ , ˆ L ( θ ) , and ∇ ˆ L ( θ ) are subscriptable by t ≥ 0 , the optimization time step (“training iteration”). θ 0 represents the initial model parameters. W e write optimization algorithms as per-iteration proce- dures (“update rules”), taking the basic form: θ t ← θ t − 1 − { . . . } | {z } “update step” Stochastic gradient descent The SGD algorithm, pa- rameterized by learning rate α > 0 , performs the follo wing procedure at each iteration t : θ t ← θ t − 1 − α · ∇ ˆ L t − 1 ( θ t − 1 ) (1) Adam The Adam algorithm (Kingma and Ba 2014), parameterized by global learning rate α > 0 , discount factors β 1 , β 2 ∈ (0 , 1) , and stability constant  > 0 , performs the following procedure at each iteration t : m t ← β 1 · m t − 1 + (1 − β 1 ) · ∇ ˆ L t − 1 ( θ t − 1 ) (2) v t ← β 2 · v t − 1 + (1 − β 2 ) · h ∇ ˆ L t − 1 ( θ t − 1 ) i 2 (3) θ t ← θ t − 1 − α " (1 − β t 1 ) − 1 · m t p (1 − β t 2 ) − 1 · v t +  # (4) where m, v ∈ R p denote auxiliary memory (interpretable as ﬁrst moment and second moment estimators of ∇ ˆ L t , respec- tiv ely). By conv ention, m 0 = v 0 = 0 . W armup schedules For an y optimization algorithm pa- rameterized with a learning rate α , a warmup schedule ω can be applied. ω is a sequence of “warmup f actors” ω t ∈ [0 , 1] , which serve to dampen the step size of each iteration t . Specif- ically , a warmup schedule is imposed by replacing α with α t = α · ω t in the algorithm’ s update rule. Perhaps the most common functional form for the schedule is linear warmup , parameterized by a “warmup period” τ : ω linear ,τ t = min  1 , 1 τ · t  (5) Rectiﬁed Adam The RAdam algorithm (Liu et al. 2020), parameterized identically to Adam, performs the fol- lowing procedure at each iteration t : ρ ∞ ← 2 / (1 − β 2 ) − 1 (6) ρ t ← ρ ∞ − 2 t · β t 2 / (1 − β t 2 ) (7) ω t ← s ( ρ t − 4)( ρ t − 2) ρ ∞ ( ρ ∞ − 4)( ρ ∞ − 2) ρ t (8) m t ← β 1 · m t − 1 + (1 − β 1 ) · ∇ ˆ L t − 1 ( θ t − 1 ) (9) v t ← β 2 · v t − 1 + (1 − β 2 ) · h ∇ ˆ L t − 1 ( θ t − 1 ) i 2 (10) θ t ← θ t −    α · (1 − β t 1 ) − 1 · m t ρ t ≤ 4 α · ω t ·  (1 − β t 1 ) − 1 · m t √ (1 − β t 2 ) − 1 · v t +   ρ t > 4 (11) 3 Rectiﬁed Adam, Adaptive V ariance, and Update Steps W e begin by uncov ering the precise beha vior of RAdam, before delving into its underlying variance-based moti v ation. 3.1 RAdam: Perf orm 4 Iterations of Momentum SGD, Then Use Adam with Fixed W armup Liu et al. describe RAdam as having two modes of opera- tion: “diver gent variance” and “conv ergent variance”, cor- responding respectively to the cases ρ t ≤ 4 and ρ t > 4 in Equation 11. In the “div ergent” phase, RAdam performs a v ariant of heavy-ball momentum SGD (Polyak 1964). 2 Then, in the “con ver gent” phase, RAdam performs Adam, with the learning rate scaled down by ω t . Ho wev er , this is not dynamic scaling based on t he training- time beha vior of the optimizer or the distribution of the gradi- ents. Rather , ω t is a deterministic function of solely t and β 2 . Thus, the “con vergent” phase is simply Adam with a ﬁxed warmup schedule. W e ﬁnd that for all practically relev ant values of β 2 , the condition ρ t ≤ 4 is simply t ≤ 4 : Fact 3.1. Assume that 0 . 8 ≤ β 2 < 1 and t is a positive inte ger . Then, for ρ t as deﬁned in Equation 7 : ρ t ≤ 4 ⇐ ⇒ t ≤ 4 Pr oof. See Appendix B.1. Thus follows a layman’ s description of RAdam: 1. Perform four iterations of heavy-ball momentum. 2. At iteration ﬁ ve and beyond, use Adam with a ﬁxed warmup schedule. 2 The departure from standard heavy-ball momentum is in the bias correction by (1 − β t 1 ) . (a) Median coefﬁcient of v ariation of gradients (calculated ov er 256 trials). (b) Pearson correlation between Adam’ s | m t | and √ v t . (c) Median parameter update magnitude (  = 0 ). Figure 1: Analysis of gradients and updates during the training of a simple feed-forward network on the EMNIST digit recognition task with the Adam optimizer – see Appendix A.5 for comprehensiv e details. On its face, using four iterations of momentum at the be- ginning of training seems arbitrary . In preliminary experi- mentation (including the e xperimental settings described in Section 5), we performed ablations o ver the follo wing options for these four initial iterations: • Do absolutely nothing. • Use Adam with learning rate α · ω 5 (i.e. do exactly what RAdam does at the ﬁfth iteration). • Use Adam with linear warmup to α · ω 5 (i.e. gradually warm up the learning rate to RAdam’ s ﬁfth iteration). As expected for a decision af fecting only four training itera- tions, the practical difference between these choices is uni- formly negligible. Thus, the only possible beneﬁt of RAdam stems from its custom warmup schedule ω t for the ﬁfth itera- tion and beyond. W e re visit this topic in Sections 4 and 5. 3.2 V ariance-Based Motivation f or RAdam and W armup Gi ven the arbitrary nature of RAdam’ s operation, we proceed to inv estigate the moti vation for RAdam, which Liu et al. also identify as the underlying motiv ation for warmup’ s crucial role in Adam. Liu et al. focus their principal analysis on the term q 1 − β t 2 v t . Fixing  = 0 , this term can be interpreted as Adam’ s “adap- tiv e learning rate”, which scales the global learning rate for each parameter before computing Adam’ s ﬁnal update for that parameter . The y identify that the quantity V ar  q 1 − β t 2 v t  does not exist during the ﬁrst few training iterations, 3 and ev en after con ver ging to a ﬁnite value, continues to remain elev ated for some time. Perhaps the most immediate observ ation is that early-stage gradients are not zero-mean. In fact, at the beginning of opti- mization, the expected magnitude of the gradients ∇ ˆ L t ( θ t ) 3 The authors approximate 1 − β t 2 v t as having a scaled in verse χ 2 distribution, under the assumption that (1) all gradients are i.i.d. and zero-mean, and (2) a simple a verage approximates an exponential moving a verage. ( i . e . absolute value of the deterministic gradients ∇L ( θ t ) ) should dominate the gradient variance, since a randomly- initialized model is exceedingly unlikely to be near a local minimum of L ( θ t ) . Indeed, on a demonstration training run of a feed-forward network on the EMNIST digit recognition task, we observe that the median coefﬁcient of v ariation of the gradients (Figure 1a) starts at below 1, indicating that for most parameters, the expected v alue of the gradient exceeds the standard de viation during early-stage training. Only be- yond training iteration 50 does the coefﬁcient of variation consistently exceed 1. Relaxing the zero-mean assumption decreases V ar h 1 − β t 2 v t i considerably . 4 More important, ho wev er , is that m t and v t are not at all independent. Figure 1b re veals that in the EMNIST setting, the absolute value of the ﬁrst moment estimator ( | m t | ) is extremely correlated with the square root of the second mo- ment estimator ( √ v t ). Since Adam’ s parameter updates are proportional to m t / √ v t , high correlation between these two quantities implies that the magnitude of the updates are quite regular , despite the high variance of q 1 v t . Indeed, during the ﬁrst training iteration ( t = 1 ), it is guaranteed that | m t | = √ v t for all parameters, making all Adam parameter updates either − α or α (assuming  = 0 ). Thus, even though V ar h 1 − β t 2 v t i is diver gent, the magnitude of the parameter updates themselves are constant. Ironically , it is pr ecisely when the adaptive learning rate’ s v ariance is “di vergent” that the actual parameter update magnitudes hav e zero variance. This suggests that the adaptiv e learning rate may not be the best medium of analysis for understanding the role of warmup in Adam. 3.3 High Initial Update Step Magnitudes Necessi- tate W armup in Adam W e provide an alternati ve vie w of the frequent necessity of learning rate w armup when using Adam. W e do so by directly 4 Although Liu et al. do not comment on the relati ve magnitudes of V ar h 1 − β t 2 v t i , their Fig. 9 reveals that coefﬁcients of variation below 1 dampen that quantity by an order of magnitude or more. Figure 2: Distribution of Adam’ s update step magnitudes at a simulated local minimum of L ( θ ) (quantiles: { 2 . 5% , 25% , 50% , 75% , 97 . 5% } ). in vestigating the magnitudes of the update steps, perhaps the most proximate determinant of training stability . In stochastic gradient descent, parameter updates are sim- ply the gradients multiplied by the learning rate. W armup for SGD can thus be moti vated as mitig ating the large e xpected magnitudes of the gradients (directly proportional to update magnitudes) and rapid change in gradients at the be ginning of training (Goyal et al. 2017; Gotmare et al. 2019). Similar logic can be employed for adapti ve methods. On the other hand, if a model’ s gradients have near -zero means and low gradient v ariances, the update steps are simi- larly well-regulated and optimization via SGD can be stable without any learning rate warmup. For example, a nearly- con verged model (thus ha ving near-zero e xpected gradients and lo w gradient magnitudes) trained via SGD can have its op- timization be stably restarted without learning rate warmup. This is not the case with Adam. W e proceed to computa- tionally analyze the magnitude of Adam’ s update step ov er the course of training. Speciﬁcally , we demonstrate via simu- lation that ev en when the model parameters θ t are initialized at an idealized local minimum of L ( θ ) ( i . e . ∇ ˆ L t ( θ t ) has zero mean and is i.i.d. across time), the magnitude of Adam’ s update steps will still be quite high at the start of training, only gradually decaying tow ard a stationary distribution. Simulation conﬁguration All gradients are simulated as i.i.d. normal variables with zero mean and constant isotropic v ariance 10 − 9 , thus approximating the optimization dynamics at an exact local minimum of L ( θ ) . 5 W e sam- ple independent gradient trajectories (each 1000 iterations long) for 25000 parameters. W e then run the Adam optimizer with these sampled gradients and ev aluate the distribution of the update step magnitudes (before multiplication by the global learning rate α ) at each iteration. The Adam optimizer conﬁguration is β 1 = 0 . 9 , β 2 = 0 . 999 , and  = 0 . Simulation r esults Figure 2 depicts the outcome of this computational simulation. As alluded to in Section 3.2, the update magnitudes for all parameters start at 1 · α . The update magnitudes gradually decay but continue to remain high for quite some time, only beginning to settle into a stationary 5 Note that the beha vior of Adam in this setting is inv ariant to the choice of variance constant. distribution after 40 or so training iterations (with median update magnitude ≈ 0 . 16 · α ). W e extend the trajectory length to 10000 and ﬁnd that the median update step of the stationary distribution is approximately 0 . 153 · α . These results imply that unlike SGD, Adam will always encounter early-stage training instability by way of large up- date magnitudes, e ven when the model is alr eady initialized at a local minimum . This stands as a contributing factor to Adam’ s need for learning rate warmup above and be yond that of ﬁrst-order methods. Comparison to real-world, random initialization set- tings Finally , we examine the update step distribution of a model initialized aw ay from a local minimum of L ( θ ) . Fig- ure 1c depicts the median parameter update magnitudes of Adam in the EMNIST setting from Section 3.2. W e observe a qualitati ve similarity to the local minimum simulation results – the update magnitudes start at 1 · α , only gradually settling into a stationary distribution around 0 . 15 · α . Note that the EMNIST optimization decreases more slowly in update magnitude and tak es longer ( ≈ 100 training itera- tions) to settle into the stationary distribution. This suggests that the update step non-regularity observ ed in the idealized local minimum initialization setting is only exacerbated in the more realistic setting of random initialization. 4 Rules of Thumb T urning to the practical application of learning rate warmup, we ﬁrst deﬁne a simple heuristic function, the ef fective warmup period , to characterize the dampening effect of warmup schedules. W e then present and intuitiv ely moti- vate tw o Adam warmup schedules that require no tuning and are thus usable as rules of thumb . 4.1 Effective W armup Period W e deﬁne the effective warmup period T ( ω ) of a warmup schedule ω as follows: T ( ω ) = ∞ X t =1 (1 − ω t ) Intuitiv ely , this is the sum of the warmup’ s dampening effect across all of training. 4.2 Exponential W armup W e propose a simple “exponential warmup” schedule based on a decaying exponential and a constant τ : ω expo ,τ t = 1 − exp  − 1 τ · t  (12) The constant τ is analogous to a linear warmup period, and we recommend τ = (1 − β 2 ) − 1 as a rule of thumb: ω expo , untuned t = 1 − exp ( − (1 − β 2 ) · t ) (13) In choosing τ , our guiding (albeit extremely speculative) intuition is to have the warmup factor ω expo ,τ t be roughly (a) Effecti ve warmup periods of RAdam and rule-of-thumb warmup schedules, as a function of β 2 . (b) RAdam and rule-of-thumb warmup schedules over time for β 2 = 0 . 999 . Figure 3: Comparison of various characteristics of RAdam and rule-of-thumb w armup schedules. equiv alent to Adam’ s second moment bias correction term in Adam. This term, 1 − β t 2 , is the sum of the coef ﬁcients in the moving a verage estimation of the second moment, and can thus be interpreted as ho w “complete” the second moment estimator is at an y gi ven point in time. W e brieﬂy sho w the approximate correspondence between the bias correction term and the warmup factor: 6 1 − β t 2 = 1 − exp (log ( β 2 ) · t ) ≈ 1 − exp (( β 2 − 1) · t ) = 1 − exp ( − (1 − β 2 ) · t ) 4.3 Linear W armup Recall the formulation of linear warmup: ω linear ,τ t = min  1 , 1 τ · t  As a similar rule of thumb to the exponential warmup schedule, we suggest performing linear warmup ov er τ = 2 · (1 − β 2 ) − 1 iterations: ω linear , untuned t = min  1 , 1 − β 2 2 · t  (14) Our choice of τ is carried ov er from exponential warmup as a starting point. T o preserve the same ef fectiv e warmup period, the τ from the exponential rule-of-thumb is multi- plied by 2 to account for the fact that exponential warmup decelerates ov er time, whereas linear warmup does not. W e elaborate in Appendix B.2. 4.4 Comparison with RAdam W e ﬁrst compare RAdam with the rule-of-thumb schedules (Equations 13 and 14) by computing their effecti ve warmup 6 The second step follows from a ﬁrst-order T aylor expansion of log( β 2 ) around β 2 = 1 . In practice, this approximation is extremely accurate for typical values of β 2 . periods across a range of β 2 values. 7 Figure 3a re veals that the ef fectiv e warmup periods of RAdam and the rules of thumb are nearly identical across all practical values of β 2 , indicating that they ha ve similar dampening effects over early- stage training. W e then proceed to examine the trajectory of the warmup schedule for the commonly used setting of β 2 = 0 . 999 . Fig- ure 3b re veals that the functional forms of the warmup factors are qualitati vely similar in magnitudes. The warmup sched- ules for RAdam and the rule-of-thumb exponential warmup closely correspond in shape as well. W e thus posit that RAdam and the untuned rule-of-thumb warmup schedules are more or less interchangeable. An em- pirical veriﬁcation follo ws. 5 Experiments W e ev aluate untuned exponential warmup (Equation 13), un- tuned linear warmup (Equation 14), and RAdam across a variety of supervised machine learning tasks. F or brevity , all experimental settings are summarized in the main text and comprehensiv ely detailed in Appendix A. 5.1 Image Classiﬁcation Using each of the three warmup methods, we train a ResNet- 50 model (He et al. 2016) on the ILSVRC (“ImageNet”) image classiﬁcation dataset with various conﬁgurations of Adam. Speciﬁcally , we sweep ov er: α (learning rate) ∈  10 − 4 , 10 − 3 , 10 − 2  β 2 ∈ { 0 . 99 , 0 . 997 , 0 . 999 } T able 1 presents the top-1 error rates at the end of training for the three warmup methods. Across all conﬁgurations of Adam, the top-1 error rates are indistinguishable between the warmup methods. 8 7 For the purpose of this analysis, w { 1 , 2 , 3 , 4 } are all deﬁned to be zero for RAdam. 8 The best error rates fall roughly 3% behind those from SGD, as is typical with Adam on computer vision tasks. LR β 2 Exponential Linear RAdam 10 − 4 0.99 34 . 2% ± 0 . 1 34 . 2% ± 0 . 1 34 . 2% ± 0 . 1 10 − 4 0.997 34 . 3% ± 0 . 2 34 . 2% ± 0 . 2 34 . 1% ± 0 . 1 10 − 4 0.999 34 . 5% ± 0 . 1 34 . 4% ± 0 . 1 34 . 2% ± 0 . 3 10 − 3 0.99 27 . 9% ± 0 . 1 28 . 0% ± 0 . 1 28 . 4% ± 0 . 1 10 − 3 0.997 27 . 9% ± 0 . 1 27 . 9% ± 0 . 1 28 . 3% ± 0 . 1 10 − 3 0.999 28 . 2% ± 0 . 1 28 . 3% ± 0 . 1 28 . 4% ± 0 . 1 10 − 2 0.99 29 . 3% ± 0 . 1 29 . 3% ± 0 . 3 29 . 4% ± 0 . 2 10 − 2 0.997 29 . 2% ± 0 . 2 29 . 3% ± 0 . 1 29 . 4% ± 0 . 5 10 − 2 0.999 28 . 9% ± 0 . 2 28 . 7% ± 0 . 1 29 . 8% ± 0 . 4 T able 1: T op-1 error rates of ResNet-50 on ImageNet (means and standard deviations o ver 5 random seeds). Figure 4: Mean training loss (5 seeds) of ResNet-50 on Imagenet, using Adam with α = 10 − 3 and β 2 = 0 . 999 . W e next examine the course of optimization for individ- ual conﬁgurations of Adam’ s α and β 2 . Figure 4 depicts the training loss using the popular “default” Adam conﬁgura- tion of learning rate α = 10 − 3 and β 2 = 0 . 999 , revealing that the behavior of these warmup methods is indeed nearly indistinguishable. Appendix C.1 provides both training and validation metrics (Figures 7 and 8 respectively) for all tested conﬁgurations, reinforcing this trend. 5.2 Language Modeling Using each of the three warmup methods, we train a state- of-the-art T ransformer-based language model from Bae vski and Auli (2018) on WIKITEXT-103 . W e sweep over the following grid of Adam h yperparmeters: α (learning rate) ∈  1 · 10 − 4 , 3 · 10 − 4 , 5 · 10 − 4  β 2 ∈ { 0 . 99 , 0 . 998 , 0 . 999 } with β 1 = 0 . 9 and  = 10 − 7 ﬁxed. As with image classi- ﬁcation, we observe in T able 2 that the choice of warmup method has a minimal impact on training across different hyperparameters. Figure 5 depicts the validation perplexity throughout train- ing for the best Adam parametrization ( α = 10 − 4 and β 2 = 0 . 999 ), which similarly supports the indistinguisha- bility of the warmup methods. LR β 2 Exponential Linear RAdam 1 · 10 − 4 0.99 21 . 0 ± 0 . 1 21 . 0 ± 0 . 1 21 . 1 ± 0 . 1 1 · 10 − 4 0.998 19 . 9 ± 0 . 0 19 . 9 ± 0 . 0 20 . 0 ± 0 . 0 1 · 10 − 4 0.999 20 . 0 ± 0 . 0 20 . 0 ± 0 . 0 20 . 1 ± 0 . 1 3 · 10 − 4 0.99 21 . 3 ± 0 . 3 20 . 8 ± 0 . 1 22 . 4 ± 0 . 0 3 · 10 − 4 0.998 19 . 6 ± 0 . 0 19 . 6 ± 0 . 0 19 . 6 ± 0 . 1 3 · 10 − 4 0.999 19 . 5 ± 0 . 0 19 . 5 ± 0 . 0 19 . 5 ± 0 . 0 5 · 10 − 4 0.99 24 . 4 ± 2 . 4 24 . 1 ± 1 . 4 26 . 0 ± 1 . 8 5 · 10 − 4 0.998 20 . 1 ± 0 . 0 20 . 0 ± 0 . 0 20 . 1 ± 0 . 0 5 · 10 − 4 0.999 19 . 8 ± 0 . 0 19 . 7 ± 0 . 1 19 . 7 ± 0 . 0 T able 2: V alidation perplexity of a T ransformer LM on the WIKITEXT-103 dataset (means and standard deviations o ver 3 random seeds). Figure 5: Mean validation perple xity (3 seeds) of T ransformer LM on WIKITEXT-103 , using Adam with α = 10 − 4 and β 2 = 0 . 999 . 5.3 Machine T ranslation Finally , we e valuate the warmup methods on a large scale machine translation task. Using each of the three warmup methods, we train a T ransformer model (V aswani et al. 2017) on the WMT16 English-German (“EN-DE”) dataset. W e ﬁx Adam’ s β 1 = 0 . 9 and  = 10 − 7 and sweep over the following grid of Adam h yperparameters: α (learning rate) ∈  5 · 10 − 5 , 8 · 10 − 5 , 1 · 10 − 4  β 2 ∈ { 0 . 98 , 0 . 99 , 0 . 998 , 0 . 999 } W e observe no perceptible differences between the warmup methods in either ﬁnal performance (T able 3), or in the training-time metrics of a single canonical conﬁguration ( α = 10 − 4 and β 2 = 0 . 999 , shown in Figure 6). 6 Discussion W e discuss various consequences of our ﬁndings, along with directions for future work. 6.1 Extended W armup Periods The analysis of the update step magnitudes in Section 3.3 suggests shorter warmup periods than those typically used in practice. For example, using the setting of β 2 = 0 . 999 , LR β 2 Exponential Linear RAdam 5 · 10 − 5 0.98 24 . 5 ± 0 . 1 24 . 4 ± 0 . 1 24 . 4 ± 0 . 1 5 · 10 − 5 0.99 24 . 5 ± 0 . 0 24 . 5 ± 0 . 0 24 . 5 ± 0 . 1 5 · 10 − 5 0.998 24 . 3 ± 0 . 2 24 . 4 ± 0 . 2 24 . 4 ± 0 . 1 5 · 10 − 5 0.999 24 . 2 ± 0 . 1 24 . 2 ± 0 . 1 24 . 1 ± 0 . 1 8 · 10 − 5 0.98 25 . 9 ± 0 . 1 25 . 9 ± 0 . 1 25 . 9 ± 0 . 1 8 · 10 − 5 0.99 25 . 9 ± 0 . 2 25 . 9 ± 0 . 1 25 . 9 ± 0 . 0 8 · 10 − 5 0.998 26 . 0 ± 0 . 1 25 . 2 ± 1 . 0 25 . 9 ± 0 . 1 8 · 10 − 5 0.999 25 . 7 ± 0 . 1 25 . 8 ± 0 . 1 25 . 7 ± 0 . 0 1 · 10 − 4 0.98 26 . 5 ± 0 . 1 26 . 6 ± 0 . 1 26 . 6 ± 0 . 1 1 · 10 − 4 0.99 26 . 7 ± 0 . 1 26 . 6 ± 0 . 1 26 . 6 ± 0 . 0 1 · 10 − 4 0.998 25 . 9 ± 0 . 9 26 . 5 ± 0 . 1 26 . 6 ± 0 . 0 1 · 10 − 4 0.999 26 . 2 ± 0 . 2 26 . 4 ± 0 . 0 26 . 4 ± 0 . 0 T able 3: BLEU score of Transformer on WMT16-EN-DE (means and standard deviations ov er 3 random seeds). Figure 6: Mean validation perple xity (3 seeds) of T ransformer on WMT16-EN-DE, using Adam with α = 10 − 4 and β 2 = 0 . 999 . Adam’ s update magnitudes in the theoretical model conv erge to a stationary distribution in roughly 40 iterations. If up- date magnitudes were the only relev ant consideration, then a warmup schedule over a few hundred iterations would suf ﬁce to stabilize training. In contrast, the effecti ve warmup periods of both RAdam and our rule-of-thumb schedules are roughly 1000 iterations for β 2 = 0 . 999 . State-of-the-art methods with hand-tuned warmup schedules often go well beyond, using up to 10000 iterations of linear w armup in some cases (Liu et al. 2019b; Baevski and Auli 2018; Ott et al. 2019). Accordingly , we surmise that the precise channel by which Adam necessitates an extended period of warmup is still an unresolved question, likely related to the properties of the gradients at random initialization. Future work could rigor - ously in vestigate the ef fect of extended warmup periods on the training dynamics of Adam, beyond simple per -iteration statistics. 6.2 Consequences of Update Step In variance to Gradients One ancillary ﬁnding of Section 3.3 is that the magnitudes of Adam’ s update steps during later stages of training are lar gely in v ariant to the properties or dynamics of the gradient dis- tribution – both the simulated local optimum and rea l-world random initialization settings result in con vergence to similar stationary distributions of update magnitudes. This suggests that learning rate decay at later stages of training could be the only way to impro ve late-stage con ver gence, as Adam’ s late-stage update magnitudes do not appear to be very sensi- tiv e to the variance or stationarity of gradients. In particular , we suspect that variance-based methods of improving the late-stage con vergence of SGD, such as increasing the batch size (Smith et al. 2018), will not yield comparable beneﬁts when applied to Adam, as the stationary distribution of the update magnitudes will remain largely the same. Partially adaptable methods (Chen and Gu 2018; Keskar and Socher 2017; Luo et al. 2019), which interpolate between the full adapti vity of Adam and the non-adapti vity of SGD, may hold more promise for improving late-stage con vergence. 6.3 Dynamic W armup All methods considered by this work use ﬁxed warmup sched- ules, computed only as a function of the training iteration t and various hyperparameters. Such schedules will ine vitably be brittle to some combination of problem setting, model, and optimizer conﬁguration. Another direction for future work could be to devise truly dynamic mechanisms for scheduling warmup in Adam. Such a mechanism could (among other things) track and utilize auxiliary statistics, such as the run- ning moments of the applied updates, in order to determine the stability of training at each iteration. This direction comes dangerously close to seeking the “holy grail” of an automatic learning rate tuner; e xisting at- tempts to de vise such a method hav e achie ved limited adop- tion as of yet (Li, T ai, and E 2017; Zhang, Mitliagkas, and R ´ e 2017; Baydin et al. 2018). What makes this potentially more tractable is that a maximum learning rate is still tuned and giv en a priori to the optimizer; the task is then restricted to dynamic scheduling of the learning rate from zero to this known constant, instead of an arbitrary range (0 , ∞ ) . 7 Conclusion W e sho w that the Rectiﬁed Adam (RAdam) algorithm can be characterized as four steps of momentum SGD, follo wed by Adam with a ﬁxed warmup schedule. W e also examine the shortcomings of a v ariance-based approach to analyzing the learning rate warmup heuristic, and we illustrate that Adam’ s frequent need for learning rate w armup can be par - tially e xplained by inspecting Adam’ s early-stage update step magnitudes when applied to an already-con verged model. RAdam’ s claimed beneﬁts are its superior performance to Adam and its elimination of costly warmup schedule tuning. W e obviate RAdam by providing two simple “rule-of-thumb” warmup schedules for Adam, both of which require no tuning. Linear warmup of Adam’ s learning rate over 2 · (1 − β 2 ) − 1 iterations is functionally equiv alent to RAdam across a wide range of settings. Hence, we suggest that practitioners consid- ering the need for untuned warmup of Adam’ s learning rate ﬁrst try linear warmup over 2 · (1 − β 2 ) − 1 training iterations. References Baevski, A.; and Auli, M. 2018. Adapti ve Input Representa- tions for Neural Language Modeling. CoRR abs/1809.10853. URL http://arxiv .org/abs/1809.10853. Baydin, A. G.; Cornish, R.; Mart ´ ınez-Rubio, D.; Schmidt, M.; and W ood, F . 2018. Online Learning Rate Adaptation with Hyper gradient Descent. In 6th International Conference on Learning Repr esentations, ICLR 2018, V ancouver , BC, Canada, April 30 - May 3, 2018, Conference T rack Pr oceed- ings . URL https://openre view .net/forum?id=BkrsAzW Ab. Brown, T . B.; Mann, B.; Ryder , N.; Subbiah, M.; Kaplan, J.; Dhariwal, P .; Neelakantan, A.; Shyam, P .; Sastry , G.; Askell, A.; Agarwal, S.; Herbert-V oss, A.; Krueger , G.; Henighan, T .; Child, R.; Ramesh, A.; Ziegler , D. M.; W u, J.; W inter , C.; Hesse, C.; Chen, M.; Sigler , E.; Litwin, M.; Gray , S.; Chess, B.; Clark, J.; Berner , C.; McCandlish, S.; Radford, A.; Sutske ver , I.; and Amodei, D. 2020. Language Models are Fe w-Shot Learners. CoRR abs/2005.14165. URL https: //arxiv .org/abs/2005.14165. Chen, J.; and Gu, Q. 2018. Closing the Generalization Gap of Adaptiv e Gradient Methods in Training Deep Neural Net- works. CoRR abs/1806.06763. URL http://arxiv .org/abs/ 1806.06763. Cohen, G.; Afshar , S.; T apson, J.; and van Schaik, A. 2017. EMNIST : Extending MNIST to handwritten letters. In 2017 International J oint Confer ence on Neural Networks, IJCNN 2017, Anchorag e, AK, USA, May 14-19, 2017 , 2921–2926. doi:10.1109/IJCNN.2017.7966217. URL https://doi.or g/10. 1109/IJCNN.2017.7966217. Devlin, J.; Chang, M.; Lee, K.; and T outanov a, K. 2018. BER T : Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805. URL http: //arxiv .org/abs/1810.04805. Duchi, J. C.; Hazan, E.; and Singer , Y . 2010. Adaptiv e Subgradient Methods for Online Learning and Stochas- tic Optimization. In COLT 2010 - The 23r d Confer- ence on Learning Theory , Haifa, Israel, June 27-29, 2010 , 257–269. URL http://colt2010.haifa.il.ibm.com/papers/ COL T2010proceedings.pdf#page=265. Gehring, J.; Auli, M.; Grangier , D.; Y arats, D.; and Dauphin, Y . N. 2017. Conv olutional Sequence to Sequence Learning. In Pr oceedings of the 34th International Conference on Ma- chine Learning, ICML 2017, Sydney , NSW , Australia, 6-11 August 2017 , 1243–1252. URL http://proceedings.mlr .press/ v70/gehring17a.html. Gotmare, A.; Keskar , N. S.; Xiong, C.; and Socher , R. 2019. A Closer Look at Deep Learning Heuristics: Learning rate restarts, W armup and Distillation. In 7th International Con- fer ence on Learning Representations, ICLR 2019, New Or- leans, LA, USA, May 6-9, 2019 . URL https://openrevie w .net/ forum?id=r14EOsCqKX. Goyal, P .; Doll ´ ar , P .; Girshick, R. B.; Noordhuis, P .; W esolowski, L.; Kyrola, A.; T ulloch, A.; Jia, Y .; and He, K. 2017. Accurate, Large Minibatch SGD: T raining Im- ageNet in 1 Hour . CoRR abs/1706.02677. URL http: //arxiv .org/abs/1706.02677. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Confer ence on Computer V ision and P attern Recognition, CVPR 2016, Las V e gas, NV , USA, June 27-30, 2016 , 770–778. doi:10. 1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR. 2016.90. Hinton, G.; Sri vasta va, N.; and Swersky , K. 2012. Neural networks for machine learning: Lecture 6a. Keskar , N. S.; and Socher , R. 2017. Improving Generaliza- tion Performance by Switching from Adam to SGD. CoRR abs/1712.07628. URL http://arxi v .org/abs/1712.07628. Kingma, D. P .; and Ba, J. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980. URL http: //arxiv .org/abs/1412.6980. Krizhevsk y , A.; Sutske ver , I.; and Hinton, G. E. 2012. Im- ageNet Classiﬁcation with Deep Con volutional Neural Net- works. In Advances in Neural Information Pr ocessing Sys- tems 25: 26th Annual Confer ence on Neur al Information Pr ocessing Systems 2012. Pr oceedings of a meeting held December 3-6, 2012, Lake T ahoe, Nevada, United States. , 1106–1114. URL http://papers.nips.cc/paper/4824- imagenet- classiﬁcation- with- deep- conv olutional- neural- networks. Li, Q.; T ai, C.; and E, W . 2017. Stochastic Modiﬁed Equa- tions and Adapti ve Stochastic Gradient Algorithms. In Pr o- ceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydne y , NSW , Australia, 6-11 A ugust 2017 , 2101–2110. URL http://proceedings.mlr .press/v70/ li17f.html. Liu, L.; Jiang, H.; He, P .; Chen, W .; Liu, X.; Gao, J.; and Han, J. 2020. On the V ariance of the Adaptiv e Learning Rate and Beyond. In International Confer ence on Learning Represen- tations . URL https://openre view .net/forum?id=rkgz2aEKDr. Liu, Y .; Albanie, S.; Nagrani, A.; and Zisserman, A. 2019a. Use What Y ou Hav e: V ideo retriev al using representations from collaborativ e experts. In 30th British Machine V ision Confer ence 2019, BMVC 2019, Cardif f, UK, September 9-12, 2019 , 279. BMV A Press. URL https://bmvc2019.org/wp- content/uploads/papers/0363- paper .pdf. Liu, Y .; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Le vy , O.; Lewis, M.; Zettlemoyer , L.; and Stoyano v , V . 2019b. RoBER T a: A Robustly Optimized BER T Pretraining Ap- proach. CoRR abs/1907.11692. URL http://arxiv .org/abs/ 1907.11692. Loshchilov , I.; and Hutter, F . 2019. Decoupled W eight De- cay Regularization. In 7th International Conference on Learning Repr esentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . URL https://openrevie w .net/forum?id= Bkg6RiCqY7. Luo, L.; Xiong, Y .; Liu, Y .; and Sun, X. 2019. Adaptive Gradient Methods with Dynamic Bound of Learning Rate. In 7th International Confer ence on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . URL https://openrevie w .net/forum?id=Bkg3g2R9FX. Ma, J.; and Y arats, D. 2019. Quasi-hyperbolic momentum and Adam for deep learning. In 7th International Conference on Learning Repr esentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . URL https://openrevie w .net/forum?id= S1fUpoR5FQ. Nesterov , Y . E. 1983. A method for solving the con vex programming problem with conv ergence rate O (1 /k 2 ) . In Dokl. Akad. Nauk SSSR , volume 269, 543–547. Nguyen, T . Q.; and Salazar, J. 2019. Transformers without T ears: Improving the Normalization of Self-Attention. CoRR abs/1910.05895. URL http://arxi v .org/abs/1910.05895. Ott, M.; Edunov , S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier , D.; and Auli, M. 2019. fairseq: A Fast, Extensible T oolkit for Sequence Modeling. In Pr oceedings of N AACL- HLT 2019: Demonstr ations . Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Y ang, E.; DeV ito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer , A. 2016. PyT orch Examples. https://github.com/p ytorch/ examples. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Y ang, E.; DeV ito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic dif ferentiation in PyT orch. In NIPS-W . Polyak, B. 1964. Some methods of speeding up the con ver- gence of iteration methods. USSR Computational Mathemat- ics and Mathematical Physics 4(5): 1–17. Radford, A.; W u, J.; Child, R.; Luan, D.; Amodei, D.; and Sutske ver , I. 2019. Language models are unsupervised multi- task learners. OpenAI Blog . Russakovsk y , O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy , A.; Khosla, A.; Bern- stein, M.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale V isual Recognition Challenge. International Journal of Computer V ision (IJCV) 115(3): 211–252. doi: 10.1007/s11263- 015- 0816- y. Smith, S. L.; Kindermans, P .; Y ing, C.; and Le, Q. V . 2018. Don’t Decay the Learning Rate, Increase the Batch Size. In 6th International Confer ence on Learning Representations, ICLR 2018, V ancouver , BC, Canada, April 30 - May 3, 2018, Confer ence T rack Pr oceedings . URL https://openre view .net/ forum?id=B1Yy1BxCZ. V aswani, A.; Shazeer , N.; Parmar , N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser , L.; and Polosukhin, I. 2017. Attention Is All Y ou Need. CoRR abs/1706.03762. URL http://arxiv . org/abs/1706.03762. Y amamoto, R.; Song, E.; and Kim, J.-M. 2020. Parallel W av e- GAN: A f ast wav eform generation model based on generati ve adversarial netw orks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Confer ence on Acous- tics, Speech and Signal Pr ocessing (ICASSP) , 6199–6203. IEEE. Zhang, J.; Mitliagkas, I.; and R ´ e, C. 2017. Y ellowFin and the Art of Momentum T uning. CoRR abs/1706.03471. URL http://arxiv .org/abs/1706.03471. A Full details of experimental setup A.1 System conﬁguration All experiments are performed using Python 3.7 and PyT orch version 1.2 (Paszke et al. 2017) compiled with CUDA 10, on Ubuntu 18.04 systems containing 8 NVIDIA V100 GPUs each. A.2 Image classiﬁcation Experimentation is performed using the ILSVRC 2012 1000-class dataset (“ImageNet”; Russakovsky et al. 2015) and a 50-layer con volutional residual network model (“ResNet-50”; He et al. 2016). The implementation follows that of P aszke et al. (2016), 9 with the only deviations being to enable alternati ve optimizer conﬁgurations, to enable intermediate metric logging, and to drop the last batch from each training epoch. T raining occurs ov er 90 epochs, with ten-fold learning rate decay after epochs 30 and 60. The minibatch size is 1024. The optimization objectiv e is cross-entropy , with a decoupled weight decay (Loshchilov and Hutter 2019) of 10 − 4 . Data augmentation includes horizontal ﬂipping at random, as well as random 224-pixel crops. V alidation is performed on 224-pixel center crops. For Adam and RAdam, the follo wing hyperparameters are ﬁxed: β 1 = 0 . 9 and  = 10 − 8 . All other Adam parameters (warmup schedule, learning rate α , and β 2 ) are enumerated via parameter sweep as described in Section 5.1. Each Adam conﬁguration is independently trained with 5 random seeds. A.3 Language modeling W e ev aluate the state-of-the-art, T ransformer-based language model described in (Bae vski and Auli 2018) on the WIKITEXT-103 dataset, consisting of 100M tokens with a size-260K vocab ulary . W e lev erage the author’ s implementa- tion provided in fairseq (Gehring et al. 2017; Ott et al. 2019), and train on 8 GPUs with half-precision ﬂoating point. Our experimentation setup closely follows (Bae vski and Auli 2018), e xcept that we sweep o ver Adam parameters, such as warmup schedule, learning rate α , and β 2 , while keeping β 1 = 0 . 9 and  = 10 − 7 ﬁxed (both for Adam and RAdam). The hyperparameter grid is presented in Section 5.2. Each Adam conﬁguration is independently trained with 3 random seeds. A.4 Machine translation Our setup employs a state-of-the-art T ransformer model (V aswani et al. 2017) implemented in fairseq (Ott et al. 2019). W e train on the WMT16 English-German large machine translation dataset, and e valuate on the newstest14 v alidation set. As observed in (Ma and Y arats 2019), these state-of-the-art large-scale models are fragile to train with Adam and require either a carefully chosen optimization procedure, or robust optimizers that can sustain gradients with large variance, such as QHAdam (Ma and Y arats 2019). T o eliminate this factor from our studies, we choose to lo wer the learning rate α to stabilize training, taking a marginal performance hit in training. Apart from that, our experimentation setup is identical to the one in (Ott et al. 2019). W e ﬁx Adam parameters β 1 = 0 . 9 and  = 10 − 7 , and sweep over the warmup schedule, learning rate α , and β 2 , as described in Section 5.3. W e again use half-precision ﬂoating point and train on 8 GPUs. As (Ott et al. 2019) trains on 128 GPUs, we accumulate gradients over 16 minibatches before each optimization step to achieve an identical conﬁguration. The BLEU score is av eraged ov er 3 random seeds. A.5 Gradient analysis workhorse: EMNIST digit classiﬁcation The EMNIST digit classiﬁcation task (Cohen et al. 2017) serves as the workhorse for our gradient analysis studies. Our model is a simple feed-forward neural network with three hidden layers (sizes 200, 100, and 50) and uniform weight initialization with range in versely proportional to the square root of layer sizes. Optimization is performed on the cross-entropy objecti ve with the Adam optimizer . The Adam conﬁguration is α = 10 − 3 , β 1 = 0 . 9 , β 2 = 0 . 999 ,  = 10 − 8 , and decoupled weight decay 10 − 4 . The minibatch size is 256, T raining occurs ov er 10000 training iterations. At each training iteration, 256 backwards passes are performed with indepen- dently sampled batches to collect a sample of the gradient distribution. Due to the cost of storing and analyzing the gradients of all parameters, we randomly sample 500 parameters from each weight matrix and only collect gradients for the sampled parameters. These samples are used to approximate the distribution of the gradient coef ﬁcients of variation. After the 256 backw ards passes, one ﬁnal pass is performed as a regular optimization step to update the model parameters and proceed to the ne xt iteration. 9 Commit hash ee964a2 . B Miscellaneous derivations This appendix provides miscellaneous informal deri vations of statements in the main text. B.1 Number of RAdam momentum iterations Fact 3.1. Assume that 0 . 8 ≤ β 2 < 1 and t is a positive integ er . Then, for ρ t as deﬁned in Equation 7 : ρ t ≤ 4 ⇐ ⇒ t ≤ 4 Pr oof. W e deﬁne ρ ( t, β 2 ) to be the continuous v ersion of ρ t , parameterized ov er both t and β 2 : ρ ( t, β 2 ) = 2 1 − β 2 − 1 − 2 · t · β t 2 1 − β t 2 W e then differentiate with respect to t : ∂ ρ ( t, β 2 ) ∂ t = 2 · β t 2 · β t 2 − 1 − t · ln β 2 (1 − β t 2 ) 2 ∂ ρ ( t,β 2 ) ∂ t is thus positiv e for all t > 0 . W e also differentiate with respect to β 2 , and take speciﬁc v alues thereof: ∂ ρ ( t, β 2 ) ∂ β 2 = 2 ·  1 (1 − β 2 ) 2 − t 2 · β t − 1 2 (1 − β t 2 ) 2  ∂ ρ (4 , β 2 ) ∂ β 2 = 2 ·  1 (1 − β 2 ) 2 − 16 · β 3 (1 − β 4 2 ) 2  ∂ ρ (5 , β 2 ) ∂ β 2 = 2 ·  1 (1 − β 2 ) 2 − 25 · β 4 2 (1 − β 5 2 ) 2  ∂ ρ ( t,β 2 ) ∂ β 2 is thus positiv e for all β 2 ∈ (0 , 1) at t = 4 and t = 5 . Then, we take lim β 2 → 1 ρ (4 , β 2 ) : lim β 2 → 1 ρ (4 , β 2 ) = lim β 2 → 1  2 1 − β 2 − 8 · β 4 2 1 − β 4 2  − 1 = lim β 2 → 1 2 ·  4 · β 3 2 + 3 · β 2 2 + 2 · β 2 + 1  (1 + β 2 )(1 + β 2 2 ) ! − 1 = 5 − 1 = 4 Combining this result with the fact that ∂ ρ (4 ,β 2 ) ∂ β 2 is positive for β 2 ∈ (0 , 1) , it follows that ρ (4 , β 2 ) < 4 for all β 2 ∈ (0 , 1) . Then, since ∂ ρ ( t,β 2 ) ∂ t > 0 for all t > 0 , we ha ve that ρ ( t, β 2 ) < 4 for all β 2 ∈ (0 , 1) and t ∈ (0 , 4] . W e have thus shown that t ≤ 4 = ⇒ ρ t ≤ 4 for positi ve inte gers t . In the rev erse direction, we ev aluate ρ (5 , 0 . 8) : ρ (5 , 0 . 8) = 2 1 − 0 . 8 − 1 − 2 · 5 · 0 . 8 5 1 − 0 . 8 5 ≈ 9 − 4 . 87 ≈ 4 . 14 Similarly combining this result with the f act that ∂ ρ (5 ,β 2 ) ∂ β 2 is positi ve for β 2 ∈ (0 , 1) , then with the fact that ∂ ρ ( t,β 2 ) ∂ t > 0 for all t > 0 , we have that ρ ( t, β 2 ) & 4 . 14 for all t ≥ 5 , β 2 ∈ [0 . 8 , 1) . W e have thus shown that t > 4 = ⇒ ρ t > 4 for positive integers t , completing the proof. B.2 Linear warmup period (rule-of-thumb) W e desire for the effecti ve warmup period to be roughly equiv alent between the exponential and linear rule-of-thumb schedules – that is, T ( w expo , untuned ) ≈ T ( w linear ,τ ) . Solving approximately for τ : T ( w expo , untuned ) = ∞ X t =1 exp ( − (1 − β 2 ) · t ) = 1 exp (1 − β 2 ) − 1 ≈ (1 − β 2 ) − 1 T ( w linear ,τ ) = τ X t =1  1 − 1 τ · t  = τ − 1 2 ≈ τ 2 τ = 2 · (1 − β 2 ) − 1 = ⇒ T ( w expo , untuned ) ≈ T ( w linear ,τ ) C Supplementary experimental results C.1 Image classiﬁcation (a) α = 10 − 2 , β 2 = 0 . 99 (b) α = 10 − 2 , β 2 = 0 . 997 (c) α = 10 − 2 , β 2 = 0 . 999 (d) α = 10 − 3 , β 2 = 0 . 99 (e) α = 10 − 3 , β 2 = 0 . 997 (f) α = 10 − 3 , β 2 = 0 . 999 (g) α = 10 − 4 , β 2 = 0 . 99 (h) α = 10 − 4 , β 2 = 0 . 997 (i) α = 10 − 4 , β 2 = 0 . 999 Figure 7: Mean training loss of ResNet-50 on ImageNet under various conﬁgurations of Adam (5 random seeds per conﬁguration). Standard deviations are ne gligible across conﬁgurations. (a) α = 10 − 2 , β 2 = 0 . 99 (b) α = 10 − 2 , β 2 = 0 . 997 (c) α = 10 − 2 , β 2 = 0 . 999 (d) α = 10 − 3 , β 2 = 0 . 99 (e) α = 10 − 3 , β 2 = 0 . 997 (f) α = 10 − 3 , β 2 = 0 . 999 (g) α = 10 − 4 , β 2 = 0 . 99 (h) α = 10 − 4 , β 2 = 0 . 997 (i) α = 10 − 4 , β 2 = 0 . 999 Figure 8: Mean top-1 validation error of ResNet-50 on ImageNet under various conﬁgurations of Adam (5 random seeds per conﬁguration). Standard deviations are ne gligible across conﬁgurations.

On the adequacy of untuned warmup for adaptive optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment