Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets

Understanding Quantization of Optimizer States in LLM Pr e-training: Dynamics of State Staleness and Effectiv eness of State Resets Kristi T opollai Ne w Y ork Uni versity kt2664@nyu.edu Anna Choromanska Ne w Y ork Uni versity ac5455@nyu.edu Abstract Quantizing optimizer states is becoming an im- portant ingredient of memory-efﬁcient lar ge- scale pre-training, but the resulting optimizer dynamics remain only partially understood. W e study low-precision e xponential mo ving av- erage (EMA) optimizer states and sho w ho w quantization can cause many nominal updates to round back to the same stored v alue, mak- ing the state effecti v ely stale and slowing adap- tation be yond what the nominal decay would suggest. W e then de velop a simple predic- tiv e model of stalling that estimates one-step stalling probabilities and characterizes how stalling b uilds up over time after the initial- ization. This perspectiv e provides a mechanis- tic explanation for why optimizer-state resets help in lo w precision: once a quantized EMA becomes ef fecti vely stale, resetting it can tem- porarily restore responsi veness. Moti v ated by this picture, we deri v e a simple theory-guided method for choosing useful reset periods, show- ing that in lo w precision the key question is not only whether resets help, but when they should be applied. Experiments in controlled simula- tions and LLM pre-training sho w that suitable reset schedules recover the performance lost to low-precision state storage while substantially reducing optimizer-state memory . 1 Introduction Modern large-scale pre-training relies heavily on optimizers such as AdamW ( Kingma and Ba , 2017 ; Loshchilov and Hutter , 2019 ), Lion ( Chen et al. , 2023 ), Shampoo ( Gupta et al. , 2018 ), SOAP ( Vyas et al. , 2025 ), and Muon ( Liu et al. , 2025 ), which maintain running statistics, typically expo- nential moving averages (EMAs), to smooth gra- dients, adapt step sizes, or construct precondition- ers. At the same time, modern training pipelines increasingly store these states in reduced preci- sion to lo wer memory cost and improv e throughput ( Dettmers et al. , 2022 ; Peng et al. , 2023 ; Xi et al. , 2025 ; W ang et al. , 2024 ). This makes optimizer- state quantization practically important, b ut its ef- fect on the actual dynamics of the optimizer re- mains insuf ﬁciently understood. A related empirical observ ation is that resetting optimizer states can impro ve pre-training ( Huang et al. , 2025b , a ; Glentis et al. , 2025 ). These resets are usually moti v ated through spik e mitigation or generic training stability . Ho we ver , in our experi- ments their beneﬁt is strongly precision-dependent: resets help substantially when EMA states are stored in BF16 or lower precision, but are often much less useful in full precision. This suggests that state quantization and state resets should not be vie wed separately: understanding why quan- tized states become inef fecti v e is also the ke y to understanding when resets help. In this paper , we study that connection through the dynamics of quantized EMAs. Consider a generic recursion s t = β s t − 1 + (1 − β )∆ t , where s t is an optimizer state, β is a decay coef- ﬁcient close to 1 , and ∆ t denotes the new update information. This form captures the state ev olution of many widely used optimizers. In low precision, each step ﬁrst forms a higher-precision update and then maps it back to the storage format. When the increment is too small relati ve to the local spac- ing of the ﬂoating-point grid, the update rounds back to the same stored v alue, so the state does not change. W e refer to this phenomenon as state- update stalling , also known as signal swamping and unchanged-state dynamics in low-bit EMA up- dates ( Higham , 1993 ; W ang et al. , 2018 ; Xu et al. , 2025 ). In high-decay recursions, repeated stalling slo ws adaptation, ef fectively lengthens the memory of the EMA beyond what the nominal decay sug- gests, and makes quantized states substantially less responsi ve than their full-precision counterparts. 1 This vie wpoint also helps explain sev eral recur- ring features of lo w-precision training. In particu- lar , it clariﬁes why large β 2 v alues become more problematic as precision decreases, why stochastic rounding ( Gupta et al. , 2015 ; Ozkara et al. , 2025 ; Croci et al. , 2022 ) helps but does not fully remo ve the problem when stalling is sev ere, and why the ﬁrst and second moments of Adam-like methods respond dif ferently to quantization. Most impor - tantly , it explains why resets can help and why their timing matters: once a quantized EMA has become effecti vely stale, resetting it can restore responsi veness, but resetting too early discards use- ful av eraging. The key question is therefore not only whether resets help, but when the y should be applied. W e formalize these effects through a simple pre- dicti ve model of quantized EMA dynamics. The model combines local ﬂoating-point grid geome- try with a short-horizon approximation to estimate one-step stalling probabilities and to characterize the ﬁnite post-initialization period during which a quantized EMA remains meaningfully respon- si ve. This responsiv e startup window provides a natural basis both for understanding when resets should help and for choosing useful reset periods as a function of decay and precision. W e validate the resulting predictions in controlled simulations and LLM pre-training, showing that suitable re- set schedules recov er the performance lost to low- precision state storage while substantially reducing optimizer-state memory . 2 Related W ork Optimizers f or large-scale pre-training. Mod- ern LLM pre-training increasingly follo ws compute-optimal scaling and open training recipes ( Hof fmann et al. , 2022 ; T ouvron et al. , 2023a , b ; Groenev eld et al. , 2024 ), most of which rely on optimizers with persistent state. Beyond momentum SGD and Adam/AdamW ( Kingma and Ba , 2017 ; Loshchilov and Hutter , 2019 ), this includes memory-reduced adaptiv e methods such as Adafactor ( Shazeer and Stern , 2018 ), SM3 ( Anil et al. , 2019 ) and Adam-mini ( Zhang et al. , 2025 ), structured preconditioners such as Shampoo ( Gupta et al. , 2018 ) and SO AP ( Vyas et al. , 2025 ), and more recent alternativ es such as Lion ( Chen et al. , 2023 ) and Muon ( Liu et al. , 2025 ). Low-pr ecision and quantized pre-training . Re- cent lo w-precision pre-training builds on FP8 for - mats ( Micike vicius et al. , 2022 ), with systems such as FP8-LM ( Peng et al. , 2023 ) and CO A T ( Xi et al. , 2025 ) demonstrating end-to-end FP8 LLM training, and more recent work extending to FP4 through mixed-precision or nativ e low-bit stacks ( W ang et al. , 2025 ; Chmiel et al. , 2025 ; Castro et al. , 2025 ). Closest to our setting is direct quantiza- tion of optimizer states: 8-bit and 4-bit states for Adam-like methods ( Dettmers et al. , 2022 ; Li et al. , 2023 ), 4-bit Shampoo preconditioners ( W ang et al. , 2024 ), and 8-bit quantized Muon states ( Gupta et al. , 2026 ). Our contribution is complementary: rather than proposing a ne w quantization or scal- ing scheme, we analyze a speciﬁc failure mode of quantized EMA states, state-update stalling under requantization, and use it to explain when stochas- tic rounding and periodic resets help. Optimizer state resetting. Restarting or state resetting hav e a long history in optimiza- tion ( O’Donoghue and Candès , 2012 ) and were later adapted to neural network training ( W ang et al. , 2022 ). More recent work sho ws that clearing optimizer states can help in non-stationary settings such as reinforcement learning ( Asadi et al. , 2023 ). In LLM training, methods such as SP AM ( Huang et al. , 2025b , a ) and ( Glentis et al. , 2025 ) use mo- mentum resets mainly for spike suppression and stabilization, while ( T opollai and Choromanska , 2026 ) sho w that implicit resets via adapting the mo- mentum coefﬁcient can also impro ve performance. In contrast, we sho w that in low precision the bene- ﬁt of resets is instead tied to quantized-state stalling: resets help by temporarily restoring responsi veness once EMA states become ef fectively stale. 3 State update stalling under quantized EMA states Preliminaries. W e consider exponential moving av erages (EMAs) of the form x t = β x t − 1 + (1 − β ) s t , (1) where s t is a stochastic signal (e.g., g t or g 2 t ) and β ∈ (0 , 1) is the decay coef ﬁcient. In the quantized setting, the EMA state is stored in low precision but updated in high. Let x q t denote the stored state, let D ( · ) denote dequantization (equiv alently , an exact cast to FP32), and let Q ( · ) denote quantization back to the storage format. One update therefore reads x t − 1 = D ( x q t − 1 ) , forms the FP32 update x t = 2 β x t − 1 + (1 − β ) s t , and writes back x q t = Q ( x t ) . Since D ( · ) is just a cast, no new quantization error is introduced during the read phase; all ne w errors enter on the write phase through Q ( · ) . Deﬁnition 3.1 (State update stalling) . W e say that state update stalling occurs at step t if x q t = x q t − 1 . (2) Equivalently , the high pr ecision update fails to move the stor ed state to a new r epr esentable value. Quantization therefore acts as a nonlinear gate in the EMA recursion, suppressing suf ﬁciently small updates. W e focus on the standard Adam setting, where the optimizer maintains ﬁrst- and second- moment EMAs of the gradient g t = ∇ f ( θ t ) , namely the momentum m t = β 1 m t − 1 + (1 − β 1 ) g t and the v ariance estimate v t = β 2 v t − 1 + (1 − β 2 ) g 2 t . Although the same gating mechanism applies to both moments, it is typically more pronounced for the second moment, whose quantization effects are often more visible in practice ( Fishman et al. , 2025 ; T ang et al. , 2026 ; Y u et al. , 2024 ). Moreover , the second moment admits a simple probabilistic model that matches empirical behavior well. W e therefore focus the analysis on the second-moment EMA. 3.1 One-step stalling as a geometry pr oblem W e analyze a single coordinate of the second- moment EMA: v t = β 2 v t − 1 + (1 − β 2 ) g 2 t . Our analysis is coordinate-wise and does not de- pend on the global training objective; it requires only a short-window approximation for one gra- dient coordinate. Concretely , over the windo w of interest we assume that the coordinate has an ap- proximately ﬁxed local scale σ , and we analyze the EMA in the re gime where v t − 1 is of the same order as σ 2 . T o obtain closed-form e xpressions, we further use the local Gaussian approximation g 2 t σ 2 d ≈ χ 2 1 , equi valently g t σ d ≈ N (0 , 1) . (3) This is a tractable short-window approximation for the step-wise distribution of a single gradient coordinate, and is in line with recent evidence that stochastic gradients exhibit substantial Gaussian structure in this sense ( Xie et al. , 2023 ). The high precision update increment is then ∆ t := v t − v t − 1 = (1 − β 2 )  g 2 t − v t − 1  . (4) Let u t := ulp( v q t − 1 ) denote the local spacing between adjacent representable numbers around the stored v alue. Under nearest rounding, state- update stalling occurs whenev er the high precision proposal remains within the rounding cell of the current stored v alue, i.e. | ∆ t | < u t 2 . (5) Equi valently , the stored value changes only if | ∆ t | ≥ u t / 2 . Let ε denote the characteristic relativ e grid spac- ing of the number format used for the EMA state under consideration. F or the second moment, which is nonnegati ve, we use an unsigned E2M2 FP4 format, gi ving ε = 2 − 2 ; for BF16 and E4M3 FP8 we use ε = 2 − 7 and ε = 2 − 3 , respectiv ely . The precise relation between u t and ε , including the mantissa-dependent factor , is derived in Sec- tion A.1 . Using the local-stationarity approxima- tion v t − 1 ≈ σ 2 and deﬁning z t := g 2 t σ 2 ∼ χ 2 1 , (6) the update-to-spacing ratio can be summarized by a single effecti ve precision parameter introduced next. Under the ef fective-spacing approximation, we summarize the update-to-spacing ratio by a sin- gle effecti ve precision parameter . Here ε denotes a format-dependent effecti ve spacing parameter, and ¯ m := 1 ln 2 is the mean normalized mantissa under the log-uniform mantissa model used here as a simple single-scalar approximation to the ex- act mantissa dependence in ﬂoating-point spacing ( Goldberg , 1991 ); see Section A.1 for details. Deﬁnition 3.2 (Effecti ve precision ratio) . W e deﬁne ˆ ρ := ε 2(1 − β 2 ) ¯ m . (7) W ith this con vention, | ∆ t | u t ≈ | z t − 1 | 2 ˆ ρ , (8) so all stalling statements below are controlled by ˆ ρ . 3 3.2 Near est rounding induces hard-gated stalling Under nearest rounding (NR), the condition ( 5 ) be- comes, under the effecti ve-spacing approximation, | z t − 1 | < ˆ ρ. (9) Equi valently , the stored v alue changes only when | z t − 1 | ≥ ˆ ρ . Proposition 3.3 (NR stalling probability) . Let z t ∼ χ 2 1 . Under the effective-spacing appr oximation, the one-step pr obability of state-update stalling under near est r ounding is P NR stall ( ˆ ρ ) ≈ F χ 2 1 (1 + ˆ ρ ) − F χ 2 1 (max(0 , 1 − ˆ ρ )) . (10) Corollary 3.4 (High-stalling re gime under nearest rounding) . When ˆ ρ ≥ 1 , the lower term in ( 10 ) vanishes, so P NR stall ( ˆ ρ ) ≈ F χ 2 1 (1 + ˆ ρ ) ≈ 1 . (11) Thus, once the EMA r eaches its steady magnitude , most updates ar e blocked by the r ounding gate. For β 2 = 0 . 999 , the ef fectiv e ratios from ( 7 ) gi ve the steady-stat e stalling probabilities in T able 1 which are also veriﬁed empirically in Figure 2 . Format ε ˆ ρ P NR stall ( ˆ ρ ) BF16 2 − 7 2 . 71 ≈ 0 . 94 6 FP8 (E4M3) 2 − 3 43 . 3 ≈ 1 . 00 0 FP4 (E2M2) 2 − 2 86 . 6 ≈ 1 . 00 0 T able 1: Theoretical steady-state stalling probabilities under nearest rounding (NR) from ( 7 ). These values already re veal the core problem: at FP8 and FP4, once the second moment reaches its steady-state scale, almost e very step stalls . In this regime the EMA rarely changes, so the optimizer ef fectively stops updating its estimate of gradient magnitude and instead operates with nearly frozen statistics. Importantly , this effect is largely inde- pendent of model or training scale. The analysis is expressed in terms of the normalized variable z t = g 2 t / E [ g 2 ] , whose distrib ution does not depend on the absolute magnitude of gradients, while the ratio ˆ ρ gov erning stalling depends only on the pre- cision format and decay β 2 . As a result, changing the ov erall gradient scale, model width, or dataset shifts the EMA magnitude and ﬂoating-point spac- ing together , leaving the stalling probability essen- tially unchanged. 3.3 Empirical sensitivity to simulated moment stalling T o isolate the effect of stale EMA updates, we run a controlled intervention during LLaMA-60M pre- training on W ikiT ext: after w armup, each update of a chosen moment b uffer is independently skipped with probability p . This is not meant as a literal model of quantization-induced stalling, but as a di- rect stress test of optimizer sensitivity to persistent unchanged-state behavior . 0.0 0.2 0.4 0.6 0.8 1.0 F raction of stale moment 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 F inal val loss (1st moment) 3.08 3.10 3.12 3.14 3.16 3.18 3.20 3.22 3.24 F inal val loss (2nd moment) F inal validation loss vs stale fraction Figure 1: Simulated EMA stalling during LLaMA-60M pre-training on W ikiT ext. After warmup, each moment update is skipped independently with probability p . Figure 1 sho ws a clear trend: second-moment stalling degrades con ver gence gradually , mainly through worse ﬁnal loss, whereas ﬁrst-moment stalling is far more destabilizing. This is consis- tent with their roles in Adam-like methods: the second moment mainly controls effecti ve learning- rate scaling, while the ﬁrst moment carries direc- tional information and is therefore more sensiti ve to missed updates. 4 What stalling does to the EMA dynamics 4.1 A coarse-grained effecti ve decay approximation A useful summary of stalling is through an effective decay approximation. If a step stalls, the stored state remains unchanged ( v q t = v q t − 1 ), which is equi valent to using decay 1 on that step rather than the nominal decay β 2 . If the empirical stalling rate is approximately constant at P stall ov er a short win- do w , only a fraction 1 − P stall of steps ef fectiv ely apply the nominal EMA contraction, gi ving β eﬀ 2 ≈ 1 − (1 − β 2 )(1 − P stall ) , (12) and, for τ := 1 1 − β 2 , τ eﬀ ≈ 1 (1 − β 2 )(1 − P stall ) = τ 1 − P stall . (13) 4 This is not an e xact reduction of the quantized re- cursion, since stalling events are state dependent and correlated across steps, b ut a local mean-ﬁeld summary of ho w a high stalled fraction slo ws EMA response. The same persistence also makes succes- si ve optimizer updates more temporally correlated, a behavior link ed to instability in Adam-like meth- ods ( Molybog et al. , 2023 ). 4.2 A ﬁnite startup windo w befor e stalling dominates Stalling does not occur immediately after state ini- tialization. When the EMA state is small, the quan- tization grid is ﬁne relati ve to the typical update, so many updates pass through. As the state grows to- ward its stationary scale, the probability of a stalled update rises, creating a ﬁnite startup window during which the EMA remains meaningfully responsi ve. Under the same idealized model as above, af- ter j updates ha ve been accumulated from a zero- initialized state, the reference high precision trajec- tory is v ref j := σ 2 ϕ j , ϕ j := 1 − β j 2 . (14) Proposition 4.1 (T ransient NR stalling probability) . W ith z t deﬁned in ( 6 ) , the transient stalling pr oba- bility under near est r ounding is appr oximated by P NR stall ( j ) ≈ F χ 2 1  (1 + ˆ ρ ) ϕ j  , (15) for the formats of inter est, wher e ˆ ρ ≥ 1 . This expression centers the stalling e vent at the curr ent state scale ϕ j rather than the steady-state v alue, and therefore captures the increase in stalling as the EMA gro ws away from initialization. Deﬁnition 4.2 (Startup windo w length) . F or a tol- erance P 0 ∈ (0 , 1) , deﬁne the startup window length as the smallest j such that P stall ( j ) ≥ P 0 . Proposition 4.3 (Ideal startup windo w) . Let ϕ ⋆ ( P 0 ) := F − 1 χ 2 1 ( P 0 ) 1 + ˆ ρ . Whenever ϕ ⋆ ( P 0 ) < 1 , the ideal startup window is j ⋆ ideal ( P 0 ) = & log  1 − ϕ ⋆ ( P 0 )  log β 2 ' . (16) In practice, measured stalling often starts from a nonzero ﬂoor P init , so we use P total ( j ) = P init + (1 − P init ) P stall ( j ) , (17) which is equi valent to replacing P 0 by the ef fectiv e threshold P eﬀ 0 := P 0 − P init 1 − P init , (18) and therefore j ⋆ ( P 0 ) = ( 0 , P 0 ≤ P init , j ⋆ ideal ( P eﬀ 0 ) , P 0 > P init . (19) Details on the empirical ﬂoor P init and numer- ical startup-window values are deferred to Sec- tions A.4 and A.5 . Figure 2 illustrates two main trends. First, the stalled fraction is already nonzero immediately after initialization, and this empirical ﬂoor P init is itself larger at lower precision; we discuss its origin and estimate it in Section A.4 . Second, the stalled fraction rises as the second mo- ment grows toward its steady-state scale, producing a ﬁnite responsiv e startup windo w before the EMA becomes largely stale. This window is longest in BF16 and substantially shorter in FP8; for FP4, ac- counting for the empirical ﬂoor is essential to sepa- rate the reset-sensitiv e buildup of stalling from the large format-dependent baseline. This precision- dependent startup behavior will later provide the ke y intuition for why state resets can help and why their timing matters. 4.3 First moment stalling The ﬁrst moment behav es qualitati vely similar to the second e ven though its update scale and ulp scale do not normalize in the same way . For the second moment, the update is driven by g 2 t and the stored state is of the same order as the local v ariance, so the variance scale lar gely cancels from the update-to-ulp ratio, yielding the nearly uni ver- sal approximation developed above. For the ﬁrst moment, m t = β 1 m t − 1 + (1 − β 1 ) g t , the relev ant scale is set directly by the signed gradient. As a result, ﬁrst-moment stalling depends on the local signal-to-noise ratio between the mean gradient and its ﬂuctuations, and therefore varies across co- ordinates rather than being determined primarily by the precision format. This makes the ﬁrst moment less amenable to a simple closed-form analysis of the kind de veloped abov e for the second moment. Empirically , howe ver , its behavior follo ws a quali- tati vely similar pattern, as we show in the Appendix Section A.6 . 5 A recipe f or stalling mitigation Quantized EMA states fail primarily because stalling makes the effect i ve EMA dynamics too 5 2k 2.5k 3k 3.5k 4k 4.5k Step 0.0 0.2 0.4 0.6 0.8 1.0 Stalling ratio BF16 2k 2.2k 2.4k 2.6k 2.8k 3k Step 0.5 0.6 0.7 0.8 0.9 1.0 FP8 2k 2.0k 2.1k 2.1k 2.2k 2.2k Step 0.97 0.98 0.99 1.00 FP4 NR SR Periodic NR theory SR theory Figure 2: Measured stalling fraction during LLM pre-training for quantized EMA states. As the second moment grows away from initialization, the stalled fraction rises and then plateaus near its steady-state le vel. Lower -precision formats enter the stalled regime earlier and remain there longer . 100M-parameter LLama model trained on C4. slo w . W e discuss two complementary ways to miti- gate this. 5.1 Stochastic r ounding Stochastic rounding (SR) rounds to one of the two neighboring representable v alues with probabilities proportional to distance, replacing the NR hard gate by a soft one. Proposition 5.1 (SR stalling probability) . Under the effective-spacing appr oximation at steady state, with z t as in ( 6 ) , the conditional pr obability that SR r eturns to the curr ent stored value is p SR stall ( z t ; ˆ ρ ) = max  0 , 1 − | z t − 1 | 2 ˆ ρ  , (20) and hence P SR stall ( ˆ ρ ) = E  p SR stall ( z t ; ˆ ρ )  SR is only a partial ﬁx. At β 2 = 0 . 999 , the steady-state stalling probabilities remain high (T a- ble 2 ) and are consistent with the empirical beha v- ior in Figure 2 . Format BF16 FP8 FP4 P SR stall 0 . 825 0 . 989 0 . 994 T able 2: Theoretical steady-state stalling probabilities under SR for β 2 = 0 . 999 . Thus SR materially helps at BF16, but at FP8 and FP4 the stalled regime remains dominant. This analysis isolates unchanged-state probabili- ties only; in practice SR may also help by reducing directional quantization bias ( Croci et al. , 2022 ). 5.2 P eriodic state resetting When steady-state stalling is high, a quantized EMA ev entually becomes nearly frozen, so reset- ting the state is a natural way to restore responsive- ness. The transient analysis above sho ws, howe ver , that this responsi veness is not lost immediately: after each initialization there is a ﬁnite startup win- do w during which the EMA remains adaptiv e be- fore stalling dominates. Since the length of this windo w is governed primarily by the ef fectiv e pre- cision ratio ˆ ρ in Deﬁnition 3.2 , it is lar gely stable throughout training for a ﬁxed precision regime. This both motiv ates periodic resets and suggests ho w to time them: resets should occur once the startup window is largely exhausted, but not so early that they discard useful a veraging. W e compare two quantities ov er a c ycle of length K : the accumulated r eset-sensitive staleness, and the statistical beneﬁt still left to gain from continu- ing the EMA. Removing the reset-insensiti ve ﬂoor P init , we deﬁne the normalized stalling progress S ( j ) := P total ( j ) − P init P ss total − P init = P stall ( j ) P ss stall , (21) where P ss total = P init + (1 − P init ) P ss stall . Rather than using the endpoint S ( K ) alone, we av erage ov er the cycle and count only the excess above a tolerance le vel s 0 : ¯ S s 0 ( K ) := 1 K K X j =1  S ( j ) − s 0 1 − s 0  + , (22) and compare it with the remaining statistical error of the bias-corrected EMA, whose deriv ation is gi ven in the Appendix Section A.7 , E ( K ) := 2 β K 2 1 + β K 2 . (23) Proposition 5.2 (Heuristic reset period) . Deﬁne K ⋆ := min  K ≥ 1 : ¯ S s 0 ( K ) ≥ E ( K )  . (24) Using P stall ( j ) ≈ F χ 2 1  (1 + ˆ ρ )(1 − β j 2 )  and P ss stall ≈ F χ 2 1 (1 + ˆ ρ ) the condition in ( 24 ) is a one- dimensional monotone test in K . W e use s 0 = 0 . 6 thr oughout. Intuiti vely , we reset once the cycle spends enough time in a highly stalled regime that the remaining beneﬁt of further av eraging is no longer 6 4k 5k 6k 7k 8k 9k 10k Step 3.45 3.50 3.55 3.60 3.65 3.70 3.75 V al loss Llama-60M | BF16 A damW A damW + r estarts 8k 10k 12k 14k 16k 18k 20k Step 3.20 3.25 3.30 3.35 3.40 3.45 Llama-100M | BF16 A damW A damW + r estarts 20k 25k 30k 35k 40k 45k 50k 55k 60k Step 2.925 2.950 2.975 3.000 3.025 3.050 3.075 3.100 Llama-350M | BF16 A damW A damW + r estarts 4k 5k 6k 7k 8k 9k 10k Step 3.45 3.50 3.55 3.60 3.65 3.70 3.75 3.80 V al loss Llama-60M | FP8 A damW A damW + r estarts 8k 10k 12k 14k 16k 18k 20k Step 3.20 3.25 3.30 3.35 3.40 3.45 3.50 Llama-100M | FP8 A damW A damW + r estarts 20k 25k 30k 35k 40k 45k 50k 55k 60k Step 2.90 2.95 3.00 3.05 3.10 3.15 3.20 Llama-350M | FP8 A damW A damW + r estarts Figure 3: Training loss curv es for the 3 model scales under 2 different precisions worth the cost. Solving ( 24 ) for the precisions con- sidered in this work gi ves the periods in T able 3 , which should be viewed as selecting a good operat- ing region rather than a uniquely optimal period. BF16 FP8 FP4 K ⋆ ≈ 1116 ≈ 320 ≈ 224 T able 3: Heuristic reset periods from ( 24 ) with β 2 = 0 . 999 and s 0 = 0 . 6 . Resetting the ﬁrst moment is mainly useful in very aggressiv e regimes such as FP4, since at higher precisions its steady-state stalling rate is often belo w the initial post-reset lev el P init , as also sho wn in Section A.6 . 6 Experiments 6.1 Settings W e follow the pre-training protocol of ( Zhao et al. , 2024 ; Lialin et al. , 2024 ) on C4 ( Raf fel et al. , 2020 ) using the LLaMA architecture family ( T ouvron et al. , 2023a ), matching the small-scale GaLore setup for the 60M, 100M, and 350M models. Con- cretely , the models are trained for 10K/20K/60K steps on approximately 1.3B/2.6B/7.8B tokens, with maximum sequence length 256, token batch size 131K, linear warmup ov er the ﬁrst 10% of training, cosine decay to 10% of the initial learn- ing rate, and no data repetition. W e ev aluate three optimizer-state precision regimes: BF16, FP8 us- ing E4M3 for both moments, and FP4 with the ﬁrst moment in FP4 (E2M1) and the unsigned second moment in FP4 (E2M2). For FP8 we use per -tensor scaling; for FP4 we use block-wise scaling with block size 128 and exclude the zero point from the second-moment grid, following ( Li et al. , 2023 ). Hyperparameters are gi ven in Section A.9 . For each re gime, we compare vanilla AdamW against AdamW with stochastic rounding and pe- riodic EMA resets. Unless otherwise stated, the second-moment reset period is chosen using Equa- tion ( 24 ), with P init estimated empirically from the stalled fraction at the ﬁrst step after state initial- ization and s 0 = 0 . 6 . Although this heuristic is deri ved under nearest rounding, we use it as a prac- tical proxy under stochastic rounding as well: SR lo wers unchanged-state probabilities but preserv es the same qualitati ve dependence on precision and decay , so the resulting periods still identify a good reset regime. This gi ves periods of 1000 for BF16, 300 for FP8, and 200 for FP4. The ﬁrst moment is reset with the same period. 6.2 Results W e ﬁrst ev aluate whether periodic EMA resets im- prov e training under quantized optimizer-state stor - age across se veral practical low-precision re gimes. T able 4 and Figure 3 show that stochastic rounding and periodic EMA resets consistently improv e ﬁ- nal performance in BF16, FP8, and FP4. Although implementation choices such as scaling affect ho w se vere stalling becomes, they do not remove the underlying problem of staleness. 6.3 Ablation Studies All ablations are conducted on the 100M model. 7 Conﬁguration 60M 100M 350M BF16 AdamW 3.52 3.27 2.99 BF16 AdamW + Resets 3.45 3.21 2.92 FP8 AdamW 3.53 3.33 3.02 FP8 AdamW + Resets 3.44 3.22 2.92 FP4 AdamW 3.52 3.32 3.03 FP4 AdamW + Resets 3.48 3.26 2.97 T able 4: V alidation loss across model scales. Sensitivity to the Reset Period. W e vary the second-moment reset period over a wide range while keeping all other hyperparameters ﬁxed. Fig- ure 4 sho ws that performance is robust over a broad range, with the best results attained near the period predicted by Equation ( 24 ). This supports the view that the useful reset windo w is gov erned by quan- tized EMA stalling. 50 100 200 300 400 500 700 1000 2000 Restart period (steps) 3.22 3.24 3.26 3.28 3.30 3.32 F inal validation loss BF16 BF16 AdamW Baseline FP8 FP8 AdamW Baseline Figure 4: Sensitivity to the reset period. Contribution of Differ ent Components. W e fur - ther isolate each mitigation component by keeping training otherwise in FP32 and quantizing only the optimizer states. T able 5 shows that both stochas- tic rounding and resets help on their o wn in quan- tized regimes, while their combination giv es the best overall performance. By contrast, applying the same reset schedule in full FP32 slightly hurts performance, reinforcing our main claim that the beneﬁt of resets is tied to low-precision state stale- ness rather than to resets being uni versally useful. Conﬁguration FP32 BF16 FP8 FP4 NR only 3.155 3.181 3.193 3.199 SR only – 3.156 3.160 3.182 Resets only 3.161 3.158 3.159 3.166 SR + Resets – 3.155 3.155 3.160 Memory savings 0% 50% 75% 87.5% T able 5: Component ablation with FP32 training and quantized optimizer states only . In the FP32 column, we apply resets only , with period 1000 for both moments. Asymmetric and Adaptive Resets. W e ne xt con- sider two extensions of the basic reset strategy: asymmetric reset periods for the two moments, and an adaptiv e rule based on Proposition 5.2 and de- tailed in Section A.8 . In the latter case, we set P ss stall = 1 and s 0 = 0 . 6 , estimate P stall ( j ) on- line, and trigger a reset once ¯ S s 0 ( K ) − E ( K ) > 0 . As shown in T able 6 , using no resets for the ﬁrst moment and period 300 for the second slightly improv es on the symmetric baseline, consistent with the fact that in FP8 ﬁrst-moment steady-state stalling is typically the same as its initial ﬂoor P init , see Section A.6 . The adaptive rule is comparable while av oiding a ﬁxed reset period. Reset Strategy V alidation Loss Symmetric Resets 3.223 Asymmetric Resets 3.212 Adaptiv e Resets 3.225 T able 6: Different resetting policies under FP8. Extensions to Other EMA-Based Optimizers. Finally , we test the same reset strategy on SO AP , also resetting the EMA states of its left and right preconditioners using the same period as in AdamW and following the setup of ( W ang et al. , 2024 ). T able 7 sho ws that resets again improve lo w-precision performance. Conﬁguration BF16 FP8 SO AP 3.22 4.15 SO AP + Resets 3.19 3.23 T able 7: SO AP with and without EMA resets. 7 Conclusion Lo w-precision optimizer states do not just sav e memory; they change the dynamics of optimization. Our central message is that quantized EMA states can become effecti vely stale for long stretches of training, and that this staleness is a useful lens for understanding both the failure modes of low- precision training and the surprising ef fectiv eness of state resets. V iewed this way , resets are not merely a heuristic borro wed from successful train- ing recipes: they are a targeted way to restore re- sponsi veness once the optimizer state has stopped meaningfully ev olving. Just as importantly , our analysis suggests that good resets are not arbitrary , their timing matters. W e hope this perspective helps shift the discussion of lo w-precision opti- mizer states from storage alone to dynamics, and moti vates future work on optimizers and quantiza- tion schemes designed explicitly to remain respon- si ve under quantization. 8 8 Limitations Our analysis is designed to isolate a speciﬁc and practically important mechanism, state-update stalling under quantized EMA storage, rather than to model e very aspect of lo w-precision training. As a result, some components are intentionally approximate, including the local scalar analysis of the EMA dynamics and the heuristic used to choose reset periods. These simpliﬁcations still match the main empirical trends well in our ex- periments, but the exact quantitati ve beha vior can depend on implementation choices such as scal- ing strategy , rounding mode, and optimizer variant. In addition, our empirical study focuses on the LLaMA pre-training settings and EMA-based op- timizers considered here. Extending the analysis to broader training regimes, lar ger scales, and ad- ditional classes of stateful optimizers is a natural direction for future work. References Rohan Anil, V ineet Gupta, T omer K oren, and Y oram Singer . 2019. Memory ef ﬁcient adaptiv e optimiza- tion. Advances in Neural Information Pr ocessing Systems , 32. Kav osh Asadi, Rasool Fakoor , and Shoham Sabach. 2023. Resetting the optimizer in deep rl: An em- pirical study . In Advances in Neural Information Pr ocessing Systems , volume 36, pages 72284–72324. Curran Associates, Inc. Roberto L. Castro, Andrei P anferov , Soroush T abesh, Oliv er Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. 2025. Quartet: Native fp4 training can be optimal for large language models. Advances in neural information pr ocessing systems . Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan W ang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Y ifeng Lu, and 1 others. 2023. Symbolic discov ery of optimization algorithms. Ad- vances in neural information pr ocessing systems , 36:49205–49233. Brian Chmiel, Maxim Fishman, Ron Banner , and Daniel Soudry . 2025. Fp4 all the way: Fully quantized training of llms. Advances in neural information pr ocessing systems . Matteo Croci, Massimiliano Fasi, Nicholas J Higham, Theo Mary , and Mantas Mikaitis. 2022. Stochastic rounding: implementation, error analysis and appli- cations. Royal Society Open Science , 9(3). T im Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer . 2022. 8-bit optimizers via block-wise quantization. International Conference on Learning Repr esentations . Maxim Fishman, Brian Chmiel, Ron Banner , and Daniel Soudry . 2025. Scaling fp8 training to trillion-token llms. International Conference on Learning Repr e- sentations . Athanasios Glentis, Jiaxiang Li, Qiulin Shang, Andi Han, Ioannis Tsaknakis, Quan W ei, and Mingyi Hong. 2025. Scalable parameter and memory efﬁcient pretraining for llm: Recent algorith- mic adv ances and benchmarking. arXiv pr eprint arXiv:2505.22922 . David Goldber g. 1991. What e very computer scientist should know about ﬂoating-point arithmetic. ACM computing surve ys (CSUR) , 23(1):5–48. Dirk Groeneveld, Iz Beltagy , Evan W alsh, Akshita Bhagia, Rodney Kinney , Oyvind T afjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Y izhong W ang, Shane Arora, David Atkinson, Russell Authur , Khy- athi Chandu, Arman Cohan, Jennifer Dumas, Y anai Elazar , Y uling Gu, Jack Hessel, and 24 others. 2024. OLMo: Accelerating the science of language mod- els . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long P apers) , pages 15789–15809, Bangkok, Thailand. Association for Computational Linguistics. Aman Gupta, Rafael Celente, Abhishek Shiv anna, DT Braithwaite, Gregory Dexter , Shao T ang, Hi- roto Udagawa, Daniel Silva, Rohan Ramanath, and S Sathiya Keerthi. 2026. Ef fectiv e quanti- zation of muon optimizer states. arXiv preprint arXiv:2509.23106 . Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with lim- ited numerical precision. In International confer ence on machine learning , pages 1737–1746. PMLR. V ineet Gupta, T omer K oren, and Y oram Singer . 2018. Shampoo: Preconditioned stochastic tensor optimiza- tion . In Pr oceedings of the 35th International Con- fer ence on Machine Learning , volume 80 of Pr o- ceedings of Mac hine Learning Resear ch , pages 1842– 1850. PMLR. Nicholas J Higham. 1993. The accuracy of ﬂoating point summation. SIAM Journal on Scientiﬁc Com- puting , 14(4):783–799. Jordan Hoffmann, Sebastian Bor geaud, Arthur Mensch, Elena Buchatskaya, Tre vor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes W elbl, Aidan Clark, and 1 others. 2022. An empirical analysis of compute-optimal large language model training. Advances in neural information pr ocessing systems , 35:30016–30030. T ianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, T ianlong Chen, Lu Liu, Qingsong W en, Zhangyang W ang, and 1 others. 2025a. Stable- spam: How to train in 4-bit more stably than 16-bit adam. In F irst W orkshop on Scalable Optimization for Efﬁcient and Adaptive F oundation Models . 9 T ianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang W ang, and Shiwei Liu. 2025b. Spam: Spike-aw are adam with momentum reset for stable llm training. International Confer ence on Learning Repr esentations . Diederik P . Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization . Pr eprint , Bingrui Li, Jianfei Chen, and Jun Zhu. 2023. Memory efﬁcient optimizers with 4-bit states. In Pr oceedings of the 37th International Confer ence on Neur al In- formation Pr ocessing Systems , NIPS ’23, Red Hook, NY , USA. Curran Associates Inc. Vladislav Lialin, Sherin Muckatira, Namrata Shiv a- gunde, and Anna Rumshisky . 2024. ReloRA: High- rank training through low-rank updates . In The T welfth International Conference on Learning Repr e- sentations . Jingyuan Liu, Jianlin Su, Xingcheng Y ao, Zhejun Jiang, Guokun Lai, Y ulun Du, Y idao Qin, W eixin Xu, En- zhe Lu, Junjie Y an, Y anru Chen, Huabin Zheng, Y ibo Liu, Shaowei Liu, Bohong Y in, W eiran He, Han Zhu, Y uzhi W ang, Jianzhou W ang, and 9 others. 2025. Muon is scalable for llm training . Pr eprint , Ilya Loshchilov and Frank Hutter . 2019. Decoupled weight decay regularization . In International Confer - ence on Learning Repr esentations . Paulius Micike vicius, Dusan Stosic, Neil Burgess, Mar - ius Cornea, Pradeep K. Dubey , Richard Grisenth- waite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart F . Oberman, Mohammad Shoeybi, Michael Siu, and Hao W u. 2022. Fp8 formats for deep learning . ArXiv , abs/2209.05433. Igor Molybog, Peter Albert, Moya Chen, Zachary De- V ito, David Esiobu, Naman Goyal, Punit Singh K oura, Sharan Narang, Andrew Poulton, Ruan Silva, and 1 others. 2023. A theory on adam instabil- ity in large-scale machine learning. arXiv pr eprint arXiv:2304.09871 . Brendan O’Donoghue and Emmanuel J. Candès. 2012. Adaptiv e restart for accelerated gradient schemes . F oundations of Computational Mathematics , 15:715 – 732. Kaan Ozkara, T ao Y u, and Y oungsuk Park. 2025. Stochastic rounding for llm training: Theory and practice . In International Confer ence on Artiﬁcial Intelligence and Statistics . Houwen Peng, Kan W u, Y ixuan W ei, Guoshuai Zhao, Y uxiang Y ang, Ze Liu, Y ifan Xiong, Ziyue Y ang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe W ang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, and Peng Cheng. 2023. Fp8-lm: Training fp8 large language models . Pr eprint , Colin Raffel, Noam Shazeer , Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Y anqi Zhou, W ei Li, and Peter J Liu. 2020. Exploring the lim- its of transfer learning with a uniﬁed text-to-text transformer . Journal of mac hine learning resear ch , 21(140):1–67. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptiv e learning rates with sublinear memory cost . In Pr oceedings of the 35th International Confer ence on Machine Learning , volume 80 of Pr oceedings of Machine Learning Researc h , pages 4596–4604. PMLR. Xuan T ang, Jichu Li, and Difan Zou. 2026. A conv er- gence analysis of adaptiv e optimizers under ﬂoating- point quantization. International Conference on Learning Repr esentations . Kristi T opollai and Anna Ewa Choromanska. 2026. Adaptiv e memory momentum via a model-based framew ork for deep learning optimization. In The 29th International Confer ence on Artiﬁcial Intelli- gence and Statistics . Hugo T ouvron, Thibaut Lavril, Gautier Izacard, Xa vier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Go yal, Eric Hambro, Faisal Azhar , Aurélien Rodriguez, Armand Joulin, Edouard Grav e, and Guillaume Lample. 2023a. Llama: Open and efﬁcient foundation language models . CoRR , abs/2302.13971. Hugo T ouvron, Louis Martin, Ke vin Stone, Peter Al- bert, Amjad Almahairi, Y asmine Babaei, Nikolay Bashlykov , Soumya Batra, Prajjwal Bhar gav a, Shruti Bhosale, Dan Bikel, Lukas Blecher , Cristian Canton- Ferrer , Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, W enyin Fu, and 49 oth- ers. 2023b. Llama 2: Open foundation and ﬁne-tuned chat models . CoRR , abs/2307.09288. Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener , Lucas Janson, and Sham Kakade. 2025. Soap: Improving and stabilizing shampoo using adam for language modeling . In In- ternational Confer ence on Learning Repr esentations , volume 2025, pages 93423–93444. Bao W ang, T an Nguyen, T ao Sun, Andrea L. Bertozzi, Richard G. Baraniuk, and Stanley J. Osher . 2022. Scheduled restart momentum for accelerated stochas- tic gradient descent . SIAM Journal on Imaging Sci- ences , 15(2):738–761. Naigang W ang, Jungwook Choi, Daniel Brand, Chia-Y u Chen, and Kailash Gopalakrishnan. 2018. Training deep neural networks with 8-bit ﬂoating point num- bers. Advances in neural information processing systems , 31. Ruizhe W ang, Y eyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Y ang, Baining Guo, Zheng-Jun Zha, and Peng Cheng. 2025. Optimizing lar ge language model train- ing using FP4 quantization . In Proceedings of the 10 42nd International Conference on Machine Learn- ing , volume 267 of Pr oceedings of Machine Learning Resear ch , pages 62937–62957. PMLR. Sike W ang, Pan Zhou, Jia Li, and Hua Huang. 2024. 4-bit shampoo for memory-ef ﬁcient network training. Advances in Neural Information Pr ocessing Systems , 37:126997–127029. Haocheng Xi, Han Cai, Ligeng Zhu, Y ao Lu, Kurt Keutzer , Jianfei Chen, and Song Han. 2025. Coat: Compressing optimizer states and activ ation for memory-efﬁcient fp8 training. International Con- fer ence on Learning Representations . Zeke Xie, Qian-Y uan T ang, Mingming Sun, and Ping Li. 2023. On the overlooked structure of stochastic gra- dients. Advances in Neural Information Pr ocessing Systems , 36:66257–66276. Cong Xu, W enbin Liang, Mo Y u, Anan Liu, Ke-Y ue Zhang, Shunli W ang, Lizhuang Ma, Jianyong W ang, Jun W ang, and W ei Zhang. 2025. Pushing the limits of low-bit optimizers: A focus on ema dynamics . Pr eprint , T ao Y u, Gaurav Gupta, Karthick Gopalswamy , Amith Mamidala, Hao Zhou, Jeffrey Huynh, Y oungsuk Park, Ron Diamant, Anoop Deoras, and Luk e Huan. 2024. Collage: light-weight lo w-precision strate gy for llm training. In Pr oceedings of the 41st Interna- tional Confer ence on Mac hine Learning , ICML ’24. JMLR.org. Y ushun Zhang, Congliang Chen, Ziniu Li, T ian Ding, Chenwei W u, Diederik P Kingma, Y inyu Y e, Zhi- Quan Luo, and Ruoyu Sun. 2025. Adam-mini: Use fewer learning rates to gain more. International Con- fer ence on Learning Representations . Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang W ang, Anima Anandkumar , and Y uandong T ian. 2024. GaLore: Memory-efﬁcient LLM training by gradient lo w-rank projection . In Pr oceedings of the 41st International Confer ence on Machine Learning , volume 235 of Pr oceedings of Machine Learning Resear ch , pages 61121–61143. PMLR. 11 A A ppendix A.1 Floating-point grid spacing and the effective pr ecision ratio For a normalized ﬂoating-point number , x = m 2 e , m ∈ [1 , 2) , e ∈ Z , the spacing between adjacent representable num- bers within a ﬁxed e xponent binade is ulp( x ) = 2 e − p , where p is the number of stored mantissa bits. Hence ulp( x ) = 2 e − p = 2 − p m x, (25) so the exact local relati ve spacing is ulp( x ) x = 2 − p m . In the main te xt we do not carry this binade- and mantissa-dependent quantity through all formulas. Instead, we summarize each format by a single parameter ε , writing ulp( x ) ≈ ε m x. For the normalized regimes considered here, we take ε = 2 − p ; in particular , ε = 2 − 7 for BF16, ε = 2 − 3 for E4M3 FP8, and ε = 2 − 2 for the unsigned E2M2 FP4 format used for the second moment. In our setting, the current EMA state is v t − 1 and the high precision proposal is v t = β 2 v t − 1 + (1 − β 2 ) g 2 t . Hence the update increment is ∆ t := v t − v t − 1 = (1 − β 2 )( g 2 t − v t − 1 ) . Writing v t − 1 = m t 2 e t and using ( 25 ) , the local spacing around the stored value is approximated by u t = ulp( v q t − 1 ) ≈ ulp( v t − 1 ) = ε m t v t − 1 . Under nearest rounding, state-update stalling oc- curs whene ver | ∆ t | < u t / 2 ; equi valently , | ∆ t | ≥ u t 2 is required for the stored value to change. There- fore | ∆ t | u t = (1 − β 2 ) | g 2 t − v t − 1 | ( ε/m t ) v t − 1 . (26) Under the local-stationarity approximation v t − 1 ≈ σ 2 , and with z t := g 2 t σ 2 ∼ χ 2 1 , this becomes | ∆ t | u t ≈ (1 − β 2 ) m t | z t − 1 | ε . (27) Hence the nearest-rounding gate | ∆ t | ≥ u t / 2 is equi valent to | z t − 1 | ≥ ε 2(1 − β 2 ) m t . (28) The threshold depends on the normalized man- tissa m t , which v aries across coordinates. T o re- duce this dependence to a single scalar , we adopt a log-uniform mantissa model and treat log 2 m t as approximately uniform on [0 , 1] , corresponding to f ( m ) = 1 m ln 2 , m ∈ [1 , 2] . Under this model, ¯ m := E [ m ] = Z 2 1 m · 1 m ln 2 dm = 1 ln 2 . (29) W e then replace the coordinate-dependent thresh- old in ( 28 ) by its single-scalar surrogate obtained from ¯ m . This is an effecti ve-spacing approxima- tion rather than an exact av eraging of the stalling probability , and it is used only to obtain the com- pact predictor ˆ ρ : ˆ ρ := ε 2(1 − β 2 ) ¯ m , ¯ m = 1 ln 2 . (30) Thus the gate is approximated by | z t − 1 | ≥ ˆ ρ , or equi valently | ∆ t | u t ≈ | z t − 1 | 2 ˆ ρ . A.2 NR stalling pr obability W e derive Proposition 3.3 . Under nearest rounding (NR), the stored state stalls if and only if the high precision proposal rounds back to the current stored v alue. Under the hard-gate approximation from the main text, | ∆ t | < u t 2 ⇐ ⇒ | z t − 1 | < ˆ ρ, z t ∼ χ 2 1 . Equi valently , the stored v alue changes only when | z t − 1 | ≥ ˆ ρ . Since z t ≥ 0 , the stalling region is z t ∈  max(0 , 1 − ˆ ρ ) , 1 + ˆ ρ  . 12 Therefore, P NR stall ( ˆ ρ ) = P (max(0 , 1 − ˆ ρ ) < z t < 1 + ˆ ρ ) (31) = F χ 2 1 (1 + ˆ ρ ) − F χ 2 1 (max(0 , 1 − ˆ ρ )) . (32) Useful closed f orm for the χ 2 1 CDF . If z = X 2 with X ∼ N (0 , 1) , then for x ≥ 0 , F χ 2 1 ( x ) = P ( | X | ≤ √ x ) = 2Φ( √ x ) − 1 . This identity is conv enient for numerical e valuation and in version of the stalling thresholds. A.3 SR stalling pr obability W e derive Proposition 5.1 . Fix a grid with spacing u around the current stored value ¯ v , and suppose the higher-precision proposal is ˜ v = ¯ v + δ . Un- der stochastic rounding (SR), if ˜ v lies between ¯ v and an adjacent grid point at distance u , then the probability of rounding back to ¯ v is proportional to proximity: P  Q SR ( ˜ v ) = ¯ v | δ  = ( 1 − | δ | /u, | δ | < u, 0 , | δ | ≥ u. At steady state, writing α := 1 − β 2 , using δ ≈ α ¯ v ( z t − 1) , u ≈ ( ε/ ¯ m ) ¯ v , z t ∼ χ 2 1 , gi ves | δ | u ≈ α ¯ m ε | z t − 1 | = | z t − 1 | 2 ˆ ρ . Hence the conditional SR stalling probability is p SR stall ( z t ; ˆ ρ ) = max  0 , 1 − | z t − 1 | 2 ˆ ρ  , (33) and av eraging over z t yields P SR stall ( ˆ ρ ) = E z t ∼ χ 2 1  max  0 , 1 − | z t − 1 | 2 ˆ ρ   . (34) Large- ˆ ρ approximation. When ˆ ρ is large, the truncation at 0 is rarely activ e, since | z t − 1 | is typically O (1) . Dropping the truncation giv es P SR stall ( ˆ ρ ) ≈ 1 − 1 2 ˆ ρ µ 1 , µ 1 := E [ | z t − 1 | ] . For z t = X 2 with X ∼ N (0 , 1) , µ 1 has a closed form. Lemma A.1 (Closed form for µ 1 ) . If X ∼ N (0 , 1) and z t = X 2 , then µ 1 = E [ | X 2 − 1 | ] = 4 ϕ (1) = 4 √ 2 π e − 1 / 2 ≈ 0 . 9679 , wher e ϕ is the standar d normal density . Pr oof. By symmetry , µ 1 = E [ | X 2 − 1 | ] = 2 Z ∞ 0 | x 2 − 1 | ϕ ( x ) dx = 2  Z 1 0 (1 − x 2 ) ϕ ( x ) dx + Z ∞ 1 ( x 2 − 1) ϕ ( x ) dx  . Since E [ X 2 − 1] = 0 , the positi ve and negati ve parts hav e equal mass: Z ∞ 1 ( x 2 − 1) ϕ ( x ) dx = Z 1 0 (1 − x 2 ) ϕ ( x ) dx. Hence µ 1 = 4 Z 1 0 (1 − x 2 ) ϕ ( x ) dx. Using ϕ ′ ( x ) = − xϕ ( x ) , Z 1 0 x 2 ϕ ( x ) dx =  − xϕ ( x )  1 0 + Z 1 0 ϕ ( x ) dx = − ϕ (1) + Z 1 0 ϕ ( x ) dx, so Z 1 0 (1 − x 2 ) ϕ ( x ) dx = ϕ (1) , which gi ves µ 1 = 4 ϕ (1) . This subsection isolates the unchanged-state probability under SR. It does not attempt to capture the full beneﬁt of SR, which may also arise from the conditional unbiasedness of the quantization error e ven on steps where the stored state changes. A.4 Empirical initial ﬂoor P init Empirically , the stalled fraction immediately after initialization is already nonzero. In the main text we therefore treat P init as an empirical correction; here we explain why such a ﬂoor is expected and why it depends only weakly on model scale. At the ﬁrst update after a zero-initialized second moment, the higher-precision proposal is ˜ v 1 = αg 2 1 , α := 1 − β 2 . 13 A coordinate remains at its initialized stored value in two main ways: (i) the corresponding gradi- ent entry is exactly zero, and (ii) the proposal is nonzero in higher precision but is mapped back to the initialized quantized value by lo w-precision storage. Let p zero := Pr( g 1 ,i = 0) denote the fraction of coordinates with exactly zero gradient at the ﬁrst step. This contribution is architecture- and data-dependent, but largely format-independent. The second contrib ution depends on the quan- tization scheme. Consider a scaling group of size B (the full tensor under per -tensor scaling, or one block under block-wise scaling). If the maximum proposal magnitude in the group is mapped to the format maximum x max , then coordinate i is repre- sented by x work i = g 2 i max j ≤ B g 2 j x max . Under nearest rounding, a nonzero coordinate is mapped back to the initialized stored v alue when- e ver g 2 i max j ≤ B g 2 j < τ , τ := s min 2 x max , (35) where s min is the smallest positive representable spacing of the format. For stochastic rounding, the hard threshold is replaced by the corresponding soft gate, so the probability of returning to the ini- tialized value is smaller b ut controlled by the same scale ratio. T o obtain a simple approximation, assume that within one scaling group g 2 i σ 2 d ≈ χ 2 1 . If the group maximum is replaced by a typical upper order statistic M B := F − 1 χ 2 1  1 − 1 B  , then the fraction of nonzero coordinates crushed back to the initialized v alue under nearest rounding is approximately f NR crush ( τ , B ) ≈ F χ 2 1  τ M B  . (36) This is a heuristic approximation: it replaces the random group maximum by a typical high quantile. For stochastic rounding, the analogous crush frac- tion is obtained by av eraging the soft gate over the same distribution, and is therefore no larger than the NR v alue. Combining exact zeros with quantization- induced crushing gi ves P init ≈ p zero + (1 − p zero ) f crush ( τ , B ) . (37) This explains two qualitativ e features seen in practice. First, BF16 typically has little addi- tional crushing beyond structural zeros, whereas narro wer-range formats can exhibit a much lar ger initial ﬂoor . Second, the dependence on model size is weak: under per -tensor scaling it enters only through the typical order statistic M B , which gro ws only logarithmically in B , while under ﬁxed block- wise scaling it is go verned primarily by the block size itself. Because subsequent stalling slows the gro wth of the EMA, the ideal predictor j ⋆ ideal can still rise some what too quickly after initialization. Rather than introducing a fully self-consistent transient model, in the main text we account for the domi- nant ef fect through the phenomenological correc- tion P total ( j ) = P init + (1 − P init ) P stall ( j ) , which preserves the closed-form startup-windo w expression while incorporating the empirical post- initialization ﬂoor . A.5 Deri ving the startup window length j ⋆ ( P 0 ) W e derive the transient stalling law under an ideal- trajectory appr oximation . After initialization (or reset) to zero, the unquantized EMA e volves as v j = β 2 v j − 1 + αg 2 j , v 0 = 0 , α := 1 − β 2 . Over the short windo w relev ant to the startup anal- ysis, we assume that a single gradient coordinate has an approximately ﬁxed local scale σ , and use g 2 j σ 2 d ≈ χ 2 1 . In particular , this implies E [ g 2 j ] ≈ σ 2 , 14 which we use only to moti vate a deterministic ref- erence trajectory . By linearity of expectation, E [ v j ] = α j X k =1 β j − k 2 E [ g 2 k ] ≈ ασ 2 j X k =1 β j − k 2 = σ 2 (1 − β j 2 ) . This moti vates the reference path ¯ v j := σ 2 ϕ j , ϕ j := 1 − β j 2 . (38) W e index j as the number of post-initialization updates already accumulated in the EMA. Thus the next squared-gradient update is tested against the current stored state ¯ v j . Deﬁne z := g 2 σ 2 ∼ χ 2 1 , W := g 2 ¯ v j = z ϕ j . Under nearest rounding (NR), stalling occurs when the innov ation is smaller than half a ulp of the current state, i.e. | W − 1 | < ˆ ρ. Equi valently , − ˆ ρ < z ϕ j − 1 < ˆ ρ ⇐ ⇒ ϕ j (1 − ˆ ρ ) < z < ϕ j (1+ ˆ ρ ) . Therefore P NR stall ( j ) = F χ 2 1  ϕ j (1+ ˆ ρ )  − F χ 2 1  max { 0 , ϕ j (1 − ˆ ρ ) }  . (39) For the formats of interest in the main te xt, ˆ ρ ≥ 1 , so the lo wer endpoint is nonpositive and P NR stall ( j ) = F χ 2 1  (1 + ˆ ρ ) ϕ j  . (40) Conservati veness of the ideal-trajectory approx- imation. More generally , if the current state scale is v = σ 2 ϕ , then the same deri vation gi ves P NR stall ( ϕ ) = F χ 2 1  (1 + ˆ ρ ) ϕ  , ( ˆ ρ ≥ 1) , which is monotone increasing in ϕ . Therefore, whene ver the quantized EMA lags behind the ideal reference trajectory , its instantaneous stalling prob- ability is smaller than the ideal-trajectory predic- tion at the same step. This does not yield a pathwise bound for the full quantized process, since nearest rounding can also perturb successful updates up- ward, b ut it explains why the ideal-trajectory pre- dictor is typically conserv ativ e early in the startup phase. Why the naive substitution is incorrect. A tempting shortcut is to replace ˆ ρ by the scaled quan- tity ˆ ρ j = ˆ ρ ϕ j inside the steady-state formula. This gi ves a stalling interval centered at z = 1 for ev ery j , namely 1 − ˆ ρϕ j < z < 1 + ˆ ρϕ j , which is correct only at steady state ( ϕ j = 1 ). Dur- ing the transient, the relev ant ratio is W = z /ϕ j , so the correct interval is instead centered at the curr ent state scale z = ϕ j , as in ( 39 ) . This distinc- tion matters especially for small ϕ j , because the χ 2 1 density is largest near zero. Closed-f orm startup window . Fix a tolerance P 0 ∈ (0 , 1) and deﬁne j ⋆ ( P 0 ) := min { j : P NR stall ( j ) ≥ P 0 } . When ˆ ρ ≥ 1 , using ( 40 ), F χ 2 1  (1 + ˆ ρ ) ϕ j  ≥ P 0 ⇐ ⇒ (1 + ˆ ρ ) ϕ j ≥ F − 1 χ 2 1 ( P 0 ) . Hence ϕ j ≥ ϕ ⋆ ( P 0 ) , ϕ ⋆ ( P 0 ) := F − 1 χ 2 1 ( P 0 ) 1 + ˆ ρ . Since ϕ j = 1 − β j 2 is monotone increasing in j , the smallest such j is j ⋆ ( P 0 ) = & log  1 − ϕ ⋆ ( P 0 )  log β 2 ' , ϕ ⋆ ( P 0 ) = F − 1 χ 2 1 ( P 0 ) 1 + ˆ ρ , (41) deﬁned whene ver ϕ ⋆ ( P 0 ) < 1 , i.e. F − 1 χ 2 1 ( P 0 ) < 1 + ˆ ρ. Including the empirical initial ﬂoor . The closed- form expression abov e captures only the precision- induced transient. In practice, the measured stalled fraction also contains an empirical ﬂoor P init , due to ef fects not modeled by the idealized scalar EMA analysis, including structural zeros in the gradi- ent and, for narro w-range formats, dynamic-range crushing under scaled quantization. W e therefore use the phenomenological mixture model P total ( j ) = P init + (1 − P init ) P NR stall ( j ) . (42) For a target P 0 , this is equiv alent to replacing P 0 by the ef fective tar get P eﬀ 0 = P 0 − P init 1 − P init , (43) 15 with the con vention that j ⋆ ( P 0 ) = 0 whene ver P 0 ≤ P init . Thus the startup windo w reported in the main text should be interpret ed as a theory-plus- ﬂoor predictor rather than a fully ﬁrst-principles quantity: j ⋆ ( P 0 ) = j ⋆ ideal  P eﬀ 0  . T able 8: Startup windo w under nearest rounding for β 2 = 0 . 999 after incorporating the empirical ﬂoor P init . Theory uses ( 16 ) – ( 18 ) ; empirical values are measured in LLM pretraini ng. “ 0 ” means the target is already met immediately because P 0 ≤ P init . “–” means the thresh- old was not reached empirically within the observ ation window . BF16 FP8 FP4 P init 0 . 17 0 . 53 0 . 97 j ⋆ ( P 0 =0 . 5) 76 / 50 0 / 0 0 / 0 j ⋆ ( P 0 =0 . 8) 464 / 360 15 / 15 0 / 0 j ⋆ ( P 0 =0 . 9) 1051 / 700 36 / 43 0 / 0 j ⋆ ( P 0 =0 . 95) 3044 / – 61 / 79 0 / 0 Startup-window values. T able 8 giv es the cor- responding numerical startup-windo w values. The main trend is consistent with Figure 2 : BF16 re- tains the longest responsi ve windo w , FP8 remains responsi ve only ov er a much shorter range, and in FP4 the total stalled fraction already exceeds prac- tical thresholds immediately after reset once the empirical ﬂoor P init is incorporated. A.6 Empirical observations of stalling in the ﬁrst moment Figure 5 sho ws that the ﬁrst moment exhibits the same basic qualitative pattern as the second: the stalled fraction is already nonzero after initializa- tion and then quickly approaches a steady re gime. What differs is the se verity . In BF16, the steady- state stalled fraction is very small, conﬁrming that ﬁrst-moment stalling is largely negligible in this regime. Moreover , the initial ﬂoor P init is slightly higher than the steady-state le vel, which e xplains the mild non-monotonic beha vior and suggests that repeatedly resetting the ﬁrst moment is not espe- cially useful. In FP8, the steady-state stalled frac- tion is still relati vely lo w , and is close to the initial ﬂoor; accordingly , resetting the ﬁrst moment has little ef fect, either positiv e or negati ve. In FP4, by contrast, the steady-state stalled fraction becomes substantially larger and clearly harmful, while re- maining abov e P init , so resetting can reco ver part of the performance lost to persistent ﬁrst-moment stalling. Finally , as in the second moment, stochas- tic rounding lowers the steady-state stalled fraction, but its practical importance depends strongly on the underlying precision regime. A.7 Derivation of the r eset-period heuristic This section deriv es the theory-guided heuristic used in Section 5.2 . The goal is not to identify a uniquely optimal reset period, but rather to pre- select a useful operating region: long enough to let the EMA accumulate statistical information, but not so long that the r eset-sensitive part of stalling dominates a substantial portion of the cycle. Com- pared to a purely endpoint-based rule, the heuristic belo w incorporates two empirical facts: (i) mod- erate second-moment staleness is often harmless, and (ii) good reset periods typically form a broad windo w rather than a sharp optimum. Step 1: normalize the controllable stalling buildup. After a reset, the total measured stalled fraction at step j is modeled as P total ( j ) = P init + (1 − P init ) P stall ( j ) , where P init is the baseline ﬂoor and P stall ( j ) is the precision-induced stalling on the responsive subset. Since P init is already present immediately after reset and is not removed by resetting, it should not determine the reset period. W e therefore deﬁne the normalized stalling progress S ( j ) := P total ( j ) − P init P ss total − P init . (44) Using P ss total = P init + (1 − P init ) P ss stall , this simpliﬁes to S ( j ) = P stall ( j ) P ss stall . (45) Under the transient model, S ( j ) starts at 0 and ap- proaches 1 as j → ∞ . The key point is that this normalization removes the reset-insensitiv e ﬂoor P init and isolates the part of stalling that accumu- lates within a c ycle and can be mitigated by reset- ting. Step 2: measure cycle-av eraged excess staleness. A rule based only on the endpoint S ( K ) is often too conserv ativ e: it penalizes a cycle as soon as the last step becomes highly stalled, even if much of 16 2k 2.5k 3k 3.5k 4k 4.5k Step 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Stalling ratio BF16 2k 2.2k 2.4k 2.6k 2.8k 3k Step 0.15 0.20 0.25 0.30 FP8 2k 2.0k 2.1k 2.1k 2.2k 2.2k Step 0.75 0.80 0.85 0.90 0.95 FP4 NR SR Periodic Figure 5: Measured stalling fraction for the ﬁrst moment during pretraining of the 100M-parameter LLaMA model on C4. the cycle remains in a regime where training is still robust. T o reﬂect cumulativ e exposure to harmful staleness, we a verage ov er the c ycle and count only the excess abo ve a tolerance le vel s 0 ∈ [0 , 1) : ¯ S s 0 ( K ) := 1 K K X j =1  S ( j ) − s 0 1 − s 0  + , (46) [ x ] + := max( x, 0) . (47) This quantity is 0 when the entire cycle stays belo w the tolerated staleness level s 0 , and approaches 1 only when a substantial fraction of the cycle is spent near fully saturated controllable stalling. Thus s 0 encodes the empirically observed tolerance to moderate staleness: larger s 0 delays the point at which staleness begins to count against a longer cycle. Step 3: quantify the r emaining statistical error of the EMA. Consider the bias-corrected EMA after K steps, where the local age of the EMA is re- set at the reset boundary together with the moment state itself. Its normalized weights are w ( K ) t = (1 − β 2 ) β K − t 2 1 − β K 2 , t = 1 , . . . , K . The corresponding ef fective sample size is N stat ( K ) = 1 P K t =1 ( w ( K ) t ) 2 = (1 + β 2 )(1 − β K 2 ) (1 − β 2 )(1 + β K 2 ) . (48) Its inﬁnite-horizon limit is N ∞ stat = 1 + β 2 1 − β 2 . W e deﬁne the normalized remaining statistical error as E ( K ) := 1 − N stat ( K ) N ∞ stat = 2 β K 2 1 + β K 2 . (49) This quantity decreases monotonically from 1 to 0 as the EMA accumulates samples, and measures ho w much averaging beneﬁt is still left to gain by extending the current c ycle. Step 4: deﬁne the crossing heuristic. W e choose the reset period as the ﬁrst cycle length for which the cycle-a veraged excess staleness matches the remaining statistical error: K ⋆ := min  K ≥ 1 : ¯ S s 0 ( K ) ≥ E ( K )  . (50) This yields a direct tradeoff. Before the crossing, the EMA is still gaining substantial statistical accu- racy relati ve to the harmful staleness accumulated within the cycle. After the crossing, the additional gain from a longer cycle is increasingly outweighed by the time spent in a highly stalled regime. Step 5: substitute the transient stalling law . For the precision regimes considered in the main text, ˆ ρ ≥ 1 , so the transient NR approximation from Section A.5 gi ves P stall ( j ) ≈ F χ 2 1  (1+ ˆ ρ )(1 − β j 2 )  , P ss stall ≈ F χ 2 1 (1+ ˆ ρ ) . Substituting into ( 45 ) yields S ( j ) ≈ F χ 2 1  (1 + ˆ ρ )(1 − β j 2 )  F χ 2 1 (1 + ˆ ρ ) . (51) Therefore the cycle-a veraged e xcess staleness be- comes ¯ S s 0 ( K ) ≈ 1 K K X j =1       F χ 2 1  (1 + ˆ ρ )(1 − β j 2 )  F χ 2 1 (1 + ˆ ρ ) − s 0 1 − s 0       + . (52) Combining ( 52 ) with ( 49 ) , the reset period is the smallest integer K such that 1 K K X j =1       F χ 2 1  (1 + ˆ ρ )(1 − β j 2 )  F χ 2 1 (1 + ˆ ρ ) − s 0 1 − s 0       + ≥ 2 β K 2 1 + β K 2 . (53) 17 The left-hand side is nondecreasing in K , since it is the a verage of a nondecreasing sequence, while the right-hand side is strictly decreasing and tends to 0 . Moreover , because S ( j ) → 1 , the left-hand side tends to 1 . Hence the crossing in ( 53 ) exists and is unique. Numerical values. For β 2 = 0 . 999 and the ef- fecti ve precision ratios used in the main text, ˆ ρ BF16 = 2 . 71 , ˆ ρ FP8 = 43 . 3 , ˆ ρ FP4 = 86 . 6 , we solve ( 53 ) numerically . For the tolerance le vel used in the main text, s 0 = 0 . 6 , this gi ves K ⋆ BF16 ≈ 1116 , K ⋆ FP8 ≈ 320 , K ⋆ FP4 ≈ 224 . T able 9 also reports nearby v alues of s 0 to sho w that the heuristic is stable across a reasonable range of tolerance lev els. Overall, the predicted reset periods lie in the same broad regime as the empiri- cally robust windo ws observed in our experiments, namely roughly 500 – 1500 for BF16, 300 – 600 for FP8, and 50 – 250 for FP4. T able 9: Reset periods from the cycle-a veraged thresh- olded heuristic ( 53 ) under NR with β 2 = 0 . 999 . W e report several tolerance levels s 0 to show the sensitiv- ity of the rule to the tolerated amount of reset-sensitiv e staleness. K ⋆ for tolerance lev el s 0 Format ˆ ρ 0 . 5 0 . 6 0 . 7 Empirical window BF16 2 . 71 ≈ 1004 ≈ 1116 ≈ 1262 500 – 1500 FP8 43 . 3 ≈ 295 ≈ 320 ≈ 351 300 – 600 FP4 86 . 6 ≈ 206 ≈ 224 ≈ 246 50 – 250 In the main te xt we use s 0 = 0 . 6 , which places the predicted periods near the middle of the em- pirically robust range while remaining stable to moderate changes in the tolerance le vel. A.8 Adaptive r esetting policies While periodic resets are simple and effecti ve, one can also design adaptive reset rules by monitoring stalling online during training. This a voids commit- ting to a ﬁxed period in advance. Our adaptiv e rule follo ws the same principle as the period heuristic in Proposition 5.2 , b ut e valuates it online rather than precomputing a reset interv al. Concretely , at each iteration we estimate the em- pirical stalled fraction P stall ( k ) by comparing the ne wly quantized state with the previously stored state before writeback. W e then form the same cycle-a veraged excess staleness quantity used in the periodic heuristic, but replace the model-based transient predictor by the observed stalled frac- tion. In the adaptiv e version, we ﬁx P ss stall = 1 and s 0 = 0 . 6 , and trigger a reset once ¯ S s 0 ( k ) − E ( k ) > 0 . (54) Intuiti vely , this condition detects when the ob- served buildup of stalling within the current cy- cle outweighs the remaining statistical beneﬁt of continuing the EMA. Thus, the adapti ve policy implements the same reset principle as the periodic heuristic, but re- places a ﬁxed reset interval with an online test based on the actual beha vior of the quantized EMA states. In practice, periodic resets provide a simple default strategy , while adaptiv e resetting offers a lightweight extension that can respond to the real- ized stalling dynamics during training. A.9 Experimental Setting W e introduce the LLaMA architecture and training setup used in our experiments. All models use a maximum sequence length of 256, a global tok en batch size of 131K, and BF16 model weights and acti vations. W e apply gradient clipping with maxi- mum norm 1.0, linear warmup over the ﬁrst 10% of training steps, and cosine decay to 10% of the peak learning rate. W e train all models on C4 ( Raffel et al. , 2020 ) using AdamW with β 1 = 0 . 9 , β 2 = 0 . 999 , and AdamW epsilon ϵ adam = 10 − 6 . The peak learn- ing rates used for each model are reported in T a- ble 10 and are kept ﬁx ed across precision regimes. SO AP-speciﬁc parameters for the ablation include the preconditioner update frequenc y e very 10 steps, maximum preconditioner dimension 10,000, and preconditioner EMA coef ﬁcient β = 0 . 999 W e ev aluate three optimizer-state precision set- tings: BF16, FP8 (E4M3 for both moments with per-tensor scaling), and FP4 (E2M1 for the ﬁrst moment and unsigned E2M2 for the second mo- ment with block size 128). Model weights and acti vations remain in BF16. Because current hardware does not pro vide na- ti ve support for all of the lo w-precision state for - mats and rounding modes considered here, we sim- ulate quantized optimizer-state storage in software. Concretely , optimizer states are dequantized to high precision for the update, then quantized and stored in the target format after each step, using either nearest or stochastic rounding depending on the experiment. 18 Params Hidden Intermediate Heads Layers Steps T okens LR 60M 512 1376 8 8 10K 1.3B 0.001 100M 640 1708 10 12 20K 2.6B 0.001 350M 1024 2736 16 24 60K 7.8B 0.0004 T able 10: Model conﬁgurations used in our experiments. Data amounts are giv en in tokens. 4k 5k 6k 7k 8k 9k 10k Step 3.50 3.55 3.60 3.65 3.70 3.75 V al loss Llama-60M | FP4 A damW A damW + r estarts 8k 10k 12k 14k 16k 18k 20k Step 3.25 3.30 3.35 3.40 3.45 3.50 Llama-100M | FP4 A damW A damW + r estarts 20k 25k 30k 35k 40k 45k 50k 55k 60k Step 3.00 3.05 3.10 3.15 Llama-350M | FP4 A damW A damW + r estarts Figure 6: Training loss curv es for the three model scales with FP4 optimizer-state storage. When enabled, stochastic rounding is applied to all quantized states. Periodic EMA resets are ap- plied to both moments unless stated otherwise. The reset period for the second moment is chosen using Equation ( 24 ), yielding periods of 1000 (BF16), 300 (FP8), and 200 (FP4). The ﬁrst moment uses the same period unless otherwise speciﬁed. A.10 Additional Results W e report additional results and plots omitted from the main text. T o better understand the role of resets relativ e to scaling, we also ev aluate FP4 with per-tensor scaling instead of the block-wise scaling used in the main text. As expected, per-tensor scaling leads to substantially worse performance, since the ef- fecti ve quantization grid is much coarser . In this regime, resets should be vie wed as complemen- tary to scaling rather than a substitute for it: even with resets, the v alidation loss reaches only ≈ 3 . 8 , compared to 3 . 45 – 3 . 55 with block-wise scaling. 0 2k 4k 6k 8k 10k Step 4 5 6 7 8 9 V al loss Llama-60M | FP4 (per -tensor) A damW (6.431) A damW + r estarts (3.868) Figure 7: T raining loss curves for LLaMA-60M with FP4 optimizer-state storage under per -tensor scaling. Ho wever , resets remain critical in this setting. W ithout resets, training fails to con verge, whereas with resets it reaches a reasonable loss. This fur- ther supports our interpretation that resets act by restoring responsi veness in highly stalled re gimes, e ven when the underlying precision is too limited to match stronger scaling strategies. A.11 Algorithms Here s t may denote a single EMA state or a tuple of optimizer states; when multiple states are present, Φ , Ψ , Q , and D are understood componentwise. Algorithm 1 Stateful Optimizer with Quantized State Storage and Periodic Resets Require: Update maps Φ , Ψ ; hyperparameters h ; quantizer Q (NR or SR); dequantizer D ; reset period K ; initial state s init (often 0 ); (optional) resettable schedule state ξ init . 1: Initialize θ 0 ; ¯ s 0 ← Q ( s init ) ; k ← 0 ; (optional) ξ ← ξ init 2: for t = 1 , 2 , . . . do 3: g t ← ∇ f t ( θ t − 1 ) 4: s hp t − 1 ← D ( ¯ s t − 1 ) ▷ load quantized state 5: k ← k + 1 ▷ local step within the current reset cycle 6: s hp t ← Φ  s hp t − 1 , g t , θ t − 1 ; h, k , ξ  7: ∆ θ t ← Ψ  s hp t , g t , θ t − 1 ; h, k , ξ  8: θ t ← θ t − 1 + ∆ θ t ▷ parameter update in higher precision 9: ¯ s t ← Q  s hp t  ▷ requantize state for storage 10: if k = K then 11: ¯ s t ← Q ( s init ) ▷ often ¯ s t ← 0 12: k ← 0 13: (optional) ξ ← ξ init ▷ reset bias correction / schedules 14: end if 15: end for 19 Algorithm 2 Stateful Optimizer with Quantized State Storage and Adapti ve Resets Require: Update maps Φ , Ψ ; hyperparameters h ; quantizer Q ; dequantizer D ; decay β 2 ; reset tolerance s 0 ; steady- state stalled fraction P ss stall (we use P ss stall = 1 in practice); initial state s init (often 0 ); optional resettable schedule state ξ init . 1: Initialize parameters θ 0 2: Initialize quantized state ¯ s 0 ← Q ( s init ) 3: Initialize cycle counter k ← 0 4: Initialize accumulated excess staleness A ← 0 5: (optional) initialize schedule state ξ ← ξ init 6: for t = 1 , 2 , . . . do 7: g t ← ∇ f t ( θ t − 1 ) 8: s hp t − 1 ← D ( ¯ s t − 1 ) 9: k ← k + 1 10: s hp t ← Φ( s hp t − 1 , g t , θ t − 1 ; h, k, ξ ) 11: ∆ θ t ← Ψ( s hp t , g t , θ t − 1 ; h, k, ξ ) 12: θ t ← θ t − 1 + ∆ θ t 13: ¯ s t ← Q ( s hp t ) 14: Compute the empirical stalled fraction P stall ( k ) ← 1 d d X i =1 1 [( ¯ s t ) i = ( ¯ s t − 1 ) i ] where d is the number of entries in the state tensor 15: Compute normalized stalling progress S ( k ) ← P stall ( k ) P ss stall 16: Update accumulated excess staleness A ← A +  S ( k ) − s 0 1 − s 0  + 17: Compute cycle-av eraged excess staleness ¯ S s 0 ( k ) ← A k 18: Compute remaining statistical error E ( k ) ← 2 β k 2 1 + β k 2 19: if ¯ S s 0 ( k ) ≥ E ( k ) then 20: reset: ¯ s t ← Q ( s init ) 21: k ← 0 22: A ← 0 23: (optional) ξ ← ξ init 24: end if 25: end for 20

Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment