On the Role of Batch Size in Stochastic Conditional Gradient Methods

On the Role of Batch Size in Stochastic Conditional Gradient Methods
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study the role of batch size in stochastic conditional gradient methods under a $μ$-Kurdyka-Łojasiewicz ($μ$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.


💡 Research Summary

This paper investigates how batch size, sequence length, and stepsize should be chosen when training large‑scale language models under a fixed token budget T. The authors focus on momentum‑based stochastic conditional gradient (SCG) methods such as Scion and analyze them under a μ‑Kurdyka‑Łojasiewicz (μ‑KL) error bound. By assuming L‑smoothness, norm equivalence, and bounded‑variance stochastic gradients whose variance scales linearly with the effective batch‑sequence product BS = B·S, they derive explicit convergence guarantees for the SCG algorithm (Algorithm 1).

The main theoretical contribution is a three‑regime scaling law for the achievable optimization error as a function of BS when the total number of processed tokens is fixed (T = K·B·S).

  1. Noise‑dominated regime (small BS) – error decreases roughly as 1/(BS) because larger batches average out stochastic noise.
  2. Intermediate regime – error becomes essentially independent of BS; the algorithm is limited by curvature and the μ‑KL constant rather than noise.
  3. Large‑batch regime (very large BS) – error grows with BS because the stepsize must be reduced to maintain stability, leading to slower progress.

Balancing the dominant terms across these regimes yields a critical “BST scaling rule”
  BS ≈ T^{2/3} (up to problem‑dependent constants involving the smoothness L, noise scale σ, norm‑equivalence factor ρ, and the μ‑KL constant μ).

Guided by this rule, the authors propose an adaptive training schedule that gradually increases the effective batch‑sequence product while simultaneously adjusting the learning‑rate parameter β ≈ 1/K ≈ BS/T. The schedule consists of two (or more) stages: an initial phase with modest BS to exploit rapid early‑stage learning, followed by a phase where BS tracks the T^{2/3} law to preserve token‑efficiency in later training. This approach generalizes classic heuristics such as linear learning‑rate scaling with warm‑up, but it is specifically tailored to projection‑free conditional‑gradient geometry.

Empirical validation is performed on a 124 M‑parameter NanoGPT model trained on the FineWeb dataset with a total token budget of T = 2 × 10⁹. Experiments vary batch size B and sequence length S, confirming the three predicted regimes. The lowest test loss and best token‑efficiency are observed when BS ≈ T^{2/3} ≈ 1.6 × 10⁶, matching the theoretical optimum. Moreover, the authors plot the dual gradient norm against the suboptimality during training and find a linear relationship with slope μ, confirming that the μ‑KL condition holds empirically for large language models.

The paper also situates its contributions relative to the μ‑P framework, which ensures local stability by keeping per‑step updates Θ(1) through appropriate parameterization. In contrast, this work addresses global efficiency under a token‑budget constraint, showing that large batches are not inherently detrimental if stepsize and sequence length are co‑scaled according to the BST rule.

Overall, the paper provides (i) the first convergence analysis of momentum SCG under μ‑KL that explicitly tracks batch size, (ii) a token‑budget‑aware scaling law for batch size and sequence length, and (iii) practical adaptive scheduling guidelines validated on real language‑model training. These results give both theorists and practitioners a principled foundation for designing efficient large‑scale training pipelines that balance hardware utilization, token efficiency, and final model performance.


Comments & Academic Discussion

Loading comments...

Leave a Comment