Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics
Curriculum learning changes the order of pre-training data, but it remains unclear whether it changes the learning trajectory or mainly reorders exposure over a fixed trajectory. We train Pythia models (14M-410M parameters) for 300B tokens under three linguistically motivated curricula-Age-of-Acquisition, word frequency, and Verb Variation (VV)-and compare each against Random ordering; at 1B parameters we compare Random and VV. Across orderings, training follows a shared sequence of latent phases, while curricula mainly change within-phase data exposure. In smaller models (up to 160M parameters), Random ordering exhibits higher gradient noise and stronger late-training output-head spectral saturation, alongside lower final accuracy; curricula reduce both effects at matched compute. At larger scales, saturation differences are smaller and curriculum gains shrink. We formalize the link between difficulty pacing and optimization stability in an idealized analysis based on gradient-variance control, and our results point to a practical takeaway: curricula help by stabilizing within-phase optimization rather than by creating new phases.
💡 Research Summary
The paper investigates how the ordering of pre‑training data influences the learning dynamics of large language models (LLMs). Using the Pythia suite, the authors train models ranging from 14 M to 1 B parameters for up to 300 B tokens, comparing four data‑ordering strategies: a single random shuffle and three linguistically motivated curricula—Age‑of‑Acquisition (AoA), word frequency, and Verb Variation (VV). Each curriculum assigns a scalar “difficulty” score to every 2048‑token sample and sorts the entire dataset from easy to hard, thereby implementing an easy‑to‑hard curriculum in a single‑pass setting.
Two complementary analyses are performed. First, a latent‑phase analysis fits Hidden Markov Models (HMMs) to trajectories of loss, perplexity, and other training signals. Across all orderings, the HMM discovers the same sequence of 5–6 latent phases, indicating that curricula do not create new phases but merely reshuffle which examples appear within each phase.
Second, the authors examine optimization stability using two diagnostics: the Gradient Noise Scale (GNS) and the singular‑entropy of the output‑head weight matrix. GNS quantifies the ratio of gradient variance to squared gradient norm; higher values imply noisy, inefficient updates. Singular‑entropy measures spectral saturation of the softmax head, a symptom of the “softmax bottleneck” that limits the rank of the output distribution in small models.
Empirical results show that for capacity‑constrained models (≤ 160 M parameters), random ordering yields substantially higher GNS and a sharper drop in singular‑entropy during later training, correlating with lower final accuracy and higher perplexity. All three curricula reduce GNS, mitigate late‑stage spectral saturation, and achieve modest accuracy gains at matched compute. As model size increases (410 M and 1 B), the differences in GNS and spectral saturation shrink, and the performance gap between random and curriculum orderings becomes marginal.
The theoretical contribution formalizes curricula as a variance‑control mechanism. The authors define a difficulty score d(z) and a pacing function p(t) that together induce a time‑varying sampling distribution Pₜ. Under standard strong‑convexity and Lipschitz assumptions, they prove (Theorem 3.2) that the stability radius of stochastic gradient descent scales with the effective gradient variance σ²ₜ. Uniform random sampling can cause σ²ₜ to drift upward as training progresses because harder examples (with higher gradient variance) dominate later stages. An easy‑to‑hard curriculum, by restricting exposure to high‑variance examples until later, keeps σ²ₜ bounded and thus improves stability.
Mapping this framework to the experiments, AoA, frequency, and VV scores serve as proxies for the ideal difficulty (validated by strong positive correlations with per‑sample loss). The pacing function is effectively linear in the empirical quantiles of these scores, matching the sorted order used in training. The observed reductions in GNS and improvements in spectral stability align with the theoretical prediction that curricula bound gradient variance and thereby enhance optimization stability.
In summary, the study concludes that curriculum learning for LLM pre‑training does not alter the fundamental learning trajectory but stabilizes optimization within the existing latent phases. The benefits are most pronounced for smaller, capacity‑limited models where gradient noise and softmax bottleneck effects are significant. For larger models, where capacity alleviates these issues, curriculum gains diminish. Practically, applying linguistically motivated easy‑to‑hard curricula can be an effective, low‑cost technique to improve sample‑efficiency and final performance when training modest‑sized LLMs or when compute budgets are constrained.
Comments & Academic Discussion
Loading comments...
Leave a Comment