An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model scale. We further find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates, and we provide diagnostics that differentiate these two failure modes. Together, these results provide a large-scale, controlled characterization of how noisy data affects loss divergence in LLM pretraining.


💡 Research Summary

The paper presents a systematic empirical investigation into whether noisy data in web‑scale pre‑training corpora can cause loss divergence during large language model (LLM) training, and if so, how the phenomenon depends on noise characteristics, model scale, and training hyper‑parameters. The authors generate synthetic uniform random noise because collecting genuine random web text at scale is infeasible. For each document in a clean corpus D_c (a subset of the Llama‑4 pre‑training mix), a fraction α of tokens is replaced by noise tokens sampled uniformly from a designated noise vocabulary V_N ⊆ V, where V is the full tokenizer vocabulary. Two injection strategies are explored: (1) insertion, where noise tokens are inserted at randomly sampled positions (allowing consecutive insertions), and (2) overwriting, where each original token is replaced with a noise token with probability α.

Experiments are conducted on dense transformer models ranging from 480 M to 5.2 B parameters, following the Llama‑3 architectural family (pre‑normalization, rotary embeddings, group‑query attention, no bias, no QK‑layer‑norm). A comparable set of Mixture‑of‑Experts (MoE) models is also evaluated, using 16 experts with top‑2 routing and scaling the feed‑forward dimension to match the dense parameter count. All models are trained with AdamW (β1=0.9, β2=0.95, ε=1e‑8), weight decay 1e‑4, gradient clipping 1.0, a cosine learning‑rate schedule (peak 1.85e‑2, 2000‑step warm‑up, minimum 0.1× peak), batch size 2.6e5 tokens, and for fewer than one epoch (≈2e4 steps).

Training stability is measured by repeating each configuration with 20 random seeds. A run is labeled “diverged” if its loss exceeds the minimum observed loss by more than 0.5 nats/token for at least 600 consecutive steps. No divergences are observed on the clean corpus across all model sizes, confirming that any instability is attributable to injected noise.

Key findings:

  1. Noise can cause divergence and the type matters. With α = 55 % and a 540 M dense model, reducing the noise‑vocabulary size dramatically increases divergence probability. Using the smallest vocabulary (|V_N| = 5) yields the highest divergence rate, while the full vocabulary (|V| ≈ 200 k) shows far fewer failures. The actual content of the five noise tokens (common vs. rare) has negligible effect (Pearson r ≈ 0.125). Insertion noise leads to more divergences than overwriting, likely because insertion disrupts context more severely.

  2. Scaling trends. When the most destabilizing noise setting (insertion, |V_N| = 5) is applied, larger models diverge more often. Jointly scaling depth and width from 472 M (5 layers, 1024 dim) to 5.2 B (20 layers, 4096 dim) shows a monotonic increase in divergence rate at any fixed α. Increasing α from 5 % to 55 % raises divergence probability across all sizes. Isolating width vs. depth reveals that depth is the dominant factor: widening from 1024 to 4096 (keeping depth = 10) yields modest changes, whereas deepening from 5 to 35 layers (keeping width = 2048) causes a sharp rise, with the 35‑layer, 2.5 B model diverging in ~15 % of runs even at α = 5 %.

  3. Distinct signatures from high‑learning‑rate divergence. By comparing models trained on clean data with exaggerated learning rates to models trained on low learning rates with varying noise, the authors observe different activation behaviors. High‑LR runs exhibit spikes in maximum attention logits concentrated in a few layers, while noisy‑data runs show a more uniform but sustained increase in average activations across layers. This provides a practical diagnostic: monitoring layer‑wise attention logits can help practitioners decide whether to lower the learning rate or to clean the data.

  4. Dense vs. MoE sensitivity. Parameter‑matched MoE models (16 experts, top‑2 routing, FFN dimension scaled by 0.5×) display divergence rates comparable to their dense counterparts under the same noise conditions. Thus, the presence of sparsely activated experts does not inherently protect against noise‑induced instability.

Overall, the paper contributes four main insights: (i) uniform random noise can indeed trigger loss divergence; (ii) the probability of divergence is strongly modulated by noise vocabulary size, injection method, noise proportion, and especially model depth; (iii) noisy‑data divergences are distinguishable from high‑LR divergences via activation diagnostics; and (iv) both dense and MoE architectures share similar vulnerability to noisy data. These results underscore the critical importance of data quality control in large‑scale LLM pre‑training and suggest concrete monitoring tools for early detection of instability.


Comments & Academic Discussion

Loading comments...

Leave a Comment