Context-Free Synthetic Data Mitigates Forgetting
Fine-tuning a language model often results in a degradation of its existing performance on other tasks, due to a shift in the model parameters; this phenomenon is often referred to as (catastrophic) forgetting. We are interested in mitigating this, in settings where we only have access to the model weights but no access to its training data/recipe. A natural approach is to penalize the KL divergence between the original model and the new one. Our main realization is that a simple process - which we term context-free generation - allows for an approximate unbiased estimation of this KL divergence. We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting, in two settings: (a) preserving the zero-shot performance of pretrained-only models, and (b) preserving the reasoning performance of thinking models. We show that contextual synthetic data, and even a portion of the pretraining data, are less effective. We also investigate the effect of choices like generation temperature, data ratios etc. We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.
💡 Research Summary
The paper addresses the problem of catastrophic forgetting that occurs when large language models are fine‑tuned on a downstream task. Existing solutions such as L2 regularization, Elastic‑Weight‑Consolidation, LoRA adapters, or post‑hoc model averaging (Wise‑FT) mitigate forgetting indirectly but do not explicitly preserve the original data distribution learned during pre‑training. The authors propose a simple yet powerful approach: generate “context‑free synthetic data” by feeding only the model’s beginning‑of‑sentence (BOS) token to the pre‑trained model and sampling an unconditional text sequence. Because the BOS token is the only cue, the model effectively samples from its own full‑sequence probability distribution pθ∗(x), which reflects the original training data distribution.
The method consists of two steps. First, a set of synthetic sequences S is generated from the frozen base model θ∗ using a temperature τ (typically 0.7) and a maximum length (256 tokens for Olmo‑1B, 512 tokens for R1‑Distill‑Llama‑8B). Second, the fine‑tuning loss on the downstream dataset F is combined with a pre‑training‑style loss on the synthetic data:
L(θ) = E_{(x,y)∼F}
Comments & Academic Discussion
Loading comments...
Leave a Comment