Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we “warm up” the model by distilling Long CoTs from a toy domain, namely, Knights & Knaves (K&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: $(i)$ the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval$^{+}$, and MMLU-Pro; $(ii)$ When both the base model and the warmed-up model are RLVR trained on the same small dataset ($\leq100$ examples), the warmed-up model consistently outperforms the base model; $(iii)$ Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; $(iv)$ Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.


💡 Research Summary

The paper tackles a pressing problem in the development of reasoning‑capable large language models (LLMs): how to obtain strong multi‑step reasoning abilities when only a tiny amount of domain‑specific training data is available. The authors propose a two‑stage training pipeline that first “warms up” a base model on a toy, domain‑agnostic logic game—Knights & Knaves (K&K)—and then fine‑tunes the warmed‑up model on the target task using Reinforcement Learning with Verifiable Rewards (RLVR).

Stage 1 – Warm‑up.
K&K puzzles require only Boolean reasoning (identifying who tells the truth and who lies) and no specialized knowledge. The authors generate long chain‑of‑thought (CoT) explanations for thousands of K&K questions using a strong teacher model (Qwen‑32B). These explanations contain rich reasoning behaviors such as self‑reflection, hypothesis testing, and error correction. The teacher’s outputs are distilled into several base models (Qwen2.5‑3B, Qwen2.5‑1.5B‑Math, DeepSeek‑Math‑7B‑Base, Qwen2.5‑14B) via supervised fine‑tuning (SFT).

Empirically, the warm‑up alone yields substantial gains across three very different benchmarks: MATH (math problem solving), HumanEval+ (code generation), and MMLU‑Pro (general knowledge). For example, the Qwen2.5‑3B model improves from 43.8% to 54.0% on MATH (+10.2 pp), from 32.5% to 47.8% on HumanEval+ (+15.3 pp), and from 29.2% to 38.2% on MMLU‑Pro (+9.0 pp). Similar or larger improvements are observed for the larger 14‑B model (reaching 77.4% on MATH, close to the 80.2% achieved by full‑scale RLVR with thousands of examples). The authors also compare against distillation with the curated s1K dataset (a collection of high‑quality long CoTs from many domains). Warm‑up with K&K matches or exceeds s1K performance, especially for smaller models where s1K sometimes degrades accuracy. Ablation experiments using short CoTs (lacking explicit reasoning steps) cause performance to collapse (e.g., MATH drops to 11%), confirming that the reasoning patterns—not merely the logical content of K&K—drive the transfer.

Stage 2 – Target‑domain adaptation via RLVR.
After warm‑up, the model is fine‑tuned on a very small set of task‑specific examples (≤ 100 for math, 50 for coding) using an unbiased GRPO RLVR algorithm. The same RLVR procedure is applied to the original base model for a fair comparison. Results show that the warmed‑up model learns faster and reaches higher final performance: on MATH, the base model gains +14.0 pp after 250 RL steps, whereas the warmed‑up model gains +20.7 pp after only 100 steps. With just 100 math examples, the warmed‑up model’s final accuracy matches that of a model trained on the full 7,500‑example RLVR baseline. Similar trends appear on HumanEval+, where the warmed‑up model achieves a +5.0 pp absolute gain over the base model under identical RLVR conditions.

Cross‑domain generalization.
A known side‑effect of RLVR is over‑specialization: performance on unrelated domains can deteriorate. The authors evaluate this by testing on MMLU‑Pro subsets (physics, history) after RLVR on math or coding. The warmed‑up + RLVR models retain most of their pre‑RLVR performance, whereas the base‑model‑only RLVR suffers noticeable drops. This demonstrates that the warm‑up phase embeds domain‑agnostic reasoning schemas that protect against catastrophic forgetting.

Robustness checks.
The warm‑up benefits persist across different teacher models (Qwen‑32B, DeepSeek‑R1) and across model families (DeepSeek‑Base). When the teacher provides short, non‑reasoning outputs, the distilled model overfits the K&K task and loses generalization, reinforcing that the quality of reasoning traces is crucial.

Key contributions and implications.

  1. Warm‑up alone is a strong meta‑learner. Simple distillation from a toy logic game yields sizable, consistent gains on diverse downstream tasks.
  2. Sample‑efficient RLVR. With a warmed‑up model, RLVR requires far fewer examples and training steps to reach or surpass the performance of full‑scale RLVR.
  3. Preserved cross‑domain ability. Warm‑up mitigates the usual trade‑off between in‑domain RLVR performance and out‑of‑domain generalization.
  4. Practical pipeline for data‑scarce settings. The two‑stage method offers a low‑cost, scalable way to build reasoning LLMs when high‑quality domain data are scarce.

Overall, the paper provides compelling evidence that a brief, reasoning‑focused pre‑training on a domain‑agnostic logical game can endow LLMs with transferable reasoning skills, dramatically improve the efficiency of subsequent RLVR fine‑tuning, and maintain broad generalization. This work opens a promising direction for meta‑learning of reasoning in large language models, especially in resource‑constrained environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment