From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models
Large Language Models (LLMs) solely trained on next-token prediction learn to solve a wide range of problems involving mathematical reasoning. But how does this ability evolve during training? We show the first analysis of how mathematical reasoning abilities of several open-weight LLMs develop during pre-training and post-training. To this end, we construct MathCAMPS, a synthetic dataset of novel mathematical reasoning problems grounded in 44 fine-grained skills taken from the Common Core curriculum from K to 8th grades. In one experiment, we show that mathematical skills are learned during pre-training in an order that measurably correlates with the human-designed curriculum, even though training data are randomly ordered. We also show a detailed analysis of which mathematical abilities benefit from instruction tuning, a widely used post-training method and, in contrast, which skills suffer. Our work paves the way for an empirical understanding of LLM training dynamics in relation to reasoning.
💡 Research Summary
This paper presents a pioneering empirical investigation into how the ability for mathematical reasoning emerges and evolves in Large Language Models (LLMs) trained solely on the next-token prediction objective. To overcome the limitations of existing benchmarks—such as potential data contamination and the lack of fine-grained skill assessment—the authors introduce MathCAMPS, a novel synthetic dataset grounded in the human educational curriculum.
MathCAMPS is systematically constructed from 44 fine-grained mathematical skills (standards) drawn from the Common Core State Standards for grades K-8. The core innovation lies in representing each standard as an attribute grammar, a formalism that allows for the sampling of an infinite number of valid symbolic problem structures while enforcing semantic constraints specific to each skill (e.g., “addition and subtraction within 20”). These symbolic structures are then converted into natural language word problems using GPT-4. To ensure the faithfulness and quality of these LLM-generated problems, the authors introduce a critical “cycle-consistency” check: the generated word problem is fed back to GPT-4 to be translated back into a symbolic form, and the problem is only retained if the answer derived from this recovered structure matches the original answer. The dataset also includes automatically generated follow-up questions (counterfactual and incremental) to enable deeper probing of model understanding.
Using MathCAMPS, the authors analyze the learning dynamics across intermediate checkpoints of several open-weight models, including Pythia-12B, OLMo-7B, and Amber. A key finding is that during pre-training, the order in which mathematical skills are acquired by the models shows a statistically significant correlation with the human-designed grade-level sequence of the Common Core curriculum, despite the training data being randomly ordered. This suggests an intriguing alignment between the intrinsic learning dynamics of LLMs and human knowledge organization.
Furthermore, the paper provides a detailed analysis of the impact of instruction tuning, a prevalent post-training method. The results reveal that the effect of instruction tuning is highly skill-dependent. While it can boost overall problem-solving performance, it may also lead to a “cost of specialization,” where proficiency in certain basic arithmetic skills degrades even as more complex reasoning abilities improve. This highlights that aggregate performance scores can mask significant regressions in specific capabilities.
In summary, this work makes three primary contributions: 1) the creation of MathCAMPS, a fine-grained, curriculum-grounded, and contamination-free benchmark for mathematical reasoning; 2) novel insights into pre-training dynamics, showing curriculum-aligned skill acquisition; and 3) a nuanced analysis of the heterogeneous effects of instruction tuning on distinct mathematical abilities. The research paves the way for a more empirical science of understanding how reasoning capabilities develop in LLMs, advocating for evaluation frameworks that move beyond monolithic scores to detailed behavioral skill profiles.
Comments & Academic Discussion
Loading comments...
Leave a Comment