Post-Training as Reweighting: A Stochastic View of Reasoning Trajectories in Language Models
📝 Original Info
- Title: Post-Training as Reweighting: A Stochastic View of Reasoning Trajectories in Language Models
- ArXiv ID: 2511.07368
- Date: 2025-11-10
- Authors: ** - 논문에 명시된 저자 정보가 제공되지 않았습니다. (예: 김민수, 이지은, 박현우 등) **
📝 Abstract
Foundation models encode rich structural knowledge but often rely on post-training procedures to adapt their reasoning behavior to specific tasks. Popular approaches such as reinforcement learning with verifiable rewards (RLVR) and inference-time reward aggregation are typically analyzed from a performance perspective, leaving their effects on the underlying reasoning distribution less understood. In this work, we study post-training reasoning from a stochastic trajectory viewpoint. Following Kim et al. (2025), we model reasoning steps of varying difficulty as Markov transitions with different probabilities, and formalize reasoning processes using tree-structured Markov chains. Within this framework, pretraining corresponds to discovering the reasoning structure, while post-training primarily reweights existing chains of thought. We show that both RLVR and inference-time reward aggregation concentrate probability mass on a small number of high-probability trajectories, leading to the suppression of rare but essential reasoning paths. As a consequence, solving hard instances often depends on low-probability trajectories already present in the base model. We further prove that exploration-oriented mechanisms, such as rejecting easy instances and applying KL regularization, help preserve these rare trajectories. Empirical simulations support our theoretical analysis.💡 Deep Analysis
📄 Full Content
Reference
This content is AI-processed based on open access ArXiv data.