Post-Training as Reweighting: A Stochastic View of Reasoning Trajectories in Language Models

February 22, 2026

Reading time: 2 minute

...

📝 Original Info

Title: Post-Training as Reweighting: A Stochastic View of Reasoning Trajectories in Language Models
ArXiv ID: 2511.07368
Date: 2025-11-10
Authors: ** - 논문에 명시된 저자 정보가 제공되지 않았습니다. (예: 김민수, 이지은, 박현우 등) **

📝 Abstract

Foundation models encode rich structural knowledge but often rely on post-training procedures to adapt their reasoning behavior to specific tasks. Popular approaches such as reinforcement learning with verifiable rewards (RLVR) and inference-time reward aggregation are typically analyzed from a performance perspective, leaving their effects on the underlying reasoning distribution less understood. In this work, we study post-training reasoning from a stochastic trajectory viewpoint. Following Kim et al. (2025), we model reasoning steps of varying difficulty as Markov transitions with different probabilities, and formalize reasoning processes using tree-structured Markov chains. Within this framework, pretraining corresponds to discovering the reasoning structure, while post-training primarily reweights existing chains of thought. We show that both RLVR and inference-time reward aggregation concentrate probability mass on a small number of high-probability trajectories, leading to the suppression of rare but essential reasoning paths. As a consequence, solving hard instances often depends on low-probability trajectories already present in the base model. We further prove that exploration-oriented mechanisms, such as rejecting easy instances and applying KL regularization, help preserve these rare trajectories. Empirical simulations support our theoretical analysis.

Post-Training as Reweighting: A Stochastic View of Reasoning Trajectories in Language Models

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

Accumulating Context Changes the Beliefs of Language Models

Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

Bid2X: Revealing Dynamics of Bidding Environment in Online Advertising from A Foundation Model Lens

Start searching

No results found