Pretrain Value, Not Reward: Decoupled Value Policy Optimization

Pretrain Value, Not Reward: Decoupled Value Policy Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we explore how directly pretraining a value model simplifies and stabilizes reinforcement learning from human feedback (RLHF). In reinforcement learning, value estimation is the key to policy optimization, distinct from reward supervision. The value function predicts the \emph{return-to-go} of a partial answer, that is, how promising the partial answer is if it were continued to completion. In RLHF, however, the standard pipeline first pretrains a reward model and then learns a value function online, even though no new reward signals are available once preference data is collected. This makes critic learning redundant, as the process of training a reward model and then deriving a value model is informationally equivalent to directly pretraining a value model. Importantly, this requires no additional supervision, and our value model is trained on exactly the same data used for reward modeling. Building on this insight, we introduce \emph{Decoupled Value Policy Optimization} (DVPO), a framework that pretrains a \emph{Global Value Model} (GVM) offline and freezes it as a universal critic for policy learning. The GVM provides stable, fine-grained credit assignment without critic drift or trajectory sampling. Experiments across MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches or surpasses state-of-the-art RLHF methods. These results highlight RLHF can be reframed as policy-only optimization guided by a single pretrained value model.


💡 Research Summary

The paper revisits the standard Reinforcement Learning from Human Feedback (RLHF) pipeline, which traditionally consists of two stages: first, a reward model (RM) is trained on a fixed set of human preference data; second, a value model (critic) is learned online from the RM or via trajectory sampling. The authors argue that, when the preference dataset is static and no new environment rewards can be collected, the second stage adds no new information beyond what is already encoded in the RM. They formalize this claim in Lemma 3.1, showing that “reward‑pretraining + value‑estimation” is informationally equivalent to directly pretraining a value model on the same data.

Motivated by this equivalence, the authors propose Decoupled Value Policy Optimization (DVPO). The core component is a Global Value Model (GVM), denoted Qϕ(τ, s, a), which predicts token‑level return‑to‑go conditioned on a sampled trajectory τ that implicitly encodes the policy’s style, correctness, and domain expertise. The GVM is trained offline using temporal‑difference (TD) loss on the fixed dataset D = {τ_i, s_{it}, a_{it}, G_{it}}. Importantly, the reward signal r(s,a) is derived from sentence‑level human feedback and assigned only to the final token, while intermediate tokens receive zero reward, yielding a simple return G_t = γ^{T‑t} r(x,y). The TD loss L_GVM = (r_t + γ Qϕ(τ, s_{t+1}, a_{t+1}) – Qϕ(τ, s_t, a_t))² drives the GVM to approximate the true action‑value function across many policies.

Once the GVM converges, it is frozen and used as a static critic for policy optimization. DVPO adopts the standard clipped PPO objective L_PPO(θ) = E


Comments & Academic Discussion

Loading comments...

Leave a Comment