Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective
Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user’s utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.
💡 Research Summary
The paper addresses a fundamental inefficiency in current large‑language‑model (LLM) alignment pipelines that rely on a reward model learned from user preference data. In both training‑time and inference‑time alignment, the policy is optimized to maximize the learned reward while being regularized by a KL‑divergence term that keeps the policy close to a base model. When the base model encodes biases that conflict with user preferences, the KL constraint prevents the policy from fully exploiting the reward, leading to sub‑optimal user utility. Amplifying the reward for preferred outputs can counteract this bias, but excessive amplification raises the risk of reward hacking and degenerate behavior.
To formalize the trade‑off, the authors cast reward‑model design as a Stackelberg game. The “leader” (reward‑model provider) selects a reward function r, anticipating that the “follower” (the LLM) will respond by solving the KL‑regularized optimization problem. The leader’s objective is to maximize the true user utility r_U under the follower’s best‑response policy ρ_r. This yields a bi‑level program: maximize E_{y∼ρ_r}
Comments & Academic Discussion
Loading comments...
Leave a Comment