Factored Causal Representation Learning for Robust Reward Modeling in RLHF
A reliable reward model is essential for aligning large language models with human preferences through reinforcement learning from human feedback. However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead to reward hacking, where high predicted reward does not translate into better behavior. In this work, we address this problem from a causal perspective by proposing a factored representation learning framework that decomposes the model’s contextual embedding into (1) causal factors that are sufficient for reward prediction and (2) non-causal factors that capture reward-irrelevant attributes such as length or sycophantic bias. The reward head is then constrained to depend only on the causal component. In addition, we introduce an adversarial head trained to predict reward from the non-causal factors, while applying gradient reversal to discourage them from encoding reward-relevant information. Experiments on both mathematical and dialogue tasks demonstrate that our method learns more robust reward models and consistently improves downstream RLHF performance over state-of-the-art baselines. Analyses on length and sycophantic bias further validate the effectiveness of our method in mitigating reward hacking behaviors.
💡 Research Summary
The paper tackles a critical weakness in reinforcement learning from human feedback (RLHF): reward models often latch onto spurious features—such as response length or sycophantic phrasing—that are not causally related to human preferences. This “reward hacking” leads to policies that achieve high predicted rewards while drifting away from true user intent.
Causal Perspective
The authors first formalize the problem with a causal graph. The prompt‑response pair (x, y) generates two latent factors: causal factors z_c that truly determine human preference, and non‑causal factors z_nc that capture irrelevant attributes. Standard reward models allow a direct edge z_nc → r, violating the desired invariance condition r ⊥⊥ z_nc.
CausalRM Framework
To enforce invariance, CausalRM factorizes the backbone embedding h = f_ϕ(x, y) into two variational latents via a VAE‑style encoder:
- q_α(z_c | h) – encouraged to retain information sufficient for reward prediction.
- q_α(z_nc | h) – encouraged to capture the remaining, reward‑irrelevant information.
Both posteriors are diagonal‑covariance Gaussians with standard normal priors, regularized by KL terms.
The reward head g_ψ is structurally restricted to consume only z_c, producing a scalar reward \hat r = g_ψ(z_c). This eliminates the spurious path at the architecture level.
A reconstruction decoder d_η takes the concatenated latents
Comments & Academic Discussion
Loading comments...
Leave a Comment