Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities
Contemporary AI alignment strategies rely on a fragile premise: that human feedback, while noisy, remains a fundamentally truthful signal. In this paper, we identify this assumption as Dogma 4 of Reinforcement Learning (RL). We demonstrate that while this dogma holds in static environments, it fails in social settings where evaluators may be sycophantic, lazy, or adversarial. We prove that under Dogma 4, standard RL agents suffer from what we call Objective Decoupling, a structural failure mode where the agent’s learned objective permanently separates from the latent ground truth, guaranteeing convergence to misalignment. To resolve this, we propose Epistemic Source Alignment (ESA). Unlike standard robust methods that rely on statistical consensus (trusting the majority), ESA utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself. We prove that this “judging the judges” mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased. Empirically, we show that while traditional consensus methods fail under majority collusion, our approach successfully recovers the optimal policy.
💡 Research Summary
The paper “Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities” challenges a tacit assumption that underlies most modern AI alignment pipelines, especially Reinforcement Learning from Human Feedback (RLHF). The authors label this assumption “Dogma 4”: the feedback signal received by an RL agent is an exogenous, immutable ground‑truth reward. While this holds in physics‑based simulators, it collapses in social settings where human evaluators can be biased, lazy, or actively adversarial.
To formalize the problem, the authors introduce the Social MDP, an extension of the standard Markov Decision Process that explicitly models a set of evaluators (E = {e_1,\dots,e_M}). At each timestep the agent receives a feedback vector (r_t) whose component from evaluator (m) is
(r^{(m)}_t = R^{}(s_t,a_t) + b_m(s_t,a_t) + \epsilon^{(m)}_t).
Here (R^{}) is the latent true reward, (b_m) is a systematic bias (truthful, sycophantic, or adversarial), and (\epsilon^{(m)}_t) is zero‑mean noise. The paper classifies evaluators into three types: (1) truthful ((b_m\approx0)), (2) sycophantic ((b_m\propto\pi), rewarding actions that confirm the agent’s current policy), and (3) adversarial ((b_m\approx -C\cdot R^{*})).
The core theoretical contribution is the “Objective Decoupling” theorem. The authors define an Objective Decoupling Gap (\Delta = J(\pi^{*}) - \mathbb{E}_{\hat\pi}
Comments & Academic Discussion
Loading comments...
Leave a Comment