Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning

Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) increasingly serve as automated judges, yet they remain susceptible to cognitive biases – often altering their reasoning when faced with spurious prompt-level cues such as consensus claims or authority appeals. Existing mitigations via prompting or supervised fine-tuning fail to generalize, as they modify surface behavior without changing the optimization objective that makes bias cues predictive. To address this gap, we propose Epistemic Independence Training (EIT), a reinforcement learning framework grounded in a key principle: to learn independence, bias cues must be made non-predictive of reward. EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement. Experiments on Qwen3-4B demonstrate that EIT improves both accuracy and robustness under adversarial biases, while preserving performance when bias aligns with truth. Notably, models trained only on bandwagon bias generalize to unseen bias types such as authority and distraction, indicating that EIT induces transferable epistemic independence rather than bias-specific heuristics. Code and data are available at https://anonymous.4open.science/r/bias-mitigation-with-rl-BC47.


💡 Research Summary

The paper tackles a critical vulnerability of large language models (LLMs) when they are used as automated judges: susceptibility to social and cognitive biases such as bandwagon, authority, distraction, and positional cues. The authors argue that existing mitigation strategies—prompt engineering and supervised fine‑tuning (SFT)—only alter surface behavior and do not change the underlying optimization objective that makes bias cues predictive of reward. To fundamentally address this, they introduce Epistemic Independence Training (EIT), a reinforcement‑learning (RL) framework that makes bias cues non‑predictive of reward, thereby forcing the model to rely on intrinsic reasoning.

EIT consists of two complementary components. First, a “balanced conflict” data generation strategy: during training, each bias cue is paired with a correct answer in 50 % of the examples and with an incorrect answer in the remaining 50 %. This statistical neutralization removes any correlation between the bias signal and the ground truth, so a policy that follows the cue cannot improve its expected return. Second, a hierarchical reward function composed of three terms: (1) R_acc, a positive reward for a correct answer; (2) R_struct, a reward for producing a well‑formed chain‑of‑thought (CoT) response; and (3) R_ind, an asymmetric bias‑penalty term. When the bias contradicts the truth, following the bias incurs a penalty (‑γ₁) while answering correctly yields a modest bonus (+γ₁). When the bias aligns with the truth, there is no extra reward for agreement, but a penalty (‑γ₂) for deliberately contradicting the correct answer. This design eliminates any incentive to “always follow bias” and makes independence the optimal policy.

Optimization is performed with Group Relative Policy Optimization (GRPO). For each input, a batch of G candidate responses is sampled; the group’s average reward serves as a dynamic baseline, reducing variance and encouraging the policy to shift probability mass toward responses that are both accurate and independent. The authors train EIT on Qwen‑3‑1.7B and Qwen‑3‑4B using the MMLU‑Pro benchmark, injecting only bandwagon bias during training.

Empirical results are striking. On Qwen‑3‑4B, adversarial bandwagon bias accuracy improves from 70.1 % to 83.3 % (+13.2 pts) and robustness (stability of decisions after bias injection) rises from 68.5 % to 84.9 % (+16.4 pts). Crucially, the model generalizes to unseen bias types: authority bias sees a 10‑15 % boost, distraction bias a 39 % increase in robustness, and positional bias also benefits despite never being seen during training. Compared with larger, untrained models (Qwen‑3‑8B, Qwen‑3‑14B), the EIT‑trained Qwen‑3‑4B outperforms them on bias resistance, demonstrating that targeted RL training is more effective than scaling alone. Ablation studies confirm that both the balanced conflict data and the asymmetric bias penalty are essential; removing either leads to substantial performance degradation.

A qualitative analysis of reasoning traces reveals that SFT often yields “performative independence”—the model merely outputs refusal or bias‑ignoring language without genuine computation—whereas EIT produces substantive reasoning, with explicit domain engagement and logical verification before overriding bias cues. This indicates that EIT induces true epistemic independence rather than superficial pattern learning.

In summary, the paper presents a principled RL‑based solution that renders spurious cues non‑predictive of reward, thereby training LLM judges that are robust to a wide range of social biases. The balanced conflict strategy, bias‑penalizing reward shaping, and GRPO optimization together constitute a novel and effective approach to improving the reliability of LLM‑as‑a‑Judge systems. Future work may explore extending EIT to multi‑turn dialogues, broader bias families, and human‑in‑the‑loop evaluations.


Comments & Academic Discussion

Loading comments...

Leave a Comment