Think Twice: Branch-and-Rethink Reasoning Reward Model
Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.
💡 Research Summary
The paper introduces Branch‑and‑Rethink Reward Model (BR‑RM), a two‑turn reward‑modeling framework that brings the “think‑twice” principle from reasoning LLMs to reward models. Traditional reward models (RMs) compress many quality dimensions—factuality, safety, coherence, style, etc.—into a single scalar in one forward pass, which the authors term “judgment diffusion”: attention is spread thinly across criteria, leading to shallow analysis and missed subtle errors. Recent generative RMs (GenRMs) and reasoning RMs (ReasonRMs) add a rationale step or allocate test‑time compute, but they still evaluate the full rubric at once and therefore do not guarantee focused scrutiny.
BR‑RM addresses this by structuring the evaluation into two sequential generations. In Turn 1 (Adaptive Branching), the model selects a small subset (typically 2–3) of instance‑critical criteria from a predefined pool of nine (e.g., Information Accuracy, Logical Coherence, Safety, Creativity). This selection forces the model to hypothesize which dimensions are most at risk for the given pair of responses. Conditioned on the selected criteria, the model also produces a brief preliminary analysis for each response (α₁, α₂). The output of Turn 1 (the selected criteria and the initial analyses) constitutes the first part of the trace τ₁.
Turn 2 (Branch‑Conditioned Rethinking) takes τ₁ as a conditioning signal and performs a deep, issue‑driven re‑evaluation. The model re‑reads the responses through the lens of the previously identified dimensions, verifying facts, checking safety violations, or probing logical bugs as appropriate. This targeted second pass concentrates compute where it matters most, eliminating the shallow, broad reasoning that plagues single‑pass RMs. The final decision (ˆz) is extracted from the Turn 2 output, completing the trace τ₂.
Training uses Generalized Reward Policy Optimization (GRPO), a PPO‑style algorithm that is stable and efficient for multi‑turn generation. The policy πθ generates the full trace τ = τ₁ ∘ τ₂ for each (prompt, response₁, response₂) triplet. A simple binary outcome reward (correct preference) is assigned uniformly to all tokens in the trace after strict format validation. The advantage is computed by whitening the per‑prompt reward across K sampled traces, and the loss combines a clipped surrogate term with a KL‑penalty against a reference policy π_ref.
Empirically, BR‑RM achieves state‑of‑the‑art results on three major reward‑modeling benchmarks: RewardBench, RM‑Bench, and RMB. It outperforms strong baselines—including ReasonRM, GenRM, and conventional scalar RMs—especially on metrics that require sensitivity to subtle factual slips, safety breaches, and code correctness. Ablation studies show that selecting 2–3 dimensions per instance yields the best trade‑off between focus and coverage, and that GRPO provides more stable learning than vanilla PPO. The method also reallocates token budget efficiently, focusing on critical dimensions without a large increase in overall compute.
Limitations include the need for a predefined set of evaluation dimensions; extending to new domains requires curating appropriate criteria. The two‑turn generation introduces additional latency, which may be problematic for real‑time applications, so practitioners must balance token budget and response time. Future work could explore automatic dimension discovery, multi‑turn extensions, and integrating human feedback to learn dimension weights dynamically.
In summary, BR‑RM structurally mitigates judgment diffusion by enforcing a focused first pass and a deep, conditioned second pass. This “branch‑and‑rethink” approach substantially improves the depth, reliability, and error‑sensitivity of reward models, offering a promising new direction for alignment pipelines that rely on high‑quality preference judgments.
Comments & Academic Discussion
Loading comments...
Leave a Comment