CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria
Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.
💡 Research Summary
The paper tackles a fundamental problem in open‑ended natural‑language generation: how to obtain reliable automatic evaluation signals that can be used as rewards for reinforcement learning (RL). While recent “LLM‑as‑a‑Judge” approaches have shown promise, most existing generative reward models (GRMs) are trained on pairwise preference data and evaluate responses by directly comparing two candidates. This pairwise paradigm creates two practical issues. First, in RL the pairwise scores must be converted into pointwise rewards, often via Elo‑style ranking, which incurs quadratic computational cost and reduces efficiency. Second, many works generate evaluation criteria jointly conditioned on both the query and the response; when multiple responses share the same query, the criteria become inconsistent, introducing bias.
To address these shortcomings, the authors propose CE‑RM‑4B, a pointwise generative reward model that (1) first generates a set of evaluation criteria only from the query (a “unified criteria”), and then (2) evaluates each response conditioned on both the query and this shared criteria. This two‑stage rollout separates criteria generation from response analysis, ensuring that all responses to the same query are judged against the same standards, improving consistency and allowing the criteria to be reused across many responses, thereby reducing inference cost.
The training data are derived from the public Skywork‑Reward‑Preference‑80K‑v0.2 dataset. The authors filter the data in several steps: (a) they run a strong LLM (Qwen3‑4B‑Instruct‑2507) to perform multiple pointwise evaluations per instance and keep only those where the model’s accuracy on the chosen vs. rejected response is ≤ 0.6, i.e., cases the model is uncertain about; (b) they cluster queries by task type using Qwen3‑Max embeddings and perform stratified sampling to obtain a balanced set of about 5.7 K high‑quality instances.
Training proceeds in two phases. In a “cold‑start” supervised fine‑tuning (SFT) stage, a small set of curated instances is used to teach the model to generate three candidate criteria per query and, for each criterion, produce three evaluations of the chosen and rejected responses. The criterion with the smallest variance across its induced scores is selected, and the median evaluation score is kept as the final label. This process yields a dataset of roughly 2.2 K SFT examples, each containing a query, a response, a criteria set, and an evaluation. The SFT loss jointly maximizes the likelihood of the criteria and the evaluation text under their respective conditioning contexts.
For RL, the authors retain the original pairwise format (query, chosen response, rejected response) but relax the filtering to keep any instance where at least one criterion leads to completely correct evaluations, resulting in the 5.7 K‑example D_RL set. They adopt the GRPO algorithm and extend it with a two‑stage rollout: (i) generate n_c criteria trajectories for the query; (ii) for each criterion, generate n_e evaluation trajectories for both the chosen and rejected responses. Because only pairwise preference labels are available, they devise fine‑grained reward estimators. The reward for a criterion is the win‑rate of the chosen response’s scores over the rejected response’s scores across all evaluation trajectories generated under that criterion. The reward for an evaluation trajectory is the win‑rate of its score against all scores of the opposite response, multiplied by a binary “format” reward indicating whether the pairwise preference is satisfied. These rewards are used in PPO‑style updates, with trajectories partitioned into three groups (criteria, chosen‑response evaluations, rejected‑response evaluations) for relative advantage computation.
Empirical evaluation covers three widely used reward‑model benchmarks—RWBench, RWBench2, and RM‑Bench—as well as Best‑of‑N scenarios where each query is paired with 2, 4, or 6 candidate responses. The authors first conduct a preliminary study with three evaluation settings: (1) Direct Evaluation (no explicit criteria), (2) Explicit Criteria (criteria conditioned on query + response), and (3) Unified Criteria (criteria conditioned only on query). Results show that Unified Criteria consistently outperforms the other two, and the gap widens as the number of responses per query grows.
On the full benchmarks, CE‑RM‑4B (4 B parameters) surpasses larger models (up to 32 B) on most metrics, especially in the more realistic Best‑of‑N setups. For example, on RWBench the model achieves 90.6 % versus 86.7 % for GPT‑4o, and on RM‑Bench it reaches 83.0 % versus 79.8 % for the same baseline. Ablation studies confirm that (a) removing the unified‑criteria step degrades performance, (b) collapsing the two‑stage rollout into a single stage reduces both benchmark scores and RL effectiveness, and (c) the variance‑based criterion selection is crucial for stability.
Finally, the authors integrate CE‑RM‑4B into a practical RL pipeline using GRPO to fine‑tune a language model for instruction following. Compared with a baseline reward model based on GPT‑4o, policies trained with CE‑RM‑4B achieve higher human‑rated quality scores and converge faster (≈ 20 % fewer training steps). This demonstrates that the improvements observed on static benchmarks translate into tangible gains in downstream RL tasks.
In summary, the paper makes three key contributions: (1) it identifies and empirically validates the limitations of pairwise‑only GRMs and the need for dedicated criteria optimization; (2) it introduces a novel two‑stage rollout framework with query‑only unified criteria, together with reward estimators that turn pairwise preferences into pointwise signals; and (3) it shows that a modest‑size model trained on a carefully curated 5.7 K‑example dataset can outperform much larger baselines on both benchmark and real‑world RL evaluations. The work opens avenues for further research on scaling the data curation pipeline, handling ambiguous queries, and extending the unified‑criteria concept to multi‑modal or domain‑specific generation tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment