Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks
RL training of LLMs on open-ended tasks is challenging due to the lack of direct verifiability. In this paper, we frame such training as constrained RL that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model’s token-level certainty of a reference answer under its CoT reasoning prefix while selectively emphasizing reasoning-reflective tokens to capture how likely the generated reasoning is to yield the desired answer. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets, our framework outperforms baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.
💡 Research Summary
The paper tackles the longstanding challenge of applying reinforcement learning (RL) to open‑ended, long‑form tasks where direct verification of outputs is impossible. While RL with verifiable rewards (RL‑VR) has succeeded in domains such as coding or mathematics—where a correct answer can be automatically checked—open‑ended tasks like report writing, document revision, or scientific summarization lack such binary signals. To bridge this gap, the authors propose a constrained RL framework called Direct Reasoning Optimization (DRO). DRO combines two complementary mechanisms: (1) a token‑level dense reward named Reasoning Reflection Reward (R3) that measures the model’s self‑certainty on a reference answer conditioned on its generated chain‑of‑thought (CoT) reasoning, and (2) rubric‑gating, a hard feasibility constraint that accepts or rejects an entire rollout group based on whether the final answer satisfies a task‑specific rubric.
R3 works by computing, for each token of the reference answer, the probability that the model assigns to that token given the query and the sampled CoT prefix. Crucially, the authors identify a small subset of “reasoning‑reflective tokens” whose probabilities vary significantly with the quality of the preceding reasoning. R3 up‑weights these tokens (using a sigmoid of the certainty) and aggregates the weighted log‑probabilities, thereby preserving the signal that truly reflects reasoning quality while avoiding dilution that occurs when averaging over the whole sequence. This design yields a more discriminative reward that correlates strongly with human judgments of reasoning quality.
Rubric‑gating addresses the vulnerability of dense rewards to “reward hacking,” where a model might produce fluent but semantically empty text that still receives a high token‑level score. Instead of converting rubric scores into dense rewards—a process that is labor‑intensive, noisy, and computationally expensive—the authors treat rubrics as binary accept/reject checks applied at the rollout‑group level. For each query, G rollouts are sampled; if any rollout’s final answer fails any rubric criterion, the entire group is rejected and excluded from the policy update. This hard gating ensures that only rollouts satisfying essential lexical and semantic constraints influence learning, while R3 continues to guide the fine‑grained optimization of reasoning.
The learning algorithm builds on Group Relative Policy Optimization (GRPO), a variant of PPO that computes advantages by z‑scoring rewards within each group. DRO feeds R3 as the scalar reward for each rollout, applies rubric‑gating to filter groups, and additionally employs a variance‑based dynamic filter that discards groups whose R3 variance is too low (i.e., providing little learning signal). All reward and gating information are derived from the same reference policy, eliminating the need for external judges or separate reward models.
Empirical evaluation spans four datasets covering mathematics, programming, scientific reporting, and paragraph revision (ParaRev), each requiring answers of 100–300 tokens. DRO consistently outperforms baselines—including standard PPO‑VR with answer‑matching rewards, self‑certainty‑only RL, and rubric‑scored RL—on downstream metrics such as accuracy, coverage, and consistency. It reaches target performance 2–3× faster, demonstrating superior sample efficiency. Moreover, rubric violations drop to near zero, confirming that the hard gating effectively prevents reward hacking. Cross‑domain transfer experiments show that a policy pretrained on one task accelerates learning on another when combined with DRO, highlighting the generality of the approach.
The paper also discusses limitations: R3 requires a reference answer, so purely generative tasks without a ground‑truth may need alternative formulations; rubric design remains a bottleneck, though the authors mitigate it by using query‑specific, automatically generated rubrics and treating them as binary constraints. Experiments are limited to 7B–13B models, leaving scalability to larger models for future work.
In summary, the authors present a principled, practical solution for RL on open‑ended tasks by (i) crafting a token‑level dense reward that isolates reasoning‑relevant signal, (ii) enforcing hard rubric constraints to safeguard against superficial optimization, and (iii) integrating both within a group‑wise policy optimization framework. DRO achieves faster, more reliable learning while respecting task feasibility, marking a significant step toward robust RL‑based reasoning improvement in LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment