KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.’’ Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.


💡 Research Summary

The paper tackles two fundamental problems that arise when applying reinforcement learning (RL) to reasoning‑intensive large language and vision‑language models: (1) sparse trajectory‑level rewards that make credit assignment ambiguous and cause high‑variance updates, and (2) exploration collapse, where naïve on‑policy sampling fails to discover any reward‑positive trajectories, trapping the policy in a “learning cliff.” Existing on‑policy distillation methods mitigate instability by providing dense teacher supervision, but they apply this supervision uniformly to all sampled trajectories. The authors argue that uniform distillation is detrimental for reasoning tasks because low‑quality trajectories often stem from early logical errors; forcing the student to imitate the teacher under such flawed contexts injects noisy gradients that hinder learning.

To address both issues, the authors propose Knowledge‑Enhanced Preference Optimization (KEPO), a unified post‑training framework that combines (i) a quality‑gated distillation objective and (ii) a knowledge‑enhanced exploration strategy.

Quality‑gated distillation: For each on‑policy trajectory y_i generated by the student policy π_θ, the scalar reward r_i is computed. Only trajectories whose reward exceeds a predefined threshold τ are subjected to a KL‑based distillation loss D(π_T‖π_θ), where π_T is a strong teacher model. This gating ensures that dense token‑level guidance is aligned with the RL objective, turning distillation into a form of dense credit assignment that amplifies successful reasoning patterns rather than competing with them.

Knowledge‑enhanced exploration: When a batch of G sampled trajectories yields no positive reward (all r_i = 0), the system triggers a hint‑generation step. The teacher model produces a reasoning hint conditioned on the ground‑truth answer (e.g., key concepts, visual regions, or step‑wise pointers). The student then performs rejection sampling conditioned on this hint, repeatedly generating candidate trajectories until a reward‑positive one is found. These hint‑guided candidates are added to the original on‑policy pool, effectively injecting teacher knowledge into the exploration process without relying on static demonstrations.

The overall objective integrates three terms: (1) an advantage‑weighted preference optimization term (using group‑based estimators such as GRPO or RLOO), (2) the quality‑gated distillation term, and (3) a KL regularization term that keeps the updated policy close to a reference policy. Importance weights correct for the shift from the old policy to the current one, and the KL term prevents excessive deviation.

Experiments are conducted on a medical visual question answering (VQA) benchmark. The model is trained exclusively on MRI‑based data (single‑source) and evaluated on out‑of‑distribution modalities such as OCT and X‑ray. Baselines include pure RL (PPO/GRPO), uniform on‑policy distillation, and hybrid methods like Prefix‑RFT. KEPO demonstrates:

  • Improved training stability – variance of the reward signal drops by ~45% compared with pure RL.
  • Faster emergence of coherent chain‑of‑thought reasoning – logical consistency scores rise from 0.62 (baseline) to 0.78.
  • Higher exploration success – the probability of finding a reward‑positive trajectory jumps from 12% to 71% when the hint‑guided sampler is active.
  • Superior OOD performance – accuracy gains of 4–7 percentage points across unseen modalities, indicating that teacher‑derived hints generalize beyond the training domain.

Ablation studies confirm that both components are necessary: removing the gating leads to noisy gradients and degraded performance, while disabling hint‑guided exploration causes the policy to stall on the learning cliff.

The authors acknowledge limitations such as dependence on the teacher’s hint quality and the fixed reward threshold τ, suggesting future work on adaptive gating and broader domain validation.

In summary, KEPO offers a principled way to fuse dense teacher supervision with reinforcement learning for reasoning‑heavy tasks. By gating distillation to high‑quality trajectories and dynamically injecting teacher hints during exploration, it resolves credit‑assignment ambiguity and exploration collapse, yielding more stable training, better reasoning, and robust out‑of‑distribution generalization. This work sets a new direction for safe and effective post‑training of large multimodal models in high‑stakes applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment