Counterfactual Self-Questioning for Stable Policy Optimization in Language Models

Reading time: 1 minute
...

📝 Original Info

  • Title: Counterfactual Self-Questioning for Stable Policy Optimization in Language Models
  • ArXiv ID: 2601.00885
  • Date: 2025-12-31
  • Authors: Mandar Parab

📝 Abstract

Recent advances in language model self-improvement, including self-reflection [14], step-wise verification [4, 17] , debate [5] , and self-reward optimization [6] , demonstrate that models can iteratively refine their own reasoning. However, these approaches typically depend on external critics, hand-crafted reward models, or ensemble sampling, introducing additional supervision and instability during training. We propose Counterfactual Self-Questioning (CSQ), a framework in which a single language model generates counterfactual critiques of its own reasoning and uses these internally generated trajectories as a structured policy optimization signal. CSQ decomposes learning into three stages: (1) an initial policy rollout producing a base reasoning trajectory; (2) self-questioning, where the model formulates targeted counterfactual probes conditioned on its own reasoning; and (3) counterfactual critique, where alternative trajectories expose faulty assumptions, missing constraints, or invalid steps. The resulting counterfactual trajectories provide relative feedback that can be directly integrated with policy optimization methods such as Group Relative Policy Optimization (GRPO) [10], without requiring external reward models or multiple agents. Across GSM...

📄 Full Content

...(본문 내용이 길어 생략되었습니다. 사이트에서 전문을 확인해 주세요.)

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut