rePIRL: Learn PRM with Inverse RL for LLM Reasoning
Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.
💡 Research Summary
The paper introduces rePIRL, a novel framework for learning Process Reward Models (PRMs) for large language model (LLM) reasoning by leveraging ideas from inverse reinforcement learning (IRL). Existing PRM approaches fall into two categories: offline methods that require token‑level reward annotations or expert policy access (often via costly human labeling or Monte‑Carlo Tree Search), and online methods that infer rewards from model‑internal signals such as entropy or confidence. Offline methods suffer from high annotation cost and computational expense, while online methods frequently encounter entropy‑collapse, degrading performance in later training stages.
rePIRL aims to eliminate these strong assumptions, requiring only a set of expert trajectories D without any reward labels, preference data, or direct access to the expert policy. The authors model LLM generation as a token‑level Markov Decision Process (MDP) where each state is the current prefix of generated tokens and each action is the next token. They introduce a hidden binary variable o_t indicating whether a given (state, action) pair originates from the expert policy. An energy‑based distribution p(o_t|s_t,a_t) ∝ exp(r_ϕ(s_t,a_t)) is defined, where r_ϕ is a parametrized reward function. Maximizing the likelihood of expert trajectories under this model yields the IRL objective
J(ϕ) = E_{τ∈D}
Comments & Academic Discussion
Loading comments...
Leave a Comment