Likelihood-Based Reward Designs for General LLM Reasoning
Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.
💡 Research Summary
The paper investigates how to fine‑tune large language models (LLMs) for chain‑of‑thought (CoT) reasoning without relying on binary correctness rewards that require a verifier. The authors propose using the probability or log‑probability of a reference answer as the reward signal. Because many datasets contain a reference answer (even when the answer cannot be automatically verified), this approach can be applied universally.
Four main categories of reward functions are examined: (1) VeriFree – the raw probability of the reference answer given a sampled CoT; (2) AvgProb – average per‑token probability; (3) LogProb – the log‑probability of the whole reference answer; (4) AvgLogProb – log‑probability normalized by answer length. In addition, the authors implement JEPO (a group‑level log‑mean‑exp reward) and a standard binary‑reward RL baseline (RLOO).
Experiments span two verifiable benchmarks (MATH and DeepScaleR) and two non‑verifiable settings (Alpaca and the proof‑portion of NuminaMath). Two modern model families, Qwen‑2.5 and Llama‑3.2, are fine‑tuned under each reward scheme. Evaluation metrics include success rate (probability of producing the correct answer), log‑probability‑based perplexity (a language‑modeling quality measure), and CoT length.
Key findings:
-
Log‑probability rewards consistently excel across all domains. In verifiable tasks they match or slightly surpass binary‑reward RL in success rate while achieving perplexity close to supervised fine‑tuning (SFT). In non‑verifiable, long‑form tasks they attain the same success rate as SFT and avoid the perplexity degradation seen in binary‑reward RL.
-
Pure probability rewards (VeriFree, AvgProb) fail on long answers because the probability of an exact match becomes vanishingly small, providing almost no learning signal.
-
Standard RL (binary reward) yields reasonable success but terrible perplexity, indicating that the model over‑fits to produce the correct answer at the expense of overall language modeling quality.
-
CoT length dynamics differ: Log‑probability rewards initially shorten the CoT. In verifiable settings the length recovers during training, while in non‑verifiable settings it remains short, effectively making the model behave like SFT. Binary‑reward RL and VeriFree do not exhibit this shortening.
-
Training efficiency: Log‑probability rewards do not require sampling an answer token sequence, reducing computation per update and allowing unbiased leave‑one‑out advantage estimation.
Limitations: the method assumes a reference answer is available, which is not true for fully open‑ended generation. Moreover, overly aggressive CoT shortening could produce overly terse reasoning that is hard for humans to follow. The authors suggest auxiliary length penalties or KL‑regularization to mitigate this.
Overall, the study demonstrates that using the log‑probability of the reference answer as a reward aligns the fine‑tuning objective with the pre‑training loss, provides a dense learning signal, works uniformly across verifiable and non‑verifiable tasks, and offers computational advantages. This makes log‑probability rewards a compelling default for CoT fine‑tuning of LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment