On the Optimal Reasoning Length for RL-Trained Language Models
Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
💡 Research Summary
This paper investigates how the length of chain‑of‑thought (CoT) outputs generated by reinforcement‑learning (RL) fine‑tuned large language models (LLMs) affects both reasoning performance and computational efficiency. While RL has been shown to dramatically improve reasoning on complex tasks, it also tends to produce longer generations, raising training and inference costs. Existing length‑control methods (e.g., RLOO‑LP, ALP, DRPO) aim to curb verbosity, but it remains unclear what the optimal output length is, especially when the underlying model already possesses varying degrees of reasoning ability.
To address this gap, the authors conduct systematic experiments on two distinct models: (1) Qwen3‑1.7B‑Base, a relatively modest model that must acquire reasoning patterns during RL fine‑tuning, and (2) DeepSeek‑R1‑Distill‑Qwen‑1.5B, a distilled model that already exhibits strong reasoning capabilities out‑of‑the‑box. Both models are trained using a modified DAPO pipeline that disables any built‑in length penalty, ensuring that the subsequent length‑control interventions are the sole source of length modulation. Training runs on eight GPUs for 72 hours (≈576 GPU‑hours), with checkpoints evaluated after 640 steps (Qwen) and 480 steps (DeepSeek). Maximum generation caps are set to 8 K tokens for Qwen and 16 K tokens for DeepSeek, chosen so that fewer than 5 % of rollouts exceed the limit at the start of training.
Evaluation is performed on four mathematical reasoning benchmarks—AIME 2024, AIME 2025, AMC, and Math‑500—using the standard protocol of sampling multiple responses per problem and reporting mean accuracy. Length‑control methods examined include two baselines (Sample Avg, the original GRPO objective, and Token Avg, the DAPO objective) and three explicit penalties: RLOO‑LP (Arora & Zanette, 2025), ALP (Xiang et al., 2025), and DRPO (Li et al., 2025a). An attempt to reproduce GFPO failed and is omitted from the main analysis.
The central findings are twofold. First, the relationship between output length and accuracy is qualitatively different across the two models. Qwen3‑1.7B‑Base exhibits a monotonic increase: longer CoT generations consistently improve accuracy, and any strong length penalty degrades performance. This suggests that, for a model that must learn to reason, ample “thinking steps” are essential during RL fine‑tuning. Second, DeepSeek‑R1‑Distill‑Qwen‑1.5B shows a non‑monotonic curve: performance peaks at an intermediate length, while both overly short and overly long generations hurt accuracy. The authors decompose this phenomenon using three metrics derived from sampled outputs: mode accuracy (how often the most frequent answer is correct), answer entropy (distributional dispersion), and mode share (fraction of samples matching the mode). In the long‑output regime, mode accuracy remains stable or slightly improves, but entropy rises and mode share falls, indicating that the distribution becomes more dispersed even though its center stays near the correct answer. In the short‑output regime, both mode accuracy and mode share drop while entropy remains high, reflecting “under‑thinking”: the model fails to generate enough reasoning steps to converge on the correct answer.
These empirical observations align with the theoretical framework of Ghosal et al. (2025), which predicts a non‑monotonic reward‑variance relationship for a Gaussian policy. The present work extends that analysis from fixed test‑time interventions (e.g., inserting “Wait” tokens) to policies that have been altered by RL training with different length penalties. The authors show that the same dispersion‑driven degradation observed in test‑time scaling also manifests in standard generation from RL‑trained policies. Moreover, they demonstrate that stronger length penalties shift the optimization focus toward brevity at the expense of correctness, moving the distribution’s center away from the correct answer.
The paper concludes that length control must be carefully tuned: overly aggressive penalties can impede the acquisition of reasoning skills, especially for models lacking strong priors, while moderate penalties can improve efficiency for models that already reason well. Two failure modes are identified—(1) long outputs increase dispersion, degrading performance, and (2) short outputs cause under‑thinking, also degrading performance. The authors acknowledge limitations: only two models and a single domain (mathematical reasoning) were examined, and hyper‑parameter tuning remains labor‑intensive. They suggest future work on automatic, meta‑learning approaches to discover the optimal length regime without exhaustive search, as well as broader evaluations across domains and larger model families.
Comments & Academic Discussion
Loading comments...
Leave a Comment