Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers
Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, with Reinforcement Learning (RL) playing a key role in adapting them to specific applications. In mathematical problem solving, however, the reliance on ground truth answers poses significant challenges due to their high collection cost and limited availability. This work explores the use of simple surrogate signals, format and length, to guide RL training. We find that early training is dominated by format learning, where structural feedback alone accounts for most performance gains. Incorporating length-based rewards further refines outputs by discouraging overly long or short responses, enabling a GRPO approach with format-length signals to approximate, and in some cases surpass, ground-truth-based optimization. For example, our method achieves 40.0% accuracy on AIME2024 with a 7B base model, and generalizes across different model sizes and series. Beyond practical efficiency, these findings provide an inspirational perspective on RL: rather than imparting new knowledge, RL primarily activates reasoning capabilities already embedded in pre-trained models. This insight suggests that lightweight, label-efficient strategies can complement pre-training to unlock LLMs’ latent potential in reasoning-intensive tasks.
💡 Research Summary
Paper Overview
The authors address a fundamental bottleneck in applying reinforcement learning (RL) to mathematical problem solving with large language models (LLMs): the scarcity and high cost of ground‑truth answer labels. They propose a label‑free RL framework that relies solely on two surrogate signals—format correctness and response length—to guide policy optimization. By integrating these signals into the Group Relative Policy Optimization (GRPO) algorithm, they demonstrate that models can achieve performance comparable to, and sometimes surpassing, traditional correctness‑based RL without ever seeing a correct answer during training.
Surrogate Signal Design
- Format Reward (R_f) – A binary indicator (1 or 0) that checks whether the model’s output follows the expected mathematical presentation (e.g., proper use of symbols, step numbering, equation layout). The check is performed with SymPy and a set of regular‑expression rules.
- Length Reward (R_l) – A continuous, differentiable function of the normalized response length x = L / L_max. The function rises from 0 to a peak at a tunable turning point p (default 0.5) and then falls, penalizing overly short or overly long answers. An alternative piecewise‑linear (“polyline”) version yields similar results, showing that the exact analytic form is not critical.
- Combined Reward (R_fl) – R_f + R_l is applied only when the format is correct; otherwise the total reward is forced to zero. This coupling ensures that length incentives are only considered for structurally valid solutions.
Training Setup
- Datasets: DeepScaleR (≈17 k problems) and Math‑train (≈7.5 k problems), both curated from AMC, AIME, and other competition sources.
- Models: A spectrum of LLMs ranging from 1.5 B to 72 B parameters, including Qwen2.5‑Math series, Llama‑3.1‑OctoThinker‑8 B, DeepSeek‑Math‑7 B‑Base, and Mathstral‑7 B.
- Algorithm: GRPO implemented via the VERL 3 framework, with standard hyper‑parameters (learning rate 1e‑6, batch size 128, KL coefficient 0.001, temperature 0.6, 8 samples per prompt).
- Baselines: (a) Correctness‑based reward (exact answer match), (b) Format‑only reward, and (c) the proposed Format‑Length reward (both continuous and polyline variants).
Key Empirical Findings
- Early‑stage dominance of format – Within the first ~15 optimization steps, a model trained with only the format reward matches the learning curve of the correctness‑based baseline, accounting for roughly 85 % of total performance gains. This confirms that structural compliance is a powerful early learning signal.
- Saturation of format‑only training – After the initial phase, the format‑only policy plateaus; without additional guidance the model cannot improve answer correctness beyond structural compliance.
- Impact of length reward – Adding the length component prevents the plateau. The combined Format‑Length reward yields stable, high‑accuracy trajectories throughout training. On AIME2024, a 7 B Qwen‑Math model reaches 40 % accuracy, outperforming the correctness baseline (≈26 %). Similar gains are observed on Math500 and AMC2023.
- Scalability across model sizes – Small models (1.5 B) improve from 27 % to 44 % average accuracy (≈60 % relative gain). Large models (72 B) retain >90 % of the correctness‑based performance, sometimes even exceeding it. Even models with virtually no prior math fine‑tuning (DeepSeek‑Math‑7 B‑Base) achieve >13 % absolute accuracy, a >4,000 % relative lift.
- Robustness to reward shape – The polyline length reward produces results nearly identical to the smooth quadratic form, indicating that the crucial factor is the inductive bias toward “moderate length” rather than the precise functional expression.
Interpretation and Theoretical Insight
The authors argue that RL in this context does not inject new mathematical knowledge; instead, it activates reasoning pathways already embedded during massive unsupervised pre‑training. The surrogate signals act as lightweight “hooks” that focus the model’s latent capabilities on producing well‑structured, appropriately concise solutions. This reframes RL as a knowledge‑unlocking mechanism rather than a knowledge‑acquisition process.
Broader Implications
- Label‑efficient fine‑tuning: The work suggests a general recipe for domains where ground‑truth is scarce: identify structural or quantitative proxies that correlate with correctness and use them as RL rewards.
- Safety and contamination: Since no answer labels are required, the approach sidesteps data contamination concerns that have plagued recent LLM evaluations.
- Future directions: Extending surrogate design to richer signals (e.g., intermediate step verification, logical consistency checks) could enable label‑free RL for more complex tasks such as formal theorem proving or multi‑step scientific reasoning.
Conclusion
By demonstrating that simple format and length signals can replace explicit answer supervision, the paper provides compelling evidence that reinforcement learning can efficiently unlock the latent reasoning power of large language models. This label‑free paradigm opens a practical pathway for scaling mathematical and other reasoning‑intensive applications where annotated data are prohibitively expensive or unavailable.
Comments & Academic Discussion
Loading comments...
Leave a Comment