Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model’s capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs’ reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.

💡 Research Summary

The paper addresses a fundamental limitation of large language models (LLMs) that rely on chain‑of‑thought (CoT) prompting for complex reasoning. While CoT enables step‑by‑step reasoning, its strictly sequential nature becomes a bottleneck when tasks approach the limits of model capability, leading to high token consumption and poor test‑time scalability. The authors propose a divide‑and‑conquer (DAC) reasoning paradigm, where a problem is first decomposed into multiple sub‑problems (Division step) and then those sub‑problems are solved sequentially before the final answer is produced (Conquering step). Existing DAC‑style methods have been applied only at inference time and suffer from a misalignment with the general‑purpose post‑training that most LLMs undergo, which is primarily CoT‑oriented. This misalignment prevents models from fully exploiting DAC’s potential.

To bridge this gap, the authors introduce an end‑to‑end reinforcement learning (RL) framework called DAC‑RL. The policy model πθ is trained to perform both Division and Conquering jointly. During training, for each input problem x, the model generates Gd groups of sub‑problem sets Pg. A composite reward is assigned to the Division output based on three criteria: (1) format validity (parsable via regular expressions), (2) quantity validity (producing at least a predefined number Ns of sub‑problems), and (3) helpfulness, measured by the accuracy of the original problem’s solution when the generated sub‑problems are used in the Conquering phase. The Conquering phase concatenates the original problem with the generated sub‑problems and asks the model to solve them sequentially; a binary reward is given if the final answer matches the ground‑truth.

The authors prove (Lemma 2.1) that the final‑answer reward positively correlates with the correctness of the sub‑problems, ensuring that the RL signal encourages the model to produce useful decompositions. Training uses the GRPO (Generalized Reward‑Based Policy Optimization) algorithm, storing Division and Conquering tuples in an experience buffer and updating the policy after each batch.

Experiments are conducted on two instruction‑tuned models, Qwen2.5‑7B‑Instruct and Qwen3‑4B‑Instruct‑2507, using the DAPO‑Math‑17k dataset for RL fine‑tuning. Evaluation is performed on four competition‑level mathematics benchmarks: AIME 2024, AIME 2025 (MAA), Beyond‑AIME (ByteDance‑Seed 2025), and HMMT‑25, all of which require integer answers for precise scoring. The metrics are Pass@1 and Pass@32. After DAC‑RL training, both models achieve a substantial absolute improvement over standard CoT: +8.6 percentage points on Pass@1 and +6.3 percentage points on Pass@32. Moreover, DAC‑RL models consume 12‑18 % fewer tokens to reach comparable accuracy, demonstrating enhanced test‑time scalability. Interestingly, the DAC‑RL fine‑tuned models also show modest gains when evaluated under CoT prompting, suggesting that learning to decompose problems improves overall reasoning representations.

The analysis highlights several strengths: the composite reward effectively discourages degenerate decompositions; multi‑candidate generation in the Conquering step maintains exploration diversity and prevents entropy collapse; and the framework scales across model sizes. Limitations include reliance on the final‑answer reward, which may mask errors in individual sub‑problems, and the high computational cost of RL fine‑tuning for very large models. The authors suggest future work on more granular sub‑problem verification, extending DAC‑RL to multimodal or code‑generation tasks, and integrating human feedback to improve sample efficiency.

In summary, the paper presents a novel RL‑based training paradigm that aligns LLMs with a divide‑and‑conquer reasoning style, thereby raising the performance ceiling and test‑time scalability of large language models on the most challenging reasoning benchmarks.

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

💡 Research Summary

Comments & Academic Discussion

Leave a Comment