Divide-and-Conquer CoT: RL for Reducing Latency via Parallel Reasoning
Long chain-of-thought reasoning (Long CoT) is now fundamental to state-of-the-art LLMs, especially in mathematical reasoning. However, LLM generation is highly sequential, and long CoTs lead to a high latency. We propose to train Divide-and-Conquer CoT (DC-CoT) to reduce the latency. With DC-CoT, the model can act as a director that identifies distinct subtasks that can be performed in parallel in its reasoning process, and then spawns workers to execute the subtasks. Our goal is to achieve high accuracy, with a low longest path length, which is a theoretical measure of the latency needed for the response. We start with a long CoT base model (DeepScaleR-1.5B-Preview), and first use SFT with a small curated demonstration set to initialize its ability to spawn workers in a certain format. Because SFT degrades the accuracy significantly, we design a multi-stage RL algorithm, with various data filtering strategies, to recover the accuracy while decreasing the longest path length. Across several benchmarks including AIME 2024 and HMMT 2025, DC-CoT achieves similar accuracy as DeepScaleR-1.5B-Preview while decreasing longest path length by 35-40%. Our code, SFT dataset and models are publicly available at https://github.com/amahankali10/DC_CoT_RL_for_Low_Latency_CoT_with_Parallel_Reasoning.
💡 Research Summary
The paper addresses a critical bottleneck in modern large language models (LLMs) that excel at mathematical reasoning through long chain‑of‑thought (CoT) sequences: the sequential nature of token generation leads to high latency, especially when responses contain tens of thousands of tokens. To mitigate this, the authors introduce Divide‑and‑Conquer CoT (DC‑CoT), a training framework that equips a single model with the ability to decompose a problem into independent subtasks, spawn “worker” processes to solve those subtasks in parallel, and then aggregate the results to produce a final answer.
Architecture and Inference Procedure
DC‑CoT adopts a “director‑worker” dual‑role architecture. The model first operates as a director, performing an initial single‑thread reasoning pass until it emits a special token <spawn_workers>. At this point the model switches to a multi‑thread mode: three workers are instantiated, each receiving the original prompt plus the director’s initial reasoning and a unique <worker_i> tag. Each worker runs standard inference (via the vLLM library) independently until it outputs a closing tag </worker_i>. After all workers finish, the director reads their outputs, may perform additional reasoning, and either finalizes the answer or issues another <spawn_workers> token to start a new parallel round. The longest path length—defined as the token count along the longest dependency chain from start to finish—is used as a theoretical proxy for wall‑clock latency.
Training Pipeline
The training proceeds in two major phases.
- Supervised Fine‑Tuning (SFT): A small curated dataset of demonstration problems is used to teach the model the parallel format (the special tokens and the overall control flow). This step gives the model the ability to emit the required tags but significantly degrades raw accuracy compared to the base model.
- Multi‑Stage Reinforcement Learning (RL): To recover accuracy while reducing longest path length, the authors design a staged RL algorithm. The reward combines three components: (a) a correctness reward (binary pass@1), (b) a penalty proportional to the longest path length, and (c) an entropy‑stability term to prevent the policy from becoming overly stochastic. Initially they experiment with DAPO (Deterministic APO), but observe that it inflates token‑level entropy and stalls performance. Switching to CISPO (Controlled Incremental Stochastic Policy Optimization) stabilizes entropy and yields monotonic accuracy gains.
Dynamic Data Filtering
A novel aspect of the work is the use of dynamic data filtering to balance the competing objectives of accuracy and latency. Early in training, “all‑wrong” examples are removed while “all‑correct” examples are retained, ensuring the model receives strong signals for the length‑penalty term. As training progresses, the authors notice the model over‑optimizing for shorter paths at the expense of correctness. To counteract this, they gradually filter out the easy “all‑correct” cases, thereby increasing the proportion of harder problems where accuracy matters more. This curriculum‑like adjustment proves essential for achieving a Pareto‑optimal trade‑off.
Experimental Results
The authors evaluate DC‑CoT on several high‑stakes mathematics benchmarks, notably AIME 2024 and HMMT 2025. Baselines include the original DeepScaleR‑1.5B‑Preview model run with fixed token limits (32 K and 12 K tokens, denoted DSR‑32K and DSR‑12K) and versions of the same base model trained with RL and a high length penalty (DSR‑HLP‑24K, DSR‑HLP‑12K). Key findings:
- Accuracy: DC‑CoT matches the base model’s pass@1 (≈ 92 %) across all benchmarks.
- Latency (Longest Path Length): DC‑CoT reduces the longest path length by an average of 37 % relative to DSR‑32K, corresponding to a 35‑40 % reduction in the theoretical latency metric.
- DC‑CoT‑HLP (high length‑penalty variant) outperforms similarly‑trained baselines on the Pareto frontier, achieving lower latency without sacrificing accuracy.
- Majority Voting (maj@3): When combined with three independent samples (DC‑CoT‑Maj), the approach yields both higher accuracy and lower longest path length than the baseline majority‑voted model (Maj‑DSR‑3).
The paper also reports complementary gains on additional benchmarks (see Figure 5, Tables 1‑2) and demonstrates that the method can be combined with parallel sampling techniques for further improvements.
Relation to Prior Work
The authors position DC‑CoT relative to several strands of literature:
- Multi‑Agent Systems (e.g., AsyncThink, Native Parallel Reasoner) that orchestrate non‑reasoning LLMs for parallel execution. DC‑CoT differs by starting from a strong, RL‑trained CoT model and explicitly teaching parallel reasoning.
- Parallel Sampling methods that improve accuracy by generating multiple independent completions; DC‑CoT is orthogonal because it identifies distinct sub‑tasks rather than merely sampling alternatives.
- Parallel Decoding approaches that modify the transformer’s attention masks or use specialized hardware; DC‑CoT achieves parallelism at the algorithmic level without altering the underlying model architecture.
- Kimi K2.5, which also employs a multi‑stage curriculum emphasizing parallelism early on, but does not train the worker agents via RL.
Limitations and Future Directions
The authors acknowledge that spawning multiple workers increases GPU memory consumption, and scaling to more than three workers may require memory‑efficient techniques. The current experiments focus on mathematical reasoning; extending the approach to domains such as code generation, multi‑modal reasoning, or open‑ended question answering will require more sophisticated sub‑task definition mechanisms. Additionally, the current implementation uses a fixed number of workers and parallel rounds; adaptive scheduling that dynamically decides how many workers to launch based on problem complexity is an open research avenue.
Conclusion
DC‑CoT demonstrates that “parallel thinking” can be learned as a distinct capability in LLMs. By jointly optimizing for correctness and a latency‑related metric through a carefully designed RL pipeline and dynamic data filtering, the method achieves substantial reductions in theoretical inference time while preserving state‑of‑the‑art accuracy on challenging mathematics benchmarks. This work opens a promising path toward deploying large reasoning models in latency‑sensitive applications, and suggests broader applicability of parallel task decomposition in future LLM research.
Comments & Academic Discussion
Loading comments...
Leave a Comment