R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency–accuracy trade-offs that can be tailored to diverse computational budgets without retraining.


💡 Research Summary

Paper Overview
The paper tackles the latency bottleneck of chain‑of‑thought (CoT) reasoning in large language models (LLMs). While CoT dramatically improves problem‑solving ability, it forces the model to generate long token sequences autoregressively, making inference costly. Existing acceleration strategies fall into three categories: (1) shortening the reasoning trace (early exit, length‑aware rewards), (2) speculative decoding that drafts tokens with a small language model (SLM) and verifies them with the LLM, and (3) KV‑cache optimizations for long‑context decoding. The authors point out a critical limitation of speculative decoding: its speedup depends heavily on token‑level agreement between SLM and LLM. When agreement is low, frequent rollbacks erase any benefit and can even slow down inference.

Key Observations

  1. Entropy‑Error Correlation – Empirical analysis on the AMC benchmark shows that tokens with higher normalized entropy are far more likely to be erroneous. This holds both at the sample level (incorrect solutions have higher average entropy) and locally around the first “harmful” token that flips a correct LLM answer into a wrong one.
  2. SLM Generates Concise Traces – When the SLM solves a problem correctly, its CoT trace is substantially shorter than the LLM’s trace, indicating that the SLM can provide a more efficient reasoning path if trusted.
  3. Skewed Entropy Distribution – Most tokens have very low entropy; only about 10 % exceed an entropy of 0.1 (on a normalized scale). High‑entropy tokens are therefore rare but highly informative about uncertainty.

R‑Stitch: Entropy‑Guided Hybrid Decoding
Based on these findings, the authors propose R‑Stitch, a token‑level routing framework that dynamically switches between SLM and LLM using entropy as an uncertainty proxy. The process works as follows:

  • Decoding always starts with the SLM.
  • At each step the active model computes the normalized entropy (H_t = -\sum_i p_{t,i}\log p_{t,i} / \log V).
  • If (H_t \le \tau) (a pre‑defined threshold), the token is accepted and decoding continues with the same model.
  • If (H_t > \tau), the token is discarded and the other model (LLM when SLM is active, SLM when LLM is active) re‑processes the same prefix and generates a replacement token.
  • The LLM can also hand control back to the SLM once it produces a low‑entropy token, enabling a bidirectional flow that avoids full‑sequence rollbacks.

KV‑Cache Management
Each model maintains its own KV‑cache. When a model re‑enters after a switch, its cache is reused and only the newly generated tokens from the other model are pre‑filled. This partial‑prefill strategy eliminates redundant attention over already processed tokens, dramatically reducing the overhead of model switches.

R‑Stitch⁺: Adaptive Routing via Reinforcement Learning
Fixed thresholds may be sub‑optimal across diverse hardware budgets and data distributions. R‑Stitch⁺ augments the basic framework with a lightweight router network that decides, for each token, whether to invoke the LLM. The router receives the current entropy, token position, and a short history as input and outputs a binary decision. It is trained with a latency‑aware reward that balances speed (negative latency) and correctness (positive reward for correct answers). The resulting policy automatically adapts the token budget, achieving better speed‑accuracy trade‑offs than a static τ.

Experimental Setup

  • Models – LLMs: DeepSeek‑R1‑Distill‑Qwen at 7 B, 14 B, and 32 B scales. SLMs: L1‑Short, Distill‑1.5 B, Oat‑1.5 B, etc.
  • Benchmarks – Five challenging mathematical reasoning datasets: AMC, MATH, Olympiad, Minerva, and a fifth unnamed set.
  • Metrics – Speedup (relative to full LLM decoding), token count reduction, and accuracy (percentage of correctly solved problems).

Results

  • Speedup – Peak speedups of 3.00× (7 B), 3.85× (14 B), and 4.10× (32 B) are reported.
  • Accuracy – The accuracy remains virtually identical to full LLM decoding; any drop is within 0.2 % on average.
  • Robustness to Low Agreement – Even when token‑level agreement between SLM and LLM is below 30 %, R‑Stitch still yields consistent acceleration because it does not rely on exact token matches but on entropy‑driven confidence.
  • R‑Stitch⁺ Gains – The learned router adds an extra 12‑18 % speedup over the fixed‑threshold version while keeping the accuracy loss below 0.2 %.

Ablation & Analysis

  • Entropy Threshold Sensitivity – Varying τ shows a clear trade‑off: lower τ yields higher speed but slightly more errors; higher τ preserves accuracy at the cost of reduced speedup.
  • KV‑Cache Overhead – The partial‑prefill mechanism cuts cache‑related latency by roughly 30‑40 % compared to naïve switching.
  • Computation Cost of Entropy – Computing entropy on the SLM is cheap due to its small size; the added cost is negligible compared to the saved LLM forward passes.

Discussion
R‑Stitch demonstrates that uncertainty measured directly from the model’s probability distribution is a powerful signal for allocating computational resources. Unlike speculative decoding, which depends on the SLM’s ability to exactly match the LLM, entropy‑guided routing works even when the two models have divergent vocabularies or sampling strategies. The bidirectional switch also prevents the “all‑or‑nothing” rollback problem, leading to smoother latency profiles. The approach is training‑free for the basic version, making it immediately applicable to any SLM‑LLM pair. R‑Stitch⁺ adds a modest learning component that can be fine‑tuned for specific deployment constraints.

Limitations & Future Work

  • The entropy threshold τ must still be chosen; while the paper provides sensible defaults, optimal values may vary across tasks.
  • RL training for R‑Stitch⁺ requires a reward design that balances latency and correctness, which could be non‑trivial for new domains.
  • The method assumes both models share the same tokenizer; extending to heterogeneous tokenizers would require additional alignment mechanisms.

Conclusion
R‑Stitch introduces a principled, entropy‑based routing mechanism that dynamically stitches together a small and a large language model during CoT generation. By delegating low‑uncertainty tokens to the cheap SLM and reserving the expensive LLM for high‑uncertainty regions, it cuts both the number of generated tokens and the per‑token computation cost. The extension R‑Stitch⁺ learns an adaptive policy that further refines the speed‑accuracy trade‑off. Empirical results across multiple math reasoning benchmarks and model scales show up to 4× speedup with negligible accuracy loss, positioning R‑Stitch as a practical solution for real‑time, cost‑effective deployment of large reasoning models.


Comments & Academic Discussion

Loading comments...

Leave a Comment