rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection

Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha’’ moments when they start to perform strategies, such as self-reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM’s CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi-agent RL (MARL), based on a leader-follower framework and straightforward rule-based rewards. Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug-in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.

💡 Research Summary

The paper introduces rSIM (reinforced Strategy Injection Mechanism), a novel framework designed to endow any large language model (LLM) with advanced reasoning capabilities by explicitly injecting reasoning strategies into its chain‑of‑thought (CoT) process. The core idea is to pair a lightweight “planner” (the leader) with a pre‑trained LLM (the follower) in a leader‑follower multi‑agent reinforcement learning (MARL) setting. The planner decides, at each step of the CoT, which reasoning strategy—such as self‑reflection, hypothesis testing, backward reasoning, or multi‑perspective exploration—should be inserted. These strategies are encoded as short token prompts that the LLM consumes, thereby steering its subsequent generation toward more structured and “aha‑moment” reasoning.

Training is performed jointly: the planner’s policy and the LLM’s generation policy are updated simultaneously using policy‑gradient methods. The planner is deliberately small (on the order of a few hundred thousand parameters) to ensure fast convergence and low computational overhead. The reward function is rule‑based and composed of three main components: (1) a correctness reward that measures improvement in final answer accuracy after a strategy is applied, (2) an efficiency penalty that discourages over‑use of strategies (keeping the proportion of strategy tokens modest), and (3) a human‑alignment bonus that gives extra credit when the chosen strategy sequence resembles expert‑crafted CoT patterns. This design pushes the system to not only produce correct answers but also to develop a disciplined, strategy‑driven reasoning style.

Empirical evaluation focuses on the Qwen2.5 family of models. When the 0.5 B parameter version is equipped with a trained planner, it consistently outperforms the much larger 14 B version across a suite of benchmarks: GSM8K, MATH, ARC‑Easy/Challenge, and HumanEval. The performance gap ranges from 4 % to 18 % absolute improvement, with the most pronounced gains on tasks that require multi‑step logical deduction (e.g., MATH). Ablation studies reveal that an optimal strategy insertion rate of roughly 5‑8 % of total tokens yields the best trade‑off between guidance and generation freedom; excessive insertion leads to degradation, confirming the importance of the efficiency penalty.

A key contribution is the planner’s plug‑in nature. After a single training run, the same planner can be attached to other LLMs (e.g., Qwen2.5‑1.5B, LLaMA‑2‑7B) without any further fine‑tuning, delivering 4‑9 % absolute accuracy lifts across the same benchmarks. This demonstrates that the planner learns a model‑agnostic policy for when and how to inject strategies, effectively acting as a universal reasoning controller.

The authors also explore continual learning. By introducing new tasks (legal QA, medical diagnosis) and augmenting the reward with task‑specific correctness signals, the planner adapts its strategy repertoire while retaining previously learned behaviors. This incremental learning capability suggests that rSIM can evolve alongside expanding application domains, gradually building a richer library of reasoning tactics.

Limitations are acknowledged. The current set of strategies is hand‑crafted and limited in scope; extending to a broader, possibly automatically discovered strategy space would likely improve flexibility. The rule‑based reward, while transparent, may not capture nuanced trade‑offs in more complex environments. Moreover, because strategy cues are injected as tokens, very long contexts risk diluting the signal, potentially reducing effectiveness on extremely lengthy prompts. Future work is proposed to incorporate meta‑reward learning for automatic strategy discovery and to develop richer communication channels (e.g., separate control streams or dynamic attention modulation) between planner and LLM.

In summary, rSIM offers a practical, scalable pathway to transform ordinary LLMs into Reasoning Language Models (RLMs) by coupling them with a small, train‑once planner that injects reasoning strategies during inference. The approach delivers substantial performance gains even for modest‑sized models, generalizes across architectures, and supports continual learning, positioning it as a promising building block for next‑generation AI systems that require reliable, interpretable, and human‑aligned reasoning.

💡 Research Summary

📜 Original Paper Content