Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

Internalizing LLM Reasoning via Discovery and Replay of Latent Actions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM-Latent-Action.


💡 Research Summary

The paper introduces STIR (Self‑Distilled Tools for Internal Reasoning), a novel framework that internalizes chain‑of‑thought reasoning into the hidden states of large language models (LLMs) by dynamically controlling latent trajectories. Existing activation‑steering approaches rely on static control vectors that are injected uniformly across all decoding steps. The authors argue that reasoning is a non‑stationary process composed of distinct cognitive phases (hypothesis generation, verification, conclusion) and therefore requires time‑varying, state‑dependent interventions.

STIR consists of three tightly coupled stages.

  1. Differential Intrinsic Action Induction: For each training prompt, a set of stochastic rollouts is sampled. Each rollout is scored with a length‑regularized reward that penalizes unnecessary tokens while rewarding correctness. At predefined decision checkpoints (identified by double‑newline delimiters), the hidden‑state centroids of high‑reward (successful) and low‑reward (failed) rollouts are computed (µ⁺ and µ⁻). Their difference v = µ⁺ − µ⁻ is treated as a latent “steering impulse” that would move an erroneous state toward the successful manifold. Both a correction entry (keyed by µ⁻, value v) and an anchor entry (keyed by µ⁺, value null) are stored, enabling the system to recognize when no intervention is needed.
  2. Sparse Control Basis Construction: The raw set of impulses is typically large and redundant. The authors formulate a geometric optimization problem that balances individual utility (how much a tool improves accuracy) with collective diversity (orthogonality among tools). By maximizing a weighted sum of utility and pairwise cosine dissimilarity under a budget B (e.g., 64 tools), they obtain a compact, diverse control basis that can be efficiently queried at inference time.
  3. Value‑Modulated Trajectory Intervention: During decoding, the current hidden state hₜ is compared to the stored keys. If hₜ is close to an anchor (µ⁺), the system abstains. If it is near a negative centroid (µ⁻), the most relevant tool v is retrieved. A lightweight value function V(hₜ, v) (implemented via a small verification model or learned reward estimator) evaluates the expected benefit of applying the impulse. When V exceeds a threshold, the impulse is injected into the residual stream as ˜hₜ = hₜ + α·v; otherwise, a null action is taken. This “retrieve‑preview‑commit” cycle ensures that interventions are only applied when they are predicted to improve the trajectory, preventing over‑correction.

Experiments were conducted on six arithmetic and logical reasoning benchmarks (including GSM8K, MultiArith, LogicalDeduction) using four representative LLMs (LLaMA‑2‑7B/13B, GPT‑NeoX‑20B, Falcon‑40B). Compared to vanilla greedy decoding, standard chain‑of‑thought prompting, and static activation steering, STIR achieved average accuracy gains of 1.9 %–7.5 % while reducing token consumption by up to 35 %. The most significant improvements appeared on multi‑step problems where static steering either failed to help or even degraded performance. Analysis showed that corrective impulses were applied in less than 12 % of total decoding steps, primarily at phase transitions, and that the anchor‑null mechanism effectively prevented unnecessary perturbations.

Key contributions are: (1) a clear articulation of the temporal misalignment problem inherent in static steering for reasoning; (2) a self‑supervised pipeline that extracts actionable latent vectors from the model’s own successful rollouts; (3) a sparsity‑driven basis construction that yields a compact, diverse tool library; and (4) a runtime controller that combines sparse retrieval with value‑modulated gating to enable context‑specific, dynamic intervention.

Limitations include the upfront cost of building the tool library, the current focus on single‑language, single‑model settings, and sensitivity to the design of the value function. Future work may explore continual updating of the tool set, extension to multimodal or instruction‑following scenarios, and scaling the approach to even larger models. Overall, STIR demonstrates that the benefits of explicit chain‑of‑thought can be captured within hidden states, offering a promising path toward more compute‑efficient, high‑fidelity reasoning in LLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment