CoLT: Reasoning with Chain of Latent Tool Calls
Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls’’. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.
💡 Research Summary
The paper introduces CoLT (Chain‑of‑Latent‑Tools), a framework that enables large language models (LLMs) to perform chain‑of‑thought (CoT) reasoning more efficiently by offloading compressed reasoning steps to external, lightweight decoders. Traditional explicit CoT generates each reasoning token sequentially, leading to high inference cost. Implicit latent CoT methods reduce token length but require substantial model architecture changes and extensive retraining, limiting their applicability.
CoLT addresses these issues by having the main LLM generate special “seed tokens” that embed condensed information about a reasoning step in their hidden states. Two token types are defined:
Training consists of two supervised losses: L_main, encouraging the main model to emit correct seed tokens, and L_lat, a cross‑entropy loss on the decoder’s output. The total supervised loss is L_sup = L_main + L_lat. To go beyond gold CoT supervision, the authors also apply reinforcement learning using Group Relative Policy Optimization (GRPO). By sampling both main‑model outputs and decoder outputs, multiple reasoning trajectories are generated for each question; a reward based on answer correctness (1 for correct, 0.1 for correct format, 0 otherwise) guides policy updates, with a KL‑penalty for stability.
Experiments are conducted on four math reasoning benchmarks: GSM8K‑Aug (an expanded version of GSM8K with ~385k training examples), GSM8K‑Hard, SVAMP, and MultiArith. CoLT is evaluated with one‑seed and two‑seed configurations and compared against strong baselines such as Coconut, CODI, COLAR (with various compression ratios), SIM‑CoT, and standard CoT. Results show that CoLT achieves higher accuracy while reducing reasoning chain length. For example, on GSM8K‑Aug, CoLT (2‑seed) reaches 45.5 % accuracy with a 10.84 token reduction, outperforming COLAR (5×) which attains 42.2 % accuracy with a 13.2 token reduction. Reinforcement learning further improves performance on the harder out‑of‑domain datasets.
The authors also explore alternative decoder architectures, including multi‑hot decoders, demonstrating that the framework is flexible and not tied to a specific decoder design. Ablation studies confirm that the number of seed tokens and the choice of decoder affect the trade‑off between compression and accuracy, but reasonable defaults work well across datasets.
Limitations are acknowledged: the optimal seed‑token length and decoder selection require hyper‑parameter tuning; the current evaluation focuses on mathematical problem solving, so generalization to other domains (e.g., code generation, commonsense reasoning) remains to be validated; and maintaining separate decoder modules incurs additional memory overhead, though this is offset by the reduced token generation cost.
In summary, CoLT proposes a novel “latent tool call” paradigm that preserves the explicit‑text reasoning capabilities of pretrained LLMs while delegating compressed reasoning steps to smaller, efficient decoders. This approach yields both computational savings and accuracy gains, offering a practical pathway to deploy powerful LLMs in resource‑constrained settings without extensive model redesign or retraining.
Comments & Academic Discussion
Loading comments...
Leave a Comment