SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM’s hidden embeddings (termed ``implicit reasoning’’) rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at https://github.com/YinhanHe123/SemCoT/.

💡 Research Summary

Chain‑of‑Thought (CoT) prompting dramatically improves large language models’ (LLMs) ability to solve complex reasoning tasks, but the approach is notoriously verbose: a single problem can require hundreds of intermediate tokens, leading to long inference times, especially for very large models. Recent “implicit CoT” methods address this by replacing explicit reasoning tokens with a small set of hidden‑layer embeddings—so‑called implicit reasoning tokens—thereby shortening the reasoning sequence and bypassing tokenization and un‑embedding steps. However, existing implicit CoT approaches suffer from two fundamental drawbacks. First, they do not preserve semantic alignment between the implicit embeddings (when decoded back to natural language) and the ground‑truth step‑by‑step reasoning, causing a noticeable drop in downstream accuracy. Second, they still rely on the original massive LLM to generate each implicit token, which can take roughly 0.1 s per token on modern 70B‑parameter models; the cumulative cost remains prohibitive for real‑time applications.

SemCoT (Semantically‑aligned Implicit CoT) is introduced to solve both problems simultaneously. The framework consists of two sequential stages.

Stage 1 – Semantic Alignment Assessment.
The authors design a custom sentence transformer Cϕ that operates directly on the LLM’s hidden representations. They extract the middle five transformer layers of the target LLM (shown in prior work to capture rich linguistic information) and add a pooling layer followed by a linear projection to obtain a low‑dimensional semantic vector. To train Cϕ, they construct a reasoning‑pair dataset G = {(R, S)} where R is the full ground‑truth CoT text and S is a condensed, semantically‑equivalent version generated by GPT‑4o‑mini. Using contrastive learning, the model maximizes cosine similarity between embeddings of (R, S) while minimizing similarity with other pairs in the same minibatch. The resulting loss L_sim encourages the transformer to map sentences that convey the same reasoning to nearby points, even though the inputs are raw token embeddings rather than token IDs. Once trained, Cϕ can evaluate the semantic similarity between any implicit reasoning embedding Z and the embedded ground‑truth reasoning TF(R). This similarity score becomes an explicit objective during the next stage, ensuring that the implicit tokens retain the meaning of the original CoT.

Stage 2 – Efficient Implicit Reasoning Generation.
To reduce the per‑token generation cost, SemCoT replaces the heavyweight LLM with a lightweight language model Iψ (e.g., a distilled or pruned version of the same architecture, such as Sheared‑LLaMA‑1.3B). The lightweight model is equipped with a special token in its vocabulary; during inference it receives the user query Q followed by k tokens, where k is the desired number of implicit reasoning steps. Iψ produces its own last‑layer hidden embeddings, which are then linearly projected (via a learned matrix Wproj) into the embedding space of the original LLM. The projected embeddings serve as the implicit reasoning Z that the original LLM will consume when generating the final answer.

Training Iψ and Wproj is performed with a multi‑objective loss: (a) a semantic alignment loss that maximizes the similarity between Cϕ(TF(R)) and Cϕ(Iψ(Q)) (i.e., the implicit reasoning should be semantically close to the ground‑truth CoT), and (b) an answer correctness loss that maximizes the probability of the correct answer tokens under the original LLM conditioned on the generated Z. By jointly optimizing these objectives, the lightweight generator learns to produce fast, compact implicit tokens that still convey the full reasoning content required for accurate answers.

Experiments.
The authors evaluate SemCoT across several LLM backbones (Llama‑2‑7B‑chat, Llama‑2‑13B, GPT‑3.5‑Turbo) and a suite of reasoning benchmarks: GSM‑8K (grade‑school math), SVAMP (arithmetic), CLUTRR (logical deduction), and OpenBookQA (common‑sense). Baselines include traditional explicit CoT, prior implicit CoT methods (Implicit‑CoT, Distill‑CoT), and direct answer generation without reasoning. Metrics reported are (1) answer accuracy, (2) average inference latency, and (3) total number of tokens generated (including implicit tokens).

Results show that SemCoT consistently reduces latency by 30 %–45 % and token count by 20 %–35 % compared with the best existing implicit CoT baselines, while preserving accuracy within 0.5 %–2 % of the original explicit CoT. Notably, on the largest models the per‑token generation time drops from ~0.1 s to ~0.02 s thanks to the lightweight generator, making the overall reasoning pipeline viable for latency‑sensitive applications. Qualitative analyses confirm that the implicit embeddings, when decoded via the sentence transformer, produce natural‑language explanations that are nearly indistinguishable from the original CoT.

Discussion and Limitations.
SemCoT’s success hinges on two design choices. First, the custom sentence transformer bridges the gap between raw hidden states and semantic similarity, avoiding the pitfalls of using raw LLM embeddings directly (which are optimized for next‑token prediction rather than sentence meaning). Second, the linear projection assumes that the lightweight model’s embedding distribution is a linearly transformable subset of the full LLM’s space; while this holds for the distilled/pruned models tested, more divergent architectures may require non‑linear alignment modules. Training the sentence transformer also demands a sizable contrastive dataset of reasoning pairs, which could be a bottleneck for new domains.

Future Directions.
The authors suggest extending SemCoT to multimodal reasoning (e.g., vision‑language tasks), exploring non‑linear alignment layers, and dynamically determining the optimal number of implicit tokens per query. Moreover, integrating the semantic alignment loss directly into the LLM’s pre‑training objective could further tighten the coupling between implicit representations and human‑readable reasoning.

Conclusion.
SemCoT presents the first unified solution that simultaneously accelerates implicit CoT generation and guarantees that the compressed reasoning remains semantically faithful to the original step‑by‑step explanation. By coupling a contrastively trained sentence transformer with a distilled lightweight generator, the framework achieves substantial speed‑ups without sacrificing the accuracy gains that made CoT popular. This work paves the way for deploying sophisticated reasoning capabilities of large language models in real‑time, resource‑constrained environments.

SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

💡 Research Summary

Comments & Academic Discussion

Leave a Comment