S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLMs
Large language models (LLMs) equipped with chain-of-thought (CoT) achieve strong performance and offer a window into LLM behavior. However, recent evidence suggests that improvements in CoT capabilities often come with redundant reasoning processes, motivating a key question: Can LLMs acquire a fast-thinking mode analogous to human System 1 reasoning? To explore this, our study presents a self-sampling framework based on activation steering for efficient CoT learning. Our method can induce style-aligned and variable-length reasoning traces from target LLMs themselves without any teacher guidance, thereby alleviating a central bottleneck of SFT-based methods-the scarcity of high-quality supervision data. Using filtered data by gold answers, we perform SFT for efficient CoT learning with (i) a human-like dual-cognitive system, and (ii) a progressive compression curriculum. Furthermore, we explore a self-evolution regime in which SFT is driven solely by prediction-consistent data of variable-length variants, eliminating the need for gold answers. Extensive experiments on math benchmarks, together with cross-domain generalization tests in medicine, show that our method yields stable improvements for both general and R1-style LLMs. Our data and model checkpoints can be found at https://github.com/DYR1/S3-CoT.
💡 Research Summary
S3‑CoT (Self‑Sampled Succinct Reasoning) tackles the growing concern that chain‑of‑thought (CoT) reasoning in large language models (LLMs) often becomes unnecessarily verbose, inflating latency and computational cost even on simple queries. Inspired by the dual‑process theory of human cognition—System 1 (fast, intuitive) and System 2 (slow, deliberative)—the authors propose a self‑sampling framework that generates high‑quality, variable‑length CoT data directly from the target LLM without any external teacher or tool.
The core technical insight is the existence of a linear “length‑control direction” (VL‑D) in the hidden‑state space of transformer‑based LLMs. By appending long‑CoT and short‑CoT prompts to the same question, the authors collect activation pairs from the final token of each layer, compute the difference‑in‑means vector, and verify via PCA that a consistent direction emerges starting around the middle layers. Quantitative metrics—mean separation strength and angle variance—confirm that these vectors are highly parallel across samples, indicating a stable, model‑intrinsic attribute that can be manipulated linearly.
Using VL‑D, the framework intervenes on a contiguous block of top layers, adding a scaled version of the direction (α × d) to the hidden states during inference. Negative α values shrink the generated CoT, while positive values lengthen it. A probing stage determines suitable layer ranges (typically the top 5‑10 layers for general LLMs, top 15 for R1‑style models) and safe α magnitudes (|α| ≤ 0.5). Too weak an intervention leaves length unchanged; too strong an intervention causes output collapse (repetition or nonsensical text).
The generated CoT traces are filtered for quality in two ways. When gold answers are available, a simple answer‑matching filter retains only samples that produce the correct answer. When gold answers are absent, the authors employ a self‑consistency filter: they generate multiple CoT variants with different α values and keep only those where all variants agree on the final answer. This self‑consistency check yields near‑perfect accuracy on retained samples, though the retention rate depends heavily on the base model’s capability (e.g., only 517 of 6,838 LLaMA‑3 8B samples survive).
Training proceeds with two complementary mechanisms. First, a dual‑cognitive system mimics fast and slow reasoning: a “fast” head is encouraged to produce short CoT, while a “slow” head retains the ability to generate longer, more detailed reasoning. Second, a progressive compression curriculum gradually reduces the target length ratio (Len‑R) from near‑unity (full‑length CoT) to as low as 0.5, allowing the model to adapt without catastrophic loss of reasoning ability. This curriculum mitigates the over‑compression problem observed in prior SFT‑based approaches.
Extensive evaluation covers mathematics (GSM8K, MATH, BIG‑Bench Math) and medical QA (MedQA, PubMedQA). Compared to three families of baselines—prompt‑control methods that inject length cues, SFT‑based methods that fine‑tune on curated concise CoT (e.g., C3oT, CoT‑Value), and RL‑based methods that explicitly reward brevity—the S3‑CoT approach consistently achieves higher accuracy per token and comparable or superior absolute accuracy. Notably, on R1‑style models (which emit “
The paper’s contributions are threefold: (1) a teacher‑free pipeline for generating style‑aligned, variable‑length CoT data via activation steering; (2) a self‑evolution regime that leverages self‑consistency to obtain high‑quality supervision without gold labels; and (3) an efficient fine‑tuning strategy that combines a dual‑cognitive architecture with a progressive compression curriculum, delivering fast‑thinking capabilities at lower computational cost than RL‑based solutions. Limitations include the need for a probing step for each new model, reduced sampling efficiency for weaker LLMs, and the focus on length as the sole controllable attribute. Future work could explore multi‑attribute steering (e.g., logical depth, confidence), scaling to larger models, and integrating VL‑D discovery with reinforcement learning for even finer control.
Overall, S3‑CoT demonstrates that LLMs can be coaxed into a fast‑thinking mode by internally steering their hidden representations, thereby achieving concise, high‑quality reasoning without external supervision—a significant step toward more efficient and human‑like AI reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment