STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.
💡 Research Summary
The paper “STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models” tackles a fundamental limitation of current spoken language models (SLMs): they generate speech directly from speech inputs without any internal, unspoken reasoning step. Human speakers, by contrast, often perform substantial internal reasoning before or even while speaking, which improves the clarity, correctness, and conciseness of their utterances. The authors argue that endowing SLMs with a comparable “thinking” capability could boost performance on tasks that require logical or mathematical reasoning, while preserving the low‑latency, real‑time interaction that speech systems demand.
Naïve baseline – Thinking Before Speaking (TBS).
A straightforward way to add reasoning is to force the model to generate a full chain‑of‑thought (CoT) in text first, then interleave the usual text‑speech token generation for the spoken answer. This approach (named TBS) indeed improves answer quality, but it introduces an uncontrolled latency because the CoT can be arbitrarily long. In real‑time conversational settings, waiting for the entire CoT before any audio is emitted is unacceptable.
Core contribution – STITCH.
STITCH eliminates the latency bottleneck by exploiting the fact that the audio duration of a spoken chunk is far longer than the time needed to generate the corresponding tokens. The method divides generation into fixed‑size chunks and alternates three streams:
- Reasoning chunk – N tokens of unspoken CoT (e.g., “
Comments & Academic Discussion
Loading comments...
Leave a Comment