TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and model are available at https://mtkresearch.github.io/TASTE-SpokenLM.github.io.
💡 Research Summary
The paper introduces TASTE (Text‑Aligned Speech Tokenization and Embedding), a novel approach that directly aligns speech tokens with their corresponding text transcriptions during the tokenization stage, thereby addressing the modality gap that hampers joint text‑speech spoken language modeling (SLM). Traditional SLM pipelines first discretize speech into a long sequence of tokens (often using self‑supervised SSL representations or neural codecs) and then train a language model on those tokens. This creates two major problems: (1) a severe length mismatch between speech and text sequences, which forces complex interleaving, padding, or sequential generation schemes; and (2) redundancy, because speech tokens still encode the lexical content already captured by text tokens, wasting capacity that could be devoted to paralinguistic cues (prosody, speaker identity, emotion).
TASTE solves both issues by leveraging an off‑the‑shelf ASR model (Whisper) to obtain a high‑quality transcription v for each utterance u. The speech encoder (the frozen Whisper encoder) produces a deep hidden state h⁽ᴸ⁾ (last layer) and a shallow hidden state h⁽ˡ⁾ (mid‑layer). An attention‑based aggregator then takes the text token sequence as the query, h⁽ᴸ⁾ as keys, and h⁽ˡ⁾ as values. Because the query length equals the number of text tokens N, the output of this multi‑head attention is a compressed speech representation z of shape (N, d_z), i.e., it is already aligned in length with the text.
The compressed representation is discretized with Residual Vector Quantization (RVQ) across R stages, yielding a code sequence q =
Comments & Academic Discussion
Loading comments...
Leave a Comment