Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies, including the interleaved, and parallel generation paradigms, under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

💡 Research Summary

This paper investigates the core design choice of decoding paradigms in speech‑language models (Speech LMs) that generate both text and speech within a single model. While prior work has introduced several paradigms—interleaved, parallel, and Thinker‑Talker—comparisons have been confounded by differing base models, tokenizers, and training data. To enable a fair assessment, the authors adopt a single multimodal language model, Phi‑4‑MM (both 3.8 B and 7 B variants), the same speech tokenizer (S3Tokenizer), and identical supervised fine‑tuning (SFT) datasets across all experiments.

Three decoding strategies are examined in detail:

Interleaved Decoding: Text and audio tokens are placed in a fixed ratio (e.g., 1:2) and generated alternately. Each newly generated token—whether text or audio—is fed back into the model for the next prediction. This approach yields the best alignment between modalities, as measured by the S2S/S2T accuracy ratio and word error rate (WER). However, once the textual portion is exhausted, the fixed ratio forces the insertion of many padding text tokens, inflating the sequence length by up to threefold and dramatically slowing inference.
Parallel Decoding: In each forward pass the model predicts one text token together with multiple audio tokens. The embeddings of these tokens are averaged and fed back for the next step. While this reduces the number of decoding steps, the averaging operation weakens the explicit coupling between modalities, leading to lower alignment quality compared with the interleaved method.
Thinker‑Talker: The “Thinker” (a large language model) generates only text, while a separate “Talker” autoregressively produces audio conditioned on the Thinker’s hidden states and the generated text. Only text tokens are fed back to the Thinker. This separation preserves the Thinker’s original text capabilities but makes it difficult to achieve tight speech‑text synchronization; the experiments show a noticeable drop in S2S/S2T ratio.

The authors confirm that, under identical conditions, the interleaved paradigm outperforms the other two in both textual and audio metrics. The main drawback of interleaved decoding is the computational overhead caused by excessive padding tokens after the EOS token.

Early‑Stop Interleaved (ESI) Paradigm
To mitigate the inefficiency, the paper proposes an Early‑Stop Interleaved (ESI) decoding scheme. After the model generates the end‑of‑sentence (EOS) token for text, a special token <S> is emitted to signal that the remainder of the sequence will consist solely of audio tokens. Consequently, the model no longer inserts padding text tokens, reducing the total sequence length to roughly 75 % of the original. Empirically, this yields a 30 %+ speed‑up in inference time while preserving, and in some cases slightly improving, alignment metrics. The authors hypothesize that padding tokens act as noise, diluting attention to earlier, semantically important tokens and thus harming performance.

Speech QA Data Curation
Beyond decoding efficiency, the paper addresses the scarcity of high‑quality speech‑to‑speech question‑answering (SpokenQA) data. Starting from two well‑known text QA corpora—TriviaQA and Natural Questions—the authors (1) rewrite short answers into full conversational sentences using a large language model, (2) synthesize both questions and answers into speech via zero‑shot TTS with thousands of speaker prompts to increase speaker diversity, and (3) filter out samples whose automatic speech recognition (ASR) transcriptions have a word error rate above 20 %. They also incorporate the VoiceAssistant dataset, generating answer speech with zero‑shot TTS. The final training corpus totals roughly 800 hours of paired speech‑question and speech‑answer data.

Training and Evaluation
The Speech LM uses the Phi‑4‑MM speech encoder and LoRA adapters (rank 320) to keep the number of trainable parameters modest (460 M for the 3.8 B model, 707 M for the 7 B model). Evaluation is performed on three SpokenQA benchmarks: Llama Questions, Trivia‑QA, and Web Questions. Metrics include:

S2T accuracy: whether the reference answer appears in the generated text.
S2S accuracy: whether the reference answer appears in the transcribed speech (using Whisper‑large‑v3).
S2S/S2T ratio: a measure of how faithfully the speech reflects the textual answer.
WER between the transcribed speech and the generated text.

Results show that the ESI‑enabled interleaved model matches or exceeds the baseline interleaved model on all metrics while achieving a substantial reduction in latency. The parallel and Thinker‑Talker baselines lag behind, especially in the S2S/S2T ratio and WER, confirming the importance of tight token‑level coupling for speech‑text alignment.

Contributions and Impact

Fair Comparative Study: By holding the base model, tokenizer, and data constant, the paper provides the first unbiased comparison of major decoding paradigms for joint speech‑text generation.
Efficiency Innovation: The Early‑Stop Interleaved pattern demonstrates that careful sequence design can dramatically improve inference speed without sacrificing—and even slightly improving—quality.
Data‑Driven Performance Boost: Curated high‑quality speech QA data significantly lifts Speech LM performance on spoken question answering tasks.

Future Directions
The authors suggest extending ESI to variable text‑audio ratios, exploring multi‑turn dialogue scenarios, and investigating adaptive stopping criteria that could further reduce latency in real‑time conversational agents.

In summary, this work clarifies the trade‑offs among decoding strategies, introduces a practical solution to the long‑standing efficiency bottleneck of interleaved decoding, and enriches Speech LM training with curated speech QA data, thereby advancing the feasibility of end‑to‑end spoken dialogue systems.

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

💡 Research Summary

Comments & Academic Discussion

Leave a Comment