Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective
Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.
💡 Research Summary
The paper investigates why speech language models (SLMs) lag behind text‑based large language models (LLMs) in producing semantically coherent outputs when no explicit transcription is provided. The authors identify three plausible factors: (A) speech tokens are primarily phonetic and carry limited semantic content, (B) speech token sequences are substantially longer than their textual counterparts because they encode duration information, and (C) paralinguistic cues such as prosody, intonation, and timbre introduce additional variability. To isolate the impact of each factor, the study adopts a “modality‑evolving” experimental design that gradually transitions from pure text to phones and finally to discrete speech tokens, while keeping all other variables constant.
Dataset and Tokenizations
The experiments use the LibriHeavy‑large corpus (~50 k hours of English speech). Four tokenization schemes are constructed: (1) Text‑BPE – a 2048‑vocabulary sub‑word tokenizer trained on transcripts, representing the ideal semantic modality; (2) Phone‑Raw – raw phone symbols (~80 types) derived from Kaldi alignments; (3) Phone‑BPE – BPE applied to the phone stream, preserving the same vocabulary size as Text‑BPE; (4) Phone‑Repeat – phones repeated according to their duration, yielding a 50 Hz frame rate; and (5) Speech‑HuBERT – discrete tokens obtained by clustering HuBERT‑Large hidden states into 2048 units at 50 Hz, thereby embedding both phonetic and paralinguistic information. The token‑per‑second rates are 4.45 (Text‑BPE), 4.04 (Phone‑BPE), 9.97 (Phone‑Raw), 50 (Phone‑Repeat and Speech‑HuBERT).
Model and Training
All modalities are trained from scratch using the same TinyLlama architecture (1.1 B parameters, 22 transformer layers, Group Query Attention). Training hyper‑parameters are identical across runs (AdamW, lr 4e‑4, cosine scheduler, batch size 128 on 4 × NVIDIA‑A800‑80G). Models are trained until validation loss converges, ensuring comparable exposure to the data.
Evaluation Tasks
Four zero‑shot tasks assess lexical, syntactic, semantic, and generative capabilities:
- sWUGGY – lexical discrimination between real and pseudo‑words.
- sBLIMP – grammaticality judgment of sentence pairs.
- Topic‑StoryCloze – selecting the more plausible continuation of a short story.
- Free continuation – autoregressive generation from 20 prompts; outputs are transcribed with Whisper‑large‑v3 and perplexity (PPL) is computed.
Results Overview
- Lexical (sWUGGY): Text‑BPE, Text‑Raw, Phone‑BPE, and Phone‑Raw achieve >85 % accuracy, indicating that factor A (phonetic vs. semantic token nature) has minimal effect on word‑level recognition. Speech‑HuBERT drops to ~51 %, revealing that the added paralinguistic content (factor C) severely hampers lexical modeling.
- Syntactic (sBLIMP): Phone‑Repeat suffers an 11.1 % accuracy loss relative to Phone‑Raw, showing that longer sequences (factor B) make syntax learning harder. Speech‑HuBERT further declines by 13.4 % compared to Phone‑Repeat, confirming that factor C compounds the difficulty.
- Semantic (Topic‑StoryCloze): Accuracy degrades progressively: Phone‑Raw 66.6 % → Phone‑Repeat 58.3 % → Speech‑HuBERT 52.9 %. Both factors B and C erode the model’s ability to capture higher‑order meaning.
- Generation (Continuation): Perplexity rises modestly for Phone‑Repeat (+88 %) and dramatically for Speech‑HuBERT (+141 %). The explosion in token count and the presence of prosodic variability make coherent long‑form generation especially challenging.
Scaling Analysis
Following scaling‑law methodology, the authors plot task accuracy versus the number of tokens processed within the first epoch. All modalities exhibit roughly linear improvements, but Speech‑HuBERT’s slope is markedly shallower, indicating slower gains per token. Layer‑wise probing shows that Text‑BPE and Phone‑BPE acquire lexical patterns early (due to sub‑word priors), whereas Speech‑HuBERT only begins to improve in deeper layers, reflecting the need to disentangle paralinguistic noise before semantic signals emerge.
Interpretation and Implications
The systematic modality transition isolates each hypothesized factor. Factor A (phonetic‑centric tokens) has limited impact on lexical tasks but does not dominate overall performance gaps. Factor B (sequence length) primarily harms syntactic and semantic modeling, as longer contexts increase the burden on attention mechanisms and exacerbate error propagation. Factor C (paralinguistic information) is the most detrimental, crippling even basic lexical discrimination and inflating generation perplexity. These findings suggest that successful end‑to‑end SLMs must either (i) effectively normalize or factor out paralinguistic variability, (ii) adopt architectures that handle very long sequences efficiently (e.g., hierarchical or memory‑augmented attention), and (iii) incorporate multimodal pre‑training that aligns speech representations with textual semantics.
Proposed Directions
Based on the analysis, the authors outline several avenues:
- Paralinguistic Factorization: Separate prosodic and timbral streams from the core linguistic stream, possibly via dedicated encoders or adversarial training.
- Length‑Efficient Modeling: Employ chunk‑wise processing, recurrent memory, or sparse attention to mitigate the quadratic cost of long token sequences.
- Cross‑Modal Alignment: Pre‑train on paired speech‑text data to align discrete speech tokens with textual embeddings, leveraging techniques such as contrastive learning or teacher‑student distillation.
- Hybrid Decoding: Combine a text‑level LLM for high‑level planning with a speech‑level decoder that injects realistic prosody, thereby retaining semantic coherence while preserving naturalness.
Conclusion
The paper provides a rigorous, factor‑by‑factor dissection of why current SLMs fall short of their text‑based counterparts. By progressively evolving the modality, the authors demonstrate that paralinguistic complexity (factor C) is the principal obstacle, while sequence length (factor B) also contributes significantly, and token semantics (factor A) plays a relatively minor role. These insights chart a clear research roadmap for building truly end‑to‑end speech language models that can generate semantically coherent, natural‑sounding speech without relying on intermediate transcription.
Comments & Academic Discussion
Loading comments...
Leave a Comment