Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.
End-to-end speech large language models (speech LLMs) such as Qwen2-Audio [1], Ultravox [2], Phi-4-Multimodal [3], and Gemini [4] accept audio directly and produce text responses, bypassing the traditional pipeline of automatic speech recognition followed by a text LLM. The implicit promise is richer understanding: by processing raw audio, these models should capture paralinguistic cues (prosody, emotion, emphasis) that ASR discards. This promise has driven substantial investment across diverse architectures, from discrete audio tokens [5] to learned connectors [2] to cross-attention [1] to dual encoders [6], yet whether this diversity translates to genuinely different internal processing remains untested.
But does this promise hold? On tasks that can be solved from a transcript alone (factual question answering, topic classification, sentiment analysis), do speech LLMs actually exploit acoustic information beyond what a simple Whisper-to-LLM cascade provides? Or do they converge on implicit text representations, effectively becoming cascades with extra steps?
Prior work offers fragments of an answer. LISTEN [7] showed speech LLMs rely on lexical cues even for emotion recognition, but tested only emotion tasks without cascade comparison or per-example error analysis. SALAD [8] measured the speech-text understanding gap within models but analyzed only aggregate accuracy, not whether speech and cascade pathways fail on the same examples in the same ways; critically, comparing speech vs. text input to a single model tests in-put robustness, not architectural equivalence. Broader benchmarks [9,10] also report only aggregate performance. Separately, Cuervo and Marxer [11] showed speech LLMs scale orders of magnitude less efficiently than text LLMs, raising the question of whether end-to-end speech processing is worth the investment when a cascade might suffice. The evaluation studies above share two limitations, to varying degrees. First, none analyzes per-example behavioral agreement (whether two systems produce the same answer on the same input), which is necessary to distinguish genuine architectural divergence from superficially similar accuracy. Second, cross-model comparisons (benchmarks, LISTEN) do not control for the LLM backbone. When Qwen2-Audio (built on Qwen2-7B) is compared against a Whisper + Qwen2.5-7B cascade, any observed difference conflates the speech processing architecture with the different reasoning capabilities of the underlying LLM. A model might appear to “diverge from its cascade” simply because its backbone reasons differently, not because it processes audio differently.
We introduce matched-backbone behavioral testing to resolve this confound. By constructing cascades that pair Whisper with the same LLM backbone used inside each speech LLM (Llama-3.1-8B for Ultravox, Qwen2-7B for Qwen2-Audio, Phi-4-mini for Phi-4-Multimodal), we isolate architectural effects from backbone effects. This reveals that the backbone confound can inflate apparent architectural divergence by up to +0.13 κ, closing much of the gap toward the cascade ceiling.
We formalize the underlying question as the Cascade Equivalence Hypothesis: on tasks where the transcript is informationally sufficient, i.e., I(A; Y | T ) ≈ 0 with A the audio, T the transcript, and Y the task label, a speech LLM should be behaviorally indistinguishable from a cascade using the same LLM backbone. We test this not through aggregate accuracy, which can mask disagreement at the example level, but through per-example behavioral metrics: Cohen’s κ [12] for agreement, conditional error overlap for shared failure modes, and McNemar’s test [13] for systematic directional bias.
Our results establish cascade equivalence as a spectrum across architectures, with mechanistic evidence explaining why. We detail our methodology ( §2-3), behavioral findings ( §4), mechanistic evidence ( §5), and implications ( §6-7). Our contributions are:
Consider a spoken input A, its transcript T = ASR(A), and a task label Y . We define the acoustic surplus for task Y as:
When ∆IY ≈ 0, the transcript preserves nearly all task-relevant information; we call such tasks text-sufficient (e.g., factual QA, topic classification, sentiment analysis). Tasks whose labels depend on prosodic cues that transcription discards, such as emotion recognition and sarcasm detection, are text-insufficient.
Text sufficiency is a property of the (task, ASR system) pair: what matters is whether a realistic ASR transcript, with its characteristic errors, preserves enough information for the downstream task. Billa [14] demonstrated this empirically, showing that even high-WER ASR transcripts in mismatched languages preserve enough structural regularity for downstream performance improvements.
When ∆IY ≈ 0, the transcript carries all task-relevant information, so any system that extracts T from A-whether an explicit ASR module or a learned internal representationwill support the same
This content is AI-processed based on open access ArXiv data.