Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper’s absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.


💡 Research Summary

The paper tackles a critical bottleneck in OpenAI’s Whisper speech‑recognition model: the linear growth of the key‑value (KV) cache during autoregressive decoding, which leads to excessive GPU memory consumption, especially for long‑form audio. To alleviate this, the authors introduce Whisper‑MLA, a variant that replaces the standard multi‑head attention (MHA) with Multi‑Head Latent Attention (MLA), a low‑rank attention mechanism that compresses keys and values into a shared latent space, dramatically shrinking the KV cache.

A major challenge is that Whisper relies on absolute sinusoidal positional embeddings, whereas the original MHA2MLA conversion framework assumes relative rotary position encodings (RoPE). Directly applying MLA would break Whisper’s positional handling. The authors therefore design a custom MLA that respects absolute positional subspaces. They split the key projection matrix into a “preserved” part (selected dimensions) and a “compressible” part. Two dimension‑selection strategies are explored: (1) uniform sampling of frequency subspaces, which evenly preserves high‑ and low‑frequency components, and (2) a head‑wise 2‑norm contribution metric that ranks subspaces by the average product of query and key 2‑norms, an upper bound on attention scores. Experiments show that retaining even a small fraction (≈6 % of key dimensions) yields far better ASR performance than fully compressing all dimensions.

The conversion pipeline follows the MHA2MLA methodology: query weights remain unchanged; key weights are split; the compressible key part and the full value matrix are concatenated and jointly factorized via Singular Value Decomposition (SVD). The resulting low‑rank factors define latent projections for keys and values, enabling a shared latent space that maximizes parameter reuse from the pretrained Whisper checkpoint. This approach avoids costly training from scratch, requiring only modest fine‑tuning.

Two architectural variants are evaluated. Whisper‑MLA (Full) substitutes MLA in encoder self‑attention, decoder self‑attention, and cross‑attention, testing the general applicability of MLA across all attention modules. Whisper‑MLA (Decoder Self‑Attention Only, DSO) applies MLA solely to the decoder’s self‑attention, the only component that generates a growing KV cache during inference. Because the encoder’s KV cache is static after a single forward pass, DSO preserves the encoder’s pretrained acoustic representations while still achieving the same cache reduction as the full conversion.

Experimental setup uses the Whisper‑small model (244 M parameters, pretrained on 680 k h of speech) as the baseline. The model is converted to Whisper‑MLA, then fine‑tuned on the 960‑hour LibriSpeech corpus for three epochs on a single RTX 4090 (24 GB). The dimension‑preserving scheme keeps 48 of the 768 key dimensions and projects the remaining keys plus all values into a 96‑dimensional latent space. Results show up to 87.5 % KV‑cache reduction with minimal impact on word error rate (WER). The best DSO configuration (uniform sampling) incurs only a 0.17 % absolute WER increase relative to a fully fine‑tuned Whisper baseline, while achieving an average 81.25 % memory reduction. Full‑conversion models also reduce memory but suffer larger WER degradation, confirming that preserving the encoder and cross‑attention is crucial for accuracy.

GPU memory profiling across batch sizes (1–64) and sequence lengths (256–4096) demonstrates that Whisper‑MLA consistently consumes roughly half the memory of the original Whisper. At batch size 16 and sequence length 4096, Whisper‑MLA uses about 50 % of the memory; at batch size 64 and length 2048, Whisper exceeds the 24 GB limit (out‑of‑memory), whereas Whisper‑MLA runs comfortably at ~15.4 GB.

In conclusion, the paper proves that MLA can be adapted to models with absolute positional embeddings and that applying it selectively to the decoder’s self‑attention yields an optimal trade‑off between memory efficiency and transcription accuracy. The proposed conversion is parameter‑efficient, requires only modest fine‑tuning, and enables deployment of large‑scale speech models on memory‑constrained hardware, particularly for long‑form audio tasks. Future work may explore dynamic rank selection, more sophisticated subspace preservation criteria, and real‑time streaming scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment