Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition
Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention layers reduces compute, memory, and latency, making it ideal for low-resource speech recognition.
💡 Research Summary
The paper tackles the quadratic computational and memory burden of self‑attention in transformer‑based self‑supervised learning (SSL) speech models, which hampers their deployment in low‑resource scenarios. While the previously introduced SummaryMixing (SM) replaces pairwise attention with a global mean summary (s_g) and achieves linear‑time complexity, it suffers from a lack of fine‑grained temporal context, leading to sub‑optimal automatic speech recognition (ASR) performance.
To overcome this limitation, the authors propose Windowed SummaryMixing (WSM). In addition to the global summary, WSM computes a local window summary (s_wt) for each time step by averaging the representations of k frames before and after the current frame (a window size of 2k + 1). The final representation for frame t is obtained by concatenating the transformed frame, the global summary, and the local window summary, followed by a feed‑forward network:
s_wt = (1/(2k+1)) ∑_{j=t‑k}^{t+k} FF(h_j)
y_t = FF(concat(FF(h_t), s_g, s_wt))
Through empirical sweeps (k ∈ {3,5,7,9}) the authors find k = 5 consistently yields the lowest word error rate (WER), and they fix this value for all subsequent experiments. Importantly, the added local summary does not increase asymptotic complexity; the overall operation remains O(T) with respect to sequence length.
Beyond the architectural change, the paper introduces a selective fine‑tuning strategy. Instead of updating all parameters of a large SSL model, only the last two self‑attention layers are replaced with WSM blocks, and only these new blocks are fine‑tuned on the downstream ASR task. This approach preserves the rich, pre‑trained knowledge in the frozen layers while focusing learning capacity on the more efficient WSM modules, thereby reducing the risk of over‑fitting on small labeled corpora.
The authors evaluate the method on six datasets covering both Indian (Hindi, Tamil) and non‑Indian (Mexican Spanish, Mandarin, Arabic, and a US English conversational corpus) languages. They experiment with several popular SSL backbones: wav2vec 2.0, HuBERT, data2vec (monolingual) and XLS‑R, mHuBERT, MMS (multilingual). For each backbone they compare four encoder variants: baseline SM, the proposed WSM, a pretrained attention block (Att‑PT), and a randomly initialized attention block (Att‑Scratch). They also explore different layer‑replacement depths, ultimately finding that swapping only the final two layers offers the best trade‑off between performance and efficiency.
Key results include:
- WSM consistently outperforms SM and matches or exceeds Att‑PT while using far less memory. Peak VRAM drops from ~50 GB (Att‑PT/Att‑Scratch) to 30 GB (SM) and 32 GB (WSM), a 40 % reduction.
- Across all models and languages, WSM reduces WER/CER by 0.2–2.5 absolute points compared to SM. For example, XLS‑R’s Spanish WER improves from 28.09 % to 26.42 %, and Arabic from 40.34 % to 38.21 %.
- Inference speed gains become pronounced for longer utterances; at 100 seconds input length, WSM achieves a 25 % speedup over traditional attention, making it suitable for real‑time applications.
- Training time is also shortened because only a small subset of parameters is updated; fine‑tuning runs on a single NVIDIA H100 GPU with batch size 16 for 25 epochs per dataset.
The paper’s contributions are threefold: (1) an enhanced linear‑time mixing block that injects local context via windowed summaries, (2) a selective fine‑tuning protocol that replaces only the last two attention layers with WSM blocks, and (3) extensive empirical validation showing that this combination yields both accuracy gains and substantial resource savings across monolingual and multilingual SSL models.
In conclusion, Windowed SummaryMixing offers a practical path to deploy powerful SSL speech models in environments with limited computational resources or scarce labeled data. Future work will explore extending WSM to other speech tasks such as emotion recognition and speaker verification, and investigating adaptive window sizing or learnable summary mechanisms to further boost performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment