Efficient Rehearsal for Continual Learning in ASR via Singular Value Tuning

Efficient Rehearsal for Continual Learning in ASR via Singular Value Tuning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Continual Learning (CL) in Automatic Speech Recognition (ASR) suffers from catastrophic forgetting when adapting to new tasks, domains, or speakers. A common strategy to mitigate this is to store a subset of past data in memory for rehearsal. However, rehearsal-based methods face key limitations: storing data is often costly, infeasible with pre-trained models, or restricted by privacy regulations. Running existing rehearsal-based methods with smaller memory sizes to alleviate these issues usually leads to degraded performance. We propose a rehearsal-based CL method that remains effective even with minimal memory. It operates in two stages: first, fine-tuning on the new task; second, applying Singular Value Decomposition (SVD) to the changes in linear layers and, in a parameter-efficient manner, retraining only gating vectors on the singular values, which control to extent to which updates from the first stage are accepted, using rehearsal. We extensively test and analyze our method on two monolingual and two multilingual benchmarks. Our method reduces forgetting and outperforms state-of-the-art CL approaches for ASR, even when limited to a single utterance per previous task.


💡 Research Summary

The paper tackles catastrophic forgetting in continual learning (CL) for automatic speech recognition (ASR) when models must adapt to new tasks, domains, speakers, or languages. Traditional rehearsal‑based CL mitigates forgetting by storing a subset of past data, but real‑world constraints—privacy regulations, storage limits, and the use of large pre‑trained models—make large memory buffers impractical. Moreover, performance typically collapses when the memory size is reduced.

The authors propose a memory‑efficient rehearsal method called Singular Value‑based Rehearsal (SVR). SVR operates in two stages. In Stage 1 the model is fine‑tuned on the new task in the usual way, producing an updated parameter set \tilde{θ}t. The weight change ΔW_t in each linear layer (the difference between the fine‑tuned weight matrix and the previous weight matrix) is then decomposed by singular value decomposition (SVD): ΔW_t = U Σ Vᵀ = ∑{i=1}^k s_i u_i v_iᵀ. Each rank‑one component represents a principal direction of adaptation, but not all components are equally beneficial; some improve the new task while harming previously learned tasks.

To control the influence of each component, a learnable gating vector α∈ℝ^k is introduced. The singular values are scaled by σ(α_i) (σ is the sigmoid), yielding a modified update Δ\hat{W}_t = U diag(σ(α)⊙s) Vᵀ. Only α is trainable in this stage; the matrices U, Σ, V, and the original weights remain fixed. Consequently, the number of trainable parameters per linear layer is reduced from d_out × d_in to at most k (typically a few hundred), dramatically lowering memory and compute requirements.

Stage 2 trains α jointly on (i) the new task’s data and (ii) a tiny rehearsal memory M containing a few utterances from all previous tasks. The loss combines the standard cross‑entropy (CTC + decoder) term with a knowledge‑distillation term that encourages the current model to mimic the output distribution of the previous model θ_{t‑1}. The memory loss is weighted by (t‑1)/2 to reflect that the buffer represents all prior tasks. α is initialized so that σ(α)≈0, ensuring the model starts from the previous task’s parameters and stays within a low‑loss region for old tasks. Non‑linear layers, biases, and convolutional parameters are frozen after being set to the average of their values before and after Stage 1.

The algorithm proceeds as follows: (1) fine‑tune on the new task; (2) apply SVD to each linear layer and create α; (3) freeze all other parameters (averaged); (4) iterate over mini‑batches from the new task and the memory, updating α via the combined loss; (5) after training, add a few new utterances to the memory (either fixed‑size or growing). The authors evaluate SVR on four benchmarks: two monolingual setups (microphone shift and accent shift) and two multilingual setups (task‑specific adapters and a Whisper‑style foundation model). Memory budgets range from a single utterance per prior task to a few utterances.

Results show that SVR consistently outperforms state‑of‑the‑art rehearsal methods (e.g., Experience Replay, GEM) and regularization methods (EWC, LwF) even when the memory contains only one utterance per previous task. Relative word error rate (WER) reductions of 2–3 % absolute are reported compared to the best baselines. Analysis of the learned gating vectors reveals a near‑binary pattern: most σ(α_i) are close to 0 or 1, indicating that the model either fully suppresses or fully accepts each rank‑one update. Sensitivity experiments demonstrate that performance saturates once k reaches about half the smaller dimension of the weight matrix, and that the method remains stable across a wide range of memory sizes.

In summary, the paper introduces a novel CL technique for ASR that leverages SVD to decompose weight updates and learns a compact set of gating parameters using an extremely small rehearsal buffer. By updating only these gates, SVR achieves strong forgetting mitigation with negligible memory overhead, offering a practical solution for privacy‑sensitive or resource‑constrained deployment scenarios. Future work may extend the gating mechanism to non‑linear layers and explore online streaming settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment