Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. Fine-tuning SSL models on child speech induces shifts in the representation space. We hypothesize that delta SSL embeddings, defined as the differences between embeddings from a finetuned model and those from its pretrained counterpart, encode task-specific information that complements finetuned features from another SSL model. We evaluate multiple fusion strategies on the MyST childrens corpus using different models. Results show that delta embedding fusion with WavLM yields up to a 10 percent relative WER reduction for HuBERT and a 4.4 percent reduction for W2V2, compared to finetuned embedding fusion. Notably, fusing WavLM with delta W2V2 embeddings achieves a WER of 9.64, setting a new state of the art among SSL models on the MyST corpus. These findings demonstrate the effectiveness of delta embeddings and highlight feature fusion as a promising direction for advancing child ASR.


💡 Research Summary

The paper tackles the persistent difficulty of child automatic speech recognition (ASR) when using self‑supervised learning (SSL) models that were originally trained on large adult speech corpora. Two main obstacles are highlighted: (1) the scarcity of labeled child speech data, and (2) a domain mismatch between the pre‑training data (mostly adult speech) and the downstream child‑speech task. While fine‑tuning an SSL model on child data does improve performance, it also shifts the internal representation space in ways that are not directly exploited by standard fusion techniques.

Inspired by “task vectors” in model‑merging literature, the authors extend the idea to the representation level. For any SSL encoder f_i they define a delta embedding ΔE_i = E_i^ft – E_i^pt, where E_i^ft is the embedding produced by the fine‑tuned model and E_i^pt is the embedding from the frozen pre‑trained checkpoint. This delta captures the task‑specific adjustments introduced during fine‑tuning. The central hypothesis is that Δ‑embeddings contain complementary information that, when fused with the fine‑tuned embeddings of a different SSL model, can improve ASR performance, especially under low‑resource conditions.

Three widely used SSL encoders are examined: Wav2Vec 2.0, HuBERT, and WavLM. All three are fine‑tuned on the MyST children’s corpus (≈133 h training, 21 h dev, 25 h test) using a CTC loss. WavLM consistently yields the strongest single‑model results, so it is used as the reference model. The authors fuse WavLM’s fine‑tuned last‑layer embeddings with Δ‑embeddings derived from HuBERT or Wav2Vec 2.0. Three fusion strategies are evaluated: (i) a learnable weighted sum, (ii) simple concatenation, and (iii) cross‑attention where WavLM embeddings query the Δ‑embeddings. Across all data‑size regimes (full, 10 h, 5 h, 1 h) concatenation outperforms the other two, likely because it preserves the full dimensionality of both sources without imposing restrictive linear combinations that can under‑utilize complementary cues.

Performance results are striking. On the full MyST training set, WavLM + ΔWav2Vec 2.0 (concatenated) achieves a word error rate (WER) of 9.64 %, a new state‑of‑the‑art among SSL‑based systems on this corpus. In the extreme 1‑hour low‑resource scenario, WavLM + ΔHuBERT reduces WER to 22.74 % (a 10 % relative improvement over using HuBERT fine‑tuned embeddings alone), while WavLM + ΔWav2Vec 2.0 yields a 4.4 % relative gain. These gains are statistically significant (p < 0.05) across all conditions.

To understand why Δ‑embeddings help, the authors conduct a layer‑wise Canonical Correlation Analysis (CCA) using the PWCCA variant. They first measure similarity between pre‑trained and fine‑tuned representations, confirming that fine‑tuning mainly alters the upper transformer layers. Next, they compute similarity between fine‑tuned embeddings and their Δ‑counterparts. Both HuBERT and Wav2Vec 2.0 show stable similarity through middle layers, with a sharp drop at the final layer, indicating that Δ‑embeddings capture the task‑specific shift concentrated in the top layers. Notably, ΔWav2Vec 2.0 exhibits a steeper final‑layer drop than ΔHuBERT, suggesting stronger task‑specific deviations, which aligns with its superior complementarity in fusion experiments.

Cross‑domain experiments further validate the approach: Δ‑embeddings derived from models fine‑tuned on 100 h of LibriSpeech (adult speech) still improve WavLM performance, though not as much as in‑domain Δ‑embeddings. This demonstrates that Δ‑embeddings encode general task‑specific knowledge that can transfer across domains, echoing prior findings that adult speech data can benefit child ASR.

Finally, a Mixture‑of‑Experts (MoE) gating analysis reveals that the model assigns non‑trivial weights to both WavLM and Δ‑embeddings during inference. The average gating weight for WavLM is lower when fused with ΔWav2Vec 2.0 than with ΔHuBERT, correlating with the lower WER and indicating that ΔWav2Vec 2.0 provides more complementary information due to its distinct pre‑training objectives.

In summary, the paper makes three key contributions: (1) introduces delta SSL embeddings as a novel, representation‑level task vector; (2) demonstrates that fusing delta embeddings with fine‑tuned embeddings of heterogeneous SSL models yields consistent and significant WER reductions, especially in low‑resource child ASR; (3) provides quantitative analysis (CCA, MoE) that elucidates the mechanisms behind the observed gains. The work opens avenues for further research on delta‑based fusion across languages, domains, and even combining representation‑level and parameter‑level task vectors for broader speech technology improvements.


Comments & Academic Discussion

Loading comments...

Leave a Comment