Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re-clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style baseline with matched compute. Cluster analysis shows GMM-anchored representations achieve up to 98% entropy compared to 31% for WavLM-style, indicating substantially more uniform cluster utilization. Code is made available at https://github.com/gioannides/clustering-anchored-jepa.


💡 Research Summary

This paper addresses a critical failure mode of Joint Embedding Predictive Architectures (JEPA) when applied to speech self‑supervised learning: representation collapse. In a standard JEPA, a student network predicts masked latent representations produced by an exponential‑moving‑average (EMA) teacher. Because the teacher is derived from the student and receives no explicit acoustic grounding, both networks can drift toward a trivial solution where all frames map to the same embedding, rendering the learned features useless for downstream tasks.

Existing speech SSL methods such as HuBERT and WavLM mitigate collapse by periodically clustering intermediate representations with k‑means, then using the hard cluster IDs as discrete targets. This iterative re‑clustering is computationally expensive and discards uncertainty at phonetic boundaries, because each frame is forced into a single cluster.

The authors propose GMM‑Anchored JEPA, a lightweight alternative that requires only a single offline clustering step. They fit a diagonal‑covariance Gaussian Mixture Model (GMM) with K components on log‑mel spectrograms of the entire training corpus. The GMM yields soft posterior probabilities qₖ(m) for each frame m, preserving the probability mass over multiple clusters and thus encoding acoustic ambiguity. The GMM parameters are frozen for the whole training run, providing a stable external anchor that cannot co‑adapt with the student network.

Training proceeds with two losses: (1) the standard JEPA loss L_JEPA, an MSE between the student’s masked predictions and the EMA teacher’s latent vectors; (2) a cluster loss L_cluster, the KL‑divergence between the model’s cluster‑head output pₖ and the frozen GMM posterior qₖ. The total loss is

L_total = L_JEPA + λ(t)·L_cluster

where λ(t) decays linearly from 1.0 at the start of training to 0.01 at the end. Early training is dominated by the GMM supervision, forcing the encoder to respect the spectral structure of the input; later, the JEPA objective refines higher‑level abstractions while a residual λ = 0.01 term continues to prevent drift.

Experiments use roughly 50 000 hours of speech (LibriLight‑large + English Granary). Two encoder backbones are evaluated: a Conformer‑based model (GMM‑JEPA) and a standard Transformer (GMM‑JEPA‑T). Baselines include (a) Pure JEPA with λ = 0 (no anchoring) and (b) a WavLM‑style model that uses the same architecture but replaces the GMM with a hard k‑means clustering of log‑mel features (no iterative re‑clustering).

Downstream performance:

  • ASR on LibriSpeech dev‑clean: GMM‑JEPA‑T achieves 28.68 % WER, a 14 % relative reduction compared to the WavLM‑style baseline (33.22 % WER). Pure JEPA collapses to 100 % WER.
  • Emotion recognition on IEMOCAP: GMM‑JEPA‑T and GMM‑JEPA reach average accuracies of 67.76 % and 67.30 % respectively, versus 65.46 % for the baseline (+2 % absolute).
  • Slot filling on SNIPS: GMM‑JEPA obtains an F1 score of 64.7 % (5.6 % absolute gain), while GMM‑JEPA‑T matches the baseline within 0.1 % points.

Cluster utilization analysis: The models predict a 1024‑class cluster distribution via a dedicated head. Entropy (normalized by log K) quantifies how evenly clusters are used. Pure JEPA uses only ~45 % of the possible entropy (516/1024 clusters). WavLM‑style, despite using more clusters, achieves only 31 % entropy (978/1024 used). In contrast, GMM‑JEPA reaches 85 % entropy (1007/1024 used) and GMM‑JEPA‑T attains 98 % entropy (1013/1024 used). This demonstrates that soft GMM posteriors keep the entire cluster vocabulary active, preventing the collapse observed in the baselines.

Visualization: UMAP projections of 10 000 randomly sampled frames show that Pure JEPA collapses into a tight region with little separation, WavLM‑style spreads but with overlapping cluster assignments, while both GMM‑anchored variants form distinct, well‑separated clusters across the embedding space. Temporal consistency metrics also reveal that GMM‑anchored models maintain smoother cluster trajectories, reflecting more stable phonetic representations.

Ablation: Removing the residual λ = 0.01 term (i.e., setting λ = 0 after the decay) leads to a resurgence of collapse even after the early GMM phase, confirming that the frozen GMM acts not only as an initialization but as a continual regularizer throughout training.

Contributions:

  1. Identification of representation collapse in speech JEPA and a simple, one‑time GMM anchoring strategy that eliminates the need for costly iterative re‑clustering.
  2. Introduction of a decaying supervision schedule that balances early acoustic grounding with later high‑level predictive learning.
  3. Demonstration that soft clustering preserves boundary uncertainty, yields near‑uniform cluster utilization, and improves downstream ASR, emotion, and slot‑filling tasks across two encoder families.
  4. Comprehensive analysis (entropy, temporal consistency, UMAP) that links the observed performance gains to the mitigation of collapse.

Future directions include adaptive selection of the number of GMM components, extending the approach to noisy or multilingual corpora, integrating GMM anchoring with multimodal (speech‑text) JEPA variants, and exploring richer interactions between the cluster head and the encoder (e.g., attention‑based gating). Overall, the work offers a practical, compute‑efficient recipe for stabilizing self‑supervised speech representation learning and opens avenues for broader applications of soft clustering anchors in other modalities.


Comments & Academic Discussion

Loading comments...

Leave a Comment