Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.

💡 Research Summary

The paper tackles the persistent problem of degraded automatic speech recognition (ASR) performance on accented English, a consequence of training data being dominated by a few high‑resource varieties. Existing accent‑agnostic approaches (large‑scale self‑supervised models, adversarial training) improve overall robustness but still falter on heavily accented or unseen speakers. Accent‑specific methods (fine‑tuning, data augmentation, adapters, LoRA) achieve strong gains for known accents but rely on accent labels at training and often at inference, limiting scalability and generalization.

To bridge this gap, the authors propose Moe‑Ctc, a novel architecture that integrates a Mixture‑of‑Experts (MoE) framework with intermediate Connectionist Temporal Classification (CTC) supervision. The backbone is a FastConformer encoder; between its layers they insert sequence‑level MoE modules. Each MoE module contains N parallel experts (two‑layer feed‑forward networks) and a routing network that pools the utterance‑level hidden representation, computes softmax logits, and selects the top‑K experts for sparse activation, thereby keeping computational cost low while allowing specialization.

Training proceeds in two stages. In the first stage, accent labels are available. The routing logits are biased toward the expert assigned to the utterance’s accent (bias strength α) and an auxiliary accent‑classification loss (L_accent) encourages the router to predict the correct accent. This “Accent‑MoE” phase forces each expert to specialize on a particular accent. However, routing based solely on accent may not align with transcription quality. To address this, each expert is equipped with its own CTC head, producing an expert‑level CTC loss (L_(ℓ,j)^CTC) at intermediate layers. The CTC logits are projected back into the hidden space and added via a residual connection, allowing the CTC signal to influence the shared representation. A routing‑augmented loss (L_local = Σ g_(ℓ)j·L_(ℓ,j)^CTC) further pushes the router to assign higher weight to experts that yield lower CTC loss, directly coupling routing decisions with ASR performance.

In the second stage, the accent‑bias term and L_accent are removed, letting the router operate without explicit accent information. The model continues to be trained with the global CTC loss and the expert‑level CTC supervision, so it can generalize to unseen accents while still benefiting from the specialized expertise learned earlier.

Experiments use the MC‑Accent benchmark derived from Mozilla Common Voice. The dataset provides two training subsets (100 h and 600 h) with five “seen” accents (Australia, Canada, England, Scotland, United States) and nine “unseen” accents for evaluation. All models are first pretrained on LibriSpeech‑960 h and then fine‑tuned on MC‑Accent. Baselines include FastConformer of three sizes (small, medium, large), an intermediate‑CTC variant, a vanilla MoE, and an Accent‑MoE that uses accent labels at inference.

Results show that Moe‑Ctc consistently outperforms all baselines on both seen and unseen accents. In the low‑resource 100 h setting, the smallest Moe‑Ctc model (≈12 M parameters) achieves up to a 29.3 % relative word error rate (WER) reduction compared to the strong FastConformer baseline, and it also beats Accent‑MoE despite not requiring accent labels at test time. The routing‑augmented loss proves crucial: removing it leads to unstable routing and significant performance drops.

Key insights are: (1) expert‑level CTC supervision aligns routing with transcription quality, preventing experts from drifting toward merely accent‑discriminative features; (2) a two‑stage training schedule enables the model to first acquire accent‑specific expertise and then generalize to unseen accents; (3) sequence‑level MoE with top‑K routing offers a computationally efficient way to scale capacity for utterance‑level attributes like accent.

In summary, Moe‑Ctc presents a practical, scalable solution for accented speech recognition that combines the capacity‑boosting benefits of MoE with the stabilizing effect of intermediate CTC supervision, achieving state‑of‑the‑art performance without requiring accent information at inference. This work advances both fairness—by reducing the accent gap—and applicability of ASR systems in diverse real‑world environments.

Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition

💡 Research Summary

Comments & Academic Discussion

Leave a Comment