Frontend Token Enhancement for Token-Based Speech Recognition
Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.
💡 Research Summary
This paper addresses the vulnerability of discretized speech representations—specifically semantic or phonetic tokens derived from clustering self‑supervised learning (SSL) model outputs—to environmental noise. While such tokens enable efficient transmission, model training, and decoding, their noise sensitivity hampers practical deployment of token‑based automatic speech recognition (ASR) systems. To improve robustness, the authors propose a modular frontend that estimates clean tokens from noisy speech and evaluate its impact on a token‑based ASR backend that uses semantic tokens.
Four families of enhancement models are investigated, distinguished by their input and output domains:
-
Wave‑to‑Wave (W2W‑E) – Conventional speech‑enhancement frontends that map noisy waveforms to enhanced waveforms. Two state‑of‑the‑art SE models are used: Conv‑TasNet and TF‑GridNet. The enhanced waveforms are then processed by the usual SSL feature extractor and tokeniser.
-
Token‑to‑Token (T2T‑E) – Directly maps duplicated noisy token sequences to cleaned token sequences. The architecture mirrors the token‑ASR encoder, employing an embedding layer followed by four E‑Branchformer blocks (reduced dimension).
-
Vector‑to‑Token (V2T‑E) – Takes continuous SSL features as input and predicts tokens. The authors explore three decoders: a simple two‑layer MLP, a Temporal Convolutional Network (TCN), and an E‑Branchformer. Features are not limited to a single SSL layer; instead, a trainable weighted sum of all WavLM Large layers is used, capturing hierarchical speech information.
-
Wave‑to‑Token (W2T‑E) – The most integrated approach: a pretrained SSL model (WavLM Large) is fine‑tuned end‑to‑end with a linear (or more powerful) classifier on top, using CTC loss to directly predict tokens from raw waveforms. This model has the highest parameter count but the lowest inference cost because it bypasses intermediate representations.
All enhancement models are trained independently of the ASR backend, allowing the frontend to be swapped without retraining the recognizer. The ASR backends consist of two variants: a joint CTC/attention encoder‑decoder (AED) and a CTC‑only model, both built on 12 E‑Branchformer encoder blocks and 6 Transformer decoder blocks. The token ASR uses 2 k BPE units derived from 1 k k‑means clusters on the 21st layer of WavLM Large; the continuous ASR uses either 80‑dim FBANK or weighted‑sum WavLM features.
Experiments are conducted on the single‑channel tracks of the CHiME‑4 dataset (both simulated and real noisy conditions). Evaluation metrics include word error rate (WER) for the ASR, scale‑invariant signal‑to‑noise ratio (SI‑SNR) for waveform enhancement, and unit edit distance (UED) for token‑level enhancement.
Key findings:
-
W2T‑E achieves the best ASR performance, yielding WERs of 5.6 % (simulated) and 4.5 % (real) for the AED token ASR, surpassing the strongest continuous ASR baseline (WavLM weighted‑sum features) which records 8.1 % / 6.0 % WER. The CTC‑only token ASR also benefits, dropping from 21.9 % to 6.6 % WER with W2T‑E.
-
TF‑GridNet SE improves continuous ASR (WER reduced from 8.1 % to 5.9 % for weighted‑sum features) but its impact on token ASR is less consistent; Conv‑TasNet sometimes degrades performance.
-
T2T‑E provides minimal gains, indicating that directly cleaning noisy token sequences is challenging when the input tokens already contain substantial distortion.
-
V2T‑E shows moderate improvements, with the E‑Branchformer decoder performing best among the three variants (WER ≈ 9.8 %). Using weighted‑sum SSL features is crucial; single‑layer features underperform.
-
UED does not correlate perfectly with WER; a lower token edit distance does not guarantee lower word error, highlighting a mismatch between token‑level fidelity and downstream recognition.
The paper’s contributions are twofold: (1) it offers the first systematic comparison of diverse frontend enhancement strategies for token‑based ASR, introducing novel V2T‑E and W2T‑E approaches; (2) it demonstrates that a waveform‑to‑token frontend can not only close the gap with continuous‑feature ASR but actually surpass it under noisy conditions.
The authors discuss that while W2T‑E incurs a large training cost (≈ 312 M parameters), its inference simplicity and superior robustness make it attractive for real‑time applications. They also note that the modular training paradigm enables future upgrades of either the frontend or the backend without joint retraining.
Future work suggested includes developing lighter‑weight W2T‑E models, extending evaluation to multi‑channel and streaming scenarios, and investigating better alignment between token‑level enhancement metrics and ASR performance. Overall, the study establishes a solid foundation for integrating noise‑robust frontends into token‑based speech processing pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment