From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

``Phoneme Hallucinations (PH)’’ commonly occur in low-bitrate DNN-based codecs. It is the generative decoder’s attempt to synthesize plausible outputs from excessively compressed tokens missing some semantic information. In this work, we propose language model-driven losses (LM loss) and show they may alleviate PHs better than a semantic distillation (SD) objective in very-low-bitrate settings. The proposed LM losses build upon language models pretrained to associate speech with text. When ground-truth transcripts are unavailable, we propose to modify a popular automatic speech recognition (ASR) model, Whisper, to compare the decoded utterance against the ASR-inferred transcriptions of the input speech. Else, we propose to use the timed-text regularizer (TTR) to compare WavLM representations of the decoded utterance against BERT representations of the ground-truth transcriptions. We test and compare LM losses against an SD objective, using a reference codec whose three-stage training regimen was designed after several popular codecs. Subjective and objective evaluations conclude that LM losses may provide stronger guidance to extract semantic information from self-supervised speech representations, boosting human-perceived semantic adherence while preserving overall output quality. Demo samples, code, and checkpoints are available online.


💡 Research Summary

The paper tackles a persistent problem in ultra‑low‑bitrate neural speech codecs—“phoneme hallucination” (PH), where the generative decoder invents plausible‑sounding phonemes that differ from the original content because the compressed token stream lacks sufficient semantic information. While recent semantic codecs mitigate this issue by distilling self‑supervised speech representations (e.g., HuBERT‑AAPT) into the loss function, PH still appears at bitrates below roughly 0.4 kbps, especially when the codebook size or frame rate is severely limited.

To address this, the authors propose two language‑model‑driven loss functions (LM Loss) that exploit pretrained models explicitly trained to align speech with text. The first, called the ASR loss, repurposes an automatic speech recognition model (Whisper) that predicts sub‑word tokens autoregressively. For a clean utterance x, Whisper generates a token sequence c_W; the same model then processes the decoded signal (\hat{x}) and predicts a second sequence f_W conditioned on c_W. The loss is the cross‑entropy between f_W and the ground‑truth tokens of c_W, summed over all time steps. Because the loss operates entirely in the token space, it does not require ground‑truth transcripts, allowing training on any clean speech corpus. The authors note that directly using transcripts can destabilize training, so the token‑level formulation is preferred.

The second loss, the Timed‑Text Regularizer (TTR), builds on the audio‑based language model WavLM and the text‑based language model BERT (referred to as BER‑T). For each sub‑word segment of the decoded speech, WavLM produces a variable‑length sequence of embeddings, which are summarized by a trainable “summarizer” transformer (P_Sum) into a single vector S_i. A second “aggregator” transformer (P_Agg) then refines the whole sequence {S_i} by self‑attention. In parallel, the ground‑truth script is processed by BER‑T to obtain text embeddings {T_i}. The TTR loss combines (1) the cosine similarity between each S_i and its corresponding T_i, and (2) a pairwise cosine‑distance term that enforces that the relational structure among the audio embeddings matches that of the text embeddings. During a pre‑training stage, only P_Sum and P_Agg are updated while WavLM and BER‑T remain frozen; subsequently, the same loss is applied to the decoded signal (\hat{x}) to guide the codec’s parameters.

The experimental framework adopts a three‑stage training pipeline common to many neural codecs: (1) pre‑training of encoder and decoder on reconstruction losses, (2) vector‑quantization (VQ) codebook learning, and (3) fine‑tuning of the decoder with either the proposed LM Losses or the conventional semantic‑distillation loss (L_HuBERT). The reference codec is a slight modification of a prior system that uses a HiFi‑GAN vocoder to synthesize audio from quantized pitch and HuBERT features; the authors add a HuBERT encoder that halves the feature rate and a VQ codebook for the compressed representation.

Evaluation includes objective metrics (Word Error Rate, PESQ, STOI) and subjective listening tests modeled after MUSHRA, together with a specific PH counting analysis. At an extreme bitrate of 187.5 bps (≈0.15 kbps), both LM Losses dramatically reduce PH occurrences compared with the baseline SD loss. The ASR loss yields a ~30 % relative WER reduction, while the TTR loss provides an even larger gain when accurate transcripts are available. Subjectively, listeners rate “semantic consistency” higher by an average of 0.12 points for LM Losses, with no statistically significant degradation in overall audio quality. The ASR loss’s advantage is its applicability to unlabeled data, whereas the TTR loss excels when precise time‑aligned transcripts are present, making the two approaches complementary.

In conclusion, integrating pretrained language models directly into the loss function offers a powerful mechanism to inject linguistic constraints into ultra‑low‑bitrate speech coding. This approach overcomes the representational bottleneck of traditional semantic distillation by leveraging the rich speech‑text alignment learned by large‑scale ASR and text‑audio models. The paper suggests future directions such as scaling to larger multimodal LMs (e.g., Whisper‑large, GPT‑4V) and optimizing inference cost for real‑time streaming scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment