DisContSE: Single-Step Diffusion Speech Enhancement Based on Joint Discrete and Continuous Embeddings

DisContSE: Single-Step Diffusion Speech Enhancement Based on Joint Discrete and Continuous Embeddings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore, they generally achieve promising results on non-intrusive metrics but show poor performance on intrusive metrics, as they may struggle in reconstructing the correct phones. In this paper, we propose DisContSE, an efficient diffusion-based speech enhancement model on joint discrete codec tokens and continuous embeddings. Our contributions are three-fold. First, we formulate both a discrete and a continuous enhancement module operating on discrete audio codec tokens and continuous embeddings, respectively, to achieve improved fidelity and intelligibility simultaneously. Second, a semantic enhancement module is further adopted to achieve optimal phonetic accuracy. Third, we achieve a single-step efficient reverse process in inference with a novel quantization error mask initialization strategy, which, according to our knowledge, is the first successful single-step diffusion speech enhancement based on an audio codec. Trained and evaluated on URGENT 2024 Speech Enhancement Challenge data splits, the proposed DisContSE excels top-reported time- and frequency-domain diffusion baseline methods in PESQ, POLQA, UTMOS, and in a subjective ITU-T P.808 listening test, clearly achieving an overall top rank.


💡 Research Summary

Title: DisContSE: Single‑Step Diffusion Speech Enhancement Based on Joint Discrete and Continuous Embeddings

Problem Statement:
Diffusion‑based speech enhancement (SE) has shown strong performance, especially when operating on discrete audio‑codec tokens. However, existing methods require many reverse‑diffusion steps, leading to high inference latency. Moreover, while they achieve good non‑intrusive scores, they often lag on intrusive metrics (e.g., PESQ, POLQA) because the reconstructed speech may contain phonetic errors.

Proposed Solution:
DisContSE introduces a hybrid architecture that simultaneously processes (i) discrete codec tokens and (ii) continuous embeddings extracted from a pre‑trained Descript Audio Codec (DAC), and (iii) semantic features from a pre‑trained WavLM model. Three enhancement modules are employed:

  1. Discrete Enhancement Module – a masked language model (MaskGIT‑style) that predicts masked tokens. It uses parallel embedding layers for each of the 12 codebooks, sums the resulting representations with continuous and semantic embeddings, and is trained with cross‑entropy loss on masked positions plus a self‑critic binary‑cross‑entropy loss.

  2. Continuous Enhancement Module – a continuous‑domain LM consisting of two fully‑connected layers and eight transformer blocks. It refines the DAC‑encoded continuous embeddings using a mean‑absolute‑error (MAE) loss, then re‑tokenizes the refined embeddings to obtain an improved token set.

  3. Semantic Enhancement Module – a WavLM‑based LM (four transformer blocks) that enhances phonetic information, also optimized with an MAE loss.

All three modules share the same transformer dimension (H = 512) and are trained jointly while keeping the DAC encoder, DAC tokenizer, and WavLM encoder frozen. The total trainable parameter count is 81.4 M.

Key Innovation – Single‑Step Reverse Diffusion:
Traditional diffusion SE starts from a fully masked token sequence (T = 1) and iteratively reduces the mask over N steps (N ≈ 10–50). DisContSE replaces this multi‑step schedule with a quantization‑error mask initialization strategy. The DAC tokenizer provides a quantization‑error matrix Δ_quant for each codebook. The algorithm selects the top ⌈sin(π T / 2)·L·C⌉ entries (where L is sequence length, C = 12) and sets those positions to “masked”. With T = 0.1 (≈10 % mask), the model directly feeds this partially masked token set into the discrete enhancement module, which predicts the final token distribution in a single reverse step. This eliminates the iterative refinement loop, dramatically reducing inference time while preserving quality.

Training Details:

  • Optimizer: AdamW, learning rate = 2.5 × 10⁻⁴, 4 K warm‑up steps.
  • Total steps: 300 K (≈3.5 days on 4 × NVIDIA H100 GPUs).
  • Loss: J = J_dis^CE + J_critic^BCE + J_cont^MAE + J_sem^MAE.

Experimental Setup:

  • Dataset: URGENT 2024 Speech Enhancement Challenge (634.5 h training, 32.7 h validation, 661 test utterances, 16 kHz).
  • Evaluation: Intrusive metrics (PESQ, POLQA, ESTOI), non‑intrusive metrics (DNS‑MOS, NISQA, UTMOS), phonetic fidelity (Levenshtein Phone Similarity, SBScore), speaker similarity (SpkSim), word‑accuracy (W Acc), and a subjective ITU‑T P.808 listening test (MOS).
  • Baselines: Seven recent diffusion‑based SE models (SGMSE+, BBED, SB, CRP, CDiffuSE, StoRM, Universe++) re‑trained under identical conditions.

Results:

Model PESQ POLQA UTMOS MOS (P.808) Overall Rank
Noisy 1.88 2.17 0.67 2.17 8.36
Clean 4.50 4.71 1.00 3.95
DisContSE (D+G1) 3.14 3.25 0.84 3.75 2.36
CRP (G1) 3.10 3.01 0.82 3.71 3.36
SB (G30) 2.57 2.86 0.80 3.73 4.82
Universe++ (D+G8) 3.09 3.23 0.81 3.73 4.18

DisContSE achieves the best scores on PESQ, POLQA, UTMOS, and the subjective MOS, while remaining competitive on ESTOI and phonetic fidelity (LPS). The overall rank (2.36, lower is better) places it ahead of all baselines; the closest competitor, CRP, ranks 3.36.

Ablation Findings:

  • Mask Initialization: Quantization‑error mask (T = 0.1) outperforms random mask and fully‑masked initialization across all metrics.
  • Module Contributions: Removing the semantic module degrades phonetic metrics; removing the continuous module reduces PESQ and POLQA; removing both harms overall performance.
  • Number of Reverse Steps: Increasing N beyond 1 (e.g., 5, 10 steps) yields marginal gains but at a substantial computational cost, confirming the efficiency of the single‑step design.

Conclusions and Future Work:
DisContSE demonstrates that a carefully designed combination of discrete, continuous, and semantic enhancement pathways can deliver diffusion‑level quality with a single reverse diffusion step, achieving real‑time inference capability. The quantization‑error mask initialization is a novel, effective technique for reducing diffusion steps without sacrificing fidelity. Limitations include reliance on the quality of the DAC quantization error and the fixed nature of the pre‑trained DAC/WavLM encoders, which may affect cross‑domain generalization. Future research directions include learning a data‑driven mask initialization, extending the framework to multi‑speaker and multilingual scenarios, and exploring end‑to‑end fine‑tuning of the DAC and WavLM backbones.


Comments & Academic Discussion

Loading comments...

Leave a Comment