ParaGSE: Parallel Generative Speech Enhancement with Group-Vector-Quantization-based Neural Speech Codec

ParaGSE: Parallel Generative Speech Enhancement with Group-Vector-Quantization-based Neural Speech Codec
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper proposes a novel parallel generative speech enhancement (ParaGSE) framework that leverages a group vector quantization (GVQ)-based neural speech codec. The GVQ-based codec adopts separate VQs to produce mutually independent tokens, enabling efficient parallel token prediction in ParaGSE. Specifically, ParaGSE leverages the GVQ-based codec to encode degraded speech into distinct tokens, predicts the corresponding clean tokens through parallel branches conditioned on degraded spectral features, and ultimately reconstructs clean speech via the codec decoder. Experimental results demonstrate that ParaGSE consistently produces superior enhanced speech compared to both discriminative and generative baselines, under a wide range of distortions including noise, reverberation, band-limiting, and their mixtures. Furthermore, empowered by parallel computation in token prediction, ParaGSE attains about a 1.5-fold improvement in generation efficiency on CPU compared with serial generative speech enhancement approaches.


💡 Research Summary

The paper introduces ParaGSE, a novel parallel generative speech‑enhancement framework that leverages a group‑vector‑quantization (GVQ) based neural speech codec (G‑MDCTCodec). Traditional generative SE methods rely on large language models or residual vector quantizers (RVQ) that predict clean tokens sequentially, leading to high computational cost and limited real‑time applicability. ParaGSE addresses these issues by designing a codec where the encoder output is split into N groups, each quantized independently by its own codebook. This independence yields mutually exclusive token streams that can be predicted in parallel.

The system consists of three main components. First, a spectral‑feature extraction module converts degraded speech into a short‑time Fourier transform (STFT) representation, downsamples it, and processes it with two BiLSTM layers followed by a Conformer block to produce a high‑dimensional spectral feature vector ˆs. Second, the degraded speech is fed into the G‑MDCTCodec encoder and GVQ, producing N degraded tokens d(y)ₙ (n = 1…N). Third, N parallel prediction branches each take a degraded token and the spectral feature ˆs as inputs. Inside each branch, the token is looked up in a trainable embedding table to obtain a latent vector vₙ, which is concatenated with ˆs and passed through two BiLSTM layers and a Conformer block. The output is projected to M dimensions, soft‑maxed, and interpreted as a probability distribution over the codebook entries. Cross‑entropy loss between this distribution and the one‑hot representation of the target clean token d(x)ₙ drives training. The codec’s encoder, GVQ, and decoder are frozen after pre‑training; only the spectral‑feature extractor and the parallel branches are fine‑tuned for enhancement.

G‑MDCTCodec itself is built upon the previously proposed MDCTCodec but replaces the residual vector quantizer with GVQ. The authors demonstrate that this change does not degrade coding quality: log‑spectral distance (LSD), short‑time objective intelligibility (STOI), and virtual speech quality objective listener (VISQOL) scores are virtually identical to those of the original codec. This validates that independent quantization can preserve speech fidelity while enabling parallel token prediction.

Experiments are conducted on three representative SE tasks: denoising, dereverberation, and mixed distortion (noise + reverberation + band‑limiting). The training data are derived from the VoiceBank corpus, with degradations generated from DEMAND noise, DNS Challenge room impulse responses, and down‑sampling to 8 kHz for the mixed case. The codec uses N = 4 groups, each with a codebook of size M = 256 and vector dimension K/N = 8. The spectral‑feature extractor uses an STFT (frame length 320, shift 40), three convolutional down‑sampling layers (overall factor 8), and Conformer blocks with 512 channels and 8 attention heads.

ParaGSE is compared against a suite of baselines: the time‑domain regression model DEMUCS, frequency‑domain discriminative models CMGAN and MP‑SENet, and the generative model Genhancer. Objective metrics include intrusive LSD and three non‑intrusive scores (NISQA, DNSMOS, UTMOS). Subjective evaluation uses ABX preference tests on Amazon Mechanical Turk with at least 30 listeners per pair.

Results show that ParaGSE consistently outperforms DEMUCS and Genhancer across all objective metrics in both denoising and dereverberation. Compared with CMGAN and MP‑SENet, ParaGSE achieves comparable or better non‑intrusive scores, and surpasses them in the dereverberation task. The only metric where ParaGSE lags is LSD, which the authors attribute to the generative nature of the model (approximating a distribution rather than exact waveform reconstruction). Subjective tests confirm a statistically significant preference for ParaGSE over the strongest discriminative baselines, especially in the mixed‑distortion scenario.

Efficiency analysis reveals that the parallel token‑prediction architecture yields a 1.5× speedup on CPU relative to serial generative approaches such as Genhancer, demonstrating that high‑quality enhancement can be achieved without sacrificing real‑time performance.

In summary, ParaGSE introduces three key innovations: (1) a GVQ‑based codec that provides independent token streams, (2) a spectral‑feature conditioned parallel prediction mechanism that simultaneously restores all token groups, and (3) a training regime that freezes the codec while fine‑tuning only the enhancement modules. These contributions collectively address the long‑standing trade‑off between model complexity, inference speed, and speech quality in generative speech enhancement, opening the door for practical deployment in communication devices, hearing aids, and voice‑assistant systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment