WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation

WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.


💡 Research Summary

WhispEar introduces a novel bidirectional framework for converting whispered speech to normal speech (W2N) and vice‑versa (N2W) by leveraging a shared, speaking‑mode‑invariant semantic representation. The system consists of three sequential stages. In Stage 1, a lightweight semantic tokenizer is distilled from a large‑scale ASR encoder (HuBERT‑X‑Large) using an L2 distillation loss; the student model processes both whispered and normal waveforms, producing discrete semantic tokens via finite‑scalar quantization. Stage 2 trains a conditional flow‑matching transformer that maps these tokens to mel‑spectrograms. The model receives masked mel regions initialized with Gaussian noise and predicts the velocity field along an optimal‑transport path, using a flow‑matching loss that operates only on the masked portions. Crucially, the same acoustic model and vocoder are shared for both W2N and N2W directions, ensuring parameter efficiency and consistent quality.

Stage 3 builds two “unified” tokenizers: f_n2w (normal‑to‑whisper) and f_w2n (whisper‑to‑normal). First, f_n2w is trained on the limited real paired data, learning to map normal speech into the semantic space of whispered speech. Once trained, it is used to synthesize high‑quality pseudo‑whispered utterances from massive normal‑speech corpora (e.g., LibriSpeech, EMILIA). This yields large‑scale aligned pseudo‑parallel pairs (˜x_w, x_n) without any additional recording effort. Finally, f_w2n is trained on a combination of the original real paired data and the generated pseudo‑pairs, dramatically expanding the effective training set for the W2N task.

To evaluate the approach, the authors constructed wEar, the largest bilingual (Chinese‑English) whispered‑normal parallel corpus to date. The real portion (wEar‑Real) contains 146 speakers, 18 hours of Chinese recordings, and carefully aligned whispered/normal utterances captured in low‑noise environments. Using the N2W model, they generated over 3 000 hours of pseudo‑parallel data (wEar‑Pseudo), bringing the total corpus size to more than 3 000 hours and 600 k aligned pairs.

Experiments address three questions: (EQ1) How does WhispEar compare to state‑of‑the‑art W2N systems? (EQ2) How effective is the internally generated pseudo‑whispered data for training? (EQ3) Does scaling the pseudo‑data further improve performance, and what is the role of fine‑tuning on real data?

Performance is measured across four dimensions: naturalness (UTMOS, DNSMOS, NISQA), intelligibility (WER for English, CER for Chinese using Whisper‑large‑v3), prosody (F0 Pearson correlation with the target normal speech), and speaker similarity (cosine similarity of speaker embeddings from wavlm‑base‑plus‑sv). Compared with four baselines—WESPER, DistillW2N, CosyVoice2, and MaskCycleGAN—WhispEar (without large‑scale scaling) already matches or exceeds them on most metrics. When the pseudo‑parallel data is incorporated (WhispEar‑Scaled), the system achieves the best overall results: English WER drops to 22.44 % (vs. 36.69 % for the strongest baseline), speaker similarity rises to 0.577, UTMOS reaches 3.75, DNSMOS 3.76, NISQA 4.38, and F0 correlation improves to 0.513. In Chinese, WhispEar‑Scaled attains a CER of 14.93 % and speaker similarity of 0.750, whereas English‑only baselines suffer severe degradation (CER > 80 %).

Ablation studies on data construction reveal that naïvely using raw, unaligned pairs yields the poorest performance, while traditional DSP‑based whisper synthesis offers modest gains but still lags behind. Aligning real pairs (using the proposed forced‑alignment pipeline) and generating pseudo‑whispers with the N2W model each provide substantial improvements; combining both (A + P) yields the highest scores across all metrics. This demonstrates that high‑quality temporal alignment and model‑based pseudo‑generation are both essential for overcoming data scarcity.

Scaling experiments further confirm the data‑centric advantage. Pre‑training the unified tokenizer on 10 k, 50 k, and 200 k pseudo‑pairs shows monotonic reductions in WER and increases in speaker similarity. Adding a fine‑tuning stage on the limited real aligned data (Pretrain + SFT) amplifies these gains, indicating that a small amount of high‑quality real data can effectively calibrate a model trained on massive synthetic data.

In summary, WhispEar contributes four key innovations: (1) a modality‑invariant semantic token representation that bridges whispered and normal speech; (2) a flow‑matching acoustic model shared across conversion directions; (3) a zero‑shot N2W pipeline that generates large‑scale, high‑fidelity pseudo‑whispered speech; and (4) a systematic scaling study confirming that expanding pseudo‑parallel data consistently improves W2N performance. The release of the extensive bilingual wEar corpus further provides a valuable benchmark for future whispered‑speech research.


Comments & Academic Discussion

Loading comments...

Leave a Comment