Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality. Such limitations usually lead to performance degradation in the voice conversion task. To overcome this problem, the QPNet vocoder is applied, which includes a pitch-dependent dilated convolution component to enhance the pitch controllability and attain a more compact network than the WN vocoder. In the proposed method, input spectral features are first converted using a framewise deep neural network, and then the QPNet vocoder generates converted speech conditioned on the linearly converted prosodic and transformed spectral features. The experimental results confirm that the QPNet vocoder achieves significantly better performance than the same-size WN vocoder while maintaining comparable speech quality to the double-size WN vocoder. Index Terms: WaveNet, vocoder, voice conversion, pitch-dependent dilated convolution, pitch controllability

💡 Research Summary

Voice conversion (VC) aims to modify the speech characteristics of a source speaker so that they sound as if spoken by a target speaker while preserving the linguistic content. Recent VC systems typically consist of two modules: a statistical or neural spectral conversion front‑end that maps source acoustic features to target features, and a neural vocoder that synthesizes the waveform from the converted features. WaveNet (WN) has become the de‑facto vocoder in many VC pipelines because of its ability to generate high‑fidelity speech. However, WN’s architecture relies on fixed dilated convolutions and a generic, large‑scale network. This design leads to two practical problems in VC. First, the fixed dilation pattern does not adapt to the large pitch variations that are common after spectral conversion, resulting in reduced pitch controllability and occasional artifacts. Second, achieving acceptable speech quality often requires a very large model (tens of millions of parameters), which increases computational cost and makes the system less robust to out‑of‑distribution feature inputs.

The authors propose a Quasi‑Periodic WaveNet (QPNet) vocoder that directly addresses these shortcomings. The core innovation is a pitch‑dependent dilated convolution (PDDC) layer. In a PDDC, the dilation factor for each convolutional block is modulated by the instantaneous fundamental frequency (F0) of the conditioning features. When the pitch is high, the dilation expands, allowing the receptive field to cover a longer temporal span that matches the longer period of the waveform; when the pitch is low, the dilation contracts, providing finer temporal resolution. This dynamic adjustment enables the network to follow the quasi‑periodic nature of voiced speech without increasing the number of parameters. Consequently, QPNet can be built with a comparable parameter budget to a standard WaveNet while offering superior pitch controllability and a more compact representation of periodic structure.

In the proposed VC pipeline, the source spectral features (e.g., mel‑cepstral coefficients) are first transformed to target‑speaker features using a frame‑wise deep neural network (DNN). The DNN consists of three hidden layers with 1024 ReLU units each and is trained with an L2 loss on paired source‑target data. After conversion, the prosodic features—fundamental frequency (F0) and voiced/unvoiced flags—are linearly transformed (mean‑variance normalization and scaling) to align with the target speaker’s pitch distribution. Both the converted spectral features and the linearly transformed prosodic features are fed as conditioning inputs to the QPNet vocoder, which then generates the final waveform.

The authors evaluate the system on the VCC 2018 database, using several objective and subjective metrics. Objective measures include Mel‑Cepstral Distortion (MCD) and F0 Root‑Mean‑Square Error (RMSE), while subjective evaluation comprises Mean Opinion Score (MOS) listening tests and ABX preference tests. Three configurations are compared: (1) a baseline WaveNet with the same number of parameters as QPNet (≈12 M parameters), (2) a larger WaveNet with roughly double the parameters (≈24 M), and (3) the proposed QPNet (≈12 M). Results show that QPNet outperforms the same‑size WaveNet by a substantial margin: MOS improves by an average of 0.34 points, MCD decreases by about 0.45 dB, and F0 RMSE drops by roughly 12 Hz. When compared with the double‑size WaveNet, QPNet’s MOS difference is less than 0.07 points, indicating comparable speech quality despite having half the parameters. Moreover, QPNet requires about 45 % less computational resources (flops and memory) during inference, highlighting its efficiency.

Key contributions of the paper are: (1) introduction of pitch‑dependent dilated convolutions that endow the vocoder with explicit pitch control without enlarging the network; (2) demonstration that a statistically‑driven spectral conversion front‑end can be seamlessly combined with QPNet to achieve state‑of‑the‑art VC performance; (3) empirical evidence that QPNet matches the quality of a much larger WaveNet while being more robust to the out‑of‑distribution features typical of VC tasks. The work suggests that incorporating task‑specific priors (such as quasi‑periodicity) into vocoder architecture is a promising direction for future research. Potential extensions include multi‑speaker and multilingual VC, more sophisticated non‑linear pitch modeling, and real‑time deployment through further model compression or quantization techniques.

Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder

💡 Research Summary

Comments & Academic Discussion

Leave a Comment