A comparative study of generative models for child voice conversion
Generative models are a popular choice for adult-to-adult voice conversion (VC) because of their efficient way of modelling unlabelled data. To this point their usefulness in producing children speech and in particular adult to child VC has not been investigated. For adult to child VC, four generative models are compared: diffusion model, flow based model, variational autoencoders, and generative adversarial network. Results show that although converted speech outputs produce by those models appear plausible, they exhibit insufficient similarity with the target speaker characteristics. We introduce an efficient frequency warping technique that can be applied to the output of models, and which shows significant reduction of the mismatch between adult and child. The output of all the models are evaluated using both objective and subjective measures. In particular we compare specific speaker pairing using a unique corpus collected for dubbing of children speech.
💡 Research Summary
This paper investigates adult‑to‑child voice conversion (VC), a scenario that has received little attention despite its relevance for dubbing children’s media. Four state‑of‑the‑art generative models—CycleGAN‑VC2 (a GAN), a variational autoencoder (VAE), a normalizing flow model, and a diffusion probabilistic model (DPM)—are trained on non‑parallel, acted speech recorded from professional dubbing actors. The authors also propose a post‑processing step based on frequency warping, which multiplies mel‑cepstral coefficients by a learned transformation matrix to align the vocal‑tract length characteristics of adult and child speech.
Data: An internal studio corpus (5 h, three speakers: adult male, adult female, child female) and the public CMU child speech corpus (9.1 h, 76 speakers) are used. All audio is resampled to 16 kHz, with 10 % held out for validation and 10 % for testing.
Model details:
- CycleGAN‑VC2: Uses 36‑dimensional mel‑cepstral features extracted with WORLD, adversarial and cycle‑consistency losses, and a WORLD vocoder for waveform synthesis.
- VAE: Consists of a speaker encoder, content encoder, and decoder; operates on 512‑dimensional mel‑spectrograms; reconstruction uses Griffin‑Lim.
- Normalizing Flow: Trains directly on raw waveforms with invertible transformations, 256 ms windows, no overlap, and Griffin‑Lim for synthesis.
- Diffusion Model: Implements a forward diffusion (encoder) and reverse diffusion (decoder) using a Transformer encoder; 80‑dimensional mel‑spectrograms are input, and HiFi‑GAN serves as the final vocoder.
Frequency warping: The warping matrix D is initialized with cosine terms and refined by minimizing a mean‑squared error between warped source mel‑cepstral coefficients and target coefficients. The warping factor α depends on the source and target fundamental frequencies, enabling a smooth shift of the spectral envelope toward child‑like characteristics.
Evaluation: Objective metrics are Mel‑Cepstrum Distortion (MCD) and normalized F0 root‑mean‑square error (F0 RMSE). Subjective listening tests (MOS and ABX) assess naturalness and similarity to the target child voice.
Results:
- On the internal acted corpus, the diffusion model achieves the lowest MCD (7.82 dB) and F0 RMSE (0.39 Hz), followed closely by CycleGAN‑VC2 (8.47 dB / 0.67 Hz). VAE and flow models perform worse (MCD ≈ 9–10 dB).
- Adding the warping post‑processing reduces MCD by 0.3–0.7 dB and F0 RMSE by 0.1–0.2 Hz for all models, indicating a consistent benefit.
- On the multi‑speaker CMU corpus, similar trends appear: diffusion (8.92 dB / 0.72 Hz) and CycleGAN‑VC2 (8.08 dB / 0.63 Hz) improve after warping to 8.45 dB / 0.68 Hz and 7.98 dB / 0.52 Hz respectively.
- Subjective scores confirm that warped outputs are perceived as more natural and closer to the target child voice, with diffusion + warping receiving the highest preference.
Discussion:
- Model trade‑offs: Diffusion models excel in capturing speaker‑independent representations but are computationally heavy. CycleGAN‑VC2 is lightweight and fast but struggles with multi‑speaker generalization. VAE suffers from the reconstruction‑KL trade‑off, leading to higher spectral distortion. Flow models provide exact likelihoods but lack contextual modeling, limiting prosody preservation.
- Warping impact: The frequency warping effectively compensates for the vocal‑tract length mismatch, a dominant factor distinguishing adult from child speech. However, it introduces an extra training stage and may hinder real‑time deployment.
- Limitations: The internal dataset is modest (5 h, three speakers), limiting conclusions about scalability. Warping is optimized on mel‑features only; extending to phase or waveform‑level adjustments could yield further gains. Objective metrics do not capture emotional or prosodic fidelity, suggesting the need for richer evaluation protocols.
Conclusion: The study demonstrates that generative models, especially diffusion‑based approaches, can perform adult‑to‑child VC with reasonable quality, and that a simple frequency‑warping post‑processor consistently enhances both objective and subjective outcomes. This constitutes the first systematic comparison of generative VC models on acted dubbing data. Future work should explore larger, more diverse corpora, real‑time warping implementations, and evaluation metrics that reflect emotional expressivity.
Comments & Academic Discussion
Loading comments...
Leave a Comment