Speech to Speech Synthesis for Voice Impersonation
Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems that fuses the two disciplines in order to perform effective speech to speech style transfer for the purpose of voice impersonation. We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples despite a number of drawbacks in its capacity. We benchmark our proposed model by comparing it with a generative adversarial model which accomplishes a similar task, and show that ours produces more convincing results.
💡 Research Summary
The paper tackles the under‑explored problem of speech‑to‑speech style transfer, aiming to generate an utterance that preserves the linguistic content of a source speaker while adopting the voice characteristics of a target speaker. The authors propose the Speech‑to‑Speech Synthesis Network (STSSN), which is essentially a pipeline that stitches together three state‑of‑the‑art components: a DeepSpeech‑style automatic speech recognition (ASR) front‑end, a speaker‑embedding encoder, and a Tacotron‑2 text‑to‑speech (TTS) back‑end.
System Architecture
- ASR Module – A modified DeepSpeech model (single convolutional layer, three residual blocks, a fully‑connected layer, followed by four bidirectional GRU layers) is trained on the LibriSpeech train‑clean‑100 subset using Connectionist Temporal Classification (CTC) loss. It converts the source audio into a character sequence.
- Speaker Embedding – An LSTM‑based speaker encoder (trained on a speaker‑verification task) extracts a 256‑dimensional fixed‑size embedding from the target speaker’s audio. The embedding is designed to capture timbre, pitch range, and other style cues.
- TTS Module – A pre‑trained Tacotron‑2 model receives the character sequence from the ASR and the speaker embedding. The embedding is concatenated to the character embeddings at every decoding step, and a location‑sensitive attention mechanism produces a mel‑spectrogram conditioned on the target voice. A vocoder from the same repository converts the mel‑spectrogram into a waveform.
Datasets and Training
- STSSN: LibriSpeech (100 h train‑clean‑100, 5.4 h test‑clean). Audio is resampled to 16 kHz and transformed into 128‑band mel spectrograms.
- Baseline: CycleGAN applied to the VCC2016 voice conversion challenge data (1620 training, 108 validation samples). Spectrograms are decomposed into fundamental frequency (f0), spectral envelope, and aperiodicity; the envelope is encoded and fed to the GAN.
Training details for the ASR component include a batch size of 8, a one‑cycle learning‑rate schedule peaking at 5 × 10⁻⁴, and roughly 70 epochs (≈48 h on a single GPU). The Tacotron‑2 component is used unchanged, relying on its publicly available pretrained weights and a built‑in vocoder. The CycleGAN uses gated linear units, a minibatch size of 1, and the standard adversarial plus cycle‑consistency loss (λ weighting the latter).
Evaluation
The authors rely primarily on Mean Opinion Score (MOS) collected from human listeners, claiming that STSSN yields “more convincing” speech than the CycleGAN baseline. Visual inspection of spectrograms shows that the output preserves the horizontal stripe patterns of the content audio (indicating linguistic structure) while adopting the overall texture of the style audio (indicating voice characteristics). No objective metrics such as PESQ, STOI, or speaker‑verification similarity scores are reported.
Strengths
- Pragmatic Integration – By reusing well‑tested open‑source models, the authors avoid reinventing complex components and achieve a functional prototype quickly.
- Clear Motivation – The paper discusses both beneficial applications (identity protection, accessibility, game development) and potential misuse (voice fraud), showing awareness of ethical implications.
- Empirical Advantage – Subjective listening tests suggest that the concatenated pipeline outperforms a direct spectrogram‑to‑spectrogram CycleGAN approach, especially in handling noise and preserving content.
Weaknesses and Limitations
- Text Bottleneck – Converting source audio to text discards prosodic, emotional, and fine‑grained acoustic cues. Errors in transcription can propagate to the TTS stage, potentially degrading naturalness.
- Static Speaker Embedding – A single 256‑dimensional vector may not capture intra‑speaker variability (e.g., different emotions, speaking rates). Dynamic conditioning or hierarchical embeddings could improve fidelity.
- Evaluation Shortcomings – Relying solely on MOS limits reproducibility and objective comparison. Inclusion of PESQ, STOI, or speaker verification scores would provide a more rigorous assessment.
- Baseline Fairness – The CycleGAN is trained on a much smaller, task‑specific dataset (VCC2016) whereas STSSN benefits from the large LibriSpeech corpus. This disparity makes direct performance comparison less conclusive.
- Insufficient Detail for Replication – Hyper‑parameters such as frame length, hop size, mel‑normalization, and exact GPU specifications are omitted, hindering reproducibility.
- Model Size and Efficiency – While the ASR component is deliberately lightweight, the overall pipeline still requires three separate forward passes (ASR → embedding → TTS), increasing latency and memory footprint.
Future Directions
The authors propose an end‑to‑end architecture that eliminates the intermediate text representation, directly mapping source spectrograms to target spectrograms conditioned on speaker embeddings. This would preserve more acoustic information, reduce model size, and enable richer style transfer. Potential enhancements include:
- Using variational or contrastive learning to obtain more expressive, dynamic speaker embeddings.
- Incorporating adversarial training on the final waveform (e.g., WaveGAN or HiFi‑GAN) to improve audio realism.
- Expanding evaluation to objective metrics and large‑scale speaker verification tests.
- Exploring multilingual and cross‑accent transfer to broaden applicability.
Conclusion
The paper demonstrates that a straightforward concatenation of high‑performing ASR and TTS models, augmented with a speaker‑style encoder, can achieve convincing speech‑to‑speech voice impersonation. While the prototype shows promise and outperforms a CycleGAN baseline in subjective listening tests, the work would benefit from a more rigorous evaluation, a deeper analysis of the bottlenecks introduced by the text intermediate, and a transition toward a fully end‑to‑end trainable system. The proposed future work outlines a clear path toward higher fidelity, lower latency, and broader applicability, making this an interesting contribution to the emerging field of neural voice conversion.
Comments & Academic Discussion
Loading comments...
Leave a Comment