Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

Investigating accuracy of pitch-accent annotations in neural   network-based speech synthesis and denoising effects
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder. We compared an ideal system that uses manually corrected linguistic features including phoneme and prosodic information in training and test sets against a few other systems that use corrupted linguistic features. Both subjective and objective results demonstrate that corrupted linguistic features, especially those in the test set, affected the ideal system’s performance significantly in a statistical sense due to a mismatched condition between the training and test sets. Interestingly, while an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.


💡 Research Summary

This paper investigates how inaccuracies in linguistic features—specifically Japanese pitch‑accent types and interrogative phrase flags—affect the quality of a neural‑network‑based text‑to‑speech (TTS) system that uses a WaveNet vocoder. The authors adopt a conventional pipeline (linguistic analyzer → acoustic model → vocoder) but replace the deterministic vocoder and RNN‑based acoustic models with state‑of‑the‑art components: two autoregressive acoustic networks (SAR for mel‑generalized cepstral coefficients and voiced/unvoiced flags, DAR for quantized F0) and a speaker‑dependent WaveNet vocoder operating at 16 kHz with 10‑bit µ‑law quantization.

Three sets of linguistic features are prepared: (1) OpenJTalk automatic extraction (389‑dimensional), (2) “oracle” manually corrected annotations from KDDI (265‑dimensional), and (3) a corrupted version of the oracle where the pitch‑accent type is perturbed by adding a random integer between –2 and +2 with 50 % probability, and the binary question flag is flipped with 30 % probability. These manipulations simulate realistic annotation errors.

The corpus consists of 27,999 training utterances (≈46.9 h) and 480 utterances each for validation and test (≈0.8 h). Acoustic features (60‑dimensional MGC, 25‑dimensional BAP, and 255‑level quantized F0) are extracted with WORLD and SPTK. The acoustic models use a three‑layer architecture (512‑unit feed‑forward, 256‑unit bidirectional LSTM, 128‑unit unidirectional LSTM). The WaveNet vocoder contains 40 causal dilated convolution layers.

Five system configurations are evaluated:

  • OJT: OpenJTalk features for both training and testing.
  • MOO: Oracle features for both training and testing (the “ideal” system).
  • MOC: Oracle training, corrupted testing.
  • MMC: Corrupted features for both training and testing.
  • MMO: Training uses a mixture where 28.6 % of utterances have corrupted oracle labels, testing uses clean oracle features.

Objective metrics (F0 RMSE, correlation, V/UV error, mel‑cepstral distortion) show that MOO achieves the best scores, while MOC suffers a large drop when test‑time features are noisy. Interestingly, MMC outperforms OJT, and MMO matches MOO despite being trained on partially corrupted data, suggesting a regularization effect akin to a denoising auto‑encoder.

A large‑scale crowdsourced listening test (100 participants, 720 sets, three ratings per sample) provides MOS and a Turing‑style binary discrimination test. Natural speech (µ‑law down‑sampled) scores 3.96 / 5.0. MOO and MMO obtain virtually identical MOS (3.62 vs. 3.63, p = 0.72), indicating that a modest amount of training‑time noise does not degrade perceived quality. OJT and MOC receive lower scores (≈3.3), and MMC again outperforms OJT. In the Turing test, listeners struggle to reliably distinguish synthetic from natural speech, confirming the high naturalness afforded by the WaveNet vocoder.

The authors draw several conclusions: (1) Accurate prosodic annotations are crucial; mismatches between training and test linguistic features cause significant degradation. (2) Introducing controlled noise into training‑time linguistic features acts as a regularizer, improving robustness when test‑time features are imperfect. (3) The combination of autoregressive acoustic models and WaveNet vocoder yields synthetic speech that can pass a Turing‑style test for many listeners. They suggest future work on automatic detection and correction of annotation errors, and on extending the denoising‑regularization strategy to multilingual or multi‑speaker scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment