Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis

Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Expressive speech synthesis requires vibrant prosody and well-timed pauses. We propose an effective strategy to augment a small dataset to train an expressive end-to-end Text-to-Speech model. We merge audios of emotionally congruent text using a text emotion recognizer, creating augmented expressive speech data. By training with two-sentence audio, our model learns natural breaks between lines. We further apply self-supervised contrastive training to improve the speaking style embedding extraction from speech. During inference, our model produces multi-sentence speech in one step, guided by the text-predicted speaking style. Evaluations showcase the effectiveness of our proposed approach when compared to a baseline model trained with consecutive two-sentence audio. Our synthesized speeches give a closer inter-sentence pause distribution to the ground truth speech. Subjective evaluations reveal our synthesized speech scored higher in naturalness and style suitability than the baseline.


💡 Research Summary

This paper addresses the challenge of expressive, multi‑sentence speech synthesis for children’s storybooks when only a small amount of high‑quality audio data is available. The authors propose a two‑pronged approach: (1) emotion‑coherent data augmentation and (2) self‑supervised contrastive training of the style encoder.

First, a T5‑based text emotion recognizer is fine‑tuned on a large external corpus and then applied to the Blizzard 2017 children’s audiobook dataset. Each sentence receives one of several emotion labels (neutral, joy, fear, anger, sadness, love, surprise). Instead of concatenating consecutive sentences arbitrarily, the method pairs sentences that share the same emotion label and stitches their audio together to form longer training utterances (two‑sentence or three‑sentence segments). Between the paired sentences a silent pause is inserted, sampled from a normal distribution (mean ≈ 509 ms, std ≈ 223 ms) estimated from the real inter‑sentence gaps in the data. This yields “emotion‑coherent” long‑form audio that preserves consistent expressive style while providing the model with natural pause timing.

Second, the Global Style Tokens (GST) module’s reference encoder is enhanced with SimCLR‑style contrastive learning. For each training sample two “views” are created by randomly masking 500 ms segments of the same mel‑spectrogram. The encoder is trained to pull together the embeddings of the two views while pushing apart embeddings of other samples in the batch. The contrastive loss is added to the overall TTS loss with a scaling factor of 0.1. This self‑supervised objective forces the reference encoder to learn a more robust, emotion‑invariant representation of speaking style.

The backbone TTS architecture is Tacotron2 equipped with stepwise monotonic attention and a reduction factor of 2 to handle longer sequences. Text‑Predicted GST (TP‑GST) predicts a style embedding directly from the input text, enabling inference without an external reference audio. WaveGlow serves as the vocoder to convert predicted mel‑spectrograms into waveforms.

Training proceeds in three stages: (i) pre‑training on LJSpeech (≈21 h) for 300 epochs, (ii) further training on LibriTTS (≈585 h) for 200 epochs to develop style tokens, and (iii) fine‑tuning on the target speaker (a female child storyteller) from the Blizzard dataset for 100 epochs. Four model variants are evaluated:

  • M1: only single‑sentence utterances,
  • M2: single‑sentence + original consecutive two‑sentence pairs,
  • M3: single‑sentence + emotion‑coherent two‑sentence pairs,
  • M4: M3 plus contrastive self‑supervision.

Objective metrics show that M3 already reduces the L1 loss of the TP‑GST prediction from 0.212 (M1) to 0.119, and M4 further lowers it to 0.075. To assess whether the style embeddings capture emotion, a Support Vector Machine classifier is trained on GST embeddings extracted from an unseen Emotional Speech Dataset (ESD). Classification accuracy improves from 71.5 % (M1) to 75.3 % (M4), confirming that the contrastive objective yields more discriminative style vectors.

A separate evaluation of inter‑sentence pause modeling demonstrates that models trained with the inserted pauses (M3, M4) generate more natural silence durations between sentences, aligning closely with the ground‑truth pause distribution. Subjective listening tests corroborate the objective findings: participants rate M4 higher than the baseline (M2) in both overall naturalness and style suitability.

The paper’s contributions are threefold: (1) a simple yet effective augmentation pipeline that leverages emotion‑matched sentence pairing and realistic pause insertion, (2) the first application of contrastive self‑supervision to GST‑based speaking‑style extraction, and (3) empirical evidence that both techniques jointly improve expressive, long‑form TTS for low‑resource children’s storybooks. Limitations include reliance on the accuracy of the emotion classifier (93 % on its own test set) and the current focus on a single English speaker; future work should explore multilingual extensions, multi‑speaker scenarios, and more sophisticated pause modeling. Overall, the study demonstrates that careful data augmentation combined with modern self‑supervised learning can substantially close the gap between low‑resource expressive TTS and high‑quality human narration.


Comments & Academic Discussion

Loading comments...

Leave a Comment