ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy

ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work investigates how emotional speech and generative strategies affect ASR performance. We analyze speech synthesized from three emotional TTS models and find that substitution errors dominate, with emotional expressiveness varying across models. Based on these insights, we introduce two generative strategies: one using transcription correctness and another using emotional salience, to construct fine-tuning subsets. Results show consistent WER improvements on real emotional datasets without noticeable degradation on clean LibriSpeech utterances. The combined strategy achieves the strongest gains, particularly for expressive speech. These findings highlight the importance of targeted augmentation for building emotion-aware ASR systems.


💡 Research Summary

This paper investigates how emotional speech and the strategies used to generate synthetic data influence automatic speech recognition (ASR) performance, and it proposes targeted data‑selection methods to improve emotion‑aware ASR. The authors first create large synthetic corpora using three state‑of‑the‑art emotion‑controllable text‑to‑speech (TTS) systems: CosyVoice2, EmoVoice, and MaskGCT. For each system they synthesize five emotions (Angry, Happy, Neutral, Sad, Surprise) from a fixed set of LibriSpeech transcriptions, yielding 30 000 training utterances and roughly 13 500 development and test utterances per model. By keeping the lexical content constant, the study isolates the acoustic and prosodic effects of emotion.

Next, the synthetic speech is evaluated along two dimensions. Using the Qwen2‑audio ASR backbone (Whisper‑large‑v3 encoder + Qwen‑7B language model), the authors compute word error rate (WER) and break it down into substitution, insertion, and deletion errors. All three emotional TTS datasets show substantially higher WER than the neutral LibriSpeech baseline (≈1.57 %). Substitution errors dominate, confirming that emotional prosody mainly disrupts phoneme‑level recognition. To ensure that the higher error rates are not simply due to poor audio quality, the authors also apply the non‑intrusive speech quality metric NISQA, which reports mean opinion scores above 3.7 for all systems. Finally, an emotion‑regression model based on WavLM predicts arousal, valence, and dominance scores (1–7 scale). CosyVoice2 and EmoVoice produce relatively low arousal (centered below 4), indicating weak emotional activation, whereas MaskGCT yields a broader distribution, suggesting richer affective expression.

Guided by these observations, two generative selection strategies are defined. The first, TTS‑Correctness (TTS‑G), retains only those synthetic utterances that increase substitution errors relative to the original neutral version while keeping insertion and deletion errors equal or lower. This criterion forces the ASR model to see challenging phonetic variations caused by emotion. The second, Emotion‑Salience (EMO‑G), selects utterances whose predicted arousal, valence, or dominance deviates by more than one standard deviation from the dataset mean, thereby ensuring that the chosen samples convey a clear emotional signal. A combined strategy (TTS‑EMO‑G) applies both filters simultaneously, aiming to provide data that are both linguistically challenging and emotionally vivid.

The authors fine‑tune the pretrained Qwen2‑audio‑7B model on the subsets produced by each strategy for each TTS system. Results (Table III) show consistent WER reductions on real emotional speech test sets while preserving performance on clean LibriSpeech. For CosyVoice2, EMO‑G and TTS‑G each lower WER by roughly 0.3 percentage points, and the combined TTS‑EMO‑G yields about a 0.5‑point gain. Similar trends appear for EmoVoice and MaskGCT, though MaskGCT’s EMO‑G alone sometimes leads to higher WER on development data, indicating that excessive emotional variance can destabilize training if not balanced with transcription quality.

In summary, the paper makes three key contributions: (1) a systematic analysis showing that emotional speech primarily inflates substitution errors rather than insertions or deletions; (2) quantitative evidence that not all synthetic emotional utterances are equally useful—high‑quality transcription and strong affective cues are both required; (3) practical, easy‑to‑implement data‑selection heuristics that improve ASR robustness to emotion without sacrificing neutral‑speech accuracy. These findings suggest that targeted augmentation with carefully filtered synthetic emotional speech is a viable path toward building ASR systems that remain reliable in emotionally rich real‑world interactions.


Comments & Academic Discussion

Loading comments...

Leave a Comment