VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.


💡 Research Summary

The paper introduces VowelPrompt, a novel framework that enriches text‑only large language models (LLMs) with fine‑grained prosodic information extracted at the vowel level, thereby improving speech emotion recognition (SER) without requiring an audio encoder at inference time. The authors begin by noting that existing SER approaches either rely on handcrafted low‑level descriptors (e.g., pitch, energy, duration) processed by traditional classifiers, or on self‑supervised acoustic embeddings such as wav2vec 2.0 and HuBERT. While the latter achieve strong performance, they are high‑dimensional, opaque, and demand an audio front‑end, limiting interpretability and deployment in text‑only scenarios. Conversely, recent text‑only prompting methods augment automatic speech recognition (ASR) transcripts with coarse prosodic cues (e.g., “spoken loudly with rising intonation”), but these utterance‑level descriptors blur fine‑grained emotional signals.

Drawing on phonetic research that identifies vowels as primary carriers of affective prosody, VowelPrompt first obtains precise phoneme boundaries through a forced‑alignment pipeline. For each vowel segment, it computes a compact set of low‑level descriptors: mean fundamental frequency (F0), pitch slope (rising/falling trend), pitch variability (standard deviation), mean intensity (RMS energy), intensity variability, and segment duration. To mitigate speaker and vowel‑type biases, the authors apply a two‑stage normalization: speaker‑level z‑normalization followed by vowel‑type normalization. The continuous values are then discretized using quantile‑based binning into a small number of ordinal categories (e.g., very low, low, moderate, high, very high). Each discretized feature is deterministically mapped to a short natural‑language phrase such as “high pitch, rising, loud, lengthened.” These phrases are appended to the original transcript, forming a richly annotated prompt that conveys both lexical content and localized prosodic cues.

Model adaptation proceeds in two phases. The first phase, supervised fine‑tuning (SFT), aligns the LLM to predict emotion labels given the augmented prompts, while also teaching the model to output a structured response format (e.g., for reasoning, for the final label). The second phase, Reinforcement Learning with Verifiable Reward (RLVR), refines the model using Group Relative Policy Optimization (GRPO). The reward function combines three components: (1) classification accuracy, (2) adherence to the prescribed output format, and (3) verification that the reasoning segment () correctly references the prosodic descriptors extracted from the audio. This design encourages the LLM not only to be accurate but also to produce human‑readable, auditable explanations that are grounded in actual acoustic evidence.

The authors evaluate VowelPrompt on five widely used SER benchmarks: IEMOCAP, MELD, CaFE, EmoDB, and ASVP‑ESD. They test under a variety of conditions: zero‑shot (no task‑specific fine‑tuning), few‑shot (limited labeled data), full fine‑tuning, cross‑domain transfer (train on one dataset, test on another), and cross‑lingual transfer (train on English, test on German/French). Across all settings, VowelPrompt consistently outperforms strong baselines, including audio‑LLM hybrids (AudioPaLM, Emotion‑LLaMA), traditional low‑level feature classifiers, and recent self‑supervised acoustic embeddings. Gains are especially pronounced in text‑only scenarios, demonstrating that vowel‑level prosodic augmentation can substitute for full acoustic embeddings.

Interpretability is assessed through human evaluation of the model’s reasoning output. Annotators judged the explanations as highly plausible (≈85% agreement) and correctly aligned with the acoustic cues, confirming that the system provides verifiable justification for its predictions. Moreover, the RLVR stage improves robustness to speaker variability and domain shift, as evidenced by stable performance across diverse corpora.

In summary, the paper makes three key contributions: (1) a linguistically motivated method for extracting and discretizing vowel‑level prosodic features, converting them into natural‑language descriptors that can be seamlessly integrated into LLM prompts; (2) a two‑stage adaptation pipeline (SFT → RLVR with GRPO) that enhances both predictive accuracy and the generation of structured, verifiable explanations; and (3) extensive empirical validation showing state‑of‑the‑art results across multiple datasets, domains, and languages while preserving interpretability. The work opens avenues for further research, such as incorporating consonantal cues, hierarchical prosodic modeling, or tighter multimodal integration with audio‑enabled LLMs, to build even richer emotion‑aware conversational agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment