Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.


💡 Research Summary

The paper addresses the persistent gap in zero‑shot speech emotion recognition (SER) when using large audio‑language models (LALMs). While LALMs have demonstrated strong zero‑shot capabilities on many speech tasks, they typically focus on linguistic content and neglect crucial paralinguistic cues such as pitch, speech rate, jitter, shimmer, intensity, and articulation rate. Existing solutions either fine‑tune the model on emotion‑annotated corpora or employ unstructured chain‑of‑thought (CoT) prompting, both of which either require costly labeled data or suffer from hallucination and lack of interpretability.

To overcome these limitations, the authors propose Compositional Chain‑of‑Thought Prompting for Emotion Reasoning (CCoT‑Emo), a framework that introduces a structured intermediate representation called an Emotion Graph (EG). The EG is built in a completely training‑free manner from each input utterance and consists of four parts:

  1. Acoustic attributes – Seven low‑level prosodic features (pitch, speech rate, volume, jitter, shimmer, intensity, articulation rate) are extracted with the openSMILE toolkit, then discretized into categorical labels (low/normal/high). This symbolic encoding makes the information directly consumable by a language model.

  2. Textual sentiment – The automatic transcription (via Whisper when necessary) is passed through a RoBERTa‑based sentiment classifier, yielding a coarse polarity (positive, neutral, negative).

  3. Emotion‑relevant keywords – KeyBERT extracts salient words from the transcript that are likely to carry affective meaning.

  4. Cross‑modal relationships – Using GPT‑4, each acoustic attribute is evaluated for whether it supports, contradicts, or is neutral with respect to the textual sentiment. The resulting relational triples are stored alongside the attributes.

All components are serialized into a compact JSON object, which is then inserted into the prompting sequence. The final prompt follows the pattern `


Comments & Academic Discussion

Loading comments...

Leave a Comment