ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning

ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this gap, we introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits inherent complexity and frequent co-occurrence of emotions, we treat minority annotations as informative perceptual signals rather than discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with an Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-grounded reasoning. Experiments show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization, producing explanations grounded in auditable acoustic and semantic evidence.


💡 Research Summary

The paper tackles two persistent shortcomings in speech emotion recognition (SER): self‑supervised speech encoders such as WavLM provide strong acoustic representations but are opaque discriminative models, while speech‑large language models (SLLMs) excel at high‑level semantic reasoning yet often generate ungrounded, text‑biased judgments lacking verifiable acoustic evidence. To bridge this gap, the authors propose ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a novel framework that reconceptualizes emotion recognition as a multi‑turn, evidence‑driven reasoning process rather than a single‑pass classification task.

ADEPT transforms an SLLM into an autonomous agent that maintains an evolving set of candidate emotions. The inference pipeline is divided into three distinct phases. In Phase 1, the agent performs a high‑recall hypothesis initialization, generating a broad, ranked candidate set from both audio and transcript cues while deliberately preserving ambiguity (e.g., keeping tied top‑votes). Phase 2 is an adaptive evidence‑accumulation loop where the agent dynamically orchestrates a suite of probing tools under a soft budget constraint. The toolkit comprises four families: (i) Structural Prior tools for budget allocation and fallback candidate scheduling; (ii) Semantic Probing tools that verify emotion‑related textual spans and pairwise semantic consistency; (iii) Acoustic Probing tools that locate, analyze, and compare low‑level signal metrics such as pitch, energy, and pauses; and (iv) Refinement tools that replay audio or re‑check semantic alignment when evidence conflicts arise. The agent decides which tool to invoke based on the current candidate set’s uncertainty (e.g., probability gaps, annotator vote dispersion) and the expected information gain versus cost.

Crucially, ADEPT integrates Group Relative Policy Optimization (GRPO) with an Evidence Trust Gate. GRPO jointly trains a policy network (selecting tools) and a value network (estimating the impact of a tool call on final prediction quality), encouraging the agent to seek informative evidence before committing to a label. The Evidence Trust Gate filters out low‑confidence observations, preventing the agent from basing decisions on noisy or irrelevant probes.

The authors also address the “consensus paradox” that plagues most SER datasets: majority‑vote labeling discards minority annotations that often encode subtle, co‑occurring emotions. Using the MSP‑Podcast V2.0 corpus, they adopt a flexible plurality‑based labeling scheme that retains all tied top‑votes as primary emotions and treats any non‑primary votes as minor emotions. ADEPT explicitly models these minor emotions through a dedicated loss term, thereby learning to recover co‑occurring affective states rather than smoothing them away.

Experimental results on MSP‑Podcast demonstrate that ADEPT matches or slightly exceeds baseline primary‑emotion accuracy (≈1–2 % improvement) while dramatically boosting minor‑emotion recall (15–30 % absolute gain). Human evaluations of generated explanations show high trustworthiness because each rationale is grounded in concrete acoustic measurements (e.g., “pitch rise between 0.45 s and 0.78 s”) and textual evidence (“the word ‘surprised’ appears in the transcript”). Ablation studies reveal that removing GRPO or the Evidence Trust Gate leads to excessive tool calls, reduced minor‑emotion recovery, and a drop in overall accuracy, confirming the importance of the reinforcement‑learning components.

The paper concludes that by reframing SER as evidence‑based multi‑turn reasoning, ADEPT achieves a rare combination of performance and interpretability. The modular probing toolkit, budget‑aware policy optimization, and explicit handling of annotator disagreement constitute a general recipe that could be extended to other multimodal affective computing tasks where ambiguity and explainability are paramount.


Comments & Academic Discussion

Loading comments...

Leave a Comment