AffectGPT-R1: Leveraging Reinforcement Learning for Open-Vocabulary Multimodal Emotion Recognition
Open-Vocabulary Multimodal Emotion Recognition (OV-MER) aims to predict emotions without being constrained by label spaces, enabling fine-grained emotion understanding. Unlike traditional discriminative methods, OV-MER leverages generative models to capture the full spectrum of emotions and employs emotion wheels (EWs) for metric calculation. Previous approaches (e.g., AffectGPT) primarily rely on token-level loss during training. However, this objective is misaligned with the metrics used in OV-MER, while these metrics cannot be optimized via gradient backpropagation. To address this limitation, we propose AffectGPT-R1, a reinforcement learning framework that treats EW-based metrics as a reward function and applies policy optimization to maximize this reward. Additionally, we introduce an explicit reasoning process and examine its necessity in OV-MER. To further guide model behavior, we incorporate auxiliary rewards that regularize both emotion reasoning and emotion prediction. We also apply length penalties to mitigate reward hacking. Experimental results demonstrate that AffectGPT-R1 yields significant performance improvements on OV-MER. Moreover, our approach enhances generalized emotion understanding, achieving state-of-the-art results on MER-UniBench. Our code is provided in the supplementary material and will be released to facilitate future research.
💡 Research Summary
The paper introduces AffectGPT‑R1, a reinforcement‑learning (RL) framework designed to address the misalignment between token‑level training objectives and the emotion‑wheel (EW) based evaluation metrics used in Open‑Vocabulary Multimodal Emotion Recognition (OV‑MER). Traditional OV‑MER approaches, such as the original AffectGPT, rely on cross‑entropy loss over token sequences, which does not correlate well with EW‑based semantic similarity. For instance, “sadness” and “madness” may be close in token space but far apart in the emotion‑wheel topology. Because EW metrics are non‑differentiable, they cannot be directly optimized via gradient descent, limiting performance.
AffectGPT‑R1 tackles this by treating EW‑based scores as a reward function and applying policy optimization to maximize this reward. The training pipeline consists of two stages:
-
Cold‑Start Training – A large‑scale, coarse‑grained dataset is used to teach the model basic multimodal integration and output formatting. The model is trained to produce both a “thinking” segment (enclosed in
tags) and an “answer” segment (enclosed in tags). This stage aligns the model with the required output structure and provides a solid foundation for later RL fine‑tuning. -
Reinforcement Learning Fine‑Tuning – A high‑quality, fine‑grained dataset is employed for reward calculation and policy updates. The authors adopt Group Relative Policy Optimization (GRPO) rather than PPO, eliminating the need for a separate value network and using group‑relative baselines to compute advantage estimates. Rewards are normalized across a sampled batch to stabilize training.
The paper proposes five complementary reward components:
- Format Reward: Binary 1/0 indicating whether both
and sections are present, enforcing structured output. - Accuracy Reward: Directly uses the official OV‑MER EW metrics (e.g., cosine similarity between predicted and ground‑truth emotion embeddings) to encourage semantic correctness.
- Reasoning‑Answer Consistency Reward: Measures semantic alignment between the reasoning text and the final answer, promoting coherent explanations.
- Multimodal Modality Rewards: Separate scores for visual, acoustic, and lexical cues, ensuring the model leverages each modality appropriately.
- Length Penalty: A negative term proportional to output length, mitigating reward‑hacking where the model inflates reward by generating overly long, redundant emotion lists.
Ablation studies show that the combination of these rewards outperforms any single reward, and that the reasoning segment substantially improves both accuracy and interpretability. The length penalty effectively curtails unnecessary verbosity without harming EW scores.
Experiments are conducted on two benchmarks:
- OV‑MER Specific Test Set: AffectGPT‑R1 achieves a 4.2 percentage‑point gain in EW‑based F1 over the token‑loss baseline, demonstrating that RL directly optimizes the target metric.
- MER‑UniBench: A unified benchmark covering diverse domains (videos, dialogues, movies). AffectGPT‑R1 sets new state‑of‑the‑art results across all domains, particularly excelling at nuanced, compound emotions such as “bittersweet” or “nostalgic surprise”.
The authors also analyze the impact of cold‑start data volume. Larger pre‑training corpora lead to faster convergence and higher final performance in the RL stage, indicating that a well‑initialized policy benefits more from reward‑driven fine‑tuning.
Finally, the paper discusses future directions: automated reward design, integration of Direct Preference Optimization (DPO) to incorporate human preference data, and real‑time policy updates for streaming multimodal inputs. All code and model checkpoints are released in the supplementary material to promote reproducibility.
In summary, AffectGPT‑R1 demonstrates that reinforcement learning, when equipped with carefully crafted reward functions—including format enforcement, semantic accuracy, reasoning consistency, multimodal balance, and length control—can bridge the gap between training objectives and evaluation metrics in open‑vocabulary emotion recognition, yielding both higher performance and more interpretable outputs.
Comments & Academic Discussion
Loading comments...
Leave a Comment