MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models
Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3%$ recognition score and $56.0%$ Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs’ emotional intelligence in the future.
💡 Research Summary
The paper introduces MME‑Emotion, a large‑scale benchmark designed to evaluate the emotional intelligence of multimodal large language models (MLLMs). Recognizing that existing affective computing benchmarks suffer from limited scenario coverage and a lack of reasoning assessment, the authors compile over 6,500 short video clips (average length >3.3 seconds) drawn from public datasets and segment them into temporally consistent emotion intervals. Each clip is paired with task‑specific question‑answer (QA) prompts covering eight distinct emotional tasks: laboratory emotion recognition (ER‑Lab), wild‑scene recognition (ER‑Wild), noise‑robust recognition (Noise‑ER), fine‑grained recognition (FG‑ER), multi‑label recognition (ML‑ER), sentiment analysis (SA), fine‑grained sentiment analysis (FG‑SA), and intent recognition (IR). The tasks span 27 scenario types, ensuring a balanced distribution of question volume and video duration.
To assess models comprehensively, the authors define three unified metrics: Recognition Score (Rec‑S) for label accuracy, Reasoning Score (Rea‑S) for the quality of extracted reasoning steps, and Chain‑of‑Thought Score (CoT‑S) for logical coherence of the entire answer. Evaluation is fully automated via a multi‑agent system. First, a “Step‑LLM” (GPT‑4.1) parses a model’s raw answer into concise reasoning steps. Then a “Judge‑MLLM” (GPT‑4o) receives visual clues (frames extracted from the video), auditory clues (produced by Qwen2‑Audio), the ground‑truth label, and the step list, and computes the three metrics. Human expert validation on a sampled subset shows a high correlation (≥0.87) with the automated scores, confirming the reliability of the LLM‑as‑judge approach.
The benchmark is applied to 20 state‑of‑the‑art MLLMs, including generalist systems (Gemini‑2.5‑Pro, GPT‑4o) and specialist models (R1‑Omni, Audio‑Reasoner, AffectGPT). The best performer, Gemini‑2.5‑Pro, attains only 39.3 % Rec‑S and 56.0 % CoT‑S, while the average across all models is 29.4 % Rec‑S, 49.5 % Rea‑S, and 39.5 % CoT‑S. These results reveal that current MLLMs still possess limited emotional intelligence. Generalist models benefit from large‑scale multimodal pre‑training, whereas specialist models achieve comparable results through domain‑specific fine‑tuning, reinforcement learning with verified feedback, or dedicated emotion encoders. An additional finding is that a higher number of reasoning steps correlates positively with performance, highlighting the importance of step‑wise emotional reasoning.
The paper’s contributions are threefold: (1) the creation of the largest, most diverse emotional intelligence benchmark for MLLMs, (2) a holistic, automated evaluation suite that jointly measures recognition, reasoning, and chain‑of‑thought quality, validated against human judgments, and (3) an extensive empirical analysis that uncovers strengths and weaknesses of current models and outlines future research directions. Limitations include the short clip length, which may miss long‑term emotional dynamics, and the reliance on discrete label sets that cannot fully capture mixed or evolving emotions. Future work is suggested to extend the benchmark with longer continuous videos, richer multi‑label annotations, and to explore MLLMs with integrated multimodal encoders capable of processing visual, auditory, and textual cues simultaneously.
Comments & Academic Discussion
Loading comments...
Leave a Comment