Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding

February 09, 2026

Reading time: 29 minute

...

📝 Original Info

Title: Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding
ArXiv ID: 2512.02699
Date: 2025-12-02
Authors: Hyeongseop Rha, Jeong Hun Yeo, Junil Won, Se Jin Park, Yong Man Ro

📝 Abstract

In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotiondominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modalityaligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.

📄 Full Content

As human-machine interaction becomes increasingly prevalent through AI assistants, smart speakers, and wearable devices, understanding human emotions has become crucial for achieving empathetic and socially aware communication. To address this need, Multimodal Emotion Recognition (MER) [1][2][3][4][5][6][7][8] has emerged as a key research di- rection, leveraging complementary cues from visual, acoustic, and textual modalities to infer emotional states more accurately than unimodal approaches [9]. Recent progress in multimodal integration with Large Language Models (i.e., MLLMs) has significantly improved MER performance, enabling models to more effectively leverage the complementary cues [10][11][12][13][14]. Building on this foundation, the introduction of Reinforcement Learning (RL) optimization has further extended MER from recognition to reasoning-based emotion understanding, allowing models to articulate why an emotion arises based on multimodal evidence [15]. Despite the recent progress, reasoning-based emotion understanding with MLLMs remains relatively underexplored. Our aim is to achieve reliable emotional reasoning, which we define as the ability to generate explanations that are consistent with the model's final emotion prediction. It is an essential capability for emotional AI, which must not only predict emotions accurately but also justify them with consistent and trustworthy explanations in ambiguous multimodal scenarios. However, to the best of our knowledge, only one prior work has attempted to improve such reliability [16]. It introduces RL optimization that rewards MLLMs when an LLM judges the generated reasoning to be consistent with the prediction. While this approach shows some improvements, it remains an early step toward reliable emotional reasoning, and substantial further advances are still required.

In this context, we begin by exploring where this recent attempt [16] falls short. As shown in Figure 1(a), the model’s attention gradually shifts from multimodal inputs (i.e., audio and visual cues) to its own generated text as reasoning progresses. This shift causes the model to anchor its subsequent reasoning on whatever emotional cues appear in the initial generated text. Moreover, the reasoning process typically starts from generating a video description, because the training data are organized in a fixed order in which visual descriptions precede audio descriptions and the final reasoning. However, such a visual-first reasoning path may not reflect the true emotional driver, especially in cases where emotion is conveyed primarily through vocal tone or semantics. When the initial visual-based reasoning is irrelevant or inaccurate, the subsequent reasoning becomes increasingly misaligned, as illustrated in Figure 1(b). As a result, the initial step of reasoning becomes disproportionately influential. Similar issues appear in other domains, where early cues steer reasoning away from the underlying evidence [17].

To address the issue, we propose Modality-Importance-Guided Reasoning (MIGR), a method that aims to initiate reasoning from the emotion-dominant modality (i.e., the modality most strongly associated with the target emotion) and maintain emotion-coherent reasoning throughout inference. To achieve this, we introduce Modality Importance (MI) estimation that identifies the modality most critical for accurate emotion recognition by comparing the model’s predicted emotion under audio-only, visual-only, and audio-visual inputs. If the model accurately predicts the emotion under audio-only input but fails under visual-only or audio-visual conditions, this suggests that the visual modality may be irrelevant or even misleading. In this case, the audio modality is considered emotion-dominant.

Building on this MI and recent reasoning-based MLLM training paradigms [15,16], we design our method as a two-stage learning framework consisting of (i) modalityaligned Supervised Fine-Tuning (SFT) and (ii) modalityaligned reward optimization. In the modality-aligned SFT stage, we leverage the MI to reorganize the original reasoning text into modality-specific segments and then reorder according to emotion-dominant modality. Then, we train on the data so that it starts to generate emotion-relevant reasoning consistently at the beginning. In the modalityaligned reward optimization, we depart from conventional RL reward designs [15] that primarily focus on reasoning format or answer accuracy, and instead enforce modalityaware constraints through two complementary rewards. The modality-aligned order reward encourages the model to initiate its reasoning from the emotion-dominant modality, thereby reinforcing a modality-aligned reasoning structure. The modality-grounded reasoning reward further promotes generating emotion-dominant modality reasoning text that is causally and semantically aligned with the target emotion.

In this work, we make the following contributions: (i) We introduce the MI for identifying the emotion-dominant modality in multimodal emotion reasoning. It provides a simple yet effective way to diagnose misleading and informative modalities for modality-aligned emotion reasoning. (ii) We propose MIGR which restructures reasoning data so that the model initiates reasoning from the most informative modality. MIGR further reinforces this modality-aligned reasoning through modality-aware reward optimization, producing more stable and emotion-grounded explanations. (iii) MIGR achieves substantial improvements in reasoning reliability on the DEFW benchmark, including a 15.42% gain in Explanation-Prediction Consistency and a reduction of emotionally incorrect reasoning from 18.10% to 7.37%. These gains demonstrate the effectiveness of emotion-dominant modality alignment for robust multimodal emotional reasoning.

MER aims to understand human affective states by analyzing multimodal signals that appear across diverse expressive scenarios. The development of MER was initially driven by the introduction of unimodal emotion-related datasets [1,4,7,8,18], which provided isolated affective cues such as audio, visual, or speech signals. Leveraging these datasets enabled researchers [19][20][21][22][23] to quantitatively analyze emotional expressions within each modality, leading to the development of modality-specific encoders that effectively model and encode unimodal emotional information. Subsequently, the introduction of multimodal emotion datasets [2,3,5,6,24] further accelerated the progress of MER by offering richer combinations of affective cues-including audio, visual, textual, and physiological signals. These datasets opened the door for modeling cross-modal interactions, driving the development of fusion strategies [25,26] designed to integrate complemen-tary information across modalities and improve recognition robustness.

With the recent success of MLLMs across a wide range of domains, these models have also begun to be applied to emotion understanding tasks. The introduction of multimodal emotion-descriptive datasets such as EMER [27] has further accelerated progress in MLLM-based MER, enabling MLLMs to approach emotion recognition as a natural language-based reasoning and explanation task [10][11][12][13][14]28]. However, such emotion descriptions require manual annotation, making large-scale expansion difficult. To address this limitation, recent studies [11,12,29] have proposed leveraging MLLMs to automatically generate emotion descriptions as pseudo-labels, enabling scalable construction of description-enriched datasets. Building upon these expanded datasets, several works [11,29] have explored generating and interpreting emotion explanations grounded in video and audio cues. In parallel, architectural advances [28,30] have further improved the accuracy of emotion recognition by enhancing multimodal representation learning and reasoning capabilities.

Despite significant progress in MLLM-based MER, these approaches still rely heavily on well-designed emotional annotations. This dependency has motivated growing interest in training paradigms that reduce reliance on explicit labels and instead strengthen reasoning through alternative supervision. In this context, RL-based post-training has recently emerged as an effective strategy for enhancing MLLMs’ reasoning abilities while mitigating label dependence. In particular, GRPO [31] and Verifiable Reward [32] have demonstrated strong generalization not only in math and coding but also across a wide range of multimodal domains [15,[33][34][35][36][37][38][39]. However, many existing methods primarily focus on optimizing the final answer, while the fidelity and consistency of the intermediate reasoning process often receive insufficient attention, an issue highlighted by [40]. To address this, recent studies [41,42] have introduced consistency-aware methodologies designed to better align reasoning with final outputs. Yet, such reasoning consistency problems remain largely unexplored in inherently ambiguous domains such as MER. Because emotional interpretation depends on subtle and context-dependent multimodal cues, reliable emotional reasoning becomes especially important. In particular, the generated explanations must faithfully support the model’s final emotion prediction. Nevertheless, only one prior work has attempted to improve such reliability in MER [16], introducing an RL objective that rewards MLLMs when an LLM judge deems the generated reasoning consistent with the prediction. While this approach represents an important initial step, substan-tial further progress is required to achieve robust and trustworthy emotional reasoning.

In this section, we introduce MIGR, a method that learns which modality to attend to first based on MI. Through this design, we aim to mitigate text biases in the initial reasoning stage and ensure that early reasoning remains grounded in the emotionally dominant modality. Our method consists of two learning stages. First, we present modalityaligned SFT, which enables the model to learn an initial reasoning direction by beginning from the modality identified as emotion-dominant by the MI label. Second, we introduce Modality-Aware Reward Optimization, which aims to reinforce modality prioritization and emotion-grounded reasoning while providing stability during optimization. To this end, we propose two rewards: modality-aligned order reward and modality-grounded reasoning reward. The following sections describe each training stage in detail.

Recent studies [15] have demonstrated the effectiveness of SFT as a cold-start in emotion understanding tasks, showing that even a small amount of high-quality supervision from the Explainable Multimodal Emotion Reasoning (EMER) dataset [27] can yield substantial performance gains. Motivated by these findings, we also initiate our training with a cold-start strategy.

Although EMER [27] provides high-quality supervision, the SFT stage can be further improved by incorporating samples with particularly clear and consistent emotional cues. To enhance early-stage supervision, we therefore augment our training set with additional emotion-consistent samples. To identify such samples, we extract Facial Action Units (FAUs) from the facial frames of each video, which provide strongly correlated emotional cues [43]. We then evaluate whether the extracted FAU patterns align with the target emotion based on a predefined FAU-emotion mapping table [11]. Only samples whose FAU patterns exactly match the target emotion are selected. For these filtered samples, we construct reasoning-answer pairs using FAUconsistent emotional evidence to guide the reasoning content.

We now introduce MI that determines the modality most strongly correlated with the target emotional signal. The key idea is to assess the contribution of each modality to the model’s emotional understanding by comparing how accurately the model can infer the target emotion when different modality combinations (i.e., audio-only, visual-only, and audio-visual) are used as inputs. By contrasting the model’s behavior across these modality combinations, we identify which modality is the emotion-dominant in each audio-visual sample.

After constructing both the high-quality human-annotated dataset (EMER) and the FAU-filtered additional dataset, we merge them to form a unified training set. Using this combined data, we extract the MI for every sample. Once the emotion-dominant modality is determined, we reorganize the structure of the reasoning text, as shown in Figure 3.

Specifically, for each training example, we decompose the reasoning text into two modality-specific reasoning texts: an audio-based reasoning text and a video-based reasoning text. To explicitly mark these modality-specific parts, we employ two special tokens, and , placing the corresponding token at both the beginning and the end of each audio-and video-based text. We then reorder these text according to the MI. If audio is the emotion-dominant modality, the audio-based reasoning text is placed first, followed by the video-based text; conversely, if video is emotion-dominant, the video-based text is placed before the audio-based text. For samples where audio and video are equally informative, we include both orderings, i.e., one with audio-first and one with videofirst. By leveraging the reasoning text reorganized with the dominant modality during SFT, the model learns to initiate its reasoning process from the modality that carries the strongest emotional signal.

Following the cold-start stage, we refine the model’s reasoning process through Modality-Aligned Reward Optimization. Prior GRPO-based approaches [15,16] typically employ two reward types: an answer reward and a format reward. Since these rewards focus only on generating a correct answer and producing well-structured rea-soning, they provide no explicit guidance on what kind of reasoning should be produced. Therefore, these two rewards are insufficient to preserve the emotion-dominant modality-prioritized reasoning learned during SFT. To reinforce MI-guided reasoning throughout optimization, we introduce two rewards: the modality-aligned order reward and the modality-grounded reasoning reward.

The modality-aligned order reward encourages the model to generate reasoning in a modality order consistent with the MI m. During optimization, if audio is the emotiondominant modality, the reward is granted when the model initiates its reasoning with the token and the target emotion is detected within the token. Similarly, if the video is emotion-dominant, the reward is granted when the reasoning begins with the token and the target emotion appears within the token. We denote the modality-specific token corresponding to the MI m as . For this token, we define E m (y) as the set of predicted emotion labels obtained by applying an emotion classifier e classif ier to its reasoning sentences. Under this definition, the modality-aligned order reward is given by:

(1)

The modality-grounded reasoning reward evaluates the semantic alignment between each modality-specific reasoning text and the target emotion. For each sample, we measure how well the audio-based and video-based reasoning texts provide emotion-consistent evidence relative to the ground-truth emotion. The reward is then computed based on the MI: if audio is the emotion-dominant modality, we aggregate the reasoning sentences from the and tokens. Likewise, if video is emotiondominant, we aggregate the reasoning sentences from the and tokens. An emotion classifier e classif ier is then applied to this aggregated set of reasoning sentences, denoted as S m (y), to obtain the predicted emotion labels. The correction ratio computed from these predictions is used as the reward. This reward function can be formulated as follows:

In addition to these two modality-focused rewards, we further include an answer reward [15] (R answer ) to ensure the correctness of the final emotion prediction. In total, MIGR employs three rewards during optimization as follows:

As shown in Figure 3, MIGR integrates these three rewards to guide modality-aligned reasoning during optimization.

We utilize three multimodal emotion dataset to train our model and validate its effectiveness. EMER [27] provides high-quality human-verified reasoning-label pairs and is used for cold-start SFT. DFEW [7] and MAFW [6] are large-scale benchmarks covering diverse facial expressions in video, audio, and text modalities. A detailed description of dataset composition and statistics is provided in the Appendix 7.1.

Following prior work [11,16], we report UAR and WAR for recognition accuracy, and three complementary metrics-EEA, EPC, and FCR-to quantify the emotional coherence of generated reasoning. These metrics jointly assess how well the explanation aligns with the target emotion and the model’s prediction. Formal definitions and annotation protocols are included in the Appendix 7.2.

Pre-processing. For video input, we resize each frame to 384 × 384 and uniformly sample 8 frames per clip. For audio, we use 16 kHz waveforms and convert them into 128channel mel-spectrograms using a window size of 25 ms and a hop size of 10 ms. Architecture. Following the design of R1-Omni [15], our framework consists of SigLIP [45] as the vision encoder, Whisper-large-v3 [46] as the audio encoder, and BERT [47] as the text encoder. Each modality output is projected into the LLM embedding space through its corresponding visual and audio projectors, implemented as two linear layers to match the dimensionality of the LLM representation. For reference, a 1-second audio clip yields approximately 50 audio features, while a single video frame produces 729 visual features. The LLM backbone is Qwen2.5-7B [48], and all model weights are initialized from the HumanOmni [30] checkpoint to preserve pretrained multimodal representations.

To assess the effectiveness of the proposed framework, we compare MIGR with existing emotion understanding methods on the MAFW and DFEW datasets (Table 1). We categorize previous methods into two groups: (i) Non-Reasoning Models that directly predict emotion labels, and (ii) Reasoning-Generating Models that produce textual explanations before making a prediction. We first compare MIGR with other reasoning-generating models in terms of consistency metrics to validate that the reasoning texts generated by our model are emotionally coherent. As shown in Table 1, MIGR consistently outperforms ERV on all three consistency measures across both datasets. On MAFW, MIGR improves over ERV by +4.32, +2.95, and +11.31 points in FCR, EEA, and EPC, respectively, achieving 55.30% FCR, 57.65% EEA, and 84.37% EPC. Similarly, on DFEW, MIGR attains 68.48% FCR, 70.06% EEA, and 88.95% EPC, surpassing ERV by +6.42, +4.56, and +15.42 points. We then compare recognition accuracy between non-reasoning models and reasoning-generating models. Among non-reasoning approaches, HumanOmni achieves the strongest performance with 68.40% WAR on MAFW and 82.48% WAR on DFEW, establishing a strong accuracy-oriented baseline. In contrast, all reasoning-generating models, including MIGR and ERV, exhibit slightly lower recognition accuracy than Hu-manOmni, suggesting that the explicit reasoning generation process can weaken the direct answer-prediction capability. When comparing MIGR with reasoning models, our framework attains 62.46% WAR on MAFW and 73.93% WAR on DFEW, which is somewhat lower than the best reasoning counterparts, but accompanied by substantially higher consistency scores.

To validate whether the improvement in emotional coherence of MIGR stems from more faithful reasoning, we further examine cases where the reasoning explanation contradicts the target emotion while the predicted answer remains correct. To this end, we utilize two indicators: (i) the proportion of all samples where reasoning is incorrect but the answer is correct (R× / A✓), and (ii) the proportion of such reasoning errors among correctly predicted samples.

As shown in the Table 2, MIGR reduces reasoning-answer inconsistency compared to reasoninggenerating models, R1-Omni, and ERV across both datasets. On the DFEW dataset, the percentage of inconsistent cases drops from 24.66% (R1-Omni) and 13.75% (ERV) to 5.45% with MIGR, while the relative inconsistency among correct predictions decreases from 32.2% to . These findings suggest that MIGR not only generates explanations that are more consistent with predicted emotions but also effectively mitigates spurious reasoning that leads to correct answers for the wrong reasons.

To investigate the underlying cause of the reduced accuracy observed in Table 1 compared to existing methods, we conduct an emotion-wise performance analysis on the DFEW dataset-where Hap (Happiness), Sad (Sadness), Neu (Neutral), Ang (Anger), Sur (Surprise), Dis (Disgust), and Fea (Fear) are the seven evaluation categories-as presented in Table 3.

In this experiment, we identify two key findings. First, both reasoning-generating models, ERV and MIGR demonstrate substantially stronger performance than the nonreasoning model (Emotion-LLaMA [11]) on the challenging Disgust and Fear categories. This suggests that incorporating explicit reasoning allows models to better capture subtle or ambiguous emotional cues, particularly when predicting emotions such as Fear and Surprise. Second, although MIGR achieves accuracy levels comparable to other reasoning-based approaches and maintains stable performance across most of the remaining emotion classes, it exhibits a pronounced degradation on the Surprise category compared with ERV, with accuracy dropping from 78.84% to 66.89%. To more precisely understand this degradation on Surprise, we further analyze the prediction errors of MIGR in comparison with ERV, focusing on the confusion patterns for Surprise and identifying which emotion categories it is most frequently misclassified into.

To analyze why Surprise shows particularly lower performance, we examine 47 samples on which ERV correctly predicts Surprise but MIGR fails. As shown in Table 4, MIGR predominantly misclassifies these samples as Neutral or Fear. This pattern aligns with the well-known perceptual similarity between Fear and Surprise, which often share facial cues such as widened eyes or raised eyebrows, making them challenging to distinguish even for humans. To further examine whether MIGR truly “loses” the Surprise signal, we analyze the generated reasoning using a closedsource LLM. Interestingly, although MIGR’s final predictions are incorrect, Surprise still appears within the Top-2 reasoning-level emotion predictions for 83% of the samples. This indicates that MIGR retains Surprise-related evidence in its reasoning but ultimately assigns the final label to a neighboring emotion category. Overall, these findings suggest that MIGR’s reduced Surprise accuracy may stem from the inherent ambiguity between Surprise and Fear, indicating that this aspect has room for further improvement.

To better understand the contribution of each component in MIGR, we conduct a stage-wise ablation analysis on the DFEW dataset, as summarized in Table 5. Our ablation follows the two-stage training pipeline described in Section 3 and evaluates how successive additions in both SFT and GRPO stages influence emotional consistency and classi- ERV-R1: “In the video, we see a male character standing in a dimly lit indoor environment. His facial expression is somewhat negative, his eyes are wide open, and he seems to be staring directly at someone or something outside the frame, giving off an overall sense of tension. In the audio, it is impossible to determine the character’s emotional state. In the text, it is impossible to judge the emotional state based on the subtitle content.fear” R1-Omni: “In the video, we see a male character standing in a dimly lit indoor environment. His facial expression is relatively calm, but his eyes seem slightly confused, as if he is contemplating something. In the audio, there are no valuable emotional clues provided. In the text, it is impossible to determine the emotional state based on the subtitle content.surprise” fication accuracy.

We begin with a baseline model trained solely on the EMER dataset, which provides high-quality but limited supervision. Introducing emotion-consistent data augmentation notably improves all consistency metrics, confirming the importance of supplying emotionally reliable samples during early-stage training. Subsequently, applying MIbased reasoning reordering yields additional gains by guiding the model to initiate its reasoning from the emotiondominant modality. This validates our hypothesis that organizing reasoning sequences around MI helps mitigate visually biased or text-driven drift in the initial reasoning stage.

In the second stage, we examine how each proposed reward contributes to reasoning stability and coherence. The Modality-Aligned Order Reward encourages the model to begin reasoning from the modality prioritized by MI, restoring the modality-first structure learned during SFT. The Modality-Grounded Reasoning Reward further strengthens the emotional validity of modality-specific reasoning, promoting causal and semantically consistent evidence aligned with the target emotion. When both rewards are applied together, we observe the highest consistency across all metrics, demonstrating that the two rewards provide complementary benefits.

Figure 4 presents qualitative comparisons illustrating how MIGR utilizes modality-aligned reasoning to make more reliable emotion predictions. In the first example, the input contains no speech, making audio-based evidence unavailable. Baseline models, ERV, and R1-Omni, therefore fail to describe the emotional state because they are confused by audio or text cues. In contrast, MIGR correctly recognizes that the visual modality is dominant and structures its reasoning accordingly: the model begins with < vis desc >, identifies key facial cues, and then consolidates the interpretation in the < think > step, leading to a coherent and accurate conclusion. The second example highlights the benefit of modality-grounded reasoning in audio-dominant scenarios. MIGR initiates its reasoning with the audio modality and identifies sobbing, a strong indicator of sadness. This early recognition enables the model to reinterpret the negative visual cues in the context of the audio evidence, resulting in the correct prediction. Baseline models, however, over-rely on visual information and incorrectly interpret the frowning expression as anger, leading to faulty reasoning.

In this work, we aimed to improve the reliability of reasoning-based multimodal emotion understanding by addressing the early-stage reasoning drift commonly observed in MLLMs. To this end, we introduced MI and proposed MIGR, a training framework designed to initiate and maintain reasoning from the modality most relevant to the target emotion. Our two-stage approach ensures that both the supervised and reinforcement learning processes are aligned with the emotion-dominant modality, leading to ex-planations that are more coherent, emotionally grounded, and causally meaningful. Experiments on the MAFW and DFEW benchmark show that MIGR greatly reduces emotionally inconsistent reasoning. Overall, MIGR moves toward more trustworthy and interpretable multimodal emotional reasoning, highlighting the value of incorporating modality dominance into both data organization and optimization.

To assess the accuracy of emotion recognition, we adopt two commonly used metrics by following prior works [11,14]: (i) Unweighted Average Recall (UAR), which computes the average recall across all emotion categories without considering their frequencies; and (ii) Weighted Average Recall (WAR), which weights each class by its occurrence proportion to account for class imbalance. Together, these two metrics provide a balanced evaluation of the model’s performance on both frequent and infrequent emotion categories. Formally, let C denote the total number of emotion classes, N i the number of samples in the i-th class, and T P i the number of correctly predicted samples for class i. WAR and UAR are defined as follows:

To evaluate the emotional coherence of generated explanations, we leverage three complementary metrics by following [16]: (i) Explanation Emotion Accuracy (EEA), which measures how accurately the emotion expressed in the explanation aligns with the ground-truth emotion label; (ii) Explanation-Prediction Consistency (EPC), which quantifies the degree of consistency between the emotion reflected in the explanation and the model’s predicted emotion; and (iii) Faithful Consistency Rate (FCR), which assesses whether the explanation, the predicted emotion, and the target emotion are mutually consistent.

In our experiments, we utilize GPT-4.1-mini to infer the emotion expressed in the generated explanation. Given a total of S samples, for the i-th sample, let y i denote the ground-truth emotion, ŷi the model’s predicted emotion label, and e i the emotion derived from the reasoning text. The metrics are defined as:

Qualitative Analysis

To investigate the source of the improvements in consistency metrics, we analyze the attention distribution of the models during the reasoning process. Specifically, we calculate the relative attention scores allocated to visual tokens, audio tokens, and generated text tokens (excluding instruction prompts) at each step of token generation. The analysis was conducted on the DFEW test set, categorized into two subsets based on the MI. To account for varying lengths of generated reasoning across samples, we normalized the generation steps to a fixed length of 100. For the model consisting of 28 layers, we specifically examine the 21st layer, with attention scores averaged across all heads. Figure 5 illustrates the modality-wise attention distribution over generated tokens for MIGR, ERV, and R1-Omni. Our analysis reveals two key observations that distinguish MIGR from existing baselines: Enhanced Grounding on Multimodal Evidence As shown in Figure 5, a prominent difference is that MIGR demonstrates significantly higher peak attention scores for both visual and audio modalities compared to the baseline models, ERV and R1-Omni. This indicates that MIGR is capable of attending to multimodal evidence with greater intensity, thereby reducing reliance solely on language priors. This observation aligns with the qualitative results in Figure 6, where baseline models often fail to capture audio cues or generate hallucinations (marked in gray), whereas MIGR produces reasoning well-grounded in the input data.

Additionally, MIGR exhibits a distinct attention pattern that adapts to the dominant modality. As highlighted by the dashed circles in Figure 5(a), when the emotion-dominant modality is Vision (M I : V ision), the model’s attention to visual tokens is explicitly heightened. Conversely, when the Audio modality is dominant (M I : Audio), there is a clear surge in attention toward audio tokens. This contrasts with the baselines, which tend to exhibit a static attention bias (often towards vision or text) regardless of which modality holds the critical emotional information. This demonstrates that MIGR successfully focuses on the informative modality identified by the MI, thereby contributing to the generation of more reliable and emotion-grounded reasoning.

Figures 6 and7 present concrete examples that further illustrate the distinct advantages of our model compared to baseline methods.

Precise Capture of Auditory Emotional Cues Figure 6 demonstrates the difference in focusing capability regarding auditory cues. In case (a), despite the presence of the protagonist’s clear painful speech, baseline models fail to detect it or provide reasoning with contradictory emotions, such as “excited.” Conversely, in case (b), which lacks explicit emotional speech, baseline models tend to hallucinate audio descriptions corresponding to the serious visual cues. In contrast, MIGR accurately captures the “sad and tearful” audio signals in (a) and correctly infers the absence of emotion-related information in the audio of (b), demonstrating superior audio-visual discrimination.

Alignment between Reasoning and Prediction Figure 7 highlights the improved consistency performance of our approach. In (a), MIGR maintains the “surprise” emotion consistently from the initial reasoning phase to the final prediction. In (b), while MIGR initially considers both “happy” and “surprise,” it successfully refines its judgment through MIGR (Ours) : <vis_desc> . . . His facial expression is neutral, with his eyes straight ahead and his mouth closed. . . . </vis_desc><aud_desc>In the audio, there are no valuable emotional clues provided. In the text, it is impossible to determine. . . . </aud_desc> . . . The man’s neutral facial expression and focused gaze suggest that he is calmly observing his surroundings without displaying strong emotion . . . Since the auditory channel offers no additional emotional information . . . Together, these multimodal signals coalesce naturally to support the inference that the target emotion is neutral. neutral ERV : . . . His facial expression is serious, and his eyes are fixed straight ahead, giving off an overall stern vibe. . . . In the audio, there is a long pause at the beginning of the speech, after which the voice is deep and carries a sense of powerlessness, as if the character is sighing. In the text, it is impossible to determine the emotional state based on the subtitle content. sad R1-Omni: . . . his overall expression is serious and negative. From his facial expression and posture, it can be inferred that he may be experiencing an inner conflict or struggle. In the audio, the character sighs, and the tone of voice is low and conveys a sense of heaviness and exhaustion. In the text, it is impossible to determine . . a progressive reasoning process, converging to “happy” as the final prediction. On the other hand, baseline models exhibit reasoning processes that are inconsistent with their final predicted emotions. These qualitative results provide concrete examples that corroborate the consistency analysis presented in Table 2.

Additional Analysis of Performance Degradation in the Surprise Category

Although the DFEW dataset is utilized as a single-label benchmark, its ground truth generation involves a voting process by 10 trained human annotators. A video is assigned a single emotion label only if it receives a consensus vote of over 60%. Importantly, other emotions perceived in the video are also recorded as concurrent labels. To investigate the misclassification of Surprise samples by MIGR, we ERV: . . . woman in a forest at night. At the beginning of the video, she turns her head to one side, as if responding to someone or something. In the following scenes, she turns her head back, looking directly at the camera, with a somewhat surprised expression on her face. In the audio, it is impossible to determine . . . In the text, it is impossible to judge . . . surprise R1-Omni: . . . woman in a dark forest. . . . she turns her head and looks up, seemingly seeing or thinking about something above her. Her mouth moves slightly, as if speaking, but her gaze does not meet anyone, indicating that she may be engaged in a monologue or contemplating something. In the middle of the video, she stops speaking and opens her mouth, possibly reacting to something unexpected. Her eyes widen and her facial expression becomes tense, . . . a scare or is experiencing a tense situation. In the audio, the tone is low and the pace is slow, with long pauses between sentences.

In the text, it is impossible . . . surprise (No Speech / BGM only) (d) Figure 8. Qualitative comparison of samples with Surprise as the ground truth. While MIGR misclassifies the final label as Fear, its reasoning process accurately captures the compound emotional state of both Surprise and Fear. In contrast, ERV focuses solely on a single emotion in its reasoning, and R1-Omni exhibits hallucinations by basing its inference on non-existent subtitles. Text highlighted in yellow and green corresponds to reasoning associated with Surprise and Fear, respectively, while gray highlights indicate hallucinations regarding content absent from the input video.

concurrent human annotations and MIGR’s predictions (Table 4) suggests that MIGR is not merely making erroneous predictions; rather, it is sensitive to the subtle, multi-faceted nature of emotions that even human annotators perceive differently.

Building on this insight, we revisited the qualitative performance discussed in Sections 5.1.3 and 5.1.4. While a notable portion of Surprise samples were classified as Fear, our analysis shows that MIGR successfully captures the Surprise element within its reasoning process in 83% of these cases (as shown in Table 4).

Consistent with the previous annotation analysis, our qualitative inspection reveals that these misclassified samples often exhibit compound emotions, where facial cues. The generated reasoning text confirms that MIGR accurately recognizes this complexity, explicitly discussing the presence of both Fear and Surprise elements.

In contrast, while the ERV often matched the ground truth label, its reasoning tended to focus exclusively on the Surprise features, failing to capture the co-occurring Fear signals or the emotional ambiguity. Furthermore, we observed critical hallucination issues in baselines. For example, as shown in Figure 8(a), in a scene containing only background sound, ERV and R1-Omni generated reasoning based on non-existent dialogue (e.g., “Where are you? Come Out!”). This indicates a susceptibility to text-bias, leading to fabricated evidence. Conversely, MIGR effectively mitigates such hallucinations and provides robust reasoning by anchoring its output to the actual multimodal evidence, attributed to our Modality-Grounded Reasoning objective.

Training and Evaluation. MIGR consists of two training stages: the SFT stage and the GRPO stage. In the SFT stage, we construct an additional emotional reasoning dataset using the MAFW and DFEW training datasets. Based on the ERV[16], we align samples where the emotion represented by the reasoning output matches the target emotion. AU (Action Unit) information is extracted using the OpenFace toolkit, and AU sets are organized following the dataset construction pipeline of Emotion-LLaMA[11]. Using the Emotion-AU Table for emotion alignment verification, we incorporate 184 samples from MAFW and 253 samples from the DFEW training set as additional reasoning data. Training is conducted for 5 epochs with a cosine scheduler, a warmup ratio of 0.03, a learning rate of 2e-5, and a batch size of 32, using 8 × NVIDIA A100 GPUs. In the GRPO stage, the gradient accumulation step is set to 2, the local batch size to 1, and the generation number (G) to 4, with a learning rate of 1e-6, and a cosine decay scheduler. The training is performed for 1 epoch, and evaluation is conducted with a temperature of 0.3.

EMER

📄 Read Full PDF on ArXiv