Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer …

Authors: Yu Liu, Lei Zhang, Haoxun Li

Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition
F ollow the Clues, Frame the T ruth: Hybrid-evidential Deducti ve Reasoning in Open-V ocab ulary Multimodal Emotion Recognition Y u Liu * 1 Lei Zhang * 1 Haoxun Li 1 Hanlei Shi 1 Y uxuan Ding 1 Leyuan Qu 1 T aihao Li 1 Abstract Open-V ocabulary Multimodal Emotion Recog- nition (O V -MER) is inherently challenging due to the ambiguity of equiv ocal multimodal cues, which often stem from distinct unobserved situa- tional dynamics. While Multimodal Large Lan- guage Models (MLLMs) of fer extensi ve semantic cov erage, their performance is often bottlenecked by premature commitment to dominant data pri- ors, resulting in suboptimal heuristics that ov er- look crucial, complementary af fecti ve cues across modalities. W e argue that ef fectiv e af fectiv e rea- soning requires more than surface-lev el associa- tion; it necessitates reconstructing nuanced emo- tional states by synthesizing multiple evidence- grounded rationales that reconcile these obser- vations from di verse latent perspectives. W e in- troduce HyDRA , a Hy brid-e vidential D educti ve R easoning A rchitecture that formalizes inference as a Propose–V erify–Decide protocol. T o internal- ize this abducti ve process, we employ reinforce- ment learning with hierarchical re ward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the ob- served multimodal cues. Systematic ev aluations validate our design choices, with HyDRA consis- tently outperforming strong baselines—especially in ambiguous or conflicting scenarios—while pro- viding interpretable, diagnostic evidence traces. 1. Introduction A girl stands on a podium holding a silv er medal, her e yes filled with tears. A standard multimodal classifier may quickly commit to a single dominant label such as sad- ness . Y et, these equi vocal multimodal cues can support a plurality of co-occurring interpretations: for instance, sad- ness intertwined with pride (achiev ement), r e gr et (missing 1 Hangzhou Institute for Adv anced Study , University of Chinese Academy of Sciences, Hangzhou, China. Correspondence to: T ai- hao Li < lith@ucas.ac.cn > , Leyuan Qu < leyuan.qu@ucas.ac.cn > . Pr eprint. Marc h 18, 2026. the gold), or relief (concluding a struggle). This is not an edge case; inferring af fective states is inherently challeng- ing because emotion is often contextually underspecified by surface signals, requiring the inference of unobserved situational dynamics ( Poria et al. , 2017 ; Lian et al. , 2023b ). Con ventional Multimodal Emotion Recognition (MER) models have been restricted to fixed label taxonomies, which struggle to reflect the open-ended and nuanced nature of human affect ( Han et al. , 2025b ; Zhao et al. , 2025c ). Open- V ocabulary MER (O V -MER) relaxes this constraint and enables flexible predictions beyond fix ed taxonomies ( Lian et al. , 2024c ; 2023b ). Howe ver , O V -MER introduces a key mismatch: ev aluation stresses label cardinality and syn- onymy , while training often reduces to token-le vel likeli- hoods that ignore such semantic structure ( Bhattacharyya & W ang , 2025 ). As a result, models exploit shortcut heuris- tics and become fragile under cross-modal conflict ( Geirhos et al. , 2020 ; Li et al. , 2023 ; Zhu et al. , 2025 ). This gap natu- rally moti v ates adopting stronger semantic priors to interpret open-ended labels and latent contexts. T o supply such priors, recent work increasingly adopts Multimodal Large Language Models (MLLMs) as general- purpose multimodal reasoners ( Liu et al. , 2023 ; Lin et al. , 2024 ; Ge et al. , 2024 ). They of fer broad semantic coverage, yet they often f ail in the same way: prematur e commitment to a dominant, prior-dri ven interpretation, with complemen- tary cues left unused ( Bai et al. , 2024 ; Leng et al. , 2024 ; Lei et al. , 2024 ). Standard grounding strategies therefore suppress generation to mitigate hallucinations ( Dhulia wala et al. , 2024 ). W e argue this misses the point: the main issue is not generation itself, but biased singular interpreta- tion from incomplete e vidence ( Zhang et al. , 2025 ). Under ambiguous or conflicting cues, models collapse to a sin- gle dominant narrativ e and ignore complementary affecti ve evidence across modalities—a phenomenon mirroring the “System 1” thinking that lacks the rigors of deliberati ve verification ( Kahneman , 2011 ). Our key idea is to treat the generati ve prior as structured intu- ition, not as a final decision. Instead of accepting a single ra- tionale, we generate multiple competing candidates and then for ce them to be adjudicated by the observed multimodal evidence ( Mistretta et al. , 2025 ; W u et al. , 2025b ; Chang 1 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER et al. , 2026 ). W e introduce HyDRA , a Hy brid-evidential D eductiv e R easoning A rchitecture that formalizes this pro- cess as a Propose–V erify–Decide protocol. HyDRA pro- poses a small set of di verse situational hypotheses to a void early collapse to a single, prior-dri ven narrati ve. It then ver - ifies them through evidence-constrained comparison, elimi- nating candidates that conflict with salient multimodal ob- servations. Finally , it decides by selecting the hypothesis that best reconciles the observed cues, ef fectiv ely retain- ing evidence-consistent af fective descriptors while filtering shallow semantics and spurious priors. T o make this behavior a learned capability rather than a prompting trick, we optimize HyDRA with reinforcement learning and hierarchical rew ard shaping. Using Group Rel- ativ e Policy Optimization (GRPO) ( Shao et al. , 2024 ; Guo et al. , 2025 ), we align the policy to f av or div erse proposal generation, rigorous evidence grounding, and accurate task- specific decisions. GRPO acts as a differential filter that rew ards evidential closure, pushing the model away from plausible-sounding hallucinations and to ward rationales ne- cessitated by the multimodal e vidence ( Li et al. , 2025c ). This also yields diagnostic reasoning traces for analyzing model behavior under ambiguity and conflict. Contributions: • A hypothesis-driv en inference interface f or O V -MER. W e formalize O V -MER as a Pro- pose–V erify–Decide procedure that generates multiple latent-context hypotheses and performs evidence-constrained adjudication to av oid premature commitment under equiv ocal multimodal cues. • Learning to adjudicate, not pr ompting to look struc- tured. W e couple the protocol with GRPO-based policy optimization and hierarchical re wards to inter - nalize comparati ve verification and e vidence closure, outperforming prompt-only and alternative training paradigms under the same backbone. • Systematic evidence beyond aggregate scores. W e provide controlled ablations on hypothesis cardinal- ity , reward components, and training paradigms, and demonstrate that the g ains are driv en by multi-path adjudication rather than model scale. 2. Related W ork 2.1. From Closed-set to Open-V ocabulary MER Con ventional MER typically treats affect as a classification task within fixed taxonomies ( Zadeh et al. , 2016 ). While structured, these paradigms struggle to capture the fluid, ov erlapping, and context-dependent nature of human emo- tions in the wild ( Guerdelli et al. , 2023 ). T o address this, O V -MER has emerged, le veraging unrestricted natural lan- guage to bridge discrete labels with continuous semantic spaces ( Lian et al. , 2024c ). Despite this flexibility , O V -MER f aces a persistent challenge in resolving situational dynamics under modal ambiguity or conflict ( Han et al. , 2025a ). When visual and acoustic sig- nals contradict—such as a “tearful smile”—current models often lack the inferential depth to deduce the latent con- text, resulting in superficial mappings between signals and descriptors ( W ei et al. , 2023 ). While MLLMs like LLaV A- NeXT and InternVL3.5 of fer v ast world kno wledge for af- fectiv e interpretation ( Li et al. , 2024a ; W ang et al. , 2025 ), their performance is frequently bottleneck ed by a “prema- ture commitment” to dominant linguistic priors ( Zhou et al. , 2023 ; Huang et al. , 2024 ; Goyal et al. , 2017 ). This leads to suboptimal heuristics that ignore subtle, complementary cues, necessitating a more rigorous, evidence-grounded rea- soning framew ork ( Dhuliawala et al. , 2024 ; Zelikman et al. , 2024 ). 2.2. Heuristic Bias and Prematur e Commitment in MLLMs The “premature commitment” observed in MLLMs is of- ten attributed to a reliance on strong internal priors over sparse external evidence ( Qian et al. , 2025 ; Poria et al. , 2017 ; Lian et al. , 2023b ), a phenomenon widely documented in complex multimodal reasoning. In affecti ve contexts, this manifests as a collapse into a single dominant narrati ve, disregarding the “long-tail” of nuanced emotional possibili- ties ( Rha et al. , 2025 ). Recent studies suggest that standard grounding techniques often fail to mitigate this bias because they focus on token-le vel alignment rather than higher -order semantic consistency ( W u et al. , 2025a ; Mohsin et al. , 2025 ). W e argue that robust affecti ve reasoning requires an abductiv e process that transcends surface-le vel associa- tions ( Chang et al. , 2026 ). This perspectiv e is motiv ated by recent findings that large language models often strug- gle with logical parsimony in abducti ve reasoning, fre- quently failing to reconcile evidence through the simplest or most consistent explanatory paths ( Sun & Saparov , 2025 ; Zhang et al. , 2025 ). Specifically , synthesizing mul- tiple e vidence-grounded rationales—rather than pursuing a singular path—has been sho wn to reconcile conflicting observations across di verse latent perspectives in other reasoning-heavy tasks like video understanding and fact- checking ( Dang et al. , 2025 ; Zhao et al. , 2023 ). By formal- izing this as a Propose–V erify–Decide protocol, HyDRA ensures that the final interpretation is not merely a reflection of model priors, but a conclusion necessitated by the totality of multimodal evidence. 2 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER F igur e 1. HyDRA overview . ( A ) Propose–V erify–Decide inference protocol: gi ven multimodal input, HyDRA proposes multiple latent-context hypotheses, adjudicates them via e vidence-constrained comparison with explicit citations, and outputs the most plausible emotion set. ( B ) Learning HyDRA with GRPO and hierarchical reward shaping: for each input we sample a group of structured trajectories in the ---- format, score them with six rew ards, compute group-relativ e advantages, and update the polic y to fa vor evidential closur e and rob ust decisions under ambiguity . T extual rationales are grounded to human-verified textualized multimodal cues (ObsG) provided by the datasets, and enforced via semantic alignment to ground-truth cue descriptions. 2.3. Evidence-Grounded Reasoning and Rationale Synthesis W e ar gue that ef fectiv e af fective reasoning necessitates mov- ing beyond surface-le vel associations, as emotions are often contextually underspecified by surf ace signals ( Poria et al. , 2017 ; Lian et al. , 2023b ). While MLLMs can generate descriptiv e labels, they frequently fall victim to shortcut learning, where the model bypasses deep situational analy- sis by mapping salient but equi vocal features (e.g., “tears”) directly to high-frequency labels (e.g., “sadness”) based on statistical co-occurrences in training data ( Sun et al. , 2024 ; Geirhos et al. , 2020 ; Li et al. , 2023 ; Zhu et al. , 2025 ). Such associations are often “hijacked” by dominant linguistic priors, leading to systematic failures in “emotion conflict” scenarios where surface signals contradict latent situational dynamics ( W u et al. , 2025c ). Consequently , simple asso- ciativ e mapping is insufficient for capturing evok ed emo- tions—those deeply rooted in situational context rather than mere facial muscle mov ements ( Barrett et al. , 2019 ). T o overcome these heuristic bottlenecks, rob ust reasoning requires reconstructing nuanced emotional states by synthe- sizing multiple evidence-grounded rationales that reconcile observations from di verse latent perspecti ves ( W ang et al. , 2022 ; Zhao et al. , 2023 ; Dhuliaw ala et al. , 2024 ). This per- specti ve is rooted in abducti ve reasoning, which requires the generation of “Explanatory Hypotheses” to account for in- complete or ambiguous visual e vidence ( Chang et al. , 2026 ). Formulated as Propose–V erify–Decide, HyDRA makes af- fectiv e prediction the outcome of comparati ve verification and elimination under multimodal constraints, yielding de- cisions that remain stable under ambiguity and cross-modal conflict. 3. Methodology 3.1. Overview W e introduce HyDRA, a framework that operational- izes multimodal emotion reasoning as a structured Pro- pose–V erify–Decide sequence. As shown in Fig. 1 , HyDRA replaces the standard singular discriminative path with a multi-path reconstruction process, synthesizing rationales from di vergent perspectiv es to mitigate prior-dri ven biases. The framework is optimized via a two-stage pipeline: (i) Cold-start Multimodal Supervision to seed the structured reasoning protocol, and (ii) Policy Optimization via GRPO to enforce multimodal grounding and deductive consistenc y through hierarchical rew ard shaping. 3.2. The HyDRA Protocol Unlike standard MLLMs, HyDRA treats emotion recogni- tion as a reconstruction task o ver a latent situational space S . Given multimodal observ ations X = { V , A, T } , we es- timate the optimal emotion set ˆ Y by adjudicating ov er K generated rationales H = { H 1 , . . . , H K } : ˆ Y = Decide K X k =1 Φ( H k , X ) · Ψ( H k , Y ) ! , (1) 3 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER where Φ( · ) measures the abducti ve grounding of rationale H k against X , and Ψ( · ) represents the deducti ve consis- tency between the rationale and label Y . This process is operationalized through a three-stage inference interface: • Proposal (Abductiv e Stage) : The model generates K competing hypotheses H , where each H k encapsulates a latent conte xt s k and predicted cue descriptions E k = { e k, 1 , e k, 2 , . . . } . • V erification (Deducti ve Stage) : The model performs a “cross-examination” within a ⟨ think ⟩ block, treating each H k as a premise to verify its consistency with observations X . • Decision : In HyDRA, the Decide operation is opera- tionalized as the final synthesis in the ⟨ think ⟩ block, where the model identifies the optimal rationale H ∗ that maximizes the joint grounding strength Φ and con- sistency Ψ , thereby yielding the final emotion set ˆ Y . While Eq. 1 provides a theoretical objectiv e, HyDRA op- erationalizes it end-to-end: Φ is implicitly optimized via semantic alignment of predicted cues, and Ψ is enforced through task-lev el re wards during Reinforcement Learning (RL) stage. 3.3. Cold-Start Multimodal Supervision T o initialize the reasoning protocol π θ , we perform super- vised fine-tuning (SFT) on a corpus of structured reasoning traces. The model comprises a causal transformer backbone integrated with vision and audio encoders via projection lay- ers. Given multimodal inputs X , we minimize the negati ve log-likelihood ov er the expert trace Y : L SFT ( θ ) = − E ( X , Y ) ∼D SFT |Y | X t =1 log p θ ( y t | X , y 0 else 0 . (8) The matc h( · ) operator employs fuzzy string matching to tolerate minor surface v ariations. Semantic Grounding ( r sem ). T o ensure rationales remain rooted in multimodal reality , we align predicted evidence descriptions P with ground-truth cue annotations G using sentence embedding similarity ϕ ( · , · ) : r sem = 1 | P | X p i ∈ P Q  max g ∈ G ϕ ( emb ( p i ) , emb ( g ))  , (9) where Q ( · ) discretizes continuous similarity into robust rew ard lev els. T able 2. Conflict robustness r esults on MER-FG. W e ev aluate HyDRA against HumanOmni (Baseline) and other models across High Conflict (HCS) and Low Conflict (LCS) subsets. HyDRA maintains superior performance ev en under significant modal con- flict. ‡ : HumanOmni (7B); ∗ : AffectGPT trained on O V -MERD. Best results are bolded , second-best underlined. M O D E L H C S L C S S 1 S 2 S 1 S 2 B A S E L I N E 30 . 8 5 1 7 . 2 4 31 . 3 3 1 8 . 5 0 B A S E L I N E ‡ 3 3 . 0 5 1 7 . 8 0 3 4 . 5 6 1 9 . 1 7 P A N DA G P T 4 1 . 1 5 2 2 . 1 5 4 5 . 4 3 2 5 . 5 9 A FF E C T G P T ∗ 4 3 . 6 3 2 7 . 8 9 4 8 . 8 2 2 9 . 7 6 H Y D R A 5 4 . 7 8 3 0 . 5 6 5 5 . 5 1 3 0 . 1 3 Synergistic Effect. While r sem and r evid provide fact-based grounding, r think and r cite facilitate logical synthesis. T o- gether , this hierarchical reward system guides the policy to ev olve from generating isolated “backstory seeds” to internalizing a comprehensi ve reasoning trajectory that re- constructs the true af fectiv e state through the triangulation of multimodal evidence. 4. Experiments 4.1. Datasets & Evaluation Metrics Cold-start Subset. T o establish a rob ust initial capabil- ity for open-vocabulary reasoning, we utilize O V -MERD ( Lian et al. , 2024a ) and 242 expert-verified samples from MERCaption+ ( Lian et al. , 2025a ), ensuring high semantic density for the cold-start phase. GRPO RL Subset. For the RL stage, we curate 12,000 sam- 5 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER F igur e 2. Left : The original input (visual/audio/text). V isual priors contradict subtle audio/te xtual cues. Middle (Success) : HyDRA resolves the conflict via the Propose–V erify–Decide protocol. Right (Failur e) : R1-omni commits prematurely to salient visual signals. Due to the presence of real individuals in the original videos, personal identifiable information has been remov ed and processed via visualization to address copyright and pri vac y concerns. ples from MERCaption+, manually filtered by audio-visual clarity and alignment accuracy . Both subsets are partitioned into a unified Observation Graph (ObsG) format; the full preparation process is detailed in Appendix B . W e adopt HumanOmni-0.5B ( Zhao et al. , 2025b ) as our backbone, noting its architectural differences from the 7B version in the Appendix D . Benchmarks. W e e valuate HyDRA across tw o primary di- mensions: (1) General Emotion Recognition : W e assess the model’ s performance on both sentiment-lev el intensity and discrete emotional categories. Specifically , we utilize CMU- MOSI ( Zadeh et al. , 2017 ) and CH-SIMS ( Y u et al. , 2020 ) to ev aluate sentiment understanding, while MER2023 ( Lian et al. , 2023a ) and MER2024 ( Lian et al. , 2024b ) are em- ployed to test the recognition of basic emotional categories. (2) Open-V ocabulary F ine-Grained (O V -FG) : W e adopt the MER-FG benchmark ( Lian et al. , 2024c ) (a curated sub- set of MERCaption+), which provides a high-density label space and serves as a rigorous stress test for open-set affec- tiv e reasoning and deductive precision. Crucially , although MERCaption+ is utilized across dif fer- ent stages, we ensure that the three subsets—the 200 SFT samples, the 12,000 RL samples, and the O V -FG ev aluation set—are strictly disjoint . This isolation guarantees that the e valuation reflects true zero-shot or out-of-distribution performance without any o verlap with the training data. Evaluation Metrics. For the Sentiment and Basic emotion tasks, we strictly follo w the protocols in ( Lian et al. , 2025a ), adopting A CC. F ollo wing the MER-FG protocol ( Lian et al. , 2025b ), we report the average F1-score across two granular- ities on O V -FG: S 1 (coarse-grained) and S 2 (fine-grained). T o handle open-vocabulary synonyms and varying label den- sities, both metrics apply hierarchical grouping via multiple emotion wheels to map predictions into unified taxonomic lev els before calculating the sample-wise F1-score. Unless otherwise noted, all experimental results are reported in percentages (%). Full details regarding benchmark statis- tics, baseline descriptions, metric specifications, and imple- mentation configurations are provided in Appendix C , D , E , and F , respectiv ely . 4.2. Main Result T able 1 shows our method achie ves the best ov erall perfor- mance, despite using a 0.5B backbone, reaching an average score of 61.53 across all six ev aluations. This indicates that the improvement is primarily driven by the proposed Propose–V erify–Decide inference, rather than model scale. Strength on open-v ocabulary fine-grained emotion. The largest gains appear on O V -FG, where our method ranks first on both granularities. In particular , it substantially im- prov es the coarse-grained score and also sets a new best 6 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER T able 3. Ablation study of GRPO re ward components on MER-FG. Symbols • and ◦ indicate enabled and disabled rewards, respec- tiv ely . For .: Format constraints; Acc.: Accuracy reward; Cit.: Hierarchical citation; Sem.: Semantic Grounding. Bold : Best; Underline: Second-best. F O R . A C C . C I T . S E M . OV - F G S 1 S 2 A V G • ◦ ◦ ◦ 4 7 . 9 3 2 5 . 6 7 3 6 . 8 0 ◦ • ◦ ◦ 4 9 . 27 2 5 . 7 3 3 7 . 5 0 ◦ ◦ • ◦ 47 . 3 0 2 5 . 8 5 3 6 . 5 7 ◦ ◦ ◦ • 4 9 . 7 9 2 6 . 0 0 3 7 . 9 0 • ◦ • • 5 0 . 4 6 2 6 . 3 5 3 8. 4 0 • • ◦ • 5 2 . 6 5 2 7 . 8 8 4 0 . 2 6 • • • ◦ 5 2 . 4 2 2 6 . 0 8 3 9. 2 5 • • • • 5 5 . 5 2 3 0 . 4 8 4 3 . 0 0 fine-grained score. This aligns with our moti vation: when label space is open and cues are underspecified, a single prior-dri ven interpretation is brittle, while multi-hypothesis proposal followed by e vidence-constrained verification bet- ter reconciles complementary cues. Robustness on basic emotion and sentiment. Beyond O V -FG, our method remains consistently strong on Basic and Sentiment tasks, achieving second-best performance across MER2023, MER2024, SIMS, and MOSI. Notably , it stays competitive with 7B baselines on sentiment while preserving strong basic-emotion recognition, suggesting that enforcing evidence-closed reasoning does not trade of f general affect recognition capability for fine-grained g ains. Conflict rob ustness. T able 2 stratifies the test set by cross- modal conflict. All baselines drop noticeably on HCS com- pared to LCS, confirming their fragility under contradictory cues. HyDRA stays best on both subsets and degrades the least, indicating that multi-path adjudication mitigates premature commitment when modalities disagree. Com- pared to AffectGPT , HyDRA improv es S 1 by +11.15 on HCS versus +6.69 on LCS, sho wing larger relati ve gains in conflict-heavy cases. Figure 2 illustrates a representative scenario where HyDRA effecti vely resolves modality conflicts that mislead standard models. In this case, the subject maintains a “calm and focused” facial expression , while the audio and text cues signify intense internal distress. Unlike the R1-omni, which suffers from premature commitment to the salient visual signal, HyDRA av oids an immediate decision. Through the Propose–V erify–Decide protocol, it generates competing hypotheses and correctly adjudicates that the subtle audio cue is a more authentic indicator of af fect than the controlled visual mask. By reconciling these contradictory signals, HyDRA successfully identifies the state as Genuine Anxiety , matching the ground truth while providing a transparent, evidence-grounded reasoning trace. T able 4. Impact of hypothesis cardinality ( K ) on MER-FG. ”No-hypotheses” denotes the standard linear reasoning (R1-style) without the Propose–V erify–Decide protocol. K = 2 achiev es the optimal balance between analytical div ersity and evidential focus. All variants use the same 0.5B backbone. Bold : Best; Underline : Second-best. M O D E L OV - F G S 1 S 2 A V G B A S E L I N E 3 0 . 9 8 1 7 . 5 8 2 4 . 2 8 N O - H Y P OT H E S E S 5 2 . 1 4 2 6 . 8 8 3 9 . 5 1 1 - H Y P OT H E S I S 4 9 . 2 4 2 5 . 5 0 3 7 . 3 7 3 - H Y P OT H E S E S 5 4 . 4 8 2 6 . 0 5 4 0 . 2 6 4 - H Y P OT H E S E S 5 3 . 5 1 2 6 . 2 8 3 9 . 8 9 H Y D R A 5 5 . 5 2 3 0 . 4 8 4 3 . 0 0 A detailed discussion regarding the scope and limitations of our proposed method is provided in Appendix G . 5. Ablation Study This section provides a systematic decomposition of Hy- DRA to ev aluate the contribution of its core components: rew ard design, reasoning structure, and training paradigms. Unless otherwise specified, all variants are e v aluated on the O V -FG using its standard metrics. In every ablation config- uration, the RL stage follows the same training schedule as the formal e xperiments. The results reported here reflect the alignment behavior across dif ferent design choices, provid- ing a direct comparison of the impact of each component on the model’ s performance. 5.1. Ablating GRPO Rewards W e study how each rew ard component contributes to RL alignment under HyDRA. W e evaluate three settings: en- abling all re wards (default), keep-one-only , and leave-one- out , as summarized in T able 3 . Keep-one-only . When only one re ward is enabled, Ac- curacy rew ard and Evidence selection rew ard yield the strongest performance. This is expected because both pro- vide task-aligned content supervision: one targets the final prediction quality , while the other constrains evidence qual- ity via semantic matching. In contrast, Format re wards and Hierarchical evidence citation rewards of fer limited gains in isolation, suggesting that structural compliance alone is insuf ficient to ensure correct open-vocab ulary emotion inference. Leav e-one-out. In multi-rew ard RL, removing the Accu- racy reward causes a clear degradation, indicating that it remains the dominant task-aligned learning signal, which employs P ℓ to mitigate re ward hacking (see Appendix H ). Con versely , removing Hierarchical e vidence citation re- 7 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER T able 5. Comparison of training paradigms on MER-FG. Cold- start and GRPO RL Subsets are used for initial and RL stages. SFT full , PPO, and HyDRA share the same data budget. Results show that RL (especially our GRPO-based HyDRA) significantly outperforms SFT scaling. Bold : Best; Underline: Second-best. M O D E L OV - F G S 1 S 2 A V G P RO M P T - O N LY 2 3 . 2 6 1 2 . 3 4 1 7 . 8 0 B A S E L I N E 3 0 . 9 8 1 7 . 5 8 2 4 . 2 8 S F T C OL D 4 5 . 9 1 2 4 . 2 2 3 5 . 0 7 P P O T R A I N I N G 5 1 . 0 9 2 6 . 9 3 3 9 . 0 1 S F T F UL L 4 9 . 6 0 2 5 . 3 1 3 7 . 4 6 H Y D R A 5 5 . 5 2 3 0 . 4 8 4 3 . 0 0 wards yields the strongest performance among leav e-one- out v ariants. In our final training configuration, Format re- wards are consistently enabled as a foundational constraint, rather than a primary variable in the lea ve-one-out analysis. This decision is made because removing format re wards in RL yields high numerical scores , yet it incurs the risk of long-term structural degradation. By fixing format rew ards as a constant, we ensure the model maintains strict struc- tural integrity , while simultaneously isolating the impacts of content-driv en rewards. 5.2. Impact of Hypothesis Cardinality on Reasoning T able 4 ev aluates hypothesis cardinality K . Every Propose– V erify–Decide variant significantly outperforms the base- line, confirming that the structural benefit of the Pro- pose–V erify–Decide protocol is fundamental. Howe ver , the “No-hypotheses” v ariant—representing a linear reasoning paradigm akin to the DeepSeek-R1 style—remains subopti- mal. While linear traces allow for logical depth, they lack the explicit di ver gent-then-con ver gent mechanism neces- sary to reconcile the ambiguous and often contradictory cues inherent in affecti ve reasoning. W e identify a performance “sweet spot” at K = 2 , which balances analytical diversity with token efficienc y . Inter- estingly , the K = 1 variant performs worse than the linear “No-hypotheses” baseline, re vealing a confirmation bias trap: forced upfront commitment causes the verification phase to anchor on justifying the initial guess rather than rather than performing the “cross-examination” described in Sec. 3.2 . K = 2 introduces suf ficient “cognitiv e friction” to compel genuine adjudication without e xceeding the informational capacity of typical 2–5 second video clips. Beyond K = 2 , returns diminish due to the sparse evi- dence in short-form content. K = 3 often leads to semantic redundancy where candidates overlap significantly , while K = 4 triggers “ov er-interpretation. ” When forced to gen- erate excessi ve div ersity from limited cues, the model may hallucinate situational dynamics to satisfy the cardinality constraint, introducing noise. Notably , while S 2 shows slight resilience at K = 4 by occasionally capturing rare fine-grained labels, the ov erall reliability and latency trade- offs f av or the more robust K = 2 configuration. 5.3. T raining Paradigm Boundary T able 5 inv estigates whether HyDRA ’ s reasoning capability stems from parameter internalization or mere data scaling. Necessity of Adaptation. The “Prompt-only” approach, which applies our Propose–V erify–Decide protocol to the frozen backbone, performs significantly worse than the base- line. This confirms that the multi-step reasoning logic is not a superficial prompting trick but a complex deductive behavior that must be internalized through systematic pa- rameter updates. RL vs. SFT Efficiency . While SFT cold establishes a solid foundation for open-v ocabulary tasks, ex- panding the supervised data to SFT full yields only marginal improv ements. Notably , both RL-based paradigms—PPO and HyDRA—outperform SFT full using the same data bud- get. This gap demonstrates that for fine-grained affecti ve reasoning, reinforcement learning is more sample-efficient than density-driven supervised learning, as it encourages the model to e xplore and self-correct reasoning paths. Su- periority of HyDRA. Among RL variants, HyDRA further surpasses PPO by 3.99% in average score, achieving the peak performance. This suggests that our GRPO frame work, tailored with evidence-constrained rewards, more effecti vely stabilizes the “Propose–V erify–Decide” trajectory , prevent- ing the model from regressing to biased heuristics. 6. Conclusion W e introduced HyDRA , which operationalizes Open- V ocab ulary MER as a Pr opose–V erify–Decide protocol. Hy- DRA repurposes MLLM generativ e priors to propose multi- ple competing latent-context hypotheses, and then performs evidence-constrained adjudication to select the emotion set that best reconciles equi vocal and potentially conflicting multimodal cues. T o make this protocol a learned capability rather than a prompting heuristic, we further optimized the model with GRPO and hier ar chical r ewar d shaping , e xplic- itly aligning intermediate reasoning trajectories with final task performance and enforcing evide ntial closur e . Across systematic ev aluations, HyDRA consistently outperformed strong baselines on diverse O V -MER benchmarks, with particularly pronounced gains under ambiguity and cross- modal conflict, and it notably surpassed the performance of an zero-shot 7B baseline. Beyond metric g ains, HyDRA pro- duces interpretable and diagnostic evidence traces, enabling principled analysis of when and why the model commits to a prediction. W e hope this work encourages the community to treat O V -MER as a hybrid abducti ve–deducti ve infer- 8 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER ence problem and to build learning objectives that re ward verifiable, e vidence-grounded reasoning. Impact Statement This work enhances the reliability and interpretability of Multimodal Emotion Recognition by mitigating model bias in ambiguous scenarios. While our frame work enables more empathetic and transparent AI for applications in mental health and human-computer interaction, we recognize the inherent risks of emotional profiling and priv acy infringe- ment associated with affecti ve computing. W e advocate for the deployment of these technologies exclusi vely within ethical framew orks that prioritize user consent and prohibit unauthorized surveillance. References Bai, Z., W ang, P ., Xiao, T ., He, T ., Han, Z., Zhang, Z., and Shou, M. Z. Hallucination of multimodal large language models: A surv ey . arXiv pr eprint arXiv:2404.18930 , 2024. Barrett, L. F ., Adolphs, R., Marsella, S., Martinez, A. M., and Pollak, S. D. Emotional expressions reconsidered: Challenges to inferring emotion from human facial mo ve- ments. Psyc hological Science in the Public Inter est , 20 (1):1–68, 2019. doi: 10.1177/1529100619832930. PMID: 31313636. Bhattacharyya, S. and W ang, J. Z. Evaluating vision- language models for emotion recognition. In Chiruzzo, L., Ritter , A., and W ang, L. (eds.), Findings of the As- sociation for Computational Linguistics: NAA CL 2025 , pp. 1798–1820, Albuquerque, Ne w Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8- 89176-195-7. doi: 10.18653/v1/2025.findings- naacl.97. Chang, B., W ang, Q., Guo, X., Nan, Z., Y ao, Y ., and Zhou, T . Abductiv emllm: Boosting visual abductiv e reasoning within mllms, 2026. Dang, J., Song, H., Xiao, J., W ang, B., Peng, H., Li, H., Y ang, X., W ang, M., and Chua, T .-S. Mupa: T ow ards multi-path agentic reasoning for grounded video question answering, 2025. Dhuliawala, S., K omeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., and W eston, J. Chain-of-verification reduces hallucination in large language models. In F ind- ings of the association for computational linguistics: ACL 2024 , pp. 3563–3578, 2024. Ge, M., T ang, D., and Li, M. V ideo emotion open- vocab ulary recognition based on multimodal large lan- guage model. arXiv preprint , 2024. Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Bren- del, W ., Bethge, M., and W ichmann, F . A. Shortcut learn- ing in deep neural networks. Nature Machine Intelligence , 2(11):665–673, 2020. Goyal, Y ., Khot, T ., Summers-Stay , D., Batra, D., and Parikh, D. Making the v in vqa matter: Elev ating the role of image understanding in visual question answer- ing. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pp. 6904–6913, 2017. Guerdelli, H., Ferrari, C., Cardia Neto, J. B., Berretti, S., Barhoumi, W ., and Del Bimbo, A. T ow ards a better understanding of human emotions: Challenges of dataset labeling. In International Conference on Image Analysis and Pr ocessing , pp. 242–254. Springer , 2023. Guo, D., Y ang, D., Zhang, H., Song, J., W ang, P ., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek- r1 incenti vizes reasoning in llms through reinforcement learning. Natur e , 645(8081):633–638, 2025. Han, Z., Zhu, B., Xu, Y ., Song, P ., and Y ang, X. Bench- marking and bridging emotion conflicts for multimodal emotion reasoning. In Pr oceedings of the 33r d A CM International Conference on Multimedia , MM ’25, pp. 5528–5537, New Y ork, NY , USA, 2025a. Association for Computing Machinery . ISBN 9798400720352. doi: 10.1145/3746027.3754856. Han, Z., Zhu, B., Xu, Y ., Song, P ., and Y ang, X. Bench- marking and bridging emotion conflicts for multimodal emotion reasoning. In Pr oceedings of the 33rd A CM In- ternational Confer ence on Multimedia , pp. 5528–5537, 2025b. Huang, Q., Dong, X., Zhang, P ., W ang, B., He, C., W ang, J., Lin, D., Zhang, W ., and Y u, N. Opera: Alle viat- ing hallucination in multi-modal lar ge language models via over -trust penalty and retrospection-allocation. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 13418–13427, 2024. Jin, P ., T akanobu, R., Zhang, W ., Cao, X., and Y uan, L. Chat-uni vi: Unified visual representation empo wers large language models with image and video understanding. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 13700–13710, 2024. Kahneman, D. Thinking, F ast and Slow . F arrar , Straus and Giroux, New Y ork, 2011. Lei, Y ., Y ang, D., Chen, Z., Chen, J., Zhai, P ., and Zhang, L. Large vision-language models as emotion recognizers in context a wareness. arXiv pr eprint arXiv:2407.11300 , 2024. 9 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., and Bing, L. Mitigating object hallucinations in lar ge vision- language models through visual contrastiv e decoding. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 13872–13882, 2024. Li, B., Zhang, Y ., Chen, L., W ang, J., Pu, F ., Cahyono, J. A., Y ang, J., Li, C., and Liu, Z. Otter: A multi-modal model with in-context instruction tuning. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2025a. Li, F ., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W ., Ma, Z., and Li, C. Llav a-next-interleav e: T ackling multi- image, video, and 3d in large multimodal models, 2024a. Li, K., He, Y ., W ang, Y ., Li, Y ., W ang, W ., Luo, P ., W ang, Y ., W ang, L., and Qiao, Y . V ideochat: Chat-centric video understanding. Science China Information Sciences , 68 (10):200102, 2025b. Li, Y ., Du, Y ., Zhou, K., W ang, J., Zhao, X., and W en, J.-R. Evaluating object hallucination in lar ge vision-language models. In Bouamor , H., Pino, J., and Bali, K. (eds.), Pr o- ceedings of the 2023 Conference on Empirical Methods in Natural Language Pr ocessing , pp. 292–305, Singa- pore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp- main.20. Li, Y ., W ang, C., and Jia, J. Llama-vid: An image is w orth 2 tokens in lar ge language models. In Eur opean Confer ence on Computer V ision , pp. 323–340. Springer , 2024b. Li, Z., W u, X., Shi, G., Qin, Y ., Du, H., Liu, F ., Zhou, T ., Manocha, D., and Boyd-Graber , J. L. V ideo- hallu: Ev aluating and mitigating multi-modal halluci- nations on synthetic video understanding. arXiv pr eprint arXiv:2505.01481 , 2025c. Lian, Z., Sun, H., Sun, L., Chen, K., Xu, M., W ang, K., Xu, K., He, Y ., Li, Y ., Zhao, J., Liu, Y ., Liu, B., Y i, J., W ang, M., Cambria, E., Zhao, G., Schuller , B. W ., and T ao, J. Mer 2023: Multi-label learning, modality rob ustness, and semi-supervised learning. In Pr oceedings of the 31st A CM International Conference on Multimedia , MM ’23, pp. 9610–9614, Ne w Y ork, NY , USA, 2023a. Association for Computing Machinery . ISBN 9798400701085. doi: 10.1145/3581783.3612836. Lian, Z., Sun, H., Sun, L., Gu, H., W en, Z., Zhang, S., Chen, S., Xu, M., Xu, K., Chen, K., et al. Explain- able multimodal emotion recognition. arXiv pr eprint arXiv:2306.15401 , 2023b. Lian, Z., Sun, H., Sun, L., Chen, H., Chen, L., Gu, H., W en, Z., Chen, S., Zhang, S., Y ao, H., et al. Ov-mer: T owards open-vocab ulary multimodal emotion recognition. arXiv pr eprint arXiv:2410.01495 , 2024a. Lian, Z., Sun, H., Sun, L., W en, Z., Zhang, S., Chen, S., Gu, H., Zhao, J., Ma, Z., Chen, X., et al. Mer 2024: Semi-supervised learning, noise robustness, and open- vocab ulary multimodal emotion recognition. In Pr oceed- ings of the 2nd International W orkshop on Multimodal and Responsible Affective Computing , pp. 41–48, 2024b. Lian, Z., Chen, H., Chen, L., Sun, H., Sun, L., Ren, Y ., Cheng, Z., Liu, B., Liu, R., Peng, X., Y i, J., and T ao, J. AffectGPT: A new dataset, model, and benchmark for emotion understanding with multimodal lar ge language models. In F orty-second International Confer ence on Machine Learning , 2025a. Lian, Z., Liu, R., Xu, K., Liu, B., Liu, X., Zhang, Y ., Liu, X., Li, Y ., Cheng, Z., Zuo, H., et al. Mer 2025: When affecti ve computing meets lar ge language models. In Pr oceedings of the 33r d ACM International Confer ence on Multimedia , pp. 13837–13842, 2025b. Lian, Z. et al. Ov-mer: T owards open-vocab ulary multimodal emotion recognition. arXiv pr eprint arXiv:2410.01495 , 2024c. Lin, B., Y e, Y ., Zhu, B., Cui, J., Ning, M., Jin, P ., and Y uan, L. V ideo-llav a: Learning united visual representation by alignment before projection. In Pr oceedings of the 2024 confer ence on empirical methods in natural language pr ocessing , pp. 5971–5984, 2024. Liu, H., Li, C., W u, Q., and Lee, Y . J. V isual instruction tun- ing. Advances in neural information pr ocessing systems , 36:34892–34916, 2023. Maaz, M., Rasheed, H., Khan, S., and Khan, F . V ideo- chatgpt: T o wards detailed video understanding via large vision and language models. In Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 12585–12602, 2024. Mistretta, M., Baldrati, A., Agnolucci, L., Bertini, M., and Bagdanov , A. D. Cross the gap: Exposing the intra- modal misalignment in clip via modality in version. arXiv pr eprint arXiv:2502.04263 , 2025. Mohsin, M. A., Umer , M., Bilal, A., Memon, Z., Qadir , M. I., Bhattacharya, S., Rizwan, H., Gorle, A. R., Kazmi, M. Z., Mohsin, A., et al. On the fundamental limits of llms at scale. arXiv preprint , 2025. Poria, S., Cambria, E., Bajpai, R., and Hussain, A. A revie w of affecti ve computing: From unimodal analysis to multimodal fusion. Information fusion , 37:98–125, 2017. Qian, Y ., W an, C., Jia, C., Y ang, Y ., Zhao, Q., and Gan, Z. Prism-bench: A benchmark of puzzle-based visual tasks with cot error detection, 2025. 10 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER Rha, H., Y eo, J. H., W on, J., Park, S. J., and Ro, Y . M. Learn- ing what to attend first: Modality-importance-guided rea- soning for reliable multimodal emotion understanding, 2025. Shao, Z., W ang, P ., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., W u, Y ., and Guo, D. Deepseek- math: Pushing the limits of mathematical reasoning in open language models, 2024. Su, Y ., Lan, T ., Li, H., Xu, J., W ang, Y ., and Cai, D. Pandagpt: One model to instruction-follo w them all, 2023. Sun, Y . and Saparov , A. Language models do not follow occam’ s razor: A benchmark for inductiv e and abductiv e reasoning, 2025. Sun, Z., Xiao, Y ., Li, J., Ji, Y ., Chen, W ., and Zhang, M. Exploring and mitigating shortcut learning for genera- tiv e large language models. In Calzolari, N., Kan, M.- Y ., Hoste, V ., Lenci, A., Sakti, S., and Xue, N. (eds.), Pr oceedings of the 2024 Joint International Confer- ence on Computational Linguistics, Languag e Resour ces and Evaluation (LREC-COLING 2024) , pp. 6883–6893, T orino, Italia, May 2024. ELRA and ICCL. W ang, W ., Gao, Z., Gu, L., Pu, H., Cui, L., W ei, X., Liu, Z., Jing, L., Y e, S., Shao, J., et al. Intern vl3.5: Advancing open-source multimodal models in versatility , reasoning, and efficienc y . arXiv pr eprint arXiv:2508.18265 , 2025. W ang, X., W ei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery , A., and Zhou, D. Self-consistenc y im- prov es chain of thought reasoning in language models. arXiv pr eprint arXiv:2203.11171 , 2022. W ei, Y ., Y uan, S., Y ang, R., Shen, L., Li, Z., W ang, L., and Chen, M. T ackling modality heterogeneity with multi- view calibration network for multimodal sentiment de- tection. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 5240–5252, T oronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl- long.287. W u, Q., Y ang, X., Zhou, Y ., Fang, C., Song, B., Sun, X., and Ji, R. Grounded chain-of-thought for multimodal large language models. arXiv pr eprint arXiv:2503.12799 , 2025a. W u, Y ., Zhang, L., Y ao, H., Du, J., Y an, K., Ding, S., W u, Y ., and Li, X. Antidote: A unified framework for mitigating lvlm hallucinations in counterfactual presupposition and object perception. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pp. 14646–14656, 2025b. W u, Z., Huang, H.-Y ., and W u, Y . Beyond spurious signals: Debiasing multimodal large language models via coun- terfactual inference and adapti ve e xpert routing, 2025c. Y e, Q., Xu, H., Xu, G., Y e, J., Y an, M., Zhou, Y ., W ang, J., Hu, A., Shi, P ., Shi, Y ., et al. mplug-owl: Modulariza- tion empo wers large language models with multimodality . arXiv pr eprint arXiv:2304.14178 , 2023. Y u, W ., Xu, H., Meng, F ., Zhu, Y ., Ma, Y ., W u, J., Zou, J., and Y ang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modal- ity . In Pr oceedings of the 58th annual meeting of the association for computational linguistics , pp. 3718–3727, 2020. Zadeh, A., Zellers, R., Pincus, E., and Morency , L.-P . Mosi: Multimodal corpus of sentim ent intensity and subjecti vity analysis in online opinion videos, 2016. Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morenc y , L.-P . T ensor fusion network for multimodal sentiment analysis. arXiv preprint , 2017. Zelikman, E., Harik, G., Shao, Y ., Jayasiri, V ., Haber, N., and Goodman, N. D. Quiet-star: Language models can teach themselv es to think before speaking. arXiv pr eprint arXiv:2403.09629 , 2024. Zhang, X., Zeng, F ., and Gu, C. Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation. Neural Networks , 184:107059, 2025. Zhao, J., W ei, X., and Bo, L. R1-omni: Explainable omni-multimodal emotion recognition with reinforce- ment learning. arXiv preprint , 2025a. Zhao, J., Y ang, Q., Peng, Y ., Bai, D., Y ao, S., Sun, B., Chen, X., Fu, S., W ei, X., Bo, L., et al. Humanomni: A large vision-speech language model for human-centric video understanding. arXiv preprint , 2025b. Zhao, K., Zheng, M., Li, Q., and Liu, J. Multimodal sen- timent analysis-a comprehensiv e surve y from a fusion methods perspectiv e. IEEE Access , 2025c. Zhao, R., Li, X., Joty , S., Qin, C., and Bing, L. V erify- and-edit: A kno wledge-enhanced chain-of-thought frame- work. arXiv pr eprint arXiv:2305.03268 , 2023. Zhou, Y ., Cui, C., Y oon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., and Y ao, H. Analyzing and mitigating object hallucination in large vision-language models. arXiv pr eprint arXiv:2310.00754 , 2023. Zhu, Y ., T ao, L., Dong, M., and Xu, C. Mitigating object hal- lucinations in large vision-language models via attention calibration. arXiv preprint , 2025. 11 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER A. Implementation Details of Reward Mechanisms A.1. Reward Composition and W eights As introduced in Sec. 3.1, the total re ward R is calculated as the weighted sum of indi vidual task-specific re wards. All re ward components r k are normalized to [0 , 1] before scaling. T able 6 lists the specific weights used during GRPO training. These weights were determined through a coarse grid search on a v alidation subset to balance logical rigour (citations/e vidence) and task accuracy . T able 6. Hyperparameters for Hierarchical Re ward Shaping. Category Component W eight λ k Primary Optimization Goal Performance Accurac y ( r acc ) 4.0 Maximize F1-score with length penalty P ℓ . Protocol Consistency ( r fmt ) 0.5 Ensure JSON v alidly and tag ordering. Reasoning ( r think ) 0.5 Enforce comparative/dif ferential thinking blocks. Reliability Citation ( r cite ) 1.0 Encourage explicit hypothesis referencing. Evidence ( r evid ) 2.0 Enforce intra-trace consistency via fuzzy matching. Grounding ( r sem ) 2.0 Align reasoning with multimodal cue anno- tations. A.2. Detailed Formulations of Reward Components Accuracy with Length Penalty . T o prev ent re ward hacking via verbosity , r acc incorporates a penalty P ℓ : P ℓ = L L pre (10) where L is the number of emotion w ords in the ground truth labels and L pre is the number of emotion w ords in the model-predicted labels. Fuzzy Matching for r evid . The matc h( c, V ) operator (Eq. 4) handles surface variations by: (i) conv erting to lowercase, (ii) stripping punctuation, and (iii) verifying if the citation string c exists as a substring within the evidence pool V . For ambiguous cases, a Lev enshtein similarity threshold of 0 . 85 is applied. Discretization of Semantic Gr ounding ( r sem ). The function Q ( · ) in Eq. 5 maps continuous cosine similarity s to discrete rew ards to stabilize the RL gradient: Q ( s ) =      1 . 0 , s ≥ 0 . 7 0 . 5 , 0 . 5 ≤ s < 0 . 7 0 , s < 0 . 5 (11) W e use all-MiniLM-L6-v2 as the backbone for emb ( · ) . B. Data Preparation This section details the technical implementation of the automated generation pipeline used to construct data for cold-start SFT and GRPO RL. The ov erall procedure, illustrated in Figure 3 and formalized in Algorithm 1 , is designed as a structured, sequential workflo w that transforms raw multimodal descriptions and Ground T ruth (GT) labels into standardized training samples following the HyDRA output interface, i.e., a structured response composed of , , and . The pipeline begins with input preparation, followed by Phase 1: ObsG generation. In Phase 1, a general-purpose DeepSeek Chat model con verts unstructured multimodal text into structured ObsG JSON objects that summarize objective, time-sliced evidence. The core of the pipeline is Phase 2, where a specialized DeepSeek Reasoner model instantiates HyDRA ’ s Propose–V erify–Decide protocol. Specifically , giv en a composite context consisting of the objectiv e ObsG and the GT 12 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER F igure 3. Overvie w of the Generation stages of ObsG. labels, the reasoner is prompted to (i) generate multiple competing hypotheses in , each grounded in modality-specific evidence candidates; (ii) perform evidence-constrained comparison in by explicitly contrasting shared cues and conflicting cues across hypotheses and selecting the most coherent e xplanation; and (iii) output the final open-vocab ulary emotion set in , consistent with the verified decision. In Phase 3, we apply deterministic post-processing and normalization to ensure interface stability and reduce spurious variance. Finally , we assemble the verified HyDRA reasoning modules, the original observation modules, and the formatted target answers into a single structurally standardized JSON training sample, completing the automated construction of high-quality cold-start data into a single standardized JSON training sample, as shown in the output of Step 5 in Figure 3 . For completeness, we also reuse this pipeline to b uild GRPO training data. In this case, we only retain the Phase 1 outputs (ObsG JSON), and omit Phases 2–3, because GRPO requires objectiv e observations as rew ard signals rather than fully instantiated HyDRA responses. System Prompt: Cold-Start Data Generation Y ou are a structured reasoning assistant. Use ONL Y the provided ObservationGraph (ObsG). • Input Context : Multimodal evidence (video/audio/te xt) and EV AL-ONL Y affect profiles (Ground T ruth hints). • Constraint : Do NOT quote EV AL-ONL Y content directly . Rank evidence by cross-modality co verage. • Output Schema (JSON) : 1. hypotheses : A list of 1–3 hypotheses. Each must contain an assumption and evidence ids top5 (unique IDs). 2. think : A reasoning trace using **[Common]**, **[Differences]**, and **[Decision]** headers. **MAND A TOR Y : Cite specific evidence like [v1][a2] and cross-reference hypotheses [H1]. [v1][a2] needed to be natural language descriptions in the raw data.** • T ermination : Conclude the [Decision] section with ”leading to the conclusion for an emotional state of [DESCRIPTOR]. ” 13 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER Algorithm 1 Cold-Start Data Generation Pipeline Input: Multimodal Data ( V , A, T ) , Ground Truth GT labels , Prompt T emplates Output: Structured Training Sample S final // Phase 1: Observation Gener ation Prompt ObsG ← ConstructObsGPrompt ( V , A, T , T emplate ) ObsG JSON ← LLM Chat ( Prompt ObsG ) // Phase 2: Reasoning & Hypotheses Generation Context ← EmbedGT ( ObsG JSON , GT labels ) Prompt Reason ← ConstructReasonerPrompt ( Context ) Raw Output ← LLM Reasoner ( Prompt Reason ) Hypotheses, CoT ← Parse ( Ra w Output ) // Phase 3: P ost-Pr ocessing & F inal Assembly Hypotheses ′ ← EnforceEvidenceCap ( Hypotheses , top k = 5) Hypotheses ′′ ← Deriv eModalities ( Hypotheses ′ ) Payload Answer ← FormatGT ( GT labels ) S final ← AssembleXML ( . . . ) retur n S final T able 7. Dataset card of the benchmarks used in our experiments. W e group datasets by task type and report the of ficial split adopted for ev aluation. D AT A S E T T Y P E R AW D AT A S E T S E L E C T E D S U B S ET # S A M P L E S S E N T I M E N T A NA L Y S I S C M U - M O S I ( Z A D E H E T A L . , 2 0 1 7 ) T E S T 6 8 6 C H - S I M S ( Y U E T A L . , 2 0 2 0 ) T E S T 4 5 7 B A S I C E M O TI O N R E C O G N I TI O N M E R 2 0 2 3 ( L I A N E T A L . , 2 0 2 3 A ) M E R - M U L T I ( T E S T ) 4 1 1 M E R 2 0 2 4 ( L I A N E T A L . , 2 0 2 4 B ) M E R - S E M I ( T E S T ) 1 , 1 6 9 F I N E - G R A I NE D E M OT I O N D E T E C T I O N M E R - F G ( L I A N E T A L . , 2 0 2 5 B ) T E S T 1 , 2 0 0 C. Datasets Details W e ev aluate our method on a diverse set of public multimodal af fect benchmarks (T able 7 ). For Sentiment Analysis, we use the official test splits of CMU-MOSI and CH-SIMS. For Basic Emotion Recognition, we use the test sets of MER2023 (MER-MUL TI) and MER2024 (MER-SEMI). For Fine-grained Emotion Detection, we use the MER-FG test set to assess open-vocab ulary fine-grained emotion prediction. Dataset splits and sample counts are summarized in T able 7 . D. Baseline Details This section summarizes the backbone components of the MLLMs referenced in our experiments and comparisons. HumanOmni. W e adopt HumanOmni-0.5B ( Zhao et al. , 2025b ) as our backbone, which is a large multimodal model tailored for human-centric tasks via a dual-path vision architecture. HumanOmni is offered in two scales: 0.5B and 7B. While both versions share the same fundamental design, they differ in component scales: HumanOmni-0.5B integrates a SigLip-400M vision tower , a HumanV ision-Base encoder, and a Qwen2-0.5B LLM, whereas the 7B version scales these to SigLip-SO400M, HumanV ision-Large, and Qwen2-7B, respectiv ely . T o maximize the demonstration of our method’ s effecti veness under constrained computational resources, we employ the 0.5B v ariant, which provides high-fidelity human-centric features within a compact parameter space. Otter . Otter is a Flamingo-style, instruction-tuned multimodal assistant that supports in-context multimodal instruction following, and we include it as a generic image-based MLLM baseline. ( Li et al. , 2025a ) 14 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER V ideo-LLaV A. V ideo-LLaV A extends LLaV A-style instruction tuning to jointly handle images and videos via a unified visual representation, serving as a strong video-chat baseline without explicit audio modeling. ( Lin et al. , 2024 ) V ideo-ChatGPT . V ideo-ChatGPT is a video con versation model that pairs a video-adapted visual encoder with an LLM for detailed video dialogue, and we use it as a representativ e early video-chat baseline. ( Maaz et al. , 2024 ) LLaMA-V id. LLaMA-V id targets long-context video understanding by reducing per -frame visual token length, and we adopt it to test whether improv ed temporal scalability benefits affect reasoning. ( Li et al. , 2024b ) Chat-UniV i. Chat-UniV i unifies image and video understanding with dynamic visual tokens under a unified visual representation, and we treat it as a competitiv e unified image-video MLLM baseline. ( Jin et al. , 2024 ) PandaGPT . PandaGPT utilizes the multimodal joint representation space of ImageBind to align di verse inputs with the V icuna language model, serving as a versatile baseline for general-purpose multimodal instruction follo wing( Su et al. , 2023 ). V ideoChat. V ideoChat is a chat-centric video understanding framew ork that connects video foundation models with LLMs via a learnable interface, and we include it as a widely used video understanding baseline. ( Li et al. , 2025b ) mPLUG-Owl. mPLUG-Owl equips LLMs with multimodality via modularized training and multimodal instruction tuning, and we use it as a representativ e modular MLLM baseline. ( Y e et al. , 2023 ) R1-Omni. R1-Omni applies reinforcement learning with verifiable re wards to improve omni-multimodal (audio-visual) emotion recognition and explainability , making it a key affect-specialized baseline. It is based on HumanOmni-0.5B. ( Zhao et al. , 2025a ) AffectGPT (O V -MERD). W e use AffectGPT(O V -MERD) as specified in the MER2025 setting, i.e., the AffectGPT variant trained on O V -MERD for open-vocab ulary/fine-grained emotion understanding comparisons. ( Lian et al. , 2025b ; a ) E. Emotion Wheel-based (EW) Metric T o provide a rob ust e valuation for open-vocab ulary emotion prediction, we adopt the Emotion Wheel-based (EW) metric proposed in prior literature ( Lian et al. , 2025a ). The illustrations are Figure 4 . This metric accounts for semantic redundancies and synonyms by employing a hierarchical clustering strate gy and set-le vel e valuation. The computation process is objecti ve and primarily consists of two components: neutralizing synonym influence and defining set-le vel metrics. E.1. Handling Synonyms and V ariations T o mitigate the impact of morphological v ariations and semantic overlaps, a three-le vel hierarchical grouping strate gy is employed, represented by a composite clustering function G w k ( · ) : • (L1) Morphological Normalization ( F l 1 ): This function maps different forms of words to their base form. For instance, words like happier and happiness are normalized to happy . • (L2) Synonym Mapping ( F l 2 ): This mapping function unifies synonyms into a single representativ e word. For example, joyful and cheerful are mapped to happy . • (L3) Emotion Wheel Clustering ( F w k l 3 ): This function maps outer , more specific labels to their corresponding inner fundamental labels. Follo wing ( Lian et al. , 2025a ), K = 5 different emotion wheels are adopted to ensure comprehensiv e coverage. The complete clustering operation for a specific wheel w k is summarized as: G w k ( · ) = F w k l 3 ( F l 2 ( F l 1 ( · ))) , k ∈ [1 , K ] . (12) 15 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER E.2. Set-level Metric Calculation Giv en that the number of predicted and annotated emotions may vary across samples, the eval uation is conducted using set-lev el metrics. Suppose the dataset contains N samples. For a sample x i , let Y i = { y j i } n i j =1 denote the set of true labels and ˆ Y i = { ˆ y j i } ˆ n i j =1 denote the set of predicted labels. The metrics for each wheel k are defined as: Precision k S = 1 N N X i =1 | G w k ( Y i ) ∩ G w k ( ˆ Y i ) | | G w k ( ˆ Y i ) | , (13) Recall k S = 1 N N X i =1 | G w k ( Y i ) ∩ G w k ( ˆ Y i ) | | G w k ( Y i ) | , (14) F k S = 2 × Precision k S × Recall k S Precision k S + Recall k S . (15) In these set-wise operations, duplicate emotion words are automatically remov ed. E.3. Final Score Computation Finally , the comprehensi ve EW score is determined by calculating the av erage F1-score across all K emotion wheels: EW ( Y i , ˆ Y i ) = 1 K K X k =1 F k S . (16) F . Experimental Setup T raining Details. Our models were trained on a cluster of up to 8 NVIDIA L20 GPUs. T o balance computational efficienc y with memory constraints, we uniformly sampled 8 frames for video inputs during training. All experiments were conducted using BF16 precision and FlashAttention-2, with DeepSpeed ZeR O-3 employed for memory optimization.The training process consists of two stages: SFT followed by GRPO RL. • SFT Stage: W e use BF16 training with gradient checkpointing and FlashAttention. The maximum sequence length is 2048, the per-de vice batch size is 2 with gradient accumulation to a global batch of 16. W e train for 5 epochs using a cosine schedule, learning rate 2 × 10 − 4 , warmup ratio 0.01, and weight decay 0.0. The multimodal configuration uses a vision tower and an audio tower , a lightweight projector, and enables modality start/end wrappers; the number of video frames is 8. • GRPO RL Stage: W e run BF16 training with gradient checkpointing and FlashAttention. The prompt and completion lengths are 1024/1024. W e sample G = 8 completions per prompt, use a per-device batch size of 1 with gradient accumulation of 4, and train for 4.5k steps. This duration is empirically selected as the optimal con vergence point; as shown in Figure 5 , the e valuation scores (e.g., S 1 and Mean) peak at approximately 4.5k steps and plateau thereafter . The learning rate is 1 × 10 − 6 with warmup ratio 0.03. The maximum pixel b udget for visual inputs is capped at 401408. G. Limitations Model scale and compute. W e train and e valuate HyDRA on a relati vely small backbone (0.5B), and do not perform RL training on 7B-scale (or lar ger) MLLMs. This choice is primarily constrained by av ailable compute, and it may limit the absolute performance ceiling as well as the generality of our conclusions across model sizes. Larger backbones could better absorb the Propose–V erify–Decide objective, support stronger perception and longer-context reasoning, and potentially change the optimal trade-offs (e.g., between div ersity and verification strictness). W ith access to larger-scale training resources, we plan to scale HyDRA to 7B+ models and systematically study scaling beha vior under the same ev aluation protocol. 16 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER (a) w1 (b) w2 (c) w3 (d) w4 (e) w5 F igure 4. Emotion wheel. W e use five emotion wheels, all of which are deri ved from previous research ( Lian et al. , 2025a ). Sensitivity to cold-start f ormatting and fixed hypothesis b udget. HyDRA is strongly shaped by the cold-start training format and the structured rationale schema (e.g., the explicit field). In our current setting, generating two hypotheses is the best-performing configuration; howe ver , this fixed budget may not be robust when the input has higher information density , more complex situational dynamics, or richer multi-label af fectiv e mixtures. In such cases, two candidates may fail to cover the space of plausible explanations, weakening the comparati ve verification stage and re-introducing premature commitment. Future work will explore elastic hypothesis generation—adapting the number (and granularity) of hypotheses to the uncertainty and cue di versity of each instance—while k eeping verification ef ficient (e.g., early stopping, hypothesis pruning, or learned controllers). Backbone perception as an upstream bottleneck. Our contribution targets the r easoning paradigm (how the model forms and adjudicates interpretations), but O V -MER performance can be fundamentally limited by the base MLLM’ s perceptual competence. If the model fails to reliably encode fine-grained visual or acoustic cues (i.e., it cannot “see/hear” them), then richer reasoning may not reco ver the missing e vidence, and verification may become brittle or under -informed. This is especially relev ant for subtle facial micro-expressions, prosody , and cross-modal timing. An important next step is to re-ev aluate HyDRA on stronger multimodal backbones and perception modules to disentangle gains from improved perception versus impro ved abducti ve adjudication. Compatibility with emerging MLLM design directions. Recent MLLM research suggests two complementary directions for impro ving multimodal understanding: (i) decoupling the Perception–Cognition loop into explicit perception-first pipelines (often yielding large gains in recognition), and (ii) strengthening multimodal fusion directly in latent space to improv e grounding and reduce reliance on language priors. Our current implementation does not e xplicitly incorporate either paradigm, and therefore may not realize the best achie vable syner gy between perception, fusion, and abducti ve reasoning. In future work, we will in vestigate combining HyDRA with perception-first/cognition-second workflo ws and with latent fusion architectures, to test whether stronger base representations further amplify the benefits of Propose–V erify–Decide training 17 Follo w the Clues, Frame the T ruth: Hybrid-evidential Deductive Reasoning in O V -MER F igure 5. Evaluation scores across v arious epochs on the MER-FG. and evidence-closed v erification. H. Analysis of Emotional Label Cardinality Distribution Figure 6 illustrates the distribution of emotion-word counts per sample for both ground truth (GT) and model predictions. A common risk in open-vocab ulary reward design is ”label dumping, ” where the model hacks the re ward by o ver -predicting labels to increase recall. Howe ver , our distribution shows that while GT labels typically range between 5 and 8, model predictions remain highly concentrated at k = 2 . This significant difference demonstrates that the length-penalty P ℓ effecti vely constrains the model, ensuring that performance gains stem from semantic precision rather than cardinality-based rew ard hacking. Notably , this conciseness does not compromise accurac y; as detailed in Appendix E , our hierarchical metric maps predictions into unified taxonomic lev els, allowing a few high-precision labels to effecti vely capture the primary emotional dimensions within the GT . F igure 6. Analysis of Emotional Label Cardinality Distribution 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment