멀티이성 통합 판별 추론 프레임워크 MIND

Reading time: 5 minute
...

📝 Original Info

  • Title: 멀티이성 통합 판별 추론 프레임워크 MIND
  • ArXiv ID: 2512.05530
  • Date: Pending
  • Authors: ** - Chuang Yu¹²³⁶, Jinmiao Zhao¹²³, Mingxuan Zhao⁴, Yunpeng Liu²†, Xiujun Shu⁵, Yuanhao Feng⁵, Bo Wang⁵, Xiangyu Yue⁶† Affiliations: 1. 중국과학원 광전정보처리 연구소 2. 중국과학원 심양 자동화 연구소 3. 중국과학원 대학원 4. 홍콩과학기술대학(광저우) 5. 텐센트 6. 홍콩중문대학 MMLab **

📝 Abstract

Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of "Understand → Rethink → Correct", and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which automatically and efficiently expands existing datasets by generating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Twostage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multirationale Contrastive Alignment (MCA) optimization strategy, which achieves semantic aggregation of correct reasoning and boundary separation of incorrect reasoning. Extensive experiments demonstrate that the proposed MIND reasoning framework achieves state-of-the-art (SOTA) performance on multiple public datasets covering scientific, commonsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence.

💡 Deep Analysis

Figure 1

📄 Full Content

MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models Chuang Yu1,2,3,6, Jinmiao Zhao1,2,3, Mingxuan Zhao4, Yunpeng Liu2†, Xiujun Shu5, Yuanhao Feng5, Bo Wang5, Xiangyu Yue6† 1Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences 2Shenyang Institute of Automation, Chinese Academy of Sciences 3University of Chinese Academy of Sciences 4HKUST(GZ) 5Tencent 6MMLab, The Chinese University of Hong Kong https://github.com/YuChuang1205/MIND Abstract Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suf- fer from limited multi-rationale semantic modeling, insuffi- cient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we pro- pose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of “Understand →Re- think →Correct”, and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmen- tation and Discrimination (RAD) paradigm, which auto- matically and efficiently expands existing datasets by gener- ating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Two- stage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the sec- ond phase enables active logic discrimination and correc- tion. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multi- rationale Contrastive Alignment (MCA) optimization strat- egy, which achieves semantic aggregation of correct reason- ing and boundary separation of incorrect reasoning. Exten- sive experiments demonstrate that the proposed MIND rea- soning framework achieves state-of-the-art (SOTA) perfor- mance on multiple public datasets covering scientific, com- monsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence. 1. Introduction With the rapid advancement of multimodal large language models (MLLMs) across tasks such as visual question an- swering (VQA) [4, 6, 53, 67], visual reasoning [9, 54, 66], and cross-modal understanding [15, 16, 61], reasoning abil- ity has emerged as a key indicator of a model’s intelligence. To enhance this, recent research has proposed Multimodal Phase I: Rationale generation Phase II: Answer generation R-1: The baby applies effort to the cabinet door, causing it to swing open. The force direction is inward relative to the baby, identifying it as a pull. R-2: The cabinet door is opened by the baby's hand applying force that moves it in the direction of the baby, creating a pulling action. R-N: The infant's hand exerts pressure on the cabinet door, resulting in it opening. Since the motion is directed inward toward the baby’s hand, this constitutes a pulling force. R-3: The act of opening the door occurs because the baby’s hand directs a force inward. The force in this direction is classified as a pull. …… Understand problem-solving Rationale: …… baby’s hand directs a force inward. The force in this direction is classified as a pull. Rationale: …… The direction of this force is away from the baby's hand. This force is a push. Rethink Output: The answer is (A). BECAUSE: The input Rationale is correct! …… baby’s hand directs a force inward. The force in this direction is classified as a pull. Output: The answer is (A). BECAUSE: The input Rationale is incorrect! The correct is as follows: …… The door swings open toward the child, which is typical of a pulling force. Correct Revealing and Correcting Reasoning Traps Question: Which type of force from the baby's hand opens the cabinet door? Options: (A) pull (B) push Question: Which type of force from the baby's hand opens the cabinet door? Options: (A) pull (B) push Phase I: Rationale generation Phase II: Answer generation R-1: The baby applies effort to the cabinet door, causing it to swing open. The force direction is inward relative to the baby, identifying it as a pull. R-2: The cabinet door is opened by the baby's hand applying force that moves it in the direction of the baby, creating a pulling action. R-N: The infant's hand exerts pressure on the cabinet door, resulting in it opening. Since the motion is directed inward toward the baby’s hand, this constitutes a pulling force. R-3: The act of opening the door occurs because the baby’s hand directs a force inward. The force in this direction is classified as a pull. …… Understand problem-solving Rationale: …… baby’s hand directs a force inward. The force in this direction is classified as a pull. Rationale: …… The direction of this force is away from the baby's hand. This force is a push. R

📸 Image Gallery

Fig01.png Fig02.png Fig03.png Fig06.png Figs-s01.png Figs-s02.png Figs-s03.png Figs-s04.png Figs-s05-new.png Figs-s06-new2.png Figs-s07-new.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut