멀티이성 통합 판별 추론 프레임워크 MIND
📝 Abstract
Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of “Understand → Rethink → Correct”, and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which automatically and efficiently expands existing datasets by generating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Twostage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multirationale Contrastive Alignment (MCA) optimization strategy, which achieves semantic aggregation of correct reasoning and boundary separation of incorrect reasoning. Extensive experiments demonstrate that the proposed MIND reasoning framework achieves state-of-the-art (SOTA) performance on multiple public datasets covering scientific, commonsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence.
💡 Analysis
Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of “Understand → Rethink → Correct”, and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which automatically and efficiently expands existing datasets by generating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Twostage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multirationale Contrastive Alignment (MCA) optimization strategy, which achieves semantic aggregation of correct reasoning and boundary separation of incorrect reasoning. Extensive experiments demonstrate that the proposed MIND reasoning framework achieves state-of-the-art (SOTA) performance on multiple public datasets covering scientific, commonsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence.
📄 Content
MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models Chuang Yu1,2,3,6, Jinmiao Zhao1,2,3, Mingxuan Zhao4, Yunpeng Liu2†, Xiujun Shu5, Yuanhao Feng5, Bo Wang5, Xiangyu Yue6† 1Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences 2Shenyang Institute of Automation, Chinese Academy of Sciences 3University of Chinese Academy of Sciences 4HKUST(GZ) 5Tencent 6MMLab, The Chinese University of Hong Kong https://github.com/YuChuang1205/MIND Abstract Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suf- fer from limited multi-rationale semantic modeling, insuffi- cient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we pro- pose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of “Understand →Re- think →Correct”, and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmen- tation and Discrimination (RAD) paradigm, which auto- matically and efficiently expands existing datasets by gener- ating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Two- stage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the sec- ond phase enables active logic discrimination and correc- tion. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multi- rationale Contrastive Alignment (MCA) optimization strat- egy, which achieves semantic aggregation of correct reason- ing and boundary separation of incorrect reasoning. Exten- sive experiments demonstrate that the proposed MIND rea- soning framework achieves state-of-the-art (SOTA) perfor- mance on multiple public datasets covering scientific, com- monsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence.
- Introduction With the rapid advancement of multimodal large language models (MLLMs) across tasks such as visual question an- swering (VQA) [4, 6, 53, 67], visual reasoning [9, 54, 66], and cross-modal understanding [15, 16, 61], reasoning abil- ity has emerged as a key indicator of a model’s intelligence. To enhance this, recent research has proposed Multimodal Phase I: Rationale generation Phase II: Answer generation R-1: The baby applies effort to the cabinet door, causing it to swing open. The force direction is inward relative to the baby, identifying it as a pull. R-2: The cabinet door is opened by the baby’s hand applying force that moves it in the direction of the baby, creating a pulling action. R-N: The infant’s hand exerts pressure on the cabinet door, resulting in it opening. Since the motion is directed inward toward the baby’s hand, this constitutes a pulling force. R-3: The act of opening the door occurs because the baby’s hand directs a force inward. The force in this direction is classified as a pull. …… Understand problem-solving Rationale: …… baby’s hand directs a force inward. The force in this direction is classified as a pull. Rationale: …… The direction of this force is away from the baby’s hand. This force is a push. Rethink Output: The answer is (A). BECAUSE: The input Rationale is correct! …… baby’s hand directs a force inward. The force in this direction is classified as a pull. Output: The answer is (A). BECAUSE: The input Rationale is incorrect! The correct is as follows: …… The door swings open toward the child, which is typical of a pulling force. Correct Revealing and Correcting Reasoning Traps Question: Which type of force from the baby’s hand opens the cabinet door? Options: (A) pull (B) push Question: Which type of force from the baby’s hand opens the cabinet door? Options: (A) pull (B) push Phase I: Rationale generation Phase II: Answer generation R-1: The baby applies effort to the cabinet door, causing it to swing open. The force direction is inward relative to the baby, identifying it as a pull. R-2: The cabinet door is opened by the baby’s hand applying force that moves it in the direction of the baby, creating a pulling action. R-N: The infant’s hand exerts pressure on the cabinet door, resulting in it opening. Since the motion is directed inward toward the baby’s hand, this constitutes a pulling force. R-3: The act of opening the door occurs because the baby’s hand directs a force inward. The force in this direction is classified as a pull. …… Understand problem-solving Rationale: …… baby’s hand directs a force inward. The force in this direction is classified as a pull. Rationale: …… The direction of this force is away from the baby’s hand. This force is a push. R
This content is AI-processed based on ArXiv data.