Authors: ** - Chuang Yu¹²³⁶, Jinmiao Zhao¹²³, Mingxuan Zhao⁴, Yunpeng Liu²†, Xiujun Shu⁵, Yuanhao Feng⁵, Bo Wang⁵, Xiangyu Yue⁶† Affiliations: 1. 중국과학원 광전정보처리 연구소 2. 중국과학원 심양 자동화 연구소 3. 중국과학원 대학원 4. 홍콩과학기술대학(광저우) 5. 텐센트 6. 홍콩중문대학 MMLab **
📝 Abstract
Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of "Understand → Rethink → Correct", and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which automatically and efficiently expands existing datasets by generating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Twostage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multirationale Contrastive Alignment (MCA) optimization strategy, which achieves semantic aggregation of correct reasoning and boundary separation of incorrect reasoning. Extensive experiments demonstrate that the proposed MIND reasoning framework achieves state-of-the-art (SOTA) performance on multiple public datasets covering scientific, commonsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence.
💡 Deep Analysis
📄 Full Content
MIND: Multi-rationale INtegrated Discriminative Reasoning
Framework for Multi-modal Large Models
Chuang Yu1,2,3,6, Jinmiao Zhao1,2,3, Mingxuan Zhao4, Yunpeng Liu2†, Xiujun Shu5,
Yuanhao Feng5, Bo Wang5, Xiangyu Yue6†
1Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences
2Shenyang Institute of Automation, Chinese Academy of Sciences
3University of Chinese Academy of Sciences
4HKUST(GZ)
5Tencent
6MMLab, The Chinese University of Hong Kong
https://github.com/YuChuang1205/MIND
Abstract
Recently, multimodal large language models (MLLMs) have
been widely applied to reasoning tasks. However, they suf-
fer from limited multi-rationale semantic modeling, insuffi-
cient logical robustness, and are susceptible to misleading
interpretations in complex scenarios. Therefore, we pro-
pose a Multi-rationale INtegrated Discriminative (MIND)
reasoning framework, which is designed to endow MLLMs
with human-like cognitive abilities of “Understand →Re-
think →Correct”, and achieves a paradigm evolution from
passive imitation-based reasoning to active discriminative
reasoning. Specifically, we introduce a Rationale Augmen-
tation and Discrimination (RAD) paradigm, which auto-
matically and efficiently expands existing datasets by gener-
ating diverse rationales, providing a unified and extensible
data foundation. Meanwhile, we design a Progressive Two-
stage Correction Learning (P2CL) strategy. The first phase
enhances multi-rationale positive learning, while the sec-
ond phase enables active logic discrimination and correc-
tion. In addition, to mitigate representation entanglement
in the multi-rationale semantic space, we propose a Multi-
rationale Contrastive Alignment (MCA) optimization strat-
egy, which achieves semantic aggregation of correct reason-
ing and boundary separation of incorrect reasoning. Exten-
sive experiments demonstrate that the proposed MIND rea-
soning framework achieves state-of-the-art (SOTA) perfor-
mance on multiple public datasets covering scientific, com-
monsense, and mathematical scenarios. It provides a new
perspective for advancing MLLMs towards higher levels of
cognitive intelligence.
1. Introduction
With the rapid advancement of multimodal large language
models (MLLMs) across tasks such as visual question an-
swering (VQA) [4, 6, 53, 67], visual reasoning [9, 54, 66],
and cross-modal understanding [15, 16, 61], reasoning abil-
ity has emerged as a key indicator of a model’s intelligence.
To enhance this, recent research has proposed Multimodal
Phase I: Rationale generation
Phase II: Answer generation
R-1: The baby applies effort to the cabinet door,
causing it to swing open. The force direction is inward
relative to the baby, identifying it as a pull.
R-2: The cabinet door is opened by the baby's hand
applying force that moves it in the direction of the
baby, creating a pulling action.
R-N: The infant's hand exerts pressure on the cabinet
door, resulting in it opening. Since the motion is
directed inward toward the baby’s hand, this
constitutes a pulling force.
R-3: The act of opening the door occurs because the
baby’s hand directs a force inward. The force in this
direction is classified as a pull.
……
Understand problem-solving
Rationale: …… baby’s
hand directs a force inward.
The force in this direction
is classified as a pull.
Rationale: …… The
direction of this force is
away from the baby's hand.
This force is a push.
Rethink
Output: The answer is (A).
BECAUSE: The input
Rationale is correct!
…… baby’s hand directs a
force inward. The force in
this direction is classified
as a pull.
Output: The answer is (A).
BECAUSE: The input
Rationale is incorrect! The
correct is as follows: ……
The door swings open
toward the child, which is
typical of a pulling force.
Correct
Revealing and
Correcting Reasoning
Traps
Question: Which type of force from the baby's hand
opens the cabinet door?
Options: (A) pull (B) push
Question: Which type of force from the baby's hand
opens the cabinet door?
Options: (A) pull (B) push
Phase I: Rationale generation
Phase II: Answer generation
R-1: The baby applies effort to the cabinet door,
causing it to swing open. The force direction is inward
relative to the baby, identifying it as a pull.
R-2: The cabinet door is opened by the baby's hand
applying force that moves it in the direction of the
baby, creating a pulling action.
R-N: The infant's hand exerts pressure on the cabinet
door, resulting in it opening. Since the motion is
directed inward toward the baby’s hand, this
constitutes a pulling force.
R-3: The act of opening the door occurs because the
baby’s hand directs a force inward. The force in this
direction is classified as a pull.
……
Understand problem-solving
Rationale: …… baby’s
hand directs a force inward.
The force in this direction
is classified as a pull.
Rationale: …… The
direction of this force is
away from the baby's hand.
This force is a push.
R