MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

February 09, 2026

Reading time: 29 minute

...

📝 Original Info

Title: MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
ArXiv ID: 2511.23055
Date: 2025-11-28
Authors: Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng

📝 Abstract

MindPower Reasoning tion. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages

📄 Full Content

Understanding human mental states is a prerequisite for genuine human-agent collaboration. Unlike conventional embodied systems that merely execute explicit commands, next-generation agents must reason about what humans believe, desire, and intend, and act proactively on that understanding [14]. This requires an explicit mental reasoning mechanism based on the human Theory of Mind (ToM) [13,23,33], which can be formalized by the Belief-Desire-Intention (BDI) framework [36]. In BDI, humans perceive the world and others' behaviors, form beliefs about the environment, derive desires that encode goals, and generate intentions that guide actions, reflecting ToM Reasoning. This raises the question: Can embodied agents reason and act in a similar ToM-consistent manner?

We formalize this cognitive process into three progressive levels of embodied intelligence. (1) Perception: understanding human behaviors and environmental contexts via vision-language reasoning. (2) Mental Reasoning: inferring human beliefs, desires, and intentions, as demonstrated in first-and second-order ToM Reasoning tasks. (3) Decision Making and Action: reasoning about one’s own beliefs and intentions to make autonomous, goal-directed decisions and provide proactive assistance. This three-level hierarchy bridges perception and intention, paving the way toward truly collaborative human-AI interaction.

Despite rapid progress in Vision-Language Models (VLMs), a fundamental gap remains in embodied intelligence. As shown in the bottom right of Fig. 1, current VLMs such as Gemini [7], GPT [1], and Qwen-VL [2] excel at perception but remain largely reactive. They can describe what they see, yet fail to reason about what humans believe, desire, or intend. Existing Theory-of-Mind (ToM) benchmarks [18,32,40] have endowed VLMs with certain mental reasoning abilities, but they are limited to reasoning about the mental states of humans appearing in the video. They do not build ToM Reasoning from their own perspective, which prevents VLMs from learning to make decisions and generate actions.

We address this gap through a Robot-Centric Perspective, which enables VLMs to reason simultaneously about their own mental states and those of humans, forming a continuous and interpretable ToM Reasoning loop. Inspired by frameworks such as LLaVA-CoT [46] and Visual-RFT [29], we further design the Robot-Centric MindPower Reasoning Hierarchy, which connects ToM Reasoning with decision making and action generation. It structures reasoning into three levels and six layers: from (Perception), through , , and (Mental Reasoning), to and (Decision Making and Action).

To realize this goal, we introduce the MindPower Benchmark. An overview of our benchmark, reasoning hierarchy, and experiments is shown in Fig. 1. MindPower comprises two core embodied reasoning tasks: (1) False-Belief Correction, which examines whether an embodied agent can detect and resolve a human’s mistaken belief; and (2) Implicit Goal Inference & Completion, which tests whether the agent can infer a hidden goal and assist in achieving it. We construct 590 scenarios across two interactive home-environment simulators, each containing multimodal observations and object-manipulation activities that reflect everyday embodied reasoning challenges.

Further, to enhance reasoning consistency across these layers, we propose Mind-Reward, a reinforcement-based optimization framework that aligns intermediate ToM states with final actions, promoting Robot-Centric, continuous reasoning.

Our contributions are threefold: • Robot-Centric Perception Benchmark for mental-stategrounded action. MindPower links mental reasoning with embodied action through two tasks: False-Belief Correction and Implicit Goal Inference & Completion, across 590 interactive home scenarios, evaluating agents’ ability to infer, make decisions, and assist. • Unified MindPower Reasoning Hierarchy bridging perception and action. The MindPower Reasoning Hierarchy structures reasoning across three levels and six layers, providing a standardized way to evaluate how perception leads to action. • Reinforcement optimization for consistent ToM Reasoning. Mind-Reward aligns intermediate reasoning states with final actions, promoting coherent Robot-Centric reasoning. With this optimization, our model surpasses GPT-4o by 12.77% in decision accuracy and 12.49% in action generation.

Theory of Mind Benchmark. Early ToM benchmarks relied on narrative text to infer beliefs, desires, and intentions [6, 16, 19-22, 44, 45, 48], but lacked multimodal grounding. Subsequent multimodal benchmarks introduced videos or images depicting story-based social scenarios to support richer mental-state inference [10,11,18,26,32,40,41,53]. However, most adopt multiple-choice or shortanswer formats and focus on role-level or factual queries, offering limited support for open-ended, real-world reasoning where agents must update beliefs and act continuously.

Although datasets such as MuMA-ToM [40] and MMToM-QA [18] explore false-belief understanding or implicit goal inference, they still do not support dynamic reasoning processes that involve belief correction, assistance-oriented behavior, or proactive decision-making, which are essential for autonomous embodied agents.

VLMs-based Embodied Agents. Embodied agents have been developed to perform tasks autonomously by decomposing complex goals into multiple subtasks and executing them step by step [5,25,30,38,51]. For example, PaLM-E [9] demonstrates that large embodied models can perform high-level task planning by integrating visual and linguistic cues. Some benchmarks further support multi-agent collaboration, enabling agents to observe each other or even human partners to coordinate goals and actions [8,17,35,43,50]. For example, RoboBench [30] allows agents to decompose high-level goals into subgoals for sequential execution, while Smart-Help [4] focuses on achieving comfortable human-robot interaction by balancing human comfort and task efficiency. However, these systems still depend on predefined goals or imitation signals and lack self-perspective mental reasoning. As highlighted in Mindblindness [3], social intelligence requires inferring others’ mental states and acting upon those inferences, a capability missing from current embodied benchmarks. They do not evaluate first-or second-order belief reasoning, which is crucial for autonomous and socially grounded decision-making. Even robotic setups that incorporate hidden-belief modeling, such as AToM-Bot [8], cover only narrow goal spaces and provide limited task diversity, falling short of comprehensive ToM evaluation.

The Theory of Mind (ToM) framework [36] models human decision-making through a Belief-Desire-Intention hierarchy: individuals form desires from their beliefs and commit to intentions that drive actions. Building on this cognitive structure, we introduce the MindPower Benchmark, which includes a unified reasoning hierarchy (i.e., MindPower Reasoning Hierarchy), the curated MindPower dataset, and comprehensive evaluation metrics. Specifically, as shown in Fig. 2, the MindPower Reasoning Hierarchy extends the embodied decision-making process into six layers organized across three levels, each reflecting how an embodied agent perceives, reasons, and acts within its environment. Level-1: Perception.

• -The agent observes the environment through vision or other sensory inputs. This step answers “What is happening now?” Level-2: Mental Reasoning.

Based on proposed MindPower Reasoning Hierarchy, we propose MindPower Dataset. Dataset Collection Principles. We construct the dataset based on three principles: (1) Realism: scenarios and events should be plausible in the real world.

(2) BDI Consistency: each sample preserves a coherent to hierarchy with logically consistent intermediate states. (3) Diversity under simulator constraints: within simulator constraints, we include varied scenes, roles, and goals to ensure diversity while maintaining feasible simulation and annotation. Task Design. We define two core task types for constructing the MindPower dataset: False-Belief Correction and Implicit Goal Inference & Completion. The former evaluates whether an embodied agent can detect and correct a human’s mistaken belief about the environment (e.g., misjudged object locations). The latter tests the agent’s ability to infer unstated intentions from subtle behavioral cues, such as searching or repeated failed attempts. For example, when the human starts rummaging through drawers or walking around to search after completing several actions, the agent should reason that the human is looking for a specific target object. We further incorporate special-needs scenarios (e.g., wheelchair users or children), enabling evaluation of assistive behaviors under mobility and reach constraints. Different from ToM benchmarks such as MuMA-ToM [40] and MMToM-QA [18], our task explicitly models the moment when belief contradictions arise. As shown in Fig. 3, we introduce a searching event (e.g., “Alice comes back and walks around”), enabling the agent to perceive both intention and false belief. Furthermore, our Level-2 Mental Reasoning is Robot-Centric, requiring inference of both the agent’s and the human’s mental states, whereas existing benchmarks adopt a role-centric design that infers only the human’s reasoning via multiple-choice questions. Construction Pipeline. We use VirtualHome [34] and ThreeDWorld [15] to simulate realistic household environments. The pipeline consists of three stages: (1) Story Construction. We generate initial story scripts using GPT-4o [1] based on room type, character setup, goals, and involved objects, followed by manual filtering by five annotators to remove implausible scenarios. 1 (2) Multimodal Data Collection. Each script is reenacted in the simulators to collect video data. Ensuring strict adherence to scripts, each sample takes 25-35 minutes in VirtualHome and 50-70 minutes in ThreeDWorld, yielding 590 examples in total. (3) MindPower Reasoning Hierarchy Annotation. Five trained annotators label all six layers of the MindPower Reasoning Hierarchy for every sample. 2

Task Diversity. Our benchmark incorporates 2 simulators, 8 home layouts, and 16 humanoid agents representing different age groups, genders, and mobility conditions, including children, adults, and wheelchair users. 3Robot-Centric Perspective. As summarized in Tab.1 and Fig. 3, existing ToM Benchmark, such as MuMA-ToM [40] and MMToM-QA [18] primarily assess the understanding of beliefs or intentions in narrative settings, typically

The MindPower Reasoning Hierarchy significantly improves decision and action accuracy. In contrast, our MindPower Benchmark bridges these gaps by integrating explicit Mental Reasoning with autonomous decision making and action generation. It enables reasoning from the agent’s own perspective and adopts an Open-Ended format that jointly evaluates False-Belief Correction and Implicit Goal Inference & Completion. We will introduce the proposed evaluation metrics in Sec. 5.

We split the dataset into training and testing sets with an 8:2 ratio and evaluated it on human participants as well as both open-source and closed-source Vision Language Models (VLMs). The detailed results are presented in Tab. 2. We summarize our main findings as follows:

(1) Human participants achieved the highest scores, clearly outperforming all VLMs. Specifically, 9 trained participants were asked to watch the collected videos and provide BDI reasoning processes, followed by corresponding decisions and actions. As shown in bottom-right of Fig. 1, humans surpass all VLMs.

(2) Closed-source VLMs showed superior results in Perception, Mental Reasoning, and Decision Making and Action, with Gemini-2.5 Pro and GPT-4o achieving the highest scores. As shown in Tab. 2, among opensource VLMs, those with reasoning abilities, such as Video-R1 [12] and VideoChat-R1 [27], performed the best.

(3) The MindPower Reasoning Hierarchy substantially improves decision and action accuracy (Level-3).

To further validate the effectiveness of the MindPower Benchmark and the proposed MindPower Reasoning Hierarchy, we conducted ablation studies by removing Level-1 and 2 and instructing models to directly output Decision and Action results. We evaluated this setup on GPT-4o [1]. As shown in Fig. 4a, removing the MindPower Reasoning Hierarchy led to a clear performance degradation: GPT-4o’s decision-making accuracy dropped by 1.24%, while action generation accuracy decreased from 2.91% to 0.82%, demonstrating that the MindPower Reasoning Hierarchy is crucial for improving both the quality and consistency of decision and action outputs. Moreover, when using standard step-by-step reasoning ( … ) instead of the MindPower Reasoning Hierarchy, performance degrades substantially: decision accuracy falls by 4.89%, and action accuracy decreases from 2.91% to 0.90%. These results indicate that the MindPower Reasoning Hierarchy significantly improves the accuracy of both decision-making and action generation compared with standard reasoning.

(4) VLMs, especially open-source models, lack a Robot-Centric Perspective. During Perception level, VLMs often provide general video descriptions about clothing or the environment instead of focusing on individual actions. They overlook crucial details such as movements, directions, and appearance sequences, which are essential for inferring implicit goals or detecting false beliefs. At higher reasoning layers, they are easily biased by the environment rather than reasoning from a Robot-Centric Perspective of both human and robot mental states. For example, in a kitchen scene, a model may predict cleaning kitchenware, while the person is actually searching for an item that someone else has taken. In a bedroom, it may assume tidying the bed, even though the person is only retrieving something from it. Overall, from Perception to Mental Reasoning, VLMs fail to adopt a Robot-Centric Perspective and to reason from specific actions or contradictions. Instead, they rely on coarse and stereotypical descriptions. As shown in Fig. 4b, we use GPT-4o4 to evaluate whether VLMs consider individual actions and contradictions in human behavior. The results show that open-source VLMs still exhibit a substantial gap compared with human reasoning.

After data collection, we propose our method to let VLMs learn to act from ToM Reasoning. Our method is guided by two core principles:

• BDI Consistency.

The reasoning hierarchy from to , , , , and should remain logically consistent across all layers.

• Robot-Centric Optimality. The agent must reason and act from its own embodied perspective. During the Mental Reasoning level, it simultaneously infers its own beliefs and performs second-order reasoning about the human’s beliefs, maintaining correct perspective separation. Following this design, we adopt a two-stage training paradigm similar to Visual-RFT [29] and DeepSeek-Math [39]. Specifically, we first perform Supervised Fine-Tuning (SFT) to establish base reasoning alignment, followed by Group Relative Policy Optimization (GRPO) using our proposed reward, combining Mind-Reward and Format-Reward, to enhance BDI consistency and Robot-Centric optimality. Mind-Reward. In the GRPO stage, we introduce Mind-Reward R Mind to further optimize the SFT model. The Mind Reasoning Hierarchy is continuous and requires maintaining consistency across all reasoning levels and layers. Moreover, across different reasoning layers, such as the perception of events and the inference of beliefs about embodied agents and humans, there exist inherent temporal and logical dependencies that must be preserved. 5We represent each reasoning layer (from to ) as a sequence of atomic actions, denoted as action (agent, object), where agent refers to the owner of the action or mental state, and object denotes the target entity or mental content being acted upon. Since layers in the Mental Reasoning level involves distinct cognitive and physical reasoning patterns, we construct a unified atomic-action table that encompasses both categories. 6Both the ground-truth and generated outputs are then converted into structured atomic action sequences by an LLM (Qwen3-Max [47]) during the GRPO training process, and these extracted atomic actions are subsequently used for reward computation.

Mind-Reward evaluates reasoning quality from three complementary aspects: (1) Atomic Accuracy: measured by ROUGE-1, it quantifies the proportion of correctly matched atomic actions, each tagged with a perspective attribute (human or embodied agent) to ensure Robot-Centric Perspective alignment; (2) Local Consistency: measured by ROUGE-2 between adjacent atomic pairs to assess shortrange reasoning coherence; (3) Global Consistency: measured by ROUGE-L (longest common subsequence) to evaluate the overall reasoning alignment across the reasoning process.

The final reward is a weighted sum of these components:

This reward formulation explicitly enforces both ToM consistency and Robot-Centric Perspective throughout GRPO. Format-Reward. Format-Reward R Format is computed by performing a sequential regular expression match over the six reasoning layers: , , , , , and . If all layers appear in the correct order, the reward is set to 1; otherwise, it is 0. Overall Reward. As shown in Fig. 5, the final reward R used in GRPO combines the proposed Mind-Reward R Mind and Format-Reward R Format as:

The advantage A i is then computed within each group as:

where R i is the reward for the i-th response.

Optimization. We adopt the GRPO algorithm proposed in DeepSeekMath [39] to optimize the model. GRPO samples a group of outputs {o 1 , o 2 , • • • , o G } from the old policy π θold and updates the policy π θ by maximizing the following objective:

Experiment

We design evaluation metrics to assess the model’s performance across three levels, corresponding to the full reasoning hierarchy from to and .

Level-1: Perception. The perception module outputs textual descriptions (captions). We evaluate these outputs using BERTScore [52] and Sentence Transformer [37] similarity, which measure the semantic alignment between the generated captions and the ground-truth descriptions.

Level-2: Mental Reasoning. We similarly evaluate the reasoning outputs using BERTScore and Sentence Transformer similarity, measuring semantic consistency across the three components of , , and .

Level-3: Decision Making and Action. The decision stage generates textual outputs, which are evaluated using the same BERTScore and Sentence Transformer similarity metrics. The action stage produces sequences of atomic actions, which are evaluated using two additional metrics: Success Rate (SR) and Action Correctness (AC). These metrics assess both the overall correctness of the action sequence and the accuracy of each atomic action, represented in the form action (object). The SR score combines multiple ROUGE components and is defined as:

where R 1 , R 2 , and R L denote the ROUGE-1, ROUGE-2, and ROUGE-L scores, respectively. The AC score measures how accurately the generated action sequence A * matches the ground-truth sequence Â, and is computed as:

where |A * ∩ Â| denotes the number of atomic actions in A * that correctly match the ground-truth sequence Â, and | Â| is the total number of actions in the ground-truth sequence.

BDI and Perspective Consistency. We use GPT-4o to evaluate the BDI consistency and perspective of the generated outputs. The content from to is assessed by GPT-4o based on three criteria: (1) whether each reasoning layer logically follows from the previous one without contradictions, (2) whether the overall reasoning is complete and precise, and (3) whether the reasoning genuinely adopts the robot’s perspective and effectively assists the human characters in the story.

We randomly split the dataset into training and testing sets with an 8:2 ratio. We used Qwen2.5-VL-7B-Instruct as the base model. We extracted 32 frames from each video and concatenated them for training. We used 5 training epochs for SFT and 400 iterations for GRPO. The number of generations was set to 8, and training was done on a single H800 GPU. We set α 1 as 0.2, α 2 as 0.3, and α 3 as 0.5.

We evaluated several closed-source baselines, including Gemini-2.5 Pro [7], Gemini-2.5 Flash [7], and GPT-4o [1].

Since GPT-4o does not accept raw video input, we uniformly sampled an average of 64 frames as its input.

For open-source baselines, we tested Qwen2.5-VL-7B-Instruct [2], InternVL3.5-8B [42], Video-LLaVA3 [28], Video-ChatGPT [31], Video-R1 [12], VideoChat-R1 [27], and LLaVA-OV-8B [24]. For Qwen2. Without SFT, we find that compared with the initial Qwen2.5-VL-7B-Instruct, although there is some improvement in decision and action accuracy, the overall performance remains suboptimal. This indicates that the model The owner, sitting in a wheelchair, moved to the kitchen counter and tried to reach the wine glass on it, but could not reach it. I believe the owner, …, wants to take the wine glass from the kitchen counter. I believe he cannot reach it due to limited mobility… I want to help the owner get the … Retrieve the glass from the kitchen counter and hand it to the owner. Walk to the kitchen counter, pick up the wineglass, and hand it to the owner. walk (kitchen_counter), reach (glass), pick (glass), walk (owner), give (glass, owner)

The man in the wheelchair is sitting… looking around the room. The man might be feeling hungry or thirsty… …offering him a snack or a drink. … check if there’s anything he needs. …to see if the man needs anything. walk (kitchen_counter) The owner moved to the kitchen counter and tried to reach the cup on it, but could not reach it. I believe the owner wants to take the cup from the kitchen counter. I believe the owner cannot reach it because it is placed too far or too far. I want to help the owner get the cup they cannot reach. Retrieve the cup from the kitchen counter and hand it to the owner. Move to the kitchen counter, take the cup and give t to the owner. walk (kitchen_counter), pick (cup), walk (owner), give (cup, owner)

We illustrate the differences between GPT-4o and Qwen2.5-VL-7B-Instruct using a scenario where a man in a wheelchair attempts to reach a distant cup (Fig. 3). Both models are easily swayed by environmental cues: GPT-4o hallucinates a refrigerator-opening action, whereas Qwen2.5-VL-7B-Instruct infers hunger. As discussed in Sec. 3.4, these failures arise from the lack of Robot-Centric Perception. In contrast, our model infers the human’s inability to reach the cup and performs second-order reasoning by clearly separating perspectives.

In this work, we introduce the MindPower Benchmark, which incorporates the Robot-Centric MindPower Reasoning Hierarchy with three levels and six layers for model-ing ToM Reasoning. The benchmark includes the Mind-Power Dataset with two tasks, False-Belief Correction and Implicit Goal Inference & Completion, together with evaluation metrics for assessing whether VLM-based embodied agents can perform decision making and action generation grounded in ToM Reasoning. Finally, we evaluate a variety of VLMs on our benchmark and propose a Mind-Reward mechanism that achieves the best overall performance.

In future work, we will extend the MindPower Reasoning Hierarchy to human-robot collaboration and multiagent coordination, and deploy our model on real robots to assess its performance in practical settings.

Considering the space limitations of the main paper, we provide additional results and discussions in this appendix. The appendix is organized to first clarify the key concepts used throughout the paper, followed by detailed descriptions of our dataset collection and annotation process, comparisons with other benchmarks, and the prompts used in Sec. 3.4. We then describe how textual instructions are converted into atomic action sequences in the Mind-Reward framework. Next, we present additional experimental results, including evaluation metrics and taskspecific experiments. We further discuss potential extensions of our dataset, such as multi-view extension and its connection to low-level execution models. Finally, we summarize the limitations of the current benchmark and future directions for improvement. The full benchmark will be publicly released to encourage future research.

Theory of Mind (ToM). Theory of Mind (ToM) [33,36] is the cognitive ability to infer others’ mental states such as beliefs, desires, and intentions, and to use these inferences to predict and guide actions.

For the False-Belief Correction task, as illustrated in Fig. 7, we follow a taxonomy-driven approach. We first categorize scenarios based on the mapping between Virtu-alHome [34] and ThreeDWorld [15] environments and the typical object distributions in each room (e.g., kitchen, living room). We then determine the number of humans involved in each scene. To cover different numbers of humanoid agents and different target (final) humanoid agents, we design three distinct prompt templates for GPT-4o to generate story scripts. When issuing each request, we iterate over a predefined list of objects along with their corresponding start and end locations. The prompts are shown in Fig. 16.

For the Implicit Goal Inference & Completion task, we design four types of scenarios to comprehensively evaluate agents’ goal-inference abilities:

(1) Special populations. We include scenarios featuring individuals with unique physical conditions: a wheelchair user and a 1.2-meter-tall child. A wheelchair user faces mobility and height limitations, while the child cannot reach high places. We design stories that incorporate these constraints so that the hidden goal must be inferred through contextual cues rather than physical actions.

(2) Object-centric property reasoning. We exploit special physical properties of household objects to construct implicit goals. For instance, since faucets can leak water, we create situations where a person leaves without turning off the faucet. Similarly, because candles provide light, we design scenes where a person reading a book suddenly experiences a power outage and begins walking around; the agent can infer that they are searching for candles (no flashlight is available in the environment).

(3) Functional object combinations. Based on the objects present in VirtualHome and ThreeDWorld, we identify typical usage pairs or triplets. For example, a knife, cutting board, and carrot together imply the goal of cutting carrots. If a person places a cutting board on the table and puts a carrot on it before searching for another object, the hidden goal is most likely to find a knife to complete the task.

(4) Dialogue-driven inference. We additionally design conversational scenarios like MuMA-ToM [40] and Fan-ToM [21] in which implicit goals must be inferred from incomplete verbal exchanges rather than direct physical interactions.

Finally, we collect 200 examples for Implicit Goal Inference & Completion and 390 examples for False-Belief Correction. Among them, 37 examples are adapted from MuMA-ToM [40], where we further augment each story by incorporating a stage-3 “character search” segment, as illustrated in Fig. 8, and 2 examples are sourced from CHAIC [10]. Overall, 113 examples contain a single humanoid agent, 373 contain two agents, and 104 contain three agents. In addition, 17 examples involve agents with special needs, 96 focus on object-centric property reasoning and functional object combinations, and 87 correspond to dialogue-driven inference.

Data Annotation. For each example in the MindPower Reasoning Hierarchy, the annotations are manually created and subsequently verified using GPT-4o [1]. During the annotation process, particularly for the layer, we adopt a unified action space that integrates action definitions from both VirtualHome and ThreeDWorld. This approach enables us to standardize heterogeneous simulators under a single executable schema. The complete list of supported high-level actions is as follows:

High-Level Action Set Walk, Run, WalkTowards, WalkForward, TurnLeft, Sit, StandUp, TurnRight, Sit, StandUp, Grab, Open, Close, Put, PutIn, SwitchOn, SwitchOff, Drink, Touch, LookAt, TurnBy, TurnTo, MoveBy, MoveTo, ReachFor, ResetArm, Drop, Animate, RotateHead, ResetHead

For some examples in the False-Belief Correction task, the camera viewpoint prevents certain objects from being visible after they are moved. For instance, we design scenarios where a humanoid agent moves an object from the fridge in the kitchen to the bedroom, but the camera is fixed in the kitchen and cannot capture the final location. As a result, the embodied agent can only infer that the object has been moved, without knowing where it ends up. In such cases, the annotated does not require the agent to find the object. Instead, the action is defined as reminding the returning character that the object has already been moved, thereby correcting their false belief even though the agent cannot locate the object.

We compare our dataset with existing multimodal ToM benchmarks from three perspectives:

• Data source and diversity. To the best of our knowledge, our benchmark is the first to be constructed using two different simulators, which substantially increases the diversity of environments, interaction patterns, and embodied tasks. In contrast, prior multimodal ToM datasets are typically collected from a single simulatorfor example, MuMA-ToM [40], MMToM-QA [18], and BDIQA [32] are limited to VirtualHome, while SoMi-ToM [11] is restricted to Minecraft. • Reasoning paradigm. As shown in Fig. 8, our dataset adopts a Robot-Centric ToM reasoning paradigm, where the agent must infer both the mental states of humans and its own belief state, and then produce decisions and action sequences. Existing multimodal ToM benchmarks primarily focus on inferring human mental states without requiring downstream decision making or action generation. • Evaluation format.

Our benchmark supports openended evaluation, allowing agents to autonomously reason and respond in natural language. This differs from prior datasets, which mainly rely on multiple-choice question formats and therefore cannot reflect real-world embodied decision-making where agents act independently.

We employ two simulators in total, VirtualHome and Three-DWorld, covering 8 different apartment layouts that include dining rooms, bedrooms, kitchens, and bathrooms, as well as 16 humanoid agents consisting of 2 children, 1 wheelchair user, and 13 adults of diverse ages and skin tones. The set of humanoid agents is illustrated in Fig. 12, while the distribution of apartment layouts is shown in Fig. 13.

Detailed Examples of Example 1 in Fig. 1. The Mind-Power Reasoning Hierarchy output of Example 1 in Fig. 1 is:

• Alice walks into the kitchen, puts the apple on the table, and then leaves the kitchen. Then David walks into the kitchen, picks up the apple, and puts it in the refrigerator. Alice comes back and walks around. • I think Alice is looking for the apple. I believe she thinks the apple is on the table, but I also believe the apple is actually in the refrigerator. • I want to assist Alice in achieving his goal of retrieving the apple. and I want to resolve the mismatch between Alice’s belief and the real world state. • I want to take out the apple from fridge, and hand it to Alice.

In Sec. 3.4 of Manuscript, we conduct some experiments on MindPower Benchmark. Prompt used for VLMs to produce outputs in Mind-Power Reasoning Hierarchy format. For the experiments in Sec. 3.4 and Tab. 2 of the manuscript, we employed the prompt shown in Fig. 15 to guide the vision-language models (VLMs) to generate outputs in the MindPower Reasoning Hierarchy format.

Prompt used for GPT-4o. In Sec. 3.4 of the manuscript, we use the prompt shown in Fig. 17 to instruct GPT-4o to generate the and directly, without performing step-by-step reasoning, while the prompt shown in Fig. 18 guides the model to produce the and with standard reasoning.

In Fig. 4 of the manuscript, we evaluate the Robot-centric score across all VLMs using GPT-4o, with the prompt shown in Fig. 19 to assess whether the model performs reasoning from the robot’s own perspective rather than inferring solely from the surrounding environment. The token character refers to any human identifier in the scene (e.g., char0, char1). However, for the layer, we omit the character argument because actions in this layer exclusively represent the behaviors of the embodied agent itself and therefore do not require explicit character attribution.

For the , , , and layers, the defined atomic action table is presented in Tab. 3.

The prompt used for Qwen3-Max is in Fig. 20.

Can the model still make correct decisions or carry out assisting actions even if the reasoning in the previous layer is incorrect? Even if the model makes errors in object recognition or misinterprets the initial scene, it can still produce correct outputs as long as it correctly identifies the final location of the object. This is because our decisionmaking process is designed to correct for human false be-What can you do for him?

Ours A man in a wheelchair is seated in a kitchen. He appears to be reaching towards the refrigerator. The man pauses for a moment, then continues moving his hand towards the refrigerator. I believe the man is attempting to access the refrigerator, possibly to retrieve or store an item. I believe he may need assistance due to his seated position and proximity to the refrigerator. The owner, sitting in a wheelchair, moved to the kitchen counter and tried to reach the wine glass on it, but could not reach it. I believe the owner, who is sitting in a wheelchair, wants to take the glass from the kitchen counter I believe he cannot reach it due to limited mobility and height constraints. Since I can reach higher and move freely, I can assist by retrieving the glass. I want to help the owner get the glass they cannot reach because of their limited reach from the wheelchair. Retrieve the glass from the kitchen counter and hand it to the owner. Walk to the kitchen counter, pick up the wineglass, and hand it to the owner. walk (kitchen_counter), reach (glass), pick (glass), walk (owner), give (glass, owner)

The man in the wheelchair is sitting at the kitchen counter, looking around the room.

: The man might be feeling hungry or thirsty and is looking for something to eat or drink. : To help the man by offering him a snack or a drink. : To approach the kitchen counter and check if there’s anything he needs. : The robot will go to the kitchen counter to see if the man needs anything. : walk (kitchen_counter)

The owner moved to the kitchen counter and tried to reach the cup on it, but could not reach it. I believe the owner wants to take the cup from the kitchen counter. I believe the owner cannot reach it because it is placed too far or too far. I want to help the owner get the cup they cannot reach. Retrieve the cup from the kitchen counter and hand it to the owner. Move to the kitchen counter, take the cup and give t to the owner. walk (kitchen_counter), pick (cup), walk (owner), give (cup, owner) In all experiments of this paper, we use the first viewpoint (the standard view), while the other two viewpoints will be released for use in global tracking and analysis.

Our method focuses on high-level mental-state modeling and decision making, rather than fine-grained action execution. Current Vision-Language-Action (VLA) models are strong low-level executors, generating gripper motions and stepwise trajectories, but they remain confined to actioncommand prediction and lack explicit reasoning about beliefs, goals, or social context. In contrast, our agent, similar in spirit to PaLM-E [9], performs high-level planning that grounds actions in inferred mental states and task intent. Structured Belief-Desire-Intention (BDI) reasoning enables goal inference and planning that are guided by perspective rather than how to do it. Although our system is architecturally distinct from lowlevel VLA executors, it is inherently complementary to them. The high-level plans produced by our agent can serve as abstract, semantically grounded guidance for downstream controllers. Future work can integrate our model with existing VLA-based executors by simply attaching an action head or a motion-generation module on top of the inferred intentions and subgoals. This design creates a hierarchical embodied agent: our model provides deliberate, interpretable, and socially aligned planning, while low-level VLA modules translate these plans into precise motor actions. Such a combination offers a promising direction toward end-to-end agents that are both cognitively capable and physically competent.

Limitations.

• Due to the constraints of current open-source simulators, our experiments are limited to the environments, humanoid agents, and action sets provided by the simulator. • Our system relies on an explicit MindPower Reasoning Hierarchy, which models the full chain from to . While this ensures interpretable reasoning, it inevitably increases the number of output tokens. Future Work.

• Extend the benchmark to real-world settings beyond simulation. • Develop implicit mental-state modeling based on the proposed MindPower Reasoning Hierarchy to reduce reasoning length while maintaining interpretability. • Expand our scenarios to broader domains, including outdoor environments and human-robot collaboration.

We provide one examples in which humanoid agents, controlled by embodied agents, perform assisting actions in the videos. The example is shown in Fig. 11.

6.43 18.78 15.71 20.77 19.30 17.38 13.97 19.72 12.62 18.77 0.00 0.00 5.95 LLaVA-OV-8B [24] 8.08 26.45 15.09 23.21 22.31 21.40 16.21 19.58 17.11 21.25 0.00 0.00 6.45 Ours Mind-Reward only 21.84 39.99 18.70 27.81 21.35 18.85 21.90 23.30 17.58 24.68 0.28 0.40 6.63 SFT only 32.78 52.72 43.15 42.48 47.01 37.83 34.86 39.48 36.70 43.84 8.50 10.48 8.78

ers, is essential for coherent decision-making in multi-agent interactions involving cooperation, conflict, or deception. ToM Reasoning. In our work, “ToM Reasoning” refers to an agent’s ability to infer others’ mental states and make decisions based on them rather than solely on observable states. Robot-Centric. In our work, by “Robot-Centric” we mean that the embodied agent should reason from its own perspective. It not only needs to infer its own mental states but also reason about how it perceives the mental states of human. Role-Centric. “Role-Centric” refers to the model reasoning about mental states from the perspective of a character within the current story or multimodal input. MindPower Reasoning Hierarchy. In this work, we propose that the model follows the reasoning path → → → → → , which constitutes the MindPower Reasoning Hierarchy.

More details can be found in Sec. B of the Supplementary Material.

Details about videos and labels are provided in Sec. B of the Supplementary Material.

3 More examples can be found in Sec. B of the Supplementary Material.

Prompt can be found in Sec. C of the Supplementary Material.

A detailed discussion is provided in Sec. D of the Supplementary Material.

Details can be found in Sec. D of the Supplementary Material.

📄 Read Full PDF on ArXiv