BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.

💡 Research Summary

BiManiBench introduces a hierarchical benchmark specifically designed to evaluate multimodal large language models (MLLMs) on bimanual manipulation tasks, addressing a critical gap in existing robotics benchmarks that focus almost exclusively on single‑arm scenarios. The benchmark is organized into three tiers: (1) Dual‑Arm Spatial Reasoning, which tests the model’s ability to assign the correct arm (left or right) to grasp objects based on visual input; (2) High‑Level Action Planning, which requires the model to generate a logical sequence of atomic actions in JSON format for complex, long‑horizon tasks; and (3) Low‑Level End‑Effector Control, which demands direct output of continuous 16‑dimensional control commands (7‑DoF pose + 1‑DoF gripper for each arm) at every timestep.

To capture the nuances of bimanual coordination, the authors define three coordination modes: Independent Parallel Manipulation, Synchronous Collaborative Manipulation, and Sequential Collaborative Manipulation. Each mode stresses different aspects of inter‑arm interaction such as collision avoidance, temporal synchronization, and logical dependencies. The benchmark also introduces a Gaussian‑Weighted Spatial Score for the reasoning tier, providing a soft penalty for mis‑assignments near the workspace center line, thereby reflecting the inherent ambiguity of that region.

The experimental platform builds on the RoboTwin simulator, extending it with a rich set of objects (blocks, containers, bottles, etc.) and providing both first‑person and third‑person visual streams. A vision‑driven agent architecture is employed: the model receives images, language instructions, interaction history, and optional auxiliary data, then produces a textual description, a high‑level plan, and finally a concrete action sequence. To mitigate the “action lag” problem caused by planning multiple steps ahead without intermediate feedback, the authors implement Task‑Adaptive Execution Truncation, which limits the number of actions executed from a chunk before re‑observing the environment and replanning.

The benchmark evaluates over 30 state‑of‑the‑art models, including proprietary systems (GPT‑5, Gemini‑2.5‑Pro, Claude‑4‑Sonnet) and open‑source alternatives (InternVL‑3, Qwen2.5‑VL). Results reveal three consistent patterns: (1) Spatial reasoning is generally strong, yet stochastic hallucinations near the central axis cause incorrect arm assignments, leading to unreachable grasps or self‑collisions. (2) High‑level planning shows solid logical reasoning; however, when these plans are translated into low‑level continuous control, models frequently fail to generate collision‑free, temporally synchronized trajectories, especially in synchronous collaborative tasks. (3) Model size and visual bandwidth matter: large models benefit from multi‑view inputs and achieve modest performance gains, whereas smaller models suffer from information overload, resulting in lower success rates than single‑view settings.

The authors conclude that current MLLM‑based embodied agents excel at perception and high‑level reasoning but lack an integrated understanding of mutual kinematic constraints and fine‑grained temporal sequencing required for true bimanual coordination. They propose future research directions: embedding explicit kinematic graphs or constraint‑aware modules within the language model, developing dedicated collision‑avoidance and synchronization heads (e.g., graph neural network planners), and designing efficient multimodal token handling strategies for limited‑capacity models.

Overall, BiManiBench provides the first systematic, hierarchical evaluation suite for bimanual manipulation, exposing critical failure modes of contemporary MLLMs and offering a clear roadmap for advancing multi‑arm embodied intelligence.

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment