OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible. To address these gaps, we introduce OmniEVA – an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

💡 Research Summary

OmniEVA tackles two fundamental shortcomings of current multimodal large language model (MLLM)‑based embodied systems: a geometric adaptability gap and an embodiment constraint gap. The former arises because models trained only on 2‑D inputs lack sufficient spatial information for tasks that require rich 3‑D reasoning, while existing 3‑D‑LLMs inject geometry in a static, task‑agnostic manner, leading to unnecessary computation and noisy embeddings when 3‑D data is irrelevant. The latter stems from neglecting the physical capabilities and limits of real robots—such as joint ranges, workspace boundaries, and object affordances—causing generated plans to be theoretically correct but practically infeasible.

OmniEVA introduces two complementary innovations to close these gaps. First, a Task‑Adaptive 3‑D Grounding mechanism employs a gated router (TA‑GR) that decides, on a per‑sample basis, whether to fuse 3‑D positional encodings with the visual tokens of a Vision Transformer. Depth maps are projected into world coordinates using camera intrinsics/extrinsics, averaged per image patch, and sinusoidally encoded to produce a 3‑D token tensor Vp. The router receives a task embedding (via a lightweight sentence transformer) and a global scene descriptor (average‑pooled visual features), concatenates them, and passes them through an MLP to obtain gate logits. A Gumbel‑Softmax hard gate (g∈{0,1}) either adds Vp to the visual tokens (g=1) or leaves the visual stream untouched (g=0). This dynamic mixture‑of‑experts approach ensures that 3‑D information is leveraged only when the task truly demands it, preserving efficiency for purely 2‑D queries.

Second, an Embodiment‑Aware Reasoning framework integrates task goals, environmental context, and robot physical constraints into the reasoning loop. After the TA‑GR pre‑training stage, OmniEVA undergoes supervised fine‑tuning on a hybrid dataset covering 2‑D VQA, video‑based scene understanding, and 3‑D grounding, establishing a strong general embodied reasoning foundation (OmniEVA‑Base). The final stage applies Task‑and‑Embodiment‑aware GRadient‑based Policy Optimization (TE‑GRPO), a reinforcement‑learning fine‑tuning that incorporates a reward function penalizing violations of joint limits, workspace collisions, and affordance mismatches. The gate parameters remain frozen, while the language model and embodiment‑aware module are updated, preserving pretrained linguistic knowledge while learning executable plans.

Training proceeds in three stages: (1) TA‑GR pre‑training on large 3‑D datasets (ScanNet, Matterport3D, 3RScan, ARKitScenes) with a differentiated learning‑rate schedule; (2) supervised fine‑tuning on a curated multimodal corpus to build general reasoning capabilities; (3) TE‑GRPO reinforcement fine‑tuning in simulated robot environments that model real‑world kinematics and constraints.

Evaluation spans eight public embodied benchmarks—including 2‑D spatial VQA, 3‑D visual grounding, video‑based scene captioning, and large‑scale navigation (HM3D, MP3D)—and four newly introduced primitive tasks (Where2Go, Where2Grasp, Where2Approach, Where2Fit) that isolate core planning skills. OmniEVA achieves state‑of‑the‑art performance on seven of the eight benchmarks, notably surpassing prior bests on HM3D and MP3D navigation. In the primitive suite, it outperforms all baselines, demonstrating that the gated router activates 3‑D fusion in >95 % of geometry‑critical queries while remaining inactive in <10 % of color‑or language‑only queries. Reinforcement‑tuned policies achieve an 87 % success rate on simulated execution, compared with ~45 % for models lacking embodiment awareness.

In summary, OmniEVA provides a principled architecture that dynamically balances 2‑D and 3‑D perception with task relevance and rigorously enforces robot physical constraints during planning. By bridging the geometric adaptability and embodiment constraint gaps, it delivers a versatile, executable planner ready for real‑world robotic navigation, mobile manipulation, and human‑robot collaboration, and offers a scalable foundation for future extensions to multi‑robot coordination and long‑term memory‑driven planning.

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment