EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.

💡 Research Summary

The paper introduces a new embodied‑AI task called EgoActing, which requires a humanoid robot to translate a high‑level natural‑language instruction into a sequence of low‑level, spatially aware actions using only egocentric RGB observations, past action history, and a set of available whole‑body policies. To instantiate this task the authors present EgoActor, a unified vision‑language model (VLM) that directly predicts locomotion primitives (forward/backward steps, lateral shifts, turns, height adjustments), head‑orientation actions, manipulation commands, and human‑interaction behaviors in real time.

Model architecture – EgoActor builds on the open‑source Qwen3‑VL transformer, applying LoRA adapters to fine‑tune all linear layers. Two model sizes (4 B and 8 B parameters) are trained, offering a trade‑off between latency (sub‑second inference) and performance. No architectural changes specific to robotics are introduced; the same VLM backbone that powers image‑text tasks is repurposed for action prediction.

Training data – The authors assemble a heterogeneous corpus comprising (1) real‑world egocentric video demonstrations collected from a humanoid platform, (2) a spatial‑reasoning question‑answer dataset (EgoTaskQA) that forces the model to reason about distances, angles, and object locations, and (3) simulated trajectories generated in a physics‑based environment. Each training example contains a recent history of ten RGB frames plus the three most recent observation‑action pairs. Actions are annotated in two formats:

Structured Language Actions (SLAs) – concise templates encoding quantitative motion parameters (e.g., “Turn left 30.5 degrees”, “Move forward 0.8 meters”). These are used for movement and active‑perception skills, enabling the model to learn precise spatial grounding.
Natural Language Actions (NLAs) – free‑form text describing manipulation or human‑interaction steps (e.g., “Pick up the orange on the desk”, “Ask: Could you guide me to the meeting room?”). NLAs give the system flexibility to generate previously unseen commands without a fixed skill library.

During inference, the model receives the latest visual context and action history, outputs a mixed sequence of SLAs and NLAs, and a lightweight parser translates SLAs into velocity/angle commands for the robot’s low‑level controller. NLAs are dispatched via keyword triggers: speech‑related phrases invoke a text‑to‑speech system, predefined interaction keywords trigger scripted gestures, and remaining commands are forwarded to a pre‑trained visual‑language‑action (VLA) manipulation module.

Evaluation – Experiments are conducted both in simulation (diverse layouts, dynamic obstacles) and on a physical humanoid robot. Three benchmark families are used: (i) Human‑Robot Interaction (e.g., ask for directions, request objects), (ii) Mobile Manipulation (search‑and‑pick tasks), and (iii) Traversability (navigating narrow passages while maintaining balance). EgoActor achieves >85 % success on the 8 B model across these domains, outperforming prior approaches such as SayCan, UniNav, and other VLN‑style agents that either rely on fixed skill libraries or lack whole‑body coordination. Qualitative videos demonstrate the robot halting its walk, tilting its head to locate a target, extending an arm to grasp, and then resuming locomotion—all within a single coherent plan, evidencing smooth transitions between movement, perception, manipulation, and interaction.

Insights and limitations – The work shows that a generic VLM, when supplied with appropriately formatted multimodal data, can learn the spatial reasoning needed for precise motor control without any hand‑crafted geometric modules. The separation of SLAs and NLAs balances the need for accurate low‑level motion with the flexibility of open‑ended high‑level commands. However, reliance on RGB alone limits depth perception, potentially affecting tasks that require fine vertical discrimination or handling transparent objects. The rule‑based action parser may become a bottleneck for more complex, temporally extended motions, and scaling LoRA‑fine‑tuned VLMs to even larger corpora remains an open research question.

Contributions – (1) Definition of the EgoActing task that explicitly stresses spatial understanding for humanoid robots, (2) Introduction of EgoActor, a unified VLM that predicts a rich set of egocentric actions covering locomotion, head movement, manipulation, and human interaction, and (3) Extensive real‑world and simulated validation, together with the release of code, models, datasets, and evaluation protocols to foster reproducibility.

In summary, EgoActor bridges the gap between abstract language planning and concrete motor execution for humanoid robots, demonstrating that vision‑language models can serve as a single, scalable backbone for embodied agents operating in complex, partially observable, and dynamically changing real‑world environments.

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment