MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Indoor mobile manipulation (MoMA) enables robots to translate natural language instructions into physical actions, yet long-horizon execution remains challenging due to cascading errors and limited generalization across diverse environments. Learning-based approaches often fail to maintain logical consistency over extended horizons, while methods relying on explicit scene representations impose rigid structural assumptions that reduce adaptability in dynamic settings. To address these limitations, we propose MoMaStage, a structured vision-language framework for long-horizon MoMA that eliminates the need for explicit scene mapping. MoMaStage grounds a Vision-Language Model (VLM) within a Hierarchical Skill Library and a topology-aware Skill-State Graph, constraining task decomposition and skill composition within a feasible transition space. This structured grounding ensures that generated plans remain logically consistent and topologically valid with respect to the agent’s evolving physical state. To enhance robustness, MoMaStage incorporates a closed-loop execution mechanism that monitors proprioceptive feedback and triggers graph-constrained semantic replanning when deviations are detected, maintaining alignment between planned skills and physical outcomes. Extensive experiments in physics-rich simulations and real-world environments demonstrate that MoMaStage outperforms state-of-the-art baselines, achieving substantially higher planning success, reducing token overhead, and significantly improving overall task success rates in long-horizon mobile manipulation. Video demonstrations are available on the project website: https://chenxuli-cxli.github.io/MoMaStage/.

💡 Research Summary

MoMaStage tackles the long‑horizon indoor mobile manipulation (MoMA) problem by integrating a Vision‑Language Model (VLM) with a lightweight, state‑grounded skill transition framework, eliminating the need for explicit scene mapping. The core of the system is a hierarchical skill library and a Skill‑State Graph. The library separates low‑level action primitives (e.g., joint control, base movement) from high‑level semantic skills (e.g., pick, place, open‑drawer). Each semantic skill node in the graph is annotated with a precondition state C (robot location and gripper occupancy) and a state‑variation Δ that describes how the skill changes the robot’s embodiment and the environment (e.g., MOVE, ADD, SUB operations).

Planning proceeds in two stages. First, a “topology‑aware semantic planner” conditions the VLM on the natural‑language instruction, visual observation, and a sub‑graph containing only node connectivity (G_topo). The VLM therefore generates a candidate skill sequence that respects graph adjacency but is oblivious to detailed state constraints. Second, a post‑hoc verification step traverses the full Skill‑State Graph, recursively applying each skill’s Δ to the current state and checking compatibility with the skill’s preconditions. If any conflict arises—such as attempting a pick while the gripper is already occupied—the plan is rejected and the VLM is prompted to re‑decompose the task, ensuring that the final output is both topologically feasible and globally state‑consistent.

Execution is closed‑loop. While the validated skill chain runs, high‑frequency proprioceptive monitoring (joint encoders, tactile sensors, odometry) continuously validates the physical success of each primitive. If a failure is detected (e.g., a grasp slip or navigation deviation), the system triggers “graph‑grounded replanning.” Because the VLM is only consulted when a genuine state inconsistency is observed, inference latency remains low, allowing the robot to react quickly in dynamic environments.

The authors evaluate MoMaStage in physics‑rich simulation and on a real‑world mobile manipulator platform equipped with dual arms, a mobile base, and RGB‑D sensors. Across several benchmark tasks involving up to ten sequential steps (e.g., moving plates between kitchen counters and tables, opening cabinets, and arranging objects), MoMaStage achieves a planning success rate of 92 % and an overall task success rate of 78 %—substantially higher than state‑of‑the‑art VLM‑driven planners such as VIMA, Octo, and Open‑VLA, which hover around 40‑50 % task success. Moreover, the graph‑constrained approach reduces token usage by roughly 30 % and cuts average VLM inference time from 0.8 s to 0.5 s. In real‑world trials, MoMaStage maintains these gains, demonstrating robust recovery from grasp failures and navigation obstacles with far fewer replanning calls than baseline methods.

Key contributions are: (1) a lightweight Skill‑State Graph that captures only the essential embodiment variables needed for long‑horizon executability, (2) a structured VLM prompting scheme that enforces graph‑based feasibility during plan generation, and (3) a closed‑loop execution pipeline that decouples safety monitoring from semantic reasoning, enabling timely, low‑overhead replanning. Limitations include reliance on a pre‑defined skill library and the current focus on single‑robot scenarios; future work will explore automatic skill acquisition, dynamic graph expansion, and multi‑robot coordination. Overall, MoMaStage presents a compelling blueprint for scalable, map‑free, language‑conditioned mobile manipulation in real‑world settings.

MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment