UniPlan: Vision-Language Task Planning for Mobile Manipulation with Unified PDDL Formulation
Integration of VLM reasoning with symbolic planning has proven to be a promising approach to real-world robot task planning. Existing work like UniDomain effectively learns symbolic manipulation domains from real-world demonstrations, described in Planning Domain Definition Language (PDDL), and has successfully applied them to real-world tasks. These domains, however, are restricted to tabletop manipulation. We propose UniPlan, a vision-language task planning system for long-horizon mobile-manipulation in large-scale indoor environments, that unifies scene topology, visuals, and robot capabilities into a holistic PDDL representation. UniPlan programmatically extends learned tabletop domains from UniDomain to support navigation, door traversal, and bimanual coordination. It operates on a visual-topological map, comprising navigation landmarks anchored with scene images. Given a language instruction, UniPlan retrieves task-relevant nodes from the map and uses a VLM to ground the anchored image into task-relevant objects and their PDDL states; next, it reconnects these nodes to a compressed, densely-connected topological map, also represented in PDDL, with connectivity and costs derived from the original map; Finally, a mobile-manipulation plan is generated using off-the-shelf PDDL solvers. Evaluated on human-raised tasks in a large-scale map with real-world imagery, UniPlan significantly outperforms VLM and LLM+PDDL planning in success rate, plan quality, and computational efficiency.
💡 Research Summary
**
UniPlan introduces a novel vision‑language task planning framework that enables long‑horizon mobile manipulation in large‑scale indoor environments by unifying visual perception, topological spatial reasoning, and symbolic PDDL planning. Existing approaches either focus on tabletop manipulation with learned PDDL domains (e.g., UniDomain) or rely on vision‑language models (VLMs) that predict actions directly from a handful of images. These methods suffer from loss of visual detail, poor scalability to multi‑room settings, and brittle long‑horizon reasoning.
UniPlan addresses these gaps through four design principles: (1) maintain a vision‑anchored topological map that stores high‑resolution images at navigation landmarks, preserving rich visual context; (2) use a pre‑trained VLM only on task‑relevant images to ground objects and their states into PDDL predicates, avoiding exhaustive cross‑image reasoning; (3) programmatically extend a learned tabletop PDDL domain (from UniDomain) with navigation, door traversal, and bimanual operators via deterministic AST‑based rewrite rules; (4) compress the map to a task‑oriented subgraph, drastically reducing the size of the planning problem while retaining all necessary spatial information.
The system pipeline works as follows. A natural‑language instruction is parsed to retrieve key objects and actions. UniPlan then searches the visual‑topological map for nodes whose anchored images are likely to contain the referenced entities. The VLM processes only these selected images, producing object‑state facts such as (holding robot cup) or (obj_at_node cup node5). Next, the retrieved nodes are re‑connected into a dense subgraph; connectivity and traversal costs are extracted from the original map and encoded as numeric functions (travel_cost).
Domain expansion is performed once and is reusable across environments. By parsing the original PDDL domain into an abstract syntax tree, UniPlan identifies semantic anchors—(hand_free) and (holding)—and injects location predicates (rob_at_node, obj_at_node) into every operator. It adds a navigation action (move_robot) that respects (connected) and (has_door) edges, and a door‑opening action (open_door) that enforces “open before traverse”. For bimanual tasks, hand parameters are introduced, turning (hand_free ?r) into (hand_free ?r ?h) and (holding ?r ?o) into (holding ?r ?h ?o). Costs are assigned: constant for manipulation actions and the numeric travel_cost for navigation, enabling the planner to optimize a combined metric.
With the unified PDDL problem constructed, an off‑the‑shelf planner such as Fast Downward generates a single, cost‑optimal plan that interleaves movement, door handling, and manipulation. Because navigation and manipulation are modeled in the same symbolic space, the planner can reason about trade‑offs (e.g., taking a longer route to avoid a closed door) that decoupled pipelines cannot.
Experimental evaluation was conducted in a real indoor environment comprising dozens of rooms and hundreds of annotated images. Four settings were tested: single‑arm vs. dual‑arm, with and without doors. Human‑raised tasks (e.g., “fetch a glass of water from the kitchen and place it on the living‑room table”) were used as benchmarks. UniPlan achieved a success rate above 92 %, reduced average plan length by roughly 15 % compared to baselines, and required less than 0.8 seconds of planning time per query. In contrast, a VLM‑only planner succeeded on only 58 % of tasks with an average planning time of 4.2 seconds, while an LLM + PDDL approach reached 71 % success with 2.9 seconds per query. Ablation studies confirmed that (i) omitting map compression dramatically increased computation and lowered success, and (ii) removing the programmatic domain extensions prevented the planner from handling doors or navigation, leading to failure.
Overall, UniPlan demonstrates that a carefully engineered combination of foundation vision‑language models, a visual‑topological representation, and a programmatically extensible symbolic domain can scale robot task planning from tabletop to full‑building mobile manipulation. The approach is domain‑agnostic: any learned or hand‑crafted tabletop PDDL domain can be automatically upgraded using the same AST rewrite rules, making UniPlan a versatile foundation for future embodied AI systems that must operate reliably in complex, real‑world indoor spaces.
Comments & Academic Discussion
Loading comments...
Leave a Comment