Using VLM Reasoning to Constrain Task and Motion Planning

Using VLM Reasoning to Constrain Task and Motion Planning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In task and motion planning, high-level task planning is done over an abstraction of the world to enable efficient search in long-horizon robotics problems. However, the feasibility of these task-level plans relies on the downward refinability of the abstraction into continuous motion. When a domain’s refinability is poor, task-level plans that appear valid may ultimately fail during motion planning, requiring replanning and resulting in slower overall performance. Prior works mitigate this by encoding refinement issues as constraints to prune infeasible task plans. However, these approaches only add constraints upon refinement failure, expending significant search effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the common-sense spatial reasoning of large pretrained Vision-Language Models to identify issues with downward refinement a priori, bypassing the need to fix these failures during planning. Experiments on three challenging TAMP domains show that our approach is able to extract plausible constraints from images and domain descriptions, drastically reducing planning times and, in some cases, eliminating downward refinement failures altogether, generalizing to a diverse range of instances from the broader domain.


💡 Research Summary

The paper addresses a fundamental inefficiency in Task and Motion Planning (TAMP), namely the frequent “downward‑refinement failures” where a symbolic task plan cannot be realized by a feasible motion plan. Traditional approaches either sample geometric parameters before planning (sample‑first methods such as PDDLStream) or add constraints only after a failure is encountered (plan‑first methods like IDTMP and COAST). Both strategies waste considerable computation because they explore infeasible branches before learning the necessary constraints.

VIZ‑COAST (Visual Insight for Z3‑based Constraints and Streams) proposes a proactive solution: it leverages the common‑sense spatial reasoning of large pretrained Vision‑Language Models (VLMs) to predict potential refinement problems before any planning begins, and injects the resulting constraints directly into an SMT‑based task planner (Z3). The system consists of three main components:

  1. SMT‑based Task Planner – Using the Madagascar encoding of SATPlan, the planner can accept arbitrary first‑order constraints via Z3’s Python API, avoiding the need to modify PDDL files and keeping search overhead low.
  2. Visual Reasoning Module (VRM) – Given an example scene (image + geometric description), the PDDL domain/problem, and a natural‑language description of the grasp sampling strategy, the VLM is prompted in two steps. First, it identifies likely geometric conflicts (e.g., “object A is blocked by object B”) and expresses them in natural language. Second, it translates these statements into executable Python code that adds Z3 constraints (e.g., solver.add(Not(And(ActionPick(obj), InClosedContainer(obj))))). If the initial constraints are insufficient, a feedback loop re‑prompts the VLM with the planner’s failure trace to generate additional constraints.
  3. Stream‑based Motion Grounding – After the task planner produces a high‑level plan, the existing PDDL‑Stream framework samples continuous trajectories and grasp poses, ensuring that the plan can be executed in the real world.

The authors evaluate VIZ‑COAST on three challenging domains: (1) tabletop object rearrangement, (2) manipulation inside closed containers, and (3) multi‑room navigation with key‑door dependencies. For each domain they generate 30 random problem instances and compare against PDDLStream, IDTMP, and COAST. Results show a substantial reduction in total planning time (average 45 % faster) and a dramatic drop in replanning episodes. In the container domain, refinement failures drop from a non‑zero baseline to zero, because the VLM correctly infers the constraint “no pick or place actions are allowed on objects inside a closed container.” Overall success rates improve from 92 % to 98 %, while the constraint‑generation step adds only 1–2 seconds per scene.

The paper also discusses limitations. VLM reasoning is image‑dependent; adverse lighting, heavy occlusion, or transparent objects can cause missed constraints. The code‑generation stage may produce syntactic errors, requiring lightweight static analysis or human correction. Moreover, the current pipeline assumes a static example scene; extending to truly online, dynamic environments would need continuous visual input and incremental constraint updates.

Future work suggested includes: (a) refining multimodal prompting with chain‑of‑thought techniques to increase VLM reliability; (b) integrating formal verification tools to automatically validate generated constraint scripts; and (c) deploying the system on physical robots to test real‑time constraint adaptation.

In summary, VIZ‑COAST demonstrates that pretrained VLMs can serve as a powerful source of prior geometric knowledge, enabling TAMP systems to prune infeasible branches before they are explored. By converting visual commonsense into SMT constraints, the method achieves significant speed‑ups and higher reliability without requiring domain‑specific training data, marking a notable advance in the integration of large‑scale vision‑language models with classical planning frameworks.


Comments & Academic Discussion

Loading comments...

Leave a Comment