From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.

💡 Research Summary

This paper tackles the challenge of long‑horizon decision making in robotics by learning abstract symbolic world models directly from a handful of short‑horizon, image‑based demonstrations. The key innovation is the use of pretrained vision‑language models (VLMs) to automatically propose a large pool of high‑level visual predicates (e.g., NoObjectsOnTop(?table), IsEraser(?obj)) and to evaluate these predicates on raw camera images. During training, the proposed predicates together with the demonstrations are fed into a bilevel optimization framework. The lower level estimates the truth values of each predicate across the demonstration trajectories, while the upper level selects a compact subset of predicates and learns corresponding PDDL‑style operators that maximize planning efficiency (i.e., minimize the abstract state space and operator count).

The method, called pix2pred, therefore performs a generate‑then‑subselect pipeline: (1) VLM‑driven generation of candidate predicate names via natural‑language prompts; (2) labeling of each candidate on every demonstration state using the VLM; (3) selection of the most useful predicates and learning of operators (including soft precondition intersection to mitigate VLM noise). For skills that require continuous parameters, generative samplers are trained, again guided by VLM outputs.

At test time, given a novel set of objects, a new initial visual scene, and a goal expressed in the initial predicate vocabulary, the system first uses the VLM to compute the current abstract state (truth values of the selected predicates). A standard PDDL planner then searches for a sequence of operators that achieves the goal. Each operator is mapped back to a low‑level, parameterized skill, which is executed on the robot. Re‑planning can be performed if the environment changes during execution.

Empirical evaluation spans three simulated domains and real‑world experiments on a Boston Dynamics Spot robot. In simulation, pix2pred outperforms prior approaches that rely on handcrafted abstractions, online interaction, or dense language annotations, achieving higher success rates, handling more objects, and solving longer‑horizon tasks. In the real‑world setting, the robot solves two complex cleanup tasks in visually distinct rooms, despite only having been trained on a few human‑demonstrated videos. The approach requires no online interaction or environment resets, demonstrating strong data efficiency.

Overall, the paper contributes (1) a novel VLM‑based predicate invention mechanism that bridges raw pixels to symbolic logic, (2) an optimization‑driven selection process that tailors the symbolic world model for efficient planning, and (3) a demonstration‑only learning pipeline that enables zero‑shot generalization to new goals, objects, and visual backgrounds. Limitations include dependence on the quality of VLM prompts and difficulty handling intricate 3D relational concepts, which the authors suggest as future work.

From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment