GoalLadder: Incremental Goal Discovery with Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Natural language can offer a concise and human-interpretable means of specifying reinforcement learning (RL) tasks. The ability to extract rewards from a language instruction can enable the development of robotic systems that can learn from human guidance; however, it remains a challenging problem, especially in visual environments. Existing approaches that employ large, pretrained language models either rely on non-visual environment representations, require prohibitively large amounts of feedback, or generate noisy, ill-shaped reward functions. In this paper, we propose a novel method, GoalLadder, that leverages vision-language models (VLMs) to train RL agents from a single language instruction in visual environments. GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language. To do so, it queries a VLM to identify states that represent an improvement in agent’s task progress and to rank them using pairwise comparisons. Unlike prior work, GoalLadder does not trust VLM’s feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system, thus reducing the detrimental effects of noisy VLM feedback. Over the course of training, the agent is tasked with minimising the distance to the top-ranked goal in a learned embedding space, which is trained on unlabelled visual data. This key feature allows us to bypass the need for abundant and accurate feedback typically required to train a well-shaped reward function. We demonstrate that GoalLadder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of $\sim$95% compared to only $\sim$45% of the best competitor.

💡 Research Summary

The paper “GoalLadder: Incremental Goal Discovery with Vision-Language Models” presents a novel framework for training reinforcement learning (RL) agents in visual environments using only a single, natural language instruction. It addresses key limitations of prior work, which relied on non-visual representations, required prohibitively large amounts of feedback from large models, or generated noisy reward functions.

GoalLadder’s core innovation is an iterative process that discovers and ranks intermediate “goal states” that progressively lead toward task completion. The method operates through a repeating cycle. First, the RL agent (built on Soft Actor-Critic) interacts with the environment to collect a trajectory of visual observations. Second, in the Discovery phase, states from this trajectory are uniformly sampled and presented to a Vision-Language Model (VLM), such as Gemini, alongside the current top-rated goal state. The VLM is asked which state is closer to the language-specified task. If a sampled state is deemed better, it is added to a candidate goal buffer. This acts as a filter, ensuring only promising states are evaluated further.

Third, in the Ranking phase, pairs of candidate goals are sampled from the buffer and compared by the VLM. Crucially, the outcomes of these pairwise comparisons are not trusted absolutely. Instead, they feed into an ELO-based rating system, similar to chess player rankings. Each candidate goal has a rating that is updated based on the expected outcome (given the current ratings) and the actual VLM judgment. This design makes the system robust to individual instances of noisy or incorrect VLM feedback, allowing a reliable hierarchy of goals to emerge over time.

Fourth, in the Training phase, the highest-rated candidate goal becomes the agent’s current target. The reward function is defined as the negative Euclidean distance to this target state, not in pixel space, but in a learned latent embedding space. This space is created by a Variational Autoencoder (VAE) trained in a self-supervised manner on the agent’s unlabeled visual observations. This allows the reward to generalize to unseen states without requiring additional VLM queries. The agent’s past experience is periodically relabeled with this new reward function to account for the non-stationary target.

The key advantages are twofold: robustness to noisy feedback (via ELO ratings) and query efficiency (due to the filtering step and the use of a learned embedding space for dense reward calculation, minimizing expensive VLM calls).

Experiments were conducted on classic control and five robotic manipulation tasks from Meta-World (e.g., drawer open, button press). GoalLadder was compared against embedding-based methods (using CLIP) and preference-based methods (RL-VLM-F). The results were striking: GoalLadder achieved an average final success rate of ~95%, drastically outperforming the best baseline at ~45%. It performed nearly as well as an oracle agent with access to the ground-truth reward signal, even surpassing it on one task. This demonstrates that GoalLadder not only approximates a reward but facilitates efficient exploration through incremental goal setting.

In summary, GoalLadder offers a principled and practical framework for leveraging the commonsense knowledge of VLMs for RL in visual domains, while intelligently mitigating their weaknesses in spatial reasoning and reliability. It moves beyond using VLMs as monolithic reward generators, instead employing them as comparators within a robust, adaptive system that discovers the path to a goal step-by-step.

GoalLadder: Incremental Goal Discovery with Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment