Reading time: 9 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.20675
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Learning generalizable reward functions is a core challenge in embodied intelligence. Recent work leverages contrastive vision language models (VLMs) to obtain dense, domain-agnostic rewards without human supervision. These methods adapt VLMs into reward models through increasingly complex learning objectives, yet meaningful comparison remains difficult due to differences in training data, architectures, and evaluation settings. In this work, we isolate the impact of the learning objective by evaluating recent VLM-based reward models under a unified framework with identical backbones, finetuning data, and evaluation environments. Using Meta-World tasks, we assess modeling accuracy by measuring consistency with ground truth reward and correlation with expert progress. Remarkably, we show that a simple triplet loss outperforms state-of-the-art methods, suggesting that much of the improvements in recent approaches could be attributed to differences in data and architectures.

๐Ÿ“„ Full Content

The increasing momentum of embodied intelligence has motivated the study of generalizable reward models that do not rely on hand-crafted human supervision. Recent work leverages large vision-language models (VLMs) as general-purpose reward functions that measure progress through alignment between visual observations and language goals. Rocamonde et al. (2024) showed that CLIP (Radford et al., 2021) can be used as a zero-shot reward model for downstream policy learning, though it struggles with domain shift related to action specification and environmental dynamics. To mitigate this effect, many methods have turned to large-scale video demonstration datasets (Damen et al., 2018;Grauman et al., 2022) to finetune VLMs, enabling a better understanding of task-relevant behaviors. Nevertheless, the question of extracting meaningful signals of task progress from video demonstrations remains a central challenge to vision-language reward modeling. As a result, increasingly complex learning objectives have been proposed to generate more accurate rewards (Sermanet et al., 2018;Nair et al., 2022;Ma et al., 2023b;a;Karamcheti et al., 2023). However, most of these methods were pre-trained using different datasets and architectures, making it difficult to isolate the choice of learning objective from downstream performance. For instance, R3M (Nair et al., 2022) was pre-trained on Ego4D (Grauman et al., 2022) yet compares to methods pre-trained on ImageNet (Russakovsky et al., 2015;Parisi et al., 2022). LIV (Ma et al., 2023a) was initialized with CLIP weights and pre-trained on EpicKitchen (Nair et al., 2022) yet compares to methods utilizing a frozen BERT encoder Devlin et al. (2019) and vision encoder pre-trained on Ego4D from scratch (Nair et al., 2022).

In this work, we systematically compare objectives under a unified framework, holding the pretrained model backbone, finetuning data, and downstream evaluation environments constant. This setup allows us to decouple the impact of the learning objectives from any other confounding factors. We assess model performance via two distinct benchmarks, evaluating consistency with ground truth reward and alignment with expert progress. While initially introduced as a minimal baseline for comparison with recent methods, our results show that a very simple triplet loss Schroff et al. (2015) can surpass current state-of-the-art learning objectives. More broadly, our findings indicate that simpler ranking-based learning objectives offer greater accuracy and robustness, suggesting that much of the apparent progress in recent methods could instead be due to differences in data and model architecture.

State-of-the-art contrastive reward modeling approaches rely on combinations of learning objectives, either in their original form or adapted for alignment with language. For our experiments, we focus on a set of loss functions that can be used to reproduce many popular reward models.

TCN (Sermanet et al., 2018) reduces the embedding distance of images which are closer in time and increases the distance of images which are temporally further apart. Given an image encoder ฯ€ img , a sequence of images {I t } k t=0 , their corresponding encodings z t = ฯ€ img (I t ), batches of [I i , I j>i , I k>j ] 1:B and a similarity function S, TCN minimizes the following objective:

We adapt L TCN to language by replacing I k>j with the language annotation for the given sequence.

Inserting the language embedding v = ฯ€ text (l) results in the following objective

(2)

) uses a goal-conditioned approach that separates embeddings of temporally adjacent images while bringing together embeddings of images that are farther apart in time. Given batches of [I i , I j>i , I j+1 , I kโ‰ฅj+1 ] 1:B , VIP is defined as:

Just like TCN, we adapt L VIP to language by replacing I k>h with the embedded language annotation v = ฯ€ text (l) for the given sequence.

Combinations of these losses reproduce popular reward modeling methods minus regularization terms. R3M is obtained by combining both TCN versions:

(5) Note that R3M adapts TCN to language by training an MLP to predict a similarity score from the concatenated vector [z 0 , z i , v], where z 0 is the first image in the sequence. This approach introduces a non-negligible amount of additional parameters, especially for large embeddings like those from SigLIP2. For fairness, we omit the initial embedding in our setup.

LIV is a combination of both VIP versions plus an InfoNCE objective (Oord et al., 2018), the latter which we also include in our experiments:

Triplet (Schroff et al., 2015) applied to this context reduces the embedding distance between the language goal and the later images of the sequence, and pushes earlier images further away. We propose it as a simple baseline for task progress and show its remarkable effectiveness. Given batches of [I i , I j>i , l] 1:B , we use the language annotation as the anchor, the later image as the positive and the earlier image as the negative example. The triplet loss is defined as:

where ฮฑ is the margin controlling how far apart the negative should be from the positive.

Learning objectives. We finetune using different configurations of the losses described in 2

(1)

Model backbone To isolate the performance of the learning objective from model capacity and pretraining, we employ the base version of SigLIP2 (Tschannen et al., 2025) as the backbone for all methods. We finetune using LoRA (Hu et al., 2021) to prevent overfitting.

Data Each model is finetuned on expert demonstrations from the Meta-World environments (Yu et al., 2021). To evaluate out-of-domain performance, we remove 3 tasks from the training data (see 3.2). For the remaining tasks, we collect 3 expert trajectories per task with randomized end-effector initializations, yielding 132 total. To ensure consistent dataset size, we cap the maximum number of samples at 50 thousand and generate them evenly across trajectories. Each timestep is recorded from 3 camera views to enable multi-view training. The timesteps within a sample remain fixed between epochs, but their associated views are randomly reassigned at each iteration, acting as a form of data augmentation to promote view-invariant representations and reduce overfitting.

Implementation details For a complete list of the implementation details, refer to A.1.

We wish to learn generalizable reward models which can leverage pre-training knowledge to predict progress on unseen tasks. To this end, we exclude button-press, drawer-open, and door-open environments from the finetuning data and use them for subsequent evaluation metrics. Button-press is relatively simple, as success depends mainly on end-effector proximity to a target. In contrast, drawer-open and door-open are compositional, requiring navigation to the object and subsequent manipulation. Note that we evaluate only the standard button-press but remove all its variants to prevent leakage.

To evaluate the performance of finetuned reward models on out-of-domain tasks, we introduce two benchmarks.

(1) For measuring reward modeling robustness, we collect random and sub-optimal rollouts from each test environment to construct per-task datasets of 10,000 timestep pairs, each labeled by which timestep has the higher ground-truth reward. The model predicts the higher-reward timestep by comparing each element’s similarity to the language goal. Accuracy is defined as the percentage of pairs where this prediction matches the ground-truth ordering. We refer to this benchmark as consistency with ground-truth reward. While informative about robustness, this metric does not necessarily indicate that a model can guide an agent towards a goal.

(2) To evaluate this capacity, we collect 50 expert trajectories per task and predict rewards along each trajectory. Assuming expert timesteps correlate with increasing ground-truth reward, we compute the Value-Order Correlation (VOC) Ma et al. ( 2024) to measure how well the model’s predicted reward ordering aligns with goal-directed behavior. A model should perform well on both these benchmarks to yield transferable rewards.

We report all results on a per-view basis and multi-view. The multi-view is obtained by taking the average similarity scores across views of a timestep, and using it for prediction instead. This evaluates if the objectives were able to take advantage of the multi-view augmentations to learn view-agnostic representations.

Consistency with ground truth reward Table 1 summarizes each model’s task consistency with ground-truth reward accuracy. Triplet loss achieves the highest overall accuracy across all held-out tasks, surpassing both TCN-based and VIP-based objectives. While L TCNtext and L R3M perform similarly, objectives incorporating VIP components exhibit inconsistent ranking ability, often close to random chance. The base SigLIP2 backbone performs slightly below 50%, confirming that finetuning on expert demonstrations is necessary to encode temporal semantics. press the handle down handle-pull-side-v3 pull the handle up handle-pull-v3 pull the handle up lever-pull-v3 pull the lever down 90 degrees pick-place-wall-v3 pick the red object, bypass the wall and place the object on the target pick-out-of-hole-v3 pick up the object from the hole pick-place-v3 pick and place the red object to the goal plate-slide-v3 slide the puck into the hockey net plate-slide-side-v3 slide the puck into the hockey net plate-slide-back-v3 get the puck from the hockey net plate-slide-back-side-v3 get the puck from the hockey net peg-insert-side-v3 insert the peg in the hole peg-unplug-side-v3 unplug the peg from the hole soccer-v3 kick the soccer ball into the goal stick-push-v3 grasp the blue stick and push the box using the stick stick-pull-v3 grasp the blue stick and pull the box using the stick push-v3 push the red object to the green objective push-wall-v3 bypass the wall and push the object behind the wall push-back-v3 push the red object back to the green objective reach

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut