Policy Contrastive Decoding for Robotic Foundation Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Robotic foundation models, or generalist robot policies, hold immense potential to enable flexible, general-purpose and dexterous robotic systems. Despite their advancements, our empirical experiments reveal that existing robot policies are prone to learning spurious correlations from pre-training trajectories, adversely affecting their generalization capabilities beyond the training data. To tackle this, we propose a novel Policy Contrastive Decoding (PCD) approach, which redirects the robot policy’s focus toward object-relevant visual clues by contrasting action probability distributions derived from original and object-masked visual inputs. As a training-free method, our PCD can be used as a plugin to improve different types of robot policies without needing to finetune or access model weights. We conduct extensive experiments on top of three open-source robot policies, including the autoregressive policy OpenVLA and the diffusion-based policies Octo and $π_0$. The obtained results in both simulation and real-world environments prove PCD’s flexibility and effectiveness, e.g., PCD enhances the state-of-the-art policy $π_0$ by 8.9% in the simulation environment and by 108% in the real-world environment. Code and demos are publicly available at: https://koorye.github.io/PCD.

💡 Research Summary

Robotic foundation models have demonstrated impressive manipulation capabilities across a wide range of tasks, yet they often learn spurious correlations from large pre‑training datasets. These correlations cause policies to rely on irrelevant visual cues such as background textures, lighting, or object placement, leading to severe performance drops when the deployment environment changes. The paper introduces Policy Contrastive Decoding (PCD), a training‑free, plug‑and‑play inference‑time technique that forces a policy to focus on object‑relevant visual information.
The core idea is simple: generate two versions of the current observation—(1) the original image and (2) an object‑masked image where the target object is removed. The policy is run on both inputs, producing two action probability distributions. By contrasting these distributions with a tunable exponent α, PCD amplifies the influence of features that survive masking (i.e., object‑related cues) while suppressing those that disappear (spurious cues). The resulting distribution is used for action selection without modifying any model weights.
Two technical challenges are addressed. First, producing object‑masked observations for an entire trajectory is handled by Track2Mask: an initial annotation (via human point/box prompts or an off‑the‑shelf detector such as Grounding DINO) identifies the target object; SAM2 then tracks and segments the object across frames, and inpainting removes it. This pipeline can operate with minimal human effort. Second, diffusion‑based policies (Octo, π₀) do not output explicit probability densities. The authors propose KDE‑based Probabilistic Modeling (KDE‑PM), which samples N candidate actions from the diffusion policy and estimates per‑dimension densities using Gaussian kernel density estimation, assuming independence across dimensions. This yields an approximate probability distribution compatible with the contrastive formulation.
Experiments span 15 diverse manipulation tasks in simulation and 7 real‑world tasks on a physical robot arm. Three open‑source policies—OpenVLA (autoregressive), Octo, and π₀ (diffusion)—are evaluated. In simulation, PCD improves success rates by 50.6 % (OpenVLA), 29.7 % (Octo), and 8.9 % (π₀). In real‑world trials, π₀’s performance more than doubles, showing a 108 % gain. Ablation studies reveal that moderate α values (≈1.0–1.5) provide the best trade‑off; overly large α suppresses action diversity. Additional analyses confirm that PCD mitigates the 30 %+ drops observed when lighting or object positions are perturbed.
The contributions are threefold: (1) a novel, training‑free contrastive decoding scheme tailored to robotic policies; (2) a unified framework that works for both autoregressive and diffusion policies via Track2Mask and KDE‑PM; (3) extensive empirical validation demonstrating consistent gains across simulation and real hardware. Limitations include the current focus on single‑object masking and the independence assumption in KDE‑PM, which may not hold for more complex action spaces. Future work could extend PCD to multi‑object scenarios, incorporate additional sensor modalities, and explore more sophisticated density estimation techniques. Overall, PCD offers a practical, scalable solution to the spurious correlation problem that hampers the generalization of modern robot foundation models.

Policy Contrastive Decoding for Robotic Foundation Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment