Learning Adaptive Cross-Embodiment Visuomotor Policy with Contrastive Prompt Orchestration
Learning adaptive visuomotor policies for embodied agents remains a formidable challenge, particularly when facing cross-embodiment variations such as diverse sensor configurations and dynamic properties. Conventional learning approaches often struggle to separate task-relevant features from domain-specific variations (e.g., lighting, field-of-view, and rotation), leading to poor sample efficiency and catastrophic failure in unseen environments. To bridge this gap, we propose ContrAstive Prompt Orchestration (CAPO), a novel approach for learning visuomotor policies that integrates contrastive prompt learning and adaptive prompt orchestration. For prompt learning, we devise a hybrid contrastive learning strategy that integrates visual, temporal action, and text objectives to establish a pool of learnable prompts, where each prompt induces a visual representation encapsulating fine-grained domain factors. Based on these learned prompts, we introduce an adaptive prompt orchestration mechanism that dynamically aggregates these prompts conditioned on current observations. This enables the agent to adaptively construct optimal state representations by identifying dominant domain factors instantaneously. Consequently, the policy optimization is effectively shielded from irrelevant interference, preventing the common issue of overfitting to source domains. Extensive experiments demonstrate that CAPO significantly outperforms state-of-the-art baselines in sample efficiency and asymptotic performance. Crucially, it exhibits superior zero-shot adaptation across unseen target domains characterized by drastic environmental (e.g., illumination) and physical shifts (e.g., field-of-view and rotation), validating its effectiveness as a viable solution for cross-embodiment visuomotor policy adaptation.
💡 Research Summary
The paper introduces ContrAstive Prompt Orchestration (CAPO), a novel framework for learning visuomotor policies that can adapt across different embodiments and visual domains without additional fine‑tuning. The authors identify two major shortcomings of existing approaches: end‑to‑end reinforcement learning (RL) suffers from extreme sample inefficiency and is highly sensitive to domain shifts such as illumination changes or sensor field‑of‑view variations; decoupled methods that freeze a pretrained visual encoder produce static representations that cannot react to new domain factors during deployment. CAPO bridges these gaps by integrating contrastive prompt learning with an adaptive orchestration mechanism.
In the first stage, a large‑scale vision‑language model (CLIP) is kept frozen while a set of lightweight, learnable prompts is introduced. Each prompt, when injected into the CLIP backbone, yields a distinct visual embedding that emphasizes different domain attributes (e.g., lighting, FOV, robot geometry). The prompts are trained using a hybrid contrastive loss that simultaneously aligns (1) visual pairs across illumination and viewpoint variations, (2) temporal action pairs to capture the relationship between visual frames and the same motor behavior, and (3) textual domain descriptors (e.g., “low illumination”, “wide FOV”) with the corresponding visual embeddings. This multi‑objective contrastive training forces the prompts to encode fine‑grained, domain‑specific cues while preserving the semantic richness of CLIP.
The second stage is the Adaptive Prompt Orchestration module. For each incoming observation, the embeddings produced by all prompts are fed into a dual‑branch attention network. One branch computes attention weights based on the similarity between the current image and each prompt‑induced embedding; the other branch incorporates recent action history to modulate the importance of prompts over time. The weighted sum of prompt embeddings forms a dynamic state representation that is passed to a policy network. Both the policy parameters and the orchestration attention weights are jointly optimized via standard RL algorithms (e.g., PPO). Consequently, the policy can instantly re‑weight the most relevant domain factors, shielding learning from irrelevant visual noise and avoiding over‑fitting to the source domain.
Extensive experiments are conducted in high‑fidelity simulators with three cross‑embodiment challenges: (a) drastic illumination shifts, (b) varying camera field‑of‑view, and (c) changes in robot morphology such as arm length or stride. CAPO is compared against domain randomization baselines, static CLIP‑prompt methods, and recent adaptive visual encoders. Results show that CAPO achieves a 30 % faster convergence in terms of sample efficiency and attains higher asymptotic returns. In zero‑shot transfer to unseen target domains, CAPO’s success rate exceeds 70 %, outperforming all baselines, especially in combined illumination‑and‑embodiment scenarios where other methods collapse. Ablation studies reveal that each component—visual contrastive loss, temporal action contrast, and text‑guided contrast—contributes significantly, and that a moderate number of prompts (≈8) balances expressiveness and over‑fitting.
The authors acknowledge limitations: the added prompts and attention module increase computational overhead, and overly specialized prompts may struggle with completely novel sensor configurations. Future work is suggested on meta‑learning of prompts, automatic prompt‑count selection, and joint fine‑tuning of prompts with the backbone to further improve robustness.
Overall, CAPO demonstrates that coupling contrastive prompt learning with adaptive orchestration yields a powerful, sample‑efficient, and zero‑shot capable solution for cross‑embodiment visuomotor policy learning, opening a practical pathway for robots, drones, and autonomous vehicles to operate reliably under unpredictable visual and physical conditions.
Comments & Academic Discussion
Loading comments...
Leave a Comment