When would Vision-Proprioception Policies Fail in Robotic Manipulation?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Proprioceptive information is critical for precise servo control by providing real-time robotic states. Its collaboration with vision is highly expected to enhance performances of the manipulation policy in complex tasks. However, recent studies have reported inconsistent observations on the generalization of vision-proprioception policies. In this work, we investigate this by conducting temporally controlled experiments. We found that during task sub-phases that robot’s motion transitions, which require target localization, the vision modality of the vision-proprioception policy plays a limited role. Further analysis reveals that the policy naturally gravitates toward concise proprioceptive signals that offer faster loss reduction when training, thereby dominating the optimization and suppressing the learning of the visual modality during motion-transition phases. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm that adaptively modulates the optimization of proprioception, enabling dynamic collaboration within the vision-proprioception policy. Specifically, we leverage proprioception to capture robotic states and estimate the probability of each timestep in the trajectory belonging to motion-transition phases. During policy learning, we apply fine-grained adjustment that reduces the magnitude of proprioception’s gradient based on estimated probabilities, leading to robust and generalizable vision-proprioception policies. The comprehensive experiments demonstrate GAP is applicable in both simulated and real-world environments, across one-arm and dual-arm setups, and compatible with both conventional and Vision-Language-Action models. We believe this work can offer valuable insights into the development of vision-proprioception policies in robotic manipulation.

💡 Research Summary

This paper investigates why vision‑proprioception (VP) policies sometimes underperform compared to vision‑only policies in robotic manipulation, despite the intuitive benefit of fusing visual and proprioceptive signals. The authors introduce the concept of “modality temporality,” which states that the relative importance of vision and proprioception changes over the course of a task. During motion‑consistent phases, where the robot executes continuous movements, proprioceptive signals (e.g., 6‑DOF gripper pose and opening degree) provide concise, low‑dimensional information that directly reduces the behavior‑cloning loss. Conversely, during motion‑transition phases—moments when the robot must locate or re‑localize a target object—visual cues become critical, yet they are often subtle (pixel‑level changes) and therefore contribute less to loss reduction.

Through a controlled “intervention experiment,” the authors replace actions from a vision‑only policy with those from a VP policy for short 10‑step windows. The substitution has negligible impact during motion‑consistent phases but dramatically degrades performance during transition phases, indicating that the VP policy fails to exploit visual information when it matters most. From an optimization perspective, the loss gradient with respect to the visual encoder parameters (ω_v) is dominated by the proprioceptive branch because proprioceptive features yield larger error reductions. This dominance suppresses learning of the visual branch, leading to poor generalization in scenarios where object positions shift or out‑of‑distribution (OOD) conditions arise.

To remedy this, the authors propose Gradient Adjustment with Phase‑guidance (GAP). GAP consists of three key components:

Motion Representation – The robot’s proprioceptive stream (position p, orientation θ, gripper opening g) is transformed into motion deltas m_i:j = {p_i:j, θ_i:j, g_i:j} between timesteps i and j, capturing the robot’s movement in a compact vector.
Transition Phase Estimation – A Change‑Point Detection (CPD) algorithm first segments trajectories into motion‑consistent phases by minimizing a cost based on cosine similarity of position, weighted orientation similarity, and sign agreement of opening degree. Since CPD yields discrete phase boundaries, a temporal network (LSTM) processes proprioceptive differences Δs_i = s_i+1 – s_i to predict a continuous probability ρ_i that timestep i belongs to a transition phase. The network is supervised by the CPD indices, and a relaxed penalty is applied near boundaries to respect the gradual nature of transitions.
Gradient Adjustment – During policy training, the gradient flowing into the proprioceptive encoder ω_s is scaled by a factor (1 – ρ_i). When ρ_i is high (i.e., a transition step), the proprioceptive gradient is attenuated, preventing it from overwhelming the visual gradient. In motion‑consistent steps (ρ_i ≈ 0), the gradient remains unchanged, preserving the benefits of proprioception for precise control.

The GAP framework is agnostic to the underlying policy architecture. The authors evaluate it on a broad suite of tasks: simulated RoboSuite and MetaWorld benchmarks (assembly, pick‑place, rotation‑sensitive, soft‑object manipulation), real‑world single‑arm and dual‑arm setups, and even Vision‑Language‑Action (VLA) models. Policies are implemented with MLP, diffusion, and transformer heads. Across all configurations, GAP‑enhanced VP policies achieve 12–18 percentage‑point improvements in success rate over vanilla VP policies, and they consistently outperform vision‑only baselines, especially under OOD variations (different object poses, lighting, backgrounds). Qualitative visualizations show increased activation of visual features during transition phases and a balanced contribution from both modalities.

Key contributions of the work are:

Identification of a training‑time optimization bias where proprioceptive signals dominate loss reduction, suppressing visual learning.
A principled motion representation and CPD‑based segmentation to automatically detect transition phases.
A lightweight gradient‑scaling mechanism that dynamically re‑weights modality contributions during learning.
Extensive empirical validation across simulators, real robots, single/dual arms, and VLA models, demonstrating the generality of GAP.

Limitations include reliance on accurate CPD segmentation and LSTM‑based probability estimation, which may degrade in highly stochastic or highly dynamic tasks. The current study focuses on behavior‑cloning (BC) with mean‑squared error loss; extending GAP to reinforcement‑learning objectives or other loss functions remains an open direction. Future work could explore more sophisticated temporal models (e.g., Transformers), integrate uncertainty estimation for phase detection, and apply GAP to human‑robot collaborative scenarios where visual cues are even more volatile.

In summary, the paper provides a clear diagnosis of why vision‑proprioception policies can fail during critical task phases and offers a practical, architecture‑agnostic solution that restores balanced multimodal learning, leading to more robust and generalizable manipulation policies.

When would Vision-Proprioception Policies Fail in Robotic Manipulation?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment