Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at: https://xukechun.github.io/papers/BayesVLA.


💡 Research Summary

This paper, “Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy,” addresses the critical challenge of out-of-distribution generalization in Vision-Language-Action (VLA) models, which is often severely hampered by catastrophic forgetting of the pre-trained Vision-Language Model (VLM) backbone during fine-tuning on robot data. While co-training with external reasoning data is a common remedy, it introduces significant deployment costs and reliance on heuristic tuning.

The authors identify a more fundamental, intrinsic issue within standard VLA datasets: modality imbalance. In a typical demonstration, a single language instruction corresponds to dozens or hundreds of visual frames and actions, creating a stark disparity in diversity that biases the model towards learning visual shortcuts and discarding nuanced language understanding.

To structurally address this imbalance without external data, the paper proposes BayesVLA, a novel framework based on a Bayesian factorization of the VLA policy. The core idea is to decompose the policy π(a|v,ℓ) into a vision-action prior π_p(a|v) and a language-conditioned likelihood L(ℓ|v,a), formalized as π(a|v,ℓ) ∝ π_p(a|v) * L(ℓ|v,a). This factorization embodies a “see-to-act then prompt-to-specify” philosophy: the prior learns foundational visuomotor skills (what actions are feasible given the visual scene), and the likelihood aligns these action proposals with the specific language instruction.

This modeling leads to a natural two-stage training procedure. In Stage 1, the language-agnostic vision-action prior is trained solely on abundant vision-action pairs to learn general manipulation primitives (e.g., generating diverse grasp poses). In Stage 2, the language-conditioned likelihood is trained to score action candidates sampled from the fixed prior based on their alignment with the language instruction, achieved by injecting features from a frozen VLM backbone. This separation prevents the language alignment process from corrupting the learned visuomotor representations.

The architecture further incorporates a decomposition into pre-contact and post-contact phases, reflecting the physical stages of manipulation. For the pre-contact phase (e.g., approaching and grasping), BayesVLA directly leverages powerful pre-trained action foundation models (like AnyGrasp) as strong priors. For the post-contact phase (e.g., lifting and placing), it employs a diffusion-based model as the prior to generate dense, multimodal trajectories. The language-conditioned likelihood is applied to both phases during the second training stage.

The paper provides a formal information-theoretic analysis, demonstrating that the modality imbalance reduces the mutual information between actions and language, leading to poor instruction following. The proposed Bayesian factorization is proven to mitigate this by preserving the action-language information while discouraging vision-action shortcuts.

Extensive experiments validate the superiority of BayesVLA. Evaluations on public benchmarks (RLBench, LIBERO) and custom simulation environments show that it significantly outperforms existing VLA methods (including RT-2, Octo, and Diffusion Policy) in generalizing to novel language instructions, unseen object instances, and entirely new scenes. Real-world experiments with a 6-DoF robot arm further confirm its robust language grounding and generalization capabilities in complex manipulation tasks. The work offers a principled and effective alternative to heuristic co-training, advancing the development of generalizable robot policies.


Comments & Academic Discussion

Loading comments...

Leave a Comment