관측을 통합한 확산 브리지 기반 로봇 정책

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Imitation learning with diffusion models has advanced robotic control by capturing the multimodal action distributions. However, existing methods typically treat observations only as highlevel conditions to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, the sampling is forced to begin from random noise, weakening the coupling between perception and control and often yielding suboptimal performance. We propose BridgePolicy, a generative visuomotor policy that directly integrates observations into the stochastic dynamics via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich and informative prior rather than random noise, substantially improving precision and reliability in control. A key difficulty is that diffusion bridge normally connects distributions of matched dimensionality, while robotic observations are heterogeneous and not naturally aligned with actions. To overcome this, we introduce a multi-modal fusion module and a semantic aligner to unify the visual and state inputs and align the observations with action representations, making diffusion bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and 5 real-world tasks demonstrate that BridgePolicy consistently outperforms stateof-the-art generative policies.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Imitation learning (Osa et al., 2018) is a widely adopted learning paradigm in robotic learning (Li et al., 2024;Shafiullah et al., 2023;Ze et al., 2023b;Seo et al., 2023;Fu et al., 2024), where a robot is provided with a set of expert demonstrations and learns to mimic the provided demonstrations to perform the tasks effectively. Recently, generative models such as diffusion model (Ho et al., 2020;Song et al., 2020b;Dhariwal & Nichol, 2021;Ho & Salimans, 2022) and flow matching (Lipman et al., 2024;Liu et al., 2022;Albergo et al., 2023) gains its prominence owing to their capacity to capture multi-modal distributions and learn temporal dependency (Chi et al., 2023;Ze et al., 2023a;Zhang et al., 2025). These methods share a similar principle that perturb action chunks into random noise via a forward process defined by a Stochastic or Ordinary Differential Equation (SDE/ODE) and then train a neural network conditioned on observations to reverse this process, iteratively transforming noise into executable actions.

With this paradigm, Diffusion Policy (Chi et al., 2023) and 3D Diffusion Policy (Ze et al., 2023a), known as DP and DP3, employ an SDE-defined forward process and train the neural network conditioned on visual inputs and robot states to control the denoising process to sample actions from the random noise. Similarly, FlowPolicy (Zhang et al., 2025) employs an ODE-defined forward and reverse process, which reduces the stochasticity during training and inference. Despite their success, current generative policies largely treat observations as high-level conditioning signals to the denoising network (Chi et al., 2023;Ze et al., 2023a;Zhang et al., 2025), rather than integrating them into the dynamics of the forward process. This underutilization forces sampling to begin from an uninformative random noise, weakening the coupling between perception and control and often yielding suboptimal performance.

Diffusion bridge has demonstrated considerable success in image restoration and translation (De Bortoli et al., 2021a;Yue et al., 2023;Li et al., 2023;Luo et al., 2023;Zhou et al., 2023), where it modifies the forward process such that the endpoint distribution naturally aligns with the desired conditioned distribution in standard diffusion. As shown in Figure 1, with this mathematically exact formulation of modeling informative observations in the forward process rather than solely treating the observations as the external condition, the reverse process can naturally start from the observations, a more informative prior instead of the uninformative random noise, thereby improving the precision of the generated actions. Building upon these insights, we propose that diffusion bridge can also serve as a more effective framework for visuomotor policy learning. Specifically, rather than adopting the conventional diffusion model, we formulate policy learning as the problem of learning a diffusion bridge, where observations are explicitly modeled in to the diffusion SDE trajectory itself through the framework of diffusion bridge (Zhu et al., 2025;Pan et al., 2025) instead of merely treating them as conditions, and the reverse process would be able to sample actions starting from a more informative prior of the observations instead of the uninformative random Gaussian noise. This deeper integration allows the policy to more effectively exploit observations, leading to more precise and reliable control.

However, formulating policy learning as the diffusion bridge brings two difficulties. First, robotic observations and actions are inherently multi-modal and heterogeneous, violating the standard diffusion-bridge assumption that the connected endpoint distributions share the same dimensionality. Second, observations often include proprioceptive states, RGB-D vision, and language instructions, which do not admit a simple one-to-one mapping to the action space required by classical bridge formulations. Therefore, unifying the representation of multi-modal observations and aligning the observation and action shapes constitute the critical challenges in making diffusion bridges applicable to policy learning. We introduce BridgePolicy, a generative visuomotor policy that directly samples actions from observation-informed priors rather than random noise. We construct a diffusion bridge that embeds observations within the diffusion SDE trajectory, enabling the model to fully exploit sensory inputs instead of using them solely as external conditions. To resolve the modality and shape mismatches, we design two components: (i) a multi-modal fusion module that consolidates heterogeneous inputs (e.g., vision, proprio-ception) into a unified observation representation, and (ii) a semantic aligner that maps this fused representation into an action-aligned latent space, ensuring compatible endpoints for the bridge. Together, these modules allow the diffusion bridge to leverage heterogeneous data sources for policy learning. Our

View Original ArXiv

This content is AI-processed based on ArXiv data.

관측을 통합한 확산 브리지 기반 로봇 정책

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found