Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene’s physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency. The project page is https://bridgeremoval.github.io/.


💡 Research Summary

The paper tackles the problem of video object removal, where a target object must be erased from a video while plausibly filling the resulting hole. Existing approaches fall into two categories: propagation‑based methods that rely on optical flow to copy pixels from other frames, and generative inpainting methods, especially recent diffusion‑based models. The latter follow a “noise‑to‑data” paradigm: generation starts from pure Gaussian noise and progressively denoises toward a video conditioned on the masked input. While powerful, this paradigm discards the rich structural and contextual information already present in the source video, often leading to incomplete erasures or implausible content that violates scene physics.

Motivated by recent advances in stochastic bridge models, the authors reformulate video object removal as a video‑to‑video translation problem using a data‑to‑data stochastic bridge. Instead of a generic Gaussian prior, the source video (with the object) itself serves as the prior distribution. The bridge is realized via a variance‑preserving stochastic differential equation (VP‑SDE) that defines a continuous‑time path from the source latent representation (z_{\text{src}}) to the target latent representation (z_{\text{tgt}}). By employing a pre‑trained VAE encoder, both the source video, the ground‑truth cleaned video, and the binary mask are projected into a compact latent space. The source latent is used in two ways: (1) as the boundary condition of the bridge at time (t=1), and (2) as part of the spatial conditioning input (y = \text{concat}(z_M, z_{\text{src}})) that provides the network with explicit background context.

The bridge process is analytically tractable. At any interpolation time (t\in


Comments & Academic Discussion

Loading comments...

Leave a Comment