Target-Aware Video Diffusion Models
We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt. Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human-object interactions by leveraging the priors of large-scale video generative models. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target’s spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.
💡 Research Summary
The paper introduces a target‑aware video diffusion model that generates videos from a single input image, a segmentation mask defining a target object, and a textual description of an action. Building on the state‑of‑the‑art text‑to‑video diffusion model CogVideoX, the authors extend the architecture to accept an additional binary mask channel. The mask is down‑sampled to match the latent resolution and concatenated with the image condition, while the image projection layer is expanded with a zero‑initialized channel to preserve pretrained weights.
To make the model explicitly aware of the target, a special token “
Comments & Academic Discussion
Loading comments...
Leave a Comment