Target-Aware Video Diffusion Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt. Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human-object interactions by leveraging the priors of large-scale video generative models. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target’s spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.

💡 Research Summary

The paper introduces a target‑aware video diffusion model that generates videos from a single input image, a segmentation mask defining a target object, and a textual description of an action. Building on the state‑of‑the‑art text‑to‑video diffusion model CogVideoX, the authors extend the architecture to accept an additional binary mask channel. The mask is down‑sampled to match the latent resolution and concatenated with the image condition, while the image projection layer is expanded with a zero‑initialized channel to preserve pretrained weights.

To make the model explicitly aware of the target, a special token “

Target-Aware Video Diffusion Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment