VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.

💡 Research Summary

The paper introduces VDOT, an efficient unified video creation framework that dramatically reduces inference time while maintaining high visual quality across a wide range of video generation and editing tasks. Existing unified video models such as VACE and UNIC can handle multiple conditioning signals (text, reference frames, masks, etc.) but rely on large, complex architectures and require 100 or more diffusion steps, making them impractical for real‑world deployment.

VDOT tackles this problem by distilling a large pretrained video diffusion model (VACE‑W) into a lightweight student generator that operates in only four denoising steps. The distillation is performed within the Distribution‑Matching Distillation (DMD) paradigm, but instead of relying solely on the reverse Kullback‑Leibler (KL) divergence, the authors augment the loss with an Entropic Optimal Transport (EOT) discrepancy. The EOT term computes the minimum transport cost between the teacher’s score distribution and the student’s score distribution, yielding an optimal transport plan T* that provides explicit, geometrically meaningful gradients (∇a_i W_ε² = ∑j T*{ij}(a_i − b_j)). This geometric regularization prevents the mode‑seeking behavior and zero‑forcing/gradient‑collapse problems that commonly arise when few‑step generators are trained with KL alone.

In addition to the OT‑based distillation loss, VDOT incorporates an adversarial discriminator that is trained on real video data. The discriminator supplies a GAN loss that corrects residual score‑approximation errors and improves texture fidelity, while the generator follows a “Self‑Forcing” schedule—using previously denoised frames as conditioning for the current step—to keep training and inference distributions aligned.

All conditioning modalities (text, image frames, video clips, binary masks) are unified through a Video Condition Unit (VCU) that encodes them as a token triplet (T;F;M). These tokens are processed by the frozen WAN‑DiT backbone and the VACE‑DiT adapters, enabling VDOT to support five canonical tasks—text‑to‑video, reference‑to‑video, video‑to‑video, masked video‑to‑video, and composite tasks—within a single model.

To train such a model at scale, the authors build a fully automated data pipeline: 4K‑resolution videos are scraped, dense captions are generated with vision‑language models, task‑aware filtering removes low‑quality samples, and a ranking stage selects the most diverse and informative clips. The resulting dataset feeds into a new benchmark, UVCBench, which comprises 18 generation/editing tasks, each with 20 representative test cases, providing both objective metrics (FVD, IS, CLIP‑Score) and human preference evaluations.

Empirical results on UVCBench show that the 4‑step VDOT matches or exceeds the performance of 100‑step baselines while being 4–5× faster at inference. Notably, VDOT excels on composite tasks that combine multiple modalities, where previous unified models struggle. Ablation studies confirm that the OT regularizer stabilizes training and improves diversity, and that the adversarial component boosts perceptual quality.

In summary, VDOT is the first work to apply optimal‑transport‑based geometric constraints within distribution‑matching distillation for video generation, and to combine this with adversarial training in a unified, few‑step setting. The approach delivers a practical solution for real‑time, multi‑task video creation, and the accompanying data pipeline and benchmark are likely to become valuable resources for future AIGC research.

VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment