Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at https://github.com/leeruibin/MfM.git.


💡 Research Summary

The paper introduces Many‑for‑Many (MfM), a unified diffusion‑transformer framework that can simultaneously handle more than ten visual generation and manipulation tasks, including text‑to‑video (T2V), image‑to‑video (I2V), video‑to‑video (V2V), video extension, first‑last‑frame‑to‑video, first‑last‑clip‑to‑video, inpainting/outpainting, colorization, style transfer, single‑image super‑resolution (SISR), and video super‑resolution (VSR). The authors argue that existing approaches either train a separate model per task or fine‑tune a T2V foundation model for downstream tasks, which requires massive high‑quality annotations and limits the number of tasks a single model can perform.

To overcome these limitations, MfM adopts four key technical components. First, a lightweight “adapter” normalizes heterogeneous conditioning inputs—0‑D (timestep, motion score), 1‑D (text), 2‑D (image, mask), and 3‑D (video, depth maps)—into a unified 3‑D latent space. The adapter consists of several convolutional layers and down‑sampling blocks that reshape any pixel‑based condition (including depth maps) to match the spatial‑temporal resolution of the video VAE’s latent representation, after which it is simply added to the latent features.

Second, the backbone is a Diffusion Transformer (DiT) equipped with full 3‑D attention. The authors extend Rotary Position Embedding (RoPE) to three dimensions (temporal, height, width) and allocate hidden‑state channels proportionally (2/8 for time, 3/8 each for height and width). This 3‑D RoPE enables the model to handle videos of varying lengths and resolutions without additional architectural changes. To stabilize training of large transformers, they incorporate RMSNorm and Query‑Key Normalization (QK‑Norm), following recent findings on attention entropy growth.

Third, training follows the Flow Matching paradigm (Rectified Flow). At each step a noisy sample Xₜ is generated by linear interpolation between a clean video X₀ and Gaussian noise ε, and the model is trained to predict the ground‑truth velocity Vₜ = ε – X₀. The loss is the L2 distance between the predicted velocity and Vₜ, conditioned on timestep, motion score, text prompt, and the unified 3‑D condition. This approach yields faster convergence and higher sampling efficiency compared with classic DDPM.

Fourth, the authors devise a progressive joint image‑video learning schedule. Training begins with pure text‑image pairs to align textual semantics with visual features, then gradually introduces video data while decreasing the image‑to‑video ratio to 0.1. This strategy expands the effective training corpus, allowing the same model to excel at image‑centric tasks (T2I, SISR) and video‑centric tasks. For each training step, a video sample is drawn, a set of compatible tasks is identified, and one task is randomly selected to construct the conditioning inputs via the adapter.

Two model sizes are released: a 2 B‑parameter version (28 layers, 28 attention heads, hidden dimension 1792) and an 8 B‑parameter version (40 layers, 48 heads, hidden dimension 3072). Both are trained on 120 M–160 M samples (text‑image and text‑video pairs). Remarkably, the 8 B model achieves competitive or superior video quality compared with state‑of‑the‑art open‑source T2V foundations such as CogVideoX, MovieGen, and StepVideo, while using roughly only 10 % of the data those models consume. Depth maps are incorporated as an additional condition, improving the model’s perception of 3‑D structure and yielding better results on scenes with significant depth variation.

Extensive experiments—including quantitative metrics (FVD, IS) and qualitative visual comparisons—demonstrate that MfM not only matches specialized models on individual tasks but also benefits from multi‑task training: video generation quality improves thanks to the large amount of image data that would otherwise be unusable for pure T2V training. The code, pretrained checkpoints, and inference pipelines are publicly released at https://github.com/leeruibin/MfM.git, facilitating reproducibility and further research.

In summary, Many‑for‑Many presents a practical and scalable solution for unified visual generation and manipulation. By standardizing conditions with a lightweight adapter, employing full 3‑D attention with RoPE, leveraging Flow Matching, and jointly training on image and video data, the framework dramatically reduces the annotation burden while delivering a single model capable of a broad spectrum of tasks. This work sets a new direction for multi‑task diffusion models and opens avenues for more efficient, versatile generative AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment