MTC-VAE: Multi-Level Temporal Compression with Content Awareness
Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations. For continuous Variational Autoencoders (VAEs), achieving higher compression rates is desirable; yet, the efficiency notably declines when extra sampling layers are added without expanding the dimensions of hidden channels. In this paper, we present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression, providing a straightforward and minimal fine-tuning approach to counteract performance decline at elevated compression rates.Moreover, we examine how varying compression levels impact model performance over video segments with diverse characteristics, offering empirical evidence on the effectiveness of our proposed approach. We also investigate the integration of our multi-level temporal compression VAE with diffusion-based generative models, DiT, highlighting successful concurrent training and compatibility within these frameworks. This investigation illustrates the potential uses of multi-level temporal compression.
💡 Research Summary
The paper introduces MTC‑VAE (Multi‑level Temporal Compression VAE), a method that converts a fixed‑rate video VAE into a model capable of adaptive, segment‑wise temporal compression with only minimal fine‑tuning. The authors observe that current continuous VAEs used in latent video diffusion models (LVDMs) compress each video frame independently in time, leading to unnecessary computational overhead for static or slowly moving segments and insufficient quality for fast‑moving parts. To address this, MTC‑VAE first splits an input video into equal‑length segments. For each segment a compression rate cᵢ is selected from {4, 8, 16} by maximizing a score function that balances reconstruction quality (measured by PSNR) against the logarithmic gain of higher compression. The score incorporates a per‑segment tolerance factor α(xᵢ) that favors higher compression when the segment’s average quality is high and its quality variance across rates is low.
The encoder then applies a variable number of temporal sampling layers according to the chosen cᵢ. New layers are initialized from the nearest existing layers of the pretrained VAE, ensuring that the model’s spatial capacity remains unchanged. An additional embedding f_c is added to the keyframe (the first frame of each segment) to signal the compression level. All latent segments are concatenated into a single latent sequence Z, which is the input for the diffusion transformer.
During decoding, the challenge is to recover the original temporal resolution despite heterogeneous compression rates. The authors propose a lightweight “keyframe predictor” P that scans Z and classifies each latent frame as a keyframe or not. The detected keyframes delineate the boundaries of the original segments, allowing each latent chunk zᵢ to be decoded with its specific rate ĉᵢ. This binary classification approach avoids the heavy parameter cost of a full multi‑class predictor while achieving accurate segmentation.
Training proceeds in two stages. Stage 1 fine‑tunes the VAE with the newly added sampling layers using a combination of reconstruction, KL, and adversarial losses (identical to OD‑VAE). Stage 2 trains the keyframe embedding f_c and the predictor P with binary cross‑entropy loss, and introduces a flow‑guided consistency loss L_flow = L_quality + L_motion. Optical flow Δf computed by a pretrained flow model is used to warp the reconstructed frames; the L1 distance between warped output and ground truth enforces motion consistency, while an additional L1 term penalizes visual artifacts. An exponential moving average (EMA) with decay 0.999 stabilizes training.
Extensive experiments on WebVid‑10M and Panda‑70M validate the approach. Compared to state‑of‑the‑art VAEs (e.g., CogVideoVAE), MTC‑VAE achieves up to a 92 % increase in compression rate (VCPR) while degrading PSNR by only 0.03 dB, SSIM by 0.0027, and LPIPS by 0.0015. Table 1 shows that with a 16× temporal compression the model reaches PSNR 35.10 dB and SSIM 0.9296, matching or surpassing the quality of lower‑compression baselines.
The authors also integrate MTC‑VAE with a DiT‑based diffusion model. By feeding the multi‑level compressed latents into DiT during fine‑tuning, they demonstrate two practical benefits: (1) the same hardware can now generate longer videos (e.g., 1‑minute 1080p) without exceeding memory limits, and (2) for videos of equal duration, computational cost drops by 45 %–68 % while preserving generation fidelity.
Key contributions are: (i) a simple, low‑overhead conversion of fixed‑rate VAEs to multi‑level temporal compression, (ii) a principled segment‑wise compression‑rate selection based on PSNR‑driven scoring, (iii) an efficient keyframe predictor that restores temporal ordering, (iv) a flow‑guided loss that balances visual quality and motion consistency, and (v) successful coupling with transformer‑based diffusion generators. MTC‑VAE thus offers a practical pathway to substantially improve memory and compute efficiency in video generation pipelines without sacrificing output quality, opening avenues for high‑resolution, long‑duration video synthesis in both research and production settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment