Unified Latents (UL): How to train your latents
We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder’s output noise to the prior’s minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.
💡 Research Summary
Unified Latents (UL) introduces a principled framework for learning latent representations that are simultaneously regularized by a diffusion prior and decoded by a diffusion model. The authors identify two longstanding issues in latent diffusion models: the need for manually tuned KL‑weighting in VAE‑style encoders, and the loss of high‑frequency detail in reconstructions. UL resolves these by tightly coupling the encoder’s output noise to the minimum noise level of the diffusion prior, thereby turning the KL term into a simple weighted mean‑squared error over noise levels and providing an explicit upper bound on the bits per dimension (bpd) of the latent space.
The encoder Eθ maps an image x to a deterministic latent z_clean. This latent is then forward‑noised to a fixed log‑SNR λ(0)=5 (σ≈0.08), yielding z0. The diffusion prior Pθ models the trajectory from pure noise z1 to the slightly noisy z0. Its loss is the continuous ELBO integral ∫ dλz/dt·exp(λz)·‖z_clean‑ĥz(z_t)‖², with a unit weighting w(λz)=1 so that low‑noise levels are not overly penalized. The diffusion decoder Dθ operates in image space, conditioning on both the noisy image xt and the latent z0. Its reconstruction loss uses a sigmoid‑weighted ELBO, w(λx)=sigmoid(λx−b), which emphasizes high‑frequency details while allowing the latent to carry a controllable amount of information. Two hyper‑parameters—loss factor c_lf and bias b—directly regulate the trade‑off between latent bitrate and reconstruction fidelity.
Training proceeds in two stages. Stage 1 jointly optimizes encoder, prior, and decoder using the combined loss L = Lz + Lx. This yields latents that are easy for the diffusion prior to model yet retain sufficient information density. Stage 2 freezes the encoder and retrains the prior as a “base model” (a multi‑stage ViT) with the same fixed noise schedule. This step addresses the observation that a prior trained with ELBO weighting alone produces sub‑optimal samples; the base model, trained with sigmoid weighting, improves sample quality without altering the latent distribution.
Empirically, UL is evaluated on ImageNet‑512 and Kinetics‑600. On ImageNet‑512, a 4×4 latent grid (≈1 M parameters) achieves FID 1.4, PSNR ≈30 dB, and an estimated 3.2 bpd, while requiring roughly 30 % fewer FLOPs than models trained on Stable Diffusion latents. On Kinetics‑600, UL sets a new state‑of‑the‑art FVD of 1.3. Scaling studies on large text‑to‑image and text‑to‑video datasets show that UL consistently delivers higher sample quality for a given compute budget.
Compared to prior work, UL differs from VAE‑GAN hybrids (e.g., Latent Diffusion Model) by providing a mathematically grounded bitrate control, and from diffusion‑AE variants (DiffuseVAE, ε‑VAE) by preserving high‑frequency content through the weighted decoder loss. Unlike methods that rely on pretrained semantic encoders (DINO, SigLIP), UL learns latents end‑to‑end purely from data, making it broadly applicable across modalities.
In summary, Unified Latents offers a clean, two‑term objective (prior loss + decoder loss) that eliminates the need for ad‑hoc KL weighting, supplies an explicit information‑theoretic bound on latents, and yields state‑of‑the‑art generation quality with reduced training cost. Its design principles—aligning encoder noise with prior precision, using weighted ELBOs for both prior and decoder, and employing a two‑stage training pipeline—provide a compelling blueprint for future high‑resolution image and video generation, compression, and latent‑based transfer learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment