One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.

💡 Research Summary

The paper introduces Feature Auto‑Encoder (FAE), a minimalist framework that bridges high‑dimensional, understanding‑oriented visual features from pretrained encoders with the low‑dimensional latent spaces required by modern generative models such as diffusion models and normalizing flows. The authors first identify a fundamental mismatch: self‑supervised vision encoders (e.g., DINO, SigLIP) produce rich, high‑dimensional embeddings that capture many hypotheses about masked regions, whereas generative models need compact latents that can faithfully propagate injected noise during sampling. Prior attempts to reconcile these representations have relied on elaborate loss functions (contrastive, KL‑divergence) and deep transformation networks, which increase computational cost and often destabilize training.

FAE resolves this by coupling two separate decoders. The first decoder is trained solely to reconstruct the original feature space, preserving the semantic richness of the pretrained encoder without compression. The second decoder takes the reconstructed features as input and maps them into a low‑dimensional latent vector suitable for a downstream generator. Crucially, the connection between the two decoders is implemented with a single attention layer. This layer performs the dimensionality reduction while retaining the diversity of hypotheses encoded in the high‑dimensional features, and it does not interfere with the noise‑injection mechanisms that diffusion models rely on.

The framework is encoder‑agnostic (compatible with DINO, SigLIP, etc.) and generator‑agnostic (works with both diffusion models and normalizing flows). Experiments on ImageNet at 256×256 resolution cover both class‑conditional and text‑to‑image settings. With classifier‑free guidance (CFG), a diffusion model trained for 800 epochs achieves an FID of 1.29, and even after only 80 epochs it reaches 1.70. Without CFG, FAE attains state‑of‑the‑art FIDs of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating that high quality can be obtained quickly. Ablation studies show that adding more attention layers yields negligible gains, confirming that a single layer suffices. The method also scales to normalizing‑flow generators, where similar improvements in sample quality and training speed are observed.

Limitations include the relatively large parameter count of the first reconstruction decoder and the lack of extensive evaluation on very high‑resolution images (>512×512). Future work will explore lightweight reconstruction modules and strategies for scaling FAE to ultra‑high‑resolution synthesis.

In summary, FAE provides a simple yet powerful solution to adapt pretrained visual representations for image generation. By decoupling reconstruction from latent compression and using only one attention layer, it achieves fast convergence, competitive FID scores, and broad applicability across encoder and generator families, marking a significant step toward more efficient and versatile generative modeling.

💡 Research Summary

📜 Original Paper Content