UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at https://github.com/nnnth/UniLIP.


💡 Research Summary

UniLIP presents a novel unified framework that extends the CLIP visual encoder beyond pure understanding to support high‑fidelity image reconstruction, text‑guided generation, and instruction‑based editing. The authors identify two fundamental shortcomings of existing CLIP‑based unified models: (1) a trade‑off between semantic understanding and pixel‑level reconstruction, and (2) insufficient utilization of CLIP features in generation/editing pipelines, often leading to inconsistent outputs. To address (1), UniLIP adopts a two‑stage training regime. In Stage 1 the CLIP backbone is frozen while a lightweight pixel decoder and a projection head are trained with a combination of MSE and LPIPS losses, extracting the weak pixel cues already present in CLIP embeddings. In Stage 2 the CLIP encoder is fine‑tuned, but a self‑distillation loss (the L2 distance between frozen‑CLIP and fine‑tuned features) is added, together with a reduced learning rate (0.1×) for the encoder. This prevents catastrophic forgetting and preserves the original semantic space while enabling the encoder to emit richer, reconstructable features. Empirically, this yields a PSNR of 24.93 dB and SSIM of 0.788—substantially higher than prior CLIP‑quantization or diffusion‑conditioned baselines—while also improving understanding metrics such as MME‑P, TextVQA, and AI2D.

For (2), the paper builds on the MetaQuery paradigm but introduces a dual‑condition architecture. Instead of relying solely on a fixed set of learnable query tokens, UniLIP concatenates (i) the multimodal hidden states produced by a frozen Multimodal Large Language Model (MLLM) and (ii) the query embeddings that encode the LLM’s reasoning results. This combined conditioning is fed to a diffusion transformer (DiT) that generates or edits images. The hidden states carry the continuous CLIP‑derived visual information (pixel‑level details), while the query embeddings inject high‑level semantic intent (e.g., “replace the cake with two smaller cakes”). The design eliminates the information bottleneck observed in prior query‑only methods, leading to better prompt alignment in generation and higher edit consistency.

UniLIP is instantiated on InternViT (the InternVL3 visual encoder) and trained on 40 M publicly available images. The model size is modest: a 1 B‑parameter visual encoder and a 3 B‑parameter diffusion + query module. Despite this compactness, UniLIP outperforms larger unified models such as BAGEL (7 B) and UniWorld‑V1 (12 B) on three key benchmarks: GenEval (0.90), WISE (0.63), and ImgEdit (3.94). Notably, the system achieves a 32× compression ratio relative to raw pixels, yet the decoder can reconstruct images directly without an auxiliary VAE, demonstrating the efficacy of the learned continuous representation.

The contributions can be summarized as:

  1. A two‑stage training plus self‑distillation scheme that endows CLIP with high‑fidelity reconstruction while preserving (and even enhancing) its semantic understanding.
  2. A dual‑condition generation/editing pipeline that fuses multimodal hidden states and learnable query embeddings, leveraging the reasoning power of frozen MLLMs without sacrificing visual detail.
  3. Empirical evidence that a lightweight (4 B total) model can surpass much larger baselines across understanding, generation, and editing tasks, highlighting the potential of CLIP‑centric unified architectures.

Limitations include the fixed distillation weight (λ = 1) which may need tuning for domain‑specific data, the exclusive use of InternViT as the CLIP backbone (generalization to other CLIP variants remains untested), and a reliance on quantitative edit metrics without extensive qualitative analysis of complex structural edits. Nonetheless, UniLIP marks a significant step toward a single, efficient multimodal model that seamlessly integrates perception, synthesis, and manipulation, and it is likely to influence future research on parameter‑efficient, unified AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment