CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval. Our code is available at https://github.com/Z1zs/Causal-Embed.
💡 Research Summary
CausalEmbed tackles the storage and computational bottleneck that plagues current multi‑vector visual document retrieval (VDR) systems. Traditional multi‑vector approaches, exemplified by ColPali, encode each document page into hundreds or thousands of patch‑level visual tokens and treat each token as an independent embedding. While this fine‑grained representation preserves layout information and yields strong retrieval performance, the sheer number of vectors per page incurs prohibitive memory and disk costs, making large‑scale deployment difficult.
The proposed method reframes embedding creation as an auto‑regressive generation problem. Starting from a pre‑trained multimodal large language model (MLLM) consisting of a vision encoder Φ and a language decoder Ψ, the input image I (or textual query T) is first transformed into an initial context C (visual feature sequence or token sequence). The decoder Ψ then generates a sequence of latent vectors Z =
Comments & Academic Discussion
Loading comments...
Leave a Comment