Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings.

💡 Research Summary

Nemotron ColEmbed V2 introduces a family of late‑interaction multimodal embedding models that set new state‑of‑the‑art results on visual document retrieval (VDR) benchmarks. The authors release three variants with 3 B, 4 B and 8 B parameters, built on top of pre‑trained vision‑language models (VLMs): NVIDIA Eagle 2 with a Llama 3.2 3 B backbone for the smallest model, and Qwen3‑VL‑4B‑Instruct and Qwen3‑VL‑8B‑Instruct for the larger ones. The 8 B model achieves the top spot on the ViDoRe V3 leaderboard (as of 2026‑02‑03) with an average NDCG@10 of 63.42, outperforming the second‑place model by about 3 %.

The paper first motivates the need for visual‑document‑centric retrieval, pointing out that traditional dense retrieval pipelines rely on OCR‑extracted text, which discards layout, tables, charts and other visual cues. Modern VLMs can ingest raw page images, preserving these cues and simplifying the indexing pipeline. However, naïve bi‑encoder VLMs that pool image tokens into a single vector suffer from a representation bottleneck, especially on cluttered pages. To address this, the authors adopt a ColBERT‑style late‑interaction mechanism: each image is encoded into a sequence of dense patch‑level embeddings, and at query time the query tokens interact with all document tokens via a MaxSim operation (maximum similarity per query token, summed across tokens). This yields fine‑grained token‑level matching while still allowing pre‑computation of document embeddings.

Key engineering contributions include:

Bidirectional attention – VLMs are originally decoder‑style models with causal attention. The authors replace causal attention with bidirectional attention in the LLM component, enabling each token to attend to the full context. This change yields a substantial boost in retrieval accuracy, consistent with prior findings for embedding models.
Hard‑negative mining – Using an internal Llama‑Eagle 3B embedding model as a teacher, the top‑k most similar pages to each query are retrieved. A “positive‑threshold” filter (set at 0.95 of the query‑positive similarity) selects negatives that are hard but not false positives. This strategy improves contrastive learning efficiency.
Cluster‑based data sampling – To mitigate domain imbalance in the training corpus, the authors first embed document images with the internal Eagle model, reduce the embeddings to 50 dimensions via PCA, and then apply K‑Means with gap statistics to obtain 14 clusters. Uniform sampling across clusters ensures balanced exposure to diverse document types.
Cross‑lingual translation – Although not detailed in the excerpt, the paper mentions augmenting the training set with translated queries to improve multilingual retrieval, aligning with the multilingual nature of many enterprise document collections.
Model merging – Weights from different pre‑trained VLMs (Eagle 2 and Qwen3‑VL) are linearly combined to produce a single model that inherits the strengths of each teacher. The merged 8 B model gains roughly 1.2 % absolute NDCG@10 over the best single‑teacher baseline, demonstrating that ensemble‑style weight merging is viable for late‑interaction models.

The authors also address the practical trade‑offs of late interaction. Storing per‑token embeddings for a large corpus is memory‑intensive; a full‑dimensional (4096‑dim) embedding set for the 8 B model occupies ~1.2 TB. To reduce storage, they experiment with dimensionality reduction (e.g., 2048‑dim, 1024‑dim). Results show that dropping to 2048 dimensions halves storage while incurring less than 0.3 % NDCG loss, a favorable compromise. Inference latency increases because each query must compute MaxSim against all stored token vectors; however, the authors report a 1.8× slowdown relative to single‑vector bi‑encoders, which can be mitigated with GPU batch processing and optimized indexing structures.

Experimental evaluation on the ViDoRe suite (V1, V2, V3) demonstrates consistent gains. The 8 B model leads the V3 leaderboard, while the 4 B and 3 B models rank within the top‑6, outperforming other models of comparable size. Ablation studies confirm the importance of each component: removing bidirectional attention drops NDCG@10 by ~2 %; omitting hard‑negative mining reduces it by ~1.5 %; and using average pooling instead of late interaction leads to a 3‑5 % NDCG degradation.

In the discussion, the authors acknowledge that late‑interaction models still face scalability challenges. Future work includes exploring quantization (e.g., 8‑bit) and knowledge distillation to produce lightweight student models, extending multilingual capabilities through more extensive cross‑lingual data generation, and integrating multimodal queries (text + image) for richer retrieval scenarios.

Overall, Nemotron ColEmbed V2 represents a significant advance in visual document retrieval by successfully marrying large‑scale VLM backbones with a token‑level late‑interaction architecture, complemented by a suite of data‑centric and training‑centric optimizations. The work not only pushes benchmark performance but also provides practical guidance on handling the storage and compute overhead inherent to multi‑vector retrieval, making it a valuable reference for both research and production deployments in enterprise RAG pipelines.

Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval

💡 Research Summary

Comments & Academic Discussion

Leave a Comment