Scaling Laws for Embedding Dimension in Information Retrieval

Scaling Laws for Embedding Dimension in Information Retrieval
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Dense retrieval, which encodes queries and documents into a single dense vector, has become the dominant neural retrieval approach due to its simplicity and compatibility with fast approximate nearest neighbor algorithms. As the tasks dense retrieval performs grow in complexity, the fundamental limitations of the underlying data structure and similarity metric – namely vectors and inner-products – become more apparent. Prior recent work has shown theoretical limitations inherent to single vectors and inner-products that are generally tied to the embedding dimension. Given the importance of embedding dimension for retrieval capacity, understanding how dense retrieval performance changes as embedding dimension is scaled is fundamental to building next generation retrieval models that balance effectiveness and efficiency. In this work, we conduct a comprehensive analysis of the relationship between embedding dimension and retrieval performance. Our experiments include two model families and a range of model sizes from each to construct a detailed picture of embedding scaling behavior. We find that the scaling behavior fits a power law, allowing us to derive scaling laws for performance given only embedding dimension, as well as a joint law accounting for embedding dimension and model size. Our analysis shows that for evaluation tasks aligned with the training task, performance continues to improve as embedding size increases, though with diminishing returns. For evaluation data that is less aligned with the training task, we find that performance is less predictable, with performance degrading with larger embedding dimensions for certain tasks. We hope our work provides additional insight into the limitations of embeddings and their behavior as well as offers a practical guide for selecting model and embedding dimension to achieve optimal performance with reduced storage and compute costs.


💡 Research Summary

This paper investigates how the dimensionality of dense embeddings influences retrieval effectiveness, a topic that has received limited empirical attention despite strong theoretical indications that embedding dimension (d) bounds the capacity of inner‑product based retrieval. The authors conduct a systematic study using two distinct encoder families—BERT (a classic transformer) and Ettin (a newer, more advanced suite)—across a wide range of model sizes (from a few million to over a billion parameters) and embedding dimensions ranging from 32 up to 28 720, well beyond the native hidden size of the underlying backbones.

Training data differ for the two families: BERT models are trained on the standard MSMARCO Passage dataset (≈500 k queries) with teacher scores from a cross‑encoder, while Ettin models use MSMARCO Instruct, which augments queries with LLM‑generated instructions and adds hard negatives, yielding roughly 1 M training queries. Both setups employ a two‑tower architecture with mean‑pooling over token representations, followed by a linear projection to the target embedding dimension. BERT models are optimized with a combination of Margin‑MSE (distillation) and contrastive cross‑entropy losses; Ettin models use only the contrastive loss due to the lack of teacher scores.

Evaluation is performed on two benchmarks: MSMARCO Dev (in‑distribution) and TREC‑DL (combined, out‑of‑distribution). The primary metric is contrastive entropy, complemented by standard IR measures such as MRR. Across all configurations, performance exhibits a clear power‑law relationship with embedding dimension:

 Performance ≈ A · d^α + δ

where A, α, and δ are fitted per dataset and model family. The fitted α values lie between 0.02 and 0.07, indicating diminishing returns as d grows. The R² values exceed 0.99, confirming an excellent fit.

Beyond dimension‑only scaling, the authors derive a joint scaling law that incorporates model parameter count (N):

 Performance ≈ A · N^β · d^α + δ

Here β ranges from 0.4 to 0.9, capturing the additional gains from larger models. This unified formulation allows practitioners to predict performance for any (N, d) pair without retraining.

A key insight is the divergence between in‑distribution and out‑of‑distribution tasks. For MSMARCO Dev, increasing d consistently improves performance, albeit with smaller increments at higher dimensions. For TREC‑DL, however, the relationship is less monotonic; in some cases larger dimensions lead to slight degradations, suggesting that over‑parameterized embeddings may overfit to the training distribution and fail to generalize.

The paper also presents a cost‑aware analysis. By fixing an inference compute budget (e.g., FLOPs) and storage constraints, the authors compute the (N, d) configuration that maximizes predicted performance. For example, under a 10 GFLOPs budget, a BERT‑L4 model with 4 k dimensions outperforms a larger BERT‑L8 model with 8 k dimensions, delivering a 1.2 % higher MRR while using less memory. This demonstrates that optimal design is not simply “bigger model + larger embeddings” but a balanced trade‑off guided by the derived scaling laws.

The study’s contributions are threefold: (1) empirical validation that embedding dimension follows a power‑law scaling with retrieval performance; (2) a joint scaling law that unifies the effects of model size and embedding dimension; (3) practical guidelines for selecting (N, d) under real‑world resource constraints.

Limitations include the focus on English datasets, the lack of analysis on how dimensionality interacts with compression techniques (e.g., quantization, PCA), and the omission of dynamic system costs such as index update latency. Future work could extend the analysis to multilingual corpora, explore hybrid approaches that combine dimensional scaling with product quantization, and integrate the scaling laws into end‑to‑end system simulators that account for indexing, retrieval latency, and energy consumption.

Overall, the paper provides a rigorous, data‑driven foundation for understanding and optimizing embedding dimensionality in dense retrieval, offering both theoretical insight and actionable recommendations for building efficient, high‑performing search systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment