QVCache: A Query-Aware Vector Cache
Vector databases have become a cornerstone of modern information retrieval, powering applications in recommendation, search, and retrieval-augmented generation (RAG) pipelines. However, scaling approximate nearest neighbor (ANN) search to high recall under strict latency SLOs remains fundamentally constrained by memory capacity and I/O bandwidth. Disk-based vector search systems suffer severe latency degradation at high accuracy, while fully in-memory solutions incur prohibitive memory costs at billion-scale. Despite the central role of caching in traditional databases, vector search lacks a general query-level caching layer capable of amortizing repeated query work. We present QVCache, the first backend-agnostic, query-level caching system for ANN search with bounded memory footprint. QVCache exploits semantic query repetition by performing similarity-aware caching rather than exact-match lookup. It dynamically learns region-specific distance thresholds using an online learning algorithm, enabling recall-preserving cache hits while bounding lookup latency and memory usage independently of dataset size. QVCache operates as a drop-in layer for existing vector databases. It maintains a megabyte-scale memory footprint and achieves sub-millisecond cache-hit latency, reducing end-to-end query latency by up to 40-1000x when integrated with existing ANN systems. For workloads exhibiting temporal-semantic locality, QVCache substantially reduces latency while preserving recall comparable to the underlying ANN backend, establishing it as a missing but essential caching layer for scalable vector search.
💡 Research Summary
Vector databases have become essential components in modern information‑retrieval pipelines, powering recommendation engines, web search, and retrieval‑augmented generation (RAG) for large language models. However, achieving high recall in approximate nearest‑neighbor (ANN) search under strict latency Service Level Objectives (SLOs) remains a fundamental systems bottleneck. In‑memory ANN indexes provide low latency but scale linearly in memory cost, making billion‑scale deployments prohibitively expensive. Disk‑based ANN systems reduce memory pressure but suffer severe latency degradation at high recall because random I/O dominates execution time. Existing caching techniques from traditional databases cannot be directly applied to vector search because they assume exact query repetition, an assumption that fails when even minor changes in user input produce distinct high‑dimensional embeddings.
The authors observe that real‑world workloads exhibit “temporal‑semantic locality”: queries issued within short time windows are often semantically similar, i.e., their embeddings lie close together in the vector space, even though the raw vectors are not identical. Empirical studies across web search, e‑commerce, and LLM‑driven RAG show that 30‑70 % of queries are semantic variants of recent queries. Moreover, only a small fraction of the overall dataset (often <1 %) is repeatedly accessed within these short windows, suggesting a hot set that can be cached efficiently.
QVCache is introduced as the first backend‑agnostic, query‑level cache for ANN search that exploits this semantic repetition. Its core ideas are: (1) treat caching as a similarity problem rather than an exact‑match problem; (2) learn region‑specific distance thresholds online, using feedback from the backend ANN system to adapt to local distance distributions; (3) bound both lookup latency and memory usage independently of the total dataset size by organizing cached vectors into a fixed number of “mini‑indexes”. Each mini‑index is a small dynamic graph‑based ANN structure (FreshVamana) that supports concurrent search and insertion. The cache maintains a megabyte‑scale memory footprint by fixing the capacity of each mini‑index and evicting whole mini‑indexes according to policies such as LRU.
When a query arrives, QVCache first searches across all mini‑indexes to obtain a candidate k‑nearest‑neighbor set. It then compares the distance of the k‑th candidate to the learned region‑specific threshold. If the distance is within the threshold, the query is classified as a cache hit and answered directly, bypassing the backend. If not, the query is forwarded to the underlying ANN engine; the resulting vectors are fetched and inserted into the hottest mini‑index, and the threshold for the corresponding region is updated asynchronously. This design ensures that cache‑miss latency remains comparable to the backend latency, even for disk‑based or remote backends where vector fetching may involve multiple I/O operations.
The paper provides a thorough evaluation on several large‑scale datasets (Deep1B, MS‑MARCO, OpenAI embeddings) and multiple ANN backends (FAISS IVF, DiskANN, Milvus). QVCache consistently reduces p50 latency by 40× to 1000× while keeping recall loss below 0.1 % across all configurations. Memory consumption stays below 1 % of a full in‑memory index (typically a few megabytes). The authors also present a cost model showing that, in cloud environments where ANN queries are billed per request, QVCache can cut operational expenses by 60‑90 % for workloads with strong temporal‑semantic locality.
In summary, QVCache demonstrates that a similarity‑aware, adaptive caching layer can bridge the gap between high‑recall ANN search and strict latency requirements without incurring prohibitive memory costs. By being drop‑in compatible with existing vector databases, it offers a practical, low‑overhead solution that leverages inherent workload patterns to achieve orders‑of‑magnitude performance gains while preserving accuracy. Future work includes extending the approach to multimodal embeddings, dynamic workload adaptation, and integrating cache‑side re‑ranking mechanisms.
Comments & Academic Discussion
Loading comments...
Leave a Comment