Forward Index Compression for Learned Sparse Retrieval
Text retrieval using learned sparse representations of queries and documents has, over the years, evolved into a highly effective approach to search. It is thanks to recent advances in approximate nearest neighbor search-with the emergence of highly efficient algorithms such as the inverted index-based Seismic and the graph-based Hnsw-that retrieval with sparse representations became viable in practice. In this work, we scrutinize the efficiency of sparse retrieval algorithms and focus particularly on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index. In particular, we seek compression techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency. In our examination with various integer compression techniques, we report that StreamVByte achieves the best trade-off between memory footprint, retrieval accuracy, and latency. We then improve StreamVByte by introducing DotVByte, a new algorithm tailored to inner product computation. Experiments on MsMarco show that our improvements lead to significant space savings while maintaining retrieval efficiency.
💡 Research Summary
The paper addresses a critical bottleneck in modern sparse‑vector retrieval systems that use learned sparse embeddings: the forward index, which stores the mapping from document identifiers to their sparse vectors, consumes a large fraction of the total index memory. The authors first evaluate a range of integer compression techniques—VByte, Elias Gamma, Elias Delta, Zeta, and the SIMD‑optimized Stream VByte—both with and without a graph‑based reordering step (Recursive Graph Bisection, RGB). Their experiments on the MS‑MARCO collection show that while RGB can reduce bits‑per‑component by up to 27 %, the trade‑off between compression ratio and decoding speed varies widely. Stream VByte offers the fastest decoding but its compression ratio is sub‑optimal, whereas Zeta achieves the best compression at the cost of much higher latency.
Motivated by the observation that forward‑index components are 16‑bit integers, the authors design a new algorithm called DotVByte. DotVByte reduces the control information to a single bit per component, allowing eight values to be decoded simultaneously with a 128‑bit SIMD register. It also leverages the popcnt instruction to compute the number of bytes to advance directly from the control byte. Crucially, the decompression step is fused with the inner‑product computation: after decoding the component IDs, the corresponding query values are gathered, document values are loaded, converted to 32‑bit floats, multiplied, and accumulated—all inside SIMD registers without intermediate buffering. To avoid cross‑document control‑byte sharing, the component stream is padded to a multiple of eight, ensuring alignment and eliminating unnecessary decoding of previous documents.
The authors integrate DotVByte into the state‑of‑the‑art sparse ANN system Seismic and evaluate it with two leading sparse embedding models: Splade (≈119 non‑zero terms per document) and LiLsr (≈387 non‑zero terms per document). Experiments are conducted under a realistic memory budget (≈1.5 × the collection size) and a single‑threaded query setting on a high‑core Intel Core Ultra 7 server. Results demonstrate that DotVByte reduces the forward‑index size to about 10.8 bits per component (a 22 % reduction versus the uncompressed baseline) while incurring only a 20 % increase in scan time (0.36 s versus 0.30 s). Compared to Stream VByte, DotVByte is more than three times faster and uses slightly less space. In terms of end‑to‑end query latency at high recall levels (90 %–99 % of true nearest neighbors), DotVByte matches the uncompressed index (≈1,080 µs for 99 % recall on Splade) and dramatically outperforms Zeta and Stream VByte, which exhibit latencies of several seconds. Similar gains are observed for LiLsr, where DotVByte achieves 99 % recall with 1,643 µs latency and a total index size of 14.9 GB, compared to 25.4 GB and 12.2 s for Zeta.
The study concludes that targeted compression of the forward index, when co‑designed with SIMD‑friendly inner‑product computation, can simultaneously shrink memory footprints and preserve—or even improve—search efficiency. DotVByte exemplifies how domain‑specific adaptations of general‑purpose compression schemes can yield practical benefits for large‑scale learned sparse retrieval. Future work will explore sub‑byte encoding of small gaps to further improve compression ratios and extend the approach to other ANN frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment