Re-Pair Compression of Inverted Lists

Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art.

💡 Research Summary

The paper investigates a novel approach to compressing inverted index posting lists by applying the Re‑Pair grammar‑based compression algorithm to the sequence of gaps (differences) between consecutive document identifiers. Traditional inverted list compression techniques, such as Variable‑Byte, Gamma/Delta coding, PForDelta, and SIMD‑accelerated block codes, focus on representing small gap values efficiently but suffer from limited random‑access capabilities; decoding often requires scanning from the beginning of the list up to the desired position. This limitation becomes a performance bottleneck during list intersection, a core operation in query processing, because each intersecting list must be traversed repeatedly.

Re‑Pair works by repeatedly finding the most frequent adjacent symbol pair in a sequence and replacing it with a new non‑terminal symbol, building a hierarchy of production rules. The resulting compressed sequence consists of rule identifiers, while the rule dictionary encodes the original symbols. Crucially, the hierarchical nature of the dictionary enables direct access to any position by traversing the rule tree, offering the potential for fast “skip” operations during intersections.

The authors first apply vanilla Re‑Pair to the gap sequences of real‑world posting lists derived from a large web collection. They observe that compression ratios are comparable to, and sometimes slightly better than, Variable‑Byte, and that pure decompression speed is very high because the algorithm can expand symbols in a streaming fashion. However, when used for intersection, the naïve Re‑Pair approach incurs overhead: each step may require descending multiple rule levels, which is costly for short lists where the rule metadata dominates the payload.

To address this, two enhancements are introduced. (1) Range Metadata per Rule – each non‑terminal is annotated with the minimum and maximum document IDs that its expansion covers. During intersection, the algorithm can compare these bounds and discard entire rules that cannot contain a match, thereby skipping large portions of the list without full expansion. (2) Depth‑Limited Flattening – the rule tree is flattened up to a configurable depth; below this depth the gaps are stored directly, while deeper levels remain as rule identifiers. This reduces the number of indirections for frequently accessed shallow parts of the list, improving cache locality and reducing branch mispredictions.

The paper evaluates these Re‑Pair variants against several state‑of‑the‑art compression schemes (Variable‑Byte, PForDelta, SIMD‑BP128, QMX) under multiple intersection algorithms (simple merge, galloping, adaptive skip). Experiments are conducted on both main‑memory and SSD‑based secondary storage to capture the impact of random I/O. Results show:

For medium‑sized posting lists (thousands to tens of thousands of entries), the enhanced Re‑Pair achieves a favorable balance of space savings (10–15 % less memory than Variable‑Byte) and intersection time, often matching or slightly outperforming SIMD‑BP128.
For very short lists (tens of entries), the overhead of rule metadata outweighs benefits; traditional Variable‑Byte remains faster.
For very long lists (hundreds of thousands of entries), block‑based schemes like PForDelta and SIMD‑BP128 still lead in raw speed, though Re‑Pair’s ability to skip large rule blocks reduces disk seeks when lists reside on SSDs.

Implementation details include storing the rule dictionary as a memory‑mapped file, interleaving rule identifiers and raw gaps within each posting list, and using SIMD instructions to parallelize the expansion of shallow rules. The authors also discuss how the approach can be integrated into existing search engine pipelines with minimal changes to the query planner.

In conclusion, Re‑Pair‑based compression offers an alternative trade‑off: it provides comparable compression ratios to established methods while enabling more efficient random access, which is valuable for intersection‑heavy workloads and for scenarios where posting lists are partially stored on secondary media. Nevertheless, the current prototype incurs non‑trivial overhead from rule management and metadata storage, preventing it from universally outperforming the best existing techniques. The authors suggest future work on adaptive rule generation tuned to posting‑list characteristics, hardware‑accelerated rule decoding (e.g., GPU or FPGA), and shared rule dictionaries across multiple terms to amortize dictionary costs. With such enhancements, Re‑Pair could become a competitive or even superior solution for inverted index compression in large‑scale information retrieval systems.