Adaptive encodings for small and fast compressed suffix arrays
Compressed suffix arrays (CSAs) index large repetitive collections and are key in many text applications. The r-index and its derivatives combine the run-length Burrows-Wheeler Transform (BWT) with suffix array sampling to achieve space proportional to the number of equal-symbol runs in the BWT. While effective for near-identical strings, their size grows quickly as variation increases, since the number of BWT runs is sensitive to edits. Existing approaches typically trade space for query speed, or vice versa, limiting their practicality at large scale. We introduce variable-length blocking (VLB), an encoding technique for BWT-based CSAs that adapts the amount of indexing information to local compressibility. The BWT is recursively divided into blocks of at most w runs (a parameter) and organized into a tree. Compressible regions appear near the root and store little auxiliary data, while incompressible regions lie deeper and retain additional information to speed up access. Queries traverse a short root-to-leaf path followed by a small run scan. This strategy balances space and query speed by transferring bits saved in compressible areas to accelerate access in incompressible ones. Backward search relies on rank and successor queries over the BWT. We introduce a sampling technique that guarantees correctness only along valid backward-search states, reducing space without affecting query performance. We extend VLB to encode the subsampled r-index (sr-index). Experiments show that VLB-based techniques outperform the r-index and sr-index in query time, while retaining space close to that of the sr-index. Compared to the move data structure, VLB offers a more favorable space-time tradeoff.
💡 Research Summary
The paper addresses the longstanding trade‑off between space consumption and query speed in compressed suffix arrays (CSAs) for large, repetitive text collections. While the r‑index achieves O(r) space by run‑length encoding the Burrows–Wheeler Transform (BWT) and sampling two suffix‑array (SA) entries per run, its size quickly inflates when the collection contains many edits, because each edit can increase the number of BWT runs dramatically. Existing solutions either apply additional compression at the cost of slower queries or prioritize speed while using substantially more memory, leaving a gap for a practical, balanced structure.
The authors propose Variable‑Length Blocking (VLB), an adaptive encoding that tailors the amount of auxiliary indexing information to the local compressibility of the BWT. The BWT is recursively partitioned so that each block contains at most w runs (w is a tunable parameter). These blocks are organized into an f‑ary tree, the VLB‑tree. Blocks that are highly compressible (few long runs) appear near the root and store only minimal metadata (block start position and run lengths). Blocks that are less compressible (many short runs) are placed deeper in the tree and are enriched with extra rank, successor, and SA‑sample data. Consequently, a query follows a short root‑to‑leaf path of height O(log_f(ℓ/w)) and then scans at most w consecutive runs in the leaf, where ℓ is the original block size. This yields a query time of O(m·(log_f(ℓ/w)+w)) for counting a pattern of length m, with roughly one cache miss per dense BWT region and up to log_f(ℓ/w) cache misses in dense, high‑run regions.
A key innovation is a sampling scheme for rank and successor queries that guarantees correctness only along “valid backward‑search states”. Because backward search only ever visits BWT intervals that correspond to prefixes of the pattern, the index can omit many samples that would never be used, reducing space without harming speed or correctness. This relaxation principle is independent of VLB and could be applied to other CSA designs.
The VLB‑tree is further extended to encode the φ⁻¹ function, which enables reconstruction of the full SA from a single “toehold” value obtained during counting. By storing a single difference value per BWT run, φ⁻¹ can be evaluated with a predecessor query on a bitvector followed by a table lookup, keeping the overhead low. Combining the VLB‑tree for the BWT with the VLB‑tree for φ⁻¹ yields a full CSA equivalent to the r‑index. By subsampling the SA samples in dense regions, the authors obtain a subsampled r‑index (sr‑index) that retains the cache‑friendly hierarchical layout while matching the space of the original sr‑index.
Experimental evaluation on ten real‑world genomic datasets and five version‑control repositories compares VLB‑based structures against the r‑index, the subsampled r‑index, and the recent “move” data structure. VLB‑run‑length BWT achieves up to a 2× reduction in space relative to the r‑index, while VLB‑sr‑index uses roughly the same space as Cobas et al.’s sr‑index. Query performance improves dramatically: count queries are 4.8–5.5× faster, and locate queries are 8.9–9.8× faster than the best existing alternatives. The move structure remains faster in raw time but requires 2.5–8.3× more memory, making VLB the more balanced choice. Cache‑miss profiling confirms that the hierarchical block layout yields superior locality, which is the primary source of the speedup.
In summary, Variable‑Length Blocking provides a principled method to reallocate indexing bits from highly compressible BWT regions to those that are hard to compress, achieving a CSA that is both space‑efficient and fast. The approach is especially suited for massive, moderately repetitive collections where traditional r‑index variants either waste space or suffer from slow queries. Future work may explore external‑memory adaptations, dynamic updates, and applying the “valid‑state‑only” sampling to other compressed indexing frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment