Compressed Text Indexes:From Theory to Practice!
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner’s point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology.
💡 Research Summary
The paper surveys the state of compressed full‑text self‑indexes (CF‑TSIs), a class of data structures that store a text in a compressed form while still supporting fast pattern‑matching queries. Traditional indexes such as suffix arrays or inverted files require several times the size of the original text, which becomes prohibitive for massive collections. Over the past decade, a rich theoretical literature has produced a variety of CF‑TSIs—most notably the FM‑index, compressed suffix arrays (CSA), recursive layered CSA (RLCSA), LZ‑index, and several hybrid schemes. Each of these combines a core compression technique (Burrows‑Wheeler Transform, LZ77 parsing, run‑length encoding, etc.) with auxiliary rank/select structures, wavelet trees, or sampling strategies to enable navigation on the compressed representation.
Despite the theoretical maturity, practical adoption has lagged because existing implementations are scattered, use incompatible file formats, and expose ad‑hoc APIs. Consequently, researchers cannot easily reproduce results, and developers cannot integrate these indexes into larger systems without substantial engineering effort. The authors address this gap by first cataloguing the most widely used implementations, describing their algorithmic foundations, memory layouts, cache‑optimisation tricks, SIMD usage, and multithreading capabilities. They then introduce the Pizza&Chili platform, a publicly available repository that hosts tuned versions of the leading compressed indexes together with a unified C++ library and Python bindings. The platform defines a minimal, consistent API consisting of four operations: building an index from a raw text, loading a persisted index, executing a pattern search (with optional options such as count‑only, locate, or range queries), and releasing resources.
A key contribution of Pizza&Chili is the automation of parameter tuning and validation. The authors provide scripts that automatically select sampling rates, block sizes, and other hyper‑parameters based on the characteristics of the input corpus. They also supply a comprehensive testbed containing standard corpora (full Wikipedia dump, human genome, web‑log collections) and a suite of query workloads (single‑pattern, multi‑pattern, range, and locate queries). The validation framework checks both functional correctness (exact match of results) and performance metrics, ensuring reproducibility across different hardware platforms.
The experimental section presents an extensive benchmark covering five major metrics: compression ratio, index construction time, memory footprint during construction and query, query latency, and throughput under multithreaded workloads. Results show that the FM‑index achieves a typical 30‑40 % compression of the original text while delivering microsecond‑scale single‑pattern searches; RLCSA offers the highest memory savings (up to a factor of two on large DNA datasets) at the cost of modestly higher latency; the LZ‑index provides the strongest compression but suffers on random‑access heavy queries. SIMD‑accelerated variants consistently improve throughput by 1.5‑2×, and a four‑core pipeline yields up to 3.2× speed‑up for batch query processing. These findings demonstrate that compressed indexes are not only theoretically attractive but also practically viable for environments with stringent memory constraints, such as mobile devices, embedded systems, or large‑scale bioinformatics pipelines.
The authors conclude by outlining future research directions: supporting dynamic updates (insertions/deletions) without rebuilding the whole index, extending the framework to distributed and cloud‑based settings, and leveraging machine‑learning techniques for automatic hyper‑parameter optimisation. By delivering a standardized, well‑tested implementation suite and a reproducible benchmarking methodology, the paper bridges the gap between theory and practice, paving the way for broader adoption of compressed full‑text self‑indexes in real‑world applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment