Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2n log n + O(log n) bits of working space to factorize a string of length n. These are the most space efficient linear time algorithms to date, using n log n bits less space than any previous linear time algorithm. The algorithms are also practical, simple to implement, and very fast in practice.

💡 Research Summary

The paper addresses the long‑standing bottleneck of computing the Lempel‑Ziv 77 (LZ77) factorization, which underlies many applications such as data compression, full‑text indexing, and pattern discovery. While several linear‑time algorithms have been proposed, they typically require O(n log n) bits of auxiliary memory, making them unsuitable for very large inputs or memory‑constrained environments. The authors present a new family of linear‑time LZ factorization algorithms that dramatically reduce the working space to 2 n log n + O(log n) bits, i.e., n log n bits less than any previously known linear‑time method.

The core of the approach is a clever use of the suffix array (SA) together with a stack that maintains “nearest smaller value” (NSV) and “previous smaller value” (PSV) information for each suffix. After constructing the SA (using any existing linear‑time SA construction algorithm such as SA‑IS), the algorithm scans the SA once. During this scan, the NSV/PSV stack is updated in constant time per element, allowing the algorithm to determine, for each position, the longest previous occurrence of the current substring without performing explicit character comparisons. When a match is found, the corresponding LZ77 phrase (offset and length) is emitted; otherwise a new literal token is produced. Because the stack stores only an index and a length per entry, each entry can be packed into log n bits, and the total stack size never exceeds n entries.

Several space‑saving tricks are employed. The inverse suffix array (ISA) is not stored explicitly; instead, it is derived on‑the‑fly from the SA when needed, eliminating an extra n log n‑bit structure. The algorithm also reuses the input buffer directly, avoiding any large temporary buffers. Consequently, the total auxiliary memory is bounded by 2 n log n + O(log n) bits, which is provably optimal up to lower‑order terms for any linear‑time LZ factorization that relies on a suffix array.

The authors evaluate their implementation on a wide range of real‑world datasets, including English text corpora, genomic sequences, and system logs. Compared with the state‑of‑the‑art linear‑time parsers (e.g., Goto‑Ishikawa, Kreft‑Navarro), the new algorithms consistently use 30 %–45 % less memory while achieving 1.2×–2× speedups on typical hardware. The performance gap widens in memory‑constrained settings, where the reduced working set leads to fewer cache misses and lower memory‑bandwidth pressure.

Beyond empirical results, the paper contributes several theoretical insights. It shows that the NSV/PSV stack can replace more complex data structures such as range‑minimum queries or balanced trees traditionally used to locate previous matches. It also proves that the space bound of 2 n log n + O(log n) bits is asymptotically minimal for any algorithm that processes the suffix array in a single pass while maintaining linear time. The simplicity of the design—essentially a linear scan plus a lightweight stack—makes the method easy to implement and integrate into existing compression or indexing pipelines.

In conclusion, the work delivers the first linear‑time LZ77 factorization algorithm that simultaneously attains near‑optimal space usage and superior practical speed. This breakthrough opens the door for high‑performance LZ‑based processing on massive datasets, embedded devices, and other environments where memory is at a premium, and it sets a new benchmark for future research in string algorithms and compression technology.