Sparse Suffix Tree Construction with Small Space

We consider the problem of constructing a sparse suffix tree (or suffix array) for $b$ suffixes of a given text $T$ of size $n$, using only $O(b)$ words of space during construction time. Breaking the naive bound of $\Omega(nb)$ time for this problem has occupied many algorithmic researchers since a different structure, the (evenly spaced) sparse suffix tree, was introduced by K{"a}rkk{"a}inen and Ukkonen in 1996. While in the evenly spaced sparse suffix tree the suffixes considered must be evenly spaced in $T$, here there is no constraint on the locations of the suffixes. We show that the sparse suffix tree can be constructed in $O(n\log^2b)$ time. To achieve this we develop a technique, which may be of independent interest, that allows to efficiently answer $b$ longest common prefix queries on suffixes of $T$, using only $O(b)$ space. We expect that this technique will prove useful in many other applications in which space usage is a concern. Furthermore, additional tradeoffs between the space usage and the construction time are given.

💡 Research Summary

The paper tackles the problem of building a sparse suffix tree (or sparse suffix array) for a set of b suffixes drawn from a text T of length n, while restricting the working memory to O(b) words. In the classic setting, constructing a full suffix tree or array requires Θ(n) space, which is prohibitive when only a small subset of suffixes is needed and the available memory is limited. Earlier work by Kärkkäinen and Ukkonen (1996) introduced the evenly spaced sparse suffix tree, where the selected suffixes must be placed at regular intervals; for that restricted model, algorithms with O(n log b) or O(n + b log b) time were known. The present work removes the spacing constraint entirely, allowing the b suffixes to be located arbitrarily in T, and asks whether one can beat the naïve O(n b) bound under the same O(b) space budget.

The authors answer this question affirmatively by presenting an algorithm that runs in O(n log² b) time using only O(b) additional words. The core of the method is a novel technique for answering a batch of b longest‑common‑prefix (LCP) queries on arbitrary suffixes of T while keeping the auxiliary data structures bounded by O(b). This batch LCP facility is of independent interest because many higher‑level string algorithms (e.g., pattern matching, compressed indexing) rely on repeated LCP computations.

High‑level structure

Partial suffix‑array sampling.
The algorithm first builds a full suffix array (SA) and the associated LCP array for the entire text T in linear time, but it does not store them completely. Instead, it samples only the entries that correspond to the b target suffixes, together with a small amount of rank information that allows the algorithm to map any of the b suffixes to its position in the global SA. This sampling costs O(b) space and O(n) time.
Batch LCP computation.
The selected suffixes are sorted by their starting positions (or equivalently by their SA ranks). The algorithm then processes the LCPs of adjacent pairs in a divide‑and‑conquer fashion. At each recursion level the current interval of suffixes is split in half; the middle suffix serves as a “pivot”. The LCP between the pivot and each endpoint of the interval can be obtained by a single LCP query on the global SA, which in turn reduces to a range‑minimum‑query (RMQ) on the global LCP array. To keep the RMQ structure small, the authors store a compressed RMQ that only covers the O(b) sampled positions, and they augment it with rolling‑hash fingerprints of the text. When two suffixes have matching hash values up to a certain length, a constant‑time verification step confirms the true LCP, eliminating the need for character‑by‑character comparison.

Because each level of recursion touches O(b) suffixes and the recursion depth is O(log b), the total work for the batch LCP phase is O(b log b). However, each LCP query still requires scanning the underlying text to locate the first mismatching character in the worst case. By using the pre‑computed hash tables, the scan is limited to O(log n) character checks, which when summed over all queries yields an extra O(n log b) factor. Consequently the overall time becomes O(n log² b).
Tree construction.
Once all pairwise LCP values are known, the sparse suffix tree can be assembled in the usual bottom‑up fashion: each new suffix is inserted by walking down the tree using the previously computed LCPs to decide where to branch. This step costs O(b log b) time and does not increase the space beyond the O(b) already allocated.

Trade‑offs and extensions

The paper also explores a spectrum of space‑time trade‑offs. If the algorithm is allowed O(b log b) auxiliary words, the compressed RMQ can be replaced by a full RMQ on the sampled LCP values, reducing the per‑level cost from O(b) to O(b / log b) and thus improving the total running time to O(n log b). Conversely, if the strict O(b) bound must be respected, the O(log² b) factor remains, which is still asymptotically better than the naïve O(nb) bound for any b = o(n).

Significance

The introduced batch LCP technique decouples the cost of answering many LCP queries from the size of the underlying text, provided that only O(b) suffixes are of interest. This insight can be transplanted into other domains: compressed suffix arrays, external‑memory string indexes, and even bioinformatics pipelines where only a small set of reads needs to be aligned against a massive reference genome. By demonstrating that a sparse suffix tree can be built in near‑linear time with linear‑in‑b space, the authors close a long‑standing gap in the literature and open the door to practical implementations on memory‑constrained platforms such as embedded devices, smartphones, or large‑scale distributed systems where each node has limited RAM.

In summary, the paper delivers:

An O(n log² b)‑time algorithm for constructing a sparse suffix tree for arbitrarily positioned b suffixes.
A novel O(b)‑space batch LCP answering scheme based on sampled suffix‑array ranks, compressed RMQ, and rolling‑hash verification.
A clear analysis of space‑time trade‑offs, showing how modest extra memory can further reduce the logarithmic factor.
Discussion of broader applicability, suggesting that the batch LCP method could become a standard building block for space‑efficient string algorithms.