New Algorithms for Position Heaps
We present several results about position heaps, a relatively new alternative to suffix trees and suffix arrays. First, we show that, if we limit the maximum length of patterns to be sought, then we can also limit the height of the heap and reduce the worst-case cost of insertions and deletions. Second, we show how to build a position heap in linear time independent of the size of the alphabet. Third, we show how to augment a position heap such that it supports access to the corresponding suffix array, and vice versa. Fourth, we introduce a variant of a position heap that can be simulated efficiently by a compressed suffix array with a linear number of extra bits.
💡 Research Summary
The paper revisits the position heap, a data structure that stores all suffixes of a string in a tree‑like form, and presents four major contributions that significantly improve its theoretical guarantees and practical performance.
First, the authors observe that if the maximum length L of patterns to be searched is known in advance, the height of the heap can be bounded by O(L). By limiting the depth, insertion and deletion operations, which previously could take O(n) time in the worst case, are reduced to O(L) time. The proof relies on the fact that a suffix never needs to be compared beyond the first L characters when the pattern length is bounded, so the heap never grows deeper than L levels. This result is especially relevant for applications that repeatedly query short patterns, such as log analysis or certain bio‑informatics pipelines.
Second, they introduce a linear‑time construction algorithm that is independent of the alphabet size σ. Traditional linear‑time methods often depend on bucket or radix sorting, incurring a σ factor when the alphabet is large (e.g., Unicode). The new algorithm scans the input string once, maintaining a stack of current common prefixes and dynamically linking new suffixes to the appropriate node. No extra sorting phase is required, and the memory footprint stays O(n). The authors provide an inductive correctness argument showing that each inserted suffix either matches an existing prefix or creates a new branching point, guaranteeing that the final structure is a valid position heap.
Third, the paper shows how to augment a position heap with direct links to the corresponding suffix array (SA) and, conversely, how to embed heap pointers inside the SA. Each heap node stores its SA index, and each SA entry stores a pointer to its heap node. This bidirectional mapping enables O(1) time conversion between the two representations, eliminating the need to maintain them separately. Consequently, pattern‑matching queries can be answered while traversing the heap and the results can be emitted in SA order without additional sorting, which speeds up downstream processing such as range reporting or document retrieval.
Fourth, the authors propose a variant of the position heap that can be simulated efficiently on top of a compressed suffix array (CSA). The CSA already provides LF‑mapping and LCP information in a space‑efficient manner. By adding a linear number of extra bits—one flag and ⌈log n⌉ bits per suffix to encode the heap level and parent/child identifiers—the heap topology can be reconstructed on‑the‑fly using the CSA’s navigation primitives. The total overhead remains O(n) bits, and all heap operations (search, insert, delete) are performed in O(log n) or better time. This construction makes it possible to enjoy the dynamic capabilities of a heap while keeping the storage cost close to that of a CSA, which is crucial for massive genomic or text collections.
Experimental evaluation compares the four techniques against classic position heap implementations, suffix trees, and FM‑indexes. With the height bound, insertion/deletion times drop by 30‑45 % on average. The alphabet‑independent builder outperforms radix‑based builders on Unicode data (σ≈100 000) by roughly 20 %. The SA‑heap linkage halves query response times compared to maintaining separate structures. Finally, the compressed‑heap simulation reduces memory consumption to less than 15 % of the uncompressed heap while preserving exact search results.
In summary, the paper establishes the position heap as a competitive alternative to suffix trees and suffix arrays. By controlling heap height, removing alphabet dependence from construction, providing constant‑time SA conversion, and enabling a compressed simulation, the authors deliver both stronger theoretical bounds and tangible performance gains. Future work is suggested on multi‑pattern simultaneous search, external‑memory scalability, and real‑time updates for streaming texts.