Sequence representations supporting queries $access$, $select$ and $rank$ are at the core of many data structures. There is a considerable gap between the various upper bounds and the few lower bounds known for such representations, and how they relate to the space used. In this article we prove a strong lower bound for $rank$, which holds for rather permissive assumptions on the space used, and give matching upper bounds that require only a compressed representation of the sequence. Within this compressed space, operations $access$ and $select$ can be solved in constant or almost-constant time, which is optimal for large alphabets. Our new upper bounds dominate all of the previous work in the time/space map.
A large number of data structures build on sequence representations. In particular, supporting the following three queries on a sequence S[1, n] over alphabet [1, σ] has proved extremely useful:
-access(S, i) gives S[i]; -select a (S, j) gives the position of the jth occurrence of a ∈ [1, σ] in S; and -rank a (S, i) gives the number of occurrences of a ∈ [1, σ] in S [1, i].
The most basic case is that of bitmaps, when σ = 2. Obvious applications are set representations supporting membership and predecessor search, although many other uses, such as representing tree topologies, multisets, and partial sums [Jacobson 1989;Raman et al. 2007] have been reported. The focus of this article is on general alphabets, where further applications have been described. For example, the FM-index [Ferragina and Manzini 2005], a compressed indexed representation for text collections that supports pattern searches, is most successfully implemented over a sequence representation supporting access and rank [Ferragina et al. 2007], and more recently select [Belazzougui and Navarro 2011]. Grossi et al. [2003] had used earlier similar techniques for text indexing. Golynski et al. [2006] used these operations for representing labeled trees and permutations. Further applications of these operations to multi-labeled trees and binary relations were uncovered by Barbay et al. [2011]. Ferragina et al. [2009], Gupta et al. [2006], and Arroyuelo et al. [2010a] devised new applications to XML indexing. Other applications were described as well to representing permutations and inverted indexes [Barbay and Navarro 2009;Barbay et al. 2012] and graphs [Claude and Navarro 2010;Hernández and Navarro 2012]. Välimäki and Mäkinen [2007] and Gagie et al. [2010] applied them to document retrieval on general texts. Finally, applications to various types of inverted indexes on natural language text collections have been explored [Brisaboa et al. 2012;Arroyuelo et al. 2010b;Arroyuelo et al. 2012].
When representing sequences supporting the three operations, it seems reasonable to aim for O(n lg σ) bits of space. However, in many applications the size of the data is huge and space usage is crucial: only sublinear space on top of the raw data can be accepted. This is our focus.
Various time-and space-efficient sequence representations supporting the three operations have been proposed, and also various lower bounds have been proved. All the representations proposed assume the RAM model with word size w = Ω(lg n).
In the case of bitmaps, Munro [1996] and Clark [1996] achieved constant-time rank and select using o(n) extra bits on top of a plain representation of S. Golynski [2007] proved a lower bound of Ω(n lg lg n/ lg n) extra bits for supporting either operation in constant time if S is to be represented in plain form, and gave matching upper bounds. This assumption is particularly inconvenient in the frequent case where the bitmap is sparse, that is, it has only m n 1s, and hence can be compressed. When S can be represented arbitrarily, Pȃtraşcu [2008] achieved lg n m +O(n/ lg c n) bits of space, where c is any constant. This space was shown later to be optimal [Pȃtraşcu and Viola 2010]. However, the space can be reduced further, up to lg n m + O(m) bits, if superconstant time for the operations is permitted [Gupta et al. 2007;Okanohara and Sadakane 2007], or if the operations are weakened: When rank 1 (S, i) can only be applied if S[i] = 1 and only select 1 (S, j) is supported, Raman et al. [2007] achieved constant time and lg n m + o(m) + O(lg lg n) bits of space. When only rank 1 (S, i) is supported for the positions i such that S[i] = 1, and in addition we cannot even determine S[i], the structure is called a monotone minimum perfect hash function (mmphf) and can be implemented in O(m lg lg n m ) bits and answering in constant time [Belazzougui et al. 2009].
For general sequences, a useful measure of compressibility is the zeroth-order entropy of S, H 0 (S) = a∈ [1,σ] na n lg n na , where n a is the number of occurrences of a in S. This can be extended to the k-th order entropy, H k (S) = 1 n A∈ [1,σ] k |T A |H 0 (T A ), where T A is the string of symbols following k-tuple A in S. It holds 0 ≤ H k (S) ≤ H k-1 (S) ≤ H 0 (S) ≤ lg σ for any k, but the entropy measure is only meaningful for k < lg σ n. See Manzini [2001] and Gagie [2006] for a deeper discussion.
We say that a representation of S is succinct if it takes n lg σ + o(n lg σ) bits, zeroth-order compressed if it takes nH 0 (S) + o(n lg σ) bits, and high-order com-
• 3 pressed if it takes nH k (S) + o(n lg σ) bits. We may also compress the redundancy, o(n lg σ), to use for example nH 0 (S) + o(nH 0 (S)) bits.
Upper and lower bounds for sequence representations supporting the three operations are far less understood over arbitrary alphabets. Grossi et al. [2003] introduced the wavelet tree, a zeroth-order compressed representation using nH 0 (S) + o(n lg σ) bits that solves the three
This content is AI-processed based on open access ArXiv data.