Improved Grammar-Based Compressed Indexes

Reading time: 5 minute
...

📝 Original Info

  • Title: Improved Grammar-Based Compressed Indexes
  • ArXiv ID: 1110.4493
  • Date: 2023-05-17
  • Authors: : N/A

📝 Abstract

We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text $T[1..u]$ that is represented by a (context-free) grammar of $n$ (terminal and nonterminal) symbols and size $N$ (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of $T$ takes $N\lg n$ bits of space. Our representation requires $2N\lg n + N\lg u + \epsilon\, n\lg n + o(N\lg n)$ bits of space, for any $0<\epsilon \le 1$. It can find the positions of the $occ$ occurrences of a pattern of length $m$ in $T$ in $O((m^2/\epsilon)\lg (\frac{\lg u}{\lg n}) +occ\lg n)$ time, and extract any substring of length $\ell$ of $T$ in time $O(\ell+h\lg(N/h))$, where $h$ is the height of the grammar tree.

💡 Deep Analysis

Figure 1

📄 Full Content

Grammar-based compression is an active area of research that dates from at least the seventies. A given sequence T [1..u] over alphabet [1..σ] is replaced by a hopefully small (context-free) grammar G that generates just the string T . Let n be the number of grammar symbols, counting terminals and nonterminals. Let N be the size of the grammar, measured as the sum of the lengths of the right-hand sides of the rules. Then the grammar-compressed representation of T requires N lg n bits, versus the u lg σ bits required by a plain representation.

Grammar-based methods can achieve universal compression [21]. Unlike statistical methods, that exploit frequencies to achieve compression, grammar-based methods exploit repetitions in the text, and thus they are especially suitable for compressing highly repetitive sequence collections. These collections, containing long identical substrings, possibly far away from each other, arise when managing software repositories, versioned documents, temporal databases, transaction logs, periodic publications, and computational biology sequence databases.

Finding the smallest grammar G * that represents a given text T is NP-complete [33,9]. Moreover, the smallest grammar is never smaller than an LZ77 parse [35] of T . A simple method to achieve an O(lg u)approximation to the smallest grammar size is to parse T using LZ77 and then to convert it into a grammar [33]. A more sophisticated approximation achieves ratio O(lg(u/N * )), where N * is the size of G * .

While grammar-compression methods are strictly inferior to LZ77 compression, and some popular grammarbased compressors such as LZ78 [36], Re-Pair [24] and Sequitur [30], can generate sizes much larger than the smallest grammar [9], some of those methods (in particular Re-Pair) perform very well in practice, both in classical and repetitive settings. 3 In reward, unlike LZ77, grammar compression allows one to decompress arbitrary substrings of T almost optimally [16,6]. The most recent result [6] extracts any T [p, p + -1] in time O( + lg u). Unfortunately, the representation that achieves this time complexity requires O(N lg u) bits, possibly proportional but in practice many times the size of the output of a grammar-based compressor. On the practical side, applications like Comrad [23] achieve good space and time performance for extracting substrings of T .

More ambitious than just extracting arbitrary substring from T is to ask for indexed searches, that is, finding all the occ occurrences in T of a given pattern P [1..m]. Self-indexes are compressed text representations that support both operations, extract and search, in time depending only polylogarithmically on u.

They have appeared in the last decade [28], and have focused mostly on statistical compression. As a result, they work well on classical texts, but not on repetitive collections [25]. Some of those self-indexes have been adapted to repetitive collections [25], but they cannot reach the compression ratio of the best grammar-based methods.

Searching for patterns on grammar-compressed text has been faced mostly in sequential form [2], that is, scanning the whole grammar. The best result [20] achieves time O(N + m 2 + occ). This may be o(u), but still linear in the size of the compressed text. There exist a few self-indexes based on LZ78-like compression [15,3,32], but LZ78 is among the weakest grammar-based compressors. In particular, LZ78 has been shown not to be competitive on highly repetitive collections [25].

The only self-index supporting general grammar compressors [13] operates on “straight-line programs” (SLPs), where the right hands of the rules are of length 1 or 2. Given such a grammar they achieve, among other tradeoffs, 3n lg n + n lg u bits of space and O(m(m + h) lg 2 n) search time, where h is the height of the parse tree of the grammar. A general grammar of n symbols and size N can be converted into a SLP of Nn rules.

More recently, a self-index based on LZ77 compression has been developed [22]. Given a parsing of T into n phrases, the self-index uses n lg n + 2n lg u + O(n lg σ) bits of space, and searches in time O(m 2 h + (m + occ) lg n), where h is the nesting of the parsing. Extraction requires O( h) time. Experiments on repetitive collections [11,12] show that the grammar-based compressor [13] can be competitive with the best classical self-index adapted to repetitive collections [25] but, at least that particular implementation, is not competitive with the LZ77-based self-index [22].

Note that the search time in both self-indexes depends on h. This is undesirable as h is only bounded by n. That kind of dependence has been removed for extracting text substrings [6], but not for searches.

Our main contribution is a new representation of general context-free grammars. The following theorem summarizes its properties. Note that the search time is independent of h. In the rest of the paper we describe how this structure operates. Fir

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut