Faster subsequence recognition in compressed strings
Computation on compressed strings is one of the key approaches to processing massive data sets. We consider local subsequence recognition problems on strings compressed by straight-line programs (SLP), which is closely related to Lempel–Ziv compression. For an SLP-compressed text of length $\bar m$, and an uncompressed pattern of length $n$, C{'e}gielski et al. gave an algorithm for local subsequence recognition running in time $O(\bar mn^2 \log n)$. We improve the running time to $O(\bar mn^{1.5})$. Our algorithm can also be used to compute the longest common subsequence between a compressed text and an uncompressed pattern in time $O(\bar mn^{1.5})$; the same problem with a compressed pattern is known to be NP-hard.
💡 Research Summary
The paper addresses the problem of recognizing subsequences locally within a text that is compressed using a straight‑line program (SLP), while the pattern remains uncompressed. An SLP is a context‑free grammar that generates exactly one string; it captures Lempel‑Ziv (LZ) and LZW compression and can represent a text of length m with a description of size \bar m, where \bar m may be exponentially smaller than m.
Cegielski et al. previously gave algorithms for global and local subsequence recognition on SLP‑compressed texts that run in O(\bar m n² log n) time, where n is the length of the uncompressed pattern. The present work improves this bound dramatically. First, a simple folklore algorithm solves the global subsequence recognition problem in O(\bar m n) time. The main contribution, however, is an O(\bar m n^{1.5})‑time algorithm for the more general “partial semi‑local longest common subsequence” (LCS) problem. This problem asks for the LCS scores of the compressed text against every substring, every prefix‑suffix, and every suffix‑prefix combination of the pattern. Computing these scores implicitly yields solutions to all the local subsequence‑counting problems considered by Cegielski et al.: minimal‑window, fixed‑window, and bounded‑minimal‑window counting.
The technical core relies on a geometric representation of LCS via an alignment directed‑acyclic graph (dag) embedded in an m × n grid. Extending the finite grid to an infinite one allows the definition of a highest‑score matrix A(i, j), which records the optimal LCS score between the text prefix ending at position i and the pattern suffix starting at position j. Instead of storing A explicitly, the authors encode it by a set of “critical points” located at half‑integer coordinates. These critical points form an infinite permutation matrix D_A; its distribution matrix d_A satisfies the simple identity A(i, j) = j − i − d_A(i, j). Thus, the entire highest‑score matrix is uniquely determined by the permutation matrix of critical points.
The crucial observation is that the composition of two such permutation matrices corresponds to the “partial highest‑score matrix multiplication” introduced in earlier work
Comments & Academic Discussion
Loading comments...
Leave a Comment