Incongruity-sensitive access to highly compressed strings

Incongruity-sensitive access to highly compressed strings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Random access to highly compressed strings – represented by straight-line programs or Lempel-Ziv parses, for example – is a well-studied topic. Random access to such strings in strongly sublogarithmic time is impossible in the worst case, but previous authors have shown how to support faster access to specific characters and their neighbourhoods. In this paper we explore whether, since better compression can impede access, we can support faster access to relatively incompressible substrings of highly compressed strings. We first show how, given a run-length compressed straight-line program (RLSLP) of size $g_{rl}$ or a block tree of size $L$, we can build an $O (g_{rl})$-space or an $O (L)$-space data structure, respectively, that supports access to any character in time logarithmic in the length of the longest repeated substring containing that character. That is, the more incongruous a character is with respect to the characters around it in a certain sense, the faster we can support access to it. We then prove a similar but more powerful and sophisticated result for parsings in which phrases’ sources do not overlap much larger phrases, with the query time depending also on the number of phrases we must copy from their sources to obtain the queried character.


💡 Research Summary

The paper investigates a new angle on the classic problem of random access in highly compressed strings, focusing on the relationship between compressibility of a local region and the speed of accessing characters in that region. While prior work has shown that sub‑logarithmic worst‑case access is impossible for strings stored in space polynomial in strong compressibility measures (e.g., smallest SLP size, substring complexity, or attractor size), it also demonstrated that certain “easy” positions—such as the ends of non‑terminals—can be accessed faster. The authors ask whether the opposite is true: can we guarantee faster access to characters that lie in relatively incompressible (i.e., incongruous) parts of a compressed text? This question is motivated by practical scenarios such as genomic analysis, where rare variants or mutations correspond to low‑repetition regions that are often the most interesting.

The core contribution is a family of data structures whose query time depends on the length ℓq of the longest repeated substring that contains the queried character S


Comments & Academic Discussion

Loading comments...

Leave a Comment