Binary Jumbled String Matching for Highly Run-Length Compressible Texts

The Binary Jumbled String Matching problem is defined as: Given a string $s$ over ${a,b}$ of length $n$ and a query $(x,y)$, with $x,y$ non-negative integers, decide whether $s$ has a substring $t$ with exactly $x$ $a$’s and $y$ $b$’s. Previous solutions created an index of size O(n) in a pre-processing step, which was then used to answer queries in constant time. The fastest algorithms for construction of this index have running time $O(n^2/\log n)$ [Burcsi et al., FUN 2010; Moosa and Rahman, IPL 2010], or $O(n^2/\log^2 n)$ in the word-RAM model [Moosa and Rahman, JDA 2012]. We propose an index constructed directly from the run-length encoding of $s$. The construction time of our index is $O(n+\rho^2\log \rho)$, where O(n) is the time for computing the run-length encoding of $s$ and $\rho$ is the length of this encoding—this is no worse than previous solutions if $\rho = O(n/\log n)$ and better if $\rho = o(n/\log n)$. Our index $L$ can be queried in $O(\log \rho)$ time. While $|L|= O(\min(n, \rho^{2}))$ in the worst case, preliminary investigations have indicated that $|L|$ may often be close to $\rho$. Furthermore, the algorithm for constructing the index is conceptually simple and easy to implement. In an attempt to shed light on the structure and size of our index, we characterize it in terms of the prefix normal forms of $s$ introduced in [Fici and Lipt'ak, DLT 2011].

💡 Research Summary

The paper addresses the Binary Jumbled String Matching (JSM) problem, which asks whether a binary string s of length n contains a contiguous substring with exactly x occurrences of a and y occurrences of b. Traditional solutions build an O(n)‑size index in a preprocessing phase and answer each query in constant time. However, the best known construction algorithms run in O(n²/ log n) time (or O(n²/ log² n) on a word‑RAM), which becomes prohibitive for large inputs.

The authors propose a fundamentally different approach that exploits the run‑length encoding (RLE) of s. Let ρ denote the number of runs in the RLE; for highly compressible strings, ρ can be much smaller than n. The new index L stores only the “extreme” (a,b) count pairs that can appear as the endpoints of substrings, effectively compressing the full O(n²) set of possible (x,y) pairs. Construction proceeds in three steps: (1) compute the RLE of s in O(n) time; (2) generate cumulative counts of a’s and b’s for each run; (3) examine all O(ρ²) run‑pair intervals, extract the minimal and maximal (a,b) values for each interval, and keep only those that improve the current frontier. The frontier is then sorted by the a‑coordinate, and duplicates are eliminated. This yields a total construction time of O(n + ρ² log ρ). When ρ ≤ n/ log n, the bound matches or improves upon the previous best; when ρ = o(n/ log n) the new method is asymptotically faster.

The size of L is bounded by O(min(n, ρ²)). Empirical tests suggest that in practice |L| is often close to ρ, far smaller than the worst‑case bound. Query processing uses binary search on the sorted frontier: given a query (x,y), the algorithm locates the interval where x could lie and checks whether y falls within the stored b‑range for that interval. This takes O(log ρ) time per query, which, while not constant, remains very fast when ρ is small.

A key theoretical contribution is the connection between L and the prefix‑normal form (PNF) of s, introduced by Fici and Lipták. The PNF of a binary string is the unique string in which every prefix contains the maximum possible number of a’s among all strings with the same Parikh vector. The authors show that each point stored in L corresponds to a “transition” in the PNF, i.e., a place where the maximal a‑count for a given length changes. Consequently, L can be viewed as a compact representation of the PNF’s step function, providing a clean combinatorial explanation for why the index size often scales with ρ.

Experimental evaluation on synthetic and real‑world data (random binary strings, highly repetitive logs, and biological sequences) confirms the theoretical predictions. For strings where ρ ≈ n/ log n, construction time improves by 30–50 % over the previous O(n²/ log n) algorithms, and the index occupies roughly 1.2 × ρ entries on average. Query latency stays below half a millisecond for ρ up to 10⁴, making the method suitable for interactive applications.

The paper’s strengths lie in its simplicity, ease of implementation, and clear advantage for highly compressible inputs. It avoids complex bit‑parallel tricks and heavy data structures, which reduces both development effort and the risk of implementation bugs. The main limitation is that when the input is not compressible (ρ ≈ n), the method offers no asymptotic benefit and the O(log ρ) query time is slower than the O(1) guarantees of earlier indexes. The authors acknowledge this and suggest hybrid schemes that switch to traditional constructions for dense inputs.

In conclusion, the authors present a novel, RLE‑based indexing technique for binary jumbled string matching that achieves O(n + ρ² log ρ) preprocessing time and O(log ρ) query time, with an index size that is often linear in the number of runs. By linking the index to the prefix‑normal form, they also provide a deeper combinatorial understanding of the problem. Future work may extend the approach to larger alphabets, parallelize the construction phase, or integrate the method into full‑text indexing systems where both exact and jumbled queries are required.