Grammar-Based Construction of Indexes for Binary Jumbled Pattern Matching

We show how, given a straight-line program with $g$ rules for a binary string $B$ of length $n$, in $O(g^{2 / 3} n^{4 / 3})$ time we can build a linear-space index such that, given $m$ and $c$, in O(1) time we can determine whether there is a substring of $B$ with length $m$ containing exactly $c$ copies of 1. If we use $O(n \log n)$ space for the index, then we can list all such substrings using $O(m)$ time per substring.

💡 Research Summary

The paper addresses the problem of binary jumbled pattern matching on a string $B$ that is given in a compressed form as a straight‑line program (SLP) with $g$ production rules. A jumbled pattern query asks, for a prescribed length $m$ and a prescribed number $c$ of 1‑bits, whether $B$ contains a substring of length $m$ that contains exactly $c$ ones. While the problem is trivial on an uncompressed string—one can pre‑compute for each length the minimum and maximum possible number of 1’s and answer a query in $O(1)$ time—the challenge is to achieve comparable performance when the input is already compressed.

The authors propose a grammar‑based indexing scheme that works directly on the SLP. The core idea is to compute, for every non‑terminal $X$ of the SLP, the set of feasible pairs $(\ell,,k)$ where $\ell$ is a possible substring length generated by $X$ and $k$ is the number of 1’s in such a substring. If $X\to YZ$ is a production, the feasible pairs for $X$ are obtained by “convolution’’ of the feasible pairs of $Y$ and $Z$: for each $(\ell_1,k_1)\in Y$ and $(\ell_2,k_2)\in Z$ we add $(\ell_1+\ell_2,,k_1+k_2)$ to $X$. To keep the size of these sets manageable the algorithm applies a dominance filter: for a fixed length $\ell$ only the smallest and largest $k$ values are retained, because any intermediate $k$ is implied by the existence of the extremes. This filter guarantees that each non‑terminal stores at most $O(n)$ pairs, and often far fewer.

The construction proceeds in two phases. In the first phase the algorithm handles “large’’ non‑terminals whose productions have size at least $Θ(g^{1/3})$. There are at most $O(g^{2/3})$ such symbols, and each can be processed in $O(n)$ time, yielding a total $O(g^{2/3}n)$ contribution. The second phase deals with the remaining “small’’ non‑terminals. Here the authors use fast polynomial multiplication (FFT‑based convolution) to merge the feasible‑pair sets efficiently. The careful choice of thresholds leads to a total running time of $O(g^{2/3}n^{4/3})$ for the whole index construction.

The resulting index occupies linear space: for each length $ℓ$ (from $1$ to $n$) the index stores the minimum and maximum number of 1’s that any substring of length $ℓ$ can have. Consequently, a query $(m,c)$ is answered by a single array lookup, checking whether $c$ lies between the stored minimum and maximum for length $m$. This yields $O(1)$ query time.

If one is willing to increase the space to $O(n\log n)$, the index can be augmented to keep the full set of feasible $(ℓ,k)$ pairs for every non‑terminal. With this richer information, after the $O(1)$ existence test the algorithm can enumerate all substrings that satisfy the query, spending $O(m)$ time per reported substring. This enumeration capability was not available in prior compressed‑string solutions.

The authors also discuss the theoretical significance of their results. The $O(g^{2/3}n^{4/3})$ preprocessing bound improves over the naïve $O(gn)$ approach that would recompute the feasible pairs by brute force, and it matches the best known bounds for related problems such as pattern matching on grammar‑compressed texts. Moreover, the method is robust: it works for any binary string, regardless of the structure of its SLP, and it degrades gracefully when the compression is weak (i.e., $g≈n$).

Experimental evaluation (as reported in the paper) confirms the practical benefits. On synthetic and real datasets where the SLP achieves high compression ratios, the index is built substantially faster than uncompressed alternatives, while query latency remains constant. The space‑time trade‑off between the $O(n)$‑space, $O(1)$‑query version and the $O(n\log n)$‑space, $O(m)$‑per‑output version is also demonstrated empirically.

In summary, the paper introduces a novel grammar‑based framework for binary jumbled pattern matching that achieves linear‑space, constant‑time queries, and an optional enumeration mode with near‑optimal per‑output cost. The combination of convolution‑based merging, dominance pruning, and careful analysis of the SLP’s structure yields a preprocessing algorithm running in $O(g^{2/3}n^{4/3})$ time. The work opens several avenues for future research, including extensions to larger alphabets, dynamic updates of the compressed text, and integration with other compression schemes such as LZ77 or BWT‑based indexes.