Merging RLBWTs adaptively
We show how to merge two run-length compressed Burrows-Wheeler Transforms (RLBWTs) into a run-length compressed extended Burrows-Wheeler Transform (eBWT) in $O (r)$ space and $O ((r + L) \log (m + n))$ time, where $m$ and $n$ are the lengths of the uncompressed strings, $r$ is the number of runs in the final eBWT and $L$ is the sum of its irreducible LCP values.
💡 Research Summary
The paper addresses the problem of merging two run‑length compressed Burrows‑Wheeler Transforms (RLBWTs) into a single run‑length compressed extended BWT (eBWT). This operation is crucial for building pan‑genomic indexes where each species or strain may be represented by its own RLBWT and the final index must integrate all of them. Prior work on merging BWTs either operated on uncompressed strings or relied on dynamic RLBWT structures that still required time proportional to the original text lengths, even when the inputs were highly repetitive.
The authors introduce a novel static approach that works directly on the compressed representations. The key technical contribution is the construction of “move structures” that enable iterative evaluation of the Ψ function (the inverse of the LF mapping) in O(log r) time per step while using only O(r) additional space, where r denotes the number of runs in the final eBWT. Lemma 1 proves that, with these structures, the context of any character in either input can be compared in O(log r) time plus a term proportional to the length of their longest common prefix.
To illustrate the methodology, the paper first presents a warm‑up algorithm for merging positional BWTs (PBWTs). By treating each column as a sequence of blocks of identical bits (0‑blocks and 1‑blocks) and copying whole blocks rather than individual bits, the PBWT merge runs in O(r + B) time, where B is the total number of blocks across all columns. This block‑wise technique naturally extends to RLBWTs because each run corresponds to a block of identical characters.
The main algorithm combines the block‑wise insight with the move structures. It scans the runs of the two input RLBWTs in order, compares the contexts of the run boundaries using binary search, and decides from which input each run should be taken in the merged eBWT. When a run boundary coincides with an “irreducible” LCP value (i.e., a longest common prefix that cannot be derived from neighboring runs), an extra cost proportional to that LCP value is incurred. The sum of all such irreducible LCP values is denoted by L. The authors show that L is bounded by O((m + n)·log δ), where δ ≤ r is a measure of the compressibility of the eBWT.
Initially the analysis yields a time bound of O((r log r + L)·log(m + n)). The authors then prove a technical lemma (Lemma 6) that eliminates the r log r term, resulting in the final bound O((r + L)·log(m + n)). The space consumption remains O(r) beyond the storage of the two input RLBWTs. Consequently, the algorithm’s running time depends only on the number of runs in the output and on the total irreducible LCP, not on the raw input sizes. For highly repetitive data, r is typically much smaller than m + n, and L is also modest, so the algorithm approaches linear time in the compressed size.
The paper notes that the technique generalizes to merging more than two RLBWTs or already compressed eBWTs by repeatedly applying the pairwise merge, preserving the same asymptotic guarantees. This makes the method suitable for constructing large pan‑genomic indexes, metagenomic classifiers, and other applications where many similar but not identical sequences must be combined into a single compressed index.
In conclusion, the authors provide a static, space‑efficient algorithm for merging run‑length compressed BWTs that improves upon prior dynamic approaches. The method achieves O(r) extra space and O((r + L)·log(m + n)) time, offering practical benefits for large‑scale genomic data sets characterized by high repetition and moderate similarity across sequences. Future work is suggested to include empirical evaluation on real genomic collections and further optimization of the handling of irreducible LCP values.
Comments & Academic Discussion
Loading comments...
Leave a Comment