On Bijective Variants of the Burrows-Wheeler Transform
The sort transform (ST) is a modification of the Burrows-Wheeler transform (BWT). Both transformations map an arbitrary word of length n to a pair consisting of a word of length n and an index between 1 and n. The BWT sorts all rotation conjugates of the input word, whereas the ST of order k only uses the first k letters for sorting all such conjugates. If two conjugates start with the same prefix of length k, then the indices of the rotations are used for tie-breaking. Both transforms output the sequence of the last letters of the sorted list and the index of the input within the sorted list. In this paper, we discuss a bijective variant of the BWT (due to Scott), proving its correctness and relations to other results due to Gessel and Reutenauer (1993) and Crochemore, Desarmenien, and Perrin (2005). Further, we present a novel bijective variant of the ST.
💡 Research Summary
The paper investigates bijective (one‑to‑one) variants of the Burrows‑Wheeler Transform (BWT) and its generalisation, the Sort Transform (ST). The classic BWT maps a word w of length n to a pair (L, i) where L is the last column of the matrix formed by lexicographically sorting all cyclic rotations of w and i is the row index of the original word. Because the index i (or an explicit end‑of‑string symbol) is required, the classic BWT is not bijective; it introduces O(log n) extra bits and can be undesirable for compression or cryptographic applications.
The authors revisit the bijective BWT originally proposed by Scott (2007). They provide a clearer description based on Lyndon factorisation: any word w can be uniquely written as a non‑increasing concatenation of Lyndon words v₁ ≥ v₂ ≥ … ≥ v_s. Each Lyndon block is primitive and has a unique minimal rotation, which serves as a canonical representative of its conjugacy class. By sorting the set of all rotations of the Lyndon blocks using a stable order that respects both the lexical order of the blocks and the original positions of equal letters, they obtain a list M(w) that contains each rotation exactly once. The last column L = BWT(w) together with the standard permutation π_L (defined by sorting letters while preserving the original left‑to‑right order of equal symbols) is sufficient to reconstruct w without any external index. Lemma 3 shows that for any k, the k‑order context of the i‑th rotation can be expressed as a product of letters taken from L according to successive applications of π_L. Consequently, Corollary 4 proves that (L, i) uniquely determines w, and the inverse can be computed by traversing π_L backwards, which is often simpler than recomputing the full permutation.
The paper then turns to the Sort Transform (ST), a modification of BWT where rotations are sorted only by their first k symbols (the “k‑order context”). If two rotations share the same k‑prefix, their original rotation indices are used as a tie‑breaker. The classic ST still needs an index or a special terminator. The authors construct a bijective ST by extending the same standard permutation technique to the k‑order context. They define L_k as the last column of the k‑sorted matrix and π_{L_k} as the permutation induced by a stable sort of the first letters of the right‑shifted rotations. Lemma 3 (generalised) guarantees that the k‑order context of each rotation can be recovered from L_k and π_{L_k}. Therefore, the pair (L_k, i) is a bijection between words of length n and their transformed representations, eliminating the need for any auxiliary data.
A substantial portion of the work connects these bijective transforms to earlier combinatorial results. The authors cite Gessel and Reutenauer (1993), who studied the distribution of Lyndon words in permutations, and Crochemore, Desarmenien, and Perrin (2005), who examined factorisations of words into Lyndon components. By framing the bijective BWT and ST in terms of Lyndon factorisation and standard permutations, the paper shows that these transforms naturally inherit the algebraic properties described in those works.
Algorithmically, the forward bijective BWT and ST are straightforward: compute the Lyndon factorisation (linear time via Duval’s algorithm), generate all rotations of each factor, perform a stable lexical sort based on the appropriate context length (full word for BWT, first k symbols for ST), and output the last column. The inverse procedures rely on constructing the standard permutation from the last column and iteratively applying it to rebuild the original word. The authors analyse the computational complexity, noting that the dominant cost is the stable sort (O(n log n) in the general case, O(n) for integer alphabets using radix sort). Memory usage remains linear.
In the concluding section, the authors argue that bijective variants save the O(log n) bits otherwise needed for an index, improve security in cryptographic settings (since no explicit marker of the original position exists), and provide a more natural theoretical framework for understanding BWT‑based compression. They suggest future research directions, including the design of compression schemes that directly exploit the bijective transforms, optimisation for different alphabet orders, and adaptation to streaming or real‑time environments.
Overall, the paper delivers a rigorous treatment of bijective BWT and ST, supplies constructive proofs of correctness, links the constructions to classical combinatorial theory, and offers practical algorithms that could replace the traditional index‑based transforms in applications where bijectivity is advantageous.
Comments & Academic Discussion
Loading comments...
Leave a Comment