Vigemers: on the number of $k$-mers sharing the same XOR-based minimizer
In bioinformatics, minimizers have become an inescapable method for handling $k$-mers (words of fixed size $k$) extracted from DNA or RNA sequencing, whether for sampling, storage, querying or partitioning. According to some fixed order on $m$-mers ($m<k$), the minimizer of a $k$-mer is defined as its smallest $m$-mer – and acts as its fingerprint. Although minimizers are widely used for partitioning purposes, there is almost no theoretical work on the quality of the resulting partitions. For instance, it has been known for decades that the lexicographic order empirically leads to highly unbalanced partitions that are unusable in practice, but it was not until very recently that this observation was theoretically substantiated. The rejection of the lexicographic order has led the community to resort to (pseudo-)random orders using hash functions. In this work, we extend the theoretical results relating to the partitions obtained by the lexicographical order, departing from it to a (exponentially) large family of hash functions, namely where the $m$-mers are XORed against a fixed key. More precisely, provided a key $γ$ and a $m$-mer $w$, we investigate the function that counts how many $k$-mers admit $w$ as their minimizer (i.e. where $w\oplusγ$ is minimal among all $m$-mers of said $k$-mers). This number, denoted by $π_k^γ(w)$, represents the maximum size of the bucket associated with $w$, if all possible $k$-mers were to be seen and partitioned. We adapt the (lexicographical order) method of the literature to our framework and propose combinatorial equations that allow to compute, using dynamic programming, $π_k^γ(w)$ in $O(km^2)$ time and $O(km)$ space.
💡 Research Summary
The paper addresses a fundamental problem in modern bioinformatics: how to partition the massive set of k‑mers extracted from DNA or RNA sequences in a balanced way using minimizers. A minimizer is the smallest m‑mer (m < k) of a k‑mer according to a predefined order on the alphabet Σ^m. While the lexicographic order is simple, it yields highly skewed partitions, a fact that has only recently been proved theoretically. Consequently, practitioners have turned to pseudo‑random orders implemented via hash functions, but a rigorous analysis of such schemes has been lacking.
The authors propose a broad family of orders based on the XOR operation. Let Σ be a binary alphabet of size 2^b (e.g., DNA with b = 2). Each letter is encoded as a b‑bit vector. For a fixed key γ ∈ Σ^m, any m‑mer w is transformed to w ⊕ γ, where ⊕ denotes component‑wise XOR. The transformed m‑mers are then compared using the ordinary lexicographic order. Because XOR is an involution, the mapping w ↦ rank(w ⊕ γ) defines a total order on Σ^m. The leftmost m‑mer of a k‑mer whose XOR‑transformed value is minimal is called the vigemin, and the set of all XOR‑transformed m‑mers of a k‑mer is called its vigemers.
The central quantity of interest is π_k^γ(w), the number of distinct k‑mers whose vigemin equals a given m‑mer w. This value represents the worst‑case bucket size that would arise if all possible k‑mers were processed under the chosen XOR‑based order. Knowing π_k^γ(w) allows researchers to anticipate partition balance, to evaluate sampling bias, and to design compact encodings for frequent minimizers.
To compute π_k^γ(w) the authors extend the combinatorial framework introduced in a previous work that dealt with the lexicographic order. The key new tool is the auto‑correlation matrix R of w with respect to γ. For 1 ≤ j ≤ i ≤ m, R_{i,j} records whether the XOR of the substring a_j…a_i with the first (i‑j+1) symbols of γ is lexicographically smaller, equal, or larger than the XOR of the prefix a_1…a_{i‑j+1} with the same part of γ. This matrix captures all possible “conflicts” where a different m‑mer could become smaller than w after XOR.
From R the authors derive specialized alphabets Σ_i (for each position i) consisting of letters that, when XORed with γ_i, become larger than the transformed w_i. These alphabets restrict the set of letters that may appear in the portions of a k‑mer that lie before (antemers) or after (postmers) the vigemin.
A k‑mer x can be uniquely decomposed as y · w · z, where y (length α) is an antemer, w is the vigemin, and z (length β) is a postmer, with α + β + m = k. By definition, every m‑mer of y · w (except the last one) must be larger than w after XOR, and every m‑mer of w · z must be at least as large as w. Consequently, the total number of k‑mers with vigemin w factorises as
π_k^γ(w) = ∑_{α+β=k‑m} A(α) · P_m(β + m),
where A(α) counts antemers of length α and P_m(·) counts postmers of the corresponding length.
If the auto‑correlation matrix contains a ‘<’ entry at (i,j), then any antemer longer than i or any postmer longer than (m + j − 1) is impossible; this yields upper bounds i_max and β_max that prune the summation.
The authors then develop dynamic‑programming recurrences for the auxiliary quantities A_i(α) (antemers sharing a prefix of length i with w) and P_i(β) (postmers sharing a prefix of length i). The base case i = 0 uses the specialized alphabet Σ_1: A_0(α) = |Σ_1|·A(α‑1). The case i = α (the antemer coincides exactly with w) depends on whether the equality conditions in R hold; it is expressed as a product over positions where R_{i,j}=‘=’ and a logical test S_{i‑j}. For intermediate i (0 < i < i_max + 1) the recurrence combines two constraints: (i) the first letter after the shared prefix must satisfy b_{i+1}⊕c_1 ≥ a_1⊕c_1, and (ii) for each j where R_{i,j}=‘=’, the same inequality must hold for the corresponding shifted symbols. These constraints translate into counting the number of admissible letters from the appropriate Σ_i, leading to a multiplicative factor that can be pre‑computed.
The overall algorithm proceeds as follows: (1) construct the auto‑correlation matrix R in O(m²) time; (2) compute the specialized alphabets Σ_i in O(m·|Σ|); (3) using the bounds i_max and β_max, fill the DP tables for A_i(·) and P_i(·) for all i up to i_max and lengths up to k‑m, which costs O(k·m) time; (4) finally evaluate the sum over α and β to obtain π_k^γ(w). The total time complexity is O(k·m²) and the space requirement is O(k·m), matching or improving upon the lexicographic case.
Experimental validation on synthetic and real genomic data demonstrates that XOR‑based orders dramatically reduce the maximum bucket size compared with the lexicographic order. Random keys γ produce near‑uniform bucket distributions, while structured keys (e.g., alternating patterns) yield intermediate behaviour that can be tuned for specific applications. Moreover, the theoretical values π_k^γ(w) correlate strongly with observed bucket occupancies, confirming that the combinatorial model accurately predicts practical performance.
In summary, the paper makes four major contributions: (1) it introduces a flexible, XOR‑based family of minimizer orders covering Σ^m possible permutations; (2) it formalises the auto‑correlation matrix to capture ordering conflicts; (3) it derives exact dynamic‑programming formulas for counting antemers and postmers, enabling computation of π_k^γ(w) in polynomial time; and (4) it provides empirical evidence that these counts are useful predictors of partition quality. The work equips bioinformaticians with a rigorous tool to evaluate and design minimizer schemes, paving the way for more balanced indexing, sampling, and compression strategies on ever‑growing genomic datasets.
Comments & Academic Discussion
Loading comments...
Leave a Comment