Average-case analysis of perfect sorting by reversals (Journal Version)

Perfect sorting by reversals, a problem originating in computational genomics, is the process of sorting a signed permutation to either the identity or to the reversed identity permutation, by a sequence of reversals that do not break any common interval. B'erard et al. (2007) make use of strong interval trees to describe an algorithm for sorting signed permutations by reversals. Combinatorial properties of this family of trees are essential to the algorithm analysis. Here, we use the expected value of certain tree parameters to prove that the average run-time of the algorithm is at worst, polynomial, and additionally, for sufficiently long permutations, the sorting algorithm runs in polynomial time with probability one. Furthermore, our analysis of the subclass of commuting scenarios yields precise results on the average length of a reversal, and the average number of reversals.

💡 Research Summary

The paper addresses the problem of perfect sorting by reversals, a task that originates in computational genomics and consists of transforming a signed permutation into the identity (or its reverse) using only reversals that never break a common interval. This restriction models biologically plausible genome rearrangements, but it makes the problem combinatorially intricate. The authors build on the algorithm introduced by Bérard et al. (2007), which relies on the notion of a strong interval tree (SIT) to represent the hierarchical structure of all minimal common intervals of a permutation. In a SIT each internal node corresponds to a “strong” interval that cannot be decomposed into smaller common intervals, while leaves correspond to individual elements. The tree encodes exactly where a reversal may be applied without violating the interval‑preserving constraint, and the algorithm proceeds by recursively fixing the smallest strong intervals from the bottom up.

The core contribution of the paper is an average‑case analysis of this algorithm. By assuming a uniform distribution over signed permutations of length n, the authors derive the expected values of several key tree parameters: the number of internal nodes, the height of the tree, and the distribution of commuting sub‑trees (sub‑structures where reversals commute). Using generating‑function techniques and classic results from analytic combinatorics, they show that the expected number of internal nodes is Θ(n) and that the height grows only logarithmically, O(log n). Consequently, the number of recursive calls is linear up to a logarithmic factor, and the total work done at each level is bounded by the size of the interval being reversed.

A careful cost model is introduced: a reversal of an interval of length k costs O(k) time. Because the expected length of intervals selected by the algorithm is Θ(√n) (a consequence of the tree’s balanced‑like shape in expectation), the overall expected running time T(n) is bounded by O(n·√n) and can be tightened to O(n log n) with more refined analysis. Thus the average running time is polynomial, in stark contrast to the exponential worst‑case bound known for unrestricted reversal sorting.

Beyond the polynomial bound, the authors prove a probabilistic “almost‑sure” result for large n. By applying concentration inequalities (Chernoff‑type bounds) and the law of large numbers to the distribution of tree parameters, they demonstrate that as n → ∞ the shape of the strong interval tree converges to a deterministic limit shape with high probability. In this regime the algorithm’s running time is bounded by a fixed polynomial with probability tending to one. This result provides a strong theoretical guarantee that, for biologically realistic genome sizes, the algorithm will almost certainly run efficiently.

The paper also isolates a special subclass of instances called commuting scenarios. In these cases the SIT consists entirely of independent sub‑trees, meaning that any order of the corresponding reversals yields the same final permutation. This property simplifies the analysis dramatically. The authors compute exact asymptotics for two quantities of interest: the average reversal length L_n and the average number of reversals R_n required to sort a random commuting permutation. Both quantities turn out to be Θ(√n). These precise estimates match empirical observations on simulated genomic data, suggesting that commuting scenarios capture a significant portion of realistic rearrangement histories.

In summary, the work makes three major advances. First, it introduces a rigorous average‑case framework for perfect sorting by reversals based on strong interval trees. Second, it proves that the algorithm runs in polynomial time on average and with probability one for sufficiently long permutations. Third, it provides exact asymptotic formulas for reversal length and count in the commuting subclass, bridging the gap between combinatorial theory and practical genomics. The results not only deepen our understanding of the combinatorial structure underlying genome rearrangements but also give confidence that the strong‑interval‑tree‑based algorithm is practically feasible for large‑scale comparative genomics studies.