Balanced Vertices in Trees and a Simpler Algorithm to Compute the Genomic Distance

Computational comparative genomics is a subdiscipline of computational biology in which the relationships between two or more genomes are studied by computational means. A highly relevant question in this field is the calculation of the minimum number of rearrangement operations (reversals, translocations, fusions and fissions) that are necessary to transform one given genome into another one, the so-called genome rearrangement problem [1]. The white-grey tree cover problem studied in this paper (formally defined in Section 2) arises as a subproblem of the genome rearrangement problem, and so far only an unsatisfactory (and not self-contained) solution exists [2]. The main goal of this paper is to give a short solution of the problem and to correct some omissions and discrepancies of the original formulation. (In Section 4 we point out some cases where the original formulation fails.) Moreover, it gives rise to a combinatorial problem on trees, detailed in Section 3, that seems to be interesting on its own. Since one of our main concerns here is brevity, we (usually) don't give detailed proofs of easy facts, which are not essential for our main goal. A white-grey tree is a rooted tree with (white or grey) colored and uncolored vertices. The root is uncolored, some children of the root are grey (some of them can be leaves), all leaves which are not children of the root are white. All uncolored vertices (with the possible exception of the root) are branching points. A system of paths in a white-grey tree is a colored cover if: (i) Each path has colored endpoints. One vertex alone may constitute a path. (ii) Each colored vertex is covered with path(s). The cost of a path P is denoted by cost(P) and is defined as follows: (i) P is short if it has exactly one vertex. Then cost(P) = 1. (ii) P is grey if its endpoints are grey vertices (then the third vertex is the uncolored root). Then cost(P) = 1. (iii) P is long otherwise. Its cost is cost(P) = 2. The cost of a path system P is the sum of the individual costs: cost(P) := P∈P cost(P). A colored cover P is an optimal one for a given white-grey tree T if it has minimal cost among all possible colored covers, denoted by cost(T ). Problem 2 (White-grey tree cover problem). Given a white-grey tree T , compute cost(T ). The main result of this paper is a simple way to calculate the exact cost of an optimal cover. We are not quite ready to formalize the main result (without some further observations it would require a detailed case analysis), but we mention here a well known fact [1]: Lemma 3. Let T be a white-grey tree with w white and g grey leaves, then: w + ⌈g/2⌉ ≤ cost(T ) ≤ w + ⌈g/2⌉ + 1. In this section we prove a useful tool for (unrooted) trees which seems to be interesting in its own. In tree T ′ denote by P u,v the unique path between vertices u and v. Theorem 4. Let T ′ be a tree with 2n leaves. Then there exists a vertex v ∈ V(T ′ ) and a bijection among the leaves α : L → L such that the path system P ℓ,α(ℓ) (where ℓ ∈ L) covers each vertex in T ′ and all paths contain v. We offer here two proofs. One gives a very simple algorithm to construct such a cover, but it clearly cannot provide all possible solutions. The second proof is based on a necessary and sufficient reformulation of the statement. First proof. Consider an embedding of our tree into the plane and enumerate the leaves in a counter clockwise fashion. One way to obtain such a numbering is to fix a leaf as a root and take the left-to-right, depth first traversal of the tree which conforms with the embedding. Now we define our bijection with the formula α : ℓ i → ℓ i+n mod 2n. Considering any two such paths, their endpoints alternate along the circle which contains the leaves in increasing order. Therefore these two paths clearly intersect each other. As it is well known (its very first proof is due to Gyárfás and Lehel, [3]) if (in a tree) a set of paths does not contain two disjoint paths, then all the paths share a common vertex v. And because these paths connect v to the leaves, they cover all edges of the tree. If T ′ is a fully balanced tree, then no matter what is the embedding in the previous proof, two close leaves will be paired with two close leaves. Therefore there are clearly solutions which cannot be obtained with the previous method. In the remaining part of this section we sketch a proof which is able to find all possible solutions: Second proof. For each vertex-edge pair (v, e) denote by δ(v, e) the number of leaves ℓ in T ′ such that P v,ℓ contains the edge e (where v ∈ V(T ′ ), e ∈ E(T ′ ) and v ∈ e). Furthermore, denote by E(v) the set of edges that contain v. In the configuration required by Theorem 4, vertex v clearly satisfies the inequality: Such a vertex v ∈ T ′ is called a balanced vertex. (If a vertex-edge pair does not satisfy this inequality, then the pair is called oversaturated.) As a matter of fact, this property is just equivalent to the existence of the required cover: Lemma 5. Let T ′ be a tree with 2n leaves, n ≥ 1. Then for any balanced vertex v there exists a bijection α such that the paths P ℓ,α(ℓ) cover all edges, and all paths contain v. The easy proof is left to the diligent reader. (One can argue, for example, with mathematical induction.) A balanced vertex in a tree is similar to the well-known notion center of the tree, but while a center is usually (almost) unique, there may be several balanced vertices. The following observation completes our second proof of Theorem 4. Lemma 6. Any tree T ′ with an even number of leaves contains a balanced vertex. Proof. (Sketch) Assume that a particular vertex v is not balanced. Then there exists an edge e ∈ E(v) such that the pair (v, e) is oversaturated. We repeat the process with the other end of that edge. If this vertex is not balanced again, then it will provide another oversaturated pair. The finiteness of the graph finishes the proof of the Lemma and this also completes the second proof of Theorem 4. The flexibility in the pairing algorithm clearly can provide any possible bijection α. It is also interesting to recognize that one can find a suitable balanced vertex quickly: Lemma 7. Let T ′ be a tree with 2n leaves. Then there is a linear (in the number of leaves) time algorithm to find a balanced vertex in T ′ . This proof is left to the reader again. A simple dynamic programming algorithm suffices. We are ready to determine the cost of an optimal cover for the white-grey tree T. We say a path in the cover is a mixed path if it contains at least two colored vertices, exactly one of the colored vertices is a grey leaf. We will use the notation T w for the subtree derived from T by deleting 3 all grey leaves and their edges, and the root if it would become a leaf. Furthermore for a path P in T we will use the notation P ↾ T w to denote the trace of P on T w , i.e. the restriction of P to the nodes of T w with the extra condition that in the truncated path we delete the starting (if any) uncolored vertices. We extend this notation to the trace of a path system P, P ↾ T w . Our general strategy to determine an optimal colored cover is to build it from an optimal colored cover of the subtree T w . To do that we are going to exploit certain properties -described in the following result -of optimal solutions having a minimum number of mixed paths. Theorem 8. Every white-grey tree T has an optimal colored cover P such that (1) P contains at most 2 mixed paths, (2) P ↾ T w is an optimal cover of T w , (3) for each mixed path P ∈ P, cost(P) = cost(P ↾ T w ) and so P ↾ T w is either a long path, or a short path consisting of a single grey leaf. Proof. (1) Let P be an optimal cover with a minimum number of mixed paths. Assume on the contrary that P contains three mixed paths: P 1 , P 2 and P 3 , where P i is a path from the grey leaf g i to the colored vertex c i ∈ T w . (If two paths cover the same grey leaf then deleting that leaf from one of the paths decreases the number of the mixed paths in the cover. So we may assume that the grey leaves are pairwise distinct.) Let the path P be the intersection of the paths P 1 , P 2 , P 3 . Clearly P is a path from the root to some c ∈ T w . (It is clear that c may be the root itself). Since c is the "last" point of the intersection, we can assume that the unique sub-paths P c 1 ,c and P c 2 ,c are edge disjoint (and of course vertex-disjoint except vertex c). Then replace the paths {P 1 , P 2 , P 3 } in P with the paths {P g 1 ,g 2 , P c 1 ,c 2 , P 3 } to obtain a path cover P ′ . But cost(P ′ ) ≤ cost(P) and P ′ contains less mixed paths than P -a contradiction. (2) So we have an optimal cover which contains at most two mixed paths. If its trace is not optimal then consider the following cover Q: cover T w optimally (this has cost at least 1 smaller than the trace of the original cover had), keep the paths from P which do not intersect T w and finally cover the (at most two) grey vertices that were covered by the mixed path(s) with a path whose cost is 1. Then the cost of Q is less than or equal to the cost of P, and Q does not contain any mixed path. (3) Assume that 2 = cost(P) > cost(P ↾ T w ) = 1 for a mixed path P ∈ P. The restriction P ↾ T w should be a short white path covering vertex u, and P is a path P u,g = (u, root, g) for a grey leaf g. Replacing the path P with two short paths covering u and g resp. keeps the cost of the cover, but decreases the number of mixed paths. A cover P is nice iff it satisfies the requirements of Theorem 8. Let P be a path in T w . We say that path P is free iff P can be extended to path P ′ such that P ′ contains a grey leaf while cost(P) = cost(P ′ ) holds. Theorem 8 implies the following statement: Lemma 9. Assume that T is a white-grey tree which has g grey leaves. where f is the maximal number of free paths in a nice optimal cover of T w . Next we should solve the white-grey tree cover problem for the subtree T w . Therefore we first solve the problem for trees where (essentially) all leaves are white. In what comes, we will say a leaf is short if it is adjacent to a branching vertex. Lemma 10. Let T ′ be a white-grey tree with w colored leaves but without a grey vertex or with exactly one grey leaf. Then the minimal cost of a colored cover is: if w is odd and there is no short leaf; w, otherwise. ( Proof. Since we have at most one grey leaf, we can not use a "cheap" grey-grey path to cover it. So we can change the color of that vertex into white without changing the cost of the tree and thus assume that all leaves are white. If the number of leaves is even, then the result is a direct consequence of Lemma 3 and Theorem 4. If the number of leaves is odd, but there is a short leaf, then we cover that leaf with a short path. Deleting it from the tree we are back to the previous case. Finally assume that w is odd but cost(T ′ ) = w. Then each leaf is covered once in an optimal cover, and one of them is covered by a short path. If this leaf is not a short one, however, then its colored neighbor is not covered, a contradiction. For simplicity we fix: in this case the constructed optimal cover contains a long path which does not cover any branching vertex. This path will be called a half-path. Let's remark that Lemma 10 for white-only trees is certainly not new: actually it was proved as early as 1995 (see [1]). But the consideration of more general white-grey trees raises several problematic issues. One of them is that in the literature, known to the authors, grey vertices which are not leaves have not been studied. However, the white-grey trees are constructed in connection with the genome rearrangement problem ( [2]) and grey vertices can appear in nonleaf positions. Another problem that paper [2] fails to determine is the exact cost of a minimal colored cover for some cases. Here we give only one of them. (The references relate to the relevant sections of that paper.) Assume that the root of T has two neighbors: one is a grey leaf (g = 1), and the other one is a branching vertex. Furthermore assume that w is odd, and no white leaf is short. Then we are in the scope of Theorem 5 of [2]. Since g is odd and T c is a fortress or junior fortress, we are to apply the case "otherwise" of Theorem 5. That formula now gives: cost(T ) = w + ⌈g/2⌉ + 1 = w + 2 while the proper cost is only w + ⌈g/2⌉ = w + 1. Before we give our main result we introduce one more notion: when among the children of the root there is exactly one child that is not a grey leaf, then the (colored) vertices between the root and the first branching point are called dangerous. Theorem 11. Let T be a white-grey tree with g grey and w white leaves. Let T w be derived from T by deleting the grey leaves (and the root if it would become a leaf). (1) If T does not have any dangerous vertex then , if w is odd and there is no short leaf in T w ; w + g 2 , otherwise.

Balanced Vertices in Trees and a Simpler Algorithm to Compute the Genomic Distance

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment