We establish a limit formula for the median of the distance between two leaves in a fully resolved unrooted phylogenetic tree with n leaves. More precisely, we prove that this median is equal, in the limit, to the square root of 4*ln(2)*n.
Deep Dive into The median of the distance between two leaves in a phylogenetic tree.
We establish a limit formula for the median of the distance between two leaves in a fully resolved unrooted phylogenetic tree with n leaves. More precisely, we prove that this median is equal, in the limit, to the square root of 4*ln(2)*n.
The definition and study of metrics for the comparison of phylogenetic trees is a classical problem in phylogenetics [1,Ch. 30], motivated, among other applications, by the need to compare alternative phylogenies for a given set of organisms obtained from different datasets or using different methods. Many metrics for the comparison of rooted or unrooted phylogenetic trees on the same set of taxa have been proposed so far. Some of the most popular such metrics are based on the comparison of the vectors of distances between pairs of taxa in the corresponding trees. But, in contrast with other metrics, the statistical properties of these metrics are mostly unknown.
Steel and Penny [3] computed the mean value of the square of the metric for fully resolved unrooted trees defined through the euclidean distance between their vectors of distances (they called it the path difference metric). One of the main ingredients in their work was the explicit computation of the mean value and the variance of the distance d between two leaves in a fully resolved unrooted phylogenetic tree with n leaves, obtaining that
In this work we continue the statistical analysis of this random variable d, by giving an expression for its median that allows the derivation of a limit formula for it. We hope our result will constitute a first step towards obtaining a formula for the median of the aforementioned squared path difference metric between fully resolved unrooted phylogenetic trees, a problem that still remains open.
In this paper, by a phylogenetic tree on a set S we mean a fully resolved (that is, with all its internal nodes of degree 3) unrooted tree with its leaves bijectively labeled in the set S. Although in practice S may be any set of taxa, to fix ideas we shall always take S = {1, . . . , n}, with n the number of leaves of the tree, and we shall use the term phylogenetic tree with n leaves to refer to a phylogenetic tree on this set. For simplicity, we shall always identify a leaf of a phylogenetic tree with its label.
Let T n be the set of (isomorphism classes of) phylogenetic trees with n leaves. It is well known [1] that
Let k, l โ S = {1, . . . , n} be any two different labels of trees in T n . The distance d T (k, l) between the leaves k and l in a phylogenetic tree T โ T n is the length of the unique path between them. Let’s consider the random variable d kl = distance between the labels k and l in one tree in T n .
The possible values of d kl are 1, 2, . . . , n -1.
Our goal is to estimate the value median(n) of the median of this variable d kl on T n when the tree and the leaves are chosen equiprobably. In this case, d kl = d 12 , and thus we can reduce our problem to compute the median of the variable d := d 12 .
For every i = 1, . . . , n -1, let c i be the cardinal of {T โ T n | d T (1, 2) = i}. Arguing as in [3, p. 140], we have the following result.
Lemma 1. c n-1 = (n -2)! and, for every i = 1, . . . , n -2,
Proof. Consider the function B(x) = 1 -โ 1 -2x. By [3, p. 140], we have that
we obtain the formulas in the statement. โ โ Lemma 2. For every k = 1, . . . , n-1,
Proof. Taking into account that (2j)!! = 2 j j! and (2j + 1)!! = (2j+1)! 2 j j! , for every j โ N, and using Lemma 1, we have:
We use now the method in [2, Chap. 5] to compute
The next step is to find three polynomials a(i), b(i) and c(i) such that
We take a(i
The polynomial x(i) = 1 satisfies this equation. Then, by [2, Chap. 5],
where g is a function of n. We find this function from the case k = 2:
From this equality we deduce that g(n) = 4(2n -5)! (n -3)! . We conclude that:
The formula in the statement follows from this expression. โ โ
Proof. To simplify the notations, we shall denote median(n) by k. By definition,
Thus, k is the largest integer value such that
If we simplify this inequation and take logarithms, this condition becomes k ln(2)
Combining the development of the function ln(
with equation (1), we obtain:
So, the first order term of the median k will be the largest integer value that satisfies k2 /4n ln(2). Therefore, the median will be the closest integer to 4 ln(2)n, from where the thesis in the statement follows.
โ โ
We have obtained a limit formula for the median of the distance between two leaves in a fully resolved unrooted phylogenetic tree with n leaves. Our method allows to find more terms of the development of the median. For instance, it can be proved that median(n) โ โ 4n ln 2 + ( 1 2 -ln 2). The limit formula obtained in this work can be generalized to the p-percentile x p = max k โ N | k i=1 c i |T n |p . Indeed, using our method we obtain that x p โ -4 ln(1 -p)n.
This content is AI-processed based on ArXiv data.