The mean value of the squared path-difference distance for rooted phylogenetic trees

The path-difference metric is one of the oldest distances for the comparison of fully resolved phylogenetic trees, but its statistical properties are still quite unknown. In this paper we compute the mean value of the square of the path-difference me…

Authors: Arnau Mir, Francesc Rossello

The mean value of the squared path-difference distance for rooted   phylogenetic trees
The mean v alue of the squared path-diffe rence distance for ro oted ph y logenetic trees Arnau Mir a , F rancesc Rossell´ o ∗ ,a a R ese ar ch Institute of H e a lth Scienc e (IUNICS) and Dep artment of Mathematics and Computer Scienc e, University of the Bale aric Islands, E-07122 Palma de Mal lor c a, Sp ain Abstract The path-difference metric is o ne of the oldest distances for the comparison of fully resolv ed ph ylogenetic trees, but its statistical prop erties are still quite unkno wn. In this pap er w e compute the mean v alue of the square of the path-difference metric b et w een tw o fully resolv ed ro oted ph ylogenetic trees with n leav es, under the uniform distribution. This complemen ts previous w ork b y Steel and P enn y , who computed this mean v alue for f ully resolve d unro oted ph ylogenetic trees. Key w o r d s: Ph ylogenetic trees, path-difference metric, no dal distance, h yp ergeometric series 1. In tr o duction The definition and study of metrics fo r t he comparison of ro oted ph ylo- genetic trees is a classical problem in ph ylogenetics [9, Ch. 30], motiv at ed b y the need to compare a lt ernativ e phylogenetic trees for a give n set o f or- ganisms obtained from differen t datasets or using different reconstruction algorithms [11]. Other applicatio ns of these metrics include the assess men t of ph ylogenetic tree reconstruction metho ds [18] and the definition of search- b y-similarity pro cedures on databases [12]. Man y metrics for the comparison of ro oted ph ylogenetic tr ees on the same set o f ta xa hav e been prop osed so far. Some of the first suc h metrics, defined ∗ Corresp o nding a uthor Email addr esses: arn au.mir @uib.e s (Arnau Mir ), cesc .rosse llo@ui b.es (F rancesc Rossell´ o) Pr eprint submitt e d to Elsevier Octob er 29, 2 018 around 4 0 y ears ago, w ere based on the comparison of the v ectors of lengths of (undirected) paths connecting pairs of taxa in the cor r esp o nding trees. These metrics comprise, for instance, the euclidean distance b et wee n t hese ve ctors [6, 7], the Manhattan distance b et wee n them [18], or the correlation b etw een them [14]. Similar metrics ha ve also been defined for unro oted phylogenetic trees [3, 15 , 17]. Let us p oin t out here tha t , in the ro oted case, these metric s satisfy the separation axiom of metrics (distance 0 means isomorphism) only for ful ly r esolve d , or binary , ph ylogenetic trees, and hence they are metrics, in the actual mathematical sense of the term, only in this case; cf. [5]. In the unro oted case, they are metrics for arbitrary trees . In con trast with other metrics [4, 10, 16, 17], and despite their tradi- tion and p opula r ity , the statistical pro p erties of these path-lengths based metrics are mostly unknow n. F or instance, the diameter of none of these metrics (either in the r o oted or in the unro oted case) is know n y et. Steel a nd P enn y [1 7] studied, among others, the distribution of o ne of these distances for unro oted trees: the one defined through the euclidean distance b et w een path-lengths v ectors, whic h these authors called the p ath-diff e r e nc e metric (other published names f or this metric are the cladistic differ enc e [6] and, generically , a no dal distanc e [3, 15]). In the aforemen tioned pa p er, S teel and P enn y computed the mean v alue of the square of this path-difference metric for fully resolv ed unroot ed trees . The k nowledge of this mean v alue is useful in the assessmen t of a comparison of tw o trees thro ugh this metric, b ecause it “ pro vides a n indication as to whether or no t this measured similarity could ha ve come about b y c hance” [17]. In this pap er w e compute the mean v alue of the square of the path- difference metric fo r fully resolv ed ro oted ph ylogenetic trees with n lea v es. Although t he ra w argumen t underlying our computation is the same as in Steel a nd P enn y’s paper, the details in the ro oted case are m uc h harder than in the unro oted case, b ecause of the a symmetric role of the ro ot. W e hav e pro ved that this mean v alue grows in O ( n 3 ); more sp ecifically , it is 2  n 2    4( n − 1) + 2 − 2 2( n − 1)  2( n − 1) n − 1  − 2 2( n − 1)  2( n − 1) n − 1  ! 2   . This turns out to b e the mean v alue o btained b y Steel and P enn y for unro oted ph ylogenetic trees, but with n + 1 lea ve s. A similar relationship b et we en com- binatorial v alues for ro oted and unro oted phylogenetic trees arises in other problems; for instance, a simple arg ument sho ws that the num b er of ro oted 2 ph ylogenetic trees with n leav es is the n um b er of unro oted phylogenetic trees with n + 1 leav es [9, Ch. 3]; als o, as w e sh all see in this pap er (Corollary 11), the mean v alue of the length of the undirected path b et wee n tw o giv en leav es in a ro oted ph ylogenetic tree with n lea ves is equal t o the correspo nding mean v alue f o r unro oted phylogene tic trees. But w e ha ve not b een able to find a clev er argumen t that pro ves directly t his relationship b etw een the mean v al- ues of the s quared path-difference metric, or of the pat h- length bet w een t wo lea ve s, in the r o oted and unro oted cases, and th us w e hav e needed to compute them. 2. Preliminaries 2.1. Phylo genetic tr e es In this pap er, b y a phylo g enetic tr e e on a set S of taxa w e mean a ful ly r esolve d , o r binary (that is, with a ll its in ternal no des of o ut - degree 2), ro oted tree with its lea ve s bijectiv ely lab eled in the set S . T o simplify the lang uage, w e shall alw a ys iden tify a leaf of a ph ylogenetic tree with its lab el. W e shall also use the term phylo genetic tr e e with n le a v es to refer t o a ph ylogenetic tree on a giv en set of n taxa, w hen this set is kno wn or nonrelev ant. W e shall represen t a path from u to v in a ph ylogenetic tr ee T b y u v . Whenev er there exists a path u v , w e shall say that v is a desc endant of u and also that u is an an c estor of v . Given a node v o f a ph ylogenetic tree T , the subtr e e of T r o ote d at v is the subgraph of T induce d on t he set of descendan ts of v . It is a p h ylogenetic tree on the set of descendan t lea v es of v , and with ro ot this no de v . The lowest c ommon anc e stor (LCA) o f a pair of no des u, v of a ph ylo- genetic tree T , in sym b ols LC A T ( u, v ), is the unique common a ncestor of them that is a descend an t o f ev ery other common a ncestor of them. The p ath differ enc e d T ( u, v ) b et w een t w o no des u and v is the su m o f the lengths of the paths LC A T ( u, v ) u and LC A T ( u, v ) v ; equiv alen tly , it is the length o f the only path connecting u and v in the undirected tree asso ciated to T . It is w ell-kno wn ( f or a pro of, see [5]) that the v ector of pat h differ- ences d ( T ) =  d T ( i, j )  1 6 i 2, |T n | = (2 n − 3)!! = (2 n − 3)(2 n − 5) · · · 3 · 1 . 3 An or der e d m -for est on a set S is a n ordered sequence o f m phylogenetic trees ( T 1 , T 2 , . . . , T m ), eac h T i on a set S i of taxa, suc h that these s ets S i are pairwise disjoint a nd their union is S . Let F m,n b e the set of (isomorphism classes of ) ordered m -forests on an y giv en set S with | S | = n . The cardinal of F m,n is computed (alt ho ugh not explicitly) along the pro of of Theorem 3 in [17]. Lemma 1. F or every m > 1 , |F m,m | = m ! an d |F m,n | = m ( n !) Q n − m − 1 l =1 ( n + l ) (2( n − m ))!! = (2 n − m − 1 )! m ( n − m )!2 n − m for every n > m. Pr o of. The exp onen tia l generating function for the n umber of rooted ph ylo- genetic trees with n lea v es is B ( x ) = 1 − √ 1 − 2 x . Then , the exp onen tial generating function for the num b er of o r dered forests consisting of a giv en n umber of trees (mark ed b y the v ariable y ) and a giv en global n um b er of lea ve s (mark ed b y the v ariable y ) is F ( x, y ) = X m > 1 y m B ( x ) m = 1 1 − y B ( x ) − 1 . This implies that the n um b er |F m,n | of ordered m -f orests on a set of n lea v es is equal to ∂ n ∂ x n ( B ( x ) m )   x =0 . This deriv ativ e can b e easily computed, yielding the v alues giv en in the statemen t. 2.2. Hyp er ge ometric function s The ( gener alize d ) hyp er ge ometric function p F q is defined [2] as p F q  a 1 , . . . , a p b 1 , . . . , b q ; z  = X k > 0 ( a 1 ) k · · · ( a p ) k ( b 1 ) k · · · ( b q ) k · z k k ! , where ( a ) k := a · ( a + 1) · · · ( a + k − 1 ) . The follo wing lemmas will be used in the next section. Lemma 2. 2 F 1  n − 1 , 2 − n − n ; 1 2  = 2 n − 1 n . 4 Pr o of. T o compute the v alue of 2 F 1  n − 1 , 2 − n − n ; 1 2  w e shall use F or- m ula 1 5.1.26 in [1] (see also http://functions.wo lfram.com/07.23.03.0028.01 ) : 2 F 1  a, 1 − a c ; 1 2  = 2 1 − c √ π Γ( c ) Γ  a + c 2  Γ  c − a +1 2  . W e cannot apply this e xpression to a = n − 1 and c = − n , b ecause Γ( − n ) = ∞ . So , instead, w e use a standard pass to limit argumen t: 2 F 1  n − 1 , 2 − n − n ; 1 2  = lim ε → 0 2 F 1  n − 1 , 2 − n − n + ε ; 1 2  = lim ε → 0 2 1+ n − ε √ π Γ( − n + ε ) Γ  ε − 1 2  Γ  2 − 2 n + ε 2  = 2 n − 1 n . Lemma 3. 3 F 2  1 − n, 2 − n, n − 1 − n, − n ; 1 2  = 2 n − 1 n 2  − 1 + (2 n − 1) !! 2 n − 2 ( n − 1)!  , Pr o of. The h yp ergeometric series 3 F 2  1 − n, 2 − n, n − 1 − n, − n ; 1 2  can b e written as a function of the hy p ergeometric f unction 2 F 1 as follo ws: 1 3 F 2  1 − n, 2 − n, n − 1 − n, − n ; 1 2  = 2 F 1  n − 1 , 2 − n − n ; 1 2  − ( n − 1)( n − 2) 2 n 2 2 F 1  n, 3 − n 1 − n ; 1 2  . (1) W e already know from the prev ious lemma that 2 F 1  n − 1 , 2 − n − n ; 1 2  = 2 n − 1 n . It remains t o c ompute 2 F 1  n, 3 − n 1 − n ; 1 2  . T o do it, w e shall use the follo wing for mula: 2 2 F 1  a, 3 − a c ; 1 2  = 2 3 − c √ π Γ( c ) ( a − 1)( a − 2) c − 2 Γ  a + c 2 − 1  Γ  c − a +1 2  − 2 Γ  a + c − 3 2  Γ  c − a 2  ! . 1 See h ttp:// funct ions.wolfram.com/07.27.03.0118.01 2 See h ttp:// funct ions.wolfram.com/07.23.03.0030.01 . 5 Again, w e cannot a pply this formula to a = n − 1 and c = − n , and thus w e use a pass to limit ar g umen t: 2 F 1  n, 3 − n 1 − n ; 1 2  = lim ε → 0 2 F 1  n, 3 − n 1 − n + ε ; 1 2  = lim ε → 0 2 2+ n − ε √ π Γ(1 − n + ε ) ( n − 1)( n − 2) ( − n − 1 − ε ) Γ  ε − 1 2  Γ  1 − n + ε 2  − 2 Γ  ε − 2 2  Γ  1+ ε − 2 n 2  ! = 2 2+ n √ π ( n − 1)( n − 2) lim ε → 0 ( − n − 1 − ε )Γ(1 − n + ε ) Γ  ε − 1 2  Γ  1 − n + ε 2  − lim ε → 0 2Γ(1 − n + ε ) Γ  ε − 2 2  Γ  1+ ε − 2 n 2  ! = 2 2+ n √ π ( n − 1)( n − 2) ( n + 1) 4 √ π − ( − 1) n +2  − 1 / 2 n  n ! ( n − 1)! √ π ! = 2 2+ n √ π ( n − 1)( n − 2)  ( n + 1) 4 − (2 n − 1)!! ( n − 1)!2 n  . Replacing 2 F 1  n − 1 , 2 − n − n ; 1 2  and 2 F 1  n, 3 − n 1 − n ; 1 2  in equa- tion (1) b y their v alues giv en ab ov e, w e obtain 3 F 2  1 − n, 2 − n, n − 1 − n, − n ; 1 2  = 2 n − 1 n − 2 n +2 2 n 2  ( n + 1) 4 − (2 n − 1)!! ( n − 1)!2 n  = 2 n − 1 n 2  − 1 + (2 n − 1)!! 2 n − 2 ( n − 1)!  . as w e claimed. Lemma 4. F or every r e al numb ers a, b , 4 F 3  1 , a, a + 1 / 2 , b 2 , 2 a, b + 1 / 2 ; 1  = (2 b − 1) ( a − 1)( b − 1)  − 1 + 3 F 2  a − 1 , a − 1 / 2 , b − 1 2 a − 1 , b − 1 / 2 ; 1  . Pr o of. By definition, 4 F 3  1 , a, a + 1 / 2 , b 2 , 2 a, b + 1 / 2 ; 1  = X k > 0 k !( a ) k ( a + 1 / 2) k ( b ) k ( k + 1 )!(2 a ) k ( b + 1 / 2) k · 1 k ! = X k > 1 ( a ) k − 1 ( a + 1 / 2) k − 1 ( b ) k − 1 k !(2 a ) k − 1 ( b + 1 / 2) k − 1 = ( ∗ ) . 6 T aking in to accoun t that ( a ) k − 1 = ( a − 1) k a − 1 , ( a + 1 / 2) k − 1 = ( a − 1 / 2 ) k a − 1 / 2 , ( b ) k − 1 = ( b − 1) k b − 1 , (2 a ) k − 1 = (2 a − 1) k 2 a − 1 , ( b + 1 / 2) k − 1 = ( b − 1 / 2 ) k b − 1 / 2 , the express ion ( * ) can be written as ( ∗ ) = X k > 1 ( a − 1) k ( a − 1 / 2 ) k ( b − 1) k (2 a − 1) ( b − 1 / 2) ( a − 1)( a − 1 / 2)( b − 1 )(2 a − 1) k ( b − 1 / 2 ) k · 1 k ! = (2 b − 1) ( a − 1)( b − 1)  − 1 + 3 F 2  a − 1 , a − 1 / 2 , b − 1 2 a − 1 , b − 1 / 2 ; 1  yielding the form ula in the statemen t. 3. Mean total areas F or ev ery s ∈ Z + , the total s -ar e a of a phylogenetic tree T is D ( s ) ( T ) = X 1 6 i 2 and for e v e ry s ∈ Z + , µ ( D ( s ) ) n =  n 2  S ( s ) n (2 n − 3) !! . Pr o of. Using the previous lemma, µ ( D ( s ) ) n = P T ∈T n D ( s ) ( T ) |T n | = P 1 6 i

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment