The mean value of the squared path-difference distance for rooted phylogenetic trees

The mean v alue of the squared path-diﬀe rence distance for ro oted ph y logenetic trees Arnau Mir a , F rancesc Rossell´ o ∗ ,a a R ese ar ch Institute of H e a lth Scienc e (IUNICS) and Dep artment of Mathematics and Computer Scienc e, University of the Bale aric Islands, E-07122 Palma de Mal lor c a, Sp ain Abstract The path-diﬀerence metric is o ne of the oldest distances for the comparison of fully resolv ed ph ylogenetic trees, but its statistical prop erties are still quite unkno wn. In this pap er w e compute the mean v alue of the square of the path-diﬀerence metric b et w een tw o fully resolv ed ro oted ph ylogenetic trees with n leav es, under the uniform distribution. This complemen ts previous w ork b y Steel and P enn y , who computed this mean v alue for f ully resolve d unro oted ph ylogenetic trees. Key w o r d s: Ph ylogenetic trees, path-diﬀerence metric, no dal distance, h yp ergeometric series 1. In tr o duction The deﬁnition and study of metrics fo r t he comparison of ro oted ph ylo- genetic trees is a classical problem in ph ylogenetics [9, Ch. 30], motiv at ed b y the need to compare a lt ernativ e phylogenetic trees for a give n set o f or- ganisms obtained from diﬀeren t datasets or using diﬀerent reconstruction algorithms [11]. Other applicatio ns of these metrics include the assess men t of ph ylogenetic tree reconstruction metho ds [18] and the deﬁnition of search- b y-similarity pro cedures on databases [12]. Man y metrics for the comparison of ro oted ph ylogenetic tr ees on the same set o f ta xa hav e been prop osed so far. Some of the ﬁrst suc h metrics, deﬁned ∗ Corresp o nding a uthor Email addr esses: arn au.mir @uib.e s (Arnau Mir ), cesc .rosse llo@ui b.es (F rancesc Rossell´ o) Pr eprint submitt e d to Elsevier Octob er 29, 2 018 around 4 0 y ears ago, w ere based on the comparison of the v ectors of lengths of (undirected) paths connecting pairs of taxa in the cor r esp o nding trees. These metrics comprise, for instance, the euclidean distance b et wee n t hese ve ctors [6, 7], the Manhattan distance b et wee n them [18], or the correlation b etw een them [14]. Similar metrics ha ve also been deﬁned for unro oted phylogenetic trees [3, 15 , 17]. Let us p oin t out here tha t , in the ro oted case, these metric s satisfy the separation axiom of metrics (distance 0 means isomorphism) only for ful ly r esolve d , or binary , ph ylogenetic trees, and hence they are metrics, in the actual mathematical sense of the term, only in this case; cf. [5]. In the unro oted case, they are metrics for arbitrary trees . In con trast with other metrics [4, 10, 16, 17], and despite their tradi- tion and p opula r ity , the statistical pro p erties of these path-lengths based metrics are mostly unknow n. F or instance, the diameter of none of these metrics (either in the r o oted or in the unro oted case) is know n y et. Steel a nd P enn y [1 7] studied, among others, the distribution of o ne of these distances for unro oted trees: the one deﬁned through the euclidean distance b et w een path-lengths v ectors, whic h these authors called the p ath-diﬀ e r e nc e metric (other published names f or this metric are the cladistic diﬀer enc e [6] and, generically , a no dal distanc e [3, 15]). In the aforemen tioned pa p er, S teel and P enn y computed the mean v alue of the square of this path-diﬀerence metric for fully resolv ed unroot ed trees . The k nowledge of this mean v alue is useful in the assessmen t of a comparison of tw o trees thro ugh this metric, b ecause it “ pro vides a n indication as to whether or no t this measured similarity could ha ve come about b y c hance” [17]. In this pap er w e compute the mean v alue of the square of the path- diﬀerence metric fo r fully resolv ed ro oted ph ylogenetic trees with n lea v es. Although t he ra w argumen t underlying our computation is the same as in Steel a nd P enn y’s paper, the details in the ro oted case are m uc h harder than in the unro oted case, b ecause of the a symmetric role of the ro ot. W e hav e pro ved that this mean v alue grows in O ( n 3 ); more sp eciﬁcally , it is 2  n 2    4( n − 1) + 2 − 2 2( n − 1)  2( n − 1) n − 1  − 2 2( n − 1)  2( n − 1) n − 1  ! 2   . This turns out to b e the mean v alue o btained b y Steel and P enn y for unro oted ph ylogenetic trees, but with n + 1 lea ve s. A similar relationship b et we en com- binatorial v alues for ro oted and unro oted phylogenetic trees arises in other problems; for instance, a simple arg ument sho ws that the num b er of ro oted 2 ph ylogenetic trees with n leav es is the n um b er of unro oted phylogenetic trees with n + 1 leav es [9, Ch. 3]; als o, as w e sh all see in this pap er (Corollary 11), the mean v alue of the length of the undirected path b et wee n tw o giv en leav es in a ro oted ph ylogenetic tree with n lea ves is equal t o the correspo nding mean v alue f o r unro oted phylogene tic trees. But w e ha ve not b een able to ﬁnd a clev er argumen t that pro ves directly t his relationship b etw een the mean v al- ues of the s quared path-diﬀerence metric, or of the pat h- length bet w een t wo lea ve s, in the r o oted and unro oted cases, and th us w e hav e needed to compute them. 2. Preliminaries 2.1. Phylo genetic tr e es In this pap er, b y a phylo g enetic tr e e on a set S of taxa w e mean a ful ly r esolve d , o r binary (that is, with a ll its in ternal no des of o ut - degree 2), ro oted tree with its lea ve s bijectiv ely lab eled in the set S . T o simplify the lang uage, w e shall alw a ys iden tify a leaf of a ph ylogenetic tree with its lab el. W e shall also use the term phylo genetic tr e e with n le a v es to refer t o a ph ylogenetic tree on a giv en set of n taxa, w hen this set is kno wn or nonrelev ant. W e shall represen t a path from u to v in a ph ylogenetic tr ee T b y u v . Whenev er there exists a path u v , w e shall say that v is a desc endant of u and also that u is an an c estor of v . Given a node v o f a ph ylogenetic tree T , the subtr e e of T r o ote d at v is the subgraph of T induce d on t he set of descendan ts of v . It is a p h ylogenetic tree on the set of descendan t lea v es of v , and with ro ot this no de v . The lowest c ommon anc e stor (LCA) o f a pair of no des u, v of a ph ylo- genetic tree T , in sym b ols LC A T ( u, v ), is the unique common a ncestor of them that is a descend an t o f ev ery other common a ncestor of them. The p ath diﬀer enc e d T ( u, v ) b et w een t w o no des u and v is the su m o f the lengths of the paths LC A T ( u, v ) u and LC A T ( u, v ) v ; equiv alen tly , it is the length o f the only path connecting u and v in the undirected tree asso ciated to T . It is w ell-kno wn ( f or a pro of, see [5]) that the v ector of pat h diﬀer- ences d ( T ) =  d T ( i, j )  1 6 i 2, |T n | = (2 n − 3)!! = (2 n − 3)(2 n − 5) · · · 3 · 1 . 3 An or der e d m -for est on a set S is a n ordered sequence o f m phylogenetic trees ( T 1 , T 2 , . . . , T m ), eac h T i on a set S i of taxa, suc h that these s ets S i are pairwise disjoint a nd their union is S . Let F m,n b e the set of (isomorphism classes of ) ordered m -forests on an y giv en set S with | S | = n . The cardinal of F m,n is computed (alt ho ugh not explicitly) along the pro of of Theorem 3 in [17]. Lemma 1. F or every m > 1 , |F m,m | = m ! an d |F m,n | = m ( n !) Q n − m − 1 l =1 ( n + l ) (2( n − m ))!! = (2 n − m − 1 )! m ( n − m )!2 n − m for every n > m. Pr o of. The exp onen tia l generating function for the n umber of rooted ph ylo- genetic trees with n lea v es is B ( x ) = 1 − √ 1 − 2 x . Then , the exp onen tial generating function for the num b er of o r dered forests consisting of a giv en n umber of trees (mark ed b y the v ariable y ) and a giv en global n um b er of lea ve s (mark ed b y the v ariable y ) is F ( x, y ) = X m > 1 y m B ( x ) m = 1 1 − y B ( x ) − 1 . This implies that the n um b er |F m,n | of ordered m -f orests on a set of n lea v es is equal to ∂ n ∂ x n ( B ( x ) m )   x =0 . This deriv ativ e can b e easily computed, yielding the v alues giv en in the statemen t. 2.2. Hyp er ge ometric function s The ( gener alize d ) hyp er ge ometric function p F q is deﬁned [2] as p F q  a 1 , . . . , a p b 1 , . . . , b q ; z  = X k > 0 ( a 1 ) k · · · ( a p ) k ( b 1 ) k · · · ( b q ) k · z k k ! , where ( a ) k := a · ( a + 1) · · · ( a + k − 1 ) . The follo wing lemmas will be used in the next section. Lemma 2. 2 F 1  n − 1 , 2 − n − n ; 1 2  = 2 n − 1 n . 4 Pr o of. T o compute the v alue of 2 F 1  n − 1 , 2 − n − n ; 1 2  w e shall use F or- m ula 1 5.1.26 in [1] (see also http://functions.wo lfram.com/07.23.03.0028.01 ) : 2 F 1  a, 1 − a c ; 1 2  = 2 1 − c √ π Γ( c ) Γ  a + c 2  Γ  c − a +1 2  . W e cannot apply this e xpression to a = n − 1 and c = − n , b ecause Γ( − n ) = ∞ . So , instead, w e use a standard pass to limit argumen t: 2 F 1  n − 1 , 2 − n − n ; 1 2  = lim ε → 0 2 F 1  n − 1 , 2 − n − n + ε ; 1 2  = lim ε → 0 2 1+ n − ε √ π Γ( − n + ε ) Γ  ε − 1 2  Γ  2 − 2 n + ε 2  = 2 n − 1 n . Lemma 3. 3 F 2  1 − n, 2 − n, n − 1 − n, − n ; 1 2  = 2 n − 1 n 2  − 1 + (2 n − 1) !! 2 n − 2 ( n − 1)!  , Pr o of. The h yp ergeometric series 3 F 2  1 − n, 2 − n, n − 1 − n, − n ; 1 2  can b e written as a function of the hy p ergeometric f unction 2 F 1 as follo ws: 1 3 F 2  1 − n, 2 − n, n − 1 − n, − n ; 1 2  = 2 F 1  n − 1 , 2 − n − n ; 1 2  − ( n − 1)( n − 2) 2 n 2 2 F 1  n, 3 − n 1 − n ; 1 2  . (1) W e already know from the prev ious lemma that 2 F 1  n − 1 , 2 − n − n ; 1 2  = 2 n − 1 n . It remains t o c ompute 2 F 1  n, 3 − n 1 − n ; 1 2  . T o do it, w e shall use the follo wing for mula: 2 2 F 1  a, 3 − a c ; 1 2  = 2 3 − c √ π Γ( c ) ( a − 1)( a − 2) c − 2 Γ  a + c 2 − 1  Γ  c − a +1 2  − 2 Γ  a + c − 3 2  Γ  c − a 2  ! . 1 See h ttp:// funct ions.wolfram.com/07.27.03.0118.01 2 See h ttp:// funct ions.wolfram.com/07.23.03.0030.01 . 5 Again, w e cannot a pply this formula to a = n − 1 and c = − n , and thus w e use a pass to limit ar g umen t: 2 F 1  n, 3 − n 1 − n ; 1 2  = lim ε → 0 2 F 1  n, 3 − n 1 − n + ε ; 1 2  = lim ε → 0 2 2+ n − ε √ π Γ(1 − n + ε ) ( n − 1)( n − 2) ( − n − 1 − ε ) Γ  ε − 1 2  Γ  1 − n + ε 2  − 2 Γ  ε − 2 2  Γ  1+ ε − 2 n 2  ! = 2 2+ n √ π ( n − 1)( n − 2) lim ε → 0 ( − n − 1 − ε )Γ(1 − n + ε ) Γ  ε − 1 2  Γ  1 − n + ε 2  − lim ε → 0 2Γ(1 − n + ε ) Γ  ε − 2 2  Γ  1+ ε − 2 n 2  ! = 2 2+ n √ π ( n − 1)( n − 2) ( n + 1) 4 √ π − ( − 1) n +2  − 1 / 2 n  n ! ( n − 1)! √ π ! = 2 2+ n √ π ( n − 1)( n − 2)  ( n + 1) 4 − (2 n − 1)!! ( n − 1)!2 n  . Replacing 2 F 1  n − 1 , 2 − n − n ; 1 2  and 2 F 1  n, 3 − n 1 − n ; 1 2  in equa- tion (1) b y their v alues giv en ab ov e, w e obtain 3 F 2  1 − n, 2 − n, n − 1 − n, − n ; 1 2  = 2 n − 1 n − 2 n +2 2 n 2  ( n + 1) 4 − (2 n − 1)!! ( n − 1)!2 n  = 2 n − 1 n 2  − 1 + (2 n − 1)!! 2 n − 2 ( n − 1)!  . as w e claimed. Lemma 4. F or every r e al numb ers a, b , 4 F 3  1 , a, a + 1 / 2 , b 2 , 2 a, b + 1 / 2 ; 1  = (2 b − 1) ( a − 1)( b − 1)  − 1 + 3 F 2  a − 1 , a − 1 / 2 , b − 1 2 a − 1 , b − 1 / 2 ; 1  . Pr o of. By deﬁnition, 4 F 3  1 , a, a + 1 / 2 , b 2 , 2 a, b + 1 / 2 ; 1  = X k > 0 k !( a ) k ( a + 1 / 2) k ( b ) k ( k + 1 )!(2 a ) k ( b + 1 / 2) k · 1 k ! = X k > 1 ( a ) k − 1 ( a + 1 / 2) k − 1 ( b ) k − 1 k !(2 a ) k − 1 ( b + 1 / 2) k − 1 = ( ∗ ) . 6 T aking in to accoun t that ( a ) k − 1 = ( a − 1) k a − 1 , ( a + 1 / 2) k − 1 = ( a − 1 / 2 ) k a − 1 / 2 , ( b ) k − 1 = ( b − 1) k b − 1 , (2 a ) k − 1 = (2 a − 1) k 2 a − 1 , ( b + 1 / 2) k − 1 = ( b − 1 / 2 ) k b − 1 / 2 , the express ion ( * ) can be written as ( ∗ ) = X k > 1 ( a − 1) k ( a − 1 / 2 ) k ( b − 1) k (2 a − 1) ( b − 1 / 2) ( a − 1)( a − 1 / 2)( b − 1 )(2 a − 1) k ( b − 1 / 2 ) k · 1 k ! = (2 b − 1) ( a − 1)( b − 1)  − 1 + 3 F 2  a − 1 , a − 1 / 2 , b − 1 2 a − 1 , b − 1 / 2 ; 1  yielding the form ula in the statemen t. 3. Mean total areas F or ev ery s ∈ Z + , the total s -ar e a of a phylogenetic tree T is D ( s ) ( T ) = X 1 6 i 2 and for e v e ry s ∈ Z + , µ ( D ( s ) ) n =  n 2  S ( s ) n (2 n − 3) !! . Pr o of. Using the previous lemma, µ ( D ( s ) ) n = P T ∈T n D ( s ) ( T ) |T n | = P 1 6 i

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment