Nodal distances for rooted phylogenetic trees

No dal distances for ro oted ph ylogenetic trees Gabriel Cardona 1 , Merc ` e Llabr ´ es 1 2 , F rancesc Rossell´ o 1 2 , and Gabriel V alien te 2 3 1 Department of Mathematics and Computer Science, Universit y of the Balearic Islands, E- 07 122 P alma d e Mallorca, Spain 2 Researc h Institut e of H ealth Science (IUNI CS), E-07122 Palma de Mallorca, Spain 3 Algorithms, Bioinformatic s, C omplexity and F ormal Metho ds Researc h Group, T ec hnical Un iv ersity of Catalonia, E-08034 Barcelona, Sp ai n Abstract. D is similarit y measures for (possibly w eighted) ph ylogenetic trees based on the comparison of their vectors of path lengths b etw een pairs of taxa, have been present in the systematics literature since the early sev enties. But, as fa r as rooted p h ylogenetic trees goes, these vectors can only separate non-weigh ted binary trees, and therefore these dissimilarit y measures are metrics only on this class. In this pap er w e ov ercome t his problem, by splitting in a suitable wa y each path length b et ween tw o taxa into tw o length s. W e prove that the resulting spli tt e d p ath lengths matric es single out arbitrary ro oted phylogenetic trees with nested taxa and arcs we igh ted in the set of p ositive real numbers. This allow s the deﬁnition of metrics on this general class by comparing these matrices b y means of metrics in spaces M n ( R ) of real-v alued n × n matrices. W e conclude this p aper by establishing some b asi c facts about the metric s for non- weigh ted phylogenetic trees deﬁned in this w ay using L p metrics on M n ( R ), with p ∈ N \ { 0 } . 1 In tr oduction The exp onen tial increase in the amoun t of a v ailable genomic and metagenomic d ata has pro duced an explosion in the num b er of phylogenet ic trees pr oposed by researchers: according to Rok as [24], phylog en et icists are cu r ren tly pub lishing an a ve r ag e of 15 p h y- logenetic trees p er day . Man y such trees are alternativ e ph ylogenies for th e same sets of organisms, obtained f rom diﬀerent datasets or using diﬀerent evo lu tio nary mo dels or diﬀeren t phylo genetic reconstruction algorithms [16]. This v ariet y of ph ylogenetic tr ees mak es it necessary th e existence of m etho ds for measuring the diﬀerences b et wee n phy- logenetic trees [13, Ch. 30], and the s afest wa y to quan tify these diﬀerences is by using a metric, f or whic h zero diﬀerence means isomorphism. The comparison of p h ylogenetic trees is also us ed to assess the stabilit y of reconstru c- tion m et ho ds [31], and it is essent ial to p erforming p hylogeneti c queries on databases [18]. F urther , the need for comparing phyl ogenetic trees also arises in th e comparativ e analysis of clustering results obtained using diﬀerent metho ds or diﬀerent d istance matrices, and there is a growing inte rest in the assessmen t of clustering results in bioinform atics [15]. Recen t applications of the comparison of phylo genetic-lik e trees also include the stu d y of the similarit y b et wee n sequences, or sets of sequences, by measurin g the diﬀerence b e- t wee n their con text trees [17]. I n summary , and u sing the w ord s of Steel and Pe n n y [29], tree comparison metrics are an imp ortan t aid in the study of ev olution. Man y metrics for p hylogenetic tree comparison h av e b een prop osed so f ar, includ- ing the Robinson-F oulds, or p artiti on , metric [22, 23], the nearest-neigh b or interc h ange metric [30], the su btree transf er distance [2], and the triples metric [9]. In the early s ev- en ties, sev eral r esea r c h ers prop osed dissimilarit y measur es for (p ossibly we ighted) ro oted phylo genetic trees based on the comparison of the ve ctors of lengths of paths connect- ing pairs of taxa. The aim of these m ea s u res w as to quantify the rate at which pairs of taxa that are close together in one tree lie at opp osite ends in another tree [19]. These authors d eﬁned the dissimilarity b et we en a pair of trees as the euclidean distance b e- t wee n the corresp onding v ectors of path lengths [10, 11], the Manhattan distance b et w een these v ectors [31] or th e correlati on b et w een these v ectors [20]. Similar dissimilarity mea- sures ha v e also b een deﬁn ed for unro oted phyloge n etic tr ees [6, 29]. Although diﬀeren t names h a v e b een used for these d iss imila r it y measures (cladistic diﬀerence [10], top olog- ical distance [20] , path d iﬀerence d istance [29]) , the term no dal distanc e seems to ha v e prev ailed [6, 21]. According to Steel and Penn y [29], they hav e several in teresting features that mak e them d eserv e more stu d y and consid erat ion. The theoretical basis for these n o dal distances is Smolenskii’s th eo r em [28] establish- ing that tw o un rooted phyloge n eti c trees T , T ′ on the same set S of taxa are isomorph ic if, and only if, for every pair of lea ves i, j , th e distances b et ween i and j in T and in T ′ are the s ame . Th is r esult was later expanded by Zaretskii [32], who c h aracterized the v ectors of distances b et w een pairs of lea v es of an u nro oted phylo genetic tree by means of the well-kno w n four-p oint condition. Smolenskii’s and Zaretskii’s pap ers were publish ed in Russian, and it h as contributed to the fact that their r esults ha ve b een r edisco vered and generalized m an y times [3, 7, 8, 26]; for a mo dern textb o ok treatment of these results in all their generalit y (w eigh ted unro oted tr ee s with nested taxa), see [25, Ch. 7], and for a h istorica l accoun t, see [1]. Unfortunately , Smolenskii’s theorem is not v alid for arbitrary ro oted phylo genetic trees: th er e exist non-isomorphic ro oted p h ylogenetic tr ees with the same path lengths b et w een pairs of lea ve s (see Figs. 1, 2, 3). It turns out that only the ful ly r esolve d , or binary , non-weighte d ro oted phylog enetic trees are singled out by their path lengths v ectors, and therefore the no dal distances based on th e comparison of these vect ors are metrics (more sp eciﬁcally , zero no dal distance means isomorphism) only on the space of non-w eighte d binary phyloge n etic trees. Although this resu lt seems to b e known since the time of the ﬁrst prop osals of no dal d istances, we h av e not b een able to ﬁnd an explicit pro of in the literature, and thus, for the sak e of completeness, w e include a simple p roof of this fact in S ec tion 3, r educing it to the general v ers ion of Smolenksii’s result. The main resu lt of th is p aper is the deﬁnition of metrics on th e space of arbitr ary ro oted ph ylogenetic trees that generalize the no dal d ista n ces, where arbitrary means non necessarily binary an d with p ossibly nested taxa and arcs w eighted in the s et of p ositiv e real n u m b ers. T o do that, w e split eac h path b et we en tw o taxa into the paths from their least common ancestor to eac h taxa. In this wa y w e asso ciate to eac h ro oted phylo genetic tree w ith n taxa an n × n matrix, with ro ws and columns in dexed b y the taxa, whose ( i, j )-en try con tains the length of the path from the least common ancestor of the i - th and j -th taxa to the i -th taxon. W e pro ve that these splitte d p ath lengths matric es single out arbitrary ro oted p h ylogenetic trees, and then w e u se them to deﬁne splitte d no dal metrics on the space of w eighte d r o oted p h ylogenetic trees with nested taxa by 2 comparing these matrices th rough real-v alued n orms applied to their diﬀerence. W e also pro ve s ome basic prop erties of the splitted no dal metrics on the space of n on-w eigh ted ro oted phylogeneti c trees obtained u s ing the L p norms, with p ∈ N \ { 0 } . 2 Notations and c on ven tions A r o ote d tr e e is a non-empt y directed ﬁnite graph that con tains a d istinguished no d e, called the r o ot , from whic h ev ery other n ode can b e reac hed through exactly one path. An A -weighte d r o ote d tr e e , with A ⊆ R , is a pair ( T , ω ) consisting of a ro oted tree T = ( V , E ) and a weight function ω : E → A that asso ciat es to every arc e ∈ E a real n u m b er ω ( e ) ∈ A . In this pap er we shall only consider tw o sets A of we ights: the set of non-negativ e real n um b ers R > 0 = { t ∈ R | t > 0 } , and the set of p ositiv e real num b ers R > 0 = { t ∈ R | t > 0 } . When the set A is irrelev an t (for instance, in general deﬁnitions), w e shall omit it and simply talk ab out weighte d , instead of A -weig hted, trees. W e identify ev ery non-weighte d (that is, w h ere n o weig h t function has b een explicitly deﬁn ed ) ro oted tree T with the w eigh ted ro oted tree ( T , ω ) with ω the weigh t 1 constan t function. Let T = ( V , E ) b e a ro oted tree. Whenev er ( u, v ) ∈ E , w e s ay that v is a c hild of u and that u is the p ar ent of v . Ev ery no de in T has exactly one parent , except the ro ot, whic h has no p aren t. Th e num b er of c hildr en of a no de is its out-de gr e e . T he n odes without c h ild ren are the le aves of the tree, an d th e other no des are called internal . An arc ( u, v ) is internal when its head v is internal, and p endant when v is a leaf. The out-degree 1 no des are called elementary . A tree is binary wh en all its internal no des ha ve out-degree 2. Giv en a p ath ( v 0 , v 1 , . . . , v k ) in a ro oted tree T , its origin is v 0 , its end is v k , and its interme diate no des are v 1 , . . . , v k − 1 . Su c h a p ath is non-trivial when k > 1. W e sh all represent a path fr om u to v , that is, a p at h with origin u and end v , by u v . Whenev er there exists a (non-trivial) path u v , w e shall sa y that v is a ( non-trivial ) desc e nda nt of u and also that u is a ( non-trivial ) anc estor of v . If v is a d escendan t of u , the path u v is unique. The distanc e fr om a no de u to a descendant v of it in a weigh ted ro oted tree is the su m of th e wei gh ts of the arcs forming the un ique path u v ; in a n on-w eigh ted ro oted tree, this distance is simply the num b er of arcs of this path. The depth of a no de v , in symb ols depth T ( v ), is the d istance from the ro ot to v . The le ast c ommon anc estor (LCA) of a p air of n odes u, v of a r o oted tree T , in sym b ols [ u, v ] T , is the unique common an cestor of them that is a descendant of ev ery other common ancestor of them. Alternativ ely , it is the u nique common ancestor of u, v suc h that the p aths fr om it to u and v ha ve only th eir origin in common. In particular, if one of the n odes, sa y u , is an ancestor of the other, then [ u, v ] T = u . Let S b e a non-empt y ﬁn ite set of lab els , or taxa . A ( weighte d ) phylo genetic tr e e on S is a (weigh ted) ro oted tree with some of its no des, including all its lea ve s and its elemen tary n odes, bijectiv ely lab ele d in the s et S . In such a phylogenetic tr ee , w e shall alw a ys ident ify , u sually withou t an y fu rther mention, a lab eled no de with its taxon. The in ternal lab eled no des of a p h ylogenetic tr ee are called neste d taxa . 3 Tw o phylogeneti c trees T and T ′ on the same set S of taxa are i somorphic when they are isomorph ic as directed graphs and th e isomorp hism sends eac h lab eled no de of T to the lab eled no de with the same lab el in T ′ ; an isomorph ism of wei ghted phylo genetic trees is also required to p reserv e arc w eight s. As usu al, w e shall use the sym b ol ∼ = to denote the existence of an isomorp hism. Although our main ob j ec t of stud y are th e weigh ted phylogeneti c tr ees, and hence they are ro oted trees, in the next section there will also app ear unr ooted trees. An unr o ote d tr e e is an und irecte d ﬁnite graph where eve r y pair of no des is connected by exactly one p ath. An A -weighte d unr o ote d tr e e is a pair ( T , ω ) consisting of an u nrooted tree T = ( V , E ) and a weight fu nction ω : E → A . The distanc e b etw een t wo n o des in a w eight ed u nrooted tree is th e sum of the weig h ts of the edges formin g the u nique path that connects these no des. An un rooted tree is p artial ly lab ele d in a set S when some of its no des are bijectiv ely lab eled in the set S . An u nr o ote d S -tr e e is an unr ooted tree partially lab eled in S with all its leav es and all its no des of d eg ree 2 lab eled. Giv en a phylog en etic tree T = ( V , E ) on S , its unr o ote d version is the u nrooted tree T u = ( V , E u ) partially lab eled in S obtained by replacing eac h arc ( u, v ) ∈ E by an edge { u, v } ∈ E u , and k eeping th e lab els. The notion of isomorphism for (p ossibly w eigh ted) partially lab eled un rooted trees is similar to the n otion giv en in the r ooted case. Notice that if T 1 = ( V 1 , E 1 ) and T 2 = ( V 2 , E 2 ) are tw o phylogenet ic trees on the same set S of taxa, with ro ots r 1 and r 2 , resp ectiv ely , then a map p ing f : V 1 → V 2 is an isomorp hism b et wee n T 1 and T 2 if, and only if, it is an isomorp hism b et we en T u 1 and T u 2 and f ( r 1 ) = r 2 . 3 P ath lengths separate non-w eigh ted binary ph ylogenetic trees Let T b e an R > 0 -w eigh ted phylog enetic tree on the set S = { 1 , . . . , n } . F or eve r y i, j ∈ S , let ℓ T ( i, j ) and ℓ T ( j, i ) d en ot e the distances fr om [ i, j ] T to i and j , resp ectiv ely . T he p ath length b etw een tw o lab eled n odes i and j is L T ( i, j ) = ℓ T ( i, j ) + ℓ T ( j, i ) . Deﬁnition 1. The path lengths vect or of T is the ve ctor L ( T ) =  L T ( i, j )  1 6 i 0 -w eigh ted un rooted S -tree; see also Th m. 7.1.8 in [25]. 4 Prop osition 1. Two non-weighte d bi nar y phylo genetic tr e es on the same set S of taxa ar e isomorp hic if, and only if, they have the same p ath lengths ve ctors. Pr o of. The ‘only if ’ imp lic ation is ob vious. As f ar as the ‘if ’ imp lica tion go es, let T 1 and T 2 b e tw o n on-w eigh ted b in ary p h ylogenetic trees on the same set S with the same path lengths vec tors. If | S | = 1, the equiv alence in the statemen t is obvious, b ecause ev ery phylo genetic tree with only one lab eled no de consists only of one no de. So we assume henceforth that | S | > 2. F or ev ery t = 1 , 2, let ( T ∗ t , ω t ) b e the R > 0 -w eigh ted unro oted S -tree deﬁned as follo ws: – If the ro ot of T t is lab eled, then T ∗ t = T u t and all ed ges of T ∗ t ha ve we ight 1. – If the r oot r t of T t is n ot lab eled, and if u t , v t are the c h ildren of r t , th en T ∗ t is obtained from T u t b y remo ving the no de r t and r eplac ing the edges { r t , u t } , { r t , v t } b y a single edge { u t , v t } , and then all edges of T ∗ t ha ve weig ht 1, except { u t , v t } , whic h h as weigh t 2. It is straigh tforward to chec k that such a T ∗ t is alwa ys an un rooted S -tree: the ro ot r t of T t is the only d egree 2 no de in T u t and th en, if it is lab eled, T u t is an unro oted S -tree, and if it is non lab eled, w e remo v e it in the construction of T ∗ t without mo difying the degrees of the remaining no des. Moreo v er, it is also ob v ious fr om th e constru cti on that the distance b et ween an y pair of lab eled no des in T ∗ t is equ al to the path length b et ween these no des in T t . In particular, T ∗ 1 and T ∗ 2 ha ve the same d ista nces b et ween eac h pair of lab eled no des. T hen, by [25, Thm. 7.1.8]. T ∗ 1 ∼ = T ∗ 2 as w eighte d unr ooted S -trees. It remains to chec k that this isomorph ism induces an isomorphism of phylo genetic trees T 1 ∼ = T 2 . T o do it, n oti ce th at , since the isomorphism b et we en T ∗ 1 and T ∗ 2 preserve s edge weigh ts, there are only tw o p ossibilities: – All edges in T ∗ 1 and T ∗ 2 ha ve weig h t 1. In this case T ∗ 1 = T u 1 and T ∗ 2 = T u 2 and th e isomorphism T u 1 ∼ = T u 2 sends the r o ot of T 1 to the r oot of T 2 , b eca u se they are the only degree 2 no des in T ∗ 1 and T ∗ 2 . Th erefore, it in duces an isomorphism T 1 ∼ = T 2 . – Both T ∗ 1 and T ∗ 2 ha ve one weigh t 2 edge, say { u 1 , v 1 } and { u 2 , v 2 } , resp ectiv ely . Then eac h T u t is obtained from T ∗ t b y addin g the ro ot r t of T t and splitting the edge { u t , v t } into tw o edges { u t , r t } and { v t , r t } . S ince the isomorph ism T ∗ 1 ∼ = T ∗ 2 sends { u 1 , v 1 } to { u 2 , v 2 } , its extension to a mapp ing V 1 → V 2 b y send ing r 1 to r 2 deﬁnes an isomorphism T u 1 ∼ = T u 2 that sends the r o ot of T 1 to the ro ot of T 2 , and hence an isomorphism T 1 ∼ = T 2 . ⊓ ⊔ Let B T n b e the class of all non-weig hted binary phylogenetic trees on S = { 1 , . . . , n } . The injectivit y u p to isomorphisms of the mapping L : B T n → R n ( n − 1) / 2 T 7→ L ( T ) 5 mak es the classical deﬁn itio n s of no dal metrics on B T n induced by metrics on R n ( n − 1) / 2 to yield, indeed, metrics. F or example, recall that the L p norm on R m is deﬁned as k ( x 1 , . . . , x m ) k p =      { i | i = 1 , . . . , m , x i 6 = 0 }   if p = 0 p p P m i =1 | x i | p if p ∈ N + max {| x i | | i = 1 , . . . , m } if p = ∞ where, here and h encefo rth, N + stands for N \ { 0 } . Eac h L p norm on R n ( n − 1) / 2 induces then a metric on B T n through the f orm ula d p ( T 1 , T 2 ) = k L ( T 1 ) − L ( T 2 ) k p . Some of these metrics ha ve b een present in the literature s ince the early sev enties. F or instance, F arris [10] in tro duced the metric on B T n induced by the L 2 , or Eu clidea n, norm on R n ( n − 1) / 2 : d 2 ( T 1 , T 2 ) = s X 1 6 i 0 -we igh ted b inary phylogenetic trees with the same path lengths vectors. R emark 1. Let T b e a non-weig h ted binary ph ylogenetic tr ee on a set S of taxa. Since the path lengths vect or L ( T ) is th e vec tor of distances of a (p ossibly w eight ed) unr ooted S -tree (see the pro of of Pr oposition 1), it is wel l-kno wn (see, for instance, L em. 7.1.7 in [25]) th at it satisﬁes the four-p oint c ondition : for ev ery a, b, c, d ∈ S , L T ( a, b ) + L T ( c, d ) 6 max { L T ( a, c ) + L T ( b, d ) , L T ( a, d ) + L T ( b, c ) } . Zaretskii’s theorem [32] establishes that any d issimilarit y measure on S satisfying this four-p oin t cond itio n is give n by the distances b et wee n lab eled no des in an R > 0 -w eigh ted unro oted S -tree (see Thm. 7.2.6 in [25]). But, to our kno w ledge , it is not kno wn what extra prop erties should b e required to suc h a dissimilarit y measure on S to guaran tee that it is giv en by the p ath lengths b et ween lab eled no des in a non-w eighte d b inary phylo genetic tree. 4 Splitted path lengths separate arbitrary ph ylogenetic t rees Let ( T , ω ), with T = ( V , E ), b e again an R > 0 -w eigh ted ph ylogenetic tree o n S = { 1 , . . . , n } and , for ev ery i, j ∈ S , let ℓ T ( i, j ) and ℓ T ( j, i ) still denote the distances from [ i, j ] T to i and j , resp ectiv ely . Deﬁnition 2. The splitted p at h lengths matrix of T is the n × n squar e matrix over R > 0 ℓ ( T ) =      ℓ T (1 , 1) ℓ T (1 , 2) . . . ℓ T (1 , n ) ℓ T (2 , 1) ℓ T (2 , 2) . . . ℓ T (2 , n ) . . . . . . . . . . . . ℓ T ( n, 1) ℓ T ( n, 2) . . . ℓ T ( n, n )      ∈ M n ( R > 0 ) . 7 Notice that this matrix need not b e s y m metrica l (see the next example), bu t all entries ℓ T ( i, i ) in its main diagonal are 0. The splitted path lengths matrix ℓ ( T ) of a tree T ∈ T n can b e computed in optimal O ( n 2 ) time, b y computing b y breadth-ﬁr s t s earch for eac h in ternal no de of T the distance to eac h one of its d escend an t taxa and the pairs of taxa of which it is the LCA. Example 1. Th e splitted path lengths matrices of the trees T and T ′ depicted in Fig. 1 are ℓ ( T ) =    0 1 1 1 1 0 1 1 2 2 0 1 2 2 1 0    , ℓ ( T ′ ) =    0 1 2 2 1 0 2 2 1 1 0 1 1 1 1 0    . The splitted path lengths matrices of the trees T and T ′ depicted in Fig. 2 are ℓ ( T ) = 0 1 2 1 0 2 0 0 0 ! , ℓ ( T ′ ) = 0 2 1 0 0 0 1 2 0 ! . The splitted path lengths matrices of the weigh ted trees T and T ′ depicted in Fig. 3 are ℓ ( T ) =  0 1 2 0  , ℓ ( T ′ ) =  0 2 1 0  . This example sh o w s that the splitted path lengths m at rices can separate pairs of phylo genetic trees that could not b e sep arat ed by means of their path lengths v ectors. Our main result in this s ec tion states that these matrices charac terize arb itrary R > 0 - w eight ed p h ylogenetic trees. T o prov e it, it is con venien t to establish ﬁ rst s ome lemmas, and to r eca ll a r esult fr om [14]. Lemma 1. L et T b e an R > 0 -weighte d phylo genetic tr e e on S . A lab el i ∈ S is a ne ste d taxon of T if, and only if, ℓ T ( i, j ) = 0 for some j 6 = i . Pr o of. If an in tern al n ode of T is labeled with i , then taking as j ∈ S any descend an t leaf of i w e h a ve that [ i, j ] T = i and hence ℓ T ( i, j ) = 0. Conv ers ely , if ℓ T ( i, j ) = 0, then [ i, j ] T = i and therefore the no de i is an ancestor of the no de j . If i 6 = j , this can only happ en if i is in ternal. ⊓ ⊔ Lemma 2. L et T b e an R > 0 -weighte d phylo genetic tr e e on S . F or every i ∈ S , c onsider the se t of weights W i = { ℓ T ( i, j ) | j ∈ S, ℓ T ( i, j ) > 0 } . (a) W i = ∅ if, and only if, i is the r o ot of T . (b) If W i 6 = ∅ , then its smal lest element w i is the weight of the ar c with he ad i . Pr o of. As far as far (a) go es, W i = ∅ if, an d only if, ℓ T ( i, j ) = 0 for ev ery j ∈ S , that is, if, and only if, i is an ancestor of eve r y lab eled no de. S ince the set of lab eled no des of 8 T in cludes all lea ves and all elemen tary no des, this is equiv alen t to the fact that i is the ro ot. As far as (b) go es, assume that W i 6 = ∅ , so that i has a p aren t x . Let w i b e the w eight of the arc ( x, i ). T hen, sin ce eve r y non-trivial path [ i, j ] T i m ust en d with the arc ( x, i ), it is clear th at if ℓ T ( i, j ) > 0, then ℓ T ( i, j ) > w i . No w, if x is lab eled, sa y w ith lab el i 0 , then x = [ i, i 0 ] T and th u s ℓ T ( i, i 0 ) = w i . If x is not lab eled, then it cannot b e elemen tary , and hence it must hav e at least another c hild y . Let i 0 b e a descendant leaf of y . In th is case, x = [ i, i 0 ] T and ℓ T ( i, i 0 ) = w i , to o. This pro ves that, in all cases, w i ∈ W i , and thus th at it is th e smallest elemen t of this set. ⊓ ⊔ The follo wing r esult is a d irect consequence of the last t w o lemmas. Corollary 1. L et T and T ′ b e two R > 0 -weighte d phylo ge netic tr e es on the same set S of taxa such that ℓ ( T ) = ℓ ( T ′ ) . Then: (a) The neste d taxa of T and T ′ ar e the same. (b) T has its r o ot lab ele d with i if, and only if, T ′ has its r o ot lab ele d with i . (c) If the no des lab e le d with i in T and T ′ ar e not their r o ots, the weight of the ar c with he ad i i n T and in T ′ is the same. ⊓ ⊔ Let S b e a set of taxa and R ( S ) the set of S -triples , that is, of structures ab | c with a, b, c ∈ S pairwise diﬀerent . Classically , an S -triplet ab | c is said to b e pr e se nt in a phylo genetic tr ee T if c d iverged from a b efore b did , in the sen s e that [ a, b ] T < [ a, c ] T = [ b, c ] T . Let no w ( T , ω ) b e an R > 0 -w eigh ted phyloge netic tree on S . F or ev ery ab | c ∈ R ( S ), let λ T ( ab | c ) ∈ R > 0 b e deﬁn ed as f ollo w s: – If ab | c is present in T , then λ T ( ab | c ) is the distance from [ a, c ] T = [ b, c ] T to [ a, b ] T – If ab | c is not present in T , then λ T ( ab | c ) = 0. Notice th at λ T ( ab | c ) = λ T ( ba | c ). This mapping λ T has a s im p le description in terms of ℓ ( T ). Lemma 3. L et ( T , ω ) b e an R > 0 -weighte d phylo genetic tr e e on S . F or every ab | c ∈ R ( S ) , λ T ( ab | c ) = max { ℓ T ( a, c ) − ℓ T ( a, b ) , 0 } . Pr o of. If [ a, c ] T is a non-trivial ancestor of [ a, b ] T in T , then the path [ a, c ] T a con tains the no d e [ a, b ] T and the d istance ℓ T ( a, c ) from [ a, c ] T to a is equal to th e distance λ T ( ab | c ) from [ a, c ] T to [ a, b ] T plus the distance ℓ T ( a, b ) f rom [ a, b ] T to a . Th er efore, in this case, max { ℓ T ( a, c ) − ℓ T ( a, b ) , 0 } = ℓ T ( a, c ) − ℓ T ( a, b ) = λ T ( ab | c ) . If [ a, c ] T = [ a, b ] T , then ℓ T ( a, c ) = ℓ T ( a, b ) an d ab | c is not p resen t in T and thus max { ℓ T ( a, c ) − ℓ T ( a, b ) , 0 } = 0 = λ T ( ab | c ) . 9 Finally , if [ a, c ] T is not an ancestor of [ a, b ], then it m u st happ en that [ a, b ] T is a non- trivial ancestor of [ a, c ] T and therefore ℓ T ( a, b ) > ℓ T ( a, c ). S in ce ab | c is n ot p resen t in T , either, this im p lies that max { ℓ T ( a, c ) − ℓ T ( a, b ) , 0 } = 0 = λ T ( ab | c ) . So, the equ al it y in the statemen t alw a ys h olds . ⊓ ⊔ The follo wing r esu lt is Thm. 2 in [14]. I n it, Q ( X ) denotes the set of X - quartets , that is, of str u ctures ab | cd with a, b, c, d ∈ X pairwise diﬀerent . Theorem 1. L et λ : R ( S ) → R > 0 b e a map such that λ ( ab | c ) = λ ( ba | c ) f or eve ry a, b, c ∈ S p airwise diﬀer ent, and let z b e an element not in S . Then: (a) λ = λ T for some R > 0 -weighte d phylo genetic tr e e ( T , ω ) with neither neste d taxa nor weight 0 i nternal ar cs if, and only if, the mapping µ : Q ( S ∪ { z } ) → R > 0 deﬁne d by µ ( ab | cd ) =  λ ( ab | c ) if d = z min { λ ( ab | c ) , λ ( ab | d ) } + m in { λ ( c d | a ) , λ ( cd | b ) } i f d 6 = z satisﬁes the fol lowing pr op erties: (1) µ ( ab | cd ) = µ ( ba | cd ) = µ ( cd | ab ) (2) F or every a, b, c, d , at le ast two of µ ( ab | cd ) , µ ( ac | bd ) , and µ ( ad | bc ) ar e e qual to 0. (3) If µ ( ab | cd ) > 0 , then, for every x 6 = a, b, c, d , e ither µ ( ab | cx ) · µ ( ab | dx ) > 0 or µ ( ax | cd ) · µ ( bx | c d ) > 0 . (4) F or every a, b, c, d, e , if µ ( ab | cd ) > µ ( ab | ce ) > 0 , then µ ( ae | cd ) = µ ( ab | cd ) − µ ( ab | ce ) . (5) F or every a, b, c, d, e , if µ ( ab | cd ) > 0 and µ ( bc | de ) > 0 , then µ ( ab | de ) = µ ( ab | cd ) + µ ( bc | de ) . (b) If ( T , ω ) and ( T ′ , ω ′ ) ar e two R > 0 -weighte d phylo genetic tr e es with neither neste d taxa nor weight 0 internal ar c s and su c h that λ T = λ T ′ , then T ∼ = T ′ as phylo genetic tr e es and the isomorphism pr eserves the weights of the internal ar cs. ⊓ ⊔ No w w e can pr oceed with the pr oof that splitted p at h lengths matrices c haracterize R > 0 -w eigh ted phylogeneti c trees. Theorem 2. Two R > 0 -weighte d phylo ge ne tic tr e es on the same set S of taxa ar e iso- morphic i f , and only if, they have the same splitte d p ath lengths matric es. 10 Pr o of. As in P roposition 1, the statemen t when | S | = 1 is obviously tru e. Assu me no w that | S | > 2. F or eve r y R > 0 -w eigh ted phyloge netic tree ( T , ω ) on S , let ( T , ω ) b e the R > 0 - w eight ed phylogeneti c tree without n este d taxa obtained as follo ws: for ev ery internal lab eled no de i of T , u nlabel it and add to it a leaf c h ild lab eled with i through an arc of weigh t 0. It is straigh tforward to c hec k that ℓ T ( i, j ) = ℓ T ( i, j ) for ev ery i, j ∈ S . Since T w as R > 0 -w eigh ted, the only wei gh t 0 arcs in T are the n ew p endant arcs that replace the nested taxa. Moreo ver, ( T , ω ) can b e reco ve r ed fr om ( T , ω ) by simply r emo vin g the w eight 0 p endant arcs and lab eling the tail of a remo ved arc with the lab el of th e arc’s head. Let no w ( T 1 , ω 1 ) and ( T 2 , ω 2 ) b e tw o R > 0 -w eigh ted phylogeneti c trees on th e same set S of taxa suc h that ℓ ( T 1 ) = ℓ ( T 2 ). Then ℓ ( T 1 ) = ℓ ( T 2 ) and h en ce , by Lemm a 3, λ T 1 = λ T 2 . Since ( T 1 , ω 1 ) and ( T 2 , ω 2 ) are R > 0 -w eigh ted p h ylogenetic trees with neither nested taxa n or w eigh t 0 inte r nal arcs, by Th eo r em 1.(b) we h av e that T 1 ∼ = T 2 as phylo genetic trees, and m oreov er this isomorph ism p reserv es th e weig hts of the in tern al arcs. But w e also kn o w that the arc end ing in the leaf i has the same w eight in T 1 and in T 2 : if i was a nested taxon of T 1 and T 2 (and recall th at T 1 and T 2 ha ve the same nested taxa by Corollary 1.(a)), th is w eigh t is in b oth cases 0, and if i was the lab el of a leaf of T 1 and T 2 , this weigh t is the same in T 1 and in T 2 b y Corollary 1.(c), and hence in T 1 and in T 2 . Therefore, the isomorphism T 1 ∼ = T 2 is an isomorphism of w eigh ted phylo genetic trees. Finally , the wa y ( T 1 , ω 1 ) and ( T 2 , ω 2 ) are reconstructed f r om ( T 1 , ω 1 ) and ( T 2 , ω 2 ) implies that this isomorphism ind u ces an isomorphism of weigh ted phyloge n eti c trees T 1 ∼ = T 2 . This pro v es the ‘if ’ imp lication; the ‘only if ’ implication is ob vious. ⊓ ⊔ R emark 2. The pro of of the last theorem can also b e applied, with sm all mo diﬁcations, to pr o ve that the splitted path lengths m atrices also separate R > 0 -w eigh ted phylog enetic trees with multi-lab ele d no des , that is, where a no de can h a v e more than on e lab el (but t wo diﬀeren t n odes cannot share an y lab el); in such a tree T , if i and j are lab els of the same n ode, th en ℓ T ( i, j ) = ℓ T ( j, i ) = 0. It is en ough to slight ly change the deﬁn itio n of T : on the one hand, f or ev ery internal lab eled no de of T , unlab el it and, for eac h one of its lab els, add to it a leaf c hild labeled with this lab el through an arc of w eigh t 0; and, on the other hand , do the same for ev er y leaf with m ore than one lab el. The same argument as in the pro of of the last theorem sh o w s th at if T 1 and T 2 are t wo R > 0 -w eigh ted phylogeneti c trees with multi-l ab eled n odes such that ℓ ( T 1 ) = ℓ ( T 2 ), then the R > 0 -w eigh ted phylog enetic trees with neither nested taxa nor weigh t 0 internal arcs T 1 and T 2 obtained in this w a y are isomorph ic. T o d eriv e from this isomorphism an isomorphism T 1 ∼ = T 2 , one must use that, in this multi-la b eled case: – An internal no de of a tree T is lab eled { i 1 , . . . , i k } if, an d only if, ℓ T ( a, b ) = 0 for ev ery a, b ∈ { i 1 , . . . , i k } , ℓ T ( a, j ) > 0 or ℓ T ( j, a ) > 0 for eve r y a ∈ { i 1 , . . . , i k } and ev ery j / ∈ { i 1 , . . . , i k } , and there exists s ome j / ∈ { i 1 , . . . , i k } s u c h that ℓ T ( a, j ) = 0 for every a ∈ { i 1 , . . . , i k } . 11 – A leaf of T is lab eled { i 1 , . . . , i k } if, and only if, ℓ T ( a, b ) = 0 for eve r y a, b ∈ { i 1 , . . . , i k } , an d ℓ T ( a, j ) > 0 for eve r y a ∈ { i 1 , . . . , i k } and eve r y j / ∈ { i 1 , . . . , i k } . These prop erties ent ail that if ℓ ( T 1 ) = ℓ ( T 2 ), then T 1 and T 2 ha ve the same families of sets { i 1 , . . . , i k } of lab els of in tern al no des as w ell as of lea v es. W e lea ve the details to the reader. Notice that Theorem 1 n ot only establishes that the mappin g λ T singles out an R > 0 - w eight ed phylog en et ic tree T with n either nested taxa nor weig ht 0 internal arcs, up to the weigh ts of its p end an t arcs, but it also charac terizes wh at m ap p ings can b e r ea lized as λ T -mappings, for some T of this type. W e can u se this result to c haracterize the matrices th at are splitted path lengths matrices of R > 0 -w eigh ted phylogeneti c trees. Prop osition 2. L et M =  m i,j  ∈ M n ( R > 0 ) b e an n × n squar e matrix over R > 0 with m i,i = 0 for ev e ry i = 1 , . . . , n . Then, M = ℓ ( T ) for some R > 0 -weighte d phylo genetic tr e e T on S = { 1 , . . . , n } if , and only if, the mapping λ M : R ( S ) → R > 0 deﬁne d by λ M ( ab | c ) = max { m a,c − m a,b , 0 } satisﬁes the fol lowing c onditions: (a) λ M ( ab | c ) = λ M ( ba | c ) for every a, b, c ∈ S p airwise diﬀer ent. (b) The mapping µ M deﬁne d fr om λ M as in The or em 1.(a) satisﬁes pr op erties (1)–(5) ther ein. Pr o of. The ‘only if ’ implication is easy: if M = ℓ ( T ), so that m i,j = ℓ T ( i, j ) for ev er y i, j ∈ S , then λ M = λ T , with T th e R > 0 -w eigh ted phylog enetic tree without nested taxa or wei gh t 0 internal arcs asso ciated to T in the pro of of Theorem 2, and therefore it satisﬁes conditions (a) and (b) in the statemen t. Con versely , if λ M satisﬁes conditions (a) and (b ), then by Th eo rem 1 there exists an R > 0 -w eigh ted phyloge netic tree T 0 without n ested taxa or w eight 0 int ernal arcs such that λ M = λ T 0 . By L emm a 3, λ T 0 ( ab | c ) = max { ℓ T 0 ( a, c ) − ℓ T 0 ( a, b ) , 0 } . Therefore, for ev ery a, b, c ∈ S pairwise d iﬀeren t, max { ℓ T 0 ( a, c ) − ℓ T 0 ( a, b ) , 0 } = max { m a,c − m a,b , 0 } . The tree T 0 is un ique up to the w eights of the p endan t arcs. S o, without an y loss of generalit y we may assume that the wei ght of th e arc ending in the leaf a is min { m a,j | j 6 = a } . No w, for every a ∈ S and for ev ery b ∈ S \ { a } , b is a descendant of the p arent x a of a in T 0 if, and only if, m a,b = m in { m a,j | j 6 = a } . As far as the ‘if ’ imp lic ation go es, assume th at m a,b = min { m a,j | j 6 = a } but b is not a descend an t of x a . Let c ∈ S \ { a } b e a descendant of x a , so that [ a, c ] T 0 = x a . Th en, [ a, c ] T 0 is a non-trivial descend an t of 12 [ a, b ] T 0 and therefore (sin ce the inte rnal arcs of T 0 ha ve n on-nega tive w eight) , ℓ T 0 ( a, b ) − ℓ T 0 ( a, c ) > 0. But this con tradicts the fact that, since m a,c > m a,b , ℓ T 0 ( a, b ) − ℓ T 0 ( a, c ) = λ T 0 ( ac | b ) = λ M ( ac | b ) = min { m a,b − m a,c , 0 } = 0 . As far as the con v erse imp lic ation goes, let b ∈ S \ { a } b e a d escendan t of x a , and let b ′ ∈ S \ { a } b e su c h that m a,b ′ = min { m a,j | j 6 = a } : as w e hav e j ust seen, b ′ is also a descendan t of x a and th erefore [ a, b ] T 0 = [ a, b ′ ] T 0 = x a . Then , max { m a,b − m a,b ′ , 0 } = λ T 0 ( ab ′ | b ) = 0 implies that m a,b − m a,b ′ 6 0, that is, that m a,b = min { m a,j | j 6 = a } , to o. No w, let us a ﬁx a taxon a ∈ S , and let b ∈ S \ { a } b e a descendant of the parent x a of a in T 0 . Then, on the one h and, ℓ T 0 ( a, b ) = m a,b , b eca u se it is the w eight of the arc ( x a , a ), and, on the other hand , for ev ery c 6 = a, b , w e hav e th at m a,c > m a,b and ℓ T 0 ( a, c ) > ℓ T 0 ( a, b ) and therefore m a,c = λ M ( ab | c ) + m a,b = λ T 0 ( ab | c ) + ℓ T 0 ( a, b ) = ℓ T 0 ( a, c ) . This implies that the a -th ro w in M and ℓ ( T 0 ) are equal, and hence, since a w as any elemen t of S , M = ℓ ( T 0 ). Finally , T 0 is transformed in to an R > 0 -w eigh ted phylo genetic tree w ith the s ame splitted path lengths matrix by simply r emo vin g the weig h t 0 p endan t arcs and lab eling the tail of a remo ved arc w ith the lab el of the arc’s head; cf. the pr oof of Theorem 2. ⊓ ⊔ 5 Splitted no dal metrics Let T n b e the space of R > 0 -w eigh ted phylog enetic trees on the set S = { 1 , . . . , n } of taxa. As we hav e seen, the mapping ℓ : T n − → M n ( R > 0 ) that asso ciates to eac h ( T , ω ) ∈ T n its splitted path lengths matrix ℓ ( T ) is injectiv e u p to isomorp hisms. As it hap p ened w ith the em b edd ing L : B T n ֒ → R n ( n − 1) / 2 , this allo ws one to induce metrics on T n from metrics on M n ( R > 0 ). Prop osition 3. L et D b e any metric on M n ( R > 0 ) . The mapping d : T n × T n → R > 0 ( T 1 , T 2 ) 7→ D ( ℓ ( T 1 ) , ℓ ( T 2 )) satisﬁes the axioms of metrics up to i som orphisms: (1) d ( T 1 , T 2 ) > 0 , (2) d ( T 1 , T 2 ) = 0 if, and only if, T 1 ∼ = T 2 , (3) d ( T 1 , T 2 ) = d ( T 2 , T 1 ) , (4) d ( T 1 , T 3 ) 6 d ( T 1 , T 2 ) + d ( T 2 , T 3 ) . 13 Pr o of. Prop erties (1), (3) and (4) are d ir ec t consequences of the corresp onding prop erties of D , wh ile pr operty (2) follo ws from the separation axiom f or D (whic h sa ys th at D ( M 1 , M 2 ) = 0 if, and only if , M 1 = M 2 ) and Theorem 2. ⊓ ⊔ W e shall generically call splitte d no dal metrics the metrics on T n induced by metrics on M n ( R > 0 ) through the em b edd ing ℓ . In p articula r, ev ery L p norm k · k p on M n ( R > 0 ) deﬁnes a sp litt ed no dal metric d s p through d s p ( T 1 , T 2 ) = k ℓ ( T 1 ) − ℓ ( T 2 ) k p . F or instance, d s 1 ( T 1 , T 2 ) = X 1 6 i 6 = j 6 n | ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) | , d s 2 ( T 1 , T 2 ) = s X 1 6 i 6 = j 6 n ( ℓ T 1 ( i, j ) − ℓ T 2 ( i, j )) 2 are the sp litt ed no dal metrics ind uced by the L 1 and L 2 norms on M n ( R > 0 ). W e ha ve seen in the p revious section that th e sp litt ed p at h lengths matrices can b e computed in O ( n 2 ) time. Th eir diﬀerence can b e computed in O ( n 2 ) time, and the sum of the p -th p ow ers of the en tr ies of the resulting matrix can b e compu te d in O ( n 2 log( p ) + n 2 ) time (assum ing constan t-time addition and m ultiplication of real num b ers). Th erefore, the cost of computing d s p ( T 1 , T 2 ) p , f or T 1 , T 2 ∈ T n and p ∈ N + , is O ( n 2 log( p ) + n 2 ). Thus, if p = 1, th e d s 1 metric on T n can b e compu ted in O ( n 2 ) time. F or p > 2, th e cost of computing d s p ( T 1 , T 2 ), for T 1 , T 2 ∈ T n , as the p -th ro ot of d s p ( T 1 , T 2 ) p will dep end on th e accuracy with whic h this ro ot is computed. F or instance, using the Newton metho d to compute it with an accuracy of an 1 / 2 h -th of its v alue has a cost of O ( p 2 log( p ) log ( hp )); see, for instance, [4]. S o, in practice, for small p and not to o large h , this step will b e d ominate d by the computation of d s p ( T 1 , T 2 ) p , and the total cost will b e O ( n 2 ) (w e understand in this case log( p ) as part of the constant factor). F or p = 0 or ∞ , the cost of computing d p ( T 1 , T 2 ) is also O ( n 2 ) time. These splitted n odal metrics can b e seen conceptually as the generalizations to T n of the classical n odal metrics on B T n . Conceptually , but not numerically , b ecause the restriction of d s p to B T n is not equal to d p , ev en up to a scalar f ac tor, as th e follo w ing easy example sho ws. Example 2. Cons id er the n on-w eigh ted binary trees T 1 , T 2 , T 3 depicted in Fig. 4. I t is easy to compu te their path lengths v ectors and splitted path lengths matrices: L ( T 1 ) = (3 , 4 , 4 , 3 , 3 , 2) , L ( T 2 ) = (2 , 3 , 4 , 3 , 4 , 3) , L ( T 3 ) = (4 , 4 , 3 , 2 , 3 , 3) ℓ ( T 1 ) =    0 1 1 1 2 0 1 1 3 2 0 1 3 2 1 0    , ℓ ( T 2 ) =    0 1 2 3 1 0 2 3 1 1 0 2 1 1 1 0    , ℓ ( T 3 ) =    0 1 1 1 3 0 1 2 3 1 0 2 2 1 1 0    . 14 1 2 3 4 T 1 1 2 3 4 T 2 1 4 3 2 T 3 Fig. 4. The non- w eighted binary phylogenetic trees in Example 2. F rom these vect ors and matrices we obtain that d p ( T 1 , T 2 ) = d p ( T 1 , T 3 ) =    4 if p = 0 p √ 4 if p ∈ N + 1 if p = ∞ while d s p ( T 1 , T 2 ) =    10 if p = 0 p √ 6 + 4 · 2 p if p ∈ N + 2 if p = ∞ d s p ( T 1 , T 3 ) =    6 if p = 0 p √ 6 if p ∈ N + 1 if p = ∞ This sho ws that there do es not exist any λ ∈ R suc h that d s p = λ · d p on B T 4 for an y p ∈ N ∪ {∞} . S imila r counterexamples can b e pro duced f or ev ery n > 4. The follo wing in equalit y relates d p and d s p on any B T n . Prop osition 4. F or ev ery T 1 , T 2 ∈ B T n and for every p ∈ N ∪ {∞} , d p ( T 1 , T 2 ) 6      d s p ( T 1 , T 2 ) if p = 0 2 1 − 1 p d s p ( T 1 , T 2 ) if p ∈ N + 2 d s p ( T 1 , T 2 ) if p = ∞ Pr o of. F or ev er y T ∈ B T n , let L ∗ ( T ) b e the symmetric matrix L ∗ ( T ) = ℓ ( T ) + ℓ ( T ) t . Notice that th e ( i, j )-th an d the ( j, i )-th entries of L ∗ ( T ) are b oth equal to L T ( i, j ). Now, b y the u sual prop erties of norms, k L ∗ ( T 1 ) − L ∗ ( T 2 ) k p = k ℓ ( T 1 ) + ℓ ( T 1 ) t − ( ℓ ( T 2 ) + ℓ ( T 2 ) t ) k p 6 k ℓ ( T 1 ) − ℓ ( T 2 ) k p + k ℓ ( T 1 ) t − ℓ ( T 2 ) t k p = 2 k ℓ ( T 1 ) − ℓ ( T 2 ) k p . On the other hand, L ∗ ( T 1 ) − L ∗ ( T 2 ) can b e u ndersto od as t wo concat enated copies of L ( T 1 ) − L ( T 2 ) and therefore, k L ∗ ( T 1 ) − L ∗ ( T 2 ) k p =    2 k L ( T 1 ) − L ( T 2 ) k p if p = 0 p √ 2 · k L ( T 1 ) − L ( T 2 ) k p if p ∈ N + k L ( T 1 ) − L ( T 2 ) k p if p = ∞ 15 Com bin ing this equ al it y with the p revious inequalit y we obtain the inequalit y in the statemen t. ⊓ ⊔ 6 The non-weigh ted case Although we ights enr ic h the top ological structure of a phylog en et ic tree, for in stance by adding probabilities, b o otstrap v alues or div ergence d eg r ees to branches, the comparison of non-wei ghted p h ylogenetic trees, as bare h ierarc h ica l classiﬁcations or ev olutive histo- ries, has an in terest in itself. Let N T n denote the class of all non-weig hted phyloge netic trees on S = { 1 , . . . , n } . F elsenstein [12] ga v e a recur ren t form ula for the num b er U ( n, m ) of d iﬀerent trees in N T n with m un la b eled in tern al n odes, from wh ic h the total num b er |N T n | of diﬀeren t non -weigh ted phyloge n etic trees on n taxa can b e compu te d: see T able 2 in [12] or sequ ence A005264 in [27]. T able 1 recalls the ﬁr st v alues of |N T n | . n 1 2 3 4 5 6 7 |N T n | 1 3 22 262 4 336 91 984 2 381 408 T able 1. The val ues of |N T n | for n up to 7 In this section we gather some resu lts on the sp litted n odal metrics d s p , for p ∈ N + , on N T n , and we rep ort on some numerical exp erimen ts for d s 1 and d s 2 on this class. T o simplify the notations, for ev ery a, b ∈ S and p ∈ N + , w e shall write C p T 1 ,T 2 ( a, b ) to denote | ℓ T 1 ( a, b ) − ℓ T 2 ( a, b ) | p . In this w ay , if T 1 , T 2 ∈ N T n and p ∈ N + , then d s p ( T 1 , T 2 ) p = X ( a,b ) ∈ S 2 C p T 1 ,T 2 ( a, b ) ∈ N . Our ﬁ rst r esu lt sh o ws that the m et rics d s p ha ve a redun dan t factor on N T n when n is o dd. Lemma 4. If n i s o dd, then k ℓ ( T ) k 1 is even, for every T ∈ N T n . Pr o of. Let T = ( E , V ) b e a non-weigh ted p h ylogenetic tr ee on S = { 1 , . . . , n } w ith n o dd. F or every e ∈ E , let ν ℓ ( e ) b e the num b er of paths [ i, j ] i , with i, j ∈ S , that con tain the arc e . It is clear that k ℓ ( T ) k 1 = X 1 6 i 6 = j 6 n ℓ T ( i, j ) = X e ∈ E ν ℓ ( e ) . It turns out that if n is o dd, then ev ery ν ℓ ( e ) is even and therefore the right -hand side sum is eve n . Indeed, let e = ( u, v ) b e an y arc and let V b e the set of descendant lab eled no des of v . Then, e is con tained in a path [ i, j ] i if, and only if, i ∈ V a n d j / ∈ V . This sho ws that ν ℓ ( e ) = | V | · | S − V | . No w, since | S | is o dd, either | V | or | S − V | is eve n , whic h imp lie s that ν ℓ ( e ) is eve n . ⊓ ⊔ 16 Prop osition 5. If n is o dd, then d s p ( T 1 , T 2 ) p is even, for ev ery T 1 , T 2 ∈ N T n and for every p ∈ N + . Pr o of. Let T 1 , T 2 ∈ N T n , with n o dd. Then d s p ( T 1 , T 2 ) p = X 1 6 i 6 = j 6 n C p T 1 ,T 2 ( i, j ) . No w, we kno w that P 1 6 i 6 = j 6 n ℓ T 1 ( i, j ) and P 1 6 i 6 = j 6 n ℓ T 2 ( i, j ) are even n u m b ers. This implies that the n u m b er   { ( i, j ) ∈ S 2 | C p T 1 ,T 2 ( i, j ) o dd }   =   { ( i, j ) ∈ S 2 | ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) o dd }   is even, an d h ence that the sum P 1 6 i 6 = j 6 n C p T 1 ,T 2 ( i, j ) is ev en. ⊓ ⊔ This r esult shows that if n is o dd, d s 1 tak es only ev en v alues on N T n , and therefore it can b e divided b y 2 and the r esulting v alues are still inte ger num b ers. In a similar w a y , d s 2 has a ‘redun dan t’ √ 2 factor on N T n , for n o dd. No similar result holds for ev en v alues of n : for in stance, N T 2 consists of three trees T 1 , T 2 , T 3 , with Newic k strin gs (1,2) ; , ((1)2); , and ((2 )1); , resp ectiv ely , and d s 1 ( T 1 , T 2 ) = d s 1 ( T 1 , T 3 ) = 1, d s 1 ( T 2 , T 3 ) = 2. R emark 3. The theses in the last t wo results are true in the more general setting of N + -w eigh ted phylogeneti c trees. T o see it, notice that if ( T , ω ) is suc h a tree, th en k ℓ ( T ) k 1 = X 1 6 i 6 = j 6 n ℓ T ( i, j ) = X e ∈ E ω ( e ) · ν ℓ ( e ) and then, th e pro of that eac h ν ℓ ( e ) is eve n is the same as in the non-weig h ted case. On the other hand, th e thesis in the last prop osition d oes not generalize to p = 0 or ∞ : it is easy to p rod u ce counterexa mples sh o wing that d s 0 and d s ∞ tak e o dd v alues on N T 3 . Our n ext goal is to ﬁnd the least v alue for d s p on N T n , for p ∈ N + . Lemma 5. L et T 1 , T 2 ∈ N T n with n > 6 and p ∈ N + . If ther e is some taxon that is a le af of lar gest depth in T 1 but not in T 2 , then d p ( T 1 , T 2 ) p > 5 . Pr o of. T o simplify the notations, and sin ce in this pro of the trees T 1 , T 2 and the in dex p are ﬁxed, w e s h all w r ite C ( a, b ) to denote C p T 1 ,T 2 ( a, b ). Assume, without an y loss of generalit y , that 1 is a d ee p est leaf of T 1 and that 2 is a leaf of T 2 suc h that d epth T 2 (2) > d ep th T 2 (1). Then, th e distance from [1 , 2] T 2 to 2 w ill b e larger than to 1. This implies that ℓ T 2 (2 , 1) > ℓ T 2 (1 , 2). Since ℓ T 1 (2 , 1) 6 ℓ T 1 (1 , 2) (b ecause depth T 1 (2) 6 depth T 1 (1)), it must happ en that ℓ T 2 (2 , 1) 6 = ℓ T 1 (2 , 1) or ℓ T 2 (1 , 2) 6 = ℓ T 1 (1 , 2), and th erefore C (1 , 2) + C (2 , 1) > 1 . 17 Let us chec k n ow that, for ev ery a ∈ S \ { 1 , 2 } , at least one of the follo wing four equalities do es not hold: ℓ T 2 (1 , a ) = ℓ T 1 (1 , a ) , ℓ T 2 (2 , a ) = ℓ T 1 (2 , a ) ℓ T 2 ( a, 1) = ℓ T 1 ( a, 1) , ℓ T 2 ( a, 2) = ℓ T 1 ( a, 2) (1) This will imp ly that ev ery a ∈ S \ { 1 , 2 } c ontributes 1 to d s p ( T 1 , T 2 ) p , in the sens e th at C (1 , a ) + C (2 , a ) + C ( a, 1) + C ( a, 2) > 1 . Since there are at least 4 taxa in S \ { 1 , 2 } an d th ese contributions ad d u p to C (1 , 2) + C (2 , 1), this will pro ve that d s p ( T 1 , T 2 ) p > 5. The wa y eac h a ∈ S \ { 1 , 2 } con tribu tes to d s p ( T 1 , T 2 ) p dep ends on its relativ e p osition with resp ect to 1 and 2 in T 2 . – If a 6 1, then ℓ T 2 (1 , a ) = 0 bu t ℓ T 1 (1 , a ) > 0 and therefore ℓ T 2 (1 , a ) 6 = ℓ T 1 (1 , a ). – Assume that [ a, 1] T 2 = [ a, 2] T 2 > [1 , 2] T 2 . In this case ℓ T 2 ( a, 2) = ℓ T 2 ( a, 1) a n d ℓ T 2 (2 , a ) > ℓ T 2 (1 , a ). But these r ela tions cann ot hold in T 1 , b ecause they imply that depth T 1 (2) > depth T 1 (1). Thus, the equ al ities (1) cannot hold sim u ltaneously . – Assume that 1 < [ a, 1] T 2 < [1 , 2] T 2 . In this case λ T 2 ( a 1 | 2) > 0 and ℓ T 2 ( a, 1) + λ T 2 ( a 1 | 2) = ℓ T 2 ( a, 2) ℓ T 2 (1 , a ) + λ T 2 ( a 1 | 2) = ℓ T 2 (1 , 2) ℓ T 2 (2 , a ) = ℓ T 2 (2 , 1) If ℓ T 1 ( a, 1) = ℓ T 2 ( a, 1) and ℓ T 1 ( a, 2) = ℓ T 2 ( a, 2), then the fact that ℓ T 1 ( a, 2) > ℓ T 1 ( a, 1) implies that 1 < [ a, 1] T 1 < [1 , 2] T 1 and th u s λ T 1 ( a 1 | 2) = ℓ T 1 ( a, 2) − ℓ T 1 ( a, 1) = ℓ T 2 ( a, 2) − ℓ T 2 ( a, 1) = λ T 2 ( a 1 | 2) . Then, if ℓ T 1 (1 , a ) = ℓ T 2 (1 , a ), ℓ T 1 (1 , 2) = ℓ T 1 (1 , a ) + λ T 1 ( a 1 | 2) = ℓ T 2 (1 , a ) + λ T 2 ( a 1 | 2) = ℓ T 2 (1 , 2) . Finally , if ℓ T 1 (2 , a ) = ℓ T 2 (2 , a ), then ℓ T 1 (2 , 1) = ℓ T 1 (2 , a ) = ℓ T 2 (2 , a ) = ℓ T 2 (2 , 1) . And this leads to a con tradiction, b ecause, as w e ha ve seen at the b eginning of the pro of, ℓ T 2 (2 , 1) 6 = ℓ T 1 (2 , 1) or ℓ T 2 (1 , 2) 6 = ℓ T 1 (1 , 2). Therefore, the equalities (1) cannot hold sim ultaneously . – If 2 < [ a, 2] T 2 < [1 , 2] T 2 , a similar argument sh o w s that at least one of the equ ali ties (1) fails, to o. This ﬁnishes the p roof of the lemma. ⊓ ⊔ 18 1 2 3 4 . . . n T 1 2 3 4 . . . n T ′ Fig. 5. Two non-isomorphic phylogenetic trees in N T n such that d s p ( T , T ′ ) p = 4 for every p ∈ N + . Theorem 3. F or eve ry p ∈ N + and for every n > 2 : (1) If n 6 5 , then min { d s p ( T 1 , T 2 ) p | T 1 , T 2 ∈ N T n , T 1 6 = T 2 } = n − 1 . (2) If n > 6 , then min { d s p ( T 1 , T 2 ) p | T 1 , T 2 ∈ N T n , T 1 6 = T 2 } = 4 . Pr o of. T o simplify the notations, and sin ce in this pro of the trees T 1 , T 2 and the in dex p are ﬁxed, w e s h all w r ite C ( a, b ) to denote C p T 1 ,T 2 ( a, b ). The cases n = 1 to 5 can b e chec ked ‘b y hand’ thr ough th e computation of the distances b et ween all pairs of trees in N T n . In the case n = 1, there is only one tree in N T 1 , and , as w e mentio ned after Lemma 4, N T 2 consists only of three trees T 1 , T 2 , T 3 , with Newic k strings (1, 2); , ((1)2); , and (( 2)1); , r esp ective ly , and it can b e seen that d s p ( T 1 , T 2 ) p = d s p ( T 1 , T 3 ) p = 1, d s p ( T 2 , T 3 ) p = 2. As far as the cases n = 3 , 4 , 5 go, the ﬁles { 3,4,5 } -tree- nt-pairs.dat a v ailable at the Supp lemen tary Material web page con tain the v alues of d s p ( T 1 , T 2 ) p for eac h (unord ered) pair of tr ees { T 1 , T 2 } in the corresp ond ing N T n . No w, for n > 5, we s hall p ro v e by indu ct ion on n that d s p ( T 1 , T 2 ) p > 4 for ev ery p ai r of diﬀerent trees T 1 , T 2 ∈ N T n . Since it is easy to pro duce pairs of trees T 1 , T 2 ∈ N T n suc h that d s p ( T 1 , T 2 ) p = 4, lik e for instance those depicted in Fig. 5, this will ﬁnish the pro of of the statemen t. The starting p oin t for the induction p rocedur e is n = 5: we kno w (by d irect insp ection of the ﬁle 5- tree-nt-pairs. dat ) that d s p ( T 1 , T 2 ) p > 4 for ev ery pair of diﬀerent trees T 1 , T 2 ∈ N T 5 . Assume no w that this inequalit y holds for ev ery tw o trees in N T n , for some n > 5, and let us p ro v e it for N T n +1 . So, let T 1 , T 2 ∈ N T n +1 b e a pair of diﬀeren t trees. As in the last p roof, w e shall write C ( a, b ) to d en ot e C p T 1 ,T 2 ( a, b ). Without an y loss of generalit y , we assum e that n + 1 is a leaf of largest d ep th in T 1 . By Lemma 5, if n + 1 is not a deep est leaf of T 2 , then d s p ( T 1 , T 2 ) p > 5. So, in the r est of the pro of w e assume that n + 1 is also a deep est leaf of T 2 . In particular, in b oth trees, the siblings of n + 1 (if they exist) are also deep est leav es. W e d istinguish n o w t wo main cases, eac h one divided in sev eral sub cases. (a) Assume that the parent of n + 1 in T 1 is lab eled, sa y with n . This implies that ℓ T 1 ( n, n + 1) = 0 , ℓ T 1 ( n + 1 , n ) = 1 ℓ T 1 ( n + 1 , a ) = ℓ T 1 ( n, a ) + 1 , for ev er y a ∈ S \ { n, n + 1 } ℓ T 1 ( a, n + 1) = ℓ T 1 ( a, n ) , for eve r y a ∈ S \ { n, n + 1 } 19 W e d istinguish th e follo wing sub cases. (a.1) Assum e that, in T 2 , the no de n is an ancestor of n + 1, b ut not its parent. In this case, ℓ T 2 ( n + 1 , n ) > 1, and therefore C ( n + 1 , n ) > 1 . No w, let a ∈ S \ { n, n + 1 } . Let us see that a con tr ibutes at least 1 to d s p ( T 1 , T 2 ) p . – If n > [ a, n + 1] T 2 (that is, if a is a descendan t of an in termediate no de in the path n n + 1), then ℓ T 2 ( a, n + 1) < ℓ T 2 ( a, n ) and therefore, sin ce ℓ T 1 ( a, n + 1) = ℓ T 1 ( a, n ), it m u st happ en that ℓ T 1 ( a, n + 1) 6 = ℓ T 2 ( a, n + 1) or ℓ T 1 ( a, n ) 6 = ℓ T 2 ( a, n ), whic h imp lie s that C ( a, n ) + C ( a, n + 1) > 1 . – If n 6 [ a, n + 1] T 2 in T 2 , then ℓ T 2 ( n + 1 , a ) = ℓ T 2 ( n, a ) + ℓ T 2 ( n + 1 , n ) > ℓ T 2 ( n, a ) + 1 , and therefore, s in ce ℓ T 1 ( n +1 , a ) = ℓ T 1 ( n, a )+ 1, it m ust happ en that ℓ T 1 ( n +1 , a ) 6 = ℓ T 2 ( n + 1 , a ) or ℓ T 1 ( n, a ) 6 = ℓ T 2 ( n, a ), and h ence C ( n, a ) + C ( n + 1 , a ) > 1 . Since there are at least 4 taxa other than n and n + 1, and th eir con tributions add up to C ( n + 1 , n ), we conclude that, in this case, d s p ( T 1 , T 2 ) p > 5. (a.2) Assum e th at , in T 2 , the n ode n is not an ancestor of n + 1; set ℓ T 2 ( n, n + 1) = x > 1 , ℓ T 2 ( n + 1 , n ) = y > 1 . If x > y , then depth T 2 ( n ) > depth T 2 ( n + 1) and th u s, since n + 1 was a d ee p est leaf of T 2 , n would also b e a deep est leaf of T 2 . But n is n ot a d eepest leaf of T 1 and therefore, in this case, w e already kn o w by Lemma 5 that d s p ( T 1 , T 2 ) p > 5. Assume no w that x < y . Th en, y > 2 and thus, on the one hand, C ( n + 1 , n ) + C ( n, n + 1) = ( y − 1) p + x p > 2 and, on the other h and, the p at h [ n + 1 , n ] T 2 n + 1 has at least one intermediate no de: let a 0 6 = n + 1 b e a lab eled no de that is a descendant of the p aren t of n + 1 (notice that, in th is case, a 0 is either th e parent of n + 1 or its sib ling). Then, ℓ T 2 ( a 0 , n + 1) < ℓ T 2 ( a 0 , n ) , ℓ T 2 ( n + 1 , a 0 ) = 1 6 ℓ T 2 ( n, a 0 ) imply that C ( a 0 , n + 1) + C ( a 0 , n ) > 1 , C ( n + 1 , a 0 ) + C ( n, a 0 ) > 1 . So, in this case, d s p ( T 1 , T 2 ) > 4. 20 (a.3) Assum e that, in T 2 , the no de n + 1 is a leaf and its p aren t is n . Let T ∗ 1 , T ∗ 2 ∈ N T n b e the trees obtained f rom T 1 and T 2 , resp ectiv ely , b y r emo vin g the leaf n + 1 together with its p endan t arc. After this op erati on , w e hav e that, for ev ery 1 6 a 6 = b 6 n , ℓ T ∗ i ( a, b ) = ℓ T i ( a, b ) an d th erefore, C ( a, b ) = C p T ∗ 1 ,T ∗ 2 ( a, b ). T hen, d s p ( T 1 , T 2 ) p > X 1 6 a 6 = b 6 n C ( a, b ) = X 1 6 a 6 = b 6 n C p T ∗ 1 ,T ∗ 2 ( a, b ) = d s p ( T ∗ 1 , T ∗ 2 ) p > 4 , the last inequalit y b eing giv en by the induction hypothesis. (b) Assume n o w that the parent of n + 1 is not lab ele d. Therefore, n + 1 m ust h a v e at least one sibling, which, we r ec all, is a leaf. Without any loss of generalit y we assume that n is a sib lin g of n + 1. In this case, we ha ve that ℓ T 1 ( n, n + 1) = ℓ T 1 ( n + 1 , n ) = 1 ℓ T 1 ( n + 1 , a ) = ℓ T 1 ( n, a ) > 0 , for ev ery a ∈ S \ { n, n + 1 } ℓ T 1 ( a, n + 1) = ℓ T 1 ( a, n ) , for eve r y a ∈ S \ { n, n + 1 } Notice moreo ver that n is also a deep est leaf in T 1 and therefore, by Lemma 5, if it is not a deep est leaf in T 2 , then d s p ( T 1 , T 2 ) p > 5. So, w e assume henceforth that n and n + 1 are deep est lea ve s in T 2 . As in (a), th ere are sev eral sub cases to discuss. (b.1) Assume that, in T 2 , the lea v es n and n + 1 are not sib lin g. In th is case, ℓ T 2 ( n, n + 1) = x > 1 , ℓ T 2 ( n + 1 , n ) = y > 1 and x > 1 or y > 1. Sin ce the depths of n and n + 1 in T 2 are the same, it m u st happ en that x = y . Then, C ( n, n + 1) + C ( n + 1 , n ) = ( x − 1) p + ( x − 1) p > 2 . Let now a 0 6 = n a lab eled no de, other than n , that is a descendant of the p aren t of n in T 2 : notice that this parent is an in term ed ia te no de in th e path [ n, n + 1] T 2 n . Then, ℓ T 2 ( n, a 0 ) = 1 < x = ℓ T 2 ( n + 1 , a 0 ) , ℓ T 2 ( a 0 , n ) < ℓ T 2 ( a 0 , n + 1) imply that a 0 con tribu tes at least 2 to d s p ( T 1 , T 2 ) p , and therefore that d s p ( T 1 , T 2 ) p > 4. Actually , d s p ( T 1 , T 2 ) p > 6, b ecause any lab eled no de b 0 6 = n + 1 that is a descendant of the p aren t of n + 1 in T 2 will also contribute at least 2 to d s p ( T 1 , T 2 ) p . (b.2) Assume th at , in T 2 , the lea v es n and n + 1 are s iblings and their paren t is lab eled, sa y with 1. In this case, b y (a) (applied interc hanging the roles of T 1 and T 2 and the roles of n and 1), w e already know that d s p ( T 1 , T 2 ) p > 4. (b.3) Assume that, in T 2 , the n odes n and n + 1 are sibling lea ves and th eir p arent is not lab eled. In this case, let T ∗ 1 , T ∗ 2 ∈ N T n b e th e tr ees obtained from T 1 and T 2 , resp ectiv ely , by remo vin g the lea ves n and n + 1 together w ith their p endant arcs, and lab eling with n th e form er parent of n and n + 1. In this wa y we ha ve that, for ev ery 1 6 a 6 = b 6 n and for ev ery i = 1 , 2, ℓ T ∗ i ( a, b ) = ℓ T i ( a, b ) if a 6 = n ℓ T ∗ i ( n, b ) = ℓ T i ( n, b ) − 1 if a = n 21 and therefore, C ( a, b ) = C p T ∗ 1 ,T ∗ 2 ( a, b ). Then, arguing as in (a.3), d s p ( T 1 , T 2 ) p > d s p ( T ∗ 1 , T ∗ 2 ) p > 4 . This ﬁnishes the p roof by induction. ⊓ ⊔ R emark 4. F ollo wing in d eta il th e argument s develo p ed in the last theorem u n til their last consequences, it can b e pr o v ed th at , for n > 6, the p airs of trees T 1 , T 2 in N T n suc h that d s p ( T 1 , T 2 ) p = 4, for ev ery p ∈ N + , are exactly those pairs such th at d 1 ( T 1 , T 2 ) = 4, and they h a ve the follo w ing form. L et i 1 , i 2 , i 3 b e any three taxa in S and let T 0 b e any non-w eighte d ro oted tree with some of its n odes, including all its elemen tary no des and all its lea ves exc ept at most one elementary no de or one le af , lab eled in S \ { i 1 , i 2 , i 3 } . Then, T 1 and T 2 are obtained, resp ectiv ely , b y attac hing to T 0 at the same no de the ‘basic’ trees T ′ 1 and T ′ 2 or T ′′ 1 and T ′′ 2 in Fig. 6. T he attac hment of one of th ese trees at a no de v in T is carried out by iden tifying the no de with the ro ot of the tree, and in su c h a wa y that the resulting trees T 1 and T 2 ha ve all their lea ves and elemen tary no des lab eled. T his implies that if T had some n on-la b eled leaf or elementa r y n o de, this is necessarily the no de where the basic trees must b e attac hed, and that (since T ′′ 2 has its ro ot elemen tary), the basic pair T ′′ 1 , T ′′ 2 cannot b e attac hed to a non-lab eled leaf (this w ould create an elemen tary no de in T 2 ). F or instance, the trees T and T ′ in Fig. 5 are obtained by attac hing the basic trees T ′ 1 and T ′ 2 (with i 1 = 1, i 2 = 2, and i 3 = 3) to the tree with Newic k co de (4, ...,n); . i 1 i 2 i 3 T ′ 1 i 1 i 3 i 2 T ′ 2 i 1 i 2 i 3 T ′′ 1 i 1 i 2 i 3 T ′′ 2 Fig. 6. The p airs of b as ic trees that give rise, when attac h ed to th e same place in a tree, to pairs of non-weig hted p h ylogenetic trees at d s p distance p √ 4. R emark 5. It can b e c heck ed that the pairs of diﬀerent trees in N T n at least distance for d s 1 ha ve alwa ys splitted path lengths matrices with n − 1 (if n 6 5) or 4 (if n > 5) 22 en tries that d iﬀer in only 1. This implies th at the least non-zero v alue for d s ∞ on N T n is alw a ys 1, and that the least non-zero v alue for d s 0 on N T n is agai n n − 1 for n 6 5 and 4 for n > 6. Unfortunately , we hav e not b een able to ﬁnd a formula for th e diameter of N T n with resp ect to an y m et ric d s p with p ∈ N + . Actually , and to our knowle dge, the diameter of the space of n on-w eigh ted binary phyloge netic trees with resp ect to the no dal metrics d 1 and d 2 is still not kno wn , either. Not kn o w ing a form u la for the diameter, we are not able to giv e an explicit d escription of the distribu tio n of distances for any p , either. In the ﬁle distribu tions.pdf in the Supp lemen tary Material we provide th e distribu tions of d s 1 and ( d s 2 ) 2 (that is, of d s 2 squared) on N T n for n = 3 , 4 , 5 , 6, as w ell as the distribu tio n s of the v alues of d s 1 and ( d s 2 ) 2 applied to pairs of trees in T reeBASE sh aring n = 2 to 6 lab els. 7 Conclusions Some classical metrics for phylog enetic trees are based on the comparison of the rep- resen tations of ro oted ph ylogenetic trees as v ectors of path lengths b etw een pairs of lab eled no des. But these metrics only separate non-wei gh ted binary ro oted trees: tw o more general n on-isomo rphic ro oted phylog enetic trees can hav e the s ame suc h vec tors of path lengths, and therefore b e at zero distance f or th ese metrics. In this pap er we ha ve o v ercome this problem by represent ing a ro oted p h ylogenetic tree b y means of a matrix with rows and columns indexed b y taxa and wh ere ev ery entry ( i, j ) is the distance fr om the least common ancestor of the p air of no des lab eled with i and j to the no de lab eled with i . W e call these matrices splitte d p ath lengths matric es , b ecause they split in t wo terms the path length b et wee n every pair of lab eled no des. These matrices deﬁn e an in - jectiv e mapp ing fr om the s p ace T n of all R > 0 -w eigh ted ro oted p h ylogenetic trees with n lab eled no des and p ossibly nested taxa into the set M n ( R ) of n × n real-v alued m atrices. Therefore, any n orm on M n ( R ) applied to the d iﬀerence of the splitted path lengths matrices of trees deﬁ nes a metric on T n . Using the well- k n o wn L p norms on M n ( R ), for p ∈ N ∪ {∞} , we obtain the family of splitted no dal m etrics d s p on T n d s p ( T 1 , T 2 ) =        { ( i, j ) | 1 6 i 6 = j 6 n, ℓ T 1 ( i, j ) 6 = ℓ T 2 ( i, j ) }   if p = 0 p q P m 1 6 i 6 = j 6 n | ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) | p if p ∈ N + max {| ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) | | 1 6 i 6 = j 6 n } if p = ∞ W e ha ve pr ov ed sev eral pr operties for these metrics d s p on the s u bspace N T n of non-w eighte d ro oted p h ylogenetic trees p ossibly with nested taxa. F or instance, w e h a v e established the least distance b etw een an y p air of suc h trees. It r emai ns as an op en problem to ﬁnd th e diameter of N T n with resp ect to these metrics, and the distribution of their v alues. Actually , th ese p roblems also remain op en f or the classical no dal d istance s on non-w eighte d b inary (r o oted as well as u nro ote d ) trees. T hese are inte resting pr oblems: to kno w the largest v alue reac h ed by a metric is necessary to n ormaliz e the m etric b etw een 23 0 and 1, w hile knowing the distribution of the v alues allo ws one to an s w er the q u estio n of wh et her t w o trees are more similar than exp ected b y c hance [19]. W e hop e to r eport on these p roblems in a near future. W e cannot adv o cate th e use of any splitted n odal metric d s p o v er the other ones except, p erhaps, warning against the use of d s 0 ( T 1 , T 2 ) =   { ( i, j ) ∈ S 2 | ℓ T 1 ( i, j ) 6 = ℓ T 2 ( i, j ) }   d s ∞ ( T 1 , T 2 ) = max {| ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) | | ( i, j ) ∈ S 2 } b ecause they are to o uninform at ive. Since th e most p opular norms on R m are the Man- hattan and the Euclidean, it seems natural to use d s 1 and d s 2 , as it has b een the case in th e classical, non -weigh ted binary setting. Eac h one has its adv antag es. F or instance, the computation of d s 1 do es n ot in volv e square ro ots, and therefore it can b e compu ted exactly and, if the w eights are in teger num b ers, the resulting v alue is an in teger n um b er. Moreo v er, it is well kno w n that, for eve r y p ∈ N + , k x k p 6 k x k 1 for every x ∈ R m and therefore, d s p ( T 1 , T 2 ) 6 d s 1 ( T 1 , T 2 ) for eve ry T 1 , T 2 ∈ T n . On the other hand , th e comparison of s plitte d p at h lengths matrices by means of the Euclidean n orm enables the u se of many geometric and clustering metho ds that are not a v ailable otherwise. F or instance, the sp eciﬁc p rop erties of th e Euclidean norm allo w ed Steel and Penn y to compu te explicitly the m ea n v alue of the no dal d istance d 2 on the class of non-we igh ted unr o oted binary trees [29], while n o similar r esu lt is kno wn for d 1 . As a ru le of thum b, we consider suitable to use d s 1 when the trees are n on-w eigh ted (of when they hav e inte ger wei gh ts), b ecause th ese trees can b e seen as discrete ob jects and th u s their comparison through a d iscrete to ol as th e Manhattan norm seems appro- priate. When the tr ees ha ve arbitrary p ositiv e real we ights, they sh ou ld b e u n derstoo d as b elonging to a contin uous sp ac e [5], and then th e Euclidean norm is more app r opriate. Supplemen t ary Material The Supp le men tary Material referenced in the pap er is av ailable at http://b ioinfo.uib.es/ ~recerca/phylotrees/nodal/ . Ac knowledgem ents: The researc h describ ed in th is pap er has b een partially su pp orted by the Spanish DGI pro j ects MTM2006-0777 3 COMGRIO and MTM2006- 15038-C02-01. 24 References 1. H . Ab di, Additive-tree representations, Lecture Notes in Biomathematics 84 ( 1990) 43–59. 2. B. L. Allen, M. A. Steel, Subtree transfer operations and their in duced metrics on ev olutionary trees, Ann. Combin. 5 (2001) 1–13. 3. V . Batagelj, T. Pisanski, J. M. S. Sim˜ oes-P ereira, A n algorithm for tree-realiza bilit y of distance matrices, I n t. J. Comput. Math. 34 (3) (1990) 171–176. 4. P . Batra, Newton’s metho d and the comput atio n al complexity of th e fundamental th eore m of algebra, Electron. Notes Theor. Comput. Sci. 202 (2008) 201–21 8. 5. L. J. Billera, S . P . Holmes, K. V ogtmann, Geometry of th e space of phylogenetic trees, Adv . Appl. Math. 27 (1) (2001) 733–767. 6. J. Bluis, D.-G. Shin, No dal distance algorithm: Calculating a phylogenetic tree comparison metric, in: Pro c. 3rd IEEE Symp. BioInformatics and BioEngineering, 2003. 7. F. T. Boesch, Properties of the distance matrix of a tree, Q. Appl. Math. 16 (1968) 607–609. 8. P . Buneman, The recov ery of trees from measures of d iss imilarity , in: J. H . et al (ed.), Mathematics in the archaeolog ical and historical sciences, Edinburgh Universit y Press, 1969, pp . 387– 395. 9. D. E. Critchlo w, D. K. Pearl , C. Qian, The triples distance for rooted bifurcating phylogenetic trees, Syst. Biol. 45 (3) (1996) 323–334. 10. J. S. F arris, A successiv e approximations approach to chara ct er wei ghting, Syst. Zo ol. 18 (1969) 374–385 . 11. J. S. F arris, O n comparing the shap es of tax onomic trees, Syst. Zo ol. 22 (1973) 50–54. 12. J. F elsenstein, The number of evo lutionary trees, Sy st. Zool. 27 (1978) 27–33. 13. J. F elsenstein, Inferring Phylogenies, Sinauer Associates Inc., 2004. 14. S. Gr ¨ un ewald, K. T. Hu ber, V. Moulton, C. Semple, Enco ding phylog en etic trees in terms of w eighted quartets, J. Math. Biol. 56 (4) (2008) 465–477. 15. J. Handl, J. K no wles, D. B. Kell, Computational cluster va lidation in p ost-genomic data analysis, Bioinformatic s 21 (15) (2005) 3201–3212. 16. K. Ho ef-Emd en , Molecular phylogenetic an alyses and real-life data, Computing in Science and En- gineering 7 (3) (2005) 86–91. 17. F. Leonardi, S. R. Matioli, H. A . A rmeli n , A . Galves, Detecting phylogenetic relations out from sparse context trees, http://arxiv.org/abs/0 804.4279. 18. R. D. M. Page, Phyloinfo rmatics: T ow ard a phylogenetic database., in: J. T.-L. W ang, M. J. Zaki, H. T oivonen, D. Sh asha (eds.), D ata Mining in Bioinformatics, Springer-V erlag, 2005, pp . 219–241. 19. D. Pe n n y , M. D. Hendy , The use of tree comparison metrics, Sy st. Zool. 34 (1) (1985) 75–82. 20. J. B. Phip p s, Dendrogram top olo gy , Syst. Zool. 20 ( 1971) 306–308. 21. P . Puigb` o, S . Garcia-V allv´ e, J. McInerney , TOPD/FMTS: a new softw are to compare phylogenetic trees, Bioinformatics 23 (12) (2007) 1556–155 8. 22. D. F. Robinson, L. R. F oulds, Comparison of weigh ted lab elled trees, in: Pro c. 6th Australian Conf. Com binatorial Mathematics, vol. 748 of Lecture Notes in Mathematics, S p ringer-V erlag, Berlin, 1979. 23. D. F. Robin son, L. R . F oulds, Comparison of phylogenetic trees, Math. Biosci. 53 ( 1/2) (1981) 131–147 . 24. A. Rok as, Genomics and th e tree of life, Science 313 (5795) (2006) 1897–1899. 25. C. Semple, M. St eel, Ph ylogenetics, O xford Universit y Press, 2003. 26. J. M. S. Sim˜ oes-Perei ra, A note on the tree realizabilit y of a distance, J. Comb. Th. B 6 (3) (1969) 303–310 . 27. N. J. A. Sloane, The On-Line Encyclop edia of Integer S equences, published electronically at www.rese arch.att.com/ njas/sequences/. 28. Y. A. Smolenskii, A metho d for the linear recording of graphs, U SSR Computational Mathematics and Mathematical Physics 2 (1963) 396–397. 29. M. A. St eel, D. Penny , Distributions of t ree comparison metrics—some new results, Sy st. Biol. 42 (2) (1993) 126–14 1. 30. M. S. W aterman, T. F. Smith , On the similarit y of den d ogra ms, J. Theor. Biol. 73 (1978) 789–800. 31. W. T. Williams, H. T. Cliﬀord, On the comparison of tw o classiﬁcatio n s of the same set of elements, T axon 20 (4) (1971) 519–522. 25 32. K. A. Zaretskii, Construction of a tree from the collection of distances b et ween suspendin g vertices, Usp ekhi Matematic hesk ik h Nauka 6 (1965) 90–92, in Russian. 26

Nodal distances for rooted phylogenetic trees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment