Nodal distances for rooted phylogenetic trees

Dissimilarity measures for (possibly weighted) phylogenetic trees based on the comparison of their vectors of path lengths between pairs of taxa, have been present in the systematics literature since the early seventies. But, as far as rooted phyloge…

Authors: Gabriel Cardona, Merce Llabres, Francesc Rossello

No dal distances for ro oted ph ylogenetic trees Gabriel Cardona 1 , Merc ` e Llabr ´ es 1 2 , F rancesc Rossell´ o 1 2 , and Gabriel V alien te 2 3 1 Department of Mathematics and Computer Science, Universit y of the Balearic Islands, E- 07 122 P alma d e Mallorca, Spain 2 Researc h Institut e of H ealth Science (IUNI CS), E-07122 Palma de Mallorca, Spain 3 Algorithms, Bioinformatic s, C omplexity and F ormal Metho ds Researc h Group, T ec hnical Un iv ersity of Catalonia, E-08034 Barcelona, Sp ai n Abstract. D is similarit y measures for (possibly w eighted) ph ylogenetic trees based on the comparison of their vectors of path lengths b etw een pairs of taxa, have been present in the systematics literature since the early sev enties. But, as fa r as rooted p h ylogenetic trees goes, these vectors can only separate non-weigh ted binary trees, and therefore these dissimilarit y measures are metrics only on this class. In this pap er w e ov ercome t his problem, by splitting in a suitable wa y each path length b et ween tw o taxa into tw o length s. W e prove that the resulting spli tt e d p ath lengths matric es single out arbitrary ro oted phylogenetic trees with nested taxa and arcs we igh ted in the set of p ositive real numbers. This allow s the definition of metrics on this general class by comparing these matrices b y means of metrics in spaces M n ( R ) of real-v alued n × n matrices. W e conclude this p aper by establishing some b asi c facts about the metric s for non- weigh ted phylogenetic trees defined in this w ay using L p metrics on M n ( R ), with p ∈ N \ { 0 } . 1 In tr oduction The exp onen tial increase in the amoun t of a v ailable genomic and metagenomic d ata has pro duced an explosion in the num b er of phylogenet ic trees pr oposed by researchers: according to Rok as [24], phylog en et icists are cu r ren tly pub lishing an a ve r ag e of 15 p h y- logenetic trees p er day . Man y such trees are alternativ e ph ylogenies for th e same sets of organisms, obtained f rom different datasets or using different evo lu tio nary mo dels or differen t phylo genetic reconstruction algorithms [16]. This v ariet y of ph ylogenetic tr ees mak es it necessary th e existence of m etho ds for measuring the differences b et wee n phy- logenetic trees [13, Ch. 30], and the s afest wa y to quan tify these differences is by using a metric, f or whic h zero difference means isomorphism. The comparison of p h ylogenetic trees is also us ed to assess the stabilit y of reconstru c- tion m et ho ds [31], and it is essent ial to p erforming p hylogeneti c queries on databases [18]. F urther , the need for comparing phyl ogenetic trees also arises in th e comparativ e analysis of clustering results obtained using different metho ds or different d istance matrices, and there is a growing inte rest in the assessmen t of clustering results in bioinform atics [15]. Recen t applications of the comparison of phylo genetic-lik e trees also include the stu d y of the similarit y b et wee n sequences, or sets of sequences, by measurin g the difference b e- t wee n their con text trees [17]. I n summary , and u sing the w ord s of Steel and Pe n n y [29], tree comparison metrics are an imp ortan t aid in the study of ev olution. Man y metrics for p hylogenetic tree comparison h av e b een prop osed so f ar, includ- ing the Robinson-F oulds, or p artiti on , metric [22, 23], the nearest-neigh b or interc h ange metric [30], the su btree transf er distance [2], and the triples metric [9]. In the early s ev- en ties, sev eral r esea r c h ers prop osed dissimilarit y measur es for (p ossibly we ighted) ro oted phylo genetic trees based on the comparison of the ve ctors of lengths of paths connect- ing pairs of taxa. The aim of these m ea s u res w as to quantify the rate at which pairs of taxa that are close together in one tree lie at opp osite ends in another tree [19]. These authors d efined the dissimilarity b et we en a pair of trees as the euclidean distance b e- t wee n the corresp onding v ectors of path lengths [10, 11], the Manhattan distance b et w een these v ectors [31] or th e correlati on b et w een these v ectors [20]. Similar dissimilarity mea- sures ha v e also b een defin ed for unro oted phyloge n etic tr ees [6, 29]. Although differen t names h a v e b een used for these d iss imila r it y measures (cladistic difference [10], top olog- ical distance [20] , path d ifference d istance [29]) , the term no dal distanc e seems to ha v e prev ailed [6, 21]. According to Steel and Penn y [29], they hav e several in teresting features that mak e them d eserv e more stu d y and consid erat ion. The theoretical basis for these n o dal distances is Smolenskii’s th eo r em [28] establish- ing that tw o un rooted phyloge n eti c trees T , T ′ on the same set S of taxa are isomorph ic if, and only if, for every pair of lea ves i, j , th e distances b et ween i and j in T and in T ′ are the s ame . Th is r esult was later expanded by Zaretskii [32], who c h aracterized the v ectors of distances b et w een pairs of lea v es of an u nro oted phylo genetic tree by means of the well-kno w n four-p oint condition. Smolenskii’s and Zaretskii’s pap ers were publish ed in Russian, and it h as contributed to the fact that their r esults ha ve b een r edisco vered and generalized m an y times [3, 7, 8, 26]; for a mo dern textb o ok treatment of these results in all their generalit y (w eigh ted unro oted tr ee s with nested taxa), see [25, Ch. 7], and for a h istorica l accoun t, see [1]. Unfortunately , Smolenskii’s theorem is not v alid for arbitrary ro oted phylo genetic trees: th er e exist non-isomorphic ro oted p h ylogenetic tr ees with the same path lengths b et w een pairs of lea ve s (see Figs. 1, 2, 3). It turns out that only the ful ly r esolve d , or binary , non-weighte d ro oted phylog enetic trees are singled out by their path lengths v ectors, and therefore the no dal distances based on th e comparison of these vect ors are metrics (more sp ecifically , zero no dal distance means isomorphism) only on the space of non-w eighte d binary phyloge n etic trees. Although this resu lt seems to b e known since the time of the first prop osals of no dal d istances, we h av e not b een able to find an explicit pro of in the literature, and thus, for the sak e of completeness, w e include a simple p roof of this fact in S ec tion 3, r educing it to the general v ers ion of Smolenksii’s result. The main resu lt of th is p aper is the definition of metrics on th e space of arbitr ary ro oted ph ylogenetic trees that generalize the no dal d ista n ces, where arbitrary means non necessarily binary an d with p ossibly nested taxa and arcs w eighted in the s et of p ositiv e real n u m b ers. T o do that, w e split eac h path b et we en tw o taxa into the paths from their least common ancestor to eac h taxa. In this wa y w e asso ciate to eac h ro oted phylo genetic tree w ith n taxa an n × n matrix, with ro ws and columns in dexed b y the taxa, whose ( i, j )-en try con tains the length of the path from the least common ancestor of the i - th and j -th taxa to the i -th taxon. W e pro ve that these splitte d p ath lengths matric es single out arbitrary ro oted p h ylogenetic trees, and then w e u se them to define splitte d no dal metrics on the space of w eighte d r o oted p h ylogenetic trees with nested taxa by 2 comparing these matrices th rough real-v alued n orms applied to their difference. W e also pro ve s ome basic prop erties of the splitted no dal metrics on the space of n on-w eigh ted ro oted phylogeneti c trees obtained u s ing the L p norms, with p ∈ N \ { 0 } . 2 Notations and c on ven tions A r o ote d tr e e is a non-empt y directed finite graph that con tains a d istinguished no d e, called the r o ot , from whic h ev ery other n ode can b e reac hed through exactly one path. An A -weighte d r o ote d tr e e , with A ⊆ R , is a pair ( T , ω ) consisting of a ro oted tree T = ( V , E ) and a weight function ω : E → A that asso ciat es to every arc e ∈ E a real n u m b er ω ( e ) ∈ A . In this pap er we shall only consider tw o sets A of we ights: the set of non-negativ e real n um b ers R > 0 = { t ∈ R | t > 0 } , and the set of p ositiv e real num b ers R > 0 = { t ∈ R | t > 0 } . When the set A is irrelev an t (for instance, in general definitions), w e shall omit it and simply talk ab out weighte d , instead of A -weig hted, trees. W e identify ev ery non-weighte d (that is, w h ere n o weig h t function has b een explicitly defin ed ) ro oted tree T with the w eigh ted ro oted tree ( T , ω ) with ω the weigh t 1 constan t function. Let T = ( V , E ) b e a ro oted tree. Whenev er ( u, v ) ∈ E , w e s ay that v is a c hild of u and that u is the p ar ent of v . Ev ery no de in T has exactly one parent , except the ro ot, whic h has no p aren t. Th e num b er of c hildr en of a no de is its out-de gr e e . T he n odes without c h ild ren are the le aves of the tree, an d th e other no des are called internal . An arc ( u, v ) is internal when its head v is internal, and p endant when v is a leaf. The out-degree 1 no des are called elementary . A tree is binary wh en all its internal no des ha ve out-degree 2. Giv en a p ath ( v 0 , v 1 , . . . , v k ) in a ro oted tree T , its origin is v 0 , its end is v k , and its interme diate no des are v 1 , . . . , v k − 1 . Su c h a p ath is non-trivial when k > 1. W e sh all represent a path fr om u to v , that is, a p at h with origin u and end v , by u v . Whenev er there exists a (non-trivial) path u v , w e shall sa y that v is a ( non-trivial ) desc e nda nt of u and also that u is a ( non-trivial ) anc estor of v . If v is a d escendan t of u , the path u v is unique. The distanc e fr om a no de u to a descendant v of it in a weigh ted ro oted tree is the su m of th e wei gh ts of the arcs forming the un ique path u v ; in a n on-w eigh ted ro oted tree, this distance is simply the num b er of arcs of this path. The depth of a no de v , in symb ols depth T ( v ), is the d istance from the ro ot to v . The le ast c ommon anc estor (LCA) of a p air of n odes u, v of a r o oted tree T , in sym b ols [ u, v ] T , is the unique common an cestor of them that is a descendant of ev ery other common ancestor of them. Alternativ ely , it is the u nique common ancestor of u, v suc h that the p aths fr om it to u and v ha ve only th eir origin in common. In particular, if one of the n odes, sa y u , is an ancestor of the other, then [ u, v ] T = u . Let S b e a non-empt y fin ite set of lab els , or taxa . A ( weighte d ) phylo genetic tr e e on S is a (weigh ted) ro oted tree with some of its no des, including all its lea ve s and its elemen tary n odes, bijectiv ely lab ele d in the s et S . In such a phylogenetic tr ee , w e shall alw a ys ident ify , u sually withou t an y fu rther mention, a lab eled no de with its taxon. The in ternal lab eled no des of a p h ylogenetic tr ee are called neste d taxa . 3 Tw o phylogeneti c trees T and T ′ on the same set S of taxa are i somorphic when they are isomorph ic as directed graphs and th e isomorp hism sends eac h lab eled no de of T to the lab eled no de with the same lab el in T ′ ; an isomorph ism of wei ghted phylo genetic trees is also required to p reserv e arc w eight s. As usu al, w e shall use the sym b ol ∼ = to denote the existence of an isomorp hism. Although our main ob j ec t of stud y are th e weigh ted phylogeneti c tr ees, and hence they are ro oted trees, in the next section there will also app ear unr ooted trees. An unr o ote d tr e e is an und irecte d finite graph where eve r y pair of no des is connected by exactly one p ath. An A -weighte d unr o ote d tr e e is a pair ( T , ω ) consisting of an u nrooted tree T = ( V , E ) and a weight fu nction ω : E → A . The distanc e b etw een t wo n o des in a w eight ed u nrooted tree is th e sum of the weig h ts of the edges formin g the u nique path that connects these no des. An un rooted tree is p artial ly lab ele d in a set S when some of its no des are bijectiv ely lab eled in the set S . An u nr o ote d S -tr e e is an unr ooted tree partially lab eled in S with all its leav es and all its no des of d eg ree 2 lab eled. Giv en a phylog en etic tree T = ( V , E ) on S , its unr o ote d version is the u nrooted tree T u = ( V , E u ) partially lab eled in S obtained by replacing eac h arc ( u, v ) ∈ E by an edge { u, v } ∈ E u , and k eeping th e lab els. The notion of isomorphism for (p ossibly w eigh ted) partially lab eled un rooted trees is similar to the n otion giv en in the r ooted case. Notice that if T 1 = ( V 1 , E 1 ) and T 2 = ( V 2 , E 2 ) are tw o phylogenet ic trees on the same set S of taxa, with ro ots r 1 and r 2 , resp ectiv ely , then a map p ing f : V 1 → V 2 is an isomorp hism b et wee n T 1 and T 2 if, and only if, it is an isomorp hism b et we en T u 1 and T u 2 and f ( r 1 ) = r 2 . 3 P ath lengths separate non-w eigh ted binary ph ylogenetic trees Let T b e an R > 0 -w eigh ted phylog enetic tree on the set S = { 1 , . . . , n } . F or eve r y i, j ∈ S , let ℓ T ( i, j ) and ℓ T ( j, i ) d en ot e the distances fr om [ i, j ] T to i and j , resp ectiv ely . T he p ath length b etw een tw o lab eled n odes i and j is L T ( i, j ) = ℓ T ( i, j ) + ℓ T ( j, i ) . Definition 1. The path lengths vect or of T is the ve ctor L ( T ) =  L T ( i, j )  1 6 i 0 -w eigh ted un rooted S -tree; see also Th m. 7.1.8 in [25]. 4 Prop osition 1. Two non-weighte d bi nar y phylo genetic tr e es on the same set S of taxa ar e isomorp hic if, and only if, they have the same p ath lengths ve ctors. Pr o of. The ‘only if ’ imp lic ation is ob vious. As f ar as the ‘if ’ imp lica tion go es, let T 1 and T 2 b e tw o n on-w eigh ted b in ary p h ylogenetic trees on the same set S with the same path lengths vec tors. If | S | = 1, the equiv alence in the statemen t is obvious, b ecause ev ery phylo genetic tree with only one lab eled no de consists only of one no de. So we assume henceforth that | S | > 2. F or ev ery t = 1 , 2, let ( T ∗ t , ω t ) b e the R > 0 -w eigh ted unro oted S -tree defined as follo ws: – If the ro ot of T t is lab eled, then T ∗ t = T u t and all ed ges of T ∗ t ha ve we ight 1. – If the r oot r t of T t is n ot lab eled, and if u t , v t are the c h ildren of r t , th en T ∗ t is obtained from T u t b y remo ving the no de r t and r eplac ing the edges { r t , u t } , { r t , v t } b y a single edge { u t , v t } , and then all edges of T ∗ t ha ve weig ht 1, except { u t , v t } , whic h h as weigh t 2. It is straigh tforward to chec k that such a T ∗ t is alwa ys an un rooted S -tree: the ro ot r t of T t is the only d egree 2 no de in T u t and th en, if it is lab eled, T u t is an unro oted S -tree, and if it is non lab eled, w e remo v e it in the construction of T ∗ t without mo difying the degrees of the remaining no des. Moreo v er, it is also ob v ious fr om th e constru cti on that the distance b et ween an y pair of lab eled no des in T ∗ t is equ al to the path length b et ween these no des in T t . In particular, T ∗ 1 and T ∗ 2 ha ve the same d ista nces b et ween eac h pair of lab eled no des. T hen, by [25, Thm. 7.1.8]. T ∗ 1 ∼ = T ∗ 2 as w eighte d unr ooted S -trees. It remains to chec k that this isomorph ism induces an isomorphism of phylo genetic trees T 1 ∼ = T 2 . T o do it, n oti ce th at , since the isomorphism b et we en T ∗ 1 and T ∗ 2 preserve s edge weigh ts, there are only tw o p ossibilities: – All edges in T ∗ 1 and T ∗ 2 ha ve weig h t 1. In this case T ∗ 1 = T u 1 and T ∗ 2 = T u 2 and th e isomorphism T u 1 ∼ = T u 2 sends the r o ot of T 1 to the r oot of T 2 , b eca u se they are the only degree 2 no des in T ∗ 1 and T ∗ 2 . Th erefore, it in duces an isomorphism T 1 ∼ = T 2 . – Both T ∗ 1 and T ∗ 2 ha ve one weigh t 2 edge, say { u 1 , v 1 } and { u 2 , v 2 } , resp ectiv ely . Then eac h T u t is obtained from T ∗ t b y addin g the ro ot r t of T t and splitting the edge { u t , v t } into tw o edges { u t , r t } and { v t , r t } . S ince the isomorph ism T ∗ 1 ∼ = T ∗ 2 sends { u 1 , v 1 } to { u 2 , v 2 } , its extension to a mapp ing V 1 → V 2 b y send ing r 1 to r 2 defines an isomorphism T u 1 ∼ = T u 2 that sends the r o ot of T 1 to the ro ot of T 2 , and hence an isomorphism T 1 ∼ = T 2 . ⊓ ⊔ Let B T n b e the class of all non-weig hted binary phylogenetic trees on S = { 1 , . . . , n } . The injectivit y u p to isomorphisms of the mapping L : B T n → R n ( n − 1) / 2 T 7→ L ( T ) 5 mak es the classical defin itio n s of no dal metrics on B T n induced by metrics on R n ( n − 1) / 2 to yield, indeed, metrics. F or example, recall that the L p norm on R m is defined as k ( x 1 , . . . , x m ) k p =      { i | i = 1 , . . . , m , x i 6 = 0 }   if p = 0 p p P m i =1 | x i | p if p ∈ N + max {| x i | | i = 1 , . . . , m } if p = ∞ where, here and h encefo rth, N + stands for N \ { 0 } . Eac h L p norm on R n ( n − 1) / 2 induces then a metric on B T n through the f orm ula d p ( T 1 , T 2 ) = k L ( T 1 ) − L ( T 2 ) k p . Some of these metrics ha ve b een present in the literature s ince the early sev enties. F or instance, F arris [10] in tro duced the metric on B T n induced by the L 2 , or Eu clidea n, norm on R n ( n − 1) / 2 : d 2 ( T 1 , T 2 ) = s X 1 6 i 0 -we igh ted b inary phylogenetic trees with the same path lengths vectors. R emark 1. Let T b e a non-weig h ted binary ph ylogenetic tr ee on a set S of taxa. Since the path lengths vect or L ( T ) is th e vec tor of distances of a (p ossibly w eight ed) unr ooted S -tree (see the pro of of Pr oposition 1), it is wel l-kno wn (see, for instance, L em. 7.1.7 in [25]) th at it satisfies the four-p oint c ondition : for ev ery a, b, c, d ∈ S , L T ( a, b ) + L T ( c, d ) 6 max { L T ( a, c ) + L T ( b, d ) , L T ( a, d ) + L T ( b, c ) } . Zaretskii’s theorem [32] establishes that any d issimilarit y measure on S satisfying this four-p oin t cond itio n is give n by the distances b et wee n lab eled no des in an R > 0 -w eigh ted unro oted S -tree (see Thm. 7.2.6 in [25]). But, to our kno w ledge , it is not kno wn what extra prop erties should b e required to suc h a dissimilarit y measure on S to guaran tee that it is giv en by the p ath lengths b et ween lab eled no des in a non-w eighte d b inary phylo genetic tree. 4 Splitted path lengths separate arbitrary ph ylogenetic t rees Let ( T , ω ), with T = ( V , E ), b e again an R > 0 -w eigh ted ph ylogenetic tree o n S = { 1 , . . . , n } and , for ev ery i, j ∈ S , let ℓ T ( i, j ) and ℓ T ( j, i ) still denote the distances from [ i, j ] T to i and j , resp ectiv ely . Definition 2. The splitted p at h lengths matrix of T is the n × n squar e matrix over R > 0 ℓ ( T ) =      ℓ T (1 , 1) ℓ T (1 , 2) . . . ℓ T (1 , n ) ℓ T (2 , 1) ℓ T (2 , 2) . . . ℓ T (2 , n ) . . . . . . . . . . . . ℓ T ( n, 1) ℓ T ( n, 2) . . . ℓ T ( n, n )      ∈ M n ( R > 0 ) . 7 Notice that this matrix need not b e s y m metrica l (see the next example), bu t all entries ℓ T ( i, i ) in its main diagonal are 0. The splitted path lengths matrix ℓ ( T ) of a tree T ∈ T n can b e computed in optimal O ( n 2 ) time, b y computing b y breadth-fir s t s earch for eac h in ternal no de of T the distance to eac h one of its d escend an t taxa and the pairs of taxa of which it is the LCA. Example 1. Th e splitted path lengths matrices of the trees T and T ′ depicted in Fig. 1 are ℓ ( T ) =    0 1 1 1 1 0 1 1 2 2 0 1 2 2 1 0    , ℓ ( T ′ ) =    0 1 2 2 1 0 2 2 1 1 0 1 1 1 1 0    . The splitted path lengths matrices of the trees T and T ′ depicted in Fig. 2 are ℓ ( T ) = 0 1 2 1 0 2 0 0 0 ! , ℓ ( T ′ ) = 0 2 1 0 0 0 1 2 0 ! . The splitted path lengths matrices of the weigh ted trees T and T ′ depicted in Fig. 3 are ℓ ( T ) =  0 1 2 0  , ℓ ( T ′ ) =  0 2 1 0  . This example sh o w s that the splitted path lengths m at rices can separate pairs of phylo genetic trees that could not b e sep arat ed by means of their path lengths v ectors. Our main result in this s ec tion states that these matrices charac terize arb itrary R > 0 - w eight ed p h ylogenetic trees. T o prov e it, it is con venien t to establish fi rst s ome lemmas, and to r eca ll a r esult fr om [14]. Lemma 1. L et T b e an R > 0 -weighte d phylo genetic tr e e on S . A lab el i ∈ S is a ne ste d taxon of T if, and only if, ℓ T ( i, j ) = 0 for some j 6 = i . Pr o of. If an in tern al n ode of T is labeled with i , then taking as j ∈ S any descend an t leaf of i w e h a ve that [ i, j ] T = i and hence ℓ T ( i, j ) = 0. Conv ers ely , if ℓ T ( i, j ) = 0, then [ i, j ] T = i and therefore the no de i is an ancestor of the no de j . If i 6 = j , this can only happ en if i is in ternal. ⊓ ⊔ Lemma 2. L et T b e an R > 0 -weighte d phylo genetic tr e e on S . F or every i ∈ S , c onsider the se t of weights W i = { ℓ T ( i, j ) | j ∈ S, ℓ T ( i, j ) > 0 } . (a) W i = ∅ if, and only if, i is the r o ot of T . (b) If W i 6 = ∅ , then its smal lest element w i is the weight of the ar c with he ad i . Pr o of. As far as far (a) go es, W i = ∅ if, an d only if, ℓ T ( i, j ) = 0 for ev ery j ∈ S , that is, if, and only if, i is an ancestor of eve r y lab eled no de. S ince the set of lab eled no des of 8 T in cludes all lea ves and all elemen tary no des, this is equiv alen t to the fact that i is the ro ot. As far as (b) go es, assume that W i 6 = ∅ , so that i has a p aren t x . Let w i b e the w eight of the arc ( x, i ). T hen, sin ce eve r y non-trivial path [ i, j ] T i m ust en d with the arc ( x, i ), it is clear th at if ℓ T ( i, j ) > 0, then ℓ T ( i, j ) > w i . No w, if x is lab eled, sa y w ith lab el i 0 , then x = [ i, i 0 ] T and th u s ℓ T ( i, i 0 ) = w i . If x is not lab eled, then it cannot b e elemen tary , and hence it must hav e at least another c hild y . Let i 0 b e a descendant leaf of y . In th is case, x = [ i, i 0 ] T and ℓ T ( i, i 0 ) = w i , to o. This pro ves that, in all cases, w i ∈ W i , and thus th at it is th e smallest elemen t of this set. ⊓ ⊔ The follo wing r esult is a d irect consequence of the last t w o lemmas. Corollary 1. L et T and T ′ b e two R > 0 -weighte d phylo ge netic tr e es on the same set S of taxa such that ℓ ( T ) = ℓ ( T ′ ) . Then: (a) The neste d taxa of T and T ′ ar e the same. (b) T has its r o ot lab ele d with i if, and only if, T ′ has its r o ot lab ele d with i . (c) If the no des lab e le d with i in T and T ′ ar e not their r o ots, the weight of the ar c with he ad i i n T and in T ′ is the same. ⊓ ⊔ Let S b e a set of taxa and R ( S ) the set of S -triples , that is, of structures ab | c with a, b, c ∈ S pairwise different . Classically , an S -triplet ab | c is said to b e pr e se nt in a phylo genetic tr ee T if c d iverged from a b efore b did , in the sen s e that [ a, b ] T < [ a, c ] T = [ b, c ] T . Let no w ( T , ω ) b e an R > 0 -w eigh ted phyloge netic tree on S . F or ev ery ab | c ∈ R ( S ), let λ T ( ab | c ) ∈ R > 0 b e defin ed as f ollo w s: – If ab | c is present in T , then λ T ( ab | c ) is the distance from [ a, c ] T = [ b, c ] T to [ a, b ] T – If ab | c is not present in T , then λ T ( ab | c ) = 0. Notice th at λ T ( ab | c ) = λ T ( ba | c ). This mapping λ T has a s im p le description in terms of ℓ ( T ). Lemma 3. L et ( T , ω ) b e an R > 0 -weighte d phylo genetic tr e e on S . F or every ab | c ∈ R ( S ) , λ T ( ab | c ) = max { ℓ T ( a, c ) − ℓ T ( a, b ) , 0 } . Pr o of. If [ a, c ] T is a non-trivial ancestor of [ a, b ] T in T , then the path [ a, c ] T a con tains the no d e [ a, b ] T and the d istance ℓ T ( a, c ) from [ a, c ] T to a is equal to th e distance λ T ( ab | c ) from [ a, c ] T to [ a, b ] T plus the distance ℓ T ( a, b ) f rom [ a, b ] T to a . Th er efore, in this case, max { ℓ T ( a, c ) − ℓ T ( a, b ) , 0 } = ℓ T ( a, c ) − ℓ T ( a, b ) = λ T ( ab | c ) . If [ a, c ] T = [ a, b ] T , then ℓ T ( a, c ) = ℓ T ( a, b ) an d ab | c is not p resen t in T and thus max { ℓ T ( a, c ) − ℓ T ( a, b ) , 0 } = 0 = λ T ( ab | c ) . 9 Finally , if [ a, c ] T is not an ancestor of [ a, b ], then it m u st happ en that [ a, b ] T is a non- trivial ancestor of [ a, c ] T and therefore ℓ T ( a, b ) > ℓ T ( a, c ). S in ce ab | c is n ot p resen t in T , either, this im p lies that max { ℓ T ( a, c ) − ℓ T ( a, b ) , 0 } = 0 = λ T ( ab | c ) . So, the equ al it y in the statemen t alw a ys h olds . ⊓ ⊔ The follo wing r esu lt is Thm. 2 in [14]. I n it, Q ( X ) denotes the set of X - quartets , that is, of str u ctures ab | cd with a, b, c, d ∈ X pairwise different . Theorem 1. L et λ : R ( S ) → R > 0 b e a map such that λ ( ab | c ) = λ ( ba | c ) f or eve ry a, b, c ∈ S p airwise differ ent, and let z b e an element not in S . Then: (a) λ = λ T for some R > 0 -weighte d phylo genetic tr e e ( T , ω ) with neither neste d taxa nor weight 0 i nternal ar cs if, and only if, the mapping µ : Q ( S ∪ { z } ) → R > 0 define d by µ ( ab | cd ) =  λ ( ab | c ) if d = z min { λ ( ab | c ) , λ ( ab | d ) } + m in { λ ( c d | a ) , λ ( cd | b ) } i f d 6 = z satisfies the fol lowing pr op erties: (1) µ ( ab | cd ) = µ ( ba | cd ) = µ ( cd | ab ) (2) F or every a, b, c, d , at le ast two of µ ( ab | cd ) , µ ( ac | bd ) , and µ ( ad | bc ) ar e e qual to 0. (3) If µ ( ab | cd ) > 0 , then, for every x 6 = a, b, c, d , e ither µ ( ab | cx ) · µ ( ab | dx ) > 0 or µ ( ax | cd ) · µ ( bx | c d ) > 0 . (4) F or every a, b, c, d, e , if µ ( ab | cd ) > µ ( ab | ce ) > 0 , then µ ( ae | cd ) = µ ( ab | cd ) − µ ( ab | ce ) . (5) F or every a, b, c, d, e , if µ ( ab | cd ) > 0 and µ ( bc | de ) > 0 , then µ ( ab | de ) = µ ( ab | cd ) + µ ( bc | de ) . (b) If ( T , ω ) and ( T ′ , ω ′ ) ar e two R > 0 -weighte d phylo genetic tr e es with neither neste d taxa nor weight 0 internal ar c s and su c h that λ T = λ T ′ , then T ∼ = T ′ as phylo genetic tr e es and the isomorphism pr eserves the weights of the internal ar cs. ⊓ ⊔ No w w e can pr oceed with the pr oof that splitted p at h lengths matrices c haracterize R > 0 -w eigh ted phylogeneti c trees. Theorem 2. Two R > 0 -weighte d phylo ge ne tic tr e es on the same set S of taxa ar e iso- morphic i f , and only if, they have the same splitte d p ath lengths matric es. 10 Pr o of. As in P roposition 1, the statemen t when | S | = 1 is obviously tru e. Assu me no w that | S | > 2. F or eve r y R > 0 -w eigh ted phyloge netic tree ( T , ω ) on S , let ( T , ω ) b e the R > 0 - w eight ed phylogeneti c tree without n este d taxa obtained as follo ws: for ev ery internal lab eled no de i of T , u nlabel it and add to it a leaf c h ild lab eled with i through an arc of weigh t 0. It is straigh tforward to c hec k that ℓ T ( i, j ) = ℓ T ( i, j ) for ev ery i, j ∈ S . Since T w as R > 0 -w eigh ted, the only wei gh t 0 arcs in T are the n ew p endant arcs that replace the nested taxa. Moreo ver, ( T , ω ) can b e reco ve r ed fr om ( T , ω ) by simply r emo vin g the w eight 0 p endant arcs and lab eling the tail of a remo ved arc with the lab el of th e arc’s head. Let no w ( T 1 , ω 1 ) and ( T 2 , ω 2 ) b e tw o R > 0 -w eigh ted phylogeneti c trees on th e same set S of taxa suc h that ℓ ( T 1 ) = ℓ ( T 2 ). Then ℓ ( T 1 ) = ℓ ( T 2 ) and h en ce , by Lemm a 3, λ T 1 = λ T 2 . Since ( T 1 , ω 1 ) and ( T 2 , ω 2 ) are R > 0 -w eigh ted p h ylogenetic trees with neither nested taxa n or w eigh t 0 inte r nal arcs, by Th eo r em 1.(b) we h av e that T 1 ∼ = T 2 as phylo genetic trees, and m oreov er this isomorph ism p reserv es th e weig hts of the in tern al arcs. But w e also kn o w that the arc end ing in the leaf i has the same w eight in T 1 and in T 2 : if i was a nested taxon of T 1 and T 2 (and recall th at T 1 and T 2 ha ve the same nested taxa by Corollary 1.(a)), th is w eigh t is in b oth cases 0, and if i was the lab el of a leaf of T 1 and T 2 , this weigh t is the same in T 1 and in T 2 b y Corollary 1.(c), and hence in T 1 and in T 2 . Therefore, the isomorphism T 1 ∼ = T 2 is an isomorphism of w eigh ted phylo genetic trees. Finally , the wa y ( T 1 , ω 1 ) and ( T 2 , ω 2 ) are reconstructed f r om ( T 1 , ω 1 ) and ( T 2 , ω 2 ) implies that this isomorphism ind u ces an isomorphism of weigh ted phyloge n eti c trees T 1 ∼ = T 2 . This pro v es the ‘if ’ imp lication; the ‘only if ’ implication is ob vious. ⊓ ⊔ R emark 2. The pro of of the last theorem can also b e applied, with sm all mo difications, to pr o ve that the splitted path lengths m atrices also separate R > 0 -w eigh ted phylog enetic trees with multi-lab ele d no des , that is, where a no de can h a v e more than on e lab el (but t wo differen t n odes cannot share an y lab el); in such a tree T , if i and j are lab els of the same n ode, th en ℓ T ( i, j ) = ℓ T ( j, i ) = 0. It is en ough to slight ly change the defin itio n of T : on the one hand, f or ev ery internal lab eled no de of T , unlab el it and, for eac h one of its lab els, add to it a leaf c hild labeled with this lab el through an arc of w eigh t 0; and, on the other hand , do the same for ev er y leaf with m ore than one lab el. The same argument as in the pro of of the last theorem sh o w s th at if T 1 and T 2 are t wo R > 0 -w eigh ted phylogeneti c trees with multi-l ab eled n odes such that ℓ ( T 1 ) = ℓ ( T 2 ), then the R > 0 -w eigh ted phylog enetic trees with neither nested taxa nor weigh t 0 internal arcs T 1 and T 2 obtained in this w a y are isomorph ic. T o d eriv e from this isomorphism an isomorphism T 1 ∼ = T 2 , one must use that, in this multi-la b eled case: – An internal no de of a tree T is lab eled { i 1 , . . . , i k } if, an d only if, ℓ T ( a, b ) = 0 for ev ery a, b ∈ { i 1 , . . . , i k } , ℓ T ( a, j ) > 0 or ℓ T ( j, a ) > 0 for eve r y a ∈ { i 1 , . . . , i k } and ev ery j / ∈ { i 1 , . . . , i k } , and there exists s ome j / ∈ { i 1 , . . . , i k } s u c h that ℓ T ( a, j ) = 0 for every a ∈ { i 1 , . . . , i k } . 11 – A leaf of T is lab eled { i 1 , . . . , i k } if, and only if, ℓ T ( a, b ) = 0 for eve r y a, b ∈ { i 1 , . . . , i k } , an d ℓ T ( a, j ) > 0 for eve r y a ∈ { i 1 , . . . , i k } and eve r y j / ∈ { i 1 , . . . , i k } . These prop erties ent ail that if ℓ ( T 1 ) = ℓ ( T 2 ), then T 1 and T 2 ha ve the same families of sets { i 1 , . . . , i k } of lab els of in tern al no des as w ell as of lea v es. W e lea ve the details to the reader. Notice that Theorem 1 n ot only establishes that the mappin g λ T singles out an R > 0 - w eight ed phylog en et ic tree T with n either nested taxa nor weig ht 0 internal arcs, up to the weigh ts of its p end an t arcs, but it also charac terizes wh at m ap p ings can b e r ea lized as λ T -mappings, for some T of this type. W e can u se this result to c haracterize the matrices th at are splitted path lengths matrices of R > 0 -w eigh ted phylogeneti c trees. Prop osition 2. L et M =  m i,j  ∈ M n ( R > 0 ) b e an n × n squar e matrix over R > 0 with m i,i = 0 for ev e ry i = 1 , . . . , n . Then, M = ℓ ( T ) for some R > 0 -weighte d phylo genetic tr e e T on S = { 1 , . . . , n } if , and only if, the mapping λ M : R ( S ) → R > 0 define d by λ M ( ab | c ) = max { m a,c − m a,b , 0 } satisfies the fol lowing c onditions: (a) λ M ( ab | c ) = λ M ( ba | c ) for every a, b, c ∈ S p airwise differ ent. (b) The mapping µ M define d fr om λ M as in The or em 1.(a) satisfies pr op erties (1)–(5) ther ein. Pr o of. The ‘only if ’ implication is easy: if M = ℓ ( T ), so that m i,j = ℓ T ( i, j ) for ev er y i, j ∈ S , then λ M = λ T , with T th e R > 0 -w eigh ted phylog enetic tree without nested taxa or wei gh t 0 internal arcs asso ciated to T in the pro of of Theorem 2, and therefore it satisfies conditions (a) and (b) in the statemen t. Con versely , if λ M satisfies conditions (a) and (b ), then by Th eo rem 1 there exists an R > 0 -w eigh ted phyloge netic tree T 0 without n ested taxa or w eight 0 int ernal arcs such that λ M = λ T 0 . By L emm a 3, λ T 0 ( ab | c ) = max { ℓ T 0 ( a, c ) − ℓ T 0 ( a, b ) , 0 } . Therefore, for ev ery a, b, c ∈ S pairwise d ifferen t, max { ℓ T 0 ( a, c ) − ℓ T 0 ( a, b ) , 0 } = max { m a,c − m a,b , 0 } . The tree T 0 is un ique up to the w eights of the p endan t arcs. S o, without an y loss of generalit y we may assume that the wei ght of th e arc ending in the leaf a is min { m a,j | j 6 = a } . No w, for every a ∈ S and for ev ery b ∈ S \ { a } , b is a descendant of the p arent x a of a in T 0 if, and only if, m a,b = m in { m a,j | j 6 = a } . As far as the ‘if ’ imp lic ation go es, assume th at m a,b = min { m a,j | j 6 = a } but b is not a descend an t of x a . Let c ∈ S \ { a } b e a descendant of x a , so that [ a, c ] T 0 = x a . Th en, [ a, c ] T 0 is a non-trivial descend an t of 12 [ a, b ] T 0 and therefore (sin ce the inte rnal arcs of T 0 ha ve n on-nega tive w eight) , ℓ T 0 ( a, b ) − ℓ T 0 ( a, c ) > 0. But this con tradicts the fact that, since m a,c > m a,b , ℓ T 0 ( a, b ) − ℓ T 0 ( a, c ) = λ T 0 ( ac | b ) = λ M ( ac | b ) = min { m a,b − m a,c , 0 } = 0 . As far as the con v erse imp lic ation goes, let b ∈ S \ { a } b e a d escendan t of x a , and let b ′ ∈ S \ { a } b e su c h that m a,b ′ = min { m a,j | j 6 = a } : as w e hav e j ust seen, b ′ is also a descendan t of x a and th erefore [ a, b ] T 0 = [ a, b ′ ] T 0 = x a . Then , max { m a,b − m a,b ′ , 0 } = λ T 0 ( ab ′ | b ) = 0 implies that m a,b − m a,b ′ 6 0, that is, that m a,b = min { m a,j | j 6 = a } , to o. No w, let us a fix a taxon a ∈ S , and let b ∈ S \ { a } b e a descendant of the parent x a of a in T 0 . Then, on the one h and, ℓ T 0 ( a, b ) = m a,b , b eca u se it is the w eight of the arc ( x a , a ), and, on the other hand , for ev ery c 6 = a, b , w e hav e th at m a,c > m a,b and ℓ T 0 ( a, c ) > ℓ T 0 ( a, b ) and therefore m a,c = λ M ( ab | c ) + m a,b = λ T 0 ( ab | c ) + ℓ T 0 ( a, b ) = ℓ T 0 ( a, c ) . This implies that the a -th ro w in M and ℓ ( T 0 ) are equal, and hence, since a w as any elemen t of S , M = ℓ ( T 0 ). Finally , T 0 is transformed in to an R > 0 -w eigh ted phylo genetic tree w ith the s ame splitted path lengths matrix by simply r emo vin g the weig h t 0 p endan t arcs and lab eling the tail of a remo ved arc w ith the lab el of the arc’s head; cf. the pr oof of Theorem 2. ⊓ ⊔ 5 Splitted no dal metrics Let T n b e the space of R > 0 -w eigh ted phylog enetic trees on the set S = { 1 , . . . , n } of taxa. As we hav e seen, the mapping ℓ : T n − → M n ( R > 0 ) that asso ciates to eac h ( T , ω ) ∈ T n its splitted path lengths matrix ℓ ( T ) is injectiv e u p to isomorp hisms. As it hap p ened w ith the em b edd ing L : B T n ֒ → R n ( n − 1) / 2 , this allo ws one to induce metrics on T n from metrics on M n ( R > 0 ). Prop osition 3. L et D b e any metric on M n ( R > 0 ) . The mapping d : T n × T n → R > 0 ( T 1 , T 2 ) 7→ D ( ℓ ( T 1 ) , ℓ ( T 2 )) satisfies the axioms of metrics up to i som orphisms: (1) d ( T 1 , T 2 ) > 0 , (2) d ( T 1 , T 2 ) = 0 if, and only if, T 1 ∼ = T 2 , (3) d ( T 1 , T 2 ) = d ( T 2 , T 1 ) , (4) d ( T 1 , T 3 ) 6 d ( T 1 , T 2 ) + d ( T 2 , T 3 ) . 13 Pr o of. Prop erties (1), (3) and (4) are d ir ec t consequences of the corresp onding prop erties of D , wh ile pr operty (2) follo ws from the separation axiom f or D (whic h sa ys th at D ( M 1 , M 2 ) = 0 if, and only if , M 1 = M 2 ) and Theorem 2. ⊓ ⊔ W e shall generically call splitte d no dal metrics the metrics on T n induced by metrics on M n ( R > 0 ) through the em b edd ing ℓ . In p articula r, ev ery L p norm k · k p on M n ( R > 0 ) defines a sp litt ed no dal metric d s p through d s p ( T 1 , T 2 ) = k ℓ ( T 1 ) − ℓ ( T 2 ) k p . F or instance, d s 1 ( T 1 , T 2 ) = X 1 6 i 6 = j 6 n | ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) | , d s 2 ( T 1 , T 2 ) = s X 1 6 i 6 = j 6 n ( ℓ T 1 ( i, j ) − ℓ T 2 ( i, j )) 2 are the sp litt ed no dal metrics ind uced by the L 1 and L 2 norms on M n ( R > 0 ). W e ha ve seen in the p revious section that th e sp litt ed p at h lengths matrices can b e computed in O ( n 2 ) time. Th eir difference can b e computed in O ( n 2 ) time, and the sum of the p -th p ow ers of the en tr ies of the resulting matrix can b e compu te d in O ( n 2 log( p ) + n 2 ) time (assum ing constan t-time addition and m ultiplication of real num b ers). Th erefore, the cost of computing d s p ( T 1 , T 2 ) p , f or T 1 , T 2 ∈ T n and p ∈ N + , is O ( n 2 log( p ) + n 2 ). Thus, if p = 1, th e d s 1 metric on T n can b e compu ted in O ( n 2 ) time. F or p > 2, th e cost of computing d s p ( T 1 , T 2 ), for T 1 , T 2 ∈ T n , as the p -th ro ot of d s p ( T 1 , T 2 ) p will dep end on th e accuracy with whic h this ro ot is computed. F or instance, using the Newton metho d to compute it with an accuracy of an 1 / 2 h -th of its v alue has a cost of O ( p 2 log( p ) log ( hp )); see, for instance, [4]. S o, in practice, for small p and not to o large h , this step will b e d ominate d by the computation of d s p ( T 1 , T 2 ) p , and the total cost will b e O ( n 2 ) (w e understand in this case log( p ) as part of the constant factor). F or p = 0 or ∞ , the cost of computing d p ( T 1 , T 2 ) is also O ( n 2 ) time. These splitted n odal metrics can b e seen conceptually as the generalizations to T n of the classical n odal metrics on B T n . Conceptually , but not numerically , b ecause the restriction of d s p to B T n is not equal to d p , ev en up to a scalar f ac tor, as th e follo w ing easy example sho ws. Example 2. Cons id er the n on-w eigh ted binary trees T 1 , T 2 , T 3 depicted in Fig. 4. I t is easy to compu te their path lengths v ectors and splitted path lengths matrices: L ( T 1 ) = (3 , 4 , 4 , 3 , 3 , 2) , L ( T 2 ) = (2 , 3 , 4 , 3 , 4 , 3) , L ( T 3 ) = (4 , 4 , 3 , 2 , 3 , 3) ℓ ( T 1 ) =    0 1 1 1 2 0 1 1 3 2 0 1 3 2 1 0    , ℓ ( T 2 ) =    0 1 2 3 1 0 2 3 1 1 0 2 1 1 1 0    , ℓ ( T 3 ) =    0 1 1 1 3 0 1 2 3 1 0 2 2 1 1 0    . 14 1 2 3 4 T 1 1 2 3 4 T 2 1 4 3 2 T 3 Fig. 4. The non- w eighted binary phylogenetic trees in Example 2. F rom these vect ors and matrices we obtain that d p ( T 1 , T 2 ) = d p ( T 1 , T 3 ) =    4 if p = 0 p √ 4 if p ∈ N + 1 if p = ∞ while d s p ( T 1 , T 2 ) =    10 if p = 0 p √ 6 + 4 · 2 p if p ∈ N + 2 if p = ∞ d s p ( T 1 , T 3 ) =    6 if p = 0 p √ 6 if p ∈ N + 1 if p = ∞ This sho ws that there do es not exist any λ ∈ R suc h that d s p = λ · d p on B T 4 for an y p ∈ N ∪ {∞} . S imila r counterexamples can b e pro duced f or ev ery n > 4. The follo wing in equalit y relates d p and d s p on any B T n . Prop osition 4. F or ev ery T 1 , T 2 ∈ B T n and for every p ∈ N ∪ {∞} , d p ( T 1 , T 2 ) 6      d s p ( T 1 , T 2 ) if p = 0 2 1 − 1 p d s p ( T 1 , T 2 ) if p ∈ N + 2 d s p ( T 1 , T 2 ) if p = ∞ Pr o of. F or ev er y T ∈ B T n , let L ∗ ( T ) b e the symmetric matrix L ∗ ( T ) = ℓ ( T ) + ℓ ( T ) t . Notice that th e ( i, j )-th an d the ( j, i )-th entries of L ∗ ( T ) are b oth equal to L T ( i, j ). Now, b y the u sual prop erties of norms, k L ∗ ( T 1 ) − L ∗ ( T 2 ) k p = k ℓ ( T 1 ) + ℓ ( T 1 ) t − ( ℓ ( T 2 ) + ℓ ( T 2 ) t ) k p 6 k ℓ ( T 1 ) − ℓ ( T 2 ) k p + k ℓ ( T 1 ) t − ℓ ( T 2 ) t k p = 2 k ℓ ( T 1 ) − ℓ ( T 2 ) k p . On the other hand, L ∗ ( T 1 ) − L ∗ ( T 2 ) can b e u ndersto od as t wo concat enated copies of L ( T 1 ) − L ( T 2 ) and therefore, k L ∗ ( T 1 ) − L ∗ ( T 2 ) k p =    2 k L ( T 1 ) − L ( T 2 ) k p if p = 0 p √ 2 · k L ( T 1 ) − L ( T 2 ) k p if p ∈ N + k L ( T 1 ) − L ( T 2 ) k p if p = ∞ 15 Com bin ing this equ al it y with the p revious inequalit y we obtain the inequalit y in the statemen t. ⊓ ⊔ 6 The non-weigh ted case Although we ights enr ic h the top ological structure of a phylog en et ic tree, for in stance by adding probabilities, b o otstrap v alues or div ergence d eg r ees to branches, the comparison of non-wei ghted p h ylogenetic trees, as bare h ierarc h ica l classifications or ev olutive histo- ries, has an in terest in itself. Let N T n denote the class of all non-weig hted phyloge netic trees on S = { 1 , . . . , n } . F elsenstein [12] ga v e a recur ren t form ula for the num b er U ( n, m ) of d ifferent trees in N T n with m un la b eled in tern al n odes, from wh ic h the total num b er |N T n | of differen t non -weigh ted phyloge n etic trees on n taxa can b e compu te d: see T able 2 in [12] or sequ ence A005264 in [27]. T able 1 recalls the fir st v alues of |N T n | . n 1 2 3 4 5 6 7 |N T n | 1 3 22 262 4 336 91 984 2 381 408 T able 1. The val ues of |N T n | for n up to 7 In this section we gather some resu lts on the sp litted n odal metrics d s p , for p ∈ N + , on N T n , and we rep ort on some numerical exp erimen ts for d s 1 and d s 2 on this class. T o simplify the notations, for ev ery a, b ∈ S and p ∈ N + , w e shall write C p T 1 ,T 2 ( a, b ) to denote | ℓ T 1 ( a, b ) − ℓ T 2 ( a, b ) | p . In this w ay , if T 1 , T 2 ∈ N T n and p ∈ N + , then d s p ( T 1 , T 2 ) p = X ( a,b ) ∈ S 2 C p T 1 ,T 2 ( a, b ) ∈ N . Our fi rst r esu lt sh o ws that the m et rics d s p ha ve a redun dan t factor on N T n when n is o dd. Lemma 4. If n i s o dd, then k ℓ ( T ) k 1 is even, for every T ∈ N T n . Pr o of. Let T = ( E , V ) b e a non-weigh ted p h ylogenetic tr ee on S = { 1 , . . . , n } w ith n o dd. F or every e ∈ E , let ν ℓ ( e ) b e the num b er of paths [ i, j ] i , with i, j ∈ S , that con tain the arc e . It is clear that k ℓ ( T ) k 1 = X 1 6 i 6 = j 6 n ℓ T ( i, j ) = X e ∈ E ν ℓ ( e ) . It turns out that if n is o dd, then ev ery ν ℓ ( e ) is even and therefore the right -hand side sum is eve n . Indeed, let e = ( u, v ) b e an y arc and let V b e the set of descendant lab eled no des of v . Then, e is con tained in a path [ i, j ] i if, and only if, i ∈ V a n d j / ∈ V . This sho ws that ν ℓ ( e ) = | V | · | S − V | . No w, since | S | is o dd, either | V | or | S − V | is eve n , whic h imp lie s that ν ℓ ( e ) is eve n . ⊓ ⊔ 16 Prop osition 5. If n is o dd, then d s p ( T 1 , T 2 ) p is even, for ev ery T 1 , T 2 ∈ N T n and for every p ∈ N + . Pr o of. Let T 1 , T 2 ∈ N T n , with n o dd. Then d s p ( T 1 , T 2 ) p = X 1 6 i 6 = j 6 n C p T 1 ,T 2 ( i, j ) . No w, we kno w that P 1 6 i 6 = j 6 n ℓ T 1 ( i, j ) and P 1 6 i 6 = j 6 n ℓ T 2 ( i, j ) are even n u m b ers. This implies that the n u m b er   { ( i, j ) ∈ S 2 | C p T 1 ,T 2 ( i, j ) o dd }   =   { ( i, j ) ∈ S 2 | ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) o dd }   is even, an d h ence that the sum P 1 6 i 6 = j 6 n C p T 1 ,T 2 ( i, j ) is ev en. ⊓ ⊔ This r esult shows that if n is o dd, d s 1 tak es only ev en v alues on N T n , and therefore it can b e divided b y 2 and the r esulting v alues are still inte ger num b ers. In a similar w a y , d s 2 has a ‘redun dan t’ √ 2 factor on N T n , for n o dd. No similar result holds for ev en v alues of n : for in stance, N T 2 consists of three trees T 1 , T 2 , T 3 , with Newic k strin gs (1,2) ; , ((1)2); , and ((2 )1); , resp ectiv ely , and d s 1 ( T 1 , T 2 ) = d s 1 ( T 1 , T 3 ) = 1, d s 1 ( T 2 , T 3 ) = 2. R emark 3. The theses in the last t wo results are true in the more general setting of N + -w eigh ted phylogeneti c trees. T o see it, notice that if ( T , ω ) is suc h a tree, th en k ℓ ( T ) k 1 = X 1 6 i 6 = j 6 n ℓ T ( i, j ) = X e ∈ E ω ( e ) · ν ℓ ( e ) and then, th e pro of that eac h ν ℓ ( e ) is eve n is the same as in the non-weig h ted case. On the other hand, th e thesis in the last prop osition d oes not generalize to p = 0 or ∞ : it is easy to p rod u ce counterexa mples sh o wing that d s 0 and d s ∞ tak e o dd v alues on N T 3 . Our n ext goal is to find the least v alue for d s p on N T n , for p ∈ N + . Lemma 5. L et T 1 , T 2 ∈ N T n with n > 6 and p ∈ N + . If ther e is some taxon that is a le af of lar gest depth in T 1 but not in T 2 , then d p ( T 1 , T 2 ) p > 5 . Pr o of. T o simplify the notations, and sin ce in this pro of the trees T 1 , T 2 and the in dex p are fixed, w e s h all w r ite C ( a, b ) to denote C p T 1 ,T 2 ( a, b ). Assume, without an y loss of generalit y , that 1 is a d ee p est leaf of T 1 and that 2 is a leaf of T 2 suc h that d epth T 2 (2) > d ep th T 2 (1). Then, th e distance from [1 , 2] T 2 to 2 w ill b e larger than to 1. This implies that ℓ T 2 (2 , 1) > ℓ T 2 (1 , 2). Since ℓ T 1 (2 , 1) 6 ℓ T 1 (1 , 2) (b ecause depth T 1 (2) 6 depth T 1 (1)), it must happ en that ℓ T 2 (2 , 1) 6 = ℓ T 1 (2 , 1) or ℓ T 2 (1 , 2) 6 = ℓ T 1 (1 , 2), and th erefore C (1 , 2) + C (2 , 1) > 1 . 17 Let us chec k n ow that, for ev ery a ∈ S \ { 1 , 2 } , at least one of the follo wing four equalities do es not hold: ℓ T 2 (1 , a ) = ℓ T 1 (1 , a ) , ℓ T 2 (2 , a ) = ℓ T 1 (2 , a ) ℓ T 2 ( a, 1) = ℓ T 1 ( a, 1) , ℓ T 2 ( a, 2) = ℓ T 1 ( a, 2) (1) This will imp ly that ev ery a ∈ S \ { 1 , 2 } c ontributes 1 to d s p ( T 1 , T 2 ) p , in the sens e th at C (1 , a ) + C (2 , a ) + C ( a, 1) + C ( a, 2) > 1 . Since there are at least 4 taxa in S \ { 1 , 2 } an d th ese contributions ad d u p to C (1 , 2) + C (2 , 1), this will pro ve that d s p ( T 1 , T 2 ) p > 5. The wa y eac h a ∈ S \ { 1 , 2 } con tribu tes to d s p ( T 1 , T 2 ) p dep ends on its relativ e p osition with resp ect to 1 and 2 in T 2 . – If a 6 1, then ℓ T 2 (1 , a ) = 0 bu t ℓ T 1 (1 , a ) > 0 and therefore ℓ T 2 (1 , a ) 6 = ℓ T 1 (1 , a ). – Assume that [ a, 1] T 2 = [ a, 2] T 2 > [1 , 2] T 2 . In this case ℓ T 2 ( a, 2) = ℓ T 2 ( a, 1) a n d ℓ T 2 (2 , a ) > ℓ T 2 (1 , a ). But these r ela tions cann ot hold in T 1 , b ecause they imply that depth T 1 (2) > depth T 1 (1). Thus, the equ al ities (1) cannot hold sim u ltaneously . – Assume that 1 < [ a, 1] T 2 < [1 , 2] T 2 . In this case λ T 2 ( a 1 | 2) > 0 and ℓ T 2 ( a, 1) + λ T 2 ( a 1 | 2) = ℓ T 2 ( a, 2) ℓ T 2 (1 , a ) + λ T 2 ( a 1 | 2) = ℓ T 2 (1 , 2) ℓ T 2 (2 , a ) = ℓ T 2 (2 , 1) If ℓ T 1 ( a, 1) = ℓ T 2 ( a, 1) and ℓ T 1 ( a, 2) = ℓ T 2 ( a, 2), then the fact that ℓ T 1 ( a, 2) > ℓ T 1 ( a, 1) implies that 1 < [ a, 1] T 1 < [1 , 2] T 1 and th u s λ T 1 ( a 1 | 2) = ℓ T 1 ( a, 2) − ℓ T 1 ( a, 1) = ℓ T 2 ( a, 2) − ℓ T 2 ( a, 1) = λ T 2 ( a 1 | 2) . Then, if ℓ T 1 (1 , a ) = ℓ T 2 (1 , a ), ℓ T 1 (1 , 2) = ℓ T 1 (1 , a ) + λ T 1 ( a 1 | 2) = ℓ T 2 (1 , a ) + λ T 2 ( a 1 | 2) = ℓ T 2 (1 , 2) . Finally , if ℓ T 1 (2 , a ) = ℓ T 2 (2 , a ), then ℓ T 1 (2 , 1) = ℓ T 1 (2 , a ) = ℓ T 2 (2 , a ) = ℓ T 2 (2 , 1) . And this leads to a con tradiction, b ecause, as w e ha ve seen at the b eginning of the pro of, ℓ T 2 (2 , 1) 6 = ℓ T 1 (2 , 1) or ℓ T 2 (1 , 2) 6 = ℓ T 1 (1 , 2). Therefore, the equalities (1) cannot hold sim ultaneously . – If 2 < [ a, 2] T 2 < [1 , 2] T 2 , a similar argument sh o w s that at least one of the equ ali ties (1) fails, to o. This finishes the p roof of the lemma. ⊓ ⊔ 18 1 2 3 4 . . . n T 1 2 3 4 . . . n T ′ Fig. 5. Two non-isomorphic phylogenetic trees in N T n such that d s p ( T , T ′ ) p = 4 for every p ∈ N + . Theorem 3. F or eve ry p ∈ N + and for every n > 2 : (1) If n 6 5 , then min { d s p ( T 1 , T 2 ) p | T 1 , T 2 ∈ N T n , T 1 6 = T 2 } = n − 1 . (2) If n > 6 , then min { d s p ( T 1 , T 2 ) p | T 1 , T 2 ∈ N T n , T 1 6 = T 2 } = 4 . Pr o of. T o simplify the notations, and sin ce in this pro of the trees T 1 , T 2 and the in dex p are fixed, w e s h all w r ite C ( a, b ) to denote C p T 1 ,T 2 ( a, b ). The cases n = 1 to 5 can b e chec ked ‘b y hand’ thr ough th e computation of the distances b et ween all pairs of trees in N T n . In the case n = 1, there is only one tree in N T 1 , and , as w e mentio ned after Lemma 4, N T 2 consists only of three trees T 1 , T 2 , T 3 , with Newic k strings (1, 2); , ((1)2); , and (( 2)1); , r esp ective ly , and it can b e seen that d s p ( T 1 , T 2 ) p = d s p ( T 1 , T 3 ) p = 1, d s p ( T 2 , T 3 ) p = 2. As far as the cases n = 3 , 4 , 5 go, the files { 3,4,5 } -tree- nt-pairs.dat a v ailable at the Supp lemen tary Material web page con tain the v alues of d s p ( T 1 , T 2 ) p for eac h (unord ered) pair of tr ees { T 1 , T 2 } in the corresp ond ing N T n . No w, for n > 5, we s hall p ro v e by indu ct ion on n that d s p ( T 1 , T 2 ) p > 4 for ev ery p ai r of different trees T 1 , T 2 ∈ N T n . Since it is easy to pro duce pairs of trees T 1 , T 2 ∈ N T n suc h that d s p ( T 1 , T 2 ) p = 4, lik e for instance those depicted in Fig. 5, this will finish the pro of of the statemen t. The starting p oin t for the induction p rocedur e is n = 5: we kno w (by d irect insp ection of the file 5- tree-nt-pairs. dat ) that d s p ( T 1 , T 2 ) p > 4 for ev ery pair of different trees T 1 , T 2 ∈ N T 5 . Assume no w that this inequalit y holds for ev ery tw o trees in N T n , for some n > 5, and let us p ro v e it for N T n +1 . So, let T 1 , T 2 ∈ N T n +1 b e a pair of differen t trees. As in the last p roof, w e shall write C ( a, b ) to d en ot e C p T 1 ,T 2 ( a, b ). Without an y loss of generalit y , we assum e that n + 1 is a leaf of largest d ep th in T 1 . By Lemma 5, if n + 1 is not a deep est leaf of T 2 , then d s p ( T 1 , T 2 ) p > 5. So, in the r est of the pro of w e assume that n + 1 is also a deep est leaf of T 2 . In particular, in b oth trees, the siblings of n + 1 (if they exist) are also deep est leav es. W e d istinguish n o w t wo main cases, eac h one divided in sev eral sub cases. (a) Assume that the parent of n + 1 in T 1 is lab eled, sa y with n . This implies that ℓ T 1 ( n, n + 1) = 0 , ℓ T 1 ( n + 1 , n ) = 1 ℓ T 1 ( n + 1 , a ) = ℓ T 1 ( n, a ) + 1 , for ev er y a ∈ S \ { n, n + 1 } ℓ T 1 ( a, n + 1) = ℓ T 1 ( a, n ) , for eve r y a ∈ S \ { n, n + 1 } 19 W e d istinguish th e follo wing sub cases. (a.1) Assum e that, in T 2 , the no de n is an ancestor of n + 1, b ut not its parent. In this case, ℓ T 2 ( n + 1 , n ) > 1, and therefore C ( n + 1 , n ) > 1 . No w, let a ∈ S \ { n, n + 1 } . Let us see that a con tr ibutes at least 1 to d s p ( T 1 , T 2 ) p . – If n > [ a, n + 1] T 2 (that is, if a is a descendan t of an in termediate no de in the path n n + 1), then ℓ T 2 ( a, n + 1) < ℓ T 2 ( a, n ) and therefore, sin ce ℓ T 1 ( a, n + 1) = ℓ T 1 ( a, n ), it m u st happ en that ℓ T 1 ( a, n + 1) 6 = ℓ T 2 ( a, n + 1) or ℓ T 1 ( a, n ) 6 = ℓ T 2 ( a, n ), whic h imp lie s that C ( a, n ) + C ( a, n + 1) > 1 . – If n 6 [ a, n + 1] T 2 in T 2 , then ℓ T 2 ( n + 1 , a ) = ℓ T 2 ( n, a ) + ℓ T 2 ( n + 1 , n ) > ℓ T 2 ( n, a ) + 1 , and therefore, s in ce ℓ T 1 ( n +1 , a ) = ℓ T 1 ( n, a )+ 1, it m ust happ en that ℓ T 1 ( n +1 , a ) 6 = ℓ T 2 ( n + 1 , a ) or ℓ T 1 ( n, a ) 6 = ℓ T 2 ( n, a ), and h ence C ( n, a ) + C ( n + 1 , a ) > 1 . Since there are at least 4 taxa other than n and n + 1, and th eir con tributions add up to C ( n + 1 , n ), we conclude that, in this case, d s p ( T 1 , T 2 ) p > 5. (a.2) Assum e th at , in T 2 , the n ode n is not an ancestor of n + 1; set ℓ T 2 ( n, n + 1) = x > 1 , ℓ T 2 ( n + 1 , n ) = y > 1 . If x > y , then depth T 2 ( n ) > depth T 2 ( n + 1) and th u s, since n + 1 was a d ee p est leaf of T 2 , n would also b e a deep est leaf of T 2 . But n is n ot a d eepest leaf of T 1 and therefore, in this case, w e already kn o w by Lemma 5 that d s p ( T 1 , T 2 ) p > 5. Assume no w that x < y . Th en, y > 2 and thus, on the one hand, C ( n + 1 , n ) + C ( n, n + 1) = ( y − 1) p + x p > 2 and, on the other h and, the p at h [ n + 1 , n ] T 2 n + 1 has at least one intermediate no de: let a 0 6 = n + 1 b e a lab eled no de that is a descendant of the p aren t of n + 1 (notice that, in th is case, a 0 is either th e parent of n + 1 or its sib ling). Then, ℓ T 2 ( a 0 , n + 1) < ℓ T 2 ( a 0 , n ) , ℓ T 2 ( n + 1 , a 0 ) = 1 6 ℓ T 2 ( n, a 0 ) imply that C ( a 0 , n + 1) + C ( a 0 , n ) > 1 , C ( n + 1 , a 0 ) + C ( n, a 0 ) > 1 . So, in this case, d s p ( T 1 , T 2 ) > 4. 20 (a.3) Assum e that, in T 2 , the no de n + 1 is a leaf and its p aren t is n . Let T ∗ 1 , T ∗ 2 ∈ N T n b e the trees obtained f rom T 1 and T 2 , resp ectiv ely , b y r emo vin g the leaf n + 1 together with its p endan t arc. After this op erati on , w e hav e that, for ev ery 1 6 a 6 = b 6 n , ℓ T ∗ i ( a, b ) = ℓ T i ( a, b ) an d th erefore, C ( a, b ) = C p T ∗ 1 ,T ∗ 2 ( a, b ). T hen, d s p ( T 1 , T 2 ) p > X 1 6 a 6 = b 6 n C ( a, b ) = X 1 6 a 6 = b 6 n C p T ∗ 1 ,T ∗ 2 ( a, b ) = d s p ( T ∗ 1 , T ∗ 2 ) p > 4 , the last inequalit y b eing giv en by the induction hypothesis. (b) Assume n o w that the parent of n + 1 is not lab ele d. Therefore, n + 1 m ust h a v e at least one sibling, which, we r ec all, is a leaf. Without any loss of generalit y we assume that n is a sib lin g of n + 1. In this case, we ha ve that ℓ T 1 ( n, n + 1) = ℓ T 1 ( n + 1 , n ) = 1 ℓ T 1 ( n + 1 , a ) = ℓ T 1 ( n, a ) > 0 , for ev ery a ∈ S \ { n, n + 1 } ℓ T 1 ( a, n + 1) = ℓ T 1 ( a, n ) , for eve r y a ∈ S \ { n, n + 1 } Notice moreo ver that n is also a deep est leaf in T 1 and therefore, by Lemma 5, if it is not a deep est leaf in T 2 , then d s p ( T 1 , T 2 ) p > 5. So, w e assume henceforth that n and n + 1 are deep est lea ve s in T 2 . As in (a), th ere are sev eral sub cases to discuss. (b.1) Assume that, in T 2 , the lea v es n and n + 1 are not sib lin g. In th is case, ℓ T 2 ( n, n + 1) = x > 1 , ℓ T 2 ( n + 1 , n ) = y > 1 and x > 1 or y > 1. Sin ce the depths of n and n + 1 in T 2 are the same, it m u st happ en that x = y . Then, C ( n, n + 1) + C ( n + 1 , n ) = ( x − 1) p + ( x − 1) p > 2 . Let now a 0 6 = n a lab eled no de, other than n , that is a descendant of the p aren t of n in T 2 : notice that this parent is an in term ed ia te no de in th e path [ n, n + 1] T 2 n . Then, ℓ T 2 ( n, a 0 ) = 1 < x = ℓ T 2 ( n + 1 , a 0 ) , ℓ T 2 ( a 0 , n ) < ℓ T 2 ( a 0 , n + 1) imply that a 0 con tribu tes at least 2 to d s p ( T 1 , T 2 ) p , and therefore that d s p ( T 1 , T 2 ) p > 4. Actually , d s p ( T 1 , T 2 ) p > 6, b ecause any lab eled no de b 0 6 = n + 1 that is a descendant of the p aren t of n + 1 in T 2 will also contribute at least 2 to d s p ( T 1 , T 2 ) p . (b.2) Assume th at , in T 2 , the lea v es n and n + 1 are s iblings and their paren t is lab eled, sa y with 1. In this case, b y (a) (applied interc hanging the roles of T 1 and T 2 and the roles of n and 1), w e already know that d s p ( T 1 , T 2 ) p > 4. (b.3) Assume that, in T 2 , the n odes n and n + 1 are sibling lea ves and th eir p arent is not lab eled. In this case, let T ∗ 1 , T ∗ 2 ∈ N T n b e th e tr ees obtained from T 1 and T 2 , resp ectiv ely , by remo vin g the lea ves n and n + 1 together w ith their p endant arcs, and lab eling with n th e form er parent of n and n + 1. In this wa y we ha ve that, for ev ery 1 6 a 6 = b 6 n and for ev ery i = 1 , 2, ℓ T ∗ i ( a, b ) = ℓ T i ( a, b ) if a 6 = n ℓ T ∗ i ( n, b ) = ℓ T i ( n, b ) − 1 if a = n 21 and therefore, C ( a, b ) = C p T ∗ 1 ,T ∗ 2 ( a, b ). Then, arguing as in (a.3), d s p ( T 1 , T 2 ) p > d s p ( T ∗ 1 , T ∗ 2 ) p > 4 . This finishes the p roof by induction. ⊓ ⊔ R emark 4. F ollo wing in d eta il th e argument s develo p ed in the last theorem u n til their last consequences, it can b e pr o v ed th at , for n > 6, the p airs of trees T 1 , T 2 in N T n suc h that d s p ( T 1 , T 2 ) p = 4, for ev ery p ∈ N + , are exactly those pairs such th at d 1 ( T 1 , T 2 ) = 4, and they h a ve the follo w ing form. L et i 1 , i 2 , i 3 b e any three taxa in S and let T 0 b e any non-w eighte d ro oted tree with some of its n odes, including all its elemen tary no des and all its lea ves exc ept at most one elementary no de or one le af , lab eled in S \ { i 1 , i 2 , i 3 } . Then, T 1 and T 2 are obtained, resp ectiv ely , b y attac hing to T 0 at the same no de the ‘basic’ trees T ′ 1 and T ′ 2 or T ′′ 1 and T ′′ 2 in Fig. 6. T he attac hment of one of th ese trees at a no de v in T is carried out by iden tifying the no de with the ro ot of the tree, and in su c h a wa y that the resulting trees T 1 and T 2 ha ve all their lea ves and elemen tary no des lab eled. T his implies that if T had some n on-la b eled leaf or elementa r y n o de, this is necessarily the no de where the basic trees must b e attac hed, and that (since T ′′ 2 has its ro ot elemen tary), the basic pair T ′′ 1 , T ′′ 2 cannot b e attac hed to a non-lab eled leaf (this w ould create an elemen tary no de in T 2 ). F or instance, the trees T and T ′ in Fig. 5 are obtained by attac hing the basic trees T ′ 1 and T ′ 2 (with i 1 = 1, i 2 = 2, and i 3 = 3) to the tree with Newic k co de (4, ...,n); . i 1 i 2 i 3 T ′ 1 i 1 i 3 i 2 T ′ 2 i 1 i 2 i 3 T ′′ 1 i 1 i 2 i 3 T ′′ 2 Fig. 6. The p airs of b as ic trees that give rise, when attac h ed to th e same place in a tree, to pairs of non-weig hted p h ylogenetic trees at d s p distance p √ 4. R emark 5. It can b e c heck ed that the pairs of different trees in N T n at least distance for d s 1 ha ve alwa ys splitted path lengths matrices with n − 1 (if n 6 5) or 4 (if n > 5) 22 en tries that d iffer in only 1. This implies th at the least non-zero v alue for d s ∞ on N T n is alw a ys 1, and that the least non-zero v alue for d s 0 on N T n is agai n n − 1 for n 6 5 and 4 for n > 6. Unfortunately , we hav e not b een able to find a formula for th e diameter of N T n with resp ect to an y m et ric d s p with p ∈ N + . Actually , and to our knowle dge, the diameter of the space of n on-w eigh ted binary phyloge netic trees with resp ect to the no dal metrics d 1 and d 2 is still not kno wn , either. Not kn o w ing a form u la for the diameter, we are not able to giv e an explicit d escription of the distribu tio n of distances for any p , either. In the file distribu tions.pdf in the Supp lemen tary Material we provide th e distribu tions of d s 1 and ( d s 2 ) 2 (that is, of d s 2 squared) on N T n for n = 3 , 4 , 5 , 6, as w ell as the distribu tio n s of the v alues of d s 1 and ( d s 2 ) 2 applied to pairs of trees in T reeBASE sh aring n = 2 to 6 lab els. 7 Conclusions Some classical metrics for phylog enetic trees are based on the comparison of the rep- resen tations of ro oted ph ylogenetic trees as v ectors of path lengths b etw een pairs of lab eled no des. But these metrics only separate non-wei gh ted binary ro oted trees: tw o more general n on-isomo rphic ro oted phylog enetic trees can hav e the s ame suc h vec tors of path lengths, and therefore b e at zero distance f or th ese metrics. In this pap er we ha ve o v ercome this problem by represent ing a ro oted p h ylogenetic tree b y means of a matrix with rows and columns indexed b y taxa and wh ere ev ery entry ( i, j ) is the distance fr om the least common ancestor of the p air of no des lab eled with i and j to the no de lab eled with i . W e call these matrices splitte d p ath lengths matric es , b ecause they split in t wo terms the path length b et wee n every pair of lab eled no des. These matrices defin e an in - jectiv e mapp ing fr om the s p ace T n of all R > 0 -w eigh ted ro oted p h ylogenetic trees with n lab eled no des and p ossibly nested taxa into the set M n ( R ) of n × n real-v alued m atrices. Therefore, any n orm on M n ( R ) applied to the d ifference of the splitted path lengths matrices of trees defi nes a metric on T n . Using the well- k n o wn L p norms on M n ( R ), for p ∈ N ∪ {∞} , we obtain the family of splitted no dal m etrics d s p on T n d s p ( T 1 , T 2 ) =        { ( i, j ) | 1 6 i 6 = j 6 n, ℓ T 1 ( i, j ) 6 = ℓ T 2 ( i, j ) }   if p = 0 p q P m 1 6 i 6 = j 6 n | ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) | p if p ∈ N + max {| ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) | | 1 6 i 6 = j 6 n } if p = ∞ W e ha ve pr ov ed sev eral pr operties for these metrics d s p on the s u bspace N T n of non-w eighte d ro oted p h ylogenetic trees p ossibly with nested taxa. F or instance, w e h a v e established the least distance b etw een an y p air of suc h trees. It r emai ns as an op en problem to find th e diameter of N T n with resp ect to these metrics, and the distribution of their v alues. Actually , th ese p roblems also remain op en f or the classical no dal d istance s on non-w eighte d b inary (r o oted as well as u nro ote d ) trees. T hese are inte resting pr oblems: to kno w the largest v alue reac h ed by a metric is necessary to n ormaliz e the m etric b etw een 23 0 and 1, w hile knowing the distribution of the v alues allo ws one to an s w er the q u estio n of wh et her t w o trees are more similar than exp ected b y c hance [19]. W e hop e to r eport on these p roblems in a near future. W e cannot adv o cate th e use of any splitted n odal metric d s p o v er the other ones except, p erhaps, warning against the use of d s 0 ( T 1 , T 2 ) =   { ( i, j ) ∈ S 2 | ℓ T 1 ( i, j ) 6 = ℓ T 2 ( i, j ) }   d s ∞ ( T 1 , T 2 ) = max {| ℓ T 1 ( i, j ) − ℓ T 2 ( i, j ) | | ( i, j ) ∈ S 2 } b ecause they are to o uninform at ive. Since th e most p opular norms on R m are the Man- hattan and the Euclidean, it seems natural to use d s 1 and d s 2 , as it has b een the case in th e classical, non -weigh ted binary setting. Eac h one has its adv antag es. F or instance, the computation of d s 1 do es n ot in volv e square ro ots, and therefore it can b e compu ted exactly and, if the w eights are in teger num b ers, the resulting v alue is an in teger n um b er. Moreo v er, it is well kno w n that, for eve r y p ∈ N + , k x k p 6 k x k 1 for every x ∈ R m and therefore, d s p ( T 1 , T 2 ) 6 d s 1 ( T 1 , T 2 ) for eve ry T 1 , T 2 ∈ T n . On the other hand , th e comparison of s plitte d p at h lengths matrices by means of the Euclidean n orm enables the u se of many geometric and clustering metho ds that are not a v ailable otherwise. F or instance, the sp ecific p rop erties of th e Euclidean norm allo w ed Steel and Penn y to compu te explicitly the m ea n v alue of the no dal d istance d 2 on the class of non-we igh ted unr o oted binary trees [29], while n o similar r esu lt is kno wn for d 1 . As a ru le of thum b, we consider suitable to use d s 1 when the trees are n on-w eigh ted (of when they hav e inte ger wei gh ts), b ecause th ese trees can b e seen as discrete ob jects and th u s their comparison through a d iscrete to ol as th e Manhattan norm seems appro- priate. When the tr ees ha ve arbitrary p ositiv e real we ights, they sh ou ld b e u n derstoo d as b elonging to a contin uous sp ac e [5], and then th e Euclidean norm is more app r opriate. Supplemen t ary Material The Supp le men tary Material referenced in the pap er is av ailable at http://b ioinfo.uib.es/ ~recerca/phylotrees/nodal/ . Ac knowledgem ents: The researc h describ ed in th is pap er has b een partially su pp orted by the Spanish DGI pro j ects MTM2006-0777 3 COMGRIO and MTM2006- 15038-C02-01. 24 References 1. H . Ab di, Additive-tree representations, Lecture Notes in Biomathematics 84 ( 1990) 43–59. 2. B. L. Allen, M. A. Steel, Subtree transfer operations and their in duced metrics on ev olutionary trees, Ann. Combin. 5 (2001) 1–13. 3. V . Batagelj, T. Pisanski, J. M. S. Sim˜ oes-P ereira, A n algorithm for tree-realiza bilit y of distance matrices, I n t. J. Comput. Math. 34 (3) (1990) 171–176. 4. P . Batra, Newton’s metho d and the comput atio n al complexity of th e fundamental th eore m of algebra, Electron. Notes Theor. Comput. Sci. 202 (2008) 201–21 8. 5. L. J. Billera, S . P . Holmes, K. V ogtmann, Geometry of th e space of phylogenetic trees, Adv . Appl. Math. 27 (1) (2001) 733–767. 6. J. Bluis, D.-G. Shin, No dal distance algorithm: Calculating a phylogenetic tree comparison metric, in: Pro c. 3rd IEEE Symp. BioInformatics and BioEngineering, 2003. 7. F. T. Boesch, Properties of the distance matrix of a tree, Q. Appl. Math. 16 (1968) 607–609. 8. P . Buneman, The recov ery of trees from measures of d iss imilarity , in: J. H . et al (ed.), Mathematics in the archaeolog ical and historical sciences, Edinburgh Universit y Press, 1969, pp . 387– 395. 9. D. E. Critchlo w, D. K. Pearl , C. Qian, The triples distance for rooted bifurcating phylogenetic trees, Syst. Biol. 45 (3) (1996) 323–334. 10. J. S. F arris, A successiv e approximations approach to chara ct er wei ghting, Syst. Zo ol. 18 (1969) 374–385 . 11. J. S. F arris, O n comparing the shap es of tax onomic trees, Syst. Zo ol. 22 (1973) 50–54. 12. J. F elsenstein, The number of evo lutionary trees, Sy st. Zool. 27 (1978) 27–33. 13. J. F elsenstein, Inferring Phylogenies, Sinauer Associates Inc., 2004. 14. S. Gr ¨ un ewald, K. T. Hu ber, V. Moulton, C. Semple, Enco ding phylog en etic trees in terms of w eighted quartets, J. Math. Biol. 56 (4) (2008) 465–477. 15. J. Handl, J. K no wles, D. B. Kell, Computational cluster va lidation in p ost-genomic data analysis, Bioinformatic s 21 (15) (2005) 3201–3212. 16. K. Ho ef-Emd en , Molecular phylogenetic an alyses and real-life data, Computing in Science and En- gineering 7 (3) (2005) 86–91. 17. F. Leonardi, S. R. Matioli, H. A . A rmeli n , A . Galves, Detecting phylogenetic relations out from sparse context trees, http://arxiv.org/abs/0 804.4279. 18. R. D. M. Page, Phyloinfo rmatics: T ow ard a phylogenetic database., in: J. T.-L. W ang, M. J. Zaki, H. T oivonen, D. Sh asha (eds.), D ata Mining in Bioinformatics, Springer-V erlag, 2005, pp . 219–241. 19. D. Pe n n y , M. D. Hendy , The use of tree comparison metrics, Sy st. Zool. 34 (1) (1985) 75–82. 20. J. B. Phip p s, Dendrogram top olo gy , Syst. Zool. 20 ( 1971) 306–308. 21. P . Puigb` o, S . Garcia-V allv´ e, J. McInerney , TOPD/FMTS: a new softw are to compare phylogenetic trees, Bioinformatics 23 (12) (2007) 1556–155 8. 22. D. F. Robinson, L. R. F oulds, Comparison of weigh ted lab elled trees, in: Pro c. 6th Australian Conf. Com binatorial Mathematics, vol. 748 of Lecture Notes in Mathematics, S p ringer-V erlag, Berlin, 1979. 23. D. F. Robin son, L. R . F oulds, Comparison of phylogenetic trees, Math. Biosci. 53 ( 1/2) (1981) 131–147 . 24. A. Rok as, Genomics and th e tree of life, Science 313 (5795) (2006) 1897–1899. 25. C. Semple, M. St eel, Ph ylogenetics, O xford Universit y Press, 2003. 26. J. M. S. Sim˜ oes-Perei ra, A note on the tree realizabilit y of a distance, J. Comb. Th. B 6 (3) (1969) 303–310 . 27. N. J. A. Sloane, The On-Line Encyclop edia of Integer S equences, published electronically at www.rese arch.att.com/ njas/sequences/. 28. Y. A. Smolenskii, A metho d for the linear recording of graphs, U SSR Computational Mathematics and Mathematical Physics 2 (1963) 396–397. 29. M. A. St eel, D. Penny , Distributions of t ree comparison metrics—some new results, Sy st. Biol. 42 (2) (1993) 126–14 1. 30. M. S. W aterman, T. F. Smith , On the similarit y of den d ogra ms, J. Theor. Biol. 73 (1978) 789–800. 31. W. T. Williams, H. T. Clifford, On the comparison of tw o classificatio n s of the same set of elements, T axon 20 (4) (1971) 519–522. 25 32. K. A. Zaretskii, Construction of a tree from the collection of distances b et ween suspendin g vertices, Usp ekhi Matematic hesk ik h Nauka 6 (1965) 90–92, in Russian. 26

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment