Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

Lik eliho o d-based tree reconstruction on a concatenation of alignments can b e p ositiv ely misleading Sebastien Ro c h a and Mik e Steel b a Dep artment of Mathematics, University of Wisconsin–Madison, Madison WI, USA b MS Biomathematics R ese ar ch Centr e, University of Canterbury, Christchur ch, New Ze aland Abstract The reconstruction of a sp ecies tree from genomic data faces a double h urdle. First, the (gene) tree describing the evolution of each gene may diﬀer from the sp ecies tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leav es of eac h gene tree provide merely an imp erfect estimate of the topology of the gene tree. In this note, w e demonstrate formally that a basic statistical problem arises if one tries to a v oid accounting for these t w o processes and analyses the genetic data directly via a concatenation approac h. More precisely , we show that, under the m ulti-sp ecies coalescent with a standard site substitution mo del, maximum likelihoo d estimation on sequence data that has been concatenated across genes and p erformed under the incorrect assumption that all sites ha ve ev olv ed indep enden tly and iden tically on a ﬁxed tree is a statistically inconsistent estimator of the sp ecies tree. Our results provide a formal justiﬁcation of simulation results describ ed of Kubatko and Degnan (2007) and others, and complements recen t theoretical results by DeGorgio and Degnan (2010) and Chifman and Kubtak o (2014). Keywor ds: Ph ylogenetic reconstruction, incomplete lineage sorting, maximum lik eliho o d, consistency Pr eprint submitte d to Elsevier January 17, 2021 1. Introduction Mo dern molecular sequencing tec hnology has pro vided a wealth of data to help biologists infer evolutionary relationships b et ween sp ecies. Not only is it p ossible to quickly sequence a single gene across a wide range of sp ecies, but h undreds, or ev en thousands of genes can also be sequenced across those taxa. But with this abundance of data comes new statistical and mathematical c hallenges. These arise because tree inference requires dealing with the interpla y of t wo random pro cesses, as we now explain. F or each gene, the asso ciated aligned sequence data pro vides an estimate of the evolutionary gene tr e e that describ es the ancestry of this gene as one traces bac k its ancestry in time (each copy b eing inherited from one paren t in the previous generation). Moreo ver, giv en suﬃciently long sequences, several metho ds (e.g. maximum likelihoo d and corrected distance metho ds) hav e b een sho wn to b e statically consisten t estimators of the gene tree top ology under v arious site substitution mo dels [7]. ‘Statistical consistency’ here refers to the usual notion in molecular ph ylogenetics, namely that as the sequence length gro ws, the probabilit y that the correct gene tree top ology is returned from the data conv erges to 1 as the n um b er of sites grow. Here the site patterns generated indep enden tly and identically (i.i.d.) under the substitution model on a binary (fully-resolv ed) gene tree. But inferring a gene tree is only part of the puzzle of reconstructing the main ev olutionary ob ject of interest in biology – namely a sp e cies tr e e . This latter tree describ es, on a broad (macro ev olutionary) scale, ho w lineages (consisting of p opulations of a sp ecies) successively separated and diverged from each other o ver evolutionary time scales, with some lineages forming new sp ecies, ulti- mately leading to the giv en taxa observed at the presen t (a precise deﬁnition of a sp ecies-lev el phylogenetic tree is problematic as it requires ﬁrst agreeing on a deﬁnition of ‘sp ecies’, for whic h there are m ultitude of diﬀering opinions) [19, 23]. A sp ecies tree, together with the length (time-scale) and width (popula- tion size) of its branches, induces a probability distribution on the p ossible gene trees and, when the discordance b et ween gene trees is attributed to incomplete lineage sorting, this probability distribution can b e describ ed by the so-called multi-sp e cies c o alesc ent process (details are provided in the recen t b o ok b y [12]). This pro cess extends the celebrated Kingman c o alesc ent pro cess from a single p opulation to a ph ylogenetic tree, where the latter can be view ed as a ‘tree of p opulations’ The relationship betw een gene trees and sp ecies trees has attracted a go od deal of attention from mathematicians and statisticians ov er the last decade or so [5, 11, 17, 18, 24, 26]. An early and easily veriﬁed result is that for three taxa, the most probable gene tree top ology under the m ulti-sp ecies coalescen t matc hes the sp ecies tree (the other tw o comp eting binary top ologies hav e equal but lo w er probabilit y) [29]. Consequently , estimating the sp ecies tree b y the gene tree that appears most frequently is a statistically consistent metho d (under the m ulti-sp ecies coalescen t) when we hav e just three taxa. Moreo ver, when there are more than three taxa, one can still estimate a species tree consistently , for 2 example, by estimating all the ro oted triples, and using these to reconstruct the sp ecies tree top ology [4]. Ho wev er, the alternative simple ‘ma jority rule’ strategy of estimating the sp ecies tree b y merely taking the most frequen t gene tree falls apart when w e ha ve more than than three species. With four taxa, the most probable gene tree top ology can diﬀer from certain (unbalanced) species tree top ologies, while for ﬁv e or more taxa a more striking result applies – every species tree top ology has branc h lengths for which the most probable gene tree top ology diﬀers from that of the sp ecies tree (for details, see [5]). Nevertheless one can still infer a sp ecies tree in a statistically consistent manner from a series of gene trees generated i.i.d. by the m ulti-sp ecies coalescent pro cess, and several techniques ha ve b een dev elop ed for this (see e.g. [22]). There are also additional mec hanisms that can lead to conﬂict b et ween gene trees and species trees, including reticulate evolu- tion (e.g. the formation of h ybrid sp ecies), lateral gene transfer (in prok aryotic taxa such as bacteria) and gene duplication and loss, but w e do not consider these pro cesses here. W e hav e so far discussed these t wo random pro cesses – the evolution of se- quence site patterns on a gene tree under a site-substitution mo del, and the random generation of gene trees from the sp ecies tree under the m ulti-species coalescen t pro cess – as separate process. But in reality these tw o processes w ork in concert, a gene tree will ha v e a random topology (determined by the multi- sp ecies coalescent on the species tree) and on this random gene tree sequences will evolv e according to a substitution pro cess. Th us, it is not immediately ob vious whether metho ds exist for inferring a sp ecies tree top ology directly from a series of aligned sequences (one for eac h gene) which would b e statisti- cally consistent as the num ber of genes gro ws. Using techniques from algebraic statistics, Chifman and Kubatko [2] recen tly established that the species tree top ology (up to the placement of the ro ot) is an identiﬁable discrete parameter under the combined substitution–coalescence process. Moreo ver they describe an explicit metho d for estimating the sp ecies tree based on ph ylogenetic inv ari- an ts and singular v alue decomp osition techniques. F or Bay esian inference of sp ecies trees directly from sequence data (e.g. via the program *BEAST, [10]) the statistical consistency has also b een formally established [28]. In this paper we consider a simpler and alternative strategy that has b een used widely for inferring the sp ecies tree directly from sequence data, namely concatenation of sequences (e.g. [20, 25]). In its simplest form, this strategy simply concatenates all the sequences, and treats them as though each site had evolv ed i.i.d. on a ﬁxed tree. Kubatko and Degnan [13] used sim ulations to study the p erformance of suc h a concatenation approach, and their ﬁnding suggested that it could lead to misleading ph ylogenetic estimates. Nev ertheless, the accuracy of concatenation metho ds is still very m uch under debate (e.g. [9, 27, 31]). While many sim ulation studies hav e concluded that concatenation metho ds are signiﬁcantly less accurate than ILS-based metho ds or are prone to pro ducing erroneous estimates with high conﬁdence [10, 13, 14, 15, 16], others ha ve found that they can be more accurate under s ome conditions (such as low ph ylogenetic signal) [1, 8, 21]. Moreov er, a formal pro of of whether or not a 3 standard statistical method, suc h as maxim um likelihoo d (ML), is statistically consisten t as an estimator of tree top ology based on concatenated sequences has nev er b een presented, with the exception of the w ork of DeGiorgio and Degnan [3] who established the consistency of ML in the sp ecial case of three taxa under a molecular clo c k. This is the motiv ation for our current pap er. W e consider what happ ens when ML is applied under the assumption that the sites evolv e i.i.d. on a ﬁxed tree (in k eeping with the concatenation approac h). Our main result (Theorem 1) sho ws that ML is statistically inconsisten t as an estimator of tree top ology . Indeed the probability that the true sp ecies tree is an ML tree can b e made as small as we wish in the limit as the num ber of genes grows (ev en with six taxa). What mak es this result non-trivial is that studying the b eha vior of mis-speciﬁed lik eliho o ds can b e challenging. Our pro of of inconsistency in v olves com bining a n umber of arguments and results, including a classic result in p opulations genetics (the ‘Ewens’ Sampling formula’), a formal link age betw een likelihoo d and parsimony , and the in terplay of v arious concentration and appro ximations b ounds. 2. Deﬁnitions and main result Consider: • a species tree top ology T together with branch lengths L (which, for each edge e of T , combine temp oral branc h lengths ( t e ) and an eﬀective p opu- lation size for that edge N e – note the subscript e here refers to the edge e not ‘eﬀectiv e’). • g alignments A 1 , A 2 , . . . , A g , where A i consists of sequences of length ` = ` ( g ) ev olved i.i.d. under a symmetric r -state site substitution mo del at substitution rate θ on the random gene tree (with associated branch lengths) that is generated by ( T , L ) via the m ultisp ecies coalescent model. That is, on each branc h of T , lo oking backw ards in time, lineages entering the branch coalesce at constant rate according to the Kingman coalescent with ﬁxed p opulation size. The remaining lineages at the top of the branch en ter the ancestral p opulation. F or eac h locus, conditioned on the gen- erated gene tree, the alignment is generated according to the symmetric r -state mo del. • maximum likelihoo d tree(s) T M L for the concatenated alignment A 1 A 2 · · · A g inferred under the assumption that all sites evolv e i.i.d. on a tree accord- ing to the symmetric r -state site substitution mo del (for branch lengths that are optimized, as usual, as part of the ML estimation). Let P ( T , L, r, g , `, θ ) b e the probability that T has the same unro oted top ol- ogy as (at least one) ML tree T M L . Our main result can be stated as follows. 4 Theorem 1. Under the mo del describ e d ab ove, ther e exist tr e e top olo gies T with br anch lengths L for T , and a site substitution r ate θ suﬃciently smal l, for which the fol lowing holds: F or any δ > 0 , ther e is a value g 0 so that P ( T , L, r, g , `, θ ) ≤ δ for al l g ≥ g 0 , and for al l se quenc e length functions ` = ` ( g ) . 3. Heuristic argumen t and a key preliminary result The formal pro of of Theorem 1 is presen ted in the next section. Here w e describ e the idea of the pro of, and establish a preliminary result that is central to the pro of. In the anomaly zone, the most frequen t gene tree top ology diﬀers from the sp ecies tree topology . That in itself do es not imply that maxim um likelihoo d on the concatenation will pic k the wrong tree. How ever what we sho w is that the wrong top ology do es indeed lead to a higher exp ected likelihoo d. W e exploit a connection to parsimon y: at low mutation rates the likelihoo d score is roughly equal to the parsimon y score (up to a factor). The latter, being combinatorial in nature, turns out to b e easier to characterize. In particular, we sho w that under the m ultispecies coalescent the wrong top ology has a higher exp ected parsimony score. The follo wing preliminary result (Prop osition 1) establishes the previous claim under the related inﬁnite-alleles mo del of m utation. This prop osition pla ys a key role in the ﬁnal step (Claim 7) of the pro of of the theorem. Giv en allele frequencies ( a 1 , a 2 , . . . , a n ) where P n j =1 j a j = n , the celebrated ‘Ew ens’ Samping F orm ula’ describ es the probabilit y of generating such an allele distribution in a coalescen t tree, with scaled m utation rate θ = 4 N µ under an inﬁnite alleles mo del: P θ,n ( a 1 , a 2 , . . . , a n ) = n ! θ ( n ) n Y j =1 ( θ /j ) a j a j ! , where θ ( n ) = θ ( θ + 1) · · · ( θ + n − 1) . (for details, see Durrett [6], p.18). W e will apply this in the current setting, where n = 6 and θ =  a small positive constan t (to b e determined later). Let x = (0 , 1 , 0 , 1 , 0 , 0) and y = (0 , 0 , 2 , 0 , 0 , 0) . Then P , 6 ( x ) = 6!  (6) ( / 2) 1 1! ( / 4) 1 1! = 3 4  + O (  2 ) . (1) Similarly , P , 6 ( y ) = 6!  (6) ( / 3) 2 2! = 1 3  + O (  2 ) . (2) Consider the tw o unro oted binary tree shap es on six leav es, shown in Fig. 1, and denote these as Y (the symmetric tree with three cherries) and Z (the caterpillar tree with t wo cherries). W e apply the ab o v e calculations to establish the following result. 5 Y Z Figure 1: The tw o binary tree shapes on six leav es: (a) the shap e Y ; (b) the shap e Z . There are 15 and 90 phylogenetic trees on a given leaf set that hav e the shapes Y and Z , resp ectively . Prop osition 1. L et T and T 0 b e two unr o ote d binary phylo genetic tr e es of shap es Z and Y r esp e ctively. Consider a site p attern that is r andomly gen- er ate d on a c o alesc ent tr e e on the same le af set under the inﬁnite al leles mo del with sc ale d mutation r ate θ (= 4 N µ ) =  . F or a binary tr e e top olo gy W , P W ( χ ) denote the p arsimony sc or e of a site p attern χ on W . Then E ESF [ P Z ( χ ) − P Y ( χ )] = 1 60  + O (  2 ) . wher e E ESF denotes the exp e ctation under the inﬁnite-al leles mo del. Pr o of: W e refer to a binary pattern on the leaf set { 1 , 2 , 3 , 4 , 5 , 6 } as a k -clade if there are k lea v es in one state, and 6 − k in another ( k ≤ n/ 2). Given such a binary pattern, the additional p enalty of this clade is its homoplasy score (i.e. the parsimony score minus 1, unless the clade is a 0-clade in which case the p enalt y is 0). F or a ph ylogenetic tree having shap e Y there are: •  3 2  · 2 · 2 = 12 in total 2-clades that cost an additional p enalty of +1; • 1 2  3 1  ·  2 1  · 2 = 6 in total 3-clades that cost an additional p enalty of +1; • 1 2 2 · 2 · 2 = 4 in total 3-clades that cost an additional p enalt y of +2. 6 Th us the expected v alue of the additional parsimon y p enalt y ∆ Y for a tree ph ylogenetic tree having shap e Y is: 1 · P , 6 ( x ) · 12  6 2  + 1 · P , 6 ( y ) · 6 1 2  6 3  + 2 · P , 6 ( y ) · 4 1 2  6 3  . Substituting Eqns. (1) and (2) in to this last expression gives: E ESF [ P Y ( χ )] = 16 15  + O (  2 ) . (3) A similar analysis for a Z -shape tree shows that there are: • 13 in total 2-clades that cost an additional p enalty of +1; • 5 in total 3-clades that cost an additional p enalty of +1; • 4 in total 3-clades that cost an additional p enalty of +2. Th us the exp ected v alue of the additional parsimony p enalt y P Z ( χ ) for a tree ph ylogenetic tree having shap e Z is: 1 · P , 6 ( x ) · 13  6 2  + 1 · P , 6 ( y ) · 5 1 2  6 3  + 2 · P , 6 ( y ) · 4 1 2  6 3  . Substituting Eqns. (1) and (2) in to this last expression gives: E ESF [ P Z ( χ )] = 13 12  + O (  2 ) . (4) Com bining Eqns. (3) and (4) gives: E ESF [ P Z ( χ ) − P Y ( χ )] = 1 60  + O (  2 ) . (5) Prop osition 1 now follows from (5). 2 4. Pro of of Theorem 1 T o establish Theorem 1 it suﬃces to do so for any n um b er n of taxa, and w e do so for n = 6. F or the sp ecies tree T , take an y ro oted tree that has the unro oted topology of the Z -shaped tree (caterpillar). Mak e all the edges L of this tree less than β . W e use the following notation: • Denote by G 1 , . . . , G g the gene trees generated by the multispecies coales- cen t on T . • Let E T denote the exp ectation under T and let G b e a gene tree generated under T . • Let C = [ r ] n b e the set of r -state characters on the set of n taxa. 7 A B C D E F > Figure 2: Anomalous gene trees on a 6-taxon species tree with shap e Z . The even t of deepest coalescence is depicted. • Let χ f k ∈ C be the k -th c haracter of the f -th alignment, where 1 ≤ k ≤ ` and 1 ≤ f ≤ g , and let X = { χ f k } k,f . • F or a character χ ∈ C , let N f χ b e the n umber of times character χ app ears in the f -th alignmen t and let N χ b e the num ber of times it appears o v erall. Let U b e an n -leaf tree with mutation probabilites { q e } . W e denote b y p U χ the probabilit y that χ is pro duced b y U under the symmetric r -state site substitu- tion model. Then the (mis-sp eciﬁed, i.e., not taking in to accoun t the coalescen t) empirical min us log-likelihoo d under tree U is given b y L U ( X ) = − 1 g ` X k,f log h r p U χ f k i = − 1 g ` X χ ∈C N χ log  r p U χ  . W e w ant to sho w that with high probabilit y L U ( X ) is not minimized on the sp ecies tree top ology . W e follo w the pro of sk etched in Section 3. F or a binary tree topology W and a character χ ∈ C we let ¯ χ W denote a minimal extension of χ on W and P W ( χ ), the parsimon y score of χ . Let P W ( X ) = 1 g ` X k,f P W ( χ k,f ) = 1 g ` X χ ∈C N χ P W ( χ ) . Let E ( W ) and V ( W ) b e the edges and v ertices of W . W e assume that W is binary and has n lea ves, hence | E ( W ) | = 2 n − 3 and | V ( W ) | = 2 n − 2. Let 8 L ∗ W ( X ) be the minus log-likelihoo d under an optimal c hoice of branch lengths (in [0 , + ∞ ]) for W . Let N 0 denote the num ber of constan t characters and N 6 =0 = g ` − N 0 . Claim 1 (Parsimon y-based appro ximation of the likelihoo d). If N 0 > 1 , g ` P W ( X ) N 0 ≤ 1 (6) then, for al l q 0 ∈ (0 , 1) , L ∗ W ( X ) ≤ − P W ( X ) log  q 0 r − 1  − 2 n log (1 − q 0 ) (7) and L ∗ W ( X ) ≥ − P W ( X ) log  g ` P W ( X ) ( r − 1) N 0  − N 6 =0 g ` n log r . (8) Pr o of: W e adapt several bounds deriv ed in T uﬄey and Steel [30, Lemmas 5 and 6]. Letting U hav e top ology W with all transition probabilities equal to q 0 , by considering a minimal extension (see T uﬄey and Steel [30, Equation (52)]) w e ha ve r p U χ ≥  q 0 r − 1  P W ( χ ) (1 − q 0 ) 2 n − 3 − P W ( χ ) ≥  q 0 r − 1  P W ( χ ) (1 − q 0 ) 2 n , and therefore L ∗ W ( X ) ≤ − 1 g `    X χ 6 =0 N χ log "  q 0 r − 1  P W ( χ ) (1 − q 0 ) 2 n # + N 0 log h (1 − q 0 ) 2 n i    = − P W ( X ) log  q 0 r − 1  − 2 n log (1 − q 0 ) , where w e used that N 0 + P χ 6 =0 N χ = g ` . This prov es (7). F or the other direction, let U b e the tree with top ology W and optimal m utation probabilities ( q ∗ e ) e . Let ¯ q = max e q e . Then, summing ov er all minimal extensions (see T uﬄey and Steel [30, Equation (63)]), r p U χ ≤ r n − 2  ¯ q r − 1  P W ( χ ) ≤ r n  ¯ q r − 1  P W ( χ ) , and b y considering tw o lea v es whose connecting path go es through an edge with probabilit y ¯ q (see T uﬄey and Steel [30, Equation (9)]) p U 0 ≤ 1 − ¯ q . 9 Hence L U ( χ ) ≥ − 1 g `    X χ 6 =0 N χ log " r n  ¯ q r − 1  P W ( χ ) # + N 0 log(1 − ¯ q )    ≥ − N 6 =0 g ` n log r − P W ( X ) log  ¯ q r − 1  + N 0 g ` ¯ q , where we used − log(1 − ¯ q ) ≥ ¯ q . Minimizing L U ( χ ) ov er ¯ q (see T uﬄey and Steel [30, Equation (65) and (66)]), a low er bound is obtained by ﬁxing ¯ q to g ` P W ( X ) / N 0 . 2 In order for the appro ximation in Claim 1 to b e useful, w e need that N 0 is asymptotically larger than max { nN 6 =0 , r } and that P W ( X ) is not to o small. W e pro ceed to prov e that these t w o properties hold when the mutation rate is low enough. W e begin by showing that the empirical frequencies of c haracters are close to their exp ectation when g → + ∞ . Claim 2 (Concentration of empirical frequencies). With pr ob ability exc e e d- ing 1 − 2 r n exp( − 2 g ζ 2 1 ) , for al l χ ∈ C ,     1 g ` N χ − E T [ p G χ ]     < ζ 1 . (9) Pr o of: F or all χ ∈ C , 1 g ` N χ = 1 g ` X k,f 1 { χ f k = χ } = 1 g X f 1 ` N f χ = 1 g X f 1 ` X k 1 { χ f k = χ } ! . (10) Noting that the ` − 1 N f χ s are in [0 , 1] and indep endent, Ho eﬀding’s inequality implies for all ζ > 0 P T      1 g ` N χ − 1 g ` E T [ N χ ]     ≥ ζ 1  ≤ 2 exp  − 2 g ζ 2 1  . Moreo ver by Eqn. (10) 1 g ` E T [ N χ ] = 1 g ` X k,f E T [ 1 { χ f k = χ } ] = P T [ χ f k = χ ] = E T [ p G χ ] . The result follo ws from the fact that |C | = r n . 2 An immediate corollary is the concen tration of the parsimony score. Claim 3 (Concentration of parsimon y score). Under Eqn. (9) , | P W ( X ) − E T [ P W ( χ )] | ≤ nr n ζ 1 . 10 Pr o of: By deﬁnition, | P W ( X ) − E T [ P W ( χ )] | =      1 g ` X χ P W ( χ ) N χ − X χ P W ( χ ) E T [ p G χ ]      ≤ X χ P W ( χ )     1 g ` N χ − E T [ p G χ ]     ≤ r n nζ 1 . 2 The next tw o claims relate the multispecies coalescent to the standard coa- lescen t. W e will refer to the p opulation of T ancestral to all taxa as the master p opulation . W e let D be the gene tree ev ent that no coalescence o ccurs b efore the master p opulation, whic h we refer to as de ep est c o alesc enc e . W e let T D b e the coalescent model on the master p opulation (i.e., the standard n -coalescent). W e further let D 0 b e the site even t such that D occurs and further no mutation o ccurs b elow the master population. Let M b e the n um b er of mutations on a site. Claim 4 (Low er bound on the n um b er of constant c haracters). Ther e is ζ 2 (dep ending only on n and β ) such that, for any θ > 0 , E T [ p G 0 ] ≥ 1 − ζ 2 θ . Pr o of: Note that { χ = 0 } ⊇ { M = 0 } . The n umber of mutations on a site is sto c hastically dominated by the same quan tity c onditione d on D . Indeed deepest coalescence ensures the highest total length of the gene tree. Hence E T [ p G 0 ] ≥ P T [ M = 0] ≥ P T [ M = 0 | D ] = E T [exp ( − θ H G ) | D ] ≥ E T [1 − θ H G | D ] , where H G is the total length of gene tree G . Note that, on D , H G ≤ n · nβ + H 0 G , where H 0 G is the total length of the gene tree inside the master p opulation. Letting h (1) n b e the exp ected length of the standard coalescent on n samples, we ha ve E T [ p G 0 ] ≥ 1 − θ ( n 2 β + h (1) n ) . Therefore w e can take ζ 2 = n 2 β + h (1) n . 2 Claim 5 (Reduction to standard coalescent). F or any θ and ζ 3 > 0 , ther e is β smal l enough (dep ending only on ζ 3 , n , and θ ), such that | E T [ P W ( χ )] − E T D [ P W ( χ )] | ≤ ζ 3 . 11 Pr o of: Note that E T [ P W ( χ ) | D 0 ] = E T D [ P W ( χ )] . F urther E T [ P W ( χ )] = E T [ P W ( χ ) | D 0 ] P T [ D 0 ] + E T [ P W ( χ ) | ( D 0 ) c ] P T [( D 0 ) c ] ≤ E T D [ P W ( χ )] + n ζ 3 n , b y choosing β small enough to make the probability P T [ D 0 ] ≥ ( e − ( n 2 ) β ) n exp( − θ ( n · nβ )) ≥ 1 − ζ 3 n . Ab o v e we used that P W ( χ ) ≤ n . Similarly , E T [ P W ( χ )] = E T [ P W ( χ ) | D 0 ] P T [ D 0 ] + E T [ P W ( χ ) | ( D 0 ) c ] P T [( D 0 ) c ] ≥ E T D [ P W ( χ )]  1 − ζ 3 n  ≥ E T D [ P W ( χ )] − n ζ 3 n . 2 Recall that E ESF is the exp ectation under the inﬁnite-alleles mo del on T D . Claim 6 (Inﬁnite-alleles approximation). Ther e is ζ 4 dep ending only on n such that, for any θ > 0 , | E T D [ P W ( χ )] − E ESF [ P W ( χ )] | ≤ ζ 4 θ 2 . Pr o of: Note that E T D [ P W ( χ ) | M ≤ 1] = E ESF [ P W ( χ ) | M ≤ 1] , as a single mutation as the same eﬀect on the c haracters of r -state symmetric and inﬁnite-alleles mo dels. Moreov er, b ecause b oth mo dels are run with the same parameters, they ha ve the same distribution of n um b er of mutations. In particular, P T D [ M ≤ 1] = P ESF [ M ≤ 1] . Note that P ESF [ M > 1] = E ESF   X i ≥ 2 e − θH G ( θ H G ) i i !   ≤ E ESF   ( θ H G ) 2 X i ≥ 0 e − θH G ( θ H G ) i i !   = θ 2 h (2) n , 12 where h (2) n = E ESF [ H 2 G ]. Hence, since E T D [ P W ( χ )] = E T D [ P W ( χ ) | M ≤ 1] P T D [ M ≤ 1] + E T D [ P W ( χ ) | M > 1] P T D [ M > 1] , w e hav e on the one hand E T D [ P W ( χ )] ≤ E ESF [ P W ( χ ) | M ≤ 1] P ESF [ M ≤ 1] +( E ESF [ P W ( χ ) | M > 1] + n ) P ESF [ M > 1] = E ESF [ P W ( χ )] + nθ 2 h (2) n . And, on the other hand, w e hav e E T D [ P W ( χ )] ≥ E ESF [ P W ( χ ) | M ≤ 1] P ESF [ M ≤ 1] +( E ESF [ P W ( χ ) | M > 1] − n ) P ESF [ M > 1] = E ESF [ P W ( χ )] − nθ 2 h (2) n . 2 Claim 7 (Final argument). L et θ =  . Ther e ar e  and β smal l enough (dep ending on n and r ) such that L ∗ Z ( X ) > L ∗ Y ( X ) , with pr ob ability exc e e ding 1 − 2 r n exp( − 2 g  4 ) . Pr o of: Cho osing β small enough, ζ 1 = ζ 3 =  2 . Claims 2 and 4 imply that, with probabilit y exceeding 1 − 2 r n exp( − 2 g ζ 2 1 ), N 0 ≥ g ` [1 − ζ 2  −  2 ] = g ` (1 − O (  )) , N 6 =0 = O ( g` ) . (11) When Eqn. (11) holds, b y Claims 3, 5 and 6, | P W ( X ) − E ESF [ P W ( χ )] | = O (  2 ) . (12) T ogether with Prop osition 1, this implies that P Z ( X ) − P Y ( X ) = 1 60  + O (  2 ) . (13) W e ﬁnally return to the lik eliho od. Note that (6) in Claim 1 is satisﬁed by (11) and P Y ( X ) ≤ nN 6 =0 /g ` = O ( nε ) . (14) Hence, taking q 0 = g ` P Y ( X ) / N 0 = O ( nε ) , (15) in Claim 1 yields L ∗ Z ( X ) − L ∗ Y ( X ) ≥ − [ P Z ( X ) − P Y ( X )] log  g ` P Y ( X ) ( r − 1) N 0  +2 n log  1 − g ` P Y ( X ) N 0  − N 6 =0 g ` n log r ≥  1 60  + O (  2 )  log[Ω(( n ) − 1 )] − 2 n · O ( n ) − n log r · O (  ) > 0 , 13 b y (14) and (15), when  is small enough (dep ending on n and r ). 2 Theorem 1 now follows immediately from Claim 7 by noting that the low er b ound 1 − 2 r n exp( − 2 g  4 ) con v erges to 1 as g gro ws; consequently , the proba- bilit y that Y has a higher likelihoo d than Z (i.e. a low er minus log-likelihoo d) con verges to 1 as the num b er of alignments g increases. 5. Concluding commen ts Our statistical inconsistency result applies for the particular case of a tree with six leav es. While this suﬃces to establish inconsistency in general, w e conjecture that an extension of our argumen t would apply to a tree with an y n umber of leav es. How ev er a detailed pro of of this assertion is b ey ond the scop e of this short note. 6. Ackno wledgemen ts The authors thank the Simons Institute at UC Berk eley , where this work w as carried out. S.R. is supp orted b y NSF grants DMS-1248176 and DMS-1149312 (CAREER), and an Alfred P . Sloan Research F ellowship. M.S. w ould lik e to thank the NZ Marsden F und and the Allan Wilson Centre for funding supp ort. References [1] M. S. Bayzid and T. W arnow. Naive binning impro ves phylogenomic anal- yses. Bioinformatics , 29(18):2277–2284, 2013. [2] J. Chifman and L. Kubatko. Identiﬁabilit y of the unro oted sp ecies tree top ology under the coalescent mo del with time-reversible substitution pro- cesses. arXiv , page 1406.4811, 2014. [3] M. DeGiorgio and J. H. Degnan. F ast and consistent estimation of sp ecies trees using sup ermatrix ro oted triples. Mole cular Biolo gy and Evolution , 27(3):552–569, 2010. doi: 10.1093/molb ev/msp250. URL http://mbe. oxfordjournals.org/content/27/3/552.abstract . [4] J. Degnan, M. DeGiorgio, D. Bryan t, and N. Rosenberg. Prop erties of consensus methods for inferring sp ecies trees from gene trees. Syst. Biol. , 58(1):35–54, 2009. [5] J. H. Degnan and N. A. Rosenberg. Gene tree discordance, ph ylogenetic inference and the multispecies coalescen t. T r ends Ec ol. Evol. , 24(6):332– 340, 2009. [6] R. Durrett. Pr ob ability mo dels for DNA se quenc e evolution (2nd e d.) . Springer, 2008. 14 [7] J. F elsenstein. Inferring phylo genies, vol. 2 . Sinauer Associates Sunderland, 2004. [8] S. R. Gadagk ar, M. S. Rosenberg, and S. Kumar. Inferring sp ecies ph ylo- genies from m ultiple genes: Concatenated sequence tree versus consensus gene tree. J. Exp. Zo ol. B Mol. Dev. Evol. , 304:64–74, 2005. [9] J. Gatesy and M. S. Springer. Concatenation versus coalescence versus “concatalescence”. Pr o c. Natl. A c ad. Sci. (USA). , 110(13):E1179, 2013. [10] J. Heled and A. J. Drummond. Bay esian inference of species trees from m ultilo cus data. Mol. Biol. Evol. , 27(3):570–580, 2010. [11] H. Huang, Q. He, L. S. Kubatko, and L. L. Knowles. Sources of error for sp ecies-tree estimation: Impact of mutational and coalescen t eﬀects on accuracy and implications for c ho osing among diﬀeren t metho ds. Syst. Biol. , 59(5):573–583, 2010. [12] L. L. Knowles and L. S. Kubatko. Estimating Sp e cies T r e es: Pr actic al and The or etic al Asp e cts . Wiley-Blac kwell, 2010. [13] L. S. Kubatko and J. H. Degnan. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. , 56(1):17–24, 2007. [14] L. S. Kubatko, B. C. Carstens, and L. L. Knowles. STEM: sp ecies tree estimation using maximum lik eliho o d for gene trees under coales- cence. Bioinformatics , 25(7):971–973, 2009. doi: 10.1093/bioinformatics/ btp079. URL http://bioinformatics.oxfordjournals.org/content/ 25/7/971.abstract . [15] B. R. Larget, S. K. Kotha, C. N. Dew ey , and C. An´ e. BUCKy: Gene tree/sp ecies tree reconciliation with Bay esian concordance analy- sis. Bioinformatics , 26(22):2910–2911, 2010. doi: 10.1093/bioinformatics/ btq539. URL http://bioinformatics.oxfordjournals.org/content/ 26/22/2910.abstract . [16] A. D. Leach ´ e and B. Rannala. The accuracy of species tree estimation under simulation: a comparison of metho ds. Syst. Biol. , 60(2):126–137, 2011. [17] L. Liu, L. Y u, L. Kubatk o, D. K. P earl, and S. V. Edwards. Coalescen t metho ds for estimating phylogenetic trees. Mole cular Phylo genetics and Evolution , 53(1):320–328, 2009. [18] L. Liu, L. Y u, D. K. Pearl, and S. V. Edwards. Estimating sp ecies phyloge- nies using coalescence times among sequences. Syst. Biol. , 58(5):468–477, 2009. [19] W. P . Maddison. Gene trees in sp ecies trees. Syst. Biol. , 46(3):523–536, 1997. 15 [20] R. W. Meredith, J. E. Janeˇ ck a, J. Gatesy , O. A. Ryder, C. A. Fisher, E. C. T eeling, A. Go o dbla, E. Eizirik, Sim˜ ao. T. L. L., T. Stadler, D. L. Rab osky , R. L. Honeycutt, J. J. Flynn, C. M. Ingram, C. Steiner, T. L. Williams, T. J. Robinson, A. Burk-Herric k, M. W esterman, N. A. Ayoub, M. S. Springer, and W. J. Murph y . Impacts of the cretaceous terrestrial rev olution and KPg extinction on mammal diversiﬁcation. Scienc e , 334 (6055):521–524, 2011. doi: 10.1126/science.1211028. URL http://www. sciencemag.org/content/334/6055/521.abstract . [21] Siav ash Mirarab, Md Shamsuzzoha Bayzid, and T andy W arno w. Ev al- uating summary metho ds for multi-locus sp ecies tree estimation in the presence of incomplete lineage sorting. Systematic Biolo gy , 2014. doi: 10.1093/sysbio/syu063. URL http://sysbio.oxfordjournals.org/ content/early/2014/08/26/sysbio.syu063.abstract . [22] E. Mossel and S. Roch. Incomplete lineage sorting: consisten t phylogen y estimation from multiple lo ci. IEEE/ACM T r ans. Comput. Biol. Bioinf. (TCBB) , 7(1):166–171, 2010. [23] R. Nic hols. Gene trees and sp ecies trees are not the same. T r ends Ec ol. Evol. , 16(7):358–364, 2001. [24] S. Ro c h. An analytical comparison of m ultilo cus metho ds under the m ul- tisp ecies coalescen t: the three-taxon case. In Paciﬁc Symp osium on Bio- c omputing , pages 297–306. W orld Scientiﬁc, 2013. [25] A. Rok as, B. L. Williams, N. King, and S. B. Carroll. Genome-scale ap- proac hes to resolving incongruence in molecular ph ylogenies. Natur e , 425: 798–804, 2003. [26] N. A. Rosenberg. The probability of top ological concordance of gene trees and sp ecies trees. The or. Pop. Biol. , 61:225–247, 2002. [27] S. Song, L. Liu, S. V. Edwards, and S. W u. Resolving conﬂict in eutherian mammal phylogen y using phylogenomics and the multispecies coalescent mo del. Pr o c e e dings of the National A c ademy of Scienc es , 109(37):14942– 14947, 2012. [28] M. Steel. Consistency of Bay esian inference of resolv ed phylogenetic trees. J. The or et. Biol. , 336:246–249, 2013. [29] F. T a jima. Ev olutionary relationships of DNA sequences in ﬁnite p opula- tions. Genetics , 105:437–460, 1983. [30] C. T uﬄey and M. A. Steel. Links b et ween maximum likelihoo d and maxi- m um parsimony under a simple mo del of site substitution. B. Math. Biol. , 59(3):581–607, 1997. [31] S. W u, S. Song, L. Liu, and S. V. Edwards. Reply to Gatesy and Springer: The m ultisp ecies coalescen t mo del can eﬀectiv ely handle recombination and gene tree heterogeneity . Pr o c. Natl. A c ad. Sci. (USA) , 110(13):E1180, 2013. 16

Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment