Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method
We consider the problem of estimating the evolutionary history of a set of species (phylogeny or species tree) from several genes. It is known that the evolutionary history of individual genes (gene trees) might be topologically distinct from each ot…
Authors: Gautam Dasarathy, Robert Nowak, Sebastien Roch
D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 1 Data Requirement f or Ph ylogenetic Inf erence from Multiple Loci: A Ne w Distance Method Gautam Dasarath y † , Rober t Now ak † , and Sebastien Roch # † Wisconsin Institutes f or Disco v er y # Depar tment of Mathematics Univ ersity of Wisconsin - Madison Abstract W e consider the problem of estimating the ev olutionar y history of a set of species (phylogeny or species tree) from se ver al genes. It is known that the ev olutionar y histor y of individual genes (gene trees) might be topologically distinct from each other and from the underlying species tree, possib ly conf ounding phylogenetic analysis . A fur ther complication in practice is that one has to estimate gene trees from molecular sequences of finite length. W e pro vide the first full data-requirement analysis of a species tree reconstruction method that takes into account estimation errors at the gene lev el. Under that criterion, we also de vise a nov el reconstruction algorithm that prov ab ly improv es o ver all pre vious methods in a regime of interest. Index T erms ph ylogenetic inf erence, incomplete lineage sor ting, multispecies coalescent, distance methods, sample comple xity , molecular clock F 1 I N T R O D U C T I O N W E consider the problem of estimating the common evolutionary history , mor e pr ecisely the species tree , of a set of n species using sequence data fr om multiple genes or loci. It is well known that the estimated genealogical history of a gene ( gene tree ) may be topologically distinct from the species tr ee that encapsulates it, possibly confounding phylogenetic analysis [ 1 ]. The subject of this paper is an important source of such gene tree incongruence, known as incomplete lineage sorting (ILS), wher e two lineages fail to coalesce in their most recent common ancestral population. That failure may lead one of the lineages to first coalesce with a more distantly related population ther eby producing a gene tree whose topology differs from the species tree that we ar e trying to estimate. Several species tree reconstr uction methods have recently been developed that address ILS. See for instance [ 2 ], [ 3 ] and r eferences therein. Many such methods r ely on a statistical model known as the multispecies coalescent which, roughly speaking, generates gen e trees by performing independent coalescent processes in each ancestral population and then assembling these together . This pr ocess is illustrated in Figure 1 below and explained in a little more detail in Section 2.2 . For more background on phylogenetic infer ence and coalescent theory see, e.g., [ 4 ], [ 5 ], [ 6 ]. The accuracy of multiloci reconstruction methods has been evaluated empirically , for instance, in [ 7 ], [ 8 ]. The focus of this paper is the mathematical characterization of the S.R. acknowledges the support of NSF grants DMS-1248176 and DMS-1149312 (CAREER), and an Alfred P . Sloan Research Fellowship. D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 2 performance of such methods. Prior theoretical work has focused mainly on statistical consistency under the multispecies coalescent; see e.g., [ 8 ], [ 9 ], [ 10 ], [ 11 ]. That is, assuming access to either correct gene trees or correct pairwise distances (or coalescence times) for each gene, a method is statistically consistent if it is guaranteed to converge on the correct species tree as the number of genes, m , tends to infinity . [ 12 ] studies the rates of convergence (in m ) for several such methods. For instance, letting f > 0 denote the smallest branch length in the species tree, in the limit f → 0 , it was shown that the GLASS algorithm [ 10 ], which is an agglomerative clustering method in which the dissimilarity between each pair of species is taken to be the minimum of the coalescent times among the m genes, needs the number of genes m to scale as f − 1 . On the other hand, m needs to scale as f − 2 for the STEAC algorithm [ 8 ], which is also an agglomerative clustering method which instead uses the average of the coalescent times across the m genes as the measure of dissimilarity . In reality , however , one has to estimate gene trees and coalescent times fr om finite, say , length- k molecular sequences. T aking into account the resulting estimation errors at the gene level is key to mathematically quantify and compare the performance of differ ent methods (see e.g., [ 13 ], [ 14 ], [ 15 ]). Intuitively , for instance, the “minimum” used in GLASS may be significantly more sensitive to estimation err ors than the “average” used in STEAC. W e make progr ess towards this goal by performing the first full data requir ement analysis of some species tr ee reconstr uction methods. Our contribution is two-fold. First it is known that, in or der to reconstruct a single gene tree correctly with high probability , it is both necessary [ 16 ] and sufficient [ 17 ] for the sequence length k to scale as f − 2 . Therefor e, in light of this and the results in [ 12 ], one might expect that the total amount of data requir ed, mk , must scale as f − 3 and f − 4 for GLASS and STEAC r espectively . W e show that, by a crucial modification of STEAC, one obtains an algorithm that is guaranteed to reconstr uct the species tree exactly with high probability as long as m scales like f − 2 and k ≥ 1 . In particular , it suf fices for the overall sample complexity , mk , to scale like f − 2 (which is much smaller than f − 3 and f − 4 in the regime of interest, where f 1 ). Secondly , unlike GLASS, STEAC only works under the restrictive molecular clock assumption [ 6 ], where the mutation rates and population sizes ar e constant across the populations repr esented by the branches of the species tree. W e extend the previous data requir ement result beyond the molecular clock by devising a novel STEAC-like species tree reconstruction algorithm which we call MET AL (Metric algorithm for Estimation of Trees based on Aggregation of Loci). This algorithm is a distance based method where the distances are defined by concatenating the molecular sequences corresponding to all the loci (genes). 2 P R E L I M I N A R I E S A N D N O TA T I O N W e will begin with a description of our modeling assumptions and introduce some notation that will be used thr oughout the paper . 2.1 The Species T ree At the heart of the model is an unknown species tree S = ( V , E ) which repr esents the evolutionary history of n isolated populations; these isolated populations are represented by the size n leaf set L of this tr ee. The goal is to learn the structure of S . W e assume that each branch e ∈ E of the species tree corresponds to t e generations of evolution and we assume that each generation in this branch has a population of size N e . As is standard in coalescent theory , we will assign each branch e ∈ E , a length τ e > 0 in coalescent time units defined as τ e , t e / N e . The smallest branch length, f , min e τ e , will play an important role in D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 3 Gene 2 Gene 1 T ime Fig. 1: A species tree (the thick, shaded tr ee) and two samples from the multispecies coalescent. Notice that while the topology of Gene 1 agrees with the species tree, the topology of Gene 2 does not. our analysis and in particular , we will be interested in the case where f is very small. For a pair of vertices X , Y ∈ V , we will use π S X Y ⊂ E to denote the unique path connecting X and Y in S and τ X Y will denote the length of this path. Notice that { τ AB } A,B ∈ L forms a metric on the set L and such a metric that can be written as a sum of path lengths on a tree is called an additive metric (see e.g., [ 6 ]) with respect to that tree. If we additionally assume that the population sizes in each branch are equal to some constant N , then { τ AB } A,B ∈ L forms an ultrametric with r espect to S , i.e., for any three leaves A, B , C such that S restricted to A, B , C has the topology (( A, B ) , C ) ∗ , we have that τ AB ≤ τ AC = τ B C . W e will let ∆ , max A,B ∈ L τ AB denote the diameter of the species tree. Finally , T o each branch e ∈ E , we will also associate a mutation rate, µ e and we will let µ L , min e ∈ E µ e and µ U , max e ∈ E µ e denote the smallest and lar gest mutation rates, r espectively . 2.2 The Multispecies Coalescent and the Gene T rees Following [ 18 ], we assume that a multispecies coalescent (MSC) process produces m (indepen- dent) random genealogies G (1) , G (2) , . . . , G ( m ) based on S . These encode, say , the evolutionary history of m differ ent genes or loci on the genome and will be r eferred to as gene trees henceforth. It is easier to understand the MSC constr uctively and in the case wher e the population size N e in each branch e ∈ E is a constant N . Consider the 3 species example of Figure 1 , where the thick, shaded tree is the species tr ee S with edges { e i } 5 i =1 . As is standard in coalescent theory , we will think of time as running backwards, that is, time (in coalescent time units) starts at 0 at the leaves and increases towards the root of the tree. By T AB (resp. T AB C ), we mean the time when the parent population of A and B (resp. the parent population of A, B , and C ) branch (or speciate). Let us first consider one random draw from the MSC, i.e., the case of one particular gene, Gene 1. A, B , and C each have a copy (or allele) of Gene 1 and ∗ . W e will sometimes find it useful to repr esent trees in the so called Newick Format. For instance, the Newick repr esentations of the trees labelled Gene 1 and Gene 2 in Figure 1 are (( A, B ) , C ) and ( A, ( B , C )) , respectively . D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 4 the MSC describes the evolutionary history of the lineages corresponding to these alleles. From time 0 until T AB , the lineages corr esponding to A and B are in isolated populations and hence do not “coalesce”. However , once these lineages reach the parent population of A and B (repr esented by the branch e 4 ), they have a chance to coalesce. According to the MSC, the coalescence happens after a random time drawn according to the Exp (1) distribution, that is, P h t (1) AB − T AB ≥ x i = 1 − e − x , x ≥ 0 . (1) Now , the coalesced A - B lineage and the lineage corresponding to C do not interact until time T AB C , which is when they find themselves in a common population. They then coalesce at a random time t (1) AB C which is again such that t (1) AB C − T AB C ∼ Exp (1) . This gives us a random gene tree with the topology (( A, B ) , C ) . T o contrast with this, consider the case of Gene 2. Here, the lineages corresponding to the alleles in A and B do not coalesce in e 4 (since the randomly drawn coalescence time was more than the length of e 4 ). So, at time T AB C , there are three lineages present in the branch e 5 . When ther e are multiple lineages in the same population, accor ding to the MSC, each pair independently coalesces again after a random time period drawn accor ding to the Exp (1) distribution. In this case, the genealogies of B and C alleles coalesce (at time t (2) B C ) before A and B , thus giving us a second random tree with topology ( A, ( B , C )) . Notice that while the genealogy (evolutionary history) of Gene 1 agrees with that of the species, the genealogy of Gene 2 does not. This is an example of incomplete lineage sorting which, as mentioned earlier , is a fundamental road block for learning the tree of life. W e refer the reader to [ 18 ] for more details on the multispecies coalescent but, we will state the model her e for the sake of completeness. Before we proceed, we will r ecord a simple fact about the exponential distribution: If X 1 , . . . , X p iid ∼ Exp (1) , then min i ∈{ 1 ,...,p } X i ∼ Exp ( p ) . This follows since P min i ∈{ 1 ,...,p } X i ≥ t = p Y i =1 P ( X i ≥ t ) = e − pt . (2) The density of the likelihood of a gene tree G ( i ) = V ( i ) , E ( i ) can now be written down as follows. W e will focus our attention on the branch e ∈ E of the species tr ee and for the gene tree G ( i ) , let I ( i ) e and O ( i ) e be the number of lineages entering and leaving the branch e respectively . For instance, consider Gene 1 in Figure 1 . Here, two lineages enter the branch e 4 and one lineage leaves it. On the other hand, in the case of Gene 2 in Figure 1 , two lineages enter the branch e 4 and two lineages leave it. Let t ( i ) e,s , s = n 1 , 2 , . . . , I ( i ) e − O ( i ) e + 1 o be the s − th coalescent time corresponding to G ( i ) in the branch e . Recall that each pair of lineages in a population can coalesce at a random time drawn according to the Exp (1) distribution independently of each other . Therefore, after the ( s − 1) -th coalescent event at time t ( i ) e,s − 1 , there are I ( i ) e − s + 1 surviving lineages in branch e and the likelihood that the s − th coalescence time in branch e is t ( i ) e,s corresponds to the event that the minimum of I ( i ) e − s +1 2 random variables distributed according to Exp (1) has the value t ( i ) e,s − t ( i ) e,s − 1 . Therefor e using ( 2 ), the density of the likelihood of G ( i ) can be written as Y e ∈ E I ( i ) e − O ( i ) e +1 Y s =1 exp ( − I ( i ) e − s + 1 2 h t ( i ) e,s − t ( i ) e,s − 1 i ) , (3) where, for convenience, we let t ( i ) e, 0 and t ( i ) e,I ( i ) e − O ( i ) e +1 be respectively the divergence times of the population in e and of its parent population. D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 5 2.3 Observation Model and The Inference Prob lem Much of the prior work on understanding the theoretical complexity of learning species trees from multiple loci (or gene trees) has focused on the case where exact gene trees are available. However , in reality one needs to estimate these gene tr ees from molecular sequences and indeed there has been a recent thrust towards understanding the effect of errors in estimating the gene trees (see e.g., [ 13 ], [ 14 ], [ 15 ]). Our approach will be to take this error into account explicitly and in fact bypass the reconstruction of gene trees altogether . W e model the sample generation process according to the standard Jukes-Cantor (JC) model (see e.g., [ 6 ]). That is, given a gene tree G = ( V , E ) , we will associate to each ˜ e ∈ E , a probability p ˜ e (whose dependence on the length of ˜ e we will make explicit below). Then, the JC model assigns a character from { A , T , G , C } uniformly at random to the root of G . Moving away fr om the r oot, with probability p ˜ e , each edge ˜ e changes the state of its ancestor to one of the other three, chosen uniformly at random. The states at the leaves of G are assembled into a length n vector to get the first sample; this process is repeated k times to generate the data set. Notice that k models the number of sites or the sequence length of each gene. Now , we will define p ˜ e . T o each edge ˜ e of the random gene tree G is associated a random length σ ˜ e according to the MSC. Also, given an edge e ∈ E of the species tree, we will write σ e ∩ ˜ e to denote the length of the portion ˜ e that overlaps with e . This lets us define the effective (mutation rate adjusted) branch lengths, δ ˜ e = P e ∈ E µ e σ e ∩ ˜ e . As before, for any two vertices X , Y ∈ V , π G X Y denotes the path joining X and Y in G and σ X Y (resp. δ X Y ) denotes the length of this path under σ (resp. under δ ). Now , for an edge ˜ e ∈ E , we define p ˜ e , 3 4 (1 − e − 4 3 δ ˜ e ) . Notice that this definition implies that the probability p X Y of disagreement between the characters at vertices X and Y satisfies, p X Y = 3 4 (1 − e − 4 3 δ X Y ) . The goal then, is to learn the structure of S given the data { χ ij } i ∈ [ m ] ,j ∈ [ k ] which is an n × m × k array composed of the characters { A , T , G , C } , where { χ ij } j ∈ [ k ] is the data generated from the random gene tree G ( i ) according to the Jukes-Cantor model. The Jukes-Cantor model was chosen because it lends itself to easy presentation. Since the techniques developed here are distance-based , all our results can be generalized to the more realistic Generalized T ime-Reversible (GTR) model [ 19 ] using spectral techniques as in [ 20 ], [ 21 ]. 3 M A I N R E S U LT S W e now state the main results of the paper . First, we will deal with the case where the strong molecular clock [ 6 ] assumption holds. W e will then turn our attention to the more general case that does away with this assumption. 3.1 The Molecular Clock Assumption Holds Assuming that the molecular clock hypothesis holds is often unrealistic; it is equivalent to believing that all extant and ancestral populations have the same population size and that the mutations happen at the same rate through time and acr oss populations. It has however proven to be a useful abstraction for developing powerful methods. In our setting, this is equivalent to assuming that for all e ∈ E , µ e = µ > 0 , and N e = N , both constants independent of e . In or der to infer the species tree from samples, we will begin by defining a distance measure on the leaves. For each pair of leaves A, B ∈ L , we define b p AB = 1 mk X i ∈ [ m ] ,j ∈ [ k ] 1 { χ ij A 6 = χ ij B } , (4) D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 6 which can be thought of as the normalized hamming distance between the concatenated molecular sequences corr esponding to species A and B . Our first result, which is proved in Appendix A , is that, in expectation, { b p AB } A,B ∈ L is not only a metric on L , but is in fact an ultrametric with respect to S . Theorem 1. { E [ b p AB ] } A,B ∈ L forms an ultrametric with respect to the true species tree S . In fact, for any triple A, B , C ∈ L with the topology (( A, B ) , C ) in S , we have E [ b p AC ] = E [ b p B C ] > E [ b p AB ] + 3 e − 4 3 µτ AC µ 8 µ + 3 f . (5) This r esult inspir es the following pr ocedure for r econstructing S : Use { b p AB } A,B ∈ L as a dissimilarity measure for L and use a standard algorithm that accepts a dissimilarity measure and returns an ultrametric tree (see e.g., [ 4 ], [ 6 ] for backgr ound on distance based methods). For the sake of simplicity , we may assume that we use the UPGMA algorithm [ 22 ], the standard method for bottom-up agglomerative clustering, in order to pr oduce an ultrametric tree. Then, recalling that µ denotes the (common) mutation rate across the populations repr esented by the species tree S , and ∆ denotes diameter of S , we have the following performance guarantee. Theorem 2. Given an > 0 , using UPGMA on L with the dissimilarity measure { b p AB } A,B ∈ L results in the correct tree S being output with probability no less than 1 − as long as the number of genes m , and the sequence length k satisfy m ≥ C 1 ( µ, ∆ , n, ) × f − 2 and k ≥ 1 , (6) where C 1 ( µ, ∆ , n, ) = 16 e 8 3 µ ∆ (8 µ +3) 2 9 µ 2 log 8 ( n 3 ) . Theorem 2 , which is proved in Appendix B , tells us that the above procedur e succeeds with high probability as long as we get molecular sequences of length at least one from at least O ( f − 2 ) genes. That is, a total sequence length of mk = O ( f − 2 ) suffices for reliable learning. Notice that the procedur e we propose is similar to the STEAC algorithm [ 8 ] except instead of using the average coalescent time as the distance measure , we use ( 4 ), which can be considered as the normalized hamming distance. It turns out that this modification is crucial to obtaining our impr oved sample complexity result. 3.2 The Molecular Clock Assumption Does Not Hold W e will now consider the more general case where the strong molecular clock assumption does not hold. That is, we will assume that each branch e of the species tree has a (possibly) distinct mutation rate µ e and population size N e . First, we observe that { E [ b p AB ] } A,B ∈ L as defined above is no longer an ultrametric with respect to S and therefor e, the above procedur e (and for a similar r eason, the STEAC algorithm) cannot be used to recover the species tree. In such situations, one usually turns to distance methods that rely on the 4-point condition (see e.g., [ 6 ]). However , it is not immediately clear how to define a metric that satisfies the 4-point condition in our setting. Our next result, which is arguably the most important contribution of this paper , shows that this can be done. As befor e, we will first consider an idealized measure of dissimilarity as follows: d AB = − 3 4 log 1 − 4 3 E [ b p AB ] , A, B ∈ L, D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 7 where b p AB is as defined in ( 4 ). Our next result, which parallels Theorem 1 , shows that this “idealized” dissimilarity measur e is actually an additive metric with respect to S . Re- call that this means that the four point condition holds, i.e., for a quadruple of leaves A, B , C , D that are such that the topology of S restricted to these 4 leaves is (( A, B ) , ( C, D )) or ((( A, B ) , C ) , D ) , the above distances satisfy d AB + d C D ≤ d AC + d B D = d AD + d B C . See [ 6 ], for instance, for more information about tree metrics. Theorem 3. The set of dissimilarities { d AB } A,B ∈ L forms an additive metric with respect to S . In fact, suppose the leaves A, B , C , D ∈ L are such that either (( A, B ) , ( C, D )) or ((( A, B ) , C ) , D ) holds with respect to S , then d AC + d B D = d AD + d B C > d AB + d C D + α add , (7) where α add = 3 4 log 8 3 µ L (1 − e − f ) + 1 > 0 and µ L , min e ∈ E µ e is the smallest mutations rate, as defined in Section 2.1 . It is somewhat surprising that this r esult is true. It tells us that if one ignores the fact that there ar e multiple loci and pretends as though all samples came from a single gene tree, then the gene tree estimated from this “concatenated molecular sequence” has the same topology as S . Furthermore, this result is also interesting since phylogenetic mixtures are known to cause problems for distance-based methods [ 23 ]. W e pr ove Theorem 3 in Appendix C . In light of this, we propose the following algorithm to reconstruct S . First, we define the following sample-based corrected measure of dissimilarity (with b p AB as defined in ( 4 )) b d AB , − 3 4 log 1 − 4 3 b p AB . (8) Now , use any quartet-test based algorithm (like Neighbor Joining [ 24 ]) which returns an additive tr ee using { b d AB } A,B ∈ L defined as in ( 8 ) as the input dissimilarity measur e. W e call this algorithm MET AL (for Metric algorithm for Estimation of Trees based on Aggregation of Loci). Recall that µ U and µ L are respectively the maximum and minimum mutation rates, and ∆ is the diameter of the species tree S (c.f. Section 2.1 ). W e then have the following result. Theorem 4. For any > 0 , MET AL succeeds in reconstructing (the unrooted version of) S with probability at least 1 − as long as m and k satisfy k ≥ 1 and m ≥ e 8 µ U ∆ 3 (8 µ U + 3) 2 (24 + 8 α add ) 2 162 α 2 add log 16 n 4 ! (9) where α add = 3 4 log 8 3 µ L (1 − e − f ) + 1 . In the limit as f → 0 , the right side above approaches C 2 ( µ U , µ L , ∆ , n, ) × f − 2 , where C 2 ( µ U , µ L , ∆ , n, ) = 8 e 8 µ U ∆ 3 (8 µ U + 3) 2 9 µ 2 L log 16 n 3 ! . Remark . Following [ 17 ], the diameter ∆ can be r eplaced by the (often much smaller) depth 1 of the tr ee by employing a distance method that uses only those distances that ar e “small enough”. 1. The depth of an edge e is the length (under τ ) of the shortest path between two leaves crossing e ; the depth of a tree is the maximum edge depth. D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 8 W e prove Theorem 4 using arguments that are similar in spirit to those in the proof of Theorem 2 . W e r efer the reader to Appendix D for the exact details. Theorem 4 tells us that as long as m scales like O ( f − 2 ) and k ≥ 1 , the species tree can be reconstr ucted (upto the location of the root) reliably . It should be noted here that we assume that for each population/branch e ∈ E , the mutation rate µ e is constant across gene trees; generalizing this analysis to the case where the mutation rates ar e allowed to change is an interesting avenue for future work. 4 D I S C U S S I O N Irrespective of the sequence length k of each gene, the number of genes m requir ed needs to satisfy m ∈ Ω( f − 1 ) for consistent species tree estimation. T o see this, consider the species tree in Figur e 1 . Given m gene trees drawn according to the MSC based on this species tree, the probability that none of them have a coalescent event in branch e 4 is given by e − mτ e 4 (this is the probability that m independent exponentials are bigger than τ e 4 ). Therefor e, if m < τ − 1 e 4 , then with probability greater than e − 1 , none of the m the gene trees have a coalescence event in e 4 , that is, there is no evidence for the existence of this branch from the sample. This argument can also be formalized by observing that any algorithm that is able to estimate S reliably should be able to perform a reliable hypothesis test between two shifted exponential distributions. Therefor e, this result follows from the fact that D KL ( p ( x ; τ AB + f ) k p ( x ; τ AB )) = f , wher e p ( x ; a ) = e − ( x − a ) 1 { x ≥ a } and D K L ( ·k· ) is the Kullback-Liebler diver gence [ 25 ]. On the other hand, we know from [ 16 ] that even without the confounding effect of the multispecies coalescent, a total sequence length ( m × k ) of at least Ω( f − 2 ) is needed for consistent estimation. These two together imply that there is a constant C > 0 such that m needs to satisfy the following for consistent estimation of the species tree m ≥ C max f − 1 , f − 2 k . (10) As mentioned earlier , the results in this paper show that m ∈ O ( f − 2 ) is achievable irrespective of the value of k , i.e., in particular , a total data set size of mk ∈ O ( f − 2 ) is achievable. Prior to this, to the best of our knowledge, the best complexity bounds were provably attained by GLASS [ 10 ] (as shown in [ 12 ]) which requir es that m ≥ O ( f − 1 ) and k ≥ O ( f − 2 ) , i.e., a total data set size of mk ∈ O ( f − 3 ) . This raises two very interesting open questions. (A) What is the precise tradeof f between m and k for reliable recovery of S and in particular , is it possible to devise an algorithm that recovers S given m ∈ o ( f − 2 ) when the sequence length, k , is moderate, say , O ( f − 1 ) ? (B) Is there a procedur e that attains all points (values of m and k ) in this tradeoff, as opposed to the current situation wher e it appears as though GLASS meets the lower bounds for large k and MET AL meets the lower bound for small k ? R E F E R E N C E S [1] W . P . Maddison, “Gene tr ees in species trees,” Systematic biology , vol. 46, no. 3, pp. 523–536, 1997. [2] R. Nichols, “Gene trees and species trees are not the same,” T rends in Ecology & Evolution , vol. 16, no. 7, pp. 358–364, 2001. [3] L. Liu, L. Y u, L. Kubatko, D. K. Pearl, and S. V . Edwards, “Coalescent methods for estimating phylogenetic trees,” Molecular Phylogenetics and Evolution , vol. 53, no. 1, pp. 320–328, 2009. [4] J. Felsenstein, Inferring phylogenies , vol. 2. Sinauer Associates Sunderland, 2004. [5] R. Griffiths and S. T avar ´ e, “Ancestral inference in population genetics,” Statistical Science , pp. 307–319, 1994. [6] C. Semple and M. A. Steel, Phylogenetics , vol. 24. Oxford University Press, 2003. [7] A. D. Leach ´ e and B. Rannala, “The accuracy of species tree estimation under simulation: a comparison of methods,” Systematic Biology , vol. 60, no. 2, pp. 126–137, 2011. D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 9 [8] L. Liu, L. Y u, D. K. Pearl, and S. V . Edwards, “Estimating species phylogenies using coalescence times among sequences,” Systematic Biology , vol. 58, no. 5, pp. 468–477, 2009. [9] J. H. Degnan, M. DeGior gio, D. Bryant, and N. A. Rosenberg, “Properties of consensus methods for inferring species trees from gene trees,” Systematic Biology , vol. 58, no. 1, pp. 35–54, 2009. [10] E. Mossel and S. Roch, “Incomplete lineage sorting: consistent phylogeny estimation from multiple loci,” IEEE/ACM T ransactions on Computational Biology and Bioinformatics (TCBB) , vol. 7, no. 1, pp. 166–171, 2010. [11] L. Liu, L. Y u, and D. Pearl, “Maximum tree: a consistent estimator of the species tree,” Journal of Mathematical Biology , vol. 60, pp. 95–106, 2010. 10.1007/s00285-009-0260-0. [12] S. Roch, “An analytical comparison of multilocus methods under the multispecies coalescent: the three-taxon case.,” in Pacific Symposium on Biocomputing , pp. 297–306, W orld Scientific, 2013. [13] L. Nakhleh, “Computational approaches to species phylogeny inference and gene tree reconciliation,” T rends in ecology & evolution , vol. 28, no. 12, pp. 719–728, 2013. [14] J. Y ang and T . W arnow , “Fast and accurate methods for phylogenomic analyses,” BMC bioinformatics , vol. 12, no. Suppl 9, p. S4, 2011. [15] M. W . Hahn, “Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution,” Genome Biol , vol. 8, no. 7, p. R141, 2007. [16] M. A. Steel and L. A. Sz ´ ekely , “Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications,” SIAM J. Discrete Math. , vol. 15, no. 4, pp. 562–575 (electronic), 2002. [17] P . L. Erdos, M. A. Steel, L. A. Sz ´ ekely , and T . J. W arnow , “A few logs suffice to build (almost) all trees (i),” Random Structures and Algorithms , vol. 14, no. 2, pp. 153–184, 1999. [18] B. Rannala and Z. Y ang, “Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci,” Genetics , vol. 164, no. 4, pp. 1645–1656, 2003. [19] S. T avar ´ e, “Some probabilistic and statistical problems in the analysis of dna sequences,” Lectures on mathematics in the life sciences , vol. 17, pp. 57–86, 1986. [20] S. Roch, “T oward extracting all phylogenetic information from matrices of evolutionary distances,” Science , vol. 327, no. 5971, pp. 1376–1379, 2010. [21] E. Mossel and Y . Peres, “Information flow on trees,” The Annals of Applied Probability , vol. 13, no. 3, pp. 817–844, 2003. [22] R. R. Sokal and C. D. Michener , A statistical method for evaluating systematic relationships . University of Kansas, 1958. [23] M. Steel, “A basic limitation on inferring phylogenies by pairwise sequence comparisons,” Journal of Theoretical Biology , vol. 256, no. 3, pp. 467 – 472, 2009. [24] N. Saitou and M. Nei, “The neighbor-joining method: a new method for reconstructing phylogenetic trees.,” Molecular biology and evolution , vol. 4, no. 4, pp. 406–425, 1987. [25] T . M. Cover and J. A. Thomas, Elements of information theory . John W iley & Sons, 2012. A P P E N D I X A P R O O F O F T H E O R E M 1 Recall that for any pair of leaves A, B ∈ L , we define b p AB = 1 mk X i ∈ [ m ] ,j ∈ [ k ] 1 { χ ij A 6 = χ ij B } . (11) Theorem 1. { E [ b p AB ] } A,B ∈ L † forms an ultrametric with respect to the true species tree S . In fact, for any triple A, B , C ∈ L with the topology (( A, B ) , C ) in S , we have E [ b p AC ] = E [ b p B C ] > E [ b p AB ] + 3 e − 4 3 µτ AC µf 8 µ + 3 . (12) Proof: Suppose that A, B , C ∈ L are three arbitrary leaves of the species tree with the topology (( A, B ) , C ) . By definition, we have that E [ b p AC ] = E 3 4 1 − e − 4 3 δ AC , where δ AC is the distance between A and C on a random gene tree drawn according to the MSC. Notice that it satisfies δ AC = µτ AC + 2 µZ with Z ∼ Exp (1) . Therefore, we have † . Unless otherwise noted, expectations will be with respect all the randomness present. D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 10 E [ b p AC ] − E [ b p AB ] = − 3 4 e − 4 3 µτ AC E h e − 8 3 µZ i + − 3 4 e − 4 3 µτ AB E h e − 8 3 µZ i ( a ) = 3 e − 4 3 µτ AB − e − 4 3 µτ AC 4( 8 3 µ + 1) ( b ) ≥ 3 e − 4 3 µτ AC µf (8 µ + 3) , where ( a ) follows fr om the fact that if X ∼ Exp (1) , for any α > 0 , E [ e − αX ] = ( α + 1) − 1 and ( b ) follows from observing that for any α > 0 and x < y , we have e − αx α − e − αy α = Z y x e − αt dt ≥ ( y − x ) e − αy Proceeding similarly , It can be seen that E [ b p AC ] = E [ b p B C ] . This concludes the pr oof. A P P E N D I X B P R O O F O F T H E O R E M 2 W e now prove Theorem 2 which guarantees that S can be reliably recover ed by using a standard distance-based algorithm like UPGMA or bottom-up agglomerative clustering with { b p AB } A,B ∈ L as a dissimilarity measur e for L . Theorem 2. Given an > 0 , using UPGMA on L with the dissimilarity measur e { b p AB } A,B ∈ L results in the correct tree S being output with pr obability no less than 1 − as long as the number of genes m , and the sequence length k satisfy m ≥ C 1 ( µ, ∆ , n, ) × f − 2 and k ≥ 1 , (13) where C 1 ( µ, ∆ , n, ) = 16 e 8 3 µ ∆ (8 µ +3) 2 9 µ 2 log 8 ( n 3 ) . Proof: Recall that the algorithm we propose to recover the tree uses { ˆ p AB } A,B ∈ L as a dissimilarity measure and uses an agglomerative clustering algorithm. Therefor e, this procedur e errs if for any triple of leaves A, B , C which have the topology (( A, B ) , C ) with respect to S , either b p AB > b p AC or b p AB > b p B C . Letting L 3 denote the set of all unorder ed triples in L , we can use the union bound and over-estimate the error as follows P [ Error ] = P [ (( A,B ) ,C ) ∈ ( L 3 ) n The triple (( A, B ) , C ) is such that b p AB > b p AC or b p AB > b p B C o ≤ X (( A,B ) ,C ) ∈ ( L 3 ) P [ b p AB > b p AC ] + P [ b p AB > b p B C ] . (14) W e will now upper bound the term P [ b p AB > b p AC ] , the other term will satisfy the same upper bound. Defining α um = 3 e − 4 3 ∆ µf (8 µ +3) , for an arbitrary triple (( A, B ) , C ) we have P [ b p AB − b p AC > 0] = P [ b p AB − E [ b p AB ] − b p AC + E [ b p AC ] > E [ b p AC ] − E [ b p AB ]] ( a ) ≤ P [ b p AB − E [ b p AB ] − b p AC + E [ b p AC ] > α um ] ≤ P h b p AB − E [ b p AB ] > α um 2 i + P h E [ b p AC ] − b p AC > α um 2 i , (15) D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 11 where ( a ) follows from Theorem 1 . Let us first look at the first term in ( 15 ). The second one will follow similarly . P [ b p AB − E [ p AB ] > α um / 2] ( a ) = E P b p AB − 1 m X i ∈ [ m ] p ( i ) AB + 1 m X i ∈ [ m ] p ( i ) AB − E [ b p AB ] > α um 2 { δ ( i ) AB } i ∈ [ m ] ≤ E P b p AB − 1 m X i ∈ [ m ] p ( i ) AB > α um 4 { δ ( i ) AB } i ∈ [ m ] + P 1 m X i ∈ [ m ] p ( i ) AB − E [ b p AB ] > α um 4 { δ ( i ) AB } i ∈ [ m ] = E P b p AB − 1 m X i ∈ [ m ] p ( i ) AB > α um 4 { δ ( i ) AB } i ∈ [ m ] + P 1 m X i ∈ [ m ] p ( i ) AB − E [ b p AB ] > α um 4 . (16) In ( a ) , we condition on { δ ( i ) AB } i ∈ [ m ] , where δ ( i ) AB , as before, is the random distance between the leaves A and B on the gene tree G ( i ) . W e then add and subtract 1 m P i ∈ [ m ] p ( i ) AB , where p ( i ) AB , 3 4 1 − e − 4 3 δ ( i ) AB . The next inequality follows from a union bound. The two terms in the last equation can now be upper bounded using Hoeffding’s inequality: E " P " 1 mk m X i =1 k X j =1 X ij AB − 1 m m X i =1 p ( i ) AB > α um 4 n d ( i ) AB o ## ≤ e − mkα 2 um / 16 . (17) P 1 m X i ∈ [ m ] p ( i ) AB − E [ b p AB ] > α um 4 ≤ e − mα 2 um / 16 . (18) These inequalities follow since E h X ij AB δ ( i ) AB i = p ( i ) AB and E h p ( i ) AB i = E [ b p AB ] . Substituting these in ( 14 ), we have P [ Error ] ≤ X (( AB ) C ) ∈ ( L 3 ) P [ b p AB > b p AC ] + P [ b p AB > b p B C ] ≤ X (( AB ) C ) ∈ ( L 3 ) 4 e − mkα 2 um / 16 + e − mα 2 um / 16 ≤ n 3 4 e − mkα 2 um / 16 + e − mα 2 um / 16 Therefor e, the probability of error can be made less than if we pick m and k as shown in ( 6 ) or ( 13 ). A P P E N D I X C P R O O F O F T H E O R E M 3 Recall that we define d AB = − 3 4 log 1 − 4 3 E [ b p AB ] and Theorem 3 , which we will pr ove now , tells us that these distances form an additive metric with respect to S . Theorem 3. The set of dissimilarities { d AB } A,B ∈ L forms an additive metric with r espect to S . In fact, suppose the leaves A, B , C, D ∈ L are such that either (( A, B ) , ( C, D )) or ((( A, B ) , C ) , D ) holds with respect to S , then d AC + d B D = d AD + d B C > d AB + d C D + α add , D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 12 where α add = 3 4 log 8 3 µ L (1 − e − f ) + 1 > 0 . Proof: W e will first show that for any 4 leaves A, B , C, D ∈ L that are such that either (( A, B ) , ( C , D )) or ((( A, B ) , C , D )) holds with respect to S , then d AC + d B D > d AB + d C D + α add . Using similar techniques, we will next establish that d AC + d B D = d AB + d C D . W e begin by observing that by definition, d AC + d B D − d AB − d C D = − 3 4 log 1 − 4 3 E [ b p AC ] − 3 4 log 1 − 4 3 E [ b p B D ] + 3 4 log 1 − 4 3 E [ b p AB ] + 3 4 log 1 − 4 3 E [ b p AB ] (19) = 3 4 log E h e − 4 3 δ AB i E h e − 4 3 δ C D i E h e − 4 3 δ AC i E h e − 4 3 δ B D i , (20) where the expectations in the last equation are with respect to the multispecies coalescent and the δ ’s are the random gene tr ee distances as defined in Section 2.3 . W e will pr ove this theorem by lower bounding the quantity E h e − 4 3 δ AB i E h e − 4 3 δ C D i E h e − 4 3 δ AC i E h e − 4 3 δ B D i appropri- ately . T owar ds this end, we note that for any 4 leaves of the species tree A, B , C, D , there are only 2 possible topologies with r espect to S upto relabeli ng: (a) (( A, B ) , ( C, D )) and (b) ((( A, B ) , C ) , D ) . W e will consider each case separately and bound the above quantity in what follows. Case (a): (( A, B ) , ( C , D )) In order to tackle the first case, we will use the notation from Figure 2a below , which shows the species tree S r estricted to the leaves A, B , C, D . Let o 1 , o 2 and o 3 be the common ancestors of ( A, B ) , ( C, D ) and ( A, C ) respectively . Let E AB be the event that the lineages corr esponding to A and B coalesce in the segment ( o 1 , o 3 ) of the tree in Figure 2a and let E AB be the event that this does not occur . Similarly , we define the events E C D and E C D . T o reduce notational clutter , for w , v ∈ S , we will write µ wv to denote P e ∈ π S wv µ e τ e . Now , for leaves X, Y ∈ L , let Z X Y denote the random quantity 1 2 ( δ X Y − µ X Y ) , i.e., it is the effective (mutation rate adjusted) coalescent time after the lineages corresponding to X and Y find themselves in a common population. By the memoryless pr operty of the exponential distribution, it is easy to check that Z AB − µ o 1 o 3 conditioned on E AB has the same distribution as Z C D − µ o 2 o 3 conditioned on E C D . Let Z denote be a random variable with this common distribution. Also observe that Z AC and Z B D have the same distribution as Z . This is depicted diagrammatically in Figure 2a . Now , using the fact that by definition, δ AB = µ AB + 2 Z AB , we have E h e − 4 3 δ AB i = e − 4 3 µ AB E h e − 8 3 Z AB i = e − 4 3 µ AB n E h e − 8 3 Z AB E AB i P ( E AB ) + E h e − 8 3 Z AB E AB i P E AB o ( a ) ≥ e − 4 3 µ AB n e − 8 3 µ o 1 o 3 P ( E AB ) + e − 8 3 µ o 1 o 3 E h e − 8 3 Z i P E AB o = e − 4 3 ( µ AB +2 µ o 1 o 3 ) n P ( E AB ) + E h e − 8 3 Z i P E AB o , (21) where ( a ) follows from the fact that conditioned on E AB , Z AB ≤ µ o 1 o 3 and that conditioned on E AB , Z AB d = Z + µ o 1 o 3 . Similarly , we get the following lower bound corresponding to the leaves C , D . E h e − 4 3 δ C D i ≥ e − 4 3 ( µ C D +2 µ o 2 o 3 ) n P ( E C D ) + E h e − 8 3 Z i P E C D o (22) D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 13 On the other hand, notice that δ AC = µ AC + 2 Z AC d = µ AC + 2 Z and δ B D = µ B D + 2 Z B D d = µ B D + 2 Z . Ther efore, we have E h e − 4 3 δ AC i = e − 4 3 µ AC E h e − 8 3 Z i , and E h e − 4 3 δ B D i = e − 4 3 µ B D E h e − 8 3 Z i , (23) From equations ( 21 ) - ( 23 ), we have E h e − 4 3 δ AB i E h e − 4 3 δ AC i × E h e − 4 3 δ C D i E h e − 4 3 δ B D i ≥ e − 4 3 ( µ AB +2 µ o 1 o 3 ) n P ( E AB ) + E h e − 8 3 Z i P E AB o e − 4 3 µ AC E h e − 8 3 Z i × e − 4 3 ( µ C D +2 µ o 2 o 3 ) n P ( E C D ) + E h e − 8 3 Z i P E AB o e − 4 3 µ B D E h e − 8 3 Z i (24) ( a ) = n P ( E AB ) + E h e − 8 3 Z i P E AB o n P ( E C D ) + E h e − 8 3 Z i P E C D o E h e − 8 3 Z i 2 = P ( E AB ) E h e − 8 3 Z i + P E AB × P ( E C D ) E h e − 8 3 Z i + P E C D (25) where in ( a ) , we have used the fact that µ AB + µ C D + 2 µ o 1 o 3 + 2 µ o 2 o 3 = µ AC + µ B D and in the last step we divide each term in the numerator by E h e − 8 3 Z i . Next, observe that Z stochastically dominates the random variable µ L ˜ Z , where ˜ Z ∼ Exp (1) . Therefor e, we have E h e − 8 3 Z i ≤ E h e − 8 3 µ L ˜ Z i = 1 8 3 µ L + 1 . (26) Substituting this in ( 25 ) gives us E h e − 4 3 δ AB i E h e − 4 3 δ C D i E h e − 4 3 δ AC i E h e − 4 3 δ B D i ≥ 8 3 µ L P ( E AB ) + 1 × 8 3 µ L P ( E C D ) + 1 (27) Finally , we observe that the probability that the event E AB occurs is given by 1 − e − τ o 1 o 3 , wher e τ o 1 o 3 is the length of the path ( o 1 , o 3 ) in the species tree; this follows fr om the memoryless property of the exponential distribution. Since τ o 1 o 3 ≥ f , we have that P ( E AB ) ≥ 1 − e − f , and similarly P ( E C D ) ≥ 1 − e − f . Substituting this in ( 27 ), we get the following lower bound E h e − 4 3 δ AB i E h e − 4 3 δ C D i E h e − 4 3 δ AC i E h e − 4 3 δ B D i ≥ 8 3 µ L 1 − e − f + 1 2 (28) Next, we consider Case (b). Case (b) : ((( A, B ) , C ) , D ) Here, we will write o 1 , o 2 , o 3 to denote the most recent common ancestors of ( A, B ) , ( A, C ) and ( A, D ) respectively . Again we will use notation from the D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 14 A B C D Z AB Z CD Z ( d ) = Z AC ( d ) = Z BD o 2 o 1 o 3 (a) Case (a) A B C D Z AB o 2 o 1 o 3 Z 1 ( d ) = Z AC Z 2 ( d ) = Z BD ( d ) = Z CD (b) Case (b) Fig. 2: Pictures showing the random variables and internal nodes used in Proof of Theorem 3 previous case for random variables of the form Z X Y , X , Y ∈ L . In this case, we let E AB denote the event that the lineages corresponding to A and B coalesce in the branch ( o 1 , o 2 ) in Figure 2b . Again, from the memoryless property , it can be seen that the random variable Z AB − µ o 1 o 2 conditioned on E AB and the random variable Z AC have the same distribution; we let Z 1 denote a random variable with this common distribution. Similarly Z C D and Z B D have the same distribution and we let Z 2 denote a random variable with this distribution. Reasoning as before, we see that since δ AB = µ AB + 2 Z AB , E h e − 4 3 δ AB i = e − 4 3 µ AB n E h e − 8 3 Z AB E AB i P ( E AB ) + E h e − 8 3 Z AB E AB i P E AB o ( a ) ≥ e − 4 3 µ AB n e − 8 3 µ o 1 o 2 P ( E AB ) + e − 8 3 µ o 1 o 2 E h e − 8 3 Z 1 i P E AB o = e − 4 3 ( µ AB +2 µ o 1 o 2 ) n P ( E AB ) + E h e − 8 3 Z 1 i P E AB o , (29) where, as before, ( a ) follows from the fact that conditioned on E AB , Z AB ≤ µ o 1 o 2 and that conditioned on E AB , Z AB d = Z 1 + µ o 1 o 2 . On the other hand, we have E h e − 4 3 δ C D i = e − 4 3 µ C D E h e − 8 3 Z 2 i (30) E h e − 4 3 δ AC i = e − 4 3 µ AC E h e − 8 3 Z 1 i (31) E h e − 4 3 δ B D i = e − 4 3 µ B D E h e − 8 3 Z 2 i . (32) Therefor e, fr om ( 29 )-( 32 ), we have that E h e − 4 3 δ AB i E h e − 4 3 δ C D i E h e − 4 3 δ AC i E h e − 4 3 δ B D i ≥ e − 4 3 ( µ AB + µ C D +2 µ o 1 o 2 − µ AC − µ B D ) 1 E h e − 8 3 Z 1 i P ( E AB ) + P E AB = P [ E AB ] E h e − 8 3 Z 1 i + P E AB (33) where the second step follows from the fact that µ AB + µ C D + 2 µ o 1 o 2 = µ AC + µ B D . Finally , as in case (a), we use the bounds E h e − 8 3 Z 1 i ≤ 1 8 3 µ L +1 and that P [ E AB ] ≥ 1 − e − f to get the following lower bound. E h e − 4 3 δ AB i E h e − 4 3 δ C D i E h e − 4 3 δ AC i E h e − 4 3 δ B D i ≥ 8 3 µ L (1 − e − f ) + 1 (34) D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 15 Since 8 3 µ L (1 − e − f ) + 1 ≥ 1 , from ( 28 ) and ( 34 ), we have that for any 4 leaves A, B , C, D such that the species tr ee S restricted to these four leaves satisfies either (( A, B ) , ( C, D )) or ((( A, B ) , C ) , D ) , then E h e − 4 3 δ AB i E h e − 4 3 δ C D i E h e − 4 3 δ AC i E h e − 4 3 δ B D i ≥ 8 3 µ L (1 − e − f ) + 1 (35) Substituting this lower bound in ( 20 ), we get the r esult that for any 4 leaves A, B , C, D ∈ L that are such that (( A, B ) , ( C, D )) or ((( A, B ) , C, D )) holds with r espect to S , we have that d AC + d B D > d AB + d C D + α add , where α add = 3 4 log 8 3 µ L (1 − e − f ) + 1 . T o conclude the proof, we will next establish the “equality part” of the theorem. As in ( 20 ), notice that the following holds. d AC + d B D − d AD − d B C = 3 4 log E h e − 4 3 δ AD i E h e − 4 3 δ B C i E h e − 4 3 δ AC i E h e − 4 3 δ B D i . (36) Again, we will divide this proof into two cases as above. Case (a): (( A, B ) , ( C , D )) Observe that the following hold with µ X Y and Z as defined before (cf. Fig 2a ) δ AD = µ AD + 2 Z , δ B C = µ B C + 2 Z , δ AC = µ AC + 2 Z , δ B D = µ B D + 2 Z Substituting these in ( 36 ) and observing that µ AD + µ B C = µ AC + µ B D tells us that d AC + d B D = d AD + d B C in case (a). Case (b): ((( A, B ) , C ) , D ) In this case, observe that the following hold again with µ X Y and Z 1 and Z 2 as defined earlier (cf. Fig 2(b)): δ AD = µ AD + 2 Z 2 , δ B C = µ B C + 2 Z 1 δ AC = µ AC + 2 Z 1 , δ B D = µ B D + 2 Z 2 Again, substituting these in ( 36 ) and observing that µ AD + µ B C = µ AC + µ B D tells us that d AC + d B D = d AD + d B C in case (b) as well. This concludes the pr oof. A P P E N D I X D P R O O F O F T H E O R E M 4 W e will now prove the last main result in our paper that shows that Theorem 3 can be used to design a tree reconstruction algorithm when one only has access to molecular data and also provides sample complexity results for this algorithm. Recall that we propose the following measure of dissimilarity from the samples b d AB , − 3 4 log 1 − 4 3 b p AB . (37) where b p AB is as defined in ( 4 ). In light of Theor em 3 , we proposed the following tree reconstruction procedur e, which we call MET AL: use any distance algorithm (like Neighbor Joining [ 24 ]) which r eturns an D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 16 additive tree using { b d AB } A,B ∈ L as the dissimilarity measure. W e then have the following result. Theorem 4. For any > 0 , the MET AL algorithm succeeds in reconstructing (the unrooted version of) S with probability at least 1 − as long as m and k satisfy k ≥ 1 and m ≥ e 8 µ U ∆ 3 (8 µ U + 3) 2 (24 + 8 α add ) 2 162 α 2 add log 16 n 4 ! (38) where α add = 3 4 log 8 3 µ L (1 − e − f ) + 1 . In the limit as f → 0 , the right side above approaches C 2 ( µ U , µ L , ∆ , n, ) × f − 2 , where C 2 ( µ U , µ L , ∆ , n, ) = 8 e 8 µ U ∆ 3 (8 µ U + 3) 2 9 µ 2 L log 16 n 3 ! . Proof: Notice that the above algorithm makes an err or only if there exists a set of four leaves A, B , C , D such that τ AB + τ C D ≤ τ AC + τ B D = τ AD + τ B C , but the 4-point condition is not satisfied by b d , that is: b d AB + b d C D − b d AC − b d B D > 0 or b d AB + b d C D − b d AD − b d B C > 0 Therefor e, using the union bound, the probability of error can be upper bounded as follows: P ( Error ) ≤ X A,B ,C,D ∈ L : τ AB + τ C D ≤ τ AC + τ B D = τ AD + τ B C P h ˆ d AB + ˆ d C D − ˆ d AC − ˆ d B D > 0 i + P h ˆ d AB + ˆ d C D − ˆ d AD − ˆ d B C > 0 i (39) W e will bound the first term inside the summation of ( 39 ) and the second one will follow similarly . Setting α add , 3 4 log 8 3 µ L (1 − e − f ) + 1 , observe that for a quadruple of leaves A, B , C , D such that τ AB + τ C D ≤ τ AC + τ B D = τ AD + τ B C , we have P h b d AB + b d C D − b d AC − b d B D > 0 i = P h b d AB − d AB + b d C D − d C D − b d AC + d AC − b d B D + d B D > d AC + d B D − d AB − d C D i ≤ P h b d AB − d AB + b d C D − d C D − b d AC + d AC − b d B D + d B D > α add i , where the second inequality follows fr om Theorem 3 which says that d AC + d B D − d AB − d C D > α add . W e will again use the union bound to get P h b d AB + b d C D − b d AC − b d B D > 0 i ≤ P h b d AB − d AB + b d C D − d C D − b d AC + d AC − b d B D + d B D > α add i ≤ P h b d AB − d AB > α add 4 i + P h b d C D − d C D > α add 4 i + P h d AC − b d AC > α add 4 i + P h d B D − b d B D > α add 4 i . (40) T o proceed, we will focus our attention on the first term in ( 40 ). The remaining terms will follow similarly . For notational clarity , let us define the function ` ( x ) , − 3 4 log 1 − 4 3 x and let p ( i ) AB denote the random quantity 3 4 1 − e − 4 3 δ ( i ) AB = ` − 1 δ ( i ) AB , where, as usual, δ ( i ) AB is the distances between A and B on the random gene tree G ( i ) drawn according to the MSC. Now , observe that, by definition, b d AB and d AB are equal to ` ( b p AB ) and ` ( E [ b p AB ]) respectively . D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 17 Our strategy will be to first show that with high probability b p AB is close to 1 m P m i =1 p ( i ) AB which is in turn close to E [ b p AB ] . W e will then use the fact that ` ( x ) is a well-behaved function to obtain an upper bound on the the first term of ( 40 ). Conditioned on a particular realization of the MSC process n δ ( i ) AB o i ∈ [ m ] , let E 1 ( ξ ) and E 2 ( ξ ) denote the events that 1 m P i ∈ [ m ] p ( i ) AB − E b p AB > ξ and 1 m P i ∈ [ m ] p ( i ) AB − b p AB > ξ , r espectively . Now , notice we can bound the first term in ( 40 ) as follows. P h b d AB − d AB > α add 4 i = P h ` ( b p AB ) − ` ( E b p AB ) > α add 4 i ( a ) = E " P ` ( b p AB ) − ` ( E b p AB ) > α add 4 n δ ( i ) AB o i ∈ [ m ] # ( b ) ≤ E P ` ( b p AB ) − ` ( E b p AB ) > α add 4 n δ ( i ) AB o i ∈ [ m ] , E 1 ( ξ ) c , E 2 ( ξ ) c + E P E 1 ( ξ ) n δ ( i ) AB o i ∈ [ m ] + E P E 2 ( ξ ) n δ ( i ) AB o i ∈ [ m ] , (41) where in ( a ) we condition on n δ ( i ) AB o , a particular realization of the MSC. In ( b ) we use the following fact: for any thr ee events E a , E b , E c , the following inequality holds P ( E a ) = P ( E a |E b ∪ E c ) P ( E b ∪ E c ) + P ( E a |E c b ∩ E c c ) P ( E c b ∩ E c c ) ≤ P ( E b ∪ E c ) + P ( E a |E c b ∩ E c c ) ≤ P ( E b ) + P ( E c ) + P ( E a |E c b ∩ E c c ) , where we identify E a , E b , and E c with the events b d AB − d AB > α add 4 , E 1 ( ξ ) , and E 2 ( ξ ) respectively . Our goal now is to pick a value of ξ so that the first term in ( 41 ) is 0. T owar ds this end, we will use the following r esult that we pr ove in Section D.1 . Claim 1. For any ξ > 0 , conditioned on a particular realization n δ ( i ) AB o i ∈ [ m ] of the MSC process, and the events E 1 ( ξ ) c and E 2 ( ξ ) c , the following inequality holds | ` ( b p AB ) − ` ( E b p AB ) | ≤ 2 ξ e − 4 µ U ∆ / 3 8 3 µ U +1 − 8 ξ 3 . (42) Now , Claim 1 tells us that if we make the following choice for ξ ξ = ξ 0 , 9 α add e − 4 3 µ U ∆ (24 + α add )(8 µ U + 3) , (43) then conditioned on the events E 1 ( ξ 0 ) c and E 2 ( ξ 0 ) c , we have that ` ( b p AB ) − ` ( E b p AB ) ≤ α add 4 Therefor e, we have P ` ( b p AB ) − ` ( E b p AB ) > α add 4 n δ ( i ) AB o i ∈ [ m ] , E 1 ( ξ 0 ) c , E 2 ( ξ 0 ) c = 0 . (44) Using this in ( 41 ), we have P h b d AB − d AB > α add 4 i ≤ E P E 1 ( ξ 0 ) n δ ( i ) AB o i ∈ [ m ] + E P E 2 ( ξ 0 ) n δ ( i ) AB o i ∈ [ m ] ≤ e − 2 mξ 2 0 + e − 2 mkξ 2 0 , (45) D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 18 where the second inequality comes from applying Hoeffding’s inequality to each term, as in ( 17 ) and ( 18 ). Since this upper bound is independent of the choice of the pair of leaves, we can use ( 45 ) and ( 40 ) in ( 39 ) to get P [ Error ] ≤ X A,B ,C,D ∈ L : τ AB + τ C D ≤ τ AC + τ B D = τ AD + τ B C 8 e − 2 mξ 2 0 + e − 2 mkξ 2 0 ≤ 8 n 4 e − 2 mξ 2 0 + e − 2 mkξ 2 0 . (46) Now , if we pick m and k as in ( 38 ) (also ( 9 )), we see that the right side above is less than , which concludes the proof. The limit as f → 0 can also be readily computed by observing that α add → 2 µ L f as f → 0 . D .1 Pr oof of Claim 1 W e will begin by using the fact that ` ( x ) satisfies the following Lipschitz property: for any 0 ≤ x ≤ y ≤ B , we have ` ( y ) − ` ( x ) = − 3 4 log 1 − 4 3 y + 3 4 log 1 − 4 3 x = Z y x 1 1 − 4 3 t dt ≤ ( y − x ) 1 − 4 3 B . (47) From this, we have that ` 1 m X i ∈ [ m ] p ( i ) AB − ` ( E [ p AB ]) ≤ ξ 1 − 4 3 ( E [ b p AB ] + ξ ) , conditioned on E 1 ( ξ ) , (48) where we have chosen the B (of ( 47 )) to be E [ b p AB ] + ξ , since conditioned on E 1 ( ξ ) , we have that 1 m m X i =1 p ( i ) AB ≤ E [ b p AB ] + ξ . (49) Similarly , conditioned on E 2 ( ξ ) c and E 1 ( ξ ) c , we have ` 1 m X i ∈ [ m ] p ( i ) AB − ` ( b p AB ) ≤ ξ 1 − 4 3 1 m P i ∈ [ m ] p ( i ) AB + ξ ≤ ξ 1 − 4 3 ( E [ b p AB ] + 2 ξ ) , (50) where in the first inequality we have chosen B (of ( 47 )) to be 1 m P i ∈ [ m ] p ( i ) AB + ξ , since conditioned on E 2 ( ξ ) c , we have that b p AB ≤ 1 m P i ∈ [ m ] p ( i ) AB + ξ , and the second inequality D ASARA THY et al. : D A T A REQUIREMENT FOR PHYLOGENETIC INFERENCE FROM MUL TIPLE LOCI 19 follows from ( 49 ). Ther efore from ( 48 ) and ( 50 ), we have that the following inequality holds conditioned on E 1 ( ξ ) c and E 2 ( ξ ) c : | ` ( b p AB ) − ` ( E [ b p AB ]) | ≤ ` 1 m X i ∈ [ m ] p ( i ) AB − ` ( E [ p AB ]) + ` 1 m X i ∈ [ m ] p ( i ) AB − ` ( b p AB ) ≤ 2 ξ 1 − 4 3 ( E [ b p AB ] + 2 ξ ) (51) Finally , to conclude the proof of the claim, we bound E [ p AB ] using the properties of the multispecies coalescent. Notice that, by definition, the random distance δ AB is equal to µ AB + 2 Z AB , where µ AB and Z AB are as defined in Section C . Therefore, E [ b p AB ] = E 3 4 (1 − e − 4 3 δ AB ) = 3 4 1 − e − 4 3 µ AB E h e − 8 3 Z AB i (52) Next, we observe that the random variable Z AB is stochastically dominated by the random variable µ U Z , where Z ∼ Exp (1) . This implies that E h e − 8 3 Z AB i ≥ E h e − 8 3 µ U Z i = 1 8 3 µ U + 1 . Using this and the fact that µ AB ≤ µ U ∆ in ( 52 ), we have E [ b p AB ] ≤ 3 4 1 − e − 4 3 µ U ∆ 8 3 µ U + 1 ! . Substituting this in ( 51 ) concludes the proof.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment