Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies

Identiﬁability and inference of non-parametric rates-across-sites models on lar ge-scale phylogenies ∗ Elchanan Mossel † Sebastien Roch ‡ February 25, 2019 Abstract Mutation rate v ariation across loci is well known to cause dif ﬁculties, no- tably identiﬁability issues, in the reconstruction of ev olutionary trees from molecular sequences. Here we introduce a ne w approach for estimating gen- eral rates-across-sites models. Our results imply , in particular , that large phylogenies are typically identiﬁable under rate v ariation. W e also deriv e sequence-length requirements for high-probability reconstruction. Our main contribution is a no vel algorithm that clusters sites accord- ing to their mutation rate. F ollo wing this site clustering step, standard re- construction techniques can be used to recover the phylogen y . Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena. ∗ Ke ywords: phylogenetic reconstruction, rates-across-sites models, concentration of measure. † U.C. Berkeley and W eizmann Institute of Science. Supported by DMS 0548249 (CAREER) award, by DOD ONR grant N000141110140, by ISF grant 1300/08 and by ERC PIRG04-GA- 2008-239137 grant. ‡ Department of Mathematics and Bioinformatics Program, UCLA. W ork supported by NSF grant DMS-1007144. 1 Intr oduction The ev olutionary history of li ving organisms is typically represented graphically by a phylogeny , a tree whose branchings indicate past speciation ev ents. The infer- ence of phylogenies based on molecular sequences extracted from extant species is a major task of computational biology . Among the many biological phenomena that complicate this task, one that has recei ved much attention in the statistical phylogenetics literature is the v ariation in mutation rate across sites in a genome. (See related work belo w .) Such v ariation is generally attributed to unequal de- grees of selectiv e pressure. As we describe formally belo w , mathematically this phenomenon can be modeled as a mixtur e of phylogenies. That is, interpreting branch length as a measure of the amount of e v olutionary change, rates-across- sites (RAS) models posit that all sites in a genome e volv e according to a common tree topology , b ut branch lengths for a giv en site are scaled by a random factor . Here we introduce a ne w approach for estimating RAS models. Our main con- tribution is a nov el algorithm which clusters the sites according to their mutation rate. W e sho w that our technique may be used to reconstruct phylogenies. In- deed, follo wing the site clustering step, standard reconstruction techniques can be employed to reco ver a phylogen y on the unmix ed subset of sites obtained. Our re- sults rely on the follo wing basic insight: there exist simple site-wise statistics that experience concentration-of-measure phenomena. Consequently , our techniques only hold in the lar ge-tr ee limit . Concentration has been used extensi vely in statistical phylogenetics. Ho we ver its typical use is in the lar ge-sample limit , that is, as the sequence length gro ws to inﬁnity , for instance in order to show that so-called ev olutionary distance estimates are accurate giv en sufﬁciently long sequences (see e.g. [ESSW99]). Instead, we consider here concentration in what we call the larg e-tr ee limit , that is, as the number of lea ves goes to inﬁnity . Note that the latter is trickier to analyze. Indeed, whereas different sites are usually assumed to e volv e independently , leaf states are not independent. T o the best of our knowledge, this is the ﬁrst use of this type of concentration in the context of phylogenetics. Our results imply , in particular , that lar ge phylogenies are typically identiﬁ- able under rate variation. W e also derive sequence-length requirements for high- probability reconstruction. 1 1.1 Related work Most prior theoretical work on mixture models has focused on the question of identiﬁability . A class of phylogenetic models is identiﬁable if any two models in the class produce different data distrib utions. It is well-known that unmix ed phylogenetic models are typically identiﬁable [Cha96]. This is not the case in gener al for mixtur es of phylogenies. For instance, Steel et al. [SSH94] showed that for any two trees one can ﬁnd a random scaling on each of them such that their data distrib utions are identical. Hence it is hopeless in general to reconstruct phylogenies under mixture models. See also [EW04, MS07, MMS08, SV07b, ˇ SV07a, Ste09] for further examples of this type. Ho wev er the negati ve e xamples constructed in the references abov e are not necessarily typical. The y use special features of the mutation models (and their in variants) and allow themselves quite a bit of ﬂexibility in setting up the topolo- gies and branch lengths. In fact, recently a variety of more standard mixture mod- els hav e been sho wn to be identiﬁable. These include the common GTR+Gamma model [AAR08, WS10] and GTR+Gamma+I model [CH11], as well as some co- v arion models [AR06], some group-based models [APRS11], and so-called r - component identical tree mixtures [RS10]. Although these results do not provide practical algorithms for reconstructing the corresponding mixtures, they do gi ve hope that these problems may be tackled successfully . Beyond the identiﬁability question, there seems to hav e been little rigorous work on reconstructing phylogenetic mixture models. One positiv e result is the case of the molecular clock assumption with across-sites rate variation [SSH94], although no sequence-length requirements are provided. There is a large body of work on practical reconstruction algorithms for v arious types of mixtures, notably rates-across-sites models and cov arion-type models, using mostly likelihood and bayesian methods. See e.g. [Fel04] for references. But the optimization prob- lems the y attempt to solve are likely NP-hard [CT06, Roc06]. There also exist many techniques for testing for the presence of a mixture (for example, for test- ing for rate heterogeneity), b ut such tests typically require the kno wledge of the phylogeny . See e.g. [HR97]. Here we giv e both identiﬁability and reconstruction results. Whereas Steel et al. [SSH94] sho w that any two ﬁxed trees can be made indistinguishable with an appropriate (arbitrarily complex) choice of scaling distributions, we sho w in essence that, giv en a ﬁxed rate distribution (or a well-behav ed class of rate dis- tributions), sufﬁciently large trees are typically distinguishable. After a draft of our results were circulated [MR08], related results for large trees were established 2 by Rhodes and Sulliv ant [RS10] using dif ferent techniques. In particular , our technical assumptions are similar in spirit to the genericity condition in [RS10]. Although our genericity assumptions are stronger , they allow an efﬁcient recon- struction of the model and explicit bounds on sequence-length requirements. Note moreov er that our results apply to general, possibly continuous, nonparametric rate distributions. The proof of our main results relies on the construction of a site clustering statistic that discriminates between dif ferent rates. A similar statistic was also used in [SS06] in a different context. Ho wev er , in contrast to [SS06], our main reconstruction result requires that a site clustering statistic be constructed based only on data generated by the mixture—that is, without prior kno wledge of the model. 1.2 Over view of techniques A simpliﬁed setting T o illustrate our main ideas, we ﬁrst consider a simple two- speed model. Assume that molecular sequences hav e two types of sites: “slo w” and “fast. ” Both types of sites e volve independently by single substitution on a common ev olutionary tree according, say , to a standard Jukes-Cantor model of substitution, but the fast ones e volv e three times as fast. See Section 2 for a formal deﬁnition of the Jukes-Cantor model. T o keep things simple, assume for no w that the e v olutionary tree is a complete binary tree with n = 2 h leav es, where h is the number of levels. (Note that our results apply to much more general rate distributions. W e also discuss how to deal with general trees. See below .) Our approach is based on the follo wing question: Is it possible to tell with high conﬁdence which sites are slo w or fast, with no prior kno wledge of the phylogen y that generated them? Perhaps surprisingly , the answer is yes —at least for large trees. This far -reaching observ ation does not seem to hav e been made pre viously . T o see how this w orks, assume for the time being that we know the phylogen y . W e will sho w ho w to remove this assumption below . T ake a pair of leav es a, b . The ef fect of the speed of a site can be seen in the probability of agreement be- tween a and b : the lea ves agree more often on slow-e volving sites. Hence, if a site sho ws agreement between a and b , one may deduce that the site is more likely to be slow-e volving. But this is too little information to infer with high conﬁdence the speed of a site. Instead, one may look at a larger collection of pairs of leav es and consider the statistic that counts ho w man y of them agree on a gi ven site. The idea is that a large number of agreements should indicate a slo w site. F or this scheme to work accurately , we require two properties from this statistic: separa- 3 tion and concentration . By separation, we mean that the expected value of the statistic should be different on slo w and fast sites—in order to distinguish them. By concentration, we mean that the statistic should lie very close to its expecta- tion. These two properties produce a good site clustering test. T o satisfy them, the pairs of leav es in volved must be chosen carefully . Separation and concentration T o obtain separation, it is natural to use only pairs of “close” lea ves. Indeed, leav es that are far away are practically indepen- dent and the speed of a site has v ery little noticeable ef fect on their agreement. As for concentration, what one needs is the kind of conditions that gi ve rise to the central limit theorem: a large sum of small independent contributions. For sym- metric models such as the Jukes-Cantor model, the agr eement e vents on two pairs of leaves ( a, b ) and ( c, d ) are independent as long as the paths between ( a, b ) and ( c, d ) do not intersect. Therefore, we are led to consider the follo wing statistic: count ho w man y cherries (that is, sister lea ves) agree and di vide by the total num- ber of cherries to obtain a fraction. One can sho w from the considerations above that such a statistic is highly concentrated. Unknown, general tree Ho wev er our deriv ation so far has relied hea vily on two unsatisﬁed premises: 1. That the tree is known. This is of course not the case since our ultimate goal is precisely to reconstruct the phylogeny . 2. And that the tree is complete. In particular , our argument uses the fact that complete binary trees contain many cherries. But general trees may ha ve very fe w cherries. P erhaps surprisingly , neither of these conditions is necessary . The bulk of the technical contributions of this paper lie in getting rid of these assumptions. W e sho w in particular ho w to construct a site clustering statistic similar to the one abov e directly from the data without prior knowledge of the tree. At a high le vel, all one needs is to select a large collection of “sufﬁciently correlated” pairs of leav es and then “dilute” them to discard pairs that are too close to each other . This leads to a highly concentrated site-wise statistic. See Section 2 for a statement of our results. 4 2 Deﬁnitions and Results 2.1 Basic Deﬁnitions Phylogenies A phylogen y is a graphical representation of the speciation history of a group of or ganisms. The lea ves typically correspond to current species. Each branching indicates a speciation e vent. Moreo ver we associate to each edge a positi ve weight. This weight can be thought roughly as the time elapsed on the edge multiplied by the mutation rate which may also depend on the edge. More formally: Deﬁnition 1 (Phylogeny) A phylogeny T = ( V , E ; L, µ ) is a tree with vertex set V , edge set E and n (labelled) leaves L = [ n ] = { 1 , . . . , n } such that 1) the de gree of all internal vertices V − L is exactly 3 , and 2) the edges ar e assigned weights µ : E → (0 , + ∞ ) . W e let T [ T ] = ( V , E ; L ) be the topology of T . A phylogeny is naturally equipped with a so-called tree metric on the leaves d : L × L → (0 , + ∞ ) deﬁned as follows ∀ u, v ∈ L, d ( u, v ) = X e ∈ Path T ( u,v ) µ e , wher e Path T ( u, v ) is the set of edges on the path between u and v in T . W e will r efer to d ( u, v ) as the ev olutionary distance between u and v . Since under the assumptions above ther e is a one-to-one corr espondence between d and µ (see e.g . [SS03]), we write either T = ( V , E ; L, d ) or T = ( V , E ; L, µ ) . W e also sometimes use the natural e xtension of d to the internal vertices of T . W e will sometimes restrict ourselves to the follo wing standard special case. Deﬁnition 2 (Regular Phylogenies) Let 0 < f ≤ g < + ∞ . W e denote by T f ,g the set of phylogenies T = ( V , E ; L, µ ) suc h that ∀ e ∈ E , f ≤ µ e ≤ g . Poisson Model A standard model of DNA sequence ev olution is the following P oisson model . See e.g. [SS03]. Deﬁnition 3 (Poisson Model) Consider the following stochastic pr ocess. W e ar e given a phylogeny T = ( V , E ; [ n ] , µ ) and a ﬁnite set R with r elements. Let π be a pr obability distribution on R . Let Q ∈ R r × r be the following rate matrix Q xy =  π y , if x 6 = y , π y − 1 , o.w . 5 Associate to each edg e e ∈ E the stochastic matrix [ M ( e )] xy = [exp ( µ e Q )] xy =  π x + (1 − π x ) e − µ e , if x = y , π y (1 − e − µ e ) , o.w . The pr ocess runs as follows. Choose an arbitrary r oot ρ ∈ V . Denote by E ↓ the set E directed away fr om the r oot. Pick a state for the r oot accor ding to π . Moving away fr om the r oot towar d the leaves, apply the c hannel M ( e ) to each edge e independently . Denote the state so obtained σ V = ( σ v ) v ∈ V . In particular , σ [ n ] is the state at the leaves. Mor e pr ecisely , the joint distribution of σ V is given by µ V ( σ V ) = π ρ ( σ ρ ) Y e =( u,v ) ∈ E ↓ [ M ( e )] σ u σ v . F or W ⊆ V , we denote by µ W the mar ginal of µ V at W . Under this model, the weight µ e is the expected number of substitutions on edge e in a r elated continuous-time pr ocess. The r -state Poisson model is the special case when π is the uniform distribution over R . In that case, we denote the distribution of σ V by D [ T , r ] . When r is clear fr om the conte xt, we write instead σ V ∼ D [ T ] . More generally , we take k independent samples ( σ i V ) k i =1 from the model abov e, that is, σ 1 V , . . . , σ k V are i.i.d. D [ T , r ] . W e think of ( σ i v ) k i =1 as the sequence at node v ∈ V . T ypically , R = { A , G , C , T } and the model describes how DN A se- quences stochastically ev olve by point mutations along an ev olutionary tree— under the assumption that each site in the sequences e volv es independently . Example 1 (CFN and Juk es-Cantor Models) The special case r = 2 corr e- sponds to the so-called CFN model. The special case r = 4 is the well-known J ukes-Cantor model. W e ﬁx r throughout. Remark 1 W e discuss the mor e gener al GTR model in the concluding r emarks. Rates-across-sites model W e introduce the basic rates-across-sites model which will be the focus of this paper . W e will use the following deﬁnition. Deﬁnition 4 (Phylogenetic Scaling) Let T = ( V , E ; [ n ] , µ ) be a phylogeny and Λ , a constant in [0 , + ∞ ) . Then we denote by Λ T the phylogeny obtained by scaling the weights of T by Λ , that is, Λ T = ( V , E ; [ n ] , Λ µ ) . 6 Deﬁnition 5 (Rates-Across-Sites Model (see e.g . [SSH94])) In the generalized Pois- son model we ar e given a phylogeny T and a scaling factor Λ , that is, a ran- dom variable on [0 , + ∞ ) . Let Λ 1 , . . . , Λ k be i.i.d. copies of Λ . Conditioned on Λ 1 , . . . , Λ k , the samples ( σ i V ) k i =1 gener ated under this model ar e independent with σ j V ∼ D [Λ j T ] , j = 1 , . . . , k . W e denote by D [ T , Λ , r ] the pr obability distribution of σ 1 V . W e also let D [ T , Λ , r ] be the pr obability distrib ution of σ 1 L . 2.2 Main r esults T ree identiﬁability T o provide a uniform bound on the minimum tree size re- quired for our identiﬁability result to hold, we make explicit assumptions on the mutation model. For s ≥ 0 , let Φ( s ) = E  e − s Λ  , be the moment g enerating function (or one-sided Laplace transform) of the scaling factor Λ . The probability distribution of Λ is determined by Φ . See e.g. [Bil95]. W e normalize Λ so that − Φ 0 (0) = E [Λ] = 1 . In particular , Λ is not identically 0 and Φ is continuous and strictly decreasing. Assumption 1 Let 0 < f ≤ g < + ∞ , and M > 0 . The following set of assump- tions on a gener alized P oisson model will be denoted by A( f , g , M ) : 1. Regular phylogeny: The phylogeny T = ( V , E ; [ n ] , µ ) is in T f ,g . 2. Mass close to 0 : W e have that Φ − 1  e − 6 g  ≤ M . In wor ds, an evolutionary distance of M under Λ -scaling pr oduces a corr ela- tion corresponding to an evolutionary distance of at least 6 g without the scaling. W e denote by GPM( f , g , M , n 0 ) the set of generalized P oisson models satisfying A( f , g , M ) with at least n 0 leaves. Remark 2 Note that the r esults in [SSH94] indicate that appr opriate conditions ar e needed to obtain a tree identiﬁability result in the generalized P oisson model when the random scaling is unknown. W e do not claim that the conditions above ar e minimal. The ﬁrst assumption is meant to ensur e that ther e is enough signal to 7 r econstruct the tr ee. The second assumption bounds the distortion of the random scaling for e volutionary distances corr esponding to short paths. It essentially implies that the pr obability mass of Λ close to 0 is bounded. In particular , note that if the pr obability mass below ε = − 1 M ln  e − 6 g − δ 1 − δ  , is less than δ (for δ < e − 6 g ), then Φ( M ) ≤ δ + (1 − δ ) e − εM ≤ e − 6 g , and the assumption is satisﬁed. Con versely , if Λ satisﬁes the second assumption then the pr obability mass δ below ε (for ε < 6 g / M ) must be such that δ ≤ e − (6 g − εM ) , since e − 6 g ≥ Φ( M ) ≥ δ e − εM . Remark 3 The second assumption implicitly implies that P [Λ = 0] ≤ e − 6 g . This is in fact not necessary . By ﬁrst r emoving all in variant sites, it should be possible to extend our main theor em to moment gener ating functions of the form Φ( s ) = α + (1 − α )Φ + ( s ) , wher e 0 ≤ α < 1 is uniformly bounded away fr om 1 and Φ + satisﬁes the assump- tions above . Indeed, on a lar ge phylogeny , it is extr emely unlikely to pr oduce an in variant site using a positive scaling factor . Hence remo ving all in variant sites has the effect of essentially r estricting the dataset to the positive part of the distri- bution of Λ . W e leave the details to the r eader . Given this observation, in the rest of the manuscript one can assume that P [Λ = 0] = 0 . Theorem 1 (T ree identiﬁability) F ix 0 < f ≤ g < + ∞ , and M > 0 . Then, ther e exists n 0 ( f , g , M ) ≥ 1 such that, if ( T , Λ) and ( T 0 , Λ 0 ) ar e in GPM( f , g , M , n 0 ) with T [ T ] 6 = T [ T 0 ] , then D [ T , Λ , r ] 6 = D [ T 0 , Λ 0 , r ] . (Recall that D [ T , Λ , r ] denotes the distrib ution at the lea ves .) 8 Remark 4 Note that we allow Λ ∼ Λ 0 (wher e ∼ denotes equality in distrib ution). This is the sense in which our r esult is a tree identiﬁability r esult. Remark 5 Note that our identiﬁability r esult applies only to suf ﬁciently large phylogenies. Computing n 0 fr om our techniques is difﬁcult (and in gener al de- pends on the parameter s f , g , M ). One could estimate the r equir ed size by run- ning the r econstruction algorithm below on simulated data for various sizes and parameter s. W e leave such empirical studies for futur e work. The proof of our main theorem relies on the follo wing reconstruction result. T ree r econstruction Moreov er , we giv e a stronger result implying that the phy- logeny can be reconstructed with high conﬁdence using polynomial length se- quences in polynomial time. The proof appears in Sections 3, 4 and 5. Theorem 2 (T ree Reconstruction) Under Assumption 1, for all 0 < δ < 1 , ther e is a γ k > 0 lar ge enough so that the topology of the tr ee can be r econstructed in polynomial time using k = n γ k samples, except with pr obability δ . Remark 6 Once the tr ee has been estimated, one can also infer the r ate distrib u- tion. Details ar e left to the inter ested r eader . 3 Site clustering statistic: Existence and properties In this section, we introduce our main site clustering statistic and show that it is concentrated. Let 0 < f ≤ g < + ∞ and M > 0 . In this section, we ﬁx a phylogeny T = ( V , E ; [ n ] , µ ) in T f ,g . W e let ( σ i L ) k i =1 be k i.i.d. samples from D [ T , Λ , r ] where the generalized Poisson model ( T , Λ) satisﬁes Assumption 1. Moreov er , let Λ 1 , . . . , Λ k be the i.i.d. scaling factors corresponding to the k sam- ples abov e. Some notation W e will also use the notation [ n ] 2 = { ( a, b ) ∈ [ n ] × [ n ] : a ≤ b } , [ n ] 2 = = { ( a, a ) } a ∈ [ n ] , and [ n ] 2 6 = = [ n ] 2 − [ n ] 2 = . W e also denote by [ n ] 4 6 = the set of pairs ( a, b ) , ( c, d ) ∈ [ n ] 2 6 = such that ( a, b ) 6 = ( c, d ) (as pairs). For α > 0 , we let Υ α = { ( a, b ) ∈ [ n ] 2 6 = : d ( a, b ) ≤ α } , be all pairs of leav es in T at ev olutionary distance at most α . Let p ∞ = 1 − q ∞ where q ∞ = P x ∈R π 2 x . 9 3.1 What makes a good site clustering statistic? For a site i = 1 , . . . , k , consider a statistic of the form U i = 1 | Υ | X ( a,b ) ∈ Υ p − 1 ∞ [ 1 { σ i a = σ i b } − q ∞ ] , (1) where Υ ⊆ [ n ] 2 6 = , is a subset of pairs of distinct lea ves independent of i . Using the expression for the transition matrix gi ven in Deﬁnition 3, note that E [ U i | Λ i ] = 1 | Υ | X ( a,b ) ∈ Υ p − 1 ∞  E [ 1 { σ i a = σ i b } | Λ i ] − q ∞  = 1 | Υ | X ( a,b ) ∈ Υ p − 1 ∞ " X x ∈R π x  π x + (1 − π x ) e − Λ i d ( a,b )  − q ∞ # = 1 | Υ | X ( a,b ) ∈ Υ e − Λ i d ( a,b ) , which is strictly decr easing in Λ i . W e need two properties for (1) to make a good site clustering statistic: separation and concentration. For separation , that is, for the statistic abo ve to distinguish different scaling factors as much as possible, we require the follo wing condition: S1 Each pair in Υ is composed of two “sufﬁciently close” leav es, that is, there is α < + ∞ such that Υ ⊆ Υ α . Indeed, if two leav es are far away , their joint distribution is close to independent and scaling has little ef fect on their agreement. A much better separation is ob- tained from close leav es. T o guarantee concentration of a statistic of the type (1), we require the follow- ing three conditions on Υ : C1 The set Υ is “large enough” and each pair makes a “small contribution” to the sum. This will be satisﬁed if we show that we can take | Υ | = Θ( n ) , as (1) is a sum of { 0 , 1 } -variables. C2 Agreement for different pairs in Υ is “sufﬁciently uncorrelated, ” e.g., inde- pendent. These conditions will allo w us to apply standard large de viations ar guments. 10 Example 2 (Full sum) As a ﬁrst guess, one may expect that taking Υ to be all pairs of leav es may give a good site clustering statistic. However , in general this is not the case as we show in the following example . Consider the two-state case on a complete binary tree with identical edge lengths µ and Λ = 1 . F or mathematical con venience, assume that the states ar e R = { +1 , − 1 } and let γ = 2 e − 2 µ . Then, up to a multiplicative factor and additive constant, the clustering statistic is simply U ( h ) = X ( a,b ) ∈ [ n ] 2 6 = σ a σ b , for a tr ee with h levels. Using a calculation of [EKPS00, Section 5], one has E [ σ a σ b ] = e − d ( a,b ) . (2) Dividing the expectation into terms over the ﬁrst subtr ee of the r oot, terms over the second subtr ee of the r oot, and terms between the two subtr ees, we have E  U ( h )  = 2 E  U ( h − 1)  + (2 h − 1 ) 2 e − 2 hµ . Solving for the r ecursion gives E  U ( h )  = γ 2 h − 2 γ h − 1 γ − 1 = O  2 2 h γ 2 h  , as h → ∞ . On the other hand, the expectation of the squar e E [( U ( h ) ) 2 ] is a sum of terms of the form E [ σ z 1 σ z 2 σ z 3 σ z 4 ] wher e some of the z ’ s may be r epeated. All such terms are non-ne gative because of (2) and the fact that terms where all z ’ s ar e dif fer ent factor into a pr oduct by Pr oposition 1 below . Hence V ar  U ( h )  ≥ X ( a,b ) ∈ [ n ] 2 6 = E  ( σ a σ b ) 2  − E  U ( h )  2 = 2 h (2 h − 1) 2 − γ 2 2 2 h − 4  γ h − 1 γ − 1  2 = Ω(2 2 h ) , if γ < 1 . Hence, assuming that γ < 1 , we have E  U ( h )  2 V ar [ U ( h ) ] → 0 , as h → ∞ . In other wor ds, the sum o ver all pairs is too “noisy” to serve as a site clustering statistic in that case. 11 3.2 Does it exist? W e now show that there alw ays exist statistics that satisfy the properties abo ve and we giv e e xplicit guarantee on their concentration. Note that, in the current section, we only provide a proof of e xistence . In particular , in establishing existence, we use e volutionary distances which are not av ailable from the data. Later , in Section 4, we explain ho w to construct such a statistic from the data D [ T , Λ , r ] (or , more precisely , the samples ( σ i L ) k i =1 ) without knowledge of the tree topology , evolutionary distances, or site scaling factor s . W e now explain ho w the conditions abov e can be achiev ed for an appropriate choice of Υ on any tree topology . Note, howe ver , that Υ depends on T . Independence W e ﬁrst show that the clustering stat istic (1) is a sum of indepen- dent v ariables as long as the paths between differ ent pairs do not intersect. This will allo w us to satisfy C2. Proposition 1 (Independence) Assume that for all ( a, b ) , ( a 0 , b 0 ) ∈ Υ with ( a, b ) 6 = ( a 0 , b 0 ) we have P ath( a, b ) ∩ P ath( a 0 , b 0 ) = ∅ , wher e Path( a, b ) is the set of edges on the path between a and b . Then the random variables { 1 { σ a = σ b }} ( a,b ) ∈ Υ , ar e mutually independent. Proof: Denote Υ = { ( a 1 , b 1 ) , . . . , ( a υ , b υ ) } with υ = | Υ | . Let V be the set of nodes on the path between a 1 and b 1 . Removing the edges in P ath( a 1 , b 1 ) creates a forest where the V -nodes can be taken as roots. Note that, by symmetry and the Marko v property , conditioned on V , the distribution of the random v ariables { 1 { σ a = σ b }} ( a,b ) ∈ Υ −{ ( a 1 ,b 1 ) } , (3) does not depend on the states of the V -nodes. In particular , 1 { σ a 1 = σ b 1 } is independent of (3). Proceeding by induction gi ves the result.  Size T o satisfy S1, we restrict ourselves to “close pairs. ” W e ﬁrst sho w that the size of Υ α gro ws linearly as long as α ≥ 4 g , allo wing us to also satisfy C1. A similar result is prov ed in [SS06]. Proposition 2 (Size of Υ α ) Let α ≥ 4 g . Then | Υ α | ≥ n 4 . 12 Proof: Let Γ = { a ∈ [ n ] : d ( a, b ) > α, ∀ b ∈ [ n ] − { a }} , that is, Γ is the set of lea ves with no other leaf at e volutionary distance α . W e will bound the size of Γ . For a ∈ Γ , let B ( a ) = n v ∈ V : d ( a, v ) ≤ α 2 o . Note that for all a, b ∈ Γ with a 6 = b we have B ( a ) ∩ B ( b ) = ∅ by the triangle inequality . Moreover , it holds that for all a ∈ Γ |B ( a ) | ≥ 2 b α 2 g c , since T is binary and there is no leaf other than a in B ( a ) . Hence, we must ha ve | Γ | ≤ 2 n − 2 2 b α 2 g c ≤  1 2 b α 2 g c − 1  n, as there are 2 n − 2 nodes in T . No w , for all a / ∈ Γ assign an arbitrary leaf at ev olutionary distance at most α . Then | Υ α | ≥ 1 2 ( n − | Γ | ) ≥ 1 2  1 − 1 2 b α 2 g c − 1  n, where we divided by 2 to av oid o ver -counting. The result follo ws from the as- sumption α ≥ 4 g .  Sparsiﬁcation Note that Υ 4 g satisﬁes C1 but does not satisfy C2 as the pairs may be intersecting (see Proposition 1). W e no w sho w ho w to satisfy both C1 and C2 by “sparsifying” Υ 4 g . In stating this procedure, we allo w some ﬂe xibility (that is, arbitrary choices) which will be useful in analyzing the actual implementation in the next section. Let 4 g < m < M be a constant to be determined later and assume Υ 0 is any set satisfying Υ 4 g ⊆ Υ 0 ⊆ Υ m . W e kno w from Proposition 2 that Υ 0 has linear size, that is, | Υ 0 | ≥ n/ 4 . W e construct a linear-sized subset Υ of Υ 0 satisfying the non-intersection condition of Proposition 1 as follo ws. Let S := Υ 0 and Υ 00 := ∅ . 13 • T ake an y pair ( a ∗ , b ∗ ) in S and add it to Υ 00 . • Let S 0 be any subset of S such that S 0 contains all pairs with at least one node within ev olutionary distance m of either a ∗ or b ∗ and contains no pair with both nodes beyond ev olutionary distance M from both a ∗ and b ∗ . Re- mov e S 0 from S . • Repeat until S is empty . • Return Υ := Υ 00 . W e claim that Υ is linear in size and that no tw o pairs in Υ intersect. Proposition 3 (Pr operties of Υ ) Let Υ be any set built by the pr ocedur e above . Then, 1. F or all ( a, b ) , ( a 0 , b 0 ) ∈ Υ with ( a, b ) 6 = ( a 0 , b 0 ) we have Path( a, b ) ∩ P ath( a 0 , b 0 ) = ∅ . 2. There is γ s = γ s ( M , f ) > 0 such that | Υ | ≥ γ s n , wher e γ s does not depend on T , but only on M , f . 3. F or all ( a, b ) ∈ Υ , we have 2 f ≤ d ( a, b ) ≤ M . Proof: W e ﬁrst prove the non-intersecting condition. All pairs of lea ves in Υ are at ev olutionary distance at most m . Moreover , for any ( a, b ) 6 = ( a 0 , b 0 ) in Υ , we hav e by construction min { d ( u, v ) : u ∈ { a, b } , v ∈ { a 0 , b 0 }} ≥ m. Hence, the path between a and b and the path between a 0 and b 0 cannot intersect: we hav e d ( a, b ) + d ( a 0 , b 0 ) − d ( a, a 0 ) − d ( b, b 0 ) ≤ 0 , which, using µ e > 0 for all e and the four-point test (see e.g. [SS03]), excludes the topology aa 0 | bb 0 (that is, the four -leaf topology where { a, a 0 } is one side of the internal edge and { b, b 0 } is on the other); and similarly for the topology ab 0 | a 0 b . W e now bound the size of Υ . Let ( a, b ) be a pair of lea ves at e volutionary distance at most m . There are at most 2 · 2 b M f c − 1 = 2 b M f c leav es at e volu- tionary distance at most M from either a or b . Therefore, at each iteration of 14 the sparsiﬁcation algorithm, the number of elements of S removed is at most 2 b M f c + b m f c − 1 ≤ 2 2 b M f c , as each leaf remov ed is in volv ed in at most 2 b m f c − 1 pairs at ev olutionary distance m . Since by Proposition 2, the size of Υ 0 is at least n/ 4 , the number of elements in Υ at the end of the sparsiﬁcation algorithm is at least | Υ | ≥ n 2 2 b M f c +2 . Finally , by our assumption on the phylogen y , two distinct lea ves are al ways at e volutionary distance at least 2 f . For the upper bound, use m < M .  Deﬁne γ s = γ s ( M , f ) = 1 2 2 b M f c +2 . Deﬁnition 6 (Sparse pairs) W e say that a set Υ ⊆ [ n ] 2 6 = is γ s -sparse if it satisﬁes the thr ee pr operties in the statement of Pr oposition 3. 3.3 Pr operties of the site clustering statistic In (1) ﬁx an γ s -sparse Υ . W e now show that conditions C1 and C2 lead to con- centration. Let U Υ = E [ U i ] = 1 | Υ | X ( a,b ) ∈ Υ E  e − Λ i d ( a,b )  = 1 | Υ | X ( a,b ) ∈ Υ Φ( d ( a, b )) . Moreov er , for λ ≥ 0 , deﬁne U Υ ( λ ) = 1 | Υ | X ( a,b ) ∈ Υ e − λd ( a,b ) and note that U Υ (Λ i ) = E [ U i | Λ i ] , and U Υ = E [ U Υ (Λ i )] , for i = 1 , . . . , k . Proposition 4 (Concentration of U i ) F or all ζ > 0 , ther e is c > 0 depending on M , f such that P [ |U i − U Υ (Λ i ) | ≥ ζ | Λ i ] ≤ 2 exp( − cζ 2 n ) , almost sur ely , for all i = 1 , . . . , k . (W e will eventually use ζ = o (1) .) 15 Proof: Recall the following standard concentration inequality (see e.g. [MR95]): Lemma 1 (Azuma-Hoeffding Inequality) Suppose X = ( X 1 , . . . , X m ) ar e in- dependent random variables taking values in a set S , and f : S m → R is any t -Lipschitz function: | f ( x ) − f ( y ) | ≤ t whenever x and y differ at just one coor - dinate. Then, ∀ ζ > 0 , P [ | f ( X ) − E [ f ( X )] | ≥ ζ ] ≤ 2 exp  − ζ 2 2 t 2 m  . From Propositions 1, 2, and 3, the random v ariable U i is a (normalized) sum of Ω( n ) independent bounded variables. By Lemma 1, conditioning on Λ i , we hav e | U Υ (Λ i ) − U i | ≤ ζ except with probability exp( − Ω( ζ 2 n )) , where we used that m = Ω( n ) and t = O (1 /n ) .  Moreov er , we show separation. Proposition 5 (Separation of U i ) If λ − λ 0 ≥ β , wher e β ≥ 0 , then U Υ ( λ 0 ) − U Υ ( λ ) ≥ e − λM  e 2 f β − 1  . Proof: W e hav e U Υ ( λ 0 ) − U Υ ( λ ) = 1 | Υ | X ( a,b ) ∈ Υ h e − λ 0 d ( a,b ) − e − λd ( a,b ) i ≥ 1 | Υ | X ( a,b ) ∈ Υ  e − ( λ − β ) d ( a,b ) − e − λd ( a,b )  ≥ 1 | Υ | X ( a,b ) ∈ Υ e − λd ( a,b )  e β d ( a,b ) − 1  ≥ 1 | Υ | X ( a,b ) ∈ Υ e − λM  e β · 2 f − 1  , since 2 f ≤ d ( a, b ) ≤ M for all ( a, b ) ∈ Υ by assumption.  Remark 7 The last bound may seem pr oblematic because under our assumptions the scaling factor is allowed to have an unbounded support. In that case the RHS could be arbitrarily close to 0 . But we will show below that we can safely ignor e lar ge values of λ . 16 4 Constructing the site clustering statistic from data Note that in the pre vious section we only established the e xistence of an appropri- ate site clustering statistic. W e now show ho w such a statistic can be built from data without knowledge of the tr ee topology or site scaling factors . Notation W e use the same notation as in the previous section. Further , we let ˆ q ( a, b ) = 1 k k X i =1 p − 1 ∞  1 { σ i a = σ i b } − q ∞  . Also let q ( a, b ) = E [ ˆ q ( a, b )] = E  p − 1 ∞  1 { σ 1 a = σ 1 b } − q ∞  = E  E  p − 1 ∞  1 { σ 1 a = σ 1 b } − q ∞  | Λ 1  = E  e − Λ 1 d ( a,b )  = Φ( d ( a, b )) . where we used our previous calculations. W e deﬁne some constants used in the algorithm and its analysis whose v alues will be justiﬁed below . Let ω m = e − 5 g , ω + m = e − 5 . 5 g , ω − m = e − 4 . 5 g . Also recall that Φ is strictly decreasing and let m = Φ − 1 ( ω m ) . Note that by Jensen’ s inequality and E [Λ 1 ] = 1 Φ(5 g ) ≥ e − E [Λ 1 ] · 5 g = e − 5 g , so that m ≥ 5 g > 4 g , as assumed in the pre vious section. Similarly , by assumption, m = Φ − 1 (5 g ) < Φ − 1 ( e − 6 g ) ≤ M , so that m < M . Finally let η = min  e − 4 g − e − 4 . 5 g , e − 4 . 5 g − e − 5 g , e − 5 g − e − 5 . 5 g , e − 5 . 5 g − e − 6 g  . 17 4.1 Site clustering algorithm W e proceed in three steps, as in the idealized setting of Section 3.2. Howe ver , unlike the idealized setting, we do not assume the kno wledge of e volutionary distances. Note in particular that it is not possible to estimate d ( a, b ) from the samples ( σ i L ) k i =1 because the rate distribution is unknown. Instead, we use ˆ q ( a, b ) as a r ough estimate of how close a and b are in the tree. This will sufﬁce for our purposes, as we sho w in the next subsection. The algorithm is the following: 1. (Close P airs) F or all pairs of leav es a, b ∈ [ n ] , compute ˆ q ( a, b ) and set Υ 0 = { ( a, b ) ∈ [ n ] 2 6 = : ˆ q ( a, b ) ≥ ω − m } . 2. (Sparsiﬁcation) Let S := Υ 0 and Υ 00 := ∅ . • T ake an y pair ( a ∗ , b ∗ ) in S and add it to Υ 00 . • Remove from S all pairs ( a, b ) such that max { ˆ q ( c ∗ , c ) : c ∗ ∈ { a ∗ , b ∗ } , c ∈ { a, b }} ≥ ω + m . • Repeat until S is empty . 3. (F inal Statistic) Return Υ := Υ 00 . 4.2 Analysis of the clustering algorithm Let Υ be the set returned by the previous algorithm. W e show that it is γ s -sparse with high probability . Proposition 6 (Clustering statistic) Under Assumption 1, for all 0 < δ < 1 ther e exists a constant 0 < C < + ∞ (depending on g ) such that the set of pairs Υ r eturned by the pre vious algorithm is γ s -sparse with pr obability 1 − δ pr ovided that the number of samples satisﬁes k ≥ C log n. Mor eover , the algorithm runs in polynomial time. Proof: W e ﬁrst prove that all ˆ q ( a, b ) ’ s are sufﬁciently accurate. 18 Lemma 2 F or all 0 < δ < 1 , ther e exists a constant 0 < C < + ∞ (depending on g ) such that | ˆ q ( a, b ) − q ( a, b ) | ≤ η , for all ( a, b ) ∈ [ n ] 2 6 = with pr obability 1 − δ pr ovided that the number of samples satisﬁes k ≥ C log n. Proof: For each ( a, b ) ∈ [ n ] 2 6 = , ˆ q ( a, b ) is a sum of k independent bounded vari- ables. By Lemma 1, taking ζ = η we hav e | ˆ q ( a, b ) − q ( a, b ) | ≤ η , except with probability 2 exp( − C 0 k ) for some C 0 > 0 depending on p ∞ and η . Note that there are at most n 2 elements in [ n ] 2 6 = so that the probability of failure is at most 2 n 2 exp( − C 0 · C log n ) ≤ δ, for C sufﬁciently lar ge.  W e return to the proof of Proposition 6. Assume that the conclusion of the pre- vious lemma holds. Our goal is to prov e that the site clustering algorithm then follo ws the idealized sparsiﬁcation procedure described in Section 3. 1. W e ﬁrst prov e that the set Υ 0 = { ( a, b ) ∈ [ n ] 2 6 = : ˆ q ( a, b ) ≥ ω − m } . satisﬁes Υ 4 g ⊆ Υ 0 ⊆ Υ m . Let ( a, b ) be such that d ( a, b ) ≤ 4 g . Then q ( a, b ) = Φ( d ( a, b )) ≥ Φ(4 g ) ≥ e − 4 g , by monotonicity and Jensen’ s inequality . Hence ˆ q ( a, b ) ≥ e − 4 g − η ≥ e − 4 g −  e − 4 g − e − 4 . 5 g  ≥ e − 4 . 5 g = ω − m , and Υ 4 g ⊆ Υ 0 . Similarly , let ( a, b ) be such that d ( a, b ) > m . Then q ( a, b ) = Φ( d ( a, b )) < Φ( m ) = ω m , and ˆ q ( a, b ) < ω m + η ≤ e − 5 g +  e − 4 . 5 g − e − 5 g  = ω − m , so that Υ 0 ⊆ Υ m . 19 2. Let Υ 00 be the set obtained during one of the iterations of Step 2 of the site clustering algorithm and ﬁx a pair ( a ∗ , b ∗ ) ∈ Υ 00 . W e need to sho w that the set S 0 of pairs ( a, b ) in Υ 00 such that max { ˆ q ( c ∗ , c ) : c ∗ ∈ { a ∗ , b ∗ } , c ∈ { a, b }} ≥ ω + m (4) is such that it contains all pairs with at least one node within e volutionary distance m of either a ∗ or b ∗ and contains no pair with both nodes beyond e volutionary distance M from both a ∗ and b ∗ . In the ﬁrst case, assume w .l.o.g. that d ( a, a ∗ ) ≤ m. Then, arguing as abo ve, ˆ q ( a, a ∗ ) ≥ e − 5 g −  e − 5 g − e − 5 . 5 g  = ω + m , and (4) is satisﬁed. In the second case, for all c ∈ { a, b } and c ∗ ∈ { a ∗ , b ∗ } d ( c, c ∗ ) > M , and ˆ q ( c, c ∗ ) < e − 6 g +  e − 5 . 5 g − e − 6 g  = ω + m , so that (4) is not satisﬁed. The two properties abo ve guarantee that the algorithm constructs a set Υ as in the idealized sparsiﬁcation procedure of Section 3. In particular , Υ is γ s -sparse by Proposition 3.  5 T ree r econstruction W e now show ho w to use our site clustering statistic to b uild the tree itself. The algorithm is composed of tw o steps: we ﬁrst “bin” the sites according to the value of the clustering statistic; we then use the sites in one of those bins and apply a standard distance-based reconstruction method. By taking the bins suf ﬁciently small, we show that the content of the bins is made of sites with almost identical scaling factor —thus essentially reducing the situation to the unmixed case. Throughout this section, we assume that there is a γ k > 0 such that k = n γ k . W e also assume that Υ is γ s -sparse, that U stands for a cop y of the corresponding clustering statistic under scaling factor Λ , and that Assumption 1 is satisﬁed. 20 5.1 Site binning Ignoring small and large scaling factors W e ﬁrst show that, under Assump- tion 1, the scaling factor has non-ne gligible mass between two bounded v alues. Proposition 7 (Bounding the scaling factor) W e have P  λ ≤ Λ ≤ λ  ≥ χ, wher e λ = g M , λ = 2 1 − e − 5 g , and χ = 1 − e − 5 g 2 . Proof: From our con vention that E [Λ] = 1 , Marko v’ s inequality implies that P [Λ ≥ λ ] ≤ 1 λ = 1 − e − 5 g 2 . For the other direction, we reproduce the argument in Remark 2. Recall that we assume that Φ − 1  e − 6 g  ≤ M . Then the probability mass δ belo w ε (for ε < 6 g / M ) must be such that δ ≤ e − (6 g − εM ) , since e − 6 g ≥ Φ( M ) ≥ δ e − εM . T ake ε = g / M so that δ ≤ e − 5 g . Then we hav e P  λ ≤ Λ ≤ λ  ≥ 1 −  e − 5 g  −  1 − e − 5 g 2  = χ, as desired.  T ranslating the previous proposition into a statement about U -values, we obtain the follo wing. 21 Proposition 8 (Bounding U -values) Letting χ be as above, we have P  U ≤ U Υ (Λ) ≤ U  ≥ χ, wher e U = e − M λ , and U = e − 2 f λ . Proof: Recall that U Υ (Λ) = E [ U | Λ] = 1 | Υ | X ( a,b ) ∈ Υ e − Λ d ( a,b ) . Since 2 f ≤ d ( a, b ) ≤ M for all ( a, b ) ∈ Υ , we hav e e − M Λ ≤ 1 | Υ | X ( a,b ) ∈ Υ e − Λ d ( a,b ) ≤ e − 2 f Λ . The result then follo ws from Proposition 7.  Binning the sites W e will no w bin the sites whose clustering statistic lie be- tween U and U . The previous proposition guarantees that there is a positiv e frac- tion of such sites in expectation. Let ∆ U = γ U log n , be the size of the bins in U -space, where γ U > 0 is a constant to be ﬁx ed later . T o av oid taking integer parts, we assume for simplicity that U − U is a multiple of ∆ U . Let N U = U − U + 2∆ U ∆ U , be the number of bins. (The extra 2∆ U in the numerator accounts for estimation error . See belo w .) Note that U − U = Θ(1) and therefore N U = Θ(log n ) . W e proceed as follo ws: 22 • (Initialization) For j = 0 , . . . , N U , – b B j = ∅ . • (Main Loop) For i = 1 , . . . , k , – (Out-of-bounds) If U i / ∈ [ U − ∆ U , U + ∆ U ) then b B 0 := b B 0 ∪ { i } . – (Binning) Else if U i ∈ [ U − ∆ U + ( j − 1)∆ U , U − ∆ U + j ∆ U ) then b B j := b B j ∪ { i } and b B > 0 := b B > 0 ∪ { i } . Restating Proposition 4, we hav e: Proposition 9 (Concentration of U -values) W e have |U i − U Υ (Λ i ) | ≤ ∆ U , ∀ i ∈ { 1 , . . . , k } , (5) except with pr obability exp  − Ω( n/ log 2 n )  . Proof: T aking ζ = ∆ U in Proposition 4, (5) holds except with probability 2 n γ k exp  − Ω( n/ log 2 n )  = exp  − Ω( n/ log 2 n )  .  W e ﬁrst sho w that each bin contains sites with roughly the same scaling factor . W e ﬁrst need a bound on the scaling factors in b B > 0 . (Note that, because we needed that the bounds U and U be independent of Υ (which itself depends on unkno wn e volutionary distances), Proposition 7 does not apply directly here.) Proposition 10 (Bounds on selected scaling factors) Assume (5) holds. Ther e is γ U > 0 small enough so that for all i ∈ b B > 0 f g M 2 ≤ Λ i ≤ 2 M f (1 − e − 5 g ) . Proof: Since U Υ (Λ i ) ≤ U + 2∆ U by (5), we hav e U + 2∆ U ≥ 1 | Υ | X ( a,b ) ∈ Υ e − Λ i d ( a,b ) ≥ e − M Λ i . 23 Choose γ U > 0 small enough, so that U + 2∆ U ≤ e − f λ . T aking logarithms, Λ i ≥ f g M 2 . Similarly , since U Υ (Λ i ) ≥ U − 2∆ U by (5), we hav e U − 2∆ U ≤ 1 | Υ | X ( a,b ) ∈ Υ e − Λ i d ( a,b ) ≤ e − 2 f Λ i . Choose γ U > 0 small enough, so that U − 2∆ U ≥ e − 2 M λ . T aking logarithms, Λ i ≤ 2 M f (1 − e − 5 g ) .  For j ∈ { 1 , . . . , N U } , let U j = U − ∆ U + ( j − 1 + 1 / 2)∆ U , be the midpoint of the j -th bin. Using the fact that U Υ ( λ ) is strictly decreasing in λ ∈ R + , we deﬁne λ j as the unique solution to U Υ ( λ j ) = U j , for j ∈ { 1 , . . . , N U } . Proposition 11 (Bin v ariation) Assume (5) holds. F or any γ Λ > 0 , one can pick γ U > 0 small enough (depending on M , f , g ) such that for any j ∈ { 1 , . . . , N U } and i ∈ b B j , | Λ i − λ j | ≤ γ Λ log n . 24 Proof: This follows from Proposition 5. Assume that U j ≥ U Υ (Λ i ) . (The other case is similar .) Using the upper bound in Proposition 10 and (5), we get 3 2 ∆ U ≥ U j − U Υ (Λ i ) ≥ e − Λ i M  e 2 f β − 1  ≥  e 2 f β − 1  exp  − 2 M 2 f (1 − e − 5 g )  . Hence, β ≤ 1 2 f log   1 + 3 γ U exp  2 M 2 f (1 − e − 5 g )  2 log n   .  Next, we ar gue that at least one bin contains a non-ne gligible fraction of sites. W e say that a bin b B j is abundant if    b B j    ≥ k χ 6 N U . Proposition 12 (Ab undant bin) W e have ∃ j ∗ ∈ { 1 , . . . , N U } such that b B j ∗ is abundant , (6) except with pr obability exp( − Ω( n γ δ )) for some γ δ > 0 . Proof: For the analysis, we introduce ﬁctitious bins for the (unknown) expected U -v alues. That is, for i = 1 , . . . , k , we let i ∈ B j if U Υ (Λ i ) ∈ [ U − ∆ U + ( j − 1)∆ U , U − ∆ U + j ∆ U ) , for some j ∈ { 2 , . . . , N U − 1 } , or i ∈ B 0 otherwise. Then, there is j ∗∗ ∈ { 2 , . . . , N U − 1 } such that P [ U Υ (Λ) ∈ [ U − ∆ U + ( j ∗∗ − 1)∆ U , U − ∆ U + j ∗∗ ∆ U )] ≥ χ N U , (7) that is, the probability that a site falls into bin B j ∗∗ is at least χ/ N U . This follows immediately from Proposition 8 and the fact that the bins are disjoint and cover the interv al [ U , U ] . From Lemma 1 and (7), P  | B j ∗∗ | ≤ k χ 2 N U  ≤ 2 exp − ( k χ/ 2 N U ) 2 2 k ! = exp  − Ω( n γ k / log 2 n )  . Therefore, if (5) holds, one of b B j ∗∗ − 1 , b B j ∗∗ , or b B j ∗∗ +1 must contain at least a third of the sites in B j ∗∗ . This occurs with probability at least 1 − exp( − Ω( n γ δ )) for some γ δ > 0 .  25 5.2 Estimating a distorted metric Estimating evolutionary distances W e use an abundant bin to estimate ev olu- tionary distances. • (Abundant bin) Let b B ∗ be any bin with at least k χ 6 N U sites and set k ∗ =    b B ∗    and λ ∗ = λ j , where j is the index of b B ∗ , that is, λ ∗ is the midpoint of b B ∗ . • (Evolutionary distances) For all a 6 = b ∈ L , compute ˆ q ∗ ( a, b ) = 1 k ∗ X i ∈ b B ∗ p − 1 ∞  1 { σ i a = σ i b } − q ∞  . W e prov e that the ˆ q ∗ ( a, b ) is a good approximation of e − λ ∗ d ( a,b ) . Proposition 13 (Accuracy) Let γ Λ > 0 , γ q < γ k / 2 be ﬁxed constants. There is a γ δ > 0 such that the following hold e xcept with pr obability exp( − Ω( n γ δ )) : 1. There is at least one ab undant bin. Let b B ∗ be an arbitrary suc h bin. 2. And, for each i ∈ b B ∗ and for all a 6 = b ∈ L ,   ˆ q ∗ ( a, b ) − e − λ ∗ d ( a,b )   ≤ 1 n γ q + e − λ ∗ d ( a,b )  e γ Λ d ( a,b ) log n − 1  . (8) Proof: The result follows from Propositions 9, 10, 11 and 12, and the following lemma. Lemma 3 Let γ Λ > 0 , γ q < γ k / 2 be ﬁxed constants. Let ˜ k ≥ k χ 6 N U , and f g M 2 ≤ ˜ λ, ˜ λ 1 , . . . , ˜ λ ˜ k ≤ 2 M f (1 − e − 5 g ) 26 be such that, for all i ,    ˜ λ i − ˜ λ    ≤ γ Λ log n . (9) Let ˜ σ i L ∼ D [ T , ˜ λ i , r ] independently for all i . Then, for all a 6 = b ∈ L ,       1 ˜ k ˜ k X i =1 p − 1 ∞  1 { ˜ σ i a = ˜ σ i b } − q ∞  − e − ˜ λd ( a,b )       ≤ 1 n γ q + e − ˜ λd ( a,b )  e γ Λ d ( a,b ) log n − 1  . except with pr obability exp ( − Ω( n γ k − 2 γ q / log n )) . Proof: In Lemma 1, take m = ˜ k = Ω( n γ k / log n ) , t = 1 ˜ k , and ζ = 1 n γ q . Then, except with probability 2 n 2 exp ( − Ω( n γ k − 2 γ q / log n )) ,       1 ˜ k ˜ k X i =1 p − 1 ∞  1 { ˜ σ i a = ˜ σ i b } − q ∞  − 1 ˜ k ˜ k X i =1 e − ˜ λ i d ( a,b )       ≤ 1 n γ q , for all a 6 = b ∈ L . Moreov er , by (9),       e − ˜ λd ( a,b ) − 1 ˜ k ˜ k X i =1 e − ˜ λ i d ( a,b )       ≤ e − ˜ λd ( a,b )       1 − 1 ˜ k ˜ k X i =1 e | ˜ λ − ˜ λ i | d ( a,b )       ≤ e − ˜ λd ( a,b )    1 − e γ Λ d ( a,b ) log n    .   T ree construction T o reconstruct the tree, we use a distance-based method of [DMR09]. W e require the follo wing deﬁnition. Deﬁnition 7 (Distorted metric [Mos07, KZZ03]) Let T = ( V , E ; L, d ) be a phy- logeny and let τ , Ψ > 0 . W e say that ˆ d : L × L → (0 , + ∞ ] is a ( τ , Ψ) - distorted metric for T or a ( τ , Ψ) - distortion of d if: 1. [Symmetry] F or all u, v ∈ L , ˆ d is symmetric, that is, ˆ d ( u, v ) = ˆ d ( v , u ); 27 2. [Distortion] ˆ d is accurate on “short” distances, that is, for all u, v ∈ L , if either d ( u, v ) < Ψ + τ or ˆ d ( u, v ) < Ψ + τ then    d ( u, v ) − ˆ d ( u, v )    < τ . An immediate consequence of [DMR09, Theorem 1] is the follo wing. Theorem 3 (See [DMR09].) Let T = ( V , E ; L, d ) be a phylogeny with n leaves in T f ,g . Then topology of T can be r ecover ed in polynomial time fr om a ( τ , Ψ) - distortion ˆ d of d as long as τ ≤ f 5 , and Ψ ≥ 5 g log n. (The constants above ar e not optimal but will suf ﬁce for our purposes.) See [DMR09] for the details of the reconstruction algorithm. W e no w sho w ho w to obtain a ( f / 5 , 5 g log n ) -distortion with high probability . Proposition 14 (Distortion estimation) Ther e ar e γ U , γ Λ , γ q , γ k > 0 so that, given that the conclusions of Pr oposition 13 hold, then ˆ d ( a, b ) = − ln ( ˆ q ∗ ( a, b ) + ) , ( a, b ) ∈ L × L, is a ( λ ∗ f / 5 , 5 λ ∗ g log n ) -distortion of λ ∗ d . Proof: Deﬁne Ł − 2 = { ( a, b ) ∈ L × L : d ( a, b ) ≤ 15 g log n } , and Ł + 2 = { ( a, b ) ∈ L × L : d ( a, b ) > 12 g log n } , Let ( a, b ) ∈ Ł − 2 . Note that e − λ ∗ d ( a,b ) ≥ exp  −  2 M f (1 − e − 5 g )  15 g log n  ≡ 1 n γ 0 q , where the last equality is a deﬁnition. Then, taking γ q (and hence γ k ) large enough and γ Λ (and hence γ U ) small enough, from (8) we hav e    ˆ d ( a, b ) − λ ∗ d ( a, b )    ≤  f g M 2  f 5 ≤ λ ∗ f 5 . 28 Similarly , let ( a, b ) ∈ Ł + 2 . Note that e − λ ∗ d ( a,b ) < exp  −  f g M 2  12 g log n  ≡ 1 n γ 00 q , where the last equality is a deﬁnition. Then, taking γ q large enough and γ Λ small enough, from (8) we hav e ˆ d ( a, b ) ≥ 6 λ ∗ g log n ≥ 5 λ ∗ g log n + λ ∗ f 5 .  W e ﬁnally state our main tree-construction result. Proposition 15 (T ree r econstruction) Under Assumption 1, given a γ s -sparse Υ ther e is a γ k > 0 lar ge enough so that the topology of the tr ee can be r econstructed in polynomial time using k = n γ k samples, e xcept with pr obability exp( − Ω( n γ δ )) for some γ δ > 0 . Proof: The result follows from Theorem 3 and Proposition 14.  Combining Propositions 6 and 15, we get Theorem 2. 6 Concluding r emarks Using techniques from the recent unpublished manuscript [MR11], our results can be e xtended to handle the more general GTR model of molecular e volution which allo ws Q -matrices to be time-re versible. This generalization in volv es choosing pairs of lea ves that are not only connected by edge-disjoint paths, but also far enough from each other . One can then use mixing arguments to deri ve the inde- pendence properties required for concentration of the site clustering statistic. W e leav e out the details. 29 Refer ences [AAR08] Elizabeth S. Allman, Cecile Ane, and John A. Rhodes. Identiﬁability of a Mark ovian model of molecular e volution with gamma-distrib uted rates. Advances in Applied Pr obability , 40(1):228–249, 2008. [APRS11] Elizabeth S. Allman, Sonja Petrovic, John A. Rhodes, and Seth Sul- li vant. Identiﬁability of two-tree mixtures for group-based models. IEEE/A CM T ransactions on Computational Biolo gy and Bioinformat- ics , 8:710–722, 2011. [AR06] Elizabeth S. Allman and John A. Rhodes. The identiﬁability of tree topology for phylogenetic models, including cov arion and mixture models. Journal of Computational Biology , 13(5):1101–1113, 2006. PMID: 16796553. [Bil95] Patrick Billingsley . Pr obability and measure . W iley Series in Proba- bility and Mathematical Statistics. John W iley & Sons Inc., New Y ork, 1995. [CH11] Juanjuan Chai and Elizabeth A. Housworth. On Rogers’ proof of identiﬁability for the GTR + Gamma + I model. 2011. Published online at http://sysbio.oxfordjournals.org/content/early/2011/03/27/sysbio.syr023.short. [Cha96] Joseph T . Chang. Full reconstruction of Markov models on ev olution- ary trees: identiﬁability and consistency . Math. Biosci. , 137(1):51–73, 1996. [CT06] Benny Chor and T amir T uller . Finding a maximum likelihood tree is hard. J. A CM , 53(5):722–744, 2006. [DMR09] Constantinos Daskalakis, Elchanan Mossel, and S ´ ebastien Roch. Phy- logenies without branch bounds: Contracting the short, pruning the deep. In Seraﬁm Batzoglou, editor , RECOMB , volume 5541 of Lec- tur e Notes in Computer Science , pages 451–465. Springer , 2009. [EKPS00] W . S. Ev ans, C. K enyon, Y . Peres, and L. J. Schulman. Broadcasting on trees and the Ising model. Ann. Appl. Pr obab . , 10(2):410–433, 2000. 30 [ESSW99] P . L. Erd ¨ os, M. A. Steel, L. A. Sz ´ ekely , and T . A. W arnow . A few logs suf ﬁce to b uild (almost) all trees (part 1). Random Struct. Algor . , 14(2):153–184, 1999. [EW04] Stev en N. Evans and T andy W arnow . Unidentiﬁable di ver gence times in rates-across-sites models. IEEE/ACM T rans. Comput. Biology Bioinform. , 1(3):130–134, 2004. [Fel04] J. Felsenstein. Inferring Phylog enies . Sinauer , Sunderland, MA, 2004. [HR97] J. P . Huelsenbeck and B. Rannala. Phylogenetic methods come of age: T esting hypotheses in an e volutionary context. Science , 276(5310):227–232, 1997. [KZZ03] V alerie King, Li Zhang, and Y unhong Zhou. On the complexity of distance-based e volutionary tree reconstruction. In SOD A ’03: Pr o- ceedings of the fourteenth annual ACM-SIAM symposium on Discr ete algorithms , pages 444–453, Philadelphia, P A, USA, 2003. Society for Industrial and Applied Mathematics. [MMS08] Frederick A. Matsen, Elchanan Mossel, and Mike Steel. Mixed-up trees: the structure of phylogenetic mixtures. Bulletin of Mathemati- cal Biology , 70(4):1115–1139, 2008. [Mos07] E. Mossel. Distorted metrics on trees and phylogenetic forests. IEEE/A CM T rans. Comput. Bio. Bioinform. , 4(1):108–116, 2007. [MR95] Rajeev Motwani and Prabhakar Raghav an. Randomized algorithms . Cambridge Uni versity Press, Cambridge, 1995. [MR08] Elchanan Mossel and S ´ ebastien Roch. Detecting and untangling phy- logenetic mixtures: An approach based on site clustering. Preprint, 2008. [MR11] Elchanan Mossel and S ´ ebastien Roch. Phylogenetic mixtures: Con- centration of measure in the large-tree limit. Preprint, 2011. [MS07] Frederick A. Matsen and Mike Steel. Phylogenetic mixtures on a single tree can mimic a tree of another topology . Systematic Biology , 56(5):767–775, 2007. 31 [Roc06] S ´ ebastien Roch. A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM T rans. Comput. Biology Bioinform. , 3(1):92–94, 2006. [RS10] J. Rhodes and S. Sulli v ant. Identiﬁability of large phylogenetic mix- ture models. Preprint, 2010. [SS03] C. Semple and M. Steel. Phylog enetics , v olume 22 of Mathematics and its Applications series . Oxford Univ ersity Press, 2003. [SS06] M. A. Steel and L. A. Sz ´ ekely . On the v ariational distance of two trees. Ann. Appl. Pr obab . , 16(3):1563–1575, 2006. [SSH94] MA Steel, LA Szkely , and MD Hendy . Reconstructing trees when sequence sites e volve at variable rates. J. Comput. Biol. , 1(2):153– 163, 1994. [Ste09] Mike Steel. A basic limitation on inferring phylogenies by pairwise sequence comparisons. Journal of Theor etical Biology , 256(3):467 – 472, 2009. [ ˇ SV07a] Daniel ˇ Stefanko vi ˇ c and Eric V igoda. Phylogeny of mixture models: robustness of maximum likelihood and non-identiﬁable distributions. J . Comput. Biol. , 14(2):156–189 (electronic), 2007. [SV07b] Daniel Stefankovic and Eric V igoda. Pitfalls of heterogeneous pro- cesses for phylogenetic reconstruction. Syst. Biol. , 56(1):113–124, 2007. [WS10] Jihua W u and Edward Susko. Rate-variation need not defeat phylo- genetic inference through pairwise sequence comparisons. Journal of Theor etical Biology , 263(4):587 – 589, 2010. 32

Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment