Phylogenetic mixtures: Concentration of measure in the large-tree limit
The reconstruction of phylogenies from DNA or protein sequences is a major task of computational evolutionary biology. Common phenomena, notably variations in mutation rates across genomes and incongruences between gene lineage histories, often make …
Authors: Elchanan Mossel, Sebastien Roch
The Annals of Applie d Pr obabil ity 2012, V ol. 22, N o. 6, 2429 –2459 DOI: 10.1214 /11-AAP837 c Institute of Mathematical Statistics , 2012 PHYLOGENETIC MIXTURES: CONCENTRA TION OF MEASURE IN THE LAR GE-TREE LIMIT By Elchanan Mosse l 1 and Sebas tien R och 2 University o f Califor nia, Berkeley, and University of California, L os Angeles The reco nstruction of phylogenies fro m DNA or protein sequences is a ma jor task of computational evolutionary biology . Common phe- nomena, notably v ariations in mutation rates across genomes and in- congruences betw een gene lineage histories, often make it necessary to mo del molecular data as originating from a mixture of phylogenies. Such mix ed mo dels play an increasingly imp ortant role in practice. Using concentration of measure techniques, we show that mixtures of large trees are typically identifiable. W e also d erive sequence-length requirements for high-probability reconstru ct ion. 1. In tro d u ction. Ph ylogenetic s [ 10 , 22 ] is cen tered around the recon- struction of ev olutionary histories from molecular d ata extracted fr om mo d- ern sp ecies. Th e assu mption is that molecular data consists of aligned se- quences and that eac h p osition in the sequences ev olv es in dep end en tly ac- cording to a Mark o v mo del on a tree, w here the key p arameters are (see Section 3 for formal defi nitions): • R ate matrix. An r × r m u tation rate matrix Q , w here r is the alphab et size. A t ypical alphab et is th e set of nucleo tid es { A , C , G , T } , but here we allo w more general state sp aces. Without loss of generalit y , w e denote the alphab et b y R = [ r ] = { 1 , . . . , r } . The ( i, j ) th ent ry of Q enco des the rate at whic h state i mutates in to state j . • Binary tr e e. An evo lu tionary tree T , where the lea v es are th e mod ern sp ecies an d eac h bran c hing represen ts a past sp eciation ev ent. T he lea ves Received Au gust 2011; revised December 2011. 1 Supp orted by D MS 0548249 (CAREER) aw ard, by D O D ONR Grant N 0001411 10140, by I S F Grant 1300/08 and by ERC Grant PIRG04 -GA-2008-239137. 2 Supp orted by NSF Grant DMS-10-07144. AMS 2000 subje ct classific ations. Primary 60K35; secondary 92D15. Key wor ds and phr ases. Phylogenetic reconstruction, random trees, concentration of measure. This is an electronic r eprint of the orig inal article published by the Institute of Mathematical Statistics in The Annals of Applie d Pr ob ability , 2012, V ol. 22, No. 6, 2 429– 2459 . This r eprint differs fr om the origina l in pagination and t yp ographic detail. 1 2 E. MOS SEL AND S. ROCH are lab eled w ith names of sp ecies. Without loss of generalit y , we assume the lab els are X = [ n ] . • Br anch lengths. F or eac h edge e , we hav e a scalar branc h length w e whic h measures the exp ected tota l num b er of substitutions per site alo n g edge e . Roughly sp eaking, w e is the amoun t of mutatio nal c h ange b etw een the end p oints of e . The classical problem in ph ylogenetics can b e stated as follo ws: • Phylo genetic tr e e r e c onstruction ( PTR ): Unmixe d c ase. Giv en n molecular sequences of length k , { s a = ( s i a ) k i =1 } a ∈ [ n ] with s i a ∈ [ r ] , whic h h a ve ev olv ed according to the pro cess ab o v e with indep en den t sites, reconstru ct th e top ology of the ev olutionary tree. There exists a v ast theoretical literature on this problem; see, for exam- ple, [ 22 ] and references therein. Ho we v er, v arious ph enomena, notably v ariations in mutati on rates across genomes and incongruences b et we en gene lineage histories, often make it necessary to mo del m olecular data as originating from a mixtur e of different phylo genies. Here, using conce n tr ation of measure tec hniqu es, we sho w th at mixtures of large trees are typicall y identifiable. By t ypically , w e mean informally that our resu lts hold under conditions guaran teeing th at th e tree top ologie s present in the mixture are su fficien tly d istinct. (S ee Section 2.2 for a careful statemen t of the th eorems.) In particular, w e give a broad n ew class of conditions under which mixtures are iden tifiable, and we extend, to m ore general su bstitution mo d els, previous r esults on the tota l v ariation distance b et w een Mark o v mo dels on trees. Ou r p ro ofs are constru ctiv e in that w e pro v ide a computationally efficient reconstruction algorithm. W e also d eriv e sequence-length requirements for high-probability r econstruction. Our iden tifiab ilit y and reconstruction r esults represent an imp ortan t first step to ward dealing with more biologically relev an t mixtur e mo dels (such as the on es m en tioned ab o v e) in whic h the tree top ologies tend to b e similar. In particular, in a r ecen t related p ap er [ 18 ], we ha ve u sed the tec hn iques dev elop ed here to reconstruct common rates-a cross-sites mo d els. 1.1. R elate d work. Most prior theoretical work on mixture mo dels h as fo cused on the qu estion of identifiability . A class of ph ylogenetic mod els is iden tifiable if any tw o mo dels in the class pro du ce different data distri- butions. It is w ell kno wn that u nmixed p h ylogenetic m o dels are t ypically iden tifi able [ 6 ]. This is not the c ase in gener al for mixtur es of phylo genies. F or instance, Steel et al. [ 24 ] sho wed that fo r any tw o trees one ca n find a ran- dom scaling o n eac h of them, suc h that their dat a distributions are i den tical. PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 3 Hence it is h op eless, in general, to reconstruct p h ylogenies und er mixture mo dels. See also [ 9 , 13 , 14 , 23 , 26 , 27 ] for further examples of th is t yp e. Ho we v er, the negativ e examples constru cted in the references ab ov e are not necessarily t ypical. T hey use sp ecial features of th e m utation mo dels (and their in v arian ts) and allo w themselves quite a bit of flexibilit y in set- ting up the top ologies and branc h lengths. In fact, recen tly a v ariet y of more standard mixture mo dels hav e b een shown to b e iden tifiable. T hese include the common GTR + Γ mo del [ 1 , 28 ] and GTR + Γ + I m o del [ 5 ], as w ell as some co v arion mo dels [ 3 ], some group -based mo dels [ 2 ] and so- called r -comp onent identica l tree mixtures [ 20 ]. Although these results do not pro vide practical algorithms for reconstru cting the corresp onding mix- tures, they do giv e hop e that these problems ma y b e tac kled successfully . Bey ond the identifiabilit y question, there seems to h a ve b een little rig- orous work on reconstructing phylog enetic mixture mo dels. One p ositiv e result is the case of the molecular clock assumption with across-sites rate v ariation [ 24 ], although no sequence-length r equiremen ts are provided. There is a large b o d y of w ork on practical reconstruction algorithms for v arious t yp es of mixtures, n otably rates-across-sites mo dels and co v arion-t yp e mo d- els, using mostly likeli ho o d and Ba y esian method s; see, for example, [ 10 ] for references. But the optimization problems they attempt to solv e are likel y NP-hard [ 7 , 21 ]. There also exist many tec h niques for testing for the pr es- ence of a mixture (e.g., for testing for rate h eterogeneit y), but such tests t ypically require the kn o wledge of the phylog en y; see, for example, [ 11 ]. Here w e giv e b oth iden tifiabilit y and reconstruction results. The pr o of of our main results relies on the construction of a clustering statistic that discriminates b et ween distinct ph ylogenies. A similar approac h wa s used recen tly in [ 18 ]. Th ere, how ev er, the p roblem was to distinguish b etw een phylo genies with the same top ology , but differ ent branch lengths. In the current wo r k, a main tec hnical c h allenge is to analyze the simulta neous b e- ha vior of such a clustering statistic on distinct top ologies. A similar statistic w as also used in [ 25 ] to pro ve a sp ecial case of Theorem 2 b elo w. Ho w ever, in cont rast to [ 25 ], ou r main resu lt requires that a clustering statistic b e constructed based only on data generate d b y the mixtur e—that is, without prior kno wledge of the top ologies to b e distinguished. Finally , unlik e [ 18 ] and [ 25 ], w e consider the more general GTR mod el. 2. Definitions and results. 2.1. Basic definitions. Phylo genies. A phylog eny is a graph ical representa tion of the sp eciation history of a grou p of organisms. The lea v es t yp ically corresp ond to curr en t sp ecies. Each branching ind icates a sp eciation ev ent. Moreo ver we asso ciate to eac h edge a p ositiv e weig ht. This w eigh t can b e thought r oughly as the 4 E. MOS SEL AND S. ROCH amoun t of ev olutionary c h ange on the ed ge. More formally , we mak e the follo wing defin itions; see, for example, [ 22 ]. Fix a set of leaf lab els X = [ n ] = { 1 , . . . , n } . Definition 2.1 (Phylo geny). A weighte d binary phylo genetic X -tr e e (or phylo geny ) T = ( V , E ; φ ; w ) is a tree with v ertex set V , edge set E , leaf set L with | L | = n , and a bijectiv e mapp ing φ : X → L su c h that: (1) The degree of all internal v ertices V − L is exactly 3. (2) The edges are assigned weigh ts w : E → (0 , + ∞ ). W e let T l [ T ] = ( V , E ; φ ) b e the le af-lab el le d top olo gy of T . Definition 2.2 (T ree metric). A phylog eny T = ( V , E ; φ ; w ) is naturally equipp ed with a tr e e metric d T : X × X → (0 , + ∞ ) defined as follo ws: ∀ a, b ∈ X d T ( a, b ) = X e ∈ Pat h T ( φ ( a ) ,φ ( b ) ) w e , where P ath T ( u, v ) is the set of edges on the path b et wee n u and v in T . W e will refer to d T ( a, b ) as the evolutionary distanc e b et ween a and b . In a slig h t abuse of notati on, we also sometimes use d T ( u, v ) to denote the e volutionary distance as ab o ve b et ween an y t wo v ertices u, v of T . W e will restrict ourselv es to the follo w ing standard sp ecial case. Definition 2.3 (Regular ph ylogenies). Let 0 < f ≤ g < + ∞ . W e denote b y Y ( n ) f ,g the s et of phylo genies T = ( V , E ; φ ; w ) w ith n lea ves suc h that f ≤ w e ≤ g , ∀ e ∈ E . W e also let Y f ,g = S n ≥ 1 Y ( n ) f ,g . GTR mo del. A commonly used mo del of DNA sequen ce evolutio n is the follo wing GTR mo del ; see, for example, [ 22 ]. W e first define an approp riate class of rate matrice s. Definition 2.4 (GTR rate matrix). Let R b e a set of c h aracter states with r = |R| . Without loss of generalit y w e assume that R = [ r ]. Let π b e a probabilit y distribu tion on R satisfying π x > 0 for all x ∈ R . A gener al time- r eve rsible ( GTR ) r ate matrix on R , with resp ect to stationary distribution π , is an r × r real-v alued matrix Q su c h th at: (1) Q xy > 0 for all x 6 = y ∈ R . (2) P y ∈R Q xy = 0, for all x ∈ R . (3) π x Q xy = π y Q y x , for all x, y ∈ R . By the rev ersib ilit y assu mption, Q has r real eigen v alues 0 = Λ 1 > Λ 2 ≥ · · · ≥ Λ r . W e norm alize Q b y fixin g Λ 2 = − 1. PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 5 Definition 2.5 (GTR mo del). Consider the follo wing sto chastic pro- cess. W e are giv en a p h ylogen y T = ( V , E ; φ ; w ) and a finite set R with r elemen ts. Let π b e a pr obabilit y distribu tion on R and Q b e a GTR rate matrix with resp ect to π . Asso ciate to eac h edge e ∈ E the sto chastic matrix M ( e ) = exp ( w e Q ) . The pro cess runs as follo ws. Ch o ose an arb itrary ro ot ρ ∈ V . Denote b y E ↓ the set E directed a wa y from the ro ot. Pic k a state for the ro ot at r andom according to π . Mo v ing aw a y from the ro ot to wa rd the lea v es, apply the c hann el M ( e ) to eac h edge e indep endent ly . Denote the state s o obtained s V = ( s v ) v ∈ V . In particular, s L is the state at th e lea v es, w hic h we also denote b y s X . More precisely , the join t d istribution of s V is giv en by µ V ( s V ) = π ρ ( s ρ ) Y e =( u,v ) ∈ E ↓ [ M ( e ) ] s u s v . F or W ⊆ V , w e denote b y µ W the marginal of µ V at W . Under this m o del, the weig ht w e is the exp ected num b er of su bstitutions on edge e in the con tinuous-time pro cess. W e denote by D [ T , Q ] the p robabilit y d istribution of s V . W e also let D l [ T , Q ] denote the probabilit y distribu tion of s X ≡ ( s φ ( a ) ) a ∈ X . More generally , we consider k in dep end en t samples { s i V } k i =1 from the mo del ab ov e, that is, s 1 V , . . . , s k V are i.i.d. D [ T , Q ]. W e think of ( s i v ) k i =1 as the sequence at no de v ∈ V . T yp ically , R = { A , G , C , T } and the mod el de- scrib es ho w DNA sequences sto c h astically ev olv e by p oin t m utations along an ev olutionary tree under the assump tion that eac h site in th e sequences ev olv es indep endent ly . When considerin g many s amples { s i V } k i =1 , w e drop the subscript to refer to a single sample s V . Mixe d mo del. W e int ro duce the basic mixed mod el w hic h will b e the fo cus of this pap er. W e will use the follo win g defin ition. W e assume that Q is fixed and kno wn throughout. Remark 2.1 (Unknown rate matrix). See the concluding remarks for an extension of our techniques when Q is unknown. Definition 2.6 (Θ-mixture). Let Θ b e a p ositiv e integ er. In the Θ - mixtur e mo del , we consider a finite set of ph ylogenies T = { T θ = ( V θ , E θ ; φ θ ; w θ ) } Θ θ = 1 on the same set of leaf lab els X = [ n ] and a p ositiv e p robabilit y distribu - tion ν = ( ν θ ) Θ θ = 1 on [Θ]. Consider k i.i.d. r andom v ariables N 1 , . . . , N k with 6 E. MOS SEL AND S. ROCH distribution ν . Then, conditioned on N 1 , . . . , N k , the samples { s i X } k i =1 gener- ated u nder the Θ -mixture mo d el ( T , ν, Q ) are ind ep endent w ith conditional distribution s j X ∼ D l [ T N j , Q ] , j = 1 , . . . , k . W e den ote b y D l [( T , ν , Q )] the probabilit y distr ibution of s 1 X . W e will refer to T θ as the θ -c omp onent of the mixture ( T , ν, Q ). W e assume that Θ is fixed and kn o wn throughout. As ab o v e, we drop the sup erscript to r efer to a single sample s X with corresp ond ing comp onen t indicator N . T o simplify n otation, w e let d T θ = d θ ∀ θ ∈ [Θ] . Some notation. W e w ill u se the notation [ n ] 2 = { ( a, b ) ∈ [ n ] × [ n ] : a ≤ b } , [ n ] 2 = = { ( a, a ) } a ∈ [ n ] and [ n ] 2 6 = = [ n ] 2 − [ n ] 2 = . W e also denote b y [ n ] 4 6 = the set of pairs ( a 1 , b 1 ) , ( a 2 , b 2 ) ∈ [ n ] 2 6 = suc h th at ( a 1 , b 1 ) 6 = ( a 2 , b 2 ) (as pairs). W e use the notati on p oly( n ) to denote the gro w th cond ition usually written Θ( n C ) for some C > 0. 2.2. Main r esults. W e mak e the follo w ing assumptions on the m utation mo del. Assumpt ion 1. Let 0 < f ≤ g < + ∞ , and ν > 0. W e will u se the follo w - ing set of assu mptions on a Θ-mixture mo del ( T , ν, Q ) : (1) R e gular phylo genies : T θ ∈ Y f ,g , ∀ θ ∈ [Θ] . (2) Minimum fr e quency : ν θ ≥ ν , ∀ θ ∈ [Θ]. W e den ote by Θ-M[ f , g , ν , n ] the set of Θ-mixture mo dels on n lea ve s satis- fying these conditions. Remark 2.2 (No minimum frequ ency). See the co ncluding remarks for an extension of our tec hniques w hen the min im u m f requency assumption is not satisfied. T r e e identifiability. Our first result states that, und er Assumption 1 , Θ-mixture mo d els are id en tifiable—except for an “asymptotically negligible fraction.” T o formalize this notio n, w e use the f ollo wing definition. Note that Θ-M[ f , g , ν , n ] is a compact su bset of a fin ite pro du ct of metric spaces [ 4 ] whic h we equ ip with its Borel σ -algebra. Definition 2.7 (P ermutat ion-in v arian t measure). Let A ⊆ Θ-M( f , g , ν , n ) b e a Borel set. Given Θ p ermutations Π = { Π θ } θ ∈ [ Θ] of X , w e let Π[ T ] ≡ { Π θ [ T θ ] } θ ∈ [ Θ] ≡ { ( V θ , E θ ; φ θ ◦ Π θ ; w θ ) } θ ∈ [ Θ] , PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 7 where ◦ ind icates composition, and A Π = { ( T , ν , Q ) ∈ Θ-M( f , g , ν , n ) : (Π[ T ] , ν, Q ) ∈ A } . A pr obabilit y measure λ on Θ-M( f , g , ν , n ) is p ermutation-invariant if for all A and Π as ab ov e, w e h a ve the follo w ing: λ [ A ] = λ [ A Π ] . Remark 2.3. Alternativ ely one can think of a p ermutat ion-in v arian t measure as first p ic king u nlab eled trees, br anc h weigh ts and mixtur e fre- quencies according to a sp ecified join t d istribution, and then lab eling the lea v es of eac h tr ee in the mixture indep en den tly , un iformly at rand om. Note that the indep e ndent lab eling of th e trees is needed for our pro of. It ensures that the ph ylogenies in the mixture are typically , “sufficientl y distinct.” Generalizing our results, p ossib ly in a weak er form, to mixtures of “similar” phylo genies is an imp ortan t op en problem. S ee [ 18 ] for recen t pr ogress in this direction. F or t w o Θ-mixture mo dels ( T , ν , Q ) and ( T ′ = { T ′ θ } θ ∈ [ Θ] , ν ′ , Q ) , w e write ( T , ν , Q ) ≁ ( T ′ , ν ′ , Q ) , if there is no b ijectiv e mapp ing h of [Θ] such that T l [ T θ ] = T l [ T ′ h ( θ ) ] ∀ θ ∈ [Θ] . In w ords , ( T , ν, Q ) and ( T ′ , ν ′ , Q ) are not equiv alen t u p to comp onent r e- lab eling. Theorem 1 (T r ee iden tifiabilit y). Fix 0 < f ≤ g < + ∞ , and ν > 0 . Then, ther e exists a se quenc e of Bor el subsets A n ⊆ Θ - M( f , g , ν , n ) , n ≥ 1 , such that the fol lowing hold: (1) F or any se quenc e of p ermutation-invariant me asur es λ n , n ≥ 1 , r e- sp e ctively, on Θ - M( f , g , ν , n ) , n ≥ 1 , we have λ n [ A n ] = 1 − o n ( ν , f , g ) as n → ∞ . Her e o n ( ν , f , g ) indic ates c onver genc e to 0 as n → ∞ for fixe d ν , f , g . (2) F or al l ( T , ν , Q ) ≁ ( T ′ , ν ′ , Q ) ∈ [ n ≥ 1 A n , we have D l [( T , ν , Q )] 6 = D l [( T ′ , ν ′ , Q )] . 8 E. MOS SEL AND S. ROCH Remark 2.4. As remark ed ab o ve, our pr o of requires that the phylo- genies in the m ixture are “sufficien tly different.” T his is t ypically the case under a p ermutatio n-in v ariant measure. Roughly sp eaking, the complemen ts of the sets A n in the pr evious th eorem conta in those exceptional instances where the ph ylogenies are to o “similar.” See the pro of for a formal definition of A n . T r e e distanc e. W e also generalize to GTR mo dels a resu lt of Steel and Sz ´ ek ely: ph ylogenies are typica lly far a w ay in v ariational distance [ 25 ]. Th e tec hniques in [ 25 ] apply only to group -based mo dels and other highly sym - metric mo dels; see [ 25 ] for details. Let k · k TV denote total v ariation distance; that is, f or t wo probabilit y measures D , D ′ on a measure space (Ω , F ) define kD − D ′ k TV = sup B ∈F |D ( B ) − D ′ ( B ) | . Theorem 2 (T ree distance). L et { A n } n b e as in The or em 1 wher e Θ = 2 and ν = 1 / 2 [in which c ase we ne c essarily have ν = (1 / 2 , 1 / 2) ]. Then for al l ( T , ν , Q ) ∈ [ n ≥ 1 A n , we have kD l [ T 1 , Q ] − D l [ T 2 , Q ] k TV = 1 − o n (1) . Remark 2.5. Note that ν pla ys no substantiv e r ole in th e p revious theorem other than to determine A n . T r e e r e c onstruction. The pro of of Theorems 1 a nd 2 rely o n the follo win g reconstruction result of ind ep endent inte r est. W e sho w that the top ologies can b e reconstructed efficien tly with high confidence us ing p olynomial length sequences. Recall that k denotes the sequence lengt h. Theorem 3 (T ree reconstruction). Fix 0 < f ≤ g < + ∞ , and ν > 0 . Then, ther e exists a se quenc e of Bor el subsets A n ⊆ Θ - M( f , g , ν , n ) , n ≥ 1 , such that the fol lowing hold: (1) F or any se quenc e of p ermutation-invariant me asur es λ n , n ≥ 1 , r e- sp e ctively, on Θ - M( f , g , ν , n ) , n ≥ 1 , we have λ n [ A n ] = 1 − o n ( ν , f , g ) as n → ∞ . (2) F or al l ( T , ν , Q ) ∈ [ n ≥ 1 A n , PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 9 the top olo gi es of ( T , ν , Q ) c an b e r e c onstructe d in time p olynomial in n and k using p olynomial ly many samples (i.e., k is p olynomial in n ) with pr ob ability 1 − o n ( ν , f , g ) under the samples and the r andomness of the algorithm . Remark 2.6. Th e su bsets { A n } n in Theorems 1 and 3 are in fact the same. The rest of the pap er is devo ted to th e pr o of of Th eorem 3 which imp lies Theorems 1 and 2 . 2.3. Pr o of overview. The pro of of Th eorem 3 relies on the construction of a clustering statistic that discriminates b et ween distinct ph ylogenies. Clustering statistic. Fix 0 < f ≤ g < + ∞ and ν > 0. Supp ose for now that Θ = 2, and let λ b e a p ermutation-in v arian t probabilit y m easure on Θ-M[ f , g , ν , n ] . It will b e useful to think of λ as a t wo-ste p pro cedure: first pic k unlab eled, weigh ted top ologies; and second, assign a un iformly rand om lab eling to the lea ves of eac h tree. Pick a Θ-mixture mo d el ( T , ν , Q ) accord- ing to λ . W e will denote b y P λ and E λ probabilit y and exp ectation u nder λ . Similarly , we denote b y P l and E l (resp., P A and E A ) pr obabilit y and ex- p ectation under ( T , ν, Q ) (r esp., u nder the r andomness of our algorithm), as w ell as com binations such as P A ,λ with the ob vious meaning. Let z = ( z x ) r x =1 b e a (real-v alued) righ t eigen ve ctor of Q corresp onding to eigen v alue Λ 2 = − 1 and normalize z so that r X x =1 π x z 2 x = 1 . (An y negativ e eigen v alue could b e used instead.) Consider the follo win g one- dimensional m apping of the samples ([ 17 ], Lemma 5.3): for all i = 1 , . . . , k and a ∈ X , σ i a = z s i a . (1) Recall that we drop the sup erscript w hen referring to a single sample. It holds that E l [ σ a | N = θ ] = 0 . (2) Moreo v er, follo wing a computation in [ 17 ], Lemma 5.3, letting a ∧ b b e the most recen t common ancestor of a and b (under the arbitrary c hoice of ro ot ρ ) one has q θ ( a, b ) = E l [ σ a σ b | N = θ ] − E l [ σ a | N = θ ] E [ σ b | N = θ ] = E l [ σ a σ b | N = θ ] 10 E. MOS SEL AND S. ROCH = r X x =1 π x E l [ σ a σ b | N = θ , s a ∧ b = x ] (3) = r X x =1 π x E l [ σ a | N = θ , s a ∧ b = x ] E l [ σ b | N = θ , s a ∧ b = x ] = r X x =1 π x ( e − d θ ( a ∧ b,a ) z x )( e − d θ ( a ∧ b,b ) z x ) = e − d θ ( a,b ) and q ( a, b ) = E l [ σ a σ b ] − E l [ σ a ] E l [ σ b ] = E l [ σ a σ b ] = Θ X θ = 1 ν θ e − d θ ( a,b ) . (4) W e us e a statistic of the form U = 1 | Υ | X ( a,b ) ∈ Υ σ a σ b , (5) where Υ ⊆ [ n ] 2 6 = . F or U to b e effectiv e in discriminating b et ween T 1 and T 2 , w e requ ire the follo wing (informal) conditions: (C1) The difference in co nditional exp ectations ∆ = | E l [ U | N = 1] − E l [ U | N = 2] | is large. (C2) The statistic U is concentrat ed around its mean u nder b oth D l [ T 1 , Q ] and D l [ T 2 , Q ] . (C3) The set Υ can b e constru cted from data generated by the mixtur e ( T , ν , Q ). A U satisfying C1–C3 c ould b e used to infer the h idden v ariables N 1 , . . . , N k and, thereb y , to cluster the samples in their resp ectiv e comp onent. Prior work. In [ 18 ], it w as s ho wn in a related con text that taking Υ = [ n ] 2 6 = is not in general an app ropriate choic e, as it ma y l ead to a large v ariance. Instead, the follo wing lemma wa s used. Claim (Disjoin t close pairs [ 25 ]; see also [ 18 ]). F or any T ∈ Y ( n ) f ,g , ther e exists a subset Υ ⊆ [ n ] 2 6 = such that the fol lowing hold: (1) | Υ | = Ω( n ) ; (2) ∀ ( a, b ) ∈ Υ , d T ( a, b ) ≤ 3 g ; (3) ∀ ( a 1 , b 1 ) 6 = ( a 2 , b 2 ) ∈ Υ , the p aths P ath T ( a 1 , b 1 ) and Pa th T ( a 2 , b 2 ) ar e e dge-disjoint. We wil l say that such p airs ar e T -disjoint. PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 11 F or sp ecial Q matrices, it was shown in [ 25 ] and [ 18 ] that suc h a Υ for T = T 1 , say , can b e us ed to constru ct a clustering statistic [similar to ( 5 )] concen trated under D l [ T 1 , Q ] . In particular, the T 1 -disjoin tn ess assumption ab o ve implies the indep en dence of the v ariables σ a 1 σ b 1 and σ a 2 σ b 2 under the Q matrices consid ered in [ 18 , 25 ]. M oreo v er , Steel and S z ´ ek ely [ 2 5 ] p ro ved the existence of a fur ther subset that is also T 2 -disjoin t, but their construction requires the kno w ledge of T 2 . Here w e show ho w to satisfy conditions C1–C3 under GTR mod els. High-level c onstruction. W e giv e a sket c h of our tec hniques. F ormal s tate- men ts and fu ll pro ofs can b e found in Sectio ns 3 , 4 and 5 . F or α > 0, let Υ α,θ = { ( a, b ) ∈ [ n ] 2 6 = : d θ ( a, b ) ≤ α } and Υ α = [ θ ∈ [ Θ] Υ α,θ . Because the v ariables N 1 , . . . , N k are hidd en, w e cannot infer Υ α,θ directly from the samples, for instance, using ( 3 ). Instead: ( Step 1) Using ( 4 ) and the estimator ˆ q ( a, b ) = 1 k k X i =1 σ i a σ i b , w e construct a set with size linear in n satisfying Υ 4 g ⊆ Υ ′ ⊆ Υ C c for an ap p ropriate constant C c ; see Lemma 4.1 . Define Υ ′ θ = Υ ′ ∩ Υ C c ,θ . F or general GTR rate matrices, T θ -disjoin tn ess of ( a 1 , b 1 ) , ( a 2 , b 2 ) ∈ Υ ′ θ do es not guaran tee indep end ence of σ a 1 σ b 1 and σ a 2 σ b 2 under D l [ T θ , Q ] . Instead, w e c ho ose p airs that are far enough from eac h other b y pic k ing a sufficien tly sparse r andom su bset of Υ ′ ; see Lemma 3.8 . W e sa y that ( a 1 , b 1 ) , ( a 2 , b 2 ) ∈ Υ ′ θ are T θ -far if the smallest ev olutionary distance b et we en { a 1 , b 1 } and { a 2 , b 2 } is at least C f log log n for a constan t C f > 0 to b e determined. ( Step 2) W e take a random subset Υ ′′ of Υ ′ with | Υ ′′ | = Θ(log n ); see Lemma 4.2 . 12 E. MOS SEL AND S. ROCH Denoting Υ ′′ θ = Υ ′′ ∩ Υ C c ,θ , w e sho w that all ( a 1 , b 1 ) 6 = ( a 2 , b 2 ) ∈ Υ ′′ θ are T θ -far. Under a p ermutatio n- in v ariant λ , a pair ( a, b ) ∈ Υ α, 1 is unlikely to b e in Υ α, 2 . In p articular, we sho w that, und er λ , the in tersection of Υ ′′ 1 and Υ ′′ 2 is empt y . In fact, a pair ( a, b ) ∈ Υ α, 1 is lik ely to b e su c h that d 2 ( a, b ) is large. W e sa y that ( a, b ) ∈ [ n ] 2 6 = is T θ -str e tche d if d θ ( a, b ) ≥ C st log log n for a constan t C st > 0 to b e determined. W e sho w t hat all ( a, b ) ∈ Υ ′′ 1 are T 2 -stretc hed; see Lemma 3.7 . T o infer Υ ′′ θ , w e consider the quantit y ˆ r ( c 1 , c 2 ) = 1 k k X i =1 [ σ i a 1 σ i b 1 σ i a 2 σ i b 2 − ˆ q ( a 1 , b 1 ) ˆ q ( a 2 , b 2 )] for c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ [ n ] 2 6 = . W e note that if ( a, b ) ∈ Υ ′′ is T 2 - stretc hed, then E l [ σ a σ b | N = 2] ≈ E l [ σ a | N = 2] E l [ σ b | N = 2] = 0 and q ( a, b ) ≈ ν 1 q 1 ( a, b ) . There are then t w o cases: (I) If c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ Υ ′′ 1 (and similarly for Υ ′′ 2 ), th ey are T 1 -far and eac h is T 2 -stretc hed. Moreo v er w e sho w th at ( c 1 , c 2 ) is T 2 -far. Therefore, q ( a 1 , b 1 ) ≈ ν 1 q 1 ( a 1 , b 1 ) , q ( a 2 , b 2 ) ≈ ν 1 q 1 ( a 2 , b 2 ) , and w e show fu rther that E l [ σ a 1 σ b 1 σ a 2 σ b 2 ] ≈ ν 1 E l [ σ a 1 σ b 1 | N = 1] E l [ σ a 2 σ b 2 | N = 1] + ν 2 E l [ σ a 1 | N = 2] E l [ σ b 1 | N = 2] E l [ σ a 2 | N = 2] E l [ σ b 2 | N = 2] ≈ ν 1 q 1 ( a 1 , b 1 ) q 1 ( a 2 , b 2 ) . So ˆ r ( c 1 , c 2 ) ≈ ν 1 (1 − ν 1 ) q 1 ( a 1 , b 1 ) q 1 ( a 2 , b 2 ) > 0 . (I I) On the other hand, if c 1 = ( a 1 , b 1 ) ∈ Υ ′′ 1 and c 2 = ( a 2 , b 2 ) ∈ Υ ′′ 2 , th en c 1 is T 2 -stretc hed, and c 2 is T 1 -stretc hed. Moreo ver we sho w that ( c 1 , c 2 ) is b oth T 1 -far and T 2 -far. Therefore, q ( a 1 , b 1 ) ≈ ν 1 q 1 ( a 1 , b 1 ) , q ( a 2 , b 2 ) ≈ ν 2 q 2 ( a 2 , b 2 ) , PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 13 and w e show that E l [ σ a 1 σ b 1 σ a 2 σ b 2 ] ≈ ν 1 E l [ σ a 1 σ b 1 | N = 1] E l [ σ a 2 | N = 1] E l [ σ b 2 | N = 1] + ν 2 E l [ σ a 1 | N = 2] E l [ σ b 1 | N = 2] E l [ σ a 2 σ b 2 | N = 2] ≈ 0 . So ˆ r ( c 1 , c 2 ) ≈ − ν 1 q 1 ( a 1 , b 1 ) ν 2 q 2 ( a 2 , b 2 ) < 0; see Lemma 3.9 . The argumen t ab o ve leads to the follo wing step. ( Step 3) F or all p airs c 1 = ( a 1 , b 1 ) and c 2 = ( a 2 , b 2 ) in Υ ′′ , w e compute ˆ r ( c 1 , c 2 ). Using cases I an d II , we then infer the sets Υ ′′ 1 and Υ ′′ 2 . W e form the clustering statistics U i θ = 1 | Υ ′′ θ | X ( a,b ) ∈ Υ ′′ θ σ i a σ i b for θ = 1 , 2 and i = 1 , . . . , k ; see Lemma 4.3 . By the argumen ts in cases I and II ab ov e, w e get that for ( a, b ) ∈ Υ ′′ 1 , E l [ σ a σ b | N = 1] ≈ ν 1 q 1 ( a, b ) , whereas E l [ σ a σ b | N = 2] ≈ E l [ σ a | N = 2] E l [ σ b | N = 2] ≈ 0 , so that (dropping the sup erscript to refer to a single sample) E l [ U 1 | N = 1] > C ∆ , whereas E l [ U 1 | N = 2] < C ∆ for a constan t C ∆ > 0 to b e determined later; see Lemma 3.10 . Moreo v er, the pr op erties of Υ ′′ θ discussed in cases I and I I allo w us to p ro ve f urther that U θ is concen trated around its mean; see Lemma 3.11 . This lea ds to the follo wing step. ( Step 4) Divide the samples i = 1 , . . . , k into tw o clusters K 1 and K 2 , according to whether U i 1 > C ∆ or U i 2 > C ∆ , respectively; see Lemma 5.1 . 14 E. MOS SEL AND S. ROCH Once the samples are divided in to pure comp onent s, w e app ly standard reconstruction tec hniqu es to infer eac h top ology . ( Step 5) F or θ = 1 , 2, reconstruct the top ology T l [ T θ ] from the samples in K θ ; see Lemma 5.3 . Gener al Θ . When Θ > 2 , w e pro ceed a s ab o ve and construct a c lustering statistic for eac h comp onent. 3. Main lemmas. In this sectio n, we d eriv e a num b er of preliminary re- sults. These results are also d escrib ed informally in Section 2.3 . Fix a GTR matrix Q and c onstan ts Θ ≥ 2, 0 < f ≤ g < + ∞ and ν > 0. Let λ b e a p ermutatio n -in v ariant prob abilit y measure on Θ-M[ f , g , ν , n ]. Pick a Θ-mixture model ( T , ν , Q ) according to λ , and generate k indep endent samples { s i X } k i =1 from D l [( T , ν , Q )]. W e work with the mapping { σ X } k i =1 defined in ( 1 ). Throughout w e assume that the n um b er of samples is k = n C k for some C k > 0 to b e fixed late r. 3.1. Useful lemmas. W e will need the follo wing standard concen tration inequalities; see, for example, [ 19 ]: Lemma 3.1 (Azuma–Ho effding inequalit y). Supp ose Z = ( Z 1 , . . . , Z m ) ar e indep endent r andom variables taking values in a set S , and h : S m → R is any t - Lipschitz function: | h ( z ) − h ( z ′ ) | ≤ t whenever z , z ′ ∈ S m differ at just one c o or dinate. Then, ∀ ζ > 0 , P [ | h ( Z ) − E [ h ( Z )] | ≥ ζ ] ≤ 2 exp − ζ 2 2 t 2 m . Lemma 3.2 (Chern off b ound s). L et Z 1 , . . . , Z m b e indep endent Poisson trials such that, for 1 ≤ i ≤ m , P [ Z i = 1 ] = p i wher e 0 < p i < 1 . Then, for Z = P m i =1 Z i , M = E [ Z ] = P m i =1 p i , 0 < δ − ≤ 1 , and δ + > 2 e − 1 , P [ Z < (1 − δ − ) M ] < e − M δ 2 − / 2 and P [ Z > (1 + δ + ) M ] < 2 − (1+ δ + ) M . 3.2. L ar ge-sample asympto tic s. Denoting K = [ k ] , let K θ ⊆ K b e those samples coming from comp onen t θ , that is, K θ = { i ∈ K : N i = θ } . PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 15 Lemma 3.3 (Size of K θ ). Under P l , for any C s > 1 , we have C − 1 s ≤ |K θ | ν θ k ≤ C s for al l θ ∈ [Θ] , exc ept with pr ob ability exp( − Ω( n C k )) . Pr oof. Recall that ν ≤ ν θ ≤ 1 − ν . Using Lemma 3.1 w ith m = k and ζ = ν θ k max { 1 − C − 1 s , C s − 1 } = ν θ k ( C s − 1) giv es th e result. Consider the estimato rs ˆ q θ ( a, b ) = 1 |K θ | X i ∈K θ σ i a σ i b and ˆ q ( a, b ) = 1 k k X i =1 σ i a σ i b . Let q θ ( a, b ) = e − d θ ( a,b ) and q ( a, b ) = Θ X θ = 1 ν θ q θ ( a, b ) . Lemma 3.4 (Acc uracy of ˆ q ). Fix 0 < C q < C k / 2 . Under P l , we have | ˆ q ( a, b ) − q ( a, b ) | ≤ n − C q and | ˆ q θ ( a, b ) − q θ ( a, b ) | ≤ n − C q for al l θ ∈ [Θ] and al l ( a, b ) ∈ [ n ] 2 6 = exc ept with pr ob ability exp( − p oly( n )) . Pr oof. F or eac h ( a, b ) ∈ [ n ] 2 6 = , ˆ q ( a, b ) is a s um of k indep endent v ari- ables. By Lemma 3.1 , taking m = k , t = k − 1 max i | z i | 2 , ζ = n − C q , w e ha ve | ˆ q ( a, b ) − q ( a, b ) | ≤ n − C q , except with probabilit y 2 exp( − Ω( n C k − 2 C q )). Note that there are at most n 2 elemen ts in [ n ] 2 6 = so that the p robabilit y of failur e is at most 2 n 2 exp( − Ω( n C k − 2 C q )) = exp( − Ω( n C k − 2 C q )) . 16 E. MOS SEL AND S. ROCH Using Lemma 3.3 , the same holds for eac h θ . The o ve rall probab ilit y of failure under P l is exp( − Ω( n C k − 2 C q )). F ollo wing the same argumen t, a similar result holds for ˆ r ( c 1 , c 2 ) = 1 k k X i =1 [ σ i a 1 σ i b 1 σ i a 2 σ i b 2 − ˆ q ( a 1 , b 1 ) ˆ q ( a 2 , b 2 )] for c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ [ n ] 2 6 = . Let r ( c 1 , c 2 ) = E l [ ˆ r ( c 1 , c 2 )] . Lemma 3.5 (Acc uracy of ˆ r ). Under P l , we have | ˆ r ( c 1 , c 2 ) − r ( c 1 , c 2 ) | ≤ n − C q for al l c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ [ n ] 2 6 = exc ept with pr ob ability exp( − p oly( n ) ) . 3.3. Combinatoria l pr op erties. F or α > 0, let Υ α,θ = { ( a, b ) ∈ [ n ] 2 6 = : d θ ( a, b ) ≤ α } (6) and Υ α = [ θ ∈ [ Θ] Υ α,θ . (7) The lo w er b ound b elo w follo ws from a (stronger) lemma i n [ 25 ]; see also [ 18 ]. Lemma 3.6 (Size of Υ α,θ ). F or al l α > 0 and θ ∈ [Θ] , 1 4 n ≤ | Υ α,θ | ≤ 2 ⌊ α/f ⌋ n. In p articular, 1 4 n ≤ | Υ α | ≤ Θ2 ⌊ α/f ⌋ n. Pr oof. F or a ∈ X and α ≥ 4 g , let B α ( a ) = { v ∈ V : d θ ( φ θ ( a ) , v ) ≤ α } . Since T θ is binary , there are at most 2 ⌊ α/f ⌋ v ertices within ev olutionary distance α , that is, |B α ( a ) | ≤ 2 ⌊ α/f ⌋ . Restricting to lea v es giv es the up p er b ound . Let Γ α = { a ∈ [ n ] : d θ ( a, b ) > α, ∀ b ∈ [ n ] − { a }} , PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 17 that is, Γ α is the set of lea v es with no other leaf at ev olutionary distance α in T θ . W e will b oun d the size of Γ α . Note that for all a, b ∈ Γ α with a 6 = b , w e ha ve B α/ 2 ( a ) ∩ B α/ 2( b ) = ∅ by the triangle inequalit y . Moreo v er, it h olds that for all a ∈ Γ α |B α/ 2 ( a ) | ≥ 2 ⌊ α/ (2 g ) ⌋ , since T θ is binary , and there is no leaf other than a in B α/ 2 ( a ). Hence, we m u st ha ve | Γ α | ≤ 2 n − 2 2 ⌊ α/ (2 g ) ⌋ ≤ 1 2 ⌊ α/ (2 g ) ⌋− 1 n as there are 2 n − 2 no des in T θ . No w, for all a / ∈ Γ α assign an arb itrary leaf at ev olutionary distance at most α . T hen | Υ α,θ | ≥ 1 2 ( n − | Γ α | ) ≥ 1 2 1 − 1 2 ⌊ α/ (2 g ) ⌋− 1 n, where we divided by 2 to av oid double-counti ng. Th e result follo ws fr om the assumption α ≥ 4 g . Let C c > 4 g , C f > 0, and C st > C f to b e fixed later. Definition 3.1 ( T θ -quasic herr y). W e sa y that ( a, b ) ∈ [ n ] 2 6 = is a T θ -quasi- cherry if ( a, b ) ∈ Υ C c ,θ . Definition 3.2 ( T θ -stretc hed). W e say that ( a, b ) ∈ [ n ] 2 6 = is T θ -str e tche d if d θ ( a, b ) ≥ C st log log n . Definition 3.3 ( T θ -far). W e sa y that c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ [ n ] 2 6 = are T θ -far if d θ ( c 1 , c 2 ) ≡ min { d θ ( x 1 , x 2 ) : x 1 ∈ { a 1 , b 1 } , x 2 ∈ { a 2 , b 2 }} ≥ C f log log n. Let Υ ′ b e an y sub set sati sfying Υ 4 g ⊆ Υ ′ ⊆ Υ C c (8) and let Υ ′ θ = Υ ′ ∩ Υ C c ,θ . (9) Let C p sp > 0 to b e fi xed lat er. Keep eac h ( a, b ) ∈ Υ C c indep en den tly w ith probabilit y p sp = C p sp log n n 18 E. MOS SEL AND S. ROCH to form the set Υ ′′ C c , and let Υ ′′ = Υ ′ ∩ Υ ′′ C c . Let 0 < C − sp < C + sp < + ∞ b e constan ts (to b e determined). Definition 3.4 (Prop erly sparse). A subset Υ 4 g ⊆ Υ ′′ ⊆ Υ C c with Υ ′′ θ = Υ ′′ ∩ Υ C c ,θ , θ ∈ [Θ] , is pr op erly sp arse if it satisfies the follo wing pr op erties: F or all θ ∈ [Θ]: (1) W e ha v e C − sp log n ≤ | Υ ′′ θ | ≤ C + sp log n. (2) All c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ Υ ′′ are T θ -far. (3) All pairs in Υ ′′ θ are T θ ′ -stretc hed for θ ′ 6 = θ . Let Υ ′′ C c ,θ = Υ ′′ C c ∩ Υ C c ,θ , θ ∈ [Θ] , and Υ ′′ 4 g ,θ = Υ 4 g ∩ Υ ′′ C c ,θ , θ ∈ [Θ] . Lemma 3.7 (Spars ification). Ther e exist c onstants 0 < C − sp < C + sp < + ∞ such that, under P A ,λ , the set Υ ′′ C c as ab ove satisfies the fol lowing pr op erties, exc ept with pr ob ability 1 / p oly ( n ) : for al l θ ∈ [Θ] : (1) We have C − sp log n ≤ | Υ ′′ 4 g ,θ | and | Υ ′′ C c ,θ | ≤ C + sp log n. (2) Al l c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ Υ ′′ C c ar e T θ -far. (3) Al l p airs in Υ ′′ C c ,θ ar e T θ ′ -str e tche d for θ ′ 6 = θ . In p articular, the set Υ ′′ as ab ove is pr op erly sp arse. M or e over, the claim holds for any C − sp > 0 by taking C p sp > 0 lar ge enough. In tuitiv ely , part (2) follo ws from the sp arsification step w hereas part (3) is a consequen ce of the p ermutat ion-in v ariance of λ . W e give a formal pro of next. Pr oof of Lemma 3.7 . F or part (1), w e use Lemma 3.2 . T ak e 1 4 C p sp log n ≤ M 4 g ≡ C p sp log n n | Υ 4 g ,θ | and M C c ≡ C p sp log n n | Υ C c ,θ | ≤ 2 ⌊ C c /f ⌋ C p sp log n. With δ − = 1 / 2, δ + = 5, w e hav e P A [ | Υ ′′ 4 g ,θ | < (1 − δ − ) M 4 g ] < e − M 4 g δ 2 − / 2 = 1 p oly( n ) PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 19 and P A [ | Υ ′′ C c ,θ | > (1 + δ + ) M C c ] < 2 − (1+ δ + ) M C c = 1 p oly( n ) . The first part follo w s fr om the choic e C − sp = C p sp 8 and C + sp = 6 C p sp 2 ⌊ C c /f ⌋ . F or the second part, let c 1 = ( a 1 , b 1 ) b e a pair in Υ ′′ C c . Let S b e the collect ion of pairs c 2 = ( a 2 , b 2 ) 6 = c 1 in the original set Υ C c that are within ev olutionary d istance C f log log n of c 1 in T θ , that is, d ( c 1 , c 2 ) ≤ C f log log n. Note that the n umb er of lea v es within evolutio nary distance C f log log n from a 1 or b 1 is at most 2 · 2 ⌊ C f log log n/f ⌋ . Moreo v er, eac h su c h leaf can be in vo lv ed in at most Θ2 ⌊ C c /f ⌋ pairs, since an y pair in Υ C c m u st b e a T θ ′ -quasic herr y for some θ ′ ∈ [Θ] and th e n um b er of lea ves at ev olutionary distance C c from a v ertex in a tree in Y f ,g is at most 2 ⌊ C c /f ⌋ . Hence |S | ≤ 2 · 2 ⌊ C f log log n/f ⌋ · Θ2 ⌊ C c /f ⌋ = O (log n ) . Therefore the probabilit y that an y c 2 ∈ S r emains in Υ ′′ C c is at most O (log 2 n/ n ). Assu ming part (1) holds, summing ov er Υ ′′ C c , and app lying Marko v’s in- equalit y , w e get P A [ | c 1 6 = c 2 ∈ Υ ′′ C c : c 1 , c 2 are not T θ -far | ≥ 1] = O log 3 n n + 1 p oly( n ) . This giv es th e seco nd part. F or the thir d p art, consider a T θ -quasic herr y ( a, b ). Thinkin g of λ as as- signing leaf lab els in T θ ′ uniformly at r andom, the p robabilit y that b is within ev olutionary d istance C st log log n of a in T θ ′ is at m ost P λ [( a, b ) is not T θ ′ -stretc hed ] ≤ 2 ⌊ C st log log n/f ⌋ n = O log n n , where the n u merator in the second expression is an upp er b ound on the n u m b er of v ertices at ev olutionary distance C st log log n of a in T θ ′ . Su m- ming o v er all p airs in Υ ′′ C c ,θ and assuming the b ound in part (1) holds, th e exp ected num b er of pairs in Υ ′′ C c ,θ that are not T θ ′ -stretc hed is O (log 2 n/n ). By Mark o v’s inequalit y , P A ,λ [ |{ ( a, b ) ∈ Υ ′′ C c ,θ : ( a, b ) is not T θ ′ -stretc hed }| ≥ 1] ≤ O log 2 n n + 1 p oly( n ) . This giv es th e third part. 20 E. MOS SEL AND S. ROCH 3.4. Mixing. W e use a mixing argumen t similar to [ 15 ]. Let Q min = min x 6 = y Q xy , whic h is p ositiv e b y assumption. W e think of Q as acting as follo w s. F r om a state x , w e ha ve tw o t y p e of transitions to y 6 = x : (i) W e jump to state y at rate Q min > 0. (ii) W e jump to state y at rate Q xy − Q min ≥ 0. Note that a transition of type (i) do es not dep end on the starting state. Hence if P is a path from u to v in T θ , N = θ , and a transition of t yp e (i) o ccurs along P , then σ u is indep endent of σ v . The p robabilit y , conditioned on N = θ , that suc h a transition do es not o ccur, is e − d θ ( u,v )( r − 1) Q min . Let Υ ′′ ⊆ [ n ] 2 6 = b e a p rop erly sparse set. W e sh o w next that pairs in Υ ′′ are ind ep endent with high probabilit y . W e pr o ceed by considering the paths joining them and arguin g that transitions of t yp e (i) are likely to occur on them by the com binatorial pr op erties in Definition 3.4 . F ormally , fix θ ∈ [Θ], and consider t wo pairs c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ Υ ′′ . By Defin ition 3.4 , c 1 and c 2 are T θ -far. There are three cases without loss of generalit y: (1) c 1 , c 2 ar e T θ -quasicherries. In the subtree of T θ connecting { a 1 , b 1 , a 2 , b 2 } , calle d a quartet , the paths P ath T θ ( a 1 , b 1 ) and P ath T θ ( a 2 , b 2 ) are disjoin t. This is denoted by the quartet split a 1 b 1 | a 2 b 2 . Let P θ [ c 1 , c 2 ] b e the in ternal path of the quartet. Note that by Defin ition 3.4 the length of P θ [ c 1 , c 2 ] is at least C f log log n − 2 C c . De note b y P θ c 1 [ c 1 , c 2 ] the subpath of P θ [ c 1 , c 2 ] within ev olutionary d istance 1 3 C f log log n of c 1 . (2) c 1 is a T θ -quasicherry , and c 2 is T θ -str e tche d. Cons ider the s ubtree of T θ connecting { a 1 , b 1 , a 2 } , called a triplet , and l et u b e the cen tral vertex of it. Let P θ [ c 1 , a 2 ] b e th e path connecting u and a 2 . Note that by Definition 3.4 , the length of P θ [ c 1 , a 2 ] is at lea s t C f log log n − C c . Denote b y P θ c 1 [ c 1 , a 2 ] the su bpath of P θ [ c 1 , a 2 ] within ev olutionary distance 1 3 C f log log n of c 1 . Similarly , denote by P θ a 2 [ c 1 , a 2 ] the su bpath of P θ [ c 1 , a 2 ] with in ev olutionary distance 1 3 C f log log n of a 2 . (3) c 1 , c 2 ar e T θ -str e tche d. Let P θ [ a 1 , a 2 ] b e the path connecting a 1 and a 2 . Note that by Defin ition 3.4 the length o f P θ [ a 1 , a 2 ] is at lea s t C f log log n . Denote b y P θ a 1 [ a 1 , a 2 ] the subpath of P θ [ a 1 , a 2 ] within ev olutionary distance 1 3 C f log log n of a 1 . Similarly , let P θ [ a 1 , b 1 ] b e the path joinin g a 1 and b 1 , and let P θ a 1 [ a 1 , b 1 ] b e the subpath of P θ [ a 1 , b 1 ] within ev olutionary distance 1 3 C st log log n > 1 3 C f log log n of a 1 . Condition on N = θ . F or ea c h c 1 = ( a 1 , b 1 ) ∈ Υ ′′ θ , let E θ c 1 b e th e follo w ing ev ent: PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 21 Eac h subpath P θ c 1 [ c 1 , c 2 ], c 2 6 = c 1 ∈ Υ ′′ θ , and each subpath P θ c 1 [ c 1 , a 2 ], c 2 = ( a 2 , b 2 ) ∈ Υ ′′ − Υ ′′ θ , undergo a transition of type (i) during the generation of sample σ X . Similarly , for eac h c 1 = ( a 1 , b 1 ) ∈ Υ ′′ − Υ ′′ θ , let E θ c 1 = E θ a 1 ∩ E θ b 1 where E θ a 1 is the follo wing even t (and similarly for E θ b 1 ): Eac h subpath P θ a 1 [ c 2 , a 1 ], c 2 ∈ Υ ′′ θ , eac h subpath P θ a 1 [ a 1 , a 2 ], c 2 = ( a 2 , b 2 ) ∈ Υ ′′ − Υ ′′ θ with c 1 6 = c 2 , as w ell as subpath P θ a 1 [ a 1 , b 1 ] undergo a t ransition of type (i) d uring the generation of sample σ X . Note that, under E θ c 1 , the rand om v ariable σ a 1 σ b 1 is ind ep endent of every other su c h random v ariable in Υ ′′ . Moreo v er, in the case c 1 ∈ Υ ′′ − Υ ′′ θ , then further σ a 1 is indep endent of σ b 1 . The next lemma sho w s that most of the ev ents ab ov e o ccur with high pr obabilit y implying that a large fraction of σ a 1 σ b 1 ’s are m utually indep endent. Lemma 3.8 (Pai r indep endence). L et Υ ′′ ⊆ [ n ] 2 6 = b e a pr op erly sp arse set. Conditione d on N = θ , let I = { c 1 ∈ Υ ′′ : E θ c 1 holds } . F or any 0 < ε I < 1 and C I > 0 , ther e e xist C f , C st > C f and C − sp > 0 lar ge enough so that the fol lowing hold s exc ept with pr ob ability n − C I under P l : |I | ≥ (1 − ε I ) | Υ ′′ | . Pr oof. Condition on N = θ . Note that the E θ c 1 ’s are mutually in dep en- den t b ecause the corresp on ding paths are disjoin t by co nstruction. By a union b oun d o v er Υ ′′ , for all c 1 ∈ Υ ′′ , P l [( E θ c 1 ) c | N = θ ] ≤ 2 C + sp log n · e − ((1 / 3) C f log log n − 2 C c )( r − 1) Q min (10) = 1 p oly(log n ) for C f large enough. Applying Lemma 3.2 with M = | Υ ′′ | · P l [( E θ c 1 ) c | N = θ ] and δ + > 2 e suc h that (1 + δ + ) M = ε I | Υ ′′ | ≥ ε I C − sp log n, w e get P l [ | Υ ′′ − I | > ε I | Υ ′′ | ] ≤ 2 − ε I | Υ ′′ | = 1 n C I b y taking C − sp large enough in Definition 3.4 . W e us e the indep endence clai ms ab o v e to simp lify exp ectation computa- tions. 22 E. MOS SEL AND S. ROCH Lemma 3.9 (Exp ectation computations). L et Υ ′′ ⊆ [ n ] 2 6 = b e a pr op erly sp arse set. The fol lowing hold. F or al l θ 6 = θ ′ ∈ [Θ] : (1) ∀ ( a, b ) ∈ Υ ′′ θ , q θ ( a, b ) ≥ e − C c . (2) ∀ ( a, b ) ∈ Υ ′′ − Υ ′′ θ , q θ ( a, b ) = 1 p oly(log n ) . (3) ∀ ( a, b ) ∈ Υ ′′ θ , q ( a, b ) = ν θ q θ ( a, b ) + 1 p oly(log n ) . (4) ∀ c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ Υ ′′ θ , r ( c 1 , c 2 ) = ν θ (1 − ν θ ) q θ ( a 1 , b 1 ) q θ ( a 2 , b 2 ) + 1 p oly(log n ) ≥ 1 2 ν (1 − ν ) e − 2 C c > 0 . (5) ∀ c 1 = ( a 1 , b 1 ) ∈ Υ ′′ θ , c 2 = ( a 2 , b 2 ) ∈ Υ ′′ θ ′ , r ( c 1 , c 2 ) = − ν θ q θ ( a 1 , b 1 ) ν θ ′ q θ ′ ( a 2 , b 2 ) + 1 p oly(log n ) ≤ − 1 2 ν e − 2 C c < 0 . Pr oof. P arts (1) and (2) follo w from the fact that q θ ( a, b ) = e − d θ ( a,b ) , d θ ( a, b ) ≤ C c for all ( a, b ) ∈ Υ ′′ θ and d θ ( a, b ) ≥ C st log log n for all ( a, b ) ∈ Υ ′′ − Υ ′′ θ from Definition 3.4 . P art (3) follo ws fr om parts (1) and (2). F or part (4), let c 1 = ( a 1 , b 1 ) 6 = c 2 = ( a 2 , b 2 ) ∈ Υ ′′ θ . Note that E l [ σ a 1 σ b 1 σ a 2 σ b 2 | N = θ , E θ c 1 , E θ c 2 ] = E l [ σ a 1 σ b 1 | N = θ ] E l [ σ a 2 σ b 2 | N = θ ] = q θ ( a 1 , b 1 ) q θ ( a 2 , b 2 ) and E l [ σ a 1 σ b 1 σ a 2 σ b 2 | N = θ ′ , E θ ′ c 1 , E θ ′ c 2 ] = E l [ σ a 1 | N = θ ′ ] E l [ σ b 1 | N = θ ′ ] × E l [ σ a 2 | N = θ ′ ] E l [ σ b 2 | N = θ ′ ] = 0 b y ( 2 ), so that E l [ σ a 1 σ b 1 σ a 2 σ b 2 ] = ν θ q θ ( a 1 , b 1 ) q θ ( a 2 , b 2 ) + 1 p oly(log n ) from ( 10 ). Then part (4) f ollo ws f rom Lemma 3.4 and p art (3) . PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 23 F or p art (5), let c 1 = ( a 1 , b 1 ) ∈ Υ ′′ θ , c 2 = ( a 2 , b 2 ) ∈ Υ ′′ θ ′ . Let θ ′′ 6 = θ , θ ′ . Note that E l [ σ a 1 σ b 1 σ a 2 σ b 2 | N = θ , E θ c 1 , E θ c 2 ] = E l [ σ a 1 σ b 1 | N = θ ] × E l [ σ a 2 | N = θ ] E l [ σ b 2 | N = θ ] = 0 and E l [ σ a 1 σ b 1 σ a 2 σ b 2 | N = θ ′ , E θ ′ c 1 , E θ ′ c 2 ] = E l [ σ a 1 σ b 1 | N = θ ′ ] × E l [ σ a 2 | N = θ ′ ] E l [ σ b 2 | N = θ ′ ] = 0 . Moreo v er, since c 1 , c 2 / ∈ Υ ′′ θ ′′ , E l [ σ a 1 σ b 1 σ a 2 σ b 2 | N = θ ′′ , E θ ′′ c 1 , E θ ′′ c 2 ] = E l [ σ a 1 | N = θ ′′ ] E l [ σ b 1 | N = θ ′′ ] × E l [ σ a 2 | N = θ ′′ ] E l [ σ b 2 | N = θ ′′ ] = 0 . Hence E l [ σ a 1 σ b 1 σ a 2 σ b 2 ] = 0 + 1 p oly(log n ) from ( 10 ). Then part (5) f ollo ws f rom Lemma 3.4 and p art (3) . 3.5. L ar ge-tr e e c onc entr ation. Let Υ ′′ ⊆ [ n ] 2 6 = b e a prop erly sp arse set. Consider the clustering statisti c U θ = 1 | Υ ′′ θ | X ( a,b ) ∈ Υ ′′ θ σ a σ b . W e sh o w th at U θ is concen trated and separates the θ -comp onent from all other comp onent s. Lemma 3.10 (Separatio n). Ther e exi sts C ∆ > 0 such that f or θ ′ 6 = θ E l [ U θ | N = θ ] > C ∆ and E l [ U θ | N = θ ′ ] < C ∆ . Pr oof. By Definition 3.4 , all ( a, b ) ∈ Υ ′′ θ are T θ ′ -stretc hed. Hence E l [ U θ | N = θ ] ≥ e − C c 24 E. MOS SEL AND S. ROCH and E l [ U θ | N = θ ′ ] = 1 p oly(log n ) b y Lemma 3.9 . T aking C ∆ = 1 2 e − C c giv es th e result. Lemma 3.11 (Concent ration of U θ ). F or al l ε U > 0 and C U > 0 , ther e ar e C f > 0 , C st > C f and C − sp > 0 lar ge e nough such that for al l θ , θ ′ (p ossibly e qual) P l [ |U θ ′ − E l [ U θ ′ | N = θ ] | ≥ ε U | N = θ ] ≤ 1 n C U . Pr oof. Let I b e as in Lemm a 3.8 , and let U I θ b e the same as U θ with the sum restricte d to I . F rom Lemmas 3.7 and 3 .8 , conditioned on I , U I θ is a normalized sum of Θ(log n ) indep endent b ounded v ariables. Concen tration of U I θ therefore follo ws from Lemma 3.1 using m = Ω(log n ) , t = O (1 / log n ) and ζ = 1 2 ε U . T aking ε I = 1 2 ε U max i z 2 i and C I > C U in Lemma 3.8 as wel l as C − sp > 0 large enough giv es the result. 4. Constructing the cl ustering statistic from data. In this section, w e pro v ide details on the plan laid out in Section 2.3 . Fix a GTR matrix Q and c onstan ts Θ ≥ 2, 0 < f ≤ g < + ∞ and ν > 0. Let λ b e a p ermutation-in v arian t probability measure on Θ -M[ f , g , ν , n ]. In this section, we w ork directly with samples { σ i X } k i =1 generated from an unknown Θ-mixture mo del ( T , ν, Q ) pick ed according to λ . Our goal is to construct the clustering statistics {U θ } Θ θ = 1 from { σ i X } k i =1 . These statistics will b e used in t he next section to reco nstruct the top ologies of the mod el ( T , ν , Q ). 4.1. Clustering algorithm. W e pro ceed in three steps. Let C c = − ln 1 3Θ(1 − ν ) ν e − 4 g and ω = 2 3 ν e − 4 g . The algorithm is the follo w ing: (1) ( Finding quasicherries ) F or all pairs of lea v es a, b ∈ [ n ], compu te ˆ q ( a, b ) , and set ˆ Υ ′ = { ( a, b ) ∈ [ n ] 2 6 = : ˆ q ( a, b ) ≥ ω } . (2) ( Sp arsific ation ) C onstruct ˆ Υ ′′ b y kee ping eac h ( a, b ) ∈ ˆ Υ ′ indep en - den tly with probabilit y p sp = C p sp log n n . PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 25 (3) ( Inferring clusters ) F or all c 1 6 = c 2 ∈ ˆ Υ ′ , compu te ˆ r ( c 1 , c 2 ), and set c 1 ∼ c 2 if ˆ r ( c 1 , c 2 ) > 0 . Let ˆ Υ ′′ θ , θ = 1 , . . . , ˆ Θ, b e the equiv alence classes of t he transitiv e closure of ∼ . (4) ( Final sets ) Retur n ˆ Υ ′′ θ , θ ∈ [Θ] . 4.2. Ana lysis of the clustering algorithm. W e sh o w that eac h step of the previous algorithm succeeds with high p robabilit y . Lemma 4.1 (Finding quasicherries). The set ˆ Υ ′ satisfies the fol lowing, exc ept with pr ob ability at most exp( − p oly( n )) under P l : Υ 4 g ⊆ ˆ Υ ′ ⊆ Υ C c . Pr oof. W e prov e b oth inclusions. F or all θ ∈ [Θ] and ( a, b ) ∈ Υ 4 g ,θ , q θ ( a, b ) ≥ e − 4 g and q ( a, b ) ≥ ν e − 4 g > 2 3 ν e − 4 g = ω . By Lemma 3.4 , ˆ q ( a, b ) ≥ ω , except with probabilit y exp ( − p oly( n ) ). Similarly for an y ( a, b ) ∈ ˆ Υ ′ , b y Lemma 3.4 , if ˆ q ( a, b ) ≥ ω = 2 3 ν e − 4 g , then q ( a, b ) ≥ 1 3 ν e − 4 g , so that there is θ ∈ [Θ] w ith ν θ q θ ( a, b ) ≥ 1 3Θ ν e − 4 g . That is, q θ ( a, b ) ≥ 1 3Θ(1 − ν ) ν e − 4 g and d θ ( a, b ) ≤ − ln 1 3Θ(1 − ν ) ν e − 4 g = C c . Hence ( a, b ) ∈ Υ C c ,θ . Lemma 4.2 (Spars ification). Assuming that the c onclusions of L e m- ma 4.1 hold, ˆ Υ ′′ is pr op erly sp arse, exc ept with pr ob ability 1 / p oly( n ) . 26 E. MOS SEL AND S. ROCH Pr oof. This follo ws from Lemma 4.1 and the c hoice of p sp . Lemma 4.3 (Inferrin g clusters). A ssuming that the c onclusions of L em- mas 4.1 and 4.2 hold, we have ˆ Θ = Θ , and ther e is a bije ctive mapping h of [Θ] such that ˆ Υ ′′ h ( θ ) = Υ ′′ θ with the choic e Υ ′ = ˆ Υ ′ in Se ction 3.3 , exc ept with pr ob ability exp( − p oly( n )) . Pr oof. It follo ws from Lemmas 3.5 and 3.9 that ∼ is an equiv alence relation with equiv alence classes Υ ′′ θ , θ = 1 , . . . , Θ, except with probabilit y exp( − p oly ( n )). 5. T ree reconstruction. W e no w show how to use the clustering sta tistics to build the top ologies. T he algorithm is comp osed of tw o steps: we first b in the sites according to the v alue of the clustering statistics; we then use the sites in one o f those bins a nd apply a standard distance-based reco n struction metho d. W e show that the con ten t of the bins is made o f sites from the same comp onent —th u s reducing the situation to the u nmixed case. Let C ∆ = 1 2 e − C c , ε U = 1 3 e − C c and ε I = 1 2 ε U max i z 2 i . Moreo v er tak e C f , C st , C p sp and C − sp so that the lemmas in Section 3 hold. T o sim plify notation, we rename the comp onents so that h is th e identity . 5.1. Site binning. L et ˆ Υ ′′ θ , θ ∈ [Θ ], b e th e sets return ed by the algorithm in Section 4 . Ass ume that the conclusions of Lemmas 4.1 , 4.2 and 4.3 hold. W e bin the sites with the follo wing pr o cedure: (1) ( Clustering statistics ) F or all i = 1 , . . . , k and a ll θ = 1 , . . . , Θ , compute ˆ U i θ = 1 | ˆ Υ ′′ θ | X ( a,b ) ∈ ˆ Υ ′′ θ σ i a σ i b . (2) ( Binning sites ) F or all θ = 1 , . . . , Θ, set ˆ K θ = { i ∈ [ k ] : ˆ U i θ > C ∆ } . W e sh o w that the binning is successful with high p robabilit y . PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 27 Lemma 5.1 (Binning the sites). A ssume that the c onclusions of L em- mas 4.1 , 4.2 and 4.3 hold. F or any C k , ther e exists C U lar ge enough so that, for al l θ ∈ [Θ] , ˆ K θ = K θ , exc ept with pr ob ability 1 / p oly ( n ) . Pr oof. This follo ws from Lemmas 3.10 and 3.11 b y a union b ound o v er all samples. 5.2. Estimating a distorte d metric. Estimating evolutionary dist anc es. W e estimate ev olutionary distances on eac h comp onen t. F or all θ ∈ [Θ], let ˆ K θ b e as ab o v e and assume the conclusions of Lemma 5.1 hold. (1) ( Estimating distanc es ) F or all θ = 1 , . . . , Θ and a 6 = b ∈ [ n ] , compute ˆ q θ ( a, b ) = 1 | ˆ K θ | X i ∈ ˆ K θ σ i a σ i b . Lemma 5.2 (Estimating distances). Assume the c onclusions of L e m- ma 5.1 hold. The fol lowing hold exc ept with pr ob ability exp( − p oly( n )) : for al l θ ∈ [Θ] and al l a 6 = b ∈ [ n ] , | ˆ q θ ( a, b ) − q θ ( a, b ) | ≤ 1 n C q . Pr oof. The result follo ws from Lemma 3.4 . T r e e c onstruction. T o reconstruct the tree, we u se a distance-based metho d of [ 8 ]. W e require the follo wing defin ition. Definition 5.1 (Distorted metric [ 12 , 16 ]). Let T = ( V , E ; φ ; w ) b e a phylo geny with corresp onding tree metric d , and let τ , Ψ > 0. W e say that ˆ d : X × X → (0 , + ∞ ] is a ( τ , Ψ) - distorte d metric for T or a ( τ , Ψ)- distortion of d if: (1) ( Symmetry ) F or all a, b ∈ X , ˆ d is symmetric, that is, ˆ d ( a, b ) = ˆ d ( b, a ); (2) ( Distortion ) ˆ d is accurate on “short” distances; that is, f or all a, b ∈ X , if either d ( a, b ) < Ψ + τ or ˆ d ( a, b ) < Ψ + τ , th en | d ( a, b ) − ˆ d ( a, b ) | < τ . An immediate consequence of [ 8 ], Theorem 1, is th e follo wing. 28 E. MOS SEL AND S. ROCH Claim (Reconstruction from distorted metrics [ 8 ]). L et T = ( V , E ; φ ; w ) b e a phylo ge ny in Y f ,g . Then the top olo gy of T c an b e r e c over e d in p olynomial time f r om a ( τ , Ψ) -distortion ˆ d of d as long as τ ≤ f 5 and Ψ ≥ 5 g log n. Remark 5.1. Th e constants ab o ve are not optimal but will suffice for our purp oses. See [ 8 ] for the detail s of the reconstruction algo rithm. W e now sh o w h o w to obtain a ( f / 5 , 5 g log n )-d istortion with high proba- bilit y for eac h comp onent. Lemma 5.3 (Distortion estimation). Ther e exist C q , C k > 0 so that, given that the c onclusions of L emma 5.2 hold, for al l θ ∈ [Θ] , ˆ d θ ( a, b ) = − ln( ˆ q θ ( a, b ) + ) , ( a, b ) ∈ X × X , is a ( f / 5 , 5 g log n ) -distortion of d θ . Pr oof. Fix θ ∈ [Θ] . Define L − 2 = { ( a, b ) ∈ X × X : d θ ( a, b ) ≤ 15 g log n } and L + 2 = { ( a, b ) ∈ X × X : d θ ( a, b ) > 12 g log n } . Let ( a, b ) ∈ L − 2 . Note that e − d θ ( a,b ) ≥ exp( − 15 g log n ) ≡ 1 n C ′ q , where the last equalit y is a defin ition. Then, taking C q (and hence C k ) large enough, from Lemma 5.2 , w e hav e | ˆ d θ ( a, b ) − d θ ( a, b ) | ≤ f 5 . Similarly , let ( a, b ) ∈ L + 2 . Note that e − d θ ( a,b ) < exp( − 12 g log n ) ≡ 1 n C ′′ q , where the last equalit y is a definition. Th en, taking C q large enough, f rom Lemma 5.2 w e ha v e ˆ d θ ( a, b ) ≥ 5 g log n + f 5 . PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 29 6. Pro of of main th eorems. W e are n o w ready to prov e the main theo- rems. Pr oof of Theorem 3 . Let C 1 , C 2 > 0. Let A n b e the subset of those Θ-mixture models ( T , ν , Q ) in Θ-M[ f , g , ν , n ] fo r whic h part (3) of Lemma 3.7 holds w ith probability at least 1 − n − C 1 under the random c hoices of the algorithm. By the pro of of Lemma 3.7 , for small enough C 1 , C 2 > 0, w e ha ve λ n [ A c n ] ≤ n − C 2 . On A n , the lemmas in Sections 3 , 4 and 5 hold with probabilit y 1 − 1 / p oly( n ) . Then the top ologies are correctly reconstructed b y the claim in Section 5.2 . Pr oof of Theorem 1 . Let ( T , ν , Q ) ≁ ( T ′ , ν ′ , Q ) ∈ [ n ≥ 1 A n . Then, b y Theorem 3 , the algorithm correctly reconstructs the topologies in ( T , ν , Q ) with probability 1 − 1 / p oly ( n ) on sequen ces of length k = p oly( n ) . Rep eating the reconstruction on indep endent sequences and taking a ma- jorit y v ote, we get almost su re conv ergence to the correct top ologies. T he same holds for ( T ′ , ν ′ , Q ) . Hence, D l [( T , ν , Q )] 6 = D l [( T ′ , ν ′ , Q )] . Pr oof of Theorem 2 . Let ( T , ν , Q ) ∈ [ n ≥ 1 A n with Θ = 2 and ν = (1 / 2 , 1 / 2). Then , fr om the pro of of Lemma 5.1 , there exists a clustering s tatistic such that s amples from T 1 and T 2 are correctly distinguished with probabilit y 1 − 1 / p oly( n ) . Recall that kD − D ′ k TV = sup B ∈F |D ( B ) − D ′ ( B ) | . T aking B to b e the ev ent that a site is reco gnized as belonging to comp onent 1 by the clustering statistic ab o v e, we get kD l [ T 1 , Q ] − D l [ T 2 , Q ] k TV = 1 − o n (1) . 7. Concluding remarks. Our tec h niques also admit the follo wing exten- sions: • When Q is unkno wn, one ca n still apply ou r tec hniqu e b y using the fol- lo wing idea. Note that all w e need is an eigen v ector of Q with negativ e eigen v alue. Ch o ose a pair ( a, b ) of close lea v es u sing, for instance, the clas- sical log-det distance [ 22 ]. Un der a p erm utation-in v arian t measure, ( a, b ) is stretc hed in all but one comp onen t, with high probabilit y . One can then compute an eigen v ector decomp osition of the transition matrix b et we en a and b . W e lea ve out the details. 30 E. MOS SEL AND S. ROCH • The minim um frequency assum ption is n ot necessary as long as one has an u pp er b ound on the n um b er of comp onen ts and that one requir es only that frequent enough comp onents b e detec ted and reconstructed. W e lea v e out the details. REFERENCES [1] Allman, E. S. , An ´ e, C. and Rhodes, J. A. (2008). Identifiabilit y of a Marko- vian mo del of molecular evo lution with gamma-distributed rates. A dv. in Appl. Pr ob ab. 40 229–249. MR2411822 [2] Allman, E. S. , Petro vic, S. , R hodes, J. A. and Sulliv ant, S. (2011). Identifia- bilit y of tw o-tree mixtures for group- b ased mo dels. IEEE/ACM T r ans. Comput. Biolo gy Bioi nform. 8 710–722. [3] Allman, E. S. and Rhodes, J. A . (2006). The iden tifi abilit y of t ree top ology for phylogenetic mo dels, including cov arion and mixture mo dels. J. Comput. Biol. 13 1101–1113 (electronic). MR2255411 [4] Billera, L. J. , H olmes, S. P. and Vogtmann, K. (2001). Geometry of the sp ace of phylogenetic trees. Adv . in Appl. Math. 27 733–767. MR1867931 [5] Chai, J. and Houswor th, E. A. (2011). On Rogers’ p roof of identifiabili ty for the GTR + Gamma + I mo del. Av ailable at http://sysbio. o x fordjournals.org/ conten t/early/2011/03 /27/sysbio.syr023.short . [6] Chang, J. T. (1996). F ull reconstruction of Mark ov models on evolutionary trees: Identifiabilit y and consistency. Math. Bi osci. 137 51–73. MR1410044 [7] Chor, B. and Tuller, T. (2006). Finding a m ax im um likelihoo d tree is hard. J. ACM 53 722–744 (electronic). MR2263067 [8] D askalakis, C. , Mossel, E. and Roch, S. (2009). Phylogenies without b ranch b ounds: Con tracting the short, p ru ning t he deep. In RECOMB ( S. Ba tzoglou , ed.). L e ctur e Notes in Computer Scienc e 5541 451–465. Springer, New Y ork. [9] Ev ans, S. N. and W arnow, T. (2004). U n identifiable d ivergence times in rates- across-sites mo dels. IEEE/ACM T r ans. Comput. Biolo gy Bi oinform. 1 130–134. [10] Felsenstein, J. (2004). Inf erring Phylo genies . Sinauer, Sunderland, MA. [11] Huelsenbeck, J. P. and Rannala, B. (1997). Phylogenetic metho ds come of age: T esting hypotheses in an evol utionary context. Scienc e 276 227–232. [12] King, V. , Zhang, L. and Zhou, Y. (2003). On the complexity of distance-based evol utionary tree reconstruction. In Pr o c e e dings of the F ourte enth Annual ACM- SIAM Symp osium on Discr ete A lgorithms (Baltim or e, MD, 2003) 444–453. ACM , N ew Y ork. MR1974948 [13] Ma tsen, F. A. , Mossel, E. and Steel, M. (2008). Mixed- up trees: The stru ct u re of phylogenetic mixtures. Bul l . Math. Biol. 70 1115– 1139. MR2391182 [14] Ma tsen, F. A. an d Ste e l, M. (2007). Phylogenetic mixtures on a single tree can mimic a tree of anoth er top ology. Syst. Biol. 56 767–77 5. [15] Mossel, E. (2003). On the imp ossibilit y of reconstructing ancestral d ata and phylo- genies. J. Comput. Bi ol. 10 669–6 78. [16] Mossel, E. (2007). D istorted metrics on trees and phylogenetic forests. IEEE/ACM T r ans. Com put. Biolo gy Bi oinform. 4 108–116. [17] Mossel, E. and Peres, Y. (2003). Information flow on trees. Ann. Appl . Pr ob ab. 13 817–844 . MR1994038 [18] Mossel, E. and Roch, S. (2011). Identifiabilit y and inference of non p arametric rates-across-sites mo dels on large-scale phylogenies . Preprint. PHYLOGENETIC MIX TURES IN THE LARGE-TREE LIMIT 31 [19] Motw ani, R. and R agha v an, P. (1995). R andomize d A l gorithms . Cambridge Un iv . Press, Cambridge. MR1344451 [20] Rhodes, J. and Sulliv an t, S. ( 2010). Identifiability of large ph y logenetic mixture mod els. Preprint. [21] Ro ch, S. (200 6). A short pro of that phylogenetic tree reconstruction by maximum lik elihoo d is h ard . IEEE/ACM T r ans. Comput. Bi olo gy Bioinform. 3 92–94. [22] Semple, C. and Steel, M. (2003). Phylo genetics . Oxfor d L e ctur e Series in M athe- matics and Its Applic ations 24 . Oxford Univ. Press, Oxford. MR2060009 [23] Steel, M. (2009). A basic limitation on inferring phylogenies by pairwise sequ ence comparisons. J. T he or et. Biol. 256 467–472. [24] Steel, M. , Sz ´ ekel y , L. A. and Hendy, M. D. (1994). Reconstructing trees when sequence sites evo lve at v ariable rates. J. Comput. Biol. 1 153–163 . [25] Steel, M. A. and Sz ´ ekel y , L. A. (2006). On the v ariational distance of tw o trees. Ann . Appl. Pr ob ab. 16 1563–1575. MR2260073 [26] ˇ Stef anko vi ˇ c, D. and Vi goda, E. (2007). Phylogen y of mixture mo dels: Robustness of maxim u m likel ihoo d and non-identifiable distributions. J. Comput. Biol. 14 156–189 (electronic). MR2299868 [27] Stef anko vic, D. and Vigoda, E. (2007). Pitfalls of h eterogeneous pro cesses for phylogenetic reconstruction. Syst. Bi ol. 56 113–1 24. [28] Wu, J. and S u sk o, E. ( 2010). Rate- vari ation need not defeat phylogenetic inference through pairwise sequence comparisons. J. The or et. Biol. 263 587–589. Dep ar tments of St a tistics and Computer Science University of California Berkeley, California 94720 USA E-mail: mossel@stat.berkeley .edu Dep ar tment of Mathema tics and Bioinforma tics Program University of California Los Angeles, Ca lifornia 900 95 USA E-mail: ro c h@math.ucla.edu
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment