Inference of evolutionary trees and rates from biological sequences is commonly performed using continuous-time Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its own rate. Very little has been rigorously established concerning the identifiability of the models currently in common use in data analysis, although non-identifiability was proven for a semi-parametric model and an incorrect proof of identifiability was published for a general parametric model (GTR+Gamma+I). Here we prove that one of the most widely used models (GTR+Gamma) is identifiable for generic parameters, and for all parameter choices in the case of 4-state (DNA) models. This is the first proof of identifiability of a phylogenetic model with a continuous distribution of rates.
Deep Dive into Identifiability of a Markovian model of molecular evolution with Gamma-distributed rates.
Inference of evolutionary trees and rates from biological sequences is commonly performed using continuous-time Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its own rate. Very little has been rigorously established concerning the identifiability of the models currently in common use in data analysis, although non-identifiability was proven for a semi-parametric model and an incorrect proof of identifiability was published for a general parametric model (GTR+Gamma+I). Here we prove that one of the most widely used models (GTR+Gamma) is identifiable for generic parameters, and for all parameter choices in the case of 4-state (DNA) models. This is the first proof of identifiability of a phylogenetic model with a continuous distribution of rates.
A central goal of molecular phylogenetics is to infer evolutionary trees from DNA or protein sequences. Such sequence data come from extant species at the tips of the tree -the tree of life -while the topology of the tree relating these species is unknown. Inferring this tree helps us understand the evolutionary relationships between sequences. Phylogenetic data analysis is often performed using Markovian models of evolution: Mutations occur along the branches of the tree under a finite-state Markov process. There is ample evidence that some places in the genome undergo mutations at a high rate, while other loci evolve very slowly, perhaps due to some functional constraint. Such rate variation occurs at all spatial scales, across genes as well as across sites within genes. In performing inference, this heterogeneity is accounted for by the use of flexible mixture models where each site is allowed its own rate according to a rate distribution µ. In the context of molecular phylogenetics, the use of a parametric family for µ is generally considered both advantageous and sufficiently flexible.
The question of identifiability for such a rate-variation model is a fundamental one, as standard proofs of consistency of statistical inference methods begin by establishing identifiability. Without identifiability, inference of some or all model parameters may be unjustified. However, since phylogenetic data is gathered only from the tips of the tree, understanding when one has identifiability of the tree topology and other parameters for phylogenetic models poses substantial mathematical challenges. Indeed, it has been shown that the tree and model parameters are not identifiable if the distribution of rates µ is too general, even when the Markovian mutation model is quite simple [13].
The most commonly used phylogenetic model is a general time-reversible (GTR) Markovian mutation model along with a Gamma distribution family (Γ) for µ. For more flexibility, a class of invariable sites (I) can be added by allowing µ to be the mixture of a Gamma distribution with an atom at 0 [4].
Numerous studies have shown that the addition to the GTR model of rate heterogeneity through Γ, I, or both, can considerably improve fit to data at the expense of only a few additional parameters. In fact, when model selection procedures are performed, the GTR+Γ+I model is preferred in most studies. These stochastic models are the basis of hundreds of publications every year in the biological sciences -over 40 in Systematic Biology alone in 2006. Their impact is immense in the fields of evolutionary biology, ecology, conservation biology, and biogeography, as well as in medicine, where, for example, they appear in the study of the evolution of infectious diseases such as HIV and influenza viruses.
The main result claimed in the widely-cited paper [11] is the following:
The 4-base (DNA) GTR+Γ+I model, with unknown mixing parameter and Γ shape parameter, is identifiable from the joint distributions of pairs of taxa.
However, the proof given in [11] of this statement is flawed; in fact, two gaps occur in the argument. The first gap is in the use of an unjustified claim concerning graphs of the sort exemplified by Figure 3 of that paper. As this claim plays a crucial role in the entire argument, the statement above remains unproven.
The second gap, though less sweeping in its impact, is still significant. Assuming the unjustified graphical claim mentioned above could be proved, the argument of [11] still uses an assumption that the eigenvalues of the GTR rate matrix be distinct. While this is true for generic GTR parameters, there are exceptions, including the well-known Jukes-Cantor and Kimura 2-parameter models [4]. Without substantial additional arguments, the reasoning given in [11] cannot prove identifiability in all cases.
Furthermore, bridging either of the gaps in [11] is not a trivial matter.
Though we suspect that Rogers’ statement of identifiability is correct, at least for generic parameters, we have not been able to establish it by his methods.
For further exposition on the nature of the gaps, see the Appendix.
In this paper, we consider only the GTR+Γ model, but for characters with any number κ ≥ 2 states, where the case κ = 4 corresponds to DNA sequences.
Our main result is the following: Theorem 1. The κ-state GTR+Γ model is identifiable from the joint distributions of triples of taxa for generic parameters on any tree with 3 or more taxa. Moreover, when κ = 4 the model is identifiable for all parameters.
The term ‘generic’ here means for those GTR state distributions and rate matrices which do not satisfy at least one of a collection of equalities to be explicitly given in Theorem 2. Consequently, the set of non-generic parameters is of Lebesgue measure zero in the full parameter space. Our arguments are quite different from those attempted in [11]. We combine arguments from algebra, algebraic geometry and analysis.
W
…(Full text truncated)…
This content is AI-processed based on ArXiv data.