Phylogenetic mixtures on a single tree can mimic a tree of another topology
Phylogenetic mixtures model the inhomogeneous molecular evolution commonly observed in data. The performance of phylogenetic reconstruction methods where the underlying data is generated by a mixture model has stimulated considerable recent debate. Much of the controversy stems from simulations of mixture model data on a given tree topology for which reconstruction algorithms output a tree of a different topology; these findings were held up to show the shortcomings of particular tree reconstruction methods. In so doing, the underlying assumption was that mixture model data on one topology can be distinguished from data evolved on an unmixed tree of another topology given enough data and the ``correct’’ method. Here we show that this assumption can be false. For biologists our results imply that, for example, the combined data from two genes whose phylogenetic trees differ only in terms of branch lengths can perfectly fit a tree of a different topology.
💡 Research Summary
The paper addresses a fundamental assumption in phylogenetics: that data generated under a mixture model on one tree topology can, given enough sequence length and an appropriate inference method, be distinguished from data generated on an unmixed tree of a different topology. The authors demonstrate that this assumption is false in general. They formalize a phylogenetic mixture as a convex combination of probability distributions produced by the same tree topology but with different sets of branch lengths (or other substitution parameters). By exploiting the linear structure of these distributions, they construct explicit examples where a mixture on tree T₁ yields exactly the same site‑pattern distribution as a single‑parameter model on a distinct tree T₂.
The proof proceeds in two stages. First, for the simplest non‑trivial case of four taxa, they select two distinct branch‑length vectors for topology T₁, generating two distributions P₁ and P₂. By weighting these with coefficients α and (1‑α), they obtain a mixed distribution P_mix = αP₁ + (1‑α)P₂. Second, they show that there exists a branch‑length vector for the alternative topology T₂ such that the distribution P_alt produced by the standard Markov model on T₂ coincides with P_mix. This equivalence reduces to solving a system of linear equations in the space of site‑pattern probabilities, which is possible because the dimension of that space exceeds the number of constraints imposed by the two‑component mixture. The authors also connect the construction to phylogenetic invariants: certain polynomial relationships among pattern frequencies remain unchanged under the mixture, allowing the two distinct topologies to be algebraically indistinguishable.
To validate the theory, the authors conduct simulations. They generate two gene alignments under the same topology but with different branch lengths, concatenate them, and then infer trees using maximum‑likelihood and Bayesian approaches. In every replicate, the inferred tree matches the wrong topology (the one that would produce the same mixed distribution), confirming that standard reconstruction methods cannot detect the underlying mixture.
The biological implications are profound. In practice, many datasets combine sequences from multiple genes or genomic regions that evolve under heterogeneous processes. Even if each gene individually supports the same underlying tree, differences in branch lengths can cause the combined data to perfectly mimic a completely different tree. Consequently, phylogenetic analyses that ignore mixture effects risk drawing erroneous conclusions about species relationships.
The authors propose several mitigation strategies. One is to fit explicit mixture models that allow multiple sets of branch lengths on the same topology, using model‑selection criteria to test whether a mixture provides a significantly better fit than a single‑tree model. Another is to partition the data (by gene, codon position, or other criteria) and assess congruence among the resulting trees before concatenation. Finally, they suggest employing invariant‑based tests that can detect when a dataset lies in the intersection of the model spaces of two distinct topologies.
In summary, the paper provides rigorous mathematical evidence that phylogenetic mixture models on a single tree can exactly replicate the site‑pattern distribution of a different tree topology. This challenges the prevailing view that sufficient data and optimal methods guarantee correct topology recovery, and it underscores the need for more sophisticated modeling and diagnostic tools in modern phylogenetics.
Comments & Academic Discussion
Loading comments...
Leave a Comment