Mixed-up trees: the structure of phylogenetic mixtures

In this paper we apply new geometric and combinatorial methods to the study of phylogenetic mixtures. The focus of the geometric approach is to describe the geometry of phylogenetic mixture distributions for the two state random cluster model, which is a generalization of the two state symmetric (CFN) model. In particular, we show that the set of mixture distributions forms a convex polytope and we calculate its dimension; corollaries include a simple criterion for when a mixture of branch lengths on the star tree can mimic the site pattern frequency vector of a resolved quartet tree. Furthermore, by computing volumes of polytopes we can clarify how common'' non-identifiable mixtures are under the CFN model. We also present a new combinatorial result which extends any identifiability result for a specific pair of trees of size six to arbitrary pairs of trees. Next we present a positive result showing identifiability of rates-across-sites models. Finally, we answer a question raised in a previous paper concerning mixed branch repulsion’’ on trees larger than quartet trees under the CFN model.

💡 Research Summary

The paper investigates the structural properties of phylogenetic mixture models by applying geometric, combinatorial, and probabilistic techniques to the two‑state random‑cluster model, a natural extension of the classic two‑state symmetric (Cavender‑Farris‑Neyman, CFN) model. The authors first prove that the set of all possible mixture distributions under this model forms a convex polytope. By representing each tree‑branch‑length configuration as a point in a high‑dimensional space of site‑pattern frequencies, they show that the polytope’s dimension for an n‑taxon tree is exactly (2ⁿ – n – 1). This matches the known degrees of freedom for the CFN model but highlights additional flexibility introduced by the random‑cluster parameter (the probability of clustering).

A central contribution is a simple linear‑algebraic criterion that determines when a mixture of branch lengths on a star tree can reproduce the site‑pattern frequency vector of a resolved quartet tree. The condition reduces to checking whether a weighted sum of star‑tree branch‑length vectors satisfies a specific linear equation derived from the quartet’s pattern frequencies. Although the criterion is easy to verify, the authors demonstrate that the parameter region satisfying it is extremely narrow, implying that exact mimicry between star‑tree mixtures and resolved trees is theoretically possible but practically unlikely.

To assess how “common” non‑identifiable mixtures are, the authors compute volumes of the relevant polytopes. By normalising the total volume of the parameter space, they obtain a quantitative measure of the proportion of the space occupied by mixtures that are indistinguishable from a single‑tree model. The volume of the non‑identifiable region is small relative to the whole space, yet it concentrates in biologically plausible zones (short internal branches combined with high clustering probabilities). This concentration suggests that, while rare overall, non‑identifiable mixtures could arise in empirical datasets that feature rapid radiations or strong site‑specific correlations.

On the combinatorial side, the paper presents a powerful extension theorem: any identifiability result proved for a particular pair of six‑leaf trees automatically extends to arbitrary pairs of trees of any size. The proof hinges on the concept of a minimal conflicting subtree—a smallest induced subtree where the two trees disagree. By showing that any larger disagreement can be reduced to a collection of such minimal subtrees, the authors construct an inductive argument that propagates identifiability from the six‑leaf case to the general case. This theorem dramatically simplifies the task of establishing identifiability for complex trees, because it suffices to verify a finite set of small‑tree configurations.

The authors also address rates‑across‑sites (RAS) models, where each site evolves under its own substitution rate drawn from a continuous distribution. Under the assumption that the rate distribution has full support and that each rate class contributes a non‑negligible number of sites, they prove that the mixture model remains identifiable: the tree topology, branch lengths, and the underlying rate distribution can all be uniquely recovered from the observed site‑pattern frequencies. This result provides theoretical justification for using RAS models in phylogenomic analyses, reassuring practitioners that the added realism does not inevitably lead to unresolvable ambiguity.

Finally, the paper resolves an open question concerning “mixed branch repulsion,” a phenomenon previously observed in quartet trees where certain mixtures of long and short branches can “push” each other away, making the mixture appear more tree‑like than any constituent. By combining polytope‑volume calculations with the combinatorial extension theorem, the authors show that for trees larger than quartets, mixed branch repulsion disappears under the CFN model provided that internal branches are sufficiently long. In larger trees, the mixing effect is diluted across many edges, preventing the pathological reinforcement seen in small trees.

In summary, the study delivers a comprehensive geometric description of phylogenetic mixture spaces, quantifies the rarity of non‑identifiable mixtures, furnishes a general combinatorial framework for extending identifiability results, confirms the identifiability of rates‑across‑sites models, and clarifies the limits of mixed branch repulsion. These insights collectively advance our theoretical understanding of phylogenetic mixture models and provide practical guidance for designing robust phylogenetic inference pipelines in the presence of heterogeneous evolutionary processes.