Maximum Likelihood Supertrees

Reading time: 5 minute
...

📝 Original Info

  • Title: Maximum Likelihood Supertrees
  • ArXiv ID: 0708.2124
  • Date: 2007-08-17
  • Authors: ** Mike Steel, David Bryant **

📝 Abstract

We analyse a maximum-likelihood approach for combining phylogenetic trees into a larger `supertree'. This is based on a simple exponential model of phylogenetic error, which ensures that ML supertrees have a simple combinatorial description (as a median tree, minimising a weighted sum of distances to the input trees). We show that this approach to ML supertree reconstruction is statistically consistent (it converges on the true species supertree as more input trees are combined), in contrast to the widely-used MRP method, which we show can be statistically inconsistent under the exponential error model. We also show that this statistical consistency extends to an ML approach for constructing species supertrees from gene trees. In this setting, incomplete lineage sorting (due to coalescence rates of homologous genes being lower than speciation rates) has been shown to lead to gene trees that are frequently different from species trees, and this can confound efforts to reconstruct the species phylogeny correctly.

💡 Deep Analysis

📄 Full Content

Combining trees on different, overlapping sets of taxa into a parent 'supertree' is now a mainstream strategy for constructing large phylogenetic trees. The literature on supertrees is growing steadily: new methods of supertree reconstruction are being developed (Cotton and Wilkinson, 2007) and supertree analyses are shedding light on fundamental evolutionary questions (Bininda-Emonds et al., 2007). Despite this surge in research activity, it is probably fair to say that biologists are still confused about what supertrees really are and what it is we do when we build a supertree. Are we, as some maintain, simply summarising the phylogenetic information contained in a group of subtrees? Or are we trying to derive the best estimate of phylogeny given the information at hand? Nor is it clear which of these two conceptually different objectives underpin the various supertree reconstruction methods.

We take the view that what biologists really want a supertree reconstruction method to deliver is the best hypothesis of evolutionary relationships that can be inferred from the data available. Obviously, it is not the case that the supertree constructed as a summary statistic will necessarily be the best estimate of phylogeny.

Nonetheless, if we are prepared to consider supertree reconstruction a problem of phylogenetic estimation, we have at our disposal an arsenal of phylogenetic tools and methods that have been tried and tested. Matrix Representation with Parsimony (MRP; (Baum and Ragan, 1992)), Matrix Representation with Compatibility (MRC; (Rodrigo, 1996;Ross and Rodrigo, 2004)) and, most recently, Bayesian supertree reconstruction (BSR, (Ronquist et al., 2004)) are undoubtedly inspired by standard phylogenetic methods. A gap remains, though, as there has been remarkably little development of likelihood-based methods for supertree reconstruction.

In this paper, we analyse one approach to obtain maximum-likelihood (ML) estimates of supertrees, based on a probability model that permits ’errors’ in subtree topologies. The approach is of the type described by Cotton and Page (2004), and it permits supertrees to be estimated even if there is topological conflict amongst the constituent subtrees. We show that ML estimates of supertrees so obtained are statistically consistent under fairly general conditions. By contrast, we show that MRP may be inconsistent under these same conditions. We then consider a further complication that arises in the supertree setting when combining gene trees into species trees -in addition to the possibility that the input gene trees are reconstructed incorrectly (either a consequence of the reconstruction method used, or some sampling error), there is a further stochastic process that leads to the (true) gene trees differing from their underlying species tree (a consequence of incomplete lineage sorting). Although simple majority-rule approaches (and gene concatenation) have recently been shown to be misleading, we show that an ML supertree approach for combining gene trees is also statistically consistent.

1.1. Terminology. Throughout this paper, unless stated otherwise, phylogenetic trees may be either rooted or unrooted, and we will mostly follow the notation of Semple and Steel (2003). In particular, given a (rooted or unrooted) phylogenetic tree T on a set X of taxa (which will always label the leaves of the tree), any subset Y of X induces a phylogenetic tree on taxon set Y , denoted T |Y , which, informally, is the subtree of T that connects the taxa in Y only. In the supertree problem, we have a sequence P = (T 1 , T 2 , . . . , T k ) of input trees, called a profile, where T i is a phylogenetic tree on taxon set X i . We wish to combine these trees into a phylogenetic tree T on the union of the taxon sets (i.e.

We assume that the trees in P are either all rooted or all unrooted, and that T is rooted or unrooted accordingly. We will mostly assume that trees are fully-resolved (i.e. binary trees, without polytomies); in Section 5 we briefly describe how this restriction can be lifted.

A special case of the supertree problem arises when the taxon sets of the input trees are all the same (X 1 = X 2 = • • • = X k ). This is the much studied consensus tree problem. In an early paper McMorris (1990) described how, in this consensus setting, the majority rule consensus tree can be given a maximum likelihood interpretation. However this approach is quite different to the one described here (even when restricted to the consensus problem).

In this paper, we will denote the underlying (’true’) species tree as T 0 (assuming that such a tree exists and that the evolution of the taxa has not involved reticulate processes such as the formation of hybrid taxa). In an ideal world, we would like T i = T 0 |X i for each tree T i in the profile -that is, we would like each of the reconstructed trees to be identical to the subtree of the ’true’ tree for the taxa in X 0 . But in practice, the t

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut