Resampling Residuals: Robust Estimators of Error and Fit for Evolutionary Trees and Phylogenomics

Reading time: 6 minute
...

📝 Original Info

  • Title: Resampling Residuals: Robust Estimators of Error and Fit for Evolutionary Trees and Phylogenomics
  • ArXiv ID: 0912.5288
  • Date: 2009-12-31
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Phylogenomics, even more so than traditional phylogenetics, needs to represent the uncertainty in evolutionary trees due to systematic error. Here we illustrate the analysis of genome-scale alignments of yeast, using robust measures of the additivity of the fit of distances to tree when using flexi Weighted Least Squares. A variety of DNA and protein distances are used. We explore the nature of the residuals, standardize them, and then create replicate data sets by resampling these residuals. Under the model, the results are shown to be very similar to the conventional sequence bootstrap. With real data they show up uncertainty in the tree that is either due to underestimating the stochastic error (hence massively overestimating the effective sequence length) and/or systematic error. The methods are extended to the very fast BME criterion with similarly promising results.

💡 Deep Analysis

Deep Dive into Resampling Residuals: Robust Estimators of Error and Fit for Evolutionary Trees and Phylogenomics.

Phylogenomics, even more so than traditional phylogenetics, needs to represent the uncertainty in evolutionary trees due to systematic error. Here we illustrate the analysis of genome-scale alignments of yeast, using robust measures of the additivity of the fit of distances to tree when using flexi Weighted Least Squares. A variety of DNA and protein distances are used. We explore the nature of the residuals, standardize them, and then create replicate data sets by resampling these residuals. Under the model, the results are shown to be very similar to the conventional sequence bootstrap. With real data they show up uncertainty in the tree that is either due to underestimating the stochastic error (hence massively overestimating the effective sequence length) and/or systematic error. The methods are extended to the very fast BME criterion with similarly promising results.

📄 Full Content

Waddell and Azad (2009). Residual Resampling of Phylogenetic Trees Page 1 Resampling Residuals: Robust Estimators of Error and Fit for Evolutionary Trees and Phylogenomics Peter J. Waddell12 and Ariful Azad2

pwaddell@purdue.edu

1Department of Biological Sciences, Purdue University, West Lafayette, IN 47906, U.S.A.
2Department of Computer Science, Purdue University, West Lafayette, IN 47906, U.S.A

.
Phylogenomics, even more so than traditional phylogenetics, needs to represent the uncertainty in evolutionary trees due to systematic error. Here we illustrate the analysis of genome-scale alignments of yeast, using robust measures of the additivity of the fit of distances to tree when using flexi Weighted Least Squares. A variety of DNA and protein distances are used. We explore the nature of the residuals, standardize them, and then create replicate data sets by resampling these residuals. Under the model, the results are shown to be very similar to the conventional sequence bootstrap. With real data they show up uncertainty in the tree that is either due to underestimating the stochastic error (hence massively overestimating the effective sequence length) and/or systematic error. The methods are extended to the very fast BME criterion with similarly promising results.

“… his ignorance and almost doe-like naivety is keeping his mind receptive to a possible solution.” A quotation from Kryten: Red Dwarf VIII-Cassandra

Keywords: Resampled Residuals, Sequence Bootstrap, Flexi Weighted Least Squares or fWLS Phylogenetic Trees, Balanced Minimum Evolution BME, Rokas Yeast Genomes, Phylogenomics
Waddell and Azad (2009). Residual Resampling of Phylogenetic Trees Page 2

1 Introduction

Phylogenetics and phylogenomics aim to recover the evolutionary trees by which homologous parts of the genome evolved. Absent factors such as hybridization or horizontal gene transfer, and if the coalescence time is small compared to edge length durations, then a species tree should be a good description of the data. Phylogenetic methods should aim to not simply build a tree, but inform the user of the fit of the data to the tree (hopefully relative to not only other treatments of the data, but also other data sets) and also the reliability of the inferred tree (e.g., Swofford et al. 1996). One consequence of fit is making a realistic assessment of how much confidence to have that an edge (internode or less precisely a “branch”) in the reconstructed tree was is in the tree generating the data. For purposes such as divergence time estimation, with its reliance on the weighted tree, a variance covariance matrix of the edge lengths in the tree is similarly important (e.g., Waddell and Penny 1996, Thorne, Kishino and Painter 1998).

In terms of phylogenetic methods, it is often prescribed that there are three main methods, parsimony, likelihood based and distance based. Many practitioners swear by one method over the others (often for somewhat vague reasons), but in reality they often blend into each other and it is in understanding the interrelationships that a more robust understanding of phylogenetics emerges. One example of a bridging method is the Hadamard conjugation (Hendy and Penny 1993), which is a likelihood, distance and invariants method all at the same time (Waddell 1995, Swofford et al. 1996). In practice also, the methods have quite different and complimentary strengths. For example, parsimony is intuitive and has a useful exploratory flavor to it (especially when combined with software like Mesquite, http://mesquiteproject.org ). Parsimony is also relatively fast for moderate sized data sets. Likelihood-based methods (such as maximum likelihood, ML, or marginal likelihood/Bayesian inference) offer detailed predictions of how the data/model should relate to each other. These are generally computationally expensive methods (Swofford et al. 1996), but tests of fit can be approximated (Waddell, Ota, and Penny 2009). Distance-based methods directly use the fundamental property of additivity of pairwise distances on the tree (Swofford et al. 1996). They also offer clear advantages in computational speed (or simply computability) on very large data sets. For example, a balanced minimum evolution (BME) criterion based method (e.g., Gascuel and Steel 2006) will now a complete an SPR cycle of tree search in t2 time (where t is the number of tips on the tree, Hordijk and Gascuel 2005). This is phenomenally fast since the number of sub-tree pruning and redrafting (SPR) tree rearrangements to be considered is itself t2!

At present, in terms of fit of data to tree, there are few clear guides. Cladists have long held that measures such as the consistency index are a useful guide to how “clean” a data set is, but it is also well recognized that these are hard to calibrate for different sized data sets (e.g.,

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut