Title: Resampling Residuals: Robust Estimators of Error and Fit for Evolutionary Trees and Phylogenomics
ArXiv ID: 0912.5288
Date: 2009-12-31
Authors: Researchers from original ArXiv paper
📝 Abstract
Phylogenomics, even more so than traditional phylogenetics, needs to represent the uncertainty in evolutionary trees due to systematic error. Here we illustrate the analysis of genome-scale alignments of yeast, using robust measures of the additivity of the fit of distances to tree when using flexi Weighted Least Squares. A variety of DNA and protein distances are used. We explore the nature of the residuals, standardize them, and then create replicate data sets by resampling these residuals. Under the model, the results are shown to be very similar to the conventional sequence bootstrap. With real data they show up uncertainty in the tree that is either due to underestimating the stochastic error (hence massively overestimating the effective sequence length) and/or systematic error. The methods are extended to the very fast BME criterion with similarly promising results.
💡 Deep Analysis
Deep Dive into Resampling Residuals: Robust Estimators of Error and Fit for Evolutionary Trees and Phylogenomics.
Phylogenomics, even more so than traditional phylogenetics, needs to represent the uncertainty in evolutionary trees due to systematic error. Here we illustrate the analysis of genome-scale alignments of yeast, using robust measures of the additivity of the fit of distances to tree when using flexi Weighted Least Squares. A variety of DNA and protein distances are used. We explore the nature of the residuals, standardize them, and then create replicate data sets by resampling these residuals. Under the model, the results are shown to be very similar to the conventional sequence bootstrap. With real data they show up uncertainty in the tree that is either due to underestimating the stochastic error (hence massively overestimating the effective sequence length) and/or systematic error. The methods are extended to the very fast BME criterion with similarly promising results.
📄 Full Content
Waddell and Azad (2009). Residual Resampling of Phylogenetic Trees Page 1
Resampling Residuals: Robust Estimators of Error
and Fit for Evolutionary Trees and Phylogenomics
Peter J. Waddell12 and Ariful Azad2
1Department of Biological Sciences, Purdue University, West Lafayette, IN 47906, U.S.A. 2Department of Computer Science, Purdue University, West Lafayette, IN 47906, U.S.A
. Phylogenomics, even more so than traditional phylogenetics, needs to represent the uncertainty in
evolutionary trees due to systematic error. Here we illustrate the analysis of genome-scale
alignments of yeast, using robust measures of the additivity of the fit of distances to tree when
using flexi Weighted Least Squares. A variety of DNA and protein distances are used. We
explore the nature of the residuals, standardize them, and then create replicate data sets by
resampling these residuals. Under the model, the results are shown to be very similar to the
conventional sequence bootstrap. With real data they show up uncertainty in the tree that is either
due to underestimating the stochastic error (hence massively overestimating the effective
sequence length) and/or systematic error. The methods are extended to the very fast BME
criterion with similarly promising results.
“… his ignorance and almost doe-like naivety is keeping his mind receptive to a possible
solution.” A quotation from Kryten: Red Dwarf VIII-Cassandra
Keywords: Resampled Residuals, Sequence Bootstrap, Flexi Weighted Least Squares or fWLS
Phylogenetic Trees, Balanced Minimum Evolution BME, Rokas Yeast Genomes, Phylogenomics Waddell and Azad (2009). Residual Resampling of Phylogenetic Trees Page 2
1 Introduction
Phylogenetics and phylogenomics aim to recover the evolutionary trees by which
homologous parts of the genome evolved. Absent factors such as hybridization or horizontal gene
transfer, and if the coalescence time is small compared to edge length durations, then a species
tree should be a good description of the data. Phylogenetic methods should aim to not simply
build a tree, but inform the user of the fit of the data to the tree (hopefully relative to not only
other treatments of the data, but also other data sets) and also the reliability of the inferred tree
(e.g., Swofford et al. 1996). One consequence of fit is making a realistic assessment of how much
confidence to have that an edge (internode or less precisely a “branch”) in the reconstructed tree
was is in the tree generating the data. For purposes such as divergence time estimation, with its
reliance on the weighted tree, a variance covariance matrix of the edge lengths in the tree is
similarly important (e.g., Waddell and Penny 1996, Thorne, Kishino and Painter 1998).
In terms of phylogenetic methods, it is often prescribed that there are three main methods,
parsimony, likelihood based and distance based. Many practitioners swear by one method over
the others (often for somewhat vague reasons), but in reality they often blend into each other and
it is in understanding the interrelationships that a more robust understanding of phylogenetics
emerges. One example of a bridging method is the Hadamard conjugation (Hendy and Penny
1993), which is a likelihood, distance and invariants method all at the same time (Waddell 1995,
Swofford et al. 1996). In practice also, the methods have quite different and complimentary
strengths. For example, parsimony is intuitive and has a useful exploratory flavor to it (especially
when combined with software like Mesquite, http://mesquiteproject.org
). Parsimony is also
relatively fast for moderate sized data sets. Likelihood-based methods (such as maximum
likelihood, ML, or marginal likelihood/Bayesian inference) offer detailed predictions of how the
data/model should relate to each other. These are generally computationally expensive methods
(Swofford et al. 1996), but tests of fit can be approximated (Waddell, Ota, and Penny 2009).
Distance-based methods directly use the fundamental property of additivity of pairwise distances
on the tree (Swofford et al. 1996). They also offer clear advantages in computational speed (or
simply computability) on very large data sets. For example, a balanced minimum evolution
(BME) criterion based method (e.g., Gascuel and Steel 2006) will now a complete an SPR cycle
of tree search in t2 time (where t is the number of tips on the tree, Hordijk and Gascuel 2005).
This is phenomenally fast since the number of sub-tree pruning and redrafting (SPR) tree
rearrangements to be considered is itself t2!
At present, in terms of fit of data to tree, there are few clear guides. Cladists have long
held that measures such as the consistency index are a useful guide to how “clean” a data set is,
but it is also well recognized that these are hard to calibrate for different sized data sets (e.g.,