pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Reading time: 6 minute
...

📝 Original Info

  • Title: pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree
  • ArXiv ID: 1003.5943
  • Date: 2010-04-01
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power of likelihood-based approaches to large data sets. This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence. Pplacer enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is available as source code, binaries, and a web service.

💡 Deep Analysis

Deep Dive into pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree.

Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. “Phylogenetic placement,” where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power of likelihood-based approaches to large data sets. This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edg

📄 Full Content

pplacer: linear time maximum-likelihood and Bayesian phy- logenetic placement of sequences onto a fixed reference tree Frederick A Matsen∗1, Robin B Kodner2,3 and E Virginia Armbrust2 1Departments of Integrative Biology, Mathematics, and Statistics, University of California, Berkeley, Berkeley, CA, USA 2School of Oceanography, University of Washington, Seattle, Washington, USA 3Friday Harbor Laboratories, University of Washington, Friday Harbor, Washington, USA Email: Frederick A Matsen∗- ematsen@gmail.com; Robin B Kodner - rkodner@u.washington.edu; E Virginia Armbrust - armbrust@u.washington.edu; ∗Corresponding author Abstract Background: Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. “Phylogenetic placement,” where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihood-based approaches to large data sets. Results: This paper introduces pplacer, a software package for phylogenetic placement and subsequent visual- ization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence. 1 arXiv:1003.5943v1 [q-bio.PE] 30 Mar 2010 Conclusions: Pplacer enables efficient phylogenetic placement and subsequent visualization, making likelihood- based phylogenetics methodology practical for large collections of reads; it is freely available as source code, binaries, and a web service. Background High-throughput pyrosequencing technologies have enabled the widespread use of metagenomics and meta- transcriptomics in a variety of fields [1]. This technology has revolutionized the possibilities for unbiased surveys of environmental microbial diversity, ranging from the human gut to the open ocean [2–8]. The trade offfor high throughput sequencing is that the resulting sequence reads can be short and come without information on organismal origin or read location within a genome. The most common way of analyzing a metagenomic data set is to use BLAST [9] to assign a taxonomic name to each query sequence based on “reference” data of known origin. This strategy has its problems: when a query sequence is only distantly related to sequences in the database, BLAST can either err substantially by forcing a query into an alignment with a known sequence, or return an uninformatively broad collection of alignments. Furthermore, similarity statistics such as BLAST E-values can be difficult to interpret because they are dependent on fragment length and database size. Therefore it can be difficult to know if a given taxonomic assignment is correct unless a very clear “hit” is found. Numerous tools have appeared that assign taxonomic information to query sequences, overcoming the shortcomings of BLAST. For example, MEGAN (MEtaGenome ANalyzer) [10] implements a common- ancestor algorithm on the NCBI taxonomy using BLAST scores. PhyloPythia [11], TACOA [12], and Phymm [13] use composition based methods to assign taxonomic information to metagenomic sequences. Recent tools can work with reads as short as 100bp. Phylogeny offers an alternative and complementary means of understanding the evolutionary origin of query sequences. The presence of a query sequence on a certain branch of a tree gives precise information about the evolutionary relationship of that sequence to other sequences in the tree. For example, a query sequence placed deep in the tree can indicate how the query is distantly related to the other sequences in the tree, whereas the corresponding taxonomic name would simply indicate membership in a large taxonomic group. On the other hand, taxonomic names are key to obtaining functional information about organisms, 2 and the most robust and comprehensive means of understanding the composition of unknown sequen

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut