CyberinfraV1, Pmc, Corpus, bibRecord, 000480

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Identifieur interne : 000480 ( Pmc/Corpus ); précédent : 000479; suivant : 000481

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Auteurs : Frederick A. Matsen ; Robin B. Kodner ; E Virginia Armbrust

Source :

BMC Bioinformatics [ 1471-2105 ] ; 2010.

RBID : PMC:3098090

Abstract

Background

Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihood-based approaches to large data sets.

Results

This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence.

Conclusions

Pplacer enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is freely available as source code, binaries, and a web service.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098090

DOI: 10.1186/1471-2105-11-538
PubMed: 21034504
PubMed Central: 3098090

Links to Exploration step

PMC:3098090

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree</title>
<author><name sortKey="Matsen, Frederick A" sort="Matsen, Frederick A" uniqKey="Matsen F" first="Frederick A" last="Matsen">Frederick A. Matsen</name>
<affiliation><nlm:aff id="I1">Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kodner, Robin B" sort="Kodner, Robin B" uniqKey="Kodner R" first="Robin B" last="Kodner">Robin B. Kodner</name>
<affiliation><nlm:aff id="I2">School of Oceanography, University of Washington, Seattle, Washington, USA</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="I3">Friday Harbor Laboratories, University of Washington, Friday Harbor, Washington, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Armbrust, E Virginia" sort="Armbrust, E Virginia" uniqKey="Armbrust E" first="E Virginia" last="Armbrust">E Virginia Armbrust</name>
<affiliation><nlm:aff id="I2">School of Oceanography, University of Washington, Seattle, Washington, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">21034504</idno>
<idno type="pmc">3098090</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098090</idno>
<idno type="RBID">PMC:3098090</idno>
<idno type="doi">10.1186/1471-2105-11-538</idno>
<date when="2010">2010</date>
<idno type="wicri:Area/Pmc/Corpus">000480</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree</title>
<author><name sortKey="Matsen, Frederick A" sort="Matsen, Frederick A" uniqKey="Matsen F" first="Frederick A" last="Matsen">Frederick A. Matsen</name>
<affiliation><nlm:aff id="I1">Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kodner, Robin B" sort="Kodner, Robin B" uniqKey="Kodner R" first="Robin B" last="Kodner">Robin B. Kodner</name>
<affiliation><nlm:aff id="I2">School of Oceanography, University of Washington, Seattle, Washington, USA</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="I3">Friday Harbor Laboratories, University of Washington, Friday Harbor, Washington, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Armbrust, E Virginia" sort="Armbrust, E Virginia" uniqKey="Armbrust E" first="E Virginia" last="Armbrust">E Virginia Armbrust</name>
<affiliation><nlm:aff id="I2">School of Oceanography, University of Washington, Seattle, Washington, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihood-based approaches to large data sets.</p>
</sec>
<sec><title>Results</title>
<p>This paper introduces <monospace>pplacer</monospace>
, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. <monospace>Pplacer</monospace>
 features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence.</p>
</sec>
<sec><title>Conclusions</title>
<p><monospace>Pplacer</monospace>
 enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is freely available as source code, binaries, and a web service.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Margulies, M" uniqKey="Margulies M">M Margulies</name>
</author>
<author><name sortKey="Egholm, M" uniqKey="Egholm M">M Egholm</name>
</author>
<author><name sortKey="Altman, W" uniqKey="Altman W">W Altman</name>
</author>
<author><name sortKey="Attiya, S" uniqKey="Attiya S">S Attiya</name>
</author>
<author><name sortKey="Bader, J" uniqKey="Bader J">J Bader</name>
</author>
<author><name sortKey="Bemben, L" uniqKey="Bemben L">L Bemben</name>
</author>
<author><name sortKey="Berka, J" uniqKey="Berka J">J Berka</name>
</author>
<author><name sortKey="Braverman, M" uniqKey="Braverman M">M Braverman</name>
</author>
<author><name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
<author><name sortKey="Chen, Z" uniqKey="Chen Z">Z Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Culley, A" uniqKey="Culley A">A Culley</name>
</author>
<author><name sortKey="Lang, A" uniqKey="Lang A">A Lang</name>
</author>
<author><name sortKey="Suttle, C" uniqKey="Suttle C">C Suttle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gill, S" uniqKey="Gill S">S Gill</name>
</author>
<author><name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author><name sortKey="Deboy, R" uniqKey="Deboy R">R DeBoy</name>
</author>
<author><name sortKey="Eckburg, P" uniqKey="Eckburg P">P Eckburg</name>
</author>
<author><name sortKey="Turnbaugh, P" uniqKey="Turnbaugh P">P Turnbaugh</name>
</author>
<author><name sortKey="Samuel, B" uniqKey="Samuel B">B Samuel</name>
</author>
<author><name sortKey="Gordon, J" uniqKey="Gordon J">J Gordon</name>
</author>
<author><name sortKey="Relman, D" uniqKey="Relman D">D Relman</name>
</author>
<author><name sortKey="Fraser Liggett, C" uniqKey="Fraser Liggett C">C Fraser-Liggett</name>
</author>
<author><name sortKey="Nelson, K" uniqKey="Nelson K">K Nelson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Venter, J" uniqKey="Venter J">J Venter</name>
</author>
<author><name sortKey="Remington, K" uniqKey="Remington K">K Remington</name>
</author>
<author><name sortKey="Heidelberg, J" uniqKey="Heidelberg J">J Heidelberg</name>
</author>
<author><name sortKey="Halpern, A" uniqKey="Halpern A">A Halpern</name>
</author>
<author><name sortKey="Rusch, D" uniqKey="Rusch D">D Rusch</name>
</author>
<author><name sortKey="Eisen, J" uniqKey="Eisen J">J Eisen</name>
</author>
<author><name sortKey="Wu, D" uniqKey="Wu D">D Wu</name>
</author>
<author><name sortKey="Paulsen, I" uniqKey="Paulsen I">I Paulsen</name>
</author>
<author><name sortKey="Nelson, K" uniqKey="Nelson K">K Nelson</name>
</author>
<author><name sortKey="Nelson, W" uniqKey="Nelson W">W Nelson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tringe, S" uniqKey="Tringe S">S Tringe</name>
</author>
<author><name sortKey="Rubin, E" uniqKey="Rubin E">E Rubin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Martin, H" uniqKey="Martin H">H Martín</name>
</author>
<author><name sortKey="Ivanova, N" uniqKey="Ivanova N">N Ivanova</name>
</author>
<author><name sortKey="Kunin, V" uniqKey="Kunin V">V Kunin</name>
</author>
<author><name sortKey="Warnecke, F" uniqKey="Warnecke F">F Warnecke</name>
</author>
<author><name sortKey="Barry, K" uniqKey="Barry K">K Barry</name>
</author>
<author><name sortKey="Mchardy, A" uniqKey="Mchardy A">A McHardy</name>
</author>
<author><name sortKey="Yeates, C" uniqKey="Yeates C">C Yeates</name>
</author>
<author><name sortKey="He, S" uniqKey="He S">S He</name>
</author>
<author><name sortKey="Salamov, A" uniqKey="Salamov A">A Salamov</name>
</author>
<author><name sortKey="Szeto, E" uniqKey="Szeto E">E Szeto</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Warnecke, F" uniqKey="Warnecke F">F Warnecke</name>
</author>
<author><name sortKey="Luginbuhl, P" uniqKey="Luginbuhl P">P Luginbühl</name>
</author>
<author><name sortKey="Ivanova, N" uniqKey="Ivanova N">N Ivanova</name>
</author>
<author><name sortKey="Ghassemian, M" uniqKey="Ghassemian M">M Ghassemian</name>
</author>
<author><name sortKey="Richardson, T" uniqKey="Richardson T">T Richardson</name>
</author>
<author><name sortKey="Stege, J" uniqKey="Stege J">J Stege</name>
</author>
<author><name sortKey="Cayouette, M" uniqKey="Cayouette M">M Cayouette</name>
</author>
<author><name sortKey="Mchardy, A" uniqKey="Mchardy A">A McHardy</name>
</author>
<author><name sortKey="Djord Jevic, G" uniqKey="Djord Jevic G">G Djord-jevic</name>
</author>
<author><name sortKey="Aboushadi, N" uniqKey="Aboushadi N">N Aboushadi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Baker, B" uniqKey="Baker B">B Baker</name>
</author>
<author><name sortKey="Banfield, J" uniqKey="Banfield J">J Banfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Altschul, S" uniqKey="Altschul S">S Altschul</name>
</author>
<author><name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author><name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author><name sortKey="Myers, E" uniqKey="Myers E">E Myers</name>
</author>
<author><name sortKey="Lipman, D" uniqKey="Lipman D">D Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Huson, D" uniqKey="Huson D">D Huson</name>
</author>
<author><name sortKey="Auch, A" uniqKey="Auch A">A Auch</name>
</author>
<author><name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author><name sortKey="Schuster, S" uniqKey="Schuster S">S Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mchardy, A" uniqKey="Mchardy A">A McHardy</name>
</author>
<author><name sortKey="Martin, H" uniqKey="Martin H">H Martín</name>
</author>
<author><name sortKey="Tsirigos, A" uniqKey="Tsirigos A">A Tsirigos</name>
</author>
<author><name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author><name sortKey="Rigoutsos, I" uniqKey="Rigoutsos I">I Rigoutsos</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Diaz, N" uniqKey="Diaz N">N Diaz</name>
</author>
<author><name sortKey="Krause, L" uniqKey="Krause L">L Krause</name>
</author>
<author><name sortKey="Goesmann, A" uniqKey="Goesmann A">A Goesmann</name>
</author>
<author><name sortKey="Niehaus, K" uniqKey="Niehaus K">K Niehaus</name>
</author>
<author><name sortKey="Nattkemper, T" uniqKey="Nattkemper T">T Nattkemper</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Brady, A" uniqKey="Brady A">A Brady</name>
</author>
<author><name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Allman, E" uniqKey="Allman E">E Allman</name>
</author>
<author><name sortKey="Rhodes, J" uniqKey="Rhodes J">J Rhodes</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Allman, E" uniqKey="Allman E">E Allman</name>
</author>
<author><name sortKey="Rhodes, J" uniqKey="Rhodes J">J Rhodes</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Shimodaira, H" uniqKey="Shimodaira H">H Shimodaira</name>
</author>
<author><name sortKey="Hasegawa, M" uniqKey="Hasegawa M">M Hasegawa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yang, Z" uniqKey="Yang Z">Z Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Le, S" uniqKey="Le S">S Le</name>
</author>
<author><name sortKey="Gascuel, O" uniqKey="Gascuel O">O Gascuel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Felsenstein, J" uniqKey="Felsenstein J">J Felsenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chor, B" uniqKey="Chor B">B Chor</name>
</author>
<author><name sortKey="Tuller, T" uniqKey="Tuller T">T Tuller</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Roch, S" uniqKey="Roch S">S Roch</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Guindon, S" uniqKey="Guindon S">S Guindon</name>
</author>
<author><name sortKey="Gascuel, O" uniqKey="Gascuel O">O Gascuel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A Stamatakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zwickl, D" uniqKey="Zwickl D">D Zwickl</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Price, Mn" uniqKey="Price M">MN Price</name>
</author>
<author><name sortKey="Dehal, Ps" uniqKey="Dehal P">PS Dehal</name>
</author>
<author><name sortKey="Arkin, Ap" uniqKey="Arkin A">AP Arkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Steel, M" uniqKey="Steel M">M Steel</name>
</author>
<author><name sortKey="Szekely, L" uniqKey="Szekely L">L Székely</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Moret, B" uniqKey="Moret B">B Moret</name>
</author>
<author><name sortKey="Roshan, U" uniqKey="Roshan U">U Roshan</name>
</author>
<author><name sortKey="Warnow, T" uniqKey="Warnow T">T Warnow</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Berger, S" uniqKey="Berger S">S Berger</name>
</author>
<author><name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A Stamatakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Margulies, M" uniqKey="Margulies M">M Margulies</name>
</author>
<author><name sortKey="Egholm, M" uniqKey="Egholm M">M Egholm</name>
</author>
<author><name sortKey="Altman, W" uniqKey="Altman W">W Altman</name>
</author>
<author><name sortKey="Attiya, S" uniqKey="Attiya S">S Attiya</name>
</author>
<author><name sortKey="Bader, J" uniqKey="Bader J">J Bader</name>
</author>
<author><name sortKey="Bemben, L" uniqKey="Bemben L">L Bemben</name>
</author>
<author><name sortKey="Berka, J" uniqKey="Berka J">J Berka</name>
</author>
<author><name sortKey="Braverman, M" uniqKey="Braverman M">M Braverman</name>
</author>
<author><name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
<author><name sortKey="Chen, Z" uniqKey="Chen Z">Z Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mardis, E" uniqKey="Mardis E">E Mardis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lemmon, A" uniqKey="Lemmon A">A Lemmon</name>
</author>
<author><name sortKey="Brown, J" uniqKey="Brown J">J Brown</name>
</author>
<author><name sortKey="Stanger Hall, K" uniqKey="Stanger Hall K">K Stanger-Hall</name>
</author>
<author><name sortKey="Lemmon, E" uniqKey="Lemmon E">E Lemmon</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Mooers, A" uniqKey="Mooers A">A Mooers</name>
</author>
<author><name sortKey="Heard, S" uniqKey="Heard S">S Heard</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lozupone, C" uniqKey="Lozupone C">C Lozupone</name>
</author>
<author><name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kluge, A" uniqKey="Kluge A">A Kluge</name>
</author>
<author><name sortKey="Farris, J" uniqKey="Farris J">J Farris</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Felsenstein, J" uniqKey="Felsenstein J">J Felsenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Monier, A" uniqKey="Monier A">A Monier</name>
</author>
<author><name sortKey="Claverie, J" uniqKey="Claverie J">J Claverie</name>
</author>
<author><name sortKey="Ogata, H" uniqKey="Ogata H">H Ogata</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Von Mering, C" uniqKey="Von Mering C">C Von Mering</name>
</author>
<author><name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author><name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
<author><name sortKey="Tringe, S" uniqKey="Tringe S">S Tringe</name>
</author>
<author><name sortKey="Doerks, T" uniqKey="Doerks T">T Doerks</name>
</author>
<author><name sortKey="Jensen, L" uniqKey="Jensen L">L Jensen</name>
</author>
<author><name sortKey="Ward, N" uniqKey="Ward N">N Ward</name>
</author>
<author><name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kosakovsky, P" uniqKey="Kosakovsky P">P Kosakovsky</name>
</author>
<author><name sortKey="Posada, D" uniqKey="Posada D">D Posada</name>
</author>
<author><name sortKey="Stawiski, E" uniqKey="Stawiski E">E Stawiski</name>
</author>
<author><name sortKey="Chappey, C" uniqKey="Chappey C">C Chappey</name>
</author>
<author><name sortKey="Poon, A" uniqKey="Poon A">A Poon</name>
</author>
<author><name sortKey="Hughes, G" uniqKey="Hughes G">G Hughes</name>
</author>
<author><name sortKey="Fearnhill, E" uniqKey="Fearnhill E">E Fearnhill</name>
</author>
<author><name sortKey="Gravenor, M" uniqKey="Gravenor M">M Gravenor</name>
</author>
<author><name sortKey="Leigh, B" uniqKey="Leigh B">B Leigh</name>
</author>
<author><name sortKey="Frost, S" uniqKey="Frost S">S Frost</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zwickl, D" uniqKey="Zwickl D">D Zwickl</name>
</author>
<author><name sortKey="Hillis, D" uniqKey="Hillis D">D Hillis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cueto, M" uniqKey="Cueto M">M Cueto</name>
</author>
<author><name sortKey="Matsen, F" uniqKey="Matsen F">F Matsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Munch, K" uniqKey="Munch K">K Munch</name>
</author>
<author><name sortKey="Boomsma, W" uniqKey="Boomsma W">W Boomsma</name>
</author>
<author><name sortKey="Willerslev, E" uniqKey="Willerslev E">E Willerslev</name>
</author>
<author><name sortKey="Nielsen, R" uniqKey="Nielsen R">R Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Drummond, A" uniqKey="Drummond A">A Drummond</name>
</author>
<author><name sortKey="Rambaut, A" uniqKey="Rambaut A">A Rambaut</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Huelsenbeck, Jp" uniqKey="Huelsenbeck J">JP Huelsenbeck</name>
</author>
<author><name sortKey="Ronquist, F" uniqKey="Ronquist F">F Ronquist</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Whelan, S" uniqKey="Whelan S">S Whelan</name>
</author>
<author><name sortKey="Goldman, N" uniqKey="Goldman N">N Goldman</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Han, M" uniqKey="Han M">M Han</name>
</author>
<author><name sortKey="Zmasek, C" uniqKey="Zmasek C">C Zmasek</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zurawski, G" uniqKey="Zurawski G">G Zurawski</name>
</author>
<author><name sortKey="Bohnert, H" uniqKey="Bohnert H">H Bohnert</name>
</author>
<author><name sortKey="Whitfeld, P" uniqKey="Whitfeld P">P Whitfeld</name>
</author>
<author><name sortKey="Bottomley, W" uniqKey="Bottomley W">W Bottomley</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zeidner, G" uniqKey="Zeidner G">G Zeidner</name>
</author>
<author><name sortKey="Preston, C" uniqKey="Preston C">C Preston</name>
</author>
<author><name sortKey="Delong, E" uniqKey="Delong E">E Delong</name>
</author>
<author><name sortKey="Massana, R" uniqKey="Massana R">R Massana</name>
</author>
<author><name sortKey="Post, A" uniqKey="Post A">A Post</name>
</author>
<author><name sortKey="Scanlan, D" uniqKey="Scanlan D">D Scanlan</name>
</author>
<author><name sortKey="Beja, O" uniqKey="Beja O">O Beja</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sullivan, M" uniqKey="Sullivan M">M Sullivan</name>
</author>
<author><name sortKey="Lindell, D" uniqKey="Lindell D">D Lindell</name>
</author>
<author><name sortKey="Lee, J" uniqKey="Lee J">J Lee</name>
</author>
<author><name sortKey="Thompson, L" uniqKey="Thompson L">L Thompson</name>
</author>
<author><name sortKey="Bielawski, J" uniqKey="Bielawski J">J Bielawski</name>
</author>
<author><name sortKey="Chisholm, S" uniqKey="Chisholm S">S Chisholm</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Millard, A" uniqKey="Millard A">A Millard</name>
</author>
<author><name sortKey="Clokie, M" uniqKey="Clokie M">M Clokie</name>
</author>
<author><name sortKey="Shub, D" uniqKey="Shub D">D Shub</name>
</author>
<author><name sortKey="Mann, N" uniqKey="Mann N">N Mann</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lindell, D" uniqKey="Lindell D">D Lindell</name>
</author>
<author><name sortKey="Jaffe, J" uniqKey="Jaffe J">J Jaffe</name>
</author>
<author><name sortKey="Coleman, M" uniqKey="Coleman M">M Coleman</name>
</author>
<author><name sortKey="Futschik, M" uniqKey="Futschik M">M Futschik</name>
</author>
<author><name sortKey="Axmann, I" uniqKey="Axmann I">I Axmann</name>
</author>
<author><name sortKey="Rector, T" uniqKey="Rector T">T Rector</name>
</author>
<author><name sortKey="Kettler, G" uniqKey="Kettler G">G Kettler</name>
</author>
<author><name sortKey="Sullivan, M" uniqKey="Sullivan M">M Sullivan</name>
</author>
<author><name sortKey="Steen, R" uniqKey="Steen R">R Steen</name>
</author>
<author><name sortKey="Hess, W" uniqKey="Hess W">W Hess</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chenard, C" uniqKey="Chenard C">C Chenard</name>
</author>
<author><name sortKey="Suttle, C" uniqKey="Suttle C">C Suttle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Williamson, S" uniqKey="Williamson S">S Williamson</name>
</author>
<author><name sortKey="Rusch, D" uniqKey="Rusch D">D Rusch</name>
</author>
<author><name sortKey="Yooseph, S" uniqKey="Yooseph S">S Yooseph</name>
</author>
<author><name sortKey="Halpern, A" uniqKey="Halpern A">A Halpern</name>
</author>
<author><name sortKey="Heidelberg, K" uniqKey="Heidelberg K">K Heidelberg</name>
</author>
<author><name sortKey="Glass, J" uniqKey="Glass J">J Glass</name>
</author>
<author><name sortKey="Andrews Pfannkoch, C" uniqKey="Andrews Pfannkoch C">C Andrews-Pfannkoch</name>
</author>
<author><name sortKey="Fadrosh, D" uniqKey="Fadrosh D">D Fadrosh</name>
</author>
<author><name sortKey="Miller, C" uniqKey="Miller C">C Miller</name>
</author>
<author><name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sharon, I" uniqKey="Sharon I">I Sharon</name>
</author>
<author><name sortKey="Tzahor, S" uniqKey="Tzahor S">S Tzahor</name>
</author>
<author><name sortKey="Williamson, S" uniqKey="Williamson S">S Williamson</name>
</author>
<author><name sortKey="Shmoish, M" uniqKey="Shmoish M">M Shmoish</name>
</author>
<author><name sortKey="Man Aharonovich, D" uniqKey="Man Aharonovich D">D Man-Aharonovich</name>
</author>
<author><name sortKey="Rusch, D" uniqKey="Rusch D">D Rusch</name>
</author>
<author><name sortKey="Yooseph, S" uniqKey="Yooseph S">S Yooseph</name>
</author>
<author><name sortKey="Zeidner, G" uniqKey="Zeidner G">G Zeidner</name>
</author>
<author><name sortKey="Golden, S" uniqKey="Golden S">S Golden</name>
</author>
<author><name sortKey="Mackey, S" uniqKey="Mackey S">S Mackey</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Eddy, S" uniqKey="Eddy S">S Eddy</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tatusov, R" uniqKey="Tatusov R">R Tatusov</name>
</author>
<author><name sortKey="Galperin, M" uniqKey="Galperin M">M Galperin</name>
</author>
<author><name sortKey="Natale, D" uniqKey="Natale D">D Natale</name>
</author>
<author><name sortKey="Koonin, E" uniqKey="Koonin E">E Koonin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stark, M" uniqKey="Stark M">M Stark</name>
</author>
<author><name sortKey="Berger, S" uniqKey="Berger S">S Berger</name>
</author>
<author><name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A Stamatakis</name>
</author>
<author><name sortKey="Von Mering, C" uniqKey="Von Mering C">C von Mering</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Krause, L" uniqKey="Krause L">L Krause</name>
</author>
<author><name sortKey="Diaz, N" uniqKey="Diaz N">N Diaz</name>
</author>
<author><name sortKey="Goesmann, A" uniqKey="Goesmann A">A Goesmann</name>
</author>
<author><name sortKey="Kelley, S" uniqKey="Kelley S">S Kelley</name>
</author>
<author><name sortKey="Nattkemper, T" uniqKey="Nattkemper T">T Nattkemper</name>
</author>
<author><name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author><name sortKey="Edwards, R" uniqKey="Edwards R">R Edwards</name>
</author>
<author><name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Munch, K" uniqKey="Munch K">K Munch</name>
</author>
<author><name sortKey="Boomsma, W" uniqKey="Boomsma W">W Boomsma</name>
</author>
<author><name sortKey="Huelsenbeck, J" uniqKey="Huelsenbeck J">J Huelsenbeck</name>
</author>
<author><name sortKey="Willerslev, E" uniqKey="Willerslev E">E Willerslev</name>
</author>
<author><name sortKey="Nielsen, R" uniqKey="Nielsen R">R Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Felsenstein, J" uniqKey="Felsenstein J">J Felsenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schmidt, H" uniqKey="Schmidt H">H Schmidt</name>
</author>
<author><name sortKey="Strimmer, K" uniqKey="Strimmer K">K Strimmer</name>
</author>
<author><name sortKey="Vingron, M" uniqKey="Vingron M">M Vingron</name>
</author>
<author><name sortKey="Von Haeseler, A" uniqKey="Von Haeseler A">A von Haeseler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kishino, H" uniqKey="Kishino H">H Kishino</name>
</author>
<author><name sortKey="Miyata, T" uniqKey="Miyata T">T Miyata</name>
</author>
<author><name sortKey="Hasegawa, M" uniqKey="Hasegawa M">M Hasegawa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Strimmer, K" uniqKey="Strimmer K">K Strimmer</name>
</author>
<author><name sortKey="Rambaut, A" uniqKey="Rambaut A">A Rambaut</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wu, M" uniqKey="Wu M">M Wu</name>
</author>
<author><name sortKey="Eisen, J" uniqKey="Eisen J">J Eisen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A Stamatakis</name>
</author>
<author><name sortKey="Komornik, Z" uniqKey="Komornik Z">Z Komornik</name>
</author>
<author><name sortKey="Berger, S" uniqKey="Berger S">S Berger</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Evans, S" uniqKey="Evans S">S Evans</name>
</author>
<author><name sortKey="Matsen, F" uniqKey="Matsen F">F Matsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lozupone, C" uniqKey="Lozupone C">C Lozupone</name>
</author>
<author><name sortKey="Hamady, M" uniqKey="Hamady M">M Hamady</name>
</author>
<author><name sortKey="Kelley, S" uniqKey="Kelley S">S Kelley</name>
</author>
<author><name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Turnbaugh, P" uniqKey="Turnbaugh P">P Turnbaugh</name>
</author>
<author><name sortKey="Hamady, M" uniqKey="Hamady M">M Hamady</name>
</author>
<author><name sortKey="Yatsunenko, T" uniqKey="Yatsunenko T">T Yatsunenko</name>
</author>
<author><name sortKey="Cantarel, B" uniqKey="Cantarel B">B Cantarel</name>
</author>
<author><name sortKey="Duncan, A" uniqKey="Duncan A">A Duncan</name>
</author>
<author><name sortKey="Ley, R" uniqKey="Ley R">R Ley</name>
</author>
<author><name sortKey="Sogin, M" uniqKey="Sogin M">M Sogin</name>
</author>
<author><name sortKey="Jones, W" uniqKey="Jones W">W Jones</name>
</author>
<author><name sortKey="Roe, B" uniqKey="Roe B">B Roe</name>
</author>
<author><name sortKey="Affourtit, J" uniqKey="Affourtit J">J Affourtit</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Drummond, A" uniqKey="Drummond A">A Drummond</name>
</author>
<author><name sortKey="Ashton, B" uniqKey="Ashton B">B Ashton</name>
</author>
<author><name sortKey="Cheung, M" uniqKey="Cheung M">M Cheung</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A Stamatakis</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-title-group><journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">21034504</article-id>
<article-id pub-id-type="pmc">3098090</article-id>
<article-id pub-id-type="publisher-id">1471-2105-11-538</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-11-538</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" corresp="yes" id="A1"><name><surname>Matsen</surname>
<given-names>Frederick A</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>matsen@fhcrc.org</email>
</contrib>
<contrib contrib-type="author" id="A2"><name><surname>Kodner</surname>
<given-names>Robin B</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<xref ref-type="aff" rid="I3">3</xref>
<email>rkodner@u.washington.edu</email>
</contrib>
<contrib contrib-type="author" id="A3"><name><surname>Armbrust</surname>
<given-names>E Virginia</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>armbrust@u.washington.edu</email>
</contrib>
</contrib-group>
<aff id="I1"><label>1</label>
Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA</aff>
<aff id="I2"><label>2</label>
School of Oceanography, University of Washington, Seattle, Washington, USA</aff>
<aff id="I3"><label>3</label>
Friday Harbor Laboratories, University of Washington, Friday Harbor, Washington, USA</aff>
<pub-date pub-type="collection"><year>2010</year>
</pub-date>
<pub-date pub-type="epub"><day>30</day>
<month>10</month>
<year>2010</year>
</pub-date>
<volume>11</volume>
<fpage>538</fpage>
<lpage>538</lpage>
<history><date date-type="received"><day>18</day>
<month>3</month>
<year>2010</year>
</date>
<date date-type="accepted"><day>30</day>
<month>10</month>
<year>2010</year>
</date>
</history>
<permissions><copyright-statement>Copyright ©2010 Matsen et al; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2010</copyright-year>
<copyright-holder>Matsen et al; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0"><license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/11/538"></self-uri>
<abstract><sec><title>Background</title>
<p>Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihood-based approaches to large data sets.</p>
</sec>
<sec><title>Results</title>
<p>This paper introduces <monospace>pplacer</monospace>
, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. <monospace>Pplacer</monospace>
 features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence.</p>
</sec>
<sec><title>Conclusions</title>
<p><monospace>Pplacer</monospace>
 enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is freely available as source code, binaries, and a web service.</p>
</sec>
</abstract>
</article-meta>
</front>
<body><sec><title>Background</title>
<p>High-throughput pyrosequencing technologies have enabled the widespread use of metagenomics and metatranscriptomics in a variety of fields [<xref ref-type="bibr" rid="B1">1</xref>
]. This technology has revolutionized the possibilities for unbiased surveys of environmental microbial diversity, ranging from the human gut to the open ocean [<xref ref-type="bibr" rid="B2">2</xref>
-<xref ref-type="bibr" rid="B8">8</xref>
]. The trade off for high throughput sequencing is that the resulting sequence reads can be short and come without information on organismal origin or read location within a genome.</p>
<p>The most common way of analyzing a metagenomic data set is to use BLAST [<xref ref-type="bibr" rid="B9">9</xref>
] to assign a taxonomic name to each query sequence based on "reference" data of known origin. This strategy has its problems: when a query sequence is only distantly related to sequences in the database, BLAST can either err substantially by forcing a query into an alignment with a known sequence, or return an uninformatively broad collection of alignments. Furthermore, similarity statistics such as BLAST <italic>E</italic>
-values can be difficult to interpret because they are dependent on fragment length and database size. Therefore it can be difficult to know if a given taxonomic assignment is correct unless a very clear "hit" is found.</p>
<p>Numerous tools have appeared that assign taxonomic information to query sequences, overcoming the shortcomings of BLAST. For example, MEGAN (MEtaGenome ANalyzer) [<xref ref-type="bibr" rid="B10">10</xref>
] implements a commonancestor algorithm on the NCBI taxonomy using BLAST scores. PhyloPythia [<xref ref-type="bibr" rid="B11">11</xref>
], TACOA [<xref ref-type="bibr" rid="B12">12</xref>
], and Phymm [<xref ref-type="bibr" rid="B13">13</xref>
] use composition based methods to assign taxonomic information to metagenomic sequences. Recent tools can work with reads as short as 100 bp.</p>
<p>Phylogeny offers an alternative and complementary means of understanding the evolutionary origin of query sequences. The presence of a query sequence on a certain branch of a tree gives precise information about the evolutionary relationship of that sequence to other sequences in the tree. For example, a query sequence placed deep in the tree can indicate <italic>how </italic>
the query is distantly related to the other sequences in the tree, whereas the corresponding taxonomic name would simply indicate membership in a large taxonomic group. On the other hand, taxonomic names are key to obtaining functional information about organisms, and the most robust and comprehensive means of understanding the provenance of unknown sequences will derive both from taxonomic and phylogenetic sources.</p>
<p>Likelihood-based phylogenetics, with over 30 years of theoretical and practical development, is a sophisticated tool for the evolutionary analysis of sequence data. It has well-developed statistical foundations for inference [<xref ref-type="bibr" rid="B14">14</xref>
,<xref ref-type="bibr" rid="B15">15</xref>
], tests for uncertainty estimation [<xref ref-type="bibr" rid="B16">16</xref>
], and sophisticated evolutionary models [<xref ref-type="bibr" rid="B17">17</xref>
,<xref ref-type="bibr" rid="B18">18</xref>
]. In contrast to distance-based methods, likelihood-based methods can use both low and high variation regions of an alignment to provide resolution at different levels of a phylogenetic tree [<xref ref-type="bibr" rid="B19">19</xref>
].</p>
<p>Traditional likelihood-based phylogenetics approaches are not always appropriate for analyzing the data from metagenomic and metatranscriptomic studies. The first challenge is that of complexity: the maximum likelihood phylogenetics problem is NP-hard [<xref ref-type="bibr" rid="B20">20</xref>
,<xref ref-type="bibr" rid="B21">21</xref>
] and thus maximum likelihood trees cannot be found in a practical amount of time with many taxa. A remarkable amount of progress has been made in approximate acceleration heuristics [<xref ref-type="bibr" rid="B22">22</xref>
-<xref ref-type="bibr" rid="B25">25</xref>
], but accurate maximum likelihood inference for hundreds of thousands of taxa remains out of reach.</p>
<p>Second, accurate phylogenetic inference is not possible with fixed length sequences in the limit of a large number of taxa. This can be seen via theory [<xref ref-type="bibr" rid="B26">26</xref>
], where lower bounds on sequence length can be derived as an increasing function of the number of taxa. It is clear from simulation [<xref ref-type="bibr" rid="B27">27</xref>
], where one can directly observe the growth of needed sequence length. Such problems can also be observed in real data where insufficient sequence length for a large number of taxa is manifested as a large collection of trees similar in terms of likelihood [<xref ref-type="bibr" rid="B28">28</xref>
]; statistical tools can aid in the diagnosis of such situations [<xref ref-type="bibr" rid="B16">16</xref>
].</p>
<p>The lack of signal problem is especially pronounced when using contemporary sequencing methods that produce a large number of short reads. Some methodologies, such as 454 [<xref ref-type="bibr" rid="B29">29</xref>
], will soon be producing sequence in the 600-800 bp range, which is sufficient for classical phylogenetic inference on a moderate number of taxa. However, there is considerable interest in using massively parallel methodologies such as SOLiD and Illumina which produce hundreds of millions of short reads at low cost [<xref ref-type="bibr" rid="B30">30</xref>
]. Signal problems are further exacerbated by shotgun sequencing methodology where the sequenced position is randomly distributed over a given gene. Applying classical maximum-likelihood phylogeny to a single alignment of shotgun reads together with full-length reference sequences can lead to artifactual grouping of short reads based on the read position in the alignment; such grouping is not a surprise given that non-sequenced regions are treated as missing data (see, e.g. [<xref ref-type="bibr" rid="B19">19</xref>
,<xref ref-type="bibr" rid="B31">31</xref>
]).</p>
<p>A third problem is deriving meaningful information from large trees. Although significant progress has been made in visualizing trees with thousands of taxa [<xref ref-type="bibr" rid="B32">32</xref>
,<xref ref-type="bibr" rid="B33">33</xref>
], understanding the similarities and differences between such trees is inherently difficult. In a setting with lots of samples, constructing one tree per sample requires comparing trees with disjoint sets of taxa; such comparisons can only be done in terms of tree shape [<xref ref-type="bibr" rid="B34">34</xref>
]. Alternatively, phylogenetic trees can be constructed on pairs of environments at a time, then comparison software such as UniFrac [<xref ref-type="bibr" rid="B35">35</xref>
] can be used to derive distances between them, but the lack of a unifying phylogenetic framework hampers the analysis of a large collection of samples. "Phylogenetic placement" has emerged in the last several years as an alternative way to gain an evolutionary understanding of sequence data from a large collection of taxa. The input of a phylogenetic placement algorithm consists of a reference tree, a reference alignment, and a collection of query sequences. The result of a phylogenetic placement algorithm is a collection of assignments of query sequences to the tree, one assignment for each query (or more than one when placement location is uncertain). Phylogenetic placement is a simplified version of phylogenetic tree reconstruction by sequential insertion [<xref ref-type="bibr" rid="B36">36</xref>
,<xref ref-type="bibr" rid="B37">37</xref>
]. It has been gaining in popularity, with recent implementations in 2008 [<xref ref-type="bibr" rid="B38">38</xref>
,<xref ref-type="bibr" rid="B39">39</xref>
], and more efficient implementations in this paper and by Berger and Stamatakis [<xref ref-type="bibr" rid="B28">28</xref>
]. A recent HIV subtype classification scheme [<xref ref-type="bibr" rid="B40">40</xref>
] is also a type of phylogenetic placement algorithm that allows the potential for recombination in query sequences.</p>
<p>Phylogenetic placement sidesteps many of the problems associated with applying traditional phylogenetics algorithms to large, environmentally-derived sequence data. Computation is significantly simplified, resulting in algorithms that can place thousands to tens of thousands of query sequences per hour per processor into a reference tree on a thousand taxa. Because computation is performed on each query sequence individually, the calculation can be readily parallelized. The relationships between the query sequences are not investigated, reducing from an exponential to a linear number of phylogenetic hypotheses. Short and/or non-overlapping query sequences pose less of a problem, as query sequences are compared to the full-length reference sequences. Visualization of samples and comparison between samples are facilitated by the assumption of a reference tree, that can be drawn in a way which shows the location of reads.</p>
<p>Phylogenetic placement is not a substitute for traditional phylogenetic analysis, but rather an approximate tool when handling a large number of sequences. Importantly, the addition of a taxon <italic>x </italic>
to a phylogenetic data set on taxa <italic>S </italic>
can lead to re-evaluation of the phylogenetic tree on <italic>S</italic>
; this is the essence of the taxon sampling debate [<xref ref-type="bibr" rid="B41">41</xref>
] and has recently been the subject of mathematical investigation [<xref ref-type="bibr" rid="B42">42</xref>
]. This problem can be mitigated by the judicious selection of reference taxa and the use of well-supported phylogenetic trees. The error resulting from the assumption of a fixed phylogenetic reference tree will be smaller than that when using an assumed taxonomy such as the commonly used NCBI taxonomy, which forms a reference tree of sorts for a number of popular methods currently in use [<xref ref-type="bibr" rid="B10">10</xref>
,<xref ref-type="bibr" rid="B43">43</xref>
]. Phylogenetic placement, in contrast, is done on a gene-by-gene basis and can thus accommodate the variability in the evolutionary history of different genes, which may include gene duplication, horizontal transfer, and loss.</p>
<p>This paper describes <monospace>pplacer</monospace>
, software developed to perform phylogenetic placement with linear time and memory complexity in each relevant parameter: number of reference sequences, number of query sequences, and sequence length. <monospace>Pplacer</monospace>
 was developed to be user-friendly, and its design facilitates integration into metagenomic analysis pipelines. It has a number of distinctive features. First, it is unique among phylogenetic placement software in its ability to evaluate the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. Second, <monospace>pplacer</monospace>
 enables calculation of the expected distance between placement locations for each query sequence; this development is crucial for uncertainty estimation in regions of the tree consisting of many short branches, where the placement edge may be uncertain although the correct placement region in the tree may be relatively clear. Third, <monospace>pplacer</monospace>
 can display both the number of placements on an edge and the uncertainty of those placements on a single tree (Figure <xref ref-type="fig" rid="F1">1</xref>
). Such visualizations can be used to understand if placement uncertainty is a significant problem for downstream analysis and to identify problematic parts of the tree. Fourth, the <monospace>pplacer</monospace>
 software package includes utilities to ease large scale analysis and sorting of the query alignment based on placement location. These programs are available in GPLv3-licensed code and binary form <ext-link ext-link-type="uri" xlink:href="http://matsen.fhcrc.org/pplacer/">http://matsen.fhcrc.org/pplacer/</ext-link>
, which also includes a web portal for running <monospace>pplacer</monospace>
 and for visualizing placement results.</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p><bold>Example application, showing uncertainty</bold>
. <monospace>Pplacer</monospace>
 example application using <italic>psbA </italic>
reference sequences and the corresponding recruited Global Ocean Sampling [<xref ref-type="bibr" rid="B4">4</xref>
] (GOS) sequences showing both number of placements and their uncertainty. Branch thickness is a linear function of the log-transformed number of placements on that edge, and branch color represents average uncertainty (more red implies more uncertain, with yellow denoting EDPL above a user-defined limit). The upper panel shows the <italic>Prochlorococcus </italic>
clade of the tree. The lower panel shows a portion of the tree with substantial uncertainty using the EDPL metric. <monospace>Placeviz</monospace>
 output viewed using Archaeopteryx [<xref ref-type="bibr" rid="B32">32</xref>
].</p>
</caption>
<graphic xlink:href="1471-2105-11-538-1"></graphic>
</fig>
<p>To validate <monospace>pplacer</monospace>
's phylogenetic placement algorithm we implemented a framework that simulates reads from real alignments and tests <monospace>pplacer</monospace>
's ability to place the read in the correct location. As described below, a primary focus of this effort is a simulation study of 631 COG alignments, where 10 reads were simulated from each taxon of each alignment, placed on their respective trees, and evaluated for accuracy. These tests confirm both that <monospace>pplacer</monospace>
 places reads accurately and that the posterior probability and the likelihood weight ratio (described below) both do a good job of indicating whether a placement can be trusted or not. We also use these simulations to understand how the distance to sister taxon impacts placement accuracy.</p>
</sec>
<sec><title>Results</title>
<sec><title><bold>Overview of phylogenetic placement using </bold>
<monospace>pplacer</monospace>
</title>
<p><monospace>Pplacer</monospace>
 places query sequences in a fixed reference phylogeny according to phylogenetic posterior probability and maximum likelihood criteria. In Bayesian mode, <monospace>pplacer</monospace>
 evaluates the posterior probability of a fragment placement on an edge conditioned on the reference tree topology and branch lengths. The posterior probability has a clear statistical interpretation as the probability that the fragment is correctly placed on that edge, assuming the reference tree, the alignment, and the priors on pendant branch length. Because the reference tree is fixed, direct numerical quadrature over the likelihood function can be performed to obtain the posterior probability rather than relying on Markov chain Monte-Carlo procedures as is typically done in phylogenetics [<xref ref-type="bibr" rid="B44">44</xref>
,<xref ref-type="bibr" rid="B45">45</xref>
]. In maximum likelihood (ML) mode, <monospace>pplacer</monospace>
 evaluates the "likelihood weight ratio" [<xref ref-type="bibr" rid="B39">39</xref>
], i.e. the ML likelihood values across all placement locations normalized to sum to one.</p>
<p>Because the reference tree is fixed with respect to topology and branch length, only two tree traversals are needed to precompute all of the information needed from the reference tree. From there all likelihood computation is performed on a collection of three-taxon trees, the number of which is linear in the number of reference taxa. Therefore the fragment placement component of our algorithm has linear time and space complexity in the number of taxa <italic>n </italic>
in the reference tree (Figures <xref ref-type="fig" rid="F2">2</xref>
 and <xref ref-type="fig" rid="F3">3</xref>
). It is also linear in the length of the query sequence, as described in the section on algorithmic internals. Note that the fixing of branch lengths in the reference tree is an approximation that permits the linear time operation in <italic>n </italic>
(typically all branch lengths are re-optimized when modifying the tree).</p>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p><bold>Linear time dependence on number of reference taxa</bold>
. Time to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. Tests run on an Intel Xeon @ 2.33 Ghz.</p>
</caption>
<graphic xlink:href="1471-2105-11-538-2"></graphic>
</fig>
<fig id="F3" position="float"><label>Figure 3</label>
<caption><p><monospace>pplacer</monospace>
<bold>memory requirements</bold>
. Memory required to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. Tests run on an Intel Xeon @ 2.33 Ghz.</p>
</caption>
<graphic xlink:href="1471-2105-11-538-3"></graphic>
</fig>
<p>The <monospace>pplacer</monospace>
 binary is stand-alone; a single command specifying the reference tree, the reference alignment, a reference statistics file, and the aligned reads suffices to run the core <monospace>pplacer</monospace>
 analysis. <monospace>Pplacer</monospace>
 does not optimize sequence mutation model parameters, and instead obtains those values from PHYML [<xref ref-type="bibr" rid="B22">22</xref>
] or RAxML [<xref ref-type="bibr" rid="B23">23</xref>
] statistics output files. When analyzing protein sequences the user can choose between the LG [<xref ref-type="bibr" rid="B18">18</xref>
] or WAG [<xref ref-type="bibr" rid="B46">46</xref>
] models, and nucleotide likelihoods are computed via the general time reversible (GTR) model. Rate variation among sites is accomodated by the discrete Γ model [<xref ref-type="bibr" rid="B17">17</xref>
]. For posterior probability calculation, the user can choose between exponential or uniform pendant branch length priors. Each <monospace>pplacer</monospace>
 run creates a <monospace>.place</monospace>
 file that describes the various placements and their confidence scores; analysis can be done directly on this file, or the user can run it through <monospace>placeviz</monospace>
, our tool to visualize the fragment placements. The <monospace>pplacer</monospace>
 code is written in the functional/imperative language <monospace>ocaml</monospace>
[<xref ref-type="bibr" rid="B47">47</xref>
] using routines from the GNU scientific library (GSL) [<xref ref-type="bibr" rid="B48">48</xref>
].</p>
<p>To accelerate placements, <monospace>pplacer</monospace>
 implements a two-stage search algorithm for query sequences, where a quick first evaluation of the tree is followed by a more detailed search in high-scoring parts of the tree. The more detailed second search is directed by <monospace>pplacer</monospace>
's "baseball" heuristics, which limit the full search in a way that adapts to the difficulty of the optimization problem (described in detail in "Methods"). The balance between speed and accuracy depends on two parameters, which can be appropriately chosen for the problem at hand via <monospace>pplacer</monospace>
's "fantasy baseball" mode. This feature places a subset of the query sequences and reports the accuracy of the parameter combinations within specified ranges, as well as information concerning runtime for those parameter combinations. The user can then apply these parameter choices for an optimized run of their data.</p>
</sec>
<sec><title>Quantifying uncertainty in placement location</title>
<p><monospace>Pplacer</monospace>
 calculates edge uncertainty via posterior probability and the likelihood weight ratio. These methods quantify uncertainty on an edge-by-edge basis by comparing the best placement locations on each edge. Such quantities form the basis of an understanding of placement uncertainty.</p>
<p>The Expected Distance between Placement Locations (EDPL) is used to overcome difficulties in distinguishing between local and global uncertainty, which is a complication of relying on confidence scores determined on an edge-by-edge basis. This quantity is computed as follows for a given query sequence. <monospace>Pplacer</monospace>
 first determines the top-scoring collection of edges; the optimal placement on each edge is assigned a probability defining confidence, which is the likelihood weight ratio (in ML mode) or the posterior probability (in Bayesian mode). The EDPL uncertainty is the weighted-average distance between those placements (Figure <xref ref-type="fig" rid="F4">4</xref>
), i.e. the sum of the distances between the optimal placements weighted by their probability (4). The EDPL thus uses distances on the tree to distinguish between cases where nearby edges appear equally good, versus cases when a given query sequence does not have a clear position in the tree. These measures of uncertainty can then be viewed with <monospace>placeviz</monospace>
 as described below.</p>
<fig id="F4" position="float"><label>Figure 4</label>
<caption><p><bold>Measuring uncertainty by the expected distance between placement locations (EDPL)</bold>
. The Expected Distance between Placement Locations (EDPL) uncertainty metric can indicate if placement uncertainty may pose a problem for downstream analysis. The EDPL uncertainty is the sum of the distances between the optimal placements weighted by their probability (4). The hollow stars on the left side of the tree depict a case where there is considerable uncertainty as to the exact placement edge, but the collection of possible edges all sit in a small region of the tree. This local uncertainty would have a low EDPL score. The full stars on the right side of the diagram would have a large EDPL, as the different placements are spread widely across the tree. Such a situation can be flagged for special treatment or removal.</p>
</caption>
<graphic xlink:href="1471-2105-11-538-4"></graphic>
</fig>
</sec>
<sec><title>Visualizing placements using <monospace>placeviz</monospace>
 and placement management using <monospace>placeutil</monospace>
</title>
<p>Our package includes tools to facilitate placement visualization and management: <monospace>placeviz</monospace>
 and <monospace>placeutil</monospace>
. <monospace>Placeviz</monospace>
 converts the placement files generated by <monospace>pplacer</monospace>
 into tree formats that are viewable by external viewers. The richest visualizations make use of the phyloXML format [<xref ref-type="bibr" rid="B49">49</xref>
], which can be viewed using the freely available Archaeopteryx [<xref ref-type="bibr" rid="B32">32</xref>
] Java software. Less information-dense visualizations are also available in the standard "Newick" format [<xref ref-type="bibr" rid="B19">19</xref>
].</p>
<p>As shown in Figure <xref ref-type="fig" rid="F1">1</xref>
, <monospace>placeviz</monospace>
 extends previous work on visualizations [<xref ref-type="bibr" rid="B39">39</xref>
], representing placement density (branch thickness) and uncertainty (color) on a single tree. Specifically, it draws the reference tree such that the thickness of the branch is a linear function of the number of placements (this linear function has a non-zero <italic>y</italic>
-intercept so that the whole tree is visible); the weighted average EDPL uncertainty for the placements on the tree is expressed as a color gradient from the usual branch length color (white or black by choice) to red, with 100% red representing a user-defined uncertainty maximum. Yellow is used to denote edges whose average EDPL uncertainty is above the given maximum level. An example <monospace>placeviz</monospace>
 visualization can be viewed interactively at <ext-link ext-link-type="uri" xlink:href="http://matsen.fhcrc.org/pplacer/visualization.html">http://matsen.fhcrc.org/pplacer/visualization.html</ext-link>
.</p>
<p><monospace>Placeviz</monospace>
 also offers other visualization options, such as individually placing the query sequences on the tree, which is useful for a small number of placements. It also can sort query sequences by their best scoring edge into a <monospace>.loc.fasta</monospace>
 file; inspection can reveal if any specific features of the query sequences lead to placement on one edge or another. This sorting can also group query sequences as potentially coming from similar organisms, even if those query sequences do not overlap.</p>
<p><monospace>Placeutil</monospace>
 is a utility for combining, splitting apart, and filtering placements, which can be useful when doing large scale analysis. For example, when a collection of query sequences are split apart to run in parallel, their placements can be brought back together using <monospace>placeutil</monospace>
, while checking that they were run using the same reference tree and model parameters. Conversely, if a number of samples were run together, they can be split apart again using regular expressions on their names. Placements can also be separated by likelihood weight ratio, posterior probability, and EDPL.</p>
</sec>
<sec><title><bold>A </bold>
<monospace>pplacer </monospace>
<bold>application: psbA in the Global Ocean Sampling (GOS) database</bold>
</title>
<p>To demonstrate the use of <monospace>pplacer</monospace>
 for a metagenomic study, we analyzed the <italic>psbA </italic>
and <italic>psbD </italic>
gene for the D1 and D2 subunits of photosystem II in cyanobacterial and eukaryotic chloroplasts [<xref ref-type="bibr" rid="B50">50</xref>
] from the Global Ocean Sampling (GOS) dataset [<xref ref-type="bibr" rid="B4">4</xref>
]. The GOS database is the largest publicly available metagenomic database, and has been the subject of numerous studies. We choose the <italic>psbA </italic>
and <italic>psbD </italic>
genes because they are well defined, are found across domains, and can be used to differentiate cyanobacteria from eukaryotic phototrophs in a data set assuming sequence reads are accurately identified [<xref ref-type="bibr" rid="B51">51</xref>
]. In addition, it has been shown in a number of studies that cyanophage virus genomes contain both <italic>psbA </italic>
and <italic>psbD </italic>
sequences [<xref ref-type="bibr" rid="B52">52</xref>
-<xref ref-type="bibr" rid="B55">55</xref>
], and that viruses are the source of a substantial number <italic>psbA </italic>
and <italic>psbD </italic>
sequences in GOS [<xref ref-type="bibr" rid="B56">56</xref>
,<xref ref-type="bibr" rid="B57">57</xref>
]. BLAST searches using either a eukaryotic <italic>psbA </italic>
query or a cyanobacterial <italic>psbA </italic>
query sequence can yield the same collection of reads from GOS with similar E-values - even very low values on the order of 10<sup>-<italic>100 </italic>
</sup>
or smaller in some cases (Additional file <xref ref-type="supplementary-material" rid="S1">1</xref>
: Table S1). This can make taxonomic assignment even at a high level difficult using BLAST-based comparisons. The use of <monospace>pplacer</monospace>
 on the closely related <italic>psbA </italic>
and <italic>psbD </italic>
genes demonstrates phylogenetic placement on closely related paralogs.</p>
<p>To identify <italic>psbA </italic>
and <italic>psbD </italic>
genes in the GOS dataset, we performed a HMMER [<xref ref-type="bibr" rid="B58">58</xref>
] search of the GOS dataset using a 836 nucleotide reference alignment containing 270 reference sequences of cyanobacteria, eukaryotic plastids, and virus. The reference alignment included all possible reference sequences for <italic>psbA </italic>
and <italic>psbD </italic>
from published genomes, which is important for confident phylogenetic identification of new clades or strains. A total of 8535 metagenomic sequences were recruited by HMMER with an E-value cut off of 10<sup>-5</sup>
; these were then placed on the reference tree using <monospace>pplacer</monospace>
 (Figures <xref ref-type="fig" rid="F1">1</xref>
 and <xref ref-type="fig" rid="F5">5</xref>
). The expanded region of the trees shown in the figures highlights the <italic>Prochlorococcus </italic>
clade, known to be one of the most abundant phototrophs in the global ocean. There are many sequences placed sister to the sequenced representatives but also many sequences placed at internal nodes, that could represent some as yet unsequenced strain of these cyanobacteria.</p>
<fig id="F5" position="float"><label>Figure 5</label>
<caption><p><bold>Example application</bold>
. Placement visualization of same results as in Figure 1. The notation "15_at_4", for example, means that 15 sequences were placed at internal edge number 4. These edge numbers can then be used to find the corresponding sequences in the <monospace>.loc.fasta file</monospace>
. <monospace>Placeviz</monospace>
 output viewed using FigTree [<xref ref-type="bibr" rid="B75">75</xref>
].</p>
</caption>
<graphic xlink:href="1471-2105-11-538-5"></graphic>
</fig>
</sec>
<sec><title>Simulation</title>
<p>Simulation experiments were conducted to verify overall accuracy and to determine the relationship between confidence scores and accuracy. The simulation removes one taxon at a time from a given reference tree, simulates fragments from that taxon, then evaluates how accurately the placement method assigns the simulated fragments to their original position. In order to evaluate the accuracy of the placements, a simple topological distance metric is used. We have not simulated homopolymer-type errors in the alignments, because such errors should be treated by a pre-processing step and thus are not the domain of a phylogenetic placement algorithm. Furthermore, the emergence of more accurate very high throughput sequencing technology [<xref ref-type="bibr" rid="B30">30</xref>
] re-focuses our attention on the question of speed rather than error problems. Further details are given in the "Methods" section.</p>
<p>A broad simulation analysis of <monospace>pplacer</monospace>
 performance was done using 631 COG [<xref ref-type="bibr" rid="B59">59</xref>
] alignments. The COG alignments had between 19 and 436 taxa, with a median of 41; they were between 200 and 2050 amino acids in length, with a median of 391 (supplemental Figures S1 and S2). Reference phylogenetic trees were built based on the full-length gene sequences for each of these genes using PHYML [<xref ref-type="bibr" rid="B22">22</xref>
] and the LG [<xref ref-type="bibr" rid="B18">18</xref>
] protein substitution model (LG model chosen based on the evidence presented in the corresponding paper). Each taxon from each gene alignment was eliminated one at a time from the reference set as described in "Methods"; ten reads were simulated from each, leading to a total of 334,670 simulated reads, which were aligned to a hidden Markov model of the reference alignment. As is commonly done when analyzing a metagenome, the reads were filtered by their HMMER E-value (in this case 10<sup>-5</sup>
). Two normal distributions were used for read length: a "long" read simulation with amino acid sequence length of mean 85 and standard deviation of 20, and a "short" read simulation with mean 30 and standard deviation of 7. After the HMMER step, the "long" read simulation placed a total of 285,621 reads, and the "short" one placed a total of 148,969 reads on their respective phylogenetic trees.</p>
<p>The best resulting maximum likelihood placement edge was compared to the placement with the highest posterior probability to determine how well the confidence scores reflect the difference between accurate and inaccurate placements (Tables <xref ref-type="table" rid="T1">1</xref>
 and <xref ref-type="table" rid="T2">2</xref>
). Both methods provide similar results, implying that the likelihood weight ratio appears to be a reasonable proxy for the more statistically rigorous posterior probability calculation, although posterior probability does a slightly better job of distinguishing between accurate and inaccurate placements for the short reads. Overall, accuracy is high and there is a strong correlation between likelihood weight ratio, posterior probability, and accuracy. Many of the placements were placed with high confidence score and high accuracy in large and small trees (Figure <xref ref-type="fig" rid="F6">6</xref>
). Reads from more closely related taxa are easier to accurately place than more distantly related taxa (Figure <xref ref-type="fig" rid="F7">7</xref>
), although good placement is achieved even when sequences are only distantly related to the sequences in the reference tree.</p>
<table-wrap id="T1" position="float"><label>Table 1</label>
<caption><p>Accuracy results for the mean 85 AA COG simulation</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">range</th>
<th align="center">ML <italic>μ</italic>
</th>
<th align="center">PP <italic>μ</italic>
</th>
<th align="center">ML <italic>σ</italic>
</th>
<th align="center">PP <italic>σ</italic>
</th>
<th align="center">ML FC</th>
<th align="center">PP FC</th>
<th align="center">ML #</th>
<th align="center">PP #</th>
</tr>
</thead>
<tbody><tr><td align="center">0.0-0.1</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr><td align="center">0.1-0.2</td>
<td align="center">3.57</td>
<td align="center">3.78</td>
<td align="center">3.09</td>
<td align="center">3.27</td>
<td align="center">0.07</td>
<td align="center">0.03</td>
<td align="center">4149</td>
<td align="center">2312</td>
</tr>
<tr><td align="center">0.2-0.3</td>
<td align="center">2.97</td>
<td align="center">3.19</td>
<td align="center">3.04</td>
<td align="center">3.06</td>
<td align="center">0.16</td>
<td align="center">0.11</td>
<td align="center">15123</td>
<td align="center">9018</td>
</tr>
<tr><td align="center">0.3-0.4</td>
<td align="center">2.39</td>
<td align="center">2.76</td>
<td align="center">3.00</td>
<td align="center">3.07</td>
<td align="center">0.26</td>
<td align="center">0.17</td>
<td align="center">22696</td>
<td align="center">18373</td>
</tr>
<tr><td align="center">0.4-0.5</td>
<td align="center">2.25</td>
<td align="center">2.29</td>
<td align="center">3.11</td>
<td align="center">2.98</td>
<td align="center">0.32</td>
<td align="center">0.24</td>
<td align="center">20120</td>
<td align="center">23022</td>
</tr>
<tr><td align="center">0.5-0.6</td>
<td align="center">2.14</td>
<td align="center">2.11</td>
<td align="center">3.09</td>
<td align="center">3.01</td>
<td align="center">0.36</td>
<td align="center">0.32</td>
<td align="center">17228</td>
<td align="center">20090</td>
</tr>
<tr><td align="center">0.6-0.7</td>
<td align="center">1.94</td>
<td align="center">1.95</td>
<td align="center">3.04</td>
<td align="center">2.99</td>
<td align="center">0.42</td>
<td align="center">0.38</td>
<td align="center">14113</td>
<td align="center">16223</td>
</tr>
<tr><td align="center">0.7-0.8</td>
<td align="center">1.86</td>
<td align="center">1.85</td>
<td align="center">3.05</td>
<td align="center">3.01</td>
<td align="center">0.47</td>
<td align="center">0.44</td>
<td align="center">13527</td>
<td align="center">14879</td>
</tr>
<tr><td align="center">0.8-0.9</td>
<td align="center">1.62</td>
<td align="center">1.65</td>
<td align="center">2.97</td>
<td align="center">2.97</td>
<td align="center">0.55</td>
<td align="center">0.52</td>
<td align="center">14850</td>
<td align="center">15747</td>
</tr>
<tr><td align="center">0.9-1.0</td>
<td align="center">0.32</td>
<td align="center">0.32</td>
<td align="center">1.54</td>
<td align="center">1.53</td>
<td align="center">0.92</td>
<td align="center">0.92</td>
<td align="center">163815</td>
<td align="center">165957</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Error analysis for the COG simulation with the error metric described in the text. As in Figure 6, simulated reads had a normally-distributed length with a mean of 85 amino acids, and a standard deviation of 20. This table pools the results, and shows mean (<italic>μ</italic>
) and standard deviation (<italic>σ</italic>
) of the error, the fraction placed correctly (FC), and the number of reads placed for <monospace>pplacer</monospace>
 run in maximum likelihood (ML) and posterior probability (PP) modes. For example, the "ML" columns in the row labeled 0.4-0.5 shows error statistics for all of the reads in the simulation that had likelihood weight ratio between 0.4 and 0.5: there were 20120 such reads of which 32% were placed correctly, and the corresponding error mean and standard deviation of about 2.25 and 2.29, respectively. This table demonstrates the effectiveness of the confidence scores- as the confidence scores increase, the error decreases. We note that the ML and PP methods have very comparable performance for this length of read, and thus the quickly-calculated ML weight ratio can act as a proxy for the more statistically rigorous posterior probability calculation.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T2" position="float"><label>Table 2</label>
<caption><p>Accuracy results for the mean 30 AA COG simulation</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">range</th>
<th align="center">ML <italic>μ</italic>
</th>
<th align="center">PP <italic>μ</italic>
</th>
<th align="center">ML <italic>σ</italic>
</th>
<th align="center">PP <italic>σ</italic>
</th>
<th align="center">ML FP</th>
<th align="center">PP FP</th>
<th align="center">ML #</th>
<th align="center">PP #</th>
</tr>
</thead>
<tbody><tr><td align="center">0.0-0.1</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr><td align="center">0.1-0.2</td>
<td align="center">3.67</td>
<td align="center">3.94</td>
<td align="center">3.23</td>
<td align="center">3.31</td>
<td align="center">0.09</td>
<td align="center">0.05</td>
<td align="center">7736</td>
<td align="center">3583</td>
</tr>
<tr><td align="center">0.2-0.3</td>
<td align="center">3.24</td>
<td align="center">3.48</td>
<td align="center">3.26</td>
<td align="center">3.23</td>
<td align="center">0.16</td>
<td align="center">0.11</td>
<td align="center">17491</td>
<td align="center">14308</td>
</tr>
<tr><td align="center">0.3-0.4</td>
<td align="center">2.64</td>
<td align="center">2.98</td>
<td align="center">3.23</td>
<td align="center">3.26</td>
<td align="center">0.26</td>
<td align="center">0.17</td>
<td align="center">17000</td>
<td align="center">17600</td>
</tr>
<tr><td align="center">0.4-0.5</td>
<td align="center">2.51</td>
<td align="center">2.46</td>
<td align="center">3.30</td>
<td align="center">3.11</td>
<td align="center">0.33</td>
<td align="center">0.25</td>
<td align="center">11114</td>
<td align="center">14572</td>
</tr>
<tr><td align="center">0.5-0.6</td>
<td align="center">2.27</td>
<td align="center">2.27</td>
<td align="center">3.26</td>
<td align="center">3.10</td>
<td align="center">0.40</td>
<td align="center">0.33</td>
<td align="center">8375</td>
<td align="center">9894</td>
</tr>
<tr><td align="center">0.6-0.7</td>
<td align="center">2.11</td>
<td align="center">2.03</td>
<td align="center">3.14</td>
<td align="center">3.08</td>
<td align="center">0.45</td>
<td align="center">0.41</td>
<td align="center">6921</td>
<td align="center">7771</td>
</tr>
<tr><td align="center">0.7-0.8</td>
<td align="center">1.83</td>
<td align="center">1.76</td>
<td align="center">3.06</td>
<td align="center">2.98</td>
<td align="center">0.52</td>
<td align="center">0.50</td>
<td align="center">6321</td>
<td align="center">6530</td>
</tr>
<tr><td align="center">0.8-0.9</td>
<td align="center">1.51</td>
<td align="center">1.44</td>
<td align="center">2.92</td>
<td align="center">2.83</td>
<td align="center">0.62</td>
<td align="center">0.60</td>
<td align="center">7101</td>
<td align="center">6873</td>
</tr>
<tr><td align="center">0.9-1.0</td>
<td align="center">0.22</td>
<td align="center">0.20</td>
<td align="center">1.22</td>
<td align="center">1.17</td>
<td align="center">0.94</td>
<td align="center">0.94</td>
<td align="center">66910</td>
<td align="center">67838</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Similar analysis as Table 1, but with a normally-distributed length with a mean of 30 amino acids, and a standard deviation of 7. In this case, the posterior probability calculation shows slightly superior ability to distinguish between accurate and inaccurate placements than the likelihood weight ratio.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F6" position="float"><label>Figure 6</label>
<caption><p><bold>Simulation with 631 COG alignments</bold>
. Error analysis from a simulation study using 631 COG alignments. Ten reads were simulated from each taxon of each alignment, and then binned according to the likelihood weight ratio of their best placement; ranges for the four bins are indicated in the legend. There is one scatter point in the plot for each bin of each alignment: the <italic>x</italic>
-axis for each plot shows the number of taxa in the tree used for the simulation, and the <italic>y </italic>
axis shows the average error for that bin. For example, a point at (100, 1.2) labeled 0.5 - 0.75 indicates that the set of all placements for an alignment of 100 taxa with confidence score between 0.5 and 0.75 has average error of 1.2. As described in the text, the error metric is the number of internal nodes between the correct edge and the node placement edge.</p>
</caption>
<graphic xlink:href="1471-2105-11-538-6"></graphic>
</fig>
<fig id="F7" position="float"><label>Figure 7</label>
<caption><p><bold>Accuracy versus distance to sister taxon: COG simulation</bold>
. The relationship between accuracy and phylogenetic (sum of branch length) distance to the sister taxon for the COG simulation. For each taxon in each alignment, the phylogenetic distance to the closest sister taxon was calculated, along with the average placement error for the ten reads simulated from that taxon in that alignment. The results were binned and shown in boxplot form, with the central line showing the median, the box showing the interquartile range, and the "whiskers" showing the extent of values which are with 1.5 times the interquartile range beyond the lower and upper quartiles. Outliers eliminated for clarity.</p>
</caption>
<graphic xlink:href="1471-2105-11-538-7"></graphic>
</fig>
</sec>
</sec>
<sec><title>Discussion</title>
<p>Likelihood-based phylogeny is a well developed way to establish the evolutionary relationships between sequences. Phylogenetic placement is a simplified version of likelihood-based phylogenetic inference that enables rapid placement of numerous short query sequences and sidesteps some of the problems inherent in applying phylogenetics to hundreds of thousands or millions of taxa. Phylogenetic placement is by no means a replacement for classical phylogenetic inference, which should be applied when query sequences are full length and moderate in number.</p>
<p>Phylogenetic placement software sits in a category distinct from taxonomic identification software such as MEGAN [<xref ref-type="bibr" rid="B10">10</xref>
] or Phymm [<xref ref-type="bibr" rid="B13">13</xref>
]. First, phylogenetic placement software does not assign names to query sequences, and instead returns an assignment of the query sequences to edges of a phylogenetic tree. Second, phylogenetic placement is designed to work with a single reference phylogenetic tree built on a single alignment. Thus it is well suited for fine-scale analysis of query sequences to provide detailed comparative and evolutionary information at the single gene level. This poses no problems when looking at a single marker gene such as such as 16 S, but some scripting and automation is necessary when there are many genes of interest. These challenges are somewhat mitigated through program design and pipeline scripts [<xref ref-type="bibr" rid="B60">60</xref>
], but phylogenetic placement methods may always require more work than general purpose taxonomic classification software.</p>
<p>Phylogenetic placement is also different from packages that construct a phylogenetic tree <italic>de novo </italic>
in order to infer taxonomic identity by clade membership. Such packages, such as CARMA [<xref ref-type="bibr" rid="B61">61</xref>
] and SAP [<xref ref-type="bibr" rid="B43">43</xref>
,<xref ref-type="bibr" rid="B62">62</xref>
], combine sequence search, alignment, and phylogeny into a complete pipeline to provide taxonomic information for an unknown query sequence. Because different query sequences will have different sets of reference taxa, these methods are not phylogenetic placement algorithms as described above. Also, because they are performing a full phylogenetic tree construction, they either use distance-based methods for faster results [<xref ref-type="bibr" rid="B43">43</xref>
,<xref ref-type="bibr" rid="B61">61</xref>
] or are many orders of magnitude slower than phylogenetic placement methods [<xref ref-type="bibr" rid="B62">62</xref>
].</p>
<p><monospace>Pplacer</monospace>
 is not the only software to perform likelihood-based phylogenetic placement. The first pair of software implementations were the "phylomapping" method of [<xref ref-type="bibr" rid="B38">38</xref>
], and the first version of the "MLTreeMap" method of [<xref ref-type="bibr" rid="B39">39</xref>
]. Both methods use a topologically fixed reference tree, and are wrappers around existing phylogenetic implementations: ProtML [<xref ref-type="bibr" rid="B63">63</xref>
] for phylomapping, and TREE-PUZZLE [<xref ref-type="bibr" rid="B64">64</xref>
] for MLTreeMap. Neither project has resulted in software that is freely available for download (MLTreeMap is available as a web service, but as it is tied to a core set of bacterial genes it is not useful for scientists examining other genes or domains). Also, by using a general-purpose phylogenetic computing engine, they miss on opportunities to optimize on computation and the resulting algorithm is not linear in the number of reference taxa. Both methods equip placement with a statistically justifiable but non-traditional confidence score: phylomapping adapts the RELL bootstrap [<xref ref-type="bibr" rid="B65">65</xref>
] to their setting, and MLTreeMap uses the "expected maximum likelihood weight ratio," which has been discussed in [<xref ref-type="bibr" rid="B66">66</xref>
]. AMPHORA also uses a hybrid parsimony and neighbor-joining strategy to place query sequences in a fixed reference tree [<xref ref-type="bibr" rid="B67">67</xref>
].</p>
<p>The only other software at present that performs likelihood-based phylogenetic placement at speeds comparable of <monospace>pplacer</monospace>
 is the independently-developed "evolutionary placement algorithm" (EPA) [<xref ref-type="bibr" rid="B28">28</xref>
] available as an option to RAxML [<xref ref-type="bibr" rid="B23">23</xref>
]. <monospace>Pplacer</monospace>
 and the EPA both cache likelihood information on the tree to accelerate placement, and both use two-stage algorithms to quickly place many sequences. The two packages use different acceleration heuristics, but only <monospace>pplacer</monospace>
 offers guidance on parameter choices to use for those heuristics via its "fantasy baseball" feature as described below in the section on algorithmic internals. The EPA allows for one parameter more flexibility than <monospace>pplacer</monospace>
 for branch length optimization, and can perform placement on partitioned datasets and inference on binary, RNA secondary structure, and multi-state data. The EPA offers single-process parallelization [<xref ref-type="bibr" rid="B68">68</xref>
] (note both the EPA and <monospace>pplacer</monospace>
 can easily be run in parallel as multiple processes). The EPA leverages the efficient memory representation of RAxML, such that an equivalent run using the Gamma model of rate variation will use half the memory of <monospace>pplacer</monospace>
, and a run using the CAT approximation will require one eighth of the memory. The EPA comes without a visualization tool such as <monospace>placeviz</monospace>
, although it can be visualized if run on their webserver, or within the new MLTreeMap suite of Perl scripts for visualization [<xref ref-type="bibr" rid="B60">60</xref>
].</p>
<p>We have compared the performance of EPA and <monospace>pplacer</monospace>
 in a study designed jointly by ourselves and the authors of [<xref ref-type="bibr" rid="B28">28</xref>
]. <monospace>Pplacer</monospace>
 and the EPA showed comparable speed in placing metagenomic reads on reference trees of different sizes (Figure <xref ref-type="fig" rid="F8">8</xref>
). For accuracy, we simulated from the 16 s alignments used for accuracy evaluation in [<xref ref-type="bibr" rid="B28">28</xref>
]. As in their paper, we simulated nucleotide reads of normally distributed length with mean 200 and standard deviation 60. The error was evaluated using the same topological error metric in two ways: first, the error of the placement with the highest likelihood (Figure <xref ref-type="fig" rid="F9">9</xref>
), and second, the total error weighted by the normalized likelhood weights (Figure <xref ref-type="fig" rid="F10">10</xref>
). Each program was run with the four-category gamma model of rate heterogeneity. There was no clear difference in accuracy between EPA and <monospace>pplacer</monospace>
 for these alignments with either of these ways of evaluating the error. This is despite the fact that the "correct" placement was chosen to be that assigned by the EPA with the full length sequence.</p>
<fig id="F8" position="float"><label>Figure 8</label>
<caption><p><bold>Speed comparison of </bold>
<monospace>pplacer</monospace>
<bold>and RAxML's EPA algorithm</bold>
. Time to place 10,000 16 s rRNA reads of median length 198 nt onto a reference phylogenetic tree, with a 1287 nt reference alignment. "Γ model" refers to a four-category gamma model of rate heterogeneity [<xref ref-type="bibr" rid="B17">17</xref>
], and "CAT" is an approximation which chooses a single rate for each site [<xref ref-type="bibr" rid="B76">76</xref>
]. Tests run on an Intel Xeon @ 2.33 Ghz.</p>
</caption>
<graphic xlink:href="1471-2105-11-538-8"></graphic>
</fig>
<fig id="F9" position="float"><label>Figure 9</label>
<caption><p><bold>Top placement accuracy comparison of </bold>
<monospace>pplacer</monospace>
<bold>and RAxML's EPA algorithm</bold>
. Accuracy comparison between EPA and <monospace>pplacer</monospace>
 both run with the Γ model of rate variation, using reads of mean length 200 simulated from the test data sets from [<xref ref-type="bibr" rid="B28">28</xref>
]. The x-axis numbers are the size of the data set used for simulation. The y-axis shows the error for the placement with the highest likelihood score.</p>
</caption>
<graphic xlink:href="1471-2105-11-538-9"></graphic>
</fig>
<fig id="F10" position="float"><label>Figure 10</label>
<caption><p><bold>Expected accuracy comparison of </bold>
<monospace>pplacer</monospace>
<bold>and RAxML's EPA algorithm</bold>
. Comparison as in Figure 9 but scoring the expected error, i.e. the total error weighted by the likelihood weight ratios.</p>
</caption>
<graphic xlink:href="1471-2105-11-538-10"></graphic>
</fig>
<p>In contrast to the EPA, <monospace>pplacer</monospace>
 placements all sit on a single reference tree with its associated branch lengths fixed. Thus it is easy to compare the relative location of placements, and to consider all placements on a single tree. Placement locations along a branch are useful in cases such as classification, where a placement close to the root of a clade may be assigned membership to that clade, whereas placements in the middle of the same edge may not. The EPA, on the other hand, optimizes the length of the branch of the reference tree as well as the placement location along that branch; thus each placement is done onto a slightly different reference tree. Presumably because the placement location does not happen on a single reference tree, the placement location is not reported by the program and this information is lost [<xref ref-type="bibr" rid="B28">28</xref>
].</p>
<p>We did not compare the RAxML parsimony insertions wrapped by AMPHORA to these likelihood placements, because we would be scoring a parsimony insertion algorithm according to the original positions in a maximum-likelihood tree. The difference between these optimality criteria would naturally lead to some differences, which would be viewed by the scoring metric as error. The innovative bootstrap-based taxonomic assignment procedure in the AMPHORA package produces a name rather than a phylogenetic placement, and thus cannot be directly compared to the output of <monospace>pplacer</monospace>
.</p>
</sec>
<sec><title>Conclusions</title>
<p><monospace>Pplacer</monospace>
 enables efficient maximum likelihood and posterior probability phylogenetic placement of reads, making likelihood-based phylogenetics methodology practical for large-scale metagenomic or 16 S survey data. <monospace>Pplacer</monospace>
 can be used whenever a reference alignment and phylogenetic tree is available, and is designed for ease of use for both single-run and pipelined applications. "Baseball" heuristics adapt to the difficulty of the phylogenetic placement problem at hand, and come with features which guide the user to an appropriate set of parameter choices. The EDPL metric helps users decide if edge uncertainty is a substantial problem for downstream analysis. <monospace>Pplacer</monospace>
 offers tightly integrated yet flexible visualization tools which can be used to view both the placements and their uncertainty on a single tree. Large-scale simulations confirmed the accuracy of the <monospace>pplacer</monospace>
 results and the descriptive ability of the confidence scores. <monospace>Pplacer</monospace>
 is freely available, comes with a complete manual and tutorials, and can be used via a web service.</p>
<p><monospace>Pplacer</monospace>
 forms the core of a body of work we are developing to facilitate and extend the utility of phylogenetic placement methodology. We have shown recently [<xref ref-type="bibr" rid="B69">69</xref>
] that phylogenetic placements (and uncertainty measurements thereof) fit perfectly into a statistical framework generalizing weighted UniFrac [<xref ref-type="bibr" rid="B70">70</xref>
] allowing for statistical comparison and visualization of differences between samples. In collaboration with another group, we have also implemented a preliminary version of software which automates the selection of appropriate reference sequences, as well as the assignment of taxonomic names based on phylogenetic placements.</p>
</sec>
<sec sec-type="methods"><title>Methods</title>
<sec><title><monospace>Pplacer </monospace>
<bold>algorithmic internals</bold>
</title>
<p>Here we survey <monospace>pplacer</monospace>
 algorithmic developments. The code implementing these algorithms is freely available on the github code repository [<xref ref-type="bibr" rid="B71">71</xref>
]. The basic development that permits linear time and space scaling in the size of the reference tree is that of pre-calculation of likelihood vectors at either end of each edge of the reference tree; this development is shared by the EPA and SCUEAL [<xref ref-type="bibr" rid="B40">40</xref>
] and the original idea goes back much earlier. Using these cached likelihood vectors, a naive algorithm might insert the query sequence into each edge of the tree and perform full branch length optimization using the cached likelihood vectors. However, a substantial speed improvement can be gained by performing a two-stage algorithm, where the first stage does a quick initial evaluation to find a good set of locations, and the second stage does a more detailed evaluation of the results from the first stage.</p>
<p><monospace>Pplacer</monospace>
's "baseball" heuristics limit the full search on the tree in a way that adapts to the difficulty of the optimization problem. The first stage is enabled by calculating likelihood vectors for the center of each edge; these vectors can be used to quickly sort the edges in approximate order of fit for a given query sequence. This edge ordering will be called the "batting order." The edges are evaluated in the batting order with full branch length optimization, stopping as follows. Start with the edge that looks best from the initial evaluation; let <italic>L </italic>
be the log likelihood of the branch-length-optimized ML attachment to that edge. Fix some positive number <italic>D</italic>
, called the "strike box." We proceed down the list in order until we encounter the first placement that has log likelihood less than <italic>L </italic>
- <italic>D</italic>
, which is called a "strike." Continue, allowing some number of strikes, until we stop doing detailed evaluation of what are most likely rather poor parts of the tree. An option restricts the total number of "pitches," i.e. full branch length optimizations.</p>
<p>The baseball heuristics allow the algorithm to adapt to the likelihood surface present in the tree; its behavior is controlled by parameters that can be chosen using <monospace>pplacer</monospace>
's "fantasy baseball" feature. This option allows automated testing of various parameter combinations for the baseball heuristics. Namely, it evaluates a large fixed number of placements, and records what the results would have been if various settings for the number of allowed strikes and the strike box were chosen. It records both the number of full evaluations that were done (which is essentially linearly proportional to the run time) and statistics that record if the optimal placement would have been found with those settings, and how good the best found with those settings is compared to the optimal placement.</p>
<p>Placement speed can also be accelerated by using information gained about the placement of a given query sequence to aid in placement of closely related query sequences. Before placement begins, pairwise sequence comparisons are done, first in terms of number of mismatches and second in terms of number of matches to gaps. Specifically, each sequence <italic>s<sub>i </sub>
</italic>
is compared to previous sequences in order; the sequence <italic>s<sub>j </sub>
</italic>
that is most closely related to <italic>s<sub>i </sub>
</italic>
with <italic>j < i </italic>
is found and assigned as <italic>s<sub>i</sub>
</italic>
's "friend." If no sequence is above a certain threshold of similarity then no friend is assigned. If <italic>s<sub>i </sub>
</italic>
and <italic>s<sub>j </sub>
</italic>
are identical, then <italic>s<sub>j</sub>
</italic>
's placement is used for <italic>s<sub>i</sub>
</italic>
. If they are similar but not identical, the branch lengths for <italic>s<sub>j </sub>
</italic>
are used as starting values for the branch length optimization of <italic>s<sub>i</sub>
</italic>
. This scheme is not a heuristic, but rather an exact way to accelerate the optimization process. On the other hand, such comparison is inherently an <italic>O</italic>
(<italic>n<sup>2</sup>
</italic>
) operation and thus may slow placement down given more than tens of thousands of query sequences. In such a case the user may choose to forgo the friend finding process.</p>
<p><monospace>Pplacer</monospace>
's speed is also linearly proportional to the lengths of the query sequences, which is enabled because the reference tree is fixed with respect to topology and branch length. Specifically, as described below, likelihood computations are performed such that the sites without a known state (gaps or missing sites) cancel out of the computation of likelihood weight or posterior probability. These sites are masked out of <monospace>pplacer</monospace>
's computation and thus do not compute to runtime.</p>
<p>Because of the extensive memory caching to accelerate placement, <monospace>pplacer</monospace>
 consumes a nontrivial amount of memory. The fixed contributions to memory use break down as follows: a factor of two for quick and full evaluation of placements, two nodes on each edge, four rate variation categories, four bytes per double precision floating point number, and four (nucleotide) or 20 (amino acid) states. To get a lower bound for total memory use, multiply this number, which is 128 bytes (nucleotide) or 640 bytes (amino acid), with two times the number of reference sequences minus three (the number of edges), times the number of columns in the reference alignment. Other data structures add on top of that (Figure <xref ref-type="fig" rid="F3">3</xref>
).</p>
</sec>
<sec><title>Likelihood weight ratio, posterior probability, and EDPL</title>
<p>Posterior probability is calculated by first integrating out the possible attachment locations and branch lengths against a prior distribution of pendant branch lengths. Let <italic>ℓ<sub>i </sub>
</italic>
denote an edge of the reference tree, <italic>A<sub>i </sub>
</italic>
the length of that edge, <italic>a </italic>
the attachment location along <italic>ℓ<sub>i</sub>
</italic>
, <italic>b </italic>
the pendant branch length, ℒ the phylogenetic likelihood function (e.g. equation 16.9 of [<xref ref-type="bibr" rid="B19">19</xref>
]), <italic>D </italic>
the alignment, <italic>T</italic>
<sub>ref </sub>
the reference phylogenetic tree, and <italic>P </italic>
the prior probability of a pendant branch length. We obtain the Bayes marginal likelihood by direct two-dimensional numerical integration:</p>
<p><disp-formula id="bmcM1"><label>(1)</label>
<mml:math id="M1" name="1471-2105-11-538-i1" overflow="scroll"><mml:mrow><mml:msub><mml:mi>ℒ</mml:mi>
<mml:mrow><mml:mtext>Bayes</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi>ℓ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub><mml:mi>T</mml:mi>
<mml:mrow><mml:mtext>ref</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup><mml:mi>A</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow><mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mstyle displaystyle="true"><mml:mrow><mml:msubsup><mml:mo>∫</mml:mo>
<mml:mn>0</mml:mn>
<mml:mi>∞</mml:mi>
</mml:msubsup>
<mml:mrow><mml:mstyle displaystyle="true"><mml:mrow><mml:msubsup><mml:mo>∫</mml:mo>
<mml:mn>0</mml:mn>
<mml:mrow><mml:msub><mml:mi>A</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msubsup>
<mml:mi>ℒ</mml:mi>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub><mml:mi>T</mml:mi>
<mml:mrow><mml:mtext>ref</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mi>ℓ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>a</mml:mi>
<mml:mtext> </mml:mtext>
<mml:mi>d</mml:mi>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>The posterior probability can then be obtained by taking a ratio of these marginal likelihoods (summation is over branches <italic>j </italic>
of the tree):</p>
<p><disp-formula id="bmcM2"><label>(2)</label>
<mml:math id="M2" name="1471-2105-11-538-i2" overflow="scroll"><mml:mrow><mml:mi>ℙ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi>ℓ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub><mml:mi>T</mml:mi>
<mml:mrow><mml:mtext>ref</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:msub><mml:mi>ℒ</mml:mi>
<mml:mrow><mml:mtext>Bayes</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub><mml:mi>T</mml:mi>
<mml:mrow><mml:mtext>ref</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mi>ℓ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mo>∑</mml:mo>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mrow><mml:msub><mml:mi>ℒ</mml:mi>
<mml:mrow><mml:mtext>Bayes</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mstyle>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub><mml:mi>T</mml:mi>
<mml:mrow><mml:mtext>ref</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mi>ℓ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>The likelihood weight ratio is defined corresponding ratio with marginal likelihood replaced by the ML likelihood:</p>
<p><disp-formula id="bmcM3"><label>(3)</label>
<mml:math id="M3" name="1471-2105-11-538-i3" overflow="scroll"><mml:mrow><mml:mi>ℙ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi>ℓ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub><mml:mi>T</mml:mi>
<mml:mrow><mml:mtext>ref</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:msub><mml:mi>ℒ</mml:mi>
<mml:mrow><mml:mtext>ML</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub><mml:mi>T</mml:mi>
<mml:mrow><mml:mtext>ref</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mi>ℓ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mo>∑</mml:mo>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mrow><mml:msub><mml:mi>ℒ</mml:mi>
<mml:mrow><mml:mtext>ML</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mstyle>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub><mml:mi>T</mml:mi>
<mml:mrow><mml:mtext>ref</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mi>ℓ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where <italic>ℒ</italic>
<sub>ML </sub>
is the maximal likelihood obtained by maximizing <italic>ℒ</italic>
(<italic>D</italic>
|<italic>T</italic>
<sub>ref </sub>
, <italic>ℓ<sub>i</sub>
, a, b</italic>
) with respect to branch length parameters <italic>a </italic>
and <italic>b</italic>
. The expected (under bootstrap replicates) likelihood weight ratio is the confidence score used in [<xref ref-type="bibr" rid="B39">39</xref>
]. Some justification for using likelihood weight ratios is given in [<xref ref-type="bibr" rid="B66">66</xref>
].</p>
<p>The expected distance between placement locations (EDPL) is a simple summation given probabilities from likelihood weight distributions or posterior probabilities. Let <italic>p<sub>i </sub>
</italic>
= ℙ(<italic>ℓ<sub>i</sub>
</italic>
|<italic>T</italic>
<sub>ref</sub>
, <italic>D</italic>
) from either (2) or (3), let <italic>d<sub>ij </sub>
</italic>
denote the tree distance between the optimal attachment positions on edges <italic>ℓ<sub>i </sub>
</italic>
and <italic>ℓ<sub>j</sub>
</italic>
, and let <italic>L </italic>
denote the total tree length. Then the EDPL is simply</p>
<p><disp-formula id="bmcM4"><label>(4)</label>
<mml:math id="M4" name="1471-2105-11-538-i4" overflow="scroll"><mml:mrow><mml:mstyle displaystyle="true"><mml:munder><mml:mo>∑</mml:mo>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow><mml:msub><mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub><mml:mi>p</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub><mml:mi>d</mml:mi>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>An extension of these ideas would be to integrate the marginal likelihoods over the potential attachment positions on the edges of interest; we have not pursued such a calculation.</p>
</sec>
<sec><title>Simulation design and error metric</title>
<p>The simulation procedure for a single gene is as follows. Begin with an alignment <italic>A </italic>
of full-length sequences for the gene of interest, along with a phylogeny <italic>T </italic>
derived from that alignment. <italic>T </italic>
is assumed to be correct.</p>
<p>Simulated fragments from a given taxon <italic>X </italic>
are re-placed in the phylogenetic tree, and their location relative to <italic>X</italic>
's original location is determined. The simulation pipeline repeats the following steps for every taxon <italic>X </italic>
in the alignment <italic>A</italic>
.</p>
<p>1. remove <italic>X </italic>
from the reference alignment, making an alignment <italic>A<sub>X </sub>
</italic>
.</p>
<p>2. build a profile HMM out of <italic>A<sub>X </sub>
</italic>
.</p>
<p>3. cut <italic>X </italic>
and its pendant branch out of the tree <italic>T</italic>
, suppressing the resultant degree-two internal node. Re-estimate branch lengths using <italic>A<sub>X </sub>
</italic>
, and call the resulting tree <italic>T<sub>X </sub>
</italic>
.</p>
<p>4. simulate fragments from the unaligned sequence of <italic>X </italic>
by taking sequences of normally-distributed length and uniformly-distributed position.</p>
<p>5. align these simulated fragments using the profile HMM built from <italic>A<sub>X </sub>
</italic>
.</p>
<p>6. place the simulated fragments in <italic>T<sub>X </sub>
</italic>
with respect to the reference alignment <italic>A<sub>X </sub>
</italic>
.</p>
<p>7. compare the resulting placements to the location of <italic>X </italic>
in <italic>T </italic>
using our error metric described below.</p>
<p>Note that only branch lengths are re-estimated; if we estimated <italic>T<sub>X </sub>
de novo </italic>
from <italic>A<sub>X </sub>
</italic>
then we would not be able to compare the placements to the taxon locations in <italic>T</italic>
.</p>
<p>In order to evaluate the accuracy of the placements, a simple topological distance metric is used. To calculate this metric for the placement of a taxon <italic>X</italic>
, highlight both the edge of <italic>T<sub>X </sub>
</italic>
corresponding to the correct placement and the edge of <italic>T<sub>X </sub>
</italic>
corresponding to the actual placement of the simulated fragment. The error metric then is the number of internal nodes between the two highlighted edges. Thus, if the fragment is placed in the correct position, then error is zero, and if it is placed sister to the correct position, then the error is one, and so on. This error metric is also used in [<xref ref-type="bibr" rid="B28">28</xref>
].</p>
</sec>
<sec><title>Alignments and Reference Trees</title>
<p>Data for the analysis of speed and memory use was drawn from [<xref ref-type="bibr" rid="B72">72</xref>
]. The data came partitioned into two files, the smaller of which was used for the reference set. Sequences with at least 1200 non-gap characters were selected from the reference set and the sequence order was randomized. Reference trees were built on the first 200, 400, ..., 1600 sequences, and the other file was used as the query set.</p>
<p>The EPA to <monospace>pplacer</monospace>
 accuracy comparison was done using the simulation framework from [<xref ref-type="bibr" rid="B28">28</xref>
]. The same taxa were used to generate simulated nucleotide fragments, which had normally distributed length with mean 200 and standard deviation 60. These were aligned to the reference alignments using HM-MER. Reference tree branch lengths were re-estimated using RAxML after deletion of the taxon used for simulation. The standard version of the EPA reroots the tree at an arbitrary location; Alexandros Stamatakis modified the code for this comparison so that the tree is rerooted at the lexicographically (i.e. alphabetically) smallest node, and branch order resorted similarly. Because of this rerooting and resorting, the error could not be judged directly from the reference tree, and so the correct placement was assumed to be that chosen by the EPA with a full-length sequence. Simulation data can be downloaded from <ext-link ext-link-type="uri" xlink:href="http://matsen.fhcrc.org/pplacer/data/10_EPA_comparison.tar.gz">http://matsen.fhcrc.org/pplacer/data/10_EPA_comparison.tar.gz</ext-link>
</p>
<p>Alignments for the COG simulation were downloaded from the COG website [<xref ref-type="bibr" rid="B59">59</xref>
]. The alignments were screened for completeness and taxa with incomplete sequences were removed. Alignment ends were trimmed to eliminate excessive gaps on either end. For the GOS <italic>psbA </italic>
analysis, the - All_Metagenomic_Reads and All_Assembled_Sequences - were downloaded to a local computer cluster from CAMERA [<xref ref-type="bibr" rid="B73">73</xref>
]. A <italic>psbA </italic>
and <italic>psbD </italic>
reference alignment was made of eukaryotic plastid sequences using sequences retrieved from Genbank and then included all cyanobacteria with an HMM search of a local copy of microbial refseq (from Genbank); alignment of was done using Geneious alignment [<xref ref-type="bibr" rid="B74">74</xref>
] and was hand edited.</p>
</sec>
</sec>
<sec><title>Authors' contributions</title>
<p>FAM and RBK conceived of and developed the project. FAM did the coding, scripting, and simulation data analysis. RBK made and edited alignments and reference trees. FAM, RBK, and EVA analyzed the results and wrote the manuscript. All authors have read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material"><title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1"><caption><title>Additional file 1</title>
<p><bold>Supplemental Table 1 -- Example BLAST results</bold>
. Table showing how blastn will often retrieve the same GOS reads when given chloroplast and cyanobacterial <italic>psbA </italic>
query sequences. The first and fourth columns show the query names, and the second and fifth column shows the (identical) GOS top hits. The top 100 records shared by the results of each BLAST search are shown.</p>
</caption>
<media xlink:href="1471-2105-11-538-S1.TAB" mimetype="text" mime-subtype="plain"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><sec><title>Acknowledgements</title>
<p>Jonathan Eisen, Steve Evans, John Huelsenbeck and Rasmus Nielsen made helpful suggestions concerning phylogenetics, while Robert Bradley, Ruchira Datta, and Sean Eddy helped with the use of profile HMMs in this context. We thank the Center for Environmental Genomics at the University of Washington, in particular Chris Berthiaume and David Schruth, for computational assistance. David Schruth, Adrian Marchetti and Alexandros Stamatakis provided helpful suggestions on the manuscript. Simon Berger and Alexandros Stamatakis generously helped with simulation design and running of the EPA algorithm. FAM is especially grateful to Andrés Varón and Ward Wheeler, who made a number of suggestions which greatly improved the <monospace>pplacer</monospace>
 code. The following individuals from the <monospace>ocaml</monospace>
 listserv made helpful suggestions: Will M. Farr, Mauricio Fernandez, Stéphane Glondu, Jon D. Harrop, Xavier Leroy, Mike Lin, and Markus Mottl. FAM was supported by the Miller Institute for Basic Research at the University of California, Berkeley while doing this work. RBK is supported by the University of Washington Friday Harbor Laboratories. EVA is supported through a Gordon and Betty Moore Foundation Marine Microbiology Investigator award.</p>
</sec>
<ref-list><ref id="B1"><mixed-citation publication-type="journal"><name><surname>Margulies</surname>
<given-names>M</given-names>
</name>
<name><surname>Egholm</surname>
<given-names>M</given-names>
</name>
<name><surname>Altman</surname>
<given-names>W</given-names>
</name>
<name><surname>Attiya</surname>
<given-names>S</given-names>
</name>
<name><surname>Bader</surname>
<given-names>J</given-names>
</name>
<name><surname>Bemben</surname>
<given-names>L</given-names>
</name>
<name><surname>Berka</surname>
<given-names>J</given-names>
</name>
<name><surname>Braverman</surname>
<given-names>M</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Y</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Z</given-names>
</name>
<etal></etal>
<article-title>Genome sequencing in microfabricated high-density picolitre reactors</article-title>
<source>Nature</source>
<year>2005</year>
<volume>437</volume>
<fpage>376</fpage>
<lpage>380</lpage>
<pub-id pub-id-type="pmid">16056220</pub-id>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><name><surname>Culley</surname>
<given-names>A</given-names>
</name>
<name><surname>Lang</surname>
<given-names>A</given-names>
</name>
<name><surname>Suttle</surname>
<given-names>C</given-names>
</name>
<article-title>Metagenomic analysis of coastal RNA virus communities</article-title>
<source>Science</source>
<year>2006</year>
<volume>312</volume>
<issue>5781</issue>
<fpage>1795</fpage>
<lpage>1798</lpage>
<pub-id pub-id-type="doi">10.1126/science.1127404</pub-id>
<pub-id pub-id-type="pmid">16794078</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="journal"><name><surname>Gill</surname>
<given-names>S</given-names>
</name>
<name><surname>Pop</surname>
<given-names>M</given-names>
</name>
<name><surname>DeBoy</surname>
<given-names>R</given-names>
</name>
<name><surname>Eckburg</surname>
<given-names>P</given-names>
</name>
<name><surname>Turnbaugh</surname>
<given-names>P</given-names>
</name>
<name><surname>Samuel</surname>
<given-names>B</given-names>
</name>
<name><surname>Gordon</surname>
<given-names>J</given-names>
</name>
<name><surname>Relman</surname>
<given-names>D</given-names>
</name>
<name><surname>Fraser-Liggett</surname>
<given-names>C</given-names>
</name>
<name><surname>Nelson</surname>
<given-names>K</given-names>
</name>
<article-title>Metagenomic analysis of the human distal gut microbiome</article-title>
<source>Science</source>
<year>2006</year>
<volume>312</volume>
<issue>5778</issue>
<fpage>1355</fpage>
<lpage>1359</lpage>
<pub-id pub-id-type="doi">10.1126/science.1124234</pub-id>
<pub-id pub-id-type="pmid">16741115</pub-id>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><name><surname>Venter</surname>
<given-names>J</given-names>
</name>
<name><surname>Remington</surname>
<given-names>K</given-names>
</name>
<name><surname>Heidelberg</surname>
<given-names>J</given-names>
</name>
<name><surname>Halpern</surname>
<given-names>A</given-names>
</name>
<name><surname>Rusch</surname>
<given-names>D</given-names>
</name>
<name><surname>Eisen</surname>
<given-names>J</given-names>
</name>
<name><surname>Wu</surname>
<given-names>D</given-names>
</name>
<name><surname>Paulsen</surname>
<given-names>I</given-names>
</name>
<name><surname>Nelson</surname>
<given-names>K</given-names>
</name>
<name><surname>Nelson</surname>
<given-names>W</given-names>
</name>
<etal></etal>
<article-title>Environmental genome shotgun sequencing of the Sargasso Sea</article-title>
<source>Science</source>
<year>2004</year>
<volume>304</volume>
<issue>5667</issue>
<fpage>66</fpage>
<lpage>74</lpage>
<pub-id pub-id-type="doi">10.1126/science.1093857</pub-id>
<pub-id pub-id-type="pmid">15001713</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><name><surname>Tringe</surname>
<given-names>S</given-names>
</name>
<name><surname>Rubin</surname>
<given-names>E</given-names>
</name>
<article-title>Metagenomics: DNA sequencing of environmental samples</article-title>
<source>Nat Rev Genet</source>
<year>2005</year>
<volume>6</volume>
<issue>11</issue>
<fpage>805</fpage>
<lpage>814</lpage>
<pub-id pub-id-type="doi">10.1038/nrg1709</pub-id>
<pub-id pub-id-type="pmid">16304596</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="journal"><name><surname>Martín</surname>
<given-names>H</given-names>
</name>
<name><surname>Ivanova</surname>
<given-names>N</given-names>
</name>
<name><surname>Kunin</surname>
<given-names>V</given-names>
</name>
<name><surname>Warnecke</surname>
<given-names>F</given-names>
</name>
<name><surname>Barry</surname>
<given-names>K</given-names>
</name>
<name><surname>McHardy</surname>
<given-names>A</given-names>
</name>
<name><surname>Yeates</surname>
<given-names>C</given-names>
</name>
<name><surname>He</surname>
<given-names>S</given-names>
</name>
<name><surname>Salamov</surname>
<given-names>A</given-names>
</name>
<name><surname>Szeto</surname>
<given-names>E</given-names>
</name>
<etal></etal>
<article-title>Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities</article-title>
<source>Nat Biotech</source>
<year>2006</year>
<volume>24</volume>
<fpage>1263</fpage>
<lpage>1269</lpage>
<pub-id pub-id-type="doi">10.1038/nbt1247</pub-id>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="journal"><name><surname>Warnecke</surname>
<given-names>F</given-names>
</name>
<name><surname>Luginbühl</surname>
<given-names>P</given-names>
</name>
<name><surname>Ivanova</surname>
<given-names>N</given-names>
</name>
<name><surname>Ghassemian</surname>
<given-names>M</given-names>
</name>
<name><surname>Richardson</surname>
<given-names>T</given-names>
</name>
<name><surname>Stege</surname>
<given-names>J</given-names>
</name>
<name><surname>Cayouette</surname>
<given-names>M</given-names>
</name>
<name><surname>McHardy</surname>
<given-names>A</given-names>
</name>
<name><surname>Djord-jevic</surname>
<given-names>G</given-names>
</name>
<name><surname>Aboushadi</surname>
<given-names>N</given-names>
</name>
<etal></etal>
<article-title>Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite</article-title>
<source>Nature</source>
<year>2007</year>
<volume>450</volume>
<issue>7169</issue>
<fpage>560</fpage>
<lpage>565</lpage>
<pub-id pub-id-type="doi">10.1038/nature06269</pub-id>
<pub-id pub-id-type="pmid">18033299</pub-id>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="journal"><name><surname>Baker</surname>
<given-names>B</given-names>
</name>
<name><surname>Banfield</surname>
<given-names>J</given-names>
</name>
<article-title>Microbial communities in acid mine drainage</article-title>
<source>FEMS Microbiol Ecol</source>
<year>2003</year>
<volume>44</volume>
<issue>2</issue>
<fpage>139</fpage>
<lpage>152</lpage>
<pub-id pub-id-type="doi">10.1016/S0168-6496(03)00028-X</pub-id>
<pub-id pub-id-type="pmid">19719632</pub-id>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="journal"><name><surname>Altschul</surname>
<given-names>S</given-names>
</name>
<name><surname>Gish</surname>
<given-names>W</given-names>
</name>
<name><surname>Miller</surname>
<given-names>W</given-names>
</name>
<name><surname>Myers</surname>
<given-names>E</given-names>
</name>
<name><surname>Lipman</surname>
<given-names>D</given-names>
</name>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol</source>
<year>1990</year>
<volume>215</volume>
<issue>3</issue>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="pmid">2231712</pub-id>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="journal"><name><surname>Huson</surname>
<given-names>D</given-names>
</name>
<name><surname>Auch</surname>
<given-names>A</given-names>
</name>
<name><surname>Qi</surname>
<given-names>J</given-names>
</name>
<name><surname>Schuster</surname>
<given-names>S</given-names>
</name>
<article-title>MEGAN analysis of metagenomic data</article-title>
<source>Genome Res</source>
<year>2007</year>
<volume>17</volume>
<issue>3</issue>
<fpage>377</fpage>
<pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="journal"><name><surname>McHardy</surname>
<given-names>A</given-names>
</name>
<name><surname>Martín</surname>
<given-names>H</given-names>
</name>
<name><surname>Tsirigos</surname>
<given-names>A</given-names>
</name>
<name><surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name><surname>Rigoutsos</surname>
<given-names>I</given-names>
</name>
<article-title>Accurate phylogenetic classification of variable-length DNA fragments</article-title>
<source>Nature Methods</source>
<year>2007</year>
<volume>4</volume>
<fpage>63</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth976</pub-id>
<pub-id pub-id-type="pmid">17179938</pub-id>
</mixed-citation>
</ref>
<ref id="B12"><mixed-citation publication-type="journal"><name><surname>Diaz</surname>
<given-names>N</given-names>
</name>
<name><surname>Krause</surname>
<given-names>L</given-names>
</name>
<name><surname>Goesmann</surname>
<given-names>A</given-names>
</name>
<name><surname>Niehaus</surname>
<given-names>K</given-names>
</name>
<name><surname>Nattkemper</surname>
<given-names>T</given-names>
</name>
<article-title>TACOA-Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach</article-title>
<source>BMC Bioinfo</source>
<year>2009</year>
<volume>10</volume>
<fpage>56</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-56</pub-id>
</mixed-citation>
</ref>
<ref id="B13"><mixed-citation publication-type="journal"><name><surname>Brady</surname>
<given-names>A</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>S</given-names>
</name>
<article-title>Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models</article-title>
<source>Nature Methods</source>
<year>2009</year>
<volume>6</volume>
<issue>9</issue>
<fpage>673</fpage>
<lpage>676</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1358</pub-id>
<pub-id pub-id-type="pmid">19648916</pub-id>
</mixed-citation>
</ref>
<ref id="B14"><mixed-citation publication-type="journal"><name><surname>Allman</surname>
<given-names>E</given-names>
</name>
<name><surname>Rhodes</surname>
<given-names>J</given-names>
</name>
<article-title>The identifability of tree topology for phylogenetic models, including covarion and mixture models</article-title>
<source>J Comput Biol</source>
<year>2006</year>
<volume>13</volume>
<issue>5</issue>
<fpage>1101</fpage>
<lpage>1113</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2006.13.1101</pub-id>
<pub-id pub-id-type="pmid">16796553</pub-id>
</mixed-citation>
</ref>
<ref id="B15"><mixed-citation publication-type="journal"><name><surname>Allman</surname>
<given-names>E</given-names>
</name>
<name><surname>Rhodes</surname>
<given-names>J</given-names>
</name>
<article-title>Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites</article-title>
<source>Math Biosci</source>
<year>2008</year>
<volume>211</volume>
<fpage>18</fpage>
<lpage>33</lpage>
<pub-id pub-id-type="doi">10.1016/j.mbs.2007.09.001</pub-id>
<pub-id pub-id-type="pmid">17964612</pub-id>
</mixed-citation>
</ref>
<ref id="B16"><mixed-citation publication-type="journal"><name><surname>Shimodaira</surname>
<given-names>H</given-names>
</name>
<name><surname>Hasegawa</surname>
<given-names>M</given-names>
</name>
<article-title>Multiple comparisons of log-likelihoods with applications to phylogenetic inference</article-title>
<source>Mol Biol Evol</source>
<year>1999</year>
<volume>16</volume>
<fpage>1114</fpage>
<lpage>1116</lpage>
</mixed-citation>
</ref>
<ref id="B17"><mixed-citation publication-type="journal"><name><surname>Yang</surname>
<given-names>Z</given-names>
</name>
<article-title>Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods</article-title>
<source>J Mol Evol</source>
<year>1994</year>
<volume>39</volume>
<issue>3</issue>
<fpage>306</fpage>
<lpage>314</lpage>
<pub-id pub-id-type="doi">10.1007/BF00160154</pub-id>
<pub-id pub-id-type="pmid">7932792</pub-id>
</mixed-citation>
</ref>
<ref id="B18"><mixed-citation publication-type="journal"><name><surname>Le</surname>
<given-names>S</given-names>
</name>
<name><surname>Gascuel</surname>
<given-names>O</given-names>
</name>
<article-title>An improved general amino acid replacement matrix</article-title>
<source>Mol Biol Evol</source>
<year>2008</year>
<volume>25</volume>
<issue>7</issue>
<fpage>1307</fpage>
<pub-id pub-id-type="doi">10.1093/molbev/msn067</pub-id>
<pub-id pub-id-type="pmid">18367465</pub-id>
</mixed-citation>
</ref>
<ref id="B19"><mixed-citation publication-type="other"><name><surname>Felsenstein</surname>
<given-names>J</given-names>
</name>
<source>Inferring Phylogenies</source>
<year>2004</year>
</mixed-citation>
</ref>
<ref id="B20"><mixed-citation publication-type="journal"><name><surname>Chor</surname>
<given-names>B</given-names>
</name>
<name><surname>Tuller</surname>
<given-names>T</given-names>
</name>
<article-title>Finding a maximum likelihood tree is hard</article-title>
<source>J ACM</source>
<year>2006</year>
<volume>53</volume>
<issue>5</issue>
<fpage>744</fpage>
<pub-id pub-id-type="doi">10.1145/1183907.1183909</pub-id>
</mixed-citation>
</ref>
<ref id="B21"><mixed-citation publication-type="other"><name><surname>Roch</surname>
<given-names>S</given-names>
</name>
<article-title>A short proof that phylogenetic tree reconstruction by maximum likelihood is hard</article-title>
<source>IEEE/ACM TCBB</source>
<year>2006</year>
<fpage>92</fpage>
<lpage>94</lpage>
</mixed-citation>
</ref>
<ref id="B22"><mixed-citation publication-type="other"><name><surname>Guindon</surname>
<given-names>S</given-names>
</name>
<name><surname>Gascuel</surname>
<given-names>O</given-names>
</name>
<article-title>A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood</article-title>
<source>Sys Biol</source>
<year>2003</year>
<fpage>696</fpage>
<lpage>704</lpage>
<pub-id pub-id-type="doi">10.1080/10635150390235520</pub-id>
</mixed-citation>
</ref>
<ref id="B23"><mixed-citation publication-type="journal"><name><surname>Stamatakis</surname>
<given-names>A</given-names>
</name>
<article-title>RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>21</issue>
<fpage>2688</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl446</pub-id>
<pub-id pub-id-type="pmid">16928733</pub-id>
</mixed-citation>
</ref>
<ref id="B24"><mixed-citation publication-type="book"><name><surname>Zwickl</surname>
<given-names>D</given-names>
</name>
<article-title>Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion</article-title>
<source>PhD thesis</source>
<year>2006</year>
<publisher-name>The University of Texas at Austin</publisher-name>
</mixed-citation>
</ref>
<ref id="B25"><mixed-citation publication-type="journal"><name><surname>Price</surname>
<given-names>MN</given-names>
</name>
<name><surname>Dehal</surname>
<given-names>PS</given-names>
</name>
<name><surname>Arkin</surname>
<given-names>AP</given-names>
</name>
<article-title>FastTree 2: Approximately Maximum-Likelihood Trees for Large Alignments</article-title>
<source>PLoS ONE</source>
<year>2010</year>
<volume>5</volume>
<issue>3</issue>
<fpage>e9490</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0009490</pub-id>
<pub-id pub-id-type="pmid">20224823</pub-id>
</mixed-citation>
</ref>
<ref id="B26"><mixed-citation publication-type="journal"><name><surname>Steel</surname>
<given-names>M</given-names>
</name>
<name><surname>Székely</surname>
<given-names>L</given-names>
</name>
<article-title>Inverting random functions II: Explicit bounds for discrete maximum likelihood estimation, with applications</article-title>
<source>SIAM J Discrete Math</source>
<year>2002</year>
<volume>15</volume>
<issue>4</issue>
<fpage>562</fpage>
<lpage>578</lpage>
<pub-id pub-id-type="doi">10.1137/S089548010138790X</pub-id>
</mixed-citation>
</ref>
<ref id="B27"><mixed-citation publication-type="other"><name><surname>Moret</surname>
<given-names>B</given-names>
</name>
<name><surname>Roshan</surname>
<given-names>U</given-names>
</name>
<name><surname>Warnow</surname>
<given-names>T</given-names>
</name>
<article-title>Sequence-length requirements for phylogenetic methods</article-title>
<source>Lecture Notes in Computer Science</source>
<year>2002</year>
<fpage>343</fpage>
<lpage>356</lpage>
<comment>full_text</comment>
</mixed-citation>
</ref>
<ref id="B28"><mixed-citation publication-type="other"><name><surname>Berger</surname>
<given-names>S</given-names>
</name>
<name><surname>Stamatakis</surname>
<given-names>A</given-names>
</name>
<article-title>Evolutionary Placement of Short Sequence Reads</article-title>
<source>Submitted to Sys Biol</source>
<year>2009</year>
<ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/0911.2852">http://arxiv.org/abs/0911.2852</ext-link>
</mixed-citation>
</ref>
<ref id="B29"><mixed-citation publication-type="journal"><name><surname>Margulies</surname>
<given-names>M</given-names>
</name>
<name><surname>Egholm</surname>
<given-names>M</given-names>
</name>
<name><surname>Altman</surname>
<given-names>W</given-names>
</name>
<name><surname>Attiya</surname>
<given-names>S</given-names>
</name>
<name><surname>Bader</surname>
<given-names>J</given-names>
</name>
<name><surname>Bemben</surname>
<given-names>L</given-names>
</name>
<name><surname>Berka</surname>
<given-names>J</given-names>
</name>
<name><surname>Braverman</surname>
<given-names>M</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Y</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Z</given-names>
</name>
<etal></etal>
<article-title>Genome sequencing in open microfabricated high density picoliter reactors</article-title>
<source>Nature</source>
<year>2005</year>
<volume>437</volume>
<issue>7057</issue>
<fpage>376</fpage>
<pub-id pub-id-type="pmid">16056220</pub-id>
</mixed-citation>
</ref>
<ref id="B30"><mixed-citation publication-type="journal"><name><surname>Mardis</surname>
<given-names>E</given-names>
</name>
<article-title>Next-generation DNA sequencing methods</article-title>
<source>Ann Rev Genomics Human Genet</source>
<year>2008</year>
<volume>9</volume>
<fpage>387</fpage>
<pub-id pub-id-type="doi">10.1146/annurev.genom.9.081307.164359</pub-id>
</mixed-citation>
</ref>
<ref id="B31"><mixed-citation publication-type="journal"><name><surname>Lemmon</surname>
<given-names>A</given-names>
</name>
<name><surname>Brown</surname>
<given-names>J</given-names>
</name>
<name><surname>Stanger-Hall</surname>
<given-names>K</given-names>
</name>
<name><surname>Lemmon</surname>
<given-names>E</given-names>
</name>
<article-title>The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference</article-title>
<source>Syst Biol</source>
<year>2009</year>
<volume>58</volume>
<fpage>130</fpage>
<pub-id pub-id-type="doi">10.1093/sysbio/syp017</pub-id>
<pub-id pub-id-type="pmid">20525573</pub-id>
</mixed-citation>
</ref>
<ref id="B32"><mixed-citation publication-type="other"><article-title>Archaeopteryx</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.phylosoft.org/archaeopteryx/">http://www.phylosoft.org/archaeopteryx/</ext-link>
</mixed-citation>
</ref>
<ref id="B33"><mixed-citation publication-type="other"><article-title>Dendroscope</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www-ab.informatik.uni-tuebingen.de/software/dendroscope">http://www-ab.informatik.uni-tuebingen.de/software/dendroscope</ext-link>
</mixed-citation>
</ref>
<ref id="B34"><mixed-citation publication-type="journal"><name><surname>Mooers</surname>
<given-names>A</given-names>
</name>
<name><surname>Heard</surname>
<given-names>S</given-names>
</name>
<article-title>Evolutionary process from phylogenetic tree shape</article-title>
<source>Q Rev Biol</source>
<year>1997</year>
<volume>72</volume>
<fpage>31</fpage>
<lpage>54</lpage>
<pub-id pub-id-type="doi">10.1086/419657</pub-id>
</mixed-citation>
</ref>
<ref id="B35"><mixed-citation publication-type="journal"><name><surname>Lozupone</surname>
<given-names>C</given-names>
</name>
<name><surname>Knight</surname>
<given-names>R</given-names>
</name>
<article-title>UniFrac: a new phylogenetic method for comparing microbial communities</article-title>
<source>Appl Enviro Microbiol</source>
<year>2005</year>
<volume>71</volume>
<issue>12</issue>
<fpage>8228</fpage>
<pub-id pub-id-type="doi">10.1128/AEM.71.12.8228-8235.2005</pub-id>
</mixed-citation>
</ref>
<ref id="B36"><mixed-citation publication-type="other"><name><surname>Kluge</surname>
<given-names>A</given-names>
</name>
<name><surname>Farris</surname>
<given-names>J</given-names>
</name>
<article-title>Quantitative phyletics and the evolution of anurans</article-title>
<source>Syst Zool</source>
<year>1969</year>
<fpage>1</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="doi">10.2307/2412407</pub-id>
</mixed-citation>
</ref>
<ref id="B37"><mixed-citation publication-type="journal"><name><surname>Felsenstein</surname>
<given-names>J</given-names>
</name>
<article-title>Evolutionary trees from DNA sequences: a maximum likelihood approach</article-title>
<source>J Mol Evol</source>
<year>1981</year>
<volume>17</volume>
<issue>6</issue>
<fpage>368</fpage>
<lpage>376</lpage>
<pub-id pub-id-type="doi">10.1007/BF01734359</pub-id>
<pub-id pub-id-type="pmid">7288891</pub-id>
</mixed-citation>
</ref>
<ref id="B38"><mixed-citation publication-type="journal"><name><surname>Monier</surname>
<given-names>A</given-names>
</name>
<name><surname>Claverie</surname>
<given-names>J</given-names>
</name>
<name><surname>Ogata</surname>
<given-names>H</given-names>
</name>
<article-title>Taxonomic distribution of large DNA viruses in the sea</article-title>
<source>Genome Biol</source>
<year>2008</year>
<volume>9</volume>
<issue>7</issue>
<fpage>R106</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2008-9-7-r106</pub-id>
<pub-id pub-id-type="pmid">18598358</pub-id>
</mixed-citation>
</ref>
<ref id="B39"><mixed-citation publication-type="journal"><name><surname>Von Mering</surname>
<given-names>C</given-names>
</name>
<name><surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name><surname>Raes</surname>
<given-names>J</given-names>
</name>
<name><surname>Tringe</surname>
<given-names>S</given-names>
</name>
<name><surname>Doerks</surname>
<given-names>T</given-names>
</name>
<name><surname>Jensen</surname>
<given-names>L</given-names>
</name>
<name><surname>Ward</surname>
<given-names>N</given-names>
</name>
<name><surname>Bork</surname>
<given-names>P</given-names>
</name>
<article-title>Quantitative phylogenetic assessment of microbial communities in diverse environments</article-title>
<source>Science</source>
<year>2007</year>
<volume>315</volume>
<issue>5815</issue>
<fpage>1126</fpage>
<pub-id pub-id-type="doi">10.1126/science.1133420</pub-id>
<pub-id pub-id-type="pmid">17272687</pub-id>
</mixed-citation>
</ref>
<ref id="B40"><mixed-citation publication-type="journal"><name><surname>Kosakovsky</surname>
<given-names>P</given-names>
</name>
<name><surname>Posada</surname>
<given-names>D</given-names>
</name>
<name><surname>Stawiski</surname>
<given-names>E</given-names>
</name>
<name><surname>Chappey</surname>
<given-names>C</given-names>
</name>
<name><surname>Poon</surname>
<given-names>A</given-names>
</name>
<name><surname>Hughes</surname>
<given-names>G</given-names>
</name>
<name><surname>Fearnhill</surname>
<given-names>E</given-names>
</name>
<name><surname>Gravenor</surname>
<given-names>M</given-names>
</name>
<name><surname>Leigh</surname>
<given-names>B</given-names>
</name>
<name><surname>Frost</surname>
<given-names>S</given-names>
</name>
<article-title>An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in HIV-1</article-title>
<source>PLoS Comp Biol</source>
<year>2009</year>
<volume>5</volume>
<issue>11</issue>
<fpage>e1000581</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1000581</pub-id>
</mixed-citation>
</ref>
<ref id="B41"><mixed-citation publication-type="journal"><name><surname>Zwickl</surname>
<given-names>D</given-names>
</name>
<name><surname>Hillis</surname>
<given-names>D</given-names>
</name>
<article-title>Increased taxon sampling greatly reduces phylogenetic error</article-title>
<source>Sys Biol</source>
<year>2002</year>
<volume>51</volume>
<issue>4</issue>
<fpage>588</fpage>
<pub-id pub-id-type="doi">10.1080/10635150290102339</pub-id>
</mixed-citation>
</ref>
<ref id="B42"><mixed-citation publication-type="other"><name><surname>Cueto</surname>
<given-names>M</given-names>
</name>
<name><surname>Matsen</surname>
<given-names>F</given-names>
</name>
<article-title>The polyhedral geometry of phylogenetic rogue taxa</article-title>
<source>In press Bull Math Biol</source>
<year>2010</year>
<ext-link ext-link-type="uri" xlink:href="Http://arxiv.org/abs/1001.5241">Http://arxiv.org/abs/1001.5241</ext-link>
</mixed-citation>
</ref>
<ref id="B43"><mixed-citation publication-type="journal"><name><surname>Munch</surname>
<given-names>K</given-names>
</name>
<name><surname>Boomsma</surname>
<given-names>W</given-names>
</name>
<name><surname>Willerslev</surname>
<given-names>E</given-names>
</name>
<name><surname>Nielsen</surname>
<given-names>R</given-names>
</name>
<article-title>Fast phylogenetic DNA barcoding</article-title>
<source>Phil Trans Royal Soc B</source>
<year>2008</year>
<volume>363</volume>
<issue>1512</issue>
<fpage>3997</fpage>
<lpage>4002</lpage>
<pub-id pub-id-type="doi">10.1098/rstb.2008.0169</pub-id>
</mixed-citation>
</ref>
<ref id="B44"><mixed-citation publication-type="other"><name><surname>Drummond</surname>
<given-names>A</given-names>
</name>
<name><surname>Rambaut</surname>
<given-names>A</given-names>
</name>
<article-title>BEAST v1.0</article-title>
<year>2003</year>
<ext-link ext-link-type="uri" xlink:href="http://beast.bio.ed.ac.uk/">http://beast.bio.ed.ac.uk/</ext-link>
</mixed-citation>
</ref>
<ref id="B45"><mixed-citation publication-type="journal"><name><surname>Huelsenbeck</surname>
<given-names>JP</given-names>
</name>
<name><surname>Ronquist</surname>
<given-names>F</given-names>
</name>
<article-title>MRBAYES: Bayesian inference of phylogeny</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<fpage>754</fpage>
<lpage>755</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.8.754</pub-id>
<pub-id pub-id-type="pmid">11524383</pub-id>
</mixed-citation>
</ref>
<ref id="B46"><mixed-citation publication-type="journal"><name><surname>Whelan</surname>
<given-names>S</given-names>
</name>
<name><surname>Goldman</surname>
<given-names>N</given-names>
</name>
<article-title>A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach</article-title>
<source>Mol Biol Evol</source>
<year>2001</year>
<volume>18</volume>
<issue>5</issue>
<fpage>691</fpage>
<lpage>699</lpage>
<pub-id pub-id-type="pmid">11319253</pub-id>
</mixed-citation>
</ref>
<ref id="B47"><mixed-citation publication-type="other"><article-title>Objective Caml</article-title>
<ext-link ext-link-type="uri" xlink:href="http://caml.inria.fr/ocaml/index.en.html">http://caml.inria.fr/ocaml/index.en.html</ext-link>
</mixed-citation>
</ref>
<ref id="B48"><mixed-citation publication-type="other"><article-title>The GNU scientific library</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.gnu.org/software/gsl/">http://www.gnu.org/software/gsl/</ext-link>
</mixed-citation>
</ref>
<ref id="B49"><mixed-citation publication-type="journal"><name><surname>Han</surname>
<given-names>M</given-names>
</name>
<name><surname>Zmasek</surname>
<given-names>C</given-names>
</name>
<article-title>phyloXML: XML for evolutionary biology and comparative genomics</article-title>
<source>BMC Bioinfo</source>
<year>2009</year>
<volume>10</volume>
<fpage>356</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-356</pub-id>
</mixed-citation>
</ref>
<ref id="B50"><mixed-citation publication-type="journal"><name><surname>Zurawski</surname>
<given-names>G</given-names>
</name>
<name><surname>Bohnert</surname>
<given-names>H</given-names>
</name>
<name><surname>Whitfeld</surname>
<given-names>P</given-names>
</name>
<name><surname>Bottomley</surname>
<given-names>W</given-names>
</name>
<article-title>Nucleotide sequence of the gene for the Mr 32,000 thylakoid membrane protein from Spinacia oleracea and Nicotiana debneyi predicts a totally con-served primary translation product of Mr 38,950</article-title>
<source>Proc Nat Acad Sci</source>
<year>1982</year>
<volume>79</volume>
<issue>24</issue>
<fpage>7699</fpage>
<lpage>7703</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.79.24.7699</pub-id>
<pub-id pub-id-type="pmid">16593262</pub-id>
</mixed-citation>
</ref>
<ref id="B51"><mixed-citation publication-type="journal"><name><surname>Zeidner</surname>
<given-names>G</given-names>
</name>
<name><surname>Preston</surname>
<given-names>C</given-names>
</name>
<name><surname>Delong</surname>
<given-names>E</given-names>
</name>
<name><surname>Massana</surname>
<given-names>R</given-names>
</name>
<name><surname>Post</surname>
<given-names>A</given-names>
</name>
<name><surname>Scanlan</surname>
<given-names>D</given-names>
</name>
<name><surname>Beja</surname>
<given-names>O</given-names>
</name>
<article-title>Molecular diversity among marine picophytoplankton as revealed by psbA analyses</article-title>
<source>Environ Microbiol</source>
<year>2003</year>
<volume>5</volume>
<issue>3</issue>
<fpage>212</fpage>
<pub-id pub-id-type="doi">10.1046/j.1462-2920.2003.00403.x</pub-id>
<pub-id pub-id-type="pmid">12588300</pub-id>
</mixed-citation>
</ref>
<ref id="B52"><mixed-citation publication-type="journal"><name><surname>Sullivan</surname>
<given-names>M</given-names>
</name>
<name><surname>Lindell</surname>
<given-names>D</given-names>
</name>
<name><surname>Lee</surname>
<given-names>J</given-names>
</name>
<name><surname>Thompson</surname>
<given-names>L</given-names>
</name>
<name><surname>Bielawski</surname>
<given-names>J</given-names>
</name>
<name><surname>Chisholm</surname>
<given-names>S</given-names>
</name>
<article-title>Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts</article-title>
<source>PLoS Biol</source>
<year>2006</year>
<volume>4</volume>
<issue>8</issue>
<fpage>e234</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pbio.0040234</pub-id>
<pub-id pub-id-type="pmid">16802857</pub-id>
</mixed-citation>
</ref>
<ref id="B53"><mixed-citation publication-type="journal"><name><surname>Millard</surname>
<given-names>A</given-names>
</name>
<name><surname>Clokie</surname>
<given-names>M</given-names>
</name>
<name><surname>Shub</surname>
<given-names>D</given-names>
</name>
<name><surname>Mann</surname>
<given-names>N</given-names>
</name>
<article-title>Genetic organization of the psbAD region in phages infecting marine Synechococcus strains</article-title>
<source>PNAS</source>
<year>2004</year>
<volume>101</volume>
<issue>30</issue>
<fpage>11007</fpage>
<pub-id pub-id-type="doi">10.1073/pnas.0401478101</pub-id>
<pub-id pub-id-type="pmid">15263091</pub-id>
</mixed-citation>
</ref>
<ref id="B54"><mixed-citation publication-type="journal"><name><surname>Lindell</surname>
<given-names>D</given-names>
</name>
<name><surname>Jaffe</surname>
<given-names>J</given-names>
</name>
<name><surname>Coleman</surname>
<given-names>M</given-names>
</name>
<name><surname>Futschik</surname>
<given-names>M</given-names>
</name>
<name><surname>Axmann</surname>
<given-names>I</given-names>
</name>
<name><surname>Rector</surname>
<given-names>T</given-names>
</name>
<name><surname>Kettler</surname>
<given-names>G</given-names>
</name>
<name><surname>Sullivan</surname>
<given-names>M</given-names>
</name>
<name><surname>Steen</surname>
<given-names>R</given-names>
</name>
<name><surname>Hess</surname>
<given-names>W</given-names>
</name>
<etal></etal>
<article-title>Genome-wide expression dynamics of a marine virus and host reveal features of co-evolution</article-title>
<source>Nature</source>
<year>2007</year>
<volume>449</volume>
<issue>7158</issue>
<fpage>83</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="doi">10.1038/nature06130</pub-id>
<pub-id pub-id-type="pmid">17805294</pub-id>
</mixed-citation>
</ref>
<ref id="B55"><mixed-citation publication-type="journal"><name><surname>Chenard</surname>
<given-names>C</given-names>
</name>
<name><surname>Suttle</surname>
<given-names>C</given-names>
</name>
<article-title>Phylogenetic diversity of sequences of cyanophage photosynthetic gene psbA in marine and freshwaters</article-title>
<source>Appl Enviro Microbiol</source>
<year>2008</year>
<volume>74</volume>
<issue>17</issue>
<fpage>5317</fpage>
<pub-id pub-id-type="doi">10.1128/AEM.02480-07</pub-id>
</mixed-citation>
</ref>
<ref id="B56"><mixed-citation publication-type="journal"><name><surname>Williamson</surname>
<given-names>S</given-names>
</name>
<name><surname>Rusch</surname>
<given-names>D</given-names>
</name>
<name><surname>Yooseph</surname>
<given-names>S</given-names>
</name>
<name><surname>Halpern</surname>
<given-names>A</given-names>
</name>
<name><surname>Heidelberg</surname>
<given-names>K</given-names>
</name>
<name><surname>Glass</surname>
<given-names>J</given-names>
</name>
<name><surname>Andrews-Pfannkoch</surname>
<given-names>C</given-names>
</name>
<name><surname>Fadrosh</surname>
<given-names>D</given-names>
</name>
<name><surname>Miller</surname>
<given-names>C</given-names>
</name>
<name><surname>Sutton</surname>
<given-names>G</given-names>
</name>
<etal></etal>
<article-title>The Sorcerer II Global Ocean Sampling Expedition: metagenomic characterization of viruses within aquatic microbial samples</article-title>
<source>PLoS ONE</source>
<year>2008</year>
<volume>3</volume>
<issue>1</issue>
<fpage>e1456</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0001456</pub-id>
<pub-id pub-id-type="pmid">18213365</pub-id>
</mixed-citation>
</ref>
<ref id="B57"><mixed-citation publication-type="journal"><name><surname>Sharon</surname>
<given-names>I</given-names>
</name>
<name><surname>Tzahor</surname>
<given-names>S</given-names>
</name>
<name><surname>Williamson</surname>
<given-names>S</given-names>
</name>
<name><surname>Shmoish</surname>
<given-names>M</given-names>
</name>
<name><surname>Man-Aharonovich</surname>
<given-names>D</given-names>
</name>
<name><surname>Rusch</surname>
<given-names>D</given-names>
</name>
<name><surname>Yooseph</surname>
<given-names>S</given-names>
</name>
<name><surname>Zeidner</surname>
<given-names>G</given-names>
</name>
<name><surname>Golden</surname>
<given-names>S</given-names>
</name>
<name><surname>Mackey</surname>
<given-names>S</given-names>
</name>
<etal></etal>
<article-title>Viral photosynthetic reaction center genes and transcripts in the marine environment</article-title>
<source>The ISME Journal</source>
<year>2007</year>
<volume>1</volume>
<issue>6</issue>
<fpage>492</fpage>
<lpage>501</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2007.67</pub-id>
<pub-id pub-id-type="pmid">18043651</pub-id>
</mixed-citation>
</ref>
<ref id="B58"><mixed-citation publication-type="journal"><name><surname>Eddy</surname>
<given-names>S</given-names>
</name>
<article-title>Profile hidden Markov models</article-title>
<source>Bioinformatics</source>
<year>1998</year>
<volume>14</volume>
<issue>9</issue>
<fpage>755</fpage>
<lpage>763</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/14.9.755</pub-id>
<pub-id pub-id-type="pmid">9918945</pub-id>
</mixed-citation>
</ref>
<ref id="B59"><mixed-citation publication-type="journal"><name><surname>Tatusov</surname>
<given-names>R</given-names>
</name>
<name><surname>Galperin</surname>
<given-names>M</given-names>
</name>
<name><surname>Natale</surname>
<given-names>D</given-names>
</name>
<name><surname>Koonin</surname>
<given-names>E</given-names>
</name>
<article-title>The COG database: a tool for genome-scale analysis of protein functions and evolution</article-title>
<source>Nucleic Acids Res</source>
<year>2000</year>
<volume>28</volume>
<fpage>33</fpage>
<pub-id pub-id-type="doi">10.1093/nar/28.1.33</pub-id>
<pub-id pub-id-type="pmid">10592175</pub-id>
</mixed-citation>
</ref>
<ref id="B60"><mixed-citation publication-type="journal"><name><surname>Stark</surname>
<given-names>M</given-names>
</name>
<name><surname>Berger</surname>
<given-names>S</given-names>
</name>
<name><surname>Stamatakis</surname>
<given-names>A</given-names>
</name>
<name><surname>von Mering</surname>
<given-names>C</given-names>
</name>
<article-title>MLTreeMap- accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies</article-title>
<source>BMC Genomics</source>
<year>2010</year>
<volume>11</volume>
<fpage>461</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-11-461</pub-id>
<pub-id pub-id-type="pmid">20687950</pub-id>
</mixed-citation>
</ref>
<ref id="B61"><mixed-citation publication-type="other"><name><surname>Krause</surname>
<given-names>L</given-names>
</name>
<name><surname>Diaz</surname>
<given-names>N</given-names>
</name>
<name><surname>Goesmann</surname>
<given-names>A</given-names>
</name>
<name><surname>Kelley</surname>
<given-names>S</given-names>
</name>
<name><surname>Nattkemper</surname>
<given-names>T</given-names>
</name>
<name><surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<name><surname>Edwards</surname>
<given-names>R</given-names>
</name>
<name><surname>Stoye</surname>
<given-names>J</given-names>
</name>
<article-title>Phylogenetic classification of short environmental DNA fragments</article-title>
<source>Nucleic Acids Res</source>
<year>2008</year>
<pub-id pub-id-type="pmid">18285365</pub-id>
</mixed-citation>
</ref>
<ref id="B62"><mixed-citation publication-type="journal"><name><surname>Munch</surname>
<given-names>K</given-names>
</name>
<name><surname>Boomsma</surname>
<given-names>W</given-names>
</name>
<name><surname>Huelsenbeck</surname>
<given-names>J</given-names>
</name>
<name><surname>Willerslev</surname>
<given-names>E</given-names>
</name>
<name><surname>Nielsen</surname>
<given-names>R</given-names>
</name>
<article-title>Statistical Assignment of DNA Sequences Using Bayesian Phylogenetics</article-title>
<source>Sys Biol</source>
<year>2008</year>
<volume>57</volume>
<issue>5</issue>
<fpage>750</fpage>
<lpage>757</lpage>
<pub-id pub-id-type="doi">10.1080/10635150802422316</pub-id>
</mixed-citation>
</ref>
<ref id="B63"><mixed-citation publication-type="book"><name><surname>Felsenstein</surname>
<given-names>J</given-names>
</name>
<article-title>PHYLIP (Phylogeny Inference Package) version 3.6</article-title>
<source>Distributed by the author</source>
<year>2004</year>
<publisher-name>Department of Genome Sciences, University of Washington, Seattle</publisher-name>
</mixed-citation>
</ref>
<ref id="B64"><mixed-citation publication-type="journal"><name><surname>Schmidt</surname>
<given-names>H</given-names>
</name>
<name><surname>Strimmer</surname>
<given-names>K</given-names>
</name>
<name><surname>Vingron</surname>
<given-names>M</given-names>
</name>
<name><surname>von Haeseler</surname>
<given-names>A</given-names>
</name>
<article-title>TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>3</issue>
<fpage>502</fpage>
<lpage>504</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.3.502</pub-id>
<pub-id pub-id-type="pmid">11934758</pub-id>
</mixed-citation>
</ref>
<ref id="B65"><mixed-citation publication-type="journal"><name><surname>Kishino</surname>
<given-names>H</given-names>
</name>
<name><surname>Miyata</surname>
<given-names>T</given-names>
</name>
<name><surname>Hasegawa</surname>
<given-names>M</given-names>
</name>
<article-title>Maximum likelihood inference of protein phylogeny and the origin of chloroplasts</article-title>
<source>J Mol Evol</source>
<year>1990</year>
<volume>31</volume>
<issue>2</issue>
<fpage>151</fpage>
<lpage>160</lpage>
<pub-id pub-id-type="doi">10.1007/BF02109483</pub-id>
</mixed-citation>
</ref>
<ref id="B66"><mixed-citation publication-type="journal"><name><surname>Strimmer</surname>
<given-names>K</given-names>
</name>
<name><surname>Rambaut</surname>
<given-names>A</given-names>
</name>
<article-title>Inferring confidence sets of possibly misspecified gene trees</article-title>
<source>Proc Royal Soc B</source>
<year>2002</year>
<volume>269</volume>
<issue>1487</issue>
<fpage>137</fpage>
<lpage>142</lpage>
<pub-id pub-id-type="doi">10.1098/rspb.2001.1862</pub-id>
</mixed-citation>
</ref>
<ref id="B67"><mixed-citation publication-type="journal"><name><surname>Wu</surname>
<given-names>M</given-names>
</name>
<name><surname>Eisen</surname>
<given-names>J</given-names>
</name>
<article-title>A simple, fast, and accurate method of phylogenomic inference</article-title>
<source>Genome Biol</source>
<year>2008</year>
<volume>9</volume>
<issue>10</issue>
<fpage>R151</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2008-9-10-r151</pub-id>
<pub-id pub-id-type="pmid">18851752</pub-id>
</mixed-citation>
</ref>
<ref id="B68"><mixed-citation publication-type="other"><name><surname>Stamatakis</surname>
<given-names>A</given-names>
</name>
<name><surname>Komornik</surname>
<given-names>Z</given-names>
</name>
<name><surname>Berger</surname>
<given-names>S</given-names>
</name>
<article-title>Evolutionary placement of short sequence reads on multi-core architectures</article-title>
<source>Proceedings of AICCSA-10, at 8th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA-10), Hammamet, Tunisia</source>
<year>2010</year>
</mixed-citation>
</ref>
<ref id="B69"><mixed-citation publication-type="other"><name><surname>Evans</surname>
<given-names>S</given-names>
</name>
<name><surname>Matsen</surname>
<given-names>F</given-names>
</name>
<article-title>The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples</article-title>
<source>submitted to JRSS B</source>
<year>2010</year>
<ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1005.1699v2">http://arxiv.org/abs/1005.1699v2</ext-link>
</mixed-citation>
</ref>
<ref id="B70"><mixed-citation publication-type="journal"><name><surname>Lozupone</surname>
<given-names>C</given-names>
</name>
<name><surname>Hamady</surname>
<given-names>M</given-names>
</name>
<name><surname>Kelley</surname>
<given-names>S</given-names>
</name>
<name><surname>Knight</surname>
<given-names>R</given-names>
</name>
<article-title>Quantitative and qualitative {beta} diversity measures lead to different insights into factors that structure microbial communities</article-title>
<source>Appl Environ Microbiol</source>
<year>2007</year>
<volume>73</volume>
<issue>5</issue>
<fpage>1576</fpage>
<pub-id pub-id-type="doi">10.1128/AEM.01996-06</pub-id>
<pub-id pub-id-type="pmid">17220268</pub-id>
</mixed-citation>
</ref>
<ref id="B71"><mixed-citation publication-type="other"><article-title>Pplacer Github repository</article-title>
<ext-link ext-link-type="uri" xlink:href="http://github.com/matsen/pplacer">http://github.com/matsen/pplacer</ext-link>
</mixed-citation>
</ref>
<ref id="B72"><mixed-citation publication-type="journal"><name><surname>Turnbaugh</surname>
<given-names>P</given-names>
</name>
<name><surname>Hamady</surname>
<given-names>M</given-names>
</name>
<name><surname>Yatsunenko</surname>
<given-names>T</given-names>
</name>
<name><surname>Cantarel</surname>
<given-names>B</given-names>
</name>
<name><surname>Duncan</surname>
<given-names>A</given-names>
</name>
<name><surname>Ley</surname>
<given-names>R</given-names>
</name>
<name><surname>Sogin</surname>
<given-names>M</given-names>
</name>
<name><surname>Jones</surname>
<given-names>W</given-names>
</name>
<name><surname>Roe</surname>
<given-names>B</given-names>
</name>
<name><surname>Affourtit</surname>
<given-names>J</given-names>
</name>
<etal></etal>
<article-title>A core gut microbiome in obese and lean twins</article-title>
<source>Nature</source>
<year>2008</year>
<volume>457</volume>
<issue>7228</issue>
<fpage>480</fpage>
<lpage>484</lpage>
<pub-id pub-id-type="doi">10.1038/nature07540</pub-id>
<pub-id pub-id-type="pmid">19043404</pub-id>
</mixed-citation>
</ref>
<ref id="B73"><mixed-citation publication-type="other"><article-title>CAMERA - Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis</article-title>
<ext-link ext-link-type="uri" xlink:href="http://camera.calit2.net/">http://camera.calit2.net/</ext-link>
</mixed-citation>
</ref>
<ref id="B74"><mixed-citation publication-type="other"><name><surname>Drummond</surname>
<given-names>A</given-names>
</name>
<name><surname>Ashton</surname>
<given-names>B</given-names>
</name>
<name><surname>Cheung</surname>
<given-names>M</given-names>
</name>
<etal></etal>
<article-title>Geneious Version 3.5</article-title>
<year>2007</year>
</mixed-citation>
</ref>
<ref id="B75"><mixed-citation publication-type="other"><article-title>FigTree</article-title>
<ext-link ext-link-type="uri" xlink:href="http://tree.bio.ed.ac.uk/software/gtree/">http://tree.bio.ed.ac.uk/software/gtree/</ext-link>
</mixed-citation>
</ref>
<ref id="B76"><mixed-citation publication-type="other"><name><surname>Stamatakis</surname>
<given-names>A</given-names>
</name>
<article-title>Phylogenetic models of rate heterogeneity: a high performance computing perspective</article-title>
<source>Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International</source>
<year>2006</year>
<fpage>8</fpage>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000480 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000480 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3098090
   |texte=   pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:21034504" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024

	Serveur d'exploration Cyberinfrastructure
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration Cyberinfrastructure

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki