Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences

Identifieur interne : 000603 ( Pmc/Corpus ); précédent : 000602; suivant : 000604

PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences

Auteurs : Gaurav Bhardwaj ; Kyung Dae Ko ; Yoojin Hong ; Zhenhai Zhang ; Ngai Lam Ho ; Sree V. Chintapalli ; Lindsay A. Kline ; Matthew Gotlin ; David Nicholas Hartranft ; Morgen E. Patterson ; Foram Dave ; Evan J. Smith ; Edward C. Holmes ; Randen L. Patterson ; Damian B. Van Rossum

Source :

RBID : PMC:3325999

Abstract

Both multiple sequence alignment and phylogenetic analysis are problematic in the “twilight zone” of sequence similarity (≤25% amino acid identity). Herein we explore the accuracy of phylogenetic inference at extreme sequence divergence using a variety of simulated data sets. We evaluate four leading multiple sequence alignment (MSA) methods (MAFFT, T-COFFEE, CLUSTAL, and MUSCLE) and six commonly used programs of tree estimation (Distance-based: Neighbor-Joining; Character-based: PhyML, RAxML, GARLI, Maximum Parsimony, and Bayesian) against a novel MSA-independent method (PHYRN) described here. Strikingly, at “midnight zone” genetic distances (∼7% pairwise identity and 4.0 gaps per position), PHYRN returns high-resolution phylogenies that outperform traditional approaches. We reason this is due to PHRYN's capability to amplify informative positions, even at the most extreme levels of sequence divergence. We also assess the applicability of the PHYRN algorithm for inferring deep evolutionary relationships in the divergent DANGER protein superfamily, for which PHYRN infers a more robust tree compared to MSA-based approaches. Taken together, these results demonstrate that PHYRN represents a powerful mechanism for mapping uncharted frontiers in highly divergent protein sequence data sets.


Url:
DOI: 10.1371/journal.pone.0034261
PubMed: 22514627
PubMed Central: 3325999

Links to Exploration step

PMC:3325999

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences</title>
<author>
<name sortKey="Bhardwaj, Gaurav" sort="Bhardwaj, Gaurav" uniqKey="Bhardwaj G" first="Gaurav" last="Bhardwaj">Gaurav Bhardwaj</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff6">
<addr-line>Department of Biochemistry and Molecular Medicine, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff8">
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ko, Kyung Dae" sort="Ko, Kyung Dae" uniqKey="Ko K" first="Kyung Dae" last="Ko">Kyung Dae Ko</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hong, Yoojin" sort="Hong, Yoojin" uniqKey="Hong Y" first="Yoojin" last="Hong">Yoojin Hong</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Zhenhai" sort="Zhang, Zhenhai" uniqKey="Zhang Z" first="Zhenhai" last="Zhang">Zhenhai Zhang</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff4">
<addr-line>Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ho, Ngai Lam" sort="Ho, Ngai Lam" uniqKey="Ho N" first="Ngai Lam" last="Ho">Ngai Lam Ho</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chintapalli, Sree V" sort="Chintapalli, Sree V" uniqKey="Chintapalli S" first="Sree V." last="Chintapalli">Sree V. Chintapalli</name>
<affiliation>
<nlm:aff id="aff7">
<addr-line>Department of Physiology and Membrane Biology, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff8">
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kline, Lindsay A" sort="Kline, Lindsay A" uniqKey="Kline L" first="Lindsay A." last="Kline">Lindsay A. Kline</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gotlin, Matthew" sort="Gotlin, Matthew" uniqKey="Gotlin M" first="Matthew" last="Gotlin">Matthew Gotlin</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hartranft, David Nicholas" sort="Hartranft, David Nicholas" uniqKey="Hartranft D" first="David Nicholas" last="Hartranft">David Nicholas Hartranft</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Patterson, Morgen E" sort="Patterson, Morgen E" uniqKey="Patterson M" first="Morgen E." last="Patterson">Morgen E. Patterson</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Dave, Foram" sort="Dave, Foram" uniqKey="Dave F" first="Foram" last="Dave">Foram Dave</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Smith, Evan J" sort="Smith, Evan J" uniqKey="Smith E" first="Evan J." last="Smith">Evan J. Smith</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Holmes, Edward C" sort="Holmes, Edward C" uniqKey="Holmes E" first="Edward C." last="Holmes">Edward C. Holmes</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff5">
<addr-line>Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Patterson, Randen L" sort="Patterson, Randen L" uniqKey="Patterson R" first="Randen L." last="Patterson">Randen L. Patterson</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff6">
<addr-line>Department of Biochemistry and Molecular Medicine, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff7">
<addr-line>Department of Physiology and Membrane Biology, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff8">
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Van Rossum, Damian B" sort="Van Rossum, Damian B" uniqKey="Van Rossum D" first="Damian B." last="Van Rossum">Damian B. Van Rossum</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff8">
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22514627</idno>
<idno type="pmc">3325999</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3325999</idno>
<idno type="RBID">PMC:3325999</idno>
<idno type="doi">10.1371/journal.pone.0034261</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000603</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences</title>
<author>
<name sortKey="Bhardwaj, Gaurav" sort="Bhardwaj, Gaurav" uniqKey="Bhardwaj G" first="Gaurav" last="Bhardwaj">Gaurav Bhardwaj</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff6">
<addr-line>Department of Biochemistry and Molecular Medicine, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff8">
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ko, Kyung Dae" sort="Ko, Kyung Dae" uniqKey="Ko K" first="Kyung Dae" last="Ko">Kyung Dae Ko</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hong, Yoojin" sort="Hong, Yoojin" uniqKey="Hong Y" first="Yoojin" last="Hong">Yoojin Hong</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Zhenhai" sort="Zhang, Zhenhai" uniqKey="Zhang Z" first="Zhenhai" last="Zhang">Zhenhai Zhang</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff4">
<addr-line>Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ho, Ngai Lam" sort="Ho, Ngai Lam" uniqKey="Ho N" first="Ngai Lam" last="Ho">Ngai Lam Ho</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chintapalli, Sree V" sort="Chintapalli, Sree V" uniqKey="Chintapalli S" first="Sree V." last="Chintapalli">Sree V. Chintapalli</name>
<affiliation>
<nlm:aff id="aff7">
<addr-line>Department of Physiology and Membrane Biology, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff8">
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kline, Lindsay A" sort="Kline, Lindsay A" uniqKey="Kline L" first="Lindsay A." last="Kline">Lindsay A. Kline</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gotlin, Matthew" sort="Gotlin, Matthew" uniqKey="Gotlin M" first="Matthew" last="Gotlin">Matthew Gotlin</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hartranft, David Nicholas" sort="Hartranft, David Nicholas" uniqKey="Hartranft D" first="David Nicholas" last="Hartranft">David Nicholas Hartranft</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Patterson, Morgen E" sort="Patterson, Morgen E" uniqKey="Patterson M" first="Morgen E." last="Patterson">Morgen E. Patterson</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Dave, Foram" sort="Dave, Foram" uniqKey="Dave F" first="Foram" last="Dave">Foram Dave</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Smith, Evan J" sort="Smith, Evan J" uniqKey="Smith E" first="Evan J." last="Smith">Evan J. Smith</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Holmes, Edward C" sort="Holmes, Edward C" uniqKey="Holmes E" first="Edward C." last="Holmes">Edward C. Holmes</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff5">
<addr-line>Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Patterson, Randen L" sort="Patterson, Randen L" uniqKey="Patterson R" first="Randen L." last="Patterson">Randen L. Patterson</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff6">
<addr-line>Department of Biochemistry and Molecular Medicine, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff7">
<addr-line>Department of Physiology and Membrane Biology, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff8">
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Van Rossum, Damian B" sort="Van Rossum, Damian B" uniqKey="Van Rossum D" first="Damian B." last="Van Rossum">Damian B. Van Rossum</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff8">
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Both multiple sequence alignment and phylogenetic analysis are problematic in the “twilight zone” of sequence similarity (≤25% amino acid identity). Herein we explore the accuracy of phylogenetic inference at extreme sequence divergence using a variety of simulated data sets. We evaluate four leading multiple sequence alignment (MSA) methods (MAFFT, T-COFFEE, CLUSTAL, and MUSCLE) and six commonly used programs of tree estimation (Distance-based: Neighbor-Joining; Character-based: PhyML, RAxML, GARLI, Maximum Parsimony, and Bayesian) against a novel MSA-independent method (PHYRN) described here. Strikingly, at “midnight zone” genetic distances (∼7% pairwise identity and 4.0 gaps per position), PHYRN returns high-resolution phylogenies that outperform traditional approaches. We reason this is due to PHRYN's capability to amplify informative positions, even at the most extreme levels of sequence divergence. We also assess the applicability of the PHYRN algorithm for inferring deep evolutionary relationships in the divergent DANGER protein superfamily, for which PHYRN infers a more robust tree compared to MSA-based approaches. Taken together, these results demonstrate that PHYRN represents a powerful mechanism for mapping uncharted frontiers in highly divergent protein sequence data sets.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Blake, Jd" uniqKey="Blake J">JD Blake</name>
</author>
<author>
<name sortKey="Cohen, Fe" uniqKey="Cohen F">FE Cohen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yona, G" uniqKey="Yona G">G Yona</name>
</author>
<author>
<name sortKey="Levitt, M" uniqKey="Levitt M">M Levitt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ko, Kd" uniqKey="Ko K">KD Ko</name>
</author>
<author>
<name sortKey="Hong, Y" uniqKey="Hong Y">Y Hong</name>
</author>
<author>
<name sortKey="Chang, Gs" uniqKey="Chang G">GS Chang</name>
</author>
<author>
<name sortKey="Bhardwaj, G" uniqKey="Bhardwaj G">G Bhardwaj</name>
</author>
<author>
<name sortKey="Van Rossum, Db" uniqKey="Van Rossum D">DB van Rossum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, K" uniqKey="Liu K">K Liu</name>
</author>
<author>
<name sortKey="Linder, Cr" uniqKey="Linder C">CR Linder</name>
</author>
<author>
<name sortKey="Warnow, T" uniqKey="Warnow T">T Warnow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roch, S" uniqKey="Roch S">S Roch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bergsten, J" uniqKey="Bergsten J">J Bergsten</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chang, Gs" uniqKey="Chang G">GS Chang</name>
</author>
<author>
<name sortKey="Hong, Y" uniqKey="Hong Y">Y Hong</name>
</author>
<author>
<name sortKey="Ko, Kd" uniqKey="Ko K">KD Ko</name>
</author>
<author>
<name sortKey="Bhardwaj, G" uniqKey="Bhardwaj G">G Bhardwaj</name>
</author>
<author>
<name sortKey="Holmes, Ec" uniqKey="Holmes E">EC Holmes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ko, Kd" uniqKey="Ko K">KD Ko</name>
</author>
<author>
<name sortKey="Hong, Y" uniqKey="Hong Y">Y Hong</name>
</author>
<author>
<name sortKey="Bhardwaj, G" uniqKey="Bhardwaj G">G Bhardwaj</name>
</author>
<author>
<name sortKey="Killick, Tm" uniqKey="Killick T">TM Killick</name>
</author>
<author>
<name sortKey="Van Rossum, Db" uniqKey="Van Rossum D">DB van Rossum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bhardwaj, G" uniqKey="Bhardwaj G">G Bhardwaj</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Hong, Y" uniqKey="Hong Y">Y Hong</name>
</author>
<author>
<name sortKey="Ko, Kd" uniqKey="Ko K">KD Ko</name>
</author>
<author>
<name sortKey="Chang, Gs" uniqKey="Chang G">GS Chang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hong, Y" uniqKey="Hong Y">Y Hong</name>
</author>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Kang, J" uniqKey="Kang J">J Kang</name>
</author>
<author>
<name sortKey="Van Rossum, Db" uniqKey="Van Rossum D">DB van Rossum</name>
</author>
<author>
<name sortKey="Patterson, Rl" uniqKey="Patterson R">RL Patterson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Talavera, G" uniqKey="Talavera G">G Talavera</name>
</author>
<author>
<name sortKey="Catersana" uniqKey="Catersana">Catersana</name>
</author>
<author>
<name sortKey="J" uniqKey="J">J</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roshan, U" uniqKey="Roshan U">U Roshan</name>
</author>
<author>
<name sortKey="Livesay, Dr" uniqKey="Livesay D">DR Livesay</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, K" uniqKey="Liu K">K Liu</name>
</author>
<author>
<name sortKey="Raghavan, S" uniqKey="Raghavan S">S Raghavan</name>
</author>
<author>
<name sortKey="Nelesen, S" uniqKey="Nelesen S">S Nelesen</name>
</author>
<author>
<name sortKey="Linder, Cr" uniqKey="Linder C">CR Linder</name>
</author>
<author>
<name sortKey="Warnow, T" uniqKey="Warnow T">T Warnow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Price, Mn" uniqKey="Price M">MN Price</name>
</author>
<author>
<name sortKey="Dehal, Ps" uniqKey="Dehal P">PS Dehal</name>
</author>
<author>
<name sortKey="Arkin, Ap" uniqKey="Arkin A">AP Arkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Beiko, Rg" uniqKey="Beiko R">RG Beiko</name>
</author>
<author>
<name sortKey="Charlebois, Rl" uniqKey="Charlebois R">RL Charlebois</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lassmann, T" uniqKey="Lassmann T">T Lassmann</name>
</author>
<author>
<name sortKey="Frings, O" uniqKey="Frings O">O Frings</name>
</author>
<author>
<name sortKey="Sonnhammer, El" uniqKey="Sonnhammer E">EL Sonnhammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
<author>
<name sortKey="Evers, D" uniqKey="Evers D">D Evers</name>
</author>
<author>
<name sortKey="Meyer, F" uniqKey="Meyer F">F Meyer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grassly, Nc" uniqKey="Grassly N">NC Grassly</name>
</author>
<author>
<name sortKey="Adachi, J" uniqKey="Adachi J">J Adachi</name>
</author>
<author>
<name sortKey="Rambaut, A" uniqKey="Rambaut A">A Rambaut</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sonnhammer, El" uniqKey="Sonnhammer E">EL Sonnhammer</name>
</author>
<author>
<name sortKey="Hollich, V" uniqKey="Hollich V">V Hollich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Robinson, Df" uniqKey="Robinson D">DF Robinson</name>
</author>
<author>
<name sortKey="Foulds, Lr" uniqKey="Foulds L">LR Foulds</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Subramanian, Ar" uniqKey="Subramanian A">AR Subramanian</name>
</author>
<author>
<name sortKey="Weyer Menkhoff, J" uniqKey="Weyer Menkhoff J">J Weyer-Menkhoff</name>
</author>
<author>
<name sortKey="Kaufmann, M" uniqKey="Kaufmann M">M Kaufmann</name>
</author>
<author>
<name sortKey="Morgenstern, B" uniqKey="Morgenstern B">B Morgenstern</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Katoh, K" uniqKey="Katoh K">K Katoh</name>
</author>
<author>
<name sortKey="Asimenos, G" uniqKey="Asimenos G">G Asimenos</name>
</author>
<author>
<name sortKey="Toh, H" uniqKey="Toh H">H Toh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thompson, Jd" uniqKey="Thompson J">JD Thompson</name>
</author>
<author>
<name sortKey="Higgins, Dg" uniqKey="Higgins D">DG Higgins</name>
</author>
<author>
<name sortKey="Gibson, Tj" uniqKey="Gibson T">TJ Gibson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Notredame, C" uniqKey="Notredame C">C Notredame</name>
</author>
<author>
<name sortKey="Higgins, Dg" uniqKey="Higgins D">DG Higgins</name>
</author>
<author>
<name sortKey="Heringa, J" uniqKey="Heringa J">J Heringa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guindon, S" uniqKey="Guindon S">S Guindon</name>
</author>
<author>
<name sortKey="Lethiec, F" uniqKey="Lethiec F">F Lethiec</name>
</author>
<author>
<name sortKey="Duroux, P" uniqKey="Duroux P">P Duroux</name>
</author>
<author>
<name sortKey="Gascuel, O" uniqKey="Gascuel O">O Gascuel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le, Sq" uniqKey="Le S">SQ Le</name>
</author>
<author>
<name sortKey="Gascuel, O" uniqKey="Gascuel O">O Gascuel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A Stamatakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zwickl, Dj" uniqKey="Zwickl D">DJ Zwickl</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wilgenbusch, Jc" uniqKey="Wilgenbusch J">JC Wilgenbusch</name>
</author>
<author>
<name sortKey="Swofford, D" uniqKey="Swofford D">D Swofford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ronquist, F" uniqKey="Ronquist F">F Ronquist</name>
</author>
<author>
<name sortKey="Huelsenbeck, Jp" uniqKey="Huelsenbeck J">JP Huelsenbeck</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ulitsky, I" uniqKey="Ulitsky I">I Ulitsky</name>
</author>
<author>
<name sortKey="Burstein, D" uniqKey="Burstein D">D Burstein</name>
</author>
<author>
<name sortKey="Tuller, T" uniqKey="Tuller T">T Tuller</name>
</author>
<author>
<name sortKey="Chor, B" uniqKey="Chor B">B Chor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lempel, A" uniqKey="Lempel A">A Lempel</name>
</author>
<author>
<name sortKey="Ziv, J" uniqKey="Ziv J">J Ziv</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hohl, M" uniqKey="Hohl M">M Hohl</name>
</author>
<author>
<name sortKey="Ragan, Ma" uniqKey="Ragan M">MA Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bruno, Wj" uniqKey="Bruno W">WJ Bruno</name>
</author>
<author>
<name sortKey="Socci, Nd" uniqKey="Socci N">ND Socci</name>
</author>
<author>
<name sortKey="Halpern, Al" uniqKey="Halpern A">AL Halpern</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Desper, R" uniqKey="Desper R">R Desper</name>
</author>
<author>
<name sortKey="Gascuel" uniqKey="Gascuel">Gascuel</name>
</author>
<author>
<name sortKey="O" uniqKey="O">O</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wheeler, Tj" uniqKey="Wheeler T">TJ Wheeler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hong, Y" uniqKey="Hong Y">Y Hong</name>
</author>
<author>
<name sortKey="Chintapalli, Sv" uniqKey="Chintapalli S">SV Chintapalli</name>
</author>
<author>
<name sortKey="Ko, Kd" uniqKey="Ko K">KD Ko</name>
</author>
<author>
<name sortKey="Bhardwaj, G" uniqKey="Bhardwaj G">G Bhardwaj</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hong, Y" uniqKey="Hong Y">Y Hong</name>
</author>
<author>
<name sortKey="Kang, J" uniqKey="Kang J">J Kang</name>
</author>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Van Rossum, Db" uniqKey="Van Rossum D">DB van Rossum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Han, Q" uniqKey="Han Q">Q Han</name>
</author>
<author>
<name sortKey="Aligo, J" uniqKey="Aligo J">J Aligo</name>
</author>
<author>
<name sortKey="Manna, D" uniqKey="Manna D">D Manna</name>
</author>
<author>
<name sortKey="Belton, K" uniqKey="Belton K">K Belton</name>
</author>
<author>
<name sortKey="Chintapalli, Sv" uniqKey="Chintapalli S">SV Chintapalli</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nikolaidis, N" uniqKey="Nikolaidis N">N Nikolaidis</name>
</author>
<author>
<name sortKey="Chalkia, D" uniqKey="Chalkia D">D Chalkia</name>
</author>
<author>
<name sortKey="Watkins, Dn" uniqKey="Watkins D">DN Watkins</name>
</author>
<author>
<name sortKey="Barrow, Rk" uniqKey="Barrow R">RK Barrow</name>
</author>
<author>
<name sortKey="Snyder, Sh" uniqKey="Snyder S">SH Snyder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Rossum, Db" uniqKey="Van Rossum D">DB van Rossum</name>
</author>
<author>
<name sortKey="Patterson, Rl" uniqKey="Patterson R">RL Patterson</name>
</author>
<author>
<name sortKey="Cheung, Kh" uniqKey="Cheung K">KH Cheung</name>
</author>
<author>
<name sortKey="Barrow, Rk" uniqKey="Barrow R">RK Barrow</name>
</author>
<author>
<name sortKey="Syrovatkina, V" uniqKey="Syrovatkina V">V Syrovatkina</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lau, Gt" uniqKey="Lau G">GT Lau</name>
</author>
<author>
<name sortKey="Wong, Og" uniqKey="Wong O">OG Wong</name>
</author>
<author>
<name sortKey="Chan, Pm" uniqKey="Chan P">PM Chan</name>
</author>
<author>
<name sortKey="Kok, Kh" uniqKey="Kok K">KH Kok</name>
</author>
<author>
<name sortKey="Wong, Rl" uniqKey="Wong R">RL Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kang, Bn" uniqKey="Kang B">BN Kang</name>
</author>
<author>
<name sortKey="Ahmad, As" uniqKey="Ahmad A">AS Ahmad</name>
</author>
<author>
<name sortKey="Saleem, S" uniqKey="Saleem S">S Saleem</name>
</author>
<author>
<name sortKey="Patterson, Rl" uniqKey="Patterson R">RL Patterson</name>
</author>
<author>
<name sortKey="Hester, L" uniqKey="Hester L">L Hester</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marchler Bauer, A" uniqKey="Marchler Bauer A">A Marchler-Bauer</name>
</author>
<author>
<name sortKey="Lu, S" uniqKey="Lu S">S Lu</name>
</author>
<author>
<name sortKey="Anderson, Jb" uniqKey="Anderson J">JB Anderson</name>
</author>
<author>
<name sortKey="Chitsaz, F" uniqKey="Chitsaz F">F Chitsaz</name>
</author>
<author>
<name sortKey="Derbyshire, Mk" uniqKey="Derbyshire M">MK Derbyshire</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tamura, K" uniqKey="Tamura K">K Tamura</name>
</author>
<author>
<name sortKey="Dudley, J" uniqKey="Dudley J">J Dudley</name>
</author>
<author>
<name sortKey="Nei, M" uniqKey="Nei M">M Nei</name>
</author>
<author>
<name sortKey="Kumar, S" uniqKey="Kumar S">S Kumar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, S" uniqKey="Sun S">S Sun</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Altintas, I" uniqKey="Altintas I">I Altintas</name>
</author>
<author>
<name sortKey="Lin, A" uniqKey="Lin A">A Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Watanabe, H" uniqKey="Watanabe H">H Watanabe</name>
</author>
<author>
<name sortKey="Vriens, J" uniqKey="Vriens J">J Vriens</name>
</author>
<author>
<name sortKey="Prenen, J" uniqKey="Prenen J">J Prenen</name>
</author>
<author>
<name sortKey="Droogmans, G" uniqKey="Droogmans G">G Droogmans</name>
</author>
<author>
<name sortKey="Voets, T" uniqKey="Voets T">T Voets</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Watanabe, H" uniqKey="Watanabe H">H Watanabe</name>
</author>
<author>
<name sortKey="Fujisawa, T" uniqKey="Fujisawa T">T Fujisawa</name>
</author>
<author>
<name sortKey="Holstein, Tw" uniqKey="Holstein T">TW Holstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chow, Kl" uniqKey="Chow K">KL Chow</name>
</author>
<author>
<name sortKey="Hall, Dh" uniqKey="Hall D">DH Hall</name>
</author>
<author>
<name sortKey="Emmons, Sw" uniqKey="Emmons S">SW Emmons</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wong, Ym" uniqKey="Wong Y">YM Wong</name>
</author>
<author>
<name sortKey="Chow, Kl" uniqKey="Chow K">KL Chow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Essoussi, N" uniqKey="Essoussi N">N Essoussi</name>
</author>
<author>
<name sortKey="Boujenfa, K" uniqKey="Boujenfa K">K Boujenfa</name>
</author>
<author>
<name sortKey="Limam, M" uniqKey="Limam M">M Limam</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Loytynoja, A" uniqKey="Loytynoja A">A Loytynoja</name>
</author>
<author>
<name sortKey="Goldman, N" uniqKey="Goldman N">N Goldman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, K" uniqKey="Liu K">K Liu</name>
</author>
<author>
<name sortKey="Warnow, Tj" uniqKey="Warnow T">TJ Warnow</name>
</author>
<author>
<name sortKey="Holder, Mt" uniqKey="Holder M">MT Holder</name>
</author>
<author>
<name sortKey="Nelesen, Sm" uniqKey="Nelesen S">SM Nelesen</name>
</author>
<author>
<name sortKey="Yu, J" uniqKey="Yu J">J Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eddy, Sr" uniqKey="Eddy S">SR Eddy</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22514627</article-id>
<article-id pub-id-type="pmc">3325999</article-id>
<article-id pub-id-type="publisher-id">PONE-D-11-24752</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0034261</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Biology</subject>
<subj-group>
<subject>Computational Biology</subject>
<subj-group>
<subject>Biological Data Management</subject>
<subject>Evolutionary Modeling</subject>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Evolutionary Biology</subject>
<subj-group>
<subject>Evolutionary Systematics</subject>
<subj-group>
<subject>Phylogenetics</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Evolutionary Theory</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Molecular Genetics</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Proteomics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences</article-title>
<alt-title alt-title-type="running-head">Phylogenetic Accuracy in Divergent Data Sets</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Bhardwaj</surname>
<given-names>Gaurav</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff6">
<sup>6</sup>
</xref>
<xref ref-type="aff" rid="aff8">
<sup>8</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ko</surname>
<given-names>Kyung Dae</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hong</surname>
<given-names>Yoojin</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Zhenhai</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff4">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ho</surname>
<given-names>Ngai Lam</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chintapalli</surname>
<given-names>Sree V.</given-names>
</name>
<xref ref-type="aff" rid="aff7">
<sup>7</sup>
</xref>
<xref ref-type="aff" rid="aff8">
<sup>8</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kline</surname>
<given-names>Lindsay A.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Gotlin</surname>
<given-names>Matthew</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hartranft</surname>
<given-names>David Nicholas</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Patterson</surname>
<given-names>Morgen E.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Dave</surname>
<given-names>Foram</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Smith</surname>
<given-names>Evan J.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Holmes</surname>
<given-names>Edward C.</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff5">
<sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Patterson</surname>
<given-names>Randen L.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff6">
<sup>6</sup>
</xref>
<xref ref-type="aff" rid="aff7">
<sup>7</sup>
</xref>
<xref ref-type="aff" rid="aff8">
<sup>8</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>van Rossum</surname>
<given-names>Damian B.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff8">
<sup>8</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
<addr-line>Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</aff>
<aff id="aff3">
<label>3</label>
<addr-line>Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</aff>
<aff id="aff4">
<label>4</label>
<addr-line>Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America</addr-line>
</aff>
<aff id="aff5">
<label>5</label>
<addr-line>Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</aff>
<aff id="aff6">
<label>6</label>
<addr-line>Department of Biochemistry and Molecular Medicine, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</aff>
<aff id="aff7">
<label>7</label>
<addr-line>Department of Physiology and Membrane Biology, School of Medicine, University of California Davis, Davis, California, United States of America</addr-line>
</aff>
<aff id="aff8">
<label>8</label>
<addr-line>Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Tuller</surname>
<given-names>Tamir</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">Tel Aviv University, Israel</aff>
<author-notes>
<corresp id="cor1">* E-mail:
<email>randen100@gmail.com</email>
(RLP);
<email>dbv10@psu.edu</email>
(DBV)</corresp>
<fn fn-type="con">
<p>Conceived and designed the experiments: GB ECH RLP DBVR. Performed the experiments: GB KDK YH SVC ZZ NLH LAK MG DNH MEP FD EJS. Analyzed the data: GB ECH RLP DBVR. Wrote the paper: GB ECH RLP DBVR.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>13</day>
<month>4</month>
<year>2012</year>
</pub-date>
<volume>7</volume>
<issue>4</issue>
<elocation-id>e34261</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>11</month>
<year>2011</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>2</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.</copyright-statement>
<copyright-year>2012</copyright-year>
</permissions>
<abstract>
<p>Both multiple sequence alignment and phylogenetic analysis are problematic in the “twilight zone” of sequence similarity (≤25% amino acid identity). Herein we explore the accuracy of phylogenetic inference at extreme sequence divergence using a variety of simulated data sets. We evaluate four leading multiple sequence alignment (MSA) methods (MAFFT, T-COFFEE, CLUSTAL, and MUSCLE) and six commonly used programs of tree estimation (Distance-based: Neighbor-Joining; Character-based: PhyML, RAxML, GARLI, Maximum Parsimony, and Bayesian) against a novel MSA-independent method (PHYRN) described here. Strikingly, at “midnight zone” genetic distances (∼7% pairwise identity and 4.0 gaps per position), PHYRN returns high-resolution phylogenies that outperform traditional approaches. We reason this is due to PHRYN's capability to amplify informative positions, even at the most extreme levels of sequence divergence. We also assess the applicability of the PHYRN algorithm for inferring deep evolutionary relationships in the divergent DANGER protein superfamily, for which PHYRN infers a more robust tree compared to MSA-based approaches. Taken together, these results demonstrate that PHYRN represents a powerful mechanism for mapping uncharted frontiers in highly divergent protein sequence data sets.</p>
</abstract>
<counts>
<page-count count="13"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>Inferring phylogenetic history among highly divergent protein sequences is one of the most challenging problems in modern evolutionary biology. The ability to reliably determine the evolutionary history of protein sequences that fall below ∼25% identity (i.e. the “twilight zone” and lower still, the “midnight zone”) would allow for better identification of homologous proteins and shed light on key events in the deep evolutionary past
<xref ref-type="bibr" rid="pone.0034261-Blake1">[1]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Yona1">[2]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Ko1">[3]</xref>
. To date, most attempts to resolve deep-node evolutionary relationships have relied upon improving methods, models, and parameters of multiple sequence alignment (MSA) and/or tree inference programs. However, MSA methods tend to get progressively worse with additional sequence divergence
<xref ref-type="bibr" rid="pone.0034261-Liu1">[4]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Edgar1">[5]</xref>
. This is due to the low information content of divergent sequences and the subsequent loss of informative points (i.e. number of common sites). No matter how robust a given tree-building algorithm performs, this lack of informative points tends to result in poor phylogenetic inference (i.e. noise in, noise out).</p>
<p>Current phylogenetic analysis often follows a two-step process: (i) obtain a guide tree based on percentage identity of all-against-all pairwise alignments, which designates the order of a progressive alignment, and (ii) estimate a phylogeny based upon the resultant MSA with distance-based or character-based tree inference programs. Distance methods (e.g. UPGMA, Neighbor Joining) are fast, and can handle large numbers of sequences
<xref ref-type="bibr" rid="pone.0034261-Roch1">[6]</xref>
. However, distance-based trees are often erroneous when rates of substitution vary greatly among lineages. Thus, distance-based methods are generally thought to be inferior to character-based methods (e.g. Parsimony, Maximum Likelihood (ML), and Bayesian). Character-based inference can generate trees with the minimum number of changes needed to explain the data, or the highest likelihood of occurring with the given data and assuming a particular model of molecular evolution. The downside to character-based phylogenetic inference is the computational cost and problems with scalability to large data sets. Further, both distance- and character-based methods are prone to long branch attraction in which rapidly evolving sequences (with long branches), are placed with other rapidly evolving sequences even if they are not sister taxa
<xref ref-type="bibr" rid="pone.0034261-Bergsten1">[7]</xref>
.</p>
<p>We developed an alternative approach to MSA-based tree inference which utilizes the Euclidean distance of sequence profiles (a.k.a. phylogenetic profiles or NxM matrices)
<xref ref-type="bibr" rid="pone.0034261-Chang1">[8]</xref>
. In this manner, the sequence profile of a set N of query amino acid sequences is defined as vectors where each entry quantifies the pairwise alignment between N queries and a set M of position specific scoring matrices (PSSMs)
<xref ref-type="bibr" rid="pone.0034261-Ko1">[3]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Ko2">[9]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Bhardwaj1">[10]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Hong1">[11]</xref>
. Given this matrix, we calculate the Euclidean distance between all pairs and generate a NXN distance matrix for tree inference. The statistical robustness and computational cost of this initial algorithm did not make it feasible in practice; however, it was sufficiently robust in a benchmark data set of divergent retroelements
<xref ref-type="bibr" rid="pone.0034261-Chang1">[8]</xref>
. This initial success led us to pursue this approach further, and alterations to the initial algorithm are discussed in detail in following sections. The underlying theory is that through the use of PSSMs, sequence profiling methods can amplify the signal (i.e. informative positions) contained in each sequence, handle large data sets, and give more refined distance measures.</p>
<p>In this study we address the performance of various traditional phylogenetic approaches and our new method presented here, PHYRN, in simulated data sets at extreme divergence levels. We use simulated data sets as the test bed of performance because unlike biological data sets, the true history of simulated sequence data is known and predefined. Knowledge of the true evolutionary history of the sequences under consideration makes it possible to quantify the performance of phylogenetic inference algorithms
<xref ref-type="bibr" rid="pone.0034261-Talavera1">[12]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Roshan1">[13]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Liu2">[14]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Price1">[15]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Beiko1">[16]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Lassmann1">[17]</xref>
. Simulated data sets also allow us to evaluate our performance at multiple different divergence levels using many replicates. In this study, we have compared PHYRN to four leading MSA methods (MAFFT, T-COFFEE, ClustalW, and MUSCLE), two alignment-free methods (Average Common Substring(ACS) approach and Lempel-Ziv(LZ) Distance), and seven established methods for tree estimation (Distance-based: Neighbor-Joining, FastME; Character-based: PhyML, RAxML, Garli (all three of which utilize a maximum likelihood approach), Maximum Parsimony, and Bayesian).</p>
<p>While simulated data sets represent a powerful way to benchmark accuracy of a given algorithm, they may not incorporate all the underlying mechanisms of natural molecular evolution (e.g. translocations, rearrangements, recombination and/or inversions)
<xref ref-type="bibr" rid="pone.0034261-Stoye1">[18]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Grassly1">[19]</xref>
. Therefore, it is informative to test PHYRN in a biologically relevant test bed. Accordingly, we also evaluate whether PHYRN is capable of providing informative measurements that could describe the evolutionary history of the highly divergent developmental DANGER superfamily. Based on the results from synthetic data sets data sets, and DANGER superfamily, we propose that: (i) high-resolution phylogenies can be built for protein families using PHYRN, (ii) these measurements have robust statistical support and inform intra- and inter-group relationships, and (iii) these measures can outperform traditional MSA-dependent tree inference methods.</p>
</sec>
<sec sec-type="methods" id="s2">
<title>Methods</title>
<sec id="s2a">
<title>Generation and Sequence Evolution of Synthetic Data Sets</title>
<p>We artificially generated protein sequences using SeqGen and ROSE simulation packages to test the performance of phylogenetic methods in highly divergent sequences
<xref ref-type="bibr" rid="pone.0034261-Stoye1">[18]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Grassly1">[19]</xref>
. In both simulations, sequences are created from a common ancestor to produce a data set of known size, divergence, and history using a variety of tree shapes (e.g. biased vs. unbiased). In this
<italic>in silico</italic>
evolutionary process, an accurate phylogenetic history is recorded since the MSA is created simultaneously, thereby allowing us to test the reliability of different methods of tree inference at different levels of sequence divergence. Both simulation methods use PAM matrices
<xref ref-type="bibr" rid="pone.0034261-Sonnhammer1">[20]</xref>
, where increasing PAM score equates to decreasing percentage identity and similarity, and an increasing number of gaps. A key difference between SeqGen and ROSE is that SeqGen does not incorporate insertion-deletion events (indels) while generating these simulated protein families. ROSE does include indels, providing a better approximation of molecular evolution (see
<xref ref-type="supplementary-material" rid="pone.0034261.s006">Supporting Methods S1</xref>
for more details).</p>
<p>Simulated sequences were then aligned by PHYRN or a distance estimation technique (MSA or alignment-free) and passed to the tree estimation method. The estimated trees are scored against the true tree for accuracy via two methods. First, we used the CONSENSE program in the PHYLIP v3.67 package (
<ext-link ext-link-type="uri" xlink:href="http://evolution.genetics.washington.edu/phylip.html">http://evolution.genetics.washington.edu/phylip.html</ext-link>
) to generate consensus trees between the true-history tree and the estimated trees. Recapitulation rate and percentages were then calculated from consensus tree newick files. Deep nodes were defined as those which are evolutionary ancestors of last two tiers of leaf nodes. For a second measure of topological difference we used the ‘treedist’ program in the PHYLIP v3.67 package to calculate symmetric distances of Robinson and Foulds (RF distance)
<xref ref-type="bibr" rid="pone.0034261-Robinson1">[21]</xref>
. RF distance is a well-established metric for comparing tree topologies in which bipartitions between two trees are compared to calculate difference in their topologies. For two trees with exactly the same topology, this distance is 0, but for two trees of n leaves, with all branches differently placed, symmetric distance is equal to 2(n–3). Thus, the accuracy of a tree-building algorithm decreases with the symmetric distance score from the true simulated tree.</p>
<p>To simulate sequence evolution, a single amino acid sequence was placed at the root of the tree
<italic>T</italic>
and evolved down the tree according to the parameters of the simulation programs. In this way each leaf of
<italic>T</italic>
has a sequence. For the majority of experiments we generated simulated data sets comprised of 100 sequences with an average length of 450 amino acids. We used Seq-Gen v 1.3.2
<xref ref-type="bibr" rid="pone.0034261-Grassly1">[19]</xref>
with PAM as the default substitution matrix and varied scaling factor from 0.1 to 1 to generate multiple replicates (n = 25) of the synthetic data sets with sequences at different divergence ranges. The SeqGen scaling factor scales the branch lengths of the input tree to a specified value before generating data set from the input tree. This changes the expected number of amino acid substitutions per site for each branch, and thus changes the overall divergence of the simulated tree. We also used ROSE v1.3
<xref ref-type="bibr" rid="pone.0034261-Stoye1">[18]</xref>
with default settings to generate multiple replicates of true trees (n = 25) across a range of divergence. The extent of sequence divergence was varied across multiple replicates by changing the average ROSE distance parameter from 100 PAM to 700 PAM. Importantly, both ROSE and Seq-Gen employed a fixed substitution rate across all branches, such that we assume a strict molecular clock. All simulated data sets are available upon request or downloadable from
<ext-link ext-link-type="uri" xlink:href="http://www.ccp.psu.edu/downloads">www.ccp.psu.edu/downloads</ext-link>
.</p>
<table-wrap id="pone-0034261-t001" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.t001</object-id>
<label>Table 1</label>
<caption>
<title>Performance Comparison of Phylogenetic Inference Methods.</title>
</caption>
<alternatives>
<graphic id="pone-0034261-t001-1" xlink:href="pone.0034261.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Alignment/ Alignment-free Method</td>
<td align="left" rowspan="1" colspan="1">Tree Inference Method</td>
<td align="left" rowspan="1" colspan="1">Settings and Parameters</td>
<td colspan="4" align="left" rowspan="1">ROSE Data Sets
<xref ref-type="table-fn" rid="nt101">#</xref>
</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">550</td>
<td align="left" rowspan="1" colspan="1">650</td>
<td align="left" rowspan="1" colspan="1">750</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">LZ</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">14(+/−48.16)</td>
<td align="left" rowspan="1" colspan="1">116.88(+/−18.03)</td>
<td align="left" rowspan="1" colspan="1">126.96(+/−6.98)</td>
<td align="left" rowspan="1" colspan="1">143.36(+/−7.34)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average Common Substring(ACS)</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">7.36(+/−35.14)</td>
<td align="left" rowspan="1" colspan="1">106.96(+/−20.98)</td>
<td align="left" rowspan="1" colspan="1">116.8(+/−18.80)</td>
<td align="left" rowspan="1" colspan="1">122(+/−13.49)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">Maximum Parsimony (ProtPars)</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">0.15(+/−0.54)</td>
<td align="left" rowspan="1" colspan="1">49.61(+/−15.64)</td>
<td align="left" rowspan="1" colspan="1">91.69(+/−13.24)</td>
<td align="left" rowspan="1" colspan="1">106.15(+/−11.76)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">Maximum Parsimony (PAUP)</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">0(+/−0)</td>
<td align="left" rowspan="1" colspan="1">33.2(+/−12.19)</td>
<td align="left" rowspan="1" colspan="1">77.68(+/−9.27)</td>
<td align="left" rowspan="1" colspan="1">99.2(+/−11.05)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">0(+/−0)</td>
<td align="left" rowspan="1" colspan="1">41.44(+/−13.50)</td>
<td align="left" rowspan="1" colspan="1">82.24(+/−14.08)</td>
<td align="left" rowspan="1" colspan="1">94.96(+/−16.04)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1">JTT, Gamma  = 0.5</td>
<td align="left" rowspan="1" colspan="1">0(+/−0)</td>
<td align="left" rowspan="1" colspan="1">13.04(+/−10.31)</td>
<td align="left" rowspan="1" colspan="1">68.4(+/−23.08)</td>
<td align="left" rowspan="1" colspan="1">99.36(+/−22.63)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">FastME</td>
<td align="left" rowspan="1" colspan="1">JTT, Gamm = 0.5</td>
<td align="left" rowspan="1" colspan="1">0(+/−0)</td>
<td align="left" rowspan="1" colspan="1">12.64(+/−12.72)</td>
<td align="left" rowspan="1" colspan="1">66(+/−21.73)</td>
<td align="left" rowspan="1" colspan="1">92.96(+/−21.31)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1">JTT, Gamma  = 1</td>
<td align="left" rowspan="1" colspan="1">0.24(+/−0.66)</td>
<td align="left" rowspan="1" colspan="1">52.32(+/−18.63)</td>
<td align="left" rowspan="1" colspan="1">114.4(+/−25.81)</td>
<td align="left" rowspan="1" colspan="1">135.52(+/−17.00)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">FastME</td>
<td align="left" rowspan="1" colspan="1">JTT, Gamma  = 1</td>
<td align="left" rowspan="1" colspan="1">0.16(+/−0.55)</td>
<td align="left" rowspan="1" colspan="1">36.96(+/−22.20)</td>
<td align="left" rowspan="1" colspan="1">105.84(+/−22.31)</td>
<td align="left" rowspan="1" colspan="1">132.88(+/−23.38)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">PhyML(Maximum-Liklihood)</td>
<td align="left" rowspan="1" colspan="1">LG</td>
<td align="left" rowspan="1" colspan="1">0(+/−0)</td>
<td align="left" rowspan="1" colspan="1">29.24(+/−25.77)</td>
<td align="left" rowspan="1" colspan="1">70.85(+/−21.73)</td>
<td align="left" rowspan="1" colspan="1">89.14(+/−24.14)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">PhyML(Maximum-Liklihood)</td>
<td align="left" rowspan="1" colspan="1">LG(+F)</td>
<td align="left" rowspan="1" colspan="1">0(+/−0)</td>
<td align="left" rowspan="1" colspan="1">38(+/−24.77)</td>
<td align="left" rowspan="1" colspan="1">75.28(+/−22.23)</td>
<td align="left" rowspan="1" colspan="1">96.08(+/−19.33)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">RAxML (Maximum-Liklihood)</td>
<td align="left" rowspan="1" colspan="1">WAG(+G)</td>
<td align="left" rowspan="1" colspan="1">0(+/−0)</td>
<td align="left" rowspan="1" colspan="1">7.76(+/−7.53)</td>
<td align="left" rowspan="1" colspan="1">43.28(+/−12.03)</td>
<td align="left" rowspan="1" colspan="1">63.2(+/−17.98)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE</td>
<td align="left" rowspan="1" colspan="1">GARLI(Maximum-Liklihood)</td>
<td align="left" rowspan="1" colspan="1">WAG(+F)</td>
<td align="left" rowspan="1" colspan="1">0(+/−0)</td>
<td align="left" rowspan="1" colspan="1">8.8(+/−7.46)</td>
<td align="left" rowspan="1" colspan="1">46.16(+/−12.62)</td>
<td align="left" rowspan="1" colspan="1">65.68(+/−16.19)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CLUSTAL</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">94.95(+/−10.05)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TCOFFEE</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">149.42(+/−14.17)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MAFFT-L-INS-i</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">153.52(+/−37.44)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MAFFT</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">174.09(+/−2.05)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">PHYRN</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">1.2(+/−1.53)</td>
<td align="left" rowspan="1" colspan="1">1.52(+/−1.66)</td>
<td align="left" rowspan="1" colspan="1">7.04(+/−4.05)</td>
<td align="left" rowspan="1" colspan="1">13.6(+/−4.12)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MUSCLE
<xref ref-type="table-fn" rid="nt102">*</xref>
</td>
<td align="left" rowspan="1" colspan="1">GARLI</td>
<td align="left" rowspan="1" colspan="1">WAG(+F), 10 inpendent runs</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">61.27(+/−18.24)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">PHYRN
<xref ref-type="table-fn" rid="nt102">*</xref>
</td>
<td align="left" rowspan="1" colspan="1">Neighbor-Joining</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">NA</td>
<td align="left" rowspan="1" colspan="1">14.6(+/−4.46)</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt101">
<label>#</label>
<p>Performance described as RF distance +/−SD.</p>
</fn>
<fn id="nt102">
<label>*</label>
<p>Analysis done only on Data Sets 2 for ROSE 700.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s2b">
<title>Methods for Estimating MSA and Phylogenetic Trees</title>
<p>We utilized a variety of MSA-based methods for each simulated data set. For a given data set, we obtained a MSA using MUSCLE v3.6
<xref ref-type="bibr" rid="pone.0034261-Edgar2">[22]</xref>
, DIALIGN v2.2.1
<xref ref-type="bibr" rid="pone.0034261-Subramanian1">[23]</xref>
, MAFFT v6.833b
<xref ref-type="bibr" rid="pone.0034261-Katoh1">[24]</xref>
, CLUSTALW v 2.0.12
<xref ref-type="bibr" rid="pone.0034261-Thompson1">[25]</xref>
, or T-COFFEE v 8.93
<xref ref-type="bibr" rid="pone.0034261-Notredame1">[26]</xref>
with default parameters. Phylogenetic trees based on these MSAs were inferred by both distance- and character-based programs. For the distance-based condition, trees were inferred using the popular Neighbor-Joining (NJ) method and/or FastME methods. Further, we also explored more complex substitution models with character-based methods. Specifically, we tested various Maximum Likelihood (ML) algorithms for tree inference. PhyML v3.0
<xref ref-type="bibr" rid="pone.0034261-Guindon1">[27]</xref>
was used at its default settings (using BioNJ to obtain the initial tree, and the Le and Gascuel (LG) amino acid replacement matrix
<xref ref-type="bibr" rid="pone.0034261-Le1">[28]</xref>
). Equilibrium amino acid frequencies were estimated from the data set using the +F option. RaxML v 7.0.4 parallel Pthreads version
<xref ref-type="bibr" rid="pone.0034261-Stamatakis1">[29]</xref>
is a different ML algorithm, and was used with the Whelan and Goldman (WAG) amino acid substitution model, and CAT approximations. CAT approximations were used in RaxML as it decreases computational time while retaining accuracy of tree inference. GARLI v 1.0
<xref ref-type="bibr" rid="pone.0034261-Zwickl1">[30]</xref>
(
<ext-link ext-link-type="uri" xlink:href="http://www.bio.utexas.edu/faculty/antisense/garli/Garli.html">www.bio.utexas.edu/faculty/antisense/garli/Garli.html</ext-link>
) was also used assuming the WAG amino acid substitution model and with the substitution frequencies estimated from the data in hand (+F settings). The gamma model of among-site rate variation was employed with empirical estimates of the extent of rate variation. In additional runs, we also used the recent version GARLI v2.0
<xref ref-type="bibr" rid="pone.0034261-Zwickl1">[30]</xref>
. To infer maximum parsimony trees we used both the PROTPARS program in the PHYLIP v3.67 package (
<ext-link ext-link-type="uri" xlink:href="http://evolution.genetics.washington.edu/phylip.html">http://evolution.genetics.washington.edu/phylip.html</ext-link>
) and PAUP* (version 4)
<xref ref-type="bibr" rid="pone.0034261-Wilgenbusch1">[31]</xref>
. In data sets where parsimony method outputs multiple trees, only the best tree (based on RF distance) was used for average accuracy calculations. Finally, we tested the Bayesian method available in MrBayes 3.1.2
<xref ref-type="bibr" rid="pone.0034261-Ronquist1">[32]</xref>
, incorporating its default settings with a mixed amino acid substitution model and a gamma model of among-site rate variation (and in additional runs using a gamma model of rate variation with a proportion of invariable amino acid sites). Default settings in MrBayes employ two different runs with 4 different chains between the 2 independent runs. Besides these default settings, we also utilized a parallel version of MrBayes with following settings: i) 16 parallel runs with the WAG amino sustitution substitution model and gamma model of among–site rate varaition, and ii) 32 parallel runs with the WAG amino acid substituition model. Optimal trees were obtained from two independent runs for each data set, and runs were stopped when runs reached stationarity (based on standard deviation of split frequencies, and also by examining the log likelihood values during runs). Majority-rule consensus trees after discarding first 25% samples as ‘burn-in’ were used for RF distance calculation. For each data set, consensus tree from the settings that provided best results was used in average RF distance calculations. For alignment-free methods, Average Common Substring(ACS) length-based distance
<xref ref-type="bibr" rid="pone.0034261-Ulitsky1">[33]</xref>
and Lempel-Ziv(LZ) distance
<xref ref-type="bibr" rid="pone.0034261-Lempel1">[34]</xref>
were calculated using ‘decaf+py’ package
<xref ref-type="bibr" rid="pone.0034261-Hohl1">[35]</xref>
, followed by tree inference using ‘neighbor’ program of PHYLIP package. All the settings and implementations used have been summarized in
<xref ref-type="table" rid="pone-0034261-t001">Table 1</xref>
, and more details on commands is provided in
<xref ref-type="supplementary-material" rid="pone.0034261.s006">Supporting Methods S1</xref>
.</p>
</sec>
<sec id="s2c">
<title>Framework for PHYRN-Based Tree Inference</title>
<p>The pipeline for the PHYRN algorithm is graphically represented in
<xref ref-type="fig" rid="pone-0034261-g001">Figure 1</xref>
. The input is a set N of amino acid sequences and set M of their associated PSSMs. The output is a tree
<italic>T</italic>
, leaf-labeled by the set N. In this study we tested four different tree building algorithms from our PHYRN distance matrix, including NJ, Weighbor
<xref ref-type="bibr" rid="pone.0034261-Bruno1">[36]</xref>
(weighted NJ), FastME
<xref ref-type="bibr" rid="pone.0034261-Desper1">[37]</xref>
and NINJA
<xref ref-type="bibr" rid="pone.0034261-Wheeler1">[38]</xref>
. PHYRN is a five step procedure: (i) curate a data set of amino acid sequences, (ii) construct a database/library of query-based PSSMs using PSI-BLAST, (iii) collect alignment statistics as a function of percentage identity X percentage coverage using a custom code of rps-BLAST and populate the real numbers into a NXM matrix, (iv) calculate Euclidean distance of all sequence pairs and represent distance in a NXN matrix, and (v) generate a distance-based tree estimated using Neighbor-Joining (or a similar clustering technique).</p>
<fig id="pone-0034261-g001" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.g001</object-id>
<label>Figure 1</label>
<caption>
<title>PHYRN concept and work flow.</title>
<p>PHYRN begins by (i–ii) defining and extracting the domain specific region among the query sequences. (iii) Domain specific regions are then used to create PSSM library using PSI-BLAST. (iv–v) Positive alignments are then calculated between queries and PSSM library using rpsBLAST, and encoded as a PHYRN product score (percentage identity X percentage coverage) matrix. (vi) The product score matrix is converted to a Euclidean distance matrix by calculating Euclidean distance between each query pair. (vii) Phylogenetic trees are then inferred using Neighbor Joining, WEIGHBOR, Minimum Evolution, NINJA, or FastME.</p>
</caption>
<graphic xlink:href="pone.0034261.g001"></graphic>
</fig>
<p>To generate PSSM libraries for the synthetic data sets, we used full-length sequences. Further, since there are no biological homologs for these synthetic sequences in the NCBI non-redundant (nr) database, full length sequences from synthetic data sets were added to the nr database, and PSSMs were generated from this modified non-redundant (nr) database. We used full-length synthetic sequences to generate PSSMs using PSI-BLAST with the aforementioned, modified nr database, at 6 iterations and e-value = 10
<sup>−6</sup>
. In contrast to our previous report
<xref ref-type="bibr" rid="pone.0034261-Chang1">[8]</xref>
, in this study we modified the product score to omit hits, excluded sequence embedding, and modified the PSSM library architecture to allow for increased computational speed in simulated libraries. Instead of organizing PSSM library as an assembly of individual single-domain databases, we changed library organization to have one single database comprised of all the PSSMs. In later sections on DANGER superfamily we illustrate how homologous regions from biological protein families can be identified and converted to PSSM libraries. Briefly this can be accomplished using several approaches such as (i) CDD profiles, (ii) an iterative use of PHYRN methodology, and/or (iii) sequence embedding based approaches
<xref ref-type="bibr" rid="pone.0034261-Hong2">[39]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Hong3">[40]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Han1">[41]</xref>
. All the codes and the user manual for PHYRN can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://www.ccp.psu.edu/downloads">www.ccp.psu.edu/downloads</ext-link>
.</p>
<p>PHYRN uses a custom code for rps-BLAST for recording positive alignments between simulated sequences and their respective PSSM library. For a given profile M, the matrix is populated 0 for no alignment or as a positive product score for the alignment with best PHYRN score (%identity X %coverage) retrieved with an e-value threshold of 10
<sup>10</sup>
. Equation sets for calculating % identity and % coverage were defined as in previous studies
<xref ref-type="bibr" rid="pone.0034261-Chang1">[8]</xref>
. However, unlike the composite score mentioned in Chang et al.
<xref ref-type="bibr" rid="pone.0034261-Chang1">[8]</xref>
which included hits, in this case the PHYRN product score equals %identity X % coverage for each PSSM that provided an alignment
<xref ref-type="bibr" rid="pone.0034261-Bhardwaj1">[10]</xref>
. Percentage identity (%i) and percentage coverage (%c) is defined as follows:</p>
<p>
<bold>%i</bold>
 = [(Number of Identical residues in alignment)/(Alignment length including gaps)]</p>
<p>
<bold>%c</bold>
 = [(Alignment length in query excluding gaps)/(Sequence length of PSSM)]</p>
<p>Thus, the PHYRN product score is directly proportional to the similarity between query sequence and PSSM, and inversely proportional to the gaps in an alignment. Overall, PHYRN product score provides a measurement of the length, robustness, and strength of the alignment. Mathematical derivations show that this PHYRN product score is equivalent to [(1-(Alignment restricted p-distance))*(1-PHYRN gap-weight)] (Equations i–v).</p>
</sec>
<sec id="s2d">
<title>Derivation of PHYRN Product Score</title>
<p>PHYRN product score  =  %Identity × %Coverage
<disp-formula>
<graphic xlink:href="pone.0034261.e001"></graphic>
<label>(i)</label>
</disp-formula>
where:</p>
<p>ids  =  number of identical residues in aligned region</p>
<p>alen =  length of the alignment</p>
<p>aqlen  =  length of the alignment without gaps</p>
<p>plen  =  length of the PSSM
<disp-formula>
<graphic xlink:href="pone.0034261.e002"></graphic>
<label>(ii)</label>
</disp-formula>
<disp-formula>
<graphic xlink:href="pone.0034261.e003"></graphic>
<label>(iii)</label>
</disp-formula>
<disp-formula>
<graphic xlink:href="pone.0034261.e004"></graphic>
<label>(iv)</label>
</disp-formula>
<disp-formula>
<graphic xlink:href="pone.0034261.e005"></graphic>
<label>(v)</label>
</disp-formula>
</p>
<p>Alignment Restricted p-distance (p
<sub>ARP</sub>
) is defined as the proportion of amino acid sites different in alignment defined as a function of PSSM length. It is calculated by dividing number of non-identical amino acid sites by total length of the PSSM. PHYRN Gap Weight (w
<sub>g</sub>
) is defined as proportion of gaps defined as a function of alignment length. It is calculated by dividing total number of gaps in alignment by length of alignment.</p>
<p>From the NXM matrix, PHYRN calculates the Euclidian distance between each query
<xref ref-type="bibr" rid="pone.0034261-Chang1">[8]</xref>
(Equation vi), which can then be depicted as a phylogenetic tree using a variety of tree-building algorithms.</p>
<p>Euclidean distance between two sequences
<italic>X</italic>
and
<italic>Y</italic>
, say
<italic>D(X, Y)</italic>
, is as follows:
<disp-formula>
<graphic xlink:href="pone.0034261.e006"></graphic>
<label>(vi)</label>
</disp-formula>
where X sequence is encoded as a vector of M scores (x
<sub>1</sub>
, x
<sub>2</sub>
, …, x
<sub>M</sub>
).</p>
</sec>
</sec>
<sec id="s3">
<title>Results</title>
<sec id="s3a">
<title>Comparison of Tree Accuracy in Simulated Evolutions</title>
<p>
<xref ref-type="fig" rid="pone-0034261-g002">Figure 2A–F</xref>
depicts how the percentage identity and gap statistics change between PAM 100 and PAM 700 of ROSE generated data sets. We observe that for trees constructed at an overall distance of PAM 550, average percentage identity as calculated from true alignments provided by ROSE is as low as 10% (
<xref ref-type="fig" rid="pone-0034261-g002">Fig 2A</xref>
). In data sets generated at PAM 650 and PAM 700, the average percent identity of data sets falls to 8.99% and 8.58%, respectively. We also observe that indel substitution events (i.e. gap openings) calculated as a function of each amino acid position, also increase with increasing PAM distance (
<xref ref-type="fig" rid="pone-0034261-g002">Fig 2B</xref>
). Moreover, we plotted the frequency distribution of all the gaps in 25 replicates at each divergence range (Number of data sets at each range = 25, Number of sequences in each data set = 100). We observe that with increasing PAM distance, average gap length (AGL) and the frequency of gaps increases (
<xref ref-type="fig" rid="pone-0034261-g002">Fig 2C–F</xref>
); however, the ratio of indel rate to substitution rate (ISR) does not change significantly between PAM 550 and PAM 700. In summary, our comparative statistics across PAM distances demonstrate that increasing PAM distance increases: 1) substitution rates, 2) frequency of gap events, and 3) the average length of gaps.</p>
<fig id="pone-0034261-g002" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.g002</object-id>
<label>Figure 2</label>
<caption>
<title>Characteristics of ROSE Data Sets.</title>
<p>Multiple data sets (n = 25) were generated using ROSE at each divergence range (PAM distance  = 100–700). The true alignment provided by the ROSE simulation was used to calculate the percentage identity, and gap statistics. A) Average percent identity calculated from each data set, decreases on increasing PAM distance (n = 25, Error Bars: +/– S.E.M.). B) Distribution of Average INDEL Events per position at different divergence ranges (PAM100–700). Average Indel events are calculated by dividing total number of gaps by total number of amino acid positions in all sequences represented in 25 replicates. C−F) Distribution of gap lengths in all replicates generated at PAM 100-PAM700. (Number of replicates = 25, number of sequences in each replicate  =  100. Average length of each sequence = 450 aa). AGL: Average Gap Length as calculated from the mean of all gap lengths in all 25 replicates. ISR: Indel event Rate/Substitution Rate.</p>
</caption>
<graphic xlink:href="pone.0034261.g002"></graphic>
</fig>
<p>Using these data sets, we first determined the most accurate MSA method for benchmarking in our study. We tested the performance of multiple popular MSA methods in these data sets (MAFFT, MUSCLE, TCOFFEE, and CLUSTAL). We generated trees for 25 different ROSE data sets at an average distance of PAM 700. For rapid comparisons, we employed the NJ algorithm for these analyses. Since we employed a single tree-inference method, phylogenetic accuracy in this analysis is a function of MSA quality.
<xref ref-type="fig" rid="pone-0034261-g003">Figure 3</xref>
shows that MUSCLE and CLUSTAL have improved performance over MAFFT and TCOFFEE that is statistically significant (p<0.01). However, MUSCLE and CLUSTAL have statistically similar performance. Therefore, for the rest of our study, we used MUSCLE as it is computationally much more efficient than CLUSTAL. In data not shown, we tested additional MSA methods (i.e. Dialign, and K-align); however, these were excluded from
<xref ref-type="fig" rid="pone-0034261-g003">Figure 3</xref>
, as they could not generate trees in data sets above PAM 550.</p>
<fig id="pone-0034261-g003" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.g003</object-id>
<label>Figure 3</label>
<caption>
<title>Accuracy Comparison of Different MSA methods.</title>
<p>Graphical representation of average Robinson-Foulds Distance from true ROSE trees (n = 21, PAM700) generated using different MSA methods. All trees were inferred using Neighbor-Joining. (n = 21, Error Bars: +/− S. E. M.). The number of sequences in each data set = 100. Maximum possible RF distance = 194.</p>
</caption>
<graphic xlink:href="pone.0034261.g003"></graphic>
</fig>
<p>In
<xref ref-type="supplementary-material" rid="pone.0034261.s001">Figure S1</xref>
we present our initial comparative analysis of PHYRN and MUSCLE-NJ using both the SeqGen and ROSE data sets generated at seven different ranges of divergence (40%–7% identity,
<xref ref-type="supplementary-material" rid="pone.0034261.s002">Figure S2 C–D</xref>
). In total, we performed 425 simulations totaling 1655 tree comparisons. At the lower levels of divergence, PHYRN marginally outperforms MUSCLE-NJ at recapitulating deep nodes; however, at higher divergences, PHYRN performs significantly better in both the SeqGen and ROSE data sets.</p>
<p>To extend upon our previous comparative analysis we next benchmarked against alignment-free, maximum parsimony, corrected distance, and ML methods (
<xref ref-type="fig" rid="pone-0034261-g004">Figure 4</xref>
and
<xref ref-type="table" rid="pone-0034261-t001">Table 1</xref>
). For this analysis, simulated data sets were generated at four different PAM distances (PAM 100, 550, 650, and 700). For each divergence range we generated 25 different data sets comprised of 100 sequences each with an average sequence length of 450 amino acids. All ML analyses were conducted with the substitution frequencies estimated from the data to ensure the best performance of these algorithms (+F option). Substitution matrix and other parameters used for each method are listed in
<xref ref-type="table" rid="pone-0034261-t001">Table 1</xref>
.</p>
<fig id="pone-0034261-g004" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.g004</object-id>
<label>Figure 4</label>
<caption>
<title>Performance Comparison of PHYRN with other Phylogenetic Inference methods.</title>
<p>A–C) Graphical representation of average symmetric distance (Robinson-Foulds Distance) between the true ROSE tree and trees estimated using PHYRN, ACS-NJ (Average Common Substring, Alignment-free method), MUSCLE-FastME (corrected distance method), MSA-PAUP (Maximum Parsimony), MSA-RAxML (Maximum Likelihood), and MUSCLE-GARLI (Maximum Likelihood) based methods. Number of replicates tested at each divergence range = 25, Error bars = +/− S.E.M. The number of sequences in each data set = 100, Avg. Length of sequences = 450. Maximum possible RF Distance = 194.</p>
</caption>
<graphic xlink:href="pone.0034261.g004"></graphic>
</fig>
<p>At the lowest level of divergence (PAM 100), all methods perform equally well (
<xref ref-type="table" rid="pone-0034261-t001">Table 1</xref>
). In addition, for all data sets, we observe that alignment-free, FastME and Maximum Parsimony (MP) perform poorly when compared to RaxML and GARLI, while RaxML and GARLI perform similarly. However, in our PAM 550 data set (25 replicates, ∼ 10% identity), RaxML and GARLI have an average RF-distance of 7.76 and 8.8, respectively, while PHYRN has an average RF-distance of 1.52, which is significantly lower than all methods tested (p<0.0001). Similarly, in our PAM 650 (25 replicates, ∼8.9% identity), RaxML and GARLI have average RF-distances of 43.6 and 46.16 respectively, while PHYRN remains robust with an average RF-distance of 7.04. In the most divergent data set we tested in this experiment, PAM 700 (25 replicates, ∼8.5% identity), RaxML and GARLI have average RF-distances of 63.84 and 65.68, respectively. Once again, PHYRN remains relatively robust with an average RF-distance of 13.6. We also tested multiple other methods and settings (NJ, corrected vs. uncorrected distances, Lempel-Ziv distance, PhyML, MP using Protpars, variations in substitutions matrices, gamma rate (+G), and empirical frequencies (+F)), the results of which are shown in
<xref ref-type="table" rid="pone-0034261-t001">Table 1</xref>
. Overall, we observe that at high rates of sequence divergence, PHYRN provides statistically more accurate inference of tree topologies than other methods (and their implementations) tested here. To test whether other distance-based tree-inference methods besides NJ would improve PHYRN performance, we tried WEIGHBOR, Ninja, and FASTME (
<xref ref-type="supplementary-material" rid="pone.0034261.s002">Fig S2</xref>
). Importantly, the performance of PHYRN is consistent regardless of the tree-building method employed. Since all methods tested here produced similar results, we suggest that the PHYRN distances derived are robust.</p>
<p>As a final comparison of PHYRN performance, we compared it with the Bayesian method MrBayes
<xref ref-type="bibr" rid="pone.0034261-Ronquist1">[32]</xref>
. Since this Bayesian approach is extremely computationally expensive we compared only five data sets at PAM 700. In this analysis, PHYRN consistently yielded a lower RF-distance, and thus more phylogenetic accuracy, than MrBayes (
<xref ref-type="fig" rid="pone-0034261-g005">Figure 5A</xref>
). Specifically, MUSCLE-MrBAYES has an average RF-distance of 46.0, while PHYRN has an RF-distance of 8.0 for these five data sets. The differences in performance are highlighted in
<xref ref-type="fig" rid="pone-0034261-g005">Figure 5B,C</xref>
. Panels 5B depict a consensus tree between one trial of PHYRN versus the True ROSE tree; a branch value of 100 means that PHYRN inferred the correct branching pattern while a value of 50 means that PHYRN incorrectly inferred that branch. Panel 5C depicts the analogous results from one trial of MrBayes versus the True ROSE tree. From this we observe that PHYRN has only four branching errors, while MrBayes contains 30.</p>
<fig id="pone-0034261-g005" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.g005</object-id>
<label>Figure 5</label>
<caption>
<title>PHYRN outperforms MrBayes in ‘midnight-zone’ synthetic data sets.</title>
<p>A) Graphical representation of symmetric distance (from true ROSE trees) for trees inferred using PHYRN and MrBayes. Data sets used were generated using ROSE at PAM 700 distance. Number of data sets tested (n = 5). The number of sequences in each data set = 100. Maximum possible RF distance = 194. Error Bars: +/−S.E.M. B) Consensus tree between true ROSE tree and PHYRN tree (PAM 700 data set 1). Red circles mark nodes that are incorrectly inferred by PHYRN. C) Consensus tree between true ROSE tree and MrBayes tree (PAM 700 data set 1). Red circles mark nodes that are incorrectly inferred by MrBayes.</p>
</caption>
<graphic xlink:href="pone.0034261.g005"></graphic>
</fig>
<p>Following these results, we examined the scalability of PHYRN with a single data set comprised of 1000 sequences and a mean distance of PAM550. Consensus tree between the true ROSE tree and the PHYRN-NJ tree shows that PHYRN infers only 8 branches incorrectly out of total 1998 branches. Moreover, the PHYRN-NJ tree shows a symmetric RF distance of 14 to the true ROSE tree (
<xref ref-type="supplementary-material" rid="pone.0034261.s003">Figure S3</xref>
). In data not shown we also tested the efficacy of PHYRN using different tree topologies at extreme divergences. In both biased (i.e. unbalanced) and unbiased trees (i.e. balanced), PHYRN outperforms all MSA-based methods analyzed here. However, in highly biased trees, all methods fail to perform due to the extreme divergence that occurs at the basal nodes. Thus, additional experimentation is needed to resolve highly biased evolutionary histories.</p>
</sec>
<sec id="s3b">
<title>Isolating Variables Underlying Phylogenetic Accuracy</title>
<p>The relatively poor performance we observe for MSA-based methods in this study could be due to either sub-optimal MSA quality and/or inaccurate tree inference. To discriminate between these variables we employed the true-alignments as provided by ROSE. If phylogenetic accuracy is not substantially improved, we can infer that the tree-inference method used is sub-optimal. In this experiment, we used the 25 data sets generated at PAM 700 distance from ROSE, and trees were inferred using corrected FastME, and GARLI. Notably, trees inferred using the true alignment perform very well, with an average RF distance of 5.12 using FastME, and 0.18 using GARLI (
<xref ref-type="fig" rid="pone-0034261-g006">Figure 6</xref>
). Hence, these data demonstrate that poor phylogenetic accuracy as observed in earlier comparisons is largely due to poor MSA quality. Indeed,
<xref ref-type="fig" rid="pone-0034261-g006">Figure 6</xref>
shows that when these same data sets are aligned by MUSCLE, both FastME and GARLI markedly lose phylogenetic accuracy. Since PHYRN does not use an MSA step, we could not use the true alignment with this method, but PHYRN using its default methodology gives an average RF distance of 14.16. In sum, these results show that phylogenetic inference in divergent data sets is stymied by the sub-optimal quality of MSA, not by tree-inference methods.</p>
<fig id="pone-0034261-g006" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.g006</object-id>
<label>Figure 6</label>
<caption>
<title>Effect of ‘True Alignment’ on Phylogenetic Inference.</title>
<p>Graphical representation of average symmetric distance (Robinson-Foulds Distance) between the true ROSE tree and trees estimated using PHYRN, corrected distance (FastME) and ML methods (GARLI). Corrected Distance and ML trees were generated with both MUSCLE alignment, and True Alignment (TA) provided by ROSE. (Number of replicates tested at each divergence range = 25, Error bars = +/− S.E.M. Number of sequences in each data set = 100, Avg. Length of sequences = 450). Maximum possible RF distance = 194.</p>
</caption>
<graphic xlink:href="pone.0034261.g006"></graphic>
</fig>
</sec>
<sec id="s3c">
<title>DANGER Superfamily as a Phylogenetic Test-Bed</title>
<p>To determine the efficacy of PHYRN in a biologically relevant data set, we examine sequences from the highly divergent DANGER superfamily. This developmental superfamily is ubiquitously expressed and is linked to multiple physiological (Ca
<sup>2+</sup>
signaling, cranio-facial development, reproduction, neurite outgrowth
<xref ref-type="bibr" rid="pone.0034261-Nikolaidis1">[42]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-vanRossum1">[43]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Lau1">[44]</xref>
) and pathophysiological (excitotoxicity) processes
<xref ref-type="bibr" rid="pone.0034261-Kang1">[45]</xref>
. Until recently, the MAB21-domain containing superfamily DANGER escaped detection due to their extreme divergence
<xref ref-type="bibr" rid="pone.0034261-Nikolaidis1">[42]</xref>
. Indeed, significant genetic correlates were required to support monophyletic groups for this superfamily. A previous study by our group, which relied on MSA-based approaches, defined six distinct monophyletic groups of DANGER
<xref ref-type="bibr" rid="pone.0034261-Nikolaidis1">[42]</xref>
. Although the orthologous relationships were well defined, the paralogous relationships in this family were ambiguous. Indeed, even upon rigorous genetic analyses and extensive manual editing of these alignments, deep-node statistical support was unattainable.</p>
</sec>
<sec id="s3d">
<title>Implementation of PHYRN for Biological Data</title>
<p>In the simulated data sets reported earlier, we utilized the full-length sequences to generate PSSMs. However, biological data sets are often comprised of both homologous and non-homologous regions. Previous studies have demonstrated that phylogenetic inference in divergent data sets improves when measurements of phylogenetic signal are limited to homologous regions
<xref ref-type="bibr" rid="pone.0034261-Ko1">[3]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Bhardwaj1">[10]</xref>
. Therefore, we sought to further refine PSSM generation by limiting the PHYRN-based measurement of phylogenetic signal to homologous regions and to generate PSSM libraries for these areas of interest (see
<xref ref-type="supplementary-material" rid="pone.0034261.s006">Supporting Methods S1</xref>
for complete description). Briefly, we curated protein sequences belonging to the DANGER superfamily from the literature and sequence databases. All known DANGER members (D1–D6 groups) share the MAB-21 domain in common
<xref ref-type="bibr" rid="pone.0034261-Nikolaidis1">[42]</xref>
. Therefore, we aligned each putative DANGER member against PSSMs for the MAB-21 domain as defined by NCBI Conserved Domain Database (CDD)
<xref ref-type="bibr" rid="pone.0034261-MarchlerBauer1">[46]</xref>
. These alignments were utilized to define the homologous region in each protein sequence. Together, these regions were converted to a MAB21-specific PSSM library containing 112 PSSMs using PSI-BLAST and compiled into an rpsBLAST compatible database.</p>
<p>Once an appropriate PSSM library is constructed, the next step is to align all queries with all PSSMs and encode the alignment statistics into an N×M matrix. In this format, N is the number of full-length query sequences and M is the number of PSSMs in the library. Hence, we aligned 108 full-length DANGER query sequences against 112 Mab-21 PSSMs using rpsBLAST. A composite score matrix (%identity×%coverage) was generated by encoding alignment statistics for all query-PSSM alignments. The pairwise distances among them (i.e. N×N matrix) were based on Euclidean distance measurement in the 108×112 data matrix. Finally, we inferred a phylogenetic tree from this matrix with the Neighbor-Joining (NJ) method available in the MEGA package
<xref ref-type="bibr" rid="pone.0034261-Tamura1">[47]</xref>
(See
<xref ref-type="supplementary-material" rid="pone.0034261.s006">Supporting Methods S1</xref>
for more details.).</p>
</sec>
<sec id="s3e">
<title>Comparative Analyses of Inferred Trees</title>
<p>To compare PHYRN-based trees with traditional methods, we generated phylogenetic trees with the same DANGER sequences using a variety of MSA and tree building algorithms. These include: (i) MUSCLE-MrBayes
<xref ref-type="bibr" rid="pone.0034261-Edgar1">[5]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Edgar2">[22]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Ronquist1">[32]</xref>
, (ii) MUSCLE-PhyML
<xref ref-type="bibr" rid="pone.0034261-Edgar1">[5]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Edgar2">[22]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Guindon1">[27]</xref>
, (iii) CLUSTAL-NJ
<xref ref-type="bibr" rid="pone.0034261-Thompson1">[25]</xref>
, and (iv) TCOFFEE-NJ
<xref ref-type="bibr" rid="pone.0034261-Notredame1">[26]</xref>
(
<xref ref-type="fig" rid="pone-0034261-g006">Figure 6</xref>
). Together these five approaches are representative of traditional character-based and distance-based methods for phylogenetic inference and are a good subset of methods for comparative analysis with PHRYN.
<xref ref-type="fig" rid="pone-0034261-g007">Figure 7a</xref>
and
<xref ref-type="supplementary-material" rid="pone.0034261.s004">Figure S4a</xref>
depict the unrooted tree derived by PHYRN, from which we observe plausible biological patterning of six monophyletic groups (D1–D6) in accord with our previous studies.
<xref ref-type="bibr" rid="pone.0034261-Nikolaidis1">[42]</xref>
. For example, within the D6 clade, cnidaria (e.g. sea anemone) occupies a basal position, followed by nematode (e.g. worms), urochordates (e.g. sea squirt), arthropods (e.g. insects), and chordates (e.g. humans). However, a single sequence from sea urchin diverges subsequent to arthropods, and thus appears to be misplaced. Trees generated by MUSCLE-NJ, MUSCLE-PhyML, CLUSTAL-NJ, and TCOFFEE-NJ also place this sequence in the same, possibly erroneous position (
<xref ref-type="fig" rid="pone-0034261-g007">Figure 7b–e</xref>
). By comparison, MUSCLE-MrBayes lacks monophyly for various groups, such as members of D2 and D3 clades and incorrectly places
<italic>Nematostella</italic>
(i.e. Cnidaria- sea anemone) D3 sequences after other higher order organisms. CLUSTAL-NJ tree splits members of D2 clade, and places some
<italic>Nematostella</italic>
sequences after the mammalian specific group D1. MUSCLE-NJ and TCOFFEE-NJ trees also misplace
<italic>Nematostella</italic>
sequences. MUSCLE-PhyML provides good bootstrap support but splits members of D3 clade. Thus, all methods with the exception of PHYRN either fail to infer monophyly, and/or yield a tree with an improbable evolutionary scenario.</p>
<fig id="pone-0034261-g007" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.g007</object-id>
<label>Figure 7</label>
<caption>
<title>Comparison of Topology and Resampling Statistics for Various Tree Construction Methods.</title>
<p>Collapsed unrooted phylogenetic trees for DANGER superfamily generated using (A) PHYRN-NJ, (B) MUSCLE-MrBayes, (C) MUSCLE-PhyML, (D) MUSCLE-NJ, (E) CLUSTAL-NJ and (F) TCOFFEE-NJ. For PHYRN trees the statistics are represented by two numbers with Bootstrap listed first followed by Jacknife statistics. Statistics for panel A were calculated from resampling results from 3000 replicates. Bootstrap statistics for panels B-F were calculated from resampling results from 1000 replicates.</p>
</caption>
<graphic xlink:href="pone.0034261.g007"></graphic>
</fig>
<p>To assess the statistical support for these various phylogenetic trees we conducted an 80% jack-knife resampling for PHRYN and bootstrap resampling for all approaches. By these measures PHYRN obtains support of >83% (bootstrap) and >88% (jack-knife) for all deep-nodes except for the placement of the D4 clade. Conversely, none of the other traditional methods tested obtain significant results for any deep-node other than the D5/D6 clades which is the most conserved subgroup in the superfamily.
<xref ref-type="supplementary-material" rid="pone.0034261.s004">Figure S4</xref>
depicts the unrooted and non-collapsed phylogenetic trees for PHYRN and MrBayes with resampling statistics at all branch points. On leaf nodes, both methods perform equally well; however, there are major differences between the topology and branch statistics between methods. Overall, this suggests that PHYRN has increased ability to measure low phylogenetic signal.</p>
</sec>
<sec id="s3f">
<title>Meta-Analysis of PHYRN-Derived Data</title>
<p>In our previous evolutionary study of DANGER, we identified a single sequence from choanoflagellate, which was used as the putative outgroup
<xref ref-type="bibr" rid="pone.0034261-Nikolaidis1">[42]</xref>
. Importantly, this sequence obtained no statistical support for this position. To ascertain whether this sequence was indeed an outgroup, we searched for additional putative DANGER sequences in multiple publically available sequence databases including NCBI, Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA,
<xref ref-type="bibr" rid="pone.0034261-Sun1">[48]</xref>
), and Department of Energy Joint Genome Institute (JGI) databases. Taken together, we identified 13 additional
<italic>Monosiga</italic>
sequences (i.e. Choanoflagellate- microscopic, heterotrophic single-celled and colony-forming eukaryotes). When we incorporate these sequences into our analyses, PHYRN infers a monophyletic topology; however, the choanoflagellate sequences form a distinct clade, with D3 as the nearest neighbor (
<xref ref-type="supplementary-material" rid="pone.0034261.s005">Figure S5a</xref>
). Moreover, their inclusion drastically reduces the statistical support across the entire tree (compare
<xref ref-type="fig" rid="pone-0034261-g007">Figure 7a</xref>
and
<xref ref-type="supplementary-material" rid="pone.0034261.s005">Figure S5a</xref>
). Based upon this observation, the homology of these
<italic>Monosiga</italic>
sequences with the DANGER superfamily is highly questionable and is likely in error.</p>
<p>From the matrix data generated by PHYRN, we can obtain additional quantitative measurements such as group-wise distribution of composite scores of sequence to PSSM comparisons, as well as their information content. These measures can be utilized to interrogate placement of the
<italic>Monosiga</italic>
group in the DANGER phylogeny.
<xref ref-type="supplementary-material" rid="pone.0034261.s005">Figure S5b</xref>
demonstrates that in all cases, these choanoflagellate PSSMs have the fewest alignments across all clades, and their sequences have the lowest information content (average product score, ±S.E.M). Moreover, the positions of the choanoflagellate sequences relative to the vertebrate specific D1 clade within the tree are suspect. In this scenario, multiple clades that contain ancient species (e.g. cnidarians, nematodes, and arthropods) would have evolved after D1. Thus, in order for this scenario to make sense, D1 proteins would have to be lost from all species prior to chordates, which is not parsimonious. Final evidence that these choanoflagellate sequences are not homologous to DANGER and thus do not belong in the phylogeny come from exhaustive searches of sequence databases. We could not identify any DANGER sequences in species before choanoflagellate or between choanoflagellate and cnidaria.</p>
<p>Thus, the question arises: which DANGER clade is the oldest? In our quantitative statistics, we observe that PSSMs from the D6 clade have the highest group-wise distribution and D6 sequences have the highest information content (
<xref ref-type="supplementary-material" rid="pone.0034261.s005">Figure S5b</xref>
). Further, in the unrooted tree D6 clade has the longest branch-length. Taken together, D6 is the most logical outgroup of the superfamily based on (i) statistical support, (ii) information content, and (iii) speciation. Taken together, our results suggest the following evolutionary scenario (
<xref ref-type="fig" rid="pone-0034261-g008">Figure 8</xref>
). The first DANGER sequences emerged in cnidarians (>580 million years ago), which are some of the first organisms known to have a developed neural net, radial axis of symmetry, muscle cells, and stem cells
<xref ref-type="bibr" rid="pone.0034261-Watanabe1">[49]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Watanabe2">[50]</xref>
. Accordingly, members of the DANGER superfamily have been shown by functional studies to be involved in neurite length extension
<xref ref-type="bibr" rid="pone.0034261-Nikolaidis1">[42]</xref>
, calcium mobilization
<xref ref-type="bibr" rid="pone.0034261-vanRossum1">[43]</xref>
, and developmental patterning
<xref ref-type="bibr" rid="pone.0034261-Lau1">[44]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Chow1">[51]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Wong1">[52]</xref>
. If we root our phylogenetic tree with D6 (
<xref ref-type="fig" rid="pone-0034261-g008">Figure 8</xref>
), we see a “simple to complex” evolutionary pattern for the DANGER superfamily, with the mammalian-specific D1 clade attaining the most distant position. Similarly, we see the appearance of simpler organisms before more complex organisms within individual monophyletic groups. For example, in D6 clade, cnidarians are the first ones to show DANGER followed by nematodes, arthropods and then chordates. Importantly, we could not identify cnidarian sequences in D4 and D5 clades. This is relevant because relatively newer clades D3 and D2 do have cnidarian members, and suggests that D4 and D5 were lost from cnidaria.</p>
<fig id="pone-0034261-g008" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0034261.g008</object-id>
<label>Figure 8</label>
<caption>
<title>Model for the Evolution of the DANGER Superfamily.</title>
<p>Graphical representation of the Neighbor-Joining (NJ) tree for 108 DANGER sequences generated PHYRN. In this model, DANGER appeared first in cnidarian organisms
<italic>(Nematostella)</italic>
and then evolved into 6 different clades. The chordate specific group, D1 attains the furthest position from the putative root (D6).</p>
</caption>
<graphic xlink:href="pone.0034261.g008"></graphic>
</fig>
</sec>
</sec>
<sec id="s4">
<title>Discussion</title>
<p>Within divergent biological data sets it is impossible to know the true evolutionary history of sequences under consideration. Due to the lack of knowledge about true evolutionary history, there is no way to accurately evaluate the performance of algorithms on biological data sets. Therefore, in the present study we utilized simulations as test beds of phylogenetic inference. Only in this way can one measure a true evolutionary history, and hence accurately quantify the performance of various algorithms by comparing ‘inferred history’ to ‘true history’. Indeed, synthetic data sets have frequently been used for benchmarking algorithm performance
<xref ref-type="bibr" rid="pone.0034261-Talavera1">[12]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Roshan1">[13]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Liu2">[14]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Price1">[15]</xref>
,
<xref ref-type="bibr" rid="pone.0034261-Beiko1">[16]</xref>
. Our comparative analysis within synthetic protein and biological sequence data indicate that PHYRN can provide a more accurate and statistically robust representation of evolutionary history within the “twilight zone/midnight zone” of sequence similarity as compared to multiple popular MSA-approaches. Our interpretation is supported by several key findings, including; 1) PHYRN outperforms all distance and ML methods tested given a MUSCLE alignment, 2) PHYRN outperforms a Bayesian method (MrBayes), given a MUSCLE alignment, and 3) both, distance-based and character-based methods require the true alignment in order to outperform PHYRN.</p>
<p>While simulations do not entirely reflect all of the underlying mechanisms of natural molecular evolution, they still represent powerful approximations of the evolutionary process (for example, including substitution matrices derived from biological databases, inclusion of insertions-deletions) and we have tested our method on two of the most-utilized simulation methods (SeqGen and ROSE). While more research is needed to develop improved models of evolution that more accurately reflect biological mechanisms, this does not detract from their utility for benchmarking studies, and PHYRN also appeared to perform well on a real biological data set comprising a highly divergent superfamily of developmental proteins (the DANGER superfamily).</p>
<p>In both synthetic and biological data sets we reason that the improved performance of PHYRN is due to the increased information content contained in PSSM libraries and an effective alignment search and scoring method. Conversely, the inability of MSA methods to obtain accurate alignments at high divergence leads to low accuracy of trees across all tree-building methods. At the lower end of this performance spectrum Neighbor-Joining performed well in conserved data sets, but poorly at higher levels of divergence. Some ML methods (RAxML and GARLI) perform better than NJ, but their performance is also greatly limited by the quality of input MSA methods. Bayesian methods, which are computationally very slow (because a whole posterior distribution of trees are produced), show a similar performance to RAxML and GARLI. We also considered other approaches such as PROBCONS and other consistency based models, but these have been shown to be slower, and thus are not easily scalable. Moreover, PROBCONS has been previously benchmarked in the twilight-zone
<xref ref-type="bibr" rid="pone.0034261-Essoussi1">[53]</xref>
, and which showed that PROBCONS performs no better than ClustalX, Align-m, T-Coffee, SAGA, ProbCons, MAFFT, MUSCLE and DIALIGN. More generally, our results on tree inference using true ROSE alignments show that better alignments may be the key to estimating accurate phylogenies in highly divergent data sets (
<xref ref-type="fig" rid="pone-0034261-g006">figure 6</xref>
).</p>
<p>For those MSA-based methods tested, we have tried to give these algorithms the best opportunity to perform well. In addition to comparisons with the default settings, we also explored: (i) equilibrium frequencies estimated from the empirical data, and (ii) a variety of among-site rate variation models in the Bayesian method. Importantly, most of these settings did not improve performance to any great extent. An ideal MSA method at extreme divergence involves “cleaning” for badly aligned regions followed by tree-inference. To accomplish this goal, we filtered our MSAs using Gblocks
<xref ref-type="bibr" rid="pone.0034261-Talavera1">[12]</xref>
; strikingly, however, our simulated data sets are so divergent that Gblocks fails to recognize any conserved sequence blocks. Thus, in these simulated data sets it is impossible to simply filter out badly aligned regions. Nevertheless, we acknowledge that there are still settings that could be fine-tuned to improve alignment estimation and tree inference in a data set dependent manner. In particular, although we have tried to benchmark against many popular MSA methods, further experimentation is needed to benchmark PHYRN against other MSA algorithms such as Prank
<xref ref-type="bibr" rid="pone.0034261-Loytynoja1">[54]</xref>
and SATe
<xref ref-type="bibr" rid="pone.0034261-Liu3">[55]</xref>
. In addition, we also need to explore PHYRN's performance in data sets where substitution rates deviate substantially from a molecular clock, and where evolutionary models are permitted to change across a phylogeny.</p>
<p>In conclusion, we propose that our increased performance on synthetic and biological data sets demonstrates that PHYRN is an accurate and scalable approach. We suggest that PHYRN's ability to handle large numbers of highly divergent sequences makes it an ideal framework to study a number of unanswered questions relating to some of the earliest events in the history of life. Future work will focus on exploring: (i) the utility of PHRYN-based ‘guide trees’ for improving MSA-based algorithms, (ii) the integration of PHYRN-based distance estimates with other statistical methods such as Maximum Likelihood, and (iii) the refinement of PHYRN-based PSSM libraries with Markovian statistics (i.e. HMM profiles)
<xref ref-type="bibr" rid="pone.0034261-Eddy1">[56]</xref>
.</p>
</sec>
<sec sec-type="supplementary-material" id="s5">
<title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0034261.s001">
<label>Figure S1</label>
<caption>
<p>
<bold>PHYRN outperforms MSA in synthetic protein families.</bold>
Consensus tree between true ROSE tree and tree generated using a) PHYRN and b) MUSCLE with NJ. Simulated protein family generated using ROSE, with an average distance of 550 (p distance ∼0.83). Red circles mark the branch points (nodes) that are recapitulated incorrectly. (# of query sequences = 67). c) Graphical representation of %deep node recapitulation versus SeqGen scaling factor. Number of replicates for each bar = 25, Error bars  =  +/− S.E.M. *p-value<0.01. Number of sequences in each data set = 100, Length of sequences = 450. d) Graphical representation of %deep node recapitulation versus average Rose distance. Number of replicates for each bar = 25, Error bars = +/− S.E.M. *p-value < 0.01. Number of sequences in each data set = 100, Avg. Length of sequences = 450.</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0034261.s001.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0034261.s002">
<label>Figure S2</label>
<caption>
<p>
<bold>Effect of Tree Inference Method on PHYRN Performance.</bold>
Graphical representation of symmetric distance for trees inferred from PHYRN distance matrix and different tree inference methods. Number of replicates tested at each divergence range = 25, Error bars = +/− S.E.M. Number of sequences in each data set = 100, Avg. Length of sequences = 450. (Maximum possible RF distance for each data set = 194).</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0034261.s002.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0034261.s003">
<label>Figure S3</label>
<caption>
<p>
<bold>Deep node recapitulation of ‘true evolutionary history’ in mega-phylogenies.</bold>
Consensus phylogenetic tree between true ROSE tree and tree generated using PHYRN. The simulated protein family was generated using ROSE, with an average PAM distance of 550. (Red colored branches mark the branches that are recapitulated incorrectly in the consensus trees. (number of query sequences  =  1000). PHYRN recapitulates 1990 branches correctly out of total 1998 branches in the consensus tree. PHYRN shows a RF distance of 14 from the true ROSE tree (Maximum possible RF distance for this data set = 1994).</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0034261.s003.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0034261.s004">
<label>Figure S4</label>
<caption>
<p>
<bold>Comparison of PHYRN and MrBayes generated Trees for DANGER Superfamily.</bold>
Unrooted Phylogenetic trees for 108 DANGER sequences generated using (A) PHYRN or (B) MUSCLE-MrBayes. Statistical support for PHYRN calculated using Bootstrap and Jackknife analysis, while for MUSCLE-MrBayes only bootstrap was used. The blank marked ‘‘_/_’’ in the statistical support indicates that the clustering of the branching connection cannot be measured in a standardized fashion by the given resampling method.</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0034261.s004.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0034261.s005">
<label>Figure S5</label>
<caption>
<p>
<bold>Identification of most basal DANGER clade using PHYRN quantitative measures.</bold>
(A) DANGER tree generated by PHYRN-NJ including 13
<italic>Monosiga</italic>
sequences. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances. Statistical support was calculated using Bootstrap and Jackknife analysis from 3,000 replicates and are reported as percentages with bootstrap values labeled first. (B) This bar graph depicts addition quantitative measures derived by PHYRN for group-wise distribution of composite score (i.e. percentage identity X percentage coverage). Errors bars = +/−S.E.M. In all cases, choanoflagellate sequences have the lowest information content (average PHYRN product score, ± S.E.M).</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0034261.s005.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0034261.s006">
<label>Supporting Methods S1</label>
<caption>
<p>
<bold>Supplemental methods describing PHYRN PSSM generation and simulation parameters.</bold>
</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0034261.s006.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>We would like to thank Teresa Killick, Loukia Hadjiyianni, Alyssa Thunen, Natasha Shah, and Anand Padmanbha for their help and support during the project, as well as Jason Holmes at the Pennsylvania State University CAC center for technical assistance. We would like to thank Drs. Robert E. Rothe, Russell H. Carroll, Jim White, Barbara VanRossum, and Cordozar C. Broadus for creative dialogue. We also thank the anonymous reviewers for their helpful suggestions at different stages of this manuscript review process.</p>
</ack>
<fn-group>
<fn fn-type="conflict">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="financial-disclosure">
<p>
<bold>Funding: </bold>
This work was supported by the Searle Young Investigators Award and start-up money from PSU (RLP), NCSA grant TG-MCB070027N (RLP, DVR), The National Science Foundation 428-15 691M (RLP, DVR), and The National Institutes of Health R01 GM087410-01 (RLP, DVR). This project was also funded by a Fellowship from the Eberly College of Sciences and the Huck Institutes of the Life Sciences (DVR) and a grant with the Pennsylvania Department of Health using Tobacco Settlement Funds (DVR). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="pone.0034261-Blake1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blake</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>FE</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Pairwise sequence alignment below the twilight zone.</article-title>
<source>JMolBiol</source>
<volume>307</volume>
<fpage>721</fpage>
<lpage>735</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Yona1">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yona</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Levitt</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Within the twilight zone: a sensitive profile-profile comparison tool based on information theory.</article-title>
<source>JMolBiol</source>
<volume>315</volume>
<fpage>1257</fpage>
<lpage>1275</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Ko1">
<label>3</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ko</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>GS</given-names>
</name>
<name>
<surname>Bhardwaj</surname>
<given-names>G</given-names>
</name>
<name>
<surname>van Rossum</surname>
<given-names>DB</given-names>
</name>
<etal></etal>
</person-group>
<year>2008</year>
<article-title>Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution.</article-title>
<publisher-name>arXiv</publisher-name>
<size units="page">0806.239</size>
</element-citation>
</ref>
<ref id="pone.0034261-Liu1">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Linder</surname>
<given-names>CR</given-names>
</name>
<name>
<surname>Warnow</surname>
<given-names>T</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>Multiple sequence alignment: a major challenge to large-scale phylogenetics.</article-title>
<source>PLoS Curr</source>
<volume>2</volume>
<fpage>RRN1198</fpage>
<pub-id pub-id-type="pmid">21113338</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Edgar1">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>MUSCLE: a multiple sequence alignment method with reduced time and space complexity.</article-title>
<source>BMCBioinformatics</source>
<volume>5</volume>
<fpage>113</fpage>
</element-citation>
</ref>
<ref id="pone.0034261-Roch1">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roch</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>Toward extracting all phylogenetic information from matrices of evolutionary distances.</article-title>
<source>Science</source>
<volume>327</volume>
<fpage>1376</fpage>
<lpage>1379</lpage>
<pub-id pub-id-type="pmid">20223986</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Bergsten1">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bergsten</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>A review of long-branch attraction.</article-title>
<source>Cladistics</source>
<volume>21</volume>
<fpage>163</fpage>
<lpage>193</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Chang1">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chang</surname>
<given-names>GS</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Ko</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Bhardwaj</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Holmes</surname>
<given-names>EC</given-names>
</name>
<etal></etal>
</person-group>
<year>2008</year>
<article-title>Phylogenetic profiles reveal evolutionary relationships within the “twilight zone” of sequence similarity.</article-title>
<source>ProcNatlAcad Sci USA</source>
<volume>105</volume>
<fpage>13474</fpage>
<lpage>13479</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Ko2">
<label>9</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ko</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Bhardwaj</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Killick</surname>
<given-names>TM</given-names>
</name>
<name>
<surname>van Rossum</surname>
<given-names>DB</given-names>
</name>
<etal></etal>
</person-group>
<year>2009</year>
<article-title>Brainstorming through the Sequence Universe: Theories on the Protein Problem.</article-title>
<publisher-name>Physics Archives q-bio.QM</publisher-name>
<fpage>1</fpage>
<lpage>21</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Bhardwaj1">
<label>10</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Bhardwaj</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Ko</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>GS</given-names>
</name>
<etal></etal>
</person-group>
<year>2010</year>
<article-title>Theories on PHYlogenetic ReconstructioN (PHYRN).</article-title>
<publisher-name>arXiv q-bio.PE, q-bio.QM</publisher-name>
<fpage>1</fpage>
<lpage>13</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Hong1">
<label>11</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Hong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>van Rossum</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>RL</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Adaptive BLASTing through Sequence Dataspace: Therories on Protein Sequence Embedding.</article-title>
<publisher-name>Physics Archives q-bio.QM</publisher-name>
<fpage>1</fpage>
<lpage>21</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Talavera1">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Talavera</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Catersana</surname>
</name>
<name>
<surname>J</surname>
</name>
</person-group>
<year>2007</year>
<article-title>Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments.</article-title>
<source>Systematic Biology</source>
<volume>56</volume>
<fpage>564</fpage>
<lpage>577</lpage>
<pub-id pub-id-type="pmid">17654362</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Roshan1">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roshan</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Livesay</surname>
<given-names>DR</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Probalign: multiple sequence alignment using partition function posterior probabilities.</article-title>
<source>Bioinformatics</source>
<volume>22</volume>
<fpage>2715</fpage>
<lpage>2721</lpage>
<pub-id pub-id-type="pmid">16954142</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Liu2">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Raghavan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Nelesen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Linder</surname>
<given-names>CR</given-names>
</name>
<name>
<surname>Warnow</surname>
<given-names>T</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.</article-title>
<source>Science</source>
<volume>324</volume>
<fpage>1561</fpage>
<lpage>1564</lpage>
<pub-id pub-id-type="pmid">19541996</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Price1">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Price</surname>
<given-names>MN</given-names>
</name>
<name>
<surname>Dehal</surname>
<given-names>PS</given-names>
</name>
<name>
<surname>Arkin</surname>
<given-names>AP</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>FastTree: computing large minimum evolution trees with profiles instead of a distance matrix.</article-title>
<source>Mol Biol Evol</source>
<volume>26</volume>
<fpage>1641</fpage>
<lpage>1650</lpage>
<pub-id pub-id-type="pmid">19377059</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Beiko1">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beiko</surname>
<given-names>RG</given-names>
</name>
<name>
<surname>Charlebois</surname>
<given-names>RL</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>A simulation test bed for hypotheses of genome evolution.</article-title>
<source>Bioinformatics</source>
<volume>23</volume>
<fpage>825</fpage>
<lpage>831</lpage>
<pub-id pub-id-type="pmid">17267425</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Lassmann1">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lassmann</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Frings</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features.</article-title>
<source>Nucleic Acids Res</source>
<volume>37</volume>
<fpage>858</fpage>
<lpage>865</lpage>
<pub-id pub-id-type="pmid">19103665</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Stoye1">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stoye</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Evers</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Meyer</surname>
<given-names>F</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Rose: generating sequence families.</article-title>
<source>Bioinformatics</source>
<volume>14</volume>
<fpage>157</fpage>
<lpage>163</lpage>
<pub-id pub-id-type="pmid">9545448</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Grassly1">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grassly</surname>
<given-names>NC</given-names>
</name>
<name>
<surname>Adachi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rambaut</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>1997</year>
<article-title>PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees.</article-title>
<source>Comput Appl Biosci</source>
<volume>13</volume>
<fpage>559</fpage>
<lpage>560</lpage>
<pub-id pub-id-type="pmid">9367131</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Sonnhammer1">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
<name>
<surname>Hollich</surname>
<given-names>V</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Scoredist: a simple and robust protein sequence distance estimator.</article-title>
<source>BMC Bioinformatics</source>
<volume>6</volume>
<fpage>108</fpage>
<pub-id pub-id-type="pmid">15857510</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Robinson1">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Robinson</surname>
<given-names>DF</given-names>
</name>
<name>
<surname>Foulds</surname>
<given-names>LR</given-names>
</name>
</person-group>
<year>1981</year>
<article-title>Comparison of Phylogenetic Trees.</article-title>
<source>Mathematical Biosciences</source>
<volume>53</volume>
<fpage>131</fpage>
<lpage>147</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Edgar2">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>MUSCLE: multiple sequence alignment with high accuracy and high throughput.</article-title>
<source>Nucleic Acids Res</source>
<volume>32</volume>
<fpage>1792</fpage>
<lpage>1797</lpage>
<pub-id pub-id-type="pmid">15034147</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Subramanian1">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Subramanian</surname>
<given-names>AR</given-names>
</name>
<name>
<surname>Weyer-Menkhoff</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kaufmann</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Morgenstern</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment.</article-title>
<source>BMCBioinformatics</source>
<volume>6</volume>
<fpage>66</fpage>
</element-citation>
</ref>
<ref id="pone.0034261-Katoh1">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Katoh</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Asimenos</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Toh</surname>
<given-names>H</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Multiple alignment of DNA sequences with MAFFT.</article-title>
<source>Methods Mol Biol</source>
<volume>537</volume>
<fpage>39</fpage>
<lpage>64</lpage>
<pub-id pub-id-type="pmid">19378139</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Thompson1">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thompson</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Higgins</surname>
<given-names>DG</given-names>
</name>
<name>
<surname>Gibson</surname>
<given-names>TJ</given-names>
</name>
</person-group>
<year>1994</year>
<article-title>CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.</article-title>
<source>Nucleic Acids Res</source>
<volume>22</volume>
<fpage>4673</fpage>
<lpage>4680</lpage>
<pub-id pub-id-type="pmid">7984417</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Notredame1">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Notredame</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Higgins</surname>
<given-names>DG</given-names>
</name>
<name>
<surname>Heringa</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>T-Coffee: A novel method for fast and accurate multiple sequence alignment.</article-title>
<source>J Mol Biol</source>
<volume>302</volume>
<fpage>205</fpage>
<lpage>217</lpage>
<pub-id pub-id-type="pmid">10964570</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Guindon1">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guindon</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lethiec</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Duroux</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gascuel</surname>
<given-names>O</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>PHYML Online – a web server for fast maximum likelihood-based phylogenetic inference.</article-title>
<source>Nucleic Acids Res</source>
<volume>33</volume>
<fpage>W557</fpage>
<lpage>559</lpage>
<pub-id pub-id-type="pmid">15980534</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Le1">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Le</surname>
<given-names>SQ</given-names>
</name>
<name>
<surname>Gascuel</surname>
<given-names>O</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>An improved general amino acid replacement matrix.</article-title>
<source>Mol Biol Evol</source>
<volume>25</volume>
<fpage>1307</fpage>
<lpage>1320</lpage>
<pub-id pub-id-type="pmid">18367465</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Stamatakis1">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stamatakis</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.</article-title>
<source>Bioinformatics</source>
<volume>22</volume>
<fpage>2688</fpage>
<lpage>2690</lpage>
<pub-id pub-id-type="pmid">16928733</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Zwickl1">
<label>30</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Zwickl</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion.</article-title>
<publisher-name>PhD dissertation, The University of Texas at Austin</publisher-name>
</element-citation>
</ref>
<ref id="pone.0034261-Wilgenbusch1">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wilgenbusch</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Swofford</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Inferring evolutionary trees with PAUP*.</article-title>
<source>Curr Protoc Bioinformatics Chapter 6: Unit 6</source>
<volume>4</volume>
</element-citation>
</ref>
<ref id="pone.0034261-Ronquist1">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ronquist</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Huelsenbeck</surname>
<given-names>JP</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>MrBayes 3: Bayesian phylogenetic inference under mixed models.</article-title>
<source>Bioinformatics</source>
<volume>19</volume>
<fpage>1572</fpage>
<lpage>1574</lpage>
<pub-id pub-id-type="pmid">12912839</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Ulitsky1">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ulitsky</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Burstein</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Tuller</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Chor</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>The average common substring approach to phylogenomic reconstruction.</article-title>
<source>J Comput Biol</source>
<volume>13</volume>
<fpage>336</fpage>
<lpage>350</lpage>
<pub-id pub-id-type="pmid">16597244</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Lempel1">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lempel</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ziv</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>1976</year>
<article-title>Complexity of Finite Sequences.</article-title>
<source>Ieee Transactions on Information Theory</source>
<volume>22</volume>
<fpage>75</fpage>
<lpage>81</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Hohl1">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hohl</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ragan</surname>
<given-names>MA</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Is multiple-sequence alignment required for accurate inference of phylogeny?</article-title>
<source>Syst Biol</source>
<volume>56</volume>
<fpage>206</fpage>
<lpage>221</lpage>
<pub-id pub-id-type="pmid">17454975</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Bruno1">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bruno</surname>
<given-names>WJ</given-names>
</name>
<name>
<surname>Socci</surname>
<given-names>ND</given-names>
</name>
<name>
<surname>Halpern</surname>
<given-names>AL</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction.</article-title>
<source>Mol Biol Evol</source>
<volume>17</volume>
<fpage>189</fpage>
<lpage>197</lpage>
<pub-id pub-id-type="pmid">10666718</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Desper1">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Desper</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Gascuel</surname>
</name>
<name>
<surname>O</surname>
</name>
</person-group>
<year>2002</year>
<article-title>Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle.</article-title>
<source>Journal of Computational Biology</source>
<volume>19</volume>
<fpage>687</fpage>
<lpage>705</lpage>
<pub-id pub-id-type="pmid">12487758</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Wheeler1">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wheeler</surname>
<given-names>TJ</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Large-Scale Neighbor-Joining with NINJA.</article-title>
<source>Algorithms in Bioinformatics</source>
<volume>5724</volume>
<fpage>375</fpage>
<lpage>389</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Hong2">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Chintapalli</surname>
<given-names>SV</given-names>
</name>
<name>
<surname>Ko</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Bhardwaj</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<etal></etal>
</person-group>
<year>2011</year>
<article-title>Predicting Protein Folds with Fold-Specific PSSM Libraries.</article-title>
<source>PLoS One</source>
<volume>6</volume>
<fpage>e20557</fpage>
<pub-id pub-id-type="pmid">21698189</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Hong3">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Kang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>D</given-names>
</name>
<name>
<surname>van Rossum</surname>
<given-names>DB</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.</article-title>
<source>PLoS One</source>
<volume>5</volume>
<fpage>e13596</fpage>
<pub-id pub-id-type="pmid">21042584</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Han1">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Han</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Aligo</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Manna</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Belton</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chintapalli</surname>
<given-names>SV</given-names>
</name>
<etal></etal>
</person-group>
<year>2011</year>
<article-title>Conserved GXXXG- and S/T-Like Motifs in the Transmembrane Domains of NS4B Protein Are Required for Hepatitis C Virus Replication.</article-title>
<source>J Virol</source>
<volume>85</volume>
<fpage>6464</fpage>
<lpage>6479</lpage>
<pub-id pub-id-type="pmid">21507970</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Nikolaidis1">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nikolaidis</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Chalkia</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Watkins</surname>
<given-names>DN</given-names>
</name>
<name>
<surname>Barrow</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Snyder</surname>
<given-names>SH</given-names>
</name>
<etal></etal>
</person-group>
<year>2007</year>
<article-title>Ancient Origin of the New Developmental Superfamily DANGER.</article-title>
<source>PLoSONE</source>
<volume>2</volume>
<fpage>e204</fpage>
</element-citation>
</ref>
<ref id="pone.0034261-vanRossum1">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>van Rossum</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Cheung</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Barrow</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Syrovatkina</surname>
<given-names>V</given-names>
</name>
<etal></etal>
</person-group>
<year>2006</year>
<article-title>DANGER: A novel regulatory protein of IP3-receptor activity.</article-title>
<source>J Biol Chem</source>
<volume>281</volume>
<issue>48</issue>
<fpage>37111</fpage>
<lpage>37116</lpage>
<pub-id pub-id-type="pmid">16990268</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Lau1">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lau</surname>
<given-names>GT</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>OG</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>PM</given-names>
</name>
<name>
<surname>Kok</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>RL</given-names>
</name>
<etal></etal>
</person-group>
<year>2001</year>
<article-title>Embryonic XMab21l2 expression is required for gastrulation and subsequent neural development.</article-title>
<source>BiochemBiophysResCommun</source>
<volume>280</volume>
<fpage>1378</fpage>
<lpage>1384</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Kang1">
<label>45</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kang</surname>
<given-names>BN</given-names>
</name>
<name>
<surname>Ahmad</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Saleem</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Hester</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<year>2010</year>
<article-title>Death-associated protein kinase-mediated cell death modulated by interaction with DANGER.</article-title>
<source>JNeurosci</source>
<volume>30</volume>
<fpage>93</fpage>
<lpage>98</lpage>
<pub-id pub-id-type="pmid">20053891</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-MarchlerBauer1">
<label>46</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marchler-Bauer</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Chitsaz</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Derbyshire</surname>
<given-names>MK</given-names>
</name>
<etal></etal>
</person-group>
<year>2011</year>
<article-title>CDD: a Conserved Domain Database for the functional annotation of proteins.</article-title>
<source>Nucleic Acids Res</source>
<volume>39</volume>
<fpage>D225</fpage>
<lpage>229</lpage>
<pub-id pub-id-type="pmid">21109532</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Tamura1">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tamura</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Dudley</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Nei</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.</article-title>
<source>MolBiolEvol</source>
<volume>24</volume>
<fpage>1596</fpage>
<lpage>1599</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Sun1">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Altintas</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<year>2011</year>
<article-title>Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource.</article-title>
<source>Nucleic Acids Res</source>
<volume>39</volume>
<fpage>D546</fpage>
<lpage>551</lpage>
<pub-id pub-id-type="pmid">21045053</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Watanabe1">
<label>49</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Watanabe</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Vriens</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Prenen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Droogmans</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Voets</surname>
<given-names>T</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>Anandamide and arachidonic acid use epoxyeicosatrienoic acids to activate TRPV4 channels.</article-title>
<source>Nature</source>
<volume>424</volume>
<fpage>434</fpage>
<lpage>438</lpage>
<pub-id pub-id-type="pmid">12879072</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Watanabe2">
<label>50</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Watanabe</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Fujisawa</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Holstein</surname>
<given-names>TW</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Cnidarians and the evolutionary origin of the nervous system.</article-title>
<source>Dev Growth Differ</source>
<volume>51</volume>
<fpage>167</fpage>
<lpage>183</lpage>
<pub-id pub-id-type="pmid">19379274</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Chow1">
<label>51</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chow</surname>
<given-names>KL</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Emmons</surname>
<given-names>SW</given-names>
</name>
</person-group>
<year>1995</year>
<article-title>The mab-21 gene of Caenorhabditis elegans encodes a novel protein required for choice of alternate cell fates.</article-title>
<source>Development</source>
<volume>121</volume>
<fpage>3615</fpage>
<lpage>3626</lpage>
<pub-id pub-id-type="pmid">8582275</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Wong1">
<label>52</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wong</surname>
<given-names>YM</given-names>
</name>
<name>
<surname>Chow</surname>
<given-names>KL</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Expression of zebrafish mab21 genes marks the differentiating eye, midbrain and neural tube.</article-title>
<source>MechDev</source>
<volume>113</volume>
<fpage>149</fpage>
<lpage>152</lpage>
</element-citation>
</ref>
<ref id="pone.0034261-Essoussi1">
<label>53</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Essoussi</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Boujenfa</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Limam</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>A comparison of MSA tools.</article-title>
<source>Bioinformation</source>
<volume>2</volume>
<fpage>452</fpage>
<lpage>455</lpage>
<pub-id pub-id-type="pmid">18841241</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Loytynoja1">
<label>54</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Loytynoja</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Goldman</surname>
<given-names>N</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>An algorithm for progressive multiple alignment of sequences with insertions.</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>102</volume>
<fpage>10557</fpage>
<lpage>10562</lpage>
<pub-id pub-id-type="pmid">16000407</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Liu3">
<label>55</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Warnow</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Holder</surname>
<given-names>MT</given-names>
</name>
<name>
<surname>Nelesen</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<year>2012</year>
<article-title>SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.</article-title>
<source>Syst Biol</source>
<volume>61</volume>
<fpage>90</fpage>
<lpage>106</lpage>
<pub-id pub-id-type="pmid">22139466</pub-id>
</element-citation>
</ref>
<ref id="pone.0034261-Eddy1">
<label>56</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eddy</surname>
<given-names>SR</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Profile hidden Markov models.</article-title>
<source>Bioinformatics</source>
<volume>14</volume>
<fpage>755</fpage>
<lpage>763</lpage>
<pub-id pub-id-type="pmid">9918945</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000603 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000603 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3325999
   |texte=   PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22514627" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1 

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024