Serveur d'exploration sur les relations entre la France et l'Australie

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0001280 ( Pmc/Corpus ); précédent : 0001279; suivant : 0001281 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Inferring phylogenies of evolving sequences without multiple sequence alignment</title>
<author>
<name sortKey="Chan, Cheong Xin" sort="Chan, Cheong Xin" uniqKey="Chan C" first="Cheong Xin" last="Chan">Cheong Xin Chan</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bernard, Guillaume" sort="Bernard, Guillaume" uniqKey="Bernard G" first="Guillaume" last="Bernard">Guillaume Bernard</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Poirion, Olivier" sort="Poirion, Olivier" uniqKey="Poirion O" first="Olivier" last="Poirion">Olivier Poirion</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a3">Current address: Laboratoire Ampère, CNRS UMR 5005, École Centrale de Lyon, France.</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hogan, James M" sort="Hogan, James M" uniqKey="Hogan J" first="James M." last="Hogan">James M. Hogan</name>
<affiliation>
<nlm:aff id="a2">
<institution>School of Electrical Engineering and Computer Science, Queensland University of Technology</institution>
, Brisbane, QLD 4000, Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ragan, Mark A" sort="Ragan, Mark A" uniqKey="Ragan M" first="Mark A." last="Ragan">Mark A. Ragan</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25266120</idno>
<idno type="pmc">4179140</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4179140</idno>
<idno type="RBID">PMC:4179140</idno>
<idno type="doi">10.1038/srep06504</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000128</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000128</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Inferring phylogenies of evolving sequences without multiple sequence alignment</title>
<author>
<name sortKey="Chan, Cheong Xin" sort="Chan, Cheong Xin" uniqKey="Chan C" first="Cheong Xin" last="Chan">Cheong Xin Chan</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bernard, Guillaume" sort="Bernard, Guillaume" uniqKey="Bernard G" first="Guillaume" last="Bernard">Guillaume Bernard</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Poirion, Olivier" sort="Poirion, Olivier" uniqKey="Poirion O" first="Olivier" last="Poirion">Olivier Poirion</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a3">Current address: Laboratoire Ampère, CNRS UMR 5005, École Centrale de Lyon, France.</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hogan, James M" sort="Hogan, James M" uniqKey="Hogan J" first="James M." last="Hogan">James M. Hogan</name>
<affiliation>
<nlm:aff id="a2">
<institution>School of Electrical Engineering and Computer Science, Queensland University of Technology</institution>
, Brisbane, QLD 4000, Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ragan, Mark A" sort="Ragan, Mark A" uniqKey="Ragan M" first="Mark A." last="Ragan">Mark A. Ragan</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Scientific Reports</title>
<idno type="eISSN">2045-2322</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on
<italic>D
<sub>2</sub>
</italic>
statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach,
<italic>D
<sub>2</sub>
</italic>
methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, R C" uniqKey="Edgar R">R. C. Edgar</name>
</author>
<author>
<name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S. Batzoglou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Notredame, C" uniqKey="Notredame C">C. Notredame</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Darling, A E" uniqKey="Darling A">A. E. Darling</name>
</author>
<author>
<name sortKey="Miklos, I" uniqKey="Miklos I">I. Miklos</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Puigb, P" uniqKey="Puigb P">P. Puigbò</name>
</author>
<author>
<name sortKey="Wolf, Y I" uniqKey="Wolf Y">Y. I. Wolf</name>
</author>
<author>
<name sortKey="Koonin, E V" uniqKey="Koonin E">E. V. Koonin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhaxybayeva, O" uniqKey="Zhaxybayeva O">O. Zhaxybayeva</name>
</author>
<author>
<name sortKey="Doolittle, W F" uniqKey="Doolittle W">W. F. Doolittle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wong, K M" uniqKey="Wong K">K. M. Wong</name>
</author>
<author>
<name sortKey="Suchard, M A" uniqKey="Suchard M">M. A. Suchard</name>
</author>
<author>
<name sortKey="Huelsenbeck, J P" uniqKey="Huelsenbeck J">J. P. Huelsenbeck</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, M T" uniqKey="Wu M">M. T. Wu</name>
</author>
<author>
<name sortKey="Chatterji, S" uniqKey="Chatterji S">S. Chatterji</name>
</author>
<author>
<name sortKey="Eisen, J A" uniqKey="Eisen J">J. A. Eisen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chan, C X" uniqKey="Chan C">C. X. Chan</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hohl, M" uniqKey="Hohl M">M. Höhl</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hohl, M" uniqKey="Hohl M">M. Höhl</name>
</author>
<author>
<name sortKey="Rigoutsos, I" uniqKey="Rigoutsos I">I. Rigoutsos</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Domazet Loso, M" uniqKey="Domazet Loso M">M. Domazet-Lošo</name>
</author>
<author>
<name sortKey="Haubold, B" uniqKey="Haubold B">B. Haubold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S. Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J. Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bonham Carter, O" uniqKey="Bonham Carter O">O. Bonham-Carter</name>
</author>
<author>
<name sortKey="Steele, J" uniqKey="Steele J">J. Steele</name>
</author>
<author>
<name sortKey="Bastola, D" uniqKey="Bastola D">D. Bastola</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haubold, B" uniqKey="Haubold B">B. Haubold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, K" uniqKey="Song K">K. Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Torney, D C" uniqKey="Torney D">D. C. Torney</name>
</author>
<author>
<name sortKey="Burks, C" uniqKey="Burks C">C. Burks</name>
</author>
<author>
<name sortKey="Davison, D" uniqKey="Davison D">D. Davison</name>
</author>
<author>
<name sortKey="Sirotkin, K M" uniqKey="Sirotkin K">K. M. Sirotkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wan, L" uniqKey="Wan L">L. Wan</name>
</author>
<author>
<name sortKey="Reinert, G" uniqKey="Reinert G">G. Reinert</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F. Sun</name>
</author>
<author>
<name sortKey="Waterman, M S" uniqKey="Waterman M">M. S. Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Reinert, G" uniqKey="Reinert G">G. Reinert</name>
</author>
<author>
<name sortKey="Chew, D" uniqKey="Chew D">D. Chew</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F. Sun</name>
</author>
<author>
<name sortKey="Waterman, M S" uniqKey="Waterman M">M. S. Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hide, W" uniqKey="Hide W">W. Hide</name>
</author>
<author>
<name sortKey="Burke, J" uniqKey="Burke J">J. Burke</name>
</author>
<author>
<name sortKey="Davison, D B" uniqKey="Davison D">D. B. Davison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, R T" uniqKey="Miller R">R. T. Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guindon, S" uniqKey="Guindon S">S. Guindon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Price, M N" uniqKey="Price M">M. N. Price</name>
</author>
<author>
<name sortKey="Dehal, P S" uniqKey="Dehal P">P. S. Dehal</name>
</author>
<author>
<name sortKey="Arkin, A P" uniqKey="Arkin A">A. P. Arkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, S F" uniqKey="Altschul S">S. F. Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W. Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W. Miller</name>
</author>
<author>
<name sortKey="Myers, E W" uniqKey="Myers E">E. W. Myers</name>
</author>
<author>
<name sortKey="Lipman, D J" uniqKey="Lipman D">D. J. Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goke, J" uniqKey="Goke J">J. Göke</name>
</author>
<author>
<name sortKey="Schulz, M H" uniqKey="Schulz M">M. H. Schulz</name>
</author>
<author>
<name sortKey="Lasserre, J" uniqKey="Lasserre J">J. Lasserre</name>
</author>
<author>
<name sortKey="Vingron, M" uniqKey="Vingron M">M. Vingron</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yi, H" uniqKey="Yi H">H. Yi</name>
</author>
<author>
<name sortKey="Jin, L" uniqKey="Jin L">L. Jin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, R C" uniqKey="Edgar R">R. C. Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ronquist, F" uniqKey="Ronquist F">F. Ronquist</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Robinson, D F" uniqKey="Robinson D">D. F. Robinson</name>
</author>
<author>
<name sortKey="Foulds, L R" uniqKey="Foulds L">L. R. Foulds</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foret, S" uniqKey="Foret S">S. Forêt</name>
</author>
<author>
<name sortKey="Wilson, S R" uniqKey="Wilson S">S. R. Wilson</name>
</author>
<author>
<name sortKey="Burden, C J" uniqKey="Burden C">C. J. Burden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foret, S" uniqKey="Foret S">S. Forêt</name>
</author>
<author>
<name sortKey="Kantorovitz, M R" uniqKey="Kantorovitz M">M. R. Kantorovitz</name>
</author>
<author>
<name sortKey="Burden, C J" uniqKey="Burden C">C. J. Burden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huffman, D A" uniqKey="Huffman D">D. A. Huffman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fletcher, W" uniqKey="Fletcher W">W. Fletcher</name>
</author>
<author>
<name sortKey="Yang, Z" uniqKey="Yang Z">Z. Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lavalette, D" uniqKey="Lavalette D">D. Lavalette</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Popescu, I I" uniqKey="Popescu I">I. I. Popescu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A. Stamatakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A. Stamatakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Golubchik, T" uniqKey="Golubchik T">T. Golubchik</name>
</author>
<author>
<name sortKey="Wise, M J" uniqKey="Wise M">M. J. Wise</name>
</author>
<author>
<name sortKey="Easteal, S" uniqKey="Easteal S">S. Easteal</name>
</author>
<author>
<name sortKey="Jermiin, L S" uniqKey="Jermiin L">L. S. Jermiin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kingman, J F C" uniqKey="Kingman J">J. F. C. Kingman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tellier, A" uniqKey="Tellier A">A. Tellier</name>
</author>
<author>
<name sortKey="Lemaire, C" uniqKey="Lemaire C">C. Lemaire</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sjodin, P" uniqKey="Sjodin P">P. Sjödin</name>
</author>
<author>
<name sortKey="Kaj, I" uniqKey="Kaj I">I. Kaj</name>
</author>
<author>
<name sortKey="Krone, S" uniqKey="Krone S">S. Krone</name>
</author>
<author>
<name sortKey="Lascoux, M" uniqKey="Lascoux M">M. Lascoux</name>
</author>
<author>
<name sortKey="Nordborg, M" uniqKey="Nordborg M">M. Nordborg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Piel, W H" uniqKey="Piel W">W. H. Piel</name>
</author>
<author>
<name sortKey="Donoghue, M J" uniqKey="Donoghue M">M. J. Donoghue</name>
</author>
<author>
<name sortKey="Sanderson, M J" uniqKey="Sanderson M">M. J. Sanderson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Posada, D" uniqKey="Posada D">D. Posada</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
<author>
<name sortKey="Chan, C X" uniqKey="Chan C">C. X. Chan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
<author>
<name sortKey="Bernard, G" uniqKey="Bernard G">G. Bernard</name>
</author>
<author>
<name sortKey="Chan, C X" uniqKey="Chan C">C. X. Chan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chan, C X" uniqKey="Chan C">C. X. Chan</name>
</author>
<author>
<name sortKey="Darling, A E" uniqKey="Darling A">A. E. Darling</name>
</author>
<author>
<name sortKey="Beiko, R G" uniqKey="Beiko R">R. G. Beiko</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Katoh, K" uniqKey="Katoh K">K. Katoh</name>
</author>
<author>
<name sortKey="Standley, D M" uniqKey="Standley D">D. M. Standley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thompson, J D" uniqKey="Thompson J">J. D. Thompson</name>
</author>
<author>
<name sortKey="Linard, B" uniqKey="Linard B">B. Linard</name>
</author>
<author>
<name sortKey="Lecompte, O" uniqKey="Lecompte O">O. Lecompte</name>
</author>
<author>
<name sortKey="Poch, O" uniqKey="Poch O">O. Poch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, K" uniqKey="Liu K">K. Liu</name>
</author>
<author>
<name sortKey="Linder, C R" uniqKey="Linder C">C. R. Linder</name>
</author>
<author>
<name sortKey="Warnow, T" uniqKey="Warnow T">T. Warnow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gunasinghe, U" uniqKey="Gunasinghe U">U. Gunasinghe</name>
</author>
<author>
<name sortKey="Alahakoon, D" uniqKey="Alahakoon D">D. Alahakoon</name>
</author>
<author>
<name sortKey="Bedingfield, S" uniqKey="Bedingfield S">S. Bedingfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haubold, B" uniqKey="Haubold B">B. Haubold</name>
</author>
<author>
<name sortKey="Pfaffelhuber, P" uniqKey="Pfaffelhuber P">P. Pfaffelhuber</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fitch, W M" uniqKey="Fitch W">W. M. Fitch</name>
</author>
<author>
<name sortKey="Margoliash, E" uniqKey="Margoliash E">E. Margoliash</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burden, C J" uniqKey="Burden C">C. J. Burden</name>
</author>
<author>
<name sortKey="Kantorovitz, M R" uniqKey="Kantorovitz M">M. R. Kantorovitz</name>
</author>
<author>
<name sortKey="Wilson, S R" uniqKey="Wilson S">S. R. Wilson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, Z" uniqKey="Yang Z">Z. Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tavare, S" uniqKey="Tavare S">S. Tavaré</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, Z" uniqKey="Yang Z">Z. Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Whelan, S" uniqKey="Whelan S">S. Whelan</name>
</author>
<author>
<name sortKey="Goldman, N" uniqKey="Goldman N">N. Goldman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arenas, M" uniqKey="Arenas M">M. Arenas</name>
</author>
<author>
<name sortKey="Posada, D" uniqKey="Posada D">D. Posada</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sjostrand, J" uniqKey="Sjostrand J">J. Sjöstrand</name>
</author>
<author>
<name sortKey="Arvestad, L" uniqKey="Arvestad L">L. Arvestad</name>
</author>
<author>
<name sortKey="Lagergren, J" uniqKey="Lagergren J">J. Lagergren</name>
</author>
<author>
<name sortKey="Sennblad, B" uniqKey="Sennblad B">B. Sennblad</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Drummond, A J" uniqKey="Drummond A">A. J. Drummond</name>
</author>
<author>
<name sortKey="Ho, S Y" uniqKey="Ho S">S. Y. Ho</name>
</author>
<author>
<name sortKey="Phillips, M J" uniqKey="Phillips M">M. J. Phillips</name>
</author>
<author>
<name sortKey="Rambaut, A" uniqKey="Rambaut A">A. Rambaut</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mcdonald, D" uniqKey="Mcdonald D">D. McDonald</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chan, C X" uniqKey="Chan C">C. X. Chan</name>
</author>
<author>
<name sortKey="Mahbob, M" uniqKey="Mahbob M">M. Mahbob</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stuart, G W" uniqKey="Stuart G">G. W. Stuart</name>
</author>
<author>
<name sortKey="Moffett, K" uniqKey="Moffett K">K. Moffett</name>
</author>
<author>
<name sortKey="Baker, S" uniqKey="Baker S">S. Baker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kupczok, A" uniqKey="Kupczok A">A. Kupczok</name>
</author>
<author>
<name sortKey="Schmidt, H" uniqKey="Schmidt H">H. Schmidt</name>
</author>
<author>
<name sortKey="Von Haeseler, A" uniqKey="Von Haeseler A">A. von Haeseler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bryant, D" uniqKey="Bryant D">D. Bryant</name>
</author>
<author>
<name sortKey="Steel, M" uniqKey="Steel M">M. Steel</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sci Rep</journal-id>
<journal-id journal-id-type="iso-abbrev">Sci Rep</journal-id>
<journal-title-group>
<journal-title>Scientific Reports</journal-title>
</journal-title-group>
<issn pub-type="epub">2045-2322</issn>
<publisher>
<publisher-name>Nature Publishing Group</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25266120</article-id>
<article-id pub-id-type="pmc">4179140</article-id>
<article-id pub-id-type="pii">srep06504</article-id>
<article-id pub-id-type="doi">10.1038/srep06504</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Inferring phylogenies of evolving sequences without multiple sequence alignment</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Chan</surname>
<given-names>Cheong Xin</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bernard</surname>
<given-names>Guillaume</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Poirion</surname>
<given-names>Olivier</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hogan</surname>
<given-names>James M.</given-names>
</name>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ragan</surname>
<given-names>Mark A.</given-names>
</name>
<xref ref-type="corresp" rid="c1">a</xref>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<aff id="a1">
<label>1</label>
<institution>Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, Brisbane, QLD 4072, Australia</aff>
<aff id="a2">
<label>2</label>
<institution>School of Electrical Engineering and Computer Science, Queensland University of Technology</institution>
, Brisbane, QLD 4000, Australia</aff>
<aff id="a3">
<label>3</label>
Current address: Laboratoire Ampère, CNRS UMR 5005, École Centrale de Lyon, France.</aff>
</contrib-group>
<author-notes>
<corresp id="c1">
<label>a</label>
<email>m.ragan@uq.edu.au</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>09</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<volume>4</volume>
<elocation-id>6504</elocation-id>
<history>
<date date-type="received">
<day>06</day>
<month>03</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>10</day>
<month>09</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2014, Macmillan Publishers Limited. All rights reserved</copyright-statement>
<copyright-year>2014</copyright-year>
<copyright-holder>Macmillan Publishers Limited. All rights reserved</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by-nc-sa/4.0/">
<pmc-comment>author-paid</pmc-comment>
<license-p>This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder in order to reproduce the material. To view a copy of this license, visit
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc-sa/4.0/">http://creativecommons.org/licenses/by-nc-sa/4.0/</ext-link>
</license-p>
</license>
</permissions>
<abstract>
<p>Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on
<italic>D
<sub>2</sub>
</italic>
statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach,
<italic>D
<sub>2</sub>
</italic>
methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.</p>
</abstract>
</article-meta>
</front>
<body>
<p>Multiple sequence alignment (MSA) has long been a standard stage in phylogenetic workflows
<xref ref-type="bibr" rid="b1">1</xref>
<xref ref-type="bibr" rid="b2">2</xref>
. In this approach, homologous sequences are first multiply aligned along their full length, yielding positional hypotheses of homology (alignment columns) that are input to maximum parsimony, maximum likelihood (ML) or Bayesian inference, or summarised in a distance matrix and used to compute a tree e.g. by neighbour-joining (NJ). A key assumption of MSA is that in each such set of sequences, homologous positions occur in the same order relative to one another. This is not fully realistic, as genes and genomes are subject to recombination, rearrangement and lateral genetic transfer
<xref ref-type="bibr" rid="b3">3</xref>
<xref ref-type="bibr" rid="b4">4</xref>
<xref ref-type="bibr" rid="b5">5</xref>
. In sequences so affected, the positional hypothesis of homology generated by MSA will be incomplete or incorrect, diffusing the phylogenetic signal, violating models of the substitution process across sites and branches, and consequently misleading phylogenetic inference
<xref ref-type="bibr" rid="b6">6</xref>
<xref ref-type="bibr" rid="b7">7</xref>
. These issues can only be intensified by the on-going deluge of sequencing data arising from advances in sequencing technologies
<xref ref-type="bibr" rid="b8">8</xref>
.</p>
<p>An alternative to MSA in phylogenetic inference is the so-called
<italic>alignment-free</italic>
approach in which pairwise similarity is computed from sub-sequences, e.g. counts of exact (or inexact) sub-sequences of defined length, or by extension, of conserved sequence patterns
<xref ref-type="bibr" rid="b9">9</xref>
<xref ref-type="bibr" rid="b10">10</xref>
, or alternatively of match lengths
<xref ref-type="bibr" rid="b11">11</xref>
. These sub-sequences are known variously as words,
<italic>k</italic>
-mers or
<italic>n</italic>
-grams
<xref ref-type="bibr" rid="b12">12</xref>
; see refs.
<xref ref-type="bibr" rid="b13">13</xref>
,
<xref ref-type="bibr" rid="b14">14</xref>
,
<xref ref-type="bibr" rid="b15">15</xref>
for recent reviews. A word-count approach for alignment-free sequence comparison uses the
<italic>D</italic>
<sub>2</sub>
statistic
<xref ref-type="bibr" rid="b15">15</xref>
<xref ref-type="bibr" rid="b16">16</xref>
<xref ref-type="bibr" rid="b17">17</xref>
<xref ref-type="bibr" rid="b18">18</xref>
. A
<italic>D</italic>
<sub>2</sub>
score is calculated based on the exact count of shared
<italic>k</italic>
-mers between any two sequences, thus representing the extent of similarity they share (see
<xref ref-type="supplementary-material" rid="s1">Supplementary Note</xref>
for details). Since the profile of
<italic>k</italic>
-mers depends on length of the sequence, modifications have been proposed to accommodate this bias, e.g. normalising the
<italic>D</italic>
<sub>2</sub>
score by the probability of occurrence for each
<italic>k</italic>
-mer observed in the sequences (
<inline-formula id="m1">
<inline-graphic id="d33e251" xlink:href="srep06504-m1.jpg"></inline-graphic>
</inline-formula>
), or by the mean and variance of
<italic>k</italic>
-mer occurrences (
<inline-formula id="m2">
<inline-graphic id="d33e257" xlink:href="srep06504-m2.jpg"></inline-graphic>
</inline-formula>
)
<xref ref-type="bibr" rid="b17">17</xref>
<xref ref-type="bibr" rid="b18">18</xref>
. These studies have demonstrated that
<inline-formula id="m3">
<inline-graphic id="d33e263" xlink:href="srep06504-m3.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula id="m4">
<inline-graphic id="d33e266" xlink:href="srep06504-m4.jpg"></inline-graphic>
</inline-formula>
have greater statistical power than
<italic>D</italic>
<sub>2</sub>
, and that this power increases with sequence length
<xref ref-type="bibr" rid="b15">15</xref>
<xref ref-type="bibr" rid="b17">17</xref>
<xref ref-type="bibr" rid="b18">18</xref>
. These statistics can be easily transformed into a pairwise measure of dissimilarity or distance, which can then be used to compute phylogenetic relationships.</p>
<p>Alignment-free approaches have been adopted in searches of sequence databases
<xref ref-type="bibr" rid="b19">19</xref>
, clustering of expressed sequence tags
<xref ref-type="bibr" rid="b20">20</xref>
, and more recently in detecting lateral genetic transfer
<xref ref-type="bibr" rid="b11">11</xref>
. By directly computing pairwise dissimilarity or distance using these methods, one can bypass resource-intensive ML or Bayesian approaches in favour of NJ. Some methods implementing approximate ML measures
<xref ref-type="bibr" rid="b21">21</xref>
<xref ref-type="bibr" rid="b22">22</xref>
, although less accurate, are less resource-intensive. However, the sensitivity of alignment-free methods to different evolutionary scenarios, and the scalability of these methods, have not been systematically investigated.</p>
<p>Here, using both simulated and empirical data we assess the accuracy of alignment-free phylogenetic approaches using
<italic>D</italic>
<sub>2</sub>
statistics compared to standard MSA-based approaches. Using sets of simulated nucleotide and amino acid sequences, we systematically examine the accuracy and sensitivity of
<italic>D</italic>
<sub>2</sub>
methods to key molecular evolutionary processes including sequence divergence, among-site rate heterogeneity, biases of G + C content, genetic rearrangements and insertions/deletions, as well as to the technical issue of incomplete sequence data. We demonstrate the scalability and potential of using alignment-free approaches to compute phylogenetic trees quickly and accurately from large-scale DNA or protein data.</p>
<sec disp-level="1" sec-type="results">
<title>Results</title>
<p>For our alignment-free phylogenetic approach, we used
<italic>D</italic>
<sub>2</sub>
statistics (independently for
<italic>D</italic>
<sub>2</sub>
,
<inline-formula id="m5">
<inline-graphic id="d33e313" xlink:href="srep06504-m5.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula id="m6">
<inline-graphic id="d33e316" xlink:href="srep06504-m6.jpg"></inline-graphic>
</inline-formula>
)
<xref ref-type="bibr" rid="b17">17</xref>
<xref ref-type="bibr" rid="b18">18</xref>
to generate a score for each possible pair of sequences within a set. Here we also introduce
<inline-formula id="m7">
<inline-graphic id="d33e322" xlink:href="srep06504-m7.jpg"></inline-graphic>
</inline-formula>
, a
<italic>D</italic>
<sub>2</sub>
statistic that extends each
<italic>k</italic>
-mer recovered in the sequences to its neighbourhood
<italic>n</italic>
, i.e. allows
<italic>n</italic>
number of wildcard residue(s). This simple extension of
<italic>D</italic>
<sub>2</sub>
is analogous to generation of high-scoring words for the query phase of BLAST
<xref ref-type="bibr" rid="b23">23</xref>
, and to a published alignment-free measure of sequence similarity
<xref ref-type="bibr" rid="b24">24</xref>
; a measure of inexact match has recently been extended to a position-specific context
<xref ref-type="bibr" rid="b25">25</xref>
. We denote cases of
<inline-formula id="m8">
<inline-graphic id="d33e352" xlink:href="srep06504-m8.jpg"></inline-graphic>
</inline-formula>
where
<italic>n</italic>
= 1 as
<inline-formula id="m9">
<inline-graphic id="d33e358" xlink:href="srep06504-m9.jpg"></inline-graphic>
</inline-formula>
hereinafter. Each of these metrics is described in the
<xref ref-type="supplementary-material" rid="s1">Supplementary Note</xref>
. For each method, we transform the scores
<italic>via</italic>
logarithmic representation of the geometric mean to estimate evolutionary distances (see Methods). Each resulting distance matrix was then used to calculate phylogenetic relationships using NJ. For comparison, for each sequence set we performed MSA using the popular tool, MUSCLE
<xref ref-type="bibr" rid="b26">26</xref>
and inferred a phylogenetic tree using the widely used MrBayes
<xref ref-type="bibr" rid="b27">27</xref>
. We use Robinson-Foulds distances
<xref ref-type="bibr" rid="b28">28</xref>
to evaluate topological congruence between each of the resulting test trees and a reference tree, normalised to adjust for different tree sizes (see Methods for details). We denote
<italic>RF</italic>
as the normalised Robinson-Foulds distance.
<italic>RF</italic>
= 0 indicates that the test tree shows complete topological congruence with the reference, while
<italic>RF</italic>
= 1 indicates that the test tree has no bipartition in common with the reference. The
<italic>RF</italic>
for a test tree generated
<italic>via</italic>
one of the four
<italic>D</italic>
<sub>2</sub>
methods is denoted as
<italic>RF
<sub>D2</sub>
</italic>
,
<italic>RF
<sub>D2S</sub>
</italic>
,
<italic>RF
<sub>D2*</sub>
</italic>
or
<italic>RF
<sub>D2n1</sub>
</italic>
, and the equivalent for a test tree generated
<italic>via</italic>
MSA and MrBayes is denoted as
<italic>RF
<sub>MSA</sub>
</italic>
.</p>
<p>Using simulated data, we independently assess the sensitivity of
<italic>D</italic>
<sub>2</sub>
methods to variation in key evolutionary processes: sequence divergence, genetic rearrangement, and insertions/deletions. Because the phylogenetic tree is known for each simulated sequence set, we use that as the reference.</p>
<sec disp-level="2">
<title>Sequence divergence</title>
<p>We simulated nucleotide sequence sets of various size categories
<italic>N</italic>
= 8, 32 and 128 (total length,
<italic>L</italic>
= 1500 nt). For each category, six sequence sets were simulated under an unrooted tree topology across distinct situations of relative branch lengths, with
<italic>α</italic>
= 1 in an 8-category discrete gamma distribution. Each of these trees (T1 through T6 in
<xref ref-type="fig" rid="f1">Fig. 1</xref>
; shown for 8-taxon trees) represents a fine-scale scenario of sequence divergence, as determined by different combinations of internal (
<italic>x</italic>
) and terminal (
<italic>y</italic>
) branch lengths. In some simulations, we recognise two subsets of
<italic>y</italic>
(
<italic>y
<sub>1</sub>
</italic>
and
<italic>y
<sub>2</sub>
</italic>
) of different length. Sets containing varied divergence levels had different combinations of
<italic>x</italic>
,
<italic>y
<sub>1</sub>
</italic>
and
<italic>y
<sub>2</sub>
</italic>
as shown in T2, T3, T5 and T6; these are the reference trees for the corresponding sequence sets. For 32- and 128-taxon trees, the topologies were simply expanded for each upper and lower half, as indicated in
<xref ref-type="fig" rid="f1">Fig. 1</xref>
(labels
<italic>p1</italic>
and
<italic>p2</italic>
). For instance in a 128-taxon tree, the relative lengths (
<italic>x</italic>
,
<italic>y
<sub>1</sub>
</italic>
,
<italic>y
<sub>2</sub>
</italic>
) of the first 64 taxa follow pattern
<italic>p1</italic>
, while the others follow
<italic>p2</italic>
. For simplicity,
<italic>x</italic>
and
<italic>y</italic>
(or
<italic>y
<sub>1</sub>
</italic>
and
<italic>y
<sub>2</sub>
</italic>
) were set at either 0.01 or 0.05 (unit in number of substitutions per site). The least-divergent (most-similar) sequence set (T1) was simulated with all branch lengths
<italic>x</italic>
=
<italic>y
<sub>1</sub>
</italic>
=
<italic>y
<sub>2</sub>
</italic>
= 0.01 (two most dissimilar sequences differ at 0.14 substitutions per site at
<italic>N</italic>
= 128), whereas the most-divergent (most-dissimilar) set (T4) had
<italic>x</italic>
=
<italic>y
<sub>1</sub>
</italic>
=
<italic>y
<sub>2</sub>
</italic>
= 0.05 (two most dissimilar sequences differ at 0.70 substitutions per site at
<italic>N</italic>
= 128). The branch lengths in all these trees are short (two most dissimilar sequences in any set differ at <0.70 substitutions per site), so any MSA-based approaches should have no problem recovering these phylogenies. However, these datasets provide a testable range of sequence divergence to assess the sensitivity of alignment-free methods in recovering the topologies. For each sequence set, we independently derived pairwise distances using
<italic>D</italic>
<sub>2</sub>
,
<inline-formula id="m10">
<inline-graphic id="d33e566" xlink:href="srep06504-m10.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula id="m11">
<inline-graphic id="d33e570" xlink:href="srep06504-m11.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula id="m12">
<inline-graphic id="d33e573" xlink:href="srep06504-m12.jpg"></inline-graphic>
</inline-formula>
, in each case across different
<italic>k</italic>
-mer lengths (
<italic>k</italic>
= 4, 8, 12, 16, 20 and 24). Each parameter setting was run with 100 replicates, i.e. 100 × 3 size categories × 6 trees × 4 methods of
<italic>D</italic>
<sub>2</sub>
statistics × 6
<italic>k</italic>
-mer lengths (total of 43200 sequence sets). The same experimental design applies to protein sequences with fixed sequence length of 500 amino acids. See Methods for details.</p>
<p>To compare the performance between MSA-based and the
<italic>D</italic>
<sub>2</sub>
methods, we denote a relative measure of accuracy
<italic>Q
<sub>DX</sub>
</italic>
=
<italic>RF
<sub>MSA</sub>
</italic>
<italic>RF
<sub>DX</sub>
</italic>
, where
<italic>DX</italic>
represents any of the
<italic>D</italic>
<sub>2</sub>
methods, i.e.
<italic>Q
<sub>D2</sub>
</italic>
is the
<italic>Q</italic>
that corresponds to
<italic>RF
<sub>MSA</sub>
</italic>
<italic>RF
<sub>D2</sub>
</italic>
, and so forth. Derived from
<italic>RF</italic>
, the
<italic>Q</italic>
values reflect the proportion of bipartitions in a tree, and can be interpreted as the difference between the deviation of each tree from the common reference. The sign of the
<italic>Q</italic>
value indicates which of the two approaches performs better; if a
<italic>D</italic>
<sub>2</sub>
method performs better than MSA in recovering the reference tree then
<italic>Q</italic>
> 0 (i.e.
<italic>RF
<sub>MSA</sub>
</italic>
>
<italic>RF
<sub>DX</sub>
</italic>
), whereas if a
<italic>D</italic>
<sub>2</sub>
method performs worse than MSA then
<italic>Q</italic>
< 0 (i.e.
<italic>RF
<sub>MSA</sub>
</italic>
<
<italic>RF
<sub>DX</sub>
</italic>
). Where
<italic>Q</italic>
= 0 (i.e.
<italic>RF
<sub>MSA</sub>
</italic>
=
<italic>RF
<sub>DX</sub>
</italic>
) the
<italic>D</italic>
<sub>2</sub>
method performs as well as the MSA-based approach, although the trees could still be incongruent with the reference (i.e. their
<italic>RF</italic>
could be non-zero).</p>
<p>Across all
<italic>D</italic>
<sub>2</sub>
methods used in this study, we found that
<inline-formula id="m13">
<inline-graphic id="d33e717" xlink:href="srep06504-m13.jpg"></inline-graphic>
</inline-formula>
yielded the smallest
<italic>RF</italic>
across all categories of size and situations of relative branch length, for both nucleotide (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S1</xref>
) and protein (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S2</xref>
) sequence sets.
<xref ref-type="fig" rid="f2">Figure 2a</xref>
shows mean
<italic>RF
<sub>D2n1</sub>
</italic>
at different
<italic>k</italic>
-mer lengths (shown for
<italic>k</italic>
≥ 8) in each size category
<italic>N</italic>
of nucleotide sequence sets, across all trees (T1 through T6;
<xref ref-type="fig" rid="f1">Fig. 1</xref>
), with the corresponding mean
<italic>Q</italic>
value shown in
<xref ref-type="fig" rid="f2">Fig. 2b</xref>
. Across all
<italic>N</italic>
,
<inline-formula id="m14">
<inline-graphic id="d33e760" xlink:href="srep06504-m14.jpg"></inline-graphic>
</inline-formula>
recovered the reference topology almost perfectly for sets of sequences simulated under trees T1, T2, T4 and T6 (at
<italic>k</italic>
= 16, mean
<italic>RF
<sub>D2n1</sub>
</italic>
≤ 0.001 across these sets and all
<italic>N</italic>
;
<xref ref-type="fig" rid="f2">Fig. 2a</xref>
), whereas larger
<italic>RF
<sub>D2n1</sub>
</italic>
distances are observed for cases of T3 and T5 (e.g. for
<italic>N</italic>
= 32 at
<italic>k</italic>
= 16, mean
<italic>RF
<sub>D2n1</sub>
</italic>
= 0.06 and 0.03 respectively for T3 and T5;
<xref ref-type="fig" rid="f2">Fig. 2a</xref>
). The accuracy decreased with increasing
<italic>k</italic>
, e.g. for
<italic>N</italic>
= 128 and T3, mean
<italic>RF
<sub>D2n1</sub>
</italic>
= 0.01, 0.03, 0.06, 0.11, 0.18 at
<italic>k</italic>
= 8, 12, 16, 20 and 24.</p>
<p>While relative performance differed across the simulated scenarios, overall across these sequence sets we find that
<inline-formula id="m15">
<inline-graphic id="d33e814" xlink:href="srep06504-m15.jpg"></inline-graphic>
</inline-formula>
performed as well as the standard MSA-based approach (e.g. for T1 and T2 at
<italic>k</italic>
= 8, mean
<italic>Q
<sub>D2n1</sub>
</italic>
= 0.00 in all cases of
<italic>N</italic>
= 8, 32 and 128;
<xref ref-type="fig" rid="f2">Fig. 2b</xref>
), with the relative performance
<italic>Q</italic>
decreasing slightly with increased
<italic>k</italic>
(e.g. for
<italic>N</italic>
= 32 at T3,
<italic>Q</italic>
= −0.01, −0.03, −0.06, −0.11, −0.17 at
<italic>k</italic>
= 8, 12, 16, 20 and 24). Across all
<italic>N</italic>
examined here,
<inline-formula id="m16">
<inline-graphic id="d33e851" xlink:href="srep06504-m16.jpg"></inline-graphic>
</inline-formula>
performed slightly worse than MSA for T3 and T5, e.g. at
<italic>k</italic>
= 8,
<italic>Q
<sub>D2n1</sub>
</italic>
= −0.01 and −0.02 respectively at
<italic>N</italic>
= 32;
<italic>Q
<sub>D2n1</sub>
</italic>
= −0.02 and −0.07 respectively at
<italic>N</italic>
= 128. The bar plots in
<xref ref-type="fig" rid="f2">Fig. 2a</xref>
almost mirror those in
<xref ref-type="fig" rid="f2">Fig. 2b</xref>
, suggesting that
<italic>RF
<sub>MSA</sub>
</italic>
= 0 in most cases. Both T3 and T5, the cases problematic for
<italic>D</italic>
<sub>2</sub>
methods, have short internal branches (
<italic>x</italic>
) with long terminal branches (
<italic>y</italic>
:
<xref ref-type="fig" rid="f1">Fig. 1</xref>
). Our results suggest that
<italic>D</italic>
<sub>2</sub>
methods are more vulnerable to this situation, while the MSA-based approach performed well across these six cases.
<italic>Q</italic>
values observed for other
<italic>D</italic>
<sub>2</sub>
methods across nucleotide and protein sequence sets are shown in
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S3 and S4</xref>
respectively.</p>
<p>To assess the optimal
<italic>k</italic>
-mer length for use in
<italic>D</italic>
<sub>2</sub>
methods in deducing phylogenetic relationships from nucleotide and protein sequences, we compared
<italic>RF</italic>
values from all
<italic>D</italic>
<sub>2</sub>
methods between the two sequence types across
<italic>N</italic>
= 8, 32 and 128 pooled from all six trees, as shown in
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S5</xref>
. For nucleotide sequences,
<italic>k</italic>
= 8 yielded the lowest
<italic>RF</italic>
distances, with
<italic>RF</italic>
= 0 at
<italic>N</italic>
= 8 and 32, and
<italic>RF</italic>
< 0.002 at
<italic>N</italic>
= 128 across all
<italic>D</italic>
<sub>2</sub>
methods. For protein sequences,
<italic>k</italic>
= 4 is the optimal length across all
<italic>D</italic>
<sub>2</sub>
methods, with
<inline-formula id="m17">
<inline-graphic id="d33e975" xlink:href="srep06504-m17.jpg"></inline-graphic>
</inline-formula>
yielding the smallest RF distances across all size categories, i.e.
<italic>RF
<sub>D2n1</sub>
</italic>
= 0.012, 0.009 and 0.009 at
<italic>N</italic>
= 8, 32 and 128. This result supports the notion that optimal
<italic>k</italic>
is negatively correlated with alphabet size of the sequence data
<xref ref-type="bibr" rid="b9">9</xref>
<xref ref-type="bibr" rid="b29">29</xref>
<xref ref-type="bibr" rid="b30">30</xref>
. Formal proof appears to be lacking, but might be approached analogously to an earlier study
<xref ref-type="bibr" rid="b31">31</xref>
.</p>
<p>Two other scenarios relevant to sequence divergence are among-site rate heterogeneity (the presence of fast-
<italic>versus</italic>
slow-evolving sequence regions), and compositional (G + C content) biases in the sequences. We examined the sensitivity of
<italic>D</italic>
<sub>2</sub>
methods independently to each these scenarios (see
<xref ref-type="supplementary-material" rid="s1">Supplementary Note</xref>
for detail). Overall, among-site rate variation does not appear to affect drastically the accuracy of either
<italic>D</italic>
<sub>2</sub>
or MSA-based approaches (
<italic>Q</italic>
= 0 in most cases at optimal
<italic>k</italic>
in
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S6</xref>
); the
<italic>RF</italic>
values for all analyses of nucleotide and protein sequences are shown respectively in
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S7 and S8</xref>
. Interestingly, we note that high G + C proportion (thus low complexity of sequences) plays to the strength of local exact matches, rather than neighbourhood (non-exact) matches as allowed in
<inline-formula id="m18">
<inline-graphic id="d33e1027" xlink:href="srep06504-m18.jpg"></inline-graphic>
</inline-formula>
(
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S9</xref>
).</p>
</sec>
<sec disp-level="2">
<title>Genetic rearrangement</title>
<p>Here we simulated sequence data to assess the direct impact of genetic rearrangement on the performance of
<italic>D</italic>
<sub>2</sub>
methods in phylogenetic inference. We define
<italic>R</italic>
as the percentage length of a full-length nucleotide sequence that has undergone a non-overlapping rearrangement. We simulated post-hoc rearrangements in half of the sequences in a set of 5000-nt sequences, i.e. at
<italic>N</italic>
= 8, each of any 4 sequences would have
<italic>R</italic>
% of its length rearranged in a non-overlapping manner. Each rearrangement event involves one or more fragments of 250 nt, such that the total rearranged region (i.e.
<italic>R%</italic>
of full length) is no longer contiguous (see Methods).
<xref ref-type="fig" rid="f3">Figure 3a</xref>
shows the average
<italic>RF
<sub>D2n1</sub>
</italic>
for each
<italic>k</italic>
-mer length in nucleotide sequence sets (
<italic>N</italic>
= 8) across
<italic>R</italic>
= 10, 25 and 50%, including
<italic>RF
<sub>MSA</sub>
</italic>
of the MSA-based approach MUSCLE + MrBayes. Across all categories and all
<italic>k</italic>
-mer sizes, all methods, alignment-free or not, yielded average
<italic>RF</italic>
< 0.05 compared to the reference tree.
<inline-formula id="m19">
<inline-graphic id="d33e1086" xlink:href="srep06504-m19.jpg"></inline-graphic>
</inline-formula>
at
<italic>k</italic>
= 8 or 12 perfectly recovered the reference topologies (
<italic>RF
<sub>D2n1</sub>
</italic>
= 0 in both cases) regardless of
<italic>R</italic>
.
<xref ref-type="fig" rid="f3">Figure 3b</xref>
shows the mean
<italic>Q</italic>
values for each of these cases. At
<italic>R</italic>
= 10% and 25%, we observed
<italic>Q</italic>
= 0 for
<italic>k</italic>
= 8 and 12, i.e.
<inline-formula id="m20">
<inline-graphic id="d33e1116" xlink:href="srep06504-m20.jpg"></inline-graphic>
</inline-formula>
performed as well as did the MSA-based approach in recovering the reference topologies. At
<italic>R</italic>
= 50%, the
<italic>D</italic>
<sub>2</sub>
methods yielded higher accuracy than did MUSCLE + MrBayes (
<italic>Q</italic>
> 0 for all
<italic>k</italic>
-mer lengths). Compared to MUSCLE (
<xref ref-type="fig" rid="f3">Fig. 3</xref>
), the use of MAFFT resulted in higher
<italic>RF</italic>
and
<italic>Q</italic>
values (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S10</xref>
), thus lower accuracy (
<italic>p</italic>
< 2.2 × 10
<sup>−16</sup>
; see
<xref ref-type="supplementary-material" rid="s1">Supplementary Note</xref>
). Our findings suggest that
<italic>D</italic>
<sub>2</sub>
methods are more robust to the effect of genetic rearrangement than is the standard approach based on MSA.</p>
</sec>
<sec disp-level="2">
<title>Insertions/deletions</title>
<p>To assess the sensitivity of the alignment-free approach to insertions/deletions (indels) we simulated nucleotide sequence sets (
<italic>N</italic>
= 32) under tree T4 by incorporating indel events at a predefined rate (
<italic>r</italic>
) along the tree branches
<xref ref-type="bibr" rid="b32">32</xref>
, with the inserted/deleted fragment lengths following a Lavalette distribution
<xref ref-type="bibr" rid="b33">33</xref>
<xref ref-type="bibr" rid="b34">34</xref>
(maximum length = 100 nt).
<xref ref-type="fig" rid="f4">Figure 4a</xref>
shows the
<italic>RF</italic>
values obtained using
<inline-formula id="m21">
<inline-graphic id="d33e1184" xlink:href="srep06504-m21.jpg"></inline-graphic>
</inline-formula>
, two MUSCLE-based methods (MrBayes and the popular ML method RAxML
<xref ref-type="bibr" rid="b35">35</xref>
<xref ref-type="bibr" rid="b36">36</xref>
) across cases at different values of
<italic>r</italic>
; the corresponding
<italic>Q</italic>
values for each MSA-based approach are shown in
<xref ref-type="fig" rid="f4">Fig. 4b</xref>
. At
<italic>r</italic>
= 0.1, all approaches recovered the reference topology perfectly (
<italic>RF</italic>
= 0 in all cases). As
<italic>r</italic>
increases, observed
<italic>RF</italic>
increases proportionately: for trees generated using
<inline-formula id="m22">
<inline-graphic id="d33e1211" xlink:href="srep06504-m22.jpg"></inline-graphic>
</inline-formula>
at
<italic>r</italic>
= 0.3, 0.4 and 0.5,
<italic>RF</italic>
= 0.001, 0.005 and 0.024. In comparison, the corresponding
<italic>RF</italic>
values for MSA-based methods are higher:
<italic>RF</italic>
= 0.071, 0.370 and 0.597 for MUSCLE + MrBayes and
<italic>RF</italic>
= 0.087, 0.391, and 0.642 for MUSCLE + RAxML. These results suggest that alignment-free methods are more robust to insertions/deletions (
<italic>RF</italic>
< 0.025 at
<italic>r</italic>
= 0.5) than MSA-based approaches (
<italic>RF</italic>
≥ 0.60 at
<italic>r</italic>
= 0.5 in both cases), with all observed
<italic>Q</italic>
≥ 0 (e.g.
<italic>Q</italic>
= 0.07, 0.37 and 0.57 at
<italic>r</italic>
= 0.3, 0.4 and 0.5 for MUSCLE + MrBayes:
<xref ref-type="fig" rid="f4">Fig. 4b</xref>
). Here the use of MAFFT instead of MUSCLE yielded lower
<italic>RF</italic>
and
<italic>Q</italic>
values, i.e. a higher accuracy of phylogenetic inference (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S11</xref>
<italic>versus</italic>
<xref ref-type="fig" rid="f4">Fig. 4</xref>
;
<italic>p</italic>
< 2.2 × 10
<sup>−16</sup>
). These findings are consistent with our analysis of other insertion/deletion scenarios including vertically staggered deletions (
<xref ref-type="supplementary-material" rid="s1">Supplementary Note and Fig. S12</xref>
), a (biologically not very realistic) scenario in which MSA is known to perform poorly
<xref ref-type="bibr" rid="b37">37</xref>
. Independently, we observed that the accuracy of
<italic>D</italic>
<sub>2</sub>
methods decreases with increasing extent of sequence truncation, and increases proportionately with sequence length (
<xref ref-type="supplementary-material" rid="s1">Supplementary Note and Fig. S13</xref>
).</p>
</sec>
<sec disp-level="2">
<title>Gene family evolution based on coalescence</title>
<p>Here we simulated nucleotide sequence sets under the coalescent model of gene family evolution (within a population)
<xref ref-type="bibr" rid="b38">38</xref>
<xref ref-type="bibr" rid="b39">39</xref>
across different fixed effective population sizes
<italic>N
<sub>e</sub>
</italic>
(see Methods). The
<italic>N
<sub>e</sub>
</italic>
parameter affects the overall population structure, thus branching patterns and branch lengths of a tree. Coalescent rate between two lineages is higher within a smaller population
<xref ref-type="bibr" rid="b40">40</xref>
, thus a smaller
<italic>N
<sub>e</sub>
</italic>
yields shorter branch lengths in a tree. All trees are asymmetric, and thus represent a more-realistic biological scenario. We note that the observed performance in this part of our analysis could be affected by one or more scenarios in addition to
<italic>N
<sub>e</sub>
</italic>
(and sequence divergence).
<xref ref-type="fig" rid="f5">Figure 5a</xref>
shows the
<italic>RF</italic>
values obtained using
<inline-formula id="m23">
<inline-graphic id="d33e1327" xlink:href="srep06504-m23.jpg"></inline-graphic>
</inline-formula>
, and by MSA-based approaches using MUSCLE, across cases at varied
<italic>N
<sub>e</sub>
</italic>
; the corresponding
<italic>Q</italic>
values for each MSA-based approach are shown in
<xref ref-type="fig" rid="f5">Fig. 5b</xref>
.
<italic>RF</italic>
> 0 was observed across all cases, suggesting that all approaches on average failed to recover known tree topologies perfectly. Observed
<italic>RF</italic>
values for all approaches increase proportionately with increasing
<italic>N
<sub>e</sub>
</italic>
when
<italic>N
<sub>e</sub>
</italic>
≥ 100000, e.g. for
<inline-formula id="m24">
<inline-graphic id="d33e1358" xlink:href="srep06504-m24.jpg"></inline-graphic>
</inline-formula>
,
<italic>RF</italic>
= 0.072, 0.119, 0.239 and 0.407 at
<italic>N
<sub>e</sub>
</italic>
= 100000, 250000, 500000 and 1000000 (
<xref ref-type="fig" rid="f5">Fig. 5a</xref>
), suggesting an inverse relationship between
<italic>N
<sub>e</sub>
</italic>
and the accuracies of these approaches in recovering the known tree topology. At
<italic>N
<sub>e</sub>
</italic>
= 10000, 100000 and 250000, both
<inline-formula id="m25">
<inline-graphic id="d33e1383" xlink:href="srep06504-m25.jpg"></inline-graphic>
</inline-formula>
and MSA-based approaches yielded almost identical trees (e.g.
<italic>Q</italic>
= −0.007, −0.010, −0.016 against MUSCLE + RAxML;
<xref ref-type="fig" rid="f5">Fig. 5b</xref>
), although
<inline-formula id="m26">
<inline-graphic id="d33e1393" xlink:href="srep06504-m26.jpg"></inline-graphic>
</inline-formula>
yielded less-accurate topologies (
<italic>Q</italic>
< 0). In the extreme cases of
<italic>N
<sub>e</sub>
</italic>
> 250000,
<inline-formula id="m27">
<inline-graphic id="d33e1404" xlink:href="srep06504-m27.jpg"></inline-graphic>
</inline-formula>
performed substantially worse than any of the two MSA-based methods, e.g.
<italic>Q</italic>
= −0.146 and −0.279 for MUSCLE + RAxML (
<xref ref-type="fig" rid="f5">Fig. 5b</xref>
). At the other end of the spectrum, cases of small
<italic>N
<sub>e</sub>
</italic>
= 1000 also negatively impacted the accuracies of all approaches, i.e.
<italic>RF</italic>
= 0.240, 0.230 and 0.213 for
<inline-formula id="m28">
<inline-graphic id="d33e1422" xlink:href="srep06504-m28.jpg"></inline-graphic>
</inline-formula>
, MUSCLE + MrBayes and MUSCLE + RAxML (
<xref ref-type="fig" rid="f5">Fig. 5a</xref>
). Results of the corresponding analysis using MAFFT are shown in
<xref ref-type="supplementary-material" rid="s1">Supplementary Figure S14</xref>
(
<italic>p</italic>
= 0.74; no significant difference). These findings indicate that in these scenarios, the alignment-free approach yields results similar to those of the MSA-based approaches, regardless of which MSA tool is used, when
<italic>N
<sub>e</sub>
</italic>
is reasonably large, but performs substantially worse in extreme cases i.e. when
<italic>N
<sub>e</sub>
</italic>
is very small or very large. This observation is plausibly explained by extreme (high/low) sequence divergence (See
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S1</xref>
), although we cannot rule out the impact of other evolutionary scenarios. In an independent analysis across datasets that were simulated under non-ultrametric trees (specifically violating the molecular clock) we observed a similar trend (
<italic>RF</italic>
> 0;
<italic>Q</italic>
< 0), with higher
<italic>RF</italic>
observed for
<inline-formula id="m29">
<inline-graphic id="d33e1458" xlink:href="srep06504-m29.jpg"></inline-graphic>
</inline-formula>
than for MSA-based approaches (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S15</xref>
). This complex scenario is more realistic than ultrametric trees, but we cannot distinguish the effect of clock violation from that of other evolutionary processes.</p>
</sec>
<sec disp-level="2">
<title>Analysis of empirical data</title>
<p>To examine the performance of these methods with empirical data, we used 4156 sets of nucleotide sequences and their corresponding phylogenetic trees from TreeBASE (treebase.org)
<xref ref-type="bibr" rid="b41">41</xref>
. These sequence sets and trees were obtained from 2471 studies deposited in TreeBASE as of 27 May 2013 (see
<xref ref-type="supplementary-material" rid="s1">Supplementary Data</xref>
for the complete list). As shown in
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S16</xref>
, the sizes of these sequence sets range between 6 and 2957 sequences (mean 59.41, median 41 sequences), and within-set sequence similarity has a mean of 90.12% (median 92.37%). For each sequence set, we used each of the
<italic>D</italic>
<sub>2</sub>
methods (independently for
<italic>k</italic>
= 6 and 8) to generate a distance matrix, from which we reconstructed a NJ tree. The selection of
<italic>k</italic>
is based on our observation of an optimal length in the analysis of simulated nucleotide sequence sets (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S5</xref>
). Because the true reference tree is unknown for empirical datasets, we cannot readily assess accuracy. Here we compare each of our resulting test trees inferred using the
<italic>D</italic>
<sub>2</sub>
methods against the corresponding tree published (and peer-reviewed) in TreeBASE. Because we cannot assume that published trees perfectly reflect true evolutionary relationships, we intentionally do not interpret
<italic>RF</italic>
as a measure of accuracy here, but instead simply as a measure of (dis)agreement between the trees produced by an alignment-free and an MSA-based approach.</p>
<p>As shown in
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S2</xref>
, the use of
<italic>k</italic>
= 6
<italic>versus</italic>
8 does not impact
<italic>RF</italic>
for any
<italic>D</italic>
<sub>2</sub>
method, with
<inline-formula id="m30">
<inline-graphic id="d33e1520" xlink:href="srep06504-m30.jpg"></inline-graphic>
</inline-formula>
yielding the smallest average
<italic>RF</italic>
(0.438; median 0.409 at
<italic>k</italic>
= 8).
<xref ref-type="fig" rid="f6">Figure 6</xref>
shows the distribution density of
<italic>RF</italic>
as observed for
<inline-formula id="m31">
<inline-graphic id="d33e1535" xlink:href="srep06504-m31.jpg"></inline-graphic>
</inline-formula>
at
<italic>k</italic>
= 8, based on sizes of the sequence sets
<italic>N</italic>
(
<xref ref-type="fig" rid="f6">Fig. 6a</xref>
) and within-set sequence similarity (
<xref ref-type="fig" rid="f6">Fig. 6b</xref>
). See
<xref ref-type="supplementary-material" rid="s1">Supplementary Tables S3 and S4</xref>
respectively for the corresponding values. As shown in
<xref ref-type="fig" rid="f6">Fig. 6a</xref>
and
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S3</xref>
,
<inline-formula id="m32">
<inline-graphic id="d33e1561" xlink:href="srep06504-m32.jpg"></inline-graphic>
</inline-formula>
yielded topologies that are more congruent with those generated using the standard MSA approach for small sequence sets (e.g. mean
<italic>RF</italic>
0.363, median 0.333 at
<italic>N</italic>
≤ 25) than for larger sequence sets of
<italic>N</italic>
> 25 (mean
<italic>RF</italic>
0.661, median 0.635 at
<italic>N</italic>
> 500), and these
<italic>RF</italic>
distances increase proportionately with increasing
<italic>N</italic>
. Interestingly, across different categories of within-set sequence similarity (percent identity;
<italic>ID</italic>
) regardless of
<italic>N</italic>
(
<xref ref-type="fig" rid="f6">Fig. 6b</xref>
), density plots of
<italic>RF</italic>
for cases of
<italic>ID</italic>
> 70% peak at values of
<italic>RF</italic>
between 0.25 and 0.40, with the smallest means observed for highly similar sequence sets (0.424 at
<italic>ID</italic>
between 80% and 90%, median 0.392;
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S4</xref>
).
<italic>RF</italic>
values increase with decreasing
<italic>ID</italic>
, with mean
<italic>RF</italic>
0.533, median 0.528 observed for cases of
<italic>ID</italic>
< 70% (
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S4</xref>
). These findings suggest that the
<italic>D</italic>
<sub>2</sub>
-based approach, across most of these diverse empirical data, yield topologies that are slightly incongruent (
<italic>RF</italic>
< 0.5 in 2809/4156 trees;
<inline-formula id="m33">
<inline-graphic id="d33e1636" xlink:href="srep06504-m33.jpg"></inline-graphic>
</inline-formula>
at
<italic>k</italic>
= 8) to those arising from the standard MSA-based approach, and that it is rare for both approaches to recover the exact same tree topology (
<italic>RF</italic>
= 0 recovered by any
<italic>D</italic>
<sub>2</sub>
-based approach in 106/4156 trees).</p>
</sec>
<sec disp-level="2">
<title>Computational efficiency and scalability</title>
<p>The computational complexity of various
<italic>D</italic>
<sub>2</sub>
methods has been described earlier
<xref ref-type="bibr" rid="b24">24</xref>
(see also
<xref ref-type="supplementary-material" rid="s1">Supplementary Note</xref>
).
<xref ref-type="fig" rid="f7">Figure 7a</xref>
shows the computation time required to generate pairwise
<italic>D</italic>
<sub>2</sub>
distance matrices across large empirical sequence sets (
<italic>N</italic>
= 1000, 2000, 3000, 4000 and 5000); for the corresponding numerical values see
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S5</xref>
. These large sequence sets are of 16S ribosomal RNA genes sampled from the GreenGenes database (see Methods). Mean computation time increases with
<italic>N</italic>
, from 49.77 seconds at
<italic>N</italic>
= 1000 to 842.98 seconds at
<italic>N</italic>
= 5000 (17-fold increase). Similarly, memory usage (
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S5</xref>
) increases with
<italic>N</italic>
, from 378.24 MB (
<italic>N</italic>
= 1000) to 2445.31 MB (
<italic>N</italic>
= 5000; approximately 6-fold increase).</p>
<p>Phylogenetic inference involves details not only of software (e.g.
<italic>D</italic>
<sub>2</sub>
and
<italic>neighbor</italic>
in PHYLIP
<italic>versus</italic>
MUSCLE and MrBayes) but also of parameter settings, implementation (e.g. programming language used, and capacity for multi-threading) and hardware (e.g. machine architecture and its efficiency of memory usage). Therefore, comparing computation time and memory usage between the two approaches is not straightforward. For 50 sets of nucleotide sequence (
<italic>N</italic>
= 8;
<italic>L</italic>
= 1500 nt), we observe an average wall time of 1.50, 86.38 and 491.16 seconds for
<italic>D</italic>
<sub>2</sub>
+
<italic>neighbor</italic>
, MUSCLE + RAxML and MUSCLE + MrBayes (four-threaded runs; see Methods). For the same analysis across protein sequence sets (
<italic>N</italic>
= 8;
<italic>L</italic>
= 500aa), wall times are respectively 1.82, 255.48 and 3047.14 seconds. Here, our alignment-free approach is approximately 140-fold and 1670-fold faster respectively, compared to MUSCLE + RAxML and MUSCLE + MrBayes. These findings suggest that
<italic>D</italic>
<sub>2</sub>
methods are highly scalable for phylogenetic inference of large-scale sequence data.</p>
<p>In an independent experiment on nucleotide sequence sets of
<italic>N</italic>
= 8 (
<xref ref-type="fig" rid="f7">Fig. 7b</xref>
), we found that computation time for
<inline-formula id="m34">
<inline-graphic id="d33e1749" xlink:href="srep06504-m34.jpg"></inline-graphic>
</inline-formula>
(at
<italic>k</italic>
= 8) increases exponentially with increasing neighbourhood
<italic>n</italic>
, from 0.71 at
<italic>n</italic>
= 1 to 124.73 seconds at
<italic>n</italic>
= 5. At greater values of neighbourhood (
<italic>n</italic>
> 2) i.e. when a higher number of wildcards is considered, the accuracy of
<inline-formula id="m35">
<inline-graphic id="d33e1768" xlink:href="srep06504-m35.jpg"></inline-graphic>
</inline-formula>
appears to decrease, more so at larger
<italic>N</italic>
(
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S17</xref>
; shown for
<italic>k</italic>
= 8 across nucleotide sequence sets). However, the interplay among
<italic>n</italic>
,
<italic>k</italic>
and
<italic>N</italic>
remains to be investigated systematically.</p>
</sec>
</sec>
<sec disp-level="1" sec-type="discussion">
<title>Discussion</title>
<p>Alignment-free methods yielded similar if not identical tree topologies to those generated using MSA-based approaches across a wide range of data sizes and scenarios. Our findings demonstrate that the accuracy of alignment-free methods, compared to the current standard based on MSA, is more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but is more sensitive to sequence divergence and the presence of incomplete (truncated) sequence data. The alignment-free methods operated at far greater computation speed (more than 2000 times faster in some cases).</p>
<p>Opposing views have recently been expressed on whether the application of alignment-free methods in phylogenetics reflects a model-free, purely informatic exercise, or alternatively can capture homology signal inherent in evolving sequences
<xref ref-type="bibr" rid="b42">42</xref>
<xref ref-type="bibr" rid="b43">43</xref>
<xref ref-type="bibr" rid="b44">44</xref>
. Our results support the latter view. The alignment-free approach implemented here appears to have no difficulty, at appropriate parameter settings across our simulated datasets, in capturing homology signal and generating topologies that are very similar or identical to those generated by MSA followed by Bayesian inference, arguably the current standard in phylogenetics (see below). The robustness of alignment-free methods to rearrangements and insertions/deletions represents a critical advantage, since these events are common among microbial genomes
<xref ref-type="bibr" rid="b3">3</xref>
and frequently interrupt individual genes
<xref ref-type="bibr" rid="b45">45</xref>
. Our findings support the notion that gappy regions tend to be forced into alignment within an MSA framework and thereby bias subsequent phylogenetic inference
<xref ref-type="bibr" rid="b37">37</xref>
.</p>
<p>Here we used MUSCLE
<xref ref-type="bibr" rid="b26">26</xref>
and MrBayes
<xref ref-type="bibr" rid="b27">27</xref>
as the standard phylogenetic approach in the analysis of simulated data. Another popular MSA tool is MAFFT
<xref ref-type="bibr" rid="b46">46</xref>
; both MUSCLE and MAFFT compare favourably against other MSA tools in a number of benchmark studies
<xref ref-type="bibr" rid="b26">26</xref>
<xref ref-type="bibr" rid="b47">47</xref>
. A comprehensive analysis of performance across different MSA tools is beyond the scope of this study. Across scenarios of random insertions/deletions, we found little difference in our inference between the use of MUSCLE and MAFFT (
<italic>p</italic>
> 0.5;
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S6</xref>
), except under the unrealistic scenarios of vertically staggered deletions (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S12</xref>
;
<italic>p</italic>
< 2.2 × 10
<sup>−16</sup>
) in which MAFFT performed better, lending support to an earlier report
<xref ref-type="bibr" rid="b37">37</xref>
. The use of other programs for MSA and phylogenetic inference, or indeed the use of different parameter settings in these programs (e.g. fewer MCMC generations in MrBayes than the 1.5 million used in this study), would inevitably yield somewhat different results. ML is another popular MSA-based method of phylogenetic inference, which estimates goodness-of-fit of sequence data given an underlying evolutionary (substitution) model. ML methods e.g. RAxML
<xref ref-type="bibr" rid="b35">35</xref>
are time-consuming, and this has prompted the development of faster though less-accurate implementations e.g. PhyML
<xref ref-type="bibr" rid="b21">21</xref>
and/or scalable methods that approximate ML estimates e.g. FastTree
<xref ref-type="bibr" rid="b22">22</xref>
(see ref.
<xref ref-type="bibr" rid="b48">48</xref>
for a comparative analysis). We generated ML trees for a subset of the simulated sequence data using RAxML and found no or little topological difference between these trees and those generated using MrBayes, as shown by the similar trends of
<italic>RF</italic>
and
<italic>Q</italic>
in
<xref ref-type="fig" rid="f4">Figs 4</xref>
and
<xref ref-type="fig" rid="f5">5</xref>
. In fact, RAxML yielded less-accurate topologies than MrBayes in many cases (larger
<italic>RF</italic>
observed for RAxML:
<xref ref-type="fig" rid="f5">Fig. 5</xref>
).</p>
<p>Using extensive simulated data and diverse empirical data (here from the TreeBASE dataset, generated by various programs and phylogenetic inference methods common in the peer-reviewed literature), our results consistently demonstrate the relative accuracy and scalability of alignment-free methods in large-scale phylogenetic inference, regardless of which specific method they were compared against. The empirical datasets used in this study are highly diverse, with various extents of within-set sequence divergence and data sizes. Many of these sequence sets contain partial and/or fragmented sequences (
<xref ref-type="supplementary-material" rid="s1">Supplementary Data</xref>
). As per our analysis of simulated sequence sets, these aspects impact the accuracy of alignment-free methods more than that of MSA-based approach in recovering accurate phylogenies. In addition, we applied
<italic>k</italic>
= 6 and 8 in our alignment-free approach across these datasets, a decision based on our observation in simulated sets of 1500 nt sequences (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S5</xref>
). In cases where sequences are longer, the representation of distinct
<italic>k</italic>
-mers (at
<italic>k</italic>
= 6 or 8) could be saturated, thus losing the resolution (reducing the distinguishing power of the
<italic>k</italic>
-mers) necessary to accurately infer dissimilarity (
<italic>vis-à-vis</italic>
phylogenetic) relationships among the sequences
<xref ref-type="bibr" rid="b9">9</xref>
<xref ref-type="bibr" rid="b30">30</xref>
. The correlation between sequence length and
<italic>k</italic>
within the context of phylogenetics has been explored to some extent
<xref ref-type="bibr" rid="b30">30</xref>
<xref ref-type="bibr" rid="b49">49</xref>
, e.g. using shortest unique substrings
<xref ref-type="bibr" rid="b50">50</xref>
, but this issue remains to be systematically investigated. In this study we used NJ to infer phylogenetic trees from the distance matrices generated from
<italic>D</italic>
<sub>2</sub>
methods; one can imagine using other distance-based approaches, e.g. a weighted least-squares method such as Fitch-Margoliash
<xref ref-type="bibr" rid="b51">51</xref>
. In small-scale investigations, we find no topological difference across trees generated using NJ or Fitch-Margoliash.</p>
<p>Conversion of subsequence similarity (profile) scores into a measure that represents the evolutionary relatedness between two full-length sequences remains an active field of research. Here we simply transformed
<italic>D</italic>
<sub>2</sub>
scores into pairwise distances of sequences using a logarithmic representation of the geometric mean. Other strategies have been proposed to create more-realistic measure of distance or dissimilarity, including the assignment of a
<italic>p</italic>
-value for each pairwise score based on a null distribution (hypothesis) of subsequences as observed across the whole dataset
<xref ref-type="bibr" rid="b29">29</xref>
<xref ref-type="bibr" rid="b52">52</xref>
. Approaches inspired by information retrieval are under consideration.</p>
<p>In general, our results demonstrate the utility and robustness of alignment-free methods across the choice of scoring methods. The non-monotonic relationship between word length and performance, the utility of
<inline-formula id="m36">
<inline-graphic id="d33e1917" xlink:href="srep06504-m36.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula id="m37">
<inline-graphic id="d33e1920" xlink:href="srep06504-m37.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula id="m38">
<inline-graphic id="d33e1923" xlink:href="srep06504-m38.jpg"></inline-graphic>
</inline-formula>
, and the failure of larger mismatch neighbourhoods are broadly consistent with previous reports
<xref ref-type="bibr" rid="b18">18</xref>
<xref ref-type="bibr" rid="b52">52</xref>
. However, simple
<italic>D</italic>
<sub>2</sub>
scoring is known to be dominated by single-sequence noise effects as
<italic>k</italic>
increases
<xref ref-type="bibr" rid="b18">18</xref>
; its good performance here may in part be explained by the normalisation inherent in our distance measure. The one exception to these comments lies in the vulnerability of
<inline-formula id="m39">
<inline-graphic id="d33e1939" xlink:href="srep06504-m39.jpg"></inline-graphic>
</inline-formula>
approaches to heterogeneous variation, an effect especially pronounced for protein sequences (
<xref ref-type="supplementary-material" rid="s1">Supplementary Fig. S6</xref>
), which may arise from the failure of the variance estimate in the denominator.</p>
<p>Crucially, the computational advantages identified above extend to a broad range of scoring methods and distance transformations. The use of a mismatch neighbourhood has potential to add significantly to both the compute and memory requirements of the process, but these demands are modest for
<inline-formula id="m40">
<inline-graphic id="d33e1947" xlink:href="srep06504-m40.jpg"></inline-graphic>
</inline-formula>
and larger neighbourhoods seem not to improve its performance in phylogenetic inference. Alignment-free methods thus offer computational speed many hundreds or thousands of times faster than the comparable MSA-based approaches, with memory requirements in the hundreds of megabytes, well within the capabilities of even portable commodity devices. To the extent that memory is not an issue, alignment-free methods present an attractive, highly scalable alternative to MSA-based methods in large-scale phylogenetic (and phylogenomic) analyses.</p>
</sec>
<sec disp-level="1" sec-type="methods">
<title>Methods</title>
<sec disp-level="2">
<title>Simulated sequence data</title>
<p>For all programs, default settings were used unless otherwise specified. We simulated sets of DNA and protein sequences of different sizes (
<italic>N</italic>
= 8, 16, 32, 128) using
<italic>evolver</italic>
as implemented in PAML 4.5
<xref ref-type="bibr" rid="b53">53</xref>
, unless otherwise specified. We used GTR
<xref ref-type="bibr" rid="b54">54</xref>
(rate parameters
<italic>a</italic>
= 0.987,
<italic>b</italic>
= 0.110,
<italic>c</italic>
= 0.218,
<italic>d</italic>
= 0.243,
<italic>e</italic>
= 0.395)
<xref ref-type="bibr" rid="b55">55</xref>
and WAG
<xref ref-type="bibr" rid="b56">56</xref>
substitution models respectively for simulation of nucleotide and protein sequences. We detail simulation strategy for each evolutionary scenario below.</p>
</sec>
<sec disp-level="2">
<title>Sequence divergence</title>
<p>For each set, sequences of fixed length (
<italic>L</italic>
= 1500 nt for DNA; 500 amino acids for protein) were simulated on an unrooted symmetrical tree on which the lengths of internal (
<italic>x</italic>
) and terminal (
<italic>y</italic>
, or
<italic>y
<sub>1</sub>
</italic>
and
<italic>y
<sub>2</sub>
</italic>
) branches are set separately, at either 0.01 or 0.05 substitutions per site, to represent six distinct scenarios (
<xref ref-type="fig" rid="f1">Fig. 1</xref>
; shown for 8-taxon trees). These sequence sets were simulated under a discrete approximation of the gamma distribution (shape parameter
<italic>α</italic>
= 1.0, 8 categories).</p>
</sec>
<sec disp-level="2">
<title>Genetic rearrangement</title>
<p>For each nucleotide sequence set (
<italic>N</italic>
= 8;
<italic>L</italic>
= 5000 nt), we relocated one or more region (i.e. individual rearrangement events) of 250 nt within a sequence in a cut-and-paste manner, with no overlaps. We define
<italic>R</italic>
as the total percentage length of
<italic>L</italic>
that has been relocated. We simulated sequence sets with
<italic>R</italic>
= 10, 25 and 50% (each in 50 replicates), such that the total rearranged region is not contiguous. Given the prior expectation that alignment-free methods would be less sensitive to sequence rearrangements, here we simulated sequence sets under tree T3 (
<xref ref-type="fig" rid="f1">Fig. 1</xref>
), one of the more problematic cases for
<italic>D</italic>
<sub>2</sub>
methods (as shown in
<xref ref-type="fig" rid="f2">Fig. 2</xref>
).</p>
</sec>
<sec disp-level="2">
<title>Insertions/deletions</title>
<p>For this analysis, we simulated nucleotide sequence sets of size
<italic>N</italic>
= 32 (
<italic>L</italic>
= 1500 nt) using INDELible
<xref ref-type="bibr" rid="b32">32</xref>
under tree T4 (
<xref ref-type="fig" rid="f1">Fig. 1</xref>
), a discrete approximation of the gamma distribution (
<italic>α</italic>
= 1.0, 8 categories) and GTR model. Indel rates were set at 0.1, 0.2, 0.3, 0.4 and 0.5, with insertion rate = deletion rate; these rates are relative to site substitution rate of 1. Length distribution of inserted/deleted fragments follows a Lavalette distribution
<xref ref-type="bibr" rid="b33">33</xref>
<xref ref-type="bibr" rid="b34">34</xref>
(
<italic>a</italic>
= 1.1; maximum indel size 100 nt) as implemented in INDELible
<xref ref-type="bibr" rid="b32">32</xref>
.</p>
</sec>
<sec disp-level="2">
<title>Coalescent model of gene family evolution</title>
<p>We used NetRecodon
<xref ref-type="bibr" rid="b57">57</xref>
to simulate gene family evolution under the coalescence model along a tree, each case at a defined effective population size (
<italic>N
<sub>e</sub>
</italic>
) of 1000, 10000, 100000, 250000, 500000 and 1000000, with a discrete approximation of the gamma distribution (
<italic>α</italic>
= 0.5, 8 categories), GTR model and mutation rate
<italic>u</italic>
= 10
<sup>−5</sup>
. Sequence sets of size
<italic>N</italic>
= 32 (
<italic>L</italic>
= 1500 nt) were used. Larger
<italic>N
<sub>e</sub>
</italic>
values result in longer branch lengths on a tree (see
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S1</xref>
). To simulate violation of molecular clock, relaxed branch lengths were further simulated on these trees using
<italic>BranchRelaxer</italic>
in GenPhyloData
<xref ref-type="bibr" rid="b58">58</xref>
, with substitution rates along branches modelled as independent and identically distributed variables in a log-normal scale (IIDLogNormal model: mean 0.0, variance 1.0)
<xref ref-type="bibr" rid="b59">59</xref>
. Sequences were then simulated using
<italic>evolver</italic>
along these new trees as per above.</p>
</sec>
<sec disp-level="2">
<title>Empirical sequence data</title>
<p>All 2471 nucleotide datasets in NEXUS format were downloaded from TreeBASE (treebase.org as of 27 May 2013)
<xref ref-type="bibr" rid="b41">41</xref>
using a custom script kindly provided by Dr William Piel. For each dataset, one or more nucleotide sequence alignment and their corresponding phylogenetic trees (totalling 4156) were extracted (
<xref ref-type="supplementary-material" rid="s1">Supplementary Data</xref>
). All 406997 unaligned 16S ribosomal RNA gene sequences (sequences_16S_all_gg_2011_1_unaligned.fasta.gz)
<xref ref-type="bibr" rid="b60">60</xref>
were downloaded from the GreenGenes database (secondgenome.com/go/2011-greengenes-taxonomy). To assess scalability of
<italic>D</italic>
<sub>2</sub>
methods on different sizes of sequence sets, these 406997 sequences were randomly selected across set
<italic>N</italic>
= 1000, 2000, 3000, 4000 and 5000, each in 100 replicates. We follow ref.
<xref ref-type="bibr" rid="b61">61</xref>
in defining within-set sequence similarity as the average pairwise similarity between each sequence in a set to the centroid sequence. A centroid sequence within a set is one that yielded the single highest bit score across all pairwise comparisons within the set using BLAST (
<italic>e</italic>
< 10
<sup>−3</sup>
).</p>
</sec>
<sec disp-level="2">
<title>Alignment-free phylogenetic approach</title>
<p>For each sequence set, we used
<italic>D</italic>
<sub>2</sub>
statistics independently for
<italic>D</italic>
<sub>2</sub>
,
<inline-formula id="m41">
<inline-graphic id="d33e2170" xlink:href="srep06504-m41.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula id="m42">
<inline-graphic id="d33e2173" xlink:href="srep06504-m42.jpg"></inline-graphic>
</inline-formula>
, and
<inline-formula id="m43">
<inline-graphic id="d33e2177" xlink:href="srep06504-m43.jpg"></inline-graphic>
</inline-formula>
to generate a score for each possible pair of sequences within a set (see
<xref ref-type="supplementary-material" rid="s1">Supplementary Note</xref>
for details). These scores were transformed
<italic>via</italic>
logarithmic representation of the geometric mean to generate a distance. The pairwise distance between sequences
<italic>a</italic>
and
<italic>b</italic>
,
<italic>D
<sub>ab</sub>
</italic>
is defined as
<disp-formula id="m44">
<inline-graphic id="d33e2198" xlink:href="srep06504-m44.jpg"></inline-graphic>
</disp-formula>
where
<italic>S
<sub>ab</sub>
</italic>
is the pairwise score between them, and
<italic>S
<sub>aa</sub>
</italic>
and
<italic>S
<sub>bb</sub>
</italic>
are the self-matching scores. These transformed pairwise distances closely approximate the angle-based distances in an earlier alignment-free method for inferring protein phylogenies
<xref ref-type="bibr" rid="b62">62</xref>
. The resulting distance matrix was used to reconstruct a phylogenetic tree using
<italic>neighbor</italic>
in PHYLIP v3.69 (evolution.genetics.washington.edu/phylip). Generation of the distance matrix from any of these
<italic>D</italic>
<sub>2</sub>
methods is implemented in a JAVA program, JIWA, which is freely available at
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.org.au/tools/jiwa/">http://bioinformatics.org.au/tools/jiwa/</ext-link>
.</p>
</sec>
<sec disp-level="2">
<title>Standard phylogenetic approach using multiple sequence alignment</title>
<p>For each sequence set, we used MUSCLE v3.8.31
<xref ref-type="bibr" rid="b26">26</xref>
to generate a multiple sequence alignment. For scenarios of genetic rearrangement, insertions/deletions and the coalescent model, we also used MAFFT (mafft-linsi) v7.158b
<xref ref-type="bibr" rid="b46">46</xref>
. For other simulated scenarios, alignments were perfectly given during the process of simulation; the use of any MSA tool would not yield any difference in the final alignments. For Bayesian phylogenetic inference, we used MrBayes v3.2.1
<xref ref-type="bibr" rid="b27">27</xref>
(MCMC ngen = 1500000 generations, samplefreq = 100, burn-in = 10000 samples, temp = 0.5, nchains = 4; sumt contype = allcompat). We assume the general reversible substitution model (lset Nucmodel = 4by4 Nst = 6) and a mixed amino acid substitution model (prset aamodel = mixed) respectively for nucleotide and protein sequences, under a four-category discrete gamma distribution across all runs (lset rate = gamma ngammacat = 4). In all cases except the insertions/deletions analysis, the standard deviation of split frequencies was <0.01 after 200000 generations. For insertions/deletions analysis, MrBayes was run at larger number of MCMC generations (ngen = 5000000) and burnin (samplefreq = 100, burn-in = 25000 samples), while other parameters remain the same. The standard deviation of split frequencies in most cases was <0.01 after 1000000 generations. For maximum likelihood inference of phylogenetic trees, we used RAxML v8.0.2
<xref ref-type="bibr" rid="b36">36</xref>
(-# 100, -t 4, -m GTRGAMMA or PROTGAMMAWAG respectively for nucleotide and protein sequences).</p>
</sec>
<sec disp-level="2">
<title>Assessment of accuracy</title>
<p>For each tree generated from a sequence set using
<italic>D</italic>
<sub>2</sub>
statistics or the standard approach, we compared its topological congruence to a reference tree using the Robinson-Foulds distance
<xref ref-type="bibr" rid="b28">28</xref>
, as implemented in
<italic>treedist</italic>
in PHYLIP v3.69 (evolution.genetics.washington.edu/phylip). This distance represents the number of splits (i.e. bipartitions) that are present in only one of the two trees. To facilitate comparison of our results across trees (i.e. sequence sets) of various sizes
<italic>N</italic>
, we normalised the distances by the maximum possible distance between two unrooted trees, 2(
<italic>N</italic>
− 3), following ref.
<xref ref-type="bibr" rid="b63">63</xref>
. Here we denote
<italic>RF</italic>
as the normalised Robinson-Foulds distance, with a value between 0 and 1 that can be interpreted as the proportion of false or missing bipartitions in the test tree topology compared to the reference topology
<xref ref-type="bibr" rid="b63">63</xref>
. When
<italic>RF</italic>
= 0, the test and reference topologies are identical, suggesting high accuracy of the approach. When
<italic>RF</italic>
= 1, none of the bipartitions in the reference is recovered in the test. In these cases, the trees could have been generated at random, as a pair of randomly generated tree topologies of
<italic>N</italic>
taxa has a Robinson-Foulds distance that approximates the denominator for normalisation, 2(
<italic>N</italic>
− 3)
<xref ref-type="bibr" rid="b64">64</xref>
. For the simulated data, we used the known tree (under which the sequences were simulated) as the reference. For empirical data from TreeBASE we used the published tree in the database as reference; in these cases, a zero
<italic>RF</italic>
does not relate directly to accuracy, but rather reflects the extent to which our method recovers the same topology as the published method based on multiple sequence alignment.</p>
</sec>
<sec disp-level="2">
<title>Assessment of computational scalability and runtime</title>
<p>The assessment of computational scalability was carried out using a high-performance distributed-memory computing cluster based on Intel Sandy Bridge 8-core 2.6 GHz processors. Comparative runtime analysis of alignment-free and MSA-based phylogenetic approaches was done on Intel Xeon L5520 8-core 2.26 GHz processors (multi-threaded, four threads). MCMC ngen = 1500000 was used for MrBayes runs.</p>
</sec>
</sec>
<sec disp-level="1">
<title>Author Contributions</title>
<p>C.X.C., J.M.H. and M.A.R. conceived the project. C.X.C., G.B. and M.A.R. designed the experiments, C.X.C., G.B. and O.P. implemented the analysis workflow and conducted the experiments, C.X.C., G.B., J.M.H. and M.A.R. analysed and interpreted the results, C.X.C. prepared all figures and tables, C.X.C. and M.A.R. prepared and wrote the manuscript. All authors reviewed, commented on and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material" id="s1">
<title>Supplementary Material</title>
<supplementary-material id="d33e77" content-type="local-data">
<caption>
<title>Supplementary Information</title>
<p>Supplementary Information</p>
</caption>
<media xlink:href="srep06504-s1.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>We thank The University of Queensland and James S McDonnell Foundation for financial support. CXC is supported by a University of Queensland Early Career Researcher grant. This work was supported by computational resources of the National Computational Infrastructure (NCI) National Facility systems and NCI Specialised Facility in Bioinformatics through the NCI Merit Allocation Scheme (Project d85). We thank Professor Michael Waterman for helpful suggestions, and Dr Lars Jermiin and Professor David Penny for their constructive feedback on this work.</p>
</ack>
<ref-list>
<ref id="b1">
<mixed-citation publication-type="journal">
<name>
<surname>Edgar</surname>
<given-names>R. C.</given-names>
</name>
&
<name>
<surname>Batzoglou</surname>
<given-names>S.</given-names>
</name>
<article-title>Multiple sequence alignment</article-title>
.
<source>Curr. Opin. Struct. Biol.</source>
<volume>16</volume>
,
<fpage>368</fpage>
<lpage>373</lpage>
(
<year>2006</year>
).
<pub-id pub-id-type="pmid">16679011</pub-id>
</mixed-citation>
</ref>
<ref id="b2">
<mixed-citation publication-type="journal">
<name>
<surname>Notredame</surname>
<given-names>C.</given-names>
</name>
<article-title>Recent evolutions of multiple sequence alignment algorithms</article-title>
.
<source>PLoS Comput. Biol.</source>
<volume>3</volume>
,
<fpage>1405</fpage>
<lpage>1408</lpage>
(
<year>2007</year>
).</mixed-citation>
</ref>
<ref id="b3">
<mixed-citation publication-type="journal">
<name>
<surname>Darling</surname>
<given-names>A. E.</given-names>
</name>
,
<name>
<surname>Miklos</surname>
<given-names>I.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Dynamics of genome rearrangement in bacterial populations</article-title>
.
<source>PLoS Genet.</source>
<volume>4</volume>
,
<fpage>e1000128</fpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">18650965</pub-id>
</mixed-citation>
</ref>
<ref id="b4">
<mixed-citation publication-type="journal">
<name>
<surname>Puigbò</surname>
<given-names>P.</given-names>
</name>
,
<name>
<surname>Wolf</surname>
<given-names>Y. I.</given-names>
</name>
&
<name>
<surname>Koonin</surname>
<given-names>E. V.</given-names>
</name>
<article-title>The tree and net components of prokaryote evolution</article-title>
.
<source>Genome Biol. Evol.</source>
<volume>2</volume>
,
<fpage>745</fpage>
<lpage>756</lpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">20889655</pub-id>
</mixed-citation>
</ref>
<ref id="b5">
<mixed-citation publication-type="journal">
<name>
<surname>Zhaxybayeva</surname>
<given-names>O.</given-names>
</name>
&
<name>
<surname>Doolittle</surname>
<given-names>W. F.</given-names>
</name>
<article-title>Lateral gene transfer</article-title>
.
<source>Curr. Biol.</source>
<volume>21</volume>
,
<fpage>R242</fpage>
<lpage>246</lpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21481756</pub-id>
</mixed-citation>
</ref>
<ref id="b6">
<mixed-citation publication-type="journal">
<name>
<surname>Wong</surname>
<given-names>K. M.</given-names>
</name>
,
<name>
<surname>Suchard</surname>
<given-names>M. A.</given-names>
</name>
&
<name>
<surname>Huelsenbeck</surname>
<given-names>J. P.</given-names>
</name>
<article-title>Alignment uncertainty and genomic analysis</article-title>
.
<source>Science</source>
<volume>319</volume>
,
<fpage>473</fpage>
<lpage>476</lpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">18218900</pub-id>
</mixed-citation>
</ref>
<ref id="b7">
<mixed-citation publication-type="journal">
<name>
<surname>Wu</surname>
<given-names>M. T.</given-names>
</name>
,
<name>
<surname>Chatterji</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Eisen</surname>
<given-names>J. A.</given-names>
</name>
<article-title>Accounting for alignment uncertainty in phylogenomics</article-title>
.
<source>PLoS ONE</source>
<volume>7</volume>
,
<fpage>e30288</fpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22272325</pub-id>
</mixed-citation>
</ref>
<ref id="b8">
<mixed-citation publication-type="journal">
<name>
<surname>Chan</surname>
<given-names>C. X.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Next-generation phylogenomics</article-title>
.
<source>Biol. Direct</source>
<volume>8</volume>
,
<fpage>3</fpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23339707</pub-id>
</mixed-citation>
</ref>
<ref id="b9">
<mixed-citation publication-type="journal">
<name>
<surname>Höhl</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Is multiple-sequence alignment required for accurate inference of phylogeny?</article-title>
<source>Syst. Biol.</source>
<volume>56</volume>
,
<fpage>206</fpage>
<lpage>221</lpage>
(
<year>2007</year>
).
<pub-id pub-id-type="pmid">17454975</pub-id>
</mixed-citation>
</ref>
<ref id="b10">
<mixed-citation publication-type="journal">
<name>
<surname>Höhl</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Rigoutsos</surname>
<given-names>I.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Pattern-based phylogenetic distance estimation and tree reconstruction</article-title>
.
<source>Evol Bioinform Online</source>
<volume>2</volume>
,
<fpage>359</fpage>
<lpage>375</lpage>
(
<year>2006</year>
).</mixed-citation>
</ref>
<ref id="b11">
<mixed-citation publication-type="journal">
<name>
<surname>Domazet-Lošo</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Haubold</surname>
<given-names>B.</given-names>
</name>
<article-title>Alignment-free detection of local similarity among viral and bacterial genomes</article-title>
.
<source>Bioinformatics</source>
<volume>27</volume>
,
<fpage>1466</fpage>
<lpage>1472</lpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21471011</pub-id>
</mixed-citation>
</ref>
<ref id="b12">
<mixed-citation publication-type="journal">
<name>
<surname>Vinga</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Almeida</surname>
<given-names>J.</given-names>
</name>
<article-title>Alignment-free sequence comparison - a review</article-title>
.
<source>Bioinformatics</source>
<volume>19</volume>
,
<fpage>513</fpage>
<lpage>523</lpage>
(
<year>2003</year>
).
<pub-id pub-id-type="pmid">12611807</pub-id>
</mixed-citation>
</ref>
<ref id="b13">
<mixed-citation publication-type="journal">
<name>
<surname>Bonham-Carter</surname>
<given-names>O.</given-names>
</name>
,
<name>
<surname>Steele</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Bastola</surname>
<given-names>D.</given-names>
</name>
<article-title>Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis</article-title>
.
<source>Brief. Bioinform.</source>
, In Press, 10.1093/bib/bbt052 (
<year>2013</year>
).</mixed-citation>
</ref>
<ref id="b14">
<mixed-citation publication-type="journal">
<name>
<surname>Haubold</surname>
<given-names>B.</given-names>
</name>
<article-title>Alignment-free phylogenetics and population genetics</article-title>
.
<source>Brief. Bioinform.</source>
<volume>15</volume>
,
<fpage>407</fpage>
<lpage>418</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24291823</pub-id>
</mixed-citation>
</ref>
<ref id="b15">
<mixed-citation publication-type="journal">
<name>
<surname>Song</surname>
<given-names>K.</given-names>
</name>
<italic>et al.</italic>
<article-title>New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing</article-title>
.
<source>Brief. Bioinform.</source>
<volume>15</volume>
,
<fpage>343</fpage>
<lpage>353</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24064230</pub-id>
</mixed-citation>
</ref>
<ref id="b16">
<mixed-citation publication-type="book">
<name>
<surname>Torney</surname>
<given-names>D. C.</given-names>
</name>
,
<name>
<surname>Burks</surname>
<given-names>C.</given-names>
</name>
,
<name>
<surname>Davison</surname>
<given-names>D.</given-names>
</name>
&
<name>
<surname>Sirotkin</surname>
<given-names>K. M.</given-names>
</name>
in
<source>Computers and DNA - Santa Fe Institute Studies in the Sciences of Complexity,</source>
<volume>
<italic>Vol. 7</italic>
</volume>
(eds. Bell, G. & Marr, R.)
<fpage>109</fpage>
<lpage>125</lpage>
(Addison-Wesley, Reading, MA;
<year>1990</year>
).</mixed-citation>
</ref>
<ref id="b17">
<mixed-citation publication-type="journal">
<name>
<surname>Wan</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Reinert</surname>
<given-names>G.</given-names>
</name>
,
<name>
<surname>Sun</surname>
<given-names>F.</given-names>
</name>
&
<name>
<surname>Waterman</surname>
<given-names>M. S.</given-names>
</name>
<article-title>Alignment-free sequence comparison (II): theoretical power of comparison statistics</article-title>
.
<source>J Comput. Biol.</source>
<volume>17</volume>
,
<fpage>1467</fpage>
<lpage>1490</lpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">20973742</pub-id>
</mixed-citation>
</ref>
<ref id="b18">
<mixed-citation publication-type="journal">
<name>
<surname>Reinert</surname>
<given-names>G.</given-names>
</name>
,
<name>
<surname>Chew</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Sun</surname>
<given-names>F.</given-names>
</name>
&
<name>
<surname>Waterman</surname>
<given-names>M. S.</given-names>
</name>
<article-title>Alignment-free sequence comparison (I): statistics and power</article-title>
.
<source>J Comput. Biol.</source>
<volume>16</volume>
,
<fpage>1615</fpage>
<lpage>1634</lpage>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">20001252</pub-id>
</mixed-citation>
</ref>
<ref id="b19">
<mixed-citation publication-type="journal">
<name>
<surname>Hide</surname>
<given-names>W.</given-names>
</name>
,
<name>
<surname>Burke</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Davison</surname>
<given-names>D. B.</given-names>
</name>
<article-title>Biological evaluation of d
<sup>2</sup>
, an algorithm for high-performance sequence comparison</article-title>
.
<source>J Comput. Biol.</source>
<volume>1</volume>
,
<fpage>199</fpage>
<lpage>215</lpage>
(
<year>1994</year>
).
<pub-id pub-id-type="pmid">8790465</pub-id>
</mixed-citation>
</ref>
<ref id="b20">
<mixed-citation publication-type="journal">
<name>
<surname>Miller</surname>
<given-names>R. T.</given-names>
</name>
<italic>et al.</italic>
<article-title>A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base</article-title>
.
<source>Genome Res.</source>
<volume>9</volume>
,
<fpage>1143</fpage>
<lpage>1155</lpage>
(
<year>1999</year>
).
<pub-id pub-id-type="pmid">10568754</pub-id>
</mixed-citation>
</ref>
<ref id="b21">
<mixed-citation publication-type="journal">
<name>
<surname>Guindon</surname>
<given-names>S.</given-names>
</name>
<italic>et al.</italic>
<article-title>New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0</article-title>
.
<source>Syst. Biol.</source>
<volume>59</volume>
,
<fpage>307</fpage>
<lpage>321</lpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">20525638</pub-id>
</mixed-citation>
</ref>
<ref id="b22">
<mixed-citation publication-type="journal">
<name>
<surname>Price</surname>
<given-names>M. N.</given-names>
</name>
,
<name>
<surname>Dehal</surname>
<given-names>P. S.</given-names>
</name>
&
<name>
<surname>Arkin</surname>
<given-names>A. P.</given-names>
</name>
<article-title>FastTree 2 – approximately maximum-likelihood trees for large alignments</article-title>
.
<source>PLoS ONE</source>
<volume>5</volume>
,
<fpage>e9490</fpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">20224823</pub-id>
</mixed-citation>
</ref>
<ref id="b23">
<mixed-citation publication-type="journal">
<name>
<surname>Altschul</surname>
<given-names>S. F.</given-names>
</name>
,
<name>
<surname>Gish</surname>
<given-names>W.</given-names>
</name>
,
<name>
<surname>Miller</surname>
<given-names>W.</given-names>
</name>
,
<name>
<surname>Myers</surname>
<given-names>E. W.</given-names>
</name>
&
<name>
<surname>Lipman</surname>
<given-names>D. J.</given-names>
</name>
<article-title>Basic local alignment search tool</article-title>
.
<source>J. Mol. Biol.</source>
<volume>215</volume>
,
<fpage>403</fpage>
<lpage>410</lpage>
(
<year>1990</year>
).
<pub-id pub-id-type="pmid">2231712</pub-id>
</mixed-citation>
</ref>
<ref id="b24">
<mixed-citation publication-type="journal">
<name>
<surname>Göke</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Schulz</surname>
<given-names>M. H.</given-names>
</name>
,
<name>
<surname>Lasserre</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Vingron</surname>
<given-names>M.</given-names>
</name>
<article-title>Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts</article-title>
.
<source>Bioinformatics</source>
<volume>28</volume>
,
<fpage>656</fpage>
<lpage>663</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22247280</pub-id>
</mixed-citation>
</ref>
<ref id="b25">
<mixed-citation publication-type="journal">
<name>
<surname>Yi</surname>
<given-names>H.</given-names>
</name>
&
<name>
<surname>Jin</surname>
<given-names>L.</given-names>
</name>
<article-title>Co-phylog: an assembly-free phylogenomic approach for closely related organisms</article-title>
.
<source>Nucleic Acids Res.</source>
<volume>41</volume>
,
<fpage>e75</fpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23335788</pub-id>
</mixed-citation>
</ref>
<ref id="b26">
<mixed-citation publication-type="journal">
<name>
<surname>Edgar</surname>
<given-names>R. C.</given-names>
</name>
<article-title>MUSCLE: multiple sequence alignment with high accuracy and high throughput</article-title>
.
<source>Nucleic Acids Res.</source>
<volume>32</volume>
,
<fpage>1792</fpage>
<lpage>1797</lpage>
(
<year>2004</year>
).
<pub-id pub-id-type="pmid">15034147</pub-id>
</mixed-citation>
</ref>
<ref id="b27">
<mixed-citation publication-type="journal">
<name>
<surname>Ronquist</surname>
<given-names>F.</given-names>
</name>
<italic>et al.</italic>
<article-title>MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space</article-title>
.
<source>Syst. Biol.</source>
<volume>61</volume>
,
<fpage>539</fpage>
<lpage>542</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22357727</pub-id>
</mixed-citation>
</ref>
<ref id="b28">
<mixed-citation publication-type="journal">
<name>
<surname>Robinson</surname>
<given-names>D. F.</given-names>
</name>
&
<name>
<surname>Foulds</surname>
<given-names>L. R.</given-names>
</name>
<article-title>Comparison of phylogenetic trees</article-title>
.
<source>Math. Biosci.</source>
<volume>53</volume>
,
<fpage>131</fpage>
<lpage>147</lpage>
(
<year>1981</year>
).</mixed-citation>
</ref>
<ref id="b29">
<mixed-citation publication-type="journal">
<name>
<surname>Forêt</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Wilson</surname>
<given-names>S. R.</given-names>
</name>
&
<name>
<surname>Burden</surname>
<given-names>C. J.</given-names>
</name>
<article-title>Empirical distribution of
<italic>k</italic>
-word matches in biological sequences</article-title>
.
<source>Pattern Recognit.</source>
<volume>42</volume>
,
<fpage>539</fpage>
<lpage>548</lpage>
(
<year>2009</year>
).</mixed-citation>
</ref>
<ref id="b30">
<mixed-citation publication-type="journal">
<name>
<surname>Forêt</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Kantorovitz</surname>
<given-names>M. R.</given-names>
</name>
&
<name>
<surname>Burden</surname>
<given-names>C. J.</given-names>
</name>
<article-title>Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences</article-title>
.
<source>BMC Bioinformatics</source>
<volume>7 Suppl 5</volume>
,
<fpage>S21</fpage>
(
<year>2006</year>
).</mixed-citation>
</ref>
<ref id="b31">
<mixed-citation publication-type="journal">
<name>
<surname>Huffman</surname>
<given-names>D. A.</given-names>
</name>
<article-title>A method for the construction of minimum-redundancy codes</article-title>
.
<source>Proc. IRE</source>
<volume>40</volume>
,
<fpage>1098</fpage>
<lpage>1101</lpage>
(
<year>1952</year>
).</mixed-citation>
</ref>
<ref id="b32">
<mixed-citation publication-type="journal">
<name>
<surname>Fletcher</surname>
<given-names>W.</given-names>
</name>
&
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<article-title>INDELible: a flexible simulator of biological sequence evolution</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>26</volume>
,
<fpage>1879</fpage>
<lpage>1888</lpage>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19423664</pub-id>
</mixed-citation>
</ref>
<ref id="b33">
<mixed-citation publication-type="other">
<name>
<surname>Lavalette</surname>
<given-names>D.</given-names>
</name>
<article-title>Facteur d'impact: impartialité ou impuissance?</article-title>
(
<article-title>INSERM U350 Institut Curie-Recherche, Bât</article-title>
.
<fpage>112</fpage>
, Centre Universitaire, Orsay, France;
<year>1996</year>
).</mixed-citation>
</ref>
<ref id="b34">
<mixed-citation publication-type="journal">
<name>
<surname>Popescu</surname>
<given-names>I. I.</given-names>
</name>
<article-title>On a Zipf's Law extension to impact factors</article-title>
.
<source>Glottometrics</source>
<volume>6</volume>
,
<fpage>83</fpage>
<lpage>93</lpage>
(
<year>2003</year>
).</mixed-citation>
</ref>
<ref id="b35">
<mixed-citation publication-type="journal">
<name>
<surname>Stamatakis</surname>
<given-names>A.</given-names>
</name>
<article-title>RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models</article-title>
.
<source>Bioinformatics</source>
<volume>22</volume>
,
<fpage>2688</fpage>
<lpage>2690</lpage>
(
<year>2006</year>
).
<pub-id pub-id-type="pmid">16928733</pub-id>
</mixed-citation>
</ref>
<ref id="b36">
<mixed-citation publication-type="journal">
<name>
<surname>Stamatakis</surname>
<given-names>A.</given-names>
</name>
<article-title>RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies</article-title>
.
<source>Bioinformatics</source>
<volume>30</volume>
,
<fpage>1312</fpage>
<lpage>1313</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24451623</pub-id>
</mixed-citation>
</ref>
<ref id="b37">
<mixed-citation publication-type="journal">
<name>
<surname>Golubchik</surname>
<given-names>T.</given-names>
</name>
,
<name>
<surname>Wise</surname>
<given-names>M. J.</given-names>
</name>
,
<name>
<surname>Easteal</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Jermiin</surname>
<given-names>L. S.</given-names>
</name>
<article-title>Mind the gaps: evidence of bias in estimates of multiple sequence alignments</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>24</volume>
,
<fpage>2433</fpage>
<lpage>2442</lpage>
(
<year>2007</year>
).
<pub-id pub-id-type="pmid">17709332</pub-id>
</mixed-citation>
</ref>
<ref id="b38">
<mixed-citation publication-type="journal">
<name>
<surname>Kingman</surname>
<given-names>J. F. C.</given-names>
</name>
<article-title>The coalescent</article-title>
.
<source>Stoch. Proc. Appl.</source>
<volume>13</volume>
,
<fpage>235</fpage>
<lpage>248</lpage>
(
<year>1982</year>
).</mixed-citation>
</ref>
<ref id="b39">
<mixed-citation publication-type="journal">
<name>
<surname>Tellier</surname>
<given-names>A.</given-names>
</name>
&
<name>
<surname>Lemaire</surname>
<given-names>C.</given-names>
</name>
<article-title>Coalescence 2.0: a multiple branching of recent theoretical developments and their applications</article-title>
.
<source>Mol. Ecol.</source>
<volume>23</volume>
,
<fpage>2637</fpage>
<lpage>2652</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24750385</pub-id>
</mixed-citation>
</ref>
<ref id="b40">
<mixed-citation publication-type="journal">
<name>
<surname>Sjödin</surname>
<given-names>P.</given-names>
</name>
,
<name>
<surname>Kaj</surname>
<given-names>I.</given-names>
</name>
,
<name>
<surname>Krone</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Lascoux</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Nordborg</surname>
<given-names>M.</given-names>
</name>
<article-title>On the meaning and existence of an effective population size</article-title>
.
<source>Genetics</source>
<volume>169</volume>
,
<fpage>1061</fpage>
<lpage>1070</lpage>
(
<year>2005</year>
).
<pub-id pub-id-type="pmid">15489538</pub-id>
</mixed-citation>
</ref>
<ref id="b41">
<mixed-citation publication-type="book">
<name>
<surname>Piel</surname>
<given-names>W. H.</given-names>
</name>
,
<name>
<surname>Donoghue</surname>
<given-names>M. J.</given-names>
</name>
&
<name>
<surname>Sanderson</surname>
<given-names>M. J.</given-names>
</name>
in
<source>To the interoperable “Catalog of Life” with partners Species 2000 Asia Oceania. NIES Research Report</source>
,
<volume>
<italic>Vol. 171</italic>
</volume>
(eds. Shimura, J., Wilson, K. L. & Gordon, D.)
<fpage>41</fpage>
<lpage>47</lpage>
(National Institute for Environmental Studies, Tsukuba, Japan;
<year>2002</year>
).</mixed-citation>
</ref>
<ref id="b42">
<mixed-citation publication-type="journal">
<name>
<surname>Posada</surname>
<given-names>D.</given-names>
</name>
<article-title>Phylogenetic models of molecular evolution: next-generation data, fit, and performance</article-title>
.
<source>J. Mol. Evol.</source>
<volume>76</volume>
,
<fpage>351</fpage>
<lpage>352</lpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23695649</pub-id>
</mixed-citation>
</ref>
<ref id="b43">
<mixed-citation publication-type="journal">
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
&
<name>
<surname>Chan</surname>
<given-names>C. X.</given-names>
</name>
<article-title>Biological intuition in alignment-free methods: response to Posada</article-title>
.
<source>J. Mol. Evol.</source>
<volume>77</volume>
,
<fpage>1</fpage>
<lpage>2</lpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23877343</pub-id>
</mixed-citation>
</ref>
<ref id="b44">
<mixed-citation publication-type="journal">
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
,
<name>
<surname>Bernard</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>Chan</surname>
<given-names>C. X.</given-names>
</name>
<article-title>Molecular phylogenetics before sequences: Oligonucleotide catalogs as
<italic>k</italic>
-mer spectra</article-title>
.
<source>RNA Biol.</source>
<volume>11</volume>
,
<fpage>176</fpage>
<lpage>185</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24572375</pub-id>
</mixed-citation>
</ref>
<ref id="b45">
<mixed-citation publication-type="journal">
<name>
<surname>Chan</surname>
<given-names>C. X.</given-names>
</name>
,
<name>
<surname>Darling</surname>
<given-names>A. E.</given-names>
</name>
,
<name>
<surname>Beiko</surname>
<given-names>R. G.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Are protein domains modules of lateral genetic transfer?</article-title>
<source>PLoS ONE</source>
<volume>4</volume>
,
<fpage>e4524</fpage>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19229333</pub-id>
</mixed-citation>
</ref>
<ref id="b46">
<mixed-citation publication-type="journal">
<name>
<surname>Katoh</surname>
<given-names>K.</given-names>
</name>
&
<name>
<surname>Standley</surname>
<given-names>D. M.</given-names>
</name>
<article-title>MAFFT multiple sequence alignment software version 7: improvements in performance and usability</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>30</volume>
,
<fpage>772</fpage>
<lpage>780</lpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23329690</pub-id>
</mixed-citation>
</ref>
<ref id="b47">
<mixed-citation publication-type="journal">
<name>
<surname>Thompson</surname>
<given-names>J. D.</given-names>
</name>
,
<name>
<surname>Linard</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Lecompte</surname>
<given-names>O.</given-names>
</name>
&
<name>
<surname>Poch</surname>
<given-names>O.</given-names>
</name>
<article-title>A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives</article-title>
.
<source>PLoS ONE</source>
<volume>6</volume>
,
<fpage>e18093</fpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21483869</pub-id>
</mixed-citation>
</ref>
<ref id="b48">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>K.</given-names>
</name>
,
<name>
<surname>Linder</surname>
<given-names>C. R.</given-names>
</name>
&
<name>
<surname>Warnow</surname>
<given-names>T.</given-names>
</name>
<article-title>RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation</article-title>
.
<source>PLoS ONE</source>
<volume>6</volume>
,
<fpage>e27731</fpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">22132132</pub-id>
</mixed-citation>
</ref>
<ref id="b49">
<mixed-citation publication-type="journal">
<name>
<surname>Gunasinghe</surname>
<given-names>U.</given-names>
</name>
,
<name>
<surname>Alahakoon</surname>
<given-names>D.</given-names>
</name>
&
<name>
<surname>Bedingfield</surname>
<given-names>S.</given-names>
</name>
<article-title>Extraction of high quality
<italic>k</italic>
-words for alignment-free sequence comparison</article-title>
.
<source>J. Theor. Biol.</source>
<volume>358</volume>
,
<fpage>31</fpage>
<lpage>51</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24846728</pub-id>
</mixed-citation>
</ref>
<ref id="b50">
<mixed-citation publication-type="journal">
<name>
<surname>Haubold</surname>
<given-names>B.</given-names>
</name>
&
<name>
<surname>Pfaffelhuber</surname>
<given-names>P.</given-names>
</name>
<article-title>Alignment-free population genomics: an efficient estimator of sequence diversity</article-title>
.
<source>G3</source>
<volume>2</volume>
,
<fpage>883</fpage>
<lpage>889</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22908037</pub-id>
</mixed-citation>
</ref>
<ref id="b51">
<mixed-citation publication-type="journal">
<name>
<surname>Fitch</surname>
<given-names>W. M.</given-names>
</name>
&
<name>
<surname>Margoliash</surname>
<given-names>E.</given-names>
</name>
<article-title>Construction of phylogenetic trees</article-title>
.
<source>Science</source>
<volume>155</volume>
,
<fpage>279</fpage>
<lpage>284</lpage>
(
<year>1967</year>
).
<pub-id pub-id-type="pmid">5334057</pub-id>
</mixed-citation>
</ref>
<ref id="b52">
<mixed-citation publication-type="journal">
<name>
<surname>Burden</surname>
<given-names>C. J.</given-names>
</name>
,
<name>
<surname>Kantorovitz</surname>
<given-names>M. R.</given-names>
</name>
&
<name>
<surname>Wilson</surname>
<given-names>S. R.</given-names>
</name>
<article-title>Approximate word matches between two random sequences</article-title>
.
<source>Ann. Appl. Probab.</source>
<volume>18</volume>
,
<fpage>1</fpage>
<lpage>21</lpage>
(
<year>2008</year>
).</mixed-citation>
</ref>
<ref id="b53">
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<article-title>PAML 4: phylogenetic analysis by maximum likelihood</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>24</volume>
,
<fpage>1586</fpage>
<lpage>1591</lpage>
(
<year>2007</year>
).
<pub-id pub-id-type="pmid">17483113</pub-id>
</mixed-citation>
</ref>
<ref id="b54">
<mixed-citation publication-type="journal">
<name>
<surname>Tavaré</surname>
<given-names>S.</given-names>
</name>
<article-title>Some probabilistic and statistical problems in the analysis of DNA sequences</article-title>
.
<source>Lect. Math. Life Sci.</source>
<volume>17</volume>
,
<fpage>57</fpage>
<lpage>86</lpage>
(
<year>1986</year>
).</mixed-citation>
</ref>
<ref id="b55">
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<article-title>Estimating the pattern of nucleotide substitution</article-title>
.
<source>J. Mol. Evol.</source>
<volume>39</volume>
,
<fpage>105</fpage>
<lpage>111</lpage>
(
<year>1994</year>
).
<pub-id pub-id-type="pmid">8064867</pub-id>
</mixed-citation>
</ref>
<ref id="b56">
<mixed-citation publication-type="journal">
<name>
<surname>Whelan</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Goldman</surname>
<given-names>N.</given-names>
</name>
<article-title>A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>18</volume>
,
<fpage>691</fpage>
<lpage>699</lpage>
(
<year>2001</year>
).
<pub-id pub-id-type="pmid">11319253</pub-id>
</mixed-citation>
</ref>
<ref id="b57">
<mixed-citation publication-type="journal">
<name>
<surname>Arenas</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Posada</surname>
<given-names>D.</given-names>
</name>
<article-title>Coalescent simulation of intracodon recombination</article-title>
.
<source>Genetics</source>
<volume>184</volume>
,
<fpage>429</fpage>
<lpage>437</lpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">19933876</pub-id>
</mixed-citation>
</ref>
<ref id="b58">
<mixed-citation publication-type="journal">
<name>
<surname>Sjöstrand</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Arvestad</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Lagergren</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Sennblad</surname>
<given-names>B.</given-names>
</name>
<article-title>GenPhyloData: realistic simulation of gene family evolution</article-title>
.
<source>BMC Bioinformatics</source>
<volume>14</volume>
,
<fpage>209</fpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23803001</pub-id>
</mixed-citation>
</ref>
<ref id="b59">
<mixed-citation publication-type="journal">
<name>
<surname>Drummond</surname>
<given-names>A. J.</given-names>
</name>
,
<name>
<surname>Ho</surname>
<given-names>S. Y.</given-names>
</name>
,
<name>
<surname>Phillips</surname>
<given-names>M. J.</given-names>
</name>
&
<name>
<surname>Rambaut</surname>
<given-names>A.</given-names>
</name>
<article-title>Relaxed phylogenetics and dating with confidence</article-title>
.
<source>PLoS Biol.</source>
<volume>4</volume>
,
<fpage>e88</fpage>
(
<year>2006</year>
).
<pub-id pub-id-type="pmid">16683862</pub-id>
</mixed-citation>
</ref>
<ref id="b60">
<mixed-citation publication-type="journal">
<name>
<surname>McDonald</surname>
<given-names>D.</given-names>
</name>
<italic>et al.</italic>
<article-title>An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea</article-title>
.
<source>ISME J.</source>
<volume>6</volume>
,
<fpage>610</fpage>
<lpage>618</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22134646</pub-id>
</mixed-citation>
</ref>
<ref id="b61">
<mixed-citation publication-type="journal">
<name>
<surname>Chan</surname>
<given-names>C. X.</given-names>
</name>
,
<name>
<surname>Mahbob</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Clustering evolving proteins into homologous families</article-title>
.
<source>BMC Bioinformatics</source>
<volume>14</volume>
,
<fpage>120</fpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23566217</pub-id>
</mixed-citation>
</ref>
<ref id="b62">
<mixed-citation publication-type="journal">
<name>
<surname>Stuart</surname>
<given-names>G. W.</given-names>
</name>
,
<name>
<surname>Moffett</surname>
<given-names>K.</given-names>
</name>
&
<name>
<surname>Baker</surname>
<given-names>S.</given-names>
</name>
<article-title>Integrated gene and species phylogenies from unaligned whole genome protein sequences</article-title>
.
<source>Bioinformatics</source>
<volume>18</volume>
,
<fpage>100</fpage>
<lpage>108</lpage>
(
<year>2002</year>
).
<pub-id pub-id-type="pmid">11836217</pub-id>
</mixed-citation>
</ref>
<ref id="b63">
<mixed-citation publication-type="journal">
<name>
<surname>Kupczok</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Schmidt</surname>
<given-names>H.</given-names>
</name>
&
<name>
<surname>von Haeseler</surname>
<given-names>A.</given-names>
</name>
<article-title>Accuracy of phylogeny reconstruction methods combining overlapping gene data sets</article-title>
.
<source>Algorithms Mol. Biol.</source>
<volume>5</volume>
,
<fpage>37</fpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">21134245</pub-id>
</mixed-citation>
</ref>
<ref id="b64">
<mixed-citation publication-type="journal">
<name>
<surname>Bryant</surname>
<given-names>D.</given-names>
</name>
&
<name>
<surname>Steel</surname>
<given-names>M.</given-names>
</name>
<article-title>Computing the distribution of a tree metric</article-title>
.
<source>IEEE/ACM Trans. Comput. Biol. Bioinform.</source>
<volume>6</volume>
,
<fpage>420</fpage>
<lpage>426</lpage>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19644170</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
<floats-group>
<fig id="f1">
<label>Figure 1</label>
<caption>
<title>Trees for simulation of sequence data.</title>
<p>Six situations showing distinct combinations of internal (
<italic>x</italic>
) and terminal (
<italic>y</italic>
) branches, labelled as T1 through T6, with
<italic>y</italic>
specified differently between the first (
<italic>p1</italic>
) and second (
<italic>p2</italic>
) half of the branches on a tree. The unit of branch lengths is number of substitutions per site. The length of each edge is either 0.01 or 0.05 substitutions per site.</p>
</caption>
<graphic xlink:href="srep06504-f1"></graphic>
</fig>
<fig id="f2">
<label>Figure 2</label>
<caption>
<title>The accuracy of
<italic>D
<sub>2</sub>
</italic>
methods based on sequence divergence of the nucleotide sequence sets.</title>
<p>For each size
<italic>N</italic>
at (i) 8, (ii) 32 and (iii) 128, mean
<italic>RF
<sub>D2n1</sub>
</italic>
are shown in (a) across different
<italic>k</italic>
-mer lengths (shown for
<italic>k</italic>
= 8, 12, 16, 20, 24), for cases simulated under each of the six trees (T1 through T6 on the
<italic>x</italic>
-axis). The corresponding
<italic>Q
<sub>D2n1</sub>
</italic>
for each case is shown in (b). Error bars indicate standard deviation from the mean. See
<xref ref-type="supplementary-material" rid="s1">Supplementary Figures S1 through S4</xref>
for complete results for all
<italic>D</italic>
<sub>2</sub>
methods for both nucleotide and protein sequence sets.</p>
</caption>
<graphic xlink:href="srep06504-f2"></graphic>
</fig>
<fig id="f3">
<label>Figure 3</label>
<caption>
<title>The accuracy of
<italic>D
<sub>2</sub>
</italic>
methods based on genetic rearrangement.</title>
<p>
<italic>RF
<sub>D2n1</sub>
</italic>
are shown in (a) across different
<italic>k</italic>
-mer lengths (
<italic>k</italic>
≥ 8), as well as that of the standard approach (
<italic>RF
<sub>MSA</sub>
</italic>
), across different
<italic>R</italic>
at 10%, 25% and 50%. The corresponding
<italic>Q
<sub>D2n1</sub>
</italic>
values are shown in (b). Error bars indicate standard deviation from the mean.</p>
</caption>
<graphic xlink:href="srep06504-f3"></graphic>
</fig>
<fig id="f4">
<label>Figure 4</label>
<caption>
<title>The accuracy of phylogenetic approaches based on insertions/deletions.</title>
<p>
<italic>RF</italic>
values are shown in (a) for
<inline-formula id="m45">
<inline-graphic id="d33e2408" xlink:href="srep06504-m45.jpg"></inline-graphic>
</inline-formula>
, MUSCLE + MrBayes and MUSCLE + RAxML across different indel rates
<italic>r</italic>
. The corresponding
<italic>Q</italic>
values for MUSCLE + MrBayes and MUSCLE + RAxML are shown in (b). Error bars indicate standard deviation from the mean.</p>
</caption>
<graphic xlink:href="srep06504-f4"></graphic>
</fig>
<fig id="f5">
<label>Figure 5</label>
<caption>
<title>The accuracy of phylogenetic approaches based on coalescent evolution of gene families.</title>
<p>
<italic>RF</italic>
values are shown in (a) for
<inline-formula id="m46">
<inline-graphic id="d33e2425" xlink:href="srep06504-m46.jpg"></inline-graphic>
</inline-formula>
, MUSCLE + MrBayes and MUSCLE + RAxML across different effective population size
<italic>N
<sub>e</sub>
</italic>
. The corresponding
<italic>Q</italic>
values for MUSCLE + MrBayes and MUSCLE + RAxML are shown in (b). Error bars indicate standard deviation from the mean.</p>
</caption>
<graphic xlink:href="srep06504-f5"></graphic>
</fig>
<fig id="f6">
<label>Figure 6</label>
<caption>
<title>The accuracy of
<italic>D
<sub>2</sub>
</italic>
methods based on TreeBASE data.</title>
<p>The probability density of
<italic>RF
<sub>D2n1</sub>
</italic>
at
<italic>k</italic>
= 8 as categorised based on (a) total number of sequences within a set,
<italic>N</italic>
(mean and median in
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S3</xref>
), and (b) within-set sequence similarity,
<italic>ID</italic>
(mean and median in
<xref ref-type="supplementary-material" rid="s1">Supplementary Table S4</xref>
).</p>
</caption>
<graphic xlink:href="srep06504-f6"></graphic>
</fig>
<fig id="f7">
<label>Figure 7</label>
<caption>
<title>Computation time of
<italic>D
<sub>2</sub>
</italic>
methods.</title>
<p>The computation time in seconds is shown for (a)
<italic>D</italic>
<sub>2</sub>
method at
<italic>k</italic>
= 8 across subset of GreenGenes data across datasets of
<italic>N</italic>
= 1000, 2000, 3000, 4000 and 5000, and for (b)
<inline-formula id="m47">
<inline-graphic id="d33e2490" xlink:href="srep06504-m47.jpg"></inline-graphic>
</inline-formula>
analysis across neighbourhood size
<italic>n</italic>
= 1 through 5, for nucleotide sequence sets of
<italic>N</italic>
= 8. Error bars indicate standard deviation from the mean.</p>
</caption>
<graphic xlink:href="srep06504-f7"></graphic>
</fig>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Asie/explor/AustralieFrV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0001280 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0001280 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Asie
   |area=    AustralieFrV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Tue Dec 5 10:43:12 2017. Site generation: Tue Mar 5 14:07:20 2024