MersV1, Pmc, Corpus, bibRecord, 000C68

Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages

Identifieur interne : 000C68 ( Pmc/Corpus ); précédent : 000C67; suivant : 000C69

Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages

Auteurs : Benjamin Siranosian ; Sudheesha Perera ; Edward Williams ; Chen Ye ; Christopher De Graffenried ; Peter Shank

Source :

F1000Research [ 2046-1402 ] ; 2015.

RBID : PMC:4841201

Abstract

Background

The genomic sequences of mycobacteriophages, phages infecting mycobacterial hosts, are diverse and mosaic. Mycobacteriophages often share little nucleotide similarity, but most of them have been grouped into lettered clusters and further into subclusters. Traditionally, mycobacteriophage genomes are analyzed based on sequence alignment or knowledge of gene content. However, these approaches are computationally expensive and can be ineffective for significantly diverged sequences. As an alternative to alignment-based genome analysis, we evaluated tetranucleotide usage in mycobacteriophage genomes. These methods make it easier to characterize features of the mycobacteriophage population at many scales.

Description

We computed tetranucleotide usage deviation (TUD), the ratio of observed counts of 4-mers in a genome to the expected count under a null model. TUD values are comparable between members of a phage subcluster and distinct between subclusters. With few exceptions, neighbor joining phylogenetic trees and hierarchical clustering dendrograms constructed using TUD values place phages in a monophyletic clade with members of the same subcluster. Regions in a genome with exceptional TUD values can point to interesting features of genomic architecture. Finally, we found that subcluster B3 mycobacteriophages contain significantly overrepresented 4-mers and 6-mers that are atypical of phage genomes.

Conclusions

Statistics based on tetranucleotide usage support established clustering of mycobacteriophages and can uncover interesting relationships within and between sequenced phage genomes. These methods are efficient to compute and do not require sequence alignment or knowledge of gene content. The code to download mycobacteriophage genome sequences and reproduce our analysis is freely available at https://github.com/bsiranosian/tango_final.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4841201

DOI: 10.12688/f1000research.6077.2
PubMed: 27134721
PubMed Central: 4841201

Links to Exploration step

PMC:4841201

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages</title>
<author><name sortKey="Siranosian, Benjamin" sort="Siranosian, Benjamin" uniqKey="Siranosian B" first="Benjamin" last="Siranosian">Benjamin Siranosian</name>
<affiliation><nlm:aff id="a1">Center for Computational Molecular Biology, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a2">Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Perera, Sudheesha" sort="Perera, Sudheesha" uniqKey="Perera S" first="Sudheesha" last="Perera">Sudheesha Perera</name>
<affiliation><nlm:aff id="a2">Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Williams, Edward" sort="Williams, Edward" uniqKey="Williams E" first="Edward" last="Williams">Edward Williams</name>
<affiliation><nlm:aff id="a2">Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Ye, Chen" sort="Ye, Chen" uniqKey="Ye C" first="Chen" last="Ye">Chen Ye</name>
<affiliation><nlm:aff id="a2">Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="De Graffenried, Christopher" sort="De Graffenried, Christopher" uniqKey="De Graffenried C" first="Christopher" last="De Graffenried">Christopher De Graffenried</name>
<affiliation><nlm:aff id="a3">Department of Molecular Microbiology and Immunology, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Shank, Peter" sort="Shank, Peter" uniqKey="Shank P" first="Peter" last="Shank">Peter Shank</name>
<affiliation><nlm:aff id="a3">Department of Molecular Microbiology and Immunology, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">27134721</idno>
<idno type="pmc">4841201</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4841201</idno>
<idno type="RBID">PMC:4841201</idno>
<idno type="doi">10.12688/f1000research.6077.2</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000C68</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000C68</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages</title>
<author><name sortKey="Siranosian, Benjamin" sort="Siranosian, Benjamin" uniqKey="Siranosian B" first="Benjamin" last="Siranosian">Benjamin Siranosian</name>
<affiliation><nlm:aff id="a1">Center for Computational Molecular Biology, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a2">Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Perera, Sudheesha" sort="Perera, Sudheesha" uniqKey="Perera S" first="Sudheesha" last="Perera">Sudheesha Perera</name>
<affiliation><nlm:aff id="a2">Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Williams, Edward" sort="Williams, Edward" uniqKey="Williams E" first="Edward" last="Williams">Edward Williams</name>
<affiliation><nlm:aff id="a2">Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Ye, Chen" sort="Ye, Chen" uniqKey="Ye C" first="Chen" last="Ye">Chen Ye</name>
<affiliation><nlm:aff id="a2">Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="De Graffenried, Christopher" sort="De Graffenried, Christopher" uniqKey="De Graffenried C" first="Christopher" last="De Graffenried">Christopher De Graffenried</name>
<affiliation><nlm:aff id="a3">Department of Molecular Microbiology and Immunology, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Shank, Peter" sort="Shank, Peter" uniqKey="Shank P" first="Peter" last="Shank">Peter Shank</name>
<affiliation><nlm:aff id="a3">Department of Molecular Microbiology and Immunology, Brown University, Providence, RI, 02912, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">F1000Research</title>
<idno type="eISSN">2046-1402</idno>
<imprint><date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p><bold>Background</bold>
</p>
<p>The genomic sequences of mycobacteriophages, phages infecting mycobacterial hosts, are diverse and mosaic. Mycobacteriophages often share little nucleotide similarity, but most of them have been grouped into lettered clusters and further into subclusters. Traditionally, mycobacteriophage genomes are analyzed based on sequence alignment or knowledge of gene content. However, these approaches are computationally expensive and can be ineffective for significantly diverged sequences. As an alternative to alignment-based genome analysis, we evaluated tetranucleotide usage in mycobacteriophage genomes. These methods make it easier to characterize features of the mycobacteriophage population at many scales.</p>
<p><bold>Description</bold>
</p>
<p>We computed tetranucleotide usage deviation (TUD), the ratio of observed counts of 4-mers in a genome to the expected count under a null model. TUD values are comparable between members of a phage subcluster and distinct between subclusters. With few exceptions, neighbor joining phylogenetic trees and hierarchical clustering dendrograms constructed using TUD values place phages in a monophyletic clade with members of the same subcluster. Regions in a genome with exceptional TUD values can point to interesting features of genomic architecture. Finally, we found that subcluster B3 mycobacteriophages contain significantly overrepresented 4-mers and 6-mers that are atypical of phage genomes.</p>
<p><bold>Conclusions</bold>
</p>
<p>Statistics based on tetranucleotide usage support established clustering of mycobacteriophages and can uncover interesting relationships within and between sequenced phage genomes. These methods are efficient to compute and do not require sequence alignment or knowledge of gene content. The code to download mycobacteriophage genome sequences and reproduce our analysis is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/bsiranosian/tango_final">https://github.com/bsiranosian/tango_final</ext-link>
.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author><name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author><name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Betley, Jn" uniqKey="Betley J">JN Betley</name>
</author>
<author><name sortKey="Frith, Mc" uniqKey="Frith M">MC Frith</name>
</author>
<author><name sortKey="Graber, Jh" uniqKey="Graber J">JH Graber</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bohannan, Bjm" uniqKey="Bohannan B">BJM Bohannan</name>
</author>
<author><name sortKey="Lenski, Re" uniqKey="Lenski R">RE Lenski</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chan, Cx" uniqKey="Chan C">CX Chan</name>
</author>
<author><name sortKey="Ragan, Ma" uniqKey="Ragan M">MA Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chibani Chennoufi, S" uniqKey="Chibani Chennoufi S">S Chibani-Chennoufi</name>
</author>
<author><name sortKey="Bruttin, A" uniqKey="Bruttin A">A Bruttin</name>
</author>
<author><name sortKey="Dillmann, Ml" uniqKey="Dillmann M">ML Dillmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cresawn, Sg" uniqKey="Cresawn S">SG Cresawn</name>
</author>
<author><name sortKey="Bogel, M" uniqKey="Bogel M">M Bogel</name>
</author>
<author><name sortKey="Day, N" uniqKey="Day N">N Day</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Danelishvili, L" uniqKey="Danelishvili L">L Danelishvili</name>
</author>
<author><name sortKey="Young, Ls" uniqKey="Young L">LS Young</name>
</author>
<author><name sortKey="Bermudez, Le" uniqKey="Bermudez L">LE Bermudez</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Doolittle, Wf" uniqKey="Doolittle W">WF Doolittle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Frith, Mc" uniqKey="Frith M">MC Frith</name>
</author>
<author><name sortKey="Hamada, M" uniqKey="Hamada M">M Hamada</name>
</author>
<author><name sortKey="Horton, P" uniqKey="Horton P">P Horton</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gelfand, Ms" uniqKey="Gelfand M">MS Gelfand</name>
</author>
<author><name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hacker, J" uniqKey="Hacker J">J Hacker</name>
</author>
<author><name sortKey="Kaper, Jb" uniqKey="Kaper J">JB Kaper</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hatfull, Gf" uniqKey="Hatfull G">GF Hatfull</name>
</author>
<author><name sortKey="Jacobs Sera, D" uniqKey="Jacobs Sera D">D Jacobs-Sera</name>
</author>
<author><name sortKey="Lawrence, Jg" uniqKey="Lawrence J">JG Lawrence</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hatfull, Gf" uniqKey="Hatfull G">GF Hatfull</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hemavathy, Kc" uniqKey="Hemavathy K">KC Hemavathy</name>
</author>
<author><name sortKey="Nagaraja, V" uniqKey="Nagaraja V">V Nagaraja</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hendrix, Rw" uniqKey="Hendrix R">RW Hendrix</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author><name sortKey="Bryant, D" uniqKey="Bryant D">D Bryant</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jordan, Tc" uniqKey="Jordan T">TC Jordan</name>
</author>
<author><name sortKey="Burnett, Sh" uniqKey="Burnett S">SH Burnett</name>
</author>
<author><name sortKey="Carson, S" uniqKey="Carson S">S Carson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Karlin, S" uniqKey="Karlin S">S Karlin</name>
</author>
<author><name sortKey="Burge, C" uniqKey="Burge C">C Burge</name>
</author>
<author><name sortKey="Campbell, Am" uniqKey="Campbell A">AM Campbell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Koski, Lb" uniqKey="Koski L">LB Koski</name>
</author>
<author><name sortKey="Morton, Ra" uniqKey="Morton R">RA Morton</name>
</author>
<author><name sortKey="Golding, Gb" uniqKey="Golding G">GB Golding</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lawrence, Jg" uniqKey="Lawrence J">JG Lawrence</name>
</author>
<author><name sortKey="Hatfull, Gf" uniqKey="Hatfull G">GF Hatfull</name>
</author>
<author><name sortKey="Hendrix, Rw" uniqKey="Hendrix R">RW Hendrix</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lawrence, Jg" uniqKey="Lawrence J">JG Lawrence</name>
</author>
<author><name sortKey="Ochman, H" uniqKey="Ochman H">H Ochman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marinus, Mg" uniqKey="Marinus M">MG Marinus</name>
</author>
<author><name sortKey="Morris, Nr" uniqKey="Morris N">NR Morris</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mcnerney, R" uniqKey="Mcnerney R">R McNerney</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Needleman, Sb" uniqKey="Needleman S">SB Needleman</name>
</author>
<author><name sortKey="Wunsch, Cd" uniqKey="Wunsch C">CD Wunsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ogilvie, La" uniqKey="Ogilvie L">LA Ogilvie</name>
</author>
<author><name sortKey="Bowler, Ld" uniqKey="Bowler L">LD Bowler</name>
</author>
<author><name sortKey="Caplin, J" uniqKey="Caplin J">J Caplin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pedulla, Ml" uniqKey="Pedulla M">ML Pedulla</name>
</author>
<author><name sortKey="Ford, Me" uniqKey="Ford M">ME Ford</name>
</author>
<author><name sortKey="Houtz, Jm" uniqKey="Houtz J">JM Houtz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pride, Dt" uniqKey="Pride D">DT Pride</name>
</author>
<author><name sortKey="Meinersmann, Rj" uniqKey="Meinersmann R">RJ Meinersmann</name>
</author>
<author><name sortKey="Wassenaar, Tm" uniqKey="Wassenaar T">TM Wassenaar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pride, Dt" uniqKey="Pride D">DT Pride</name>
</author>
<author><name sortKey="Wassenaar, Tm" uniqKey="Wassenaar T">TM Wassenaar</name>
</author>
<author><name sortKey="Ghose, C" uniqKey="Ghose C">C Ghose</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sandberg, R" uniqKey="Sandberg R">R Sandberg</name>
</author>
<author><name sortKey="Winberg, G" uniqKey="Winberg G">G Winberg</name>
</author>
<author><name sortKey="Br Nden, Ci" uniqKey="Br Nden C">CI Bränden</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Shankar, S" uniqKey="Shankar S">S Shankar</name>
</author>
<author><name sortKey="Tyagi, Ak" uniqKey="Tyagi A">AK Tyagi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sharp, Pm" uniqKey="Sharp P">PM Sharp</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Simmons, Mp" uniqKey="Simmons M">MP Simmons</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Siranosian, B" uniqKey="Siranosian B">B Siranosian</name>
</author>
<author><name sortKey="Herold, E" uniqKey="Herold E">E Herold</name>
</author>
<author><name sortKey="Williams, E" uniqKey="Williams E">E Williams</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Siranosian, B" uniqKey="Siranosian B">B Siranosian</name>
</author>
<author><name sortKey="Perera, S" uniqKey="Perera S">S Perera</name>
</author>
<author><name sortKey="Williams, E" uniqKey="Williams E">E Williams</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Waack, S" uniqKey="Waack S">S Waack</name>
</author>
<author><name sortKey="Keller, O" uniqKey="Keller O">O Keller</name>
</author>
<author><name sortKey="Asper, R" uniqKey="Asper R">R Asper</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">F1000Res</journal-id>
<journal-id journal-id-type="iso-abbrev">F1000Res</journal-id>
<journal-id journal-id-type="pmc">F1000Research</journal-id>
<journal-title-group><journal-title>F1000Research</journal-title>
</journal-title-group>
<issn pub-type="epub">2046-1402</issn>
<publisher><publisher-name>F1000Research</publisher-name>
<publisher-loc>London, UK</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">27134721</article-id>
<article-id pub-id-type="pmc">4841201</article-id>
<article-id pub-id-type="doi">10.12688/f1000research.6077.2</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
<subj-group><subject>Articles</subject>
<subj-group><subject>Bioinformatics</subject>
</subj-group>
<subj-group><subject>Genomics</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group><article-title>Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages</article-title>
<fn-group content-type="pub-status"><fn><p>[version 2; referees: 2 approved]</p>
</fn>
</fn-group>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Siranosian</surname>
<given-names>Benjamin</given-names>
</name>
<xref ref-type="corresp" rid="c1">a</xref>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Perera</surname>
<given-names>Sudheesha</given-names>
</name>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Williams</surname>
<given-names>Edward</given-names>
</name>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Ye</surname>
<given-names>Chen</given-names>
</name>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<contrib contrib-type="author"><name><surname>de Graffenried</surname>
<given-names>Christopher</given-names>
</name>
<xref ref-type="aff" rid="a3">3</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Shank</surname>
<given-names>Peter</given-names>
</name>
<xref ref-type="aff" rid="a3">3</xref>
</contrib>
<aff id="a1"><label>1</label>
Center for Computational Molecular Biology, Brown University, Providence, RI, 02912, USA</aff>
<aff id="a2"><label>2</label>
Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA</aff>
<aff id="a3"><label>3</label>
Department of Molecular Microbiology and Immunology, Brown University, Providence, RI, 02912, USA</aff>
</contrib-group>
<author-notes><corresp id="c1"><label>a</label>
<email xlink:href="mailto:benjamin_siranosian@alumni.brown.edu">benjamin_siranosian@alumni.brown.edu</email>
</corresp>
<fn fn-type="con"><p>BS designed the study. BS, SP, EW and CY performed the analysis. BS and CY prepared the figures. BS, SP, EW, CDG and PS wrote the manuscript.</p>
</fn>
<fn fn-type="COI-statement"><p><bold>Competing interests: </bold>
No competing interests were disclosed.</p>
</fn>
</author-notes>
<pub-date pub-type="epub"><day>30</day>
<month>10</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection"><year>2015</year>
</pub-date>
<volume>4</volume>
<elocation-id>36</elocation-id>
<history><date date-type="accepted"><day>28</day>
<month>10</month>
<year>2015</year>
</date>
</history>
<permissions><copyright-statement>Copyright: © 2015 Siranosian B et al.</copyright-statement>
<copyright-year>2015</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="f1000research-4-7828.pdf"></self-uri>
<abstract><p><bold>Background</bold>
</p>
<p>The genomic sequences of mycobacteriophages, phages infecting mycobacterial hosts, are diverse and mosaic. Mycobacteriophages often share little nucleotide similarity, but most of them have been grouped into lettered clusters and further into subclusters. Traditionally, mycobacteriophage genomes are analyzed based on sequence alignment or knowledge of gene content. However, these approaches are computationally expensive and can be ineffective for significantly diverged sequences. As an alternative to alignment-based genome analysis, we evaluated tetranucleotide usage in mycobacteriophage genomes. These methods make it easier to characterize features of the mycobacteriophage population at many scales.</p>
<p><bold>Description</bold>
</p>
<p>We computed tetranucleotide usage deviation (TUD), the ratio of observed counts of 4-mers in a genome to the expected count under a null model. TUD values are comparable between members of a phage subcluster and distinct between subclusters. With few exceptions, neighbor joining phylogenetic trees and hierarchical clustering dendrograms constructed using TUD values place phages in a monophyletic clade with members of the same subcluster. Regions in a genome with exceptional TUD values can point to interesting features of genomic architecture. Finally, we found that subcluster B3 mycobacteriophages contain significantly overrepresented 4-mers and 6-mers that are atypical of phage genomes.</p>
<p><bold>Conclusions</bold>
</p>
<p>Statistics based on tetranucleotide usage support established clustering of mycobacteriophages and can uncover interesting relationships within and between sequenced phage genomes. These methods are efficient to compute and do not require sequence alignment or knowledge of gene content. The code to download mycobacteriophage genome sequences and reproduce our analysis is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/bsiranosian/tango_final">https://github.com/bsiranosian/tango_final</ext-link>
.</p>
</abstract>
<kwd-group kwd-group-type="author"><kwd>mycobacteriophages, computed</kwd>
<kwd>tetranucleotide</kwd>
<kwd>usage</kwd>
<kwd>deviation, genome</kwd>
<kwd>sequences</kwd>
</kwd-group>
<funding-group><award-group id="fund-1"><funding-source>Brown University</funding-source>
</award-group>
<award-group id="fund-2"><funding-source>HHMI SEA-PHAGES program</funding-source>
</award-group>
<funding-statement>This work was funded by Brown University Biology Undergraduate Education and the HHMI SEA-PHAGES program.</funding-statement>
</funding-group>
</article-meta>
<notes notes-type="version-changes"><sec sec-type="version-changes"><label>Revised</label>
<title>Amendments from Version 1</title>
<p>This version addresses the review by Dr. Bonham-Carter. Changes have been made to make the methods section more clear, and I have included an example figure to show the calculation of TUD on a small sequence. The results from the paper remain unchanged.</p>
</sec>
</notes>
</front>
<body><sec sec-type="intro"><title>Introduction</title>
<p>Mycobacteriophages, phages infecting mycobacterial hosts, are a subset of the estimated 10
<sup>31</sup> phage particles present globally. Mycobacteriophages infect a number of bacterial hosts from the genus
<italic>Mycobacterium</italic>, and they are broadly classified into
<italic>Siphoviridae</italic> and
<italic>Myoviridae</italic>. Mycobacteriophages are present in both land and aquatic environments and play a large ecological role in the turnover and evolution of bacteria (
<xref rid="ref-3" ref-type="bibr">Bohannan & Lenski, 2000</xref>;
<xref rid="ref-5" ref-type="bibr">Chibani-Chennoufi
<italic>et al.</italic>
, 2004</xref>;
<xref rid="ref-15" ref-type="bibr">Hendrix, 2002</xref>). The recent rise of antimicrobial-resistant pathogenic bacteria has renewed interest in mycobacteriophages and the potential for phage therapy of
<italic>Mycobacterium tuberculosis</italic> infections. Although
<italic>in vivo</italic> experiments have not yet yielded promising clinical results, mycobacteriophages are still powerful diagnostic tools for the investigation of mycobacterial pathogenesis (
<xref rid="ref-7" ref-type="bibr">Danelishvili, 2006</xref>;
<xref rid="ref-13" ref-type="bibr">Hatfull, 2014</xref>;
<xref rid="ref-23" ref-type="bibr">McNerney, 1999</xref>
).</p>
<p>The genomic sequences of mycobacteriophages are mosaic and diverse. As of April 2014, 663 distinct mycobacteriophage genomes were available on the database
<ext-link ext-link-type="uri" xlink:href="http://phagesdb.org/">PhagesDB.org</ext-link>; most were isolated on
<italic>Mycobacterium smegmatis</italic> MC
<sup>2</sup>155. Global Guanine + Cytosine (GC) content ranges from 50.3% to 70% (mean of 63.9%), and genome lengths range from 41kb to 165kb (mean of 67kb). In total, more than 50,000 distinct genes are found within the population. The majority of these genes are of unknown function and do not have homologs in other types of phages or bacteria (
<xref rid="ref-12" ref-type="bibr">Hatfull
<italic>et al.</italic>
, 2010</xref>
). However, many genes are shared between closely related mycobacteriophages. Similar genes have been grouped into almost 4,000 phamilies (or phams, a play on gene families) based on shared amino acid sequence. Phams have been used to investigate horizontal gene transfer within the mycobacteriophage population and to create phylogenetic trees.</p>
<p>Despite the high levels of diversity, mycobacteriophages can be grouped into distinct clusters based on their morphologic and genetic features. Some clusters are large and further divided into subclusters (cluster A, for example, with 11 subclusters and 246 members), while other are small and undivided (cluster S with two members and no subclusters). Some phages have no nearest neighbor to establish a cluster and are classified as singletons. Clusters are defined using four methods: dot-plot comparisons, pairwise average nucleotide identities, pairwise genome map comparisons and gene content analysis (
<xref rid="ref-12" ref-type="bibr">Hatfull
<italic>et al.</italic>
, 2010</xref>). However, it should be noted that the clustering scheme proposed for mycobacteriophages mainly serves to identify similarities in genome architecture. This clustering scheme, and our proposed methods of grouping based on tetranucleotide usage described below, are not true taxonomic representations of the mycobacteriophage population. Extensive horizontal gene transfer prevents accurate reconstruction of evolutionary history from purely phylogenetic information (
<xref rid="ref-20" ref-type="bibr">Lawrence
<italic>et al.</italic>
, 2002</xref>
).</p>
<p>Methods traditionally used to analyze mycobacteriophage genomes require sequence alignment or genome annotation. These analytical tasks can be effective, but they are not without drawbacks. Alignment-based methods can be biased by the choice of score parameters (
<xref rid="ref-9" ref-type="bibr">Frith
<italic>et al.</italic>
, 2010</xref>), and genome annotation may require significant manual input, including by-hand verification of automated gene calls before a mycobacteriophage genome is submitted to GenBank. It is especially difficult to build multiple-sequence alignment based phylogenetic trees from mycobacteriophage genomes because phages lack a common genetic element, such as 16S rRNA in bacteria (
<xref rid="ref-8" ref-type="bibr">Doolittle, 1999</xref>). Alignment-free methods avoid many of the disadvantages associated with alignment-based inference. These methods typically use statistics based on the oligonucleotide composition of a sequence and are completely independent of alignment or annotation. Several methods have been developed for different applications; most are covered in the excellent review by
<xref rid="ref-35" ref-type="bibr">Vinga (2007)</xref>. Alignment-free methods are also less computationally intensive than multiple sequence alignment. While the complexity of sequence alignment algorithms scales at least as fast as the square of the number of sequences (at least O(
<italic>n
<sup>2</sup>
</italic>) complexity), alignment free methods typically fall below O(
<italic>n
<sup>2</sup>
</italic>) (
<xref rid="ref-4" ref-type="bibr">Chan & Ragan, 2013</xref>
).</p>
<p>Even so, there are drawbacks to alignment-free methods for analyzing genomes, mostly related to the interpretation of statistics in an evolutionary context. It can be difficult to understand how oligonucleotide frequencies are modified in a population over time when selection usually takes place at the level of genes. Oligonucleotide frequencies can also be subject to convergent evolution: if two distantly related phages slowly converge to similar usage frequencies, these methods can give a false indication of common ancestry (
<xref rid="ref-27" ref-type="bibr">Pride
<italic>et al.</italic>
, 2003</xref>
).</p>
<p>Alignment-free methods have been used to study phage and bacterial genomes in a variety of contexts. For example,
<xref rid="ref-28" ref-type="bibr">Pride
<italic>et al.</italic>
 (2006)</xref> found tetranucleotide usage to carry a strong phylogenetic signal in bacteriophages and showed that tetranucleotide composition was similar among phages with common hosts. More recently,
<xref rid="ref-25" ref-type="bibr">Ogilvie
<italic>et al.</italic>
 (2013)</xref> surveyed metagenomic sequencing datasets using a tetranucleotide usage-based method and discovered several novel
<italic>Bacteroidales</italic>-like phages which could not be identified with alignment-based methods. Oligonucleotide composition vectors have also been proposed as a method to root viral phylogenies (
<xref rid="ref-32" ref-type="bibr">Simmons, 2008</xref>
).</p>
<p>Statistics based on nucleotide composition in a sliding window can theoretically be used to uncover horizontal gene transfer (HGT), based on the assumption that genomes have self-similar nucleotide composition and outlier regions could represent recent horizontal transfer events (
<xref rid="ref-21" ref-type="bibr">Lawrence & Ochman, 1997</xref>). Guanine + Cytosine (GC) content in a sliding window was first used to look for pathogenicity islands within a genome (
<xref rid="ref-11" ref-type="bibr">Hacker & Kaper, 2000</xref>). More recent methods have used nucleotide composition and Naïve Bayesian classifiers (
<xref rid="ref-29" ref-type="bibr">Sandberg
<italic>et al.</italic>
, 2001</xref>) or hidden Markov models (
<xref rid="ref-36" ref-type="bibr">Waack
<italic>et al.</italic>
, 2006</xref>). However, if horizontally transferred segments change in oligonucleotide composition to be more similar to the resident genome, a process known as amelioration, it can obscure truly horizontally transferred segments (
<xref rid="ref-19" ref-type="bibr">Koski
<italic>et al.</italic>
, 2001</xref>
).</p>
<p>The number of sequenced mycobacteriophages has grown immensely in the past few years thanks to the Howard Hughes Medical Institute (HHMI) Science Education Alliance Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES) course (
<xref rid="ref-17" ref-type="bibr">Jordan
<italic>et al.</italic>
, 2014</xref>). This program allows first year undergraduate students to isolate and characterize novel mycobacteriophages from the environment. It has also provided excellent opportunities for collaborative projects between undergraduates, resulting in the work presented here
<xref rid="ref-33" ref-type="bibr">Siranosian
<italic>et al.</italic>
 (2015a)</xref>
.</p>
<p>As the number of sequenced mycobacteriophages continues to increase, researchers need new methods to quickly make comparisons at many scales. Alignment-free methods are one possibility: they are independent of sequence alignment or genome annotation, less computationally complex than alignment-based methods and applicable to genomes without a common subsequence. We investigated tetranucleotide usage in mycobacteriophage genomes as an alignment-free alternative to traditional methods for genome comparison. Our findings support what is known about mycobacteriophage biology: phages form identifiable groups and subgroups, known as clusters, but have extensive differences between clusters. Tetranucleotide usage also highlights outliers in the population and can describe unique genomic features. All of the analyses here can be done in minutes on a personal laptop. Tetranucleotide usage is a powerful tool to quickly investigate features of the growing mycobacteriophage population.</p>
</sec>
<sec sec-type="methods"><title>Methods</title>
<p>We obtained the genomic sequences of all 663 sequenced mycobacteriophages publicly available on the website
<ext-link ext-link-type="uri" xlink:href="http://phagesdb.org/">PhagesDB.org</ext-link>
 as of April 2014. This dataset contains both unpublished genomes and genomes available on GenBank. There is not an easy way to download the mycobacteriophage database in its entirety, so we automated the process with a Python script available in the code accompanying this manuscript.</p>
<p>To compare mycobacteriophage genomes independently of sequence alignment, we investigated the usage of
<italic>k</italic>-mers, substrings of DNA of length
<italic>k,</italic> in each genome. Given a value for
<italic>k</italic>, there are 4
<sup><italic>k</italic>
</sup> possible substrings. For example, the 16 possible ways to combine {A, T, C, G} in substrings of length two are {AA, AT, AC, AG, TA, TT, TC, TG, CA, CT, CC, CG, GA, GT, GC, GG}. Different values for
<italic>k</italic> are used throughout this paper, but we focus mainly on results from
<italic>k</italic>=4 and
<italic>k</italic>=6. In the following section, a substring of length
<italic>k</italic> is called a word, abbreviated by
<italic>W</italic>. Before computing
<italic>k</italic>
-mer usage, each genome is extended by the reverse complement to account for biases from transcriptional start orientation.</p>
<p>With a chosen value of
<italic>k</italic>, we first compute the number of times each substring occurs in the genome. This gives a vector
<italic>N</italic> of length 4
<sup><italic>k</italic>
</sup>, where each entry
<italic>N(W)</italic> is the number of times word
<italic>W</italic> occurs in the genome sequence. Next, we normalized the
<italic>k</italic>-mer frequencies using a zero-order Markov model, which removes biases from the background nucleotide composition and can be effective for analysis of prokaryotic genomes (
<xref rid="ref-27" ref-type="bibr">Pride
<italic>et al.</italic>
, 2003</xref>;
<xref rid="ref-28" ref-type="bibr">Pride
<italic>et al.</italic>
, 2006</xref>). Normalization accounts for the fact that GC-rich genomes are expected to have more GC-rich
<italic>k</italic>-mers simply because of the available nucleotide composition. Dividing the observed counts of
<italic>k</italic>-mers by the expected counts highlights
<italic>k</italic>
-mer usage that can differentiate between mycobacteriophage genomes.</p>
<p>The expected number of a
<italic>k</italic>-mer
<italic>W</italic>
 given the background nucleotide distribution is calculated by:</p>
<p>     
<italic>E</italic>(
<italic>W</italic>) = [(
<italic>A</italic>
<sup><italic>a</italic>
</sup> *
<italic>T</italic>
<sup><italic>t</italic>
</sup> *
<italic>C</italic>
<sup><italic>c</italic>
</sup> *
<italic>G</italic>
<sup><italic>g</italic>
</sup>) *
<italic>N</italic>
]</p>
<p>where
<italic>A,T,C,G</italic> are the frequency of each nucleotide in the genome,
<italic>a,t,c,g</italic> are the number of each nucleotide in the
<italic>k</italic>-mer
<italic>W,</italic> and
<italic>N</italic>
 is the length of the genome.</p>
<p>The normalized value for a word
<italic>W</italic> is calculated by dividing the observed counts by the expected counts. This is the usage deviation vector for a genome, and in the case of
<italic>k</italic>
=4, tetranucleotide usage deviation (TUD):</p>
<p>     
<italic>TUD(W) = N(W)/E(W)</italic>
</p>
<p>An example of calculating TUD values for a short sequence is given in
<xref ref-type="fig" rid="f1">Figure 1</xref>. This is equivalent to the “tetranucleotide usage departures from expectation” measure proposed by
<xref rid="ref-27" ref-type="bibr">Pride
<italic>et al.</italic>
 (2003)</xref>
. For a given 4-mer, a TUD value of one corresponds to the expected usage, while a value of two corresponds to usage twice as frequently as expected.</p>
<fig fig-type="figure" id="f1" orientation="portrait" position="float"><label>Figure 1. </label>
<caption><title>Example of calculating TUD for an input sequence of 10 bases.</title>
</caption>
<graphic xlink:href="f1000research-4-7828-g0000"></graphic>
</fig>
<sec><title>Data filtering</title>
<p>Phage genomic sequences are extended by the reverse complement before calculation, leading to redundant values for a given tetranucleotide and its reverse complement. One of the redundant tetranucleotides was removed before distance calculations and Principal Components Analysis (PCA). We also removed tetranucleotides that were not present at least once in all phage genomes. Only ATAT and AATT were removed by this filter.</p>
</sec>
<sec><title>Comparison of phage genomes</title>
<p>To compare phage genomes in an alignment-free way, we calculated the Euclidean distance between usage deviation vectors. In the case of
<italic>k</italic>=4 for a pair of TUD vectors from genomes
<italic>x</italic> and
<italic>y</italic>
:</p>
<p><disp-formula id="math1"><mml:math id="M1"><mml:mrow><mml:msub><mml:mi>d</mml:mi>
<mml:mrow><mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msqrt><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover><mml:mo>∑</mml:mo>
<mml:mrow><mml:mi>W</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:msup><mml:mn>4</mml:mn>
<mml:mn>4</mml:mn>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:mi>T</mml:mi>
<mml:mi>U</mml:mi>
<mml:msub><mml:mi>D</mml:mi>
<mml:mi>x</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>U</mml:mi>
<mml:msub><mml:mi>D</mml:mi>
<mml:mi>y</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Where individual 4-mers are indexed by integers ranging from 1 to 4
<sup>4</sup>
.</p>
<p>Computing pairwise distances between all usage deviation vectors produced a distance matrix used for tree building. For analysis of the subset of 60 phage in
<xref rid="ref-12" ref-type="bibr">Hatful
<italic>et al.</italic>
 (2010)</xref>, we used the SplitsTree program (
<xref rid="ref-16" ref-type="bibr">Huson & Bryant, 2006</xref>
) to construct neighbor joining phylogenetic trees. This was done to facilitate easy comparisons between previously published figures and our alignment-free trees. Hierarchical clustering using the “average” method within the statistical programming language R (version 3.1.0) was used to construct dendrograms for analyzing the entire phage database.</p>
</sec>
<sec><title>Principal components analysis</title>
<p>PCA was used to visualize relationships between phage genomes in lower- dimensional space. PCA was done on log-transformed data in R using the ‘prcomp’ function and results were plotted using the ‘ggbiplot’ package.</p>
</sec>
<sec><title>Within-genome comparisons</title>
<p>To compare tetranucleotide usage within a phage genome, we used a sliding window of 2000bp (500bp step size). This window size was selected to balance two factors: a short window can detect differences in small regions, while a longer window is necessary to encounter the majority of tetranucleotides. 4-mers were counted and normalized to the nucleotide composition of a given window. A distance matrix was constructed from pairwise Euclidean distances of all windows and used to build heatmaps. Parts of the heatmap where windows overlapped were removed before plotting, leading to the white section along the diagonal in
<xref ref-type="fig" rid="f5">Figure 5</xref>
.</p>
</sec>
</sec>
<sec sec-type="results"><title>Results</title>
<sec><title>Mycobacteriophage genomes have heterogeneous, yet clustered tetranucleotide usage</title>
<p>First, we investigated if TUD reflected relationships described from alignment-based analysis of phage genomes. In particular, does a grouping scheme based on tetranucleotide usage agree with previously assigned phage clusters? To test this hypothesis, we examined a subset of 60 mycobacteriophages first analyzed by
<xref rid="ref-12" ref-type="bibr">Hatfull
<italic>et al.</italic>
 (2010)</xref>, where the authors propose a clustering scheme based on dot-plot comparisons, pairwise average nucleotide identities, pairwise genome maps and gene content analysis. We calculated the pairwise Euclidean distances between TUD vectors for the subset of 60 phages and used the SplitsTree program (
<xref rid="ref-16" ref-type="bibr">Huson & Bryant, 2006</xref>) to construct a neighbor joining tree (
<xref ref-type="fig" rid="f2">Figure 2a</xref>). Our alignment-free tree has a striking resemblance to the tree from
<xref rid="ref-12" ref-type="bibr">Hatfull
<italic>et al.</italic>
 (2010)</xref>, which is constructed from similarities in genomic architecture (
<xref ref-type="fig" rid="f2">Figure 2b</xref>
). In every case, phages are placed in a monophyletic clade with members of their subcluster.</p>
<fig fig-type="figure" id="f2" orientation="portrait" position="float"><label>Figure 2. </label>
<caption><title>TUD captures similarity within mycobacteriophage subclusters.</title>
<p><bold>a</bold>) Neighbor joining phylogenetic tree constructed from pairwise Euclidean distances between TUD vectors for 60 mycobacteriophage genomes. Phage names are colored based on previously assigned cluster information.
<bold>b</bold>) Neighbor joining phylogenetic tree constructed from gene presence data in mycobacteriophage genomes. Reproduced with permission from Figure 3 in
<xref rid="ref-12" ref-type="bibr">Hatful
<italic>et al.</italic>
 (2010)</xref>
. The TUD tree is similar to the alignment-based tree. Phages from the same subcluster form monophyletic clades. In clusters C, F and H, subclusters from the same parent cluster form monophyletic clades.</p>
</caption>
<graphic xlink:href="f1000research-4-7828-g0001"></graphic>
</fig>
<p>Hierarchically grouping phages into clusters and subclusters represents heterogeneity within the mycobacteriophage population. In the alignment-free tree, subclusters from parent clusters C, H and F are placed in a monophyletic clade. However, in some cases, tetranucleotide usage was vastly different between subclusters of a parent cluster. For example, subcluster B3 phages are most similar to cluster A phages in terms of tetranucleotide usage, but they are similar to other cluster B genomes when compared on genetic elements (
<xref ref-type="fig" rid="f2">Figure 2</xref>). We investigate this relationship further in a following section. Importantly, the relationships between the subset of 60 phages are consistent for varying values of
<italic>k</italic> (
<xref ref-type="fig" rid="f3">Figure 3</xref>
).</p>
<fig fig-type="figure" id="f3" orientation="portrait" position="float"><label>Figure 3. </label>
<caption><title>Changing k does not change the structure of the tree.</title>
<p>Neighbor joining phylogenetic trees constructed from pairwise Euclidean distances between oligonucleotide usage deviation vectors for 60 mycobacteriophage genomes. Trees from k equal to two, five and seven are shown here. Trees show a high degree of similarity regardless of the k used. Trends observed in the tetranucleotide usage based tree (
<xref ref-type="fig" rid="f2">Figure 2</xref>
), such as grouping of subcluster members into monophyletic clades, are conserved in these trees.</p>
</caption>
<graphic xlink:href="f1000research-4-7828-g0002"></graphic>
</fig>
<p>Hundreds of mycobacteriophages have been sequenced in the past few years, bringing the total to 663 genomes (
<ext-link ext-link-type="uri" xlink:href="http://phagesdb.org/">PhagesDB.org</ext-link> as of April 2014), 21 clusters and 48 subclusters. We next examined TUD patterns in the entire database to see if the relationships observed for the subset of 60 phages were conserved. We used hierarchical clustering within R to analyze this larger dataset (see Methods). As observed for the subset of 60, almost all phages are grouped closely with members of their subcluster. Subclusters of cluster F, C, D, M and L form a monophyletic clade (
<xref ref-type="fig" rid="sf1">Supplementary Figure 1</xref>
). The relationships for cluster B genomes are also conserved – genomes within a given B subcluster are similar, but the subclusters themselves are different and placed in separate sections of the dendrogram.</p>
</sec>
<sec><title>Principal components analysis captures variation in tetranucleotide usage</title>
<p>We further investigated the ability of TUD to differentiate between predetermined phage clusters using PCA. PCA is useful for visualizing TUD, a 256-dimensional vector, in intuitive 2D space. PCA was applied to log-transformed TUD vectors for all 663 genomes. The first three principal components captured 29.3%, 15.6% and 12.9% of the variance, respectively. Comparing PC1 and PC2 highlighted groups of phage that corresponded well with assigned clusters (
<xref ref-type="fig" rid="f4">Figure 4a</xref>
). Clusters that were similar in PC1/PC2 space could be separated further by including additional PCs.</p>
<fig fig-type="figure" id="f4" orientation="portrait" position="float"><label>Figure 4. </label>
<caption><title>PCA differentiates between clusters and subclusters.</title>
<p><bold>a</bold>) Principal components analysis of all 663 mycobacteriophage genomes. Individual clusters of phages are well separated by PC1 and PC2 in most cases. Further separation can be achieved by incorporating additional principal components.
<bold>b</bold>
) Principal components analysis of cluster B phages. Individual subclusters are well separated. The outlier in B4 is KayaCho, a phage with different tetranucleotide usage but similar genome architecture when compared with other B4 phages.</p>
</caption>
<graphic xlink:href="f1000research-4-7828-g0003"></graphic>
</fig>
<p>PCA was also useful to compare phages within a single cluster. When comparing cluster B phages, the first three components captured 44.6%, 31.4% and 10.3% of the variance present, respectively. Phage subclusters typically group tightly with each other in PC-space, which makes it easy to detect outliers in terms of TUD. A single member of B4, KayaCho, is placed far from the other genomes of that subcluster (
<xref ref-type="fig" rid="f4">Figure 4b</xref>
). This indicates that KayaCho is dissimilar from other B4 phages, a finding that is supported through other methods of comparison. For example, KayaCho has a similar global genome architecture to other members of B4, but pairwise nucleotide identity is low in relation to other comparisons within the subcluster. TUD provides a quick and alignment-free way to detect genomes that are outliers within a subcluster.</p>
</sec>
<sec><title>Mycobacteriophage genomes have self-similar tetranucleotide usage, but some regions are outliers</title>
<p>Mycobacteriophage genomes are mosaic and heavily influenced by horizontal gene transfer (HGT) (
<xref rid="ref-26" ref-type="bibr">Pedulla
<italic>et al.</italic>
, 2003</xref>
). We looked for sections within a phage genome that stood out in TUD as potential candidates for HGT events. Tetranucleotide usage was calculated in a 2000bp window with a 500bp step size. Heatmaps of pairwise Euclidean distances between all windows were plotted.</p>
<p>Observation of these heatmaps revealed several interesting features. The last 5kb of cluster E phage “244” is self-similar, but different than the rest of the genome in terms of TUD (
<xref ref-type="fig" rid="f5">Figure 5a</xref>). This self-similar segment is present with >97% nucleotide identity in all cluster E phage and could represent a HGT event from a different phage cluster or organism. To search for potential transfer sources of this segment, we compared TUD in the region with other mycobacteriophages and searched for nucleotide similarity with BLAST (nr/nt database, blastn algorithm) (
<xref rid="ref-1" ref-type="bibr">Altschul
<italic>et al.</italic>
, 1997</xref>
). However, we were unable to find regions of considerable homology with either method.</p>
<fig fig-type="figure" id="f5" orientation="portrait" position="float"><label>Figure 5. </label>
<caption><title>TUD highlights putative horizontally transferred segments.</title>
<p>Comparing tetranucleotide usage in a sliding window (2000bp window, 500bp step size) across phage genomes. Each entry in the heatmap is the Euclidean distance between windows.
<bold>a</bold>) 244, a cluster E phage, is relatively self-similar with low distance values (red) between most windows. The last 5kb of the genome is an exception: it is self-similar but different than the rest of the genome. This signature is not driven by repetitive sequences, and represents a putative HGT event.
<bold>b</bold>) UPIE, a cluster L1 phage, also has a self-similar signature at the end of the genome. However, the difference in TUD in this window is driven by two cluster of repetitive k-mers (
<xref ref-type="fig" rid="f6">Figure 6</xref>
).</p>
</caption>
<graphic xlink:href="f1000research-4-7828-g0004"></graphic>
</fig>
<p>Cluster L1 phages contain two small self-similar yet genome-different regions at the end of the genome (
<xref ref-type="fig" rid="f5">Figure 5b</xref>). We examined the genome of “UPIE” with the Repfind program (
<xref rid="ref-2" ref-type="bibr">Betley
<italic>et al.</italic>
, 2002</xref>) to search for repetitive sequences that could be driving the change in TUD. There are two blocks of repetitive GC-rich
<italic>k</italic>-mers, from 68650-69050bp and 71100-71900bp, which match the regions in the heatmap (
<xref ref-type="fig" rid="f6">Figure 6</xref>
). As the sliding window moves through each of these blocks, the TUD signal becomes dominated by the repetitive sequence and makes the regions appear self-similar yet genome different. The repetitive features don’t preclude the possibility of HGT in the region, but they do likely obscure a HGT signal carried by TUD. We found other self-similar yet genome-different repetitive regions in phages from clusters F1, H and O. Although the regions highlighted here have variations in GC content, TUD removes biases from the nucleotide composition using a zero-order Markov model (see Methods). Differences in TUD are not a result of variations in the underlying GC content.</p>
<fig fig-type="figure" id="f6" orientation="portrait" position="float"><label>Figure 6. </label>
<caption><title>L1 phages contain two clusters of repetitive k-mers.</title>
<p>Two clusters of GC-rich repetitive sequences at the end of the genome of UPIE (cluster L1). The repetitive sequences drive the differences in TUD and correspond with the self-similar yet genome-different sections in the within-genome heatmap (
<xref ref-type="fig" rid="f5">Figure 5</xref>). This image was reconstructed from the output of Repfind (
<xref rid="ref-2" ref-type="bibr">Betley
<italic>et al.</italic>
, 2002</xref>
).</p>
</caption>
<graphic xlink:href="f1000research-4-7828-g0005"></graphic>
</fig>
</sec>
<sec><title>B3 phages contain overrepresented 4-mers and 6-mers</title>
<p>Finally, we examined why B3 phages are not placed with other members of cluster B in the hierarchical clustering dendrogram, while most of the other clusters show this relationship. B3 genomes share greater than 60% average nucleotide identity with other members of cluster B. This is comparable with the relationship between B2 and B4 phages, which are placed close to each other in the dendrogram. The difference in TUD is not likely to be driven solely by differences in pairwise nucleotide identity. We investigated the individual
<italic>k</italic>
-mers making up the TUD vector to examine this relationship further.</p>
<p>B3 phages used the 4-mer GATC four times more than expected by chance, greater than all other B subclusters (
<xref ref-type="fig" rid="f7">Figure 7a</xref>). The high abundance of GATC could be driven by a global increase in frequency or by discrete regions with very high usage of the 4-mer. To address this point, we compared normalized GATC usage in a sliding window across all cluster B genomes. GATC usage was increased genome-wide in B3 phages, refuting the hypothesis that the deviation was caused by a single genomic region (
<xref ref-type="fig" rid="f7">Figure 7c</xref>
). This points to a genome-wide amelioration of GATC usage in cluster B3 genomes. Interestingly, some local peaks and valleys in GATC usage are persistent across all cluster B genomes, even though these genomes are unaligned.</p>
<fig fig-type="figure" id="f7" orientation="portrait" position="float"><label>Figure 7. </label>
<caption><title>GATC and GGATCC are overrepresented in B3 phages.</title>
<p><bold>a</bold>) Density plot of TUD values for the 4-mer GATC. Individual subclusters form well-defined groups. B3 phages have GATC usage four times what is expected, much higher than other B subclusters.
<bold>b</bold>) Repeat of
<bold>(a)</bold> with the 6-mer GGATCC. B3 phages use this 6-mer greater than four times what is expected.
<bold>c</bold>) GATC usage deviation in a sliding window (5kb, 1kb step size). Each line represents the mean value in the specified subcluster. The increase in GATC usage is genome-wide, indicative of a global change in usage frequency.
<bold>d</bold>) Repeat of
<bold>(c)</bold>
 with the 6-mer GGATCC. Increased usage is also genome-wide.</p>
</caption>
<graphic xlink:href="f1000research-4-7828-g0006"></graphic>
</fig>
<p>Given the genome-wide increase in B3 GATC usage, it is possible that a higher-order signal could be driving the trend. We searched for highly used 6-mers in B3 phages and found GGATCC had a usage deviation value greater than four, while all other B genomes had a value less than one (
<xref ref-type="fig" rid="f7">Figure 7b</xref>). This increase was also genome-wide (
<xref ref-type="fig" rid="f7">Figure 7d</xref>). GATC and GGATCC are both palindromes, DNA sequences with identical reverse complements. Palindromes are typically underrepresented in bacteriophage and other prokaryotic genomes because they can be parts of recognition sites for restriction enzymes (
<xref rid="ref-10" ref-type="bibr">Gelfand & Koonin, 1997</xref>;
<xref rid="ref-18" ref-type="bibr">Karlin
<italic>et al.</italic>
, 1992</xref>;
<xref rid="ref-31" ref-type="bibr">Sharp, 1986</xref>
).</p>
<p>GATC is recognized by Dam methylase in
<italic>E. coli</italic> (
<xref rid="ref-22" ref-type="bibr">Marinus & Morris, 1973</xref>), but
<italic>Mycobacterium</italic> species do not encode Dam methylase (
<xref rid="ref-14" ref-type="bibr">Hemavathy & Nagaraja, 1995</xref>). If B3 phages recently accessed a host with an active Dam methylase, it could lead to a change in GATC frequency. Several restriction enzymes recognize GATC, like
<italic>Mgo</italic>I in
<italic>Mycobacterium gordonae</italic> (
<xref rid="ref-30" ref-type="bibr">Shankar & Tyagi, 1993</xref>), while others recognize GGATCC, such as
<italic>Bam</italic>HI in
<italic>Bacillus amyloliquefaciens</italic>
. However, the presence of a restriction/modification system in a host would theoretically lead to a decrease in usage of the recognized site. The finding that GATC and GGATCC occur in B3 genomes four times more than expected and significantly more frequently than in all other sequenced mycobacteriophages bears further investigation.</p>
</sec>
</sec>
<sec sec-type="discussion"><title>Discussion</title>
<p>In 2010, there were 60 sequenced mycobacteriophages. There are more than 660 as of April 2014. Alignment-based methods have been used to investigate the mycobacteriophage population, leading to interesting characterizations, such as hierarchical grouping into clusters and subclusters. However, as the number of published genomes continues to grow, there is a need for methods to quickly analyze the entire database of mycobacteriophage sequences.</p>
<p>Throughout this paper, we apply oligonucleotide usage methods to uncover relationships within the population of sequenced mycobacteriophages. These methods allow phage genomes to be compared independently of sequence alignment or genome annotation. The methods for counting
<italic>k</italic>-mer usage and normalizing to expected counts are simple to implement and compute. A usage deviation value has a clear interpretation: a value of two corresponds to a
<italic>k</italic>
-mer occurring twice as frequently as expected in a randomized genome sequence. Usage deviation vectors are also well-suited to distance computation and PCA.</p>
<p>Our findings support what is known about mycobacteriophage biology. Neighbor joining and hierarchical clustering from TUD place closely related phage in well-defined groups that correspond with assigned phage subclusters. In most cases, TUD supports grouping into larger clusters, such as cluster A, where all 246 members form a monophyletic clade in the hierarchical clustering dendrogram. The fact that members of cluster B do not form a clade in TUD-based comparisons does not invalidate grouping of phage into clusters, but rather serves as a way to highlight phages where TUD and gene or sequence comparisons capture different relationships.</p>
<p>Comparing TUD in a sliding window can highlight regions with dissimilar tetranucleotide composition and identify genomic segments that could have been horizontally transferred. We found self-similar yet genome-different regions at the end of cluster E and L genomes. The new TUD ‘space’ occupied by these segments could be from HGT – a recently transferred genomic section that had not yet ameliorated to the average genome TUD profile. At least for cluster L, we can say that HGT is likely not the cause. Two groups of repetitive sequences at the end of the genome are driving the difference in TUD. However, we found neither repetitive sequences nor a putative transfer candidate for the segment in cluster E. An improvement on our method could potentially detect legitimate HGT events, but we note that the concept of phams (
<xref rid="ref-12" ref-type="bibr">Hatful
<italic>et al.</italic>
, 2010</xref>) and the computer program Phamerator (
<xref rid="ref-6" ref-type="bibr">Cresawn
<italic>et al.</italic>
, 2011</xref>
) are already efficient for detecting and visualizing these features.</p>
<p>TUD vectors are similar between subcluster B3 phages but different from other members of cluster B. We found that the 4-mer GATC and 6-mer GGATCC were present over four times more than expected in B3 genomes. These sequences are palindromes and part of recognition sites for restriction enzymes, two characteristics of sequences that are typically underrepresented in prokaryotic genomes. GATC and GGATCC are highly used in all sections of B3 genomes, pointing to genome-wide amelioration of usage frequencies.</p>
<p>Oligonucleotide composition methods do not require knowledge of sequence alignment or gene content. They are ideal to compare mycobacteriophage genomes, which lack a common subsequence on which to make alignment-based inference. Alignment-free methods are also valuable when a reference sequence is not available. Recently, methods based on tetranucleotide usage were used to investigate sequences from a gut microbiome and uncovered a population of
<italic>Bacteroidales</italic>-like phage that was previously unrepresented in metagenomic sequencing datasets (
<xref rid="ref-25" ref-type="bibr">Ogilive
<italic>et al.</italic>
, 2013</xref>). Statistics based on oligonucleotide usage are part of a broader class of alignment-free methods. These methods are easy to compute across large datasets: constructing the dendrogram in
<xref ref-type="other" rid="sf1">Supplementary Figure 1</xref> from raw phage sequences takes less than two minutes on a personal laptop. Comparably, creating phylogenetic trees from pairwise global sequence alignment with the Needleman-Wunsch algorithm (
<xref rid="ref-24" ref-type="bibr">Needleman & Wunsch, 1970</xref>
) takes over 24 hours on a computing cluster. We envision oligonucleotide usage methods to be used alongside alignment-based techniques. Highlighting large trends and outliers is easy with these methods, but sequence alignment and gene annotation need to be applied to extract biological insights from the data.</p>
<sec><title>Data and software availability</title>
<p>The genomic sequences of all 663 sequenced mycobacteriophages are publicly available on the website
<ext-link ext-link-type="uri" xlink:href="http://phagesdb.org/">PhagesDB.org</ext-link>
 as of April 2014. The authors obtained permission to use the data.</p>
</sec>
<sec><title>Software access</title>
<p>The code to download mycobacteriophage genome sequences and reproduce our analysis is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/bsiranosian/tango_final">https://github.com/bsiranosian/tango_final</ext-link>. Mycobacteriophage genome sequences are available at
<ext-link ext-link-type="uri" xlink:href="http://phagesdb.org">http://phagesdb.org</ext-link>
.</p>
</sec>
<sec><title>Latest source code</title>
<p><ext-link ext-link-type="uri" xlink:href="https://github.com/bsiranosian/tango_final">https://github.com/bsiranosian/tango_final</ext-link>
</p>
</sec>
<sec><title>Source code as at the time of publication</title>
<p><ext-link ext-link-type="uri" xlink:href="https://github.com/F1000Research/tango_final">https://github.com/F1000Research/tango_final</ext-link>
</p>
</sec>
<sec><title>Archived source code as at the time of publication</title>
<p><ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.14609">http://dx.doi.org/10.5281/zenodo.14609</ext-link> (
<xref rid="ref-34" ref-type="bibr">Siranosian
<italic>et al.</italic>
, 2015b</xref>
).</p>
</sec>
</sec>
</body>
<back><ack><title>Acknowledgments</title>
<p>We would like to thank Sarah Taylor for instructing the Brown University Phage Hunters course and for her assistance during the development and presentation of this work. The present manuscript benefited from helpful comments by Dr. Graham Hatfull. We would also like to thank the hundreds of students from schools participating in the SEA-PHAGES program who have isolated, characterized and purified the mycobacteriophages we analyzed. Finally, we are deeply grateful to the SEA-PHAGES program and Howard Hughes Medical Institute for providing the resources to sequence hundreds of mycobacteriophage genomes, and
<ext-link ext-link-type="uri" xlink:href="http://phagesdb.org/">PhagesDB.org</ext-link>
 for providing access to the unpublished material that formed the base of this work.</p>
</ack>
<sec sec-type="supplementary-material"><title>Supplementary material</title>
<fig fig-type="figure" id="sf1" orientation="portrait" position="float"><label>Supplementary Figure 1. </label>
<caption><title>Hierarchical clustering of all 663 phage genomes.</title>
<p>Hierarchical clustering dendrogram constructed on pairwise Euclidean distances between all 663 phages in the mycobacteriophage database. In almost every case, phages are placed in a monophyletic clade with members of their subcluster, highlighting the concordance between alignment-based and alignment-free methods for comparison for these genomes. Some clusters (F, C, D, M and L) form monophyletic clades, while others (B, for example) are grouped in different parts of the dendrogram. A larger version of this figure can be downloaded from
<supplementary-material content-type="local-data" id="S0"><media xlink:href="f1000research-4-7828-s0000.tgz"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
.</p>
</caption>
<graphic xlink:href="f1000research-4-7828-g0007"></graphic>
</fig>
</sec>
<ref-list><ref id="ref-1"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name><surname>Schäffer</surname>
<given-names>AA</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.</article-title>
<source><italic>Nucleic Acids Res.</italic>
</source>
<year>1997</year>
;<volume>25</volume>
(<issue>17</issue>
):<fpage>3389</fpage>
–<lpage>3402</lpage>.
<pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id>
<pmc-comment>146917</pmc-comment>
<pub-id pub-id-type="pmid">9254694</pub-id>
</mixed-citation>
</ref>
<ref id="ref-2"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Betley</surname>
<given-names>JN</given-names>
</name>
<name><surname>Frith</surname>
<given-names>MC</given-names>
</name>
<name><surname>Graber</surname>
<given-names>JH</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>A ubiquitous and conserved signal for RNA localization in chordates.</article-title>
<source><italic>Curr Biol.</italic>
</source>
<year>2002</year>
;<volume>12</volume>
(<issue>20</issue>
):<fpage>1756</fpage>
–<lpage>1761</lpage>.
<pub-id pub-id-type="doi">10.1016/S0960-9822(02)01220-4</pub-id>
<pub-id pub-id-type="pmid">12401170</pub-id>
</mixed-citation>
</ref>
<ref id="ref-3"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bohannan</surname>
<given-names>BJM</given-names>
</name>
<name><surname>Lenski</surname>
<given-names>RE</given-names>
</name>
</person-group>:
<article-title>Linking genetic change to community evolution: insights from studies of bacteria and bacteriophage.</article-title>
<source><italic>Ecology Letters.</italic>
</source>
<year>2000</year>
;<volume>3</volume>
(<issue>4</issue>
):<fpage>362</fpage>
–<lpage>377</lpage>.
<pub-id pub-id-type="doi">10.1046/j.1461-0248.2000.00161.x</pub-id>
</mixed-citation>
</ref>
<ref id="ref-4"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chan</surname>
<given-names>CX</given-names>
</name>
<name><surname>Ragan</surname>
<given-names>MA</given-names>
</name>
</person-group>:
<article-title>Next-generation phylogenomics.</article-title>
<source><italic>Biol Direct.</italic>
</source>
<year>2013</year>
;<volume>8</volume>
:<fpage>3</fpage>.
<pub-id pub-id-type="doi">10.1186/1745-6150-8-3</pub-id>
<pmc-comment>3564786</pmc-comment>
<pub-id pub-id-type="pmid">23339707</pub-id>
</mixed-citation>
</ref>
<ref id="ref-5"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chibani-Chennoufi</surname>
<given-names>S</given-names>
</name>
<name><surname>Bruttin</surname>
<given-names>A</given-names>
</name>
<name><surname>Dillmann</surname>
<given-names>ML</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Phage-host interaction: an ecological perspective.</article-title>
<source><italic>J Bacteriol.</italic>
</source>
<year>2004</year>
;<volume>186</volume>
(<issue>12</issue>
):<fpage>3677</fpage>
–<lpage>3686</lpage>.
<pub-id pub-id-type="doi">10.1128/JB.186.12.3677-3686.2004</pub-id>
<pmc-comment>419959</pmc-comment>
<pub-id pub-id-type="pmid">15175280</pub-id>
</mixed-citation>
</ref>
<ref id="ref-6"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cresawn</surname>
<given-names>SG</given-names>
</name>
<name><surname>Bogel</surname>
<given-names>M</given-names>
</name>
<name><surname>Day</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Phamerator: a bioinformatic tool for comparative bacteriophage genomics.</article-title>
<source><italic>BMC Bioinformatics.</italic>
</source>
<year>2011</year>
;<volume>12</volume>
:<fpage>395</fpage>.
<pub-id pub-id-type="doi">10.1186/1471-2105-12-395</pub-id>
<pmc-comment>3233612</pmc-comment>
<pub-id pub-id-type="pmid">21991981</pub-id>
</mixed-citation>
</ref>
<ref id="ref-7"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Danelishvili</surname>
<given-names>L</given-names>
</name>
<name><surname>Young</surname>
<given-names>LS</given-names>
</name>
<name><surname>Bermudez</surname>
<given-names>LE</given-names>
</name>
</person-group>:
<article-title><italic>In vivo</italic> efficacy of phage therapy for
<italic>Mycobacterium avium</italic>
 infection as delivered by a nonvirulent mycobacterium.</article-title>
<source><italic>Microb Drug Resist.</italic>
</source>
<year>2006</year>
;<volume>12</volume>
(<issue>1</issue>
):<fpage>1</fpage>
–<lpage>6</lpage>.
<pub-id pub-id-type="doi">10.1089/mdr.2006.12.1</pub-id>
<pub-id pub-id-type="pmid">16584300</pub-id>
</mixed-citation>
</ref>
<ref id="ref-8"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Doolittle</surname>
<given-names>WF</given-names>
</name>
</person-group>:
<article-title>Phylogenetic classification and the universal tree.</article-title>
<source><italic>Science.</italic>
</source>
<year>1999</year>
;<volume>284</volume>
(<issue>5423</issue>
):<fpage>2124</fpage>
–<lpage>2129</lpage>.
<pub-id pub-id-type="doi">10.1126/science.284.5423.2124</pub-id>
<pub-id pub-id-type="pmid">10381871</pub-id>
</mixed-citation>
</ref>
<ref id="ref-9"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Frith</surname>
<given-names>MC</given-names>
</name>
<name><surname>Hamada</surname>
<given-names>M</given-names>
</name>
<name><surname>Horton</surname>
<given-names>P</given-names>
</name>
</person-group>:
<article-title>Parameters for accurate genome alignment.</article-title>
<source><italic>BMC Bioinformatics.</italic>
</source>
<year>2010</year>
;<volume>11</volume>
:<fpage>80</fpage>.
<pub-id pub-id-type="doi">10.1186/1471-2105-11-80</pub-id>
<pmc-comment>2829014</pmc-comment>
<pub-id pub-id-type="pmid">20144198</pub-id>
</mixed-citation>
</ref>
<ref id="ref-10"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gelfand</surname>
<given-names>MS</given-names>
</name>
<name><surname>Koonin</surname>
<given-names>EV</given-names>
</name>
</person-group>:
<article-title>Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes.</article-title>
<source><italic>Nucleic Acids Res.</italic>
</source>
<year>1997</year>
;<volume>25</volume>
(<issue>12</issue>
):<fpage>2430</fpage>
–<lpage>2439</lpage>.
<pub-id pub-id-type="doi">10.1093/nar/25.12.2430</pub-id>
<pmc-comment>1995031</pmc-comment>
<pub-id pub-id-type="pmid">9171096</pub-id>
</mixed-citation>
</ref>
<ref id="ref-11"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hacker</surname>
<given-names>J</given-names>
</name>
<name><surname>Kaper</surname>
<given-names>JB</given-names>
</name>
</person-group>:
<article-title>Pathogenicity islands and the evolution of microbes.</article-title>
<source><italic>Annu Rev Microbiol.</italic>
</source>
<year>2000</year>
;<volume>54</volume>
:<fpage>641</fpage>
–<lpage>679</lpage>.
<pub-id pub-id-type="doi">10.1146/annurev.micro.54.1.641</pub-id>
<pub-id pub-id-type="pmid">11018140</pub-id>
</mixed-citation>
</ref>
<ref id="ref-12"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hatfull</surname>
<given-names>GF</given-names>
</name>
<name><surname>Jacobs-Sera</surname>
<given-names>D</given-names>
</name>
<name><surname>Lawrence</surname>
<given-names>JG</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Comparative genomic analysis of 60 Mycobacteriophage genomes: genome clustering, gene acquisition, and gene size.</article-title>
<source><italic>J Mol Biol.</italic>
</source>
<year>2010</year>
;<volume>397</volume>
(<issue>1</issue>
):<fpage>119</fpage>
–<lpage>143</lpage>.
<pub-id pub-id-type="doi">10.1016/j.jmb.2010.01.011</pub-id>
<pmc-comment>2830324</pmc-comment>
<pub-id pub-id-type="pmid">20064525</pub-id>
</mixed-citation>
</ref>
<ref id="ref-13"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hatfull</surname>
<given-names>GF</given-names>
</name>
</person-group>:
<article-title>Mycobacteriophages: windows into tuberculosis.</article-title>
<source><italic>PLoS Pathog.</italic>
</source>
<year>2014</year>
;<volume>10</volume>
(<issue>3</issue>
):<fpage>e1003953</fpage>.
<pub-id pub-id-type="doi">10.1371/journal.ppat.1003953</pub-id>
<pub-id pub-id-type="pmid">24651299</pub-id>
</mixed-citation>
</ref>
<ref id="ref-14"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hemavathy</surname>
<given-names>KC</given-names>
</name>
<name><surname>Nagaraja</surname>
<given-names>V</given-names>
</name>
</person-group>:
<article-title>DNA methylation in mycobacteria: absence of methylation at GATC (Dam) and CCA/TGG (Dcm) sequences.</article-title>
<source><italic>FEMS Immunol Med Microbiol.</italic>
</source>
<year>1995</year>
;<volume>11</volume>
(<issue>4</issue>
):<fpage>291</fpage>
–<lpage>296</lpage>.
<pub-id pub-id-type="doi">10.1111/j.1574-695X.1995.tb00159.x</pub-id>
<pub-id pub-id-type="pmid">8541807</pub-id>
</mixed-citation>
</ref>
<ref id="ref-15"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hendrix</surname>
<given-names>RW</given-names>
</name>
</person-group>:
<article-title>Bacteriophages: evolution of the majority.</article-title>
<source><italic>Theor Popul Biol.</italic>
</source>
<year>2002</year>
;<volume>61</volume>
(<issue>4</issue>
):<fpage>471</fpage>
–<lpage>480</lpage>.
<pub-id pub-id-type="doi">10.1006/tpbi.2002.1590</pub-id>
<pub-id pub-id-type="pmid">12167366</pub-id>
</mixed-citation>
</ref>
<ref id="ref-16"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Huson</surname>
<given-names>DH</given-names>
</name>
<name><surname>Bryant</surname>
<given-names>D</given-names>
</name>
</person-group>:
<article-title>Application of phylogenetic networks in evolutionary studies.</article-title>
<source><italic>Mol Biol Evol.</italic>
</source>
<year>2006</year>
;<volume>23</volume>
(<issue>2</issue>
):<fpage>254</fpage>
–<lpage>267</lpage>.
<pub-id pub-id-type="doi">10.1093/molbev/msj030</pub-id>
<pub-id pub-id-type="pmid">16221896</pub-id>
</mixed-citation>
</ref>
<ref id="ref-17"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jordan</surname>
<given-names>TC</given-names>
</name>
<name><surname>Burnett</surname>
<given-names>SH</given-names>
</name>
<name><surname>Carson</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>A broadly implementable research course in phage discovery and genomics for first-year undergraduate students.</article-title>
<source><italic>MBio.</italic>
</source>
<year>2014</year>
;<volume>5</volume>
(<issue>1</issue>
):<fpage>e01051</fpage>
–<lpage>13</lpage>.
<pub-id pub-id-type="doi">10.1128/mBio.01051-13</pub-id>
<pmc-comment>3950523</pmc-comment>
<pub-id pub-id-type="pmid">24496795</pub-id>
</mixed-citation>
</ref>
<ref id="ref-18"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Karlin</surname>
<given-names>S</given-names>
</name>
<name><surname>Burge</surname>
<given-names>C</given-names>
</name>
<name><surname>Campbell</surname>
<given-names>AM</given-names>
</name>
</person-group>:
<article-title>Statistical analyses of counts and distributions of restriction sites in DNA sequences.</article-title>
<source><italic>Nucleic Acids Res.</italic>
</source>
<year>1992</year>
;<volume>20</volume>
(<issue>6</issue>
):<fpage>1363</fpage>
–<lpage>1370</lpage>.
<pub-id pub-id-type="doi">10.1093/nar/20.6.1363</pub-id>
<pmc-comment>312184</pmc-comment>
<pub-id pub-id-type="pmid">1313968</pub-id>
</mixed-citation>
</ref>
<ref id="ref-19"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Koski</surname>
<given-names>LB</given-names>
</name>
<name><surname>Morton</surname>
<given-names>RA</given-names>
</name>
<name><surname>Golding</surname>
<given-names>GB</given-names>
</name>
</person-group>:
<article-title>Codon bias and base composition are poor indicators of horizontally transferred genes.</article-title>
<source><italic>Mol Biol Evol.</italic>
</source>
<year>2001</year>
;<volume>18</volume>
(<issue>3</issue>
):<fpage>404</fpage>
–<lpage>412</lpage>.
<pub-id pub-id-type="doi">10.1093/oxfordjournals.molbev.a003816</pub-id>
<pub-id pub-id-type="pmid">11230541</pub-id>
</mixed-citation>
</ref>
<ref id="ref-20"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lawrence</surname>
<given-names>JG</given-names>
</name>
<name><surname>Hatfull</surname>
<given-names>GF</given-names>
</name>
<name><surname>Hendrix</surname>
<given-names>RW</given-names>
</name>
</person-group>:
<article-title>Imbroglios of viral taxonomy: genetic exchange and failings of phenetic approaches.</article-title>
<source><italic>J Bacteriol.</italic>
</source>
<year>2002</year>
;<volume>184</volume>
(<issue>17</issue>
):<fpage>4891</fpage>
–<lpage>4905</lpage>.
<pub-id pub-id-type="doi">10.1128/JB.184.17.4891-4905.2002</pub-id>
<pmc-comment>135278</pmc-comment>
<pub-id pub-id-type="pmid">12169615</pub-id>
</mixed-citation>
</ref>
<ref id="ref-21"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lawrence</surname>
<given-names>JG</given-names>
</name>
<name><surname>Ochman</surname>
<given-names>H</given-names>
</name>
</person-group>:
<article-title>Amelioration of bacterial genomes: rates of change and exchange.</article-title>
<source><italic>J Mol Evol.</italic>
</source>
<year>1997</year>
;<volume>44</volume>
(<issue>4</issue>
):<fpage>383</fpage>
–<lpage>397</lpage>.
<pub-id pub-id-type="doi">10.1007/PL00006158</pub-id>
<pub-id pub-id-type="pmid">9089078</pub-id>
</mixed-citation>
</ref>
<ref id="ref-22"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Marinus</surname>
<given-names>MG</given-names>
</name>
<name><surname>Morris</surname>
<given-names>NR</given-names>
</name>
</person-group>:
<article-title>Isolation of deoxyribonucleic acid methylase mutants of
<italic>Escherichia coli</italic>
 K-12.</article-title>
<source><italic>J Bacteriol.</italic>
</source>
<year>1973</year>
;<volume>114</volume>
(<issue>3</issue>
):<fpage>1143</fpage>
–<lpage>1150</lpage>.
<pmc-comment>285375</pmc-comment>
<pub-id pub-id-type="pmid">4576399</pub-id>
</mixed-citation>
</ref>
<ref id="ref-23"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>McNerney</surname>
<given-names>R</given-names>
</name>
</person-group>:
<article-title>TB: the return of the phage. A review of fifty years of mycobacteriophage research.</article-title>
<source><italic>Int J Tuberc Lung Dis.</italic>
</source>
<year>1999</year>
;<volume>3</volume>
(<issue>3</issue>
):<fpage>179</fpage>
–<lpage>184</lpage>.
<pub-id pub-id-type="pmid">10094316</pub-id>
</mixed-citation>
</ref>
<ref id="ref-24"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Needleman</surname>
<given-names>SB</given-names>
</name>
<name><surname>Wunsch</surname>
<given-names>CD</given-names>
</name>
</person-group>:
<article-title>A general method applicable to the search for similarities in the amino acid sequence of two proteins.</article-title>
<source><italic>J Mol Biol.</italic>
</source>
<year>1970</year>
;<volume>48</volume>
(<issue>3</issue>
):<fpage>443</fpage>
–<lpage>453</lpage>.
<pub-id pub-id-type="doi">10.1016/0022-2836(70)90057-4</pub-id>
<pub-id pub-id-type="pmid">5420325</pub-id>
</mixed-citation>
</ref>
<ref id="ref-25"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ogilvie</surname>
<given-names>LA</given-names>
</name>
<name><surname>Bowler</surname>
<given-names>LD</given-names>
</name>
<name><surname>Caplin</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Genome signature-based dissection of human gut metagenomes to extract subliminal viral sequences.</article-title>
<source><italic>Nat Commun.</italic>
</source>
<year>2013</year>
;<volume>4</volume>
:<fpage>2420</fpage>.
<pub-id pub-id-type="doi">10.1038/ncomms3420</pub-id>
<pmc-comment>3778543</pmc-comment>
<pub-id pub-id-type="pmid">24036533</pub-id>
</mixed-citation>
</ref>
<ref id="ref-26"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pedulla</surname>
<given-names>ML</given-names>
</name>
<name><surname>Ford</surname>
<given-names>ME</given-names>
</name>
<name><surname>Houtz</surname>
<given-names>JM</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Origins of highly mosaic mycobacteriophage genomes.</article-title>
<source><italic>Cell.</italic>
</source>
<year>2003</year>
;<volume>113</volume>
(<issue>2</issue>
):<fpage>171</fpage>
–<lpage>182</lpage>.
<pub-id pub-id-type="doi">10.1016/S0092-8674(03)00233-2</pub-id>
<pub-id pub-id-type="pmid">12705866</pub-id>
</mixed-citation>
</ref>
<ref id="ref-27"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pride</surname>
<given-names>DT</given-names>
</name>
<name><surname>Meinersmann</surname>
<given-names>RJ</given-names>
</name>
<name><surname>Wassenaar</surname>
<given-names>TM</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Evolutionary implications of microbial genome tetranucleotide frequency biases.</article-title>
<source><italic>Genome Res.</italic>
</source>
<year>2003</year>
;<volume>13</volume>
(<issue>2</issue>
):<fpage>145</fpage>
–<lpage>158</lpage>.
<pub-id pub-id-type="doi">10.1101/gr.335003</pub-id>
<pmc-comment>420360</pmc-comment>
<pub-id pub-id-type="pmid">12566393</pub-id>
</mixed-citation>
</ref>
<ref id="ref-28"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pride</surname>
<given-names>DT</given-names>
</name>
<name><surname>Wassenaar</surname>
<given-names>TM</given-names>
</name>
<name><surname>Ghose</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses.</article-title>
<source><italic>BMC Genomics.</italic>
</source>
<year>2006</year>
;<volume>7</volume>
:<fpage>8</fpage>.
<pub-id pub-id-type="doi">10.1186/1471-2164-7-8</pub-id>
<pmc-comment>1360066</pmc-comment>
<pub-id pub-id-type="pmid">16417644</pub-id>
</mixed-citation>
</ref>
<ref id="ref-29"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sandberg</surname>
<given-names>R</given-names>
</name>
<name><surname>Winberg</surname>
<given-names>G</given-names>
</name>
<name><surname>Bränden</surname>
<given-names>CI</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier.</article-title>
<source><italic>Genome Res.</italic>
</source>
<year>2001</year>
;<volume>11</volume>
(<issue>8</issue>
):<fpage>1404</fpage>
–<lpage>1409</lpage>.
<pub-id pub-id-type="doi">10.1101/gr.186401</pub-id>
<pmc-comment>311094</pmc-comment>
<pub-id pub-id-type="pmid">11483581</pub-id>
</mixed-citation>
</ref>
<ref id="ref-30"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Shankar</surname>
<given-names>S</given-names>
</name>
<name><surname>Tyagi</surname>
<given-names>AK</given-names>
</name>
</person-group>:
<article-title>Purification and characterization of restriction endonuclease
<italic>Mgo</italic>I from
<italic>Mycobacterium gordonae</italic>
.</article-title>
<source><italic>Gene.</italic>
</source>
<year>1993</year>
;<volume>131</volume>
(<issue>1</issue>
):<fpage>153</fpage>
–<lpage>154</lpage>.
<pub-id pub-id-type="doi">10.1016/0378-1119(93)90686-W</pub-id>
<pub-id pub-id-type="pmid">8370536</pub-id>
</mixed-citation>
</ref>
<ref id="ref-31"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sharp</surname>
<given-names>PM</given-names>
</name>
</person-group>:
<article-title>Molecular evolution of bacteriophages: evidence of selection against the recognition sites of host restriction enzymes.</article-title>
<source><italic>Mol Biol Evol.</italic>
</source>
<year>1986</year>
;<volume>3</volume>
(<issue>1</issue>
):<fpage>75</fpage>
–<lpage>83</lpage>.
<pub-id pub-id-type="pmid">2832688</pub-id>
</mixed-citation>
</ref>
<ref id="ref-32"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Simmons</surname>
<given-names>MP</given-names>
</name>
</person-group>:
<article-title>Potential use of host-derived genome signatures to root virus phylogenies.</article-title>
<source><italic>Mol Phylogenet Evol.</italic>
</source>
<year>2008</year>
;<volume>49</volume>
(<issue>3</issue>
):<fpage>969</fpage>
–<lpage>978</lpage>.
<pub-id pub-id-type="doi">10.1016/j.ympev.2008.08.014</pub-id>
<pub-id pub-id-type="pmid">18793737</pub-id>
</mixed-citation>
</ref>
<ref id="ref-33"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Siranosian</surname>
<given-names>B</given-names>
</name>
<name><surname>Herold</surname>
<given-names>E</given-names>
</name>
<name><surname>Williams</surname>
<given-names>E</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Tetranucleotide usage in mycobacteriophage genomes: alignment-free methods to cluster phage and infer evolutionary relationships.</article-title>
<source><italic>BMC Bioinformatics.</italic>
</source>
<year>2015a</year>
;<volume>16</volume>
(<issue>Suppl 2</issue>
):<fpage>A7</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-16-S2-A7</pub-id>
<pmc-comment>4331797</pmc-comment>
</mixed-citation>
</ref>
<ref id="ref-34"><mixed-citation publication-type="data"><person-group person-group-type="author"><name><surname>Siranosian</surname>
<given-names>B</given-names>
</name>
<name><surname>Perera</surname>
<given-names>S</given-names>
</name>
<name><surname>Williams</surname>
<given-names>E</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Code to download mycobacteriophage genome sequences.</article-title>
<source><italic>Zenodo.</italic>
</source>
<year>2015b</year>
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.14609">Data Source</ext-link>
</mixed-citation>
</ref>
<ref id="ref-35"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Vinga</surname>
<given-names>S</given-names>
</name>
</person-group>:
<article-title>Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification</article-title>in
<italic>Advanced Computational Methods for Biocomputing and Bioimaging</italic>
 (Nova Science Publishers).<year>2007</year>
;<fpage>71</fpage>
–<lpage>107</lpage>.
<ext-link ext-link-type="uri" xlink:href="http://web.ist.utl.pt/susanavinga/VINGA_bookchapter.preprint.pdf">Reference Source</ext-link>
</mixed-citation>
</ref>
<ref id="ref-36"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Waack</surname>
<given-names>S</given-names>
</name>
<name><surname>Keller</surname>
<given-names>O</given-names>
</name>
<name><surname>Asper</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>:
<article-title>Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models.</article-title>
<source><italic>BMC Bioinformatics.</italic>
</source>
<year>2006</year>
;<volume>7</volume>
:<fpage>142</fpage>.
<pub-id pub-id-type="doi">10.1186/1471-2105-7-142</pub-id>
<pmc-comment>1489950</pmc-comment>
<pub-id pub-id-type="pmid">16542435</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
<sub-article id="report13337" article-type="peer-review"><front-stub><article-id pub-id-type="doi">10.5256/f1000research.7828.r13337</article-id>
<title-group><article-title>Referee response for version 2</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Martin</surname>
<given-names>David</given-names>
</name>
<xref ref-type="aff" rid="r13337a1">1</xref>
<role>Referee</role>
</contrib>
<aff id="r13337a1"><label>1</label>
Life and Biomedical Sciences Education, School of Life Sciences, University of Dundee, Dundee, UK</aff>
</contrib-group>
<author-notes><fn fn-type="COI-statement"><p><bold>Competing interests: </bold>
No competing interests were disclosed.</p>
</fn>
</author-notes>
<pub-date pub-type="epub"><day>20</day>
<month>4</month>
<year>2016</year>
</pub-date>
<related-article id="d35e2771" related-article-type="peer-reviewed-article" ext-link-type="doi" xlink:href="10.12688/f1000research.6077.2">Version 2</related-article>
<custom-meta-group><custom-meta><meta-name>recommendation</meta-name>
<meta-value>approve</meta-value>
</custom-meta>
</custom-meta-group>
</front-stub>
<body><p>The study provides an interesting approach to the evaluation of divergence between the phage genomes. I'm not an expert in this area so come into it with a more general view. I found the revised paper clear and well explained in terms of approach. I agree with the first reviewer that the authors have perhaps been selective in just showing data from a select choice of
<italic>k-</italic>mer values. Expanding the results to show the deviation across the full range of
<italic>k</italic> tested, even if just in summary, would be interesting, though there would be a disparity between odd and even values of
<italic>k </italic>as there are no palindormes with odd 
<italic>k</italic>
.  </p>
<p>A minor issue with regard to the present publication, but which might be worth consideration for future work, is over the TUD metric where the authors compare the observed frequencies to the expected. It is not clear from the study as to the variation one might see in a null model. If TUD is the test statistic of choice, a significance value for the deviation from expected should be deteminable empirically by modelling TUD, e.g.where there is a randomly assigned sequence of nucleotides corresponding to the  genome of the organism. This could be done by shuffling the whole genome, taking a large sliding window and aggregating these scores (with or without shuffling etc.) A discussion of the significance of the deviation from expected (or the lack of appreciation of it) is worth including into the paper.</p>
<p>It is nice to see the distance measures, but without an estimate of the significance of the deviation from expected values, it becomes difficult to assess the significance of the deviation between genomes. It may be the case that using a significance measure as the distance (a Z-score or equivalent) may produce a different clustering.</p>
<p>I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
</body>
</sub-article>
<sub-article id="report11005" article-type="peer-review"><front-stub><article-id pub-id-type="doi">10.5256/f1000research.7828.r11005</article-id>
<title-group><article-title>Referee response for version 2</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Bonham-Carter</surname>
<given-names>Oliver</given-names>
</name>
<xref ref-type="aff" rid="r11005a1">1</xref>
<role>Referee</role>
</contrib>
<aff id="r11005a1"><label>1</label>
College of Information Science & Technology, School of Interdisciplinary Informatics, University of Nebraska, Omaha, NE, USA</aff>
</contrib-group>
<author-notes><fn fn-type="COI-statement"><p><bold>Competing interests: </bold>
No competing interests were disclosed.</p>
</fn>
</author-notes>
<pub-date pub-type="epub"><day>17</day>
<month>11</month>
<year>2015</year>
</pub-date>
<related-article id="d35e2836" related-article-type="peer-reviewed-article" ext-link-type="doi" xlink:href="10.12688/f1000research.6077.2">Version 2</related-article>
<custom-meta-group><custom-meta><meta-name>recommendation</meta-name>
<meta-value>approve</meta-value>
</custom-meta>
</custom-meta-group>
</front-stub>
<body><p>My initial concerns have been addressed.</p>
<p>I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
</body>
<back><ref-list><title>References</title>
<ref id="rep-ref-11005-1"><label>1</label>
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bonham-Carter</surname>
<given-names>O</given-names>
</name>
<name><surname>Steele</surname>
<given-names>J</given-names>
</name>
<name><surname>Bastola</surname>
<given-names>D</given-names>
</name>
</person-group>:
<article-title>Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.</article-title>
<source><italic>Brief Bioinform</italic>
</source>
.<year>2014</year>
;<volume>15</volume>
(<issue>6</issue>) :
<elocation-id>10.1093/bib/bbt052</elocation-id>
<fpage>890</fpage>
-<lpage>905</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbt052</pub-id>
<pub-id pub-id-type="pmid">23904502</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</sub-article>
<sub-article id="report7811" article-type="peer-review"><front-stub><article-id pub-id-type="doi">10.5256/f1000research.6506.r7811</article-id>
<title-group><article-title>Referee response for version 1</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Bonham-Carter</surname>
<given-names>Oliver</given-names>
</name>
<xref ref-type="aff" rid="r7811a1">1</xref>
<role>Referee</role>
</contrib>
<aff id="r7811a1"><label>1</label>
College of Information Science & Technology, School of Interdisciplinary Informatics, University of Nebraska, Omaha, NE, USA</aff>
</contrib-group>
<author-notes><fn fn-type="COI-statement"><p><bold>Competing interests: </bold>
No competing interests were disclosed.</p>
</fn>
</author-notes>
<pub-date pub-type="epub"><day>10</day>
<month>3</month>
<year>2015</year>
</pub-date>
<related-article id="d35e2943" related-article-type="peer-reviewed-article" ext-link-type="doi" xlink:href="10.12688/f1000research.6077.1">Version 1</related-article>
<custom-meta-group><custom-meta><meta-name>recommendation</meta-name>
<meta-value>reject</meta-value>
</custom-meta>
</custom-meta-group>
</front-stub>
<body><p>The article is nicely written but sadly, there are elements of discussion which are absent from the paper. If added, the paper's research on mycobacteriophages using alignment-free analysis would have much more support.
<list list-type="bullet"><list-item><p>The choice of TUD's as statistics for the alignment-free analysis is not fully explained /justified, nor is there much discussion about what algorithm or method is being employed by the analysis tools of the paper. Are TUD's frequencies? How do these software tools work?</p>
</list-item>
<list-item><p>An simple example of how to calculate a TUD and apply it to a method is necessary to completely understand what they are and to see how they are different from any other motif frequency calculation applied to some other method.</p>
</list-item>
<list-item><p>The assumptions of the methods are not discussed. Many methods from information theory, statistics and other kinds of mathematics require that the input data meets specific requirements (is normal, has a certain distribution, is a frequency, etc.). From the discussion in this paper, the function of analysis tool (the exact algorithm or method) is never clear and so we cannot be sure that the calculations from this work, as applied to these tools, is appropriate. For instance, many tools in information theory require that frequencies be used for their analysis. These frequencies must pass basic rules to be called as such (i.e., found on the scale of 0 to 1, all frequencies must sum to 1, 0 = false, 1 = true). This discussion is not mentioned and if it were, then the choice to used TUDs could be easily integrated into this discussion.</p>
</list-item>
<list-item><p>The manuscript mentioned that k-mers in the range of two to seven were calculated (Methods Section). Where are the results for all these other values of k={2, 3, 5 and 7} which were not the k={4 and 6} results of the article?</p>
</list-item>
<list-item><p>Although other sizes of motifs where apparently used in the analysis, the manuscript focuses on the length-4 motifs. The choice of k=4 for the size of motifs to study is not a very interesting statistic since the probability of a particular length-4 motif showing up randomly in a sequence not very high (1/(4^4) = 1/256). Given that the frequency of mutations, and all the evolutionary time during which to make changes to a sequence, these length=4 similar motifs are likely to randomly turn-up.</p>
</list-item>
<list-item><p>The authors should consider using the occurrence of motifs which are at least seven since these frequencies begin to become less randomly placed. Length-4 words are already common in many many bacteria as restriction sites for restriction enzymes. The authors will also find that there are restriction sites of length-6 for the same purpose and so they will have to remove all restriction enzyme palindromes from their sets of k=4 or 6 sized motifs if they cannot continue with a longer motif length. However, if they are determining the level of conservation between organisms, then having longer motifs should not hurt their results.</p>
</list-item>
</list>
Once these issues are addressed, the manuscript will be much stronger.</p>
<p>I have read this submission. I believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
</body>
<sub-article id="comment1668" article-type="response"><front-stub><contrib-group><contrib contrib-type="author"><name><surname>Siranosian</surname>
<given-names>Benjamin</given-names>
</name>
<aff></aff>
</contrib>
</contrib-group>
<author-notes><fn fn-type="COI-statement"><p><bold>Competing interests: </bold>
No competing interests were disclosed.</p>
</fn>
</author-notes>
<pub-date pub-type="epub"><day>23</day>
<month>10</month>
<year>2015</year>
</pub-date>
</front-stub>
<body><p>Thank you for reviewing the manuscript. I have considered the points you raised, and responded in order below. Changes to the manuscript are noted.
<list list-type="order"><list-item><p>The usage deviation-based statistics chosen for this paper are similar to those based on the composition vector of a sequence (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pubmed/23904502">Bonham-Carter
<italic>et al</italic>
., 2013</ext-link>
). Usage deviation (tetranucleotide usage deviation, TUD, in the case of k=4) is a vector of the counts of the possible k-mers, normalized to the expected counts in a randomized genome with the same nucleotide composition. I have made additions to the methods section and included a new figure that makes the calculation of usage deviation more clear. The software tools used to perform these calculations have a description at the github page linked in the paper.</p>
</list-item>
<list-item><p>I have added an example in the methods section that shows how to calculate TUD for a small sequence. Although this example outlines the method, the results are not very informative. The expected number of any 4-mer is very small in a short sequence, resulting in high TUD values for any 4-mers that do occur.</p>
</list-item>
<list-item><p>We do not make any assumptions about the input data when calculating usage deviation or performing statistics in the paper.</p>
</list-item>
<list-item><p>I showed trees constructed from other values of k in Figure 2. The relationships between phage genomes were consistent regardless of the value chosen for k. Other analyses mirrored this result, so we proceed exclusively with k={4, 6}.</p>
</list-item>
<list-item><p>I agree that length-4 motifs are not interesting to study in isolation. Usage deviation, where values represent deviations from expected frequencies, overcome this point. Single occurrences or counts of any 4-mer are uninteresting. Only when counts are normalized and compared in aggregate do the trends that observed in the paper become meaningful.</p>
</list-item>
<list-item><p>7-mers would be less randomly placed in the phage genomes analyzed. Similar to the point above, however, the occurrences of singular k-mers are not considered. As k increases, the resulting usage deviation vectors become sparse. Up to 43% of the (4^7=16384) 7-mers are absent from individual genome sequences, and no 7-mer occurs at least once in every genome analyzed. The sparse nature of the data for 7-mers would not be well-suited to some of the analyses presented in this paper (PCA, searching for horizontally transferred segments).</p>
</list-item>
<list-item><p>I acknowledge that many 4-mers and 6-mers are restriction sites. In fact, this makes the substrings more interesting. B3 mycobacteriophages have 4 times the expected usage of GATC, a restriction site in some bacteria. Biological sense dictates restriction sites would occur infrequently, but the results say the opposite. I do not feel it is necessary to remove restriction sites before the analysis, and doing so would be somewhat arbitrary. The set of restriction sites in mycobacteria species is not entirely characterized, and the host range for each mycobacteriophage has not been studied.</p>
</list-item>
</list>
We hope you find the answers to the points you raised and the revisions to the paper acceptable.</p>
<p><bold>References:</bold>
</p>
<p>Bonham-Carter, O., Steele, J. & Bastola, D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.
<italic>Brief Bioinform</italic>
<bold>15,</bold>
 890–905 (2014).</p>
</body>
</sub-article>
</sub-article>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C68 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000C68 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4841201
   |texte=   Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:27134721" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages

Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki