Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Viral dark matter and virus–host interactions resolved from publicly available microbial genomes

Identifieur interne : 000092 ( Pmc/Corpus ); précédent : 000091; suivant : 000093

Viral dark matter and virus–host interactions resolved from publicly available microbial genomes

Auteurs : Simon Roux ; Steven J. Hallam ; Tanja Woyke ; Matthew B. Sullivan

Source :

RBID : PMC:4533152

Abstract

The ecological importance of viruses is now widely recognized, yet our limited knowledge of viral sequence space and virus–host interactions precludes accurate prediction of their roles and impacts. In this study, we mined publicly available bacterial and archaeal genomic data sets to identify 12,498 high-confidence viral genomes linked to their microbial hosts. These data augment public data sets 10-fold, provide first viral sequences for 13 new bacterial phyla including ecologically abundant phyla, and help taxonomically identify 7–38% of ‘unknown’ sequence space in viromes. Genome- and network-based classification was largely consistent with accepted viral taxonomy and suggested that (i) 264 new viral genera were identified (doubling known genera) and (ii) cross-taxon genomic recombination is limited. Further analyses provided empirical data on extrachromosomal prophages and coinfection prevalences, as well as evaluation of in silico virus–host linkage predictions. Together these findings illustrate the value of mining viral signal from microbial genomes.

DOI:http://dx.doi.org/10.7554/eLife.08490.001


Url:
DOI: 10.7554/eLife.08490
PubMed: 26200428
PubMed Central: 4533152

Links to Exploration step

PMC:4533152

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Viral dark matter and virus–host interactions resolved from publicly available microbial genomes</title>
<author>
<name sortKey="Roux, Simon" sort="Roux, Simon" uniqKey="Roux S" first="Simon" last="Roux">Simon Roux</name>
<affiliation>
<nlm:aff id="aff1">
<institution content-type="dept">Department of Ecology and Evolutionary Biology</institution>
,
<institution>University of Arizona</institution>
,
<addr-line>Tucson</addr-line>
,
<country>United States</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hallam, Steven J" sort="Hallam, Steven J" uniqKey="Hallam S" first="Steven J" last="Hallam">Steven J. Hallam</name>
<affiliation>
<nlm:aff id="aff2">
<institution content-type="dept">Department of Microbiology and Immunology</institution>
,
<institution>University of British Columbia</institution>
,
<addr-line>Vancouver</addr-line>
,
<country>Canada</country>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<institution content-type="dept">Graduate Program in Bioinformatics</institution>
,
<institution>University of British Columbia</institution>
,
<addr-line>Vancouver</addr-line>
,
<country>Canada</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Woyke, Tanja" sort="Woyke, Tanja" uniqKey="Woyke T" first="Tanja" last="Woyke">Tanja Woyke</name>
<affiliation>
<nlm:aff id="aff4">
<institution>U.S Department of Energy Joint Genome Institute</institution>
,
<addr-line>Walnut Creek</addr-line>
,
<country>United States</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sullivan, Matthew B" sort="Sullivan, Matthew B" uniqKey="Sullivan M" first="Matthew B" last="Sullivan">Matthew B. Sullivan</name>
<affiliation>
<nlm:aff id="aff1">
<institution content-type="dept">Department of Ecology and Evolutionary Biology</institution>
,
<institution>University of Arizona</institution>
,
<addr-line>Tucson</addr-line>
,
<country>United States</country>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26200428</idno>
<idno type="pmc">4533152</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4533152</idno>
<idno type="RBID">PMC:4533152</idno>
<idno type="doi">10.7554/eLife.08490</idno>
<date when="????">????</date>
<idno type="wicri:Area/Pmc/Corpus">000092</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Viral dark matter and virus–host interactions resolved from publicly available microbial genomes</title>
<author>
<name sortKey="Roux, Simon" sort="Roux, Simon" uniqKey="Roux S" first="Simon" last="Roux">Simon Roux</name>
<affiliation>
<nlm:aff id="aff1">
<institution content-type="dept">Department of Ecology and Evolutionary Biology</institution>
,
<institution>University of Arizona</institution>
,
<addr-line>Tucson</addr-line>
,
<country>United States</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hallam, Steven J" sort="Hallam, Steven J" uniqKey="Hallam S" first="Steven J" last="Hallam">Steven J. Hallam</name>
<affiliation>
<nlm:aff id="aff2">
<institution content-type="dept">Department of Microbiology and Immunology</institution>
,
<institution>University of British Columbia</institution>
,
<addr-line>Vancouver</addr-line>
,
<country>Canada</country>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<institution content-type="dept">Graduate Program in Bioinformatics</institution>
,
<institution>University of British Columbia</institution>
,
<addr-line>Vancouver</addr-line>
,
<country>Canada</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Woyke, Tanja" sort="Woyke, Tanja" uniqKey="Woyke T" first="Tanja" last="Woyke">Tanja Woyke</name>
<affiliation>
<nlm:aff id="aff4">
<institution>U.S Department of Energy Joint Genome Institute</institution>
,
<addr-line>Walnut Creek</addr-line>
,
<country>United States</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sullivan, Matthew B" sort="Sullivan, Matthew B" uniqKey="Sullivan M" first="Matthew B" last="Sullivan">Matthew B. Sullivan</name>
<affiliation>
<nlm:aff id="aff1">
<institution content-type="dept">Department of Ecology and Evolutionary Biology</institution>
,
<institution>University of Arizona</institution>
,
<addr-line>Tucson</addr-line>
,
<country>United States</country>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">eLife</title>
<idno type="ISSN">2050-084X</idno>
<idno type="eISSN">2050-084X</idno>
<imprint>
<date when="????">????</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>The ecological importance of viruses is now widely recognized, yet our limited knowledge of viral sequence space and virus–host interactions precludes accurate prediction of their roles and impacts. In this study, we mined publicly available bacterial and archaeal genomic data sets to identify 12,498 high-confidence viral genomes linked to their microbial hosts. These data augment public data sets 10-fold, provide first viral sequences for 13 new bacterial phyla including ecologically abundant phyla, and help taxonomically identify 7–38% of ‘unknown’ sequence space in viromes. Genome- and network-based classification was largely consistent with accepted viral taxonomy and suggested that (i) 264 new viral genera were identified (doubling known genera) and (ii) cross-taxon genomic recombination is limited. Further analyses provided empirical data on extrachromosomal prophages and coinfection prevalences, as well as evaluation of in silico virus–host linkage predictions. Together these findings illustrate the value of mining viral signal from microbial genomes.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.001">http://dx.doi.org/10.7554/eLife.08490.001</ext-link>
</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Abedon, St" uniqKey="Abedon S">ST Abedon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Akhter, S" uniqKey="Akhter S">S Akhter</name>
</author>
<author>
<name sortKey="Aziz, Rk" uniqKey="Aziz R">RK Aziz</name>
</author>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Allers, E" uniqKey="Allers E">E Allers</name>
</author>
<author>
<name sortKey="Moraru, C" uniqKey="Moraru C">C Moraru</name>
</author>
<author>
<name sortKey="Duhaime, Mb" uniqKey="Duhaime M">MB Duhaime</name>
</author>
<author>
<name sortKey="Beneze, E" uniqKey="Beneze E">E Beneze</name>
</author>
<author>
<name sortKey="Solonenko, N" uniqKey="Solonenko N">N Solonenko</name>
</author>
<author>
<name sortKey="Canosa, Jb" uniqKey="Canosa J">JB Canosa</name>
</author>
<author>
<name sortKey="Amann, R" uniqKey="Amann R">R Amann</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Allers, E" uniqKey="Allers E">E Allers</name>
</author>
<author>
<name sortKey="Wright, Jj" uniqKey="Wright J">JJ Wright</name>
</author>
<author>
<name sortKey="Konwar, Km" uniqKey="Konwar K">KM Konwar</name>
</author>
<author>
<name sortKey="Howes, Cg" uniqKey="Howes C">CG Howes</name>
</author>
<author>
<name sortKey="Beneze, E" uniqKey="Beneze E">E Beneze</name>
</author>
<author>
<name sortKey="Hallam, Sj" uniqKey="Hallam S">SJ Hallam</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Andersson, Af" uniqKey="Andersson A">AF Andersson</name>
</author>
<author>
<name sortKey="Banfield, Jf" uniqKey="Banfield J">JF Banfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bankevich, A" uniqKey="Bankevich A">A Bankevich</name>
</author>
<author>
<name sortKey="Nurk, S" uniqKey="Nurk S">S Nurk</name>
</author>
<author>
<name sortKey="Antipov, D" uniqKey="Antipov D">D Antipov</name>
</author>
<author>
<name sortKey="Gurevich, Aa" uniqKey="Gurevich A">AA Gurevich</name>
</author>
<author>
<name sortKey="Dvorkin, M" uniqKey="Dvorkin M">M Dvorkin</name>
</author>
<author>
<name sortKey="Kulikov, As" uniqKey="Kulikov A">AS Kulikov</name>
</author>
<author>
<name sortKey="Lesin, Vm" uniqKey="Lesin V">VM Lesin</name>
</author>
<author>
<name sortKey="Nikolenko, Si" uniqKey="Nikolenko S">SI Nikolenko</name>
</author>
<author>
<name sortKey="Pham, S" uniqKey="Pham S">S Pham</name>
</author>
<author>
<name sortKey="Prjibelski, Ad" uniqKey="Prjibelski A">AD Prjibelski</name>
</author>
<author>
<name sortKey="Pyshkin, Av" uniqKey="Pyshkin A">AV Pyshkin</name>
</author>
<author>
<name sortKey="Sirotkin, Av" uniqKey="Sirotkin A">AV Sirotkin</name>
</author>
<author>
<name sortKey="Vyahhi, N" uniqKey="Vyahhi N">N Vyahhi</name>
</author>
<author>
<name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
<author>
<name sortKey="Alekseyev, Ma" uniqKey="Alekseyev M">MA Alekseyev</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bastias, R" uniqKey="Bastias R">R Bastías</name>
</author>
<author>
<name sortKey="Higuera, G" uniqKey="Higuera G">G Higuera</name>
</author>
<author>
<name sortKey="Sierralta, W" uniqKey="Sierralta W">W Sierralta</name>
</author>
<author>
<name sortKey="Espejo, Rt" uniqKey="Espejo R">RT Espejo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brum, J" uniqKey="Brum J">J Brum</name>
</author>
<author>
<name sortKey="Ignacio Espinoza, J" uniqKey="Ignacio Espinoza J">J Ignacio-Espinoza</name>
</author>
<author>
<name sortKey="Roux, S" uniqKey="Roux S">S Roux</name>
</author>
<author>
<name sortKey="Doulcier, G" uniqKey="Doulcier G">G Doulcier</name>
</author>
<author>
<name sortKey="Acinas, Sg" uniqKey="Acinas S">SG Acinas</name>
</author>
<author>
<name sortKey="Alberti, A" uniqKey="Alberti A">A Alberti</name>
</author>
<author>
<name sortKey="Chaffron, S" uniqKey="Chaffron S">S Chaffron</name>
</author>
<author>
<name sortKey="Cruaud, C" uniqKey="Cruaud C">C Cruaud</name>
</author>
<author>
<name sortKey="De Vargas, C" uniqKey="De Vargas C">C de Vargas</name>
</author>
<author>
<name sortKey="Gasol, Jm" uniqKey="Gasol J">JM Gasol</name>
</author>
<author>
<name sortKey="Gorsky, G" uniqKey="Gorsky G">G Gorsky</name>
</author>
<author>
<name sortKey="Gregory, Ac" uniqKey="Gregory A">AC Gregory</name>
</author>
<author>
<name sortKey="Ogata, H" uniqKey="Ogata H">H Ogata</name>
</author>
<author>
<name sortKey="Pesant, S" uniqKey="Pesant S">S Pesant</name>
</author>
<author>
<name sortKey="Poulos, Bt" uniqKey="Poulos B">BT Poulos</name>
</author>
<author>
<name sortKey="Schwenck, Sm" uniqKey="Schwenck S">SM Schwenck</name>
</author>
<author>
<name sortKey="Speich, S" uniqKey="Speich S">S Speich</name>
</author>
<author>
<name sortKey="Dimier, C" uniqKey="Dimier C">C Dimier</name>
</author>
<author>
<name sortKey="Kandels Lewis, S" uniqKey="Kandels Lewis S">S Kandels-Lewis</name>
</author>
<author>
<name sortKey="Picheral, M" uniqKey="Picheral M">M Picheral</name>
</author>
<author>
<name sortKey="Searson, S" uniqKey="Searson S">S Searson</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
<author>
<name sortKey="Bowler, C" uniqKey="Bowler C">C Bowler</name>
</author>
<author>
<name sortKey="Sunagawa, S" uniqKey="Sunagawa S">S Sunagawa</name>
</author>
<author>
<name sortKey="Wincker, P" uniqKey="Wincker P">P Wincker</name>
</author>
<author>
<name sortKey="Karsenti, E" uniqKey="Karsenti E">E Karsenti</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brum, Jr" uniqKey="Brum J">JR Brum</name>
</author>
<author>
<name sortKey="Jeffrey Morris, J" uniqKey="Jeffrey Morris J">J Jeffrey Morris</name>
</author>
<author>
<name sortKey="Decima, M" uniqKey="Decima M">M Décima</name>
</author>
<author>
<name sortKey="Stukel, Mr" uniqKey="Stukel M">MR Stukel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brum, Jr" uniqKey="Brum J">JR Brum</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Canchaya, C" uniqKey="Canchaya C">C Canchaya</name>
</author>
<author>
<name sortKey="Fournous, G" uniqKey="Fournous G">G Fournous</name>
</author>
<author>
<name sortKey="Brussow, H" uniqKey="Brussow H">H Brüssow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Carbone, A" uniqKey="Carbone A">A Carbone</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cardinale, Dj" uniqKey="Cardinale D">DJ Cardinale</name>
</author>
<author>
<name sortKey="Duffy, S" uniqKey="Duffy S">S Duffy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Carey Smith, Gv" uniqKey="Carey Smith G">GV Carey-Smith</name>
</author>
<author>
<name sortKey="Billington, C" uniqKey="Billington C">C Billington</name>
</author>
<author>
<name sortKey="Cornelius, Aj" uniqKey="Cornelius A">AJ Cornelius</name>
</author>
<author>
<name sortKey="Hudson, Ja" uniqKey="Hudson J">JA Hudson</name>
</author>
<author>
<name sortKey="Heinemann, Ja" uniqKey="Heinemann J">JA Heinemann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Casjens, S" uniqKey="Casjens S">S Casjens</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Castelle, Cj" uniqKey="Castelle C">CJ Castelle</name>
</author>
<author>
<name sortKey="Hug, La" uniqKey="Hug L">LA Hug</name>
</author>
<author>
<name sortKey="Wrighton, Kc" uniqKey="Wrighton K">KC Wrighton</name>
</author>
<author>
<name sortKey="Thomas, Bc" uniqKey="Thomas B">BC Thomas</name>
</author>
<author>
<name sortKey="Williams, Kh" uniqKey="Williams K">KH Williams</name>
</author>
<author>
<name sortKey="Wu, D" uniqKey="Wu D">D Wu</name>
</author>
<author>
<name sortKey="Tringe, Sg" uniqKey="Tringe S">SG Tringe</name>
</author>
<author>
<name sortKey="Singer, Sw" uniqKey="Singer S">SW Singer</name>
</author>
<author>
<name sortKey="Eisen, Ja" uniqKey="Eisen J">JA Eisen</name>
</author>
<author>
<name sortKey="Banfield, Jf" uniqKey="Banfield J">JF Banfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Clemente, Jc" uniqKey="Clemente J">JC Clemente</name>
</author>
<author>
<name sortKey="Ursell, Lk" uniqKey="Ursell L">LK Ursell</name>
</author>
<author>
<name sortKey="Parfrey, Lw" uniqKey="Parfrey L">LW Parfrey</name>
</author>
<author>
<name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davis, Ma" uniqKey="Davis M">MA Davis</name>
</author>
<author>
<name sortKey="Martin, Ka" uniqKey="Martin K">KA Martin</name>
</author>
<author>
<name sortKey="Austin, Sj" uniqKey="Austin S">SJ Austin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Delong, Ef" uniqKey="Delong E">EF DeLong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deng, L" uniqKey="Deng L">L Deng</name>
</author>
<author>
<name sortKey="Ignacio Espinoza, Jc" uniqKey="Ignacio Espinoza J">JC Ignacio-Espinoza</name>
</author>
<author>
<name sortKey="Gregory, A" uniqKey="Gregory A">A Gregory</name>
</author>
<author>
<name sortKey="Poulos, Bt" uniqKey="Poulos B">BT Poulos</name>
</author>
<author>
<name sortKey="Weitz, Js" uniqKey="Weitz J">JS Weitz</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Diemer, Gs" uniqKey="Diemer G">GS Diemer</name>
</author>
<author>
<name sortKey="Stedman, Km" uniqKey="Stedman K">KM Stedman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Emerson, Jb" uniqKey="Emerson J">JB Emerson</name>
</author>
<author>
<name sortKey="Thomas, Bc" uniqKey="Thomas B">BC Thomas</name>
</author>
<author>
<name sortKey="Alvarez, W" uniqKey="Alvarez W">W Alvarez</name>
</author>
<author>
<name sortKey="Banfield, Jf" uniqKey="Banfield J">JF Banfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Enav, H" uniqKey="Enav H">H Enav</name>
</author>
<author>
<name sortKey="Beja, O" uniqKey="Beja O">O Béjà</name>
</author>
<author>
<name sortKey="Mandel Gutfreund, Y" uniqKey="Mandel Gutfreund Y">Y Mandel-Gutfreund</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Enright, Aj" uniqKey="Enright A">AJ Enright</name>
</author>
<author>
<name sortKey="Van Dongen, S" uniqKey="Van Dongen S">S Van Dongen</name>
</author>
<author>
<name sortKey="Ouzounis, Ca" uniqKey="Ouzounis C">CA Ouzounis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Falkowski, Pg" uniqKey="Falkowski P">PG Falkowski</name>
</author>
<author>
<name sortKey="Fenchel, T" uniqKey="Fenchel T">T Fenchel</name>
</author>
<author>
<name sortKey="Delong, Ef" uniqKey="Delong E">EF Delong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fischer, Cr" uniqKey="Fischer C">CR Fischer</name>
</author>
<author>
<name sortKey="Yoichi, M" uniqKey="Yoichi M">M Yoichi</name>
</author>
<author>
<name sortKey="Unno, H" uniqKey="Unno H">H Unno</name>
</author>
<author>
<name sortKey="Tanji, Y" uniqKey="Tanji Y">Y Tanji</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Flores, Co" uniqKey="Flores C">CO Flores</name>
</author>
<author>
<name sortKey="Meyer, Jr" uniqKey="Meyer J">JR Meyer</name>
</author>
<author>
<name sortKey="Valverde, S" uniqKey="Valverde S">S Valverde</name>
</author>
<author>
<name sortKey="Farr, L" uniqKey="Farr L">L Farr</name>
</author>
<author>
<name sortKey="Weitz, Js" uniqKey="Weitz J">JS Weitz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Flores, Co" uniqKey="Flores C">CO Flores</name>
</author>
<author>
<name sortKey="Valverde, S" uniqKey="Valverde S">S Valverde</name>
</author>
<author>
<name sortKey="Weitz, Js" uniqKey="Weitz J">JS Weitz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Forterre, P" uniqKey="Forterre P">P Forterre</name>
</author>
<author>
<name sortKey="Prangishvili, D" uniqKey="Prangishvili D">D Prangishvili</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fouts, De" uniqKey="Fouts D">DE Fouts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Garrett, Ra" uniqKey="Garrett R">RA Garrett</name>
</author>
<author>
<name sortKey="Prangishvili, D" uniqKey="Prangishvili D">D Prangishvili</name>
</author>
<author>
<name sortKey="Shah, Sa" uniqKey="Shah S">SA Shah</name>
</author>
<author>
<name sortKey="Reuter, M" uniqKey="Reuter M">M Reuter</name>
</author>
<author>
<name sortKey="Stetter, Ko" uniqKey="Stetter K">KO Stetter</name>
</author>
<author>
<name sortKey="Peng, X" uniqKey="Peng X">X Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hanson, Ca" uniqKey="Hanson C">CA Hanson</name>
</author>
<author>
<name sortKey="Fuhrman, Ja" uniqKey="Fuhrman J">JA Fuhrman</name>
</author>
<author>
<name sortKey="Horner Devine, Mc" uniqKey="Horner Devine M">MC Horner-Devine</name>
</author>
<author>
<name sortKey="Martiny, Jb" uniqKey="Martiny J">JB Martiny</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hendrix, Rw" uniqKey="Hendrix R">RW Hendrix</name>
</author>
<author>
<name sortKey="Smith, Mc" uniqKey="Smith M">MC Smith</name>
</author>
<author>
<name sortKey="Burns, Rn" uniqKey="Burns R">RN Burns</name>
</author>
<author>
<name sortKey="Ford, Me" uniqKey="Ford M">ME Ford</name>
</author>
<author>
<name sortKey="Hatfull, Gf" uniqKey="Hatfull G">GF Hatfull</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hurwitz, Bl" uniqKey="Hurwitz B">BL Hurwitz</name>
</author>
<author>
<name sortKey="Hallam, Sj" uniqKey="Hallam S">SJ Hallam</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hurwitz, Bl" uniqKey="Hurwitz B">BL Hurwitz</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ignacio Espinoza, Jc" uniqKey="Ignacio Espinoza J">JC Ignacio-Espinoza</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jia, B" uniqKey="Jia B">B Jia</name>
</author>
<author>
<name sortKey="Xuan, L" uniqKey="Xuan L">L Xuan</name>
</author>
<author>
<name sortKey="Cai, K" uniqKey="Cai K">K Cai</name>
</author>
<author>
<name sortKey="Hu, Z" uniqKey="Hu Z">Z Hu</name>
</author>
<author>
<name sortKey="Ma, L" uniqKey="Ma L">L Ma</name>
</author>
<author>
<name sortKey="Wei, C" uniqKey="Wei C">C Wei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kamke, J" uniqKey="Kamke J">J Kamke</name>
</author>
<author>
<name sortKey="Sczyrba, A" uniqKey="Sczyrba A">A Sczyrba</name>
</author>
<author>
<name sortKey="Ivanova, N" uniqKey="Ivanova N">N Ivanova</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kashtan, N" uniqKey="Kashtan N">N Kashtan</name>
</author>
<author>
<name sortKey="Roggensack, Se" uniqKey="Roggensack S">SE Roggensack</name>
</author>
<author>
<name sortKey="Rodrigue, S" uniqKey="Rodrigue S">S Rodrigue</name>
</author>
<author>
<name sortKey="Thompson, Jw" uniqKey="Thompson J">JW Thompson</name>
</author>
<author>
<name sortKey="Biller, Sj" uniqKey="Biller S">SJ Biller</name>
</author>
<author>
<name sortKey="Coe, A" uniqKey="Coe A">A Coe</name>
</author>
<author>
<name sortKey="Ding, H" uniqKey="Ding H">H Ding</name>
</author>
<author>
<name sortKey="Marttinen, P" uniqKey="Marttinen P">P Marttinen</name>
</author>
<author>
<name sortKey="Malmstrom, Rr" uniqKey="Malmstrom R">RR Malmstrom</name>
</author>
<author>
<name sortKey="Stocker, R" uniqKey="Stocker R">R Stocker</name>
</author>
<author>
<name sortKey="Follows, Mj" uniqKey="Follows M">MJ Follows</name>
</author>
<author>
<name sortKey="Stepanauskas, R" uniqKey="Stepanauskas R">R Stepanauskas</name>
</author>
<author>
<name sortKey="Chisholm, Sw" uniqKey="Chisholm S">SW Chisholm</name>
</author>
<author>
<name sortKey="Biller, J" uniqKey="Biller J">J Biller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, Ms" uniqKey="Kim M">MS Kim</name>
</author>
<author>
<name sortKey="Park, Ej" uniqKey="Park E">EJ Park</name>
</author>
<author>
<name sortKey="Roh, Sw" uniqKey="Roh S">SW Roh</name>
</author>
<author>
<name sortKey="Bae, Jw" uniqKey="Bae J">JW Bae</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
<author>
<name sortKey="Senkevich, Tg" uniqKey="Senkevich T">TG Senkevich</name>
</author>
<author>
<name sortKey="Dolja, Vv" uniqKey="Dolja V">VV Dolja</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krupovic, M" uniqKey="Krupovic M">M Krupovic</name>
</author>
<author>
<name sortKey="Zhi, N" uniqKey="Zhi N">N Zhi</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="Hu, G" uniqKey="Hu G">G Hu</name>
</author>
<author>
<name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
<author>
<name sortKey="Wong, S" uniqKey="Wong S">S Wong</name>
</author>
<author>
<name sortKey="Shevchenko, S" uniqKey="Shevchenko S">S Shevchenko</name>
</author>
<author>
<name sortKey="Zhao, K" uniqKey="Zhao K">K Zhao</name>
</author>
<author>
<name sortKey="Young, Ns" uniqKey="Young N">NS Young</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Labonte, Jm" uniqKey="Labonte J">JM Labonté</name>
</author>
<author>
<name sortKey="Swan, Bk" uniqKey="Swan B">BK Swan</name>
</author>
<author>
<name sortKey="Poulos, Bt" uniqKey="Poulos B">BT Poulos</name>
</author>
<author>
<name sortKey="Luo, H" uniqKey="Luo H">H Luo</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Hallam, Sj" uniqKey="Hallam S">SJ Hallam</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
<author>
<name sortKey="Woyke, T" uniqKey="Woyke T">T Woyke</name>
</author>
<author>
<name sortKey="Wommack, Ek" uniqKey="Wommack E">EK Wommack</name>
</author>
<author>
<name sortKey="Stepanauskas, R" uniqKey="Stepanauskas R">R Stepanauskas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Labonte, Jm" uniqKey="Labonte J">JM Labonté</name>
</author>
<author>
<name sortKey="Suttle, Ca" uniqKey="Suttle C">CA Suttle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leplae, R" uniqKey="Leplae R">R Leplae</name>
</author>
<author>
<name sortKey="Lima Mendez, G" uniqKey="Lima Mendez G">G Lima-Mendez</name>
</author>
<author>
<name sortKey="Toussaint, A" uniqKey="Toussaint A">A Toussaint</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lima Mendez, G" uniqKey="Lima Mendez G">G Lima-Mendez</name>
</author>
<author>
<name sortKey="Van Helden, J" uniqKey="Van Helden J">J Van Helden</name>
</author>
<author>
<name sortKey="Toussaint, A" uniqKey="Toussaint A">A Toussaint</name>
</author>
<author>
<name sortKey="Leplae, R" uniqKey="Leplae R">R Leplae</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lima Mendez, G" uniqKey="Lima Mendez G">G Lima-Mendez</name>
</author>
<author>
<name sortKey="Van Helden, J" uniqKey="Van Helden J">J Van Helden</name>
</author>
<author>
<name sortKey="Toussaint, A" uniqKey="Toussaint A">A Toussaint</name>
</author>
<author>
<name sortKey="Leplae, R" uniqKey="Leplae R">R Leplae</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marston, Mf" uniqKey="Marston M">MF Marston</name>
</author>
<author>
<name sortKey="Pierciey, Fj" uniqKey="Pierciey F">FJ Pierciey</name>
</author>
<author>
<name sortKey="Shepard, A" uniqKey="Shepard A">A Shepard</name>
</author>
<author>
<name sortKey="Gearin, G" uniqKey="Gearin G">G Gearin</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Yandava, C" uniqKey="Yandava C">C Yandava</name>
</author>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
<author>
<name sortKey="Henn, Mr" uniqKey="Henn M">MR Henn</name>
</author>
<author>
<name sortKey="Martiny, Jb" uniqKey="Martiny J">JB Martiny</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Middelboe, M" uniqKey="Middelboe M">M Middelboe</name>
</author>
<author>
<name sortKey="Holmfeldt, K" uniqKey="Holmfeldt K">K Holmfeldt</name>
</author>
<author>
<name sortKey="Riemann, L" uniqKey="Riemann L">L Riemann</name>
</author>
<author>
<name sortKey="Nybroe, O" uniqKey="Nybroe O">O Nybroe</name>
</author>
<author>
<name sortKey="Haaber, J" uniqKey="Haaber J">J Haaber</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Minot, S" uniqKey="Minot S">S Minot</name>
</author>
<author>
<name sortKey="Grunberg, S" uniqKey="Grunberg S">S Grunberg</name>
</author>
<author>
<name sortKey="Wu, Gd" uniqKey="Wu G">GD Wu</name>
</author>
<author>
<name sortKey="Lewis, Jd" uniqKey="Lewis J">JD Lewis</name>
</author>
<author>
<name sortKey="Bushman, Fd" uniqKey="Bushman F">FD Bushman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mizuno, Cm" uniqKey="Mizuno C">CM Mizuno</name>
</author>
<author>
<name sortKey="Rodriguez Valera, F" uniqKey="Rodriguez Valera F">F Rodriguez-Valera</name>
</author>
<author>
<name sortKey="Kimes, Ne" uniqKey="Kimes N">NE Kimes</name>
</author>
<author>
<name sortKey="Ghai, R" uniqKey="Ghai R">R Ghai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mosig, G" uniqKey="Mosig G">G Mosig</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pace, Nr" uniqKey="Pace N">NR Pace</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Leung, Hc" uniqKey="Leung H">HC Leung</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Chin, Fy" uniqKey="Chin F">FY Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pride, Dt" uniqKey="Pride D">DT Pride</name>
</author>
<author>
<name sortKey="Wassenaar, Tm" uniqKey="Wassenaar T">TM Wassenaar</name>
</author>
<author>
<name sortKey="Ghose, C" uniqKey="Ghose C">C Ghose</name>
</author>
<author>
<name sortKey="Blaser, Mj" uniqKey="Blaser M">MJ Blaser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pruitt, Kd" uniqKey="Pruitt K">KD Pruitt</name>
</author>
<author>
<name sortKey="Tatusova, T" uniqKey="Tatusova T">T Tatusova</name>
</author>
<author>
<name sortKey="Klimke, W" uniqKey="Klimke W">W Klimke</name>
</author>
<author>
<name sortKey="Maglott, Dr" uniqKey="Maglott D">DR Maglott</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rakonjac, J" uniqKey="Rakonjac J">J Rakonjac</name>
</author>
<author>
<name sortKey="Bennett, Nj" uniqKey="Bennett N">NJ Bennett</name>
</author>
<author>
<name sortKey="Spagnuolo, J" uniqKey="Spagnuolo J">J Spagnuolo</name>
</author>
<author>
<name sortKey="Gagic, D" uniqKey="Gagic D">D Gagic</name>
</author>
<author>
<name sortKey="Russel, M" uniqKey="Russel M">M Russel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rappe, Ms" uniqKey="Rappe M">MS Rappé</name>
</author>
<author>
<name sortKey="Giovannoni, Sj" uniqKey="Giovannoni S">SJ Giovannoni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Reyes, A" uniqKey="Reyes A">A Reyes</name>
</author>
<author>
<name sortKey="Semenkovich, Np" uniqKey="Semenkovich N">NP Semenkovich</name>
</author>
<author>
<name sortKey="Whiteson, K" uniqKey="Whiteson K">K Whiteson</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author>
<name sortKey="Gordon, Ji" uniqKey="Gordon J">JI Gordon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rice, P" uniqKey="Rice P">P Rice</name>
</author>
<author>
<name sortKey="Longden, I" uniqKey="Longden I">I Longden</name>
</author>
<author>
<name sortKey="Bleasby, A" uniqKey="Bleasby A">A Bleasby</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rinke, C" uniqKey="Rinke C">C Rinke</name>
</author>
<author>
<name sortKey="Schwientek, P" uniqKey="Schwientek P">P Schwientek</name>
</author>
<author>
<name sortKey="Sczyrba, A" uniqKey="Sczyrba A">A Sczyrba</name>
</author>
<author>
<name sortKey="Ivanova, Nn" uniqKey="Ivanova N">NN Ivanova</name>
</author>
<author>
<name sortKey="Anderson, Ij" uniqKey="Anderson I">IJ Anderson</name>
</author>
<author>
<name sortKey="Cheng, Jf" uniqKey="Cheng J">JF Cheng</name>
</author>
<author>
<name sortKey="Darling, A" uniqKey="Darling A">A Darling</name>
</author>
<author>
<name sortKey="Malfatti, S" uniqKey="Malfatti S">S Malfatti</name>
</author>
<author>
<name sortKey="Swan, Bk" uniqKey="Swan B">BK Swan</name>
</author>
<author>
<name sortKey="Gies, Ea" uniqKey="Gies E">EA Gies</name>
</author>
<author>
<name sortKey="Dodsworth, Ja" uniqKey="Dodsworth J">JA Dodsworth</name>
</author>
<author>
<name sortKey="Hedlund, Bp" uniqKey="Hedlund B">BP Hedlund</name>
</author>
<author>
<name sortKey="Tsiamis, G" uniqKey="Tsiamis G">G Tsiamis</name>
</author>
<author>
<name sortKey="Sievert, Sm" uniqKey="Sievert S">SM Sievert</name>
</author>
<author>
<name sortKey="Liu, Wt" uniqKey="Liu W">WT Liu</name>
</author>
<author>
<name sortKey="Eisen, Ja" uniqKey="Eisen J">JA Eisen</name>
</author>
<author>
<name sortKey="Hallam, Sj" uniqKey="Hallam S">SJ Hallam</name>
</author>
<author>
<name sortKey="Kyrpides, Nc" uniqKey="Kyrpides N">NC Kyrpides</name>
</author>
<author>
<name sortKey="Stepanauskas, R" uniqKey="Stepanauskas R">R Stepanauskas</name>
</author>
<author>
<name sortKey="Rubin, Em" uniqKey="Rubin E">EM Rubin</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Woyke, T" uniqKey="Woyke T">T Woyke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rodriguez Valera, F" uniqKey="Rodriguez Valera F">F Rodriguez-Valera</name>
</author>
<author>
<name sortKey="Martin Cuadrado, Ab" uniqKey="Martin Cuadrado A">AB Martin-Cuadrado</name>
</author>
<author>
<name sortKey="Rodriguez Brito, B" uniqKey="Rodriguez Brito B">B Rodriguez-Brito</name>
</author>
<author>
<name sortKey="Pasi, L" uniqKey="Pasi L">L Pasić</name>
</author>
<author>
<name sortKey="Thingstad, Tf" uniqKey="Thingstad T">TF Thingstad</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author>
<name sortKey="Mira, A" uniqKey="Mira A">A Mira</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author>
<name sortKey="Edwards, R" uniqKey="Edwards R">R Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roux, S" uniqKey="Roux S">S Roux</name>
</author>
<author>
<name sortKey="Enault, F" uniqKey="Enault F">F Enault</name>
</author>
<author>
<name sortKey="Bronner, G" uniqKey="Bronner G">G Bronner</name>
</author>
<author>
<name sortKey="Vaulot, D" uniqKey="Vaulot D">D Vaulot</name>
</author>
<author>
<name sortKey="Forterre, P" uniqKey="Forterre P">P Forterre</name>
</author>
<author>
<name sortKey="Krupovic, M" uniqKey="Krupovic M">M Krupovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roux, S" uniqKey="Roux S">S Roux</name>
</author>
<author>
<name sortKey="Enault, F" uniqKey="Enault F">F Enault</name>
</author>
<author>
<name sortKey="Hurwitz, Bl" uniqKey="Hurwitz B">BL Hurwitz</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roux, S" uniqKey="Roux S">S Roux</name>
</author>
<author>
<name sortKey="Hallam, Sj" uniqKey="Hallam S">SJ Hallam</name>
</author>
<author>
<name sortKey="Woyke, T" uniqKey="Woyke T">T Woyke</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roux, S" uniqKey="Roux S">S Roux</name>
</author>
<author>
<name sortKey="Hawley, Ak" uniqKey="Hawley A">AK Hawley</name>
</author>
<author>
<name sortKey="Torres Beltran, M" uniqKey="Torres Beltran M">M Torres Beltran</name>
</author>
<author>
<name sortKey="Scofield, M" uniqKey="Scofield M">M Scofield</name>
</author>
<author>
<name sortKey="Schwientek, P" uniqKey="Schwientek P">P Schwientek</name>
</author>
<author>
<name sortKey="Stepanauskas, R" uniqKey="Stepanauskas R">R Stepanauskas</name>
</author>
<author>
<name sortKey="Woyke, T" uniqKey="Woyke T">T Woyke</name>
</author>
<author>
<name sortKey="Hallam, Sj" uniqKey="Hallam S">SJ Hallam</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saint Girons, I" uniqKey="Saint Girons I">I Saint Girons</name>
</author>
<author>
<name sortKey="Bourhy, P" uniqKey="Bourhy P">P Bourhy</name>
</author>
<author>
<name sortKey="Ottone, C" uniqKey="Ottone C">C Ottone</name>
</author>
<author>
<name sortKey="Picardeau, M" uniqKey="Picardeau M">M Picardeau</name>
</author>
<author>
<name sortKey="Yelton, D" uniqKey="Yelton D">D Yelton</name>
</author>
<author>
<name sortKey="Hendrix, Rw" uniqKey="Hendrix R">RW Hendrix</name>
</author>
<author>
<name sortKey="Glaser, P" uniqKey="Glaser P">P Glaser</name>
</author>
<author>
<name sortKey="Charon, N" uniqKey="Charon N">N Charon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salim, O" uniqKey="Salim O">O Salim</name>
</author>
<author>
<name sortKey="Skilton, Rj" uniqKey="Skilton R">RJ Skilton</name>
</author>
<author>
<name sortKey="Lambden, Pr" uniqKey="Lambden P">PR Lambden</name>
</author>
<author>
<name sortKey="Fane, Ba" uniqKey="Fane B">BA Fane</name>
</author>
<author>
<name sortKey="Clarke, In" uniqKey="Clarke I">IN Clarke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sencilo, A" uniqKey="Sencilo A">A Sencilo</name>
</author>
<author>
<name sortKey="Paulin, L" uniqKey="Paulin L">L Paulin</name>
</author>
<author>
<name sortKey="Kellner, S" uniqKey="Kellner S">S Kellner</name>
</author>
<author>
<name sortKey="Helm, M" uniqKey="Helm M">M Helm</name>
</author>
<author>
<name sortKey="Roine, E" uniqKey="Roine E">E Roine</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sims, Ge" uniqKey="Sims G">GE Sims</name>
</author>
<author>
<name sortKey="Jun, Sr" uniqKey="Jun S">SR Jun</name>
</author>
<author>
<name sortKey="Wu, Ga" uniqKey="Wu G">GA Wu</name>
</author>
<author>
<name sortKey="Kim, Sh" uniqKey="Kim S">SH Kim</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sternberg, N" uniqKey="Sternberg N">N Sternberg</name>
</author>
<author>
<name sortKey="Austin, S" uniqKey="Austin S">S Austin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sullivan, Mj" uniqKey="Sullivan M">MJ Sullivan</name>
</author>
<author>
<name sortKey="Petty, Nk" uniqKey="Petty N">NK Petty</name>
</author>
<author>
<name sortKey="Beatson, Sa" uniqKey="Beatson S">SA Beatson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Suttle, Ca" uniqKey="Suttle C">CA Suttle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tadmor, Ad" uniqKey="Tadmor A">AD Tadmor</name>
</author>
<author>
<name sortKey="Ottesen, Ea" uniqKey="Ottesen E">EA Ottesen</name>
</author>
<author>
<name sortKey="Leadbetter, Jr" uniqKey="Leadbetter J">JR Leadbetter</name>
</author>
<author>
<name sortKey="Phillips, R" uniqKey="Phillips R">R Phillips</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weitz, Js" uniqKey="Weitz J">JS Weitz</name>
</author>
<author>
<name sortKey="Poisot, T" uniqKey="Poisot T">T Poisot</name>
</author>
<author>
<name sortKey="Meyer, Jr" uniqKey="Meyer J">JR Meyer</name>
</author>
<author>
<name sortKey="Flores, Co" uniqKey="Flores C">CO Flores</name>
</author>
<author>
<name sortKey="Valverde, S" uniqKey="Valverde S">S Valverde</name>
</author>
<author>
<name sortKey="Sullivan, Mb" uniqKey="Sullivan M">MB Sullivan</name>
</author>
<author>
<name sortKey="Hochberg, Me" uniqKey="Hochberg M">ME Hochberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Whitman, Wb" uniqKey="Whitman W">WB Whitman</name>
</author>
<author>
<name sortKey="Coleman, Dc" uniqKey="Coleman D">DC Coleman</name>
</author>
<author>
<name sortKey="Wiebe, Wj" uniqKey="Wiebe W">WJ Wiebe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wright, Jj" uniqKey="Wright J">JJ Wright</name>
</author>
<author>
<name sortKey="Konwar, Km" uniqKey="Konwar K">KM Konwar</name>
</author>
<author>
<name sortKey="Hallam, Sj" uniqKey="Hallam S">SJ Hallam</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wrighton, K" uniqKey="Wrighton K">K Wrighton</name>
</author>
<author>
<name sortKey="Thomas, B" uniqKey="Thomas B">B Thomas</name>
</author>
<author>
<name sortKey="Sharon, I" uniqKey="Sharon I">I Sharon</name>
</author>
<author>
<name sortKey="Miller, Cs" uniqKey="Miller C">CS Miller</name>
</author>
<author>
<name sortKey="Castelle, Cj" uniqKey="Castelle C">CJ Castelle</name>
</author>
<author>
<name sortKey="Verberkmoes, Nc" uniqKey="Verberkmoes N">NC VerBerkmoes</name>
</author>
<author>
<name sortKey="Wilkins, Mj" uniqKey="Wilkins M">MJ Wilkins</name>
</author>
<author>
<name sortKey="Hettich, Rl" uniqKey="Hettich R">RL Hettich</name>
</author>
<author>
<name sortKey="Lipton, Ms" uniqKey="Lipton M">MS Lipton</name>
</author>
<author>
<name sortKey="Williams, Kh" uniqKey="Williams K">KH Williams</name>
</author>
<author>
<name sortKey="Long, Pe" uniqKey="Long P">PE Long</name>
</author>
<author>
<name sortKey="Banfield, Jf" uniqKey="Banfield J">JF Banfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yoon, Hs" uniqKey="Yoon H">HS Yoon</name>
</author>
<author>
<name sortKey="Price, Dc" uniqKey="Price D">DC Price</name>
</author>
<author>
<name sortKey="Stepanauskas, R" uniqKey="Stepanauskas R">R Stepanauskas</name>
</author>
<author>
<name sortKey="Rajah, Vd" uniqKey="Rajah V">VD Rajah</name>
</author>
<author>
<name sortKey="Sieracki, Me" uniqKey="Sieracki M">ME Sieracki</name>
</author>
<author>
<name sortKey="Wilson, Wh" uniqKey="Wilson W">WH Wilson</name>
</author>
<author>
<name sortKey="Yang, Ec" uniqKey="Yang E">EC Yang</name>
</author>
<author>
<name sortKey="Duffy, S" uniqKey="Duffy S">S Duffy</name>
</author>
<author>
<name sortKey="Bhattacharya, D" uniqKey="Bhattacharya D">D Bhattacharya</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Youle, M" uniqKey="Youle M">M Youle</name>
</author>
<author>
<name sortKey="Haynes, M" uniqKey="Haynes M">M Haynes</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, Y" uniqKey="Zhou Y">Y Zhou</name>
</author>
<author>
<name sortKey="Liang, Y" uniqKey="Liang Y">Y Liang</name>
</author>
<author>
<name sortKey="Lynch, Kh" uniqKey="Lynch K">KH Lynch</name>
</author>
<author>
<name sortKey="Dennis, Jj" uniqKey="Dennis J">JJ Dennis</name>
</author>
<author>
<name sortKey="Wishart, Ds" uniqKey="Wishart D">DS Wishart</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">eLife</journal-id>
<journal-id journal-id-type="hwp">eLife</journal-id>
<journal-id journal-id-type="publisher-id">eLife</journal-id>
<journal-title-group>
<journal-title>eLife</journal-title>
</journal-title-group>
<issn pub-type="ppub">2050-084X</issn>
<issn pub-type="epub">2050-084X</issn>
<publisher>
<publisher-name>eLife Sciences Publications, Ltd</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26200428</article-id>
<article-id pub-id-type="pmc">4533152</article-id>
<article-id pub-id-type="publisher-id">08490</article-id>
<article-id pub-id-type="doi">10.7554/eLife.08490</article-id>
<article-categories>
<subj-group subj-group-type="display-channel">
<subject>Tools and Resources</subject>
</subj-group>
<subj-group subj-group-type="heading">
<subject>Ecology</subject>
</subj-group>
<subj-group subj-group-type="heading">
<subject>Genomics and Evolutionary Biology</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Viral dark matter and virus–host interactions resolved from publicly available microbial genomes</article-title>
</title-group>
<contrib-group>
<contrib id="author-13616" contrib-type="author">
<name>
<surname>Roux</surname>
<given-names>Simon</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
<xref ref-type="author-notes" rid="pa1"></xref>
<xref ref-type="fn" rid="con1"></xref>
<xref ref-type="fn" rid="conf1"></xref>
</contrib>
<contrib id="author-13631" contrib-type="author">
<name>
<surname>Hallam</surname>
<given-names>Steven J</given-names>
</name>
<xref ref-type="aff" rid="aff2">2</xref>
<xref ref-type="aff" rid="aff3">3</xref>
<xref ref-type="other" rid="par-2"></xref>
<xref ref-type="other" rid="par-3"></xref>
<xref ref-type="other" rid="par-4"></xref>
<xref ref-type="other" rid="par-5"></xref>
<xref ref-type="other" rid="par-6"></xref>
<xref ref-type="other" rid="par-7"></xref>
<xref ref-type="fn" rid="con3"></xref>
<xref ref-type="fn" rid="conf1"></xref>
</contrib>
<contrib id="author-13630" contrib-type="author">
<name>
<surname>Woyke</surname>
<given-names>Tanja</given-names>
</name>
<xref ref-type="aff" rid="aff4">4</xref>
<xref ref-type="other" rid="par-8"></xref>
<xref ref-type="fn" rid="con4"></xref>
<xref ref-type="fn" rid="conf1"></xref>
</contrib>
<contrib id="author-13273" contrib-type="author">
<name>
<surname>Sullivan</surname>
<given-names>Matthew B</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
<xref ref-type="author-notes" rid="pa1"></xref>
<xref ref-type="author-notes" rid="pa2"></xref>
<xref ref-type="other" rid="par-1"></xref>
<xref ref-type="fn" rid="con2"></xref>
<xref ref-type="fn" rid="conf1"></xref>
</contrib>
<aff id="aff1">
<label>1</label>
<institution content-type="dept">Department of Ecology and Evolutionary Biology</institution>
,
<institution>University of Arizona</institution>
,
<addr-line>Tucson</addr-line>
,
<country>United States</country>
</aff>
<aff id="aff2">
<label>2</label>
<institution content-type="dept">Department of Microbiology and Immunology</institution>
,
<institution>University of British Columbia</institution>
,
<addr-line>Vancouver</addr-line>
,
<country>Canada</country>
</aff>
<aff id="aff3">
<label>3</label>
<institution content-type="dept">Graduate Program in Bioinformatics</institution>
,
<institution>University of British Columbia</institution>
,
<addr-line>Vancouver</addr-line>
,
<country>Canada</country>
</aff>
<aff id="aff4">
<label>4</label>
<institution>U.S Department of Energy Joint Genome Institute</institution>
,
<addr-line>Walnut Creek</addr-line>
,
<country>United States</country>
</aff>
</contrib-group>
<contrib-group>
<contrib id="author-1701" contrib-type="editor">
<name>
<surname>Neher</surname>
<given-names>Richard A</given-names>
</name>
<role>Reviewing editor</role>
<aff>
<institution>Max Planck Institute for Developmental Biology</institution>
,
<country>Germany</country>
</aff>
</contrib>
</contrib-group>
<author-notes>
<corresp id="cor1">
<label>*</label>
For correspondence:
<email>mbsulli@gmail.com</email>
</corresp>
<fn fn-type="present-address" id="pa1">
<label></label>
<p>Department of Microbiology, The Ohio State University, Columbus, United States.</p>
</fn>
<fn fn-type="present-address" id="pa2">
<label></label>
<p>Department of Civil, Environmental, and Geodetic Engineering, Columbus, United States.</p>
</fn>
</author-notes>
<pub-date publication-format="electronic" date-type="pub">
<day>22</day>
<month>7</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<volume>4</volume>
<elocation-id>e08490</elocation-id>
<history>
<date date-type="received">
<day>02</day>
<month>5</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>22</day>
<month>7</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© 2015, Roux et al</copyright-statement>
<copyright-year>2015</copyright-year>
<copyright-holder>Roux et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This article is distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use and redistribution provided that the original author and source are credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="elife08490.pdf"></self-uri>
<abstract>
<p>The ecological importance of viruses is now widely recognized, yet our limited knowledge of viral sequence space and virus–host interactions precludes accurate prediction of their roles and impacts. In this study, we mined publicly available bacterial and archaeal genomic data sets to identify 12,498 high-confidence viral genomes linked to their microbial hosts. These data augment public data sets 10-fold, provide first viral sequences for 13 new bacterial phyla including ecologically abundant phyla, and help taxonomically identify 7–38% of ‘unknown’ sequence space in viromes. Genome- and network-based classification was largely consistent with accepted viral taxonomy and suggested that (i) 264 new viral genera were identified (doubling known genera) and (ii) cross-taxon genomic recombination is limited. Further analyses provided empirical data on extrachromosomal prophages and coinfection prevalences, as well as evaluation of in silico virus–host linkage predictions. Together these findings illustrate the value of mining viral signal from microbial genomes.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.001">http://dx.doi.org/10.7554/eLife.08490.001</ext-link>
</p>
</abstract>
<abstract abstract-type="executive-summary">
<title>eLife digest</title>
<p>Viruses are infectious particles that can only multiply inside the cells of microbes and other organisms. Little is known about the genetic differences between virus particles (so-called ‘genetic diversity’), especially compared to what we know about the diversity of bacteria, archaea, and other single-celled microbes. This lack of knowledge hampers our understanding of the role viruses play in the evolution of microbial communities and their associated ecosystems.</p>
<p>Studying the genetics of the viruses in these communities is challenging. There is no single ‘marker’ gene that can be used to identify all viruses in environmental samples. Also, many of the fragments of viral genomes that have been identified have not yet been linked to their host microbes. Many viruses integrate their genome into the DNA of their host cell, and there are computational tools available that exploit this ability to identify viruses and link them to their host. However, other viruses can live and multiply inside cells without integrating their genome into the host's DNA.</p>
<p>Earlier in 2015, researchers developed a new computational tool called VirSorter that can predict virus genome sequences within the DNA extracted from microbes. VirSorter identifies viral genome sequences based on the presence of ‘hallmark’ genes that encode for components found in many virus particles, together with a reference database of genomes from many viruses.</p>
<p>Now, Roux et al.—including some of the researchers from the earlier work—use VirSorter to predict viral DNA from publicly available bacteria and archaea genome data. The study identifies over 12,000 viral genomes and links them to their microbial hosts. These data increase the number of viral genome sequences that are publically available by a factor of ten and identify the first viruses associated with 13 new types of bacteria, which include species that are abundant in particular environments.</p>
<p>It is possible for several different viruses to infect a single cell at the same time. Some viruses are known to be able to exchange DNA, and if this happens frequently in other viruses, it could have a big impact on how viruses evolve. Roux et al.'s findings suggest that although it is common for several different viruses to infect the same cell, it is relatively rare for these viruses to exchange genetic material.</p>
<p>Roux et al.'s findings demonstrate the value of searching publicly available microbial genome data for fragments of viral genomes. These new viral genomes will serve as a useful resource for researchers as they explore the communities of viruses and microbes in natural environments, the human body and in industrial processes.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.002">http://dx.doi.org/10.7554/eLife.08490.002</ext-link>
</p>
</abstract>
<kwd-group kwd-group-type="author-keywords">
<title>Author keywords</title>
<kwd>virus</kwd>
<kwd>phage</kwd>
<kwd>prophage</kwd>
<kwd>virus-host adaptation</kwd>
</kwd-group>
<kwd-group kwd-group-type="research-organism">
<title>Research organism</title>
<kwd>none</kwd>
</kwd-group>
<funding-group>
<award-group id="par-1">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100000936</institution-id>
<institution>Gordon and Betty Moore Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>3790</award-id>
<principal-award-recipient>
<name>
<surname>Sullivan</surname>
<given-names>Matthew B</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="par-2">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100000038</institution-id>
<institution>Natural Sciences and Engineering Research Council of Canada (Conseil de Recherches en Sciences Naturelles et en Génie du Canada)</institution>
</institution-wrap>
</funding-source>
<principal-award-recipient>
<name>
<surname>Hallam</surname>
<given-names>Steven J</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="par-3">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100000196</institution-id>
<institution>Canada Foundation for Innovation (Fondation canadienne pour l'innovation)</institution>
</institution-wrap>
</funding-source>
<principal-award-recipient>
<name>
<surname>Hallam</surname>
<given-names>Steven J</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="par-4">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100007631</institution-id>
<institution>Canadian Institute for Advanced Research (L'Institut Canadien de Recherches Avancées)</institution>
</institution-wrap>
</funding-source>
<principal-award-recipient>
<name>
<surname>Hallam</surname>
<given-names>Steven J</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="par-5">
<funding-source>
<institution-wrap>
<institution>Tula Foundation</institution>
</institution-wrap>
</funding-source>
<principal-award-recipient>
<name>
<surname>Hallam</surname>
<given-names>Steven J</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="par-6">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100001246</institution-id>
<institution>Ambrose Monell Foundation</institution>
</institution-wrap>
</funding-source>
<principal-award-recipient>
<name>
<surname>Hallam</surname>
<given-names>Steven J</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="par-7">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100001372</institution-id>
<institution>G. Unger Vetlesen Foundation</institution>
</institution-wrap>
</funding-source>
<principal-award-recipient>
<name>
<surname>Hallam</surname>
<given-names>Steven J</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="par-8">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100000015</institution-id>
<institution>U.S. Department of Energy (Department of Energy)</institution>
</institution-wrap>
</funding-source>
<award-id>Joint Genome Institute (DE-AC02-05CH11231)</award-id>
<principal-award-recipient>
<name>
<surname>Woyke</surname>
<given-names>Tanja</given-names>
</name>
</principal-award-recipient>
</award-group>
<funding-statement>The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.</funding-statement>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>elife-xml-version</meta-name>
<meta-value>2.3</meta-value>
</custom-meta>
<custom-meta specific-use="meta-only">
<meta-name>Author impact statement</meta-name>
<meta-value>From public microbial genomes, VirSorter revealed 12,498 viral genome sequences that expand the map of the global virosphere and whose analyses improve understanding of viral taxonomy, evolution and virus-host interactions.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Over the past two decades, our collective understanding of microbial diversity has been profoundly expanded by cultivation-independent molecular methods (
<xref rid="bib56" ref-type="bibr">Pace, 1997</xref>
;
<xref rid="bib78" ref-type="bibr">Whitman et al., 1998</xref>
;
<xref rid="bib61" ref-type="bibr">Rappé and Giovannoni, 2003</xref>
;
<xref rid="bib19" ref-type="bibr">DeLong, 2009</xref>
;
<xref rid="bib33" ref-type="bibr">Hanson et al., 2012</xref>
). It is now widely recognized that interconnected microbial communities drive matter and energy transformations in natural and engineered ecosystems (
<xref rid="bib25" ref-type="bibr">Falkowski et al., 2008</xref>
), while also contributing to health and disease states in multicellular hosts (
<xref rid="bib17" ref-type="bibr">Clemente et al., 2012</xref>
). Concomitant with this changing worldview is a growing awareness that viruses modulate microbial interaction networks and long-term evolution with resulting feedbacks on ecosystem functions and services (
<xref rid="bib75" ref-type="bibr">Suttle, 2007</xref>
;
<xref rid="bib65" ref-type="bibr">Rodriguez-Valera et al., 2009</xref>
;
<xref rid="bib29" ref-type="bibr">Forterre and Prangishvili, 2013</xref>
;
<xref rid="bib35" ref-type="bibr">Hurwitz et al., 2013</xref>
;
<xref rid="bib9" ref-type="bibr">Brum et al., 2014</xref>
;
<xref rid="bib10" ref-type="bibr">Brum and Sullivan, 2015</xref>
).</p>
<p>However, our understanding of viral diversity and virus–host interactions remains a major bottleneck in the development of predictive ecosystem models and unifying eco-evolutionary theories. This is because the lack of a universal marker gene for viruses hinders environmental survey capabilities, while the number of isolate viral genomes in databases remains limited: for comparison, more than 25,000 bacterial and archaeal host genomes are available in NCBI RefSeq (January 2015), whereas only 1,531 of their viruses were entirely sequenced and most (86%) of these derive from only 3 of 61 known host phyla (
<xref rid="bib68" ref-type="bibr">Roux et al., 2015a</xref>
). Thus, although advances in high-throughput sequencing expand the bounds of viral sequence space, these data sets are dominated by uncharacterized sequences (usually 60–95%), termed ‘viral dark matter’ (
<xref rid="bib62" ref-type="bibr">Reyes et al., 2012</xref>
;
<xref rid="bib82" ref-type="bibr">Youle et al., 2012</xref>
;
<xref rid="bib54" ref-type="bibr">Mizuno et al., 2013</xref>
;
<xref rid="bib10" ref-type="bibr">Brum and Sullivan, 2015</xref>
). In the absence of closely related isolates, viral genes and genomes remain unlinked to hosts, which greatly limits ecological and evolutionary inferences.</p>
<p>Alternatively, viral sequence space can be explored in a known host context by revealing putative viral sequences hidden in microbial genomes. Such signal was first analyzed through annotation of prophages—viral genomes integrated in microbial genomes. Numerous tools exist to automatically detect prophages (
<xref rid="bib30" ref-type="bibr">Fouts, 2006</xref>
;
<xref rid="bib48" ref-type="bibr">Lima-Mendez et al., 2008a</xref>
;
<xref rid="bib83" ref-type="bibr">Zhou et al., 2011</xref>
;
<xref rid="bib2" ref-type="bibr">Akhter et al., 2012</xref>
), so prophage diversity and abundance are relatively well studied (
<xref rid="bib15" ref-type="bibr">Casjens, 2003</xref>
;
<xref rid="bib11" ref-type="bibr">Canchaya et al., 2004</xref>
). Early estimations, when only a few hundred bacterial genomes were available, suggested that prophages are common (62% of bacterial genomes tested contained at least one), existing as intact and functional forms or in varying degrees of decay (
<xref rid="bib15" ref-type="bibr">Casjens, 2003</xref>
). Given that tens of thousands more microbial genomes are now publicly available, it is expected that many new prophages and other viral sequences remain to be discovered.</p>
<p>Further, other viral signals might be prevalent in modern microbial genomic data sets. First, certain types of prophage do not integrate into the host genome. These ‘extrachromosomal prophages’ (also termed ‘plasmid prophage’) exist outside the microbial chromosome until induced to undergo lytic replication. These have been known to occur for decades (e.g., coliphage P1,
<xref rid="bib73" ref-type="bibr">Sternberg and Austin, 1981</xref>
), though their abundance in nature is unknown. Second, some phages can enter a ‘chronic’ cycle, in which they replicate in the cell outside of the host chromosome, and produce virions that are extruded without killing their host (
<xref rid="bib1" ref-type="bibr">Abedon, 2009</xref>
;
<xref rid="bib60" ref-type="bibr">Rakonjac et al., 2011</xref>
). Third, a phage ‘carrier state’ has been observed, in which a lytic phage is maintained and multiplied within a cultivated host population without measurable effect on cell growth (
<xref rid="bib7" ref-type="bibr">Bastías et al., 2010</xref>
). This phenomenon is thought to arise due to the presence of both resistant and sensitive cells that frequently transition between these two states. Sometimes also termed ‘partial resistance’, such states that enable the coexistence of phage and host in culture have now been observed in different systems (
<italic>Vibrio</italic>
,
<italic>Escherichia coli</italic>
,
<italic>Salmonella</italic>
,
<italic>Flavobacterium</italic>
), and are linked to slight decreases in growth rate or cell concentration but no host cell clearing as would be observed for ‘typical’ lytic viruses (i.e., plaque formation), thus could go unnoticed in a microbial cell culture (
<xref rid="bib26" ref-type="bibr">Fischer et al., 2004</xref>
;
<xref rid="bib14" ref-type="bibr">Carey-Smith et al., 2006</xref>
;
<xref rid="bib52" ref-type="bibr">Middelboe et al., 2009</xref>
). All three of these lesser studied types of infection would result in the assembly of viral sequences outside of the main host chromosome in a microbial genome sequencing project and could be a new type of viral signal in modern microbial genomic data sets due to deep sequencing and public release of draft (i.e., not completely assembled) genomic sequences.</p>
<p>Finally, single amplified genome (SAG) data sets, sourced from anonymously sorted, amplified, and sequenced cells, are especially valuable for accessing the vast majority of environmental microbes that remain uncultivated in the lab (
<xref rid="bib64" ref-type="bibr">Rinke et al., 2013</xref>
;
<xref rid="bib40" ref-type="bibr">Kashtan et al., 2014</xref>
). Single-cell amplified genomes can reveal viral sequences directly linked to uncultivated hosts (
<xref rid="bib81" ref-type="bibr">Yoon et al., 2011</xref>
;
<xref rid="bib69" ref-type="bibr">Roux et al., 2014</xref>
;
<xref rid="bib44" ref-type="bibr">Labonté et al., 2015</xref>
). When combined with metagenomic sequences, these data provide information on population dynamics, lineage-specific viral-induced mortality rates, relative ratios of prophages and current lytic infections, as well as putative links between viral infection and host metabolic state (
<xref rid="bib69" ref-type="bibr">Roux et al., 2014</xref>
;
<xref rid="bib44" ref-type="bibr">Labonté et al., 2015</xref>
). Thus, as microbial genomic data sets evolve from complete genomes to fragmented draft and single-cell genomes, new windows into viral diversity and virus–host interactions are opened.</p>
<p>Here, we applied a recently developed and automated virus discovery pipeline, VirSorter (
<xref rid="bib68" ref-type="bibr">Roux et al., 2015a</xref>
), to mine the viral signal from 14,977 publicly available bacterial and archaeal genomic data sets. This identified 12,498 high-confidence viral sequences with known hosts, ∼10-fold more than in the RefSeqVirus database, that we then used to expand our understanding of viral diversity and virus–host interactions.</p>
</sec>
<sec id="s2">
<title>Results and discussion</title>
<sec id="s2-1">
<title>New viruses detected in public microbial genomic data sets with VirSorter</title>
<p>VirSorter is designed to predict bacterial and archaeal virus sequences in isolate or single-cell draft genomes, as well as complete genomes (
<xref rid="bib68" ref-type="bibr">Roux et al., 2015a</xref>
). Briefly, VirSorter identifies viral sequences through (i) statistical enrichment in viral gene content, using a reference database composed of viral genomes of archaeal and bacterial viruses from RefSeq (hereafter named RefSeqABVir for ‘RefSeq Archaea and Bacteria Viruses’) and assembled from viral metagenomes (database ‘Viromes’ in VirSorter), or (ii) a combination of viral ‘hallmark’ gene(s) that code for virion-related functions such as major capsid proteins or terminases (
<xref rid="bib42" ref-type="bibr">Koonin et al., 2006</xref>
;
<xref rid="bib69" ref-type="bibr">Roux et al., 2014</xref>
), and at least one viral-like genomic feature: statistical depletion in genes with a hit in the PFAM database, statistical enrichment in uncharacterized genes, short genes, or strand bias (i.e., consecutive genes which tend to be coded on the same strand).</p>
<p>Applied to 14,977 publicly available microbial genomes (
<xref ref-type="supplementary-material" rid="SD1-data">Figure 1—source data 1</xref>
), VirSorter identified 12,498 high-confidence viral sequences representing either long genome fragments (>10 kb when linear) or complete genomes (contigs detected as circular). These viral sequences were found in 5492 of the microbial genomes (∼30%). Simply scanning the identified viruses for novel hosts extended the host range of common viral families to now include several recently described phyla like
<italic>Caldiserica</italic>
(formerly known as candidate phylum OP5),
<italic>Marinimicrobia</italic>
(SAR406 also known as Marine Group A), or
<italic>Omnitrophica</italic>
(OP3), in addition to other understudied groups such as
<italic>Poribacteria</italic>
,
<italic>Nitrospinae</italic>
,
<italic>Cloacimonetes</italic>
(WWE1), and Chloroflexi-type SAR202 (
<xref ref-type="fig" rid="fig1">Figure 1</xref>
,
<xref ref-type="supplementary-material" rid="SD2-data">Figure 1—source data 2</xref>
,
<xref ref-type="supplementary-material" rid="SD3-data">Figure 1—source data 3</xref>
). Uncovering the first viruses infecting these major microbial groups is critical given that many candidate phyla are abundant in understudied ecosystems and play substantial roles in coupled biogeochemical cycling (
<xref rid="bib79" ref-type="bibr">Wright et al., 2012</xref>
;
<xref rid="bib80" ref-type="bibr">Wrighton et al., 2012</xref>
;
<xref rid="bib16" ref-type="bibr">Castelle et al., 2013</xref>
;
<xref rid="bib39" ref-type="bibr">Kamke et al., 2013</xref>
;
<xref rid="bib64" ref-type="bibr">Rinke et al., 2013</xref>
;
<xref rid="bib4" ref-type="bibr">Allers et al., 2013b</xref>
;
<xref rid="bib22" ref-type="bibr">Emerson et al., 2015</xref>
).
<fig id="fig1" position="float" orientation="portrait">
<object-id pub-id-type="doi">10.7554/eLife.08490.003</object-id>
<label>Figure 1.</label>
<caption>
<title>Distribution of viral sequences from the VirSorter curated data set across the bacterial and archaeal phylogeny.</title>
<p>For each bacteria or archaea phylum (or phylum-level group), corresponding viruses in RefSeq (gray) and VirSorter curated data set (red) are indicated with circles proportional to the number of sequences available. Groups for which no viruses were available in RefSeq are highlighted in black.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.003">http://dx.doi.org/10.7554/eLife.08490.003</ext-link>
</p>
<p>
<supplementary-material content-type="local-data" id="SD1-data">
<object-id pub-id-type="doi">10.7554/eLife.08490.004</object-id>
<label>Figure 1—source data 1.</label>
<caption>
<title>List of data sets mined for viral signal.</title>
<p>Bacterial and archaeal genomes searched with VirSorter for viral sequences originated from NCBI Refseq and WGS, as well as the Microbial Dark Matter data set (MDM,
<xref rid="bib64" ref-type="bibr">Rinke et al., 2013</xref>
) and the SUP05 SAGs data set (
<xref rid="bib69" ref-type="bibr">Roux et al., 2014</xref>
).</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.004">http://dx.doi.org/10.7554/eLife.08490.004</ext-link>
</p>
</caption>
<media xlink:href="elife08490s001.xls" mimetype="application" mime-subtype="xls" orientation="portrait" id="d35e664" position="anchor"></media>
</supplementary-material>
</p>
<p>
<supplementary-material content-type="local-data" id="SD2-data">
<object-id pub-id-type="doi">10.7554/eLife.08490.005</object-id>
<label>Figure 1—source data 2.</label>
<caption>
<title>New virus–host associations detected in VirSorter sequences.</title>
<p>The star (*) marks the questionable detection of an
<italic>Inoviridae</italic>
genome in a
<italic>Caldiserica</italic>
SAG, which could originate from another bacterium contaminating MDA reagents (see ‘Materials and methods’).</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.005">http://dx.doi.org/10.7554/eLife.08490.005</ext-link>
</p>
</caption>
<media xlink:href="elife08490s002.xls" mimetype="application" mime-subtype="xls" orientation="portrait" id="d35e688" position="anchor"></media>
</supplementary-material>
</p>
<p>
<supplementary-material content-type="local-data" id="SD3-data">
<object-id pub-id-type="doi">10.7554/eLife.08490.006</object-id>
<label>Figure 1—source data 3.</label>
<caption>
<title>Summary table of VirSorter data set sequences.</title>
<p>All sequences currently identified as plasmids on NCBI and which did not display any viral gene in the automatic annotation from NCBI are gathered at the bottom of the table and highlighted in orange. ‘Detection tag’ column indicates how the sequence was detected as viral by VirSorter: ‘hallmark’ for the presence of viral hallmark gene(s), ‘refseq’ for an enrichment in bacterial and archalea virus genes, ‘noncaudo’ for an enrichment in non-
<italic>Caudovirales</italic>
genes, and ‘vdb’ for an enrichment in virome-like genes.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.006">http://dx.doi.org/10.7554/eLife.08490.006</ext-link>
</p>
</caption>
<media xlink:href="elife08490s003.xls" mimetype="application" mime-subtype="xls" orientation="portrait" id="d35e709" position="anchor"></media>
</supplementary-material>
</p>
</caption>
<graphic xlink:href="elife08490f001"></graphic>
<p content-type="supplemental-figure">
<fig id="fig1s1" specific-use="child-fig" orientation="portrait" position="anchor">
<object-id pub-id-type="doi">10.7554/eLife.08490.007</object-id>
<label>Figure 1—figure supplement 1.</label>
<caption>
<title>Viral diversity in the VirSorter data set.</title>
<p>The best BLAST hits of predicted proteins along each sequence (i.e., within 75% of the best BLAST hit for this sequence) were used in a Lowest Common Ancestor affiliation (here displayed at the family level). ‘Unclassified
<italic>Caudovirales</italic>
’ gathers viruses only affiliated to the
<italic>Caudovirales</italic>
level without confident affiliation to the
<italic>Myo</italic>
-,
<italic>Sipho</italic>
-, or
<italic>Podoviridae.</italic>
The number and percentage of sequences affiliated is indicated next to each family.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.007">http://dx.doi.org/10.7554/eLife.08490.007</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490fs001"></graphic>
</fig>
</p>
<p content-type="supplemental-figure">
<fig id="fig1s2" specific-use="child-fig" orientation="portrait" position="anchor">
<object-id pub-id-type="doi">10.7554/eLife.08490.008</object-id>
<label>Figure 1—figure supplement 2.</label>
<caption>
<title>Genome map comparison (
<bold>A</bold>
) and recruitment plot (
<bold>B</bold>
) of
<italic>Bacteroidia</italic>
virus sequences from a putative new order.</title>
<p>Replication-associated, Relaxase, and hypothetical proteins are depicted in blue, orange, and gray respectively. The recruitment plot includes two viromes from human feces samples from two different studies (Human gut assembly,
<xref rid="bib53" ref-type="bibr">Minot et al., 2012</xref>
, and Human feces,
<xref rid="bib41" ref-type="bibr">Kim et al.
<italic>,</italic>
2011</xref>
). Identity percentage is based on a blastn between virome contigs and the reference genome.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.008">http://dx.doi.org/10.7554/eLife.08490.008</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490fs002"></graphic>
</fig>
</p>
</fig>
</p>
<p>BLAST-based family-level affiliations suggested that 90.45% of these 12,498 sequences correspond to
<italic>Caudovirales</italic>
, 6.82% to ssDNA viruses (predominantly
<italic>Inoviridae</italic>
family), and 2.73% could not be confidently assigned (
<xref ref-type="fig" rid="fig1s1">Figure 1—figure supplement 1</xref>
). Among the unassigned group, 7 sequences lacked any hit to a viral reference genome. These 7 short (4.1 kb) near-identical circular contigs from
<italic>Bacteroides</italic>
draft genomes were detected as viral based on sequence similarity with human gut viromes, but contained two genes associated with plasmid replication (
<xref ref-type="fig" rid="fig1s2">Figure 1—figure supplement 2A</xref>
). This could suggest a plasmid origin, but the high and even coverage of these genomes across several CsCl-purified viromes from different studies (
<xref rid="bib41" ref-type="bibr">Kim et al., 2011</xref>
;
<xref rid="bib53" ref-type="bibr">Minot et al., 2012</xref>
) suggests that they are derived from encapsidated particles typical of viruses (
<xref ref-type="fig" rid="fig1s2">Figure 1—figure supplement 2B</xref>
). If confirmed, these sequences would represent the first complete genomes for an entirely new viral order.</p>
</sec>
<sec id="s2-2">
<title>264 new putative viral genera identified through genome-based network clustering</title>
<p>To better determine relationships between viral genomes and host range, we next built a network based on shared gene content to quantify genetic relatedness between the 12,498 sequences identified with VirSorter and the 1,240 taxonomically curated genomes available in RefSeqABVir (
<xref ref-type="fig" rid="fig2s1">Figure 2—figure supplement 1</xref>
and see ‘Materials and methods’). Despite the absence of a universal marker gene, a long history of organizing viral sequence space through genome-to-genome comparison exists using either gene content (
<xref rid="bib66" ref-type="bibr">Rohwer and Edwards, 2002</xref>
;
<xref rid="bib49" ref-type="bibr">Lima-Mendez et al., 2008b</xref>
) or nucleotide composition (
<xref rid="bib72" ref-type="bibr">Sims and Jun, 2009</xref>
;
<xref rid="bib45" ref-type="bibr">Labonté and Suttle, 2013</xref>
). We used MCL (Markov Cluster Algorithm) based on the number of shared genes between sequence pairs as it had been previously shown to accurately recapitulate taxonomic relationships in the
<italic>Caudovirales</italic>
, which dominated our data set (
<xref rid="bib24" ref-type="bibr">Enright et al., 2002</xref>
;
<xref rid="bib49" ref-type="bibr">Lima-Mendez et al., 2008b</xref>
).</p>
<p>Most (99.3% of 12,498) sequences affiliated to one of 614 virus clusters (VCs), of which 535 contained at least one complete genome or large genomic fragment (>30 kb), and approximately half (271 of 535 VCs) included RefSeqABVir sequences (
<xref ref-type="fig" rid="fig2">Figure 2A</xref>
,
<xref ref-type="supplementary-material" rid="SD4-data">Figure 2—source data 1</xref>
). Those VCs with RefSeqABVir sequences provided the opportunity to evaluate whether a VC corresponded to any particular taxonomic level of ICTV classification. Of 43 RefSeq-curated viral genera, 27 have all their sequences in the same VC, 12 were spread across two VCs, and 4 were spread across >2 VCs—these latter genera included the Spouna-like viruses (3 VCs), N4-like viruses (4 VCs), Lambda-like viruses (9 VCs), and Inoviruses (11 VCs). Consistent with previous applications of this method, VCs identified in this analysis were thus approximately equivalent to a RefSeq-curated viral genus (
<xref rid="bib49" ref-type="bibr">Lima-Mendez et al., 2008b</xref>
).
<fig id="fig2" position="float" orientation="portrait">
<object-id pub-id-type="doi">10.7554/eLife.08490.009</object-id>
<label>Figure 2.</label>
<caption>
<title>Degree of novelty of viruses detected in VirSorter curated data set.</title>
<p>(
<bold>A</bold>
) Viral clusters (VCs) are considered as putative new genera when including at least one sequence larger than 30 kb, circular, or known to be a complete genome (from RefSeq). These putative genera were considered as ‘new’ when the VC did not include any RefSeq sequence, and ‘known’ otherwise. (
<bold>B</bold>
) The proportion of new VCs (containing no RefSeqABVir), VCs with only one RefSeqABVir sequence, and VCs with more than one RefSeqABVir sequence is displayed for host classes associated with more than 10 virl sequences. Only ‘putative genera’ VCs were considered (i.e., clusters containing a RefSeqABVir genome, a circular sequence, or a sequence with more than 30 predicted genes).</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.009">http://dx.doi.org/10.7554/eLife.08490.009</ext-link>
</p>
<p>
<supplementary-material content-type="local-data" id="SD4-data">
<object-id pub-id-type="doi">10.7554/eLife.08490.010</object-id>
<label>Figure 2—source data 1.</label>
<caption>
<title>Summary table of virus clusters (VCs).</title>
<p>Cluster affiliation is based on the combination of BLAST-based taxonomic affiliation of its members. For VCs with more than 10 proteins, those composed only of VirSorter sequences are highlighted in green and those with only one sequence from RefSeqABVir are marked in blue. Cases where sequences affiliated to both ssDNA and dsDNA viruses are clustered together are highlighted in red. ‘Detection tags’ lists the different detection tags for the cluster members, with ‘NCBI_RefSeq’ for complete genomes from the RefSeq database. These NCBI RefSeq sequences are counted as ‘complete’ in the ‘type of sequences’ column.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.010">http://dx.doi.org/10.7554/eLife.08490.010</ext-link>
</p>
</caption>
<media xlink:href="elife08490s004.xls" mimetype="application" mime-subtype="xls" orientation="portrait" id="d35e886" position="anchor"></media>
</supplementary-material>
</p>
</caption>
<graphic xlink:href="elife08490f002"></graphic>
<p content-type="supplemental-figure">
<fig id="fig2s1" specific-use="child-fig" orientation="portrait" position="anchor">
<object-id pub-id-type="doi">10.7554/eLife.08490.011</object-id>
<label>Figure 2—figure supplement 1.</label>
<caption>
<title>Structure of viral sequence space sampled in VirSorter data set.</title>
<p>Network of virus clusters (VCs) based on gene content comparison between viral genome sequences from RefSeqABVir and VirSorter data set. VCs including only VirSorter sequences are highlighted with a black outline. The size of nodes is proportional to the number of sequences in the cluster and the color of the node corresponds to the BLAST-based affiliation (at the family level) of its members when consistent (i.e., agreement between >75% of the cluster members, otherwise clusters are indicated as ‘unaffiliated’).</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.011">http://dx.doi.org/10.7554/eLife.08490.011</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490fs003"></graphic>
</fig>
</p>
<p content-type="supplemental-figure">
<fig id="fig2s2" specific-use="child-fig" orientation="portrait" position="anchor">
<object-id pub-id-type="doi">10.7554/eLife.08490.012</object-id>
<label>Figure 2—figure supplement 2.</label>
<caption>
<title>Benchmarks used to determine the best value for inflation and significance thresholds for virus clustering.</title>
<p>For each pair of values (inflation and significance threshold), the genome network was computed and its overall shape evaluated with ICCC (intra-cluster clustering coefficient). The chosen values are highlighted in green in the table and with a star on the associated plot.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.012">http://dx.doi.org/10.7554/eLife.08490.012</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490fs004"></graphic>
</fig>
</p>
</fig>
</p>
<p>Given this level of taxonomic resolution and ignoring the 79 VCs that lacked large (>30 kb) genome sequences, we identified a total of 264 new candidate viral genera (i.e., 264 VCs with no sequences from RefSeqABVir,
<xref ref-type="supplementary-material" rid="SD4-data">Figure 2—source data 1</xref>
). These 264 candidate genera were derived from both understudied and well-studied hosts (e.g.,
<italic>Gammaproteobacteria</italic>
and
<italic>Bacilli,</italic>
<xref ref-type="fig" rid="fig2">Figure 2B</xref>
) and included 5 of the 30 highest-membership VCs (
<xref ref-type="supplementary-material" rid="SD4-data">Figure 2—source data 1</xref>
), which confirms that our knowledge of viral diversity is limited even in well-studied hosts and with prevalent viruses.</p>
</sec>
<sec id="s2-3">
<title>VirSorter curated data set includes extrachromosomal genomes and improves virome affiliation</title>
<p>Of the 12,498 sequences, 5,232 were prophages (i.e., a viral genome integrated into a microbial contig) and 1,756 were either complete (circularized) or large (>30 kb) genome fragments assembled outside of the host chromosome (i.e., no microbial gene was detected on the contig,
<xref ref-type="fig" rid="fig3">Figure 3A</xref>
,
<xref ref-type="supplementary-material" rid="SD1-data">Figure 1—source data 1</xref>
).
<fig id="fig3" position="float" orientation="portrait">
<object-id pub-id-type="doi">10.7554/eLife.08490.013</object-id>
<label>Figure 3.</label>
<caption>
<title>Extrachromosomal prophages in VirSorter curated data set and improvement in virome affiliation.</title>
<p>(
<bold>A</bold>
) The distribution of VirSorter curated data set as ‘integrated’ (i.e., prophages integrated in the host chromosome), ‘extrachromosomal’ (i.e., >30 kb or circular sequences with no microbial genes), or ‘undetermined’ (<30 kb linear with no microbial genes) is indicated for each host class with at least five VirSorter curated data set sequences. The number of sequences associated with each host class in indicated above the histogram. (
<bold>B</bold>
) Improvement in the proportion of affiliated genes from viromes with VirSorter data set. Predicted genes from the Pacific Ocean Viromes (
<xref rid="bib36" ref-type="bibr">Hurwitz and Sullivan, 2013</xref>
), Tara Ocean Viromes (
<xref rid="bib8" ref-type="bibr">Brum et al., 2015</xref>
), and Human Gut Viromes (
<xref rid="bib53" ref-type="bibr">Minot et al., 2012</xref>
) were compared to RefSeqVirus (May 2015) and the VirSorter data set (BLASTp, threshold of 50 on bit score and 0.001 on e-value). Predicted proteins affiliated to VirSorter (in blue) did not display any significant similarity to a RefSeq sequence.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.013">http://dx.doi.org/10.7554/eLife.08490.013</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490f003"></graphic>
<p content-type="supplemental-figure">
<fig id="fig3s1" specific-use="child-fig" orientation="portrait" position="anchor">
<object-id pub-id-type="doi">10.7554/eLife.08490.014</object-id>
<label>Figure 3—figure supplement 1.</label>
<caption>
<title>Contig map of a putative new extrachromosomal prophage.</title>
<p>Contig Spirochaetia_gi_359585655 represent a complete genome (the contig was detected as circular) from a new genus (affiliated to a VC with no RefSeqABVir sequence). Functional affiliation of predicted genes is indicated on the map, with notably two genes (ParA/ParB) indicative of extrachromosomal prophages, as well as two genes (in orange) affiliated to the ACR_tran efflux pump family, of which some members are involved in antiobiotic resistance phenotypes. This contig belongs to the virus cluster VC_61, composed of 35 new putative extrachromosomal prophages from different Spirochetes genomes.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.014">http://dx.doi.org/10.7554/eLife.08490.014</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490fs005"></graphic>
</fig>
</p>
</fig>
</p>
<p>To estimate how often a large (>30 kb) genome fragment could be an integrated prophage and not capture the microbial gene content, we simulated the process for 22 different prophage-containing bacterial genomes ‘sequenced’ (in silico) at coverage of 5, 25, 50, 75, and 100× (see ‘Material and methods’). These analyses suggested that only 2.3% of large (>30 kb) prophage-originating contigs lacked any identifiable microbial genes. Thus, these latter 1,756 sequences must largely be extrachromosomal sequences and so represent a unique data source for quantifying the prevalence of under-studied viral infection modes including chronic infections, lytic viruses, or extrachromosomal prophages.</p>
<p>Although we identified no clear sequence-based marker for the first two infection types, we could conservatively estimate the fraction of extrachromosomal prophages by identifying plasmid partition genes (ParA and ParB,
<xref rid="bib18" ref-type="bibr">Davis et al., 1992</xref>
;
<xref rid="bib32" ref-type="bibr">Saint Girons et al., 2000</xref>
; see
<xref ref-type="fig" rid="fig3s1">Figure 3—figure supplement 1</xref>
for an example of a putative extrachromosomal prophage displaying ParA-ParB genes). These genes were significantly more abundant in the 1,756 circular and large genome fragments than in the rest of the data set (13% vs 1%, respectively; poisson test p-value < 10
<sup>−05</sup>
). Thus, at least 13% of these sequences appear to be bona fide extrachromosomal prophages, whereas the others might be lytic viruses in ‘carrier’ states, chronic infections, or extrachromosomal prophages without detectable ParA/ParB genes.</p>
<p>Beyond this glimpse into under-studied viral infection modes, these new reference genomes are likely to help improve taxonomic affiliation for the ‘viral dark matter’ in viromes. To quantify this, we added these sequences to the RefSeqABVir database and assigned taxonomy to predicted genes in three large-scale virome data sets available. We found that the VirSorter curated data set improved affiliation by 32 and 40%, respectively, in the marine Pacific Ocean Viromes (POV) (
<xref rid="bib36" ref-type="bibr">Hurwitz and Sullivan, 2013</xref>
) and Tara Oceans Viromes (TOV) (
<xref rid="bib8" ref-type="bibr">Brum et al., 2015</xref>
) data sets, and more than doubled the number of affiliated genes in human gut viromes (
<xref rid="bib53" ref-type="bibr">Minot et al., 2012</xref>
,
<xref ref-type="fig" rid="fig3">Figure 3B</xref>
). This particularly strong improvement in the human gut virome affiliation is presumably due to enterobacteria being abundant among current publicly available microbial genomes.</p>
<p>Finally, both the detection of non-integrated viral genomes and the improved virome affiliation suggest that the VirSorter curated data set includes not only integrated prophage data, but also viruses actively infecting these microbes (i.e., not integrated in the host chromosome and producing virions) with under-studied infection modes.</p>
</sec>
<sec id="s2-4">
<title>Long-term evolutionary patterns of bacterial and archaeal virus genomes</title>
<p>Examination of the VCs network beyond classification revealed additional higher order patterns. First, bacterial and archaeal viruses clustered separately in >99% of VCs; the exception (VC_89) included a single and unique (
<xref rid="bib31" ref-type="bibr">Garrett et al., 2010</xref>
) archaeal virus (Hyperthermophilic Archaeal Virus 2, NC_014321) that clustered with 21 bacterial viruses, presumably due to poor archaeal virus representation. Second, >95% of these VCs contained exclusively one nucleic acid type (e.g., DNA or RNA, and dsDNA or ssDNA,
<xref ref-type="fig" rid="fig2s1">Figure 2—figure supplement 1</xref>
), although RNA viral representation is low because only RefSeq-curated families
<italic>Cystoviridae</italic>
and
<italic>Leviviridae</italic>
were available (no RNA viruses were detected with VirSorter, presumably because DNA-based data sets were analyzed)
<italic>.</italic>
The 15 VCs including both ssDNA and dsDNA viral genomes are either associated with archaeal viruses for which composite ssDNA/dsDNA genomes were already described (2 VCs;
<xref rid="bib71" ref-type="bibr">Sencilo et al., 2012</xref>
) or more surprisingly with ssDNA
<italic>Inoviridae</italic>
, which clustered with
<italic>Caudovirales</italic>
in 13 VCs (
<xref ref-type="supplementary-material" rid="SD4-data">Figure 2—source data 1</xref>
). For 9 of these 13
<italic>Inoviridae–Caudovirales</italic>
VCs, some of the sequences were wrongly affiliated due to genes shared by both viral families such as integrases, exonucleases, and replication-associated proteins. Two other VCs corresponded to prophage sequences that include genes similar t
<italic>o Inoviridae</italic>
an
<italic>d Caudovirales</italic>
and could actually be two different viruses integrated at the same genome location. However, the 2 remaining mixed VCs (VC_128 and VC_215) include sequences displaying a mix of
<italic>Caudovirales</italic>
and
<italic>Inoviridae</italic>
genes (VC_215 sequences also included
<italic>Corticoviridae</italic>
genes). We posit that these might represent new composite genomes beyond the ones already described for archaea viruses (
<xref rid="bib71" ref-type="bibr">Sencilo et al., 2012</xref>
) and the recently discovered RNA–DNA chimeric viruses (
<xref rid="bib21" ref-type="bibr">Diemer and Stedman, 2012</xref>
;
<xref rid="bib67" ref-type="bibr">Roux et al., 2013</xref>
;
<xref rid="bib43" ref-type="bibr">Krupovic et al., 2015</xref>
).</p>
<p>We next evaluated the scale and range of viral co-infection, a phenomenon critical to viral genome evolution and thought to blur this vertical gene inheritance signal used to classify genomes into VCs. Indeed, the fact that super-infection of prophage-containing bacteria would provide genomic proximity for gene acquisition via illegitimate recombination was posited more than a decade ago (
<xref rid="bib55" ref-type="bibr">Mosig, 1998</xref>
). However, viral co-infection rates remain unconstrained with the only data for natural systems derived from a single large-scale single-cell genomic data set where ∼35% of infected cells contained multiple viruses (
<xref rid="bib69" ref-type="bibr">Roux et al., 2014</xref>
). Here, in the 5492 microbial genomes with detectable viral signal, nearly half (2445) contained more than one detectable virus (
<xref ref-type="fig" rid="fig4">Figure 4</xref>
). Most (∼82%) of these co-infections involved multiple
<italic>Caudovirales</italic>
, as previously observed (
<xref rid="bib15" ref-type="bibr">Casjens, 2003</xref>
), and likely provides mechanism for viral gene exchange and may be more common in some phages displaying rampant mosaicism (e.g., the
<italic>Siphoviridae,</italic>
<xref rid="bib34" ref-type="bibr">Hendrix et al., 1999</xref>
) than others. The second most commonly observed co-infections (9%) occurred between ssDNA
<italic>Inoviridae</italic>
and dsDNA
<italic>Caudovirales</italic>
(
<xref ref-type="fig" rid="fig4">Figure 4</xref>
). These genomes represented the mixed VCs from the network analyses and putative new composite genomes described above. Mechanistically,
<italic>Inoviridae</italic>
might be more prone to such co-infection due to their long infection cycle whereby they extrude their filamentous virions without killing their host (
<xref rid="bib60" ref-type="bibr">Rakonjac et al., 2011</xref>
), with a dsDNA replication stage (
<xref rid="bib70" ref-type="bibr">Salim et al., 2008</xref>
) that could increase genomic exchanges with co-infecting dsDNA viruses.
<fig id="fig4" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.7554/eLife.08490.015</object-id>
<label>Figure 4.</label>
<caption>
<title>Scale and range of co-infection.</title>
<p>(
<bold>A</bold>
) Number of different viral sequences detected by host genome. Numbers are based on the set of microbial genomes with at least one viral sequence detected (5492 genomes). (
<bold>B</bold>
) Affiliation of viruses involved in multiple infections of the same host. Affiliations are deduced from best BLAST hits alongside the viral sequences, as in
<xref ref-type="fig" rid="fig1">Figure 1</xref>
. Co-infections involving dsDNA and ssDNA viruses are highlighted in bold.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.015">http://dx.doi.org/10.7554/eLife.08490.015</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490f004"></graphic>
</fig>
</p>
<p>Together, these findings suggest that genome-based network analyses could be used to identify novel viruses, as well as to infer host domain (archaeal or bacterial, >99% accuracy) and nucleic acid type (ssDNA or dsDNA, >95% accuracy). Evolutionarily, we posit that while co-infection by multiple viruses appears common, the consistency of so many VCs with ICTV taxonomy suggests that most phage genomes harbor a largely vertically inherited core gene set as detected for marine T4-like populations (
<xref rid="bib37" ref-type="bibr">Ignacio-Espinoza and Sullivan, 2012</xref>
;
<xref rid="bib51" ref-type="bibr">Marston et al., 2012</xref>
;
<xref rid="bib20" ref-type="bibr">Deng et al., 2014</xref>
) rather than the rampant mosaicism paradigm largely derived from
<italic>Siphoviridae</italic>
genomes (
<xref rid="bib34" ref-type="bibr">Hendrix et al., 1999</xref>
). While data remain limited to a subset of the known microbial phyla, it might be that viral infection modes influence the tempo of their genome evolution. Specifically, we posit that horizontal gene transfer is more prevalent in phages that occupy host cells longer due to lysogenic or chronic infection stages and/or infect densely packed hosts (e.g., biofilms or clumped life stages) as these parameters would increase the probability of co-infection. Perhaps then, at least for more highly lytic viral groups, genome-based clustering approaches can now be leveraged for their taxonomic predictive value as suggested over a decade ago (
<xref rid="bib66" ref-type="bibr">Rohwer and Edwards, 2002</xref>
).</p>
</sec>
<sec id="s2-5">
<title>Global virus–host network is confirmed as modular</title>
<p>Beyond charting diversity and taxonomic affiliation of viral sequence space, the VirSorter data set provided a unique opportunity to explore virus–host interactions. Beyond the above-noted expansion of viruses to novel hosts, we next examined these patterns on a global scale by constructing a virus–host interaction network based on database-available taxa. When considering viral diversity at the genus level, the network displays a modular topology (
<xref ref-type="fig" rid="fig5">Figure 5</xref>
and
<xref ref-type="fig" rid="fig5s1">Figure 5—figure supplement 1</xref>
). Such modularity in virus–host interaction networks suggests that hosts are specifically associated with particular viruses (
<xref rid="bib77" ref-type="bibr">Weitz et al., 2012</xref>
), probably reflecting long-term coevolution between microbial hosts and their viruses. Such modular structure was expected, but not observed in previous virus–host interaction network studies, likely due to the short phylogenetic distances between hosts evaluated in available data sets (
<xref rid="bib27" ref-type="bibr">Flores et al., 2011</xref>
). The modular network presented here derives from a data set spanning 18 phyla across bacterial and archaeal domains. These results confirmed the prediction that ‘at macroevolutionary scales, host–phage interaction matrices should be typified by a modular structure’ (
<xref rid="bib27" ref-type="bibr">Flores et al., 2011</xref>
), as also had been observed across 215 phage types against 286 host types of unknown diversity (
<xref rid="bib28" ref-type="bibr">Flores et al., 2013</xref>
).
<fig id="fig5" position="float" orientation="portrait">
<object-id pub-id-type="doi">10.7554/eLife.08490.016</object-id>
<label>Figure 5.</label>
<caption>
<title>Virus–host network between virus clusters and host classes (matrix visualization).</title>
<p>A cell in the matrix is colored when at least one virus from a virus cluster (VC, rows) was retrieved in a genome from a host class (columns). This virus–host network is detected as significantly modular by lp-Brim (modularity Q = 0.45; the same index computed from 99 randomly permuted matrices ranged from 0.02 to 0.17, with an average of 0.08). The different modules are highlighted in color, with inter-module links in gray. Virus clusters are identified by their number and their family-level affiliation (based on BLAST-based affiliation of the cluster members) is indicated next to each cluster when available (virus clusters with inconsistent members affiliation are considered as ‘unclassified’, affiliations are spread along the x-axis for spacing purpose). Host phylum and class are indicated for each host column, with domains indicated above the corresponding hosts.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.016">http://dx.doi.org/10.7554/eLife.08490.016</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490f005"></graphic>
<p content-type="supplemental-figure">
<fig id="fig5s1" specific-use="child-fig" orientation="portrait" position="anchor">
<object-id pub-id-type="doi">10.7554/eLife.08490.017</object-id>
<label>Figure 5—figure supplement 1.</label>
<caption>
<title>Virus–host network between virus clusters and host classes (network visualization).</title>
<p>An edge is displayed between a virus cluster (VC) and a host class when at least one virus from this cluster was retrieved in a genome from the host class. This network is detected as significantly modular by lp-Brim (modularity Q = 0.45; the same index computed from 99 randomly permuted matrices ranged from 0.02 to 0.17, with an average of 0.08). The different modules are highlighted in color, with inter-module links in gray. VCs are identified by their number and their family-level affiliation (based on BLAST-based affiliation of the cluster members) is indicated below each cluster when available (VCs with inconsistent members affiliation are considered as ‘unclassified’). Host phylum and class are indicated for each host node, with phyla (when multiple class from the same phylum are included in the network) and domains indicated above the corresponding host nodes.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.017">http://dx.doi.org/10.7554/eLife.08490.017</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490fs006"></graphic>
</fig>
</p>
</fig>
</p>
</sec>
<sec id="s2-6">
<title>Virus–host adaptation signals detected at the genome composition and codon usage level</title>
<p>Finally, given the number of virus–host linkages revealed by VirSorter, we evaluated the adaptation of viral genome composition within the host milieu—an idea practiced in the literature with limited genomic information (
<xref rid="bib58" ref-type="bibr">Pride et al., 2006</xref>
;
<xref rid="bib12" ref-type="bibr">Carbone, 2008</xref>
;
<xref rid="bib13" ref-type="bibr">Cardinale and Duffy, 2011</xref>
). To this end, we computed the distance between viral and microbial genomes in terms of mono-, di-, tri-, tetra-nucleotide frequency and codon usage, and compared the distances between the virus and its host vs non-hosts in the data set. Every metric tested displayed a smaller distance between viruses and their hosts than with non-host genomes, with tetranucleotide frequency (TNF) maximizing the host to non-host distances (
<xref ref-type="fig" rid="fig6">Figure 6</xref>
).
<fig id="fig6" position="float" orientation="portrait">
<object-id pub-id-type="doi">10.7554/eLife.08490.018</object-id>
<label>Figure 6.</label>
<caption>
<title>Adaptation of viral genome composition and codon usage to the host genome.</title>
<p>K–S distances between distributions of virus
<bold></bold>
host distances and virus–non-host distances for each metrics (in color) and different subsets of the viral sequences (all sequences, by type, and by taxonomy). Only families with more than 5 genomes are displayed (although it should be noted that the VirSorter data set includes only 6
<italic>Microviridae</italic>
sequences). The number of sequences in each category is indicated in brackets. Distributions used to compute distances are displayed in
<xref ref-type="fig" rid="fig6s1">Figure 6—figure supplement 1</xref>
.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.018">http://dx.doi.org/10.7554/eLife.08490.018</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490f006"></graphic>
<p content-type="supplemental-figure">
<fig id="fig6s1" specific-use="child-fig" orientation="portrait" position="anchor">
<object-id pub-id-type="doi">10.7554/eLife.08490.019</object-id>
<label>Figure 6—figure supplement 1.</label>
<caption>
<title>(
<bold>A</bold>
) K–S distances between distributions of virus–host distances and virus–non-host distances for each metrics (in color) and different subsets of the viral sequences (based on the number of tRNA genes detected).</title>
<p>The number of sequences in each category is indicated below the number of tRNA. (
<bold>B</bold>
) Distribution of k-mer distances between viral and cellular genomes and codon usage adaptation index for host, host genus, host family, and non-host (different order) genomes. For each viral genome, the distance to the host is displayed, as well as 10 randomly taken distances to genomes from each category and different subsets of the viral sequences (by taxonomy on the left column, and by number of tRNA genes on the rigth column).</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.019">http://dx.doi.org/10.7554/eLife.08490.019</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490fs007"></graphic>
</fig>
</p>
<p content-type="supplemental-figure">
<fig id="fig6s2" specific-use="child-fig" orientation="portrait" position="anchor">
<object-id pub-id-type="doi">10.7554/eLife.08490.020</object-id>
<label>Figure 6—figure supplement 2.</label>
<caption>
<title>Distance between k-mer frequency vectors of virus genome subsamples and host genomes for
<italic>Caudovirales</italic>
.</title>
<p>Viral genomes (1000) were randomly sub-sampled at different sizes (from 2000 to 20,000 bp). Only
<italic>Caudovirales</italic>
genomes were selected for this subsample analysis. For each size of k-mer, the result of a linear regression of distance between host or non-host and viral subsample size is indicated. The same distances for the
<italic>Microviridae</italic>
and
<italic>Inoviridae</italic>
(taken from
<xref ref-type="fig" rid="fig6">Figure 6A</xref>
) are indicated for comparison, and associated with the size of the reference genome of each group (
<italic>Enterobacteria</italic>
phage phiX174 and
<italic>Enterobacteria</italic>
phage M13). For clarity's sake, the almost-identical values for 2-mer, 3-mer, and 4-mer for
<italic>Microviridae</italic>
are slightly horizontally shifted.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.020">http://dx.doi.org/10.7554/eLife.08490.020</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490fs008"></graphic>
</fig>
</p>
</fig>
</p>
<p>Among dsDNA viruses, host-correlated genome composition patterns were robust across integrated prophages and extrachromosomal genomes (i.e., viral sequences assembled outside of the main host chromosome,
<xref ref-type="fig" rid="fig6">Figure 6A</xref>
,
<xref ref-type="fig" rid="fig6s1">Figure6—figure supplement 1</xref>
). Our expectations were that prophages would be largely optimized towards the genome of their host, but that genome composition of the extrachromosomal category would be less correlated. Particularly, as cyanophage host range breadth scales with the number of tRNA genes encoded by the virus (
<xref rid="bib23" ref-type="bibr">Enav et al., 2012</xref>
), we expected that genome composition of viral genomes with many tRNA genes might have poor correlation to that of their host genomes, assuming that the viral-encoded tRNA genes could compensate for codon mismatches across hosts. However, these latter expectations were not met as viral and host genome composition correlations were strong regardless of the number of viral-encoded tRNA genes (
<xref ref-type="fig" rid="fig6s1">Figure 6—figure supplement 1</xref>
), which suggests that host-optimized viral genome composition may be beneficial even when the virus encodes its own tRNA genes.</p>
<p>Among ssDNA viruses, nucleotide composition of viral genomes was also correlated to host genomes, but less so than for dsDNA viruses (
<xref ref-type="fig" rid="fig6">Figure 6A</xref>
). This contrasts with a previous analysis of 500 phage genomes that did not detect any difference between dsDNA and ssDNA genomes adaptations to their host's genome (
<xref rid="bib13" ref-type="bibr">Cardinale and Duffy, 2011</xref>
). One ssDNA viral group, the
<italic>Microviridae</italic>
, had a reduced signal for genome composition metrics except for codon usage where its signal was comparable to that of the dsDNA viruses (
<xref ref-type="fig" rid="fig6">Figure 6A</xref>
). Although this could indicate a bias linked to the small genome size of these viruses (around 5 kb), dsDNA viruses' genomes subsampled to similar sizes displayed a minimal signal loss (
<xref ref-type="fig" rid="fig6s2">Figure 6—figure supplement 2</xref>
), which suggests other mechanisms may be driving this lower genome composition adaptation in
<italic>Microviridae</italic>
. Another ssDNA group, the
<italic>Inoviridae</italic>
had reduced genome composition and codon usage adaptation signals. Again, because
<italic>Inoviridae</italic>
release virions without killing their hosts, it is possible that the virus is exposed to host resources over a much longer time interval, lowering the selection pressure toward transcription and translation speed and efficiency, which is the main mechanism thought to drive genome composition and codon usage adaptation of viral genomes (
<xref rid="bib13" ref-type="bibr">Cardinale and Duffy, 2011</xref>
).</p>
<p>Pragmatically, to assess whether this signal could be used to predict the host of a new virus, we calculated the distance based on TNF vectors between each VirSorter curated data set sequence and the 14,977 microbial genomes. The taxonomy of the microbial genome with the lowest distance to the viral sequence (i.e., the predicted host) was then compared to the taxonomy of the actual host (i.e., the genomic data set in which the viral sequence was identified). When the host database included all host genomes, this host prediction was 99% accurate at both the family and genus level for virus–host TNF distances lower than 4.10
<sup>−04</sup>
, 88%/51% (family/genus level) for TNF distances ranging between 4.10
<sup>−04</sup>
and 1.10
<sup>−03</sup>
, and 70%/37% for distances greater than 10
<sup>−03</sup>
(
<xref ref-type="table" rid="tbl1">Table 1</xref>
). When genomes from the actual host species are excluded, the accuracy of host prediction drops slightly (95%, 83%/30%, 67%/30% for the same distance ranges), and even more when all genomes from the host genus are excluded (70% and 37% at the family level, no correct genus could be predicted in that case, and only one distance lower than 4.10
<sup>−04</sup>
was observed,
<xref ref-type="table" rid="tbl1">Table 1</xref>
). Hence, TNF comparison provides a promising in silico approach to link new viral genomes to hosts at different levels of accuracy within the taxonomic hierarchy when the suitable host reference genome is available.
<table-wrap id="tbl1" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.7554/eLife.08490.021</object-id>
<label>Table 1.</label>
<caption>
<p>Accuracy of host prediction based on distance (d) between tetranucleotide frequencies of viral and microbial genomes</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.021">http://dx.doi.org/10.7554/eLife.08490.021</ext-link>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2" colspan="1"></th>
<th rowspan="2" colspan="1">Predicted</th>
<th colspan="2" rowspan="1">Host order</th>
<th colspan="2" rowspan="1">Host family</th>
<th colspan="2" rowspan="1">Host genus</th>
</tr>
<tr>
<th rowspan="1" colspan="1">Correct</th>
<th rowspan="1" colspan="1">Ratio (%)</th>
<th rowspan="1" colspan="1">Correct</th>
<th rowspan="1" colspan="1">Ratio (%)</th>
<th rowspan="1" colspan="1">Correct</th>
<th rowspan="1" colspan="1">Ratio (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="8" rowspan="1">All reference sequences</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> d < 4 × 10
<sup>−04</sup>
</td>
<td rowspan="1" colspan="1">98</td>
<td rowspan="1" colspan="1">97</td>
<td style="background:silver" rowspan="1" colspan="1">98.98</td>
<td rowspan="1" colspan="1">97</td>
<td style="background:silver" rowspan="1" colspan="1">98.98</td>
<td rowspan="1" colspan="1">97</td>
<td style="background:silver" rowspan="1" colspan="1">98.98</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> 4 × 10
<sup>−04</sup>
≤ d < 1 × 10
<sup>−03</sup>
</td>
<td rowspan="1" colspan="1">10,173</td>
<td rowspan="1" colspan="1">9361</td>
<td style="background:silver" rowspan="1" colspan="1">92.02</td>
<td rowspan="1" colspan="1">8971</td>
<td style="background:silver" rowspan="1" colspan="1">88.18</td>
<td rowspan="1" colspan="1">5261</td>
<td rowspan="1" colspan="1">51.72</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> 1 × 10
<sup>−03</sup>
≤ d</td>
<td rowspan="1" colspan="1">2508</td>
<td rowspan="1" colspan="1">1872</td>
<td rowspan="1" colspan="1">74.64</td>
<td rowspan="1" colspan="1">1757</td>
<td rowspan="1" colspan="1">70.06</td>
<td rowspan="1" colspan="1">917</td>
<td rowspan="1" colspan="1">36.56</td>
</tr>
<tr>
<td colspan="8" rowspan="1">Host species excluded</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> d < 4 × 10
<sup>−04</sup>
</td>
<td rowspan="1" colspan="1">21</td>
<td rowspan="1" colspan="1">20</td>
<td style="background:silver" rowspan="1" colspan="1">95.24</td>
<td rowspan="1" colspan="1">20</td>
<td style="background:silver" rowspan="1" colspan="1">95.24</td>
<td rowspan="1" colspan="1">20</td>
<td style="background:silver" rowspan="1" colspan="1">95.24</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> 4 × 10
<sup>−04</sup>
≤ d < 1 × 10
<sup>−03</sup>
</td>
<td rowspan="1" colspan="1">10,003</td>
<td rowspan="1" colspan="1">9067</td>
<td style="background:silver" rowspan="1" colspan="1">90.64</td>
<td rowspan="1" colspan="1">8372</td>
<td style="background:silver" rowspan="1" colspan="1">83.69</td>
<td rowspan="1" colspan="1">2992</td>
<td rowspan="1" colspan="1">29.91</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> 1 × 10
<sup>−03</sup>
≤ d</td>
<td rowspan="1" colspan="1">2755</td>
<td rowspan="1" colspan="1">1981</td>
<td rowspan="1" colspan="1">71.91</td>
<td rowspan="1" colspan="1">1840</td>
<td rowspan="1" colspan="1">66.79</td>
<td rowspan="1" colspan="1">818</td>
<td rowspan="1" colspan="1">29.69</td>
</tr>
<tr>
<td colspan="8" rowspan="1">Host genus excluded</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> d < 4 × 10
<sup>−04</sup>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0.00</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0.00</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0.00</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> 4 × 10
<sup>−04</sup>
≤ d < 1 × 10
<sup>−03</sup>
</td>
<td rowspan="1" colspan="1">9085</td>
<td rowspan="1" colspan="1">7303</td>
<td style="background:silver" rowspan="1" colspan="1">80.39</td>
<td rowspan="1" colspan="1">6181</td>
<td rowspan="1" colspan="1">68.04</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0.00</td>
</tr>
<tr>
<td rowspan="1" colspan="1"> 1 × 10
<sup>−03</sup>
≤ d</td>
<td rowspan="1" colspan="1">3693</td>
<td rowspan="1" colspan="1">1768</td>
<td rowspan="1" colspan="1">47.87</td>
<td rowspan="1" colspan="1">1388</td>
<td rowspan="1" colspan="1">37.58</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0.00</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>For each viral genome, the order, family, and genus of its host were predicted from the taxonomy of the closest microbial genome (based on the mean absolute difference between tetranucleotide frequency vectors) and compared to the order, family, and genus of the actual host (i.e., the taxonomy of the genome with which the virus was identified). These predictions were computed with (i) all microbial genomes, (ii) excluding specifically all genomes from the host species, and (iii) excluding all genomes from the host genus. Cases with over 75% of prediction accuracy are highlighted in gray.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="s2-7">
<title>Data set availability</title>
<p>As evidenced by the improvement in virome taxonomic affiliation (
<xref ref-type="fig" rid="fig3">Figure 3B</xref>
), VirSorter curated data set should represent a useful reference data set for future virome studies. This data set also likely harbor novel biology beyond the global patterns of viral diversity and virus–host interactions presented in this manuscript, to be revealed through analyses targeted toward specific viral or host subgroups. To facilitate these follow-up studies, VirSorter curated data set is made available through two complementary websites: MetaVir and iVirus. MetaVir (project ‘VirSorter’, data set ‘VirSorter curated data set’) provides an automatic annotation of each sequence, with multiple visualization tools to explore and compare genome maps, as well as multiple ways of searching the data (by host, by phage affiliation, by taxonomic or functional affiliation of predicted genes, etc) and extract a specific subset of interest (these tools are under the tab ‘Contig maps’). Nucleotide sequences from the VirSorter curated data set are also hosted at iVirus, alongside the viral clusters annotation and network (as cytoscape-ready text files), the virus–host matrix, and the complete list of viral sequence predictions in the 14,977 archaeal and bacterial genomic data sets including the category 3 predictions that are not in VirSorter curated data set (
<ext-link ext-link-type="uri" xlink:href="http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/ivirus/VirSorter_curated_dataset">http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/ivirus/VirSorter_curated_dataset</ext-link>
). Finally, a summary of the sequences and clusters is provided as
<xref ref-type="supplementary-material" rid="SD1-data">Figure 1—source data 1</xref>
and
<xref ref-type="supplementary-material" rid="SD4-data">Figure 2—source data 1</xref>
, and a Data Dryad package including all annotated genbank files from the VirSorter curated data set is available (
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5061/dryad.b8226">http://dx.doi.org/10.5061/dryad.b8226</ext-link>
;
<xref rid="bib84" ref-type="bibr">Roux et al., 2015b</xref>
).</p>
</sec>
<sec id="s2-8">
<title>Conclusion</title>
<p>While recent advances in high-throughput sequencing and viral metagenomics continue to expand the bounds of viral sequence space (e.g.,
<xref rid="bib62" ref-type="bibr">Reyes et al., 2012</xref>
;
<xref rid="bib54" ref-type="bibr">Mizuno et al., 2013</xref>
;
<xref rid="bib10" ref-type="bibr">Brum and Sullivan, 2015</xref>
), such viruses are typically unlinked to cognate hosts, severely limiting ecological and evolutionary inferences. Concurrently, emerging methods provide new virus–host linkage capabilities, but do not scale well with increasing data set size and complexity (e.g.,
<xref rid="bib5" ref-type="bibr">Andersson and Banfield, 2008</xref>
;
<xref rid="bib76" ref-type="bibr">Tadmor et al., 2011</xref>
;
<xref rid="bib3" ref-type="bibr">Allers et al., 2013a</xref>
;
<xref rid="bib20" ref-type="bibr">Deng et al., 2014</xref>
). Here, the mining of publicly available microbial genomic data proved to be a useful complement to these approaches as it enables the exploration of host-linked viral diversity. The resulting viral sequences hidden in microbial genomes represent a powerful data set, increasing the number of known, host-linked viruses by an order of magnitude, with analyses of these data elucidating viral dark matter in ocean and human gut viromes, as well as augmenting our understanding of viral taxonomy, viral genome evolution, and virus–host interactions on multiple fronts. While this current VirSorter data set remains limited by the cultivation bias inherent in the publicly available complete and draft microbial genomes, such bias will progressively be eliminated as SAGs are used to better map microbial dark matter (e.g.,
<xref rid="bib64" ref-type="bibr">Rinke et al., 2013</xref>
). Such a drastically improved map of the virosphere, together with advances in experimental approaches and theory (
<xref rid="bib10" ref-type="bibr">Brum and Sullivan, 2015</xref>
), will help reveal the eco-evolutionary forces shaping virus–host interactions across diverse ecosystems and eventually shift our inference capability from observation to prediction.</p>
</sec>
</sec>
<sec sec-type="materials|methods" id="s3">
<title>Materials and methods</title>
<sec id="s3-1">
<title>Application of VirSorter to public bacterial and archaeal genomes</title>
<p>A total of 14,977 bacterial and archaeal genomes (complete and draft) included in RefSeq and WGS databases (
<xref rid="bib59" ref-type="bibr">Pruitt et al., 2009</xref>
) were downloaded from the NCBI ftp website in March 2014 (RefSeq release 64). The 264 new candidate phyla (‘Microbial Dark Matter’) SAGs' (
<xref rid="bib64" ref-type="bibr">Rinke et al., 2013</xref>
) raw reads were downloaded from the JGI portal page and assembled with SPAdes Genome Assembler (
<xref rid="bib6" ref-type="bibr">Bankevich et al., 2012</xref>
) (default parameters). Finally, 127 SUP05 SAGs that we previously analyzed manually were added to the cellular genome pool (
<xref rid="bib69" ref-type="bibr">Roux et al., 2014</xref>
). This data set included 4240 complete genomes and 10,547 draft genomes (as there is no clear annotation of a genome as ‘draft’ or ‘complete’ at the NCBI, we identified as ‘draft’ genomes all genome projects including more than 5 different sequences, to avoid considering genomes split into different chromosomes or including one or several plasmids as ‘draft’).</p>
<p>Genomes were processed with VirSorter (
<xref rid="bib68" ref-type="bibr">Roux et al., 2015a</xref>
) separately for each class (except for
<italic>Cyanobacteria,</italic>
SUP05 SAGs
<italic>,</italic>
and the Microbial Dark Matter data set that were all processed together), first using the RefSeqABVir database, and then using the Viromes database, yielding 89,301 total predicted viral sequences. Among these, 938 correspond to Enterobacteria phage PhiX174, which is used for quality control during Illumina sequencing, and were thus discarded.</p>
</sec>
<sec id="s3-2">
<title>Selection of a relevant subset of viral sequences: the VirSorter data set</title>
<p>We focused on a subset of the putative viral sequences extracted from RefSeq, WGS and the Microbial Dark Matter and SUP05 SAGs (89,301 sequences), and targeted the active prophages and lytic virus signatures. To this end, we discarded all predictions lacking a viral hallmark gene or a viral gene enrichment (i.e., category 3 predictions,
<xref rid="bib68" ref-type="bibr">Roux et al., 2015a</xref>
), and all prophage detections displaying viral gene enrichment only and lacking viral hallmark genes, as these are likely defective prophages for which boundaries are difficult to predict in silico and that often include bacterial genes. We next removed all linear sequences shorter than 10 kb except for sequences detected with the non-
<italic>Caudovirales</italic>
score where a threshold of 5 kb was used, as these viruses can frequently have genomes smaller than 10 kb. We also discarded all circular contigs (which should represent complete genomes) smaller than 3 kb as these are likely short repeat regions (the smallest known genome for a bacteria or archaea virus is ∼5 kb). The resulting 13,391 sequences were then manually curated to remove false positives. These false positives corresponded to defective prophages (wherein most are expected to be smaller than 10 kb), plasmid-like sequences, GTA gene clusters, and low complexity regions. In addition, this manual curation step allowed us to adjust the boundaries of some prophage predictions and/or modify the prophage vs complete viral contig automatic prediction. Consequently, 892 sequences were discarded (false-positive rate of 6.7%), leaving 12,498 curated sequences.</p>
<p>Among these, 7266 sequences were entirely viral (thus potentially represent lytic, chronic, or extrachromosomal lysogenic infections assembled in draft genomes), and 5232 were prophages (viral-like regions detected within a cellular genome fragment). Among the sequences detected as entirely viral, 6 were tagged in the NCBI database as bacteriophages, and 108 as plasmids. Viruses and plasmids can be difficult to distinguish, as gene exchange is known to occur between these two types of mobile genetic elements (
<xref rid="bib47" ref-type="bibr">Leplae et al., 2010</xref>
). Here, 84 out of these 108 ‘plasmid’ sequences displayed conclusive evidence of a viral origin as they contained viral hallmark genes (coding for terminase large subunits or major capsid proteins) and originated from draft unpublished genomes, hence likely to have been named ‘plasmid’ because they formed extrachromosomal circular assembly (see e.g., sequence gi:383080718 available at RefSeq). The 24 others were more ambiguous (highlighted in orange in
<xref ref-type="supplementary-material" rid="SD1-data">Figure 1—source data 1</xref>
) since the automatic annotation from NCBI did not display any viral-like gene, yet these sequences all displayed statistical viral-like gene enrichment, and as such were maintained in the VirSorter curated data set.</p>
<p>Finally, one additional ambiguous sequence, considered as entirely viral by VirSorter, was detected in the
<italic>Caldiserica</italic>
SAG (Caldiserica_bacterium_sp_JGI_0000059-M03_ID_3757). Even though this sequence looks indeed like a complete
<italic>Inoviridae</italic>
genome, it displayed a high level of similarity (99% identity) to the complete genome of
<italic>Delftia acidovorans</italic>
SPH-1 (gi:160361034, from coordinates 2300885 to 2307389). Such high similarity with another virus is suspicious, as well as the fact that the matching genome is
<italic>Delftia</italic>
, a bacterium known to contaminate some MDA reagents. This sequence was maintained in the VirSorter data set as there is no definite proof of the contamination, but the existence of a Caldiserica-infecting
<italic>Inoviridae</italic>
should be considered as uncertain until further evidence is available (and is displayed as such in
<xref ref-type="supplementary-material" rid="SD4-data">Figure 2—source data 2</xref>
).</p>
</sec>
<sec id="s3-3">
<title>Protein and virus clustering of the VirSorter curated data set</title>
<p>The pool of 450,047 proteins predicted from the 12,498 viral sequences was clustered with all proteins from RefSeq and the viral metagenomes (i.e., sequences from the Viromes database) with MCL based on reciprocal best BLAST hit (threshold of 50 on score and 0.001 on e-value,
<xref rid="bib24" ref-type="bibr">Enright et al., 2002</xref>
). Most of these sequences (423,618) could be included in 22,460 protein clusters (PCs). About a third (7742) of these PCs also contained sequences from the RefSeqABVir database, and the remainder formed new PCs.</p>
<p>This protein clustering was then used to cluster genomes as in Lima-Mendez et al. (
<xref rid="bib24" ref-type="bibr">Enright et al., 2002</xref>
;
<xref rid="bib49" ref-type="bibr">Lima-Mendez et al., 2008b</xref>
). Briefly, the number of shared PCs between each pair of sequences (either RefSeq or VirSorter) is computed, and a significance value is deduced by comparing it to an expected number of shared PCs (modeled with a hypergeometric formula taking into account the number of genes of both sequences).</p>
<p>We used ICCC (intracluster clustering coefficient, which estimates cluster homogeneity by measuring around each node how many of its neighbors are part of the same cluster) to determine the best inflation value (from 1.5 to 5 by 0.25 increments) and significance threshold (i.e., which minimum significance was required to draw an edge between two sequences, from 1 to 50). As expected, the number of VCs formed increased with inflation and with significance. ICCC was clearly higher with the lowest threshold in significance (sig ≥ 1), regardless of the inflation value used. For the lowest significance threshold, ICCC increased with inflation, usually with a first small peak around 2 and plateau around 4. These different values of inflation did not have a major impact on the clustering though, as 95–99% of pairs of sequences were clustered similarly using inflations values of 2.75, 3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, or 5. We eventually settled for the combination yielding the highest ICCC: a significance threshold of 1 and inflation of 4 (
<xref ref-type="fig" rid="fig2s2">Figure 2—figure supplement 2</xref>
).</p>
</sec>
<sec id="s3-4">
<title>Taxonomic and functional affiliation of sequences and VCs</title>
<p>Taxonomic affiliation of sequences was based on hits to the RefSeqABVir database. Each profile in the database was first affiliated based on the origin of its members, with a 75% majority rule: at each taxonomic level, a profile is affiliated to a taxon if more than 75% of the profile sequences are affiliated to this taxon. Then, for each of the 12,498 viral sequences identified by VirSorter, a set of relevant hits was selected: (i) first the profile with the best hit across all genes along the sequence, and (ii) the best hit from other genes with a score close to this ‘absolute’ best hit in the sequence (>75% of the score of the first best hit). The sequence was then affiliated to the Lowest Common Ancestor (LCA) of this set of relevant hits. Hence, a predicted protein will only be affiliated if pointing toward sequences or profiles typical of a viral group, and a sequence detected by VirSorter will only be affiliated if its best hits are consistent. Functional affiliation for each PC was based on the comparison of its members (predicted proteins) with PFAM (v. 27, threshold of 50 on score). VCs were affiliated based on its members affiliations if >75% were consistent.</p>
<p>For the detection of new genera in the VCs, we chose to ignoring the 79 VCs that lacked large (>30 kb) genome sequences. This 30 kb threshold is conservative as it avoids considering short genome fragments as new genera but would also overlook small non-circular viral genomes (such as some
<italic>Tectiviridae</italic>
). However, because the latter comprise a minority (∼0.1% of 12,498 sequences) of the VirSorter data set (
<xref ref-type="fig" rid="fig2">Figure 2</xref>
), we chose to retain the larger, more conservative threshold.</p>
<p>The 7 short circular sequences from
<italic>Bacteroidia</italic>
only detected with the Viromes database (gi 319430465, 298484481, 329959038, 423221334, 423242675, 423298785, 345651594) were targeted for further examination. Hits to PFAM domains could be found on two proteins: a relaxase (PF03432.9, score ∼170), and one replication initiator protein (PF01051.16, score ∼80). Genome organization was compared with Easyfig (
<xref rid="bib74" ref-type="bibr">Sullivan et al., 2011</xref>
) after aligning all genomes to the same starting point (one base before the start of the Rep-domain protein). Recruitment plots of virome contigs (extracted from
<xref rid="bib41" ref-type="bibr">Kim et al., 2011</xref>
;
<xref rid="bib53" ref-type="bibr">Minot et al., 2012</xref>
) were generated with ggplot2 and based on blastn comparison.</p>
</sec>
<sec id="s3-5">
<title>Host range and co-infection</title>
<p>The virus–host network was assessed considering only VCs with more than 10 sequences. Hosts were grouped at the class level. The modularity Q value of the virus–host matrix was computed with the lp-BRIM module in R software (
<ext-link ext-link-type="uri" xlink:href="https://github.com/tpoisot/lp-brim">https://github.com/tpoisot/lp-brim</ext-link>
). The virus–host matrix had a modularity of 0.45. The same index computed from 99 randomly permuted matrices ranged from 0.02 to 0.17, with an average of 0.07.</p>
<p>Co-infection was defined as the detection of several distinct viruses in one genome project (one complete genome or one SAG). However, different viral contigs in a single draft genome could also originate from a single viral genome mis-assembled in several different contigs. This will be especially true for
<italic>Caudovirales</italic>
that are the most detected viruses as well as the ones with the largest genomes. To limit the over-estimation of co-infection due to mis-assembled
<italic>Caudovirales</italic>
genomes, co-infection was only considered in the cases where multiple copies of the large subunit of the terminase were detected, because this gene is present in single copy in
<italic>Caudovirales</italic>
genomes, and usually detected even in new viruses (due to a high level of sequence conservation).</p>
</sec>
<sec id="s3-6">
<title>Evaluation of virus–host genome adaptation</title>
<p>Relative frequencies of k-mers (mono-, di-, tri-, and tetra-nucleotide) were computed with Jellyfish (
<xref rid="bib50" ref-type="bibr">Marçais and Kingsford, 2011</xref>
) for every VirSorter sequence and every bacterial and archaeal genome initially mined. Mean absolute error (i.e. average of absolute differences) between k-mer frequency vectors were then computed with an in-house perl script for each pair of VirSorter sequence and cellular genome, and used as a distance metric between viruses and putative hosts. For each VirSorter sequence, a set of distances that included its host (i.e., the genome with which the sequence was initially associated) alongside 10 randomly selected sequences from the same genus, the same family, and a different order than the host were factored into in the distance distribution (
<xref ref-type="fig" rid="fig6s1">Figure 6—figure supplement 1</xref>
).</p>
<p>Codon usage adaptation was evaluated with cusp and cai from the European Molecular Biology Open Software Suite (EMBOSS,
<xref rid="bib63" ref-type="bibr">Rice and Longden, 2000</xref>
). First, codon usage bias of each bacterial and archaeal genome was computed. Then, the codon usage adaptation index (cai) was calculated for each gene between VirSorter sequences and cellular genomes. The global distribution displays the average (across genes) adaptation index for each VirSorter sequence and (as for the k-mer distances) a subset of cellular genomes including its host and 10 randomly selected sequences from respectively the same genus, the same family, and a different order than the actual host. Function-specific codon usage bias was based on the gene-by-gene adaptation between each VirSorter sequence and its host.</p>
<p>For each category studied, the distance between distribution of distances to host genome (in red on
<xref ref-type="fig" rid="fig5s1">Figure 5—figure supplement 1</xref>
) and distribution of distances to non-host genomes (in purple on
<xref ref-type="fig" rid="fig5s1">Figure 5—figure supplement 1</xref>
) was evaluated with a Kolmogorov–Smirnov (K–S) statistic. The codon usage adaptation indexes for the different functional categories were compared to the ‘other functions’ values with a Wilcoxon signed-rank test to detect categories with statistically different averages. Both statistics were computed with R software.</p>
<p>To evaluate the effect of small genome size on distance between k-mer frequencies, a sub-sample of 1000
<italic>Caudovirales</italic>
was randomly taken at different sizes (from 2000 to 20,000 bp), and the same procedure as for complete sequences was used to determine the distance between host and non-host distributions of k-mer distances. Even though the signal was slightly less strong for shorter fragments, this simulation indicates that genome size is not the only factor that could explain such low viral–host genome adaptation for ssDNA viruses.</p>
<p>The prediction of the host taxonomy for each viral sequence was based on the microbial genome with the lowest tetramer frequency distance to the viral sequence. A prediction was considered as ‘correct’ when this closest microbial genome taxonomy was the same as the original genome in which the viral sequence was detected. This prediction was computed using (i) all microbial genomes, (ii) only genomes from a different species than the actual host (i.e., the genome in which the viral sequence was originally detected), and (iii) only genomes from a different genus than the actual host.</p>
</sec>
<sec id="s3-7">
<title>Estimation of virome affiliation improvement and prophage assembly efficiency</title>
<p>Protein sequences predicted from the POV (
<xref rid="bib36" ref-type="bibr">Hurwitz and Sullivan, 2013</xref>
), TOV (
<xref rid="bib8" ref-type="bibr">Brum et al., 2015</xref>
), and human gut viromes (
<xref rid="bib53" ref-type="bibr">Minot et al., 2012</xref>
) data sets were compared to RefSeqABVir (Jan. 2014) using BLAST (blastp, threshold of 50 on bit score and 0.001 on e-value). Those proteins that did not affiliate at >50 bit score and <0.001 e-value thresholds were considered ‘unclassified’ and then used as queries in a secondary BLAST (blastp with the same thresholds) against the predicted proteins from the VirSorter curated data set. Any unclassified proteins matching the VirSorter data set better were considered newly affiliated.</p>
<p>To evaluate the efficiency of prophage assembly, we simulated genome sequencing from 23 bacterial genomes with identified prophages (NC_000907 NC_000913 NC_000964 NC_002570 NC_002655 NC_002662 NC_002695 NC_002935 NC_003030 NC_003212 NC_003295 NC_003366 NC_003997 NC_004070 NC_004307 NC_004310 NC_004431 NC_004557 NC_004567 NC_004668 NC_004722 NC_005085 NC_005362). NeSSM (
<xref rid="bib38" ref-type="bibr">Jia et al., 2013</xref>
) was used to simulated HiSeq Illumina reads (100 bp paired-ends) with a coverage of the prophage region varying between 5×, 25×, 50×, 75×, and 100×. Reads were then assembled with Idba_ud (
<xref rid="bib57" ref-type="bibr">Peng et al., 2012</xref>
), and viral contigs were predicted with VirSorter (
<xref rid="bib68" ref-type="bibr">Roux et al., 2015a</xref>
). On the 481 contigs larger than 30 kb detected as viral by VirSorter, 11 were considered as ‘entirely viral’ even though these originated from integrated prophages, resulting in a ‘false-positive’ ratio of integrated prophages wrongly considered as extrachromosomal viral genomes of 2.3% for contigs of 30 kb and more. As could be expected, this same ‘false-positive’ ratio was higher for smaller contigs (12.06% for contigs <20 kb, and 22.81% for contigs <10 kb), so that we considered the origin of these small contigs as ‘undetermined’, since they may come from integrated prophages or extrachromosomal genomes.</p>
<p>All scripts used in this study are available on the TMPL wiki as a zip package:
<ext-link ext-link-type="uri" xlink:href="http://tmpl.arizona.edu/dokuwiki/doku.php?id=bioinformatics:scripts:vsb">http://tmpl.arizona.edu/dokuwiki/doku.php?id=bioinformatics:scripts:vsb</ext-link>
and on github:
<ext-link ext-link-type="uri" xlink:href="http://github.com/simroux/virsorter-curated-dataset-scripts-package">http://github.com/simroux/virsorter-curated-dataset-scripts-package</ext-link>
.</p>
</sec>
</sec>
</body>
<back>
<sec sec-type="funding-information">
<title>Funding Information</title>
<p>This paper was supported by the following grants:</p>
<list list-type="bullet">
<list-item>
<p>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100000936</institution-id>
<institution>Gordon and Betty Moore Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>3790</award-id>
to Matthew B Sullivan.</p>
</list-item>
<list-item>
<p>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100000038</institution-id>
<institution>Natural Sciences and Engineering Research Council of Canada (Conseil de Recherches en Sciences Naturelles et en Génie du Canada)</institution>
</institution-wrap>
</funding-source>
to Steven J Hallam.</p>
</list-item>
<list-item>
<p>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100000196</institution-id>
<institution>Canada Foundation for Innovation (Fondation canadienne pour l'innovation)</institution>
</institution-wrap>
</funding-source>
to Steven J Hallam.</p>
</list-item>
<list-item>
<p>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100007631</institution-id>
<institution>Canadian Institute for Advanced Research (L'Institut Canadien de Recherches Avancées)</institution>
</institution-wrap>
</funding-source>
to Steven J Hallam.</p>
</list-item>
<list-item>
<p>
<funding-source>
<institution-wrap>
<institution>Tula Foundation</institution>
</institution-wrap>
</funding-source>
to Steven J Hallam.</p>
</list-item>
<list-item>
<p>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100001246</institution-id>
<institution>Ambrose Monell Foundation</institution>
</institution-wrap>
</funding-source>
to Steven J Hallam.</p>
</list-item>
<list-item>
<p>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100001372</institution-id>
<institution>G. Unger Vetlesen Foundation</institution>
</institution-wrap>
</funding-source>
to Steven J Hallam.</p>
</list-item>
<list-item>
<p>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100000015</institution-id>
<institution>U.S. Department of Energy (Department of Energy)</institution>
</institution-wrap>
</funding-source>
<award-id>Joint Genome Institute (DE-AC02-05CH11231)</award-id>
to Tanja Woyke.</p>
</list-item>
</list>
</sec>
<ack id="ack">
<title>Acknowledgements</title>
<p>We thank Natalie Solonenko and Sheri Floge and TMPL members for their comments on the manuscript. This work was performed under the auspices of the Gordon and Betty Moore Foundation (#3790) through grants awarded to MBS and the Natural Sciences and Engineering Research Council (NSERC) of Canada, Canada Foundation for Innovation (CFI), the Canadian Institute for Advanced Research (CIFAR), and the Tula Foundation funded Centre for Microbial Diversity and Evolution, G Unger Vetlesen and Ambrose Monell Foundation through grants awarded to SJH. The work conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported under Contract No. DE-AC02-05CH11231.</p>
</ack>
<sec sec-type="additional-information" id="s4">
<title>Additional information</title>
<fn-group content-type="competing-interest">
<title>
<bold>Competing interests</bold>
</title>
<fn fn-type="conflict" id="conf1">
<p>The authors declare that no competing interests exist.</p>
</fn>
</fn-group>
<fn-group content-type="author-contribution">
<title>
<bold>Author contributions</bold>
</title>
<fn fn-type="con" id="con1">
<p>SR, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article.</p>
</fn>
<fn fn-type="con" id="con2">
<p>MBS, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article.</p>
</fn>
<fn fn-type="con" id="con3">
<p>SJH, Conception and design, Drafting or revising the article.</p>
</fn>
<fn fn-type="con" id="con4">
<p>TW, Conception and design, Drafting or revising the article.</p>
</fn>
</fn-group>
</sec>
<sec sec-type="supplementary-material" id="s5">
<title>Additional files</title>
<sec sec-type="datasets" id="s5-1">
<title>Major dataset</title>
<p>The following dataset was generated:</p>
<p>
<related-object content-type="generated-dataset" source-id="http://datadryad.org/resource/doi:10.5061/dryad.b8226" source-id-type="uri" id="dataro1">
<collab collab-type="author">Roux S</collab>
,
<collab collab-type="author">Hallam SJ</collab>
,
<collab collab-type="author">Woyke T</collab>
,
<collab collab-type="author">Sullivan MB</collab>
,
<year>2015</year>
<x xml:space="preserve">, </x>
<source>Data from: Viral dark matter and virus-host interactions resolved from publicly available microbial genomes</source>
<x xml:space="preserve">, </x>
<ext-link ext-link-type="uri" xlink:href="http://datadryad.org/resource/doi:10.5061/dryad.b8226">http://datadryad.org/resource/doi:10.5061/dryad.b8226</ext-link>
<x xml:space="preserve">, </x>
<comment>Available at Dryad Digital Repository under a CC0 Public Domain Dedication.</comment>
</related-object>
</p>
</sec>
</sec>
<ref-list>
<title>References</title>
<ref id="bib1">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abedon</surname>
<given-names>ST</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Phage evolution and ecology</article-title>
<source>Advances in Applied Microbiology</source>
<volume>67</volume>
<fpage>1</fpage>
<lpage>45</lpage>
<pub-id pub-id-type="doi">10.1016/S0065-2164(08)01001-0</pub-id>
<pub-id pub-id-type="pmid">19245935</pub-id>
</element-citation>
</ref>
<ref id="bib2">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Akhter</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Aziz</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>RA</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies</article-title>
<source>Nucleic Acids Research</source>
<volume>40</volume>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gks406</pub-id>
<pub-id pub-id-type="pmid">21908400</pub-id>
</element-citation>
</ref>
<ref id="bib3">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Allers</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Moraru</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Duhaime</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Beneze</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Solonenko</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Canosa</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Amann</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2013a</year>
<article-title>Single-cell and population level viral infection dynamics revealed by phageFISH, a method to visualize intracellular and free viruses</article-title>
<source>Environmental Microbiology</source>
<volume>15</volume>
<fpage>2306</fpage>
<lpage>2318</lpage>
<pub-id pub-id-type="doi">10.1111/1462-2920.12100</pub-id>
<pub-id pub-id-type="pmid">23489642</pub-id>
</element-citation>
</ref>
<ref id="bib4">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Allers</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Wright</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Konwar</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Howes</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Beneze</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2013b</year>
<article-title>Diversity and population structure of Marine Group A bacteria in the Northeast subarctic Pacific Ocean</article-title>
<source>The ISME Journal</source>
<volume>7</volume>
<fpage>256</fpage>
<lpage>268</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2012.108</pub-id>
<pub-id pub-id-type="pmid">23151638</pub-id>
</element-citation>
</ref>
<ref id="bib5">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andersson</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Banfield</surname>
<given-names>JF</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Virus population dynamics and acquired virus resistance in natural microbial communities</article-title>
<source>Science</source>
<volume>320</volume>
<fpage>1047</fpage>
<lpage>1050</lpage>
<pub-id pub-id-type="doi">10.1126/science.1157358</pub-id>
<pub-id pub-id-type="pmid">18497291</pub-id>
</element-citation>
</ref>
<ref id="bib6">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bankevich</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Nurk</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Antipov</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Gurevich</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Dvorkin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kulikov</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Lesin</surname>
<given-names>VM</given-names>
</name>
<name>
<surname>Nikolenko</surname>
<given-names>SI</given-names>
</name>
<name>
<surname>Pham</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Prjibelski</surname>
<given-names>AD</given-names>
</name>
<name>
<surname>Pyshkin</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Sirotkin</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Vyahhi</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Tesler</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Alekseyev</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing</article-title>
<source>Journal of Computational Biology </source>
<volume>19</volume>
<fpage>455</fpage>
<lpage>477</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2012.0021</pub-id>
<pub-id pub-id-type="pmid">22506599</pub-id>
</element-citation>
</ref>
<ref id="bib7">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bastías</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Higuera</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sierralta</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Espejo</surname>
<given-names>RT</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>A new group of cosmopolitan bacteriophages induce a carrier state in the pandemic strain of Vibrio parahaemolyticus</article-title>
<source>Environmental Microbiology</source>
<volume>12</volume>
<fpage>990</fpage>
<lpage>1000</lpage>
<pub-id pub-id-type="doi">10.1111/j.1462-2920.2010.02143.x</pub-id>
<pub-id pub-id-type="pmid">20105216</pub-id>
</element-citation>
</ref>
<ref id="bib8">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brum</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ignacio-Espinoza</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Roux</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Doulcier</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Acinas</surname>
<given-names>SG</given-names>
</name>
<name>
<surname>Alberti</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Chaffron</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Cruaud</surname>
<given-names>C</given-names>
</name>
<name>
<surname>de Vargas</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Gasol</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Gorsky</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gregory</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Ogata</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Pesant</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Poulos</surname>
<given-names>BT</given-names>
</name>
<name>
<surname>Schwenck</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Speich</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Dimier</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kandels-Lewis</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Picheral</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Searson</surname>
<given-names>S</given-names>
</name>
<collab>Tara Oceans Coordinators</collab>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Bowler</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Sunagawa</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wincker</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Karsenti</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2015</year>
<article-title>Patterns and ecological drivers of ocean viral communities</article-title>
<source>Science</source>
<volume>348</volume>
<fpage>1261498</fpage>
<pub-id pub-id-type="doi">10.1126/science.1261498</pub-id>
<pub-id pub-id-type="pmid">25999515</pub-id>
</element-citation>
</ref>
<ref id="bib9">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Brum</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Jeffrey Morris</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Décima</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stukel</surname>
<given-names>MR</given-names>
</name>
</person-group>
<year>2014</year>
<article-title>Mortality in the oceans : causes and consequences. Association for the Sciences of Limnology and Oceanography</article-title>
<person-group person-group-type="editor">
<name>
<surname>Kemp</surname>
<given-names>PF</given-names>
</name>
</person-group>
<source>Eco-DAS IX Symposium Proceedings</source>
<fpage>16</fpage>
<lpage>48</lpage>
</element-citation>
</ref>
<ref id="bib10">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brum</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2015</year>
<article-title>Rising to the challenge: accelerated pace of discovery transforms marine virology</article-title>
<source>Nature reviews. Microbiology</source>
<volume>13</volume>
<fpage>147</fpage>
<lpage>159</lpage>
<pub-id pub-id-type="doi">10.1038/nrmicro3404</pub-id>
<pub-id pub-id-type="pmid">25639680</pub-id>
</element-citation>
</ref>
<ref id="bib11">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Canchaya</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Fournous</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Brüssow</surname>
<given-names>H</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>The impact of prophages on bacterial chromosomes</article-title>
<source>Molecular Microbiology</source>
<volume>53</volume>
<fpage>9</fpage>
<lpage>18</lpage>
<pub-id pub-id-type="doi">10.1111/j.1365-2958.2004.04113.x</pub-id>
<pub-id pub-id-type="pmid">15225299</pub-id>
</element-citation>
</ref>
<ref id="bib12">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Carbone</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Codon bias is a major factor explaining phage evolution in translationally biased hosts</article-title>
<source>Journal of Molecular Evolution</source>
<volume>66</volume>
<fpage>210</fpage>
<lpage>223</lpage>
<pub-id pub-id-type="doi">10.1007/s00239-008-9068-6</pub-id>
<pub-id pub-id-type="pmid">18286220</pub-id>
</element-citation>
</ref>
<ref id="bib13">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cardinale</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Duffy</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Single-stranded genomic architecture constrains optimal codon usage</article-title>
<source>Bacteriophage</source>
<volume>1</volume>
<fpage>219</fpage>
<lpage>224</lpage>
<pub-id pub-id-type="doi">10.4161/bact.1.4.18496</pub-id>
<pub-id pub-id-type="pmid">22334868</pub-id>
</element-citation>
</ref>
<ref id="bib14">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Carey-Smith</surname>
<given-names>GV</given-names>
</name>
<name>
<surname>Billington</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Cornelius</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Hudson</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Heinemann</surname>
<given-names>JA</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Isolation and characterization of bacteriophages infecting
<italic>Salmonella</italic>
spp</article-title>
<source>FEMS Microbiology Letters</source>
<volume>258</volume>
<fpage>182</fpage>
<lpage>186</lpage>
<pub-id pub-id-type="doi">10.1111/j.1574-6968.2006.00217.x</pub-id>
<pub-id pub-id-type="pmid">16640570</pub-id>
</element-citation>
</ref>
<ref id="bib15">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Casjens</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Prophages and bacterial genomics: what have we learned so far?</article-title>
<source>Molecular Microbiology</source>
<volume>49</volume>
<fpage>277</fpage>
<lpage>300</lpage>
<pub-id pub-id-type="doi">10.1046/j.1365-2958.2003.03580.x</pub-id>
<pub-id pub-id-type="pmid">12886937</pub-id>
</element-citation>
</ref>
<ref id="bib16">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Castelle</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Hug</surname>
<given-names>LA</given-names>
</name>
<name>
<surname>Wrighton</surname>
<given-names>KC</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>BC</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Tringe</surname>
<given-names>SG</given-names>
</name>
<name>
<surname>Singer</surname>
<given-names>SW</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Banfield</surname>
<given-names>JF</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>Extraordinary phylogenetic diversity and metabolic versatility in aquifer sediment</article-title>
<source>Nature Communications</source>
<volume>4</volume>
<fpage>2120</fpage>
<pub-id pub-id-type="doi">10.1038/ncomms3120</pub-id>
</element-citation>
</ref>
<ref id="bib17">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clemente</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Ursell</surname>
<given-names>LK</given-names>
</name>
<name>
<surname>Parfrey</surname>
<given-names>LW</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>The impact of the gut Microbiota on human health: an integrative view</article-title>
<source>Cell</source>
<volume>148</volume>
<fpage>1258</fpage>
<lpage>1270</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2012.01.035</pub-id>
<pub-id pub-id-type="pmid">22424233</pub-id>
</element-citation>
</ref>
<ref id="bib18">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Davis</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Martin</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Austin</surname>
<given-names>SJ</given-names>
</name>
</person-group>
<year>1992</year>
<article-title>Biochemical activities of the ParA partition protein of the P1 plasmid</article-title>
<source>Molecular Microbiology</source>
<volume>6</volume>
<fpage>1141</fpage>
<lpage>1147</lpage>
<pub-id pub-id-type="doi">10.1111/j.1365-2958.1992.tb01552.x</pub-id>
<pub-id pub-id-type="pmid">1534133</pub-id>
</element-citation>
</ref>
<ref id="bib19">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>DeLong</surname>
<given-names>EF</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>The microbial ocean from genomes to biomes</article-title>
<source>Nature</source>
<volume>459</volume>
<fpage>200</fpage>
<lpage>206</lpage>
<pub-id pub-id-type="doi">10.1038/nature08059</pub-id>
<pub-id pub-id-type="pmid">19444206</pub-id>
</element-citation>
</ref>
<ref id="bib20">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deng</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ignacio-Espinoza</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Gregory</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Poulos</surname>
<given-names>BT</given-names>
</name>
<name>
<surname>Weitz</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2014</year>
<article-title>Viral tagging reveals discrete populations in Synechococcus viral genome sequence space</article-title>
<source>Nature</source>
<volume>513</volume>
<fpage>242</fpage>
<lpage>245</lpage>
<pub-id pub-id-type="doi">10.1038/nature13459</pub-id>
<pub-id pub-id-type="pmid">25043051</pub-id>
</element-citation>
</ref>
<ref id="bib21">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Diemer</surname>
<given-names>GS</given-names>
</name>
<name>
<surname>Stedman</surname>
<given-names>KM</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>A novel virus genome discovered in an extreme environment suggests recombination between unrelated groups of RNA and DNA viruses</article-title>
<source>Biology Direct</source>
<volume>7</volume>
<fpage>13</fpage>
<pub-id pub-id-type="doi">10.1186/1745-6150-7-13</pub-id>
<pub-id pub-id-type="pmid">22515485</pub-id>
</element-citation>
</ref>
<ref id="bib22">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Emerson</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>BC</given-names>
</name>
<name>
<surname>Alvarez</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Banfield</surname>
<given-names>JF</given-names>
</name>
</person-group>
<year>2015</year>
<article-title>Metagenomic analysis of a high CO2 subsurface microbial community populated by chemolithoautotrophs and bacteria and archaea from candidate phyla</article-title>
<source>Environmental Microbiology</source>
<pub-id pub-id-type="doi">10.1111/1462-2920.12817</pub-id>
</element-citation>
</ref>
<ref id="bib23">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Enav</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Béjà</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Mandel-Gutfreund</surname>
<given-names>Y</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Cyanophage tRNAs may have a role in cross-infectivity of oceanic
<italic>Prochlorococcus</italic>
and
<italic>Synechococcus</italic>
hosts</article-title>
<source>The ISME Journal</source>
<volume>6</volume>
<fpage>619</fpage>
<lpage>628</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2011.146</pub-id>
<pub-id pub-id-type="pmid">22011720</pub-id>
</element-citation>
</ref>
<ref id="bib24">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Enright</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Van Dongen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ouzounis</surname>
<given-names>CA</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>An efficient algorithm for large-scale detection of protein families</article-title>
<source>Nucleic Acids Research</source>
<volume>30</volume>
<fpage>1575</fpage>
<lpage>1584</lpage>
<pub-id pub-id-type="doi">10.1093/nar/30.7.1575</pub-id>
<pub-id pub-id-type="pmid">11917018</pub-id>
</element-citation>
</ref>
<ref id="bib25">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Falkowski</surname>
<given-names>PG</given-names>
</name>
<name>
<surname>Fenchel</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Delong</surname>
<given-names>EF</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>The microbial engines that drive Earth's biogeochemical cycles</article-title>
<source>Science</source>
<volume>320</volume>
<fpage>1034</fpage>
<lpage>1039</lpage>
<pub-id pub-id-type="doi">10.1126/science.1153213</pub-id>
<pub-id pub-id-type="pmid">18497287</pub-id>
</element-citation>
</ref>
<ref id="bib26">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fischer</surname>
<given-names>CR</given-names>
</name>
<name>
<surname>Yoichi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Unno</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Tanji</surname>
<given-names>Y</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>The coexistence of
<italic>Escherichia coli</italic>
serotype O157:H7 and its specific bacteriophage in continuous culture</article-title>
<source>FEMS Microbiology Letters</source>
<volume>241</volume>
<fpage>171</fpage>
<lpage>177</lpage>
<pub-id pub-id-type="doi">10.1016/j.femsle.2004.10.017</pub-id>
<pub-id pub-id-type="pmid">15598529</pub-id>
</element-citation>
</ref>
<ref id="bib27">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Flores</surname>
<given-names>CO</given-names>
</name>
<name>
<surname>Meyer</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Valverde</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Farr</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Weitz</surname>
<given-names>JS</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Statistical structure of host–phage interactions</article-title>
<source>Proceedings of the National Academy of Sciences of USA</source>
<volume>108</volume>
<fpage>E288</fpage>
<lpage>E297</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1101595108</pub-id>
</element-citation>
</ref>
<ref id="bib28">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Flores</surname>
<given-names>CO</given-names>
</name>
<name>
<surname>Valverde</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Weitz</surname>
<given-names>JS</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>Multi-scale structure and geographic drivers of cross-infection within marine bacteria and phages</article-title>
<source>The ISME Journal</source>
<volume>7</volume>
<fpage>520</fpage>
<lpage>532</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2012.135</pub-id>
<pub-id pub-id-type="pmid">23178671</pub-id>
</element-citation>
</ref>
<ref id="bib29">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Forterre</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Prangishvili</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>The major role of viruses in cellular evolution: facts and hypotheses</article-title>
<source>Current Opinion in Virology</source>
<volume>3</volume>
<fpage>558</fpage>
<lpage>565</lpage>
<pub-id pub-id-type="doi">10.1016/j.coviro.2013.06.013</pub-id>
<pub-id pub-id-type="pmid">23870799</pub-id>
</element-citation>
</ref>
<ref id="bib30">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fouts</surname>
<given-names>DE</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences</article-title>
<source>Nucleic Acids Research</source>
<volume>34</volume>
<fpage>5839</fpage>
<lpage>5851</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkl732</pub-id>
<pub-id pub-id-type="pmid">17062630</pub-id>
</element-citation>
</ref>
<ref id="bib31">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Garrett</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Prangishvili</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>SA</given-names>
</name>
<name>
<surname>Reuter</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stetter</surname>
<given-names>KO</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>X</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>Metagenomic analyses of novel viruses and plasmids from a cultured environmental sample of hyperthermophilic neutrophiles</article-title>
<source>Environmental Microbiology</source>
<volume>12</volume>
<fpage>2918</fpage>
<lpage>2930</lpage>
<pub-id pub-id-type="doi">10.1111/j.1462-2920.2010.02266.x</pub-id>
<pub-id pub-id-type="pmid">20545752</pub-id>
</element-citation>
</ref>
<ref id="bib33">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hanson</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Fuhrman</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Horner-Devine</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Martiny</surname>
<given-names>JB</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Beyond biogeographic patterns: processes shaping the microbial landscape</article-title>
<source>Nature Reviews. Microbiology</source>
<volume>10</volume>
<fpage>497</fpage>
<lpage>506</lpage>
<pub-id pub-id-type="doi">10.1038/nrmicro2795</pub-id>
<pub-id pub-id-type="pmid">22580365</pub-id>
</element-citation>
</ref>
<ref id="bib34">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hendrix</surname>
<given-names>RW</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Burns</surname>
<given-names>RN</given-names>
</name>
<name>
<surname>Ford</surname>
<given-names>ME</given-names>
</name>
<name>
<surname>Hatfull</surname>
<given-names>GF</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage</article-title>
<source>Proceedings of the National Academy of Sciences of USA</source>
<volume>96</volume>
<fpage>2192</fpage>
<lpage>2197</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.96.5.2192</pub-id>
</element-citation>
</ref>
<ref id="bib35">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hurwitz</surname>
<given-names>BL</given-names>
</name>
<name>
<surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>Metabolic reprogramming by viruses in the sunlit and dark ocean</article-title>
<source>Genome Biology</source>
<volume>14</volume>
<fpage>R123</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2013-14-11-r123</pub-id>
<pub-id pub-id-type="pmid">24200126</pub-id>
</element-citation>
</ref>
<ref id="bib36">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hurwitz</surname>
<given-names>BL</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>The Pacific Ocean Virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology</article-title>
<source>PLOS ONE</source>
<volume>8</volume>
<fpage>e57355</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0057355</pub-id>
<pub-id pub-id-type="pmid">23468974</pub-id>
</element-citation>
</ref>
<ref id="bib37">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ignacio-Espinoza</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Phylogenomics of T4 cyanophages: lateral gene transfer in the “core” and origins of host genes</article-title>
<source>Environmental Microbiology</source>
<volume>14</volume>
<fpage>2113</fpage>
<lpage>2126</lpage>
<pub-id pub-id-type="doi">10.1111/j.1462-2920.2012.02704.x</pub-id>
<pub-id pub-id-type="pmid">22348436</pub-id>
</element-citation>
</ref>
<ref id="bib38">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Xuan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>C</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>NeSSM: a next-generation sequencing simulator for metagenomics</article-title>
<source>PLOS ONE</source>
<volume>8</volume>
<fpage>e75448</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0075448</pub-id>
<pub-id pub-id-type="pmid">24124490</pub-id>
</element-citation>
</ref>
<ref id="bib39">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kamke</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Sczyrba</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>N</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>Single-cell genomics reveals complex carbohydrate degradation patterns in poribacterial symbionts of marine sponges</article-title>
<source>The ISME Journal</source>
<volume>7</volume>
<fpage>2287</fpage>
<lpage>2300</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2013.111</pub-id>
<pub-id pub-id-type="pmid">23842652</pub-id>
</element-citation>
</ref>
<ref id="bib40">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kashtan</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Roggensack</surname>
<given-names>SE</given-names>
</name>
<name>
<surname>Rodrigue</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Thompson</surname>
<given-names>JW</given-names>
</name>
<name>
<surname>Biller</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Coe</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Marttinen</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Malmstrom</surname>
<given-names>RR</given-names>
</name>
<name>
<surname>Stocker</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Follows</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Stepanauskas</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Chisholm</surname>
<given-names>SW</given-names>
</name>
<name>
<surname>Biller</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2014</year>
<article-title>Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus</article-title>
<source>Science</source>
<volume>344</volume>
<fpage>416</fpage>
<lpage>420</lpage>
<pub-id pub-id-type="doi">10.1126/science.1248575</pub-id>
<pub-id pub-id-type="pmid">24763590</pub-id>
</element-citation>
</ref>
<ref id="bib41">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>EJ</given-names>
</name>
<name>
<surname>Roh</surname>
<given-names>SW</given-names>
</name>
<name>
<surname>Bae</surname>
<given-names>JW</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Diversity and abundance of single-stranded DNA viruses in human feces</article-title>
<source>Applied and Environmental Microbiology</source>
<volume>77</volume>
<fpage>8062</fpage>
<lpage>8070</lpage>
<pub-id pub-id-type="doi">10.1128/AEM.06331-11</pub-id>
<pub-id pub-id-type="pmid">21948823</pub-id>
</element-citation>
</ref>
<ref id="bib42">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
<name>
<surname>Senkevich</surname>
<given-names>TG</given-names>
</name>
<name>
<surname>Dolja</surname>
<given-names>VV</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>The ancient Virus World and evolution of cells</article-title>
<source>Biology Direct</source>
<volume>1</volume>
<fpage>29</fpage>
<pub-id pub-id-type="doi">10.1186/1745-6150-1-29</pub-id>
<pub-id pub-id-type="pmid">16984643</pub-id>
</element-citation>
</ref>
<ref id="bib43">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krupovic</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zhi</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Shevchenko</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>NS</given-names>
</name>
</person-group>
<year>2015</year>
<article-title>Multiple layers of chimerism in a single-stranded DNA virus discovered by deep sequencing</article-title>
<source>Genome Biology and Evolution</source>
<pub-id pub-id-type="doi">10.1093/gbe/evv034</pub-id>
</element-citation>
</ref>
<ref id="bib44">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Labonté</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Swan</surname>
<given-names>BK</given-names>
</name>
<name>
<surname>Poulos</surname>
<given-names>BT</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Woyke</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Wommack</surname>
<given-names>EK</given-names>
</name>
<name>
<surname>Stepanauskas</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2015</year>
<article-title>Single cell genomics-based analysis of virus-host interactions in marine surface bacterioplankton</article-title>
<source>The ISME Journal</source>
<pub-id pub-id-type="doi">10.1038/ismej.2015.48</pub-id>
</element-citation>
</ref>
<ref id="bib45">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Labonté</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Suttle</surname>
<given-names>CA</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>Previously unknown and highly divergent ssDNA viruses populate the oceans</article-title>
<source>The ISME Journal</source>
<volume>7</volume>
<fpage>2169</fpage>
<lpage>2177</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2013.110</pub-id>
<pub-id pub-id-type="pmid">23842650</pub-id>
</element-citation>
</ref>
<ref id="bib47">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leplae</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lima-Mendez</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Toussaint</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>ACLAME: a CLAssification of Mobile genetic Elements, update 2010</article-title>
<source>Nucleic Acids Research</source>
<volume>38</volume>
<fpage>D57</fpage>
<lpage>D61</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkp938</pub-id>
<pub-id pub-id-type="pmid">19933762</pub-id>
</element-citation>
</ref>
<ref id="bib48">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lima-Mendez</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Van Helden</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Toussaint</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Leplae</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2008a</year>
<article-title>Prophinder: a computational tool for prophage prediction in prokaryotic genomes</article-title>
<source>Bioinformatics</source>
<volume>24</volume>
<fpage>863</fpage>
<lpage>865</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn043</pub-id>
<pub-id pub-id-type="pmid">18238785</pub-id>
</element-citation>
</ref>
<ref id="bib49">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lima-Mendez</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Van Helden</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Toussaint</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Leplae</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2008b</year>
<article-title>Reticulate representation of evolutionary and functional relationships between phage genomes</article-title>
<source>Molecular Biology and Evolution</source>
<volume>25</volume>
<fpage>762</fpage>
<lpage>777</lpage>
<pub-id pub-id-type="doi">10.1093/molbev/msn023</pub-id>
<pub-id pub-id-type="pmid">18234706</pub-id>
</element-citation>
</ref>
<ref id="bib50">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marçais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<volume>27</volume>
<fpage>764</fpage>
<lpage>770</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="bib51">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marston</surname>
<given-names>MF</given-names>
</name>
<name>
<surname>Pierciey</surname>
<given-names>FJ</given-names>
</name>
<name>
<surname>Shepard</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gearin</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Yandava</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Schuster</surname>
<given-names>SC</given-names>
</name>
<name>
<surname>Henn</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Martiny</surname>
<given-names>JB</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Rapid diversification of coevolving marine Synechococcus and a virus</article-title>
<source>Proceedings of the National Academy of Sciences of USA</source>
<volume>109</volume>
<fpage>4544</fpage>
<lpage>4549</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1120310109</pub-id>
</element-citation>
</ref>
<ref id="bib52">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Middelboe</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Holmfeldt</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Riemann</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Nybroe</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Haaber</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Bacteriophages drive strain diversification in a marine Flavobacterium : implications for phage resistance and physiological properties</article-title>
<source>Environmental Microbiology</source>
<volume>11</volume>
<fpage>1971</fpage>
<lpage>1982</lpage>
<pub-id pub-id-type="doi">10.1111/j.1462-2920.2009.01920.x</pub-id>
<pub-id pub-id-type="pmid">19508553</pub-id>
</element-citation>
</ref>
<ref id="bib53">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Minot</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Grunberg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Bushman</surname>
<given-names>FD</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Hypervariable loci in the human gut virome</article-title>
<source>Proceedings of the National Academy of Sciences of USA</source>
<volume>109</volume>
<fpage>3962</fpage>
<lpage>3966</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1119061109</pub-id>
</element-citation>
</ref>
<ref id="bib54">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mizuno</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Rodriguez-Valera</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Kimes</surname>
<given-names>NE</given-names>
</name>
<name>
<surname>Ghai</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>Expanding the marine virosphere using metagenomics</article-title>
<source>PLOS Genetics</source>
<volume>9</volume>
<fpage>e1003987</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pgen.1003987</pub-id>
<pub-id pub-id-type="pmid">24348267</pub-id>
</element-citation>
</ref>
<ref id="bib55">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mosig</surname>
<given-names>G</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Recombination and recombination-dependent Dna replication in bacteriophage T4</article-title>
<source>Annual Review of Genetics</source>
<volume>32</volume>
<fpage>379</fpage>
<lpage>413</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.genet.32.1.379</pub-id>
</element-citation>
</ref>
<ref id="bib56">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pace</surname>
<given-names>NR</given-names>
</name>
</person-group>
<year>1997</year>
<article-title>A molecular view of microbial diversity and the biosphere</article-title>
<source>Science</source>
<volume>276</volume>
<fpage>734</fpage>
<lpage>740</lpage>
<pub-id pub-id-type="doi">10.1126/science.276.5313.734</pub-id>
<pub-id pub-id-type="pmid">9115194</pub-id>
</element-citation>
</ref>
<ref id="bib57">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Leung</surname>
<given-names>HC</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>FY</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth</article-title>
<source>Bioinformatics</source>
<volume>28</volume>
<fpage>1420</fpage>
<lpage>1428</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts174</pub-id>
<pub-id pub-id-type="pmid">22495754</pub-id>
</element-citation>
</ref>
<ref id="bib58">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pride</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Wassenaar</surname>
<given-names>TM</given-names>
</name>
<name>
<surname>Ghose</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Blaser</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses</article-title>
<source>BMC Genomics</source>
<volume>7</volume>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-7-8</pub-id>
<pub-id pub-id-type="pmid">16403227</pub-id>
</element-citation>
</ref>
<ref id="bib59">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pruitt</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Tatusova</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Klimke</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Maglott</surname>
<given-names>DR</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>NCBI reference sequences : current status, policy and new initiatives</article-title>
<source>Nucleic Acids Research</source>
<volume>37</volume>
<fpage>32</fpage>
<lpage>36</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn721</pub-id>
</element-citation>
</ref>
<ref id="bib60">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rakonjac</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bennett</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Spagnuolo</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gagic</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Russel</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Filamentous bacteriophage: biology, phage display and nanotechnology applications</article-title>
<source>Current Issues in Molecular Biology</source>
<volume>13</volume>
<fpage>51</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="pmid">21502666</pub-id>
</element-citation>
</ref>
<ref id="bib61">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rappé</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Giovannoni</surname>
<given-names>SJ</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>The uncultured microbial majority</article-title>
<source>Annual Review of Microbiology</source>
<volume>57</volume>
<fpage>369</fpage>
<lpage>394</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.micro.57.030502.090759</pub-id>
</element-citation>
</ref>
<ref id="bib62">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reyes</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Semenkovich</surname>
<given-names>NP</given-names>
</name>
<name>
<surname>Whiteson</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Gordon</surname>
<given-names>JI</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Going viral: next-generation sequencing applied to phage populations in the human gut</article-title>
<source>Nature Reviews. Microbiology</source>
<volume>10</volume>
<fpage>607</fpage>
<lpage>617</lpage>
<pub-id pub-id-type="doi">10.1038/nrmicro2853</pub-id>
<pub-id pub-id-type="pmid">22864264</pub-id>
</element-citation>
</ref>
<ref id="bib63">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rice</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Longden</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Bleasby</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>EMBOSS: the European Molecular Biology Open Software suite</article-title>
<source>Trends in genetics</source>
<volume>16</volume>
<fpage>276</fpage>
<lpage>277</lpage>
<pub-id pub-id-type="doi">10.1016/S0168-9525(00)02024-2</pub-id>
<pub-id pub-id-type="pmid">10827456</pub-id>
</element-citation>
</ref>
<ref id="bib64">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rinke</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Schwientek</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Sczyrba</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>NN</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>IJ</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Darling</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Malfatti</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Swan</surname>
<given-names>BK</given-names>
</name>
<name>
<surname>Gies</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Dodsworth</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Hedlund</surname>
<given-names>BP</given-names>
</name>
<name>
<surname>Tsiamis</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sievert</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>WT</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Kyrpides</surname>
<given-names>NC</given-names>
</name>
<name>
<surname>Stepanauskas</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rubin</surname>
<given-names>EM</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Woyke</surname>
<given-names>T</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>Insights into the phylogeny and coding potential of microbial dark matter</article-title>
<source>Nature</source>
<volume>499</volume>
<fpage>431</fpage>
<lpage>437</lpage>
<pub-id pub-id-type="doi">10.1038/nature12352</pub-id>
<pub-id pub-id-type="pmid">23851394</pub-id>
</element-citation>
</ref>
<ref id="bib65">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodriguez-Valera</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Martin-Cuadrado</surname>
<given-names>AB</given-names>
</name>
<name>
<surname>Rodriguez-Brito</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Pasić</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Thingstad</surname>
<given-names>TF</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Mira</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Explaining microbial population genomics through phage predation</article-title>
<source>Nature Reviews Microbiology</source>
<volume>7</volume>
<fpage>828</fpage>
<lpage>836</lpage>
<pub-id pub-id-type="doi">10.1038/nrmicro2235</pub-id>
</element-citation>
</ref>
<ref id="bib66">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>The phage proteomic tree : a genome-based taxonomy for phage</article-title>
<volume>184</volume>
<fpage>4529</fpage>
<lpage>4535</lpage>
</element-citation>
</ref>
<ref id="bib67">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roux</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Enault</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Bronner</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Vaulot</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Forterre</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Krupovic</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>2013</year>
<article-title>Chimeric viruses blur the borders between the major groups of eukaryotic single-stranded DNA viruses</article-title>
<source>Nature Communications</source>
<volume>4</volume>
<fpage>1</fpage>
<lpage>10</lpage>
<pub-id pub-id-type="doi">10.1038/ncomms3700</pub-id>
</element-citation>
</ref>
<ref id="bib68">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roux</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Enault</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Hurwitz</surname>
<given-names>BL</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2015a</year>
<article-title>VirSorter: mining viral signal from microbial genomic data</article-title>
<source>PeerJ</source>
<volume>3</volume>
<fpage>e985</fpage>
<pub-id pub-id-type="doi">10.7717/peerj.985</pub-id>
<pub-id pub-id-type="pmid">26038737</pub-id>
</element-citation>
</ref>
<ref id="bib84">
<element-citation publication-type="data">
<person-group person-group-type="author">
<name>
<surname>Roux</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Woyke</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2015b</year>
<article-title>Data from: Viral dark matter and virus-host interactions resolved from publicly available microbial genomes</article-title>
<source>Dryad Data Repository</source>
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5061/dryad.b8226">http://dx.doi.org/10.5061/dryad.b8226</ext-link>
</element-citation>
</ref>
<ref id="bib69">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roux</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hawley</surname>
<given-names>AK</given-names>
</name>
<name>
<surname>Torres Beltran</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Scofield</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schwientek</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Stepanauskas</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Woyke</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2014</year>
<article-title>Ecology and evolution of viruses infecting uncultivated SUP05 bacteria as revealed by single-cell- and meta- genomics</article-title>
<source>eLife</source>
<volume>3</volume>
<fpage>e03125</fpage>
<pub-id pub-id-type="doi">10.7554/eLife.03125</pub-id>
<pub-id pub-id-type="pmid">25171894</pub-id>
</element-citation>
</ref>
<ref id="bib32">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saint Girons</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Bourhy</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ottone</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Picardeau</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Yelton</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hendrix</surname>
<given-names>RW</given-names>
</name>
<name>
<surname>Glaser</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Charon</surname>
<given-names>N</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>The LE1 bacteriophage replicates as a plasmid within Leptospira biflexa : construction of an l. biflexa-
<italic>Escherichia coli</italic>
shuttle vector</article-title>
<source>Journal of Bacteriology</source>
<volume>182</volume>
<fpage>5700</fpage>
<lpage>5705</lpage>
<pub-id pub-id-type="doi">10.1128/JB.182.20.5700-5705.2000</pub-id>
<pub-id pub-id-type="pmid">11004167</pub-id>
</element-citation>
</ref>
<ref id="bib70">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salim</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Skilton</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Lambden</surname>
<given-names>PR</given-names>
</name>
<name>
<surname>Fane</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Clarke</surname>
<given-names>IN</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Behind the chlamydial cloak: the replication cycle of chlamydiaphage Chp2, revealed</article-title>
<source>Virology</source>
<volume>377</volume>
<fpage>440</fpage>
<lpage>445</lpage>
<pub-id pub-id-type="doi">10.1016/j.virol.2008.05.001</pub-id>
<pub-id pub-id-type="pmid">18570973</pub-id>
</element-citation>
</ref>
<ref id="bib71">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sencilo</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Paulin</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Kellner</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Helm</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Roine</surname>
<given-names>E</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Related haloarchaeal pleomorphic viruses contain different genome types</article-title>
<source>Nucleic Acids Research</source>
<volume>40</volume>
<fpage>5523</fpage>
<lpage>5534</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gks215</pub-id>
<pub-id pub-id-type="pmid">22396526</pub-id>
</element-citation>
</ref>
<ref id="bib72">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sims</surname>
<given-names>GE</given-names>
</name>
<name>
<surname>Jun</surname>
<given-names>SR</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>GA</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>SH</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions</article-title>
<source>Proceedings of the National Academy of Sciences of USA</source>
<volume>106</volume>
<fpage>2677</fpage>
<lpage>2682</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.0813249106</pub-id>
</element-citation>
</ref>
<ref id="bib73">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sternberg</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Austin</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>1981</year>
<article-title>The Maintenance of the P1 plasmid prophage</article-title>
<source>Plasmid</source>
<volume>5</volume>
<fpage>20</fpage>
<lpage>31</lpage>
<pub-id pub-id-type="doi">10.1016/0147-619X(81)90075-5</pub-id>
<pub-id pub-id-type="pmid">7012872</pub-id>
</element-citation>
</ref>
<ref id="bib74">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sullivan</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Petty</surname>
<given-names>NK</given-names>
</name>
<name>
<surname>Beatson</surname>
<given-names>SA</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Easyfig: a genome comparison visualizer</article-title>
<source>Bioinformatics</source>
<volume>27</volume>
<fpage>1009</fpage>
<lpage>1010</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr039</pub-id>
<pub-id pub-id-type="pmid">21278367</pub-id>
</element-citation>
</ref>
<ref id="bib75">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Suttle</surname>
<given-names>CA</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Marine viruses–major players in the global ecosystem</article-title>
<source>Nature Reviews Microbiology</source>
<volume>5</volume>
<fpage>801</fpage>
<lpage>812</lpage>
<pub-id pub-id-type="doi">10.1038/nrmicro1750</pub-id>
<pub-id pub-id-type="pmid">17853907</pub-id>
</element-citation>
</ref>
<ref id="bib76">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tadmor</surname>
<given-names>AD</given-names>
</name>
<name>
<surname>Ottesen</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Leadbetter</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Phillips</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Probing individual environmental bacteria for viruses by using microfluidic digital PCR</article-title>
<source>Science</source>
<volume>333</volume>
<fpage>58</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="doi">10.1126/science.1200758</pub-id>
<pub-id pub-id-type="pmid">21719670</pub-id>
</element-citation>
</ref>
<ref id="bib77">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weitz</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Poisot</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Meyer</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Flores</surname>
<given-names>CO</given-names>
</name>
<name>
<surname>Valverde</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Hochberg</surname>
<given-names>ME</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Phage-bacteria infection networks</article-title>
<source>Trends in Microbiology</source>
<volume>21</volume>
<fpage>82</fpage>
<lpage>91</lpage>
<pub-id pub-id-type="doi">10.1016/j.tim.2012.11.003</pub-id>
<pub-id pub-id-type="pmid">23245704</pub-id>
</element-citation>
</ref>
<ref id="bib78">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Whitman</surname>
<given-names>WB</given-names>
</name>
<name>
<surname>Coleman</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Wiebe</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Prokaryotes : the unseen majority</article-title>
<source>Proceedings of the National Academy of Sciences of USA</source>
<volume>95</volume>
<fpage>6578</fpage>
<lpage>6583</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.95.12.6578</pub-id>
</element-citation>
</ref>
<ref id="bib79">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wright</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Konwar</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Microbial ecology of expanding oxygen minimum zones</article-title>
<source>Nature Reviews. Microbiology</source>
<volume>10</volume>
<fpage>381</fpage>
<lpage>394</lpage>
<pub-id pub-id-type="doi">10.1038/nrmicro2778</pub-id>
<pub-id pub-id-type="pmid">22580367</pub-id>
</element-citation>
</ref>
<ref id="bib80">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wrighton</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Sharon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Castelle</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>VerBerkmoes</surname>
<given-names>NC</given-names>
</name>
<name>
<surname>Wilkins</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Hettich</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Lipton</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Long</surname>
<given-names>PE</given-names>
</name>
<name>
<surname>Banfield</surname>
<given-names>JF</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla</article-title>
<source>Science</source>
<volume>337</volume>
<fpage>1661</fpage>
<lpage>1666</lpage>
<pub-id pub-id-type="doi">10.1126/science.1224041</pub-id>
<pub-id pub-id-type="pmid">23019650</pub-id>
</element-citation>
</ref>
<ref id="bib81">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yoon</surname>
<given-names>HS</given-names>
</name>
<name>
<surname>Price</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Stepanauskas</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rajah</surname>
<given-names>VD</given-names>
</name>
<name>
<surname>Sieracki</surname>
<given-names>ME</given-names>
</name>
<name>
<surname>Wilson</surname>
<given-names>WH</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>EC</given-names>
</name>
<name>
<surname>Duffy</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bhattacharya</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Single-cell genomics reveals organismal interactions in uncultivated marine protists</article-title>
<source>Science</source>
<volume>332</volume>
<fpage>714</fpage>
<lpage>717</lpage>
<pub-id pub-id-type="doi">10.1126/science.1203163</pub-id>
<pub-id pub-id-type="pmid">21551060</pub-id>
</element-citation>
</ref>
<ref id="bib82">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Youle</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Haynes</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Scratching the surface of biology's dark matter</article-title>
<person-group person-group-type="editor">
<name>
<surname>Witzany</surname>
<given-names>G</given-names>
</name>
</person-group>
<source>Viruses: essential agents of life</source>
<publisher-loc>Dordrecht, Netherlands</publisher-loc>
<publisher-name>Springer</publisher-name>
<fpage>61</fpage>
<lpage>80</lpage>
</element-citation>
</ref>
<ref id="bib83">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Lynch</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Dennis</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Wishart</surname>
<given-names>DS</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>PHAST: a fast phage search tool</article-title>
<source>Nucleic Acids Research</source>
<volume>39</volume>
<fpage>W347</fpage>
<lpage>W352</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkr485</pub-id>
<pub-id pub-id-type="pmid">21672955</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
<sub-article id="SA1" article-type="article-commentary">
<front-stub>
<article-id pub-id-type="doi">10.7554/eLife.08490.022</article-id>
<title-group>
<article-title>Decision letter</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Neher</surname>
<given-names>Richard A</given-names>
</name>
<role>Reviewing editor</role>
<aff>
<institution>Max Planck Institute for Developmental Biology</institution>
,
<country>Germany</country>
</aff>
</contrib>
</contrib-group>
</front-stub>
<body>
<boxed-text position="float" orientation="portrait">
<p>eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see
<ext-link ext-link-type="uri" xlink:href="http://elifesciences.org/review-process">review process</ext-link>
). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.</p>
</boxed-text>
<p>[Editors’ note: this article was originally rejected after discussions between the reviewers, but the authors were invited to resubmit after an appeal against the decision.]</p>
<p>Thank you for choosing to send your work entitled “Viral dark matter and virus-host interactions resolved from publicly available microbial genomes” for consideration at
<italic>eLife</italic>
. Your full submission has been evaluated by Diethard Tautz (Senior editor) and three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the decision was reached after discussions between the reviewers. One of the three reviewers, Ken Stedmann, has agreed to share his identity.</p>
<p>All reviewers agreed that virus diversity and ecology are very important topics that deserve more attention and new approaches. The large set of putative viral sequences is impressive and the patterns of host association are intriguing, but we felt that the analysis didn't deliver much novel insight into the evolution of “viral dark matter”. A more in depth analysis covering multiple scales of evolution would be necessary to make significant progress in this direction. As it stands, the manuscript describes in broad terms a data set generated with a tool that is supposed to be published elsewhere. The usefulness of the data set as a resource for others remains limited without a feature rich data base that allows convenient exploration and access. Hence we don't feel that your manuscript is appropriate for publication as a Research article in
<italic>eLife</italic>
.</p>
<p>Reviewer #1:</p>
<p>Roux and colleagues analyze a large set of putative viral sequences mined from published bacterial and archaeal genomes using a software that is described in a manuscript that is currently under review elsewhere (and provided to the reviewers). The method seems sound and I think the majority of the reported viral sequences are genuine. The authors use this large set (10 fold larger than existing data bases) to investigate patterns of host adaptation, host range, and virus taxonomy.</p>
<p>The main results reported are:</p>
<p>(i) > 12000 sequences fall into ∼600 clusters, half of which contain known viruses;</p>
<p>(ii) Viruses are well adapted to their hosts, across virus types and proteins;</p>
<p>(iii) Viruses are mostly host specific and virus/hosts define modules.</p>
<p>These results make sense and are mostly expected, the novel element here is to be able to do it on a massive scale. I have a couple of other comments/criticisms:</p>
<p>1) Only the more reliable predictions were used. How do things change when less stringent criteria are used?</p>
<p>2) Is there a sense of saturation: if one had done the study a few years ago with fewer genomes, how many clusters would have been found? Are some parts of the bacterial world exhausted, where do the new sequences come from?</p>
<p>3) What can be learned from this for future efforts to detect viruses? How should sequencing be targeted?</p>
<p>4) Coinfection: the number of viruses per genome seem compatible with random. What can really be learned from this? Does this reflect the number of concomitant infections, or the number of genomes deposited in this genome in the past (like endogenous retroviruses in mammals)?</p>
<p>5) How are others supposed to use this? A data set of this size needs tools to analyze. Are the authors going to develop a data base with interactive views etc?</p>
<p>6) This is a computational study. I expect the scripts and code to be deposited.</p>
<p>Reviewer #2:</p>
<p>In this manuscript the authors describe an approach to increase our knowledge about prokaryotic viruses (phages) by mining prokaryotic genomes available in public repositories. It is an excellent idea that makes a lot of sense and is badly needed to increase the knowledge about the sequence space of phages presently very biased and incomplete. It can also provide a great contribution into one of the conundrums of phage biology, the infection range without the bias of culture. However, I have mixed feelings about this manuscript. On the one hand it is very comprehensive including all genomes in repositories (including many draft genomes) but the results are a bit disappointing and provide very little novelty. That the pattern of infection at large phylogenetic scale will be modular was largely expected from classical work with cultures. But the most relevant question is whether at short phylogenetic distances is nested what is left unanswered. Maybe a problem that is general to these “big data” analysis is the gross level of detail. I wonder why the authors do not provide analysis at the fine resolution level i.e. phages detected within a single species or genus. At the broad level analysed here most of the results are very predictable from classic approaches. The use of draft genomes and the possibility of discriminating plasmids from phages is another question that is left untouched in both this manuscript and the previous submission. There is a gradient in nature between infective phages and conjugative elements and establishing the borderline might be risky.</p>
<p>In summary I missed some more fine grained analysis of examples in this big data approach.</p>
<p>Reviewer #3:</p>
<p>I find this to be a very well-written report of the application of a new bioinformatic tool (VirSorter) developed by some of the authors. This tool has been applied to data mining of the available and rapidly growing genomic datasets and thereby has increased the number of putative (mostly partial) viral genomes by ten-fold. Due to association with both known genomes and SAG genomes from known sources, the analysis allowed identification of potential viruses in hosts for which no known viruses are currently available. This is clearly a boon to researchers working on these organisms. I find the tetranucleotide analysis of viral genomes in order to possibly identify hosts for these viruses to be particularly attractive and plan on using it in my own research.</p>
<p>I am not convinced of the premise stated in the title and Abstract that this analysis provides much insight into viral dark matter or virus-host interactions. This tool enables further investigations that would allow that insight, which would otherwise be extremely difficult if not unfeasible.</p>
<p>I wonder if the manuscript describing the development and testing of the tool (which was submitted together with this manuscript) could be combined into one manuscript.</p>
<p>[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]</p>
<p>Thank you for resubmitting your work entitled “Viral dark matter and virus-host interactions resolved from publicly available microbial genomes” for further consideration as a Tools and resources article at
<italic>eLife</italic>
. Your revised article has been favorably evaluated by Diethard Tautz (Senior editor) and a Reviewing editor. The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:</p>
<p>1) Data availability: given that this is now being considered as a Tools and resources article, we feel that the data availability section should be more prominent. We suggest to move the availability section up (possibly before the conclusions) and provide a little bit more detail. iVirus.us itself seems like a rather hollow shell – clicking on data access yields a 502 bad gateway error. As far as I can tell, everything happens within the discovery environment of iPlant for which registration is necessary. Please elaborate a little bit. The MetaVir environment seems useful, but some of the MetaVir analysis haven't completed yet. In addition, we think that the “richly annotated genbank files” (promised in the rebuttal letter) should be made available not only on the author's website but uploaded to a big-data repository such as data dryad.</p>
<p>2) It seems the authors have misunderstood the request for making the scripts available. We were not asking for the VirSorter scripts, but the scripts that analyze the VirSorter data set to produce figures and results of the paper. Those scripts provide the most accurate description of the methods, and in the interest of reproducibility, they should – whenever possible – be made available. The preferred place would be a separate GitHub repository.</p>
<p>3) Host association figure: We continue to be underwhelmed by this figure. There are lots of lines which clearly fall into a handful of modules, but within these modules it is pretty hard to see what is going on. Maybe a two-way clustering would be more insightful. Consider a distance matrix d_ij, where d_ij is the fraction of sequences in viral cluster (VC) i that come from genomes of host phylum j, maybe normalized for the abundance of genomes from phylum j. Then cluster this matrix both by VC and host, similar to RNA-seq being clustered by gene and tissue. The modules should show up as blocks on the diagonal, while promiscuous affiliations are off-diagonal terms. What exactly the distance matrix d_ij should be requires some thought and there are probably better choices then this proposal. But if something like this would work out, it could be more informative than the current figure. Keeping one as a supplement of the other could be a good solution.</p>
</body>
</sub-article>
<sub-article id="SA2" article-type="reply">
<front-stub>
<article-id pub-id-type="doi">10.7554/eLife.08490.023</article-id>
<title-group>
<article-title>Author response</article-title>
</title-group>
</front-stub>
<body>
<p>
<italic>All reviewers agreed that virus diversity and ecology are very important topics that deserve more attention and new approaches. The large set of putative viral sequences is impressive and the patterns of host association are intriguing, but we felt that the analysis didn't deliver much novel insight into the evolution of</italic>
<italic>viral dark matter</italic>
<italic>. A more in depth analysis covering multiple scales of evolution would be necessary to make significant progress in this direction. As it stands, the manuscript describes in broad terms a data set generated with a tool that is supposed to be published elsewhere. The usefulness of the data set as a resource for others remains limited without a feature rich data base that allows convenient exploration and access. Hence we don't feel that your manuscript is appropriate for publication as a Research article in</italic>
eLife.</p>
<p>We can appreciate that a manuscript introducing 12,498 new phage genomes (whole and large fragments) leaves a feeling of unfinished business no matter how it is written. Seeing these reviews also help us see that we failed to really start out with a quantitative metric of “impact” as to us the scale alone (augmenting available phage genome sequences by an order of magnitude) was a closed case. This is because the last decade has seen microbial ecology transformed by large scale datasets including the Global Ocean Survey microbial metagenomics dataset (Rusch et al. PLoS Biology, 2007) and the first viral metagenomic dataset (Angly et al. PLoS Biology, 2006) – papers which have 1383 and 613 citations, respectively. At the same time, viral ecology is paralyzed by the dominance of “unknowns” in metagenomics studies as commonly 63–93% of new viral metagenomic reads are new to science, presumably because we only have just over a thousand phage genomes and they derive largely (85%) from only 3 of 45 bacterial phyla.</p>
<p>How much of a difference will our 12,498 host-associated phage genomes improve the situation? A new analysis we include here shows that they as much as double the number of affiliated proteins for some environmental viromes (∼35% for seawater viromes vs ∼100% for human gut virome; see
<xref ref-type="fig" rid="fig7">Author response image 1</xref>
). Thus we hope this more clearly emphasizes how single study’s dataset alone will be foundational for future ecology studies seeking to “see” viruses in microbial datasets and to affiliate viruses in viral datasets. These new results were added to the revised manuscript (text and new
<xref ref-type="fig" rid="fig3">Figure 3B</xref>
).
<fig id="fig7" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.7554/eLife.08490.024</object-id>
<label>Author response image 1.</label>
<caption>
<title>Improvement in the proportion of affiliated genes from viromes with VirSorter dataset.</title>
<p>Predicted genes from the Pacific Ocean Viromes (
<xref rid="bib36" ref-type="bibr">Hurwitz and Sullivan, 2013</xref>
), Tara Ocean Viromes (
<xref rid="bib8" ref-type="bibr">Brumnoza, et al., 2015</xref>
) and Human Gut Viromes (Minot et al., 2013) were compared to RefSeqVirus (May 2015) and the 12.5k VirSorter dataset (BLASTp, threshold of 50 on bit score and 0.001 on e-value). Predicted proteins affiliated to VirSorter (in blue) did not display any significant similarity to a RefSeq virus, but can now be associated with a phage and a host through the VirSorter database.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.024">http://dx.doi.org/10.7554/eLife.08490.024</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490f007"></graphic>
</fig>
</p>
<p>As well, we can see that we failed to clearly articulate to the reviewers the specific biological advances made in this manuscript. To summarize these advances, we list the major advances here, any single of which, I would argue, could be the sole focus of a strong, top-tier manuscript.</p>
<p>1) The amount of viral signal in publicly deposited genomes (12.5k highly confident viral sequences in 15k bacteria and archaea genomes) is unexpectedly high since we focused our analysis on “active” infections by excluding fragmented genomes likely to be defective or decayed prophages.</p>
<p>2) This study is the first to attempt to quantify the lesser studied types of viruses and finds viral genomes not integrated in the host genome to be rather abundant (>1k sequences were identifiable, subsection “New viruses detected in public microbial genomic datasets with VirSorter”). These could represent extrachromosomal prophages, chronic, or “cryptic” lytic viruses (i.e. lytic viruses that goes unnoticed in a culture), all infection types that are understudied and with unknown and likely underestimated ecological impacts.</p>
<p>3) Genome-based clustering analyses revealed that approximately half of the observed viral clusters in the VirSorter dataset lacked known reference genomes (subsection “264 new putative viral genera identified through genome-based network clustering”, last paragraph). Obtaining complete or near-complete genomes and documenting the host range for these new groups is critical for mapping the virosphere, especially because while other approaches (e.g., viral metagenomics) can help identify non-cultivated viral diversity, these lack this host association. Highlightable “firsts” here include the first viral genomes for 9 bacterial phyla (subsection “Long-term evolutionary patterns of bacterial and archaeal virus genomes” and
<xref ref-type="table" rid="tbl1">Table 1</xref>
, see also
<xref ref-type="fig" rid="fig8">Author response image 2</xref>
), which is about as much from all published literature to date, as well as a new
<italic>Bacteroides</italic>
virus, unrelated to any virus previously described that likely represents a new viral order (subsection “New viruses detected in public microbial genomic datasets with VirSorter”, third paragraph).
<fig id="fig8" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.7554/eLife.08490.025</object-id>
<label>Author response image 2.</label>
<caption>
<title>Viral sequences distribution of RefSeq and VirSorter dataset.</title>
<p>For each host group, a circle proportional to the number of viral genomes available is noted in red for RefSeq and blue for VirSorter. Hosts for which no RefSeq references were available are highlighted in bold.</p>
<p>
<bold>DOI:</bold>
<ext-link ext-link-type="doi" xlink:href="10.7554/eLife.08490.025">http://dx.doi.org/10.7554/eLife.08490.025</ext-link>
</p>
</caption>
<graphic xlink:href="elife08490f008"></graphic>
</fig>
</p>
<p>4) The fraction of microbes in any environment that are co-infection by more than one virus remains a fundamental, yet largely (completely?) unknown number in any environment. Here we show that co-infection is common (∼50% of cells are co-infected, l. 242) and many of these co-infections are by more than one type of virus. Such co-infections likely have far-reaching implications for viral genome evolution, as they provide opportunity for gene exchange, so quantifying the co-infection frequency across viral groups offers insight into how genomically promiscuous one viral group might be relative to another.</p>
<p>5) While not completely novel observations, we also perform analyses that confirm, with a much larger dataset, prior work in key areas that are desperately needed in viral taxonomy and ecology. First, genome network analysis helps classify new viruses in a robust genome-based taxonomic framework that is largely consistent with accepted ICTV taxonomy (subsection “264 new putative viral genera identified through genome-based network clustering”). Second, leveraging this unprecedentedly large-scale, host-associated viral genomic dataset, we show that tetranucleotide frequency distance is a surprisingly robust predictor of the host of most viruses. Again, while not novel knowledge, working at this scale we added an effort to quantify the probabilistic value of these predictions across multiple host phyla, as well as compared the performance of tetranucleotide frequency to 4 other sequence composition based metrics to help provide strong guidance to researchers on using this metric. Perhaps this is why in spite of the idea being in the literature for some time, reviewer #3 notes: “I find the tetranucleotide analysis of viral genomes in order to possibly identify hosts for these viruses to be particularly attractive and plan on using it in my own research
<italic>.</italic>
” Third, the modular pattern of a global virus-host network was indeed predicted by theoretical models, although the only study of comparable size observing a modular virus-host network (Flores et al., ISMEJ, 2011) was based on plaque formation on host cultures where genetic diversity was unknown. Here, we validate the modularity with an unprecedentedly large-scale dataset that includes microbes spanning 18 phyla, and add information about the level of taxonomy at which the virus-host network is modular (as we expect it to become nested at one point, near the “tip of the tree”). Fourth, the dominance of Caudovirales in the dataset, as well as the clear separation between DNA and RNA viruses as well as Archaeal and Bacterial viruses were all expected based on the previous knowledge on viral diversity, and so these findings are largely only confirmatory.</p>
<p>To better emphasize these results, we added a new figure displaying more clearly how the curated VirSorter dataset expands the range of known viruses (
<xref ref-type="fig" rid="fig1">Figure 1</xref>
), and re-organizing the manuscript so that the first three subparts of Results section are now entirely dedicated to the exploration of this new diversity (subsection “New viruses detected in public microbial genomic datasets with VirSorter”). The questions of viral classification through genome-based network, virus-host interactions and adaptations are then addressed in three more subsections (“Long-term evolutionary patterns of bacterial and archaeal virus genomes“, Global virus–host network is confirmed as modular” and “Virus–host adaptation signals detected at the genome composition and codon usage level”). We hope that this new organization brings more balance to the manuscript and helps to better introduce the dataset before actually switching to secondary analyses.</p>
<p>Another issue that the reviewers had was that there was the perception of a lack of depth in the manuscript. We acknowledge that the format of the manuscript follows a style in which only global patterns are presented. This is similar to how Rinke et al
<italic>.</italic>
(2013 Nature) handled their explorations of microbial dark matter – and is a common strategy for getting such big datasets out to specialists for follow-on analyses. Such follow-up studies will undoubtedly be extremely interesting and critical for the field. However, we chose to play to the strengths of the data and consider only global-scale stories.</p>
<p>Notably,
<italic>eLife</italic>
recently published a manuscript describing comparative phage genomics of 627 mycobacteriophage genomes (Whole genome comparison of a large collection of mycobacteriophages reveals a continuum of phage genetic diversity, Pope et al., 2015.
<italic>eLife</italic>
). The major findings in the manuscript (e.g., phage genomes are part of a genetic continuum) are consistent with and redundant to the findings from at least 5 previously published papers by the same group (Hendrix et al., PNAS, 1998; Pedulla et al., Cell, 2003; Jacobs-Sera et al., Virology, 2012; Cresawn et al., PLoS One, 2015) yet the value of such a large-scale dataset is paramount currently for future studies of the ecology, evolution and genomics of phages, which is presumably why the study was accepted at
<italic>eLife</italic>
.</p>
<p>Finally, the last two criticisms in the editor’s summary were as follows:</p>
<p>1)
<italic>“the manuscript describes in broad terms a data set generated with a tool that is supposed to be published elsewhere.”</italic>
The manuscript describing the methodological details of the tool is now published VirSorter: mining viral signal from microbial genomic data”, S Roux, F Enault, BL Hurwitz, MB Sullivan, PeerJ 3, e985.</p>
<p>2)
<italic>“The usefulness of the data set as a resource for others remains limited without a feature rich data base that allows convenient exploration and access.”</italic>
We strongly agree that providing convenient and enriched access to data and tools is crucial for researchers, as can be seen by the previous projects of the lead author who developed and maintain one of the only viral metagenomic databases and analysis tools publicly available (MetaVir), and the senior author’s laboratory which has been building iVirus on the back of the NSF-funded iPlant Cyberinfrastructure in spite of a lack of funding for the project. In fact, both projects are unfunded, yet we maintain and/or develop the efforts as they represent our commitment to getting the data and tools into researchers hands to enable them to better “see” the viruses in their datasets. Although we did not emphasize this in the manuscript, a mistake we would correct in a revised manuscript, we intend to make the dataset available on these two complementary websites (MetaVir and iVirus). MetaVir provides an automatic annotation of each sequence, with multiple visualization tools to explore and compare genome maps, as well as multiple ways of searching the data (by host, by phage affiliation, by gene taxonomic or functional affiliation, by size, etc) and extract a specific subset of interest. iVirus offers optimized data repository features as well as numerous analytical tools for comparative genome analyses and metagenomic fragment recruitment analyses (BLAST, bwa and bowtie2 read aligners, multiple flavors of functional gene annotation, phylogenetic tree building pipelines, etc.). Finally, a summary of the sequences and clusters is provided as supplementary files, and both the raw data and richly annotated sequences (genbank file format, including taxonomic and functional affiliation of all genes) will be available to download on Sullivan’s publications webpage should the paper be accepted – just as we do for other datasets of community interest (e.g., the Pacific Ocean Viromes and the Tara Oceans Viromes datasets). The information about the dataset availability was added to the Material and methods (subsection “Dataset and script availability”).</p>
<p>In summary, we would argue that there is no greater challenge to exploring the ecology and evolution of viral communities in diverse ecosystems (e.g., oceans, soils, humans) than the lack of reference genomes that cause dominance by ‘viral dark matter’. Above we have tried to more carefully articulate the major advances this study makes, and we emphasize that the reviewers also noted the quality of the work and its relevance for the field (Reviewer 2:
<italic>“It is an excellent idea that makes a lot of sense and is badly needed to increase the knowledge about the sequence space of phages presently very biased and incomplete.”</italic>
, Reviewer 3:
<italic>“This is clearly a boon to researchers working on these organisms.”</italic>
). We hope that the new figures (
<xref ref-type="fig" rid="fig1 fig2 fig3">Figures 1–, 2 and 3</xref>
), the added results and new organization of the manuscript helped to bring out how valuable VirSorter curated dataset is, and what insights into virus-host interactions were obtained. Please find below a point-by-point response to the reviewers comments.</p>
<p>Reviewer #1:</p>
<p>
<italic>Roux and colleagues analyze a large set of putative viral sequences mined from published bacterial and archaeal genomes using a software that is described in a manuscript that is currently under review elsewhere (and provided to the reviewers). The method seems sound and I think the majority of the reported viral sequences are genuine. The authors use this large set (10 fold larger than existing data bases) to investigate patterns of host adaptation, host range, and virus taxonomy</italic>
.</p>
<p>The main results reported are:</p>
<p>
<italic>(i) > 12000 sequences fall into ∼600 clusters, half of which contain known viruses</italic>
;</p>
<p>
<italic>(ii) Viruses are well adapted to their hosts, across virus types and proteins</italic>
;</p>
<p>
<italic>(iii) Viruses are mostly host specific and virus/hosts define modules</italic>
.</p>
<p>
<italic>These results make sense and are mostly expected, the novel element here is to be able to do it on a massive scale</italic>
.</p>
<p>We acknowledge that some results mostly confirm what could be predicted based on smaller scale studies, but we feel that we have made significant new discoveries and phenomenological observations in exploring the global scale patterns in this dataset. We hope that our efforts above (response to editor’s summary) now better articulate the specific advances made in this manuscript.</p>
<p>
<italic>I have a couple of other comments/criticisms</italic>
:</p>
<p>
<italic>1) Only the more reliable predictions were used</italic>
.
<italic>How do things change when less stringent criteria are used?</italic>
</p>
<p>We appreciate the suggestion of including the category 3 predictions (∼90K sequences) in our analyses. During the data exploration phase of preparing this manuscript, we examined the category 3 predictions but found it to be of mixed use since we were focused on viral sequence space in this manuscript. We went into some detail about this in the “tool” manuscript in PeerJ
<italic>,</italic>
but also here explicitly caution the reader about the value of category 3 predictions. Specifically, “we discarded all predictions lacking a viral hallmark gene or a viral gene enrichment […] as these are likely defective prophages for which boundaries are difficult to predict in silico and that often include bacterial genes” (subsection “Selection of a relevant subset of viral sequences: the VirSorter dataset”). While we were focused here on the higher confidence viral genome sequences (category 1 and 2 predictions), the category 3 predictions are of great value to specialists interested in defective prophages, mobile elements or microbial genomic islands. We hope with this added context that you can appreciate our decision to focus in this way and yet also make the category 3 predictions available through this study since they could be of value for follow-on work.</p>
<p>2) Is there a sense of saturation: if one had done the study a few years ago with fewer genomes, how many clusters would have been found? Are some parts of the bacterial world exhausted, where do the new sequences come from?</p>
<p>Because this study leverages 15K publicly available microbial genomes, it is not an ideal dataset from which to draw conclusions about saturation. Notably, however, we saw that even well studied groups, such as
<italic>Gammaproteobacteria</italic>
and
<italic>Bacilli,</italic>
do not appear saturated as new VCs (i.e., those lacking a RefSeq reference) were detected here too (new
<xref ref-type="fig" rid="fig2">Figure 2B</xref>
). We do see this as an ideal question to approach in the future using VirSorter – once the floods of SAGs data are available as these sequences will better span the microbial tree of life and provide context for both lytically infecting and cell-associated (prophage, extrachromosomal, chronic, etc.) infecting viruses.</p>
<p>3) What can be learned from this for future efforts to detect viruses? How should sequencing be targeted?</p>
<p>This is a question each individual researcher will need to answer based upon their particular research question of interest – are you interested in capturing sequence breadth or depth? It’s an age-old trade-off and one we do not answer well here since even this scale of data is not very deep in any one category yet since we leveraged public data rather than develop an explicit experimental sampling strategy.</p>
<p>
<italic>4) Coinfection: the number of viruses per genome seem compatible with random. What can really be learned from this? Does this reflect the number of concomitant infections</italic>
,
<italic>or the number of genomes deposited in this genome in the past (like endogenous retroviruses in mammals)?</italic>
</p>
<p>Unfortunately, discerning genomes previously deposited in the host genome from active viral infections in silico is nearly impossible. We conservatively focus on sequences that are likely to be active and not past infections as we only consider prophages that included the capsid-associated genes (viral hallmark genes). Thus, the 12.5k sequences likely underestimate the total number of viruses in the dataset (since we miss those with unrecognizable capsid genes), but should conservatively identify active infections that represent some combination of lytic infections, prophages and chronic infections. We added a discussion about active viruses (subsection “VirSorter curated dataset includes extrachromosomal genomes and improves virome affiliation”).</p>
<p>
<italic>5) How are others supposed to use this? A data set of this size needs tools to analyze. Are the authors going to develop a data base with interactive views etc</italic>
.
<italic>?</italic>
</p>
<p>We carefully considered for some time how best to make these data available and in the end chose to make it available through the iPlant Cyber infrastructure, which allows to easily share large sequences datasets, and the MetaVir web server, which generates automatically annotated contig maps searchable by function, taxonomy, or host taxonomy (details in the response to editor's summary above). Notably, all viral genomes are made available in a fully-annotated genbank format that could be utilized by researches in any number of genome browsers (e.g., Artemis) for follow-on analytics using the tool of choice. It is beyond the scope of this manuscript or our lab to develop interactive data interrogation tools for these data as these efforts are often large-scale projects (e.g., iPlant is a $100M NSF Center, KBase is $100M DOE Center).</p>
<p>
<italic>6) This is a computational study. I expect the scripts and code to be deposited</italic>
.</p>
<p>We agree, and apologize for not displaying this clearly in the manuscript. All of our code was previously made publicly available through GitHub and a community-available version is implemented through the iPlant Cyberinfrastructure. These details were in the prior, PeerJ publication that describes the VirSorter tool, but we now also point readers to these details in the current manuscript (subsection “Dataset and script availability”).</p>
<p>Reviewer #2:</p>
<p>
<italic>In this manuscript the authors describe an approach to increase our knowledge about prokaryotic viruses (phages) by mining prokaryotic genomes available in public repositories. It is an excellent idea that makes a lot of sense and is badly needed to increase the knowledge about the sequence space of phages presently very biased and incomplete. It can also provide a great contribution into one of the conundrums of phage biology, the infection range without the bias of culture. However, I have mixed feelings about this manuscript. On the one hand it is very comprehensive including all genomes in repositories (including many draft genomes) but the results are a bit disappointing and provide very little novelty. That the pattern of infection at large phylogenetic scale will be modular was largely expected from classical work with cultures. But the most relevant question is whether at short phylogenetic distances is nested what is left unanswered. Maybe a problem that is general to these</italic>
<italic>big data</italic>
<italic>analysis is the gross level of detail. I wonder why the authors do not provide analysis at the fine resolution level i.e. phages detected within a single species or genus. At the broad level analysed here most of the results are very predictable from classic approaches. The use of draft genomes and the possibility of discriminating plasmids from phages is another question that is left untouched in both this manuscript and the previous submission. There is a gradient in nature between infective phages and conjugative elements and establishing the borderline might be risky</italic>
.</p>
<p>In summary I missed some more fine grained analysis of examples in this big data approach.</p>
<p>We thank the reviewer for these kind words. Indeed, these more detailed analyses would likely be extremely interesting, however the density of the current manuscript (see the reply to editor's summary) hardly allows for the addition of more results, which would also lead to additional Introduction and Discussion. In the response to the editor’s summary above, we describe our rationale for why we hope to keep the manuscript focused on the big picture or global-scale analyses.</p>
<p>Reviewer #3:</p>
<p>I find this to be a very well-written report of the application of a new bioinformatic tool (VirSorter) developed by some of the authors. This tool has been applied to data mining of the available and rapidly growing genomic datasets and thereby has increased the number of putative (mostly partial) viral genomes by ten-fold. Due to association with both known genomes and SAG genomes from known sources, the analysis allowed identification of potential viruses in hosts for which no known viruses are currently available. This is clearly a boon to researchers working on these organisms. I find the tetranucleotide analysis of viral genomes in order to possibly identify hosts for these viruses to be particularly attractive and plan on using it in my own research.</p>
<p>
<italic>I am not convinced of the premise stated in the title and Abstract that this analysis provides much insight into viral dark matter or virus-host interactions. This tool enables further investigations that would allow that insight, which would otherwise be extremely difficult if not unfeasible</italic>
.</p>
<p>We thank the reviewer for the kind words. Although we agree that the tool enables exciting potential follow-up investigations, we still consider that the description of such a vast dataset, that includes (as noted by the reviewer) potential viruses for host groups with no currently isolated virus, is akin to taking one (giant) step into the viral dark matter. Notably, using this host-associated viral sequences as a complementary database doubled the ratio of affiliated genes from human gut viromes (see
<xref ref-type="fig" rid="fig7">Author response image 1</xref>
). The description of the different viral clusters linked to these new viruses and associated with specific host groups is for us what we consider as new insights into viral dark matter and virus -host interactions. Clearly we failed to articulate those advances in the submitted manuscript, but hope that our response to the editor’s summary above helps more clearly make our case. We hope this revised manuscript is better at bringing these points out.</p>
<p>
<italic>I wonder if the manuscript describing the development and testing of the tool (which was submitted together with this manuscript) could be combined into one manuscript</italic>
.</p>
<p>We had felt similarly and previously prepared a manuscript combining the tool and the findings presented in this current manuscript. Unfortunately, 18 months ago such a “merged” manuscript did not review well as frustrated both informaticists and biologists each desiring more detail. Thus we chose to separately publish the tool VirSorter: mining viral signal from microbial genomic data, S Roux, F Enault, BL Hurwitz, MB Sullivan, PeerJ 3, e985 – and here present its first application to ∼15K publicly available bacterial and archaeal genomes (this study).</p>
<p>[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]</p>
<p>
<italic>1) Data availability: given that this is a Tools and Resources article, we feel that the data availability section should be more prominent. We suggest to move the availability section up (possibly before the conclusions) and provide a little bit more detail. iVirus.us itself seems like a rather hollow shell – clicking on data access yields a 502 bad gateway error. As far as I can tell, everything happens within the discovery environment of iPlant for which registration is necessary. Please elaborate a little bit. The MetaVir environment seems useful, but some of the MetaVir analysis haven't completed yet. In addition, we think that the</italic>
<italic>richly annotated genbank files</italic>
<italic>(promised in the rebuttal letter) should be made available not only on the author's website but uploaded to a big-data repository such as data dryad</italic>
.</p>
<p>We agree with the idea of placing more emphasis on data availability and appreciate the suggestions for how to do so. To this end, we have:</p>
<p>A) Created a “Dataset availability” section. This section is located at the end of the manuscript just before the conclusions, and now details the different places where the VirSorter Curated Dataset and the associated results are available.</p>
<p>B) Created a direct iVirus link for the datasets. As noted by the reviewers, the structure of iVirus is very young and still in development for the most part. However, we will leverage here a new feature in iVirus which allows for direct access to a set of files linked to a publication without the need for registration. This link provides direct access:
<ext-link ext-link-type="uri" xlink:href="http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/ivirus/VirSorter_curated_dataset">http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/ivirus/VirSorter_curated_dataset</ext-link>
. We added this link to the manuscript in this new section (“Dataset availability”).</p>
<p>C) Made the annotated genbank files available via DataDryad. These are organized by host and provided as a zip package now uploaded to DataDryad (DataDryad package
<ext-link ext-link-type="uri" xlink:href="https://datadryad.org/resource/doi:10.5061/dryad.b8226">dryad.b8226</ext-link>
) and added this information in the subsection “Dataset availability”of the revised manuscript.</p>
<p>
<italic>2) It seems the authors have misunderstood the request for making the scripts available. We were not asking for the VirSorter scripts, but the scripts that analyze the VirSorter data set to produce figures and results of the paper. Those scripts provide the most accurate description of the methods, and in the interest of reproducibility, they should – whenever possible – be made available. The preferred place would be a separate GitHub repository</italic>
.</p>
<p>Indeed, we misunderstood the former request from the reviewers. To rectify this, we have now prepared the scripts used to produce the results in this manuscript for public release on our lab wiki (the corresponding link:
<ext-link ext-link-type="uri" xlink:href="http://tmpl.arizona.edu/dokuwiki/doku.php?id=bioinformatics:scripts:vsb">http://tmpl.arizona.edu/dokuwiki/doku.php?id=bioinformatics:scripts:vsb</ext-link>
) and a GitHub repository (
<ext-link ext-link-type="uri" xlink:href="https://github.com/simroux/virsorter-curated-dataset-scripts-package">https://github.com/simroux/virsorter-curated-dataset-scripts-package</ext-link>
).</p>
<p>
<italic>3) Host association figure: We continue to be underwhelmed by this figure. There are lots of lines which clearly fall into a handful of modules, but within these modules it is pretty hard to see what is going on. Maybe a two-way clustering would be more insightful. Consider a distance matrix d_ij, where d_ij is the fraction of sequences in viral cluster (VC) i that come from genomes of host phylum j, maybe normalized for the abundance of genomes from phylum j. Then cluster this matrix both by VC and host, similar to RNA-seq being clustered by gene and tissue. The modules should show up as blocks on the diagonal, while promiscuous affiliations are off-diagonal terms. What exactly the distance matrix d_ij should be requires some thought and there are probably better choices then this proposal. But if something like this would work out, it could be more informative than the current figure. Keeping one as a supplement of the other could be a good solution</italic>
.</p>
<p>We thank you for helping us see the issues with this figure better. To clarify these results, we modified the figures and represent the same network (and the same modules, identified through lp-brim) in a matrix form as suggested by the reviewers. Although we find that the overall “shape” of the network is not as apparent as in the “network” visualization, the “matrix” plot makes it indeed easier to identify the connections between virus clusters and host groups. The new figure (“matrix” visualization) was thus added as
<xref ref-type="fig" rid="fig5">Figure 5</xref>
, and the former
<xref ref-type="fig" rid="fig5">Figure 5</xref>
is now displayed as
<xref ref-type="fig" rid="fig5s1">Figure 5–figure supplement 1</xref>
. We hope that these two representations together help present the findings in a manner that is most informative.</p>
</body>
</sub-article>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000092 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000092 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4533152
   |texte=   Viral dark matter and virus–host interactions resolved from publicly available microbial genomes
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26200428" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1 

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024