Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data

Identifieur interne : 000663 ( Pmc/Corpus ); précédent : 000662; suivant : 000664

TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data

Auteurs : Hernan A. Lorenzi ; Jeff Hoover ; Jason Inman ; Todd Safford ; Sean Murphy ; Leonid Kagan ; Shannon J. Williamson

Source :

RBID : PMC:3156399

Abstract

In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence.


Url:
DOI: 10.4056/sigs.1694706
PubMed: 21886867
PubMed Central: 3156399

Links to Exploration step

PMC:3156399

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data</title>
<author>
<name sortKey="Lorenzi, Hernan A" sort="Lorenzi, Hernan A" uniqKey="Lorenzi H" first="Hernan A." last="Lorenzi">Hernan A. Lorenzi</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hoover, Jeff" sort="Hoover, Jeff" uniqKey="Hoover J" first="Jeff" last="Hoover">Jeff Hoover</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Inman, Jason" sort="Inman, Jason" uniqKey="Inman J" first="Jason" last="Inman">Jason Inman</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Safford, Todd" sort="Safford, Todd" uniqKey="Safford T" first="Todd" last="Safford">Todd Safford</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Murphy, Sean" sort="Murphy, Sean" uniqKey="Murphy S" first="Sean" last="Murphy">Sean Murphy</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kagan, Leonid" sort="Kagan, Leonid" uniqKey="Kagan L" first="Leonid" last="Kagan">Leonid Kagan</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Williamson, Shannon J" sort="Williamson, Shannon J" uniqKey="Williamson S" first="Shannon J." last="Williamson">Shannon J. Williamson</name>
<affiliation>
<nlm:aff id="aff2">J.CraigVenterInstitute,10355 Science Center Drive, San Diego, CA 92121, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21886867</idno>
<idno type="pmc">3156399</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3156399</idno>
<idno type="RBID">PMC:3156399</idno>
<idno type="doi">10.4056/sigs.1694706</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">000663</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data</title>
<author>
<name sortKey="Lorenzi, Hernan A" sort="Lorenzi, Hernan A" uniqKey="Lorenzi H" first="Hernan A." last="Lorenzi">Hernan A. Lorenzi</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hoover, Jeff" sort="Hoover, Jeff" uniqKey="Hoover J" first="Jeff" last="Hoover">Jeff Hoover</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Inman, Jason" sort="Inman, Jason" uniqKey="Inman J" first="Jason" last="Inman">Jason Inman</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Safford, Todd" sort="Safford, Todd" uniqKey="Safford T" first="Todd" last="Safford">Todd Safford</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Murphy, Sean" sort="Murphy, Sean" uniqKey="Murphy S" first="Sean" last="Murphy">Sean Murphy</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kagan, Leonid" sort="Kagan, Leonid" uniqKey="Kagan L" first="Leonid" last="Kagan">Leonid Kagan</name>
<affiliation>
<nlm:aff id="aff1">J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Williamson, Shannon J" sort="Williamson, Shannon J" uniqKey="Williamson S" first="Shannon J." last="Williamson">Shannon J. Williamson</name>
<affiliation>
<nlm:aff id="aff2">J.CraigVenterInstitute,10355 Science Center Drive, San Diego, CA 92121, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Standards in Genomic Sciences</title>
<idno type="eISSN">1944-3277</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Weinbauer, Mg" uniqKey="Weinbauer M">MG Weinbauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author>
<name sortKey="Edwards, R" uniqKey="Edwards R">R Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Cardenas, E" uniqKey="Cardenas E">E Cardenas</name>
</author>
<author>
<name sortKey="Fish, J" uniqKey="Fish J">J Fish</name>
</author>
<author>
<name sortKey="Chai, B" uniqKey="Chai B">B Chai</name>
</author>
<author>
<name sortKey="Farris, Rj" uniqKey="Farris R">RJ Farris</name>
</author>
<author>
<name sortKey="Kulam, As" uniqKey="Kulam A">AS Kulam</name>
</author>
<author>
<name sortKey="Mcgarrell, Dm" uniqKey="Mcgarrell D">DM McGarrell</name>
</author>
<author>
<name sortKey="Marsh, T" uniqKey="Marsh T">T Marsh</name>
</author>
<author>
<name sortKey="Garrity, Gm" uniqKey="Garrity G">GM Garrity</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Desantis, Tz" uniqKey="Desantis T">TZ DeSantis</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Larsen, N" uniqKey="Larsen N">N Larsen</name>
</author>
<author>
<name sortKey="Rojas, M" uniqKey="Rojas M">M Rojas</name>
</author>
<author>
<name sortKey="Brodie, El" uniqKey="Brodie E">EL Brodie</name>
</author>
<author>
<name sortKey="Keller, K" uniqKey="Keller K">K Keller</name>
</author>
<author>
<name sortKey="Huber, T" uniqKey="Huber T">T Huber</name>
</author>
<author>
<name sortKey="Dalevi, D" uniqKey="Dalevi D">D Dalevi</name>
</author>
<author>
<name sortKey="Hu, P" uniqKey="Hu P">P Hu</name>
</author>
<author>
<name sortKey="Andersen, Gl" uniqKey="Andersen G">GL Andersen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Matsen, Fa" uniqKey="Matsen F">FA Matsen</name>
</author>
<author>
<name sortKey="Kodner, Rb" uniqKey="Kodner R">RB Kodner</name>
</author>
<author>
<name sortKey="Armbrust, Ev" uniqKey="Armbrust E">EV Armbrust</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Markowitz, Vm" uniqKey="Markowitz V">VM Markowitz</name>
</author>
<author>
<name sortKey="Ivanova, Nn" uniqKey="Ivanova N">NN Ivanova</name>
</author>
<author>
<name sortKey="Szeto, E" uniqKey="Szeto E">E Szeto</name>
</author>
<author>
<name sortKey="Palaniappan, K" uniqKey="Palaniappan K">K Palaniappan</name>
</author>
<author>
<name sortKey="Chu, K" uniqKey="Chu K">K Chu</name>
</author>
<author>
<name sortKey="Dalevi, D" uniqKey="Dalevi D">D Dalevi</name>
</author>
<author>
<name sortKey="Chen, Im" uniqKey="Chen I">IM Chen</name>
</author>
<author>
<name sortKey="Grechkin, Y" uniqKey="Grechkin Y">Y Grechkin</name>
</author>
<author>
<name sortKey="Dubchak, I" uniqKey="Dubchak I">I Dubchak</name>
</author>
<author>
<name sortKey="Anderson, I" uniqKey="Anderson I">I Anderson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, S" uniqKey="Sun S">S Sun</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Altintas, I" uniqKey="Altintas I">I Altintas</name>
</author>
<author>
<name sortKey="Lin, A" uniqKey="Lin A">A Lin</name>
</author>
<author>
<name sortKey="Peltier, S" uniqKey="Peltier S">S Peltier</name>
</author>
<author>
<name sortKey="Stocks, K" uniqKey="Stocks K">K Stocks</name>
</author>
<author>
<name sortKey="Allen, Ee" uniqKey="Allen E">EE Allen</name>
</author>
<author>
<name sortKey="Ellisman, M" uniqKey="Ellisman M">M Ellisman</name>
</author>
<author>
<name sortKey="Grethe, J" uniqKey="Grethe J">J Grethe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meyer, F" uniqKey="Meyer F">F Meyer</name>
</author>
<author>
<name sortKey="Paarmann, D" uniqKey="Paarmann D">D Paarmann</name>
</author>
<author>
<name sortKey="D Souza, M" uniqKey="D Souza M">M D'Souza</name>
</author>
<author>
<name sortKey="Olson, R" uniqKey="Olson R">R Olson</name>
</author>
<author>
<name sortKey="Glass, Em" uniqKey="Glass E">EM Glass</name>
</author>
<author>
<name sortKey="Kubal, M" uniqKey="Kubal M">M Kubal</name>
</author>
<author>
<name sortKey="Paczian, T" uniqKey="Paczian T">T Paczian</name>
</author>
<author>
<name sortKey="Rodriguez, A" uniqKey="Rodriguez A">A Rodriguez</name>
</author>
<author>
<name sortKey="Stevens, R" uniqKey="Stevens R">R Stevens</name>
</author>
<author>
<name sortKey="Wilke, A" uniqKey="Wilke A">A Wilke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, Y" uniqKey="Ye Y">Y Ye</name>
</author>
<author>
<name sortKey="Osterman, A" uniqKey="Osterman A">A Osterman</name>
</author>
<author>
<name sortKey="Overbeek, R" uniqKey="Overbeek R">R Overbeek</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dinsdale, Ea" uniqKey="Dinsdale E">EA Dinsdale</name>
</author>
<author>
<name sortKey="Pantos, O" uniqKey="Pantos O">O Pantos</name>
</author>
<author>
<name sortKey="Smriga, S" uniqKey="Smriga S">S Smriga</name>
</author>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
<author>
<name sortKey="Angly, F" uniqKey="Angly F">F Angly</name>
</author>
<author>
<name sortKey="Wegley, L" uniqKey="Wegley L">L Wegley</name>
</author>
<author>
<name sortKey="Hatay, M" uniqKey="Hatay M">M Hatay</name>
</author>
<author>
<name sortKey="Hall, D" uniqKey="Hall D">D Hall</name>
</author>
<author>
<name sortKey="Brown, E" uniqKey="Brown E">E Brown</name>
</author>
<author>
<name sortKey="Haynes, M" uniqKey="Haynes M">M Haynes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meyer, F" uniqKey="Meyer F">F Meyer</name>
</author>
<author>
<name sortKey="Overbeek, R" uniqKey="Overbeek R">R Overbeek</name>
</author>
<author>
<name sortKey="Rodriguez, A" uniqKey="Rodriguez A">A Rodriguez</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goll, J" uniqKey="Goll J">J Goll</name>
</author>
<author>
<name sortKey="Rusch, Db" uniqKey="Rusch D">DB Rusch</name>
</author>
<author>
<name sortKey="Tanenbaum, Dm" uniqKey="Tanenbaum D">DM Tanenbaum</name>
</author>
<author>
<name sortKey="Thiagarajan, M" uniqKey="Thiagarajan M">M Thiagarajan</name>
</author>
<author>
<name sortKey="Li, K" uniqKey="Li K">K Li</name>
</author>
<author>
<name sortKey="Methe, Ba" uniqKey="Methe B">BA Methe</name>
</author>
<author>
<name sortKey="Yooseph, S" uniqKey="Yooseph S">S Yooseph</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Signal, P" uniqKey="Signal P">P Signal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Emanuelsson, O" uniqKey="Emanuelsson O">O Emanuelsson</name>
</author>
<author>
<name sortKey="Brunak, S" uniqKey="Brunak S">S Brunak</name>
</author>
<author>
<name sortKey="Von Heijne, G" uniqKey="Von Heijne G">G von Heijne</name>
</author>
<author>
<name sortKey="Nielsen, H" uniqKey="Nielsen H">H Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krogh, A" uniqKey="Krogh A">A Krogh</name>
</author>
<author>
<name sortKey="Larsson, B" uniqKey="Larsson B">B Larsson</name>
</author>
<author>
<name sortKey="Von Heijne, G" uniqKey="Von Heijne G">G von Heijne</name>
</author>
<author>
<name sortKey="Sonnhammer, El" uniqKey="Sonnhammer E">EL Sonnhammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Claudel Renard, C" uniqKey="Claudel Renard C">C Claudel-Renard</name>
</author>
<author>
<name sortKey="Chevalet, C" uniqKey="Chevalet C">C Chevalet</name>
</author>
<author>
<name sortKey="Faraut, T" uniqKey="Faraut T">T Faraut</name>
</author>
<author>
<name sortKey="Kahn, D" uniqKey="Kahn D">D Kahn</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davidsen, T" uniqKey="Davidsen T">T Davidsen</name>
</author>
<author>
<name sortKey="Beck, E" uniqKey="Beck E">E Beck</name>
</author>
<author>
<name sortKey="Ganapathy, A" uniqKey="Ganapathy A">A Ganapathy</name>
</author>
<author>
<name sortKey="Montgomery, R" uniqKey="Montgomery R">R Montgomery</name>
</author>
<author>
<name sortKey="Zafar, N" uniqKey="Zafar N">N Zafar</name>
</author>
<author>
<name sortKey="Yang, Q" uniqKey="Yang Q">Q Yang</name>
</author>
<author>
<name sortKey="Madupu, R" uniqKey="Madupu R">R Madupu</name>
</author>
<author>
<name sortKey="Goetz, P" uniqKey="Goetz P">P Goetz</name>
</author>
<author>
<name sortKey="Galinsky, K" uniqKey="Galinsky K">K Galinsky</name>
</author>
<author>
<name sortKey="White, O" uniqKey="White O">O White</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Finn, Rd" uniqKey="Finn R">RD Finn</name>
</author>
<author>
<name sortKey="Mistry, J" uniqKey="Mistry J">J Mistry</name>
</author>
<author>
<name sortKey="Tate, J" uniqKey="Tate J">J Tate</name>
</author>
<author>
<name sortKey="Coggill, P" uniqKey="Coggill P">P Coggill</name>
</author>
<author>
<name sortKey="Heger, A" uniqKey="Heger A">A Heger</name>
</author>
<author>
<name sortKey="Pollington, Je" uniqKey="Pollington J">JE Pollington</name>
</author>
<author>
<name sortKey="Gavin, Ol" uniqKey="Gavin O">OL Gavin</name>
</author>
<author>
<name sortKey="Gunasekaran, P" uniqKey="Gunasekaran P">P Gunasekaran</name>
</author>
<author>
<name sortKey="Ceric, G" uniqKey="Ceric G">G Ceric</name>
</author>
<author>
<name sortKey="Forslund, K" uniqKey="Forslund K">K Forslund</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Selengut, Jd" uniqKey="Selengut J">JD Selengut</name>
</author>
<author>
<name sortKey="Haft, Dh" uniqKey="Haft D">DH Haft</name>
</author>
<author>
<name sortKey="Davidsen, T" uniqKey="Davidsen T">T Davidsen</name>
</author>
<author>
<name sortKey="Ganapathy, A" uniqKey="Ganapathy A">A Ganapathy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leplae, R" uniqKey="Leplae R">R Leplae</name>
</author>
<author>
<name sortKey="Hebrant, A" uniqKey="Hebrant A">A Hebrant</name>
</author>
<author>
<name sortKey="Wodak, Sj" uniqKey="Wodak S">SJ Wodak</name>
</author>
<author>
<name sortKey="Toussaint, A" uniqKey="Toussaint A">A Toussaint</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marchler Bauer, A" uniqKey="Marchler Bauer A">A Marchler-Bauer</name>
</author>
<author>
<name sortKey="Lu, S" uniqKey="Lu S">S Lu</name>
</author>
<author>
<name sortKey="Anderson, Jb" uniqKey="Anderson J">JB Anderson</name>
</author>
<author>
<name sortKey="Chitsaz, F" uniqKey="Chitsaz F">F Chitsaz</name>
</author>
<author>
<name sortKey="Derbyshire, Mk" uniqKey="Derbyshire M">MK Derbyshire</name>
</author>
<author>
<name sortKey="Deweese Scott, C" uniqKey="Deweese Scott C">C DeWeese-Scott</name>
</author>
<author>
<name sortKey="Fong, Jh" uniqKey="Fong J">JH Fong</name>
</author>
<author>
<name sortKey="Geer, Ly" uniqKey="Geer L">LY Geer</name>
</author>
<author>
<name sortKey="Geer, Rc" uniqKey="Geer R">RC Geer</name>
</author>
<author>
<name sortKey="Gonzales, Nr" uniqKey="Gonzales N">NR Gonzales</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tanenbaum, Dm" uniqKey="Tanenbaum D">DM Tanenbaum</name>
</author>
<author>
<name sortKey="Goll, J" uniqKey="Goll J">J Goll</name>
</author>
<author>
<name sortKey="Murphy, S" uniqKey="Murphy S">S Murphy</name>
</author>
<author>
<name sortKey="Kumar, P" uniqKey="Kumar P">P Kumar</name>
</author>
<author>
<name sortKey="Zafar, N" uniqKey="Zafar N">N Zafar</name>
</author>
<author>
<name sortKey="Thiagarajan, M" uniqKey="Thiagarajan M">M Thiagarajan</name>
</author>
<author>
<name sortKey="Madupu, R" uniqKey="Madupu R">R Madupu</name>
</author>
<author>
<name sortKey="Davidsen, T" uniqKey="Davidsen T">T Davidsen</name>
</author>
<author>
<name sortKey="Kagan, L" uniqKey="Kagan L">L Kagan</name>
</author>
<author>
<name sortKey="Kravitz, S" uniqKey="Kravitz S">S Kravitz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Noguchi, H" uniqKey="Noguchi H">H Noguchi</name>
</author>
<author>
<name sortKey="Taniguchi, T" uniqKey="Taniguchi T">T Taniguchi</name>
</author>
<author>
<name sortKey="Itoh, T" uniqKey="Itoh T">T Itoh</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schoenfeld, T" uniqKey="Schoenfeld T">T Schoenfeld</name>
</author>
<author>
<name sortKey="Patterson, M" uniqKey="Patterson M">M Patterson</name>
</author>
<author>
<name sortKey="Richardson, Pm" uniqKey="Richardson P">PM Richardson</name>
</author>
<author>
<name sortKey="Wommack, Ke" uniqKey="Wommack K">KE Wommack</name>
</author>
<author>
<name sortKey="Young, M" uniqKey="Young M">M Young</name>
</author>
<author>
<name sortKey="Mead, D" uniqKey="Mead D">D Mead</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, T" uniqKey="Zhang T">T Zhang</name>
</author>
<author>
<name sortKey="Breitbart, M" uniqKey="Breitbart M">M Breitbart</name>
</author>
<author>
<name sortKey="Lee, Wh" uniqKey="Lee W">WH Lee</name>
</author>
<author>
<name sortKey="Run, Jq" uniqKey="Run J">JQ Run</name>
</author>
<author>
<name sortKey="Wei, Cl" uniqKey="Wei C">CL Wei</name>
</author>
<author>
<name sortKey="Soh, Sw" uniqKey="Soh S">SW Soh</name>
</author>
<author>
<name sortKey="Hibberd, Ml" uniqKey="Hibberd M">ML Hibberd</name>
</author>
<author>
<name sortKey="Liu, Et" uniqKey="Liu E">ET Liu</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author>
<name sortKey="Ruan, Y" uniqKey="Ruan Y">Y Ruan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breitbart, M" uniqKey="Breitbart M">M Breitbart</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breitbart, M" uniqKey="Breitbart M">M Breitbart</name>
</author>
<author>
<name sortKey="Haynes, M" uniqKey="Haynes M">M Haynes</name>
</author>
<author>
<name sortKey="Kelley, S" uniqKey="Kelley S">S Kelley</name>
</author>
<author>
<name sortKey="Angly, F" uniqKey="Angly F">F Angly</name>
</author>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
<author>
<name sortKey="Felts, B" uniqKey="Felts B">B Felts</name>
</author>
<author>
<name sortKey="Mahaffy, Jm" uniqKey="Mahaffy J">JM Mahaffy</name>
</author>
<author>
<name sortKey="Mueller, J" uniqKey="Mueller J">J Mueller</name>
</author>
<author>
<name sortKey="Nulton, J" uniqKey="Nulton J">J Nulton</name>
</author>
<author>
<name sortKey="Rayhawk, S" uniqKey="Rayhawk S">S Rayhawk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breitbart, M" uniqKey="Breitbart M">M Breitbart</name>
</author>
<author>
<name sortKey="Felts, B" uniqKey="Felts B">B Felts</name>
</author>
<author>
<name sortKey="Kelley, S" uniqKey="Kelley S">S Kelley</name>
</author>
<author>
<name sortKey="Mahaffy, Jm" uniqKey="Mahaffy J">JM Mahaffy</name>
</author>
<author>
<name sortKey="Nulton, J" uniqKey="Nulton J">J Nulton</name>
</author>
<author>
<name sortKey="Salamon, P" uniqKey="Salamon P">P Salamon</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breitbart, M" uniqKey="Breitbart M">M Breitbart</name>
</author>
<author>
<name sortKey="Salamon, P" uniqKey="Salamon P">P Salamon</name>
</author>
<author>
<name sortKey="Andresen, B" uniqKey="Andresen B">B Andresen</name>
</author>
<author>
<name sortKey="Mahaffy, Jm" uniqKey="Mahaffy J">JM Mahaffy</name>
</author>
<author>
<name sortKey="Segall, Am" uniqKey="Segall A">AM Segall</name>
</author>
<author>
<name sortKey="Mead, D" uniqKey="Mead D">D Mead</name>
</author>
<author>
<name sortKey="Azam, F" uniqKey="Azam F">F Azam</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Culley, Ai" uniqKey="Culley A">AI Culley</name>
</author>
<author>
<name sortKey="Lang, As" uniqKey="Lang A">AS Lang</name>
</author>
<author>
<name sortKey="Suttle, Ca" uniqKey="Suttle C">CA Suttle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bench, Sr" uniqKey="Bench S">SR Bench</name>
</author>
<author>
<name sortKey="Hanson, Te" uniqKey="Hanson T">TE Hanson</name>
</author>
<author>
<name sortKey="Williamson, Ke" uniqKey="Williamson K">KE Williamson</name>
</author>
<author>
<name sortKey="Ghosh, D" uniqKey="Ghosh D">D Ghosh</name>
</author>
<author>
<name sortKey="Radosovich, M" uniqKey="Radosovich M">M Radosovich</name>
</author>
<author>
<name sortKey="Wang, K" uniqKey="Wang K">K Wang</name>
</author>
<author>
<name sortKey="Wommack, Ke" uniqKey="Wommack K">KE Wommack</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cann, Aj" uniqKey="Cann A">AJ Cann</name>
</author>
<author>
<name sortKey="Fandrich, Se" uniqKey="Fandrich S">SE Fandrich</name>
</author>
<author>
<name sortKey="Heaphy, S" uniqKey="Heaphy S">S Heaphy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sayers, Ew" uniqKey="Sayers E">EW Sayers</name>
</author>
<author>
<name sortKey="Barrett, T" uniqKey="Barrett T">T Barrett</name>
</author>
<author>
<name sortKey="Benson, Da" uniqKey="Benson D">DA Benson</name>
</author>
<author>
<name sortKey="Bolton, E" uniqKey="Bolton E">E Bolton</name>
</author>
<author>
<name sortKey="Bryant, Sh" uniqKey="Bryant S">SH Bryant</name>
</author>
<author>
<name sortKey="Canese, K" uniqKey="Canese K">K Canese</name>
</author>
<author>
<name sortKey="Chetvernin, V" uniqKey="Chetvernin V">V Chetvernin</name>
</author>
<author>
<name sortKey="Church, Dm" uniqKey="Church D">DM Church</name>
</author>
<author>
<name sortKey="Dicuccio, M" uniqKey="Dicuccio M">M Dicuccio</name>
</author>
<author>
<name sortKey="Federhen, S" uniqKey="Federhen S">S Federhen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Letunic, I" uniqKey="Letunic I">I Letunic</name>
</author>
<author>
<name sortKey="Copley, Rr" uniqKey="Copley R">RR Copley</name>
</author>
<author>
<name sortKey="Pils, B" uniqKey="Pils B">B Pils</name>
</author>
<author>
<name sortKey="Pinkert, S" uniqKey="Pinkert S">S Pinkert</name>
</author>
<author>
<name sortKey="Schultz, J" uniqKey="Schultz J">J Schultz</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tatusov, Rl" uniqKey="Tatusov R">RL Tatusov</name>
</author>
<author>
<name sortKey="Fedorova, Nd" uniqKey="Fedorova N">ND Fedorova</name>
</author>
<author>
<name sortKey="Jackson, Jd" uniqKey="Jackson J">JD Jackson</name>
</author>
<author>
<name sortKey="Jacobs, Ar" uniqKey="Jacobs A">AR Jacobs</name>
</author>
<author>
<name sortKey="Kiryutin, B" uniqKey="Kiryutin B">B Kiryutin</name>
</author>
<author>
<name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
<author>
<name sortKey="Krylov, Dm" uniqKey="Krylov D">DM Krylov</name>
</author>
<author>
<name sortKey="Mazumder, R" uniqKey="Mazumder R">R Mazumder</name>
</author>
<author>
<name sortKey="Mekhedov, Sl" uniqKey="Mekhedov S">SL Mekhedov</name>
</author>
<author>
<name sortKey="Nikolskaya, An" uniqKey="Nikolskaya A">AN Nikolskaya</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Stand Genomic Sci</journal-id>
<journal-id journal-id-type="publisher-id">SIGS</journal-id>
<journal-title-group>
<journal-title>Standards in Genomic Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1944-3277</issn>
<publisher>
<publisher-name>Michigan State University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21886867</article-id>
<article-id pub-id-type="pmc">3156399</article-id>
<article-id pub-id-type="publisher-id">sigs.1694706</article-id>
<article-id pub-id-type="doi">10.4056/sigs.1694706</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Standard Operating Procedures</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Lorenzi</surname>
<given-names>Hernan A.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hoover</surname>
<given-names>Jeff</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Inman</surname>
<given-names>Jason</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Safford</surname>
<given-names>Todd</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Murphy</surname>
<given-names>Sean</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kagan</surname>
<given-names>Leonid</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Williamson</surname>
<given-names>Shannon J.</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
<aff id="aff1">
<label>1</label>
J.CraigVenterInstitute,9704 Medical Center Drive, Rockville, MD, 20850, USA</aff>
<aff id="aff2">
<label>2</label>
J.CraigVenterInstitute,10355 Science Center Drive, San Diego, CA 92121, USA</aff>
</contrib-group>
<author-notes>
<corresp id="cor1">
<label>*</label>
Corresponding author: Shannon Williamson (
<email xlink:href="swilliamson@jcvi.org">swilliamson@jcvi.org</email>
)</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>6</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="collection">
<day>01</day>
<month>7</month>
<year>2011</year>
</pub-date>
<volume>4</volume>
<issue>3</issue>
<fpage>418</fpage>
<lpage>429</lpage>
<permissions>
<copyright-year>2011</copyright-year>
<copyright-holder>Lorenzi et al.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.5/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence.</p>
</abstract>
<kwd-group kwd-group-type="author">
<title>Keywords: </title>
<kwd>J. Craig Venter Institute</kwd>
<kwd>metagenomic annotation</kwd>
<kwd>viral annotation</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source id="sp1">Office of Science</funding-source>
</award-group>
<award-group>
<funding-source id="sp2">U.S. Department of Energy</funding-source>
<award-id rid="sp2">De-FC02-02ER63453</award-id>
</award-group>
<award-group>
<funding-source id="sp3">National Science Foundation Microbial Genome Sequencing Program</funding-source>
<award-id rid="sp3">0626826</award-id>
</award-group>
</funding-group>
</article-meta>
<notes>
<def-list list-type="simple" list-content="abbreviations">
<title>Abbreviations: </title>
<def-item>
<term>VMGAP</term>
<def>
<p>ViralMetaGenome Annotation Pipeline</p>
</def>
</def-item>
<def-item>
<term>rDNA</term>
<def>
<p>ribosomal DNA</p>
</def>
</def-item>
<def-item>
<term>ORF</term>
<def>
<p>open reading frame</p>
</def>
</def-item>
<def-item>
<term>EC</term>
<def>
<p>enzyme commission</p>
</def>
</def-item>
<def-item>
<term>GO</term>
<def>
<p>gene ontology</p>
</def>
</def-item>
<def-item>
<term>COG</term>
<def>
<p>cluster of orthologous genes</p>
</def>
</def-item>
<def-item>
<term>DB</term>
<def>
<p>database</p>
</def>
</def-item>
<def-item>
<term>HMM</term>
<def>
<p>Hidden Markov Model</p>
</def>
</def-item>
<def-item>
<term>PSSM</term>
<def>
<p>position specific scoring matrix</p>
</def>
</def-item>
</def-list>
</notes>
</front>
<body>
<sec sec-type="intro">
<title>Introduction</title>
<p>Viruses are the most abundant biological agents and comprise the majority of the biodiversity on Earth [
<xref ref-type="bibr" rid="r1">1</xref>
-
<xref ref-type="bibr" rid="r3">3</xref>
]. However, understanding the population biology and dynamics of viral communities in the environment is difficult because their hosts (predominantly microbes) are unknown and cannot be grown in culture. Furthermore, the study of viral diversity is hampered by the lack of a universally conserved gene across all viral species, analogous to rDNA genes in cellular organisms. Metagenomic shotgun sequence analysis of viral communities helps to alleviate these constraints and is currently the most widely used approach to study the biodiversity of viral populations isolated directly from the environment.</p>
<p>The recent development of faster and cheaper next generation sequencing technologies has contributed to an exponential growth of metagenomic sequencing data, transforming our view of the microbial world. Despite the advancements in sequencing technology, functional annotation of metagenomic sequences is still very challenging. Metagenomic data originate from heterogeneous microbial communities, are usually noisy and partial, and reads frequently contain truncated open reading frames (ORFs). Complicating this landscape, the vast majority of viruses isolated from environmental samples are novel and consequently most of their genes do not have homologous sequences in the public databases, making functional annotation even more difficult.</p>
<p>Currently, there are a number of publicly available bioinformatics tools for the taxonomic (Ribosomal Database Project(RDP) [
<xref ref-type="bibr" rid="r4">4</xref>
],Greengenes [
<xref ref-type="bibr" rid="r5">5</xref>
],MEGAN [
<xref ref-type="bibr" rid="r6">6</xref>
], pplacer [
<xref ref-type="bibr" rid="r7">7</xref>
]) and functional (IMG/M [
<xref ref-type="bibr" rid="r8">8</xref>
], CAMERA [
<xref ref-type="bibr" rid="r9">9</xref>
], MG-RAST [
<xref ref-type="bibr" rid="r10">10</xref>
]) analysis of metagenomes. While IMG/M facilitates the functional analysis of pre-selected metagenomic data, it does not support the input and analysis of external user data. CAMERA allows for the construction of customized workflows for the analysis of external metagenomic data including functional annotation using RAMMCAP based on PFAM, TIGRFAM and COGs. MG-RAST is an alternative web-resource that performs metabolic reconstructions using SEED subsystems [
<xref ref-type="bibr" rid="r11">11</xref>
] and builds automated phylogenetic profiles of metagenomic data provided by the scientific community. While MG-RAST has been used for the functional annotation of multiple viral metagenomes [
<xref ref-type="bibr" rid="r12">12</xref>
], it is not ideal for the characterization of viral metagenomic data since functional classification is solely dependent on similarity to FIGfams [
<xref ref-type="bibr" rid="r13">13</xref>
], protein families developed from manually curated bacterial and archaeal proteins. Another limitation of this tool is that it does not search for conserved protein domains or motifs that could provide additional clues about the functional roles of genes present in metagenomic samples.</p>
<p>Here we describe a viral metagenomic annotation pipeline (VMGAP) that is currently utilized at the J. Craig Venter Institute (JCVI) for the functional annotation of viral metagenomic datasets. This pipeline incorporates a number of HMM and PSSM searches and makes use of a suite of specialized databases to improve the functional identification of viral genes. Results can be imported into JCVI Metagenomic Reports (METAREP) [
<xref ref-type="bibr" rid="r14">14</xref>
], an open source tool for high performance comparative metagenomics that allows users to view, query, browse and compare extremely large annotated metagenomic data sets.</p>
</sec>
<sec sec-type="other1">
<title>Requirements</title>
<p>The VMGAP requires a protein multi fasta file as input and the local installation of several open source programs, packages and public databases. The required software and packages are HMMER [
<xref ref-type="bibr" rid="r15">15</xref>
], NCBI-toolkit (blast searches [
<xref ref-type="bibr" rid="r16">16</xref>
]), SignalP (signal peptide prediction [
<xref ref-type="bibr" rid="r17">17</xref>
]) [
<xref ref-type="bibr" rid="r18">18</xref>
]and TMHMM [
<xref ref-type="bibr" rid="r19">19</xref>
,
<xref ref-type="bibr" rid="r20">20</xref>
] and PRIAM (Ecnumber prediction [
<xref ref-type="bibr" rid="r21">21</xref>
]) [
<xref ref-type="bibr" rid="r22">22</xref>
]. Among the public databases searched by the pipeline are GenBank NRDB, GenBank environmental databases ENV_NT and ENV_NR, UniProtDB [
<xref ref-type="bibr" rid="r23">23</xref>
], OMNIOMEDB [
<xref ref-type="bibr" rid="r24">24</xref>
], PFAM [
<xref ref-type="bibr" rid="r25">25</xref>
] and TIGRFAM [
<xref ref-type="bibr" rid="r26">26</xref>
] HMMDBs, ACLAME protein and HMMDBs [
<xref ref-type="bibr" rid="r27">27</xref>
],GenBank CDDDB [
<xref ref-type="bibr" rid="r28">28</xref>
] and pfam2gomappingsDB [
<xref ref-type="bibr" rid="r11">11</xref>
].</p>
</sec>
<sec sec-type="other2">
<title>Procedure</title>
<p>The JCVI VMGAP consists of two consecutive steps: (1) database searches and (2) functional assignments. The pipeline uses as input a multifasta file containing the translations of all open reading frames (ORFs) predicted in a metagenomic sample. Protein coding genes are predicted using the structural annotation pipeline [
<xref ref-type="bibr" rid="r29">29</xref>
], that is based on a combination of naïve 6-frame translations and MetaGeneAnnotator [
<xref ref-type="bibr" rid="r30">30</xref>
,
<xref ref-type="bibr" rid="r31">31</xref>
], an
<italic>ab initio</italic>
gene finder program that uses empirical data including sequence-based composition, distance and orientation of genes of completely sequenced genomes to identify protein coding genes. Once uploaded, protein sequences are used to query several databases to identify protein features and similarities as schematically represented in
<xref ref-type="fig" rid="f1">Figure 1</xref>
. During step 1, the VMGAP performs the following sequence similarity searches:</p>
<fig id="f1" fig-type="figure" position="float">
<label>Figure 1</label>
<caption>
<p>Naming rules used for functional annotation of the VMGAP.</p>
</caption>
<graphic xlink:href="sigs.1694706-f1"></graphic>
</fig>
<sec>
<title>1) Blastp searches against a non-redundant protein database</title>
<p>The non-redundant protein database encompasses several public protein databases (GenBank NR, UniProt, PIR and OMNIOME) where each set of redundant peptides are condensed into a single database entry without losing useful information recorded in the fasta headers, such as EC numbers, product names, and taxon identification number. The VMGAP reports the top 50 hits with e-values ≤1x10
<sup>-5</sup>
.</p>
</sec>
<sec>
<title>2) Blastp searches against the ACLAME database</title>
<p>ACLAME is a public protein database of mobile genetic elements (MGEs), including bacteriophages, transposons and plasmids [
<xref ref-type="bibr" rid="r27">27</xref>
]. Proteins are organized into families based on their function and sequence similarity, and families of 4 or more members are manually annotated with functional assignments using GO and MeGO terms (an ontology dedicated to MGEs developed by ACLAME). All blastp hits with e-values ≤1x10
<sup>-5</sup>
are reported.</p>
</sec>
<sec>
<title>3) Blastp and tblastn searches against environmental protein databases</title>
<p>The VMGAP queries three different environmental composite databases at the amino acid level: (i) ENV_NR, a GenBank non-redundant protein database that includes many environmental datasets, (ii) an in-house database (SANGER_PEP) composed of proteins coded by Sanger-based viral metagenomic samples not represented in ENV_NR (
<xref ref-type="table" rid="t1">
<bold>Table 1</bold>
</xref>
), and (iii) ENV_NT, a collection of nucleotide sequences from metagenomic datasets deposited in GenBank. The purpose of these analyses is to determine how similar the viruses are within the query metagenomic samples to viruses and microbes that inhabit the different environments represented in the subject databases. The VMGAP reports all blast hits with e-values ≤1x10
<sup>-3</sup>
.</p>
<table-wrap id="t1" position="float">
<label>Table 1</label>
<caption>
<title>Metagenomic libraries incorporated into the Sanger environmental protein database</title>
</caption>
<table frame="hsides" rules="groups">
<col width="78.99%" span="1"></col>
<col width="21.01%" span="1"></col>
<thead>
<tr>
<th valign="top" align="left" scope="col" rowspan="1" colspan="1">
<bold>Library Name</bold>
</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">
<bold>Reference</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Viral metagenomes from Yellowstone hot springs (Bear Paw)</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r32">32</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">RNA viral community in human feces</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r33">33</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">viral metagenomes from yellowstone hot springs (Octopus)</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r32">32</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Virus from Human Blood</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r34">34</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Virus from Human Feces</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r35">35</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Virus from Marine Sediments</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r36">36</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Uncultured marine viral communities (Mission Bay)</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r37">37</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Uncultured marine viral communities (Scripps Pier)</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r37">37</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Coastal RNA virus communities</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r38">38</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Chesapeake Bay virioplankton</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r39">39</xref>
]</td>
</tr>
<tr>
<td valign="top" align="left" scope="row" rowspan="1" colspan="1">Virus from equine feces</td>
<td valign="bottom" align="center" rowspan="1" colspan="1"> [
<xref ref-type="bibr" rid="r40">40</xref>
]</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>4) HMM searches against PFAM/TIGRFAM and ACLAME HMM</title>
<p>In addition to similarity searches against protein databases, the VMGAP looks for the presence of HMMs from two databases, PFAM/TIGRFAM (a database of HMMs representing conserved protein domains) and ACLAME-HMMs (a compilation of HMMs that describe each of the protein families found in ACLAME). PFAM/TIGRFAM HMM searches are carried out in two different ways, either requiring a global or local alignment to the HMMs. Local HMM alignments increase sensitivity in the detection of conserved protein domains, particularly when the predicted peptide is truncated and extends to the end of the read, which is noted frequently in metagenomic datasets. All HMM hit with e-values ≤l1x10
<sup>-5</sup>
are recorded for further analysis.</p>
</sec>
<sec>
<title>5) RPS-Blast against NCBI CDD database</title>
<p>The NCBI Conserved Domain Database (CDD) database is a collection of position specific scoring matrices representing conserved protein domains, protein families and superfamilies compiled from NCBI-curated domains [
<xref ref-type="bibr" rid="r41">41</xref>
], PFAM/TIGRFAM, SMART [
<xref ref-type="bibr" rid="r42">42</xref>
] and COG [
<xref ref-type="bibr" rid="r43">43</xref>
]. In spite of the overlap, PSSMs derived from PFAM/TIGRFAM do not behave exactly the same as their HMM counterparts, and in some cases these searches can identify domains where HMMs fail. The VMGAP stores all hits with e-values ≤ 1x10
<sup>-5</sup>
.</p>
</sec>
<sec>
<title>6) Identification of transmembrane domains and signal peptides</title>
<p>To discover transmembrane proteins and signal peptides that could be associated with the surface of viral particles, the VMGAP utilizes two programs, SignalP for the identification of signal peptides, and TMHMM, a program that detects candidate transmembrane domains.</p>
</sec>
<sec>
<title>7) Assignment of EC numbers</title>
<p>To aid in the metabolic reconstruction of metagenomes, the VMGAP makes use of PRIAM, a collection of PSSMs where each matrix represents an enzymatic function and is assigned to a particular EC number. Metagenomic samples are scanned for the presence of these PSSMs with RPS-Blast recording only those hits with e-values ≤1x10
<sup>-10</sup>
.</p>
</sec>
<sec>
<title>8) Rules hierarchy</title>
<p>Functional assignments of predicted peptides are carried out by retrieving the functional information produced from the results of the analyses performed in the previous steps following a series of pre-defined rules (
<xref ref-type="fig" rid="f1">Figure 1</xref>
). Rules prioritize the use of a certain piece of evidence over another based on how informative, trustful and accurate that evidence is. As shown in
<xref ref-type="fig" rid="f1">Figure 1</xref>
, hits against equivalog TIGRFAM HMMs [
<xref ref-type="bibr" rid="r26">26</xref>
]are the highest ranked supporting evidence for functional assignments in the VMGAP. Therefore, any protein that hits above the trusted cutoff of one entire copy (100% coverage with respect to the length of the HMM) of an equivalog TIGRFAM will automatically inherit the functional annotation associated to that particular HMM. The second and third tiers of evidence are constituted by highly significant BLASTP hits against ACLAME DB and the non-redundant protein database respectively; having at least 80% coverage (with respect to the shortest sequence), 50% identity and an e-value ≤ 1x10
<sup>-10</sup>
. Although proteins from ACLAME DB are also included in the non-redundant protein database, entries in the former have a higher priority since they are curated and therefore provide better functional annotation. Hits against HMMs describing ACLAME protein families and PFAM/non-equivalog TIGRFAM HMMs comprise the 4
<sup>th</sup>
and 5
<sup>th</sup>
layers of functional evidence, giving higher priority to those HMMs representing protein families against those describing conserved protein domains. Ranked 6
<sup>th</sup>
and 7
<sup>th</sup>
in the rule list are respectively RPS-BLAST hits with at least 90% coverage, percent identity ≥ 35% and e-value ≤ 1x10
<sup>-10</sup>
against NCBI-CDD profiles and local-local hits against PFAM/TIGRFAM HMMs with e-values ≤ 1x10
<sup>-5</sup>
. Finally, low-confidence BLASTP hits with at least 70% coverage, percent identity ≥ 30% and e-value ≤ 1x10
<sup>-5</sup>
against ACLAME DB and the non-redundant protein database occupy tiers 8 and 9 in the priority list respectively. Proteins that lack the evidence types described above, but still contain some other evidence such as hits against the environmental DBs are named “hypothetical protein”. Otherwise, proteins are labeled as “unknown protein”.</p>
</sec>
</sec>
<sec sec-type="other3">
<title>Implementation</title>
<p>The VMGAP consists of three major modules implemented in Perl (
<xref ref-type="fig" rid="f2">Figure 2</xref>
): (i) the control module, which initializes the pipeline, creates a sqlite DB [
<xref ref-type="bibr" rid="r44">44</xref>
] to store the status of computations and their results, coordinates the other modules, and allows interrupted pipelines to be resumed from the point of interruption, (ii) the compute module, which tracks the status of the individual computations and loads completed computations into the sqlite database, and (iii) the annotation module, which reads the computational results from the sqlite DB and applies a set of predefined rules to generate a tab-delimited annotation file containing the final annotation for each peptide (e.g. EC/GO assignments and protein names), and a tab-delimited evidence file that stores all the evidence that supports the annotation. Each line in the annotation file contains the functional annotation for an individual peptide, while in the evidence file each line represents one particular evidence for a single protein (
<xref ref-type="table" rid="t2">Table 2</xref>
and
<xref ref-type="table" rid="t3">Table 3</xref>
). Additionally, the VMGAP contains an optional module, also implemented in Perl, called Com2GO (Common-Name-to-Go Mappings). Com2GO can be run after the annotation module to attempt to classify the protein names using the GO hierarchy.</p>
<fig id="f2" fig-type="figure" position="float">
<label>Figure 2</label>
<caption>
<p>Schematic representation of the implementation of the VMGAP. The three main modules of the pipeline are depicted by yellow squares. Orange and red circles represent input and output files respectively. VICS stands for Venter Institute Compute Services; SGE stands for Sun Grid Engine job scheduler. Single and double-headed arrows indicate information flowing in one or both directions respectively.</p>
</caption>
<graphic xlink:href="sigs.1694706-f2"></graphic>
</fig>
<table-wrap id="t2" position="float">
<label>Table 2</label>
<caption>
<title>Description of the contents of the evidence file generated by the VMGAP</title>
</caption>
<table frame="hsides" rules="groups">
<col width="3.31%" span="1"></col>
<col width="13.88%" span="1"></col>
<col width="10.19%" span="1"></col>
<col width="12.74%" span="1"></col>
<col width="12.74%" span="1"></col>
<col width="12.74%" span="1"></col>
<col width="11.47%" span="1"></col>
<col width="12.74%" span="1"></col>
<col width="10.19%" span="1"></col>
<thead>
<tr>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">1</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">2</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">  3</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">4</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">5</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">6</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">7</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">8</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">   9</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  CDD_RPS</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Subject definition</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % ident</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   e-value</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % ident</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  ALL GROUP_PEP</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Subject ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject definition</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Query length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % ident</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   e-value</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  ACLAME_PEP</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Subject ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject definition</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Query length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % ident</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   e-value</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  SANGER_PEP</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Subject ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject definition</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Query length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % ident</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   e-value</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  ENV_NT</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Subject ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject definition</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Query length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % ident</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   e-value</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  ENV_NR</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Subject ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject definition</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Query length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Subject length</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM description</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   e-value</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  FRAG_HMM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  HMM begin</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM end</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Total
<break></break>
   e-value</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM accession</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM description</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM length</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  PFAM/TIGRFAM_HMM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  HMM begin</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM end</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Total
<break></break>
   e-value</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM accession</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM length</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  PRIAM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  EC Number</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   e-value</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM description</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  ACLAME_HMM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  HMM begin</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM end</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   % cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Total
<break></break>
   e-value</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM accession</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM length</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  PEPSTATS</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Molecular weight</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Isoelectric point</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  TMHMM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Number predicted helixes</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  SIGNALP</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  signal pep</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   cleavage site position</td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
<td valign="bottom" align="left" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Fields 1 and 2 correspond to the protein identifier and a flag specific for each analysis, respectively. % cov, percent coverage; % ident, percent identity; CDD_RPS, RPS-Blast vs. CDD DB; ALLGROUP_PEP, Blastp vs. protein NR DB; ACLAME_PEP, Blastp vs. ACLAME protein DB; SANGER_PEP, Blastp vs. in-house viral metagenomic DB; ENV_NT, Tblastn vs. ENV_NT DB; ENV_NR, Blastp vs. ENV_NR DB; FRAG_HMM, HMM searches vs. local PFAM/TIGRFAM HMM DB; PFAM/TIGRFAM_HMM, HMM searches vs. global PFAM/TIGRFAM HMM DB; PRIAM, RPS-Blast vs. PRIAM profile DB; ACLAME_HMM, HMM searches vs. global ACLAME HMM DB; PEPSTATS, peptide statistics; TMHMM, transmembrane domain searches; SIGNALP, signal peptide searches.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="t3" position="float">
<label>Table 3</label>
<caption>
<title>Explanation of the annotation file generated by the VMGAP</title>
</caption>
<table frame="hsides" rules="groups">
<col width="9.61%" span="1"></col>
<col width="39.35%" span="1"></col>
<col width="51.04%" span="1"></col>
<thead>
<tr>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">Column</th>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">   Description</th>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">   Example</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">1</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Unique peptide ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   JCVI_PEP_metagenomic.orf.112038372243 2.1</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">2</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Protein common name tag</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   common_name</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">3</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Functional description (s)</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   phosphonate C-P lyase system protein PhnL, putative</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">4</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Source of functional description assignment</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   AllGroup High:rf|YP_001889651.1</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">5</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   GO tag</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   GO</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">6</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Gene Ontology ID (s)</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   go:0016887||go:0005524</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">7</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Source of Gene Ontology assignment</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   PF00005||PF00005</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">8</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   EC tag</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   EC</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">9</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Enzyme Commission number ID (s)</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   3.6.3.28</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">10</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Source of Enzyme Commission ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   PRIAM</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">11</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Hits against ENV_NT DB tag</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   ENV_NT</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">12</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   ENV_NT DB libraries hit with e-values ≤1 × 10
<sup>-3</sup>
</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Hydrothermal vent metagenome FOSS10958.y2,
<break></break>
   whole genome shotgun sequence ||
<break></break>
    Lake Washington Formate SIP Enrichment Freshwater Metagenome ||
<break></break>
    Human Gut Metagenome (healthy human sample In-M, Infant, Female)</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">13</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Best hit e-value per environmental library</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   6.65676e-54 || 2.14066e-44 || 1.34265e-46</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">14</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Number of hits with e-value ≤1 × 10
<sup>-3</sup>
per environmental DB library</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   1 || 1 || 4</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">15</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   HMM DB tag</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   PFAM/TIGRFAM_HMM</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">16</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   PFAM/TIGRFAM HMM hit above trusted cutoff</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   PF000005</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">17</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Signalp tag</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   SIGNALP</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">18</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Presence (Y) or absence (N) of predicted signal peptide</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Y</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">19</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Cleavage site position</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   16</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">20</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Transmembrane domain tag</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   TMHMM</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">21</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Number of predicted transmembrane domains</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   2</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">22</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Protein statistics tag</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   PEPSTATS</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">23</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Molecular weight</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   17369.86</td>
</tr>
<tr>
<td valign="bottom" align="center" scope="row" rowspan="1" colspan="1">24</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Isoelectric point</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   9.9423</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Each lane contains the annotation for a single predicted peptide. Multiple values within a field are separated by the symbol “||”.</p>
</table-wrap-foot>
</table-wrap>
<p>The heart of the VMGAP is the compute module (
<xref ref-type="fig" rid="f2">Figure 2</xref>
). This module accepts a compute configuration file (see
<xref ref-type="table" rid="t4">Table 4</xref>
for the current configuration) and a sqlite results database. It compares the computations specified in the configuration with the results loaded into the sqlite results database. Missing computations are initiated, stale computations (outdated reference dataset or obsolete program options) are refreshed, and interrupted computations are resumed. The computations themselves are executed either in a local machine (for jobs that are not very computational intensive such as SignalP), or through the JCVI high-throughput computing platform named VICS web-services. VICS is a J2EE server backed by a 1600 node SGE-grid and a 2 Terabyte scratch-disk. All of the computations are started (or restarted) and then the compute module waits for them to complete. As a computation is completed, its results are parsed and loaded into the sqlite database and the status of the computation is updated. When all computations have completed, the module exits and allows the controller to proceed. The module may be interrupted manually and restarted at a later time.</p>
<table-wrap id="t4" position="float">
<label>Table 4</label>
<caption>
<title>List of programs and parameters in the VMGAP</title>
</caption>
<table frame="hsides" rules="groups">
<col width="34.9%" span="1"></col>
<col width="27.9%" span="1"></col>
<col width="37.2%" span="1"></col>
<thead>
<tr>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">Pipeline job name</th>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">   Program</th>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">    Parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">A CLAME_HMM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   hmmpfam</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    E 0.001</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">A CLAME_PEP</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Blastp</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    b 50 -e 1e-5</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">ALLGROUP_PEP</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Blastp</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    b 50 -e 1e-5</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">CDD_RPS</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Rpsblast</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    b 50 -e 1e-3</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">ENV_NR</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   blastp</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    b 50 -e 1e-3</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">ENV_NT</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   tblastn</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    b 50 –e 1e-3</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">FRAG_HMM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   hmmpfam</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    E 0.001</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">PEPSTATS</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Pepstats</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    None</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">PFAM/TIGRFAM_HMM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Hmmpfam</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    E 0.001</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">PRIAM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Priam</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    e 1e-10</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">SANGER_PEP</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Blastp</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    b 50 -e 1e-5</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">SIGNALP</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Signalp</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    t gram- -trunc 70</td>
</tr>
<tr>
<td valign="bottom" align="left" scope="row" rowspan="1" colspan="1">TMHMM</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   tmhmm</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">    None</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec sec-type="other4">
<title>Data Visualization and Analysis</title>
<p>Small files can be easily imported and analyzed in Excel. For extremely large files (more than a million entries), we recommend users to import the data into METAREP [
<xref ref-type="bibr" rid="r14">14</xref>
] for further analysis and visualization. The METAREP tab-delimited import format specifies many common annotation data types including those computed by VMGAP. To import VMGAP annotations, we recommend the mapping outlined in (
<xref ref-type="table" rid="t5">Table 5</xref>
). To import the data, users have to install a local version of METAREP. The source code can be found at the METAPREP website [
<xref ref-type="bibr" rid="r45">45</xref>
]. The code also contains a Perl based utility for importing METAREP tab-delimited files. Details about the installation and import process can be found in the METAREP manual which can be downloaded from the METAPREP dashboard [
<xref ref-type="bibr" rid="r46">46</xref>
].</p>
<table-wrap id="t5" position="float">
<label>Table 5</label>
<caption>
<title>VMGAP to METAREP mapping</title>
</caption>
<table frame="hsides" rules="groups">
<col width="10%" span="1"></col>
<col width="20.64%" span="1"></col>
<col width="41.44%" span="1"></col>
<col width="27.92%" span="1"></col>
<thead>
<tr>
<th valign="bottom" colspan="3" align="center" scope="colgroup" rowspan="1">
<bold>METAREP input file</bold>
<hr></hr>
</th>
<th valign="bottom" align="center" scope="col" rowspan="1" colspan="1">
<bold>VGMAP</bold>
<hr></hr>
</th>
</tr>
<tr>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">Column</th>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">  Field ID</th>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">   Description</th>
<th valign="bottom" align="left" scope="col" rowspan="1" colspan="1">  Description</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">1</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  peptide_ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Unique peptide ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Unique peptide ID</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">2</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Library_ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Library ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Library ID</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">3</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  com_name</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Functional description (s)</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Common Names(s)</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">4</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  com_name_src</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Source of functional description assignment</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  
<sup>a</sup>
NRdb BLAST +
<sup>b</sup>
FRAGHMM + PFAM +
<break></break>
  TIGRFAM + PRIAM + CDD +ACLAME</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">5</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  go_id</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Gene Ontology ID (s)</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Gene Ontology ID (s)</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">6</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  go_src</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Source of Gene Ontology assignment</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  PFAM + TIGRFAM + ACLAME +Com2GO</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">7</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  ec_id</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Enzyme Commission ID (s)</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Enzyme Commission ID (s)</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">8</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  ec_src</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Source of Enzyme Commission ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  TIGRFAM + PFAM + PRIAM</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">9</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  hmm_id</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Hidden Markov Model hits</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  ACLAME, PFAM, TIGRFAM</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">10</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  blast_taxon</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   NCBI taxonomy ID</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Best Blast Hit NRdb</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">11</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  blast_evalue</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   BLAST E-Value</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Best Blast Hit NRdb</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">12</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  blast_pid</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   BLAST percent Identity</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Best Blast Hit NRdb</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">13</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  blast_cov</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   BLAST sequence coverage of shortest sequence</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  Best Blast Hit NRdb</td>
</tr>
<tr>
<td valign="top" align="center" scope="row" rowspan="1" colspan="1">14</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  filter</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">   Any filter tag (categorical variable)</td>
<td valign="bottom" align="left" rowspan="1" colspan="1">  N/A</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<sup>a</sup>
JCVI non-redundant protein database; b, PFAM/TIGRFAM local-local alignment HMM database; c, data not available.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec sec-type="discussion">
<title>Discussion</title>
<p>In recent years and with the advancement of next generation sequencing platforms, metagenomic studies have become more affordable to the scientific community. This has triggered an exponential growth in the amount of metagenomic sequencing data available within public repositories and stresses the necessity for specialized highly efficient computational tools to cope with the functional annotation of these massive datasets. There are currently a variety of metagenomic annotation tools that are available to the general public through the web. Among the most popular resources is MG-RAST, an annotation tool that offers many advantages to the user: (i) it does not require a high-throughput computer facility, (ii) it uses reads instead of proteins as input and therefore there is no need for gene predictions, and (iii) the results are classified into functional categories facilitating the analysis of data. Perhaps most importantly, the functional distributions can be compared against other datasets that were annotated with MG-RAST.</p>
<p>While MG-RAST is capable of providing meaningful taxonomic and functional annotation of microbial metagenomes, it is limited in its capacity to annotate viral metagenomes due to its inherent dependence of FIGfams. In order to quantitatively assess the utility of VMGAP for the functional annotation of viral metagenomic data, we ran an identical set of ~300,000 peptide sequences from a marine viral metagenomic library or their respective coding ORFs through the VMGAP and MG-RAST respectively. Analysis of the results showed that the VMGAP could assign functions to almost 16% more sequences compared to MG-RAST (names other than hypothetical or unknown,
<xref ref-type="fig" rid="f3">Figure 3</xref>
). More specifically, when looking for viral-like enzymatic functions (e.g. integrase, endonuclease, DNA polymerase) or names describing viral-like structural functions (e.g. capsid, tail, neck, envelope), the VMGAP assigned almost 16,000 more viral-like names compared to MG-RAST. Of the sequences that received no functional names, ~72% contained some other evidence such as hits against environmental databases, PFAM domains or signal peptides while only 29% of such sequences are reported in MG-RAST (
<xref ref-type="fig" rid="f3">Figure3</xref>
). A more in-depth analysis showed that the increase in assigned VMGAP-associated functional terms was due to the incorporation of databases that contain viral-specific annotation, such as ACLAME. Since VMGAP also performs additional analyses such as HMM, CDD and environmental DB searches as well as MeGO/GO and EC number assignments, it provides a more comprehensive repertoire of evidence types that may facilitate the discovery of novel viral functions as well as comparative analyses of metagenomic datasets.</p>
<fig id="f3" fig-type="figure" position="float">
<label>Figure 3</label>
<caption>
<p>Comparative analysis of the functional annotation performance for viral libraries of the VMGAP compared with MG-RAST. Total number of functional assignments represents the amount of peptides from the viral library that gets a name other than “hypothetical protein” or “unknown” (VMGAP) or that does not have a significant hit against any FIGfam (MG-RAST). Unknown proteins are those that do not receive any evidence as described in
<xref ref-type="fig" rid="f1">Figure 1</xref>
(VMGAP) or that do not hit any FIGfams (MG-RAST). The following are examples of virus-like keywords used in this analysis: integrase, terminase, polymerase, recombinase, (endo|exo) nuclease, phage, viral, capsid, envelope, filament, and basal plate.</p>
</caption>
<graphic xlink:href="sigs.1694706-f3"></graphic>
</fig>
<p>Regarding the VMGAP implementation, the generation and storage of results into a relational sqlite database presents many advantages over working with flat files. The sqlite database allows the pipeline to monitor the status of each process launched on the grid and, in case of failure, restart the pipeline from the point that it crashed. Also, it makes it easier to query results, integrate different data types when generating summary reports, and share this information since all the analysis data (i.e. programs, parameters, cutoffs) and their results are stored in a single sqlite file. The storage of data in an sqlite database, however, might present some loading speed challenges when the data volume is very large and the speed of the storage device where the database resides is not fast enough (e.g. 7,200 RPM SATA drives). At JCVI, sqlite databases typically reside in 15,000 RPM SAS drives, with bandwidths of ~ 500 MB/sec. For slower systems, we recommend avoiding the usage of these databases and rather parse the results directly from the raw outputs of the analyses to generate the annotation and evidence files.</p>
<p>The organizational format of the output tab-delimited files, annotation and evidence are also advantageous. Since the first column of these files contains unique protein identifiers, all of the annotation and supporting evidence for any protein or group of proteins can be retrieved using the Unix grep utility directly from the command line. These files can be also imported into Excel for their inspection and analysis. Lastly, the VMGAP pipeline can be easily updated and customized to meet the specific needs and objectives of the user through the addition of additional virus-specific databases as they become available or the inclusion of more specialized boutique databases (e.g. RNA virus specific datasets) respectively.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>This research was supported by the
<funding-source rid="sp1">Office of Science</funding-source>
(BER),
<funding-source rid="sp2">U.S. Department of Energy</funding-source>
, Cooperative Agreement No.
<award-id rid="sp2">De-FC02-02ER63453</award-id>
and by the
<funding-source rid="sp3">National Science Foundation Microbial Genome Sequencing Program</funding-source>
(award number
<award-id rid="sp3">0626826</award-id>
). We would like to thank Johannes Goll for his technical expertise on METAREP.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="r1">
<label>1</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weinbauer</surname>
<given-names>MG</given-names>
</name>
</person-group>
<article-title>Ecology of prokaryotic viruses.</article-title>
<source>FEMS Microbiol Rev</source>
<year>2004</year>
;
<volume>28</volume>
:
<fpage>127</fpage>
-
<lpage>181</lpage>
<pub-id pub-id-type="doi">10.1016/j.femsre.2003.08.001</pub-id>
<pub-id pub-id-type="pmid">15109783</pub-id>
</mixed-citation>
</ref>
<ref id="r2">
<label>2</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edwards</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Viral metagenomics.</article-title>
<source>Nat Rev Microbiol</source>
<year>2005</year>
;
<volume>3</volume>
:
<fpage>504</fpage>
-
<lpage>510</lpage>
<pub-id pub-id-type="doi">10.1038/nrmicro1163</pub-id>
<pub-id pub-id-type="pmid">15886693</pub-id>
</mixed-citation>
</ref>
<ref id="r3">
<label>3</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>The Phage Proteomic Tree: a genome-based taxonomy for phage.</article-title>
<source>J Bacteriol</source>
<year>2002</year>
;
<volume>184</volume>
:
<fpage>4529</fpage>
-
<lpage>4535</lpage>
<pub-id pub-id-type="doi">10.1128/JB.184.16.4529-4535.2002</pub-id>
<pub-id pub-id-type="pmid">12142423</pub-id>
</mixed-citation>
</ref>
<ref id="r4">
<label>4</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cole</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Cardenas</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Fish</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chai</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Farris</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Kulam</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>McGarrell</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Marsh</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Garrity</surname>
<given-names>GM</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The Ribosomal Database Project: improved alignments and new tools for rRNA analysis.</article-title>
<source>Nucleic Acids Res</source>
<year>2009</year>
;
<volume>37</volume>
(
<issue>Database issue</issue>
):
<fpage>D141</fpage>
-
<lpage>D145</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn879</pub-id>
<pub-id pub-id-type="pmid">19004872</pub-id>
</mixed-citation>
</ref>
<ref id="r5">
<label>5</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>DeSantis</surname>
<given-names>TZ</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Larsen</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Rojas</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Brodie</surname>
<given-names>EL</given-names>
</name>
<name>
<surname>Keller</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Huber</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Dalevi</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Andersen</surname>
<given-names>GL</given-names>
</name>
</person-group>
<article-title>Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.</article-title>
<source>Appl Environ Microbiol</source>
<year>2006</year>
;
<volume>72</volume>
:
<fpage>5069</fpage>
-
<lpage>5072</lpage>
<pub-id pub-id-type="doi">10.1128/AEM.03006-05</pub-id>
<pub-id pub-id-type="pmid">16820507</pub-id>
</mixed-citation>
</ref>
<ref id="r6">
<label>6</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
</person-group>
<article-title>QiJ, SchusterSC. MEGAN analysis of metagenomic data.</article-title>
<source>Genome Res</source>
<year>2007</year>
;
<volume>17</volume>
:
<fpage>377</fpage>
-
<lpage>386</lpage>
<pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</mixed-citation>
</ref>
<ref id="r7">
<label>7</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matsen</surname>
<given-names>FA</given-names>
</name>
<name>
<surname>Kodner</surname>
<given-names>RB</given-names>
</name>
<name>
<surname>Armbrust</surname>
<given-names>EV</given-names>
</name>
</person-group>
<article-title>pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree.</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
;
<volume>11</volume>
:
<fpage>538</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-11-538</pub-id>
<pub-id pub-id-type="pmid">21034504</pub-id>
</mixed-citation>
</ref>
<ref id="r8">
<label>8</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Markowitz</surname>
<given-names>VM</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>NN</given-names>
</name>
<name>
<surname>Szeto</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Palaniappan</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Dalevi</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>IM</given-names>
</name>
<name>
<surname>Grechkin</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Dubchak</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>I</given-names>
</name>
<etal></etal>
</person-group>
<article-title>IMG/M: a data management and analysis system for metagenomes.</article-title>
<source>Nucleic Acids Res</source>
<year>2008</year>
;
<volume>36</volume>
(
<issue>Database issue</issue>
):
<fpage>D534</fpage>
-
<lpage>D538</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkm869</pub-id>
<pub-id pub-id-type="pmid">17932063</pub-id>
</mixed-citation>
</ref>
<ref id="r9">
<label>9</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Altintas</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Peltier</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Stocks</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Allen</surname>
<given-names>EE</given-names>
</name>
<name>
<surname>Ellisman</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Grethe</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource.</article-title>
<source>Nucleic Acids Res</source>
<year>2011</year>
;
<volume>39</volume>
(
<issue>Database issue</issue>
):
<fpage>D546</fpage>
-
<lpage>D551</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkq1102</pub-id>
<pub-id pub-id-type="pmid">21045053</pub-id>
</mixed-citation>
</ref>
<ref id="r10">
<label>10</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meyer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Paarmann</surname>
<given-names>D</given-names>
</name>
<name>
<surname>D'Souza</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Olson</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Glass</surname>
<given-names>EM</given-names>
</name>
<name>
<surname>Kubal</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Paczian</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Rodriguez</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Stevens</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wilke</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes.</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
;
<volume>9</volume>
:
<fpage>386</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-386</pub-id>
<pub-id pub-id-type="pmid">18803844</pub-id>
</mixed-citation>
</ref>
<ref id="r11">
<label>11</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ye</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Osterman</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Overbeek</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Automatic detection of subsystem/pathway variants in genome analysis.</article-title>
<source>Bioinformatics</source>
<year>2005</year>
;
<volume>21</volume>
(
<issue>Suppl 1</issue>
):
<fpage>i478</fpage>
-
<lpage>i486</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti1052</pub-id>
<pub-id pub-id-type="pmid">15961494</pub-id>
</mixed-citation>
</ref>
<ref id="r12">
<label>12</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dinsdale</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Pantos</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Smriga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Angly</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Wegley</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Hatay</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Haynes</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Microbial ecology of four coral atolls in the Northern Line Islands.</article-title>
<source>PloS one</source>
<year>2008</year>
;
<volume>3</volume>
:
<fpage>e1584</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0001584</pub-id>
<pub-id pub-id-type="pmid">18301735</pub-id>
</mixed-citation>
</ref>
<ref id="r13">
<label>13</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meyer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Overbeek</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rodriguez</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>FIGfams: yet another set of protein families.</article-title>
<source>Nucleic Acids Res</source>
<year>2009</year>
;
<volume>37</volume>
:
<fpage>6643</fpage>
-
<lpage>6654</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkp698</pub-id>
<pub-id pub-id-type="pmid">19762480</pub-id>
</mixed-citation>
</ref>
<ref id="r14">
<label>14</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goll</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rusch</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Tanenbaum</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Thiagarajan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Methe</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Yooseph</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>METAREP: JCVI metagenomics reports--an open source tool for high-performance comparative metagenomics.</article-title>
<source>Bioinformatics</source>
<year>2010</year>
;
<volume>26</volume>
:
<fpage>2631</fpage>
-
<lpage>2632</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq455</pub-id>
<pub-id pub-id-type="pmid">20798169</pub-id>
</mixed-citation>
</ref>
<ref id="r15">
<label>15</label>
<mixed-citation publication-type="webpage">
<person-group person-group-type="author">
<collab>HMMER</collab>
</person-group>
<ext-link ext-link-type="uri" xlink:href="http://hmmer.janelia.org">http://hmmer.janelia.org</ext-link>
</mixed-citation>
</ref>
<ref id="r16">
<label>16</label>
<mixed-citation publication-type="webpage">
<person-group person-group-type="author">
<collab>BLAST</collab>
</person-group>
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/">ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/</ext-link>
</mixed-citation>
</ref>
<ref id="r17">
<label>17</label>
<mixed-citation publication-type="webpage">
<person-group person-group-type="author">
<name>
<surname>Signal</surname>
<given-names>P</given-names>
</name>
</person-group>
<ext-link ext-link-type="uri" xlink:href="http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp">http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp</ext-link>
</mixed-citation>
</ref>
<ref id="r18">
<label>18</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Emanuelsson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Brunak</surname>
<given-names>S</given-names>
</name>
<name>
<surname>von Heijne</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Locating proteins in the cell using TargetP, SignalP and related tools.</article-title>
<source>Nat Protoc</source>
<year>2007</year>
;
<volume>2</volume>
:
<fpage>953</fpage>
-
<lpage>971</lpage>
<pub-id pub-id-type="doi">10.1038/nprot.2007.131</pub-id>
<pub-id pub-id-type="pmid">17446895</pub-id>
</mixed-citation>
</ref>
<ref id="r19">
<label>19</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krogh</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Larsson</surname>
<given-names>B</given-names>
</name>
<name>
<surname>von Heijne</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
</person-group>
<article-title>Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.</article-title>
<source>J Mol Biol</source>
<year>2001</year>
;
<volume>305</volume>
:
<fpage>567</fpage>
-
<lpage>580</lpage>
<pub-id pub-id-type="doi">10.1006/jmbi.2000.4315</pub-id>
<pub-id pub-id-type="pmid">11152613</pub-id>
</mixed-citation>
</ref>
<ref id="r20">
<label>20</label>
<mixed-citation publication-type="webpage">TMHMM. Transmembrane domain prediction.
<ext-link ext-link-type="uri" xlink:href="http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm">http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm</ext-link>
</mixed-citation>
</ref>
<ref id="r21">
<label>21</label>
<mixed-citation publication-type="webpage">
<person-group person-group-type="author">
<collab>PRIAM</collab>
</person-group>
<ext-link ext-link-type="uri" xlink:href="http://priam.prabi.fr/REL_JUL06/index_jul06.html">http://priam.prabi.fr/REL_JUL06/index_jul06.html</ext-link>
</mixed-citation>
</ref>
<ref id="r22">
<label>22</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Claudel-Renard</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Chevalet</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Faraut</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Kahn</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Enzyme-specific profiles for genome annotation: PRIAM.</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
;
<volume>31</volume>
:
<fpage>6633</fpage>
-
<lpage>6639</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkg847</pub-id>
<pub-id pub-id-type="pmid">14602924</pub-id>
</mixed-citation>
</ref>
<ref id="r23">
<label>23</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<collab>UniProt-Consortium</collab>
</person-group>
<article-title>Ongoing and future developments at the Universal Protein Resource.</article-title>
<source>Nucleic Acids Res</source>
<year>2011</year>
;
<volume>39</volume>
(
<issue>Database issue</issue>
):
<fpage>D214</fpage>
-
<lpage>D219</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkq1020</pub-id>
<pub-id pub-id-type="pmid">21051339</pub-id>
</mixed-citation>
</ref>
<ref id="r24">
<label>24</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Davidsen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Beck</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Ganapathy</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Montgomery</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zafar</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Madupu</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Goetz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Galinsky</surname>
<given-names>K</given-names>
</name>
<name>
<surname>White</surname>
<given-names>O</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The comprehensive microbial resource.</article-title>
<source>Nucleic Acids Res</source>
<year>2010</year>
;
<volume>38</volume>
(
<issue>Database issue</issue>
):
<fpage>D340</fpage>
-
<lpage>D345</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkp912</pub-id>
<pub-id pub-id-type="pmid">19892825</pub-id>
</mixed-citation>
</ref>
<ref id="r25">
<label>25</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Finn</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Mistry</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Tate</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Coggill</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Heger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Pollington</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Gavin</surname>
<given-names>OL</given-names>
</name>
<name>
<surname>Gunasekaran</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ceric</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Forslund</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The Pfam protein families database.</article-title>
<source>Nucleic Acids Res</source>
<year>2010</year>
;
<volume>38</volume>
(
<issue>Database issue</issue>
):
<fpage>D211</fpage>
-
<lpage>D222</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkp985</pub-id>
<pub-id pub-id-type="pmid">19920124</pub-id>
</mixed-citation>
</ref>
<ref id="r26">
<label>26</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Selengut</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Haft</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Davidsen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ganapathy</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Gwinn-GiglioM, NelsonWC, RichterAR, WhiteO. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes.</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
;
<volume>35</volume>
(
<issue>Database issue</issue>
):
<fpage>D260</fpage>
-
<lpage>D264</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkl1043</pub-id>
<pub-id pub-id-type="pmid">17151080</pub-id>
</mixed-citation>
</ref>
<ref id="r27">
<label>27</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leplae</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Hebrant</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wodak</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Toussaint</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>ACLAME: a CLAssification of Mobile genetic Elements.</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
;
<volume>32</volume>
(
<issue>Database issue</issue>
):
<fpage>D45</fpage>
-
<lpage>D49</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh084</pub-id>
<pub-id pub-id-type="pmid">14681355</pub-id>
</mixed-citation>
</ref>
<ref id="r28">
<label>28</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marchler-Bauer</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Chitsaz</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Derbyshire</surname>
<given-names>MK</given-names>
</name>
<name>
<surname>DeWeese-Scott</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Fong</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Geer</surname>
<given-names>LY</given-names>
</name>
<name>
<surname>Geer</surname>
<given-names>RC</given-names>
</name>
<name>
<surname>Gonzales</surname>
<given-names>NR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>CDD: a Conserved Domain Database for the functional annotation of proteins.</article-title>
<source>Nucleic Acids Res</source>
<year>2011</year>
;
<volume>39</volume>
(
<issue>Database issue</issue>
):
<fpage>D225</fpage>
-
<lpage>D229</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkq1189</pub-id>
<pub-id pub-id-type="pmid">21109532</pub-id>
</mixed-citation>
</ref>
<ref id="r29">
<label>29</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tanenbaum</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Goll</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Murphy</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Zafar</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Thiagarajan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Madupu</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Davidsen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Kagan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Kravitz</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The JCVI standard operating procedure for annotating prokaryotic 30. Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Research 2008;15:387-396.metagenomic shotgun sequencing data.</article-title>
<source>Stand Genomic Sci</source>
<year>2010</year>
;
<volume>2</volume>
:
<fpage>229</fpage>
-
<lpage>237</lpage>
<pub-id pub-id-type="doi">10.4056/sigs.651139</pub-id>
<pub-id pub-id-type="pmid">21304707</pub-id>
</mixed-citation>
</ref>
<ref id="r30">
<label>30</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Noguchi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Taniguchi</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Itoh</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes.</article-title>
<source>DNA Res</source>
<year>2008</year>
;
<volume>15</volume>
:
<fpage>387</fpage>
-
<lpage>396</lpage>
<pub-id pub-id-type="doi">10.1093/dnares/dsn027</pub-id>
<pub-id pub-id-type="pmid">18940874</pub-id>
</mixed-citation>
</ref>
<ref id="r31">
<label>31</label>
<mixed-citation publication-type="webpage">
<person-group person-group-type="author">
<collab>MetaGeneAnnotator</collab>
</person-group>
<ext-link ext-link-type="uri" xlink:href="http://metagene.cb.k.u-tokyo.ac.jp/metagene/download_mga.html">http://metagene.cb.k.u-tokyo.ac.jp/metagene/download_mga.html</ext-link>
</mixed-citation>
</ref>
<ref id="r32">
<label>32</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schoenfeld</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Richardson</surname>
<given-names>PM</given-names>
</name>
<name>
<surname>Wommack</surname>
<given-names>KE</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Mead</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Assembly of viral metagenomes from yellowstone hot springs.</article-title>
<source>Appl Environ Microbiol</source>
<year>2008</year>
;
<volume>74</volume>
:
<fpage>4164</fpage>
-
<lpage>4174</lpage>
<pub-id pub-id-type="doi">10.1128/AEM.02598-07</pub-id>
<pub-id pub-id-type="pmid">18441115</pub-id>
</mixed-citation>
</ref>
<ref id="r33">
<label>33</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Breitbart</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>WH</given-names>
</name>
<name>
<surname>Run</surname>
<given-names>JQ</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>CL</given-names>
</name>
<name>
<surname>Soh</surname>
<given-names>SW</given-names>
</name>
<name>
<surname>Hibberd</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>ET</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>RNA viral community in human feces: prevalence of plant pathogenic viruses.</article-title>
<source>PLoS Biol</source>
<year>2006</year>
;
<volume>4</volume>
:
<fpage>e3</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pbio.0040003</pub-id>
<pub-id pub-id-type="pmid">16336043</pub-id>
</mixed-citation>
</ref>
<ref id="r34">
<label>34</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breitbart</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Method for discovering novel DNA viruses in blood using viral particle selection and shotgun sequencing.</article-title>
<source>Biotechniques</source>
<year>2005</year>
;
<volume>39</volume>
:
<fpage>729</fpage>
-
<lpage>736</lpage>
<pub-id pub-id-type="doi">10.2144/000112019</pub-id>
<pub-id pub-id-type="pmid">16312220</pub-id>
</mixed-citation>
</ref>
<ref id="r35">
<label>35</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breitbart</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Haynes</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Angly</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Felts</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Mahaffy</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Mueller</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Nulton</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rayhawk</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Viral diversity and dynamics in an infant gut.</article-title>
<source>Res Microbiol</source>
<year>2008</year>
;
<volume>159</volume>
:
<fpage>367</fpage>
-
<lpage>373</lpage>
<pub-id pub-id-type="doi">10.1016/j.resmic.2008.04.006</pub-id>
<pub-id pub-id-type="pmid">18541415</pub-id>
</mixed-citation>
</ref>
<ref id="r36">
<label>36</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breitbart</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Felts</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mahaffy</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Nulton</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Salamon</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Diversity and population structure of a near-shore marine-sediment viral community.</article-title>
<source>Proc Biol Sci</source>
<year>2004</year>
;
<volume>271</volume>
:
<fpage>565</fpage>
-
<lpage>574</lpage>
<pub-id pub-id-type="doi">10.1098/rspb.2003.2628</pub-id>
<pub-id pub-id-type="pmid">15156913</pub-id>
</mixed-citation>
</ref>
<ref id="r37">
<label>37</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breitbart</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salamon</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Andresen</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Mahaffy</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Segall</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Mead</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Azam</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Genomic analysis of uncultured marine viral communities.</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>2002</year>
;
<volume>99</volume>
:
<fpage>14250</fpage>
-
<lpage>14255</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.202488399</pub-id>
<pub-id pub-id-type="pmid">12384570</pub-id>
</mixed-citation>
</ref>
<ref id="r38">
<label>38</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Culley</surname>
<given-names>AI</given-names>
</name>
<name>
<surname>Lang</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Suttle</surname>
<given-names>CA</given-names>
</name>
</person-group>
<article-title>Metagenomic analysis of coastal RNA virus communities.</article-title>
<source>Science</source>
<year>2006</year>
;
<volume>312</volume>
:
<fpage>1795</fpage>
-
<lpage>1798</lpage>
<pub-id pub-id-type="doi">10.1126/science.1127404</pub-id>
<pub-id pub-id-type="pmid">16794078</pub-id>
</mixed-citation>
</ref>
<ref id="r39">
<label>39</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bench</surname>
<given-names>SR</given-names>
</name>
<name>
<surname>Hanson</surname>
<given-names>TE</given-names>
</name>
<name>
<surname>Williamson</surname>
<given-names>KE</given-names>
</name>
<name>
<surname>Ghosh</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Radosovich</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Wommack</surname>
<given-names>KE</given-names>
</name>
</person-group>
<article-title>Metagenomic characterization of Chesapeake Bay virioplankton.</article-title>
<source>Appl Environ Microbiol</source>
<year>2007</year>
;
<volume>73</volume>
:
<fpage>7629</fpage>
-
<lpage>7641</lpage>
<pub-id pub-id-type="doi">10.1128/AEM.00938-07</pub-id>
<pub-id pub-id-type="pmid">17921274</pub-id>
</mixed-citation>
</ref>
<ref id="r40">
<label>40</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cann</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Fandrich</surname>
<given-names>SE</given-names>
</name>
<name>
<surname>Heaphy</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Analysis of the virus population present in equine faeces indicates the presence of hundreds of uncharacterized virus genomes.</article-title>
<source>Virus Genes</source>
<year>2005</year>
;
<volume>30</volume>
:
<fpage>151</fpage>
-
<lpage>156</lpage>
<pub-id pub-id-type="doi">10.1007/s11262-004-5624-3</pub-id>
<pub-id pub-id-type="pmid">15744573</pub-id>
</mixed-citation>
</ref>
<ref id="r41">
<label>41</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sayers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Barrett</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Benson</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Bolton</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Bryant</surname>
<given-names>SH</given-names>
</name>
<name>
<surname>Canese</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chetvernin</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Dicuccio</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Federhen</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Database resources of the National Center for Biotechnology Information.</article-title>
<source>Nucleic Acids Res</source>
<year>2010</year>
;
<volume>38</volume>
(
<issue>Database issue</issue>
):
<fpage>D5</fpage>
-
<lpage>D16</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkp967</pub-id>
<pub-id pub-id-type="pmid">19910364</pub-id>
</mixed-citation>
</ref>
<ref id="r42">
<label>42</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Letunic</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Copley</surname>
<given-names>RR</given-names>
</name>
<name>
<surname>Pils</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Pinkert</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Schultz</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>SMART 5: domains in the context of genomes and networks.</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
;
<volume>34</volume>
(
<issue>Database issue</issue>
):
<fpage>D257</fpage>
-
<lpage>D260</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkj079</pub-id>
<pub-id pub-id-type="pmid">16381859</pub-id>
</mixed-citation>
</ref>
<ref id="r43">
<label>43</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tatusov</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Fedorova</surname>
<given-names>ND</given-names>
</name>
<name>
<surname>Jackson</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Jacobs</surname>
<given-names>AR</given-names>
</name>
<name>
<surname>Kiryutin</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
<name>
<surname>Krylov</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Mazumder</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Mekhedov</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Nikolskaya</surname>
<given-names>AN</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The COG database: an updated version includes eukaryotes.</article-title>
<source>BMC Bioinformatics</source>
<year>2003</year>
;
<volume>4</volume>
:
<fpage>41</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-4-41</pub-id>
<pub-id pub-id-type="pmid">12969510</pub-id>
</mixed-citation>
</ref>
<ref id="r44">
<label>44</label>
<mixed-citation publication-type="webpage">
<person-group person-group-type="author">
<collab>SQLite software library</collab>
</person-group>
<ext-link ext-link-type="uri" xlink:href="http://www.sqlite.org">http://www.sqlite.org</ext-link>
</mixed-citation>
</ref>
<ref id="r45">
<label>45</label>
<mixed-citation publication-type="webpage">METAREP. JCVI Metagenomics Reports - an open source tool for high-performance comparative metagenomics.
<ext-link ext-link-type="uri" xlink:href="https://github.com/jcvi/METAREP">https://github.com/jcvi/METAREP</ext-link>
</mixed-citation>
</ref>
<ref id="r46">
<label>46</label>
<mixed-citation publication-type="webpage">METAREP. JCVI Metagenomics Reports.
<ext-link ext-link-type="uri" xlink:href="http://www.jcvi.org/metarep">http://www.jcvi.org/metarep</ext-link>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000663 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000663 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3156399
   |texte=   TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:21886867" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1 

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024